S24 Parties at The Fillmore

Last night was the annual Secure-24 Holiday party. This marks the third year that we got to attend this amazing event. Once again the company rented out The Fillmore so, naturally, we got a room at The Westin Book Cadillac. The Killer Flamingos once again graced the stage and provided the evening’s entertainment, and once again they did a fantastic job! As usual the food and beverages were outstanding and they did a great job with the raffles and prizes. Though we kicked around the idea last year, the guys never did get our act together to wear bow ties! It was another fun night though. I really appreciate that the company has these parties for us, it really does show that they appreciate everything we do all year-long. It’s a high stress, fast paced environment and the ability to just let loose for the night is great. Already looking forward to next year’s party!  

One Year at Secure-24

Today marks my 1 year anniversary at Secure-24. I can honestly say it seems like it has been a lot longer. When I first started working here everyone referred to time in S24 years, and I found out that it was a running joke amongst the engineers. It is in reference to how the years at S24 seem to be much longer than at other places. This is due to the constantly changing environments due to the nature of business that a Managed Service Provider runs. I laughed it off for the first few months until I slowly began to feel it too. Now, here I am, one year later and I find it hard to believe that it has only been one year. During this time I even managed to meet a personal goal, and  change teams. I’m not even sure I could manage to list all of the new things I have learned between the two positions, or the things I have had access to here that I would never had had access to at my old company. The people I have met here are all incredibly talented and great to be around. The company culture is amazing and the events that encourage us all to come together outside the office are terrific. What I am most thankful for though, is the fact that it has been a year and I still enjoy coming to work everyday; with the exception of the commute anyway! I’m working on that…

From Backup to Storage

One of the goals that I set for myself when I took the position with Secure-24 was to transition to the storage team within 6 months. This has been my goal because I enjoy doing storage work a lot more than I do backup and recovery. I even mentioned in my interview (for a backup engineering position) that I did not want to do backup work and that if I were to get hired this would be my goal. Over the last few months I have been halfway to my goal due to being the point person for both storage and backups for our enterprise delivery team. In fact this team has had a major project underway for one of our customers that has us implementing two new EMC VMAX arrays that I have been deeply involved with and will begin provisioning storage next week for. However, today, my full transition has been made (sort of) due to a colleague leaving the company for another opportunity. Jon was the NetApp engineer and now that he is gone the storage team will be leaning heavily on me to fill the NetApp knowledge gap, which suits me just fine! Considering another customer is lined up to bring in some new NetApp filers as well, this puts me a great position to build another set of controllers from the ground up. So officially I have transitioned to the storage team, but until they can back fill my position I will be pulling split duty between the two teams. Needless to say I am pretty happy about these circumstances, and I even beat my goal by a month!

First full day of new career

Today was my first full day at Secure-24, and it seems like it is going to be a good place to work. My hope to hit the ground running wasn’t as fruitful as I had planned though. The environment in which the company operates is pretty locked down, which I understand the need for. However, this makes for a lot of jumping through hoops in order to connect to the desired systems where work needs to happen. This means there are also many accounts that need to be configured to allow the appropriate access, and all of mine have not been set up yet. Hopefully over the next few days things will get straightened out. From what I can tell so far working here is going to be an entirely new experience from what I am used to. I already see a lot of exciting challenges and opportunities to learn new skills and improve on existing ones!

A New Begining

My first day at Secure-24 was a typical slow orientation day. Spending the better part of 8 hours in a training room and going through procedures is about as fun as it sounds. There were only two other people in my orientation group; Brian is joining the project management team, and Patti is joining the storage team. I’ll be working more closely with Patti as the backup and storage team work together and report up through the same manager, and our desk areas are right next to each other. I will say that it was nice to see that the facilities team had everything together. We got our badges, our desks, and our laptops and bags all on the first day. At the end of the day we went and spent about 1.5 hours with our teams, meeting everyone and setting up our desks. I opted to configure my desk to be a standing desk, and am looking forward to it. I have the rest of the week off, but now that corporate training and orientation is over I’ll be able to hit the ground running!

Last Day, Technically

Today was my last day at Truven and it was a lot easier than expected. This is mainly due to the fact that my manager had me not even come in to the office, but rather we all just met up for lunch at Ichiban. I have to admit, I am going to miss those lunches as Ichiban is a great place to eat, especially when it is free! In all seriousness though yesterday was a long day of saying goodbye to people I have spent the better part of seven years working with. I am sure I will keep in touch with some of them, especially Chuck, but it’s still hard to say goodbye. I am still very excited though and looking forward to the new adventure!

Time For A Change

After 7 and a half years, and what I consider a good career, it is time for a change. Two weeks from today will be last day at Truven Health Analytics. It’s been a good career as far as I am concerned and I have come a long way from where I started, however, all good things must come to an end.

It all started in February of 2005, with a bang I might add, when I started working for Thomson Medstat as a Computer Operator on the midnight shift. After about a year and a half I relocated to Eagan MN to help rebuild the company infrastructure and move our data center from Ann Arbor to Eagan. I went out on September 16, 2006 and was only supposed to be out there for about 3 months, but that turned into almost 9 months. I made the long drive home on April 10, 2007 and during that entire 12 hour trip I was driving home to no job, as I had just relocated the data center I once worked at to Eagan! However after being home for only a few hours that day I received a phone call and got some good news, and that is when I stepped away from Computer Operations and started down the road to being a Systems Manager.

In the beginning my focus was strictly on TSM administration but sometime in 2008 I started working with NetApp filers and my career as a Storage Engineer came along with it. However in typical corporate form I wouldn’t get any training for my new position until almost a year later, and an additional two years before I got certified. The past year and a half have been spent trying to hone my skills on the NetApp’s along with trying to learn how to provision SAN disk on both EMC and HDS hardware with both Brocade and Cisco fabric switches. Admittedly I am stronger on the EMC hardware (specifically the Symmetrix line) than I am on the HDS hardware, but I think that is because I have had a lot more secondhand exposure to those. Sometime in there I got raised to a Sr. Storage Engineer as well and ran point on a number of different projects including upgrading ONTap versions, performing filer head swaps, and even upgrading the TSM systems, hardware and implementing encryption.

Yet after all this it has come time to say goodbye. I have been given the opportunity to join a new company out in Southfield, Secure-24. Secure-24 is a hosting provider for other companies, supplying everything from SAP and Oracle hosting to total IT outsourcing. As for me, I’ll be joining their backup team and focusing on NetWorker with a little bit of Avamar thrown in. Though in all honesty I am hoping to quickly transition over to their storage team as I think that is really the direction I want my career to follow.

While I was not actively looking for a new job (a recruiter for the company actually contacted me via LinkedIn) too many stars aligned and showed me that it was a good time to make a change. For instance Truven is currently gearing up for a data center migration now that the company has parted ways with Thomson Reuters and become its own stand alone company. Which I guess makes the fact that I am going to work for a data center provider a little ironic. However, I also feel that I have hit a professional growth wall here. While it is true that I am still learning much of the SAN side of the house, that learning is slow at best due to the siloed nature of such a small team (there are a total of 3 engineers, 1 contractor in India, and our manager). Those coupled along with a few personal reasons have made this opportunity to hard to pass up.

So it is with both great sadness and great nervousness and expectation that I have put in my two week notice of resignation with the only company I have known in my professional career.

OnTAP version issues

Last week (on the 24th) I performed a routine OnTAP upgrade across 5 of my 3170 filers; I upgraded from 7.3.5 to 7.3.5.1P5. This upgrade was performed to help prevent a system panic from happening, which had happened twice before on our stand alone 3170 snapvault target system, under Bug 446493. Here is the Bug description:

Much disk and shelf hardware can be managed by an ANSI-standard technology called SCSI Enclosure Services (SES).  To support SES-related processing, the SES subsystem of Data ONTAP schedules various periodic actions, using an internal timeout mechanism.

Due to a software defect, under certain conditions, the SES subsystem may set an excessive number of timers, with new timers being set before old ones expire.  If this continues, an internal callout table will fill up, triggering an interruption of processing.

One condition in which the problem can occur is during initial setup and configuration of the storage system:  when the “cluster-setup wizard” is run, it asks the administrator for configuration input as follows:

Do you want to create a new cluster or join an existing cluster? {create, join}: (Login timeout will occur in 60 seconds)

When installing releases in which the defect is present, if this prompt is allowed to time out, an interruption may occur at some later time.

However, the callout table can also fill up during routine production, if storage events occur in rapid succession, such that SES scans are rapidly invoked.  Such events may include:

  •  a continuing series of disk errors
  •  breakages in disk-communication links
  • adding new shelves
  • power-cycling shelves
  • removing a power supply
  • shelf firmware updates
  • shelf faults
  • takeover/giveback

We were hitting this bung under the “continuing series of disk errors” event, which caused the SES scans to fill the timer-callout table. When this happened the system would panic and reboot. After the upgrade I performed all standard checkouts and everything appeared to be functioning normally and within standards, so I closed the upgrade processes and marked it as successful.

Then on the 26th we attempted to perform an allocation using the NMC. While going through all the steps everything appeared as though it was going well, all of the checks passed and we hit commit, only to have the process came back with an error message indicating that the process had failed. After opening a support call with NetApp and providing both screen captures of the error received and steps to reproduce, it was determined that we had run it to another bug. This time we hit Bug 474612. Here is the bug description:

system-cli API returns cli-result-value with an invalid return status. This invalid return status may break OnCommand and other third party applications utilizing NMSDK.

Basically what is happening is the NMC is executing the requested commands, but the filer cli is returning a response code that the NMC is not expecting and in fact does not recognize. When this happens the NMC does not know how to proceed and the command fails without performing the provisioning. Apparently this issue was introduced in OnTAP 7.3.5.1P4 and still exists in 1P5, so the solution is to downgrade OnTAP to 7.3.5.1P3. In this version of OnTAP bug 446493 is resolved and bug 474612 has not yet been introduced (as of this writing, bug 474612 is NOT resolved in any version of OnTAP).

After performing the downgrade tonight to OnTAP 7.3.5.1P3 I performed all normal checkouts and additionally performed the pending allocation via the NMC to verify the functionality and non-existence of bug 474612 in this version of OnTAP. Happily the allocation went through without issue, and from what we can tell all aspects of the filers are functioning normally. Our next OnTAP upgrade will have to be to 7.3.6.

The problems with combining FC and SATA drives on the same NetApp

The idea of tiered storage is something many business are seriously exploring now a days, and combining and leveraging it with and for “cloud” operations is a major focus. The idea behind tiered storage is that you have different levels of disk that have different performance characteristics, with the main focuses being speed and performance. We recently looked in to the possibility of adding some tiering to one of the NetApp environment’s I manage. The idea was to use 300GB FC (Fibre Channel) as our Tier 1 NAS disk and some 1TB SATA as our Tier2. On the surface of things this seems like a good idea for a couple of reasons:

  1. The Tier 1 disk is sufficient enough to run Oracle databases over NFS if those databases are configured properly.
  2. The Tier 2 disk is much cheaper and would be perfect for housing non-critical non-performance intense shares, such as home directories, at a fraction of the cost
  3. The mix of available disk would allow us to tailor the allocations to the actual needs of the project based on their performance requirements.

So as I began to look in to this course of action I discovered a few things that completely negates this idea, at least for having it housed in one filer (or a clustered pair even):

  • When using FC connectivity to disk shelves, FC drive and SATA drives must go on different loops.  This means that if you add a shelf of SATA to an open FC port on a filer, you will not be able to add any FC drives to that loop.
  • All write operations are committed to disk as a group, regardless of what the destination aggregate is for those writes.  So, on a system with SATA and FC disk, write operations destined for FC drives may be slowed down by the writes going to the SATA drives, if the SATA drives are busy.

The first point, the dedicated loops, isn’t such a big deal if you are planning on adding a full loop worth of SATA shelves (6 shelves per loop) and you have an open FC port with nothing else attached (or can move other shelves to fill in open spots on other loops to free up an FC port). So while the dedicated loop can be resolved and may or may not be an issue depending on your set up, it’s the second point that poses the most trouble in our environment (and I would assume most others as well).

Running the risk of impacting performance to your Tier 1 disk is not acceptable. The applications running on that tier are there for a reason, they need the performance of those faster disks. But how do you know if you will hit that impact, maybe it won’t apply? Good question. Perhaps this won’t be an issue for your environment. So ask your self this: Do you know the exact details of your workloads? No of course you don’t. You may know that there are databases on some of the exports, or that certain exports are used for regular CIFS or NFS shares for home directories, but you most likely do not know all the intimate details of each given applications work load. Without that precise knowledge it is nearly impossible to quantify the potential impact ahead of time, and thus this possible latency becomes a real concern.

Because of these factors we choose (and I recommend) not mixing FC and SATA shelves on a single (or clustered) system. If you need to have multiple Tiers you still have options:

  • Implement SAN as your Tier 1 storage and utilize NAS as Tier 2
  • Implement a Tier 1 NAS environment and a Tier 2 Nas environment on separate hardware (read: separate physical systems, either single headed or clustered)
  • Look into an appliance that can handle different types of disk in the same housing without impact and configure tiering therein.

Tiering your storage is great idea and allows for many flexibilities and possible cost savings for the customer in terms of charge backs for disk utilization. Even so, you still need keep performance in mind, and for me, the possible performance impact is not worth the risk for mixing these shelf types on a single head.

NCDA Boot Camp, Day 6

I just passed NS163 and now officially have my NCDA! It will take a couple of weeks for them to ship out my certificate but the system has been updated with my passing score so I am officially certified. Not just me though, everyone in the class passed both exams, so there are now 8 new NCDA’s in the field. Congratulations to all the other guys and to Jon Presti for being a great instructor and making sure we all had the knowledge needed to pass the exams and get certified. Now to figure out what to do with the rest of my day, because my flight doesn’t leave until tomorrow. Had I known we would be done by 2pm today, I would have scheduled for an early evening flight. Oh well, looks like I get to go explore Las Colinas a little.