Thursday, April 30, 2009

CASD and Career Trajectory

Some recently added responsibility* at work includes looking after the 2000+ 73GB fiber channel drives in a SAN farm -- 1084 are 2Gbps and the others are 4Gbps. I would find the initial configuration of the arrays and fabric an interesting project, all I really do is monitor the controllers with a GUI that operates out-of-band, and replace failed drives, which are all under a support contract. It's all pretty simple, in theory.

It's a 10 minute hike from my office to the raised floor where these live, and I'm lazy -- so, where the previous guy would find an error on the Storage Manager GUI, wander down, note the drive position and enclosure serial number, then call it in to the support, get the replacement and wander down again to swap the drive -- I made a list of all the enclosures and serial numbers and call in the drive using this info only making the one trip to physically swap drives.

For some reason (no, there is no firmware fix for this), in the bigger, faster controllers, when you remove and replace a failed drive, the failed drive remains listed in the array until you replace it manually in the list with the new drive. Annoying, but ok, fine. I just had to replace a hot spare. In a lot of ways, the hot spare pool acts like an array, but I don't have all the same options -- in particular, there seems to be no way to remove a failed drive from the hot spare pool without powering down the whole system (recycling each of the redundant controllers doesn't do the trick) to have it re-ident the drives**. When I add the new drive to the hot spare pool, that enclosure/slot shows up on the hot spare list twice, once as "Optimal" and once as "Missing". So we see an error at the controller level, but no discreet device is in an error state: all the drives are fine, all the arrays are good and we have all eight hot spares in service.

The fun part was getting through the layers of support to someone who A) gets that this is a problem and B) knows more than me, and hopefully C) can fix it. In retrospect, I should have just lied to the first level of support. They're really just there to confirm that we have a support contract and forward us on to the next level. Second level really does know more about these systems than me, but I have just been doing this a couple months. The guy at third level I really have high hopes for, if I can get him interested. "Have you power-cycled the system?" "No, it's linked to manufacturing -- there's slack, but if something goes wrong and it won't come back up, it costs over $1M/minute that the line stops." Plus, I didn't set it up, I don't know what systems it feeds, etc.

So during a low traffic window, we cycled each of the two controllers which back each other up. First take down 1, 2 takes over and handles all traffic, bring 1 back up, wait for it to stabilize and bring down two, 1 takes over all traffic while we bring 2 back up. No change. I sent this to my level 3 guy and get this:

send me a new CASD - the phydev # change on the power cycle and let me see what i can find out

You're probably thinking that this is terminology we've used before. Nope! I've send a bunch of screen shots showing the error, but the other thing I sent was a zip file of logs which the management gui creates. Sure enough of the screen that makes that, the top of the window says "Capture All System Diagnostics". Regarding, "the phydev[physical device?] # change on the power cycle", this is a statement I hope. If he wants a list of "phydev #" that changed, I got nothing. Punctuation would help me here.

One thing about being at the mature level in my career that I am is that I can easily tell when I just don't have the knowledge to follow what's being discussed and when someone is being intentionally obscure.

Of course, he could just have poor communication skills. I had someone ask me once what the difference was between a password and an email address. He was trying to ask if the ISP dialup account password was the same or different than their ISP email password.

--==<<>>==--

* - Read: they got rid of someone but their job still needed doing.

** - If that even works. Since these SAN arrays are integral to the manufacturing line, shutting one down over a false positive storage system error is not our first choice.

Labels:

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home