Dog is my co-pilot: Redundancy

redundant: n. see redundant

At the top of the IT world, is not only expensive equipment, but also equipment which needs to never fail. Let me give two examples from my work.

http://www.fz-juelich.de/jsc/CompServ/graphics/cell_blade_side1_large.jpg

http://www-06.ibm.com/systems/jp/photo/bladecenter/picture/ls42_r.jpg

We have a few VMware clusters. One recent setup has seven blades, each has two sockets with quad cores and 32GB of RAM. It occupies a standard rack width and about 4U high. If you've never worked with VM's you probably think that this means we have the equivalent of 56* systems each with 4GB of memory in that little space, which is pretty cool. But we can run over 60 machines on a single blade without overloading it. We can easily fill a Class-C subnet in a quarter-rack -- crazy! This is part of the recent appeal of virtualization. Floor space in a raised floor is expensive. If you have a smallish server farm, you could relocate it into a closet with the right air-handling. There are a lot more arguments for virtualization, but redundancy is a pleasant surprise for most people who actually use the servers. If one of the seven blades** dies, the infrastructure software can boot the server instances on another blade automatically in seconds (vm's boot really fast). If we need to do planned service on a blade, the vm's can be migrated to another blade with a hiccup so quick (1-2 seconds) that you have to be looking for it to notice since all running programs and memory gets copied to the new blade. The blade center itself (the case that holds the blades) has dual power supplies which go to entirely different power distro boxes, and in turn, provides power to the blades. The blade center also features eight IO module slots, each of which can house (6) gigabit ethernet ports, (6) Fiber port pairs or a console interface. We actually have ours connected to four different subnets with two cables for each coming from different switches. In terms of space: we're using part of one rack to replace eight or so, we're using two 30amp circuits to replace sixty 15amp circuits, and 8 network cables to replace 60. And in all this it's like we have given each virtualized server dual power, failover networking and a hot spare system since all files including the OS is stored on a SAN.

http://www-05.ibm.com/lt/storage/disk/i/DS4800.jpg

Another example of redundancy is a SAN system we use for data streams to and from manufacturing and critical to operations. In addition to the redundant fiber, ethernet and power, each DS4800 controller is actually two controllers which alternate as primary and backup controller for each group of drives. We have twelve drawers of fourteen drives (the horizontal container of drives is a drawer) with drives arranged into RAID5 arrays vertically, so if an entire drawer dies, no data is lost and the hot spare drive or two at the end of each drawer will take up some of the slack, completely rebuilding some arrays while other continue is a slower configuration (degraded) . With the hot spares and the parity inherent in RAID5, 12.2TB of raw disk space yields 8.7TB of usable space in 32 arrays/LUNs. But this level of redundancy allows us to maintain the system on a Monday thru Friday 8-5 basis despite being critical to manufacturing 24x7. I once had a drive fail on Saturday -- this slowed down the array but no data is lost. Meanwhile, the data from the missing drive is re-created on a hot-spare in about 15 minutes, after which it is added to the array and the array is back up to full speed. Then about 10 hours later, another drive from the same array (different drawer) failed and it was replaced by another hotspare. For those who aren't in this game, a failed drive in a RAID5 array for a crititcal system will usually result in someone getting called out. For two drives in the same array, multiple people lose sleep and calls are made to techs in other states. All this will be followed by a week or more of post-mortem meetings deciding how to avoid this in the future and quite possibly who is going to lose their job(s). With a SAN's ability to realistically share so many hotspares among multiple arrays, this was a non event. As long as two drives in the same array don't die at the same time, we can lose nine drives before data loss at its most pessimistic, and if the failures are evenly distributed, we could lose 40 drives without losing data.

All this is great for protecting against equipment failures or other remote issues like switch failures. And because it's so dense physical security is easier to achieve and monitor for one spot than dozens scattered around. It does make a tempting target for the malicious however. Since there are four subnets connected to the blade center with two cables each. Someone who understood what was going on could swap the ethernet cables, plugging each into a port for the wrong subnet for that cable. You could easily disconnect 200 servers from the network in less than 30 seconds and it would probably take over an hour to diagnose the problem, and quite possibly 4-5 hours while the admin checks the switches and routers which would seem the obvious culprit. Even a natural disaster, like a maintenance worker working in the ceiling above the drop tiles could step on the wrong pipe and send hundreds of gallons of water from the sprinkler system onto that one rack hosting 200 machines.

Last April, Morgan Hill, an affluent community in a valley South of San Jose (and SF) along the 101 which runs the length of California, fell victim to an apparently coordinated team of unknown men who entered four service access areas via man-hole and cut eight fiber-optic cables utterly disconnecting Morgan Hill from the rest of the US communication network. Eight teeny glass fibers the size of a human hair -- how bad could it be?

The city of Morgan Hill and parts of three counties lost 911 service, cellular mobile telephone communications, land-line telephone, DSL internet and private networks, central station fire and burglar alarms, ATMs, credit card terminals, and monitoring of critical utilities. In addition, resources that should not have failed, like the local hospital's internal computer network, proved to be dependent on external resources, leaving the hospital with a "paper system" for the day.

Although the author goes on to blame centralization, the fact that the had to hit four places seems reasonably redundant. The problem lies, in my opinion, with so many pieces of critical technology having so many physical and logical layers and dependency of still lower layers, that analysis is nearly impossible. I'm sure a lot of planners figured that even if the telephone land-lines went out, the cell phones would still work. I'll bet the hospital didn't have their own DNS server and that was the piece that broke communications on their local network. The local workstations and servers were still connected, but they couldn't find each other without the a DNS server to tell them which unique address went with which system name.

If I had to guess, I'd say that between 1-2 billion USD were spent for Y2K compliance across the US. After all the studies it looks like the fallout might have topped 50M, but we didn't know until we did the studies. We all hated the Y2K process because it was mandated by bureaucrats and tedious. But as IT services becomes more embedded into critical services***, we need to dust off and reuse these skills to dig down to the bare ground when analysing failure scenarios. If you have badge access to the server room (and who doesn't) and the whole city block loses power, will you be able to physically get to the server room to shutdown systems? Security will probably be there to let people in the front door, but how about getting to your floor? How about the lobby doors, or the door to the server room? Sure, you have UPS for the critical systems, but it can take 4-5 hours or sometimes days to fix/replace a blown transformer for a downtown block.

--=={{}}==--

* - 7 blades * 2 sockets * 4 (quad) cores = 56; 32GB / (2 * 4(quad)) = 4gb

** - "The Seven Blades" sounds like a Japanese Martial Arts film doesn't it?

*** - It's worth mentioning that I live in a little town that gets it's water from a impressive well-head which has a dedicated backup generator in case of power loss, and even then the storage will run the town for about 3 days before people at higher elevation go dry.

Labels: computers

3 Comments:

Jeff Mountjoy said...: Wow, I can't believe I never heard of Morgan Hill. Creepy. And, yeah -- I suspect disgruntled telecom workers. As with your hypothetical switching of ethernet cables, it requires a lot of information to know which wires to cut, and where they are. I'm constantly amazed at how much knowledge is required to create tricky mayhem these days. Any idiot can whack something with a fire ax, but it takes real cleverness to cause the more interesting kinds of damage.

The Seven Blades could be a movie based on Musashi's Book of Five Rings. Or the prequel/sequel to the Seven Samurai. :-); 8/01/2009 10:54 PM
Jeff Mountjoy said...: Unrelated: when I posted that comment, I noticed you use a Captcha for verification. I wonder if anyone has ever designed Captcha graphic text specifically to fool computers -- that is, instead of swirling letters so that computers can't OCR easily, if anyone makes Captcha text that will OCR as specific letters, but the wrong letters....; 8/01/2009 10:57 PM
NerfSmuggler said...: These are both excellent leads to possible future topics. As tech become ubiquitous, it can be suborned into an avenue for social engineering. Killing all the phone lines, cell towers and Internet for an area essentially fails all the alarm systems to quiet. They have no way to signal a breach (except a local auditory siren), and even if they're using a dedicated IP connection, when the security office calls around, they'll find out that everybody's cut off and ignore it.
It's an interesting idea just to visualize these kind of optical illusions. Since different Captcha spammer programs use different OCR software, I would expect it would be hard, and it would need to be generated by a real person, instead of the random captcha's generated now by graphic editor programs with command-line options. The captcha spammers could also add exceptions and scan of them more easily than new ones can be made.; 8/06/2009 10:50 AM

Dog is my co-pilot

Tuesday, July 28, 2009

Redundancy

3 Comments:

Post a Comment

Previous Posts

About Me