Adding Reliability to Your Infrastructure

Architecture
8 Comments

I don’t wanna fly in a single-engine plane.

Call me chicken, call me scaredy-cat, but I’m not excited to get into an airplane that will kill me if an engine fails.  The next step up is a twin-engine plane – but I don’t get on all of those either.

I like my engines to be RAID 10.
I like my engines to be RAID 10.

If a single engine averages a failure once in every 10,000 hours of operation, then a plane with just one of those engines will experience a failure once every 10,000 hours.  What if we equip our plane with two of those engines – how often will we experience a failure?

  • Once in every 20,000 hours of operation, or
  • Once in every 5,000 hours of operation

The correct answer is once in every 5,000 hours of operation.  All other things being equal, two-engine planes are twice as likely to have an engine failure in the same span of time.

The only way a twin-engine plane is more reliable is if just one of the two engines is enough to power the airplane safely. If the airplane requires both engines in order to maneuver and land, then the second engine didn’t add reliability: it just added complexity, expense and maintenance woes.

If one engine fails, the other engine might suddenly be running at full capacity.  In day-to-day operations, we’d only be using around 50% of each engine’s power (because we got twice as much power as we needed in order to cover our disaster recovery plan).  This engine would have to suddenly go from 50% utilized to 100% utilized – and that’s when things really start to get tested.  This means we probably shouldn’t take our time to land if one engine fails: we should get our plane on the ground as fast as possible to minimize the risks of overworking the remaining engine.  It’s working much harder than normal, and it isn’t used to that kind of load.

The only way a twin-engine plane is more reliable is if the one remaining engine can last long enough to get us to the ground. If it can’t handle the stress of running at 100% capacity, we’re not much better off than we were in the first place.  Therefore, it probably makes sense to build in even more capacity; either using more powerful engines so that they each only need 80% of their power to handle our plane, or using three engines instead of two.

But we can’t just go bolting on engines like crazy: engines cost money, add complexity, and add weight, which makes the plane harder to get off the ground.

Now Replace “Engines” with “Servers”

Some disaster recovery plans call for two database servers: a primary server used for production, and then a secondary disaster recovery server at another site.  That secondary server is constantly refreshed with data from production – might be with log shipping, replication, database mirroring, etc.  So far, so good: we’ve improved the reliability of our production site, even though we’ve added complexity.

Later, management looks at that server sitting idle and says, “We can’t leave those resources lying around. Let’s use those for reporting purposes.  We’ll have reports run against the DR server, and that’ll make our production server much faster.”  Query loads grow over time, and before you know it, both of those servers are now production.  If even just the disaster recovery system goes down, we suddenly have a problem.

The only way a two-server disaster recovery plan is more reliable is if just one of the two servers is enough to power your application safely. Otherwise, you don’t have a disaster recovery plan: you have a pending disaster.  You have the insinuation of protection without enough actual protection.  Sure, your data will still be around if one server dies, but you won’t have enough horsepower to actually service your users.  In the users’ minds, that’s a failure.

To prepare for that disaster, do some basic documentation ahead of time.  Make a list of your environments, and note whether each DR server is purely DR, or if it’s actually turned into production over time.  Before disaster strikes, make a list of which user-facing services will need to be disabled, and which can remain standing.  Decide ahead of time whether to shut down reporting queries, for example, in order to continue to service other end user activities.

Now Replace “Engines” with “Drives”

RAID 5 protects your data by striping it across multiple drives and storing parity information too.  If any one drive fails in a RAID 5 array, you’re completely fine.  Pull out the failed drive, swap in a brand new one, and the RAID card will automatically begin rebuilding the missing data from parity data on the blank drive.  For more about this process, check out the Wikipedia article on RAID.

Hard drives have moving parts, and moving parts fail.  The more drives we add, the more likely we are to experience a failure.  We’re distributing the work across more drives, which increases performance, but it simultaneously increases risk.

When there’s a drive failure, the clock starts ticking.  We have to get a new drive in as fast as possible.  In order to reduce the failure window, enterprise systems use hot spare hard drives: blank drives that sit around idle doing nothing.  When there’s a failure on Saturday night at midnight (the universally agreed-upon standard time for drive failures), the raid array automatically uses these hot spare drives as their replacement and start rebuilding the array automatically. SAN administrators like hot spares, because they like doing other things on Saturday nights instead.

On Saturday nights, SAN administrators like to do karaoke at The Arbitrated Loop.
On Saturday nights, SAN administrators like to do karaoke at The Arbitrated Loop.

When they finally return to the datacenter on Monday to replace the dead drive with a fresh one, that fresh one becomes the new hot spare.  (Not all arrays work this way – I’m generalizing.  I can hear SAN admins typing their replies already.)

While the drive array rebuilds, the remaining drives are working harder than they normally would.  Not only are they handling their regular load, but they’re also simultaneously reading data to write it onto the fresh drive.  This means our hard drives are working overtime – just like the remaining engines in our plane scenario.

This becomes a tricky balance:

  • The more drives we add, the easier they can handle normal load from end users
  • The more drives we add, the more likely we are to have failures
  • But when we have failures, the more drives we add, the easier of a time we’ll have keeping up with the rebuilds
  • The larger the drives, the longer rebuilds take, which lengthens our time window for recovery

It’s just like planes: adding more stuff means managing a balance between cost, complexity and reliability.

The next time someone asks you to add more gear into your scenario or asks to take advantage of the disaster recovery gear that’s “sitting around idle”, it’s time to recalculate your risks and reliabilities.

Previous Post
SQL Server DBA Interview: Kendal Van Dyke
Next Post
I’m Finally a Microsoft MVP!

8 Comments. Leave new

  • The only time I’ve ever flown in a single-engine plane is when I was planning to jump out of it. I guess, in your analogy, my parachute is like a database snapshot: a little dangerous, but it’ll usually get you to safety when the engine dies.

    Reply
  • Airborne Geek
    March 31, 2009 12:03 pm

    You’re missin’ out, dude… You really are.

    Also, if an Airbus 319 can make it to the Hudson, a Cessna can make it to that field over there in one piece, too 😉

    Reply
  • HA! I should have known you’d have something to say about that one!

    Reply
  • I know, not directly related to your point, but some more about the logic of having two engines (being able to fly on just one):

    The Navy (and therefore the Marine Corps) used to have a policy where all fixed wing aircraft had to have two engines, if at all possible, with the aircraft being able to fly on just one. The idea there was that if one engine should fail, you could still land on the carrier. In the middle of the ocean with but one engine, a failure meant ditching in the water, which meant losing a very expensive piece of government equipment at the taxpayer’s expense. That’s why the Navy selected the F/A-18 Hornet (then the Northrup YF-17 Cobra) over the F-16 Falcon (and the USAF did the opposite, because it didn’t have such a requirement). However, I think those standards have been relaxed.

    Reply
  • Brent I liked your analogies

    I am still a believer in 4-engine planes is safer than 2-engine ones, but there’s a balance there. Plus now most 2-engine planes support ETOPS to last long enough to land
    http://en.wikipedia.org/wiki/ETOPS

    I find it difficult to convince business owners to pay for something to be use “just in case”, just like a lot of air carriers are moving away from 4-engine planes (A340) to 2-engine (A330 or 777). I am getting carried away here…

    Reply
  • The same rules apply for power supplies. We had a situation, years ago, where we went down in production because nobody ever noticed that one of our power supplies had lost a fan and had subsequently burnt out. Several weeks later, the fan on the fail over power supply died and that power supply subsequently burnt out. Thankfully we had a geographic active-passive fail over in place with identical hardware.

    Reply
  • Here is a question that I had for you. I work in an environment where once we lock down, we do not do updates. After each update, we need compliance approval, regression testing and all kind of things. That means, we do not do windows, firmware or any kind of updates. Our current DR plan is the following:

    1) Mirroring Onsite.
    2) Log shipping offsite.

    (both are manual, no automatic failover)

    We do not have a clustered environment. Is there any reason you can think I would need a clustered environment?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.