I don’t wanna fly in a single-engine plane.
Call me chicken, call me scaredy-cat, but I’m not excited to get into an airplane that will kill me if an engine fails. The next step up is a twin-engine plane – but I don’t get on all of those either.
If a single engine averages a failure once in every 10,000 hours of operation, then a plane with just one of those engines will experience a failure once every 10,000 hours. What if we equip our plane with two of those engines – how often will we experience a failure?
- Once in every 20,000 hours of operation, or
- Once in every 5,000 hours of operation
The correct answer is once in every 5,000 hours of operation. All other things being equal, two-engine planes are twice as likely to have an engine failure in the same span of time.
The only way a twin-engine plane is more reliable is if just one of the two engines is enough to power the airplane safely. If the airplane requires both engines in order to maneuver and land, then the second engine didn’t add reliability: it just added complexity, expense and maintenance woes.
If one engine fails, the other engine might suddenly be running at full capacity. In day-to-day operations, we’d only be using around 50% of each engine’s power (because we got twice as much power as we needed in order to cover our disaster recovery plan). This engine would have to suddenly go from 50% utilized to 100% utilized – and that’s when things really start to get tested. This means we probably shouldn’t take our time to land if one engine fails: we should get our plane on the ground as fast as possible to minimize the risks of overworking the remaining engine. It’s working much harder than normal, and it isn’t used to that kind of load.
The only way a twin-engine plane is more reliable is if the one remaining engine can last long enough to get us to the ground. If it can’t handle the stress of running at 100% capacity, we’re not much better off than we were in the first place. Therefore, it probably makes sense to build in even more capacity; either using more powerful engines so that they each only need 80% of their power to handle our plane, or using three engines instead of two.
But we can’t just go bolting on engines like crazy: engines cost money, add complexity, and add weight, which makes the plane harder to get off the ground.
Now Replace “Engines” with “Servers”
Some disaster recovery plans call for two database servers: a primary server used for production, and then a secondary disaster recovery server at another site. That secondary server is constantly refreshed with data from production – might be with log shipping, replication, database mirroring, etc. So far, so good: we’ve improved the reliability of our production site, even though we’ve added complexity.
Later, management looks at that server sitting idle and says, “We can’t leave those resources lying around. Let’s use those for reporting purposes. We’ll have reports run against the DR server, and that’ll make our production server much faster.” Query loads grow over time, and before you know it, both of those servers are now production. If even just the disaster recovery system goes down, we suddenly have a problem.
The only way a two-server disaster recovery plan is more reliable is if just one of the two servers is enough to power your application safely. Otherwise, you don’t have a disaster recovery plan: you have a pending disaster. You have the insinuation of protection without enough actual protection. Sure, your data will still be around if one server dies, but you won’t have enough horsepower to actually service your users. In the users’ minds, that’s a failure.
To prepare for that disaster, do some basic documentation ahead of time. Make a list of your environments, and note whether each DR server is purely DR, or if it’s actually turned into production over time. Before disaster strikes, make a list of which user-facing services will need to be disabled, and which can remain standing. Decide ahead of time whether to shut down reporting queries, for example, in order to continue to service other end user activities.
Now Replace “Engines” with “Drives”
RAID 5 protects your data by striping it across multiple drives and storing parity information too. If any one drive fails in a RAID 5 array, you’re completely fine. Pull out the failed drive, swap in a brand new one, and the RAID card will automatically begin rebuilding the missing data from parity data on the blank drive. For more about this process, check out the Wikipedia article on RAID.
Hard drives have moving parts, and moving parts fail. The more drives we add, the more likely we are to experience a failure. We’re distributing the work across more drives, which increases performance, but it simultaneously increases risk.
When there’s a drive failure, the clock starts ticking. We have to get a new drive in as fast as possible. In order to reduce the failure window, enterprise systems use hot spare hard drives: blank drives that sit around idle doing nothing. When there’s a failure on Saturday night at midnight (the universally agreed-upon standard time for drive failures), the raid array automatically uses these hot spare drives as their replacement and start rebuilding the array automatically. SAN administrators like hot spares, because they like doing other things on Saturday nights instead.
When they finally return to the datacenter on Monday to replace the dead drive with a fresh one, that fresh one becomes the new hot spare. (Not all arrays work this way – I’m generalizing. I can hear SAN admins typing their replies already.)
While the drive array rebuilds, the remaining drives are working harder than they normally would. Not only are they handling their regular load, but they’re also simultaneously reading data to write it onto the fresh drive. This means our hard drives are working overtime – just like the remaining engines in our plane scenario.
This becomes a tricky balance:
- The more drives we add, the easier they can handle normal load from end users
- The more drives we add, the more likely we are to have failures
- But when we have failures, the more drives we add, the easier of a time we’ll have keeping up with the rebuilds
- The larger the drives, the longer rebuilds take, which lengthens our time window for recovery
It’s just like planes: adding more stuff means managing a balance between cost, complexity and reliability.
The next time someone asks you to add more gear into your scenario or asks to take advantage of the disaster recovery gear that’s “sitting around idle”, it’s time to recalculate your risks and reliabilities.