Everything breaks. Clothing wears out; iron rusts; wood rots, splinters, or breaks; software crashes. Even though we are armed with the knowledge that breakage is inevitable we still strive to build software that can withstand any possible failure. Withstanding all failures is a necessity for a select few applications. But for most software, failure should be more than tolerated; failure should be accepted as part of day to day operation. When we examine our software and consider how successful we’ve been, we can’t just include uptime as a sign of success; it’s critical that we measure how well we responded to failure.
Disasters in History
Looking at how we’ve historically handled failure can show how our attitudes have changed over time. In the aftermath of the Titanic‘s sinking, "it was believed that even though the human toll could have been reduced through proper action, the sinking itself was seen as an act of God, no more avoidable than the toppling of the Tower of Babel – a consequence of man’s pride in materialism and technology." In short, disasters happen and someone needed a better plan.
Fast forward a few decades to when American space program was dealt a major blow with the Challenger disaster. Cold weather caused O-rings to fail to expand, ultimately resulting in Challenger exploding mid-flight. While the Challenger explosion was a tremendous disaster, the gravity of the situation was made worse because the incident could have been prevented. Engineers knew about the degradation of O-ring performance in cold temperatures, yet NASA insisted on launching Challenger in unseasonably cold weather.
Truthfully, both disasters could have been prevented. In the case of Challenger, early warnings about material failure in cold weather could have prevented the chain of events that lead up to the mid-flight explosion. A well-known and easily observed failure condition lead to an unfortunate tragedy. In the case of Titanic, not only could she have set sail with additional life boats, but the captain could have heeded observations from nearby ships and changed course to avoid heavy ice in the region.
Disasters happen; it’s our response to disaster that matters. Brent hit this nail right on the head in Before You Fail Over a SQL Server AlwaysOn Availability Group. No amount of planning will prevent an equipment failure that leaves you without power or connectivity in your data center. When the business needs to remain operational, it’s up to your execution of your disaster recovery plan to keep things running. Decisions need to be made with a limited amount of information – should you fail over and potentially lose data or should you troubleshoot the issue?
To make good decisions during a disaster, or to get to the root cause after the disaster, you need to have some way to examine your system. This is where monitoring comes into play. Once understand what was happening and how your system was responding before a disaster struck, you can make better decisions about how you can monitor for, respond to, and potentially prevent a similar disaster in the future. Most applications don’t need 100% uptime with 0% data loss. For many applications, simply understanding root cause and doing our best to prevent it or avoid it in the future is good enough. For the rest… that’s a completely different world.
Eberhart, Mark. Why Things Break. New York: Three Rivers Press, 2007.