In theory, before you introduce a new system – database server, load balancer, virtualization infrastructure, etc – you build a robust runbook that documents how you’ll handle every conceivable scenario. When there’s any kind of failure, you’ll simply turn to chapter X and start going through a precise checklist that will guide you to the promised land of uptime.
Yeah, right. In reality, you’re behind the 8 ball. Everybody wants to go live with brand spankin’ new technology right now – even if we have absolutely no experience troubleshooting it. Do it live, they say.
Here’s the easy way:
- Find a room with a big whiteboard and a projector
- Gather one person from each team (networking, systems, database, app, etc)
- Connect to the system in question via remote desktop or whatever
- Write a list on the whiteboard of every component involved
For example, on a SQL Server 2012 AlwaysOn Availability Group system, I connect to Failover Cluster Manager and list through all of the components:
- Drives (local, SAN, quorum if applicable)
- IP addresses
- Services (local & clustered)
For each component, ask:
- When it fails, what will the symptoms look like?
- How will it affect the system as a whole?
- When we suspect that the component failed, who do we call to troubleshoot it further?
- How long will we wait for them to figure out if it’s broken?
- After that time, what’s our Plan B?
If we wrote down all of the answers, we’d have a runbook – but remember, we’re probably under the gun, so we probably won’t produce something that good. That’s completely okay. Let’s just get started by thinking through the complexity of the system and envisioning what failure might look like.
In complex systems, nothing every fails in a way that’s completely obvious and intuitive. There’s no warning message in the event log that says, “The root cause is that Bob in Accounting decided to grab your cluster’s admin IP address for his new virtual server. Go tell Bob to get his own unique IP address, and everything will be fine.” Even if you’ve never experienced a failure like that, you might be able to recognize the symptoms if you imagine what a cluster admin IP failure would look like. Document that, and you’re on your way to a killer runbook – which means faster recovery and easier troubleshooting.