The First Step to the Poor Man’s Runbook

Last Updated February 9, 2017

Backup and Recovery, Processes and Practices

In theory, before you introduce a new system – database server, load balancer, virtualization infrastructure, etc – you build a robust runbook that documents how you’ll handle every conceivable scenario. When there’s any kind of failure, you’ll simply turn to chapter X and start going through a precise checklist that will guide you to the promised land of uptime.

Yeah, right. In reality, you’re behind the 8 ball. Everybody wants to go live with brand spankin’ new technology right now – even if we have absolutely no experience troubleshooting it. Do it live, they say.

Here’s the easy way:

Find a room with a big whiteboard and a projector
Gather one person from each team (networking, systems, database, app, etc)
Connect to the system in question via remote desktop or whatever
Write a list on the whiteboard of every component involved

For example, on a SQL Server 2012 AlwaysOn Availability Group system, I connect to Failover Cluster Manager and list through all of the components:

Servers
Drives (local, SAN, quorum if applicable)
IP addresses
Services (local & clustered)

My ex-girlfriends would have been surrounded by red and yellow. — Ah, if only all risks were marked with signs.

For each component, ask:

When it fails, what will the symptoms look like?
How will it affect the system as a whole?
When we suspect that the component failed, who do we call to troubleshoot it further?
How long will we wait for them to figure out if it’s broken?
After that time, what’s our Plan B?

If we wrote down all of the answers, we’d have a runbook – but remember, we’re probably under the gun, so we probably won’t produce something that good. That’s completely okay. Let’s just get started by thinking through the complexity of the system and envisioning what failure might look like.

In complex systems, nothing every fails in a way that’s completely obvious and intuitive. There’s no warning message in the event log that says, “The root cause is that Bob in Accounting decided to grab your cluster’s admin IP address for his new virtual server. Go tell Bob to get his own unique IP address, and everything will be fine.” Even if you’ve never experienced a failure like that, you might be able to recognize the symptoms if you imagine what a cluster admin IP failure would look like. Document that, and you’re on your way to a killer runbook – which means faster recovery and easier troubleshooting.

Instant Index Insight: How to Use sp_BlitzIndex® (video)

sp_BlitzIndex® Holiday Week Edition

Get Free SQL Stuff

"*" indicates required fields

Company

This field is for validation purposes and should be left unchanged.

Name*

First Last

Email*

Things I want*

First Responder Kit (Scripts and Guides)

Monday Links (SQL & tech news)

Blog posts (2-4 per week)

The First Step to the Poor Man’s Runbook

Related

Leave a Reply Cancel reply

Hi! I’m Brent Ozar.

Get Free SQL Stuff