We’ve talked about how to do a free SQL Server health check, and then how to drill deeper with performance check.
After these two, management usually steps back and asks, “Wait a minute – how’d we get into this mess? And long term, strategically, what should we do next?”
And usually, we’re in a hot mess because we’ve underprovisioned hardware capabilities relative to our query performance goals, not to mention our backup/recovery goals. Our databases grew larger and our maintenance window got smaller, and now if we get into trouble, we can’t restore our databases fast enough.
Step 1: Fill out the HA/DR RPO/RTO worksheet.
In our free First Responder Kit, open up the file “Worksheet – High Availability and Disaster Recovery Planning.pdf.”
Fill out the “Current State” columns based on your HA/DR setup, your backup schedules, and your restore speeds. If you haven’t tested restores or failovers in the last 6-12 months, put “UNKNOWN” across the appropriate areas.
Make sure to read the details carefully (c’mon, it’s a short page) to understand each of the failure modes. For example, folks often tell me, “We have high availability built in because we use VMware or Hyper-V.” No, those are not high availability – they’re just availability. Read those failure modes, and you’ll notice that if some of them occur, it doesn’t matter which host your VM boots up on – it’s still down for the count.
This stuff really does happen. For real-life examples, check out the post 9 Ways to Lose Data.
Step 2: Hand it to your stakeholders & management.
Say, “Here’s our current RPO/RTO. If you want to improve these numbers – and our query performance – tell me what your new goals and budget are.” That’s where the second page of the RPO/RTO worksheet come in – I don’t wanna spoil that one for you, but suffice it to say that I usually print page 1 & page 2 back-to-back on the same sheet of paper.
Yes, the business does default to wanting zero data loss and zero downtime, but they change their minds as soon as they see page 2.
If you stay in the room while they discuss their new goals, they’ll usually ask, “Can you give us more specific costs for these?” Unfortunately, no – that requires building out a project plan, hardware, licensing estimates, etc – and it’s a ton of work. Management picks the box first (which estimates a cost range), and then us geeks get started building a more accurate cost estimate based on the right technical fit for the database, team, and RPO/RTO goals.
Management will also often ask, “What about 15 minutes? Can we do 6 hours? How’s about 12 parsecs?” We can – but let’s keep things simple at first with 0, 1 minute, 1 hour, or 1 day because that’s roughly what SQL Server’s capabilities line up with.
- At zero RTO, we need automatic, always-running solutions. SQL Server doesn’t have anything built in that allows for automatic failover in zero time, preserving transactions in flight, so this is a super-expensive category.
- At one minute RTO, we need automatic solutions, but we have a few native choices that will fail over automatically when everything’s working properly. (Of course it’s not working properly, and that’s why you’re failing over, but I digress.)
- At one hour, we start to get enough time for a meatbag to get the phone call, get to a computer, log in, do some very fast troubleshooting, and perform a well-rehearsed (and documented) process.
- At one day, we’ve bought ourselves enough time for that meatbag to sober up, do in-depth troubleshooting, try to bring a server back from the dead, and when it fails, restore databases from scratch.
If they want something in between those times, that’s cool – but we need to think big picture first.
Step 3: With their RPO/RTO goals, sketch out an architecture.
The RPO/RTO goals and rough cost estimates drive the shape of your architecture diagram: the number of boxes, the arrows that connect the boxes together.
Your performance goals drive the size of the boxes.
In the previous posts in this series, I’ve given you sample schedules on what you can do in one hour or one day. Architecture reviews are a lot more ambitious – plus, they usually assume you’ve already done the health check and performance check. (You need to know the workloads before you sketch out the sizes of the boxes.) So for this one, I don’t have an easy button – but to see an example of a deliverable, check out Patient C’s sample findings for our SQL Critical Care®.
Now that you’ve seen all three parts of this series, you have a better feel for what free tools are out there to help – and what you can realistically expect to accomplish given the time that you have. (Toss in lots of servers, and the problem becomes a lot harder to solve today at scale.)
Brent, do you have a variant of the “HA/DR RPO/RTO worksheet” for if you’re hosted in the cloud?
For Infrastructure-as-a-Service, there are no changes. For Platform-as-a-Service, there’s no chart – just put in your credit card and turn the knob to the level of protection that you want. 😀
Why did you create a separate section for Corruption?
James – because some HA/DR technologies don’t protect you from corruption. For example, with failover clustering using a SAN back end, when your single point of failure has a corruption issue, there’s no HA. Not every shop wants to fail over to DR for that, or they may have a separate RPO/RTO for it.
Thanks for this Brent, really useful.
One question I have is where you see lifecycle playing a part of these sorts of review? I’m still seeing a lot of 2008 R2 in use, as well as not planning for the 2012 end of life in 2022. Would you look to include that as a consideration?