How to Test Disaster Recovery Before You Go Live

When I build a server, success means not touching the server again for 2-3 years. I already have enough crappy, unreliable servers that fall over when someone walks past. I only wanna build good, permanent stuff going forward.

So when I build disaster recovery for something, I want to test it 3 ways:

  1. Planned failover without data loss
  2. Unplanned failover WITH data loss
  3. Planned fail-back without data loss

Let’s say we have a really simple scenario: a 2-node Always On Availability Group with one replica in our primary data center on top, and a second replica in our DR data center (or the cloud, or someone else’s computer, or whatever) on the bottom:

Simple Always On Availability Group

Here’s what those scenarios look like.

1. Planned failover without data loss

If you want to leverage your AG for easier patching with less frequent downtime, you can:

  • Patch the secondary on a weekday when you’re caffeinated and sober
  • During a maintenance window, fail over to it (which will involve steps like switching to synchronous mode if you normally run async, and then possibly switching back to async after the failover)
  • Patch the former primary
  • During another maintenance window, fail back to the former primary

This would be a planned failover, and you should be able to do it without data loss whether you’re using Availability Groups, database mirroring, log shipping, SAN replication, whatever.

As you step through doing it, document the work involved, taking screenshots as you go. Write down any jobs that need to be changed, how to check backups, etc. The goal here isn’t necessarily for anyone on your team to be able to patch your SQL Server – the goal is to enable them to do a planned failover.

Say you’re out on vacation, and your company gets word that there’s a data center emergency, and you have to migrate everything out quickly. Someone should be able to grab your checklist, follow the steps, and fail over with confidence.

2. Unplanned failover WITH data loss

Assuming that you normally run in asynchronous mode, when you experience a disaster in your primary data center, you’re gonna lose data. Some of the transactions won’t have replicated over to DR.

To simulate this:

  • Run a workload on the primary (rebuilding indexes is great for this because it generates a ton of transaction log activity, fast)
  • As the primary gets farther behind, shut it down not-at-all gracefully (I like simply disabling the network ports behind the scenes
  • Now, tag, you’re it.

Your job is to:

  • Figure out how much data you’re going to lose when you bring the DR secondary online
  • Communicate that to management to get their consensus as to how hard it will be to get that data back (to learn about that process, watch this Senior DBA class video)
  • Bring the DR secondary online

Again, document your work as you go, building checklists and taking screenshots. This is the checklist I really wanna be confident in – when the hurricane hits, I want any of the members of the IT team to be able to accomplish this. I don’t write documents for the janitorial team, mind you, just the IT team.

3. Planned fail-back without data loss

Then continuing the above scenario, bring the former primary back online. This part is way, way more tricky than it looks. Depending on your business, you may need to take backups of the former primary, start documenting what data was lost, and maybe even pave the former primaries and rebuild them completely if they were far enough behind.

This scenario is the one least likely to be done without the DBA’s involvement. Once you truly pull the trigger to fail over to DR, you’re not going to want to jump back into that hot water quickly.


After you’ve done all three of the above scenarios, and you’ve got checklists for them, you’re much more confident in how the infrastructure is going to react to problems. The end result is something that is more likely to stand the test of time, being predictable and reliable over the course of several years.

However, you can only do this BEFORE you go live, not afterwards. Nobody wants to take production down repeatedly to test this.

That’s why when I’m asked to build an Availability Group, I usually start by saying, “Great, let’s build it from scratch in an isolated environment so you can write all these checklists out and be confident in how to manage it.”

Previous Post
Election Injection
Next Post
[Video] Office Hours 2018/7/18 (With Transcriptions)

9 Comments. Leave new

  • Andrew Robertson
    July 20, 2018 10:00 am

    This is a great article, I just wish management would understand this approach. It’s disheartening when you get told that this is overkill, and there is no money in the budget for this level of testing. When things go wrong, wishing you went to this level of detail is too late. Saying I told you so to management could be a career killer. Maybe next time I’ll add a link to this article before begging for a test site. 🙂

    • Andrew – thanks! I used to think that way too – I used to say, “Doggone these managers, they don’t do what I want! They must be dumb or something!”

      Then I realized over time that I had to do a sales pitch or marketing effort to convince management to do what I wanted. If I could do a good job of explaining why going live without testing scared ME, then I could make it scare them too, and then we got the resources necessary to do a safe, stable deployment.

      My favorite starting point on this is the book Getting To Yes: Negotiating Agreement Without Giving In – https://amzn.to/2uCFFad

      • Andrew robertson
        July 22, 2018 4:43 am

        Thanks Brent, I found the book at my local bookstore, interestingly right next to The Art of War.

  • This is great stuff! Can you point me to where I can get info on how to test with on a VM and not an AG setup? I need to make my team aware of the risks with only a VM and backups for DR. When we upgrade to SQL 2017 or vNext in the next year or so, this will be a perfect exercise as our DR exercises usually consist of taking our backups and restoring them to a restored VM at our offsite vendor recovery site.

  • Great post Erik!
    hehehe

  • “When I build a server, success means not touching the server again for 2-3 years.”
    Perhaps another team will be patching SQL Server in production and DR for you? 🙂

Menu
{"cart_token":"","hash":"","cart_data":""}