When I build a server, success means not touching the server again for 2-3 years. I already have enough crappy, unreliable servers that fall over when someone walks past. I only wanna build good, permanent stuff going forward.
So when I build disaster recovery for something, I want to test it 3 ways:
- Planned failover without data loss
- Unplanned failover WITH data loss
- Planned fail-back without data loss
Let’s say we have a really simple scenario: a 2-node Always On Availability Group with one replica in our primary data center on top, and a second replica in our DR data center (or the cloud, or someone else’s computer, or whatever) on the bottom:
Here’s what those scenarios look like.
1. Planned failover without data loss
If you want to leverage your AG for easier patching with less frequent downtime, you can:
- Patch the secondary on a weekday when you’re caffeinated and sober
- During a maintenance window, fail over to it (which will involve steps like switching to synchronous mode if you normally run async, and then possibly switching back to async after the failover)
- Patch the former primary
- During another maintenance window, fail back to the former primary
This would be a planned failover, and you should be able to do it without data loss whether you’re using Availability Groups, database mirroring, log shipping, SAN replication, whatever.
As you step through doing it, document the work involved, taking screenshots as you go. Write down any jobs that need to be changed, how to check backups, etc. The goal here isn’t necessarily for anyone on your team to be able to patch your SQL Server – the goal is to enable them to do a planned failover.
Say you’re out on vacation, and your company gets word that there’s a data center emergency, and you have to migrate everything out quickly. Someone should be able to grab your checklist, follow the steps, and fail over with confidence.
2. Unplanned failover WITH data loss
Assuming that you normally run in asynchronous mode, when you experience a disaster in your primary data center, you’re gonna lose data. Some of the transactions won’t have replicated over to DR.
To simulate this:
- Run a workload on the primary (rebuilding indexes is great for this because it generates a ton of transaction log activity, fast)
- As the primary gets farther behind, shut it down not-at-all gracefully (I like simply disabling the network ports behind the scenes
- Now, tag, you’re it.
Your job is to:
- Figure out how much data you’re going to lose when you bring the DR secondary online
- Communicate that to management to get their consensus as to how hard it will be to get that data back (to learn about that process, watch this Senior DBA class video)
- Bring the DR secondary online
Again, document your work as you go, building checklists and taking screenshots. This is the checklist I really wanna be confident in – when the hurricane hits, I want any of the members of the IT team to be able to accomplish this. I don’t write documents for the janitorial team, mind you, just the IT team.
3. Planned fail-back without data loss
Then continuing the above scenario, bring the former primary back online. This part is way, way more tricky than it looks. Depending on your business, you may need to take backups of the former primary, start documenting what data was lost, and maybe even pave the former primaries and rebuild them completely if they were far enough behind.
This scenario is the one least likely to be done without the DBA’s involvement. Once you truly pull the trigger to fail over to DR, you’re not going to want to jump back into that hot water quickly.
After you’ve done all three of the above scenarios, and you’ve got checklists for them, you’re much more confident in how the infrastructure is going to react to problems. The end result is something that is more likely to stand the test of time, being predictable and reliable over the course of several years.
However, you can only do this BEFORE you go live, not afterwards. Nobody wants to take production down repeatedly to test this.
That’s why when I’m asked to build an Availability Group, I usually start by saying, “Great, let’s build it from scratch in an isolated environment so you can write all these checklists out and be confident in how to manage it.”