Think fast! When it’s time to do a manual failover with SQL Server 2012 AlwaysOn Availability Groups, we need to make a few decisions and perform a few tasks as quickly as possible to get the business back up and running. Here’s what you need to consider ahead of time so you can stay calm, cool, and collected – even if you have to trust someone else to do this work.
How much data will we lose? If you’re failing over to an asynchronous replica, you’re probably going to lose data. Build a set of queries that show exactly how old the replica’s data is – but here’s the catch: you can’t rely on being able to query the primary replica. Instead, look at timestamp-based tables. For example, if you’ve got a Sales table with a SalesDateTime field, run this query on the replica:
SELECT TOP 10 * FROM dbo.Sales ORDER BY SalesDateTime DESC
If the newest sales are 20 minutes old, you can tell management that they stand to lose up to 20 minutes of data if we fail over now. You may still be able to recover the data from the primary – more on that later. The amount of potential data loss helps you with the rest of the decisions you’re about to make.
How long will we troubleshoot? If we stand to lose 20 minutes of data, then it helps to be able to tell an executive, “We can either go live with our DR replica right now and lose 20 minutes of data, or I can spend 30 minutes troubleshooting to find out how bad the situation is. What would you like to do?” Put that decision in management hands as fast as you can – preferably after you’ve done just five minutes of troubleshooting. Within five minutes, you should know if it’s an easy problem or a challenging one, and you want to let management decide when the outage should end.
Who’s allowed to make the go/no-go decision? If the DBA, dev manager, or IT manager aren’t around, who’s responsible for the call to fail over to the asynch replica and lose data? Ideally, we’ve got as large of a list as possible here so that the company can react fast without waiting around for a long phone tree. This list of people must be firmed up ahead of time.
Who’s allowed to perform the failover? AlwaysOn Availability Group failover can be scripted out with PowerShell and T-SQL, and these scripts can be given to a 24×7 help desk rotation. I know, as a DBA, it can be scary to hand out this kind of permissions, but I’d rather empower someone else – when armed with a manager who can make go/no-go decisions – so that the company can be back up and running faster.
Once failover starts, who’s got quorum votes? If groups of servers, or heaven forbid, an entire datacenter is offline, failover starts to get more complex. The team will need to force the quorum online without enough votes, and then reassign voting rights using PowerShell or cluster.exe. We’ll probably need to change the quorum model as well, perhaps going from node majority to node-and-file-share majority. These are complex topics that should be thought-through ahead of time rather than busting out Books Online.
Do jobs need to be enabled/disabled? If we’ve turned a secondary replica into a primary, how does this affect our SQL Server Agent jobs that were running on the old replica? Do we need to change any of our backup jobs? In theory, the built-in backup preferences settings for AlwaysOn Availability Groups will cover this, but complex replica scenarios may have multiple backup jobs to maintain both onsite and offsite backups. We might also have aggregation jobs or ISV jobs that must always run on the primary, and we need to ensure those get enabled.
Who will troubleshoot connectivity problems? When there’s been a massive failure, I like having an open conference bridge with all hands on deck – but not necessarily all hands should focus on the call. For example, in multi-DBA teams, I prefer to designate just one DBA team member as the primary point of contact on the call. They handle all questions from developers and end users while the other DBAs focus on making sure the replica stays healthy.
What do we do if we have multiple primary replicas? In a split-brain cluster scenario, we can have two database servers that both think they’re the primary replica for the same database. We need a written checklist that we can give to someone who will drive to the not-supposed-to-be-primary datacenter, shut the servers down, and stay there until the network comes back online to bring the server up gracefully.
Will we try to recover the lost data? When the formerly-primary server comes back online, the databases will still be readable. We can use data comparison tools to compare the two replicas and generate insert/update/delete scripts to bring the data back into sync – in theory. In practice, identity fields can make this impractical or impossible. I’d start by comparing timestamps in a few key tables just to see how much data is at stake, and then give management a rough time estimate on what it’d take to recover that data. In order to give a good estimate, you’ll want to try this approach ahead of time (long before an outage) using a pair of development or QA database servers. Get a feeling for how the tables are related and how much work would be involved to sync them.
There’s a lot of questions here. When your team is armed with the answers ahead of time, it’ll make failovers much less painful, and you’ll look like the smartest, fastest-reacting ninjas in the company.