As a DBA, you’re responsible for identifying necessary updates to keep your SQL Servers healthy. Your business may have some mandates about the frequency of patches, but even if they don’t, you have a duty to look out for Cumulative Updates, Service Packs, and out of band hotfixes that can prevent downtime and prevent performance problems. It’s your task to test and apply them as needed.
If you use SQL Server Availability Groups (AGs), this is a tough responsibility. There are a lot of updates released for critical issues. And sometimes there are updates to the updates, because…. oops. And sometimes the updates may cause you downtime.
Here’s the lose-lose situation that the Availability Group DBA ends up in:
- If you don’t apply updates, you could hit a critical performance issue that brings down your environment which was fixed in an existing cumulative update or hotfix. An update you should have known about.
- If you apply updates regularly, you can cause yourself downtime by issues introduced in the updates. And here’s the worst part: testing reduces your risk of this, but doesn’t prevent it.
You can’t just ignore the situation. After all, you’re responsible.
High Availability is Supposed to Reduce Downtime– Including Downtime for Patching
One of the big reasons DBAs like High Availability solutions is that we want to reduce downtime during patching. If you’ve ever done much patching, you know that it can take a while to restart servers sometimes. Sometimes the server hangs on restart and you may have to connect to a special server management card to nudge it along. (iLo, DRAC, etc.)
If you’re using Database Mirroring, a Failover Cluster, or Availability Groups, you can reduce downtime in patching: you can fail your SQL Server resources around so that there are only short downtimes and your customers aren’t dependent on the server to come back. You want these features so you have less to worry about, not more.
PATCHING an Availability Group Can Cause Unplanned Downtime
Check out this recent critical problem impacting multiple cumulative updates across SQL Server 2012 and 2014. If you apply these updates, you may run into a critical blocking issue which you can only fix by:
- Turning off automatic failover
- Restarting the SQL Server instance on the primary replica. Yep, that’s an outage.
- Turning on automatic failover
Update: KB 3033492 has now been published which gives more information on options to avoid the issue (spoiler: it requires an outage), or other ways you can attempt to correct it by killing sessions (which may not work).
You might wonder, “How many people could this have impacted?” I wondered too. The night I learned about this issue, I went to my local user group meeting and heard that a person there had been impacted by the bug. They said their production environment was down for more than 9 hours while they worked with Microsoft to resolve the issue.
Which of These Downtime Scenarios is Worse?
Pick your poison. And it will feel like poison. Is it worse to:
A) Be an unlucky optimist. Follow the steps in the KB. Hope that you detect the problem during your patching window if it occurs. There’s a chance that the issue might be detected after your planned patching window has ended (and you’ve gone back to sleep), and then you’ve got an unplanned downtime on your hand and unhappy customers.
B) Be a beleaguered pessimist. Proactively take the steps above when applying patches to prevent the bug from occurring. This is more work for you, more steps for you to carry out and validate (you did put that back in automatic failover, right?), and more outages for your customer. You’re now not just failing back and forth between nodes, you’re doing an extra restart. And restarting the instance underneath the primary replica with no failover is exactly the type of thing we expect High Availability to save us from.
This Couldn’t Happen Again, Right?
Remember SQL Server 2012 SP1? It had an issue that impacted some installs where processes named msiexec.exe would keep running and use tons of CPU. It bloated out the registry. It was pretty nasty, and it raised the issue, “Should you trust Service Packs?” (Spoiler: no.)
That issue wasn’t specific to AGs. But we now have a pattern where supported, released updates (even service packs) can cause major problems on a portion of installs and they are released anyway. Odds are, it will happen again.
But Microsoft Said I Should Only Install Cumulative Updates if I Need Them, Right?
Yep– that’s what they used to say. It was dangerous advice to follow, because Service Packs are released very slowly these days.
But this changed with KB 2964518 – “Recommended updates and configuration options for SQL Server 2012 and SQL Server 2014 used with high-performance workloads.” This article specifically recommends using the latest cumulative update, and gives a dramatic list of issues fixed in those updates.
Cumulative Updates ARE Especially Critical for AGs. Look at these Fixes…
It’s possible you might still be thinking that it’s safer to just not patch an availability group and hope for the best. That’s not a good idea. Take a look at these fixes:
- “SQL Server 2012 experiences out-of-memory errors” http://support.microsoft.com/kb/2769594 (this is specific to AGs)
- “FIX: Out-of-memory errors related to a memory clerk in SQL Server 2012” http://support.microsoft.com/kb/2821783 (this is specific to AGs)
- “FIX: SQL Cache Memory decreases and CPU usage increases when you rebuild an index for a table in SQL Server (with AlwaysOnAG). 2012 SP2 CU2, 2014 CU1” http://support.microsoft.com/kb/2958054
- “FIX: High “log write waits” counter value on a SQL Server 2012 instance” http://support.microsoft.com/kb/2809338 (This is specific to Availability Groups per the Symptoms)
- “FIX: Error when you back up a database that has case-sensitive collation by using VSS in SQL Server 2012 SP2” http://support2.microsoft.com/kb/2987610 (the Symptoms specifically list backing up AG secondaries)
- FIX: SQL Server 2012 or SQL Server 2014 instance shuts down when you join database as secondary replica during AlwaysOn Availability Groups configuration http://support.microsoft.com/kb/2929193
- “Connection times out when you use AlwaysOn availability group listener with MultiSubnetFailover parameter” http://support.microsoft.com/kb/2870437 (This one is a Windows 2008R2 issue)
- “FIX: A memory leak occurs when you enable AlwaysOn Availability Groups or SQL Server failover cluster in Microsoft SQL Server 2012” http://support.microsoft.com/kb/2877100
- “FIX: “System objects not be updated (sic)” when you use AlwaysOn Availability Groups in SQL Server 2012 or SQL Server 2014″ http://support.microsoft.com/kb/3002071
- “A hotfix that improves the performance of the “AlwaysOn Availability Group” feature in SQL Server 2012 is available for Windows Server 2008 R2″ http://support.microsoft.com/KB/2687741
- “Time-out error and you cannot connect to a SQL Server 2012 AlwaysOn availability group listener in a multi-subnet environment” http://support.microsoft.com/kb/2792139
- “FIX: Synchronization state of an AlwaysOn Availability Group replica may not be updated if primary is unhealthy” http://support.microsoft.com/kb/2897554
- “Troubleshooting automatic failover problems in SQL Server 2012 AlwaysOn environments” http://support.microsoft.com/kb/2833707 (This is a list of things you need to look out for that can make automatic failover not work which may not be obvious, not a hotfix.)
- “FIX: “Non-yielding scheduler” error occurs and AlwaysOn Availability Group transits to RESOLVING state” http://support.microsoft.com/kb/3020116
- “FIX: Scheduler deadlock on AlwaysOn Availability Group primary replica in SQL Server 2012” http://support.microsoft.com/kb/2869734/en-us
- “SQL Server 2012 service shuts down unexpectedly upon availability group replica role transition on a Windows Server 2008 R2-based failover cluster” http://support.microsoft.com/kb/2777201
- “FIX: SQL Server 2012 or SQL Server 2014 instance shuts down when you join database as secondary replica during AlwaysOn Availability Groups configuration” http://support.microsoft.com/kb/2929193
I know that some folks are going to ask for a comprehensive bug list specific to AGs. I’m afraid I don’t have time to maintain that list– and that is kinda telling.
So What Should You Do?
How much time and effort do you have for reading about fixes, testing them carefully, watching blogs for out of release fixes and known issues, and establishing complex patching and production processes?
If you don’t have at least an hour a week to keep on top of available fixes, and another eight hours a month (minimum) to devote to testing and deploying fixes, you don’t have time for an AG.
Got more than two replicas? You need more time. Running Windows Server 2008R2? Well you need a lot more time, because you should be moving out of there.
It was hard for me to write this post. I’ve met some of the folks who make SQL Server. They’re incredibly smart, very dedicated, super helpful, and they work very hard to make a great product. But right now Availability Groups are a very complex and time consuming feature to manage — and many teams don’t have the right staffing and time to take on that challenge and succeed.
Brent Says: I have a hunch that maybe Kendra was a little angry when she wrote this, but even so, I’m glad I’m a consultant these days and not a full time DBA. Patching blows.