As a DBA, you’re responsible for identifying necessary updates to keep your SQL Servers healthy. Your business may have some mandates about the frequency of patches, but even if they don’t, you have a duty to look out for Cumulative Updates, Service Packs, and out of band hotfixes that can prevent downtime and prevent performance problems. It’s your task to test and apply them as needed.
If you use SQL Server Availability Groups (AGs), this is a tough responsibility. There are a lot of updates released for critical issues. And sometimes there are updates to the updates, because…. oops. And sometimes the updates may cause you downtime.
Here’s the lose-lose situation that the Availability Group DBA ends up in:
- If you don’t apply updates, you could hit a critical performance issue that brings down your environment which was fixed in an existing cumulative update or hotfix. An update you should have known about.
- If you apply updates regularly, you can cause yourself downtime by issues introduced in the updates. And here’s the worst part: testing reduces your risk of this, but doesn’t prevent it.
You can’t just ignore the situation. After all, you’re responsible.
High Availability is Supposed to Reduce Downtime– Including Downtime for Patching
One of the big reasons DBAs like High Availability solutions is that we want to reduce downtime during patching. If you’ve ever done much patching, you know that it can take a while to restart servers sometimes. Sometimes the server hangs on restart and you may have to connect to a special server management card to nudge it along. (iLo, DRAC, etc.)
If you’re using Database Mirroring, a Failover Cluster, or Availability Groups, you can reduce downtime in patching: you can fail your SQL Server resources around so that there are only short downtimes and your customers aren’t dependent on the server to come back. You want these features so you have less to worry about, not more.
PATCHING an Availability Group Can Cause Unplanned Downtime
Check out this recent critical problem impacting multiple cumulative updates across SQL Server 2012 and 2014. If you apply these updates, you may run into a critical blocking issue which you can only fix by:
- Turning off automatic failover
- Restarting the SQL Server instance on the primary replica. Yep, that’s an outage.
- Turning on automatic failover
Update: KB 3033492 has now been published which gives more information on options to avoid the issue (spoiler: it requires an outage), or other ways you can attempt to correct it by killing sessions (which may not work).
You might wonder, “How many people could this have impacted?” I wondered too. The night I learned about this issue, I went to my local user group meeting and heard that a person there had been impacted by the bug. They said their production environment was down for more than 9 hours while they worked with Microsoft to resolve the issue.
Which of These Downtime Scenarios is Worse?
Pick your poison. And it will feel like poison. Is it worse to:
A) Be an unlucky optimist. Follow the steps in the KB. Hope that you detect the problem during your patching window if it occurs. There’s a chance that the issue might be detected after your planned patching window has ended (and you’ve gone back to sleep), and then you’ve got an unplanned downtime on your hand and unhappy customers.
B) Be a beleaguered pessimist. Proactively take the steps above when applying patches to prevent the bug from occurring. This is more work for you, more steps for you to carry out and validate (you did put that back in automatic failover, right?), and more outages for your customer. You’re now not just failing back and forth between nodes, you’re doing an extra restart. And restarting the instance underneath the primary replica with no failover is exactly the type of thing we expect High Availability to save us from.
This Couldn’t Happen Again, Right?
Remember SQL Server 2012 SP1? It had an issue that impacted some installs where processes named msiexec.exe would keep running and use tons of CPU. It bloated out the registry. It was pretty nasty, and it raised the issue, “Should you trust Service Packs?” (Spoiler: no.)
That issue wasn’t specific to AGs. But we now have a pattern where supported, released updates (even service packs) can cause major problems on a portion of installs and they are released anyway. Odds are, it will happen again.
But Microsoft Said I Should Only Install Cumulative Updates if I Need Them, Right?
Yep– that’s what they used to say. It was dangerous advice to follow, because Service Packs are released very slowly these days.
But this changed with KB 2964518 – “Recommended updates and configuration options for SQL Server 2012 and SQL Server 2014 used with high-performance workloads.” This article specifically recommends using the latest cumulative update, and gives a dramatic list of issues fixed in those updates.
Cumulative Updates ARE Especially Critical for AGs. Look at these Fixes…
It’s possible you might still be thinking that it’s safer to just not patch an availability group and hope for the best. That’s not a good idea. Take a look at these fixes:
- “SQL Server 2012 experiences out-of-memory errors” http://support.microsoft.com/kb/2769594 (this is specific to AGs)
- “FIX: Out-of-memory errors related to a memory clerk in SQL Server 2012” http://support.microsoft.com/kb/2821783 (this is specific to AGs)
- “FIX: SQL Cache Memory decreases and CPU usage increases when you rebuild an index for a table in SQL Server (with AlwaysOnAG). 2012 SP2 CU2, 2014 CU1” http://support.microsoft.com/kb/2958054
- “FIX: High “log write waits” counter value on a SQL Server 2012 instance” http://support.microsoft.com/kb/2809338 (This is specific to Availability Groups per the Symptoms)
- “FIX: Error when you back up a database that has case-sensitive collation by using VSS in SQL Server 2012 SP2” http://support2.microsoft.com/kb/2987610 (the Symptoms specifically list backing up AG secondaries)
- FIX: SQL Server 2012 or SQL Server 2014 instance shuts down when you join database as secondary replica during AlwaysOn Availability Groups configuration http://support.microsoft.com/kb/2929193
- “Connection times out when you use AlwaysOn availability group listener with MultiSubnetFailover parameter” http://support.microsoft.com/kb/2870437 (This one is a Windows 2008R2 issue)
- “FIX: A memory leak occurs when you enable AlwaysOn Availability Groups or SQL Server failover cluster in Microsoft SQL Server 2012” http://support.microsoft.com/kb/2877100
- “FIX: “System objects not be updated (sic)” when you use AlwaysOn Availability Groups in SQL Server 2012 or SQL Server 2014″ http://support.microsoft.com/kb/3002071
- “A hotfix that improves the performance of the “AlwaysOn Availability Group” feature in SQL Server 2012 is available for Windows Server 2008 R2″ http://support.microsoft.com/KB/2687741
- “Time-out error and you cannot connect to a SQL Server 2012 AlwaysOn availability group listener in a multi-subnet environment” http://support.microsoft.com/kb/2792139
- “FIX: Synchronization state of an AlwaysOn Availability Group replica may not be updated if primary is unhealthy” http://support.microsoft.com/kb/2897554
- “Troubleshooting automatic failover problems in SQL Server 2012 AlwaysOn environments” http://support.microsoft.com/kb/2833707 (This is a list of things you need to look out for that can make automatic failover not work which may not be obvious, not a hotfix.)
- “FIX: “Non-yielding scheduler” error occurs and AlwaysOn Availability Group transits to RESOLVING state” http://support.microsoft.com/kb/3020116
- “FIX: Scheduler deadlock on AlwaysOn Availability Group primary replica in SQL Server 2012” http://support.microsoft.com/kb/2869734/en-us
- “SQL Server 2012 service shuts down unexpectedly upon availability group replica role transition on a Windows Server 2008 R2-based failover cluster” http://support.microsoft.com/kb/2777201
- “FIX: SQL Server 2012 or SQL Server 2014 instance shuts down when you join database as secondary replica during AlwaysOn Availability Groups configuration” http://support.microsoft.com/kb/2929193
I know that some folks are going to ask for a comprehensive bug list specific to AGs. I’m afraid I don’t have time to maintain that list– and that is kinda telling.
So What Should You Do?
How much time and effort do you have for reading about fixes, testing them carefully, watching blogs for out of release fixes and known issues, and establishing complex patching and production processes?
If you don’t have at least an hour a week to keep on top of available fixes, and another eight hours a month (minimum) to devote to testing and deploying fixes, you don’t have time for an AG.
Got more than two replicas? You need more time. Running Windows Server 2008R2? Well you need a lot more time, because you should be moving out of there.
It was hard for me to write this post. I’ve met some of the folks who make SQL Server. They’re incredibly smart, very dedicated, super helpful, and they work very hard to make a great product. But right now Availability Groups are a very complex and time consuming feature to manage — and many teams don’t have the right staffing and time to take on that challenge and succeed.
Brent Says: I have a hunch that maybe Kendra was a little angry when she wrote this, but even so, I’m glad I’m a consultant these days and not a full time DBA. Patching blows.
This whole thing that’s happening to IT right now- ‘to the cloud’,PowerBI, et.al. is at it’s core a collapse of engineering as everyone is sucked into the gravitational pull of Enormous Piles of Money. If a company like Instagram can be worth a billion dollars, when the software could be built by one person in their spare time, why invest in anything. The whole IT ecosystem is being abandoned as money chases tech fashion. The whole of on-prem IT that’s been built up over 30 years or so is rounding error compared to Facebook, Google, Instagram and Twitter. Solid Engineering practice is no longer a priority.
Which is depressing considering how big of a business SQL Server is already.
OK, I’ve been holding back on patching because the boss hates downtime. Here is my current version 11.0.3431.0 (2012 SP1 CU10) I’m wondering if I should go SP2, and then CU4, and do that whole turn off auto failover, restart primary, turn back on auto failover? I know my CIO is going to probably flat out say NO to restarting SQL on the primary. I work in a 7×24 health care environment and downtime on our primary SQL AG is a BIG deal and unscheduled down time CAN impact customer care.
I think it’s time to build a TEST AG that mimics production as best I can. Oh, did I mention this AG has 2 nodes on one subnet, and the 3rd is on a different subnet?
Yeah– with you on hating the downtime. I just updated the article to point to KB 3033492, which explains the options. There are some “preventive medicine” options, but they do require downtime.
I do think this issue can easily justify building that pre-production AG. I would also document why you’re making the decision you’re making and what the risks are (either way), and what investments can be made to try to reduce risks in the future.
One of the things that may be the most persuasive is arguing that a planned downtime, during a known low-use time where users have prepared for the outage is far less impactful than an outage in the middle of the day where users don’t know they’re going to have to put alternative methods in place and you don’t know how long the outage is going to be.
The hospital I work for has a monthly outage window where work can be done. We had to go through a few unplanned outages due to maintenance that wasn’t done to get there but it has served us well.
Maintenance windows are required, whether 24/7 or not. How would your manager react to the service compromised by a Virus and down for 2 days. Every system should have contingency to fail over/Patch. end of. It should be built in when the system is introduced.
What do you think about patching every Nth CU? That way you have less per-patching risk and still eventually get all the fixes.
N could be 3 or 5…
I was there when SQL Server 2000 SP3 came out and broke our reporting services. It was ugly… Hence Service Pack 3a. I am usually for doing Service Packs as long as we get a chance to test them somewhere before they are applied to production servers. It is more often that ‘daboss will hold off on the SP until something breaks or a vulnerability is exploited because the system is “just too busy” to be taken down. My last company avoided SP’s for that reason and they reaped the results. Any new server I built, however, was patched to the highest level before it was deployed.
Scheduling time to patch your environment is an issue no matter who you work for. But to ignore or put off patching can open your company’s data to hackers. That is why testing the patch in development is essential. And Kendra is right, you can’t catch everything.
I have over 150 SQL instances running versions 2005 through 2014 at my company. Three years ago I started patching my SQL Server environment every 6 months. Prior to patching my entire development environment, the Cumulative Updates are patched on a subset of development instances. This has proven to catch most problems. The production environment is patched one month after the complete development environment.
I agree. I’ve actually been using Cumulative Updates for years, because of issues they fixed, but with a patching process like yours. In one large environment, our patching process first went to development, then burned into a pre-production environment, then we deployed it to production in a staged series from least critical servers to more critical servers, gradually. It was a lot of work, but it was worth it.
And like you, we didn’t do that cycle every month. We reviewed the cumulative updates and did it either every quarter or if there were fixes that we felt were urgent.
Kendra, I couldn’t agree more with this approach. During the design phase of an FCI/AG (or even stand-alone) it’s important to emphasize the need for a test/QA environment that logically matches production for these types of changes.
We see related issues on the Exchange Server admin side, eg) regular msft updates, many of which break things, especially in coexistence-type scenarios.
A great article on the ever-changing topic of patching… I noticed this part of your article in particular:
“But this changed with KB 2964518 – “Recommended updates and configuration options for SQL Server 2012 and SQL Server 2014 used with high-performance workloads.” This article specifically recommends using the latest cumulative update, and gives a dramatic list of issues fixed in those updates.”
I’ve read that article and the article for CU14 for SQL Server 2012 SP1 …. one part says
“We recommend that you apply the most recent update release”
and right below it it says
“This cumulative package is intended to correct only the problems that are described in this article. Apply it only to systems that are experiencing these specific problems.”
Am I correct in thinking that Microsoft still recommends that we only apply Cumulative Updates when we encounter an issue that the CU specifically addresses… but when we do encounter an issue and are required to apply a Cumulative Update to use the latest update rather than the first CU that addressed the issue?
This may be a lawyer level question 🙂
The way I read KB2964518, it changes the rules. However, you’re right that the default wording on the individual CUs has not been changed. There’s definitely a contradiction there.
Thank you Kendra, I thought it was just me! 🙂
To upgrade or not has always been a thorny issue. Microsoft changed the game some years ago when they introduced CU’s and deferred Service Packs. CU’s every two months sometimes means a fix comes out for a problem created in the previous CU (KB 3033492 actually fixed in KB 3034679 goes back two CU’s).
We try and update twice a year and definitely when it is a security fix announced on Patch Tuesday’s (These seem to happen more often than they used to do).
It is also good to review either the previous SP CU or the SP or version above your build as these might give you insight into what is coming next.
We have been running Availability Groups for over two years on Windows 2012 and we would take an active part in the monthly Windows patching exercise as we couldn’t trust SQL2012 RTM and SP1 builds to recover properly. Change the failover to manual, fail over to the other node, wait for the node to come back up and fail back and switch back to automatic while the servers get Windows patches.
If you are active with patching with CU’s look for them on the Friday or Monday after Patch Tuesday and read the documented fixes. As I mentioned also look in any other CU’s that come out on the same day for issues that could affect you. Keep watching for tweets and follow both the CSS blog and the Release Services Blog (This was the one that mentioned about KB 3033492).
Something that people may not have seen is the number of fixes for what should be a mature product by now (SQL2012). I can only put this down to Cloud computing. Microsoft can see what errors are happening in their cloud as most of us on-premise don’t check the send error to Microsoft when we install.
I really believe that when it comes to patching, you have to apply the virtue of patience. It sounds selfish but I say let someone else get the bruised knees and wait a month or three AFTER a patch / CU / SP has been released before applying it.
I do remember 2012 SP1 because my boss asked me when do you want to go to SQL Server 2012? My reply, like any seasoned DBA, was “Oh not until SP1 is out”. Well, a day or two later, guess what? SP1 came out and my boss said “let’s go!”. Naively, with visions of little red flags dancing in my head, I said OK.
Anyone else who was an early adopter of SQLServer 2012 SP1 knows what happened next. As we say in the great white north, H-E- double hockey sticks!
CPU utilization out the wazoo. Registry bloat which eventually would fill up the registry if left unchecked. But, as I recall, not every SQL Server 2012 SP1 patch went sideways.
So, what did we learn from all this? WAIT FOR SP1 AND THEN WAIT A MONTH OR THREE.
Hi Kendra. I always thought, when patching clusters and latterly AGs, that it would be good idea to patch, say one cluster node then leave it to ‘soak’ for a week or two before doing any other nodes. Or not in the case of any problems, in which case you would roll-back the change on the one node you’d patched. However I always met resistance to this idea, mainly along the lines of ‘all the hardware needs to be at the same level’. On the face of it this is also a good idea. Any thoughts one way or the other?
Hey Gordon, Yeah, opinions can vary on this one, and I can see both sides of the argument. The right answer often depends on the size of your team and the type of processes you have: for large teams with helpdesks, ensuring that everything is the same version in a given cluster can reduce confusion and simplify response time.
The one thing I think is absolute is that patching one node of a cluster or an AG is never a substitute for testing updates in a non-production environment.
My concern with patching one side of an availability group is that we do get unexpected failovers so you think you are running on a patched node or an unpatched node and then it bounces in the middle of the night. So if you have a stable environment then this could work.
I have had a situation with a clustered server where I could only patch one node and had to make sure that it was the active node (Was only a Development server though).
In the current situation where the CU’s seem to have problems go for current minus one.
Oh, ouch! Yeah, unplanned failovers shouldn’t be normal– something’s definitely wrong.
We get hiccups around Midnight just after we have installed Windows patches. The patching happens mid morning.
Something we have got used to happening and plan accordingly for.
Oh. Are you not restarting Windows when you apply Windows Updates? I only ever apply Windows Updates when I can reboot Windows, even if it doesn’t announce that it’s required.
We have a Windows 2012 Hyper-V set of hosts with the two database servers as guests. We have the Database server and its host patched letting the availability group reconnect properly before we do the same on the other node. We have had Microsoft look at this arrangement a number of times because of issues. It works but we get the occasional hiccup around Midnight that normally corrects itself. Sometimes it burps enough that we have SQL failover.
Ran into the Bug listed was a fun event, I work for a large healthcare company and all of our hospitals are ran by a software suite that runs on SQL and we billed the upgrade as 0 downtime. Was a fun evening.
Oh, wow. That’s so rough. Sorry to hear about that.
Seems we know have something similar to the bug after applying the February Microsoft patches to the secondary replica on Windows 2012 running SQL2012 SP2 CU1, not CU3 or 4.
The errorlogs on both show no issues but the dashboard thinks they are not synchronizing.
I have been trying to find an official style guide on what processes to use to apply windows patches to an AG. We currently run a 3 node AG with the third node in a DR data centre on a seperate subnet. I have found Microsoft resources for SQL related updates, and I am also aware that the Cluster Aware Updating tool does not work with SQL AG’s.
Do you have any recommendations?
Tell me a little bit more about what you mean by “style guide” for patches? I’m just not familiar with the term being used in that way.
I think I fell victim to auto correct! Below is a corrected version of my initial question:
I have been trying to find an official guide on what processes to use to apply windows patches to an AG. We currently run a 3 node AG with the third node in a DR data centre on a seperate subnet. I have found Microsoft resources for SQL related updates, and I am also aware that the Cluster Aware Updating tool does not work with SQL AG’s.
Do you have any recommendations?
Absolutely. Testing the patches thoroughly outside of production in a staging environment that’s as similar to production as possible is key. Today’s post on 2014 SP1 is an example: https://www.brentozar.com/archive/2015/04/breaking-news-dont-install-sql-server-2014-sp1/
For patching a production environment, some people automate pieces of it, but typically if you have an AG it’s critical enough that you don’t leave patching to run unattended and you smoketest at various steps.
Is there a similar resource on what updates and patches not to install on the Windows Server underneath the SQL Server?
Testing different versions of CUs is a pain but necessary evil. However, here some twists that our DBA team fell upon. We normally install SQL 2014 with 2977315 & 2977316. Interestingly, we rolled out a CU6 on 2014 and forgot to apply the GDR and QFE patches. When we re-applied them the SQL version number rolled back to 2014.120.2381.0 the number for the two patches rather than the expected 2014.120.2480.0.
Now I’ve just read to update SQL 2012 the same way and not to rely on just wait for sp3 for 2012. Ah, the life of a SQL DBA, never envied.
Short story, be careful on applying SQL patches these days…you never know what you’re gonna get…too bad it doesn’t taste like chocolate.
Wow, I’m a single just Starting DBA in a shop that is about to roll out a DR site. This article has me sweating bullets . . .
I’ve never stood up any replication before and I came here looking to solve my issue with sql 2012 reporting services on version 11.0.3128 when fips is enabled.
I can’t seem to find a lot of information on anything recent with fips enabled and sql server
can I configure always on AG with one server with hot fix and the other with out, sql server 2012 sp1?
Madi – yes.
Thank you Kenrda. Although SQL 2016 has been tested very thoroughly, this post right here is the top reason I decided not to update to it within 8 weeks of release. I feel the slew of updates coming up could put us more at risk than the ones coming in after SQL 2014 SP1 which should be more security related and perhaps less severe/impactful bugs.
It would have been quite a rush to roll out 2016 in prod on a major environment though, even with all the testing!
ServerA – Primary Instance
ServerB – Secondary Passive Instance
AG from ServerA – Server with Automatic Synchronous replication
If I wanted to patch Server B. Do I:
1. Patch ServerB without ‘Pausing DataMovement’
2. Pause DataMovement Between ServerA and ServerB and Patch ServerB. Once Patched Resume data movement?
Rory – if you want to be safe, you’ll probably want to pause replication between them to avoid hiccups on server A while you’re patching server B.
Patching SQL Server is a necessary evil in the financial services industry and probably my least favorite thing to do as a DBA, As a matter of fact we are in the midst of our Annual (yes annual) SQL server patching effort, and other than hitting the annoying SQL Server patch hangs on the “MsiTimingAction” issue on a few servers, we have had pretty good success (knocks wood) over all with the 60+ instances we have patched thus far.
We do have a 3-node SQL 2014 AOAG environment which will be upgraded to SP1 CU6 (the fixed version) that we are saving for last so this information will be helpful in our planning. Lucky for us we do have a test system that was built just like production so we have an adequate test environment.
We use AlwaysOn extensively in our environment but we’ve encountered really long timeouts or wait times when installing patches on our database servers. This also happens sometimes when doing instance installs on servers already in a cluster, so if its a new server we run our installs first, add to cluster second. This obviously doesn’t help with patching already active servers with live availability groups.
I’m not sure what the aggravating factor here is, but I suspect it may be the number of nodes and/or how busy they are. if you have a healthy number of availability groups (30+), with a 4 to 5 databases in each, per instance per server then certain dmv’s are VERY slow (they seem to make calls to underlying cluster dll’s), and if patching is calling those dmv’s, then this would explain part of the problem at least.
We’ve found that installing a patch on our servers during an agreed weekend quiet period with no user activity can take 5+ hours to complete on just one or two servers in the cluster, requiring us to reschedule the rest for the next weekend. I’m reading up about cluster aware updating but haven’t had a chance to test this yet. Is this something anyone here has used and does it help?
We’re also reconfiguring one of our clusters so that instead of our servers hosting a mix of primary and secondary systems side by side, we will have dedicated servers for primary replicas, and dedicated servers for secondary’s. This should make it easier to plan patching as we can at least more easily ensure that we are never patching the live server, which currently is logistically difficult to commit to in our mix and match type setup.
When will mother Microsoft release something similar to Oracle RAC? Something where we can patch an instance and “live moving” user database connection to other node of the cluster with zero downtime? Came on!
About the same time people get the spelling & grammar right in their comments. Come on! 😉
don’t bully me, I’m Italian, my “spelling & grammar” is the first source of mistakes 🙂
Four years on and generally the landscape looks…
Hi Brent, apologies for the comment on an old post.. My SQL server farm has been neglected with patching OS level & SQL server level (because they are critical systems, outages are hard to come by).
An option is to patch the secondary nodes of our AOAG clusters up to the latest patch one month, then the next month the business agree to schedule a fail over out of hours.. I can then patch the new secondary (old primary). This will mean the nodes wont be on the same patch level for a month.. is this “a no no”?
Dan – for general questions, head to a Q&A site like https://dba.stackexchange.com.
I am just getting used to the topic of availability groups and wonder why it is important to set the failover mode to “manual” before patching?
Hi! For questions, you’re best off hitting SQLhelp: https://www.brentozar.com/go/sqlhelp
Thanks but sql aoag if following rolling updates , secondary replicate > testing >primary replica , that kinda sequence , downtown can be limited to the minimum , is not that not true ?
what we do is that after one node has been patched, a thorough test will be launched to make sure applications will be running fine on the patched node. Only after that step can we move onto the patch the rest node.
If sth happened after first node is patched, at least sql instance will be still running on a single node ( assuming 2 node cluster ) and graceful remediation process can take place.