I often hear companies say, “We can never ever go down, so we’d like to implement Always On Availability Groups.”
Let’s say on January 1, 2016, you rolled out a new Availability Group on SQL Server 2014. It’s the most current version available at the time, and you deploy Service Pack 1, Cumulative Update 4 (released 2015/12/22). You’re fully current, and it’s a stable engine from 2014 – how many more bugs can they find, right?
Here’s what your patching schedule would look like:
2016/02/22 – Cumulative Update 5 – corrupted columnstore indexes when AG fails over, stack dumps on AG secondaries.
2016/04/19 – Cumulative Update 6 – non-yielding schedulers during AG version cleanup, FileTables unavailable after AG failover, canceling a backup causes the server to crash (not related, but cringeworthy) – whew! This one has a lot of big fixes. We should definitely apply this.
2016/05/31 – OH SNAP! CU6 broke NOLOCK. Sure hope you didn’t apply that. Time to take another outage to apply the revised version.
2016/06/21 – Cumulative Update 7 – SQLDiag fails in AGs. You could probably skip this one if you don’t use SQLDiag, and most shops don’t.
2016/07/11 – Service Pack 2 – improved lease timeout to prevent outages, filestream directory not visible after a replica is restarted (wait I thought we fixed that in CU6? no wait that was FileTables), missing error numbers in XE.
2016/08/26 – Cumulative Update 1 – memory leak on AGs with change tracking, error 1478 when you add a database back into an AlwaysOn availability group (sic).
2016/10/18 – Cumulative Update 2 – no AG fixes, woohoo!
That’s 5-7 patch outages in 11 months (and I’m not even listing all of the fixes in these, which include things like incorrect results bugs, plus awesome new DMV diagnostic features that you definitely want.)
Here’s the way I like to explain it to companies: if you have an airplane, it’s absolutely imperative that its engines not fail mid-flight. In order to accomplish that, you have to have regular downtime for mechanics to examine and replace parts – and that doesn’t happen up in the air. With Availability Groups, we’re lucky enough to be able to transfer our
passengers databases from one airplane to another quickly – but we still have to have those other airplanes getting constant examinations and patches from mechanics.
Great post as always. I can barely handle being on an airplane when everything goes as planned. I can’t imagine hearing “uh… hi guys… this is your captain speaking… we uh… we’re having some problems up here and we… uh… well … do you see that other plane over there flying dangerously close to us? Well… we uh… we’re gonna have to get you all over there like.. like right away… I uh… Women and children first, I guess. Smoke if you got ’em.”
You sure have been posting a lot of pics of you in that Oracle jacket lately. How has Microsoft not revoked your MVP card?
Heh heh heh – I gave up my MVP card, actually. Nothing against the program, was just time for a change.
I seem to be spending more and more time coming up with analogies to try and explain basic process concepts to people recently, but your airplane one there’s a thing of pure beauty. Have a drink.
Liked the article . Good analogy.
Is it advisable to hold off AG setting as there seems so many issues?
Cluster environments can help us achieve always on, right?
By the way, always love to read your post!
Amanda – I would just generally advise folks to find the simplest solution that meets their RPO and RTO goals. Always On Availability Groups is a fantastic feature – you just have to be armed with the right people and processes to tackle it.
…and the more engines you have, the more likely you’ll have some kind of engine failure at some point.
Reminds me about my flight simulator: runs on Windows 7, and is *never ever* patched/updated. I don’t want to introduce *any* side effect through an update.
Klaus – makes sense! After all, if it works, that’s good enough! The OS is only there to provide services.
I like your Analogy, thank you so much for simplifying it