Have Availability Groups Gotten Easier Lately?

Easy Lover

I don’t blog a lot about AGs. If we’re being honest (and I do try to be honest with you, dear reader), I just like performance tuning topics way more. When new features get announced for AGs, some of you may ooh and aah, but not me.

I Make A Face

I don’t make a face because I don’t like the features, or think they’re bad. I make a face because I imagine them breaking. I imagine them breaking in the middle of the night. I try to think about upgrading them.

And that, dear readers, is why this Bouncer-American picked performance tuning. I didn’t learn SQL to spend more weekend nights hanging out with people I’m inevitably going to have to yell at.

But Also This

I don’t really do much with AGs, so perhaps my fears are unfounded. Once upon a time I had a home lab set up with VMs. I had FCI VMs, Log Shipping VMs, Mirroring VMs, and AG VMs. You know what they did? Gathered dust. Most of my questions about those technologies were just kind of settled.

It’s not that I don’t pay attention! I do, really. For instance, when Microsoft released this Failover Detection Utility, I thought it was super cool.

I also pay attention to stuff that gets released in updates. Just in the release notes for 2017 CUs, I found these:

::scrunch face::

This wasn’t all of them. There were some for Linux that I skipped over, and not because I think the Linux stuff is irrelevant.

It’s just so new, I expect it to be buggier than a flophouse wig stand (sorry Bob).

About You

So I gotta ask, how are you fine folks out there doing with your AGs?

  • Are you hitting bugs?
  • Are you patching regularly?
  • Are you running into other issues?
  • Did you pick up and run to the cloud?

If you’re patching regularly, are you using a pre-prod environment to test patches out?

I’m not saying you need one, but uh…

You might wanna think about getting one. Those patches can be dangerous, and troubleshooting outages can sure be a wild ride.

Thanks for reading!

Previous Post
[Video] Office Hours 2018/12/5 (With Transcriptions)
Next Post
Tell Us What You Make: The 2019 Data Professional Salary Survey

36 Comments. Leave new

  • We have 3 cluster-less AG groups. Each AG has a Primary, Secondary (failover), and Reporting. Not too many issues really, just network. We run asynch, and it always catches up. One of the sites has a small bandwidth and occasionally I would see backups to the offsite Reporting node , so I worked with IT and put a rule in the switch that gave priority to pertinent port. We have not done a CU update, but will surely test before doing the actual deed. Our setup meets our needs without a lot of issues or maintenance.

    Reply
    • We just setup the exactly same thing. Clusterless (read-scale) AGs. Works great for DR, as we don’t require HA. It really simplifies stuff when you yank out WSFC and only need async DR with manual failover.

      Reply
  • Joshua McIntyre
    December 10, 2018 12:05 pm

    I have a Primary, Secondary (Synchronous), Secondary (Asynchronous) setup and I think it’s great! There are definitely some headaches to deal with, working through issues of replicating non-sync items such as users, and dealing with jobs across several nodes, and the like; however, I think the headaches are worth every bit of the functionality.
    Thankfully, over the course of several years, I’ve done regular (near-monthly) patching and SPs/CUs with no significant issues, and the unexpected downtime has been near-zero, with even a few mid-day unplanned automatic failovers with no customer impact for good measure (and that right there makes it worth it to me)! Also, I’ve not been bitten by some of the nasty bugs, maybe because of slightly delaying patches and checking the SQL community for reported issues first (thanks to all the adventurous early adopters).
    I have DEV and QA environments and solid teams around me, without which I probably would not like AGs nearly as much. The Brent Ozar Unlimited and SQLSkills teams, among many others, do an excellent job of blogging and training that I benefit from as well, and it keeps my environment running smooth.
    I would do Availability Groups again without much question if I could choose.

    Reply
  • Thanks for the post. We’re only running few AAGs. But we’ve encountered issues with automatic seeding. Patched for error 35217 (not enough worker threads and massive blocking) and ran into 41169 (the availability database does not get created on the secondary). Supposed to be fixed in SQL 2016 SP1 CU5. But we saw the error in SP2 CU2. We do try to patch in pre-production prior to taking to prod.

    Reply
  • We’ve deployed one an AG where the primary is a Failover Cluster Instance, and then 2 secondaries are standalone instance, in async replication mode. The secondaries are not meant to ever become primary, they’re only there as readable secondaries and they receive reporting / some async read-only workloads. The Failover Cluster takes care of the availability part, so the AG is only there as a replication facilitator for the read-only workloads.

    We’ve hit a few bugs having to do with schema changes: the redo process that alters a table schema can deadlock with a read operation on that table, but SQL Server will not detect that deadlock. We had to implement specific monitoring for that condition to kill our process when that happens.

    Reply
  • I have several clusters with several AG’s on both 2014 and 2017. In Production they all have a Primary, Synch-Secondary, and Async-Reporting nodes (sometimes more than one). We started using them in PROD when they first came out {gasp!} and yes, the first few months were pretty rough. The problems we have had since then have been mostly problems we would have had anyway (networking, slow DC’s/witnesses, storage issues), etc. We also have AG’s setup for DEV and TEST so we can test how applications handle failovers. We generally patch DEV same-day as patches are released, TEST 24 hours later, and PROD within the next week or two depending on what issues we see or hear about. To do AG’s I recommend you understand DNS and some networking or are close buddies with someone who does. And seeding never works for me if I do more than one-database-at-a-time-until-done. Other than that, I love it! (Of course I am weird, I also do Partially Contained databases).

    Reply
  • Oh, and when we were poor we even cross-pollinated AG’s (two servers, some databases are primary on the first server and secondary on the second server, other databases that were primary on the second server and secondary on the primary server and a third server (async) for reporting. Worked great for a little-bit of DR, a little-bit of HA, and a little-bit of reporting!

    Reply
  • This is a topic which is timely for me recently. My server team stopped supporting Windows Failover Clusters (WFC) when we stopped using bare metal servers and starting virtualizing. The server team told everyone that the native HA abilities in VMWare are comparable to WFC and because of the shared disk requirement in WFC, they were no longer going to support it. I tried to bring up the fact that SQL Server could go down and not trigger a VM failover, but that was just swept under the rug. Since we didn’t have many app teams asking for Always On AGs and since I’ve heard a number of horror stories about it and since I had other work to do, I ignored it. Fast forward to now and we have a new vision for the future and it includes minimal or no downtime options and I now have to start thinking about this. But one thing I haven’t been able to find any posts on is that AGs work on top of WFC, so what additional benefit do I get from AGs? I mean, if I’m running a WFC anyways, why take the additional overhead and support of AGs? My current thinking is to use SIOS datakeeper to get around the shared disk requirement and allow for synchronous or asynchronous disk mirroring, then I can have straight up WFC with no AGs…but I wonder if all the work of setting up AGs and configuring quorums and patching, and all the things mentioned in the post above would give me some benefit that makes it all worthwhile. I’m excited to see other people’s post on this to see what their experiences have been.

    Reply
  • Eric, in addition to not requiring shared storage, AG also allows readable secondaries.

    Reply
  • We have been using Availability Groups since SQL 2012 and now on SQL 2016 SP2 builds. Be aware that the Windows patches can have an effect too. Just finished installing November W2K16 Windows patches that has a fix that should stop failover hiccups. We got numerous 35266 messages but the AG always recovered. We do have an outstanding assertion error situation that a fix has been promised in SP2 CU5 due in January.

    Reply
  • We started with AGs on SQL 2014 with Windows Server 2008R2 (2 local synchronous nodes with one offsite async). We had quite a few problems in the first months by not anticipating what WAN outages would do to our cluster, but we’re largely over them now. Patching during the middle of the day on production servers is working great with Sharepoint apps, but is noticed by our SCSM users. The read-only databases for our metrics projects have made it all worth it. We’re now designing our SQL 2017 architecture on Windows 2016 with the two nodes at each site and are hoping to use a similar solution using distributed AGs.

    Reply
  • Adi Hafiskadic
    December 10, 2018 6:42 pm

    We have noticed a big improvement in reliability and stability when running AGs on Windows Server 2016 and SQL 2016 SP2 compared to SQL 2012, 2014 on Windows Server 2012 R2.

    Reply
  • Well,it is definitely the time someone have wrote about it,i am supporting a 6 node SQL 2016 SP2 CU3 on WIN 2016 ,this is an hybrid environment 2 nodes on azure and 4 nodes on premises with really huge machines with all flash storage , extremely critical 24/7 , all the secondaries are for offloading R/O workload ,well the naked truth is that if you really have constant read only on disk tables on the secondaries ,expect redo threads timing out on buffer latch timeout ,NO SOLUTION even that it written as fixed in SP2 CU3 ,no it is not .actually discussing this in depth with dev team ,this is actually unsolvable in 2016/2017 . ADR in 2019 ,solves most of lock/deadlock issues due to the the use of snapshots and not the log ,therefore the redo threads will not timeout.

    we had problems with azure copies lagging behind in asynchronous mode ,we discovered that it was due too QUERY STORE maintenance ,we removed query store from the databases and the issue resolved . our team is highly trained 20+ years experience with SQL server ,the main thing that extremely frustrates us is that in the last 2-3 years the technical knowledge and professionalism of Microsoft support has declined significantly .we are getting ridiculous answers and solutions,the query store issue for instance was resolved by our team .although a case was opened for months without detecting the cause .there are other issues that are not that significant .

    we did not experience network issues on a regular basis ,and rare fail-overs went fine. we are missing read only routing features but it is not related .

    Reply
  • Danny,
    The reference to Query Store. Is it https://support.microsoft.com/en-us/help/4340759 you are referring to? As I had mentioned before the Windows patches seem to affect SQL more than they used to. We had been getting numerous 35266 messages that appears to be fixed by a Windows patch in the October 18th W2K16 https://support.microsoft.com/en-ca/help/4462928. I think what Allan Hirt said about Clustered DTC.
    If you’re using DTC, the only way to get the full benefits of DTC with an FCI or AG is to cluster it. I’m trying to influence change but results not guaranteed. The problem is a clustered DTC requires shared storage (or in WS2019, CSV, which is also shared). I heard him at PASS Summit and then chatted on twitter to get clarification. I was also involved in a discussion with Amit Banerjee and @sirsql about use of the BPOOL on the secondary regardless whether it is read-only. Just like in SQL 2016 MS re-jigged the processing that was performed to speed up replication needed for an AG they are doing it again for SQL 2019 with ADR. Hopefully they can bring it back into SQL 2016 and SQL 2017

    Reply
    • chris,no the issue is not related to https://support.microsoft.com/en-us/help/4340759 .anyway we have CU3 for SP2 installed.and the issue caused the Always on Secondaries to be lagging for hours ,always at same time during the day,that is because of Query Store maintenance,i manged to catch that with DBCC OPEN TRAN although there was enough space in the query store and automatic maintenance was off .log backups grows to few gigs .only way to solve was to shutdown query store.

      Reply
      • Danny,
        How many databases do you have in your AG’s and what are your Query Store settings? You mentioned using RO on the secondary’s. Are you also pushing lots of data too?

        Reply
  • We have several 2014 AGs and a few 2016. We have had a couple of instances where synchronization wasn’t working after a reboot or a VM migration, but only a couple of times where it actually caused an outage. Easily solved by restarting SQL on the secondary. Occasionally we see replication going very slowly to the DR site, and it is apparently due to some hung threads because it’s fast after restarting SQL at DR.

    Overall it is good, and one of the nice benefits is that you can install service packs during business hours with no downtime (assuming mgmt is OK with your HA/DR not being available for 30 minutes). We install security updates for sure and get service packs on eventually. We don’t upgrade with CUs unless we do an SP upgrade, at which point we’ll put on the latest or next-to-latest CU.

    Rule #1 is that you should not use clusters or AGs to accommodate poor infrastructure. If you have a lot of network issues, you’ll have even more problems and downtime if you try to use clusters.

    Rule #2 is that you should not use VM snapshots for backing up as the stunning can cause a failover. It may work 99 times out of a 100, but if something causes it to be stunned just a little longer than normal, you’ll have a failover. We’ve also seen a lot of issues with synchronization after a failover that was caused by VM stunning.

    Reply
  • Been using DB Mirroring on Standard edition w/ no problems to share (sorry about that)

    Just make sure to increase worker threads to couple of thousands for each thousand dbs mirrored.

    LOL

    Reply
  • My main bugbear with AGs is them failing over for some unknown reason (esp shops with no dedicated SQL monitoring). Looking at the Failover Detection Utility to help with this.

    Reply
  • George Humphries
    December 17, 2018 1:55 pm

    For AG failover, I use alert 1480 calling a job that switches in/out CDC jobs, and activates/deactivates AG specific jobs depending on whether or not the AG node is primary or secondary. Since I am calling a job I can also fire off an email on success to let me know how the failover went for each node participating in the AG cluster.

    Reply
  • Wilfred van Dijk
    December 28, 2018 3:50 am

    I’m still asking myself if AlwaysOn is an on-premise feature and not really an Azure HA solution (except for those who have VMs in the cloud). Having said that, it seems logical it will be deprecated in the future

    Reply
  • Mark Underhill
    December 28, 2018 4:09 am

    You can always make the first step of a job detect whether local instance is primary and secondary and run or not on result. This means jobs will always only run on the primary and so will continue after failover – without any intervention.

    Reply
  • As someone who does not have AGs in the wild and only played with them in a lab setup some, I find the article and comments fascinating. Great read!!!!

    Reply
  • We run 4 production Always on Availability Groups each with 3 nodes. They sit on top of Server 2016 DataCenter on VMware virtual servers. So far approx 2 month the system has been rock solid. I have a QA environment that mirrors production. I have found work arounds for the auto seeding issues, but overall the failovers have been smooth. Most failovers have been to patch (OS monthly after testing) or to test the Availability Group.

    Reply
  • Michele Redding
    May 9, 2019 10:37 am

    I am new to the aws cloud and I have to deploy WSFC, AD, and AlwaysOn Availability. Is their a helpful deployment guide to do this manually ? I know that AWS has stacks but I guess I just want to really understand how this all works together. Help!!!

    Reply
    • Michele – that’s pretty far beyond what I can do quickly in a blog post comment. The reason why you’re not finding a simple guide to do it is that the steps keep changing radically in the cloud and with different versions of SQL Server. Anything you’d write up today would be fairly out of date in short order.

      Reply
  • Michele Redding
    May 9, 2019 12:35 pm

    I was afraid that may be the case but still hoping for hopes sake. Thank you for your reply! One other question there is another way, stacks but what are the cons to using those? Thanks again.

    Reply
    • I really wish I could do that kind of personal advice quickly in a blog post comment, but it’s just not realistic. When you’re ready to design an architecture for a company, that’s usually where consulting comes in.

      Reply
  • Michael J. Reid
    December 6, 2019 10:34 am

    We are looking at moving to Distributed Availability Groups for the simple reason that our Windows team cannot seem to grasp the network configurations and quorum voting requirements for traditional multi-site WSFC (don’t get me started!). However, as the SQL Architect, I’m trying to minimize the manual workload of managing non-replicated objects within each data center. Things like logins and SQL jobs. To that end, I want to utilize SFI (SQL Failover Instances) while also moving to Distributed Availability Groups. I have a lot of experience using SFIs inside multi-site WSFC using traditional Availability Groups (asynch replication, of course), but I was wondering if anyone had any experience or tips regarding the use of SFIs with Distributed Availability Groups (I believe it’s possible)? Would you still be able to have your automatic SQL Instance fail over, yet maintain the synchronization across the two AGs contained by the Distributed AG? I believe the listener configuration would also be different (you would point the Listener to the VNN of the Primary host within the SFI, not the actual AG itself), but how would the Listener then automatically reroute to the alternate site (the second AG within the Distributed AG). Would you make the VNN of that SFI the “alternative” IP address of the Listener?
    Any advice would be most appreciated!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.