Synchronous Always On Availability Groups Is Not Zero Data Loss

In theory, when you configure AlwaysOn Availability Groups with synchronous replication between multiple replicas, you won’t lose data. When any transaction is committed, it’s saved across multiple replicas.

That’s the way it works, right? I mean, except when you restart your synchronous replicas, or patch them, or they just stop working for any number of reasons. The primary keeps right on trucking, accepting deletes/updates/inserts, without telling end users that all their eggs are in a single basket.

But I hear you – those are really rare cases. Most of the time, as long as you’ve got a synchronous replica, it’s synchronous.

Except when it’s not. Read the manual carefully:

If primary’s session-timeout period is exceeded by a secondary replica, the primary replica temporarily shifts into asynchronous-commit mode for that secondary replica.

That’s right – your sync secondary becomes asynchronous.

Automatically. Without warning you. And you can’t control it. You can’t say, “I only want you to accept transactions as long as data is being copied to another replica.” For more details on what causes a sync replica to drop into async, check out the BOL page on synchronous commits.

Bottom line – you can’t actually guarantee zero data loss with AlwaysOn Availability Groups. I love AGs, but they’re much, much more complex than they look at first glance.

Previous Post
Enabling Query Store in Azure SQL Database
Next Post
We’re Hiring: SQL Server Triage Specialist (Remote Position)

24 Comments. Leave new

  • Also see this for an example of synchronous data loss or loss of availability:

    SQL Server can’t beat the CAP theorem. Either it is not 100% available or not 100% consistent in case of a network problem.

    • That stackexchange question has false assumptions. Just ignore it.

      • Could you leave a comment or answer explaining the misconception? I think it’s getting a lot of attention now and I might have even seen it in the ‘hot questions’ sidebar on other Stack Exchange sites.

  • I think this just shows that there is no such thing as a “set it and forget it” high availability option. You need an administrator to watch over it, no matter the technology. AG makes it easier, but not bullet proof.

    In addition, this also shows that RPO/RTO are still valid objectives to build out your data availability needs by.

  • Hi Brent, just curious what your clients’ expectations have been around this type of thing. Have you worked with clients who would have preferred that the master refuse (or alternatively, wait indefinitely to acknowledge) writes if it couldn’t reach any of the secondaries?

    I’m not suggesting that any particular way is better, but those seem to be the alternatives to my limited imagination, and I’m genuinely curious about what preferences you’ve encountered among clients in the wild.

    • James – thanks! The post came about as a result of a visit to a client recently. They use 3 sync replicas at all times (primary, plus 2 sync secondaries) plus an async secondary. If any of the sync replicas fail (whether it’s the primary or secondary), their scripts check to make sure they can still automatically fail over to another sync replica (like set it up as an automatic failover partner if necessary), and then build in a new async secondary just in case. That’s the best approach I’ve seen so far.

  • And thanks for posting this, btw. It (and the question that tobi linked to) have got me thinking about some CAP-related topics that I haven’t thought enough about in a SQL Server context.

  • Hi Brent, do you know how to catch these events in case they occur ?

    • Yes, get a monitoring product. Based on your thresholds, they can alert you when a secondary is more than X seconds/minutes out of sync.

      • Can you expand on what should be monitored to detect the out-of-sync situation? We’ve been surprised to find on a couple of occasions our secondary has drifted hours behind (we were using read intents to pull data from the secondary and clients complained about changes not showing up).

        Ideally we want to solve the underlying issue and prevent the AG Secondary from falling out of sync, but documentation is rather sparse on the subject.Right now it seems related to long “open/close trans” statements

  • My interpretation of the Microsoft doco is different:

    The silent switch from Sync to Async has zero effect on the replica and happens only on the Primary.
    It is done so that the Primary can commit transactions when connection to a Synchronous Commit replica is lost.
    Once the Primary & replica can communicate again, the Primary treats the replica as a Synchronous Commit replica.

  • “… the primary replica waits for the secondary replica to confirm that it has hardened the log (unless the secondary replica fails to ping the primary replica within the primary’s session-timeout period).”

    Doesn’t that fit my interpretation? And if not, what does, bearing in mind that the primary & secondary are not communicating at that point?

    • Nothing does, that’s the point. Take the example of the secondary having a BSOD and rebooting. The primary keeps right on taking transactions, and you’re down to a single point of failure. If the secondary doesn’t come back up promptly, and you lose the primary, you’ve lost data.

  • yassine elouati
    March 10, 2019 3:04 pm

    Thanks as always for your contributions to the community. What’s your experience when ALWAYSON replication is slow and it is traced to the network. ALWAYSON uses only ONE TCP connection to send data and over a high latency network it is turtle slow. RFC 1323 is supposed to resolve this issue. In my case it does not. I am suspecting the network devices at this point. What is your experience?

    • Yassine – I wish I could do free personal consulting for everyone here at the blog, but realistically, when you have a question, you’ll either need to post it on a Q&A site or forum, or hire me for consulting. Thanks!

  • Toby Ovod-Everett
    April 19, 2019 3:17 pm

    I found (from this guy named Brent Ozar 🙂 ) which mentions, “you’ll need to upgrade to SQL Server 2017 and use the new REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT setting.”

  • Stephen Roberts
    June 30, 2021 8:07 pm

    Sorry to throw this in here and maybe the wrong place but here goes…

    Had an interview and was asked is Always On a DR solution.
    My basic answer is no its not on itself, because if you loose data by someone’s action always on wont protect you – e.g. not enough coffee and opps drop table.
    Always on brings some protection e.g. locations and maybe corruptions and so on…its a long list but a DR solution is a combination of a good backup strategy and something like Always On. Together they make a good DR solution.

    I’m right aren’t I? certainly in my company when we do DR tests we test remote locations (Always On) and backups and its audited and checked and unless you pass all its not full DR.

    I found it interesting that some people think Always On, on its own, is the 100% answer.

    Maybe I’m wrong haha!!

  • This would seem to imply that the automatic failover mechanism isn’t aware that the primary and replica are out of sync and performs a failover anyway. Or that the fact that they are out of sync isn’t visible to a DBA performing a manual failover. Is this the case?

    • Click that “manual” link in the post for more details on what happens – but no, it doesn’t perform an automatic failover when they’re out of sync.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.