Synchronous Always On Availability Groups Is Not Zero Data Loss

Last Updated February 9, 2017

Always On Availability Groups, Backup and Recovery

In theory, when you configure AlwaysOn Availability Groups with synchronous replication between multiple replicas, you won’t lose data. When any transaction is committed, it’s saved across multiple replicas.

That’s the way it works, right? I mean, except when you restart your synchronous replicas, or patch them, or they just stop working for any number of reasons. The primary keeps right on trucking, accepting deletes/updates/inserts, without telling end users that all their eggs are in a single basket.

But I hear you – those are really rare cases. Most of the time, as long as you’ve got a synchronous replica, it’s synchronous.

Except when it’s not. Read the manual carefully:

If primary’s session-timeout period is exceeded by a secondary replica, the primary replica temporarily shifts into asynchronous-commit mode for that secondary replica.

That’s right – your sync secondary becomes asynchronous.

Automatically. Without warning you. And you can’t control it. You can’t say, “I only want you to accept transactions as long as data is being copied to another replica.” For more details on what causes a sync replica to drop into async, check out the BOL page on synchronous commits.

Bottom line – you can’t actually guarantee zero data loss with AlwaysOn Availability Groups. I love AGs, but they’re much, much more complex than they look at first glance.

Enabling Query Store in Azure SQL Database

We’re Hiring: SQL Server Triage Specialist (Remote Position)

27 Comments. Leave new

tobi
September 3, 2015 8:35 am

Also see this for an example of synchronous data loss or loss of availability: http://dba.stackexchange.com/questions/80674/what-happens-in-this-strange-sql-server-mirroring-situation

SQL Server can’t beat the CAP theorem. Either it is not 100% available or not 100% consistent in case of a network problem.

Reply
- sirius
  September 11, 2015 5:03 am
  
  That stackexchange question has false assumptions. Just ignore it.
  
  Reply
  - James Lupolt
    September 11, 2015 7:13 am
    
    Could you leave a comment or answer explaining the misconception? I think it’s getting a lot of attention now and I might have even seen it in the ‘hot questions’ sidebar on other Stack Exchange sites.
    
    Reply
    - Brent Ozar
      September 11, 2015 7:17 am
      
      The top-rated answer by Retracement is correct – this situation doesn’t happen. The correct LSN is always resolved.
      
      Reply
Dale
September 3, 2015 12:10 pm

I think this just shows that there is no such thing as a “set it and forget it” high availability option. You need an administrator to watch over it, no matter the technology. AG makes it easier, but not bullet proof.

In addition, this also shows that RPO/RTO are still valid objectives to build out your data availability needs by.

Reply
- John
  June 30, 2023 5:22 pm
  
  Oracle Fast-Start Failover is “set it and forget it”. Properly configured, I’ve never seen it lose committed data.
  
  Reply
  - Brent Ozar
    June 30, 2023 5:42 pm
    
    Cool. How much does that cost, by the way?
    
    Reply
James Lupolt
September 3, 2015 1:05 pm

Hi Brent, just curious what your clients’ expectations have been around this type of thing. Have you worked with clients who would have preferred that the master refuse (or alternatively, wait indefinitely to acknowledge) writes if it couldn’t reach any of the secondaries?

I’m not suggesting that any particular way is better, but those seem to be the alternatives to my limited imagination, and I’m genuinely curious about what preferences you’ve encountered among clients in the wild.

Reply
- Brent Ozar
  September 3, 2015 1:11 pm
  
  James – thanks! The post came about as a result of a visit to a client recently. They use 3 sync replicas at all times (primary, plus 2 sync secondaries) plus an async secondary. If any of the sync replicas fail (whether it’s the primary or secondary), their scripts check to make sure they can still automatically fail over to another sync replica (like set it up as an automatic failover partner if necessary), and then build in a new async secondary just in case. That’s the best approach I’ve seen so far.
  
  Reply
James Lupolt
September 3, 2015 1:08 pm

And thanks for posting this, btw. It (and the question that tobi linked to) have got me thinking about some CAP-related topics that I haven’t thought enough about in a SQL Server context.

Reply
Julien
April 8, 2016 11:53 am

Hi Brent, do you know how to catch these events in case they occur ?

Reply
- Brent Ozar
  April 8, 2016 12:20 pm
  
  Yes, get a monitoring product. Based on your thresholds, they can alert you when a secondary is more than X seconds/minutes out of sync.
  
  Reply
  - Pail
    May 18, 2017 10:17 am
    
    Can you expand on what should be monitored to detect the out-of-sync situation? We’ve been surprised to find on a couple of occasions our secondary has drifted hours behind (we were using read intents to pull data from the secondary and clients complained about changes not showing up).
    
    Ideally we want to solve the underlying issue and prevent the AG Secondary from falling out of sync, but documentation is rather sparse on the subject.Right now it seems related to long “open/close trans” statements
    
    Reply
    - Brent Ozar
      May 18, 2017 10:18 am
      
      Pail – yep, pick up a third party monitoring tool like Idera SQL DM, Quest Spotlight, or SentryOne SQL Sentry. Don’t try to reinvent this wheel.
      
      Reply
Richard
December 13, 2018 9:37 am

My interpretation of the Microsoft doco is different:

The silent switch from Sync to Async has zero effect on the replica and happens only on the Primary.
It is done so that the Primary can commit transactions when connection to a Synchronous Commit replica is lost.
Once the Primary & replica can communicate again, the Primary treats the replica as a Synchronous Commit replica.

Reply
- Brent Ozar
  December 13, 2018 9:37 am
  
  LOLOL, if the connection to a sync commit replica is lost, then you’re losing data, chief.
  
  Reply
Richard
December 13, 2018 10:05 am

“… the primary replica waits for the secondary replica to confirm that it has hardened the log (unless the secondary replica fails to ping the primary replica within the primary’s session-timeout period).”

Doesn’t that fit my interpretation? And if not, what does, bearing in mind that the primary & secondary are not communicating at that point?

Reply
- Brent Ozar
  December 13, 2018 10:07 am
  
  Nothing does, that’s the point. Take the example of the secondary having a BSOD and rebooting. The primary keeps right on taking transactions, and you’re down to a single point of failure. If the secondary doesn’t come back up promptly, and you lose the primary, you’ve lost data.
  
  Reply
yassine elouati
March 10, 2019 3:04 pm

Thanks as always for your contributions to the community. What’s your experience when ALWAYSON replication is slow and it is traced to the network. ALWAYSON uses only ONE TCP connection to send data and over a high latency network it is turtle slow. RFC 1323 is supposed to resolve this issue. In my case it does not. I am suspecting the network devices at this point. What is your experience?

Reply
- Brent Ozar
  March 11, 2019 1:04 am
  
  Yassine – I wish I could do free personal consulting for everyone here at the blog, but realistically, when you have a question, you’ll either need to post it on a Q&A site or forum, or hire me for consulting. Thanks!
  
  Reply
Toby Ovod-Everett
April 19, 2019 3:17 pm

I found https://dba.stackexchange.com/questions/210609/alwayson-commit-on-primary-if-secondary-goes-down (from this guy named Brent Ozar 🙂 ) which mentions, “you’ll need to upgrade to SQL Server 2017 and use the new REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT setting.”

Reply
Stephen Roberts
June 30, 2021 8:07 pm

Sorry to throw this in here and maybe the wrong place but here goes…

Had an interview and was asked is Always On a DR solution.
My basic answer is no its not on itself, because if you loose data by someone’s action always on wont protect you – e.g. not enough coffee and opps drop table.
Always on brings some protection e.g. locations and maybe corruptions and so on…its a long list but a DR solution is a combination of a good backup strategy and something like Always On. Together they make a good DR solution.

I’m right aren’t I? certainly in my company when we do DR tests we test remote locations (Always On) and backups and its audited and checked and unless you pass all its not full DR.

I found it interesting that some people think Always On, on its own, is the 100% answer.

Maybe I’m wrong haha!!

Reply
John
June 30, 2023 5:24 pm

This would seem to imply that the automatic failover mechanism isn’t aware that the primary and replica are out of sync and performs a failover anyway. Or that the fact that they are out of sync isn’t visible to a DBA performing a manual failover. Is this the case?

Reply
- Brent Ozar
  June 30, 2023 5:43 pm
  
  Click that “manual” link in the post for more details on what happens – but no, it doesn’t perform an automatic failover when they’re out of sync.
  
  Reply
Gary Shen
March 22, 2025 8:49 pm

I know why Always On adopts such design.

Because transactions are transfered and redone in a row, one after another. If any one transaction is delayed, in sync mode, all the on-going transactions will be blocked which means the whole system will be suspended. So in order to avoid this scenario from happening, Microsoft has chosen the design of silent switch from sync to async temporarily.

Our solution doesn’t have such kind of problem. Zero Data Loss is guaranteed.

Reply
Silly P
August 1, 2025 11:35 pm

This blog is desperate for a mention of REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT. That one setting makes the whole article out of date.

Reply
- Brent Ozar
  August 1, 2025 11:41 pm
  
  As the Microsoft support folks like to call it, Low Availability, High License Cost Mode. You only get one free HA and one free DR standby with paid SA licensing, and you certainly don’t wanna do synchronous commits to DR. That means you would only have one HA replica – and if it goes down, boom, your whole cluster is down.
  
  Or, as soon as you start setting minimum commit replicas to 2, that means you’d want 3 HA instances minimum – but since only 1 extra is free, that means you’d just doubled your licensing costs.
  
  So as long as you’re okay with doubling your licensing costs or having the entire cluster go down if any one replica goes down, sure, you’re right, and this article is out of date.
  
  Tell me you haven’t actually worked with licensing that feature, without telling me you haven’t worked with licensing that feature.
  
  Reply