One of your SQL Servers is going to fail.
When one of your AG members goes down, what happens next is just like opening a new SSMS window and typing BEGIN TRAN. From this moment forwards, the transaction log starts growing.
And growing. And growing.
SQL Server is waiting for that failed replica to come back online and start synchronizing again. The transaction log has to contain all of the data necessary to bring that once-failed replica back up to speed, to the current point in time. It doesn’t matter whether this replica is sync or async, important failover partner or an unimportant reporting server.
After every 4 hours of outage troubleshooting (at 4 hours in, 8 hours, 12 hours), I look at the facts:
- How much free space do I have left on the log file drive?
- How long will it take to reseed the failed replica by using a full backup?
- Based on these, should I remove the failed replica from the Availability Group?
This means at minimum, my AG replicas all need enough space to have 4 hours of transaction log activity. I can buy myself more time by disabling index defrag/rebuild jobs as soon as a replica fails, too. (In a perfect world, I’d code that into my index rebuild jobs, but that’s left as an exercise for the reader.)
After the emergency finishes, I don’t bother shrinking the log files back down – because after all, another replica is going to fail. It’s not a matter of if – it’s just a matter of when.
Knowing that all this is going to happen, it makes sense to pre-grow/size your transaction log files large enough to handle at least 4 hours of activity without a log backup. After all, when I’m troubleshooting an availability emergency, I don’t want to waste time troubleshooting a performance emergency. If my AG databases are constantly freezing up while they grow the log files, it’s going to be a bad distraction from the problem I really need to focus on.
Great post. A perpetually growing transaction log can be a real pain. My strategy has always been to avoid that situation at all costs. This is the first time I heard of a strategy to cope with that situation (for a time). It makes sense. Yes it’s best to avoid a storm, but it’s better if you can also weather it for a while.
Thanks! Yeah, and you *know* the storm is coming with AGs. It’s just a matter of time.
I have a question on the data file sizes in an availability group. We allow the database to grow and have an alert fire when the data file expands. The secondary replica, in read-intent only mode, does not have its alert fire until the availability group fails over. The failover will happen at least once a month due to Windows patching.
Why doesn’t the expansion happen on the secondary replica, and fire the expansion alert, when it happens on the primary?
Chris – lemme zoom out a little and ask a bigger question. Why would you want multiple growth alerts happening when a database grows? Why isn’t the alert from the primary enough?
I agree but I am curious as to why it doesn’t happen on both nodes at the same time. It isn’t a worry more just curious.
I have an update on my situation that is both good and adds to the confusion. This weekend we ran database re-indexes and one database did actually expand and as expected the alert fired on the primary replica but not on the secondary replica. We had patched these servers last weekend (Canadian Thanksgiving this weekend brought forward the maintenance work) and so I might have expected to run into what you had created this blog post about. Now I look and see very little log used on the primary replica, as the log backups had cleared the committed data, but also noticed the database had in fact expanded on the secondary replica to the same size as the primary replica. The secondary alert will fire as soon as the SQL failover occurs. Still puzzling but not as concerning.
As I have sent email to you, we have same situations and can’t able to understand the reason of growing 100GB log in one day, One of our three availability replica was down due to disk was full, while index maintenance.
I have another Question that How can I stop growing log files during index maintenance? Our last Index maintenance task was span to 4 hours.
Please assist me. Thanks for writing such a nice topic on Always on.
Haresh – as I replied to you in email, I can’t do that personalized troubleshooting for free. I really wish I could, but private personal help is where consulting comes in.
If you have questions about your environment, your best bet is to post them at Q&A sites like http://DBA.StackExchange.com. Thanks!
From SQL2012 SP2 CU8 https://support.microsoft.com/en-ca/kb/3095156
This might give you a little relief and a method to use if you lose your secondary(s).
Highly coincidental to see this post after going through this exact situation recently, are you sure you’re not visiting my office without saying hi? One of our 5 node 2012 AG clusters started having a few of its secondaries get into bugcheck error / reboot loops, and the same points you made above started weighing heavily on our minds. Luckily they would stay up long enough for things to catch up before rebooting again so we didn’t have to resort to removing them from the AG, and they’ve seemed to settle down after making sure we had all the most recent BIOS and firmware updates installed.
Yeah, it’s important to keep this in mind when a replica is down.
No need to reseed from a full backup though, you can just apply the transaction log backups.
Alex – in theory, yes. In practice, the log size can exceed the full backup size if the replica communication is down for a day or more, or if some bonehead is running index rebuilds, or if we’re doing queueing in the database, or if someone rebuilds reporting tables constantly, or…you get the point.
1) I have seen at least one perfect storm for AOAG. The secondary goes down, the log on the primary fills the disk, the secondary has refused to re-sync and said that the log records it needs are not available, and it happened on a Monday after a full backup and just beyond the log backup retention period. It required a full re-initialisation of the secondary from backups.
So the log doesn’t always retain everything that’s required; I wish I had kept more written notes on exactly what happened but I did Google at the time and found others who had been in the same situation.
2) Also as we’re talking about full AOAG restores, I think it’s very important for admins to track how long it takes to do their backups and store those statistics on a management server (e.g. by scraping the msdb.dbo.backupset table) to help with estimating restores.
One client moved hundreds of servers to a new data centre. Backups were still working but we noticed they were slower (and notified the authorities who shrugged their shoulders). The extra 15-30-60 minutes doing restores (e.g. from Production to QA) wasn’t a big deal at first; until we had to roll back a failed application upgrade and it took 8 additional hours to restore the database. Ouch.
Oh yeah and when I looked into it, our largest database used to have a full backup in 12 hours on a Sunday; now it takes 4.5 days. That requires an email to the business owner that, “Hey the RTO has changed and it’s going to blow your socks off.”
We’ve since rewritten the book on our change estimations.
If AG’s need 4 hours of log space is it worth running log backups < 4 hours?
Jeremy – I’ll answer that with a question: if you need to restore, is the business comfortable losing up to four hours of data?
We actually changed out log backups from 2 hours to 1 hour since we were trying to catch transactions in between to help maintain the log files. But this server has an AG as well, asynch. We don’t backup from the secondary although were were thinking about it.
We also have a log percent full alert so that we offload logs if we have more than normal activity.
how do you offload your logs?
We use Ola Hallengren’s utility as SQL Agent jobs. If the alert fires then it executes the log backup job.
We have four node AO. Is it possible to shrink data file on primary AO server without removing the database from AO?
Suresh – have you thought about giving it a shot and seeing what happens? This is why you really want to have a staging environment so you can try these kinds of things rather than asking a stranger on the internet every time you want to do something in production. 😉
Will running the Log backup on the AG secondary server truncate\ Mark for reuse the Log file on the primary AG server ?
Mohamed, for Q&A head on over to dba.stackexchange.com
Remember to search for questions that have already been asked that may get you an answer first.
I have two node AG in SQL server 2016 . My Primary server down two days and currently failover to secondary .
Now my secondary server become primary .
Can we take Trans Log backup of New primary ?
After two days Old Primary server Up , So nay other trans log backup restored or automatic restored all data ?
Please do the needful
Vipul – when you want to ask questions, go to a question & answer site like https://dba.stackexchange.com. That’s the needful, and you’re the one who needs to do it. Take care.
Do the needfuuuuuuul!
Great post Brent!
Thanks for sharing!