Snapshot backup tools like Azure Site Recovery and Veeam are great for sysadmins. They let you quickly replicate a virtual machine to somewhere else without knowing too much about the server’s contents.
To pull it off, they use the Windows Volume Shadow Copy Service to do snapshot backups – often referred to as VSS snaps. In a nutshell:
- The VSS service tells SQL Server to freeze its writes
- SQL Server stops writing to each of its databases, freezing them
- SQL Server tells VSS, “Okay, I’m out”
- VSS creates a shadow copy of the data & log files
- VSS tells SQL Server, “You can get back to work”
- SQL Server begins writing again, one database at a time (called thawing)
These snapshot backups show up in system tables like msdb.dbo.backupset because when used successfully, they’re like full backups. It sounds like witchcraft, but it really does work, and it’s worked for years.
But when it goes wrong, it goes WRONG.
SQL Server freezes databases one at a time, serially, until it’s got them all frozen. This can take some time, and while databases are frozen, you’ll see blocking chains for writing queries (which can in turn block readers, too, depending on your isolation level.)
Because of that one-at-a-time problem, KB #943471 says:
We recommend that you create a snapshot backup of fewer than 35 databases at the same time.
Backup apps like Azure Site Recovery and Veeam don’t warn you about that, though. You just find out when your SQL Server suddenly feels like it’s locked up, with tons of blocking issues.
For example, here’s a screenshot from a client’s msdb.dbo.backupset table when Azure Site Recovery was trying to snapshot hundreds of databases all at once:
See how the backup_start_date and backup_finish_date are 40-50 seconds apart? That’s because SQL Server was slowly, gradually stepping through the snapshots of each database. Finally, they all “finished” at 8:08:10, but even that is a little tricky – in this case, they didn’t all thaw at the exact same time. (I’m guessing, based on the blocking storms I’ve been seeing lately, that the backup_start_date and backup_finish_date date/times don’t really correspond to the exact VSS freeze/thaw times.)
To make things worse, sysadmins just check boxes in tools like Azure Site Recovery and Veeam and tell them to take backups once an hour. The backups don’t necessarily happen exactly on the hour, so developers are left scratching their heads as to why seemingly simple queries suddenly just freeze every now and then for no reason.
How To Tell If It’s Happening to You
If you’re doing a health check, run sp_Blitz. We warn you about how many snapshot backups your server has taken recently, and in the upcoming October build, we also warn you if you’ve had snapshots spanning 35+ databases in the last two weeks.
If you’re troubleshooting live, run sp_BlitzWho and look for blocking chains led by backup commands. You’ll have to move fast, though, because blocking storms like this can disappear within 30-60 seconds. I don’t recommend that you log sp_BlitzWho to table that often – that’s just too much overhead. That’s where 3rd party monitoring tools are more helpful.