SQL Server Suddenly Frozen? You Might Be Snapshotting Too Many Databases.

Snapshot backup tools like Azure Site Recovery and Veeam are great for sysadmins. They let you quickly replicate a virtual machine to somewhere else without knowing too much about the server’s contents.

To pull it off, they use the Windows Volume Shadow Copy Service to do snapshot backups – often referred to as VSS snaps. In a nutshell:

  1. The VSS service tells SQL Server to freeze its writes
  2. SQL Server stops writing to each of its databases, freezing them
  3. SQL Server tells VSS, “Okay, I’m out”
  4. VSS creates a shadow copy of the data & log files
  5. VSS tells SQL Server, “You can get back to work”
  6. SQL Server begins writing again, one database at a time (called thawing)

These snapshot backups show up in system tables like msdb.dbo.backupset because when used successfully, they’re like full backups. It sounds like witchcraft, but it really does work, and it’s worked for years.

But when it goes wrong, it goes WRONG.

SQL Server freezes databases one at a time, serially, until it’s got them all frozen. This can take some time, and while databases are frozen, you’ll see blocking chains for writing queries (which can in turn block readers, too, depending on your isolation level.)

Because of that one-at-a-time problem, KB #943471 says:

We recommend that you create a snapshot backup of fewer than 35 databases at the same time.

Backup apps like Azure Site Recovery and Veeam don’t warn you about that, though. You just find out when your SQL Server suddenly feels like it’s locked up, with tons of blocking issues.

For example, here’s a screenshot from a client’s msdb.dbo.backupset table when Azure Site Recovery was trying to snapshot hundreds of databases all at once:

See how the backup_start_date and backup_finish_date are 40-50 seconds apart? That’s because SQL Server was slowly, gradually stepping through the snapshots of each database. Finally, they all “finished” at 8:08:10, but even that is a little tricky – in this case, they didn’t all thaw at the exact same time. (I’m guessing, based on the blocking storms I’ve been seeing lately, that the backup_start_date and backup_finish_date date/times don’t really correspond to the exact VSS freeze/thaw times.)

To make things worse, sysadmins just check boxes in tools like Azure Site Recovery and Veeam and tell them to take backups once an hour. The backups don’t necessarily happen exactly on the hour, so developers are left scratching their heads as to why seemingly simple queries suddenly just freeze every now and then for no reason.

How To Tell If It’s Happening to You

If you’re doing a health check, run sp_Blitz. We warn you about how many snapshot backups your server has taken recently, and in the upcoming October build, we also warn you if you’ve had snapshots spanning 35+ databases in the last two weeks.

If you’re troubleshooting live, run sp_BlitzWho and look for blocking chains led by backup commands. You’ll have to move fast, though, because blocking storms like this can disappear within 30-60 seconds. I don’t recommend that you log sp_BlitzWho to table that often – that’s just too much overhead. That’s where 3rd party monitoring tools are more helpful.

Previous Post
Unindexed Foreign Keys Can Make Performance Worse.
Next Post
I Would Love a “Cost Threshold for Recompile” Setting.

21 Comments. Leave new

  • VSS based backups on 8k databases was “fun.” That backup vendor learned a lot that month.

    Reply
  • It wasn’t the taking of the snapshot (or rather the acquiescence of the server ) that’s caused us issues. It was when the snapshot was released that the server would drop connections.

    We’d been OK until we’d moved data centres and switched to rented kit. To resolve it the backups for the SQL Servers now use dedicated proxies (it’s what I’ve been told, who am I question it) and that fixed it.

    Reply
  • Heh… yet another feature that can screw you to the floor. Lordy. Upgrades to new versions are getting damned scary. For example, the “Fast Inserts” change they made along with the caveats that create huge amount of “unused” space and you need a Trace Flag to turn it off because it’s on by default. Upgrading from 2016 to 2019 this year has me seriously concerned with all the supposed “improvements” they made. From what I saw, 2017 was a train wreck for performance (I’ve got it on my laptap). I realize that 2019 has some CUs under its belt but I find that performance improvement frequently aren’t. Like for the Cardinality Estimator “improvements”…. that was a bloody train wreck for us and we still have it turned off.

    “Change is inevitable… change for the better is not”.

    Reply
    • I experienced massive performance improvements in 2017 just by copying databases over to it. Mostly came from 2014 and 2012 and only a little 2016.

      the 2014 CE upgrades at least in my case I noted were from the CE too correctly estimating shitty queries and not providing enough resources based on the number of records read and returned and the expense of getting them back. We couldn’t fix all our queries so we upgraded the worst servers to 2016 briefly then everything else to 2017. There were a few queries we had to add a legacy CE query hint to, but they were few and far between.

      Reply
    • Jeff – I love that last line!

      Reply
  • I joined a new company and first thing I noticed were I/O freeze in log. I immediately got a back up solution and implemented backup schedules. Everyone is more happy now !

    Reply
  • I remember several years ago at Company X, when we were rolling native, a particular vendors product taking full non-copy only backups. Our internal support claimed they weren’t doing SQL backups, but the msdb backupset clearly showed they were. The vendors code was triggering the SQL VSS Writer, our only option was to disable the service. That product thankfully was retired.

    AWS.. wasn’t there a similar limit recommended? Which would be expected unless they’ve written their own VSS provider..

    *Yeah I am not naming and shaming.

    Reply
  • When we implemented Veeam I was very concerned about the “freeze” especially with the servers behind Solarwinds. The NPM/Netflow data never stops flowing and I didn’t want 3 am phone calls every morning. These DB servers typically only had one large user database, so the issue with freezing sequentially never made itself known and things generally ran smooth. Until I inherited an environment that had a two member AG.

    It wouldn’t happen daily, but I started getting alerts about the secondary getting a serious beating. It was to the point that the AG considered the secondary offline. This would occur during the time the backup was running, so it seemed pretty obvious where I needed to start looking. In addition to managing the DB servers the Veeam environment was mine as well, so nothing was in my way as far as making changes. All tweaks related to the backup job made no difference. No increase or decrease in frequency of the issue.

    For our VMware, Azure and GCP environments I’ve written reports I get daily that tell me everything about the environments, backups, storage, etc. While looking at the VM detail I noticed that the secondary in the AG had volumes in two datastores. This was unusual as we generally keep DB servers together on dedicated LUNs/datastores. I discovered that the odd volume wasn’t just on a different datastore, it was also on a different SAN. The SAN was doing its own snaps during the time the Veeam job was running. Once the file location issue was cleaned-up the issues stopped.

    Reply
  • Snapshots are the single most malignant cancer that has ever been added to systems administration. Its a hack tool that allows mediocre sysadmins to pretend like they can get away with not knowing what they are doing, and create absolutely colossal amounts of work and 40 hours days when they invariably go wrong.

    Getting the call to be up all night when a AAG cluster with for availability groups containing several multiple terabyte databases all have to roll back at the same time? Check.

    Getting the call when Veeam corrupts and entire AD FS farm serving 18 thousand users in 2 continents due to a snapshot bug? Check.

    Getting the call when most of an exchange 2010 organization with a few dozen databases go offline in 6 server DAG requiring manual copy resume for every DB? Check.
    Having to deal with absolutely horrendous queuing from tens of thousands of monitoring alerts and daily
    internal mail blasts getting stuck immediately after that? Check.

    Rebuilding SCCM in a university, just before fall terms, from an 8 month old copy of the site database because veeam decided stop clearing out snapshots in march and never alerted on it when most of the technical staff were out? Check.

    Rebuilding an extraordinarily complex sharepoint farm manually following a power surge that took out the UPS and SAN because the snapshot based sharepoint backup wasn’t actually taking consistent backups, and the “low brow” backups that were definitely not as good as snapshots were cancelled? check.

    Running a four month support ticket with Azure trying to make their snapshot based backup work while trying to fit ad hoc backups into the extremely firm budget limit of your client? check.

    Naturally, the sysadmins that evangelized how awesome these snapshot based backups are, who couldn’t originally get the regular backups to work, would leave WELL before the 24 hours mark to get some sleep and food. But maybe I misunderstand when I hear them say that “snapshot backups just work without being messed with.”

    Reply
    • The thing that I really, really hate with these solutions is losing capabilities that are available when using native backups. Want to skip the last full backup, restore from the one before and restore all the transaction log backups up to now? No. But the CHECKDB has failed and it’s not repairable without data loss. Want to do a page restore? No. Want access to the transaction log backups? No.

      And then you have the conversations about the space to store your backups because you want to “go native”.

      Reply
      • Yeah, agreed, I’ve had some awkward discussions with clients when it comes to the recent data corruption bugs in SQL 2019 and how their snapshot backup apps would have been helpless.

        Reply
      • Indeed. Most of them abstract your backup monitoring so heavily and helpfully tune out the ‘noise’ that you sometimes miss stuff. Commvault was the only one I found that managed your backups nicely, provided all the logging you needed and reliably produced SQL-relevant alerts. Learning it was quite a steep learning curve however, and you needed to take care to configure it to use regular SQL backups – there was some check box that made it back up, but not checkpoint.

        Reply
    • Keith – you sound like you’ve got all the fun scars, hehheh!

      Reply
  • we have a similunar issue with rubrik taking vm snapshots. we experience ag failovers within 1 minute of rubrik logging on. any suggestions for hostrecord ttl? has anyone else experienced this?

    Reply
  • Our firm uses Commvault (DBA’s have no access!! ) for full and diff scheduled backups, these use the Commvault SQL Agent & generally work well. However there are also daily vmWare snapshots initiated through Commvault which backup our virtual SQL servers (only the C drive), all other SQL drives are excluded. However on some of our VM’s we still experience these I/O is frozen messages at the time the snapshot is executed. No one can seem to explain this and being advised this is expected behavior, but I have my doubts as we have other servers not having this problem.

    Does anyone have any thoughts here?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}