The Perils Of VSS Snaps

So much, so often

Ah, backups. Why are they so tough to get right?

You start taking them, you find out you’re not taking enough of them, or that they’re not the right kind, or that you’re not using checksums or compression, or that you’re not storing them in the right place, or that the storage isn’t redundant.

It’s just like, why won’t someone make this easy?

Then you read about VSS Snaps, and they look so dead simple. You don’t need your DBA Ph.D to use them.

And look how fast they are! Oh how they blaze.

Hold it there, chieftan

I know it sounds like those Snaps only taking a minute sounds good, but what’s really going on?

If you look at the docs for the VSS backups, there’s a heck of a lot going on in there.

D:

We’re gonna focus on the part where the “freeze” and “thaw” happen.

Why? Because while the data is frozen, no one can read or write data.

Prahblimz

Yes, you read that right. All activity in the database is paused.

Or frozen, if you will.

Now, if you’re just taking one VSS Snap after hours when no one’s around, chances are that pause isn’t going to make a noise.

But if you’re relying on VSS Snaps for more frequent backups, or if you’re running a 24×7 shop where users are still doing important stuff, this could be a real performance hit.

Don’t believe me?

Check your error log.

holditnowhitit

If you notice the times there, and you should because there’s like gigantic rectangles calling your attention to them, you can see that I/O was stuck for a full 61 seconds for the VSS Snap.

Just add 61 seconds to all your queries. Will users be happy?

Why, Erik? Why Do they hurt me so?

It’s not your fault, pal. They’re just built that way.

That’s why smart hosting providers limit the number of databases you can stick on a server.

Take AWS RDS for example. The first limitation they mention is 30 databases per instance.

Riotous

The cloud geniuses know their hardware capabilities, and put hard limits on them.

Now it’s time for you to hit your error log to see how often, and how long your VSS Snaps are taking.

Thanks for reading!

Brent says: MS KB 943471 is an oldie, but it notes that “We recommend that you create a snapshot backup of fewer than 35 databases at the same time.”

Previous Post
SQL Server Patches for Meltdown and Spectre Attacks
Next Post
Automating Complacency

34 Comments. Leave new

  • Alex Friedman
    January 4, 2018 8:40 am

    Great point, just note that only write operations are frozen — reads go through normally.

    By the way, what’s your recommended way to take backups of TBs-sized databases in 24×7 shops?

    Reply
    • Alex — depends on the nature of the data and some other requirements, like data retention and RTO. With the details you’ve given, all I can say is “yes, have some.”

      Reply
  • AG and backup off inactive node

    Reply
    • ti — what if it’s async? what if sync falls behind? what if sync falls so far behind it gets secretly demoted to async?

      Reply
  • What I’ve also seen, is that when these snapshots occur, features such as Availability Groups go haywire when the I/O gets frozen, and failovers can occur if you have automatic failover enabled.

    Reply
  • Hey Erik, are you talking about application consistent VM backups using VSS or a database level backup using VSS? Both of these operations have very different performance implications.

    In the case of an application consistent VM backup, the entire VM must be frozen in order to backup. This is typically what causes the issues/time of impact you describe in your post. However, a database specific VSS backup will only freeze the specific database being backed up and thus doesn’t have the same level of performance hit.

    I think it’s important to understand the distinction, because if a vendor is performing an app consistent VM backup, that’s where you’re going to get the heaviest hit. Specific database backups using VSS still incur an I/O freeze, but because of much reduced freeze time, the impact is more equivalent to a native full.

    Disclaimer: I work for a company that implements VSS level database backups. 🙂 Happy to discuss it in more detail. Hell, I really need to blog on it.

    Reply
    • Hi Mike — actually, both. I’ve seen some VSS tools taking database snaps pause activity for quite a while to quiesce and then write the data out. Looking forward to your blog posts!

      Reply
    • Brian Boodman
      January 4, 2018 4:35 pm

      My *very untrustworthy* understanding is that if you’re taking VM backups, the freeze/thaw only applies during the requester snapshot, not during the VM snapshot. The VM snapshot might introduce IO delays, but these delays are normally transparent to the application. High level example flow:

      Backup
      1) Initialize DiskShadow. In Microsoft’s diagram, DiskShadow is the Requestor. Note that Disk I/O may temporarily freeze – this corresponds to Microsoft’s diagram.
      2) Initialize VM snapshot (this is external to Microsoft’s diagram. Disk(Disk I/O might temporarily freeze, but any such freezing is at the VM level, not the application level.

      Restore
      1) Restore VM Snapshot
      2) Run DiskShadow Revert

      Effectively, this means taking an application snapshot (freezes IO), then taking a VM snapshot (which contains the application snapshot). Mind you, I think there’s still a risk that the VM snapshot will put the OS in an inconsistent state, despite the application itself being in a consistent state.

      I believe this is the approach Amazon is advocating at https://aws.amazon.com/blogs/compute/automating-the-creation-of-consistent-amazon-ebs-snapshots-with-amazon-ec2-systems-manager-part-2/ (note that this blog post is probably superceded by https://aws.amazon.com/blogs/mt/take-microsoft-vss-enabled-snapshots-using-amazon-ec2-systems-manager/ , but the latter is more turnkey and thus has less explanatory power).

      Reply
  • Vijayasaradhi Samavedam
    January 4, 2018 12:19 pm

    I have been noticing issues in my current workplace. The business wants to continue using Veeam backups for standalone’s and AAG’s. The backups on the AAG’s are creating failovers, and sometimes, Veeam is actually freezing both the nodes at the same time.

    Reply
    • same here. Veeam Support has no solution, and isnt even able to tell why snapshots are happening on all the cluster nodes
      its a pitty

      Reply
  • There are better implementation recently by Google Persistent Snapshot provided in their Google Cloud Engine. Basically it is a lower level driver to work with VSS. Google claimed can do online (with very minimum frozen) for busy databases (SQL SERVER here). Had anyone gave a try? I have 6TB data disk and will give a try during busy hours. I will let us know the result.

    Reply
  • Probably not a surprise (in hindsight anyway) but the VSS backups also break the backup chain if you’re taking regular SQL full/diff backups. We ran into this problem when we set up our DBA backups then later on the VMWare implemented VMWare server snapshots as well

    Reply
    • If the snap backup is made with a copy only option, then it won’t affect your SQL “traditional” backup chain. Veeam has that problem in earlier versions. The backup (snap) they did wasn’t done using copy_only so we couldn’t implement differential SQL Server backups. Then came a new version of Veeam (7 ?, this was some 3-4 years ago) where the snap was made as a copy_only and thanks to that it didn’t affect our SQL backups – allowing us to implement differential backups.

      Reply
  • The ironic part to VSS Snapshots running. Is that disabling the SQL Writer service prevents the I/O Freeze and Thaw process but turning off this Service does not prevent the AG Failover on virtual Clusters. This has been documented by vmware and the use of snapshots is not recommended or supported on virtual clusters.
    https://kb.vmware.com/s/article/2147662?r=2&Quarterback.validateRoute=1&KM_Utility.getArticleData=1&KM_Utility.getGUser=1&KM_Utility.getArticleLanguage=1&KM_Utility.getArticle=1

    Reply
    • This is because the cluster failovers are due to the OS drive being frozen, not any of the databases. The SQL Writer is only used to manage database VSS backups, but other VSS backups can still occur. The reason VSS backups can cause a cluster failover is due to the OS drive being frozen for longer than the Health Check Timeout setting for the cluster, which will cause quorum failure.

      This can be addressed by increasing the Health Check Timeout in the cluster. Additionally, see if the VM backup can exclude VMDKs hosting database files. By minimizing the amount that needs to be frozen in a VM backup you can reduce the amount of time needed to freeze the I/O.

      Reply
      • Vincent Brengman
        January 5, 2018 2:28 pm

        Why not just disable quiescing when running a backup of OS or filesystem? If you’re running SQL backups with an agent, what is the purpose of an application consistent OS backup?

        Reply
        • So you can’t disable quiescing, that’s how VSS works. It *must* quiesce. To your point, though, if you’re taking SQL Server database backups natively, you many not need the VM backup. If you do need it, you should exclude the database files/volumes so there’s less to backup (and, thus, less to quiesce).

          Reply
  • Abhinaya Rao K
    December 20, 2018 5:07 am

    Hi Mr Brent,

    Will VSS Snaps effect Log Shipping?
    Because my log shipping is not working. When I look in backupmediafamily and backupset tables, VSS backup file has the first LSN value that matches the last LSN value of my trn file and this is causing the restore job to throw an error “The file is too recent to apply to the secondary database”. Please help.

    Thanks and Regards,
    Abhinaya Rao K

    Reply
  • Need to clarify one point, only physical reads will be frozen, right? Logical reads and all writes are expected to continue as long as you have enough memory,

    Reply
    • Yazeed – just so I understand, you’re expecting that SQL Server will allow insert/update/delete transactions (writes) to continue even though they aren’t getting written to disk?

      Reply
  • Yes, in memory before written to disk

    Reply
  • Ivan Saez Scheihing
    October 23, 2020 5:38 am

    Hi,

    We have Veaam making backups of Sqlserver instances (snapshots) and the process looks like this:


    23-10-2020 00:02:37 :: Creating VM snapshot

    23-10-2020 00:05:03 :: Hard disk 1 (40 GB) 2,1 GB read at 267 MB/s [CBT]
    23-10-2020 00:05:03 :: Hard disk 2 (30 GB) 505 MB read at 234 MB/s [CBT]
    23-10-2020 00:05:16 :: Removing VM snapshot 48 sec
    23-10-2020 00:06:05 :: Finalizing

    and what we see is that during the “Removing VM snapshot” phase the (vm) machine becomes unreachable for relative short period of time. Most of the time no issues happen but when an application is on that time accessing the database we see errors like:

    2020-10-23 00:05:53.723 +0200] [Error] Processing tasks failed: Liquit.DAL.DalUnexpectedException: Database error “Dal_Unexpected”:
    A network-related or instance-specific error occurred while establishing a connection to SQL Server

    I think it’s because of the quiescing of the vm. Our application admins are asking for HA db’s (Always on ) but reading the comments I don’t think it will solve the issue.
    Any comments?

    regards,

    Ivan

    Reply
    • Ivan – unfortunately we don’t do Veeam support here. Your best bet would be to contact Veeam, or if you need to hire us for personal consulting, click Consulting at the top of the site.

      Reply
  • We are using the ola hallengren backup scripts on our Azure VM’s. I have disabeld the SQL VSS writer when I saw failing backups at times we were not running any backups. I saw these were caused by another process, porbably a snapshot of the server. By disabling the SQL VSS writer service I want to prevent a gap in the backup chain but now I’m wandering if I did the right thing….

    Reply
  • So best to limit the number of databases to 35. We have a ms sql express with 45 client databases, each database is specific to a customer. We take snapshots 4x per day but sometimes existing connections are blocked or dropped.

    Does anyone have a solution where a snapshot can loop through databases and process each database separately?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.