The Perils Of VSS Snaps

Last Updated 6 years ago

So much, so often

Ah, backups. Why are they so tough to get right?

You start taking them, you find out you’re not taking enough of them, or that they’re not the right kind, or that you’re not using checksums or compression, or that you’re not storing them in the right place, or that the storage isn’t redundant.

It’s just like, why won’t someone make this easy?

Then you read about VSS Snaps, and they look so dead simple. You don’t need your DBA Ph.D to use them.

And look how fast they are! Oh how they blaze.

Hold it there, chieftan

I know it sounds like those Snaps only taking a minute sounds good, but what’s really going on?

If you look at the docs for the VSS backups, there’s a heck of a lot going on in there.

We’re gonna focus on the part where the “freeze” and “thaw” happen.

Why? Because while the data is frozen, no one can read or write data.

Prahblimz

Yes, you read that right. All activity in the database is paused.

Or frozen, if you will.

Now, if you’re just taking one VSS Snap after hours when no one’s around, chances are that pause isn’t going to make a noise.

But if you’re relying on VSS Snaps for more frequent backups, or if you’re running a 24×7 shop where users are still doing important stuff, this could be a real performance hit.

Don’t believe me?

Check your error log.

If you notice the times there, and you should because there’s like gigantic rectangles calling your attention to them, you can see that I/O was stuck for a full 61 seconds for the VSS Snap.

Just add 61 seconds to all your queries. Will users be happy?

Why, Erik? Why Do they hurt me so?

It’s not your fault, pal. They’re just built that way.

That’s why smart hosting providers limit the number of databases you can stick on a server.

Take AWS RDS for example. The first limitation they mention is 30 databases per instance.

Riotous

The cloud geniuses know their hardware capabilities, and put hard limits on them.

Now it’s time for you to hit your error log to see how often, and how long your VSS Snaps are taking.

Thanks for reading!

Brent says: MS KB 943471 is an oldie, but it notes that “We recommend that you create a snapshot backup of fewer than 35 databases at the same time.”

SQL Server Patches for Meltdown and Spectre Attacks

Automating Complacency

34 Comments. Leave new

Alex Friedman
January 4, 2018 8:40 am

Great point, just note that only write operations are frozen — reads go through normally.

By the way, what’s your recommended way to take backups of TBs-sized databases in 24×7 shops?

Reply
- Erik Darling
  January 4, 2018 12:14 pm
  
  Alex — depends on the nature of the data and some other requirements, like data retention and RTO. With the details you’ve given, all I can say is “yes, have some.”
  
  Reply
  - Doug A
    April 2, 2019 7:53 am
    
    Nice ghostbusters reference.
    
    Reply
ti
January 4, 2018 11:18 am

AG and backup off inactive node

Reply
- Erik Darling
  January 4, 2018 12:15 pm
  
  ti — what if it’s async? what if sync falls behind? what if sync falls so far behind it gets secretly demoted to async?
  
  Reply
KenKellman
January 4, 2018 11:19 am

What I’ve also seen, is that when these snapshots occur, features such as Availability Groups go haywire when the I/O gets frozen, and failovers can occur if you have automatic failover enabled.

Reply
- Erik Darling
  January 4, 2018 12:15 pm
  
  Ken — yep, the redo threads go buck wild on the other end.
  
  Reply
- alen teplitsky
  January 12, 2018 12:12 pm
  
  Interesting. My last place we never had AlwaysOn problems and backups went straight to tape. My new employer it’s the opposite
  
  Reply
Mike Fal
January 4, 2018 12:11 pm

Hey Erik, are you talking about application consistent VM backups using VSS or a database level backup using VSS? Both of these operations have very different performance implications.

In the case of an application consistent VM backup, the entire VM must be frozen in order to backup. This is typically what causes the issues/time of impact you describe in your post. However, a database specific VSS backup will only freeze the specific database being backed up and thus doesn’t have the same level of performance hit.

I think it’s important to understand the distinction, because if a vendor is performing an app consistent VM backup, that’s where you’re going to get the heaviest hit. Specific database backups using VSS still incur an I/O freeze, but because of much reduced freeze time, the impact is more equivalent to a native full.

Disclaimer: I work for a company that implements VSS level database backups. 🙂 Happy to discuss it in more detail. Hell, I really need to blog on it.

Reply
- Erik Darling
  January 4, 2018 12:16 pm
  
  Hi Mike — actually, both. I’ve seen some VSS tools taking database snaps pause activity for quite a while to quiesce and then write the data out. Looking forward to your blog posts!
  
  Reply
- Brian Boodman
  January 4, 2018 4:35 pm
  
  My *very untrustworthy* understanding is that if you’re taking VM backups, the freeze/thaw only applies during the requester snapshot, not during the VM snapshot. The VM snapshot might introduce IO delays, but these delays are normally transparent to the application. High level example flow:
  
  Backup
  1) Initialize DiskShadow. In Microsoft’s diagram, DiskShadow is the Requestor. Note that Disk I/O may temporarily freeze – this corresponds to Microsoft’s diagram.
  2) Initialize VM snapshot (this is external to Microsoft’s diagram. Disk(Disk I/O might temporarily freeze, but any such freezing is at the VM level, not the application level.
  
  Restore
  1) Restore VM Snapshot
  2) Run DiskShadow Revert
  
  Effectively, this means taking an application snapshot (freezes IO), then taking a VM snapshot (which contains the application snapshot). Mind you, I think there’s still a risk that the VM snapshot will put the OS in an inconsistent state, despite the application itself being in a consistent state.
  
  I believe this is the approach Amazon is advocating at https://aws.amazon.com/blogs/compute/automating-the-creation-of-consistent-amazon-ebs-snapshots-with-amazon-ec2-systems-manager-part-2/ (note that this blog post is probably superceded by https://aws.amazon.com/blogs/mt/take-microsoft-vss-enabled-snapshots-using-amazon-ec2-systems-manager/ , but the latter is more turnkey and thus has less explanatory power).
  
  Reply
Vijayasaradhi Samavedam
January 4, 2018 12:19 pm

I have been noticing issues in my current workplace. The business wants to continue using Veeam backups for standalone’s and AAG’s. The backups on the AAG’s are creating failovers, and sometimes, Veeam is actually freezing both the nodes at the same time.

Reply
- fritz
  September 6, 2023 10:32 am
  
  same here. Veeam Support has no solution, and isnt even able to tell why snapshots are happening on all the cluster nodes
  its a pitty
  
  Reply
Zen
January 4, 2018 4:16 pm

There are better implementation recently by Google Persistent Snapshot provided in their Google Cloud Engine. Basically it is a lower level driver to work with VSS. Google claimed can do online (with very minimum frozen) for busy databases (SQL SERVER here). Had anyone gave a try? I have 6TB data disk and will give a try during busy hours. I will let us know the result.

Reply
Alex M
January 4, 2018 6:49 pm

Probably not a surprise (in hindsight anyway) but the VSS backups also break the backup chain if you’re taking regular SQL full/diff backups. We ran into this problem when we set up our DBA backups then later on the VMWare implemented VMWare server snapshots as well

Reply
- Tibor Karaszi
  January 5, 2018 1:04 am
  
  If the snap backup is made with a copy only option, then it won’t affect your SQL “traditional” backup chain. Veeam has that problem in earlier versions. The backup (snap) they did wasn’t done using copy_only so we couldn’t implement differential SQL Server backups. Then came a new version of Veeam (7 ?, this was some 3-4 years ago) where the snap was made as a copy_only and thanks to that it didn’t affect our SQL backups – allowing us to implement differential backups.
  
  Reply
Paul Asaro
January 5, 2018 1:23 pm

The ironic part to VSS Snapshots running. Is that disabling the SQL Writer service prevents the I/O Freeze and Thaw process but turning off this Service does not prevent the AG Failover on virtual Clusters. This has been documented by vmware and the use of snapshots is not recommended or supported on virtual clusters.
https://kb.vmware.com/s/article/2147662?r=2&Quarterback.validateRoute=1&KM_Utility.getArticleData=1&KM_Utility.getGUser=1&KM_Utility.getArticleLanguage=1&KM_Utility.getArticle=1

Reply
- Mike Fal
  January 5, 2018 2:01 pm
  
  This is because the cluster failovers are due to the OS drive being frozen, not any of the databases. The SQL Writer is only used to manage database VSS backups, but other VSS backups can still occur. The reason VSS backups can cause a cluster failover is due to the OS drive being frozen for longer than the Health Check Timeout setting for the cluster, which will cause quorum failure.
  
  This can be addressed by increasing the Health Check Timeout in the cluster. Additionally, see if the VM backup can exclude VMDKs hosting database files. By minimizing the amount that needs to be frozen in a VM backup you can reduce the amount of time needed to freeze the I/O.
  
  Reply
  - Vincent Brengman
    January 5, 2018 2:28 pm
    
    Why not just disable quiescing when running a backup of OS or filesystem? If you’re running SQL backups with an agent, what is the purpose of an application consistent OS backup?
    
    Reply
    - Mike Fal
      January 5, 2018 3:29 pm
      
      So you can’t disable quiescing, that’s how VSS works. It *must* quiesce. To your point, though, if you’re taking SQL Server database backups natively, you many not need the VM backup. If you do need it, you should exclude the database files/volumes so there’s less to backup (and, thus, less to quiesce).
      
      Reply
Abhinaya Rao K
December 20, 2018 5:07 am

Hi Mr Brent,

Will VSS Snaps effect Log Shipping?
Because my log shipping is not working. When I look in backupmediafamily and backupset tables, VSS backup file has the first LSN value that matches the last LSN value of my trn file and this is causing the restore job to throw an error “The file is too recent to apply to the secondary database”. Please help.

Thanks and Regards,
Abhinaya Rao K

Reply
- Erik Darling
  December 20, 2018 7:27 am
  
  Floyd,
  
  For unrelated questions, head over to https://dba.stackexchange.com/questions
  
  Thanks!
  
  Reply
Yazeed
October 27, 2019 3:39 am

Need to clarify one point, only physical reads will be frozen, right? Logical reads and all writes are expected to continue as long as you have enough memory,

Reply
- Brent Ozar
  October 27, 2019 4:34 am
  
  Yazeed – just so I understand, you’re expecting that SQL Server will allow insert/update/delete transactions (writes) to continue even though they aren’t getting written to disk?
  
  Reply
Yazeed
October 27, 2019 5:02 am

Yes, in memory before written to disk

Reply
- Brent Ozar
  October 27, 2019 5:23 am
  
  Ah. No, that’s not correct. I’d start by reading up on write ahead logging:
  
  https://docs.microsoft.com/en-us/sql/relational-databases/sql-server-transaction-log-architecture-and-management-guide
  
  That’s part of how SQL Server achieves ACID. You wouldn’t have ACID if a committed transaction was only in memory, and then the server crashed, for example.
  
  Reply
  - Yazeed
    October 27, 2019 9:54 pm
    
    good read, thanks for the clarification.
    
    Reply
Ivan Saez Scheihing
October 23, 2020 5:38 am

Hi,

We have Veaam making backups of Sqlserver instances (snapshots) and the process looks like this:

…
23-10-2020 00:02:37 :: Creating VM snapshot
…
23-10-2020 00:05:03 :: Hard disk 1 (40 GB) 2,1 GB read at 267 MB/s [CBT]
23-10-2020 00:05:03 :: Hard disk 2 (30 GB) 505 MB read at 234 MB/s [CBT]
23-10-2020 00:05:16 :: Removing VM snapshot 48 sec
23-10-2020 00:06:05 :: Finalizing
…

and what we see is that during the “Removing VM snapshot” phase the (vm) machine becomes unreachable for relative short period of time. Most of the time no issues happen but when an application is on that time accessing the database we see errors like:

2020-10-23 00:05:53.723 +0200] [Error] Processing tasks failed: Liquit.DAL.DalUnexpectedException: Database error “Dal_Unexpected”:
A network-related or instance-specific error occurred while establishing a connection to SQL Server

I think it’s because of the quiescing of the vm. Our application admins are asking for HA db’s (Always on ) but reading the comments I don’t think it will solve the issue.
Any comments?

regards,

Ivan

Reply
- Brent Ozar
  October 23, 2020 5:40 am
  
  Ivan – unfortunately we don’t do Veeam support here. Your best bet would be to contact Veeam, or if you need to hire us for personal consulting, click Consulting at the top of the site.
  
  Reply
André
October 8, 2021 10:20 am

We are using the ola hallengren backup scripts on our Azure VM’s. I have disabeld the SQL VSS writer when I saw failing backups at times we were not running any backups. I saw these were caused by another process, porbably a snapshot of the server. By disabling the SQL VSS writer service I want to prevent a gap in the backup chain but now I’m wandering if I did the right thing….

Reply
Gary
March 12, 2022 6:55 pm

So best to limit the number of databases to 35. We have a ms sql express with 45 client databases, each database is specific to a customer. We take snapshots 4x per day but sometimes existing connections are blocked or dropped.

Does anyone have a solution where a snapshot can loop through databases and process each database separately?

Reply
- Brent Ozar
  March 12, 2022 9:27 pm
  
  No, that’s not how Windows VSS snapshots work: whenever you take an app-aware snapshot of a volume, it freezes all IO for all databases on that volume, period, full stop.
  
  If you want to process databases individually, you’ll need to use native full backups.
  
  Reply
  - Bob
    June 9, 2022 9:58 pm
    
    Can I create multiple volumes, placing 35DBs on each volume as a workaround to this issue?
    
    Reply
    - Brent Ozar
      June 10, 2022 11:23 am
      
      Yep, absolutely.
      
      Reply