Production DBA

SQL Server Timeouts During Backups and CHECKDB

By Brent Ozar · October 5, 2016 · 19 comments

So you’re hosting your SQL Server in the cloud – say Amazon EC2, Azure VM, or Google Compute Engine – and you’ve noticed that when you’re running a backup or a DBCC CHECKDB, you suffer from extreme performance problems.

Queries run slow, and even worse, applications report timeout errors even just trying to connect to SQL Server. More symptoms can include database mirroring and cluster failovers. What’s going on?

To understand it, let’s look at a quick sketch of your infrastructure:

Your SQL Server virtual machine lives on a physical host, and it accesses your storage via the network – plain old Ethernet, the same Ethernet you’re using right now to surf our web site.

The advantage of Ethernet-connected storage is that it’s really, really cheap to build and manage.

The drawback of Ethernet-connected storage is that if your network connection isn’t really robust, then it’s really, really easy to saturate. 1Gb Ethernet maxes out at around 100MB/sec – for comparison, a single $250 1TB SSD pushes around 500MB/sec. During high-throughput activities like backups and corruption checking, your storage is more than capable of pouring tons of data into your SQL Server – thereby completely saturating your network connection.

It gets worse: in most cases, you’re not the only VM on a given host, so your cloud provider has to throttle your network throughput.

So your network connection matters – a lot.

Faster (and/or separate) networks are certainly available to you in the cloud – it’s just a matter of budget. For example, the excellent ec2instances.info lists all of the VM types at Amazon. It includes columns for Network Performance, plus a whole bunch of EBS columns that aren’t shown by default (click the Columns dropdown to see them):

The eye-opening column is EBS Optimized: Throughput – how much you can get from your storage in a best case scenario, typically streaming sequential reads like backups. (Don’t expect to get that from small random activities like OLTP database operations.)

Sorting by that column, here are your capabilities for the very largest VMs:

While 1250 MB/sec is good, that’s also expensive: those two instance types are $8/hour and $19/hour. Once you’re past those, the throughput simply plummets right down to that single $250 SSD we were discussing earlier.

So what’s an admin to do?

One option is to bypass the network entirely. Note how in the architecture sketch, local ephemeral solid state storage isn’t hooked up through the network at all. Ephemeral storage is blazin’ fast and super cheap (included with most SQL-Server-sized instance types these days).

There’s just one little drawback: it can disappear at any time.

So if you’re going to use that for user databases, you have to protect your instances using technologies like Always On Availability Groups or database mirroring. Those can get you automatic failover with no data loss (not guaranteed, though), at the cost of slower deletes/updates/inserts due to synchronous writes across multiple servers.

Or, uh, you could just skip backups and CHECKDB. I wish I was joking, but I’ve seen more and more folks simply opt to run with scissors rather than have timeouts. That’s a bummer.

The cloud: giving you new ways to save money and run with scissors.

Erik says: The Cloud: Like getting a haircut from your ex-girlfriend.

Free, 3× a week

Get my new posts by email

Three posts a week, plus a Monday roundup of the best database news from around the web.

19 comments

Chris

October 5, 2016 at 8:05 am

Seriously? Run without backups and CHECKDB? How can they call themselves DBAs? Our first job is to protect against loss of data.

Reply
1. Brent Ozar
  
  October 5, 2016 at 8:06 am
  
  You might be surprised. I had a discussion on Twitter a while back about when you fire the DBA, and I got some interesting answers:
  
  https://www.brentozar.com/archive/2014/03/fire-dba/
  
  Reply
  1. OZ Student since 2017
    
    October 6, 2016 at 4:56 am
    
    Wow, all the Buck Woody tweets in that blog post have become dead links, and the account seems to have been replaced by a blonde lady’s account? https://twitter.com/buckwoody/
    
    Reply
    1. Erik Darling
      
      October 6, 2016 at 7:59 am
      
      Well, no, Buck just went through a “change”, and I don’t mean she started working with Oracle.
      
      (He got a new Twitter account.)
      
      Reply
      1. OZ Student since 2017
        
        October 6, 2016 at 8:28 am
        
        Good thing I don’t rely on the auto-email this comments system sends (it comes through as a text email, but with lots of HTML code bulking it up in places).
        If I hadn’t clicked through to the webpage, I’d have missed your parenthesized postscript…
        
        (ya nearly got me, Eric ^_^)
Jason A

October 5, 2016 at 9:10 am

Running without backups?
I just threw up in my mouth a little…
I don’t have total control over my backups, by I certainly work to keep an eye on when they succeed and fail, then work with the backup admins to fix problems. I suppose with a cloud-based DB (mine are all on-prem) with the performance issues you described, a change in tactics for the backups could be called for. Maybe an Availability Group with backups being run on the secondary (not having set up an AG, and being too lazy at the moment to Google it, is this even an option?)

Reply
1. Jason A
  
  October 5, 2016 at 9:20 am
  
  So, yes, you can backup the Secondary, but I’d expect you’d also need to license it at that point, so it would be a business decision which would be the preferred option:
  A) No backups, possible total data loss
  B) Just an AG, possible performance impacts
  C) Backups of the AG secondary, performance impacts and cost, but less chance of data loss
  
  Reply
James Student since 2017

October 5, 2016 at 12:47 pm

Brent,

Do you have a checklist to test out SQL Server on VM performance?

Reply
1. Brent Ozar
  
  October 5, 2016 at 12:58 pm
  
  James – no, not something I can communicate quickly in a blog post comment, sorry.
  
  Reply
Brian

October 5, 2016 at 12:53 pm

The ephemeral drive will only be lost during a user-initiated machine shut-downs or a hardware failures. SQL Server restarts recreate TempDB, so stuffing TempDB on the ephemeral drive (and using options like sort_in_tempdb) is a straightforward way to boost io performance.

The ephemeral drive may require a reformat/mount after a restart (preventing SQL from initializing), but this is easy to compensate for.

Reply
1. Brent Ozar
  
  October 5, 2016 at 12:56 pm
  
  Brian – yep, and also create the necessary folders. This also means if you’re in a mirror, AG, replication, log shipping, etc, then you have to re-initialize those from scratch.
  
  Reply
Edgar Cayce

October 5, 2016 at 1:19 pm

I’m pretty happy with Azure SQL – but I’m running at P4 level with Geo Replication. No downtime in 2 years (touch wood).

Reply
1. Brent Ozar
  
  October 5, 2016 at 10:30 pm
  
  Edgar – this was about SQL Server, not Azure SQL Database. That’s a different product. (It’s a great product, too! I like it a lot. It’s just different than what we’re discussing here.) You don’t do backups to a network file share with Azure SQL DB.
  
  Reply
Andrew Hill

October 5, 2016 at 2:57 pm

One thing prep people overlook is that for a db, a general purpose ssd has 3 iops/gb. if you know you are going to be mb/s limited, just throw a 4000+gb general purpose drive at it for the max of 10k iops, which doesn’t need to be specifically provisioned. This also means that you have enough disk space that ‘do I have enough space for that index’ shouldn’t ever come up.
If your lucky enough to have a db with page compression (eg sql server enterprise) then compression =speed as you again mitigate that io bottleneck.
(I primarily work in the olap space, but the first point stands for oltp also)

Reply
1. Geoff
  
  October 6, 2016 at 9:21 pm
  
  Yeah I’ve been dealing with this recently. I’ve found I’ve had to be careful with online index builds/rebuilds saturating the throughput of the EBS volume hosting the transaction logs. I had to restrict the volume of data by setting MAXDOP of the index build down to 1-3.
  
  Also, while I don’t have any hard evidence, the IO throughput difference between an R3.4xlarge and an R3.8xlarge seems to be greater than 2. Maybe Amazon turns off the limiting altogether when you have the whole physical machine?
  
  Reply
Dkr

October 5, 2016 at 7:52 pm

What about having a second NIC for backup traffic?

Reply
1. Brent Ozar
  
  October 5, 2016 at 10:29 pm
  
  Dkr – in the cloud, no one can hear your requests for a second network adapter.
  
  Sorry, couldn’t resist the joke. Depending on your instance size, you can get LARGER network adapters, but you don’t get the ability to provision specific pieces of hardware to route specific network traffic. (You can do a lot of neat stuff with virtual networking, but that doesn’t get you additional physical ports.)
  
  Reply
Bruce

October 25, 2016 at 12:10 pm

Wouldn’t another option for additional IOPS be to RAID-0 EBS volumes?
http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/raid-config.html
The EBS drives already have some redundancy so it’s not like running on a single physical spindle.

Reply
1. Brian
  
  October 25, 2016 at 12:28 pm
  
  There are several problems with this:
  
  1. Please note that Brent’s article focuses on bandwidth rather than IOPS. Bandwidth is limited by instance type, not by EBS volume type. Increasing IOPS tends to be far cheaper than increasing bandwidth, since there’s databases with high IOPS requirements tend to require more disk space.
  
  2. Using more EBS volumes to surpass IOPS limits has diminishing returns, and increases average latency. Your drive performance is limited by the performance of the slowest drive. EBS drives tend to have less homogeneous performance than physical drives.
  
  3. Taking a drive snapshot without a reboot is incredibly dangerous, rather than very dangerous.
  
  Reply

SQL Server Timeouts During Backups and CHECKDB

Get my new posts by email

Keep digging

19 comments

Leave a comment Cancel reply