SQL Server Timeouts During Backups and CHECKDB

So you’re hosting your SQL Server in the cloud – say Amazon EC2, Azure VM, or Google Compute Engine – and you’ve noticed that when you’re running a backup or a DBCC CHECKDB, you suffer from extreme performance problems.

Queries run slow, and even worse, applications report timeout errors even just trying to connect to SQL Server. More symptoms can include database mirroring and cluster failovers. What’s going on?

To understand it, let’s look at a quick sketch of your infrastructure:

I told you it was a fast sketch
I told you it was a fast sketch

Your SQL Server virtual machine lives on a physical host, and it accesses your storage via the network – plain old Ethernet, the same Ethernet you’re using right now to surf our web site.

The advantage of Ethernet-connected storage is that it’s really, really cheap to build and manage.

The drawback of Ethernet-connected storage is that if your network connection isn’t really robust, then it’s really, really easy to saturate. 1Gb Ethernet maxes out at around 100MB/sec – for comparison, a single $250 1TB SSD pushes around 500MB/sec. During high-throughput activities like backups and corruption checking, your storage is more than capable of pouring tons of data into your SQL Server – thereby completely saturating your network connection.

It gets worse: in most cases, you’re not the only VM on a given host, so your cloud provider has to throttle your network throughput.

So your network connection matters – a lot.

Faster (and/or separate) networks are certainly available to you in the cloud – it’s just a matter of budget. For example, the excellent ec2instances.info lists all of the VM types at Amazon. It includes columns for Network Performance, plus a whole bunch of EBS columns that aren’t shown by default (click the Columns dropdown to see them):

EC2instances.info
EC2instances.info

The eye-opening column is EBS Optimized: Throughput – how much you can get from your storage in a best case scenario, typically streaming sequential reads like backups. (Don’t expect to get that from small random activities like OLTP database operations.)

Sorting by that column, here are your capabilities for the very largest VMs:

EC2 instances by throughput
EC2 instances by throughput

While 1250 MB/sec is good, that’s also expensive: those two instance types are $8/hour and $19/hour. Once you’re past those, the throughput simply plummets right down to that single $250 SSD we were discussing earlier.

So what’s an admin to do?

One option is to bypass the network entirely. Note how in the architecture sketch, local ephemeral solid state storage isn’t hooked up through the network at all. Ephemeral storage is blazin’ fast and super cheap (included with most SQL-Server-sized instance types these days).

There’s just one little drawback: it can disappear at any time.

So if you’re going to use that for user databases, you have to protect your instances using technologies like Always On Availability Groups or database mirroring. Those can get you automatic failover with no data loss (not guaranteed, though), at the cost of slower deletes/updates/inserts due to synchronous writes across multiple servers.

Or, uh, you could just skip backups and CHECKDB. I wish I was joking, but I’ve seen more and more folks simply opt to run with scissors rather than have timeouts. That’s a bummer.

The cloud: giving you new ways to save money and run with scissors.

Erik says: The Cloud: Like getting a haircut from your ex-girlfriend.

Previous Post
Applications are Open For Our 2017 Scholarship Program
Next Post
Update On Connect Item To Replace DBCC SHOW_STATISTICS

19 Comments. Leave new

  • Seriously? Run without backups and CHECKDB? How can they call themselves DBAs? Our first job is to protect against loss of data.

    Reply
  • Running without backups?
    I just threw up in my mouth a little…
    I don’t have total control over my backups, by I certainly work to keep an eye on when they succeed and fail, then work with the backup admins to fix problems. I suppose with a cloud-based DB (mine are all on-prem) with the performance issues you described, a change in tactics for the backups could be called for. Maybe an Availability Group with backups being run on the secondary (not having set up an AG, and being too lazy at the moment to Google it, is this even an option?)

    Reply
    • So, yes, you can backup the Secondary, but I’d expect you’d also need to license it at that point, so it would be a business decision which would be the preferred option:
      A) No backups, possible total data loss
      B) Just an AG, possible performance impacts
      C) Backups of the AG secondary, performance impacts and cost, but less chance of data loss

      Reply
  • Brent,

    Do you have a checklist to test out SQL Server on VM performance?

    Reply
  • The ephemeral drive will only be lost during a user-initiated machine shut-downs or a hardware failures. SQL Server restarts recreate TempDB, so stuffing TempDB on the ephemeral drive (and using options like sort_in_tempdb) is a straightforward way to boost io performance.

    The ephemeral drive may require a reformat/mount after a restart (preventing SQL from initializing), but this is easy to compensate for.

    Reply
    • Brian – yep, and also create the necessary folders. This also means if you’re in a mirror, AG, replication, log shipping, etc, then you have to re-initialize those from scratch.

      Reply
  • I’m pretty happy with Azure SQL – but I’m running at P4 level with Geo Replication. No downtime in 2 years (touch wood).

    Reply
    • Edgar – this was about SQL Server, not Azure SQL Database. That’s a different product. (It’s a great product, too! I like it a lot. It’s just different than what we’re discussing here.) You don’t do backups to a network file share with Azure SQL DB.

      Reply
  • One thing prep people overlook is that for a db, a general purpose ssd has 3 iops/gb. if you know you are going to be mb/s limited, just throw a 4000+gb general purpose drive at it for the max of 10k iops, which doesn’t need to be specifically provisioned. This also means that you have enough disk space that ‘do I have enough space for that index’ shouldn’t ever come up.
    If your lucky enough to have a db with page compression (eg sql server enterprise) then compression =speed as you again mitigate that io bottleneck.
    (I primarily work in the olap space, but the first point stands for oltp also)

    Reply
    • Yeah I’ve been dealing with this recently. I’ve found I’ve had to be careful with online index builds/rebuilds saturating the throughput of the EBS volume hosting the transaction logs. I had to restrict the volume of data by setting MAXDOP of the index build down to 1-3.

      Also, while I don’t have any hard evidence, the IO throughput difference between an R3.4xlarge and an R3.8xlarge seems to be greater than 2. Maybe Amazon turns off the limiting altogether when you have the whole physical machine?

      Reply
  • What about having a second NIC for backup traffic?

    Reply
    • Dkr – in the cloud, no one can hear your requests for a second network adapter.

      Sorry, couldn’t resist the joke. Depending on your instance size, you can get LARGER network adapters, but you don’t get the ability to provision specific pieces of hardware to route specific network traffic. (You can do a lot of neat stuff with virtual networking, but that doesn’t get you additional physical ports.)

      Reply
  • Wouldn’t another option for additional IOPS be to RAID-0 EBS volumes?
    http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/raid-config.html
    The EBS drives already have some redundancy so it’s not like running on a single physical spindle.

    Reply
    • There are several problems with this:

      1. Please note that Brent’s article focuses on bandwidth rather than IOPS. Bandwidth is limited by instance type, not by EBS volume type. Increasing IOPS tends to be far cheaper than increasing bandwidth, since there’s databases with high IOPS requirements tend to require more disk space.

      2. Using more EBS volumes to surpass IOPS limits has diminishing returns, and increases average latency. Your drive performance is limited by the performance of the slowest drive. EBS drives tend to have less homogeneous performance than physical drives.

      3. Taking a drive snapshot without a reboot is incredibly dangerous, rather than very dangerous.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.