So you’re hosting your SQL Server in the cloud – say Amazon EC2, Azure VM, or Google Compute Engine – and you’ve noticed that when you’re running a backup or a DBCC CHECKDB, you suffer from extreme performance problems.
Queries run slow, and even worse, applications report timeout errors even just trying to connect to SQL Server. More symptoms can include database mirroring and cluster failovers. What’s going on?
To understand it, let’s look at a quick sketch of your infrastructure:
Your SQL Server virtual machine lives on a physical host, and it accesses your storage via the network – plain old Ethernet, the same Ethernet you’re using right now to surf our web site.
The advantage of Ethernet-connected storage is that it’s really, really cheap to build and manage.
The drawback of Ethernet-connected storage is that if your network connection isn’t really robust, then it’s really, really easy to saturate. 1Gb Ethernet maxes out at around 100MB/sec – for comparison, a single $250 1TB SSD pushes around 500MB/sec. During high-throughput activities like backups and corruption checking, your storage is more than capable of pouring tons of data into your SQL Server – thereby completely saturating your network connection.
It gets worse: in most cases, you’re not the only VM on a given host, so your cloud provider has to throttle your network throughput.
So your network connection matters – a lot.
Faster (and/or separate) networks are certainly available to you in the cloud – it’s just a matter of budget. For example, the excellent ec2instances.info lists all of the VM types at Amazon. It includes columns for Network Performance, plus a whole bunch of EBS columns that aren’t shown by default (click the Columns dropdown to see them):
The eye-opening column is EBS Optimized: Throughput – how much you can get from your storage in a best case scenario, typically streaming sequential reads like backups. (Don’t expect to get that from small random activities like OLTP database operations.)
Sorting by that column, here are your capabilities for the very largest VMs:
While 1250 MB/sec is good, that’s also expensive: those two instance types are $8/hour and $19/hour. Once you’re past those, the throughput simply plummets right down to that single $250 SSD we were discussing earlier.
So what’s an admin to do?
One option is to bypass the network entirely. Note how in the architecture sketch, local ephemeral solid state storage isn’t hooked up through the network at all. Ephemeral storage is blazin’ fast and super cheap (included with most SQL-Server-sized instance types these days).
There’s just one little drawback: it can disappear at any time.
So if you’re going to use that for user databases, you have to protect your instances using technologies like Always On Availability Groups or database mirroring. Those can get you automatic failover with no data loss (not guaranteed, though), at the cost of slower deletes/updates/inserts due to synchronous writes across multiple servers.
Or, uh, you could just skip backups and CHECKDB. I wish I was joking, but I’ve seen more and more folks simply opt to run with scissors rather than have timeouts. That’s a bummer.
The cloud: giving you new ways to save money and run with scissors.
Erik says: The Cloud: Like getting a haircut from your ex-girlfriend.