As you pay more for Business Critical Azure SQL DB servers, they’re supposed to get more storage throughput. The documentation on this is kinda hard to read, but boiling it down, for every core you add to a Gen5 server, you’re supposed to get 2,500 IOPs. That should scale linearly: insert speed should go up evenly with each added core.
The bottom line of this screenshot shows the IOPs going up linearly from 16 cores at 40,000, up to 200,000 IOPs for 80 cores:
But can an 80-core server
write any faster than a
16-core server? Well…no.
Sounds like a trick question, doesn’t it?
Let’s start with the Stack Overflow database, which has a 104GB Posts table – 18GB of which is off-row text data because Lord knows y’all like to write some long questions and answers. Then we’ll run this simple query to copy that Posts table to a new one:
The query will go parallel, using more cores for the reading & writing – so will more cores help? Plus, when you move from an 8-core Azure SQL DB to an 80-core one, you get more memory, and theoretically, more storage throughput.
Drum roll please:
- 8 cores, $1,991 per month: 64 minutes
- 16 cores, $3,555 per month: 32 minutes (and interestingly, it’s the same speed with zone redundancy enabled)
- 80 cores, $18,299 per month: 32 minutes
- Just for reference: 8-core AWS EC2 i3.2xl VM, $1,424 per month with SQL Server Standard Edition licensing: 2 minutes (and I don’t put that in to tout AWS, I just happen to have most of my lab VMs there, so it was a quick comparison)
You can spend five times more,
but you hit a wall at 16 cores.
So why don’t more cores equal dramatically more speed? Your first thought is probably, “Well, Brent, that query isn’t CPU-bottlenecked” – and you’re only partially right, but also pretty wrong. Here’s the Azure Portal showing performance metrics on the 8-core setup while the insert is running:
The very light-colored line banging up against the top at 100% is the transaction log IO percentage. Our workload is simply limited by how much Azure is willing to write to disk at any one time. The CPU percentage is 6%, and the read IO is 7%. Even with only 8 cores, we’re paying for CPUs that are sitting around bored.
Upgrade the server from 8 cores to 16 cores, and the load speeds double – which is great! Now, what’s our bottleneck – here are the load metrics on the 16-core box:
We’re still transaction-log-file-bottlenecked at 100%, and the only way to get more transaction log throughput is to buy more cores. So throw 80 cores at it, and:
The only thing happening here is that the CPU load is going down, but storage throughput isn’t getting faster. Our workload completes in the same 32 minutes.
This also leads to some interesting MAXDOP behavior. On all of the servers, restricting the query to MAXDOP 1, 2, 4, or any other number didn’t seem to affect the query’s runtime at all. The storage limit is the storage limit, full stop.
The takeaway: bottlenecked on transaction log IO?
Right now, over 16 cores is a waste of money.
Just because sys.dm_db_resource_stats shows that you’re hitting a resource limit, and the documentation says that the limit will go up linearly as you add more cores, don’t trust that it will. You’re going to have to experiment with your workload and different Azure SQL DB sizes in order to find the sweet spot for how much you pay for how fast your query will go. Trust, but verify.
You might be spending more money, but hitting the same limits a 16-core server is hitting!
The old-school DBAs out there will say, “Yes, but if the transaction log file is your limit, you shouldn’t add more CPUs – you should be adding storage.” That’s the problem: in Azure SQL DB, you simply can’t. The documentation says you can by adding more CPUs, but as of this writing, it just isn’t true. I don’t have any inside information here, but there seems to be a problem with Azure SQL DB’s storage scalability. Either there’s a bug in the storage limits that were set, or the documentation isn’t right, because the throughput isn’t simply rising once you get to 16 cores.
Update 2019/02/26 – clarified that we’re talking about the transaction log file, not the error log file, per the discussion on Reddit.