Update March 19: Microsoft has since acknowledged a hidden limit, then documented it, then raised it – but it’s still disappointingly slow.
In my last post, I explored how fast a $5,436/mo Azure SQL DB Hyperscale could load data. I’d had a client who was curious about spinning up their dev environment up there to see how query plans might look different. Well, as long as I was running this test, I thought – “How does this compare with Azure SQL DB?”
There are 3 different products with similar names: Azure SQL DB, Managed Instances, and Hyperscale. In this post, I’m specifically talking about Azure SQL DB – not Managed Instances or Hyperscale. Managed Instances have a way easier method to ingest data – simply restore a database backup – so that one isn’t really worth exploring. There are a bunch of similar restore-a-backup requests for features to make restores easier in Azure SQL DB, they don’t exist today, so we’re stuck with the same options we had with Hyperscale:
- Offline: use the Data Migration Assistant (also, useful white paper)
- Offline: generate a BACPAC, and then import the BACPAC in Azure SQL DB
- Online: set up transactional replication to keep the two in sync, and then cut over
- Offline (for a SQL Server source): Azure Database Migration Service
To do this quickly and easily, I went with the first option (as I did in the last post with Hyperscale.)
Attempt #1: 80 cores, $20,088 per month.
The Data Migration Assistant and its documentation strongly suggest that you shouldn’t start small: you should over-provision your Azure SQL DB during the migration process:
That seems fair – a P15 is 4,000 DTUs and up to 4TB storage. I was using the new Business Critical vCore model, though, so I maxed it out at 80 cores, 408GB RAM, $20,088 per month:
Note that it sys up to 200,000 IOPs, 1-2ms latency – that sounds like a pretty good target for data ingestion. Nothing in the wizard (or the documentation that I could find) seemed to tie data size to storage throughput – which is surprising, because I’ve seen that kind of relationship all over the cloud – but since it wasn’t mentioned, I just left it at 400GB (my database is 340GB.)
I fired up the Data Migration Assistant and let ‘er rip:
You can’t read anything into the time numbers on this screenshot compared to the last post because the Data Migration Assistant doesn’t load tables in the same predictable order every time, and the times represent the point-in-time in the total load at which the table completed.
Why couldn’t it go faster? sp_BlitzFirst makes it easy to find out:
Note the “Database is Maxed Out” lines at the bottom, and in the details at the far right, we were bumping up against the max allowed log write percent. Ouch. The Azure portal showed writes maxing out too:
Well, that was disappointing. (Note that in these screenshots, you’ll see server & database names switch around – that’s because I tried these tests with several different servers and instances in the hopes of getting past the log throughput limits.)
Attempt #2: 80 cores, 4TB storage, $21,468 per month
In the Azure portal provisioning wizard, I only had one remaining slider, and I cranked it as far to the right as I would go, upsizing my storage to 4TB (even though I only needed 340GB for the database) – hoping it would upsize my log rate throughput:
One of the awesome things about the cloud is that you can make these kinds of changes on the fly, and sure enough, it took less than 3 minutes for Azure to report that my storage upsize was complete.
Except I was still banging up against the log rate governor:
There are several interesting things to note in that screenshot:
- At the time of this screenshot, queries are waiting on writelog waits (but you can’t rely on point-in-time checks of individual queries)
- I’m hitting my DTU limits, and I’m getting poison LOG_RATE_GOVERNOR waits
- I’m averaging 96.85% of my allowed log throughput – not peaking at, but averaging
- Log backups are happening at the same time, which is probably also affecting my log rate throughput, but I can’t blame Microsoft for taking log backups
- In a 60-second sample, 4GB of data was read from the logs (most likely the log backups)
- In that same sample, 1.6GB was written to the data file, and 1.7GB to the log file, and averaged a nice and tidy zero milliseconds for the log file writes (very nice) – but in 60 seconds, that’s a only 28 megabytes of data per second to the log file. That’s USB thumb drive territory.
Eventually, it finished in 3 hours, 52 minutes – remarkably similar to the Hyperscale speeds:
And at first you’re probably saying, “Well, Brent, that means you’re bottlenecked by the source. Both Hyperscale and Azure SQL DB had the same ingestion speeds, and they have something in common: your data source.” That’s what I thought too, so…
Attempt #3: synthetic local loads
Just in case there was a problem with something with my Azure SQL DB instance, I blew it away and started again from scratch. I also took the Data Migration Assistant and networking completely out of the equation. I created a new 80-core Azure SQL DB, maxed out on storage, and loaded a table with junk contents from sys.messages – this way, I didn’t have to worry about any cross-server communication at all:
FROM sys.messages m
CROSS JOIN sys.all_columns ac;
Waits while that happens:
We’re not hitting the log rate governor, but we might as well be. In any given 60-second sample, we waited for hundreds of log file growths, and we managed to write less than 2GB to the log file. 30MB/sec of write throughput is simply terrible – even if we’re growing out a 2GB log file, that’s just terrible for $21,468 per month. You really could do better with a USB thumb drive.
To really stress things out, I started 8 sessions all running that same insert query (with different table names.) I started hitting the log rate governor again, while still only writing about 1.5GB per minute:
And then all of a sudden, my wait stats dropped to nothing! Things were flying, because…uh…wait a minute…
Huh. All my queries had crashed at once. That’s not good, and I’d dig deeper to do a repro query, but…the beach is calling.
They left the database in a bit of an odd state, too – the tables were there, with pages allocated, but no rows, indicating they’d all rolled back:
Summary: Azure SQL DB’s log throughput
seems kinda slow for $21,468 per month.
You measure storage with 3 numbers:
- How many operations you can perform (often abbreviated as IOPs)
- The size of each operation (operations * size = total throughput, usually measured in MB/sec or GB/sec)
- Latency of each operation (how long individual operations take to complete)
The latency was great – fast and predictable – but the log rate governor is capping throughput. It means you can’t do too many things at once – but boy, those few things you can do, they sure are fast. Having reliably low-latency operations is great for day-to-day transactional workloads that only do a few writes per user transaction, but…for data ingestion, it makes for slow going.
For imports, I really wish Azure SQL DB would have a mode to let you temporarily optimize for ingestion throughput rather than low latency. I don’t need sub-millisecond latency when I’m dealing with insert commands coming from another machine (especially less-than-optimized commands like the ones from Data Migration Assistant, which even ran into blocking problems during my loads, blocking themselves across threads.)
Have any of you migrated a sizable database (at least 100GB) into Azure SQL DB? If so, did you measure how long it took, like how many GB per hour you were able to import? Any tricks you found to work around the log rate governor?
Update March 9: Microsoft
acknowledged & documented the limit.
Microsoft contacted me, agreed that there’s a limit, and that it wasn’t documented for the public. They’ve now updated a few pages:
- Single database limits – now shows the transaction log is limited 20 MBps for General Purpose, and 48 MBps for Business Critical.
- Those limits are reached at 8 cores for GP and BC Gen4, and 16 cores for BC Gen 5.
- Log limit workarounds – like load data into TempDB or use a clustered columnstore index for compression.
Ouch. 48 MBps really is bad, especially at $21,468 per month. This is such a great example of understanding the limitations of a product before you make that leap.
Update March 19: Microsoft raised the limit.
Within a few days of documenting the limit, Microsoft must have realized that it was just set entirely too low, and quietly raised it:
— Robin Sue (@Suchtiman) March 19, 2019
It’s twice as high now – 96 MB/sec – but…still uncompetitive with modern USB 3.0 thumb drives. Good to see that it’s moving in the right direction.