In our last episode, we found out what your server was waiting on. In the closing remarks, I mentioned that you should find out which queries are causing the problem, and focus your mitigation efforts there. You might have circled back to the episode where we covered sp_BlitzCache, used a @SortOrder parameter that lines up with your top wait type, and…
Now you’re frustrated.
Because you’re looking at all these bad queries and thinking, “Isn’t there an easier way? Can’t I just throw hardware at this?”
Scaling out reads with replicas:
splitting queries between servers
When you want to copy the exact same production data to a second server to offload reporting queries, some of the costs are immediately obvious:
- Hardware and storage – even if you’re running it in a virtual machine, you need to account for the costs of say, 4 cores and 32GB RAM. Not only will you need storage for the databases, but you’ll also need to decide whether this server gets backed up, and copied to a disaster recovery data center.
- Software licensing – Standard Edition is ~$2k per core, and Enterprise Edition is ~$7k per core. Toss in Windows (especially now that it’s licensed per-core), your management/backup/antivirus tools, and your monitoring software.
- Project planning – you’ll need to design how to get the data from production to the reporting server, like with Always On Availability Groups, log shipping, or transactional replication.
- App modifications – the app running reporting queries will need a new connection string. Even with Always On Availability Groups, reads aren’t automatically offloaded to readable replicas – you have to use the connection string parameter ApplicationIntent = ReadOnly to tell SQL Server that you promise not to try to write anything. If you have a single app that does both reads and writes, and you only want to offload some of the queries, you’ll need to go through the code to switch those queries over to the new connection string.
But some of the costs come as cruel surprises:
- Adding a troubleshooting process – sooner or later, the data replication process will break. Depending on the method (AGs, log shipping, replication) and failure type, it’ll fail in different ways – maybe all of the data is old, maybe just some of it is, or maybe the reports aren’t accessible at all. You’ll want to list out the failure methods and explain what symptoms will look like. This helps business users recognize when their reports are wrong, and react appropriately. If you don’t do this step, then after the first failure, people are just always going to expect that there’s a bug in the report data.
- Prepare for failure – for each of those failure methods, decide how you’re going to react. For example, if AG replication breaks and reports are out of date, will you point reports at the primary until the problem is resolved, or will users just have to deal with unavailable reports while you troubleshoot or resync the replicas? If you don’t do this step, then you’re going to be winging it every time, and you’ll look unprepared while reports are wrong or down.
- Set realistic expectations for RPO and RTO – based on your process and preparation, make sure the business users understand how long their reports will be down when things break.
- Measure the overhead of replication – AGs and transactional replication can add performance slowdowns beyond what the reports used to cost. For example, if you were only running a few reports an hour, and only hitting a subset of the data, then suddenly replicating every individual delete/update/insert operation can have a huge overhead.
- Add monitoring – you need to start monitoring how far behind the reporting server is, and how performance is doing on both. Performance troubleshooting becomes a lot harder, too – for example, when you’re doing index tuning, you have to combine data across both the primary and the reporting servers in order to find the right mix of indexes across the board.
And at the end of all this, all you’ve got is a way to split out the read-only queries. Those might not even be your biggest problem – you might need to split up writes between servers, and that gets REALLY hard.
Scaling out writes with sharding:
splitting data between servers
Sharding is the technique of splitting your data between multiple shards. You have the same exact database structure on each server, but you split up customers into multiple shards – for example, maybe the US customers are on one shard, and the European customers are on another shard.
Sharding gets trickier when you have to keep data in sync between shards. For example, you might run a web site, and you want to keep all of the product and pricing data exactly the same across shards. You may also need to combine data back from all of the shards into one, like for reporting purposes.
Other techniques for splitting load across servers involve replication with multiple masters – multiple servers that can accept writes at any given time. If you’re interested in that topic, check out Many Masters, One Truth. (Spoiler: it’s a lot harder than it sounds.)
But the real bottom line with sharding, and the reason implementation is so hard, is that your application has to know which shard to query. If you’re connecting all of the apps to a single SQL Server and routing requests from there, you’re right back where you started: bottlenecked on a single SQL Server.
This whole post was a trick.
None of the scale methods are really easy. Whether you’re convincing the boss to buy more memory, taking an outage to upgrade SQL Server, or rewriting the app’s back end to accommodate sharding, you’re going to be putting in serious work.
That’s why I put so much emphasis on tuning indexes and queries. It’s hard to make your server 100x faster – but it’s really easy to make your database design and queries suddenly 100x more efficient. Fixing queries and tables and indexes is indeed hard work – but it’s totally worth it. The better you get at these techniques, the farther you can go in your career.