Time for a thought exercise.
You’ve got a database application that has bursty and unpredictable loads. Out of nowhere, you’ll suddenly get socked with a large amount of SELECT queries. Due to the way the app is written, you can’t cache the query results – the queries keep changing, and the business wants data from within the last couple of minutes.
In the past, you’d have used replication or AlwaysOn Availability Groups to build multiple read-only reporting servers, but that means big SQL Servers that sit around idle most of the time. You have to size them for the bursts in load. You could probably get by with a single 2-socket server most of the time, but when the loads come in, you need an 8-socket server to handle all these reads.
But what if you combined virtualization with SQL Server’s ability to add AG replicas on the fly? Think Amazon’s Auto Scaling, but on-premise. The script logic would go something like this:
- If your current readable replicas are experiencing bottlenecks that could be fixed by adding additional replicas,
- And the load lasts more than X minutes/hours (set by the business’s goals)
- Start up a new virtual machine (ideally a template with SQL Server already installed)
- Add it to the cluster
- Restore the most recent full and transaction log backups to it (very easy if you’re already backing up to a network share, and would require no additional load on the current replicas)
- Join it to the Availability Group
- Add it to the read-only routing list
And presto, you’ve got more power. You can also use the same type of logic to tear down replicas you don’t need.
If you wanted to get even fancier and more proactive, rather than adding an all-new server to the Availability Group, you could have the script shut down one of the existing replicas, add more vCPUs and/or memory, and start it back up again. That way you could add more power without having to deal with backups and restores, but on the downside, this means temporarily taking away power from the existing AG.
You can even do this without affecting end users. Take the replica out of the read-only routing list, wait for the last query to finish, and then start the maintenance work on that replica.
It would only make sense at companies where:
- The app’s config string used the ApplicationIntent=ReadOnly parameter for read-only queries, thereby letting us move those to read-only replicas
- The write loads can still be handled by a single server
- The script could finish in time to handle the added load (for example, this isn’t going to work with 5TB databases on slow storage)
- There’s an incentive to get rid of unneeded replicas (because some companies are just okay running lots of unneeded servers to handle peak loads)
- Licensing is done at the virtual host layer, not at the guest layer (which rules out Amazon EC2 and Azure VMs, neither of which would be cost-effective here)
If I was going to write something like this, I’d do it in PowerShell because I’d want one language that interfaces well with my monitoring software, VMware/Hyper-V, Windows clustering, and SQL Server. Alas, I’m not doing it anytime soon – I’ve gone through this thought exercise with a couple of clients, and in both cases, the answer has been, “Screw it, we’ll do it manually instead.” It sure is a fun thought exercise, though.
It does sound like a lot of work, but the slick part is that once the scripts are built and tested, you can leverage it to auto-scale read load for any of your AG-protected databases.
And, uh, open source it, will you? Kthxbai.