Auto-Scaling SQL Server Always On Availability Groups with Virtualization

Last Updated February 13, 2017

Always On Availability Groups, Amazon Web Services (AWS), Virtualization

Time for a thought exercise.

You’ve got a database application that has bursty and unpredictable loads. Out of nowhere, you’ll suddenly get socked with a large amount of SELECT queries. Due to the way the app is written, you can’t cache the query results – the queries keep changing, and the business wants data from within the last couple of minutes.

In the past, you’d have used replication or AlwaysOn Availability Groups to build multiple read-only reporting servers, but that means big SQL Servers that sit around idle most of the time. You have to size them for the bursts in load. You could probably get by with a single 2-socket server most of the time, but when the loads come in, you need an 8-socket server to handle all these reads.

But what if you combined virtualization with SQL Server’s ability to add AG replicas on the fly? Think Amazon’s Auto Scaling, but on-premise. The script logic would go something like this:

If your current readable replicas are experiencing bottlenecks that could be fixed by adding additional replicas,
And the load lasts more than X minutes/hours (set by the business’s goals)
Start up a new virtual machine (ideally a template with SQL Server already installed)
Add it to the cluster
Restore the most recent full and transaction log backups to it (very easy if you’re already backing up to a network share, and would require no additional load on the current replicas)
Join it to the Availability Group
Add it to the read-only routing list

And presto, you’ve got more power. You can also use the same type of logic to tear down replicas you don’t need.

If you wanted to get even fancier and more proactive, rather than adding an all-new server to the Availability Group, you could have the script shut down one of the existing replicas, add more vCPUs and/or memory, and start it back up again. That way you could add more power without having to deal with backups and restores, but on the downside, this means temporarily taking away power from the existing AG.

You can even do this without affecting end users. Take the replica out of the read-only routing list, wait for the last query to finish, and then start the maintenance work on that replica.

It would only make sense at companies where:

The app’s config string used the ApplicationIntent=ReadOnly parameter for read-only queries, thereby letting us move those to read-only replicas
The write loads can still be handled by a single server
The script could finish in time to handle the added load (for example, this isn’t going to work with 5TB databases on slow storage)
There’s an incentive to get rid of unneeded replicas (because some companies are just okay running lots of unneeded servers to handle peak loads)
Licensing is done at the virtual host layer, not at the guest layer (which rules out Amazon EC2 and Azure VMs, neither of which would be cost-effective here)

If I was going to write something like this, I’d do it in PowerShell because I’d want one language that interfaces well with my monitoring software, VMware/Hyper-V, Windows clustering, and SQL Server. Alas, I’m not doing it anytime soon – I’ve gone through this thought exercise with a couple of clients, and in both cases, the answer has been, “Screw it, we’ll do it manually instead.” It sure is a fun thought exercise, though.

It does sound like a lot of work, but the slick part is that once the scripts are built and tested, you can leverage it to auto-scale read load for any of your AG-protected databases.

And, uh, open source it, will you? Kthxbai.

RECOMPILE Hints and Execution Plan Caching

How Would You Change Windows Azure SQL Database?

8 Comments. Leave new

Mike Lewis
December 18, 2013 9:02 am

I’d be interested to hear your thoughts on the uses for this approach. My mind keeps wandering to the fact that if a company has meaty enough hardware to run the VMs and cope with bursts, why not just set them up and leave them running? In theory, you need to keep the host freed up enough to allow the scale-up, so you couldn’t really use it for anything else (except maybe temporary workloads here and there).

Reply
- Brent Ozar
  December 19, 2013 7:02 am
  
  Mike – Like I said in the opening sentence, it’s a thought exercise. 😀
  
  Reply
  - Mike Lewis
    December 19, 2013 8:06 am
    
    I bet this is how Azure and AWS started. A whimsy.
    
    Reply
Ishtiaq Ahmed
December 18, 2013 2:16 pm

At one of the contract engagement i had the opportunity to write these kinds of power shell scripts to build and deploy over 300 servers which saved alot of time .

VMSnapshot–> Silent Installation of SQL Server –> Backup and restore–> Creation of AG–>Combining all the AG–> Moving old legacy SQL Server to 2012.

It was fun to run those powershell scripts without much effort

The only problem i faced in building those powershell scripts which took alot of contracting hours.

Reply
- Brent Ozar
  December 19, 2013 7:01 am
  
  Ishtiaq – yep, that’s a fairly common problem. It takes a lot of work to build out these kinds of scripts discussed in the blog post, so it has to be really useful for the business before they invest that kind of money.
  
  Reply
Khanh
September 21, 2014 4:50 pm

Has anyone considered configuring a scenario where Server1 would have SQL Instance 1 running as primary and SQL Instance 2 running as a secondary while Server2 would be configured the exact opposite? I’ve tested this in our environment and it seems to be running fine. Obviously I’m not concerned with the workload yet as I have not fully tested what type of performance I’d get once Server1 basically becomes primary for both instance. I just wanted to add this configuration scenario since it seems a lot of DBAs are concerned that they would need two beefy servers but only one would really be utilized at any given time.

Reply
- Brent Ozar
  September 22, 2014 9:29 am
  
  Khanh – you can accomplish that same goal by using multiple Availability Groups rather than multiple instances. For more details, watch my AlwaysOn introduction video: https://www.youtube.com/watch?v=-jxQERXPM9s
  
  Reply