If you’ve ever wanted to use the cloud for a DR site, there’s good news: as of this week you can use AlwaysOn Availability Groups and set up an asynchronous replica of your data into Amazon Web Services (AWS). If you’re already in the cloud, or if you were holding off, this means that you can run multiple SQL Servers in several Availability Zones and Regions to scale out reads or survive outages.
What Can We Do With Availability Groups in AWS?
It’s possible to create an Availability Group and take advantage of almost all of the AlwaysOn Availability Group functionality. Readable replicas can be easily configured and with a few additional steps it’s possible to get the Availability Group Listener working. The listener is the part of the Availability Group that will send read requests to the readable secondaries. You really want this working – without the listener it’s up to you or your networking team to create and maintain load balancers and perform some kind of complex DNS dance to send writes to the correct SQL Server. That doesn’t sound like anyone’s idea of a good time.
The other piece of functionality that’s now available is manual failover. Failover didn’t work previously – there were limitations to how AWS handled networking and virtual IPs. While many of those issues have been resolved, a few still remain. Manual failovers work… with some caveats (more on that in a minute).
What Doesn’t Work With Availability Groups in AWS?
Automatic failover doesn’t work: Amazon’s networking stack doesn’t support it. But that’s okay, and here’s why – automatic failover in SQL Server 2012 AlwaysOn Availability Groups requires synchronous commit. This can introduce additional latency and we’re interested in keeping things fast. Instead use asynchronous replication and your SQL Servers will be within a few minutes, if not milliseconds, of each other.
What Happens When It’s Time To Failover?
When the time comes to failover, you’re going to have to do it manually. When the primary server, or even Availability Zone, goes down you’ll need to run three commands. If you’re failing over across multiple Availability Zones or regions, this is really simple. Connect to the new primary node in SSMS and run:
ALTER AVAILABILITY GROUP MyTestAG FORCE_FAILOVER_ALLOW_DATA_LOSS;
It’s just that simple. When that Availability Zone or Region comes back up, connect to the original primary node in SSMS and run:
ALTER AVAILABILITY GROUP MyTestAG FAILOVER;
When you’re failing over inside the same Availability Zone you need to move an IP address around the cluster manually, but the EC2 command line tools make this easy as well. Any computer that can connect to EC2 can run a combination of
ec2-assign-private-ip-addresses to move the listener to the new location in the Availability Group.
This sounds easy, and failing over is. However, this isn’t exactly easy to set up. SQL Server AlwaysOn Availability Groups have a couple of things to watch out for.
What Should You Watch Out For
AlwaysOn Availability Groups are an Enterprise Edition feature of SQL Server. To deploy SQL Server AlwaysOn Availability Groups in AWS you’re going to have to bring your own licensing. More than that, it means that you won’t be able to use one of Amazon’s SQL Server images that they provide – you’re going to have to install SQL Server on your own VMs unless you want to pay for SQL Server licensing twice. If you want to make it easy to automate deploying SQL Server in AWS you’re going to have to take a lot of specialized steps that are well outside of most DBAs’ job duties – sysprepped SQL Server installations and command line Linux enter into the picture. You’ll have put in a lot of effort to automate your SQL Server deployments but it will be worth it.
This isn’t exactly easy to set up. SQL Server AlwaysOn Availability Groups have a number of installation steps and deploying everything in a purely virtual environment adds additional complexity. If you thought that configuring AlwaysOn Availability Groups was tricky on hardware in your data center you’ll find that configuring AlwaysOn Availability Groups in AWS adds a few layers of complexity.
Why You Should Care About Availability Groups in AWS
Even though not all of the SQL Server AlwaysOn Availability Groups functionality can be used, this is a huge step forward for businesses who are using the cloud or have been considering a move to AWS. Companies can use their existing SQL Server 2012 licensing to establish their Availability Group. Availability Groups make it easy to have multiple DR sites. With recent cloud outages (both in Azure and AWS), companies are aware that even massively redundant cloud hosting is vulnerable. AlwaysOn Availability Groups make it possible to implement a highly available deployment strategy that can be used to survive severe outages.
Learn more about our SQL Server in Amazon cloud resources page.
Can you expand upon ‘with a few additional steps it’s possible to get the Availability Group Listener working’. I’m banging my head against a wall trying to get this working correctly.
Hi Scott – To get the listener working you’ll need to assign an additional Elastic IP address to the listener. You also want to make sure that this secondary IP address can be reassigned. Finally, and this is the fun part, because Windows is using a virtual IP address and refuses to behave well, you’ll need to script out AG failover yourself by detaching the Elastic IP, failing over, and then attaching the Elastic IP to the new instance. This can all be done with agent jobs on each instance, but it’s not as pretty as it might be if you were doing this in your own data center.
I have two questions on AG i.e. I have two nodes 1 at each site Prod and DR. I have configured AG with Asyncronous commit to DR.
My questions are :
1. I have a quorum setup of One SQL Node (production server) and the other is just some other server configured to vote on production site and I have given 1 vote to the DR site server. So that majority is at production site. Is that ok ?
2. In the above configuration or even if there is 0 vote to the DR site what happens when the primary datacenter goes offline or there is a disaster ? How would we recover from that on DR site ?
I have to answer your first question with a few questions of my own: Is this situation acceptable to you? By that I mean, have you tested all of the ways that quorum could fail? Are all of those possibilities acceptable for your SLA with the customers/business? Effectively designing HA/DR solutions is based on balancing the weaknesses and strengths of a system. As long as the system meets your business requirements and you’re aware of the potential failure points, then you should be good.
Question two is easy: for asynchronous replicas, you have to perform manual failover. This is outlined in Failover and Failover Modes (AlwaysOn Availability Groups). Notification of the outage, and your reaction to it, should be based on the DR planning that went into the decision to put AlwaysOn Availability Groups in place.
First off let me thank you for an awesome post! The level of detail you provide is uncanny, and very much appreciated it.
Now, to the question. haha. I’ve managed to configure the WFC and SQL AG as necessary on AWS. All went well, ran into some bumps with the listener, but got around that by assigning the listener IP to the ENI of the primary instance.
My question is: Do you know of a way to script against AWS that will move the listener IP to the secondary instance upon detection of failure?
Explained another way, is there an amazon script that will move that listener IP from the primary instance to the secondary instance IF one of Amazon’s 2 checks (system/instance) fails?
I think a script like that would enable SQL to failover automatically, assuming that script fired quickly enough. What do you think?
Thanks in advance,
There’s no stock script provided by AWS, but the AWS CLI tools make it easy to move the IP addresses around. The reason you can’t easily automate this movement is that the cluster attempts to register the virtual IP for the AG Listener. If that IP is in use anywhere, the failover will fail.
If events happen to transpire in the exactly correct clock order, then failover could work. That’s a big if, though. In order for the failover to happen correctly, you have to:
So, you’d have to create some sort of CloudWatch Alarm based on failing n instance checks over t time, push a notification to an AWS SNS end point and then have an event subscriber receive that notification and take action on it (running a script). You’d have automatic failover at a 2-5 minute granularity.
Thanks for the quick response!
Here’s what I’ve done so far, with pretty decent success.
Inside the AWS Management GUI, BEFORE initializing any failover(still manual at this time):
1. Reassociate the IP to a new ENI (on secondary node)
2. Go into SQL Management Studio, right-click the AG, and hit Failover.
3. It detects the cluster IP is still alive, and the failover proceeds as planned.
So maybe a combination of an Amazon alert to notice the instance failure, a script to reassociate the private IP to the new ENI, and a script to trigger the failover in SQL?
Am I on the right path here? I’m disappointed that automatic failover doesn’t seem like an option (unless I can time the script execution properly) – but at the least this will give me failover within a minute or two, right?
Yup, you’re on the right path.
Thanks for the info! We’re working on setting up something similar in EC2. But, I dont’ quite understand how you’re able to async write from US-East to US-West. Do you guys have a complete guide to setting up the diagram in this article? The guide you provide (https://www.brentozar.com/archive/2011/07/how-set-up-sql-server-denali-availability-groups/) seems like it’s for corporate networks, not cloud environments. Here are some things we found that make it particularly difficult:
1) You need an AD. Because of the need for a static address, the only way we could reliably get this to work was setting up a Microsoft AD in a VPC and assigning an EIP to the Active Directory Server. Not a problem.
2) We went with three SQL Servers (running Windows Server 2012). Here’s where it gets tricky. We have two severs bound by an EC2 affinity group in EC2 (not VPC) running in US-East-1a. They’re joined to the active directory domain running in a VPC.
3) We launched a third async server in US-West-2a. We couldn’t run this in EC2. So, we had to place it in a VPC and assign it an Elastic IP. Why? Well, when we tried to join it to the active directory domain server in US-East, we hit routing problems.
Ok, so we have all three DB Servers talking to the AD. But, for whatever reason, the servers in US-East can’t read the network settings of the server in US-West.
1) Is it possible to run all nodes as EC2 nodes? Or, do they need to be in a VPC?
2) If you choose three servers (2 sync in USEast, 1 async in USWest), do you need a VPN between the VPCs?
3) Can we simply add the async server later?
We don’t have a step by step guide for setting up AlwaysOn in AWS. You can download our SQL Server 2012 AlwaysOn Availability Groups Setup Checklist to get a step by step guide to getting AlwaysOn Availability Grous configured.
I can say that I recommend the whole thing be put in a VPC and use a VPN – you need to open firewall ports and should use static IPs to make networking easier between both data centers. A VPC is going to give you better control of the network in general and even give you a level of control similar to what you could get in a regular on premise network.
Agree. I think the easiest thing (without mucking w/ a VPN… I’m surprised Amazon doesnt offer a service to connect two VPCs…), is to just set everything up in one VPC spanning multiple availability zones.
If we want to add an additional async db server in another region down the road, is that fairly straight forward?
It’s as straight forward as setting up a VPN and multi-subnet failover clustering. Once those two aspects are set up and working properly, you should be ready to go. The nice part about AWS is that you can clone your entire set up into a separate network and try it out.
There’s a good walkthrough (including auto-failover) from Amazon here: http://aws.amazon.com/whitepapers/microsoft-wsfc-sql-alwayson/
Anyone have success building out an Always On Win/SQL 2014 Cluster on Rackspace Cloud services? We have one up on dedicated servers with them, but trying to figure out if it’s possible over cloud servers, to use for a DEV environment and their customer support has not been able to give clear guidance, sadly.
Is there anything fundamentally different between Rackspace and AWS support for Always On?
Thanks for any insight.
Rackspace is, from my limited exposure, different from AWS. I’d ask your Rackspace rep for access to their technical resources. There is some guidance in this Microsoft case study.
I am looking at setting this up in AWS and am wondering if Greg’s Low’s post is indicating that automatic failover is now possible in AWS. Can anyone clarify?
I have SQL 2014 AlwaysOn Availability groups and a AG listeners setup in AWS.
I used this white paper to complete the setup.
I think the limiting factor is that the setup is probably the most complex Windows cluster environment I have ever configured. Be sure to understand MultiSubnet cluster configuration for MS Clusters before going into it.
I suggest the Quick Start reference guide from here: https://aws.amazon.com/windows/resources/whitepapers/alwayson/
This is a more in depth guide that includes Cloud formation examples:
Hope that helps!
What about DAGs in AWS. Will these now work with automatic failover in the cloud? And will this serve to assist in the implementation of CQRS if the primary is write and and the seondary replica is a read only?
Michele – great question, but that’s a little beyond what I can answer quickly in a blog post comment. You may try breaking that up into a couple of questions at https://dba.stackexchange.com.
Can someone please guide me how I can configure a load balancer for an Always On availability group for cross-region in AWS environment?
Ravi – for questions, go to a Q&A site like DBA.StackExchange.com.
Thank you Brent Ozar for your help.
Can someone please answer on this.
-we have 2 node WFC with AG group + listener listed to secondary IPs in AWS. one replica in AZ1, another in AZ2. Now due to perf issues the users added two more nodes. The idea is seamless migration of dbs to new node without any changes to connection string. How can we make the existing listener work when the listener is tied to secondary IPs of the node 1 and node 2. What are my options, how can we tie the existing listener to Node 3 and Node 4.
any help is greatly appreciated.
Kumarraju – for personalized consulting, you can click Consulting at the top of the page, or go to a Q&A site like https://dba.stackexchange.com.