SQL Server 2016 adds Availability Groups in Standard Edition. While the licensing details can obviously change right up to release date, here’s what was announced at the Ignite conference:
- Limited to two nodes only (a primary and a secondary)
- Like mirroring, you can’t read from the secondary, nor take backups of it
- But like database mirroring, you can take snapshots of the secondary for a static reporting copy
- Each database is in its own Availability Group
In CTP2, this is referred to as a Basic Availability Group.
Yes, a BAG. I can’t make this stuff up people. But here’s what I can make up:
- If you use HammerDB for load testing of a BAG, that’s a BAG of Hammers
- If you have ten databases, each in its own BAG, that’s a Dime BAG
- It could be worse – it could be a Database Basic Availability Group, or a D-BAG
Enough laughing – let’s implement it. For this scenario, I’m already in production with one SQL Server 2016 machine, and I’d like to add a second one to give me some high availability or disaster recovery protection.
I’m setting up a simple AG with two standalone machines:
- SQL2016A – this will be my primary instance where my databases will normally run.
- SQL2016B – this will be my secondary, with less horsepower. Could be in the same data center, or in a different one. (To keep things simple for this explanation, I’m going to stick with the same data center and the same IP subnets.)
This checklist will focus on the Availability Groups portion, not the SQL Server installation, so go ahead and set up your second SQL Server using our SQL Server Setup Checklist. Make sure the second one uses the same user database data/log drive letters and paths as your primary server. If you use different paths, then whenever you add a data or log file to an existing database, it will succeed on the primary but then fail on the replica where the drive letters are different. (Only the user database paths have to match – not TempDB, system databases, binaries, etc.)
In theory, you might be able to do this in production without a major outage.
In practice, I wouldn’t bother trying – you’re probably behind on Windows Updates, and you need to apply these prerequisite Windows patches. Because of that, I’d schedule a separate (long) maintenance window to get up to speed on Windows patches, the latest SQL Server updates, and do the first few steps. Even when you rehearse it in a lab, it’s not unusual to run into several restarts-and-check-again-for-updates in the wild, or brief outages to
Let’s get started!
ADD THE CLUSTERING FEATURE TO BOTH WINDOWS SERVERS.
“But wait, Brent – I don’t want to build a cluster!” Unfortunately, SQL Server’s AlwaysOn Availability Groups is built atop Windows Failover Clustering Services, and you have to join your AG replicas into a single Windows cluster.
Fear not – this is nowhere near as scary as it sounds, and it can be done after-the-fact to an already-in-production SQL Server. (I’d recommend you set up a lab or staging environment first to get used to it, though.)
After installing those clustering prerequisites, open Server Manager and click Manage, Add Roles and Features, select your server, and on the Features list, check the Failover Clustering box:
You can move on to the next step after this feature has been installed on all of the servers that will be involved in the Availability Group.
Validate your cluster candidates.
Open Server Manager and click Tools, Failover Cluster Manager. (I know – you thought Failover Cluster Manager would be under Manage instead of Tools, but it’s these clever switcheroos that keep us all employed as IT professionals.) Click Validate Configuration on the right hand side to start the wizard:
Put in the names of all the servers that will be participating in the Availability Group, and hit Next. It’s going to ask if you want to run all tests – this is another one of those gotchas where, in theory, you could do this in production without an outage if you chose to run specific tests. I don’t recommend that – just run all tests. If SQL Server is running on any of these nodes, it’ll likely stop while this wizard runs, so do this during a maintenance window.
The validation wizard can take several minutes depending on the number of replicas, environment complexity, and phase of the moon.
Your cluster will probably fail, and that’s okay. This isn’t like school. Nobody has to know if you fail several times. (Chemistry was hard for all of us.) Even if you think you passed, you probably failed a portion of it. (The wizard, not chemistry.)
Click the View Report button, and Internet Explorer will launch because why wouldn’t we take every opportunity to open a browser on a production server:
You’ll be tempted to scroll down through it because that’s how most web pages work. This is not most web pages. Click on the first error – in my case, Network – and then click on the first warning it takes you to – in my case, Validate Network Communication – and then read the warnings in yellow:
This is where things get a little weird.
Some errors are completely okay to ignore. The most common two are:
- Single network interface warnings – if you’re doing network teaming or virtualization, you’re probably protecting your network through means that the clustering wizard can’t detect.
- No disks found for cluster validation – if you’re building an AlwaysOn configuration with no shared disk, like failover clustering on a UNC path or a pair of Availability Group replicas with local disk, this isn’t a problem.
But otherwise, read this report really carefully and fix the warnings before you go on to the next step. For example, if the two nodes have different Windows patch levels, get the missing updates installed first. (If you’re having difficulty getting some of the patches to apply, there’s your first sign that this isn’t a good candidate for a reliable AlwaysOn implementation – more like OftenOn, or OffAndOn.)
After you fix the issues, run the validation wizard again. This is important because the last validation report is saved in Failover Cluster Manager, and you always want that report to look good. When you call Microsoft (or a team of highly skilled consultants) for Availability Groups help, they’re going to start by looking at this validation report, and if it didn’t pass last time, that’s going to be a bad start.
Create the Cluster.
After a successful validation report, leave the “Create the cluster now” checkbox checked, and click Finish – your cluster will be created:
This checkbox is particularly diabolical – it will be checked by default as long as you didn’t have any heinously ugly showstopper errors, but even with mildly serious errors, it will still let you create the cluster. Make sure you fix the errors first.
You will be prompted for a cluster name – again, a little diabolical here, because this is not how you will address SQL Server. This name is how you will control the cluster itself:
Think of your cluster like a kitchen – a restaurant can have lots of cooks, but it will have only one head chef. Your head chef knows who’s doing what at all times, and bosses people around. The head chef never actually cooks anything – he just tells other people what to do.
Your cluster can have lots of SQL Server instances in it, and they can have lots of Availability Groups. We need one controller to know what’s going on, and that’s the cluster name.
In my case, I’m using SQL2016CLUSTER. You’ll notice that in my screenshot, it says “One or more DHCP IPv4 addresses were configured automatically” – don’t do that in your production environment. Use static IP addresses. You always want to know where the head chef is.
The next screen will prompt you to add all eligible storage to the cluster – uncheck that unless you’re building an AlwaysOn Failover Clustered Instance.
In a perfect world, the wizard finishes, your cluster is completed successfully, and your Failover Cluster Manager looks something like this:
Pay particular attention to the “Witness: None” line. Just like any good preacher, your cluster needs a witness, so right-click on the cluster name on the left side of the screen, click More Actions, and Configure Cluster Quorum Settings. In the next wizard, choose “Select the quorum witness” and choose either a disk (shared drive) or file share (UNC path) witness.
Choosing a quorum method deserves its own webcast, and luckily, Kendra Little has recorded a video on that very topic.
For my simple two-server configuration, I need a really reliable tiebreaker, so I’m going with a file share witness. I need it hosted on my most reliable server possible – in my simple lab, that’s one of my domain controllers, so I type in the server name and then click the Show Shared Folders button:
Click the New Shared Folder button, configure your file share, and exit the wizard. What I just did there is right for my lab, but only my lab – you’ll want to make some better-informed decisions in your production environment. For example, if all of my SQL Servers lived on virtual machines backed by NetApp storage, I’d probably use a file share right on the NetApp itself, because if that thing goes down, all my SQL Servers are hosed anyway.
Enable AlwaysOn Availability Groups on Each Server
On each replica, launch SQL Server Configuration Manager, then go into SQL Server Services, right-click on the SQL Server service, and click Properties.
On the AlwaysOn High Availability tab, check the box to Enable, and click OK. This won’t take effect until the SQL Server service is restarted. Repeat this on all of the SQL Servers.
Now let’s take a moment to reflect: we’ve got a pair of Windows servers that are in the same cluster, but at this point, there are no dependencies between the SQL Server services running. They can be hosting different databases, doing log shipping or mirroring between them, or have absolutely nothing in common. If you’re planning your first AG in production, the work done up to this point is a good amount of work to tackle in one maintenance window. I would schedule all of this for the first outage, and then let things simmer down for the next week or two before assembling the Availability Group components.
Create the Availability Group
Whew! We’re finally done with the Windows portion of setup. Let’s fire up SQL Server Management Studio and have some fun.
On my SQL Server, I have a copy of the StackOverflow.com database that I want to copy over to my secondary replica. It’s already in full recovery mode, and I’ve been taking full backups, so it already meets the lengthy AlwaysOn Availability Group prerequisites.
In Object Explorer, right-click on AlwaysOn High Availability and click New Availability Group Wizard. Starting in SQL Server 2016, we’ve got a few new options:
As of this writing, the AG documentation hasn’t caught up with the product yet, so here’s an explanation of each option:
Basic Availability Group – terminology can always change leading up to release, but as of CTP2, this functionality is called a Basic AG. I would just like to take another second to point out that the acronym is BAG.
Database Level Health Detection – in Super Availability Groups (as opposed to Basic – look, if they can make up stupid acronyms, so can I) with multiple databases, the SAG doesn’t drop if one database happens to go offline, become corrupt, or take the day off. New in SQL Server 2016 is the ability to force that failover if any one database goes down, so the SAG stays up.
Per Database DTC Support – The Distributed Transaction Coordinator (DTC) enables simultaneous commits across multiple SQL Servers hosted on different hardware. Prior to SQL Server 2016, AG databases didn’t support DTC participation, but now, just check this box.*
* – You’ll probably have to do more than that, but hey, the infrastructure is here.
Since we’re doing Standard Edition, we’ll just check the Basic Availability Group box and move on. Check the box to pick which database you want in your BAG, and then it’s time for replica selections. On this screen, click the Add Replica button and connect to your new soon-to-be secondary:
Here’s what the choices mean, and I’m going to explain them a little out of order:
Synchronous Commit – say SQL2016A is the primary, and a user does a delete/update/insert there. If I check the Synchronous Commit boxes for SQL2016A and SQL2016B, then SQL2016A will write the user’s transaction locally, send the transaction over the network to SQL2016B, write it to disk there, and then wait for confirmation back before telling the user that the transaction is committed. (Most of the time. If SQL2016B drops offline, that’s another story.)
If I don’t check the Synchronous Commit boxes, then I get asynchronous transactions – the SQL2016A primary transaction will succeed right away, and SQL2016A just sends the transactions over to SQL2016B later. SQL2016B can be seconds (or minutes, or hours) behind, so I’ll lose data on failover, but the primary’s transactions won’t be subject to that killer latency.
Automatic Failover – if the Synchronous Commit boxes are checked, then you can do automatic failover between two of the replicas. (This is where the cluster witness starts to come into play – in my two-node setup, SQL Server needs a tiebreaker to know the difference between single node failures and a split-brain scenario.)
For my evil purposes, I only need a disaster recovery copy of my databases, and I don’t want to fail over to DR if my primary just happens to restart or hiccup. I’m going to leave the Automatic Failover and Synchronous Commit boxes unchecked.
Readable Secondary – in Super Availability Groups, you can offload reads and backups to your secondaries. Since I’m dealing with a BAG here, the Readable Secondary dropdown only offers the “No” option.
The tabs for Endpoints, Backup Preferences, and Listener don’t really matter here. SQL Server automatically creates the endpoints for you, and you can’t offload backups or create a listener for BAGs.
The next step in the wizard syncs the data between replicas:
The Full option performs one full and one transaction log backup on the primary, then restores them over to the secondary to start the replication. (Keep in mind that if your databases are large, you’ll want to set database backup compression on by default in order to speed this process – you don’t get to choose compression settings in the wizard.)
The other two options are useful if you performed the backups yourself manually, or if you’re using storage replication technologies to get it started.
If you choose the Full option, you can track backup and restore status with sp_WhoIsActive’s percent-completed column.
After the AG wizard finishes, you can right-click on your Availability Group in Object Explorer and click Show Dashboard:
Voila! Papa’s got a brand new BAG.
Think AlwaysOn Availability Groups are right for your environment? Take my 6-question quiz to find out:
That’s from our in-person training classes, where we cover what DBAs and developers need to do for a successful AG implementation.
In those classes, here’s some of the questions I get the most often:
Q: How much network bandwidth will I need?
For a really rough estimate, sum up the amount of uncompressed transaction log backups that you generate in a 24-hour period. You’ll need to push that amount of data per day across the wire. Things get trickier when you have multiple replicas – the primary pushes changes out to all replicas, so if you’ve got 3 replicas in your DR site, you’ll need 3x the network throughput. Calculating burst requirements is much more difficult – but at least this helps you get started.
Q: What’s the performance overhead of a synchronous replica?
From the primary replica, ping the secondary, and see how long (in milliseconds) the response takes. Then run load tests on the secondary’s transaction log drive and see how long writes take. That’s the minimum additional time that will be added to each transaction on the primary. To reduce the impact, make sure your network is low-latency and your transaction log drive writes are fast.
Q: How far behind will my asynchronous replica be?
The faster your network and your servers are, and the less transactional activity you have, the more up-to-date each replica will be. I’ve seen setups where the replicas are indistinguishable from the primary. However, I’ve also seen cases with underpowered replicas, slow wide area network connections, and heavy log activity (like index maintenance) where the replicas were several minutes behind.
Q: What’s the difference between AGs in SQL 2012 and SQL 2014?
SQL Server 2014’s biggest improvement is that the replica’s databases stay visible when the primary drops offline – as long as the underlying cluster is still up and running. If I have one primary and four secondary replicas, and I lose just my primary, the secondaries are still online servicing read-only queries. (Now, you may have difficulties connecting to them unless you’re using the secondary’s name, but that’s another story.) Back in SQL 2012, when the primary dropped offline, all of the secondaries’ copies immediately dropped offline – breaking all read-only reporting queries.
Q: How do I monitor AlwaysOn Availability Groups?
That’s rather challenging right now. Uptime monitoring means knowing if the listener is accepting writeable connections, if it’s correctly routing read-only requests to other servers, if all read-only replicas are up and running, if load is distributed between replicas the way you want, and how far each replica is running behind. Performance monitoring is even tougher – each replica has its own statistics and execution plans, so queries can run at totally different speeds on identical replicas. There’s not really a good answer to this question right now.
Q: How does licensing work with AlwaysOn Availability Groups in SQL 2012 and 2014?
All replicas have to have Enterprise Edition. If you run queries, backups, or DBCCs on a replica, you have to license it. For every server licensed with Software Assurance, you get one standby replica for free – but only as long as it’s truly standby, and you’re not doing queries, backups, or DBCCs on it.
Q: Can I use AlwaysOn Availability Groups with Standard Edition?
Not at this time, but it’s certainly something folks have been asking for since database mirroring has been deprecated.
Q: Do AlwaysOn AGs require shared storage or a SAN?
No, you can use local storage, like cheap SSDs.
Q: Do Availability Groups require a Windows cluster?
Yes, they’re built atop Windows failover clustering. This is the same Windows feature that also enables failover clustered instances of SQL Server, but you don’t have to run a failover clustered instance in order to use AlwaysOn Availability Groups.
Q: Do I need a shared quorum disk for my cluster?
No, and check out Kendra’s video on failover cluster quorum to learn your options.
Q: What version of Windows do I need for AlwaysOn AGs?
We highly recommend Windows Server 2012R2, and here’s why.
Q: Can I have different indexes or tables on my replicas?
No, the replica database contents will be exactly the same as the primary.
Q: If I fail over to an asynchronous replica, and it’s behind, how do I sync up changes after the original primary comes back online?
That is left as an exercise for the reader. When I go through an AG design with a team, we talk about the work required to merge the two databases together. If it’s complex (like lots of parent/child tables with identity fields, and no update datestamp field on the tables), then management agrees to a certain amount of data loss upon failover. For example, “If we’re under fifteen minutes of data is involved, we’re just going to walk away from it.” Then we build a project plan for what it would take to actually recover >15 minutes of data, and management decides whether they want to build that tool ahead of time, or wait until disaster strikes.
Kendra says: I wish I’d had this list when I started learning about AGs. I was also surprised to learn that crazy things can happen in an AG when you create indexes. Check out more on that here.
In theory, you can add indexes online with SQL Server Enterprise Edition.
In theory, with AlwaysOn Availability Groups, you can add and drop indexes on the primary replica whenever you want.
In theory, you can perform read-only queries on the secondaries whenever you want, and nobody gets blocked.
In practice, these things don’t always add up to the same answer.
I’ve got an AlwaysOn Availability Group demo lab with SQL Server 2014 hosting the AdventureWorks2012 database. I’ll start on my secondary server, and I’ll get all of the suffixes from the Person.Person table:
The query results aren’t important – instead, I’m showing the actual execution plan, which is very important. By default, there’s no index on the Suffix field, so I get a clustered index scan.
Say that for some bizarre reason, this is the type of query my end users constantly run, so I decide to switch over to the primary to add an index on Suffix. (As a reminder, you have to add indexes on the primary, not the secondaries.) After adding the index, I run the query again:
Woohoo! We’ve now got an index scan on my new index. (Yeah, it’s not a seek, but this is a crappy query, and a scan is the best I’m going to get.) So far, so good.
But now let’s mix things up a little – let’s run that same query, but start a transaction with it:
Now I have an open transaction, which really shouldn’t mean anything because you can’t do updates on the secondary. However, switch over to the primary, and drop that newly created index. Back over on the secondary, where we’ve got that open transaction, run the exact same query again:
Our transaction has been knocked out and failed. It’s funny to think of a read-only transaction as “failing”, but if you have a stored procedure that starts things off with BEGIN TRAN and then does a lot of reads, thinking it’s going to get a complete point-in-time picture of the data, that’s not going to work.
I know what you’re thinking: “But Brent, I don’t even use snapshot isolation, because I read Kendra’s post about it, and I know we’re not ready to test our app with it yet.” Thing is, SQL Server uses RCSI behind the scenes on all AlwaysOn Availability Group replicas – and it’s lying to you about whether or not the feature is enabled. Here’s the sys.databases view for my secondary replica:
SQL Server claims that for the AdventureWorks2012 database, both snapshot isolation and read committed snapshot are off – but it’s lying to you. Read Books Online about querying the replicas and note:
Read-only workloads use row versioning to remove blocking contention on the secondary databases. All queries that run against the secondary databases are automatically mapped to snapshot isolation transaction level, even when other transaction isolation levels are explicitly set. Also, all locking hints are ignored. This eliminates reader/writer contention.
They’re not kidding – even if I specify dirty reads, I can run into the same problem. Here, I start a transaction, but I’ve asked for the dirtiest, no-lockin-est reads I can get:
But when I add or drop a related index on the primary, even WITH (NOLOCK) transactions are affected:
This starts to point to the challenge of running complex reporting jobs off the secondary replicas. SQL Server is constantly changing the data underneath your queries, and most of the time, it cooperates beautifully. However, if you have a long-running stored procedure (like minutes or hours long), and you want a point-in-time picture of the underlying data while you generate reports on the replica, you don’t want to play around with transactions. Instead, a database snapshot might be a better fit.
If you’d like to run these demos on your own lab, download the scripts here.
Our live webcast topic this week was Q&A – you could bring any HA/DR questions, and we could avoid your answers. Just kidding. Here’s the webcast – we apologize for the audio, WebEx is still getting their act together. It cleans up after the first couple of minutes.
And here’s the questions we couldn’t get to during the webcast:
SQL Server Licensing Questions
Q: If using virtual machines and clustering / failing over at that level (not sql server) is there any reason that SQL Server Standard Edition won’t work? Someone once told us in a sql class that Enterprise Edition was necessary for this.
Answer from Brent: don’t you just love those “someone once told us” things? You’ll want to get them to tell you why. Standard Edition works fine in virtual machines. It may not be cost-effective once you start stacking multiple virtual machines on the same host, though, because you have to pay for Standard Edition for every guest.
Q: Hi, with mirroring being deprecated and AlwaysOn AG only available with Enterprise Edition – what are our HA options going to be with Standard Edition in the future? Any ideas if AlwaysOn synchronous will make it into Standard?
Answer from Jeremiah: You have a few HA choices with SQL Server 2012 Standard Edition and beyond. Even though mirroring is deprecated, you could feasibly use mirroring in the hope that something new will come out. Obviously, this isn’t a viable option. The other HA option is to use clustering. SQL Server Standard Edition supports 2 node clusters, so you can always use it for HA.
Training for Clustering, AlwaysOn
Q: Is there a good resource on setting up a lab environment for a clustering setup?
Answer from Kendra: I’m so glad you asked!
- Matt Velic has put together a guide on how to build a virtual lab with a Windows Server 2012 Cluster with Virtual Box and SQL Server 2012 Evaluation Edition– complete with eBook.
- If folks want to test AlwaysOn, I did a video a while back on how to plan an AvailabilityGroup lab, and our AlwaysOn page has a setup checklist PDF.
How to Manage AlwaysOn Availability Groups
Q: Did you experience or know “split brain scenario” in AlwaysOn Availability Groups that when secondary node is up to take over primary role, the transaction becomes inconsistent? And how to avoid it?
Answer from Brent: Ooo, there’s several questions in here. First, there’s the concept of split brained clusters – when two different database servers both believe they’re the master. Windows Server Failover Clustering (WSFC) has a lot of plumbing built in to avoid that scenario. When you design a cluster, you set up quorum voting so that the nodes work together to elect a leader. In theory, you can’t run into a split brain scenario automatically – but, you can most definitely run into it manually if you go behind the scenes and change cluster settings. The simple answer here: education. Learn about how the quorum process works, learn the right quorum settings for the number of servers you have, and prepare for disaster ahead of time. Know how you’ll need to react when a server (or an entire data center) goes down. Plan and script those tasks, and then you can better avoid split brain scenarios.
Q: Can you recommend any custom policies for monitoring AlwaysOn? Or do the system policies provide thorough coverage? Thank you!
Answer from Brent: I was a pretty hard-core early adopter of AlwaysOn Availability Groups because I had some clients who needed it right away. In that situation, you have to go to production with the monitoring you have, not the monitoring you want. The built-in stuff just wasn’t anywhere near enough, so most of my early adopters ended up rolling their own. StackOverflow’s about to share some really fun stuff there, so I’d keep an eye on Blog.ServerFault.com. You should also evaluate SQL Sentry 7.5’s new AlwaysOn monitoring – it’s the only production monitoring I’m aware of, although I know all the other developers are coming out with updates to their tools for monitoring too.
Q: Is it wise to have primary availability groups in one server of the nodes and have primary groups on another of the servers that form the cluster. Or is it better to have all primary groups on server 1 and secondary on server 2?
Answer from Brent: If you split the primaries onto two different nodes, then you can do some load balancing.
Q: Would you consider Always-ON AG Read-Only replicas as a method to offload or load balance reporting? Looks like the Read Intent option acts like a load balancer for reading off of those DBs, right?
Answer from Brent: Offload yes, load balance no. The read intent options give you the ability to push read-only queries to a different replica, but there’s no load balancing. Your clients just hit the first server in the list. If you need true load balancing, you’ll want to put all of the read-only replicas behind a real load balancing appliance.
Windows Clustering Setup and Management
Q: Where can I find a good list of cluster hotfixes SQL 2008 R2 and perhaps the OS as well?
Jes here. You can go to the Update Center for Microsoft SQL Server to find the latest CU and hotfixes. Check the Support pages for Windows Server 2008R2. Updates aren’t released as cluster-specific. This is why it’s really important to have a test or QA environment that is also set up as a cluster, so you know if the cluster services are affected at all.
Q: What is the recommended order/procedure when you have to do Windows updates to servers in a cluster?
Answer from Kendra: Microsoft knew you were gonna ask this! Check out their “SQL Server failover cluster rolling patch and service pack process” KB here. But do yourself a favor and always deploy patches to a non-production test cluster first and let them burn in a bit.
Q: From your previous answers, it sounded like you don’t recommend use Windows 2008 R2 for AlwaysOn. Can you elaborate bit more on why Windows 2012 is better suited for this? I need more persuasive power to talk the rest of folks of my company to use it.
Answer from Brent: Sure, check out the AlwaysOn Availability Groups Lessons Learned video at the bottom of that page.
Q: Would you have a single DTC group or multiple groups configured for a 4 instance cluster?
Answer from Kendra: There’s no shortcut here: you have to decide on an instance by instance basis. For each instance you gotta determine how much it uses distributed transactions, and how impacted it might be if DTC were to temporarily be offline. Review Cindy Gross’ information on DTC to find out pros and cons of different approaches to configuring DTC.
SQL Server Clustering with VMware and Hyper-V
Q: Is VMWare HA a good alternative to use instead of a Microsoft Cluster?
Answer from Jeremiah: The HA choice comes down to where you want your HA to be managed. VMware HA pushes the high availability question out of the SQL Server realm and into the VMware infrastructure. More than anything else, this is a business decision – just be sure you’re happy with the decision of which team is managing your uptime.
Q: When using a virtualized active/passive 2008R2 cluster with underlying iSCSI storage can the nodes by on different hosts or is FoE needed to have nodes on different hosts?
Answer from Brent: Check out VMware’s knowledge base article on Microsoft cluster support. It lays out your options for iSCSI, FC, FCoE, and more, and separates them by shared-disk clustering versus non-shared-disk (AlwaysOn Availability Groups).
Q: Any thoughts on implementing AlwaysOn in conjunction with a virtual SQL environment using VMWare HA/ Site Recovery Manager (SRM)?
Answer from Kendra: With this level of complexity, when things get tricky it’s incredi-hard to sort out. You gotta have a rockstar team with great processes and communication skills to handle problems as they arise– and you are going to hit problems.
Even if you have the rockstar team, you want to first ask if there’s a simpler way to meet your requirements with a less risky cocktail of technologies. If you rush into what you describe, you’ll find that your high availability solution becomes your primary cause of downtime.
Shared Storage for Clusters
Q: Was reading a great article from Brent on SQLIO. How does this work on a SQL Cluster?
Answer from Kendra: You run SQLIO against the storage (not the SQL Server instance) so it works the exact same way.
Q: After Setting up The Cluster and adding the various CLUSTER DATA Drives how can I add additional Drives after gaining new internal storage?
Answer from Kendra: Before you touch production, make sure you’ve got a lab environment. If you don’t, check out the link above on how to build one. The exact steps to do this are going to vary depending on your version of Windows, your version of SQL Server, and exactly what storage you’re talking about.
For new shared storage on Sever 2008 or later, the basic process is presenting the storage to all of the nodes, bringing the drive online on one node, creating a volume, adding the disk in the failover cluster, and then adjusting dependencies in the cluster as needed. (Dependencies can be adjusted online in SQL Server 2008 and later).
If you have new non-shared storage that you want to use under tempdb (such as SSDs), you’ve got to make sure that every node in the cluster has the drives for tempdb online / volumed/ formatted/ and configured identically, and then you can move tempdb files over to it. You will need to restart SQL Sever to make modified tempdb files recognize the new paths.
Sharding and Mirroring Questions
Q: I have a peer to peer replication with 3 nodes (all bidirectional). Very beneficial but a big pain to maintain. Is that what the industry feels?
Answer from Jeremiah: SQL Server peer-to-peer replication solves a very specific need – the ability to have multiple active SQL Servers where writes can occur and where you can have near real-time updates to the other servers. While peer-to-peer replication meets that need, it has a relatively heavy price tag in terms of DBA expertise, support, and licensing costs. Even experienced teams want to have multiple DBAs on staff to deal with on call rotations and, let’s face it, while peer-to-peer replication hasn’t been deprecated, it’s a difficult feature to work with.
Q: I’ve implemented db sharding on Oracle in several environments. Is there an applicable tech in SQL Server?
Answer from Jeremiah: Sharding is just a buzzword for horizontal partitioning. In a sharded database, either the application or a load balancing router/reverse proxy is aware of the sharding scheme and sends reads and writes to the appropriate server. This can be accomplished with SQL Server, Oracle, MySQL, or even Access. There are no technologies from Microsoft and I’d be wary of anyone attempting to sell something that Just Works® – database sharding is time consuming, requires deep domain knowledge, and adds additional database overhead.
Q: Currently using SQL 2008 Mirroring. Planning a move to 2012. Your thoughts about jumping 2012 and going straight to 2014 Always On technologies?
Jes here. There were no major changes to Database Mirroring in SQL Server 2012, and I don’t foresee any coming in 2014. Eventually – we don’t have a specific version yet – Mirroring will be deprecated. Read our AlwaysOn Availability Groups Checklist to get an idea of the work involved in setting these up – it’s much more complicated than Mirroring – before you decide to jump in.
More Free AlwaysOn and Clustering Training Resources
And if your business needs help deciding which one is right for you, check out our SQL Critical Care® methodology. In just 4 days, we get to the root cause of your availability and performance pains, and we train you how to fix it. Learn more about our SQL Critical Care® now.
Business is booming; the sales team is bringing in more business and the software developers are successfully scaling out at the web server tier. There are signs of pressure on the database tier and you realize that it’s time to scale out. You’re using SQL Server 2012, you need to improve performance, and the first thought that comes to mind is the new hotness: SQL Server AlwaysOn Availability Groups.
Get Your Bearings
Before making any infrastructure decisions you need to stop and consider your goals. Ask yourself: What pain am I trying to solve?
If you’re trying to scale out reads without changing much code, AlwaysOn Availability Groups are probably the right idea. With an AlwaysOn Availability Group we can scale out to a total of 5 servers to scale out reads across the board. This is fantastic for scaling out – we change a connection string or change a DNS CNAME and point connection strings at the Availability Group.
The hope and dream is that we can wildly scale with SQL Server AlwaysOn Availability Groups. Reality is different.
The Average of its Parts
Think about this – in order for an AlwaysOn Availability Group to work for scale out reads, we have to get data from the write server to the replicas. The bottleneck isn’t how fast we can read; the bottleneck is how fast we can write.For synchronous replicas, every replica must be at least as capable as the the primary. If peak load on the primary is 16 gigabit per second, the synchronous secondaries need to be able to write at least 16 gigabits of data per second. To keep our application moving, and to handle potential spikes, we need to make sure that the secondaries are sized to be able to handle increases in load from the primary server. In short, we’re going to need to make sure that the secondary servers have the same storage as the primary.
Of course, you’re only going to use those secondaries for the same databases as the primaries, right? Nobody would ever want to add another database to one of those secondaries, right? Keeping in mind that we can only have two synchronous secondaries, how much faster can we make our SQL Server for synchronous data?
We can increase read performance, but write performance isn’t going to move.
The Average of its Parts… Over Time
The game changes when we bring asynchronous secondaries into the mix. With synchronous secondaries, all servers in the mix need to have the same write characteristics – if the primary has eight SSDs, the secondaries need to have eight SSDs (or at least the throughput of eight SSDs). When we’re using asynchronous secondaries, the secondaries have to be able to keep up with the primary’s workload on average.
To figure out what kind of hardware we need for the asynchronous secondaries, we need an SLA. Think about it like this: in order to make a hardware decision, we need to know how long our asynchronous secondaries can take to catch up. If we don’t know how far the secondaries can lag behind the primary, then we can’t accurately size the hardware. Conversely, once we have an SLA, we have to size all of the equipment to make sure that the async secondaries can never fall further behind than our SLA. Of course, that SLA is also going to have to include index rebuilds, bulk loads, and other activity that abuses the transaction log.
The Average of its Parts… with a Perfect Network
Assuming a perfect network, this all holds true. Once network congestion comes into play, performance guarantees go out the window.
During the design stage of building out an AlwaysOn Availability Group it’s critical that the network is taken into account as a critical component of the infrastructure. The network must be able to handle the load of multiple secondary nodes – network performance may cause more problems than poor disk performance.
Consider Your Environment in Aggregate
Before planning a SQL Server AlwaysOn Availability Group deployment, consider your environment as a whole. Make sure that you have the bandwidth available to stream regular operations between the primary and replicas as well as to stream under peak load. Remember – adding four additional servers may put considerable load on your network backplane. Also consider that these won’t be physically cheap servers; if you need performance from the primary server, the synchronous and asynchronous secondaries require enough power to meet your performance SLAs. This isn’t an activity for the faint of heart.
If you have an existing application and you want to scale with money, you may be able to get away with an AlwaysOn Availability Group deployment. You’ll be buying multiple servers, multiple SQL Server licenses, and a lot of high end switching equipment, but you’ll only have to change a few connection strings in your application.
If you’re considering scaling out and you’re willing to make code changes, consider Windows Azure SQL Database. In many cases, you’ll be able to scale out using database federations. Through the changes you’re likely to lose flexibility in how you tune T-SQL, perform database maintenance, and work within your schema. The end flexibility may be worth it, but there may be significant code changes to go along with the database changes.
If you’re considering scaling out and you’re willing to consider many options, take a step back and consider your application requirements. Think about the database features your application is using, how those features fit into different databases, and how those features can be spread across different services. There are a lot of distributed databases out there that may meet your requirements and provide better scale out functionality than Windows Azure SQL Database or AlwaysOn Availability Groups.
Learn more in our AlwaysOn Availability Groups resource page.
SQL Server 2012 introduced AlwaysOn Availability Groups, a feature intended to replace clustering, database mirroring, log shipping, replication, and other technologies by giving us one easy-to-manage feature for high availability, disaster recovery, and scale-out.
One year in, how’s it working out? I’ve done a lot of AG deployments, and in this 30-minute video, I explain some of the gotchas:
You can learn more at BrentOzar.com/go/alwayson.
Many companies are investigating a move into cloud hosting. With the recent release of SQL Server 2012, these very same companies are also looking at a migration to SQL Server 2012 to take advantage of readable secondaries. Combining Amazon Web Services and SQL Server 2012 seems like a perfect match. Unfortunately, there are some gotchas, like limited support for clustering, to keep in mind as you consider deploying SQL Server 2012 AlwaysOn Availability Groups in AWS.
What’s Wrong With Clustering?
AlwaysOn Availability Groups don’t require shared storage, but they do require Windows Server Failover Clustering. Let’s be clear: there’s nothing wrong with clustering.
There’s a white paper from Amazon Web Services that suggests clustering won’t work in AWS because clustering requires support for multicast networking. This changed with Windows Server 2008 – multicast networking is no longer part of Windows Server clustering. For those who don’t know, multicast networking is delivering one packet from the source computer to multiple destination computers. Multicast relies on the intermediate hardware making the decision when to make copies of the packets to send to the destinations and, in general, relies on the hardware making that decision at the latest possible moment. Because so much of the AWS infrastructure is virtual, a lot of these decisions become trickier for the intervening networking gear (which might also be completely virtual). Ultimately, this is why you’d want to avoid multicast networking in a purely virtual environment. Since Windows Server 2008 and newer don’t use multicast networking, it’s not an issue.
It’s Not You, It’s… You
If multicast networking isn’t what stops clustering from working in AWS, what does?
Every cluster has a cluster access point – a name and IP address used to administer the cluster. The cluster access point is a virtual IP address, when you set up the cluster access point, Windows creates a virtual IP address that is used to provide a way to access the active node in the cluster. This isn’t strictly necessary for SQL Server clustering, but it does make life easier when managing the underlying Windows Server Failover Cluster.
In addition to each server’s individual IP address, the cluster access point requires its own IP address. This is where things start to get sticky. When you assign static IP addresses in AWS, you don’t assign the IP address at the server level – you assign the static IP address to a virtual ethernet adapter and attach the ethernet adapter to your EC2 instance. Each of these Elastic Network Interfaces (ENI) can only be assigned one IP address, for now.
Creating an Availability Group, Is It Possible?
Since it isn’t possible to create virtual IP addresses it would be easy to conclude that it isn’t possible to create an availability group – during availability group set up SQL Server has to get in touch with the cluster control point. There’s a way around this limitation – along as you create the availability group on the current host server of the cluster, then you can create the availability group.
It is possible to create an availability group, however it isn’t possible to create a TCP listener for the same reason that it isn’t possible to create the cluster control point – it isn’t possible to have more than one IP address per ENI in AWS. This doesn’t mean that AlwaysOn Availability Groups are a non-starter in AWS, but it does mean that you won’t have access to the full AlwaysOn Availability Group functionality until later this year when Amazon introduces multiple IP addresses per ENI.
Faking Out the TCP Listener
Here’s what we know so far:
- It is possible to set up a Windows Server Failover Cluster in AWS.
- It is possible to set up AlwaysOn Availability Groups on our Failover Cluster in AWS.
- It is not possible to configure a Cluster Control Point or TCP Listener.
It is possible, though, to get much of the same benefit of the TCP Listener without using the SQL Server’s built-in functionality. A load balancer like HAProxy can be used to create groups of readable secondaries for different applications. True, your applications won’t be able to take advantage of the other TCP Listener features like specifying read only intent in the connection strings, but it will be possible to create multiple connections in HAProxy that will work the same as connections specifying read only intent. HAProxy has the ability to use sticky sessions, monitor connections, and detect when servers are offline and route traffic away from the dead server. People have been doing this with MySQL for years.
Where Do We Go From Here?
Teams using AWS who need SQL Server 2012 Always On Availability Groups can start making use of these features right now. All of the benefits of SQL Server 2012 Always On Availability Groups won’t be available, but it will be possible to use multiple readable secondaries for scale out reads. Many of the features of Always On Availability Groups can be configured using a load balancer like HAProxy for now and configured using native functionality once multiple IP addresses can be attached to a single ENI. With a little work and some patience, you’ll be able to use this feature right now and be ready for when AWS supports virtual IP addresses.
If you want to learn more about how to design, test, and manage this slick feature that offers high availability, disaster recovery, and scale-out reads, visit our AlwaysOn Availability Groups resources page.
What a fast week. I was in Dallas this past week doing a project, and the week simply flew by. I’ll be back up there again this coming week, which means I’ll miss the HoustonWireless monthly meeting for the third time in a row. Not thrilled about that, but I’ll live.
The server move to Dallas didn’t go as smoothly as I’d planned. I’m still basing my mail server out of Houston for the time being, while I figure out the finer points of Imail 7. I’m in the midst of changing over to Imail from Exchange, and I’m not exactly tickled pink with it. You get what you pay for – Exchange rocks. It’s a memory hog, but it’s got an awesome UI.