An Introduction to SQL Server Clusters

The options for high availability can get confusing. I was lucky enough to begin working with SQL Server clusters early in my career, but many people have a hard time finding simple information on what a cluster does and the most common gotchas when planning a cluster.

Today, I’ll tell you what clusters are, what they’re good for, and why I like to plan out my clusters in a very specific way. I’ll also give an overview of how clustering relates to the AlwaysOn Availability Groups feature in SQL Server 2012, and wrap up with frequently asked questions about clustering SQL Server.

What Type of Clustering are we Talking About?

There are lots of types of clusters out there. When we cluster SQL Server, we install one or more SQL Server instances into a Windows Failover Cluster. In this post I’m talking specifically about clustering SQL Server 2005 or later using Windows Server 2008 or later.

Key Concept: A Windows Failover Cluster uses shared storage– typically, this shared storage is on a SAN. When a SQL Server instance is installed on the cluster, system and user databases are required to be on the shared storage. That allows the cluster to move the SQL instance to any server (or “node”) in the cluster whenever you request, or if one of the nodes is having a problem.  There is only one copy of the data, but the network name and SQL Server service for the instance can be made active from any cluster node.

Translation: A failover cluster basically gives you the ability to have all the data for a SQL Server instance installed in something like a share that can be accessed from different servers. It will always have the same instance name, SQL Agent jobs, Linked Servers and Logins wherever you bring it up. You can even make it always use the same IPAddress and port– so no users of the SQL Server have to know where it is at any given time.

Here is a diagram of a SQL Server cluster. The cluster is named SQLCLUSTER01. It has two nodes (servers), which are named SQLCLU01NODE01 and SQLCLU01NODE02. People connect to the SQL Server instance at SQLCLU01A\SQL. The instance has been configured on port 1433.

Oh no! There’s been a failure in our environment!

Here’s what happened.

The SQLCLU01NODE01 server crashed unexpectedly. When this happened, the Windows Failover Cluster service saw that it went offline. It brought up the SQL Server services on SQLCLU01NODE02. The SQLCLU01A\SQL instance started up and connected to all the same databases on the shared storage– there’s one copy of the data, and it doesn’t  move. As part of the SQL Server startup, any transactions that were in flight and had not committed at the time of the crash were rolled back.

While this automatic failover was occurring, users could not connect to the SQLCLU01A\SQL instance. However, after it came back up they were able to resume operations as normal, and had no idea that a server was still offline.

Why You Care About Clustering

If you’re a business owner, manager, or DBA, you care about clustering because it helps keep your applications online more of the time— when done properly, it makes your database highly available.

Here are some ways that clustering makes your life easier:

  • Hardware failures are a nightmare on standalone servers. If a server starts having problems in a failover cluster, you can easily run your SQL Server instance from another node while you resolve the issue.
  • Applying security patches on a standalone server can be very tedious and annoying to the business: the SQL Server is offline while you wait for the server to reboot. By using failover clustering, you can apply patches with only brief downtimes for your application as you move your SQL Server instance to a different node.
  • Failover clusters can also give you an additional tool in your troubleshooting toolkit. Example: if you start seeing high latency when using storage and you’ve ruled out all the immediate candidates, you can fail to another node to try to rule out if it’s a problem with a per-node component like an HBA.
  • Clustering is transparent to the calling application. Lots of things with SQL Server “just work” with clustering, whereas they’re a little harder with other alternatives. With clustering, all of my databases, logins, agent jobs, and everything else that’s in my SQL Server instance fail over and come up together as a single unit— I don’t have to script or configure any of that. I can also cluster my distributed transaction coordinator and fail it over with my instance as well.

Gotchas and Notes for Planning a Cluster

Know What Clustering Doesn’t Do

The first gotcha is to be aware of what a failover cluster won’t help you with.

Clustering won’t improve your performance, unless you’re moving to more powerful servers or faster storage at the same time you implement clustering. If you’ve been on local storage, don’t assume moving to a SAN means a nirvana of performance. Also, clustering doesn’t guarantee that everything involved in your SAN is redundant! If your storage goes offline, your database goes too.

Clustering doesn’t save you space or effort for backups or maintenance. You still need to do all of your maintenance as normal.

Clustering also won’t help you scale out your reads. While a SQL Server instance can run on any node in the cluster, the instance is only started on one node at a time. That storage can’t be read by anyone else on the cluster.

Finally, clusters won’t give you 100% uptime. There are periods of downtime when your SQL Server instance is “failing over”, or moving between nodes.

Invest Time Determining the Right Naming Convention

You have a lot of names involved in a cluster: a name for the cluster itself, names for each of the servers in the cluster, and names for each SQL instance in the cluster. This can get confusing because you can use any of these names later on when connecting with Remote Desktop– so if you’re not careful, there may be times when you’re not entirely sure what server you’re logged onto! I have two general rules for naming:

First, make sure it’s obvious from the name what type of component it is– whether it’s a cluster, physical server, a SQL Server instance, or a Distributed Transaction Coordinator. I also recommend installing BGINFO to display the server name on the desktop for every server in the cluster.

Second, name everything so that if you later add further nodes or install another SQL Server instance onto the cluster, the names will be consistent.

Avoid Putting Too Many Nodes in One Cluster

I prefer to have only two or three nodes in a cluster. For example, if I need to cluster five SQL Server instances, I would put them in two failover clusters.

This requires a few extra names and IP Addresses overall, but I prefer this for management reasons.  When you apply patches or upgrades, you must make sure that each service on your cluster runs on each node successfully after you’ve applied the change. Having a smaller cluster means you don’t need to fail your instance over as many times after a change.

Don’t Assume Your Applications Will Reconnect Properly After Failover

Even though your SQL Server instance will come up with the same network name and IPAddress (if not using DHCP), many applications aren’t written to continue gracefully if the database server goes offline briefly.

Include application testing with your migration to a failover cluster. Even though the application doesn’t know it’s talking to a cluster (it’s a connection string like any other), it may not reconnect after a failover. I worked with one application where everything worked fine after a failover, except web servers stopped writing their log data to a database because they weren’t designed to retry after a connection failure. The data was written asynchronously and didn’t cause any failures that impacted users, but the issue wasn’t noticed immediately and caused the loss of some trending data.

“Active Active” Can Be Useful

My ideal cluster layout to work with is a two node cluster with identical hardware and two SQL Server instances on it. This is commonly called “Active Active” clustering, but that term is technically a no-no. Officially this is called a “Multi-Instance Failover Cluster.” Not quite as catchy, is it?

Many people think the ideal situation is to put their most important SQL Server instance on a two node cluster and leave the second node ready, waiting, and idle. So, why do I want a second SQL Server instance?

I like to put my critical, heavy hitter database on one of those instances in the cluster. I then want to take a couple of less critical, less busy databases and put them on the second instance. The perfect examples are logging databases. There are two requirements for these databases: first, they can’t require a large amount of memory or processor use to run well, because I absolutely have to know that these two instances can run successfully at peak load on a single node if required. Second, the databases on the “quiet” instance shouldn’t cause the whole application to go offline if they aren’t available.

Why do I like having a “quiet” instance? Well, whenever I need to apply updates to Windows or SQL Server, this is the canary I send into the coal mine first. You can perform rolling upgrades with failover clusters, which is great. But it’s even better to know that the first instance you fail over onto an upgraded node won’t take absolutely everything down if it has a problem.

Notes: Because of licensing costs, this option won’t always be realistic. If you go this route you have to make sure everything can stay within SLA if it has to run on a single node at your busiest times– don’t overload that “quiet” instance!

Re-Evaluate your SQL Server Configuration Settings

Revisit your configuration settings as part of your planning. For example, on a multi-instance cluster, you use the minimum memory setting for SQL Server to configure how your instances will balance their memory usage if they are on the same node.

Do I Have to Use Clustering to Use Availability Groups in SQL Server 2012?

This is an interesting question– don’t let it confuse you. We have a very cool new feature called Availability Groups coming in SQL Server 2012, which does offer  awesome scale-out read functionality. You’ll read in many places that it “requires Failover Clustering.”

This is true. In order to use the Availability Group feature in SQL Server 2012, the Failover Clustering feature must be enabled in Windows. This feature is only available in Datacenter and Enterprise edition of Windows Server, so that feature isn’t free.

But wait, there’s a catch! Even though you’re enabling the Failover Cluster feature, you are NOT required to have shared storage to use Availability Groups. You have the option to use a Failover Cluster in an Availability Group, but you can also run your Availability Groups with entirely independent storage subsystems if you desire. The feature is required because no matter what, Availability Groups will use parts of the Failover Clustering feature to manage a virtual network name and IP Address.

Frequently Asked Questions for Clustering SQL Server

Q: Can I install every SQL Server component on my cluster?
A: Nope. SQL Server Integration Services is not “cluster-aware” and can’t fail back and forth with your cluster.

Q: How long does it take to fail over?
A: There are several factors to consider in failover time. There’s the time for the SQL Server Instance’s service to go down on one node, be initiated on another node, and start up. This time for instances to start and stop includes normal database recovery times. If you need to keep failovers within an SLA, you’ll want to test failover times in a planned downtime, but also estimate in how long failover might be if it happened at peak load.

Q: Can I cluster a virtualized server?
A: Yes, you can create failover clusters with virtual servers with VMware or Hyper-V, and install SQL Server into it. I think this is great for learning and testing, but I’m not crazy about this for production environments. Before you go too far down this path, look at leveraging your hypervisor’s native components for high availability.

Q: Why do you make such a big deal about the shared storage?
A: Because not everyone has robust shared storage available. You want to make sure you’re using shared storage that has redundancy in all the right places, because in a failover cluster shared storage is a single point of failure, no matter how magical the SAN seems. This also means that if your data is corrupted, it’s going to be corrupted no matter which node you access it from.

Q: What’s the minimum number of nodes in a failover cluster?
A: One. This is called a single-node cluster. This is useful for testing purposes and in case you have a two node cluster and need to do a work on a node. You can evict a node without destroying the cluster.

Q: Can I use clustering for Disaster Recovery?
A: Yes, but it requires some fancy setup. Most SQL Server clusters are installed in the same subnet in a single datacenter and are suitable for high availability. If you want to look into multi-site clustering, “geo-clustering” became available with SQL Server 2008, and is being enhanced in SQL Server 2012. Note: you’ll need storage magic like SAN replication to get your Geo-cluster on.

Q: Does it matter which version of Windows I use?
A: Yes, it matters a lot. Plan to install your Windows Failover Cluster on the most recent version of Windows Server, and you need Enterprise or Datacenter edition. If you must use an older version of Windows, make sure it’s at least Server 2008 with the latest service packs installed. The Failover Clustering Component of Windows was rewritten with Server 2008, so if you run on older versions you’ll have fewer features and you’ll be stuck chasing old problems.

Q: What is Quorum?
A: Quorum is a count of voting members— a quorum is a way of taking attendance of cluster members who are present. The cluster uses a quorum to determine who should be online. Read more about quorum here.

Kendra Little

Kendra specializes in high availability and performance tuning. She is a Microsoft Certified Master in SQL Server-- the highest technical SQL Server Certification available. Kendra loves databases and software development more than long walks on the beach. Those cartoons in her blog posts? She draws 'em all. Read more and contact Kendra.

More Posts - Website

Follow Me:
TwitterFacebookLinkedInGoogle PlusYouTube

15 Responses to An Introduction to SQL Server Clusters
  1. Michael J Swart
    February 14, 2012 | 10:08 AM

    Ahh, so that’s what this tweet from you meant.

    “@Kendra_Little: I’m dedicating the next 30 minutes to @MJSwart and @BrentO . I’m drawing a portrait of an EqualLogic.”

    (BTW, I really like the long format).

    • Kendra Little
      February 14, 2012 | 10:49 AM

      Thanks, Michael. Right after I wrote it I found a post of advice on writing by David Ogilvy. One of the items is “never write anything longer than two pages.”

      Well, maybe Ogilvy wouldn’t have read to the end, but I’m glad you like it!

  2. Kyle D.
    February 14, 2012 | 1:47 PM

    Excellent post!

    Can you elaborate a little bit on using the minimum memory setting to balance memory usage in a failover scenario? There is a lot of conflicting information out there on the best way to configure minimum and maximum memory settings in a Multi-Instance Failover Cluster.

    • Kendra Little
      February 16, 2012 | 1:12 AM

      Hi Kyle,

      That’s a great question. The key is the way dynamic memory management works. I have had very good luck with multi-instance clusters when configuring the minimum memory values for the SQL Server instances.

      The key is to make sure that the more important instances have the right minimum amount, and that your total possible sum of minimum values is not too high. In my experience my clusters have done a very good job of balancing memory after failovers when configured in this way. (Note that I’m just talking SQL Server 2005 and later here!)

      Why doesn’t SQL Server always share memory well with other applications? Well, not all applications share memory in ways that work well with it! More information is here: http://msdn.microsoft.com/en-us/library/ms177455.aspx

      Note: I do disagree with this link in one way. I personally prefer to always set a maximum server memory for SQL Server to keep memory space free outside of what that setting limits. That doesn’t actually represent ALL the memory that even SQL Server itself consumes, and I need to leave extra room for any agent jobs that may be running, backup copies, etc, as well as the OS.

      As with any configuration change, I always recommend careful testing, change control, and verification for modifying memory settings. (I know, I know, but I have to say it. Settings are a big deal!)

      • Kyle Doyle
        February 16, 2012 | 2:34 PM

        Thanks for the reply.

        I am currently testing a few different configurations in an attempt to allow less important instances to use any extra memory on each node during normal operation. At the same time, I want to make sure that the most important instances have the memory they need in a failover scenario.

        Like you, I prefer to always set the maximum server memory. This is leading me towards making a job that runs on startup to determine which node the instance is on and adjust the max memory setting accordingly.

  3. Bodhisatya Mookherjee
    February 14, 2012 | 10:38 PM

    Excellent article. simple, elegant and educative!!!

  4. DBA100
    February 15, 2012 | 10:39 PM

    But the cluster nodes with SQL server service is not active /active like the Oracle RAC, right?

    • Kendra Little
      February 16, 2012 | 1:05 AM

      Correct, that’s why the term “Active Active” is not technically correct here. This is a failover cluster. A failover can have more than one SQL Server instance installed on it, but they are separate instances using totally separate storage as I have designed it here.

      SQL Server’s Availability Group feature that’s releasing with SQL Server 2012 allows readable secondaries– which is more like the feature you’re describing from Oracle RAC. Availability Groups can use the Failover Clustering feature in Windows, but they do more and different things.

  5. Perry Whittle
    February 16, 2012 | 6:06 AM

    Hi

    the term crash is a little generic lol.
    It depends on what the resource failure was, each resource has a set of policies (they differ between Windows 2003 and Windows 2008 clusters), if the resource restarts successfully no fail over will occur

    Regards

    Perry

    • Kendra Little
      February 20, 2012 | 10:17 AM

      Hi Perry,

      That’s a very fair point– in the example I basically meant that the first node bluescreened. Or perhaps someone shut down the wrong server by accident. :)

      Thanks for mentioning this about specific resource failures. It does absolutely matter what resource has failed, as well as the dependency tree that resource is part of.

  6. David Eaton
    February 16, 2012 | 5:28 PM

    Disaster recovery isn’t all that fancy to setup.

    All that needs to happen is to have the remote cluster and the local cluster share a IP address space with additional NIC cards. There are easy ways to bridge subnets to accomodate this type of cluster failover for disaster recovery.

  7. Lester
    February 16, 2012 | 5:55 PM

    Wow, I am schooled. Thanks Kendra. This is very informative. I did not know about clustering and did not bother to surf about it until your writing appeared on my feed. I used to have an understanding that Mirroring is somehow similar to clustering. :) But that’s totally not true.

    Some inquiries:
    1. On Mirroring, does it require human-intervention in order to make the Secondary SQL server to be active when the primary SQL server goes down?

    2. T-Logging works good for me so far. I prefer to manual-restore the last 1-3 log files rather than automating it via a script. Any thoughts?

    3. Do you know any reference(book, blog, etc…) that thoroughly compare sql server disaster plan?(clustering, mirroring, tlogging, etc…)

    Thanks Kendra

    • Kendra Little
      February 20, 2012 | 10:06 AM

      Hi Lester,

      Thanks for your comment! Mirroring and clustering are indeed quite different.

      1. Mirroring does have an automatic failover option– it involves using synchronous mirroring and a witness.

      2. Log shipping can be a great solution– it’s all about finding the solution that fits your own business correctly and meets the Recovery Point Objective (RPO) and Recovery Time Obective (RTO) that you need!

      3. Brent did a great webcast comparing disaster recovery options. He gives an overview of different methods along with pros and cons of each one. Check it out here: http://www.brentozar.com/archive/2011/12/sql-server-high-availability-disaster-recovery-basics-webcast/

  8. Stephen LaRochelle
    February 22, 2012 | 1:28 PM

    Thanks for the article, very timely. Can you explain a little more on your reluctance to cluster a virtualized server? We’re looking into doing just that; our first implementation of a cluster to boot, in a production environment.

  9. Sam
    April 20, 2012 | 10:36 AM

    Hi Kendra,
    Your Article is interesting.I am new in setting up DB. I have simple question may be silly but if I am trying to set multiple instances.
    For example type of scenario is below,
    I have server 1 with instances A and B and Server 2 with instances A and B also running and wanted both servers instances should be connected to one shared storage SAN( Configuration type I want both of them Active at same time).

    Q1. Is it practically possible that both servers Instances will be active at the same time that is Instance A on Server 1 and 2 active and performing DB operations or in second situation one instance of A will be active at a given time and other one will wait for that instance to die before starting any operations this will become exact situation described above by you.

    -> Then what will be the difference between the multiple instances active at same time with single instance that is applied across the servers where the one instance die and other one will start processing.

    Q2. Is it possible to configure nodes with SQL instances to be set with priority that is Instance A should always run from server 1 if server 1 is available.

    Q3. Which method do you prefer for back up and recovery. I have one very critical database which data cannot be lose in any situation. and three other Database they are important but a little data loss is affordable.

    And if you have any helpful documentation available please provide links.

    Please consider all my questions valid.

    Thank you

Leave a Reply

Wanting to leave an <em>phasis on your comment?

Notify me of followup comments via e-mail. You can also subscribe without commenting.