Blog

Remember that weirdo in high school who had no social skills?  Couldn’t get a date, worked at the fast food joint as a fry cook, face covered with zits – you know the one.

Okay, actually, it was us.  Anyway, the point is, we got our act together, didn’t we?  So did Windows Failover Clustering.  When you weren’t looking, Windows Server 2008 cleaned up its clustering, and now it’s the new hotness that gets all the dates.  It’s time to revisit what you thought you knew about Windows clusters.

Clusters Require Identical Hardware and Configuration

No, that’s not me. Although it’s pretty close.

When I was your age, I had to look up every single piece of server hardware on the Windows Hardware Compatibility List (HCL) to make sure it was tested and approved.  I either had to buy approved clusters as a package, or assemble them from detailed hardware lists.  The new servers I wanted were never on the HCL, or they were way too expensive.  Even when I got the goods, I had to assemble everything and then just hope it worked right.  Inevitably, it didn’t, but the hardware usually wasn’t to blame – it was my own stupidity.

With Windows 2008 and newer, you can slap together pretty much any old hardware, run the Validate a Cluster wizard, and know right away that…uh, you’ve got a lot of work to do.  I know you’ve got passionate feelings about how wizards are evil, but the Validate a Cluster wizard is AWESOME. It tests just about all of the requirements for a failover cluster and gives you a simple report of what’s broken and why you need to fix it.

You don’t need identical hardware on each node anymore – not by a long shot – but the configuration rules are still really, really specific.  Some rules are guidelines (like the suggestion of multiple network cards to mitigate risks of a patch cable coming loose) but some are outright requirements.

See, this is one of my favorite things about the wizard: by default, if your cluster doesn’t pass the validation wizard, SQL Server won’t install.  This is a DBA’s best friend in the war for systems excellence.  If your company has separate Windows, storage, and networking teams, you can run the wizard before installing SQL Server.  If it doesn’t pass, you can shrug, pass the ball back to the other teams to get the setup right, and work with them to get ‘er done.

Clusters Need a Heartbeat Network

Cluster nodes used to keep tabs on each other, and if they couldn’t reach a node, they’d freak out.  To minimize the freakiness, we used a separate heartbeat network that didn’t handle any other traffic than just cluster chatter.  In simple two-node clusters, this was often done by running a crossover cable between the two nodes, which even eliminated the possibility of switch failures.  This was a giant pain, and almost nobody got the configuration quite right – 258750 was one of the few Microsoft knowledge base article numbers I actually knew by heart.

Windows Server 2008′s failover cluster networking is less freaky and more friendly: it’ll use whatever networks it can to reach the other nodes.  This has its own drawbacks – we need to make sure that any node can reach any other node over any available network, and we need to make sure that all of our networks are highly available.  That highly available part is key – preferably we’ve got two network cards in a teamed pair, and we test before go-live to make sure the cluster stays up if a patch cable goes down.

Clusters Require Shared Storage (SAN)

As the film business died, Kodak chose…poorly.

The entire history of Windows clustering has revolved around a shared set of drives that all of the nodes could access.  If one server crashed, another node reset the drive ownership, took control, and fired up SQL Server.

You can still build shared storage failover clusters, but Windows Server 2008 and 2012 both manage to run some clustered applications with no shared storage devices.  The app has to be designed to work without shared storage, like SQL Server 2012′s new AlwaysOn Availability Groups.  Heck, we can even fake-out traditional shared-disk clustering solutions by using UNC paths for our databases.  Jonathan Kehayias wrote an in-depth tutorial on how to build a SQL Server cluster on a NAS.

Cluster Quorums Manage Themselves

Back in the days of the quorum drive, all of the cluster nodes got together and decided who was boss simply based on who could see the quorum drive on shared storage.  We could move the cluster ownership around by passing the quorum drive around.

Today, since we don’t necessarily have shared storage, we can’t rely on a quorum drive.  Windows Server now offers a variety of quorum options including node majority, node and disk majority, and my favorite at the moment, node and file share majority.  This means a file share can act as a voting member of the team, enabling two-node clusters with no shared storage.

Configuring quorum – especially managing non-voting members of the quorum – is a tricky but necessary part of building a solid cluster.  I’ve already helped a few folks bring their clusters back online after they accidentally took the whole thing down due to rebooting just one (seemingly) passive node.  We have to understand our cluster’s quorum method, document what happens if one of the members is rebooted, and ensure that all team members know what needs to happen during patch windows.

Cluster Management is Painful and Obscure

I don’t need to learn PowerShell. Betty handles that for me.

If you’ve had the misfortune of memorizing cluster.exe commands just to get your job done, raise your hand.  No, wait, put your finger back down, that’s not appropriate.  This is a family web site, and we don’t need to hear your horror stories about reading obscure knowledge base articles in the dead of night.

Unfortunately, the bad news is that this particular point is still true.  You’re still going to be managing clusters at the command line.

The good news is that for the most part, you can use PowerShell instead of cluster.exe.  This means that as you learn to manage clusters, you’ll also be learning a language that can be used to manage more SQL Servers simultaneously, plus Windows, VMware, Exchange, and lots of other things that you probably didn’t want to have to learn.  Okay, so that’s also still kinda bad news – but the good news is that sysadmins will find cluster management more intuitive, because they can use the language they already know.

LEARN MORE AT BrentOzar.com/go/alwayson

Including our SQL Server 2012 AlwaysOn Availability Groups checklist.  It’s built atop clustering, and it’s even more cool (but complex).

↑ Back to top
  1. Clustering has gone a long way from NT indeed – now you can have a cluster on your PC (if CPU supports virtualization), all it takes is Hyper-V and 3-4 VMs.

  2. Nice post Brent ! As you said clusters are way too easy to setup and manage with 2008.

    Adding a disk as a dependency for a resource(say SQL)without recycling the service was one of the coolest things to happen starting 2008 :-)

  3. I love that Brent is pushing PowerShell. Did not think I would ever see the day. Great article and timely as I think I will be building a new SQL Server cluster very shortly. Thanks for all the great info you put out.

  4. I’ve only worked with SQL clustering since Server 2003/SQL 2005 but I can attest to the facts you’ve laid out in this post. You mentioned that admin is still painful and I’d stress that folks shouldn’t assume that easy setup means easy maintenance. Without going into great detail many admin tasks are pretty simple (GUI or PS) once things are set up but you can get deep in the weeds very quickly. I had a very painful 48 hrs over a weekend a few months back that started with a fairly simple operation of removing a disk resource from a clustered instance of SQL Server. I now have a facial tic and an obvious limp ;-)

  5. Thanks for the ever consistent valuable articles.
    You crack me up.

  6. Hi Brent,

    I have a question regarding the virtual file latency for our production database. I pulled up some stats using the DMV sys.dm_io_virtual_file_stats and found high read and write latencies for the mdf file placed on a local physical machine partition. We checked for bad sectors and did not find any. Also, we got a lot of free space.We performed defragmentation and re-indexing jobs. But still the production mdf file latencies are high. Other MDFs on the same physical partitions have fine values for virtual file stats. Could you provide your opinion here, thanks.

    • Abhinav – does your production MDF have significantly *more* reads and writes than the other files? Are the other databases simply not doing that much activity? Or it may also have to do with the time of day when the slow reads and writes are happening.

      • Hi Brent,

        Yes, since this is our production database , this MDF does have a high read and write load as compared to other databases. But using the DMV , the read latency I got was around 65 and the write latency around 270, is that normal. Since I compared this to another database on the same partition, although not used as largely as this one, had the values 10 and 15 respectively.

        Thanks for helping out.

        • That sounds pretty high on the write latency. Generally I get worried if I see write latencies over 100ms. If you’d like consulting help with this, click the Contact link at the top of the page and we can work with you on a SQL Server health check to narrow down the root cause, or check out the storage chapter that I wrote in the book Pro SQL Server 2008 Internals and Troubleshooting.

  7. Great post, Master Brent. One thing that a lot of people implementing Windows Server 2008 failover clusters need to understand is that just because it is supported to have a single network adapter on each node, doesn’t mean it’s OK. I’ve dealt with customers who have highly critical databases running on clusters with only ONE network card. Since the Failover Cluster Validation Wizard simply flags this as Warning and not an Error, they go ahead and build the cluster anyway. They forgot about the reason why they have the cluster in the first place – high availability. Even if you have multiple NICs per node that are teamed up, how sure are we that the network switches are redundant and highly available? I’ve seen DR exercises where only the servers are tested but not the underlying network architecture. Only when the network switches themselves fail fo they realize that they not at all highly available. I still recommend having a dedicated network for the heartbeat communication and if the customer can guarantee that the network layer is highly available, then I’ll be happy with a NIC teaming implementation.

    • Edwin – thanks, sir! Yep, when it really counts, you want redundancy everywhere. Same thing with redundant power distribution units in the datacenter – ideally, I want every server plugged into separate ones, running off separate battery supplies, etc.

  8. I’m with you on having redundancy when it really counts. Unfortunately, most IT professionals don’t think beyond the scope of their job responsibilities. Things like HVAC, power sources, documentation, etc. should be part of the entire HA/DR stack. I remember dealing with a cluster that blew up simply because the AC broke down and turned the temperature up in the datacenter. We had to bring in industrial fans and power down non-production, non-critical servers just to keep the temperature down a bit. This was the driving motivation behind my 24HoP presentation last year
    http://www.sqlpass.org/24hours/fall2011/SessionsbySchedule/DisasterRecoveryIsNotJustAboutTechnology.aspx

  9. Hi

    Thanks for nice post. I have been using disk cluster for many years for oracle database and for server redundancy. Now we are thinking to use windows 2012 but eliminate disk cluster. Is it possible to have server redundancy without server?, how about share storage that we need to put oracle database?

    Thanks,

    • Hello,

      I think I may not have understood your question completely, but I’ll give it a shot.

      If you’re asking if you can have redundancy for a SQL Server without shared storage, the answer is yes– check out Brent’s video on high availability and disaster recovery for SQL Server here: http://www.brentozar.com/go/fail/

      If you’re asking about clustering Oracle, I can’t help on that one!

      • Thanks for your comment, actually it is for oracle 11g Standard edition, nor SQL server. Do you have some link to build the server redundancy (forget about oracle for now), without disk cluster, I will start from there!.

        Thanks

  10. How much does it cost to install cluster server for a company with less than 100 people?

    • Lyang – well, cluster installation costs don’t usually depend on the number of people, but more on the size of the application, the requirements for Recovery Point Objective (RPO) and Recovery Time Objective (RTO), scale-out needs, number of nodes, and so forth.

  11. Thanks for this article, I enjoyed reading.
    I am a bit late to this POST.
    I wonder when exactly Microsoft will launch the final release of its CLUSTERING !
    I would like to add that clustering is not the ultimate solution for our business continuity…better is backup.

    GEO~

    • Geo – backups are only a part of the solution. After all, if you experience an outage, you can’t usually wait for a restore to finish before the business comes back online. :-D

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>