Remember that weirdo in high school who had no social skills? Couldn’t get a date, worked at the fast food joint as a fry cook, face covered with zits – you know the one.
Okay, actually, it was us. Anyway, the point is, we got our act together, didn’t we? So did Windows Failover Clustering. When you weren’t looking, Windows Server 2008 cleaned up its clustering, and now it’s the new hotness that gets all the dates. It’s time to revisit what you thought you knew about Windows clusters.
Clusters Require Identical Hardware and Configuration
When I was your age, I had to look up every single piece of server hardware on the Windows Hardware Compatibility List (HCL) to make sure it was tested and approved. I either had to buy approved clusters as a package, or assemble them from detailed hardware lists. The new servers I wanted were never on the HCL, or they were way too expensive. Even when I got the goods, I had to assemble everything and then just hope it worked right. Inevitably, it didn’t, but the hardware usually wasn’t to blame – it was my own stupidity.
With Windows 2008 and newer, you can slap together pretty much any old hardware, run the Validate a Cluster wizard, and know right away that…uh, you’ve got a lot of work to do. I know you’ve got passionate feelings about how wizards are evil, but the Validate a Cluster wizard is AWESOME. It tests just about all of the requirements for a failover cluster and gives you a simple report of what’s broken and why you need to fix it.
You don’t need identical hardware on each node anymore – not by a long shot – but the configuration rules are still really, really specific. Some rules are guidelines (like the suggestion of multiple network cards to mitigate risks of a patch cable coming loose) but some are outright requirements.
See, this is one of my favorite things about the wizard: by default, if your cluster doesn’t pass the validation wizard, SQL Server won’t install. This is a DBA’s best friend in the war for systems excellence. If your company has separate Windows, storage, and networking teams, you can run the wizard before installing SQL Server. If it doesn’t pass, you can shrug, pass the ball back to the other teams to get the setup right, and work with them to get ‘er done.
Clusters Need a Heartbeat Network
Cluster nodes used to keep tabs on each other, and if they couldn’t reach a node, they’d freak out. To minimize the freakiness, we used a separate heartbeat network that didn’t handle any other traffic than just cluster chatter. In simple two-node clusters, this was often done by running a crossover cable between the two nodes, which even eliminated the possibility of switch failures. This was a giant pain, and almost nobody got the configuration quite right – 258750 was one of the few Microsoft knowledge base article numbers I actually knew by heart.
Windows Server 2008′s failover cluster networking is less freaky and more friendly: it’ll use whatever networks it can to reach the other nodes. This has its own drawbacks – we need to make sure that any node can reach any other node over any available network, and we need to make sure that all of our networks are highly available. That highly available part is key – preferably we’ve got two network cards in a teamed pair, and we test before go-live to make sure the cluster stays up if a patch cable goes down.
Clusters Require Shared Storage (SAN)
The entire history of Windows clustering has revolved around a shared set of drives that all of the nodes could access. If one server crashed, another node reset the drive ownership, took control, and fired up SQL Server.
You can still build shared storage failover clusters, but Windows Server 2008 and 2012 both manage to run some clustered applications with no shared storage devices. The app has to be designed to work without shared storage, like SQL Server 2012′s new AlwaysOn Availability Groups. Heck, we can even fake-out traditional shared-disk clustering solutions by using UNC paths for our databases. Jonathan Kehayias wrote an in-depth tutorial on how to build a SQL Server cluster on a NAS.
Cluster Quorums Manage Themselves
Back in the days of the quorum drive, all of the cluster nodes got together and decided who was boss simply based on who could see the quorum drive on shared storage. We could move the cluster ownership around by passing the quorum drive around.
Today, since we don’t necessarily have shared storage, we can’t rely on a quorum drive. Windows Server now offers a variety of quorum options including node majority, node and disk majority, and my favorite at the moment, node and file share majority. This means a file share can act as a voting member of the team, enabling two-node clusters with no shared storage.
Configuring quorum – especially managing non-voting members of the quorum – is a tricky but necessary part of building a solid cluster. I’ve already helped a few folks bring their clusters back online after they accidentally took the whole thing down due to rebooting just one (seemingly) passive node. We have to understand our cluster’s quorum method, document what happens if one of the members is rebooted, and ensure that all team members know what needs to happen during patch windows.
Cluster Management is Painful and Obscure
If you’ve had the misfortune of memorizing cluster.exe commands just to get your job done, raise your hand. No, wait, put your finger back down, that’s not appropriate. This is a family web site, and we don’t need to hear your horror stories about reading obscure knowledge base articles in the dead of night.
Unfortunately, the bad news is that this particular point is still true. You’re still going to be managing clusters at the command line.
The good news is that for the most part, you can use PowerShell instead of cluster.exe. This means that as you learn to manage clusters, you’ll also be learning a language that can be used to manage more SQL Servers simultaneously, plus Windows, VMware, Exchange, and lots of other things that you probably didn’t want to have to learn. Okay, so that’s also still kinda bad news – but the good news is that sysadmins will find cluster management more intuitive, because they can use the language they already know.
Including our SQL Server 2012 AlwaysOn Availability Groups checklist. It’s built atop clustering, and it’s even more cool (but complex).