Maybe You Shouldn’t Even Be Using Clustering or AGs.

Backup and Recovery
2 Comments

Sandra Delany (LinkedIn) wrote a well-thought-out blog post called, “Should a SQL Server DBA Know Windows Clustering?” She’s got about 20 years of DBA experience, and she works for Straight Path (a firm I respect) as a consultant. You can probably guess based on her background that yes, she believes you should know how to set up, configure, and troubleshoot Windows clustering. It’s a good post, and you should read it.

But… I don’t agree.

Two things. First, I have a problem with any blog post (even my own) that say, “If you call yourself an X, you should definitely know Y.” The term “DBA” encompasses a huge variety of jobs, held by people with a huge variety of seniority levels. If someone’s in their first year in a DBA job, maybe even their fifth, I don’t necessarily expect them to know Windows clustering.

For example, I once (briefly) worked at a global company where the DBAs weren’t even given permissions to glance at the cluster. If the SQL Server service wouldn’t start, the DBAs had to transfer the issue to the Windows team, who handled clustering. The company’s logic was that clustering is a foundational part of Windows, and it doesn’t really have anything to do with SQL Server – and in fact, is reused by other cluster-savvy application. Clustering troubleshooting is a giant pain in the ass that involves Windows logs, DNS, IP addresses, etc, all things that DBAs aren’t good at – but the Windows team is (or at least should be.)

Which brings me to another blog post…

Chrissy LeMaire (LinkedInBluesky), the creator of the powerful and popular DBAtools PowerShell stuff, wrote a solid blog post called Have You Considered Not Using SQL Server High Availability? You should read that post too.

A lot of you are full time database administrators, and you’re already taking a deep breath in anticipation of yelling back at the screen, but hang on a second.

First, your SQL Servers still all need disaster recovery. That’s different. When we say disaster recovery, we’re usually talking about things like native SQL Server backups, log shipping, storage replication, etc. These are techniques that you can use to rebuild the SQL Server in a different place, like after a ransomware attack or a natural disaster. Nobody’s suggesting you get rid of that.

We’re specifically talking HA features here: failover clustered instances (FCIs), Always On Availability Groups, and database mirroring. Chrissy writes about the operational challenge of those features.

HA features require 2 kinds of labor from 4 kinds of people. Chrissy’s post points out that the teams who manage the databases, the storage, Active Directory/DNS, and networking all have to get involved. I’d add that it requires 2 kinds of labor from all of these people: both planned, and unplanned/chaotic. When there’s a production database outage, there’s a lot of finger pointing, and management demands that everybody drop their work and jump into conference calls. Everybody starts stabbing at various switches and dials, groping blindly and wasting time, until things go back to normal – and the next server has its next emergency.

Small companies don’t have 4 kinds of people. They just have a core handful of IT people who do everything. They’re experts in how the entire stack is configured at this shop, but they’re not experts in all of the underlying technologies. When things go wrong, the work is usually single-threaded, dependent on the one person unlucky enough to be on call. That person is even more likely to stab at various switches and dials, making unplanned, chaotic changes that end up making the environment even less stable over time.

Virtualization provides pretty dang good HA for many failures, in many shops. Properly configured, it protects you from individual host hardware failures and single network cable problems. No, it doesn’t protect you from a bad Windows or SQL Server patch, but in small shops, they don’t do a lot of patching anyway. (Ease up on the outrage – I’ve seen your SQL ConstantCare® data, I know you’re several CUs behind, and I know you’ve muted that recommendation to patch.)

Virtualization HA is easier to manage. It’s just one technology that works to protect all of your VMs. That’s less learning that the overworked staff have to do, and besides, they have to learn it anyway to protect the rest of your servers. As long as they’re using it for everything else, they might as well lean on it to protect SQL Server as well.

So when clients are talking to me about easy ways they can improve their uptime, and they’re already running SQL Server in a VM, we take a step back and look at their virtualization high availability setup. If that’s working well, I explain the part about the net-new planned & unplanned work for all the different roles, and then I ask about their on-call rotations, and their plans to hire more staff in order to handle this net-new work.

If they haven’t been adding staff, and don’t plan to, then I’d rather have the staff focus on improving their ability to rapidly restore SQL Server backups, provision & configure new servers, and just generally automate their environment. That’ll come in handy more often, and help with both disaster recovery, ransomware rebuilding, recovering from “oops” delete queries, and just generally reacting to day to day issues.

In summary:

  • If you don’t know how to troubleshoot DNS, file shares, Windows cluster validation, or PowerShell, then Chrissy’s blog post is right for you, and you should probably try virtualization for high availability.
  • If your company’s HA/DR needs require Availability Groups and/or failover clustered instances, then Sandra’s blog post is right for you, and you probably have a big learning journey ahead.
Previous Post
[Video] Office Hours in Tokyo, Japan

2 Comments. Leave new

  • FCI and AGs can definitely be lower-availability solutions if you don’t have the knowledge, time, and budget to get all the details perfect. When I worked for a hosting company with lots of FCIs, I highly suspected the FCI customers on overloaded shared SANs had more downtime than customers running single-node systems with local arrays or non-shared DASs.

    At a small hedge fund I worked for, we relied on VMware for good-enough HA for most things, and if an application was critical enough, we ran multiple instances of it, each with its own database, and double-wrote the data. This worked pretty well when you had a deterministic write path that you could control (e.g. ingesting market data). It was a little hard to get new people to accept it (including me at one point), but it worked well enough for us, and we had 0 problems with FCI and AG hiccups and complexity.

    Reply
  • These details in my experience usually work themselves out in the environments that don’t need them or aren’t mature enough to use them. A shop that doesn’t run updates, drivers or firmware more than every couple years and don’t have adequate hardware or systems health **usually** won’t be in a place to spend the time and money on HA and for those, I like log shipping. I have even used a SQL express destination for those, just to have eyes on it to make sure its working and have at least some sort of vague plan to get it to standard edition if there were ever a disaster.

    I prefer to do the AD work myself, but before I accidentally got into SQL DBA work I was an AD engineer consultant and am familiar with the tech. I like the idea of having the systems group handle it but I have only ever worked with a handful of AD admins that would implement a cluster with the care and detail I would; Pesky details like ensuring the firewall is enabled and exceptions made in it instead of only turning the firewall off. (have fun with service broker, MSDTC and MSMQ if you don’t make exceptions in Windows firewall while it is ON). Since Server 2012 R2, Failover clustering is extremely forgiving even in environments with horrible DNS and AD problems as long as you RTFM in its implementation. I haven’t worked at a place yet where I wouldn’t be called after hours for any issue that impacts the SQL Servers even if it is a virtualization, networking, or other systems problem anyways so may as well have skin in the game as much as possible.

    I think a good sign that you should either not be doing HA or having someone else do it, is if you are making OS level configurations manually in each node instead of using a state-based configuration tool to configure all nodes consistently (such as group policy). Or if your servers have the performance of a netbook from 20 years ago.

    One shop I worked at organizationally had an Ego problem that thought because they had 12 Gb of SQL data organization wide that on a good day had access to 35 i/o and on a typical day, 25 i/o (nutanix – on a bad day 10-12 i/o was not out of the question) they were extreme users of SQL server with special needs and had to fight really hard to NOT implement HA and I thank the maker I never had to implement HA in an environment that regularly experienced 7000+ ms i/o stalls

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.