Maybe You Shouldn’t Even Be Using Clustering or AGs.
Sandra Delany (LinkedIn) wrote a well-thought-out blog post called, “Should a SQL Server DBA Know Windows Clustering?” She’s got about 20 years of DBA experience, and she works for Straight Path (a firm I respect) as a consultant. You can probably guess based on her background that yes, she believes you should know how to set up, configure, and troubleshoot Windows clustering. It’s a good post, and you should read it.
But… I don’t agree.
Two things. First, I have a problem with any blog post (even my own) that say, “If you call yourself an X, you should definitely know Y.” The term “DBA” encompasses a huge variety of jobs, held by people with a huge variety of seniority levels. If someone’s in their first year in a DBA job, maybe even their fifth, I don’t necessarily expect them to know Windows clustering.
For example, I once (briefly) worked at a global company where the DBAs weren’t even given permissions to glance at the cluster. If the SQL Server service wouldn’t start, the DBAs had to transfer the issue to the Windows team, who handled clustering. The company’s logic was that clustering is a foundational part of Windows, and it doesn’t really have anything to do with SQL Server – and in fact, is reused by other cluster-savvy application. Clustering troubleshooting is a giant pain in the ass that involves Windows logs, DNS, IP addresses, etc, all things that DBAs aren’t good at – but the Windows team is (or at least should be.)
Which brings me to another blog post…
Chrissy LeMaire (LinkedIn – Bluesky), the creator of the powerful and popular DBAtools PowerShell stuff, wrote a solid blog post called Have You Considered Not Using SQL Server High Availability? You should read that post too.
A lot of you are full time database administrators, and you’re already taking a deep breath in anticipation of yelling back at the screen, but hang on a second.
First, your SQL Servers still all need disaster recovery. That’s different. When we say disaster recovery, we’re usually talking about things like native SQL Server backups, log shipping, storage replication, etc. These are techniques that you can use to rebuild the SQL Server in a different place, like after a ransomware attack or a natural disaster. Nobody’s suggesting you get rid of that.
We’re specifically talking HA features here: failover clustered instances (FCIs), Always On Availability Groups, and database mirroring. Chrissy writes about the operational challenge of those features.
HA features require 2 kinds of labor from 4 kinds of people. Chrissy’s post points out that the teams who manage the databases, the storage, Active Directory/DNS, and networking all have to get involved. I’d add that it requires 2 kinds of labor from all of these people: both planned, and unplanned/chaotic. When there’s a production database outage, there’s a lot of finger pointing, and management demands that everybody drop their work and jump into conference calls. Everybody starts stabbing at various switches and dials, groping blindly and wasting time, until things go back to normal – and the next server has its next emergency.
Small companies don’t have 4 kinds of people. They just have a core handful of IT people who do everything. They’re experts in how the entire stack is configured at this shop, but they’re not experts in all of the underlying technologies. When things go wrong, the work is usually single-threaded, dependent on the one person unlucky enough to be on call. That person is even more likely to stab at various switches and dials, making unplanned, chaotic changes that end up making the environment even less stable over time.
Virtualization provides pretty dang good HA for many failures, in many shops. Properly configured, it protects you from individual host hardware failures and single network cable problems. No, it doesn’t protect you from a bad Windows or SQL Server patch, but in small shops, they don’t do a lot of patching anyway. (Ease up on the outrage – I’ve seen your SQL ConstantCare® data, I know you’re several CUs behind, and I know you’ve muted that recommendation to patch.)
Virtualization HA is easier to manage. It’s just one technology that works to protect all of your VMs. That’s less learning that the overworked staff have to do, and besides, they have to learn it anyway to protect the rest of your servers. As long as they’re using it for everything else, they might as well lean on it to protect SQL Server as well.
So when clients are talking to me about easy ways they can improve their uptime, and they’re already running SQL Server in a VM, we take a step back and look at their virtualization high availability setup. If that’s working well, I explain the part about the net-new planned & unplanned work for all the different roles, and then I ask about their on-call rotations, and their plans to hire more staff in order to handle this net-new work.
If they haven’t been adding staff, and don’t plan to, then I’d rather have the staff focus on improving their ability to rapidly restore SQL Server backups, provision & configure new servers, and just generally automate their environment. That’ll come in handy more often, and help with both disaster recovery, ransomware rebuilding, recovering from “oops” delete queries, and just generally reacting to day to day issues.
In summary:
- If you don’t know how to troubleshoot DNS, file shares, Windows cluster validation, or PowerShell, then Chrissy’s blog post is right for you, and you should probably try virtualization for high availability.
- If your company’s HA/DR needs require Availability Groups and/or failover clustered instances, then Sandra’s blog post is right for you, and you probably have a big learning journey ahead.
Related

Hi! I’m Brent Ozar.
I make Microsoft SQL Server go faster. I love teaching, travel, cars, and laughing. I’m based out of Las Vegas. He/him. I teach SQL Server training classes, or if you haven’t got time for the pain, I’m available for consulting too.
Get Free SQL Stuff
"*" indicates required fields

6 Comments. Leave new
FCI and AGs can definitely be lower-availability solutions if you don’t have the knowledge, time, and budget to get all the details perfect. When I worked for a hosting company with lots of FCIs, I highly suspected the FCI customers on overloaded shared SANs had more downtime than customers running single-node systems with local arrays or non-shared DASs.
At a small hedge fund I worked for, we relied on VMware for good-enough HA for most things, and if an application was critical enough, we ran multiple instances of it, each with its own database, and double-wrote the data. This worked pretty well when you had a deterministic write path that you could control (e.g. ingesting market data). It was a little hard to get new people to accept it (including me at one point), but it worked well enough for us, and we had 0 problems with FCI and AG hiccups and complexity.
These details in my experience usually work themselves out in the environments that don’t need them or aren’t mature enough to use them. A shop that doesn’t run updates, drivers or firmware more than every couple years and don’t have adequate hardware or systems health **usually** won’t be in a place to spend the time and money on HA and for those, I like log shipping. I have even used a SQL express destination for those, just to have eyes on it to make sure its working and have at least some sort of vague plan to get it to standard edition if there were ever a disaster.
I prefer to do the AD work myself, but before I accidentally got into SQL DBA work I was an AD engineer consultant and am familiar with the tech. I like the idea of having the systems group handle it but I have only ever worked with a handful of AD admins that would implement a cluster with the care and detail I would; Pesky details like ensuring the firewall is enabled and exceptions made in it instead of only turning the firewall off. (have fun with service broker, MSDTC and MSMQ if you don’t make exceptions in Windows firewall while it is ON). Since Server 2012 R2, Failover clustering is extremely forgiving even in environments with horrible DNS and AD problems as long as you RTFM in its implementation. I haven’t worked at a place yet where I wouldn’t be called after hours for any issue that impacts the SQL Servers even if it is a virtualization, networking, or other systems problem anyways so may as well have skin in the game as much as possible.
I think a good sign that you should either not be doing HA or having someone else do it, is if you are making OS level configurations manually in each node instead of using a state-based configuration tool to configure all nodes consistently (such as group policy). Or if your servers have the performance of a netbook from 20 years ago.
One shop I worked at organizationally had an Ego problem that thought because they had 12 Gb of SQL data organization wide that on a good day had access to 35 i/o and on a typical day, 25 i/o (nutanix – on a bad day 10-12 i/o was not out of the question) they were extreme users of SQL server with special needs and had to fight really hard to NOT implement HA and I thank the maker I never had to implement HA in an environment that regularly experienced 7000+ ms i/o stalls
I have mixed feeling about this. As a consulting engineer I’ve seen more than my share of VMs, borked for the most stupid reasons, so I don’t put much more trust in virtualization technology than I would in anything else. I’m also quite proficient with AD and OS side of tech so I tend to be a one man team when it comes to deployments, migrations, troubleshooting outages, and incident response. This tends to speed things up considerably and I really hate working on projects where I have to constantly fight siloed teams to have them do trivial tasks because no one understands the individual building blocks of the bigger picture.
Guilty of muting the patching as charged. However, the ones muted are the ones that required paid for extended service support to get the updates.
If you can get away with putting everything in HA at the virtualization level COOL. However, if you are using an AG with a read-only replica to help balance the load some….Having worn a few different IT hats back in the day before settling on DBA I’m pretty comfortable with WFCI and whatnot. However, the expanded Windows ecosystem and all the management layers and functionality that has come along, being “fortunate” enough these days is probably a bit tough.
I can see and understand both sides and for better or worse, skillset siloing, IMHO, won’t be going away any time soon
Cheers
Agree with the article. As a DBA with 25+ years in the trenches, management has been known to knee-jerk into “WE NEED HA” before investigating all their options and the environmental issues. And SQL Server HA is a pain from an administrative standpoint because Microsoft never built in the tools to sync logins, jobs, etc. between nodes – which to me is just insane. WHY WOULD YOU NOT DO THIS AT THE PRODUCT LEVEL, MS ????
As Brent would always say “What is the problem you are trying to solve?” when addressing the use of HA.
In our case, we are solving a zero-data-loss requirement should one of our 2 data centers catastrophically fail. Since our networking is such that the latency supports synchronous mirroring. If both are destroyed, we probably don’t have a business left to recover, so data recovery (we mirror the backups taken ever 1.5 hours to a 3rd data center) will be adequate for those who survive.
If you have a mature enough of an organization, the DBA can learn enough of the bit parts to be conversant with the others to address the relatively infrequent issues that come up. I’ve been doing this for 8 years and have talked to those people on an average of once a year.
Failover clustering does not solve our issues as the storage doesn’t move. It does fix patching downtime (but AGs do as well). But if you are going to have a cluster, just go full AG and get over it.
I had to teach myself the sysadmin skills when I set this stuff up. Had a good MS consultant review and adjust my design and implementation methods. This is why you hire consultants.
I’ve had one complete datacenter failure (power systems issue) during my time. My systems were all perfect within 10 seconds, before anyone could reconnect and tell the difference. Our web applications also have a different (load balancer) type of arrangement. Nobody noticed anything but a few failed transactions that were retried and successful.
I’ve solved my problems and have the proof. I laugh when Azure is down for hours at a time. My customers are barking at our CIO to not cloudify (as seems to be his mission). He doesn’t understand the problem. You can’t put health and safety (to the local physical plant) monitoring and response systems purely in a remote site. This is not smart. But you don’t build a career by making smart moves, but by doing sexy (and costly) moves and then fail up to the next opportunity… That’s the problem they are trying to solve….