In our recent webcast on HA & DR basics for DBAs, we got a bunch of questions that couldn’t be answered fast enough. Here you go:
Q: Do you have book recommendations for replication, mirroring, and the other topics?
Absolutely. Check out the resources on our HA & DR Basics video page.
Q: Do any of these techniques work for companies with just 1-2 IT staff?
To keep things simple, I’d check out virtualization replication like VMware SRM. SRM isn’t necessarily easy, but if you use that one technique, you can protect all of your virtual machines no matter what they’re running. That’s easier than learning different HA/DR techniques for a lot of different applications.
Q: For 1,000 databases on an instance, what uses more resources – mirroring or log shipping?
Technically, the answer is going to be log shipping because you’re probably already doing transaction log backups on those databases anyway. However, when you’re running thousands of databases per instance, several other interesting issues arise. How do you react to newly added databases? Can you back up all 1,000 in a timely manner, or do you need to roll your own transaction log backup jobs to run simultaneously? What’s your RPO/RTO? There’s a lot of neat questions that determine the real answer at this scale. Feel free to contact us for consulting help – we’ve got clients who run as many as 10,000 databases per instance, and we can help you learn lessons the easy way instead of the hard way.
Q: What happens if my DR is in a different datacenter, on another domain?
Your life is harder. Many built-in solutions become more challenging, and even just getting your users to be able to connect to the database become more challenging. This is a scenario where I like to step back and ask, “What problem are we trying to solve by using different domains?”
Q: Do my HA and DR servers have to be on the same subnet or VLAN?
No, all of SQL Server’s HA/DR features work on different subnets and VLANs.
Q: Do you have to kill all SQL Server connections before restoring a database?
Yes. Restoring a database – even a similar copy – means starting over with all-new data. Restoring transaction logs means SQL Server will be jumping around all over inside the database, and it doesn’t have time to deal with your queries.
Q: Does clustering work for SSAS, SSIS, and SSRS?
Microsoft sells these things inside the same box as SQL Server, but for all practical purposes, you should consider them different products. They’re like free toys that just come in the same box, but they’re wildly different. The techniques you use to protect the SQL Server engine won’t be the same as the techniques you use for the rest.
Q: Can you read a database mirror?
Not directly, but you can take a snapshot of it and read the snapshot. I rarely see this used, though, because in order to refresh the snapshot, you have to drop it – and that means kicking everybody out of the snapshot. Users don’t seem to be a big fan of getting their queries killed.
Q: What Windows Server edition is needed for clustering and AlwaysOn Availability Groups?
You can use any version of Windows Server 2012, but if you’re still using Windows Server 2008R2, you’ll need Enterprise Edition.
Q: How do you get hosts in different data centers into the same cluster?
Clusters don’t need shared storage anymore. This is why I wrote the post Everything You Know About Clustering Is Wrong. Things really have changed a lot in the last several years around clustering.
Q: How well do these features work over a slow connection?
If you’ve got enough bandwidth to keep up with copying transaction log backups as they’re happening, you can stay up to date. If you can’t, they don’t work well.
Q: Can you install SQL Server Enterprise on Windows Standard?
Q: I just joined the webcast and it’s almost over. Can you send me the slides?
You’d be surprised how often I get this question. This is why we make our material available for free on our YouTube channel. I don’t usually share the slide decks, though, because I’ve had some nasty incidents of people taking my slide decks, plagiarizing them, and presenting them as their own material.
Q: I’m using an AlwaysOn Availability Group. If I fail over to an asynchronous replica, will there be data loss?
Probably. If your primary has been doing a lot of work, and hasn’t been able to send all of that work to the asynchronous secondary, you’re doing to lose that data. It’s up to you to monitor the DMVs to see how far behind the replicas are.
Q: Is (feature A) better than (feature B)?
No. They both suck.
Q: Seriously, I need an answer.
Getting to the right answer means knowing your RPO, RTO, business needs, staff capabilities, hardware, network throughput, and more. If you ask a stranger to guess the right answer for you with just five minutes of background information, odds are it’s going to be the wrong answer. You’re going to have to buckle down and do some research into the features and your business needs, or you can bring in an experienced outsider who does this for a living. This is what we do, and we can help you get to the right answer as quickly as possible, and probably for less money than you’d expect.
Q: Is there a way of automating failovers for SQL Agent jobs for HA purposes?
If you’re using clustering, then this is built in. If you’re using a solution that does failovers at the user database level, then things get trickier because you have to know which jobs you want to fail over. For example, you probably don’t want backup jobs failing over (because you can just run those jobs on all nodes), but you might want a few database-based jobs failing over. That’s left as an exercise for the reader, though, because every shop’s needs are different – often even different between database servers in the same shop.
Q: What’s the right quorum configuration for my cluster?
Q: Can I integrate clustering with (mirroring/log shipping/replication/AlwaysOn AGs)?
Clustering is probably the strongest feature in terms of interoperability with other features. It’s been around a long time, so it’s got a good interoperability story for pretty much everything.
Q: How can one second RTO be achieved in a DR scenario?
By having both the production and disaster recovery systems live at all times, accepting queries. At that point, it’s up to the application logic to try both servers to see which one is available fastest. Some applications do this by timing out their database connections after just one second. Other applications do it by trying both database servers simultaneously and taking the first available connection.
Q: Can you back up a SQL Server to SQL Azure?
Not in the sense that you can restore directly into SQL Azure (or as it’s known now, Windows Azure SQL Database.) However, you can use Windows Azure as a backup solution if you’re so inclined.
Q: Azure is a French word meaning blue. Americans are not pronouncing French correctly.
French is the most beautiful language in the world – when it’s spoken by the French. When the rest of us try it, we sound like we’re coughing up an unfiltered Marlboro. Thank you for your cheese, wine, and wonderful words like lingerie and ménage à trois. You are a beautiful country, and the rest of us apologize for the sounds we make when confronted with French.
Q: Can you cluster inside VMware?
Q: What are the best practice recommendations for achieving 99.999% uptime?
Obviously this is way beyond what I can answer quickly, but the best advice I can give is to remember that there’s three parts to every solution: people, process, and technology. You need written processes that cover exactly how to react to an outage, and you need to rehearse and improve those processes constantly. Car racing teams practice over and over to get the fastest pit stop possible so that their race car spends more time driving and less time sitting. DBAs need to do the same.
Q: What HA option is Brent’s personal favorite, and why?
I like failover clustering because it protects everything in the instance – jobs, logins, and the server name itself – with as little manual intervention as possible. It still has weaknesses in a few single points of failure, and it’s not completely easy to implement and manage. I think the benefits still outweigh the costs.
I served with High Availability. I knew High Availability. High Availability was a friend of mine. VMware HA, you’re no High Availability.
See, for us database administrators, high availability means protection when:
- The system drive fills up because some potato decided to download a bunch of files
- An operating system or database server update goes horribly awry
- Or even when an OS or SQL update goes right – because the beauty of real high availability solutions is that they let you patch the standby node first, make sure it works, and then fail over to it so you can patch the other node.
Don’t get me wrong – I love VMware, and I love using VMware HA for database servers. It’s a fantastic way to get higher availability for those old dinosaur database servers running SQL Server 2000 that we just can’t kill, yet still run important apps. But in systems where uptime really matters, a single virtual machine isn’t the answer to high availability. That’s where solutions like clustering, database mirroring, replication, and AlwaysOn Availability Groups come into play.
Thankfully, there’s good news: when VMware HA is paired with SQL Server technologies, they can both work even better. Two standalone physical database servers running AlwaysOn Availability Groups are more reliable than just one server, but two virtual machines doing the same thing are even more reliable. They’re protected from hardware failures because they can be spun up on any VMware host in the datacenter. They’re more flexible because we can add CPU power or memory quickly based on demand.
I’ve blogged about why your SQL Server cluster shouldn’t be virtualized, and that still holds true. If you need to build a hybrid AlwaysOn solution involving both failover clustered instances (FCIs) and standalone instances, I would rather not put the FCIs in VMware first. But if you’re under pressure from management to cut costs and cut your datacenter footprint, put the rest of the instances in virtual machines. You’ll gain the power and comfort you want from physical machines while getting even higher availability from the virtual machines. Everybody wins, and the future will be better tomorrow.
When people buy my virtualization training video, one of the followup questions I get most often via email is, “Can I build SQL Server clusters in VMware and Hyper-V?”
In theory, yes. Microsoft’s knowledge base article on SQL Server virtualization support says they’ll support you as long as you’re using configurations listed in the Server Virtualization Validation Program (SVVP).
But the real question for me isn’t whether or not Microsoft supports virtual SQL Server clusters.
The question is about whether you can support it.
Us geeks usually implement clusters because the business wants higher availability. Higher availability means faster troubleshooting when the system is down. We need to be able to get the system back up and running as quickly as possible. Getting there usually means reducing the amount of complexity; complex systems take longer to troubleshoot.
Adding virtualization (which also means shared storage) makes things much tougher to troubleshoot. If the business wants a highly available SQL Server, ask yourself these questions before virtualizing a SQL Server cluster:
- Do you have a great relationship between the SQL Server, storage, and network teams?
- Do all of the teams have read-only access to each others’ tools to speed up troubleshooting?
- Do all of the teams have access to the on-call list for all other teams, and feel comfortable calling them?
- Do you have a well-practiced, well-documented troubleshooting checklist for SQL Server outages?
- Does your company have a good change control process to avoid surprises?
- Do you have an identical environment to test configuration changes and patches before going live?
If the answer to any of those questions is no, consider honing your processes before adding complexity.
But the Business is Making Me Do It!
They’re making you do it because you haven’t clearly laid out your concerns about the business risk. Show the business managers this same list of questions. Talk to them about what each answer means for the business. Was there a recent outage with a lot of finger-pointing between teams? Bring that up, and remind the business about how painful that troubleshooting session was. Things will only get worse under virtualization.
To really drive the point home, I like whiteboarding out the troubleshooting process for a physical cluster versus a virtual cluster. Show all of the parts involved in the infrastructure, and designate which teams own which parts. Every additional team involved means longer troubleshooting time.
Once the business signs off on that increased risk, then everyone’s on the same page. They’re comfortable with the additional risk you’re taking, and you’re comfortable that you’re not to blame when things go wrong. And when they do go wrong – and they will – do a post-mortem meeting explaining the outage and the time spent on troubleshooting. If the finger-pointing between the app team, SQL Server DBAs, network admins, virtualization admins, and sysadmins was a problem, document it and share it (in a friendly way) with management. They might change their mind when it’s time to deploy the next SQL Server cluster.