Answers to Common High Availability and Disaster Recovery Questions

In our recent webcast on HA & DR basics for DBAs, we got a bunch of questions that couldn’t be answered fast enough. Here you go:

 

Q: Do any of these techniques work for companies with just 1-2 IT staff?

To keep things simple, I’d check out virtualization replication like VMware SRM. SRM isn’t necessarily easy, but if you use that one technique, you can protect all of your virtual machines no matter what they’re running. That’s easier than learning different HA/DR techniques for a lot of different applications.

Q: For 1,000 databases on an instance, what uses more resources – mirroring or log shipping?

Technically, the answer is going to be log shipping because you’re probably already doing transaction log backups on those databases anyway. However, when you’re running thousands of databases per instance, several other interesting issues arise. How do you react to newly added databases? Can you back up all 1,000 in a timely manner, or do you need to roll your own transaction log backup jobs to run simultaneously? What’s your RPO/RTO? There’s a lot of neat questions that determine the real answer at this scale. Feel free to contact us for consulting help – we’ve got clients who run as many as 10,000 databases per instance, and we can help you learn lessons the easy way instead of the hard way.

Q: What happens if my DR is in a different datacenter, on another domain?

Your life is harder. Many built-in solutions become more challenging, and even just getting your users to be able to connect to the database become more challenging. This is a scenario where I like to step back and ask, “What problem are we trying to solve by using different domains?”

Q: Do my HA and DR servers have to be on the same subnet or VLAN?

No, all of SQL Server’s HA/DR features work on different subnets and VLANs.

Q: Do you have to kill all SQL Server connections before restoring a database?

Yes. Restoring a database – even a similar copy – means starting over with all-new data. Restoring transaction logs means SQL Server will be jumping around all over inside the database, and it doesn’t have time to deal with your queries. 😉

Q: Does clustering work for SSAS, SSIS, and SSRS?

Microsoft sells these things inside the same box as SQL Server, but for all practical purposes, you should consider them different products. They’re like free toys that just come in the same box, but they’re wildly different. The techniques you use to protect the SQL Server engine won’t be the same as the techniques you use for the rest.

Q: Can you read a database mirror?

Not directly, but you can take a snapshot of it and read the snapshot. I rarely see this used, though, because in order to refresh the snapshot, you have to drop it – and that means kicking everybody out of the snapshot.  Users don’t seem to be a big fan of getting their queries killed.

Q: What Windows Server edition is needed for clustering and AlwaysOn Availability Groups?

You can use any version of Windows Server 2012, but if you’re still using Windows Server 2008R2, you’ll need Enterprise Edition.

Q: How do you get hosts in different data centers into the same cluster?

Clusters don’t need shared storage anymore. This is why I wrote the post Everything You Know About Clustering Is Wrong. Things really have changed a lot in the last several years around clustering.

Q: How well do these features work over a slow connection?

If you’ve got enough bandwidth to keep up with copying transaction log backups as they’re happening, you can stay up to date. If you can’t, they don’t work well.

Q: Can you install SQL Server Enterprise on Windows Standard?

Yes.

Q: I just joined the webcast and it’s almost over. Can you send me the slides?

You’d be surprised how often I get this question. This is why we make our material available for free on our YouTube channel. I don’t usually share the slide decks, though, because I’ve had some nasty incidents of people taking my slide decks, plagiarizing them, and presenting them as their own material.

Q: I’m using an AlwaysOn Availability Group. If I fail over to an asynchronous replica, will there be data loss?

Probably. If your primary has been doing a lot of work, and hasn’t been able to send all of that work to the asynchronous secondary, you’re doing to lose that data. It’s up to you to monitor the DMVs to see how far behind the replicas are.

Q: Is (feature A) better than (feature B)?

Yes.

Q: Really?

No. They both suck.

Q: Seriously, I need an answer.

Getting to the right answer means knowing your RPO, RTO, business needs, staff capabilities, hardware, network throughput, and more. If you ask a stranger to guess the right answer for you with just five minutes of background information, odds are it’s going to be the wrong answer. You’re going to have to buckle down and do some research into the features and your business needs, or you can bring in an experienced outsider who does this for a living. This is what we do, and we can help you get to the right answer as quickly as possible, and probably for less money than you’d expect.

Q: Is there a way of automating failovers for SQL Agent jobs for HA purposes?

If you’re using clustering, then this is built in. If you’re using a solution that does failovers at the user database level, then things get trickier because you have to know which jobs you want to fail over. For example, you probably don’t want backup jobs failing over (because you can just run those jobs on all nodes), but you might want a few database-based jobs failing over. That’s left as an exercise for the reader, though, because every shop’s needs are different – often even different between database servers in the same shop.

Q: What’s the right quorum configuration for my cluster?

Check out the excellent Books Online quorum sections for Windows Server 2012 and Windows Server 2008/R2.  It’s a lot of reading, but it’s worth it – the quorum decision is a big one.

Q: Can I integrate clustering with (mirroring/log shipping/replication/AlwaysOn AGs)?

Clustering is probably the strongest feature in terms of interoperability with other features. It’s been around a long time, so it’s got a good interoperability story for pretty much everything.

Q: How can one second RTO be achieved in a DR scenario?

By having both the production and disaster recovery systems live at all times, accepting queries. At that point, it’s up to the application logic to try both servers to see which one is available fastest. Some applications do this by timing out their database connections after just one second. Other applications do it by trying both database servers simultaneously and taking the first available connection.

Q: Can you back up a SQL Server to SQL Azure?

Not in the sense that you can restore directly into SQL Azure (or as it’s known now, Windows Azure SQL Database.) However, you can use Windows Azure as a backup solution if you’re so inclined.

Q: Azure is a French word meaning blue. Americans are not pronouncing French correctly.

French is the most beautiful language in the world – when it’s spoken by the French. When the rest of us try it, we sound like we’re coughing up an unfiltered Marlboro. Thank you for your cheese, wine, and wonderful words like lingerie and ménage à trois. You are a beautiful country, and the rest of us apologize for the sounds we make when confronted with French.

Q: Can you cluster inside VMware?

Yes, but if you’re asking the question, you shouldn’t be doing it.

Q: What are the best practice recommendations for achieving 99.999% uptime?

Obviously this is way beyond what I can answer quickly, but the best advice I can give is to remember that there’s three parts to every solution: people, process, and technology. You need written processes that cover exactly how to react to an outage, and you need to rehearse and improve those processes constantly. Car racing teams practice over and over to get the fastest pit stop possible so that their race car spends more time driving and less time sitting. DBAs need to do the same.

Q: What HA option is Brent’s personal favorite, and why?

I like failover clustering because it protects everything in the instance – jobs, logins, and the server name itself – with as little manual intervention as possible. It still has weaknesses in a few single points of failure, and it’s not completely easy to implement and manage. I think the benefits still outweigh the costs.

Previous Post
How Much Cache Do You Have?
Next Post
3 Things You Need to Start Doing to Your Database Server

12 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.