Your first concern is licensing, and once you’ve got that under control, it’s time to talk about backups, restores, and uptime.
The business begins by defining the RPO and RTO requirements for high availability, disaster recovery, and “oops” deletions. Use our High Availability Planning Worksheet and fill in the “Current” column on page 1:
You need to give the business a rough idea of the current state – how much data you’ll lose under each of those scenarios, and how long the SQL Server will be down. Be honest – don’t overstate your capabilities, because that just means you’ll get less budget money to build the system the business wants.
Then let the business users fill in the “Biz Goal” column for each scenario. Of course, by default, everyone wants to pick zero data loss and zero downtime, but that’s where page 2 of the worksheet comes in:
Give both page 1 & 2 of the worksheet to the business users and let them pick the right availability requirements for their budget. Technical managers will want to leap to specific methods (“we have to use VMware replication”) but keep steering the conversation back to the big picture – what does the database need?
The costs aren’t meant to be exact estimates, just rough ballpark numbers of the hardware, software, installation, and ongoing maintenance costs for a couple/few years. To learn more about the technologies in each box, check out our High Availability and Disaster Recovery Resources page.
If your business checks different boxes for each of the three scenarios (HA, DR, and OOPS), then you’ll probably end up with a mix of different SQL Server technologies. For example, the OOPS scenario is usually handled by transaction log backups and a separate standby restore server so that you can pull out just the data you need, but that isn’t going to help you for high availability if you need 1-minute automatic failover with zero data loss.
After the business checks boxes for high availability, disaster recovery, and “oops” deletions, it’s time to pick the right recoverability option. There are virtualization-friendly options in every box, but it’s important to note what isn’t in these boxes.
VMware, Hyper-V, and Xen Alone Aren’t High Availability.
Virtualization admins often think, “If something goes wrong with hardware, the virtual server will just start up on another host.” VMware, Hyper-V, and Xen all do a great job of recovering from these kinds of failures, but those aren’t the only kinds of failures we need to avoid. Note the list of failure types in our High Availability scenario.
Sure, hypervisors protect you fairly well from Windows crashes, RAID controller failures, bad memory chips, or somebody unplugging the wrong box.
But what about patching and OS-drive-full type problems?
I’ve heard admins say, “No problem – before we do any patching, we’ll shut SQL Server down, take a snapshot of the VM, and then start the patch. If anything goes wrong, we’ll just roll back to the snapshot. No data loss, not much downtime.”
Oh, I wish. Let me tell you a story.
One of my clients was preparing a new SQL Server for production. As part of their prep, they needed to apply SQL updates to it, so they started the installation process, and …
A few minutes later the phone calls started pouring in.
Because they were patching the wrong box. They were patching the production SQL Server VM, not the new VM. To make matters worse, they ran into the SSISDB bug, and their production server was down for hours while they figured it out.
How Recoverability Influences Virtualization Design
When you’re designing solutions for HA, DR, and OOPS, read the scenarios described in this simple worksheet. Expect that sooner or later, at a time you can’t predict or control, every one of these is going to happen to you. (Well, maybe not zombies in the data center.) Your technical solution is driven by the business’s requirements for RPO/RTO in each scenario. Understand what virtualization alone can give you, and when you’re going to have to add SQL-Server-level solutions.
Your design then needs to take into account one more challenge: capacity. How many databases will you have, how large will they be, and how fast will the data change? For example, a solution involving log shipping, a 1TB database, a 10% change rate per day due to batch load jobs, and a 1Gb Ethernet pipe probably isn’t going to be a 1-minute-data-loss solution.
It all boils down to one simple question:
can you back up, DBCC, and restore the databases fast enough?
Not only do you have to answer the question for today, but for the life expectancy of the server as well. In my next virtualization post, I explore capacity planning as it relates to RPO/RTO, and then we’ll be able to put the whole picture together of what our VM environment will look like.