New High Availability Planning Worksheet

One of the questions we ask every new client is, “How much data is this application allowed to lose, and how long is it allowed to be down for?”

This leads to some really fun conversations.

The business users assume that the database will never lose data and cannot go down. The IT admins assume that the business understands all systems can lose data, and that the business will be okay with some data loss during a crash.

To get the business users and the IT team onto the same page, I had to build that page for them to get onto! And here it is:

High Availability Worksheet, Page 1
High Availability Worksheet, Page 1

Step 1: the it team fills out the current state.

We start by filling out the “current” column for each of the three scenarios, documenting how much data the business could lose in the event of an outage, and how long the system could be down.

Say we’ve got a single physical standalone SQL Server, and we’re doing transaction log backups every hour to a network share. That means we could lose at least an hour’s worth of data, and be down for several hours (if not a day). To get the system back online, we would need to:

  • Provision hardware and storage space
  • Install the right version/edition of Windows, plus patches
  • Install the right version/edition of SQL Server, plus patches
  • Set up the same sp_configure options (maxdop, any trace flags, etc)
  • Set up security (users, passwords)
  • Restore the databases

Often I find that just getting everybody to agree on the “current” column risks can take an hour, and the business users are horrified by how much work is involved. It’s nobody’s fault – this is just the current state of the system.

Step 2: The business users set a goal.

They’ll say, “We’re not allowed to lose any data and we’re never allowed to go down”, and that’s where page 2 of the worksheet comes in:

Page 2 - The Costs of Choices
Page 2 – The Costs of Choices

Down the left side, pick how much data we’re allowed to lose.

Across the top, pick how long we’re allowed to be down.

Where they meet, that’s a rough cost of the solution, and the list of options that can meet the business goals. These costs are wide ranges that give the business a rough idea of what it’ll take to achieve their objective – don’t get hung up on the exact dollar amounts. The business users just need a rough idea to help guide their choice.

I strongly, strongly encourage business users to pick one of those 4 categories, not something like 15 minutes or 4 hours. SQL Server’s high availability and disaster recovery methods line up really well with these 4 buckets, and if they choose 15 minutes, they’re not really doing me any favors – I’m going to need to design a 1-minute solution.

Step 3: The IT Team designs a solution to meet the business goals.

In our DBA training classes, we coach DBAs on:

  • How to get real answers in writing from business users
  • How to set the “IT Goal” column of this worksheet
  • How to give the business users the right recommendation
  • How to handle cost objections
  • How to build a solution that meets the business users’ needs

When I was a DBA, that was the hardest part of my job. The business kept telling me, “You’re not allowed to lose data and you’re not allowed to go down, but you don’t have a budget.” Now, in our classes, we arm DBAs with worksheets and guides to go toe-to-toe with the business. You look like a pro because you get to leverage all these cool worksheets that we built for our own consulting projects.

You can grab this planning worksheet plus much more in our First Aid resources. Enjoy, and hope it makes your next HA/DR project easier!

Previous Post
How to Configure Anti-Virus for SQL Servers
Next Post
Curious About the SQL Server 2014 Cardinality Estimator?

6 Comments. Leave new

  • Brent Willis
    May 14, 2014 12:12 pm

    Brent,

    This is a great article and I plan to use this this worksheet for an estimate that I will be presenting to senior management soon.

    I do have a question though, in your example, you state that it is a physical server and that log backups are taken every two hours to a network share, but you also state that the max data loss would be 1 hour. In this case wouldn’t the max data loss be two hours? (Assuming the network share is not lost as well and that data was not replicated or moved to tape quickly enough.)

    Thanks,
    Brent Willis

    Reply
  • Nic Neufeld
    May 16, 2014 11:52 am

    I like the seeming “oversimplicity” of this, down to the stock art for each of the categories, because too often the concepts of HA/DR get lost in the great noxious sea of buzzwords and vendor hype. I remember a previous video slide using a picture of a blue screen and a hurricane-damaged building to illustrate the HA/DR difference, I think it was yours. Adding in Mr. DidntMeanToDropThatDatabase kind of rounds out the main scenarios for this sort of stuff.

    Reply
    • Nic – thanks! Yep, that blue screen and hurricane-damaged building was my first version of this. I’ve been gradually refining it over the years.

      Reply
  • Linda Wenglikowski
    May 20, 2014 8:27 am

    Nice article Brent. I’ve seen issues when application had multiple databases. So I ask the following:
    1. do you have multiple SQL Server databases supporting the application?
    If so, then I ask
    2. do you have cross database transactions?
    If so, then I ask
    3. do you have a continuity script that needs to be ran on the event of a hardware failure or disaster to isolate where data is found in one database and is missing in one or more other databases?

    I like to ask this as most haven’t thought about this in there design. As they haven’t thought about each database having it’s own transaction log and separate replication.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.