Updated High Availability and Disaster Recovery Planning Worksheet

One of the most popular things in our First Responder Kit is our HA/DR planning worksheet. Here’s page one:

Page 1 - how our servers are doing now, and what the business wants
Page 1 – how our servers are doing now, versus what the business wants

In the past, we had three columns on this worksheet – HA, DR, and Oops Deletes. In this new version, we changed “Oops” Deletes to “Oops” Queries to make it clear that sometimes folks just update parts of a table, or they drop an entire database. We also added a column for corruption (since your protection & recovery options are different than they are for Oops moments).

When people first see this worksheet, they usually scoff and say, “The business is going to tell me we never want to lose data, and we’re never allowed to go down.” No problem – that’s where the second page of the worksheet comes in:

RPO/RTO cost range estimates
RPO/RTO cost range estimates

Find the amount of data you’re willing to lose on the left side, and the amount of downtime you’re willing to tolerate across the top. Where the boxes match up, that’s a rough price range of the solution.

In this version, we added an asterisk to a lot of supposedly synchronous solutions aren’t – for example, Always On Availability Groups don’t actually guarantee zero data loss. I still keep that sort of thing in zero data loss because most of the time, it’s zero data loss, but you just need to understand it’s not a guarantee.

I like printing those two pages front and back on the same piece of paper because it helps management understand that requirements and costs are two sides of the same coin. It’s management’s job to pick the right box (price range), and then it’s IT’s job to build a more detailed estimate for the costs inside the box. The third and final page of the worksheet breaks out the feature differences for each HA/DR option.

If you’re one of the tens of thousands of folks who’s signed up for email alerts whenever we update our First Responder Kit, then you’ve already got an email this week with these new changes. If not, head on over and pick it up now.

Previous Post
How to Contribute Code to the SQL Server First Responder Kit (Github)
Next Post
New Windows Clustering Course for SQL Server DBAs by Edwin Sarmiento

9 Comments. Leave new

  • Great presentation of costs of different RTO / RPO’s. It might be worth including backup & recovery as additional options.

    Reply
  • It’s been a long week and I’m likely missing the bindingly obvious… Would you mind explaining why you break the RTO/RPO into different categories? I would have thought for any one service the business owner would have a single RTO/RPO, rather than different requirements depending on what went wrong.

    Reply
    • Steve – no problem! Businesses usually take different gambles for different risks. For example, for high availability, they usually want automatic failover with minimal data loss (zero or 1 minute) for their mission-critical data. However, if they lose their entire building, they’re often okay losing an hour’s worth of data and being down for a day.

      Each column has its own separate costs. With that above example, we’re talking something like failover clustering inside the production data center (which has one set of costs), plus perhaps log shipping to DR (which has a separate set of costs.)

      Sometimes businesses say, “I want only 1 minute of data loss and 1 minute of downtime no matter what” – but when they see the full pricing on it, they often change their minds.

      Reply
  • Varsham Papikian
    July 22, 2016 12:37 pm

    Thanks Brent.
    While I love the simplicity which is important for such documents, depending on your goal you may want to somehow reflect the fact that SQL Server 2016 introduced ‘Basic Availability Groups’ so limited version of AlwaysON AGs stopped being EE-only. Yes, there are various limitations (like supporting only a single db per AG) but I love the fact we have the ‘Async’ option in Standard Edition (as everyone knows, we didn’t get ‘Async’ option for Mirroring in Standard).

    Thanks,
    Varsham Papikian

    Reply
    • Varsham – thanks, glad you enjoyed the doc.

      Unfortunately, as you noted, it’s impossible to fit all the various gotchas in a document this small. It’s only a starting point, not a finishing point. 😉

      Reply
  • Dear Brent,

    great – but all MSSQL HA solutions won’t help to reduce RTO when it comes to wrong update or delete statements, or am I wrong?

    Kind regards
    Gerald

    Reply
  • Hi Brent, on page 2 of the PDF the text for 1 minute RTO seems to be cut off after “sync database”

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.