Updated High Availability and Disaster Recovery Planning Worksheet

Last Updated February 23, 2021

Always On Availability Groups, Backup and Recovery, Clustering, SQL Server

One of the most popular things in our First Responder Kit is our HA/DR planning worksheet. Here’s page one:

Page 1 - how our servers are doing now, and what the business wants — Page 1 – how our servers are doing now, versus what the business wants

In the past, we had three columns on this worksheet – HA, DR, and Oops Deletes. In this new version, we changed “Oops” Deletes to “Oops” Queries to make it clear that sometimes folks just update parts of a table, or they drop an entire database. We also added a column for corruption (since your protection & recovery options are different than they are for Oops moments).

When people first see this worksheet, they usually scoff and say, “The business is going to tell me we never want to lose data, and we’re never allowed to go down.” No problem – that’s where the second page of the worksheet comes in:

Find the amount of data you’re willing to lose on the left side, and the amount of downtime you’re willing to tolerate across the top. Where the boxes match up, that’s a rough price range of the solution.

In this version, we added an asterisk to a lot of supposedly synchronous solutions aren’t – for example, Always On Availability Groups don’t actually guarantee zero data loss. I still keep that sort of thing in zero data loss because most of the time, it’s zero data loss, but you just need to understand it’s not a guarantee.

I like printing those two pages front and back on the same piece of paper because it helps management understand that requirements and costs are two sides of the same coin. It’s management’s job to pick the right box (price range), and then it’s IT’s job to build a more detailed estimate for the costs inside the box. The third and final page of the worksheet breaks out the feature differences for each HA/DR option.

If you’re one of the tens of thousands of folks who’s signed up for email alerts whenever we update our First Responder Kit, then you’ve already got an email this week with these new changes. If not, head on over and pick it up now.

How to Contribute Code to the SQL Server First Responder Kit (Github)

New Windows Clustering Course for SQL Server DBAs by Edwin Sarmiento

9 Comments. Leave new

T-Rex
July 22, 2016 11:51 am

Great presentation of costs of different RTO / RPO’s. It might be worth including backup & recovery as additional options.

Reply
Steve
July 22, 2016 12:14 pm

It’s been a long week and I’m likely missing the bindingly obvious… Would you mind explaining why you break the RTO/RPO into different categories? I would have thought for any one service the business owner would have a single RTO/RPO, rather than different requirements depending on what went wrong.

Reply
- Brent Ozar
  July 22, 2016 12:25 pm
  
  Steve – no problem! Businesses usually take different gambles for different risks. For example, for high availability, they usually want automatic failover with minimal data loss (zero or 1 minute) for their mission-critical data. However, if they lose their entire building, they’re often okay losing an hour’s worth of data and being down for a day.
  
  Each column has its own separate costs. With that above example, we’re talking something like failover clustering inside the production data center (which has one set of costs), plus perhaps log shipping to DR (which has a separate set of costs.)
  
  Sometimes businesses say, “I want only 1 minute of data loss and 1 minute of downtime no matter what” – but when they see the full pricing on it, they often change their minds.
  
  Reply
  - Steve
    July 22, 2016 12:36 pm
    
    Gotcha, that makes perfect sense – thanks Brent!
    
    Reply
Varsham Papikian
July 22, 2016 12:37 pm

Thanks Brent.
While I love the simplicity which is important for such documents, depending on your goal you may want to somehow reflect the fact that SQL Server 2016 introduced ‘Basic Availability Groups’ so limited version of AlwaysON AGs stopped being EE-only. Yes, there are various limitations (like supporting only a single db per AG) but I love the fact we have the ‘Async’ option in Standard Edition (as everyone knows, we didn’t get ‘Async’ option for Mirroring in Standard).

Thanks,
Varsham Papikian

Reply
- Brent Ozar
  July 22, 2016 12:38 pm
  
  Varsham – thanks, glad you enjoyed the doc.
  
  Unfortunately, as you noted, it’s impossible to fit all the various gotchas in a document this small. It’s only a starting point, not a finishing point. 😉
  
  Reply
Gerald
July 22, 2016 12:54 pm

Dear Brent,

great – but all MSSQL HA solutions won’t help to reduce RTO when it comes to wrong update or delete statements, or am I wrong?

Kind regards
Gerald

Reply
- Brent Ozar
  July 22, 2016 12:55 pm
  
  Exactly! That’s where you have to switch to backup & restore, especially with third party log reading products that can produce undo statements.
  
  Reply
Henry
January 28, 2022 5:20 pm

Hi Brent, on page 2 of the PDF the text for 1 minute RTO seems to be cut off after “sync database”

Reply