If one of your SQL Servers went down at some point over the last week:
- What’s the most data you could have lost?
- What’s the longest you would have been down for?
To find out, Erik and I (mostly Erik) wrote sp_BlitzBackups (sure, blame me for everything. –ED).
- Download the latest First Responder Kit including the new sp_BlitzBackups
- For questions about how it works, read on, and then leave a comment
- For issues or things to change, add an issue in Github
The Easy Way: Running It with No Parameters
When you run it with no parameters, it looks at the local server’s backup history stored in the msdb database and returns 3 result sets.
The first result set says, for each database that was backed up in the last week:
- RPOWorstCaseMinutes – what’s the most amount of data you could have lost in the last week? Say you normally run full backups daily and log backups every 10 minutes, but you had a 45-minute window on Thursday when no log backups happened. It would report 45 minutes.
- RTOWorstCaseMinutes – what’s the longest amount of time you’d have spent restoring? By default, this is only the length of time that your backups took – and in real life, sometimes your restores take longer than your backups (especially if you don’t have Instant File Initialization turned on. Note that the RTOWorstCaseMinutes isn’t necessarily at the same point in time as your RPOWorstCaseMinutes. For example, in our full-daily-logs-every-10 scenario, your RTO Worst Case point in time is probably a point in time during the daily fulls, before they finish.
- Plus a whole bunch of additional information fields about when the worst points in time were, how fast/slow your backups are going, their average sizes, and more.
The second and third result sets give you deeper analysis about issues with your backups, plus sizes over time going back -1 month, -2 months, -3 months, and more.
Power-Ups with Parameters
Here’s where the really neato stuff starts to come in.
Want to analyze more or less time? Use the @HoursBack parameter, which defaults to 168 hours (1 week). For example, on one project, we’ll be looking at backups every day to track holes in our backup strategy.
Want a more precise RTO? If you’ve done restore performance tuning, and you know how fast your full, diff, and log backups restore, use the @RestoreSpeedFullMBps (and diff and log). In our Faux PaaS Part 2 post, we did a lot of tuning around restores, and oddly, we actually got ’em to go faster than the backups – but of course, your mileage may vary. When you pass in these numbers, then we calculate RTO based on the file size / your throughput. (We just take your word for the throughput, though.)
Using sp_BlitzBackups with an Availability Group
Ah, this is where it really starts to shine. See, the problem with AGs is that backups can be run on any replica, but the data isn’t centralized back into one server.
In a nutshell, you’re going to:
- Create an msdb_fake database where your backup history will be copied into
- Add it into your Availability Group so that it fails around with the rest of your databases
- Set up a job for sp_BlitzBackups on all of your replicas, passing in parameters for @AGName, @PushBackupHistoryToListener, @WriteBackupsToListenerName, and @WriteBackupsToDatabaseName
And then you’ll be able to analyze your RPO/RTO across the entire AG. Now, let’s get into specifics about how to do that – take it away, Erik.
Part of the magic in sp_BlitzBackups is being able to centralize backup information from multiple servers into a repository for analysis.
It’s primarily aimed at people using Availability Groups, but it can be used by anyone with a Linked Server connection between one or more servers.
As of this writing, it’s a push-only model. So you’ll have to set this up on each server that you want to move data from.
We decided against writing in a pull process, because the challenges of doing this in SQL Server are just a bit silly.
- Dealing with lists: should we dynamically generate them? If they’re static, who keeps them up to date as servers change?
- Dealing with timeouts: There’s not a good way to parallelize a process like this without The Service Broker That Shall Not Be Named, or setting up multiple jobs.
- Not using xp_cmdshell to start a ping DDOS attack against your own servers trying to figure out if they’re up or not
We’re going to look at how to set it up using an AG Listener, but like I said, any ol’ Linked Server will do. I’m using the AG Listener so the process is failover-agnostic, meaning we’re writing to whomever the Primary is. If ‘Server 2’ is the Primary because ‘Server 1’ is down, it wouldn’t do any good trying to write data there, would it?
Also, you ever try to write data to a read only database?
Here’s my three node AG that I swear I meant to patch before taking screenshots of, but hey. We’ll fix it in post.
Here’s my linked server pointing to the AG Listener. I’m using sa for my Linked Server connection (I know, I know) — you’ll need at least write permissions for all this to work. If there’s a service account that can get you there, by all means: be a better DBA than me.
The command we’re going to be using looks like this:
@PushBackupHistoryToListener = 1,
@WriteBackupsToListenerName = 'SQLAGLISTEN',
@WriteBackupsToDatabaseName = 'msdb_fake',
@WriteBackupsLastHours = 168
- @PushBackupHistoryToListener = 1 — This skips the analysis process and goes right to the data mover. The default is 0, to analyze only.
- @WriteBackupsToListenerName = ‘SQLAGLISTEN’ — This is (somewhat confusingly) the name of the Linked Server connection you’ll be using. Note that it doesn’t have to be an AG Listener name, but when we started writing this, we named stuff for what we thought we were going to end up with. It turned out a little bit different, but we’re heding our bets that most people using this will be using AGs, anyway. This can be NULL if you specify the @AGName variable — we’ll hit some DMVs to find the Listener name for the AG you specify.
- @WriteBackupsToDatabaseName = ‘msdb_fake’ — The name of the database with the fake dbo.backupset table in it. Note that this can’t be msdb. We can’t write to your actual system databases or tables. I mean, we could, we just don’t want to because that’s dumb. This can’t be NULL. If you don’t have the backupset table in this database, we’ll create it for you, along with some indexes, using Leprechaun Magic.
- @WriteBackupsLastHours = 168 — How many hours back you want to start checking for backup data to move. This can be just about any number, really. If you feed in a positive number, we’ll flip it negative to do some date math.
The process uses a batching technique inspired by (but that didn’t fit 100% in this scenario) by Michael J. Swart, Canadian Extraordinaire.
Since we take your argument for hours back, we also double check to make sure you actually have data that far back to move. To be nice to your server, we move data in 10 minute intervals from one to another. There’s no sense in looping through 10 minute intervals with nothing in them.
We’ve built in some safeguards to prevent things from going off the rails. If there’s no data to move, we exit. If the 10 minute interval is past the current server time, we break. There’s some other smart stuff, too.
With that command explained a bit, all that’s left is setting up an agent job to run.
The job itself can be a simple stored procedure call.
You’ll want to set the schedule at a reasonable interval — if you take backups every 10 minutes, you don’t need to run this every minute.
To check and see if it’s working, just run a query like this to verify backup dates are being inserted.
SELECT MAX(backup_finish_date) AS max_backup_date
Thankfully, It Works On My Machine®