Announcing sp_BlitzBackups: Check Yourself Before You Wreck Yourself.

If one of your SQL Servers went down at some point over the last week:

  • What’s the most data you could have lost?
  • What’s the longest you would have been down for?

To find out, Erik and I (mostly Erik) wrote sp_BlitzBackups (sure, blame me for everything. –ED).

The Easy Way: Running It with No Parameters

When you run it with no parameters, it looks at the local server’s backup history stored in the msdb database and returns 3 result sets.

The first result set says, for each database that was backed up in the last week:

  • RPOWorstCaseMinutes – what’s the most amount of data you could have lost in the last week? Say you normally run full backups daily and log backups every 10 minutes, but you had a 45-minute window on Thursday when no log backups happened. It would report 45 minutes.
  • RTOWorstCaseMinutes – what’s the longest amount of time you’d have spent restoring? By default, this is only the length of time that your backups took – and in real life, sometimes your restores take longer than your backups (especially if you don’t have Instant File Initialization turned on. Note that the RTOWorstCaseMinutes isn’t necessarily at the same point in time as your RPOWorstCaseMinutes. For example, in our full-daily-logs-every-10 scenario, your RTO Worst Case point in time is probably a point in time during the daily fulls, before they finish.
  • Plus a whole bunch of additional information fields about when the worst points in time were, how fast/slow your backups are going, their average sizes, and more.

The second and third result sets give you deeper analysis about issues with your backups, plus sizes over time going back -1 month, -2 months, -3 months, and more.

Power-Ups with Parameters

Here’s where the really neato stuff starts to come in.

Want to analyze more or less time? Use the @HoursBack parameter, which defaults to 168 hours (1 week). For example, on one project, we’ll be looking at backups every day to track holes in our backup strategy.

Want a more precise RTO? If you’ve done restore performance tuning, and you know how fast your full, diff, and log backups restore, use the @RestoreSpeedFullMBps (and diff and log). In our Faux PaaS Part 2 post, we did a lot of tuning around restores, and oddly, we actually got ’em to go faster than the backups – but of course, your mileage may vary. When you pass in these numbers, then we calculate RTO based on the file size / your throughput. (We just take your word for the throughput, though.)

Using sp_BlitzBackups with an Availability Group

Ah, this is where it really starts to shine. See, the problem with AGs is that backups can be run on any replica, but the data isn’t centralized back into one server.

In a nutshell, you’re going to:

  • Create an msdb_fake database where your backup history will be copied into
  • Add it into your Availability Group so that it fails around with the rest of your databases
  • Set up a job for sp_BlitzBackups on all of your replicas, passing in parameters for @AGName, @PushBackupHistoryToListener, @WriteBackupsToListenerName, and @WriteBackupsToDatabaseName

And then you’ll be able to analyze your RPO/RTO across the entire AG. Now, let’s get into specifics about how to do that – take it away, Erik.

Details

Part of the magic in sp_BlitzBackups is being able to centralize backup information from multiple servers into a repository for analysis.

It’s primarily aimed at people using Availability Groups, but it can be used by anyone with a Linked Server connection between one or more servers.

As of this writing, it’s a push-only model. So you’ll have to set this up on each server that you want to move data from.

We decided against writing in a pull process, because the challenges of doing this in SQL Server are just a bit silly.

  1. Dealing with lists: should we dynamically generate them? If they’re static, who keeps them up to date as servers change?
  2. Dealing with timeouts: There’s not a good way to parallelize a process like this without The Service Broker That Shall Not Be Named, or setting up multiple jobs.
  3. Not using xp_cmdshell to start a ping DDOS attack against your own servers trying to figure out if they’re up or not

We’re going to look at how to set it up using an AG Listener, but like I said, any ol’ Linked Server will do. I’m using the AG Listener so the process is failover-agnostic, meaning we’re writing to whomever the Primary is. If ‘Server 2’ is the Primary because ‘Server 1’ is down, it wouldn’t do any good trying to write data there, would it?

Also, you ever try to write data to a read only database?

Kinda sucks.

Here’s my three node AG that I swear I meant to patch before taking screenshots of, but hey. We’ll fix it in post.

It’s just wishful thinking

 

Turtles

 

Here’s my linked server pointing to the AG Listener. I’m using sa for my Linked Server connection (I know, I know) — you’ll need at least write permissions for all this to work. If there’s a service account that can get you there, by all means: be a better DBA than me.

Crap, crap, crap

The command we’re going to be using looks like this:

  • @PushBackupHistoryToListener = 1 — This skips the analysis process and goes right to the data mover. The default is 0, to analyze only.
  • @WriteBackupsToListenerName = ‘SQLAGLISTEN’ — This is (somewhat confusingly) the name of the Linked Server connection you’ll be using. Note that it doesn’t have to be an AG Listener name, but when we started writing this, we named stuff for what we thought we were going to end up with. It turned out a little bit different, but we’re heding our bets that most people using this will be using AGs, anyway. This can be NULL if you specify the @AGName variable — we’ll hit some DMVs to find the Listener name for the AG you specify.
  • @WriteBackupsToDatabaseName = ‘msdb_fake’ — The name of the database with the fake dbo.backupset table in it. Note that this can’t be msdb. We can’t write to your actual system databases or tables. I mean, we could, we just don’t want to because that’s dumb. This can’t be NULL. If you don’t have the backupset table in this database, we’ll create it for you, along with some indexes, using Leprechaun Magic.
  • @WriteBackupsLastHours = 168 — How many hours back you want to start checking for backup data to move. This can be just about any number, really. If you feed in a positive number, we’ll flip it negative to do some date math.

The process uses a batching technique inspired by (but that didn’t fit 100% in this scenario) by Michael J. Swart, Canadian Extraordinaire.

Since we take your argument for hours back, we also double check to make sure you actually have data that far back to move. To be nice to your server, we move data in 10 minute intervals from one to another. There’s no sense in looping through 10 minute intervals with nothing in them.

We’ve built in some safeguards to prevent things from going off the rails. If there’s no data to move, we exit. If the 10 minute interval is past the current server time, we break. There’s some other smart stuff, too.

With that command explained a bit, all that’s left is setting up an agent job to run.

The job itself can be a simple stored procedure call.

Sprockets

You’ll want to set the schedule at a reasonable interval — if you take backups every 10 minutes, you don’t need to run this every minute.

LERGY

To check and see if it’s working, just run a query like this to verify backup dates are being inserted.

SELECT MAX(backup_finish_date) AS max_backup_date
FROM msdb_fake.dbo.backupset

Thankfully, It Works On My Machine®

Previous Post
[Video] Office Hours 2017/05/24 (With Transcriptions)
Next Post
Direct Seeding: I’ve Got A Wizard And You Don’t

31 Comments. Leave new

  • Randolph West
    May 30, 2017 1:10 pm

    I ran this on SQL Server 2008 Web Edition and it told me I can compress my backups.

    Reply
  • Wes Crockett
    May 30, 2017 3:54 pm

    Well, since this has gone most the day w/out a comment… let me just say…

    .First.

    Reply
  • Ron MacNeil
    June 5, 2017 2:33 pm

    First of all good work! Though found an issue when using AGs…
    In an AG, if backups are always done on the primary host, when centralizing the backupset from the AG replicas into ‘msdb_fake’, the output from sp_BlitzBackups won’t consolidate the backups for a particular database. I.E., Replica1 does some backups of ‘Crap1’, fails over, Replica2 does some backups of ‘Crap1’, the output of sp_BlitzBackups will have two rows for ‘Crap1’ with holes in the RPO calculation.

    Reply
    • Ron – yeah, msdb_fake needs to be *in* the availability group so that it fails around with your databases, always maintaining the full backup history.

      Reply
      • Ron MacNeil
        June 5, 2017 2:42 pm

        The issue seems to be related to database_guid being different on the varied replicas and thus the centralized backupset rows for ‘Crap1’ won’t group together within msdb_fake (there’s a group by on database_guid).

        Reply
        • Hmm, if the database_guids are different, are you sure they’re in the same Availability Group? If so, go ahead and create an issue in Github. (The only way I’ve seen that is when they’re not really the same database – like if they’re not in an AG.)

          Reply
          • Ron MacNeil
            June 5, 2017 2:58 pm

            Yup same AG, the master.sys.databases.group_database_id are the same. I’ll attempt to put in a issue in Github. Thanks again!

  • Nice addition to your Blitz suite. The alert about not having backup checksums is especially useful. Maybe that alert should be added to sp_Blitz as well?

    Reply
  • I’ve been looking for a good solution to push backup history to a central repository for reporting. However (and maybe I’m missing something), but this only pushs msdb..backupset to the central database and doesn’t include the master..sysdatabases table. That way I can see a database on a server doesn’t have a backup.

    On a new server to quickly figure out if backups are running I can run two queries:
    SELECT ‘No Full backup in last 1 day’, MAX(ISNULL(DATEDIFF(dd,ISNULL(b.backup_start_date, ’01/01/1900′),GETDATE()),0)) AS ‘NumDays’
    , d.name as ‘DBName’
    FROM master..sysdatabases d
    LEFT JOIN msdb..backupset b ON d.name = b.database_name
    AND b.backup_start_date = (SELECT MAX(backup_start_date)
    FROM msdb..backupset b2
    WHERE b.database_name = b2.database_name AND b2.type IN (‘D’,’I’))
    WHERE d.name != ‘tempdb’
    –AND d.name NOT IN (SELECT db_name FROM dbautility.dbo.db_exclude)
    AND DATABASEPROPERTYEX(d.name, ‘Status’) = ‘ONLINE’
    GROUP BY d.name, b.type, b.backup_size
    HAVING MAX(ISNULL(DATEDIFF(dd,ISNULL(b.backup_start_date, ’01/01/1900′),GETDATE()),0)) > 1

    and

    SELECT ‘No Transaction log backup in last 20 mins’, @@servername, MAX(ISNULL(DATEDIFF(hh,ISNULL(b.backup_start_date, ’01/01/1900′),GETDATE()),0)) AS ‘NumDays’
    , d.name as ‘DBName’, (select recovery_model_desc from [master].[sys].[Databases] s WITH (NOLOCK) where name = d.name) as ‘Recovery Model’
    FROM master..sysdatabases d
    LEFT JOIN msdb..backupset b ON d.name = b.database_name
    AND b.backup_start_date = (SELECT MAX(backup_start_date)
    FROM msdb..backupset b2
    WHERE b.database_name = b2.database_name AND b2.type IN (‘L’))
    WHERE d.name != ‘tempdb’
    –AND d.name NOT IN (SELECT db_name FROM dbautility.dbo.db_exclude)
    AND DATABASEPROPERTYEX(d.name, ‘Status’) = ‘ONLINE’
    GROUP BY d.name, b.type, b.backup_size
    HAVING MAX(ISNULL(DATEDIFF(hh,ISNULL(b.backup_start_date, ’01/01/1900′),GETDATE()),0)) > 20

    Maybe there is better way to do it, but this has been working well. I can quickly see if we have a database that doesn’t have a full or diff backup in the last day or a transaction log backup in the last 20 minutes. This will let me easily (without looking at Agent Jobs) to see if backups are working.

    Anyway you can also include sysdatabases into the mix when push backup history to the listener?

    Reply
    • Joel – I’m going to try to rephrase what you wrote just to make sure I captured it correctly:
      * You’re using a different technique today
      * You want to request a feature to sp_BlitzBackups

      If I captured that right, can you file an issue with your feature request in the Github repo? The link/instructions are in the beginning of the post. Thanks!

      Reply
  • How will this work where I have 8 AGs over two nodes? At any point in time the nodes could be primary on either node dependant on network or power issues. Normally there are 4 on each node.

    Reply
  • Hi Erik, Brent

    to feed your new PowerShell addiction:
    foreach ( $inst in $mysqls )
    {
    $WorstCase,$Detail,$Issue = invoke-sqlcmd -ServerInstance $inst -Database master -Query “EXEC dbo.sp_blitzbackups;” -OutputAs DataTables
    # Write-Datatable or something …
    }
    We could write the 3 data sets to a central DB without having linked servers. When centralizing the results of multiple servers, it would be nice to have Hostname and Instancename as output columns as well.

    Thank you very much for this great procedure!

    Reply
  • I just found a major bug in the @PushBackupHistoryToListener functionality. When the backupset table is pushed to the database, it doesn’t push the time_zone column. So if you look at the fake_msdb, the time_zone column records will all be NULL.

    This is required if you want to create a centralized report and convert the backup times from servers with different timezones.

    For example, this query will show the time_zone, and backup start date (which is the server time that the backup kicked off), and 4 different columns that translate the time to EST, CST, MST, and UTC. If the time_zone or backup_start_date is null then it will display ~NEVER~ to indicate you don’t have a good backup.

    Since time_zone is not included in sp_BlitzBackup the times will ALWAYS show ~NEVER~.

    Select
    time_zone,
    backup_start_date
    ,ISNULL(CAST(DATEADD(HOUR,DATEDIFF(hh, getutcdate(), getdate()) +1,dateadd(MINUTE,(cast(b.time_zone as float) / cast(4 as float) * 60 * -1), b.backup_start_date)) as varchar(19)), ‘~NEVER~’) as ‘Last Backup Date (EST)’
    ,ISNULL(CAST(DATEADD(HOUR,DATEDIFF(hh, getutcdate(), getdate()) +0,dateadd(MINUTE,(cast(b.time_zone as float) / cast(4 as float) * 60 * -1), b.backup_start_date)) as varchar(19)), ‘~NEVER~’) as ‘Last Backup Date (CST)’
    ,ISNULL(CAST(DATEADD(hour,DATEDIFF(hh, getutcdate(), getdate()) -1,dateadd(MINUTE,(cast(b.time_zone as float) / cast(4 as float) * 60 * -1), b.backup_start_date)) as varchar(19)), ‘~NEVER~’) as ‘Last Backup Date (MST)’
    ,ISNULL(CAST(dateadd(MINUTE,(cast(b.time_zone as float) / cast(4 as float) * 60 * -1), b.backup_start_date) as varchar(19)), ‘~NEVER~’) as ‘Last Backup Date (UTC)’
    FROM [msdb_fake].[dbo].[backupset] b

    So needless to say, if you are using this to push backup information to a Central database with servers on different timezones, it will be next to impossible to determine if you have recent backups.

    Reply
    • Simple fix was to update sp_BlitzBackup.

      Change

      SET @StringToExecute += N’ (database_name, database_guid, backup_set_uuid, type, backup_size, backup_start_date, backup_finish_date, media_set_id,
      compressed_backup_size, recovery_model, server_name, machine_name, first_lsn, last_lsn, user_name, compatibility_level,
      is_password_protected, is_snapshot, is_readonly, is_single_user, has_backup_checksums, is_damaged, ‘ + CASE WHEN @ProductVersionMajor >= 12
      THEN + N’encryptor_type, has_bulk_logged_data)’ + @crlf
      ELSE + N’has_bulk_logged_data)’ + @crlf
      END

      SET @StringToExecute +=N’
      SELECT database_name, database_guid, backup_set_uuid, type, backup_size, backup_start_date, backup_finish_date, media_set_id,
      compressed_backup_size, recovery_model, server_name, machine_name, first_lsn, last_lsn, user_name, compatibility_level,
      is_password_protected, is_snapshot, is_readonly, is_single_user, has_backup_checksums, is_damaged, ‘ + CASE WHEN @ProductVersionMajor >= 12
      THEN + N’encryptor_type, has_bulk_logged_data’ + @crlf
      ELSE + N’has_bulk_logged_data’ + @crlf
      END

      To

      SET @StringToExecute += N’ (database_name, database_guid, backup_set_uuid, type, backup_size, time_zone, backup_start_date, backup_finish_date, media_set_id,
      compressed_backup_size, recovery_model, server_name, machine_name, first_lsn, last_lsn, user_name, compatibility_level,
      is_password_protected, is_snapshot, is_readonly, is_single_user, has_backup_checksums, is_damaged, ‘ + CASE WHEN @ProductVersionMajor >= 12
      THEN + N’encryptor_type, has_bulk_logged_data)’ + @crlf
      ELSE + N’has_bulk_logged_data)’ + @crlf
      END

      SET @StringToExecute +=N’
      SELECT database_name, database_guid, backup_set_uuid, type, backup_size, time_zone, backup_start_date, backup_finish_date, media_set_id,
      compressed_backup_size, recovery_model, server_name, machine_name, first_lsn, last_lsn, user_name, compatibility_level,
      is_password_protected, is_snapshot, is_readonly, is_single_user, has_backup_checksums, is_damaged, ‘ + CASE WHEN @ProductVersionMajor >= 12
      THEN + N’encryptor_type, has_bulk_logged_data’ + @crlf
      ELSE + N’has_bulk_logged_data’ + @crlf
      END

      Reply
      • Joel – OK, great, thanks, but can you read the post about where to post issues? We use Github for source control & issue handling, rather than blog post comments. It’s tough for us to diagnose complex issues here inside the blog. Thanks for your understanding, and I’m excited to see the issue over there! We’ll get it fixed pronto once we’ve got it in our dev queue.

        Reply
  • What is the advice if you use SQL 2016/2017 Standard Edition?
    In Standard you can only have 1 database in a Basic Availability Group.

    Reply
  • modernuniverse
    April 11, 2019 4:12 am

    > With that command explained a bit, all that’s left is setting up an agent job to run.

    I assume we create and enable such sql agent job in every server whether we are running a backup process from there at the moment or not so that when we fail over, the central backup data keeps being updated automatically.

    Could you please confirm my understanding? Thank you.

    Reply
  • I have several DBs in full recovery mode, on which performing full backups every weekend, differentials every midnight, and log backups every minute. Any reason why I would see RPOWorstCaseMinutes in the thousands?

    Reply
    • Could be a whole lot of reasons that your backups aren’t working, jobs failed for a while, hole in the schedule, etc, and I’m going to give you a little tough love here: your job is to roll up your sleeves and find that out. Hope that gets you started!

      Reply
  • Well that’s the thing, I’ve confirmed all my backups are running fine?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.