How Do You Manage DBAs? Part 1: Measuring Backups

When the DBA reports to you, ask yourself two questions:

  1. “What would I fire the DBA for?”
  2. “How can I prove that they’re not doing that?”

There’s a lot of possible answers to #1, many of which involve cocaine and the office copier, but let’s focus on one I hear a lot: an inability to restore data that they claimed was backed up.

Backups are the foundation of our Hierarchy of Database Needs. They’re incredibly important, but yet they often get ignored because nobody’s banging on the DBA’s cubicle telling them to focus on backups. Instead, users are screaming about performance or too-tight security.

It’s up to you, their manager, to enable them to get the basics down. Here’s how to do it.

1. Ask for a report of most recent backups, by database.

KPI: Coffee Cups Consumed (by André Freitas)

KPI: Coffee Cups Consumed (by André Freitas)

Call in the DBA and give them a list of columns you want in the report:

  • Server name
  • Database name
  • Database size
  • Purpose (production, QA, development)
  • Last full backup finish date/time
  • Last log backup finish date/time
  • Last successful DBCC CHECKDB finish date/time
  • Last date/time the backups were restored for testing

They can produce the report in any format that works for them, but they have to be able to produce it on demand with less than five minutes of manual work. (Ideally, they’d give you a solution you can run on your own, without telling them, but let’s start with something easy.)

Tell them this first task is about getting the lay of the land first, not fixing the problems. Oh, they’re going to find problems alright – but focus on just getting a clear picture of the work ahead first.

This task may take them a day or two because they have to assemble an exact list of the servers they’re responsible for, plus build the report process. (The good DBAs reading this are thinking, “I could do this in fifteen minutes!” If you’re just saying that now, that means you haven’t done it yet, and you’re not quite as good as you think. The great DBAs are thinking, “Yeah, that’s about how long it took me to get it right the first time.”)

2. Help the DBA come up with an improvement plan.

They’re going to be ashamed/disappointed by some of these numbers. (Or if they’re not, there’s your answer about their fit for the DBA position.)

Let them build a back-of-the-napkin mini-project plan to improve the metrics for your most important server. Give them that time – typically there’s a day or two of work involved – and after they finish, review what worked about the mini-project and what didn’t. Use that knowledge to build out a plan for the rest of the servers.

Your job is to run interference for them while they get the backups and DBCCs under control. When the DBA gets “urgent” help requests from end users, here’s what I coach them to say:

“Right now, I’m focused on a project to get our backups under control. If this issue is more important than getting our production data backed up, let’s go into my manager’s office together and talk through it. If not, I hate to say this, but I’ll need you to file a help desk ticket and I’ll get to it once we’ve got the backups under control.”

Yes, this means admitting that the backups aren’t under control. It’s the only way end users will accept a “no” from the DBA.

After this project finishes, run the report again, talk through the results, and then take the results to the business users who are storing data on these servers.

3. Ask the business if this amount of data loss is acceptable.

Take the backup numbers to the business along with our free RPO/RTO Worksheet. By default, the business usually wants zero data loss and zero downtime, but as good IT managers, we have to communicate the realities of today’s databases.

Armed with the costs and timelines in that worksheet, bring written RPO goals back to the DBA. (Yes, to be a good manager takes work, and these are political problems, not technical problems – save your DBA for the technical problems.) Sometimes, this will mean bringing budget numbers back as well – if the business wants tighter RPO/RTO goals than the current infrastructure provides, the DBA will need resources to build the infrastructure improvements.

4. Measure the gap between IT’s RPO goals and the DBA’s delivery.

Remember that report the DBA built? Add two columns:

  • IT’s goal for this database’s data loss (in minutes/hours) – you and the DBA come to an agreement on a realistic goal. In underfunded shops, this is sadly less ambitious than the business’s goals. (Example: “The business wants us to lose no more than 1 minute of data, but I understand that they’re not giving you space for log backups, so let’s aim for losing no more than 1 hour of data.”) In well-funded shops, the IT goal may actually be more ambitious than the business’s goal.
  • The gap between the DBA’s delivery and IT’s goal – if the last successful backup was 45 minutes ago, but we’re aiming for <30 minutes of data loss, we’re in the red by 15 minutes. This metric is how we measure DBAs.

As long as this gap is in the black, the DBA is doing a good job on backups. (Note: I’m not even talking about restore tests or DBCCs here – this is just what I consider table stakes on managing a DBA.)

When this number goes into the red, the DBA should stop focusing on anything else – anything – and focus exclusively on getting the data protected again.

For bonus points, trend the total server count, database count, and total data size over time. This helps your user base understand that the infrastructure will come under more pressure as the data size grows – you can only cram so much data into a server before performance issues arise.

In the next post in this series, I discuss training budgets for DBAs.

Previous Post
New Updates for sp_Blitz®, sp_BlitzCache™
Next Post
New SQL Server Management Studio 2015/June – with Check for Updates!

30 Comments. Leave new

  • End users requests should be coming from a ticketing system, urgent or not.
    This helps the DBA team (or any support team really) set their priority depending on the work load and emergency they have at hand.

    This kind of discipline has to be sponsored from the very top of management.

    Then there’s time for working on the out of control backups!

    My personal rule on measuring backup is: Perform a recovery once a month.
    This is not only to make sure the backup are available and working but also sharpens the DBA skills.
    It’s one thing to know what you’re doing when you’re relaxed, it’s another thing to have to perform it under stress. There won’t be time for research then.

    I like to make sure that my team is prepared in every way.

    Reply
  • Joel Witherspoon
    June 22, 2015 12:26 pm

    Thanks for this. Using this to test and align my enterprise to build standards.

    Reply
  • This is why I have no interest in being a DBA. I like the creativity and challenges of app development using relational dbs and tsql and some fun front end technology, but I don’t have the slightest interest in managing DB backups, replication, fail over, SANS, etc. etc.

    ken

    Reply
    • Having done both sides I can understand that mentality. But having coded an automated restore routine that randomly picks databases, locates the last backup, restores it from archive to a test machine, and runs an integrity check you can get coding challenges as a DBA. And there’s the added challenge of dealing with developers that think they know more than me. There’s no small amount of satisfaction to be derived from proving, using hard data, that the implicit conversions are, in fact, the cause of production randomly going down or pointing out the exact data condition that’s causing 1 million (literally) errors to be logged to their error table daily after they haven’t provided any updates for three weeks despite it resulting in the drive filling and bringing down replication repeatedly.

      I still miss development but there are more than enough challenges for me as a production DBA right now. And it is nice to not need to worry about mundane things like source control and code reviews for the most part. Even if I have to deal with backups and failovers.

      Reply
  • I guess as long as we pretend that management has given the DBA team the resources and the responsibility this is great. But at my last job, the network team was responsible for disksapce allocation for local and offsite storage. I begged them to let us keep 48 hours local (and that was a ramp down from 7 days because, literally, they laughed at that request) and a week quickly available. I was told the local storage SQL backups were backed up daily and put on near term storage so there would never be any chance of of not having backups available and I was out my mind to ask for any more than 24 hours locally. I told them “stuff happens.” Well guess what? Stuff happened. and it turned out the off site backups of the SQL backups were never set up despite them swearing they had been. And a user had made an update mistake.

    Nobody likes it when you say “I told you so.”

    Reply
    • Cjewel – yep, I coach DBAs to practice quarterly fire drills. Act as if a user dropped a table, and get your backup teams to deliver you the data as of a certain point in time. Like Ronald Reagan said, trust – but verify.

      It sounds like you forgot the verify part. 😉

      Reply
  • How would you modify this, or would you, in the case where the DBA’s have no control over the backups other than specifying what intervals they should be run in? What I’m getting at is that our server administrators are the only one’s that can do backups because they use another tool to perform backups and refuse to give us access to that backup software.

    Reply
  • This is one of the jewel posts of this blog. This is what Brent Ozar Unlimited is good at! Prefer blogging about this rather than repeat-blogging about some technical detail of SQL Server that any Google search would answer as well. This has a lot more value.

    Reply
    • Kendra Little
      June 23, 2015 8:44 am

      Wow. I’m not sure if this is an intentional or inadvertent backhanded compliment, but it’s remarkable.

      Reply
    • I’ll tech the technical details on here any day. Saves me from Googling! 😀

      Reply
    • Maybe some of the tech posts teach you things that you could find elsewhere via Google. They also go into why it works, verify it with repeatable tests you can try yourself, and have a sense of humour. Not many places give you all of that.

      Reply
  • Henrik Staun Poulsen
    June 23, 2015 8:59 am

    I’ve tried to write such a script, but I cannot get the DBCC to work across OpenRowSet.
    How would you do that?

    USE mydb

    /* dwh */
    SET NOCOUNT on
    BEGIN TRY
    DROP table #S
    END TRY
    BEGIN CATCH
    END CATCH

    BEGIN TRY
    DROP TABLE #Mytable
    DROP TABLE #DBInfoResults
    END TRY
    BEGIN CATCH
    END CATCH
    go

    CREATE table #S ( server_name sysname, purpose VARCHAR(255) )
    insert into #s
    VALUES
    (‘myserver’, ‘Test’),
    (‘mybigserver\dsa’, ‘Production’),
    (‘myservertoo’, ‘Test’)

    declare @Server_name sysname, @Purpose varchar(255)
    declare @sql nvarchar(max) = N”, @s varchar(21)=”, @loopCounter int=0, @debug TINYINT=0

    DECLARE Server_Cursor CURSOR
    FOR
    SELECT s.server_name, s.purpose
    FROM #S s
    order by 1

    CREATE TABLE #Mytable (server_name sysname, database_name sysname, LastFullBackup DATE, LastIncrementalBackup DATE, comment VARCHAR(255), SizeInGB BIGINT, LastRestoreDate DATE, LastKnownGoodDBCCCheck DATE)

    OPEN Server_Cursor
    FETCH NEXT FROM Server_Cursor INTO @Server_name, @Purpose
    WHILE @@FETCH_STATUS = 0 BEGIN
    set @loopCounter +=1
    RAISERROR (‘%i server: “%s” ‘, 10 ,1, @Loopcounter, @Server_name) WITH NOWAIT

    select @sql = N’
    insert into #Mytable ( server_name, database_name, LastFullBackup, LastIncrementalBackup)
    SELECT server_name, name, LastFullBackup, LastIncrementalBackup
    from OPENROWSET(”SQLNCLI10”, ”Server=’+@Server_name+’;Trusted_Connection=yes;”,
    ”SELECT server.server_name, d.name, FullBackup.LastFullBackup, IncBackup.LastIncrementalBackup
    FROM sys.databases d
    OUTER APPLY (SELECT @@SERVERNAME AS server_name) AS server
    OUTER APPLY (
    SELECT TOP 1 B.backup_finish_date AS LastFullBackup
    FROM msdb.dbo.backupset B
    WHERE TYPE=””d””
    AND server.server_name=B.server_name
    AND d.name=b.database_name
    ORDER BY B.backup_finish_date DESC
    ) AS FullBackup
    OUTER APPLY (
    SELECT TOP 1 B.backup_finish_date AS LastIncrementalBackup
    FROM msdb.dbo.backupset B
    WHERE TYPE=””I””
    AND server.server_name=B.server_name
    AND d.name=b.database_name
    ORDER BY B.backup_finish_date DESC
    ) AS IncBackup
    WHERE STATE_DESC = ””ONLINE””
    AND name ””tempdb”” /* no backups, checkdbs needed */
    ORDER BY 1,2

    ) as a

    if @loopCounter <= 1 IF @debug 0 select @sql
    begin try
    exec sp_executesql @sql
    end try
    begin CATCH
    print error_number()
    print ERROR_MESSAGE()
    end catch

    /* loop the databases found on this server */
    DECLARE @database_name sysname
    DECLARE db_Cursor CURSOR
    FOR
    SELECT database_name
    FROM #Mytable H
    WHERE H.server_name=@Server_name
    order by 1

    OPEN db_Cursor
    FETCH NEXT FROM db_Cursor INTO @database_name
    WHILE @@FETCH_STATUS = 0 BEGIN
    select @sql = N’
    UPDATE #Mytable
    SET SizeInGB=(
    SELECT SizeInGB
    from OPENROWSET(”SQLNCLI10”, ”Server=’+@Server_name+’;Trusted_Connection=yes;”,

    SELECT SUM(CAST(size AS BIGINT))*8/1024/1024 as SizeInGB FROM ‘ + @database_name + ‘.sys.database_files DF

    ) as a
    )
    from #Mytable h
    where h.server_name=”’ + @server_name + ”’ and h.DataBase_name=”’ + @database_name + ”’


    RAISERROR (‘%i server: “%s” db: %s get Size’, 10 ,1, @Loopcounter, @Server_name, @database_name) WITH NOWAIT
    if @loopCounter <= 1 IF @debug 0 select @sql
    begin try
    exec sp_executesql @sql
    end try
    begin CATCH

    print error_number()
    print ERROR_MESSAGE()
    end CATCH

    /* last restore date */
    select @sql = N’
    UPDATE #Mytable
    SET LastRestoreDate=(
    SELECT restore_date
    FROM OPENROWSET(”SQLNCLI10”, ”Server=’+@Server_name+’;Trusted_Connection=yes;”,

    SELECT max(restore_date) as restore_date FROM msdb.dbo.restorehistory where destination_database_name=””’ + @database_name + ””’

    ) as a
    )
    from #Mytable h
    where h.server_name=”’ + @server_name + ”’ and h.DataBase_name=”’ + @database_name + ”’

    RAISERROR (‘%i server: “%s” db: %s get Restore date ‘, 10 ,1, @Loopcounter, @Server_name, @database_name) WITH NOWAIT
    if @loopCounter <= 1 IF @debug 0 select @sql
    begin try
    exec sp_executesql @sql
    end try
    begin CATCH

    print error_number()
    print ERROR_MESSAGE()
    end CATCH

    /* last DBCC */
    /* does not work
    select @sql = N’
    UPDATE #Mytable
    SET LastKnownGoodDBCCCheck=(
    SELECT value
    FROM OPENROWSET(”SQLNCLI10”, ”Server=’+@Server_name+’;Trusted_Connection=yes;”,

    USE ‘ + @database_name + ‘;
    CREATE TABLE #DBInfoResults
    (
    [ParentObject] VARCHAR(512),
    [Object] VARCHAR(512),
    [Field] VARCHAR(512),
    [VALUE] VARCHAR(512)
    )
    insert into #DBInfoResults exec (”DBCC DBINFO() WITH TABLERESULTS, NO_INFOMSGS”);
    SELECT value FROM #DBInfoResults where Field = ”dbi_dbccLastKnownGood”

    ) as a
    )
    from #Mytable h
    where h.server_name=”’ + @server_name + ”’ and h.DataBase_name=”’ + @database_name + ”’

    RAISERROR (‘%i server: “%s” db: %s ‘, 10 ,1, @Loopcounter, @Server_name, @database_name) WITH NOWAIT
    if @loopCounter <= 1 select @sql
    begin try
    exec sp_executesql @sql
    end try
    begin CATCH

    print error_number()
    print ERROR_MESSAGE()
    end CATCH
    */
    SET @loopCounter+=1
    FETCH NEXT FROM db_Cursor INTO @database_name
    END
    CLOSE db_Cursor ;
    DEALLOCATE db_Cursor ;

    FETCH NEXT FROM Server_Cursor INTO @Server_name, @Purpose
    END
    CLOSE Server_Cursor ;
    DEALLOCATE Server_Cursor ;

    UPDATE #Mytable SET Comment = 'No backup required; structure in TFS.' WHERE database_name IN ('vdcasdw', 'DataWarehouseTemp')
    UPDATE #Mytable SET Comment = 'No backup required; test server.' FROM #Mytable H INNER JOIN #S S ON S.server_name = H.server_name WHERE Purpose IN ('test', 'Development')

    SELECT top 10000 h.*, s.purpose FROM #Mytable H
    INNER JOIN #S S ON S.server_name = H.server_name
    ORDER BY 1 DESC

    /* run report */
    SELECT H.server_name, H.database_name, H.LastFullBackup, H.LastIncrementalBackup, comment, sizeinGB, h2.LastRestoreDate, h2.LastKnownGoodDBCCCheck, h2.Purpose
    FROM #Mytable H
    INNER JOIN #S S ON S.server_name = H.server_name
    OUTER APPLY (
    SELECT MAX(LastKnownGoodDBCCCheck) AS LastKnownGoodDBCCCheck, MAX(LastRestoreDate) AS LastRestoreDate, utl.CommaListConcatenate(s3.Purpose) AS Purpose FROM #Mytable H3
    INNER JOIN #S S3 ON S3.server_name = H3.server_name
    WHERE h.database_name=h3.database_name ) AS h2
    WHERE s.purpose='production'
    ORDER BY 1,2

    Reply
  • Nice job! Looking forward to part 2.

    Reply
  • Here is a little PowerShell function I came up with that returns backup info about all your databases based on a list of instances (last full, last diff, last log, etc): https://gist.github.com/ryandevries/18f64f9a7a5e92cf7b56

    You can also specify an RPO in hours and it will give you only the databases that violate that.

    Reply
  • How do you get this column? •Last date/time the backups were restored for testing

    It’s a manual record that has to be kept, or marked by an automated backup testing process, right?

    I wrote an SSIS package that tested backups a while back, maybe time to break it out 🙂

    Thanks.

    Reply
    • Henrik Staun Poulsen
      June 26, 2015 1:28 am

      Hi Sam,

      We have test systems, which are created from backups taken on the production systems.
      In our case we keep the same database name when we do a restore, so that makes it easy to get the “Last Restore Date”.

      Best regards,
      Henrik

      Reply
  • Brent,

    You’re making me go back and update my blog to be more complete, because when I first wrote those posts I was missing some of the details you discussed here. (Read: I wasn’t great, yet.)

    Check everything you want to know with your production databases with the script on my Database Assessment post. Then check to see when it was last refreshed from prod with the script on my Backup History post, which, despite the name, also shows when backups go the other way as well. You just need to find out what the server names are and if they’re production; unfortunately, that can be a job in itself sometimes.

    I hope this helps, especially with how awkward it is to find when DBCC CheckDB was last run. As for me, I’m very glad I read this post because it reminded me to run these scripts as a double-check, and led me to find spots where DBCC CheckDB wasn’t run for a while.

    Thanks,
    Steve

    Reply
    • Awesome, glad I could help! I’d make sure to test these in a real-world environment – for example, at first glance, I don’t think the backup history one detects the difference between successful, failed, and aborted backups.

      Reply
  • I use Powershell like this

    # Load SMO extension
    [System.Reflection.Assembly]::LoadWithPartialName(“Microsoft.SqlServer.Smo”) | Out-Null;
    $Servers =
    ## A list ‘Servername1′,’Servername2’ a text file Get-Content ‘PATHTOSERVERFILE’ or query a database Invoke-SQLCmd -Server SERVERNAME -Database ALLMyInstances -Query “Select Name FROM Instances”
    foreach($Server in $Servers)
    {
    $srv = New-Object (‘Microsoft.SqlServer.Management.Smo.Server’) $Server
    $lastDBCC_CHECKDB = @{Name=”Last DBCC Check”;Expression={$_.ExecuteWithResults(“DBCC DBINFO () WITH TABLERESULTS”).Tables[0] | where {$_.Field.ToString() -eq “dbi_dbccLastKnownGood”} | Select Value -ExpandProperty Value}}
    foreach($db in $srv.databases)
    {
    $db|Select Parent,Name,LastBackupDate,LastDifferentialBackupDate,LastLogBackupDate,$lastDBCC_CHECKDB
    }
    }

    and then take those objects and manipulate them at will IE – write to a table, create a HTML email to the manager, write and colour code an Excel file for the junior DA to resolve, whatever is required

    Reply
  • Thanks, Brent.

    As an accidental DBA who 12 years later became an accidental DBA manager, I am really looking forward to this series.

    Reply
  • Unai Garcia
    July 26, 2015 3:53 pm

    IMHO you should add “for committed data” right behind RPO.

    If the business has transactions that runs for hours you need to make them understand that your RPO, based on backups, is only for committed data.

    Reply
  • DBA “SQL is so stable you don’t need backups these days.”

    Senior DBA “….but disks…and developers…..”

    DBA *runs to create backup strategy*

    //Background music queue – Eye of the Tiger//

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}