Test: The Top Two SQL Server Problems I Find Everywhere

Companies call us for performance or high availability issues, but over and over, the very first two things we find are:

  1. They’re not taking backups to match the business’s RPO and RTO
  2. They’re not doing CHECKDB weekly, or at all, and don’t understand why that’s an issue

So let’s walk through a simple scenario and see how you do.

It’s Thursday morning at 11AM, and you get an email: users are reporting corruption errors when they run SELECTs on a critical table. You run the query, and it turns out the clustered index on the table has corruption.

Here’s your maintenance schedule:

  • Full backups nightly at 11PM
  • Log backups every 15 minutes
  • Delete log backups older than 2 days (because you only need point-in-time restore capability for recent points in time, right?)
  • CHECKDB weekly on Saturdays at 9AM
Our national symbol is ashamed of your RPO/RTO
Our national symbol is ashamed of your RPO/RTO

You can’t repair the corruption (it’s a clustered index, and there aren’t enough nonclustered indexes to cover all the columns), and the business needs that data back. You’re on: answer these questions:

  1. What backups do you restore, in order?
  2. Will they be free of corruption?
  3. How much data will you have lost?
  4. How long will that process take?
  5. Given that, what’s your effective RPO and RTO?
  6. If the business said that wasn’t good enough, what specific steps could you take to improve those numbers without spending money?

This week, while folks are working at low speed due to the holidays, double-check those backups and corruption check jobs.

Previous Post
The Cost of Adding a Reporting Server
Next Post
Selective XML Indexes: Not Bad At All

10 Comments. Leave new

  • DoubleBarrellDarrell
    December 6, 2016 11:42 am

    11:00 AM?
    There goes lunch!
    Please Brent post the answer.
    This stuff keeps me up at night.
    I just struggled through a corruption on a secondary replica.
    suspect_pages, anyone?
    So this suspect_pages has more rows than I expected. Most of them marked 5. I’ve got one that won’t repair. DBCC PAGE showed no objectid. My guess was the whole page was bad. It was a good learning experience to get the replica off and restored and synching.
    But this problem above?!
    I’ve already blown my RPO/RTO just finding the issue!
    1.What backups do you restore, in order?
    restore from last night’s backup and apply logs up to 10:45 AM
    2.Will they be free of corruption?
    backup with checksum and restore verifyonly make me feel good my backups are clean
    3.How much data will you have lost?
    15 minutes
    4.How long will that process take?
    I got my 2tb db back in an hour but the checkdb took 6 hours
    5.Given that, what’s your effective RPO and RTO?
    10:45(15 minutes)/1 hour
    6.If the business said that wasn’t good enough, what specific steps could you take to improve those numbers without spending money?
    idk
    calling Ozar will be expensive
    unlicensed replica will still require $ for OS and server

    Reply
  • This post only seems to outline a problem without a resolution. Of course you’re correct pointing this out because the backup very likely contains the same corruption, but what about the missing factors. What if the nightly backup is the only full backup. How long should the nightly backup be retained? What if the DB several hundred GBs and takes 3 to 4 hours to capture nightly and your backup plan is a weekly FULL, nightly Diffs and hourly Trans Logs? Space is a factor at times if resources are limited.

    Reply
    • Old, but disagree. If the nightly Full also contained the corruption, you’re phone would have rung long before 11AM. The trick will be figuring out which set of T-log backups are still good.

      Reply
  • I’m going to make some assumptions here (yes, I know what they say happens when you assume).
    > Most recent Saturday’s CHECKDB was OK, giving you 1 known good full backup state and several suspect states
    > Corruption occurred at some unknown point since that Saturday CHECKDB
    > The reason CHECKDB only runs once a week is because the database(s) is large
    > The business doesn’t want to lose any data, meaning we should try to restore the most recent first
    1. Restore this morning’s full backup, check for corruption. If not present, apply logs and continue checking for corruption. Otherwise, restore yesterday’s full backup and do the same.
    2. If today’s and yesterday’s full backups have the corruption, you may be stuck going back to Saturday’s without it. Try each day, going backward.
    3. Up to 5 days, in scenario 2. As little as 15 minutes, if you’re lucky.
    4. This is the rough one. If the database is large (see assumptions) you’ve got many hours poring through backup after backup, and if you do have to go back to Saturday’s full backup, not only are you out the days of data, but also the hours it takes to restore each backup and test for corruption. If corruption occurred sometime Saturday and nobody caught it, you’re in a nightmare scenario. Best case: 15-30 minutes, for a very recently corrupted small database, where you restore last night’s full backup and then each log up to 10:45 or so. And you have that pre-scripted so you don’t have to hunt down each log backup in sequence, right? 🙂
    5. Depends very much on when corruption occurred. If you’re lucky, 15 mins and 1 hour. If you’re not lucky, both of these could go into “days”.
    6. Easy steps would be to run CHECKDB more often, we run this nightly. On large databases you can break up the CHECKDB using “with PHYSICAL_ONLY” or just checking certain tables instead of the entire DB. Darrell mentioned backing up with checksum which is a good idea, and you should use checksum page verification database option also. You can also do full backups more frequently, or differentials (avoiding impacting performance) and keep log backups and differentials longer.

    With better pre-configuration, the scenario would go more like:
    11am Thursday, users report database issues. You investigate and find corruption on the clustered index which is irreparable. You’ve got a clean DBCC CHECKDB report from last night, so you restore last night’s full backup, the 6am differential, and then your 15-minute logs up til 10:45.

    Reply
  • My experience with perhaps 2005 or 2008 (can’t recall) is that sql server is happy to back up a corrupted database without complaint, but not so happy to restore it. My analysis in this situation is that there is no restore option where there is not a full backup of the db retained prior to the corruption. This would violate just about any RPO.

    Reply
  • Simon Millar
    July 14, 2017 11:32 pm

    Having Transaction Log Backups since the last Backup of the pre-Corrupted Database, and taking a Tail-Log Backup before Restoring offer a good chance of getting through this with no data loss:
    1. Preparation
    a) stop any Transaction Log Backup Jobs, as it’s likely these also have logic to delete the Log Backups for the 2 Day Retention policy (which is too short if we’re only doing weekly CHECKDB’s by the way). Our Transaction Log Backups are our “Crown Jewels”, and if we have an unbroken Log Chain going back to the Last Good Backup (Backup taken pre-corruption), we have a good chance to recover from this without data loss. So, the more on disk the better
    b) advise any application support / incident management team of the issue: we need to initiate an outage of the application to:
    * stop new data going into a corrupt database; and
    * quiesce the application connections to allow a restore

    Investigate
    a) Restore the most recent Wednesday 11pm backup as a new Database – run CHECKDB on it
    b) If it’s corrupt, repeat for the previous Tuesday night
    c) If that’s also corrupt, and assuming the Backups also have a 2 day retention (should be at least 3 to cover long weekends?), hopefully you can restore from Tape / VTL Backups. Run CHECKDB again. Be sure to restore any Log Backups since the Last Good Backup as well.
    Restore
    a) Take another Database Backup in case anything goes wrong with next steps (Space / Time permitting)
    b) Confirm no Connections, or ALTER DATABASE .. SET RESTRICTED_USER
    c) Take a Tail-Log Backup: BACKUP LOG .. WITH NORECOVERY – this puts Database in Recovery mode (no more data goes in), and ensures we have all Transactions in Backups
    d) Restore from Last Good Backup determined from Investigate phase
    e) Roll Forward / Restore all Transaction Log backups since the Last Good Backup, right up to the final Tail-Log Backup. SSMS can be used to generate the Scripts for this.
    f) RESTORE DATABASE .. WITH RECOVERY, then run CHECKDB to ensure issue resolved
    g) Hopefully, you have been able to use the Transaction Log Backups to re-apply all changes to the Database Pages restored from the Backup of the Database taken before the Corruption
    Followup
    a) what caused the corruption – look for Disk / NTFS Events in the System Event Log, issues with SAN, or memory errors on the Server
    Review
    a) our Backup Retention policy – we need to retain enough Log Backups on Disk since the last successful CHECKDB
    b) CHECKDB Frequency, and Alerting on when it Fails

    Reply
    • Erik Darling
      July 15, 2017 7:42 am

      This is one of the best reasons to have good HA and DR set up. There’s a lot of investigatory work, app downtime, and server downtime in this process. Then imagine you go through, do everything right, and you restored your data onto bad disks. Corruption comes right back.

      Having the ability to failover quickly prevents a lot of this. Thinking about servers as islands is DBA malpractice.

      Reply
      • Simon Millar
        July 16, 2017 5:37 am

        Hi Erik – was assuming no Log Shipping / DB Mirroring / AlwaysOn AG in Brett’s scenario, as not explicitly called out. But agree that it’s a lengthy process, so being able to quickly fail over to a non-corrupt database is a huge benefit of good DR. My main point was to illustrate the importance of Transaction Log Backup retention, and taking a Tail-Log Backup prior to doing any Restores to allow no data loss (as it hadn’t been mentioned, and I’ve seen many cases of SQL Server DBA’s not realising that Log Backups likely won’t contain the corruption that made its way into the data files).
        Would also add that HA based on Failover Clustering (it’s still the same disk / filesystem), and DR based on SAN Replication (block replication means any file corruption is now also at DR) won’t help us in these situations. I believe Log Shipping / DB Mirroring / AlwaysOn AG’s is they only thing that will avoid having to go through my steps above. Additionally, with DB Mirroring / AlwaysOn AG’s, we get the feature of restoring Non-Corrupt Pages from the Secondary to the Principal for free!

        Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.