Let’s Corrupt a Database Together, Part 3: Detecting Corruption

So far in this series, I’ve shown you how to corrupt a clustered index, then how nonclustered indexes can be corrupted independently. If you haven’t read those, you should start there first.

Let’s start again with our 50Ways database, but this time we’re going to be good and make sure that we’ve got checksums enabled, and that we’re in full recovery model:

Then fire open your trusty hex editor, like xvi32, and open up the MDF file just like we did in Part 1. Change Stan’s name to Flan, save the MDF, close your hex editor, and bring the database back online again:

Thanks to the magic of checksums, SQL Server knows what you did last summer:

So the table is toast, right?

Try inserting more rows.

Let’s add another four rows:

And if the moon is right, it works just fine.

4 rows affected

Adding more data doesn’t necessarily detect corruption. But if you try to select them back out, you’re still going to have a problem.

This is the worst part about database corruption – SQL Server will let your users keep right on databasin’, adding more data into a corrupt database. You have a serious time bomb on your hands here: the longer you let this go on, the more dangerous things get. Watch this.

Take a full and a transaction log backup of your database again:

And they work fine – no errors. And guess what happens when you try restores? You’ll need to change M:\MSSQL\DATA\ to your own data/log paths here (I try to keep scripts super-generic so they work everywhere, but restore ain’t so flexible):

No errors there either.

By default, backups and restores don’t detect corruption.

So now think through the timeline of what we just did:

  1. We created a database
  2. We put some data in it
  3. We took a full backup #1
  4. We took a log backup #1
  5. We corrupted the data
  6. We took full backup #2 (which backed up a corrupt data page)
  7. We took log backup #2

If we need to recover from the corruption that just happened (and assuming we can’t do page-level restores, which is for another blog post), the proper sequence is:

  1. Take a tail-of-the-log backup, which gets the last of our transactions and seals the database as read-only
  2. Restore full backup #1 (from before the corruption happened)
  3. Restore log backup #1
  4. (Skip full backup #2 altogether, because its data page is corrupt)
  5. Restore log backup #2
  6. Restore the tail-of-the-log backup

But this only works if you actually have log backup #1. This is kinda horrifying because I bet you:

  • Do CHECKDB once a week, like on Saturdays
  • Do full backups every day
  • Do transaction log backups multiple times per day, like hourly
  • But due to limited drive space, you delete log backups older than 24-48 hours

So take this timeline:

  • Saturday – DBCC CHECKDB runs, reports success
  • Sunday – full database backup, and logs all day long
  • Monday – full database backup, and logs all day long,
    and we delete Sunday’s log backups
  • Tuesday – full database backup, and logs all day long,
    and we delete Monday’s log backups
  • Wednesday – we find database corruption.

Depending on when the corruption happened, you may not be able to restore without experiencing serious data loss.

Thankfully, it’s easy to detect corruption.

Once you’re aware of this problem, you have a few great options.

You could keep your log backup files around longer. In theory, you keep them all the way back to your last clean CHECKDB. In practice, you’ll need to keep them longer than that. If you do CHECKDB every 7 days, and you delete log files older than 7 days, then when CHECKDB fails, a human being probably won’t disable the log-deletion job fast enough to keep the log backups online. In that scenario, 10-14 days of log backups might be a better choice – especially if there’s only one DBA, and everyone else just leaves the alert emails for the DBA to handle when they get back from vacation.

You could run CHECKDB more often, or on a restored copy of production. It kills me when I see people doing index rebuilds every night, but CHECKDB only once a week. Your priorities are backwards. DBAs get fired for lost data, not for slow performance. (Are you kidding me? You actually get extra training and tools budgets when your server is slow. Come to think of it, you should probably go hit the turbo button on the SQL Server just to make it slow down for a while. The turbo button doesn’t exist anymore? Maybe change your power plan down to Fair and Balanced instead of High Performance.)

Configure your alerts and failsafe operator. By default, SQL Server keeps its secrets and doesn’t tell you when it detects a corrupt page. It’s crazy easy to set up alerts – check out our SQL Server Setup Guide in the First Responder Kit.

Use the WITH CHECKSUM command on backups. When specified, this makes SQL Server check the checksum on pages while backing them up. It’s not as good as a full CHECKDB, but in the case of our toy database, it works like a charm:

You could implement mirroring or Always On Availability Groups. By themselves, these don’t make it easier to detect corruption, but they make it easier to deal with the aftermath. Sure, they have Automatic Page Repair, but they also have something simpler: a separate copy of your database’s data files. Instead of your data files being a single point of failure, now they’ve got redundancy from storage failures (assuming of course that they’re on separate physical storage, not on the same storage device.)

Previous Post
[Video] Office Hours 2017/07/19 (With Transcriptions)
Next Post
Other People’s Blog Posts I Talk About the Most

4 Comments. Leave new

Menu
{"cart_token":"","hash":"","cart_data":""}