DBA Training Plan 4: Checking for Corruption

Last Updated August 13, 2019

You would think that when SQL Server writes your data to disk, it’s totally safe.

Forever.

You would be wrong. It’s almost like there are gremlins out to get you, purposely trying to trash your data. In the really old days, it was problems with magnetic hard drives. Later, it was buggy shared storage arrays. Today, it’s the cloud, where your storage is 99.971H!7 LOST CARRIER

That was an old joke. You probably won’t get that.

First, get alerted when SQL Server finds corruption.

By default, if SQL Server reads a corrupt page from disk, it logs a message, and then keeps right on going. It doesn’t email you, doesn’t show up with a red icon in SSMS, or tell you anything about the impending doom. Let’s get that fixed.

Here’s what you need to do:

Configure Database Mail so it can send you alerts when all hell breaks loose
Set up an operator so SQL Server knows who to email – ideally, this is a distribution list
Set up Agent alerts so SQL Server emails you whenever it detects corruption

Second, get SQL Server to proactively check for corruption.

By default, when SQL Server backs up your databases, it happily backs up corrupt data without throwing an error message, either. We need it to start checking the checksums on each page to make sure they’re actually valid. Wherever you configured your backups – whether it’s maintenance plans, Ola’s scripts, or a third party tool – look for the option to check the checksums on backup. In most cases, it won’t take a noticeable amount more time, and besides, you need the peace of mind anyway.

Then, schedule regular jobs to run DBCC CHECKDB to proactively read the contents of your databases and look for corruption. We’ve got videos on how do to CHECKDB when you don’t have a DBA, when you do, and when you have a Very Large Database (VLDB).

Some folks are hesitant to check for corruption in production because they’ve run into timeouts during CHECKDB. That’s typically caused by bad storage or storage networking performance, and it’s a good sign that it’s time to dig deeper into the storage.

Third, check yo’self before you wreck yo’self.

Download a corrupt database from Steve Stedman’s Corruption Challenge. It’s a ten-week series where he gives you a different corrupt database each week, and it’s up to you to figure out how to fix the corruption with as little data loss as possible. Then, you can compare your answers to the winners and see how they solved it.

I’m not saying you have to actually fix the corruption – just download the corrupt database, and then attach it to your production server. Yes, your production server. I want you to find out if your corruption jobs really report errors, if your alerts actually work, and if your backup jobs are checking with checksum like they’re supposed to.

I would much rather have you learn with a known broken database than learn the hard way when corruption strikes your servers.

Wanna learn more about corruption?

Work along with me in this interactive blog post series:

And sooner or later, you’re going to hit corruption. Here’s how to react.

In our next episode, we’ll find out who can get you fired.

Alen T
July 23, 2019 8:38 am

Using Commvault now, but last job we did SQL backups on Netbackup. I always had the checksum option checked but the few times we had corruption I never received and alert from SQL or Netbackup that anything was picked up during the backup process.

Reply
- Keith
  July 23, 2019 9:15 am
  
  Checksum isn’t enough, it can only find some types of corruption.
  
  Brent has a video about it somewhere and does a demo and he was able to backup with checksum successfully. Have to do CheckDB. In Commvault you can set that up to restore a DB to a test server and run a checkdb against it.
  
  I envy you, I used to back up 1200 servers in 20-25 network sites with simpana 9, 10 and 11, and it was truly a pleasure working with the software. Definitely miss having it.
  
  Reply
Mike
July 23, 2019 9:23 am

LOST CARRIER… ah, the good old days where cloud computing cost a $300 1200 Baud Modem and oh the joy of hours long corrupted downloads.

Reply
- Ross Presser
  July 23, 2019 10:25 am
  
  1200? I had to make do with 300 when I started … and I remember using ^X to manually abort an xmodem transfer.
  
  Reply
richardarmstrong-finnerty
July 23, 2019 11:00 am

Worth pointing out that DBCC CHECKDB, since SQL Server 2005, operates on a database snapshot, thus not affecting the database.

Reply
- Brent Ozar
  July 23, 2019 11:02 am
  
  Richard – I’m gonna disagree on that one because it still has a pretty hefty impact on IO, CPU, memory grants, etc.
  
  Reply
- Keith
  July 23, 2019 1:04 pm
  
  I’m not even sure it was a performance enhancement, because if it can’t make the snapshot it will resort to using locks anyways
  
  Reply