You would think that when SQL Server writes your data to disk, it’s totally safe.
You would be wrong. It’s almost like there are gremlins out to get you, purposely trying to trash your data. In the really old days, it was problems with magnetic hard drives. Later, it was buggy shared storage arrays. Today, it’s the cloud, where your storage is 99.971H!7 LOST CARRIER
That was an old joke. You probably won’t get that.
First, get alerted when SQL Server finds corruption.
By default, if SQL Server reads a corrupt page from disk, it logs a message, and then keeps right on going. It doesn’t email you, doesn’t show up with a red icon in SSMS, or tell you anything about the impending doom. Let’s get that fixed.
Here’s what you need to do:
- Configure Database Mail so it can send you alerts when all hell breaks loose
- Set up an operator so SQL Server knows who to email – ideally, this is a distribution list
- Set up Agent alerts so SQL Server emails you whenever it detects corruption
Second, get SQL Server to proactively check for corruption.
By default, when SQL Server backs up your databases, it happily backs up corrupt data without throwing an error message, either. We need it to start checking the checksums on each page to make sure they’re actually valid. Wherever you configured your backups – whether it’s maintenance plans, Ola’s scripts, or a third party tool – look for the option to check the checksums on backup. In most cases, it won’t take a noticeable amount more time, and besides, you need the peace of mind anyway.
Then, schedule regular jobs to run DBCC CHECKDB to proactively read the contents of your databases and look for corruption. We’ve got videos on how do to CHECKDB when you don’t have a DBA, when you do, and when you have a Very Large Database (VLDB).
Some folks are hesitant to check for corruption in production because they’ve run into timeouts during CHECKDB. That’s typically caused by bad storage or storage networking performance, and it’s a good sign that it’s time to dig deeper into the storage.
Third, check yo’self before you wreck yo’self.
Download a corrupt database from Steve Stedman’s Corruption Challenge. It’s a ten-week series where he gives you a different corrupt database each week, and it’s up to you to figure out how to fix the corruption with as little data loss as possible. Then, you can compare your answers to the winners and see how they solved it.
I’m not saying you have to actually fix the corruption – just download the corrupt database, and then attach it to your production server. Yes, your production server. I want you to find out if your corruption jobs really report errors, if your alerts actually work, and if your backup jobs are checking with checksum like they’re supposed to.
I would much rather have you learn with a known broken database than learn the hard way when corruption strikes your servers.
Wanna learn more about corruption?
Work along with me in this interactive blog post series:
- Let’s Corrupt a Database Together, Part 1: Clustered Indexes
- Part 2: Corrupting Nonclustered Indexes
- Part 3: Detecting Corruption
And sooner or later, you’re going to hit corruption. Here’s how to react.
In our next episode, we’ll find out who can get you fired.