In part 1, I showed you how to corrupt a database – not because you’d need to do it, or because SQL Server has some kind of vulnerability. I just wanted to show you how SQL Server reacts when it finds corruption in a data file, why you should check your databases for CHECKSUM settings, and set up alerts so you can react fast.
However, in part 1, I purposely kept the data simple to teach you the basics first.
This time around, I want to teach you the concepts of what happens when you have multiple copies of the data. Let’s make our table a little more complex, and have both a clustered index and a nonclustered index:
CREATE DATABASE [50Ways]; GO ALTER DATABASE [50Ways] SET PAGE_VERIFY NONE; /* Normally a bad idea */ GO USE [50Ways]; GO CREATE TABLE [dbo].[ToLeaveYourLover] ([ID] INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, [Way] VARCHAR(50), [Guy] VARCHAR(50)); GO CREATE INDEX [IX_Guy] ON [dbo].[ToLeaveYourLover]([Guy]); GO INSERT INTO [dbo].[ToLeaveYourLover]([Way],[Guy]) VALUES ('Slip out the back','Jack'), ('Make a new plan','Stan'), ('Hop on the bus','Gus'), ('Drop off the key','Lee') GO SELECT * FROM [50Ways]..[ToLeaveYourLover]; /* Yep, we have data */ GO ALTER DATABASE [50Ways] SET OFFLINE WITH ROLLBACK IMMEDIATE; GO
Here’s our data:
As we did in the first post, let’s fire up XVI32 and edit our mdf file. This time, though, when you do a search for Stan, you’ll find it in two different places in the data file – once for the clustered index, and once for the nonclustered index (which only has Guy and ID on its 8K pages.) For more details on how those are stored, check out my free How to Think Like the Engine class.
Let’s do something a little tricky: let’s edit the nonclustered index (the latter screenshot) and change Stan to Flan, then save the file and bring the database online again. What’s our data look like?
His name is either Stan or Flan – depending on where you get his name from, the clustered index or the nonclustered index! SQL Server doesn’t catch this unless you have checksums turned on. (Again, really important: if you inherited somebody else’s server, or if a vendor is creating databases for you, you need to run sp_Blitz and make sure checksums are on for all of your databases. Just because you’re running current versions of SQL Server doesn’t mean you changed database settings along the way as you upgraded. Trust me, I’ve seen your databases. They’re heinous. Also, you need to floss.)
If you query msdb.dbo.suspect_pages, the kinda-sorta-system-table that tracks pages where SQL Server suspects corruption, it comes back empty:
SQL Server can’t know that there’s corruption if there’s no checksums. We’re flying blind here.
To catch this problem we have to run CHECKDB.
I’m trying to get better at calling it CHECKDB instead of just saying DBCC, because DBCC has other useful commands like SHRINKFILE and DROPCLEANBUFFERS and INPUTBUFFER and … wait, none of those are really all that useful. Anyway, here’s CHECKDB:
CHECKDB has a heck of a lot of work to do: it has to read every page and identify differences between objects that are supposed to be the same. In this case, it basically joined the clustered index and the nonclustered index together, detected that not every field matched, and told us exactly what 8K pages had problems.
More data, more pages, more indexes = more time.
If you read CHECKDB’s output in that screenshot carefully, you’ll see that it says “index ID 2” – which means our nonclustered index. (Index #0 is a heap, and #1 is a clustered rowstore index.)
Oh, and by the way – even though DBCC CHECKDB realized that something is wrong with the page, and our index rebuild failed, the suspect_pages table is still empty:
Muhahaha. There’s a very specific list of times when SQL Server will populate the suspect_pages table, like errors 823 or 824, but this particular issue isn’t in the list.
How do you fix this corruption?
In the real world – especially after you read this post – you’ll have checksums on all of your pages, which makes this kind of thing really simple:
- The clustered index says his name is Stan, and the page has a correct checksum
- The nonclustered index says he’s Flan, but the page has an incorrect checksum
By reading CHECKDB’s output, looking at index ID’s, and looking at page numbers, I can figure out whether I need to just drop & recreate the nonclustered index, or whether I have more widespread problems with clustered indexes too.
In this case, I can’t just rebuild it to make the corruption go away, because that fails:
Instead, I have to actually recreate the index by dropping it and recreating it.
This was an artificially simple scenario (and I’m still coming up on a thousand words) because I need you to understand that:
- Clustered and nonclustered indexes get corrupted independently
- The same field can have different contents in different places
- CHECKDB won’t do all of the work for you, and you have to do some manual work to recover from corruption even in the simplest cases
I haven’t even covered:
- How to parse CHECKDB’s output at scale (multi-terabyte databases, thousands of corrupt pages)
- How to recover from single-instance-storage failures (where a field’s only existence is corrupt, like when there’s not an index copy)
- How to do page-level restores
What you need to do next:
I said it in the first post in the series, but I need to drive it home again: enable checksums. No foolin’, this is important stuff. It doesn’t stop corruption, but rather just lets you detect it faster.
You might go your whole life and never experience the sheer terror of corruption. However, if you wanna be more prepared – or if you’re like me, and you just find these kinds of stunts fascinating – check out Steve Stedman’s Corruption Challenge. He gives you ten corrupt databases, and you have to figure out how to recover with as little data loss as possible.
In the next part in the series, you’ll learn what SQL Server operations do – and don’t – detect corruption.