In part 1, I showed you how to corrupt a database – not because you’d need to do it, or because SQL Server has some kind of vulnerability. I just wanted to show you how SQL Server reacts when it finds corruption in a data file, why you should check your databases for CHECKSUM settings, and set up alerts so you can react fast.
However, in part 1, I purposely kept the data simple to teach you the basics first.
This time around, I want to teach you the concepts of what happens when you have multiple copies of the data. Let’s make our table a little more complex, and have both a clustered index and a nonclustered index:
CREATE DATABASE [50Ways];
ALTER DATABASE [50Ways] SET PAGE_VERIFY NONE; /* Normally a bad idea */
CREATE TABLE [dbo].[ToLeaveYourLover]
([ID] INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
[Way] VARCHAR(50), [Guy] VARCHAR(50));
CREATE INDEX [IX_Guy] ON [dbo].[ToLeaveYourLover]([Guy]);
INSERT INTO [dbo].[ToLeaveYourLover]([Way],[Guy])
VALUES ('Slip out the back','Jack'),
('Make a new plan','Stan'),
('Hop on the bus','Gus'),
('Drop off the key','Lee')
SELECT * FROM [50Ways]..[ToLeaveYourLover]; /* Yep, we have data */
ALTER DATABASE [50Ways] SET OFFLINE WITH ROLLBACK IMMEDIATE;
Here’s our data:
As we did in the first post, let’s fire up XVI32 and edit our mdf file. This time, though, when you do a search for Stan, you’ll find it in two different places in the data file – once for the clustered index, and once for the nonclustered index (which only has Guy and ID on its 8K pages.) For more details on how those are stored, check out my free How to Think Like the Engine class.
Let’s do something a little tricky: let’s edit the nonclustered index (the latter screenshot) and change Stan to Flan, then save the file and bring the database online again. What’s our data look like?
His name is either Stan or Flan – depending on where you get his name from, the clustered index or the nonclustered index! SQL Server doesn’t catch this unless you have checksums turned on. (Again, really important: if you inherited somebody else’s server, or if a vendor is creating databases for you, you need to run sp_Blitz and make sure checksums are on for all of your databases. Just because you’re running current versions of SQL Server doesn’t mean you changed database settings along the way as you upgraded. Trust me, I’ve seen your databases. They’re heinous. Also, you need to floss.)
If you query msdb.dbo.suspect_pages, the kinda-sorta-system-table that tracks pages where SQL Server suspects corruption, it comes back empty:
SQL Server can’t know that there’s corruption if there’s no checksums. We’re flying blind here.
To catch this problem we have to run CHECKDB.
I’m trying to get better at calling it CHECKDB instead of just saying DBCC, because DBCC has other useful commands like SHRINKFILE and DROPCLEANBUFFERS and INPUTBUFFER and … wait, none of those are really all that useful. Anyway, here’s CHECKDB:
CHECKDB has a heck of a lot of work to do: it has to read every page and identify differences between objects that are supposed to be the same. In this case, it basically joined the clustered index and the nonclustered index together, detected that not every field matched, and told us exactly what 8K pages had problems.
More data, more pages, more indexes = more time.
If you read CHECKDB’s output in that screenshot carefully, you’ll see that it says “index ID 2” – which means our nonclustered index. (Index #0 is a heap, and #1 is a clustered rowstore index.)
Oh, and by the way – even though DBCC CHECKDB realized that something is wrong with the page, and our index rebuild failed, the suspect_pages table is still empty:
Muhahaha. There’s a very specific list of times when SQL Server will populate the suspect_pages table, like errors 823 or 824, but this particular issue isn’t in the list.
How do you fix this corruption?
In the real world – especially after you read this post – you’ll have checksums on all of your pages, which makes this kind of thing really simple:
- The clustered index says his name is Stan, and the page has a correct checksum
- The nonclustered index says he’s Flan, but the page has an incorrect checksum
By reading CHECKDB’s output, looking at index ID’s, and looking at page numbers, I can figure out whether I need to just drop & recreate the nonclustered index, or whether I have more widespread problems with clustered indexes too.
In this case, I can’t just rebuild it to make the corruption go away, because that fails:
Instead, I have to actually recreate the index by dropping it and recreating it.
This was an artificially simple scenario (and I’m still coming up on a thousand words) because I need you to understand that:
- Clustered and nonclustered indexes get corrupted independently
- The same field can have different contents in different places
- CHECKDB won’t do all of the work for you, and you have to do some manual work to recover from corruption even in the simplest cases
I haven’t even covered:
- How to parse CHECKDB’s output at scale (multi-terabyte databases, thousands of corrupt pages)
- How to recover from single-instance-storage failures (where a field’s only existence is corrupt, like when there’s not an index copy)
- How to do page-level restores
What you need to do next:
I said it in the first post in the series, but I need to drive it home again: enable checksums. No foolin’, this is important stuff. It doesn’t stop corruption, but rather just lets you detect it faster.
You might go your whole life and never experience the sheer terror of corruption. However, if you wanna be more prepared – or if you’re like me, and you just find these kinds of stunts fascinating – check out Steve Stedman’s Corruption Challenge. He gives you ten corrupt databases, and you have to figure out how to recover with as little data loss as possible.
In the next part in the series, you’ll learn what SQL Server operations do – and don’t – detect corruption.
You should be able to disable the index and then rebuild it. Prefer that to delete and recreate.
Frank – why?
Because I’m lazy. It saves on scripting out the index definition.
Because if the index is enforcing a constraint, dropping and recreating it allows the constraint to be broken, but disabling and rebuilding it in a transaction keeps the constraint enforced. Disable and rebuild is the correct approach.
I take it the strategy has changed from drop + recreate to disable + recreate. Any chance you can update the edit on https://www.sqlskills.com/blogs/paul/teched-demo-nonclustered-index-corruption/ that mentioned drop + create?
“In this case, it basically joined the clustered index and the nonclustered index together, detected that not every field matched, and told us exactly what 8K pages had problems.”
Did it, really?
“Keys out of order on page (1:264), slots 2 and 3.”
I read this like the value in slot 2 (third row, Lee) is greater than the value in slot 3 (fourth row, Flan).
This is a consistency problem within the index itself. There is no pairing to the clustered data here as far as I can see.
Not here, but try this same thing with index includes (which aren’t sorted in order) and watch what happens.
Thanks Brent. I did try this and also using my own way. Honestly this gave me a hint why I was failing to solve a sticking issue in my life. I am beginning to open Database files and try doing all sorts of things. I will see.
Thanks, this was very good information especially when it comes to Clustered and Non-clustered index relationships.
Good to see more people publicizing corruption and recovery techniques. However, I’m confused how this example illustrates your point: “CHECKDB won’t do all of the work for you, and you have to do some manual work to recover from corruption even in the simplest cases”. Running repair will fix this corruption – no effort required (apart from having the database essentially offline). Thanks
Paul – because I’m just introducing them, heh. It’s only part 2. 😀
Unable to repair the corrupted database. How to fix it?
Where’s Part 3 and any other parts? There’s no link to the next part at the end of this one.
Michael – I haven’t written them yet. Stay tuned!
Funny, as I found Part 3 here: https://www.brentozar.com/archive/2017/07/lets-corrupt-database-together-part-3-detecting-corruption/. Do you have a Tardis?
(I’ve learned over time that if someone asks me for something, and I tell them it doesn’t exist, then they seem to exert that little extra bit of effort to find the thing they were looking for in the first place, and they’re email specially proud of themselves. So good job, you! You didn’t need me after all, eh? Web searches are so powerful these days. Enjoy your treasure and your newfound power.)
I actually found it before I saw your comment, so I’m sorry to say it had no influence on my efforts.
Web searches can also be incredibly frustrating.
Anyway, interesting series, good stuff.
OMG.. this is a funny discussion. haha