It’s Time to Improve DBCC CHECKDB.

Last Updated August 13, 2018

Microsoft has been resting on Paul Randal’s laurels for far too long.

From 1999 to 2007, Paul poured his heart and soul into rewriting SQL Server’s code to check for and repair database corruption. (For more about his illustrious career, read his bio and enjoy the infectious enthusiasm in his bio photo.)

Paul did great work – his baby has lived on for over a decade, and it’s an extremely rare cumulative update that fixes a bug in CHECKDB. I’d like to think it’s not because nobody’s looking, but because he wrote good, solid code that got the job done.

But Microsoft is coasting.

This is a $30 RAID controller.

Meet the LSI MegaRAID SAS 9285CV-8e, one of the most junior RAID controllers you can buy for a server. When he’s bored, he has a couple of homework tasks he likes to perform: Patrol Read and Consistency Check. Between these two, he’s checking all of the drives in the array to make sure they match each other, and that they can successfully read and write data.

This helps catch storage failures earlier with less data loss.

You don’t have to configure this or set up a schedule – he just knows to do it because that’s what he does. It’s his job. You trusted him with your data, so every now and then, he does his homework.

SQL Server needs to do that.

Some of the pieces are there – for example, SQL Server already has the ability to watch for idle CPU times and run Agent jobs when it’s bored. For starters, that’d probably be good enough to save a lot of small businesses from heartache. For the databases over, say, 100GB, it’d be really awesome to have resumable physical_only corruption checking – tracking which pages have been checked (just like how the differential bitmap tracks page changes), with page activity reset when the page is changed (again, just like the differential bitmap.) This wouldn’t count the same as a real CHECKDB, which needs to do things like compare index contents – but holy mackerel, it’d be better than what we have now.

Because I’m just so tired of seeing corruption problems, and we can’t expect admins to know how this stuff works. I know, dear reader, you think admins should know how to set up and run corruption checking because it’s just so doggone important, you say.

But if it’s so important…

Why isn’t SQL Server doing it in the background automatically like $30 RAID cards have been doing for decades?

Want it? Cast your vote here.

Foundational Material: Microsoft SQL Server Book and Blogs From The Past

Tall Tales From Table Variables

18 Comments. Leave new

Stephen Morris
August 22, 2018 8:27 am

If you create the “collaborate” item I’d be happy to give it an upvote 🙂

Reply
- Erik Darling
  August 22, 2018 8:35 am
  
  Stephen — see the big text at the end that says “cast your vote here”? Give that a click 😉
  
  Reply
  - Melody Spencer
    October 29, 2018 12:08 am
    
    Ha ha ha …. I like that !
    
    Reply
Andre
August 22, 2018 8:44 am

That would be nice. We currently have to do a restore of backup on another server and then run a DBCC CHECKDB.

Reply
- David
  August 22, 2018 9:52 am
  
  Then you’re actually checking the restored copy, and not your actual production database.
  
  Reply
  - Andre
    August 22, 2018 11:31 am
    
    Yep best we can do – database is working 24/7 and over 300 GB
    
    Reply
  - Erik Darling
    August 22, 2018 12:24 pm
    
    David — that’s a valid way of doing it.
    
    See here: Importance of how you run consistency checks.
    
    I’ve discussed this many times before so I won’t labor the point, but you either have to run the consistency checks on the production database, or on a restored backup of it, or you’re not testing the production database. Nothing else is good enough.
    
    Thanks!
    
    Reply
    - David
      August 23, 2018 4:47 am
      
      Interesting, my first thought would be that there could be an issue on prod that wouldn’t be present on the surrogate but that seems to not be possible.
      
      Thanks
      
      Reply
Stephen Morris
August 22, 2018 9:03 am

oops – too used to links in blue with underlining, voted !

Reply
George Zhou
August 22, 2018 11:32 am

such a great idea, I vote!

Reply
Tim
August 22, 2018 4:12 pm

Ouch – coasting feels a little harsh!

Hosting your data in Azure SQL DB (and I assume Managed Instances, though that isn’t clear) gives you access to additional checks and safeguards – https://azure.microsoft.com/en-us/blog/data-integrity-in-azure-sql-database/.

Is that not trickling down to the box-product to force people into the cloud (cynical view :)), or because the ongoing investment is all based on the Service Fabric programming & hosting model?

Reply
Matthew LeVier
August 22, 2018 10:25 pm

I like the approach that Brent Ozar and his team are taking on things like this. It is just like updating stats is can be more effective than rebuilding indexes to fix fragmentation on a daily basis. They are challenging what status quo has been for too long. That’s what we need to do as professionals. Technology is changing. Maybe some of us are too busy with everything but how is anything going to get better if we don’t challenge what was done in the past? We need to do something to compete with those robots taking over.

Reply
alen teplitsky
August 23, 2018 1:48 pm

i’ve only seen a few instances of corruption over the years. Is it really that common? Wouldn’t constant corruption be a symptom of bad SAN and storage design?

Reply
- TedS
  August 24, 2018 6:20 am
  
  I’ve seen a couple of instances of data corruption. More frequently it is index corruption I see.
  
  Reply
- Brent Ozar
  August 24, 2018 7:49 am
  
  Alan – no, it can be all kinds of things, including SQL Server bugs.
  
  Reply
Paul Randal
September 11, 2018 7:01 am

Ha!

It would be relatively easy to have a crawler process that would cycle through PFS intervals and simply read allocated pages into the the buffer pool to test page-checksums, using a set of instantly-disfavored buffers to avoid taking any noticeable space, but there some very difficult problems with making it correct. One problem comes with checking pages that are already in memory – forcing them to be read from disk again would require them to be owned by a different database ID (otherwise that breaks buffer pool invariants), which is what using a database snapshot provides. Another problem is writing a PFS-based scanner that copes with changes to the allocation state of pages in the PFS interval without holding a latch on the PFS page – again, something that the database snapshot provides. Basically, without a database snapshot (prohibitively expensive for what you want) the latch-concurrency around the pages gets very challenging, if not impractical.

But I would like to see anything that checks consistency more frequently, given how common it is for corruption problems to occur (I get multiple emails per week from random people trying to recover from corruption). Answering Alex’s comment above, in a well-configured environment, it’s not common to see corruption, but well-configured environments aren’t the most common in the set of all SQL Server installations across the world.

Reply
- Brent Ozar
  September 11, 2018 7:02 am
  
  I thought you’d like that, hahaha.
  
  Reply
Andre
October 17, 2018 5:00 am

Great! Received this e-mail Monday from Azure feedback:

This aligns with our vision for improved manageability, particularly for VLDBs. Will consider for future development.

Microsoft SQL Server
Admin, Microsoft Azure

Reply