Microsoft has been resting on Paul Randal’s laurels for far too long.
From 1999 to 2007, Paul poured his heart and soul into rewriting SQL Server’s code to check for and repair database corruption. (For more about his illustrious career, read his bio and enjoy the infectious enthusiasm in his bio photo.)
Paul did great work – his baby has lived on for over a decade, and it’s an extremely rare cumulative update that fixes a bug in CHECKDB. I’d like to think it’s not because nobody’s looking, but because he wrote good, solid code that got the job done.
But Microsoft is coasting.
This is a $30 RAID controller.
Meet the LSI MegaRAID SAS 9285CV-8e, one of the most junior RAID controllers you can buy for a server. When he’s bored, he has a couple of homework tasks he likes to perform: Patrol Read and Consistency Check. Between these two, he’s checking all of the drives in the array to make sure they match each other, and that they can successfully read and write data.
This helps catch storage failures earlier with less data loss.
You don’t have to configure this or set up a schedule – he just knows to do it because that’s what he does. It’s his job. You trusted him with your data, so every now and then, he does his homework.
SQL Server needs to do that.
Some of the pieces are there – for example, SQL Server already has the ability to watch for idle CPU times and run Agent jobs when it’s bored. For starters, that’d probably be good enough to save a lot of small businesses from heartache. For the databases over, say, 100GB, it’d be really awesome to have resumable physical_only corruption checking – tracking which pages have been checked (just like how the differential bitmap tracks page changes), with page activity reset when the page is changed (again, just like the differential bitmap.) This wouldn’t count the same as a real CHECKDB, which needs to do things like compare index contents – but holy mackerel, it’d be better than what we have now.
Because I’m just so tired of seeing corruption problems, and we can’t expect admins to know how this stuff works. I know, dear reader, you think admins should know how to set up and run corruption checking because it’s just so doggone important, you say.
But if it’s so important…
Why isn’t SQL Server doing it in the background automatically like $30 RAID cards have been doing for decades?
If you create the “collaborate” item I’d be happy to give it an upvote 🙂
Stephen — see the big text at the end that says “cast your vote here”? Give that a click 😉
Ha ha ha …. I like that !
That would be nice. We currently have to do a restore of backup on another server and then run a DBCC CHECKDB.
Then you’re actually checking the restored copy, and not your actual production database.
Yep best we can do – database is working 24/7 and over 300 GB
David — that’s a valid way of doing it.
See here: Importance of how you run consistency checks.
Interesting, my first thought would be that there could be an issue on prod that wouldn’t be present on the surrogate but that seems to not be possible.
oops – too used to links in blue with underlining, voted !
such a great idea, I vote!
Ouch – coasting feels a little harsh!
Hosting your data in Azure SQL DB (and I assume Managed Instances, though that isn’t clear) gives you access to additional checks and safeguards – https://azure.microsoft.com/en-us/blog/data-integrity-in-azure-sql-database/.
Is that not trickling down to the box-product to force people into the cloud (cynical view :)), or because the ongoing investment is all based on the Service Fabric programming & hosting model?
I like the approach that Brent Ozar and his team are taking on things like this. It is just like updating stats is can be more effective than rebuilding indexes to fix fragmentation on a daily basis. They are challenging what status quo has been for too long. That’s what we need to do as professionals. Technology is changing. Maybe some of us are too busy with everything but how is anything going to get better if we don’t challenge what was done in the past? We need to do something to compete with those robots taking over.
i’ve only seen a few instances of corruption over the years. Is it really that common? Wouldn’t constant corruption be a symptom of bad SAN and storage design?
I’ve seen a couple of instances of data corruption. More frequently it is index corruption I see.
Alan – no, it can be all kinds of things, including SQL Server bugs.
It would be relatively easy to have a crawler process that would cycle through PFS intervals and simply read allocated pages into the the buffer pool to test page-checksums, using a set of instantly-disfavored buffers to avoid taking any noticeable space, but there some very difficult problems with making it correct. One problem comes with checking pages that are already in memory – forcing them to be read from disk again would require them to be owned by a different database ID (otherwise that breaks buffer pool invariants), which is what using a database snapshot provides. Another problem is writing a PFS-based scanner that copes with changes to the allocation state of pages in the PFS interval without holding a latch on the PFS page – again, something that the database snapshot provides. Basically, without a database snapshot (prohibitively expensive for what you want) the latch-concurrency around the pages gets very challenging, if not impractical.
But I would like to see anything that checks consistency more frequently, given how common it is for corruption problems to occur (I get multiple emails per week from random people trying to recover from corruption). Answering Alex’s comment above, in a well-configured environment, it’s not common to see corruption, but well-configured environments aren’t the most common in the set of all SQL Server installations across the world.
I thought you’d like that, hahaha.
Great! Received this e-mail Monday from Azure feedback:
This aligns with our vision for improved manageability, particularly for VLDBs. Will consider for future development.
Microsoft SQL Server
Admin, Microsoft Azure