There’s an old DBA saying…
May you already have a backup restored
A half hour before your boss knows there’s corruption
What? There’s no such thing as old DBA sayings? Well, maybe if you all said something other than “no” once in a while, you’d be more quotable. Hmpf.
Anyway, this is a serious question! And there are a lot of things to consider
- Do I have a different RTO for corruption?
- What’s my backup retention policy?
- How much data do I have?
- How long are my maintenance windows?
- Do I have a server I can offload checks to?
Recovery Time Objectification
When you’re setting these numbers with management, you need to make them aware that certain forms of corruption are more serious than others, and may take longer to recover from. If system tables or clustered indexes become corrupt, you’re potentially looking at a much more invasive procedure than if a nonclustered index gets a little wonky — something you can disable and rebuild pretty easily.
Either way, you’re looking at an RTO of at least how long it takes you to restore your largest database, assuming the corruption isn’t present in your most recent full backup. That’s why backup checksums are important. They’re not a replacement for regular consistency checks by any means, but they can provide an early warning for some types of page corruption, if you have page verification turned on, and your page is assigned a checksum.
If you use a 3rd party backup tool that doesn’t allow you to use the backup checksum option, stop using it. Seriously, that’s garbage. And turn on Trace Flag 3023 until you find a replacement that does.
Notice I’m not talking about RPO here. But there’s a simple equation you can do: the shorter your RTO for corruption, the longer your RPO. It’s real easy to run repair with allow data loss immediately. The amount of data you lose in doing so is ¯\_(?)_/¯
Which is why you need to carefully consider…
The shorter the period of time you keep backups, the more often you need to run DBCC CHECKDB. If you keep data for two weeks, weekly is a good starting point. If you take weekly fulls, you should consider running your DBCC checks before those happen. A corrupt backup doesn’t help you worth a lick. Garbage backup, garbage restore. If your data only goes back two weeks, and your corruption goes back a month, best of luck with your job search.
Of course, keeping backups around for a long time is physically impossible depending on…
How much data YOU have
The more you have, the harder it is to check it all. It’s not like these checks are a lightweight process. They chew up CPU, memory, disk I/O, and tempdb. They don’t cause blocking, the way a lot of people think they do, because they take the equivalent of a database snapshot to perform the checks on. It’s transactionally consistent, meaning the check is as good as your database was when the check started.
You can make things a little easier by running with the PHYSICAL ONLY option, but you lose out on some of the logical checks. The more complicated process is to break DBCC checks into pieces and run them a little every night. This is harder, but you stand a better chance of getting everything checked.
Especially if you have terabytes and terabytes of data, and really a short…
Are you 24×7? Do you have nights or weekends to do this stuff? Are you juggling maintenance items alongside data loads, reports, or other internal tasks? Your server may have a different database for different customer locations, which means you have a revolving maintenance window for each zone (think North America, Europe, APAC, etc.), so at best you’re just spreading the pain around.
Or you could start…
This is my absolute favorite. Sure, it can be a bear to script out yourself. Automating rotating backups and restores can be a nightmare; so many different servers with different drive letters.
Dell LiteSpeed has been automating this process since at least version 7.4, and it’s not like it costs a lot. For sure, it doesn’t cost more than you losing a bunch of data to corruption. If you’re the kind of shop that has trouble with in-place DBCC checks, it’s totally worth the price of admission.
But what about you?
Tell me how you tackle DBCC checks in the comments. You can answer the questions at the beginning of the post, or ask your own questions. Part of my job is to help you keep your job.
Thanks for reading!
Brent says: if you’re using NetApp SAN snapshots, they’ve also got great tooling to offload corruption checks to your DR site. Licensing gotchas may apply – for both SQL Server and NetApp writeable snaps.