There’s an old DBA saying…
May you already have a backup restored
A half hour before your boss knows there’s corruption
What? There’s no such thing as old DBA sayings? Well, maybe if you all said something other than “no” once in a while, you’d be more quotable. Hmpf.
Anyway, this is a serious question! And there are a lot of things to consider
- Do I have a different RTO for corruption?
- What’s my backup retention policy?
- How much data do I have?
- How long are my maintenance windows?
- Do I have a server I can offload checks to?
Recovery Time Objectification
When you’re setting these numbers with management, you need to make them aware that certain forms of corruption are more serious than others, and may take longer to recover from. If system tables or clustered indexes become corrupt, you’re potentially looking at a much more invasive procedure than if a nonclustered index gets a little wonky — something you can disable and rebuild pretty easily.
Either way, you’re looking at an RTO of at least how long it takes you to restore your largest database, assuming the corruption isn’t present in your most recent full backup. That’s why backup checksums are important. They’re not a replacement for regular consistency checks by any means, but they can provide an early warning for some types of page corruption, if you have page verification turned on, and your page is assigned a checksum.
If you use a 3rd party backup tool that doesn’t allow you to use the backup checksum option, stop using it. Seriously, that’s garbage. And turn on Trace Flag 3023 until you find a replacement that does.
Notice I’m not talking about RPO here. But there’s a simple equation you can do: the shorter your RTO for corruption, the longer your RPO. It’s real easy to run repair with allow data loss immediately. The amount of data you lose in doing so is ¯\_(?)_/¯
Which is why you need to carefully consider…
The shorter the period of time you keep backups, the more often you need to run DBCC CHECKDB. If you keep data for two weeks, weekly is a good starting point. If you take weekly fulls, you should consider running your DBCC checks before those happen. A corrupt backup doesn’t help you worth a lick. Garbage backup, garbage restore. If your data only goes back two weeks, and your corruption goes back a month, best of luck with your job search.
Of course, keeping backups around for a long time is physically impossible depending on…
How much data YOU have
The more you have, the harder it is to check it all. It’s not like these checks are a lightweight process. They chew up CPU, memory, disk I/O, and tempdb. They don’t cause blocking, the way a lot of people think they do, because they take the equivalent of a database snapshot to perform the checks on. It’s transactionally consistent, meaning the check is as good as your database was when the check started.
You can make things a little easier by running with the PHYSICAL ONLY option, but you lose out on some of the logical checks. The more complicated process is to break DBCC checks into pieces and run them a little every night. This is harder, but you stand a better chance of getting everything checked.
Especially if you have terabytes and terabytes of data, and really a short…
Are you 24×7? Do you have nights or weekends to do this stuff? Are you juggling maintenance items alongside data loads, reports, or other internal tasks? Your server may have a different database for different customer locations, which means you have a revolving maintenance window for each zone (think North America, Europe, APAC, etc.), so at best you’re just spreading the pain around.
Or you could start…
This is my absolute favorite. Sure, it can be a bear to script out yourself. Automating rotating backups and restores can be a nightmare; so many different servers with different drive letters.
Dell LiteSpeed has been automating this process since at least version 7.4, and it’s not like it costs a lot. For sure, it doesn’t cost more than you losing a bunch of data to corruption. If you’re the kind of shop that has trouble with in-place DBCC checks, it’s totally worth the price of admission.
But what about you?
Tell me how you tackle DBCC checks in the comments. You can answer the questions at the beginning of the post, or ask your own questions. Part of my job is to help you keep your job.
Thanks for reading!
Brent says: if you’re using NetApp SAN snapshots, they’ve also got great tooling to offload corruption checks to your DR site. Licensing gotchas may apply – for both SQL Server and NetApp writeable snaps.
I do daily a full and 15 minute incremental log backups and run CHECKDB every Sunday and Wednesday night.
I have created an SSIS package that restores my database to a different server and run my DBCC CHECKDB there. This process validates that my backups work and also run the DBCC CHECKDB
Neat! That’d make a great blog post, if you do that sort of thing.
Hello Erik I just slapped a quick post of my SSIS solutions.
The gotcha in offloading DBCC check DB to another server comes in the storage. If the storage is different did you really do a check on your production database? This is one of the DBA Myths Paul Randall discusses, specifically regarding integrity checks on a secondary replica in an AG. In short, since the storage is different you are good to go for that replica, but the primary not so much.
I do agree that a restore and integrity check is a great way to confirm your backups are solid.
The issue with Replicas, Log Shipped Secondaries, et al. is that you’ve restored logs or log data to a full backup from when you first initialized the process. A full backup is a page for page copy of your current database as it is now. It reads all the pages from your data file as they are now. I’m going by Aaron Bertrand, though.
We have a 2 replica AlwaysOn configuration. We are currently do CHECKDB on a primary server. I’ve been considering starting CHECKDB on the secondary as well because they are not using shared storage.
Is it even possible to perform CHECKDB on a secondary dB that is not readable (HA only)? Does this violate MS’s licensing since the HA secondary is not licensed?
The secondary in an AG is readable so you technologically can perform maintenance on it. The licensing question is something that I think (do not quote me) depends on if you are using Standard or Enterprise edition.
As of 2014, AGs are Enterprise only. In 2016, you can have a Basic Availability Group on a Standard Edition server, but there are many limitations.
Being able to query and perform maintenance on secondary Replicas requires Software Assurance, and/or additional licensing. Always check with your licensing provider, lest ye incur the wrath of a Microsoft audit.
There is a reason I claim not expertise in licensing. 🙂
FWIW we’re using AlwaysOn AGs with a primary and a readable secondary. We would not have been required to license the secondary as it would have been considered a passive failover instance even though we actively query it. However because we use the SQL instance that houses the readable secondaries to offload our DBCC CHECKDB process we were required to license it because it was performing “work”. (I’m specifically referring to DBCC CHECKDB against a separate set of restored copies of our databases, not against the readable secondaries, although that would be considered “work” as well and require licensing.)
Don’t take our experience as gospel though since YMMV. As Erik notes, always check with your licensing rep since Microsoft licensing is definitely relative (although it shouldn’t be). One piece of advise I’d add is to not be hesitant to push back if you believe you’re being required to license instances you don’t think you should. We did, and it definitely paid off. (Of course, we’re also under an audit from Microsoft and perhaps that’s not coincidence, hmm…)
I do weekly DBCC’s and have a job that runs every 2 minutes that checks for suspect pages and alerts me.
We have an Availability Group setup and the DBCC CHECKDB is run weekly on the primary during a slow period (Sunday). DBCC CHECKDB is run daily on the secondary replica. Is the job necessary on the primary if I run it on the secondary daily?
Yes! You can definitely have corruption on your primary replica that’s not present on your secondary replica (or vice versa). You’re going to want to run DBCC CHECKDB on your primary replica as well as any secondary replicas you have. Remember that a secondary replica is ultimately a restored copy of a full backup of the database with transaction logs constantly being applied to it. Any corruption that occurs on the primary replica after that initial full backup was taken may not necessarily be replicated to the secondary. Likewise your secondary replicas can experience corruption on their own which wouldn’t be present on the primary replica.
Picture a faulty I/O subsystem which corrupts a clustered index containing static data on the primary replica. It never gets targeted to be rebuilt since it doesn’t fragment (being non-volatile) and so it’s corrupt on the primary replica and pristine on the secondary replica. Only running DBCC CHECKDB on the secondary replica would never reveal this.
I run CHECKDB daily on each of the 800+ production databases and at least weekly on 400+ test databases. I am fortunate to have the maintenance window and server resources to accomplish this feat. I also run CHECKDB on every restore. When restoring to a newer version of SQL, I use the WITH DATA_PURITY parameter to ensure the old database works in the new environment.
The offloading of databases to do CHECKDB on another server could result in misleading results as the corruption could occur on the destination server. That could send you on a wild goose chase looking for corruption that doesn’t exist on the production server.
You always make me think about how I do things. Thank you.
You mentioned responses other than ‘No’ at the beginning of this article. in that light, I ask “Do you know what DBA stands for?”.
Don’t Bother Asking!
People do have a funny impression of what DBAs do.
I’ve been accused of sitting in my Ivory Tower, dispensing tiny morsels of privileges from on high.
Yeah, I put on my robe and wizard hat and I cast Lvl 3 select rights on the peasants below.
I wrote a FOR XML query that generates an XML document describing all full and transaction log backups for a SQL instance. Then, using an xslt, I can generate restore scripts with different options. My nightly restore, for example, restores to a QA environment on a different server and runs DBCC on it.
Russell’s comment regarding a suspect page alert is intriguing…
Smaller shop with minimal activity at night. I do integrity checks every night on all DBs. I also restore all backups to a different server every night and check those.
We run CHECKDB daily on our Dev, Test, and Prod Data Warehouse environments via Agent Jobs. They are spread out, in that we have jobs that run CHECKDB on different (and explicit) databases at certain times, e.g. “system databases” (master, msdb, model) at 2:30am, “Core DW Databases” (DW, Stage, History, etc.) at 3:00 am, then “Secondary DW Databases” at 3:30am, etc.
Since they are Agent Jobs, we get failure email notifications via Database Mail to the configured Operators. Fortunately, we have not yet had any failures 🙂
Make sure those Agent jobs log failures to a file! If you ever do find corruption, you may not want to sit around running CHECKDB again to get the full error messages.
A solid point… so question then…
We use the below command:
Now, we (I) assume that if it does fail, the error messages would be shown in the Agent Job’s history, would that be correct or not? If the error messages would indeed be in the History of the Agent Job, would that enough or not enough? Since we haven’t seen a failure yet, we haven’t seen it to know.
Nope. You may get the job error, and why it failed, but not the full output of what errors came from DBCC CHECKDB.
Good to know. Thanks for the tip!
When I started at my last role, there were 11 out of 750+ databases across 60 instances that had actually had DBCC CHECKDB run against them in the last 5 years . This is not an exaggeration.
Nowadays, all databases on all production instances have CHECKDB run once a week, as well as the majority of Dev/Test instances. There were a lot of databases that required DBCC UPDATEUSAGE, and even a few that needed even more care to bring to a consistent and uncorrupt state.
We run CHECKDB WITH NO_INFOMSGS as daily job for every database.
Database count on production server < 50, max(dbsize) 100 GB, total job-runtime 1 h.
Daily restore of all db's on staging-sql-instance and additional dbcc CHECKDB-run there.
Thanks for that post!
“What? There’s no such thing as old DBA sayings? Well, maybe if you all said something other than “no” once in a while, you’d be more quotable. Hmpf.”
Love it – great to start the day with a giggle!
We actually take backups and then run an integrity check daily (on most systems). My thought process behind that is that we keep backups for 40 days so taking a backup of a corrupt database won’t age out the one usable backup we need. It also gets the backup to disk faster so if there’s an outage during the integrity check we already have a backup. It also gives us a higher degree of certainty that the backup isn’t corrupt since there’s a possibility that a page gets corrupted between when the integrity check finishes it and when the backup reads it.
I’m not saying that is the best way to do it since not everyone can keep all their daily fulls for 40 days. The important thing is know your setup and what that particular setup means. If you do backups first and checkdb finds corruption then that backup likely isn’t good. If you do checkdb first and there’s a loss of database before end of maintenance (assuming maintenance starts at the same time whichever you do first) then you may need to go back to an earlier backup.
I like James’ idea of a job that automatically restores the backup to another instance for a DBCC and kills one bird with two stones, or something like that. I’ll give that some thought.
Myself, I don’t have huge databases or 24/7 operations, so I have no problem running a full DBCC every night. I also have it filter the full DBCC for the word ‘ERROR’ and send me the summary file, so I don’t have to page through that monstrosity of a result file looking for problems.
As far as DBA sayings, I came across a blog whose headline said An Unindexed Heap Is Not Worth Searching. I liked it.
Accidental DBA here btw.
I’m running DBCC CHECKDB weekly because they’re all production critical and meant to be up 24×7. I was flat out denied in my request to get a regular maintenance window..so yeaah. .. anyway, I’m very interested in knowing how often I should be running DBCC CHECKDB on critical apps.
1) I noticed that sp_Blitz is checking for most recent DBCC on the TempDB – is that really required?
Database [tempdb] never had a successful DBCC CHECKDB. This check should be run regularly to catch any database corruption as soon as possible. Note: you can restore a backup of a busy production database to a test server and run DBCC CHECKDB against that to minimize impact. If you do that, you can ignore this warning.
2) One of the servers throws a guaranteed stack dump every week (97 weeks in a row now …) when the DBCC job runs. What’s the best way to determine the potential data loss in the afflicted database? My plan is to perform the DBCC repair on a copy of the DB and compare tables – but is there a better way?
DBCC CHECKDB (collab) WITH physical_only executed by NT AUTHORI
TY\SYSTEM found 6 errors and repaired 0 errors. Elapsed time: 0 hours 8 minutes 0 seconds. Internal database sna
pshot has split point LSN = 00027698:00001523:0006 and first LSN = 00027698:0000151c:0001.
repair_allow_data_loss is the minimum repair level for the err
ors found by DBCC CHECKDB (collab).
**Dump thread – spid = 0, EC = 0x00000000D4E43EC0
***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\LOG\SQLDump0097.txt
You should definitely call Microsoft about that one. Those errors could be anywhere.
We have a server set up that acts as our replication distributor only. Any reasons or cautions for not running CHECKDB on the distributor database?
No, sounds good.
I run CHECKALLOC and CHECKCATALOG weekly on all databases (system and user) with Ola Hallengren’s maintenance solution. I then run a CHECKTABLE daily specifying a 3 hour window and excluding tables that have been checked in the last 9 days. The biggest reason for this is we often have multiple databases, exceeding a couple of terabytes on single instances. Without offloading, the CHECKDB’s take more days than there is in a week 🙁
Great solution rotating your tables for CHECKTABLE. Would you be able to share the script you used for this?
I have a developer who has suggested they implement this (DBCC DHECKDB() as part of his nightly Scheduled Job ETL. I am not sure what I think about allowing him to perform his task, as we already have a maintenance task set up to perform this. The real questions is, is there a reason I should not allow him to perform this action from a security or performance point of view.
Any thoughts appreciated.
Well, I’m not opposed to it being part of that process, but I wonder if that’s the right time to be running it.
I usually want to run my DBCC CHECKDB job before a full backup. I also want to make sure I get an email if it fails, with as much of the error as possible in the text.
Great post. I like the suggestion of running PHYSICAL ONLY on large TBs. We have a 4TB Db that takes nearly 24hrs to complete and the internal snapshots are consuming a lot of storage in the process.
Our maintenance window is about 6hrs per night and we’re using Always On (2016 SP1). We could offload to an async replica, but my OCD tells me to run CHECKDB on all replicas!
How do you recommend breaking DBCC checks into pieces? Are there DBCC checks you should always run?
You definitely can’t offload checks to AG replicas. The only valid sources for offloading are a full backup or a full SAN snapshot restored to another server.
Back when I had a real job, I used a schedule sort of like the one suggested by Paul here, which worked well for some of my databases that had a static set of tables.
Is checksum option in daily full backups ,will help me to escape from running CHECKDB..?
For questions, head on over to a Q&A site like https://dba.stackexchange.com.