You want to check for corruption, but you don’t want to slow down your primary production server.
In this post, I’m specifically talking about offloading the corruption checking process. I’m not talking about doing corruption checking on both the primary and other servers – that’s wonderful, and if you’re doing that, you should hug yourself. You’re doing a good job. Who’s a good dog? You are! Good dog.
Now, for the rest of you – your production server is slow, and you wanna know what the drawbacks are of running CHECKDB on other servers instead. Here are the issues to consider. I’m going to be talking in general terms about business risk. I wish I could give you an idea of the risk you’re taking there, but without seeing your servers and practices, that’s tough. Start by putting your RPO & RTO in writing, and then these risks will start to make sense based on the size of your data, the amount of data you’re willing to lose, and how long you’re willing to be down:
If you offload production work, you need a production license. Developer Edition isn’t gonna cut it. Depending on your SQL Server version and the database features you’re using, you may be able to run Enterprise in production and then Standard in your offload environment, saving yourself some licensing costs. 2016 SP1 moved a ton of developer-focused Enterprise features down to Standard, including partitioning, compression, columnstore indexes, and more, and 2019 even added Transparent Data Encryption. That’s pretty cool! You can cut the costs on your maintenance jobs, so to speak.
If you don’t run CHECKDB on the same server that takes your backups, you’re running a very, very serious risk: your backups themselves may be corrupt. SQL Server will happily back up corrupt pages all night long without throwing an error. Even the restore process doesn’t check for corruption. It’s imperative that you get a corruption check done on the backup as quickly as possible. If you’re using Availability Groups and you’re offloading backups to a different replica, you gotta run CHECKDB on that same replica at the bare minimum. I’ve seen situations where the backup and CHECKDB were run on two different replicas, and the CHECKDB was passing on one replica, but the backups were backing up corrupt data pages on a different replica.
If you restore full backups elsewhere, you’re introducing a lag time risk. Say you’re offloading the work because you’re dealing with a large database – that also likely means you’re dealing with slow backups and restores. So if your timeline looks like this:
- 6PM-8PM – full backup runs on production, pointed at a file share
- 8PM-10PM – full restore runs in another environment, restoring from the file share
- 10PM – CHECKDB starts
- Midnight – CHECKDB finishes with a failure, indicating corruption
Okay, now what? Was it a problem with the backup, the restore, or the hardware in the CHECKDB environment? Your troubleshooting is only just beginning, and while you’re troubleshooting, corruption might be getting worse. On the other hand, if you’d have run CHECKDB directly in production, you would know with confidence where the problems are.
If you run CHECKDB on a server that had log backups applied, you’re testing different data pages than what lives on the primary. You can run CHECKDB on a log shipping secondary, for example, but only the logged changes are moving from the primary to that secondary, not the data page changes. That means it’s possible for the primary to be chock full of nuts at the same time the log shipped secondary is fine. What’s the big deal? Well, imagine two problems:
- If the primary encounters corruption, sure, you can fail over to the log shipped secondary. Yay! Clean data pages.
- If you need to restore something from backup, like for compliance purposes or to unwind an “oops” delete, think about where your full backups have been taken: that’s right, the primary! Those full backups were backing up corrupt data pages, and they’re worthless. You haven’t been taking full backups on the log shipped secondary, so while it has clean data pages, it only has clean data pages as of right now.
If you’re running CHECKDB somewhere that you won’t fail over to, you’re even more exposed. In our scenarios above, let’s say we offload CHECKDB to a restore server with less cores & memory, running cheaper Standard Edition. If our primary suddenly reports corruption, it doesn’t do us a lot of good to turn to the restore server and say, “Alright, buddy, time to promote you into the primary role.” Remember, we’re already offloading CHECKDB because the primary itself wasn’t fast enough to keep up with our workloads and our maintenance jobs: odds are, our lil’ buddy isn’t going to be able to do it, either. At that point, when you hit corruption on the primary, you already have to wipe it and restore, and hope that corruption doesn’t come back – but without doing root cause analysis on why the corruption hit, well, it’s probably gonna come back. Because of that, I’m not a fan of having only one big primary, and offloading CHECKDB to one tiny restore server. At that point, all our eggs are in one basket, and all CHECKDB is good for is to confirm that yes, the basket is broken. The business is down.
But if you’re using AGs, not restores, and you’re willing to fail over to that replica, the risk is more tolerable. If you offload both CHECKDB and backups to a replica that you’re comfortable failing over to if the primary encounters corruption, then you don’t have as much of a lag time risk. You just need more than 2 servers in the AG at this point, like this:
- Replica1: normally our primary, super-busy, can’t tolerate maintenance slowdowns
- Replica2: not queried, only used for backups & CHECKDB
- Replica3: async, ready to jump into duty if necessary
Then if Replica1 starts reporting corruption, you can fail over to Replica2, make Replica3 the new backup/CHECKDB replica, and take Replica1 down while you troubleshoot its IO or SQL Server problems.
Automatic page repair isn’t a long term fix. In the scenario above, folks might say, “It’s okay if Replica1 hits corruption – it’ll just automatically fetch clean copies of the pages from Replica2 and Replica3, and heal itself in real time.” APR is like a bulletproof vest: it’s great for temporary protection for a lot of parts of your body, but if you get shot in places that the vest doesn’t cover, you’re gonna have a bad day. APR can’t heal corruption in all types of pages, in all databases. Once you start getting shot, you need to evacuate the area as soon as possible, and a bulletproof vest buys you that time. Same thing with APR – your IO subsystem or SQL Server itself is trashing your data pages, and you need to evacuate your data out of that area as quickly as possible to do root cause analysis.
tl;dr – if your only choices are either not running CHECKDB at all or offloading CHECKDB elsewhere, then obviously you should offload it. I’d rather you do it on the primary replica, but if you can’t, you can’t – just be mindful of the above gotchas.
29 Comments. Leave new
Many thanks for all the valuable knowledge shared.
Just to understand your point, if i ran CheckDB on the secondary replica2 and it reported corruption. How would I be able to failover to it as it will be also corrupted.
And in your case, how would i know that replica1 is corrupted as long as i am running checkDB on replica2.
Would the alerts For corruption and checksum be enough for replica1?
Ayman – if you run CHECKDB on the secondary replica, it doesn’t necessarily mean the *primary* is corrupt. You have to run CHECKDB on whatever replica you would want to fail over TO. https://www.brentozar.com/archive/2012/02/where-run-dbcc-on-alwayson-availability-groups/
I gather there’s still not a good way to query the last time a CHECKDB was run for a database?
I’m interested to know if there’s a production database somewhere that isn’t getting checked.
Good news! Check out sp_Blitz: https://www.brentozar.com/blitz/
If I run full backup with checksum will it be the same as running DBCC CHECKDB with PHYSICAL_ONLY
Sergey – that’s not really the topic of this post, so for other questions, head to SQLhelp: https://www.brentozar.com/go/sqlhelp or https://DBA.StackExchange.com. Thanks!
Found an answer
https://www.sqlskills.com/blogs/paul/a-sql-server-dba-myth-a-day-2730-use-backup-with-checksum-to-replace-dbcc-checkdb/
Yep, there you go! Whenever you think about posting a question, Google ‘er first and you’d be amazed at how many good answers are out there.
RE: “If you offload production work, you need a production license”, I think that requires a footnote because of the Nov 1, 2019 licensing changes. If you have Software Assurance and are using an unlicensed passive failover instance, that instance can now be used for DBCC CHECKDB without requiring a license and you can run backups from there too.
That’s a great point! I love that change.
Great Brent..
SAN snaps/clones are great solution also. Of course doesn’t work in cloud environment. We take a nightly SAN snaps of a multi-TB database. It’s nearly instantly available on another server and then run checkdb and backups on the copy.
“SAN snaps/clones are great solution also” – assuming it links in with VSS, and the person setting it up remembered to check the box….
That depends on the storage array you use. Some don’t require VSS snapshots for usable snapshots and clones.
Care to name one? I’d love to find out more about how they ensure a TLog can be applied in a consistent state.
Out of respect for Brent’s blog (don’t want to astroturf here) – go check who I work for.
Argenis – awww, thanks!
Mitch – Argenis works for a very well-respected storage vendor, Pure, who employs very well-respected people like Argenis.
Argenis – also, you have my full thumbs-up permission to astroturf here whenever you want. You’ve earned it.
https://support.purestorage.com/Solutions/Microsoft_Platform_Guide/L_Volume_Shadow_Copy_Service_(VSS)/Volume_Shadow_Copy_Service_(VSS)
Thanks Brent, you made me blush 🙂
Mitchell – check out this video of a database being cloned live without VSS:
https://www.youtube.com/watch?v=mhWL5In_pv0
Great post as usual, Brent! I am a bit confused learning that “Even the restore process doesn’t check for corruption”….
What I do is to have a nightly backup job which checks for corruption, does index and statistics maintenance and then takes full backups. These backups are subsequently restored on a test environment with a way similar like your sp_allnightlog procedure.
I originally thought: Well I check the backups that they are ok by restoring them and I check the DB itself that it is OK by running CHECKDB. What’s there what I am potentially missing? I guess the DB could become corrupt between running CHECKDB and taking the backup….right? Hmm definitely a walk in the park as you already pointed out.
Martin – I’m referring to the fact that some folks think simply doing a backup and a restore will check for corruption. You said “nightly backup job which checks for corruption” – as long as by that you mean you’re running DBCC CHECKDB, then you’re fine.
This post is about “Can I Offload DBCC CHECKDB to Another Server” – you’ve said you’re not doing that, that you’re running CHECKDB on the primary itself, so it doesn’t really apply to you.
Thanks for claifying Brent. Keep up the great work!
Hi Brent,
Just wondered what you were implying by the comment
“If you offload production work, you need a production license. Developer Edition isn’t gonna cut it. “
Just what I said – what part isn’t clear? (I’m not sure how to rephrase that.)
MS Licensing gives me a headache … with cloud damn … now I understand why you can make licensing your entire profession -_-
Why would you offload production work?
Maybe just so you can test something out.
Couldn’t you do that with a Developer License…so long as you don’t use the offloaded SQL Server for production.
If you’re running CHECKDB on a regular, scheduled basis to make sure your data is safe, that’s not “testing something out.”
[…] Brent Ozar follows Betteridge’s Law of Headlines: […]