You want to check for corruption, but you don’t want to slow down your primary production server.
In this post, I’m specifically talking about offloading the corruption checking process. I’m not talking about doing corruption checking on both the primary and other servers – that’s wonderful, and if you’re doing that, you should hug yourself. You’re doing a good job. Who’s a good dog? You are! Good dog.
Now, for the rest of you – your production server is slow, and you wanna know what the drawbacks are of running CHECKDB on other servers instead. Here are the issues to consider. I’m going to be talking in general terms about business risk. I wish I could give you an idea of the risk you’re taking there, but without seeing your servers and practices, that’s tough. Start by putting your RPO & RTO in writing, and then these risks will start to make sense based on the size of your data, the amount of data you’re willing to lose, and how long you’re willing to be down:
If you offload production work, you need a production license. Developer Edition isn’t gonna cut it. Depending on your SQL Server version and the database features you’re using, you may be able to run Enterprise in production and then Standard in your offload environment, saving yourself some licensing costs. 2016 SP1 moved a ton of developer-focused Enterprise features down to Standard, including partitioning, compression, columnstore indexes, and more, and 2019 even added Transparent Data Encryption. That’s pretty cool! You can cut the costs on your maintenance jobs, so to speak.
If you don’t run CHECKDB on the same server that takes your backups, you’re running a very, very serious risk: your backups themselves may be corrupt. SQL Server will happily back up corrupt pages all night long without throwing an error. Even the restore process doesn’t check for corruption. It’s imperative that you get a corruption check done on the backup as quickly as possible. If you’re using Availability Groups and you’re offloading backups to a different replica, you gotta run CHECKDB on that same replica at the bare minimum. I’ve seen situations where the backup and CHECKDB were run on two different replicas, and the CHECKDB was passing on one replica, but the backups were backing up corrupt data pages on a different replica.
If you restore full backups elsewhere, you’re introducing a lag time risk. Say you’re offloading the work because you’re dealing with a large database – that also likely means you’re dealing with slow backups and restores. So if your timeline looks like this:
- 6PM-8PM – full backup runs on production, pointed at a file share
- 8PM-10PM – full restore runs in another environment, restoring from the file share
- 10PM – CHECKDB starts
- Midnight – CHECKDB finishes with a failure, indicating corruption
Okay, now what? Was it a problem with the backup, the restore, or the hardware in the CHECKDB environment? Your troubleshooting is only just beginning, and while you’re troubleshooting, corruption might be getting worse. On the other hand, if you’d have run CHECKDB directly in production, you would know with confidence where the problems are.
If you run CHECKDB on a server that had log backups applied, you’re testing different data pages than what lives on the primary. You can run CHECKDB on a log shipping secondary, for example, but only the logged changes are moving from the primary to that secondary, not the data page changes. That means it’s possible for the primary to be chock full of nuts at the same time the log shipped secondary is fine. What’s the big deal? Well, imagine two problems:
- If the primary encounters corruption, sure, you can fail over to the log shipped secondary. Yay! Clean data pages.
- If you need to restore something from backup, like for compliance purposes or to unwind an “oops” delete, think about where your full backups have been taken: that’s right, the primary! Those full backups were backing up corrupt data pages, and they’re worthless. You haven’t been taking full backups on the log shipped secondary, so while it has clean data pages, it only has clean data pages as of right now.
If you’re running CHECKDB somewhere that you won’t fail over to, you’re even more exposed. In our scenarios above, let’s say we offload CHECKDB to a restore server with less cores & memory, running cheaper Standard Edition. If our primary suddenly reports corruption, it doesn’t do us a lot of good to turn to the restore server and say, “Alright, buddy, time to promote you into the primary role.” Remember, we’re already offloading CHECKDB because the primary itself wasn’t fast enough to keep up with our workloads and our maintenance jobs: odds are, our lil’ buddy isn’t going to be able to do it, either. At that point, when you hit corruption on the primary, you already have to wipe it and restore, and hope that corruption doesn’t come back – but without doing root cause analysis on why the corruption hit, well, it’s probably gonna come back. Because of that, I’m not a fan of having only one big primary, and offloading CHECKDB to one tiny restore server. At that point, all our eggs are in one basket, and all CHECKDB is good for is to confirm that yes, the basket is broken. The business is down.
But if you’re using AGs, not restores, and you’re willing to fail over to that replica, the risk is more tolerable. If you offload both CHECKDB and backups to a replica that you’re comfortable failing over to if the primary encounters corruption, then you don’t have as much of a lag time risk. You just need more than 2 servers in the AG at this point, like this:
- Replica1: normally our primary, super-busy, can’t tolerate maintenance slowdowns
- Replica2: not queried, only used for backups & CHECKDB
- Replica3: async, ready to jump into duty if necessary
Then if Replica1 starts reporting corruption, you can fail over to Replica2, make Replica3 the new backup/CHECKDB replica, and take Replica1 down while you troubleshoot its IO or SQL Server problems.
Automatic page repair isn’t a long term fix. In the scenario above, folks might say, “It’s okay if Replica1 hits corruption – it’ll just automatically fetch clean copies of the pages from Replica2 and Replica3, and heal itself in real time.” APR is like a bulletproof vest: it’s great for temporary protection for a lot of parts of your body, but if you get shot in places that the vest doesn’t cover, you’re gonna have a bad day. APR can’t heal corruption in all types of pages, in all databases. Once you start getting shot, you need to evacuate the area as soon as possible, and a bulletproof vest buys you that time. Same thing with APR – your IO subsystem or SQL Server itself is trashing your data pages, and you need to evacuate your data out of that area as quickly as possible to do root cause analysis.
tl;dr – if your only choices are either not running CHECKDB at all or offloading CHECKDB elsewhere, then obviously you should offload it. I’d rather you do it on the primary replica, but if you can’t, you can’t – just be mindful of the above gotchas.