Corruption 2: DBCC CHECKDB for VLDBs - Brent Ozar Unlimited®

“All they know is that a database has exploded” the news reporter will say grimly, staring into the camera while a ticker rolls across the bottom of the screen announcing over and over again that (YOUR NAME HERE) has lost several terabytes of irreplaceable data. Family heirloom data. For charity. Orphans. Big-eyed baby animals who look like they’re crying.

In this video, you’ll learn how to avoid all that by running DBCC CHECKDB when dealing with terabyte-sized databases. That means splitting up checks, running a different type of check, and even offloading checks to another server.

DBA Fundamentals: Corruption: DBCC CHECKDB for VLDBs

Watch this video on YouTube

[wpc_button]

Full Transcript:

Howdy folks and welcome back to the Brent Ozar Unlimited video training. I am Erik, I am of course with Brent Ozar Unlimited. We’re going to talk about DBCC CHECKDB on very large databases. This is a big important topic because running DBCC CHECKDB is so crucial and running it on very large databases is often difficult. So we’ll talk about some ways to do that, make your life easier, better, why it’s so important, all that stuff. Bigger databases just mean you have more data to lose and that’s scary. So what we’re going to talk about, what is CHECKDB, why do you need to run it, how often do you need to run it, how do you automate it, if there’s any other way to run CHECKDB, if there are faster ways, if there are ways to take some smaller bytes out of CHECKDB, what to do if you find corruption and what causes corruption.

Now, of course DBCC CHECKDB is a long complicated, convoluted routine that takes a transactionally consistent Snapshot of your database and then runs all sorts of checks across all sorts of areas from checking system pages, from checking page validity, checking page lengths, checking page allocations across the page free space, global allocation app and share, global allocation maps, checking extents, extents are groups of 8K pages all put together, performing several different logical checks across your data, checking page relationships. I don’t mean like you know, like a relationship counselor, like why are mommy and daddy fighting, I mean like if checks your pages are actually where they belong. It checks page headers to make sure no one went and did graffiti on them, it performs a bunch more logical checks, it checks metadata and then if you ask really, really nicely, it will check index views and XML indexes as well, to make sure that everything is copacetic in those places.

A lot of people do think that it will cause blocking, but the database Snapshot, the transactionally consistent Snapshot that it takes avoid that. You can specify the tab lock hit to override that behavior if you want, but that’s generally a bad idea, especially with a very large database, you know what I mean? It’s big, you don’t want that to get locked down. It is though, a resource intensive procedure because it reads everything from disk. It uses a bunch of CPU, it can use a bunch of tempdb and because everything is read from disk and because you have a very large database, the chances are that very large database outpaces the amount of RAM that you have on your server. If you have a terabyte database and 512GB of RAM, guess what doesn’t fit into RAM. If you have other databases on that server, guess what doesn’t fit into RAM. You guessed it, much of anything.

Speaking of resources, from this perspective, you will also hit your SAN extra hard. Unless your SAN and drives are super fast, like SSDs and you have the kind of pathing that it takes to really chug data onto a server, keep in mind that 8GB fiber channel has a rough limit of about 850MB a second. If you’re running CHECKDB from a bunch of different servers at once, or some other servers are doing backups and this and that and the other thing, you’re going to saturate those fiber channel connections pretty quickly. So CHECKDB, especially for very large databases, should be pretty isolated if you can.

It also kind of depends on if you have enough memory to handle your very large database being read alongside other objects. If you’re on Enterprise Edition, it can go parallel, which, god I hope – I hope if you’re on – if you’re hosting very large databases, you’re on Enterprise. You have to have TEMPDB configured really well, we’re talking multiple data files for the 8 data files, sometimes more, it’s sitting on its own drives with SSD or Flash behind it. You use trace flags 117 or 118, either by enabling them or by being on SQL Server 2016 or the defaults, or if you have a server to offload checks to. This is really the best way to do it because you save your production box so much god-awful work. Just make sure that you have instant file initialization enabled over there, otherwise restoring those very large databases is going to take an awful, awful long time. Trust me, I’ve been through that, I made that mistake very early on in my career.

You should try to run DBCC CHECKDB as often as possible. You want to shoot for once a week, but the size of very large databases makes it even more important to catch corruption at the same time that it makes it difficult to fit those checks in. If you find corruption, you have to figure out how long it’s been there, if your backups are corrupt. If you have a data retention policy, checks really do need to occur about twice as often as you’re getting rid of data. So for instance, if you have a two-week retention policy, that means running DBCC CHECKDB weekly because if you take a DBCC check every three weeks and you’re deleting data every two weeks, you might be seeing a little underlap there what might happen.

With very large databases, it is definitely a good idea to have a separate RPO and RTO agreement for corruption because it’s often much harder to recover from. Even just for storing a very large databases can take much, much longer than restoring some you know, middle of the road 50, 25GB databases. If you have to restore a terabyte of data, that can be time-consuming.

Diagnosing corruption, a.k.a running DBCC CHECKDB, determining the repair method that you need, checking backups, all that stuff takes time. This makes frequent checks all the more necessary. If you’re using maintenance plans, stop. It’s not good to use maintenance plans to run CHECKDB on very large databases. They don’t expose the more granular options to run checks with. We’ll cover this a little later, but you’re better off using something different and better, something still totally free like Ola Hallengren’s scripts. They are awesome, they are free, they also handle backups and index and statistics maintenance if you need – if you’re in the market for that sort of thing.

You can run these jobs as is, just like you would with a maintenance plan, you’re going to want to make sure that those job output logs because if CHECKDB throws an error, it might be a really long error with a really big database. It might have found a whole bunch of corruption. CHECKDB doesn’t go like, I found a little corruption, I’m going to email you now. I found a little bit more, I’m going to email you. It waits until it checks the whole thing and it just sends you this laundry list of errors and it’s like, go ahead, fix that now. And you’re like, I was going to but now I’m going to become a farmer because I don’t have to deal with that. I would just have to deal with anthrax and that’s easier.

You probably can’t afford to re-run DBCC CHECKDB on a very large database. If you have terabytes of data, that check is long and expensive. You can take the output of these job files and send them off to Microsoft for support, or you can send them to some shady internet forum, you know, FixCorruption.ru and you can for the little price of say 150 bitcoins, buy an EXE file that won’t be ransomware at all and run it on your production box to fix corruption. Trust me, it’s totally legit.

Offloading DBCC CHECKDB is usually the smartest thing to do when you have very large databases. It does require restoring a full backup or if you’re really nifty with those VLDBs, you could mount a SAN Snapshot, and that’s crazy fast. Taking a full backup and restoring it with SQL native backups kind of sucks when you get up to databases that are you know, terabyte plus in size. SAN Snapshots are really an awesome way to go. Running checks on that server does make it a bit more complicated. It does require another server, but it keeps resources on your production server dedicated to end user tasks and internet processing tasks, which is usually much better off. And when you relay this sort of stuff to the business, they’ll often be like, you know, we just happen to have this extra bit of hardware money laying around that we could do something with.

Before I tell you about what doesn’t cut it, third party tools like Quest Litespeed offer automation for this sort of task. So they offer the ability to backup, restore, run CHECKDB and you know, assuming there’s not corruption, drop a database off that server and restoring to the next database for the next wave of databases.

What doesn’t cut it, checking AG replicas even though you have to check those anyway. But checking an AG replica doesn’t tell you if the primary is corrupt. Checking snapshots of mirrored databases because it’s a different copy of the database at that point. Checking log shipped databases, because that’s just a bunch of transaction logs applied over a really old full backup of your database, checking SAN replication, because that’s not a bit for bit copy of your current database, it’s just differences, and of course SQL replication, which is useless and I hate it with a burning passion. Just kidding, it’s great for some stuff.

If you want to run things faster, say if you can’t offload your checks and you have limited maintenance windows, you can run DBCC CHECKDB with the physical only command. This skips some logical checks of your data. This skips stuff like computed columns and making sure that data types are correct. Little things you know, who would care about that? It’s not as complete as a full DBCC CHECKDB but it is faster, and it’s still a good check. It’s still a good starting place, but you still will eventually need to take a full CHECKDB somewhere. Trace flags 2562 and 2549 help the physical only flag when you run it. When you run the physical only, those trace flag will help. 2562 will treat the whole check as one big batch, which does consume more tempdb and it’s usually not necessary in SQL Server 2012 plus. It pains me that the year 2017 just still have to qualify if you’re on a version earlier than 2012, well, you know, you might need these trace flags, but here we are nonetheless. 2549, I’ve had very bad luck getting it to show any improvement, but apparently some people have. Apparently it’s good for some people, so you might want to give it a shot. It’s mostly here for completeness, it’s not here because I’m saying it’s awesome. 2549 makes the assumption that every database file is on its own drive or spindles, even if they’re on the same drive letter, which is helpful for people who use mount points and multiple data files on their fancy SAN. But like I said, that’s not a 100%. You still have to run a full CHECKDB sometimes. Where that happens and how often that happens if up to you and the business to decide.

This kind of conversation may push them to provision the server for offloading, or it may buy you some more time during regular maintenance windows to perform longer checks. It may also be good for starting a conversation like, say you have regional databases on a server, say you have America, Europe, APAC, whatever else on there, and you can’t run CHECKDB on America’s database when they’re not connected because that’s when APAC is connected and so on and so forth. Like that snake game where it eats its own butt. So this could spur the business to maybe slitting in those regional databases up to their own servers so that you have more time and more room for maintenance on them.

Enterprise Edition, of course, if you have – again, if you have very large databases, you better be on Enterprise Edition. DBCC CHECKDB can be run in parallel when you use enterprise. They obey the server’s MAXDOP rules unless you go crazy, absolutely bat dung crazy – what is it, guano crazy, and make up a separate user and resource governor that runs DBCC CHECKDB. I know, right, resource governor.

If you’re on 2016 plus, which I don’t feel so bad saying because it’s only 2017. But if you’re on 2016 plus, you can avoid the whole resource governor shenanigans by using – specifying dump to parallelism with the DBCC CHECKDB command. You can set MAXDOP there in 2016 and up, how awesome is that? You can do it for index rebuilds for like a decade and who cares, MAXDOP finally has it.

Running Enterprise Edition also hopefully gives you the ability to pack enough RAM into a server so that memory isn’t too much of an awful bottleneck. Something that you’re going to have to get used to when you’re running DBCC CHECKDB on a very large database is breaking checks up into smaller bits. There are three main components that make up a full DBCC CHECKDB. There is DBCC CHECKCATALOGUE, DBCC CHECKALLOC and DBCC CHECKTABLE, for teach table and index view. If you’re using file groups, you can use DBCC check file group as well. Ola Hallengren’s scripts make this a little bit easier. You can specify a full DBCC CHECKDB check, separate commands, or groups of commands. So if you want to run just CHECKTABLE, you can run just CHECKTABLE. If you want to run CHECKALLOC and CHECKCATALOGUE together, you can do that.

The basic premise here is that over the course of a week, you run a full CHECKDB in different parts. So you run CHECKCATALOGUE, you run CHECKALLOC, you run CHECKTABLE across a one bucket of tables, you’re on – then you know, say the next time you run CHECKTABLE table on the next bucket of tables, then you just keep cycling through and doing that over and over again. That can cut down on the amount of time each night that you spend on corruption checks. It does open you up to having corruption for longer though.

If it does take you a full week to get through all of your CHECKTABLE for every table and for every index view, if you have a bunch of large tables in there, well, if you’re on CHECKTABLE on very large table one on Monday and on Tuesday it becomes corrupted, then you run CHECKTABLE on everything else and eventually that snake comes around and eats its butt again and you find corruption a week later on Monday. Well, it’s better than not finding it at all, but it’s still pretty rough.

All that stuff is good because you can run those checks on a rotating schedule which makes it more complicated, but if your very large databases like most other very large databases, it’s typically one or two tables that make up the majority of that space. You might have a table for say auditing, you might have a main table where users focus their queries that has, I don’t know, maybe a bunch of logged data in it. Think of a large fat table, perhaps these can be terabytes, and terabytes of their data on their own; if hundreds of gigs if not terabytes. Running other checks and then checking other tables can be run separate from running on the bigger ones. So you can save those bigger ones for maybe weekends when you have more time on your hands for maintenance windows.

If all of this is starting to sound too complicated, you really need to talk to the business and have them give you a server to offload CHECKDB to, or find a job somewhere that will, because seriously, losing a very large database is a lot different from losing a 5, 10, 20 GB database. If you ever lose terabytes of data, well, that’s a major shock to the confidence of your users and customers. They probably don’t want to deal with a company that loses terabytes of data.

CHECKDB of course has some helpers. This is not absolve you from having to run DBCC CHECKDB but it does give you the heads up on some form of corruptions in between runs of CHECKDB. Alerts for errors 823, 824 and 825 cover hard and soft I/O errors. These get thrown if you have page verification set to CHECKSUM on your databases. When pages are written to disk, they have a CHECKSUM assigned to them. When they are read from disk, that CHECKSUM is read and verified, and if there’s a problem with that CHECKSUM, SQL Server throws one of those errors at you. You are aware now that something has become corrupt. You can take your backups with a CHECKSUM as well, but this also relies on page verification. This will check data pages as they’re written to a backup file, and the backup will throw an error if it hits them, finds a problem.

Now, if you use a third party backup tool and they don’t allow you to alter the backup command, or if they don’t allow you to click a box and take your backups with CHECKSUMs, you can turn on trace flag 3023. That will force all your backups to occur with CHECKSUMs so that you don’t have to worry about your negligent backup software company not allowing you to take CHECKSUMs.

Corruption errors are scary enough of a regular size database. On a very large database, you really just get more errors, more stuff usually goes bad. The damage is usually more widespread. One weekend you run CHECKDB and say you get an email that looks something like this from your job failure email – I said email too many times there. But look at this mess, what does this mean? Object ID 3, index ID 1, partition ID 1990-something-08 – what is it? How important is it? Like figuring all that stuff out stinks. It’s not fun, especially if it’s in the middle of the night, especially if there’s someone angry calling you saying users can’t do anything, and you’re like I’m drunk, what do you want. It’s not fun, it’s not good, not a good place to be, something you want to be facing on your own.

If you’re a DBA by yourself and you’ve never dealt with corruption, you should just call Microsoft. For $500, assuming that you’re in Microsoft’s support good graces and you’ve paid all your bills, they’re not going to look and find out that you’re running developer edition on your production server, they will sit there and help you read those error messages, figure out which pages are corrupt, figure out if they’re system pages and say get the hell off this server. Or start migrating data off to another server, figure out if they’re clustered indexes and tell you to maybe find a backup that works, or non-clustered indexes, run a complicated restore process to replace single pages in a database, which you can do. It’s crazy and it’s awesome but it’s hard. Hard work.

Even worse, at the end of almost every DBCC CHECKDB I have ever seen has been a message that said something along the lines of we can only fix this if you run repair with allow data loss. This command doesn’t tell you how much data you’re going to lose, which data you’re going to lose, where you’re going to lose it from. It just says, okay look, you’re going to lose some of this stuff when we fix it. And it gets worse. After your run repair with allow data loss, you have to run DBCC CHECKDB again on your very large database. And it could come back and say we found more corruption, you have to run repair with allow data loss to now fix this corruption. You may have to iterate over this step several times before repair with allow data loss repairs all your corruption, and it may not be able to repair all of your corruption. There are known cases where repair allow data loss has not been able to fix things. You just have to get your data off that server. Restore from backup, figure out how to get other stuff over there, if there are parent and child tables it’s a nightmare really. Call Microsoft to make this as unpainful as possible.

Brent wrote a great post, great post Brent, if you’re listening, all about what to do when DBCC CHECKDB reports corruption. How to split people up into groups to work on it, what they should work on, how you should talk to management and end users and the frequency of those emails. Really good solid advice stuff like that. Tara and I contributed some okay stuff to the end, but this is a good starting point if you have found corruption in your database. Messing up corruption – I’m not saying messing up fixing corruption, but not checking for corruption, being completely negligent in DBCC checks is often a resume generating event. No one wants to hire a DBA that botched a run in with corruption. If you want to get some practice fixing corrupt databases, another consultant named Steve Steadman a couple years ago published I think for 10 or 12 weeks, a new corrupt database every week and challenged members of the SQL community to fix them to the best of their ability. Brent won I think the first one of those. Great job Brent. When he put this challenge out there, I had pretty much just started here so I was really busy learning how to consult, so I didn’t really have time to go fixing corrupt databases, but some day maybe I’ll go back and revisit it. Anyway, it’s a good learning process for you guys.

So what causes corruption? We talked about what DBCC CHECKDB is and what it does and how often to run it and ways to run it and all sort of neat, nifty stuff that can really help you out, but what causes corruption? Well, usually it’s a disk problem. It could be a bad sector or a whole disk or a whole pool of disks if you’re on a SAN. It could be a less than graceful server shut down, or a less than graceful failover, could be a dirty snapshot involved in something. Every once in a while, Microsoft causes it.

Back in 2012, for Microsoft – well, for SQL Server 2012 and for SQL Server 2014, there was a nasty bug. That nasty bug caused clustered indexes to become corrupt if they ran online in parallel while there was an update or some other data modification going on. The end result was a lot of corrupted clustered indexes – sorry, non-clustered indexes. The work around was to run index rebuilds of your clustered indexes with MAXDOP one. Now, if you have a clustered index in a very large database on a very large table, say that table is 300GB, 500GB, somewhere in there. Is running MAXDOP one for a clustered index rebuild a good time? I don’t think so. I don’t think so at all. I wouldn’t do that.

This is another one of those reasons why I threw index rebuilds by the wayside on my very large databases because the amount of effort and resource consumption to run them was obscene in comparison to the benefit that was gained by running them. So usually – so I just flipped everything over to date statistics and users never complained or said a word and I never had to worry about corruption on my databases. I had to worry for other reasons, but that was not one of them.

Anyway, to recap because I’ve kept you here long enough, checking for corruption, super important. It’s becoming less common because of disks as disks become more reliable, but it can still happen because disks and Microsoft bugs and failovers and shutdowns and dirty snapshots and all sorts of other reasons, running DBCC CHECKDB will find that corruption for you, but it’s still up to you to schedule it and decide on the frequency. With a very large database, your choice has become even more complicated as to which subroutines you’re going to run when – remember, like CHECKTABLE, CHECKFILEGROUP, CHECKALLOC, CHECKCATALOGUE, which nights you do those on, which nights the big tables happen on, which nights to do this, that and the other thing. It becomes a whole lot more complicated, but you still have to do it. Generally, like I said before, the preference here would be to offload those checks to a different server, or different servers, where you can run CHECKDB in one big get up and go and not have to worry about all the complicated scheduling and maybe I missed something and what do I get next and did I get corruption on Tuesday and I didn’t find it until the next Tuesday. Just offload, talk to the management, talk to the business stakeholders and tell them why it’s absolutely necessary for databases of this size to be checked in isolation.

Make sure you help CHECKDB and yourself out by turning on page verification to CHECKSUM and taking backups with CHECKSUM along with enabling alerts for errors 823, 824 and 825. And again, if you find corruption, just call Microsoft. It’s so worth it for the money to have that sort of reassurance on the phone with you. Typically managers and business people are much more confident when you say I’m on the phone with Microsoft, we’re going to fix it. Even if you don’t end up fixing it, you could say it was Microsoft’s fault we didn’t fix it. Blame them too because they’re a big company, what are they going to do? Ask for another $500? Maybe you get your $500 back. But the bottom line is, you need someone who’s dealt with corruption before, which Microsoft has, it’s the owners of a product that becomes corrupt semi-frequently.

You may have discovered a new bug in SQL Server like that bug that needs to be fixed, or you may have stumbled across a known bug that just doesn’t have a public fix available yet and you need to install a behind the scenes hot fix after you sign a new NDAs and all that stuff. Anyway, that’s sort of the ins and outs of running DBCC CHECKDB on a very large database. I hope you’ve learned some stuff and I hope that I’ve helped you in your quest to become a better DBA, a corruption free DBA. Thanks for watching.