The instant you encounter corruption in a production SQL Server database, stop. Read this entire article first to understand the big picture, and then come back here to take action.
First, understand that you shouldn’t start by trying to repair the corruption. You may be facing a storage subsystem problem where your storage is corrupting the data underneath you. In a scenario like that, your corruption repair efforts – simply reading and writing data – may be making the situation worse. Your primary goal is going to be to communicate how widespread the corruption is, and get off the storage immediately. You just can’t know the root cause until you do thorough investigation, and by then, it may be too late – the corruption may have spread to more areas in your database.
Now, let’s walk through the checklist:
1. Turn off your backup-delete jobs: you’re gonna need ’em. If you have any kind of job that automatically deletes backups older than a certain number of days, turn off that job. Depending on the type of corruption, you may have to restore a clean full backup from before the corruption occurred, plus all of the transaction log backups since.
2. Query the msdb.dbo.suspect_pages table. This system table tracks up to the last 1,000 corrupt pages detected. The query will respond instantly, so do this first rather than trying to check the database for corruption. This will give you a quick idea of how widespread the corruption might be. Given those results, send an email to your manager, team, and app stakeholders:
SQL Server (INSERT NAME HERE) just reported corruption in these databases: (LIST OF DATABASES). This indicates that it has storage problems. We need to fail over to another SQL Server immediately – come to my cube (OR ONLINE MEETING OR WHATEVER) to discuss our options. I’m going to have to drop everything and do emergency troubleshooting.
At that point, look at your disaster recovery plan to figure out where you’re going to fail over (like a log shipping secondary.) You’ll need to check that location for corruption as well, but if we’re dealing with widespread, multi-database corruption, it’s beyond the scope of what I can teach you to do in a blog post. You can keep reading for single-database corruption recovery options, and use these same tactics across multiple databases, but it’s going to be a ton of work.
3. If you’re using a SAN, alert the storage team. If your data and log files reside on shared storage, there may be a more widespread issue. Multiple databases or servers may be affected if the shared storage is having problems. Copy/paste this into an email:
SQL Server (INSERT NAME HERE) just reported corruption in a database that lives on the SAN. I don’t know yet whether this is due to a storage issue or a SQL Server bug. I’m dropping everything to do emergency troubleshooting, and I’ll let you know more in half an hour, but if you hear anything from other server admins about SAN issues, please let me know.
4. Alert the application owners and stakeholders. Copy/paste this into an email:
SQL Server (INSERT NAME HERE) just reported database corruption. I’m dropping everything to do emergency troubleshooting to find out if we’ve lost data, and how we’ll recover it. I’ll let you know more in half an hour, but in the meantime, I would recommend taking the applications offline. If more data is added after this point, I may not be able to recover it.
This sounds paranoid, but as an example, here’s a corruption case I had recently: shortly after detecting corruption, the team realized they’d have to revert to a backup of the database from a few days ago. Rather than telling users about that possibility, they let the users keep adding data into the already-corrupt database while the DBAs did troubleshooting. Several days later, as the corruption got worse, even Microsoft couldn’t repair the corruption – and the affected tables went completely offline, permanently. If the users would have been alerted earlier, they could have avoided even more data loss.
5. Review the history of your CHECKDB and backup jobs. Fill out these answers:
- When was the most recent clean CHECKDB? (Meaning, when did CHECKDB run against this database without errors?) ________
- Do we have a full backup available from prior to that date? (For example, if CHECKDB said this database was okay on January 15th at 10PM, what’s the latest backup we have BEFORE that date, like January 14th at 11PM.) ________
- Do we have all of the transaction log backups since that date, yes/no? ________
If you can pass all three of these, great news! You’re going to have an alternate recovery path. If any of these backups are not online (like if they’re on a slow virtual tape library, or have been sent offsite), get one of your Windows admins to start retrieving these backups and putting them on a very fast network share as quickly as possible. You may need to restore all of these backups in a subsequent step.
6. If you have the full CHECKDB output, look for an easy fix. DO NOT RUN CHECKDB in this step – that can be a long process, and before you go down that road, I need you to do other stuff first. I’m just saying that if you do happen to have the CHECKDB output that shows the corruption, look at the index IDs involved. If the only corruption involves an index with an ID of 2 or higher, that means it’s only corruption in a nonclustered index, and you’ll be able to repair that relatively quickly by:
- Run sp_BlitzIndex® to list out the index definitions on the affected table:
EXEC sp_BlitzIndex @DatabaseName = ‘MyDB’, @SchemaName = ‘dbo’, @TableName = ‘MyTable’
- Find the corrupt index by its ID
- Copy out the CREATE INDEX statement from sp_BlitzIndex’s output
- Drop the index
- Recreate it again
- To verify that no other corruption exists in the database, run:
DBCC CHECKDB(‘MyDB’) WITH NO_INFOMSGS, ALL_ERRORMSGS
- If it comes back clean, you’re good, and you can email the teams that things are OK, but jump to the Aftermath section at the end of this post
- If it doesn’t come back clean, keep going
7. Fork your recovery efforts into multiple teams. Grab other members of your team and assign them tasks that will be done in parallel:
- Team 1: Restore clean backups to another server – this team grabs a different SQL Server of the same version as production. For example, if production is a SQL 2012 box, grab a development server that’s also SQL 2012. Restore the full and log backups from the dates specified in step 5. After it finishes, run DBCC CHECKDB to make sure there’s no corruption in this copy. This may take a long time, but if it works, you’ve got an excellent plan B.
- Team 2: Open a support call with Microsoft. If you don’t have a support agreement with Microsoft, grab a credit card and call professional support. It’s about $500, and they work the problem with you until it’s done. Don’t be afraid to say, “We’re not making progress fast enough – I’d like to escalate this call to the next support level.” (But at the same time, you’d better be doing what they tell you to do.)
- Team 3: Run CHECKDB if you don’t have the output. If the corruption was detected by an end user’s query or a monitoring system alert, run this, save the output to a text file, and then analyze the output with Gail Shaw’s corruption recovery instructions. Depending on the database’s size, this can take hours:
DBCC CHECKDB(‘MyDB’) WITH NO_INFOMSGS, ALL_ERRORMSGS
- Team 4: If you don’t have a clean backup, restore the most recent backups to yet another server. If you’re going to try your own attempts at corruption repair, you’re going to want another copy of the corrupt database on a different server where you can test first before trying in production. As soon as that restore finishes, if the other teams aren’t ready to try their corruption repair commands yet, create a new database on this same server, and start copying as many tables as you can from the corrupt database into the new one. You want to be able to salvage as many tables as you can.
- Team 5: Contact a corruption specialist. If things are looking bad, and management wants another option, contact one of the following to start a consulting engagement:
- Kroll Ontrack – data recovery company
- SQLskills.com – SQL Server consulting shop with Paul Randal, former Microsoft employee who worked on CHECKDB
- StedmanSolutions.com – Steve Stedman specializes in database corruption recovery.
8. Send out a status update as work starts. As soon as the above team task assignments are done – not when the work stops, but when it starts – it’s time to communicate your status to your manager, team members, and the application stakeholders.
Your update email will be based on what you’ve learned so far, but here’s an example:
Update on the corruption issue: we have identified that the corruption is confined to the SalesSite database, and it occurred at some point after 2016/05/04. We have split up our recovery efforts into three teams. Based on a quick assessment, we believe Team 1 will be the first to finish, and we hope to have a clean copy of the database up on DevSQL2012. If we have to fail over to that, we’ll have a one-hour outage. We’ll update you in another half-hour as we know more.
9. Keep working the plan. Don’t panic – let each of the teams do their thing independently. Keep doing check-ins across teams every 30 minutes, see which team is the closest to a solution the business is comfortable with, and keep the end users informed.
The Aftermath: Avoiding Corruption Next Time
Turn your backup-cleanup job back on. (Turning it off was step #1 above.)
As soon as practical, get on the most current service pack and cumulative update for your SQL Server version. You may have hit a bug like the notorious online index rebuild bug or the NOLOCK UPDATE bug, and patching will reduce the chances that you hit that issue again.
If you can’t rule out a SQL Server bug as the cause of the corruption, work with your storage teams to make sure your storage gear is patched up to its most recent firmware and driver versions. Check to see if there were any storage errors over the last several weeks – not just in the Windows event logs, but in the storage gear’s logs as well.
Once you’ve hit corruption on a server, consider taking a few proactive measures:
- Run CHECKDB more often, like every night (which sounds crazy – until you hit a corruption scenario)
- Keep your backups for a longer length of time – if you’re only running CHECKDB every week, you need at least a week’s worth of full and transaction log backups so that you can recover via restores
- Run transaction log backups more frequently, like every minute
- Put your RPO and RTO in writing – the business always assumes databases can’t lose data, and you assume that the business understands technology breaks sometimes. Get everybody on the same page.
- Based on your RPO and RTO, revisit how many databases you have on each server. If your hardware isn’t able to keep up with backups and corruption checks, it’s time to up your horsepower or switch to snapshot backups.
Erik says: Brent is spot on with how to react to corruption scenarios. I’m all about not being past the point of no return in the first place. Whether it’s:
- Making sure I have email alerts set up for 823, 824, and 825 errors that catch hard and soft I/O issues
- Running DBCC CHECKDB often enough
- Having Page Verification turned on
- Taking my backups with CHECKSUM (or using a Trace Flag if I can’t do it via my backup software),
- Making sure my backup routines at minimum do a restore with VERIFYONLY to assess basic usability, or even better, a restore of my FULL backups to another server
- Finally, using technology like Mirroring or Availability Groups, which have a feature called Automatic Page Repair that can fix some page corruption scenarios
sp_Blitz® can help you find all sorts of dangers and inadequacies with your server set up. Run it on all your servers. You may be surprised what you find.
Tara says: I’ve encountered corruption 3 times in my career. The last event was the most memorable, and I don’t mean that as a good thing. If we had setup alerts for the 823, 824 and 825 errors as Erik mentioned, we would have been notified of the corruption sooner and had less data loss. We implemented the alerts through SCOM after the event, but we should have also created SQL Server Agent alerts for those errors as SCOM wasn’t always reliable (your mileage may vary). Reacting to corruption quickly is very important.
Great post! I’ve printed it out to read more in depth at a later date (i.e. when I am not at work and getting interrupted). When I took over the sql servers at my job I was fortunate enough to come across several blog posts relating to agent alerts, checksum on backups and daily restores and checkdb’s of backups and implemented them just for this very thing. Our previous philosophy was just install SQL, get the app going and set up maintenance plans for backups with no planned testing or alerting. Your blog happens to be where I get a ton of my info and ideas to improve.
Excellent post Brent. I’m adding this to my toolbelt for my first DBA gig. Hopefully I never have to use it!
For any SQL Server DBA – this is your definitive go-to corruption checklist. A few things on that list that I’d not thought of. Thanks Brent!
Great post. One other idea I’d add would be to run through some mock scenarios so that you have some practice with this before you have to do it for real. One method would be to simulate corruption (e.g. take db offline, make edit with hex editor, bring back online or use DBCC WRITEPAGE). Another way to practice would be to do it as a table top exercise where someone lays out the situation and then the teams work through the steps that they would take.
Excellent and concise post. I might change the email wording of “I’m going to drop everything” – that could be interpreted badly 😀
Back up the log, if you can, after detecting corruption and stopping production! I REALLY like replaying log files, unless the corruption hits the log file, or a SQL Server bug hit you…
Did I mention backing up the log?
It’s not a case of IF its just a matter of WHEN !! Also, a very timely post and thanks for the Checksum Trace flag – For those shops forced to use things like Avamar (urggghh!) this is an invaluable trace flag. See, I’ve been swimming in this SQL pool a very long time and still things to learn.
Tops Brent !!
Very good list & the points added by Erik are perhaps most important of all.
Very telling that even someone as experienced as Tara has only seen corruption 3 times – open a support case with MS and be clear that it is a data corruption case, the escalation engineers that deal with corruption get cases every week, year after year, as a former SQL PFE I know that’s where I would turn for help.
“we may be facing data loss, and in order to find out, I’m going to have to drop everything”
This made my day. 😀
Good post – two comments, in your communication to Application Owners/Stakeholders or people that don’t know the SQL Server estate the “SQL Server (INSERT NAME HERE)” won’t mean anything .. it might be worth saying “The SQL Server that runs ..”
… also perhaps worth mentioning the region as well so that they know if it is a production/UAT/QA/DR/Dev server …. this will affect the severity of the issue.
Bear in mind that Test/QA is not production to your clients, but it might feel like production to your developers and you could have a team of devs unable to work whilst you work the problem.
Excellent advice. I’ve been doing Checksums on my backups and verifies on my restores for ages, along with nightly DBCCs. I run a batch job to filter the DBCCs so that I only see lines with the word ERROR in them, then I email those to me every morning. If an error ever pops up, I have the full DBCC to refer to for deeper analysis.
A question for you, Brent: DBCC before or after database backup? What are your thoughts? This has been a subject of debate for a LONG time and I’ve never seen a good argument one way or another.
One additional point — the use of the REPAIR_ALLOW_DATA_LOSS option. This should always be the absolute last option as your data is going to fall in to the bit bucket and will never be seen again.
Yes, regarding the DBCC before/after, I’d also like to know the answer to that. Logically I’d say before, so you don’t waste time backing up a corrupt database, and send an alert instead.
Wayne – I want CHECKDB to run after a full backup. If it succeeds, I know the backup was good.
If I run CHECKDB before the backup, I don’t know for sure that corruption didn’t strike after CHECKDB completed, and before the backup completed. (I know it sounds paranoid, but every minute counts when you’re trying to recover, and you’ve only got the time to restore one full backup.)
I use a restore strategy (as opposed to the backup strategy). I restore every full backup and I always run CHECKDB on them. If my latest backup contains corruption that requires data loss, my previous backup is still available (and clean, because I checked it too). I also restore log backups, and run checks on them.
If or when I detect a full backup is corrupted (and CHECKDB’s repair level will allow data loss), I will try to backup the corrupted database (as the option of last resort), and I will back up its tail (as my potential saving grace). I will then restore the prior (clean) full backup, every log backup, and that tail. Assuming the powers that be allow me time to do this right, I will run CHECKDB on that restored chain. If this chain’s CHECKDB is clean, I am back in production without data loss.
At the same time that I am restoring the chain (and ultimately running CHECKDB), I will restore the backup that contains corruption on my DR server, right next to where my restored chain’s database. That way I can run cross-database DML in DR, if needed but without a network wire between them.
In other words, no matter what happens I am backing up the database and moving to DR. I don’t care whether a database is corrupted or not when I back it up. All I care about is getting a database off that hardware, as quickly as possible.
EXCELLENT, EXCELLENT Blog Post!! I will create a fake DR and put this article into good use.
Thanks for this post, entertaining and extremely useful, which is just how you all roll! There is one thing that i cannot seem to find ANY references to scouring the internet, which could be due to not being possible.
Is there any way to look at any system views to determine which Clustered Index records with a specific key are associated with a given PAGE? I have found ways to accomplish this while the page is not corrupted… sys.fn_PhysLocFormatter(s.%%physloc%%). But are there any system views to lookup a list of IDs(Keys) that are associated with a given PAGE, without pulling the actual PAGE from the disk(the event that the page is corrupted)
Jeremiah – not directly, although you can back into some of it by using nonclustered indexes in some cases. For info, check out Steve Stedman’s Database Corruption Challenge.
>> Query the msdb.dbo.suspect_pages table.
What happens if msdb is corrupted?
It’s like Inception, but harder to understand.
We are seeing a strange case, where the database has gone into suspect mode with the below errors :
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 21:1533696; actual 0:0). It occurred during a read of page (21:1533696) in database ID 5 at offset 0x000002ece00000 in file ‘F:\Data05\MSSQL\MSSQL12.I01\MSSQL\DATA\***.ndf’. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
After running DBCC Checkdb we could find many of these errors
Page errors on the GAM, SGAM, or PFS pages prevent allocation integrity checks in database ID 5 pages from (21:1075704) to (21:1083791). See other errors for cause.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (21:1083792) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8998, Level 16, State 2, Line 1
Also I see the date modified for the .ndf file has changed to 02/1/1980 which is strange. Unsure what could have caused this to happen, is this a problem with the storage
Raj – none the steps in the blog post include posting the errors in the comments of the blog post 🙂
The Sharepoint team changed the default protocol to TLS 1.2 on the servers. After that reboot of the servers many of the databases have been in suspect mode. I have tried to bring them back to online mode using the same below code.
(ALTER DATABASE [TestDB] SET EMERGENCY;
ALTER DATABASE [TestDB] SET SINGLE_USER;
DBCC CHECKDB (N’TestDB’, REPAIR_ALLOW_DATA_LOSS) WITH NO_INFOMSGS, ALL_ERRORMSGS;
But the databases keep coming back to the suspect mode. Is there any way to for my databases to not go back that way?
Allan, for Q&A, head over to https://dba.stackexchange.com/
I encountered a corruption error at a SQL Server version upgrade. I have a database on a SQL 2014 SP1 server and I am migrating to a SQL 2016 SP2 server. I do it using backup and restore. However, after I restore it on 2016 and run a DBCC checkdb or DBCC checkalloc, it yields corruption on 2 PFS pages. The output of DBCC looks like this – Database error: Page (1:860785) is marked with the wrong type in PFS page (1:857328). PFS status 0x40 expected 0x60. If I restore the same backup on a 2014 server and do a checkdb, I cannot see any corruption. Nor do the weekly integrity checks on the live database output any corruption. I have restored multiple backups of the same DB on multiple SQL 2016 servers, all with the same result. This issue only occurs on one specific database that is running on Compatibility Level 2008 and is enabled for CDC. The specific pages to which the PFS point to are all of a table that is partitioned. I have done multiple tests to try and find the reason for corruption at restore, and so far I have not found anything that works. Did you ever encounter corruption at SQL Server upgrade? Can it be a bug in the upgrade mechanism at restore?
Nela – for corruption errors, we do what’s described in the post, including opening a support case with Microsoft. Go ahead and follow those directions. Thanks!
Hi Brent. Thanks for your reply. I contacted MS and also asked around to see if someone encountered this issue as I found it particularly strange, and I got an answer from Paul Randal in the form of a blog post, perhaps it helps more people: https://www.sqlskills.com/blogs/paul/pfs-corruption-after-upgrading-from-sql-server-2014/
A monitor with data corruption error msgs on it is what keeps me up at night; forget Michael Myers. Thank you for this blog and all of your others. I was blessed to inherit a system where the previous DBA was a student of your work so CheckDB is run nightly after full backups and trans log every 10 min (not 1 just yet). I didn’t know about the i/o error msgs though so that’s next on the list. Keep up the good work 2016 Brent Ozar!
Another way to manually fix SQL server database corruption by using SQL Server Management Studio. Follow the below steps:
1. Launch Microsoft SQL Server Management Studio
2. Click on New Query Button (Below File menu)
3. Write SQL Scripts
EXEC sp_resetstatus [YourDatabaseName];
ALTER DATABASE [YourDatabaseName] SET EMERGENCY
DBCC CHECKDB ([YourDatabaseName], REPAIR_ALLOW_DATA_LOSS)
ALTER DATABASE [YourDatabaseName] SET SINGLE_USER WITH ROLLBACK IMMEDIATE
DBCC DATABASE ([YourDatabaseName], REPAIR_ALLOW_DATA_LOSS)
ALTER DATABASE [YourDatabaseName] SET MULTI_USER
4. Now click on Execute button
After executing all these commands you can repair SQL database.
That is truly terrible advice.
Thanks for this post and also the follow up comments from Erik and Tara. I had my first corruption yesterday after 7 years as a DBA, I felt I should have been more experienced in it before but hearing that Tara had only encountered it 3 times was reassuring.
what about recovering from a situation where:
1. some entries were added to suspect_pages months ago
2. no one noticed
3. every dbcc checkdb for the past month is clean (run twice weekly)
the suspect_pages table still has entries is.
what do i do at this point, when clearly restoring from backup is not viable
Try running CHECKDB with extended logical checks, and if it still passes, then you’re fine. It may have been a transient error with your storage. You can then delete the entries from the suspect_pages table.