The instant you encounter corruption in a production SQL Server database, stop. Read this entire article first to understand the big picture, and then come back here to take action.
First, understand that you shouldn’t start by trying to repair the corruption. You may be facing a storage subsystem problem where your storage is corrupting the data underneath you. In a scenario like that, your corruption repair efforts – simply reading and writing data – may be making the situation worse. Your primary goal is going to be to communicate how widespread the corruption is, and get off the storage immediately. You just can’t know the root cause until you do thorough investigation, and by then, it may be too late – the corruption may have spread to more areas in your database.
Now, let’s walk through the checklist:
1. Turn off your backup-delete jobs: you’re gonna need ’em. If you have any kind of job that automatically deletes backups older than a certain number of days, turn off that job. Depending on the type of corruption, you may have to restore a clean full backup from before the corruption occurred, plus all of the transaction log backups since.
2. Query the msdb.dbo.suspect_pages table. This system table tracks up to the last 1,000 corrupt pages detected. The query will respond instantly, so do this first rather than trying to check the database for corruption. This will give you a quick idea of how widespread the corruption might be. Given those results, send an email to your manager, team, and app stakeholders:
SQL Server (INSERT NAME HERE) just reported corruption in these databases: (LIST OF DATABASES). This indicates that it has storage problems. We need to fail over to another SQL Server immediately – come to my cube (OR ONLINE MEETING OR WHATEVER) to discuss our options. I’m going to have to drop everything and do emergency troubleshooting.
At that point, look at your disaster recovery plan to figure out where you’re going to fail over (like a log shipping secondary.) You’ll need to check that location for corruption as well, but if we’re dealing with widespread, multi-database corruption, it’s beyond the scope of what I can teach you to do in a blog post. You can keep reading for single-database corruption recovery options, and use these same tactics across multiple databases, but it’s going to be a ton of work.
3. If you’re using a SAN, alert the storage team. If your data and log files reside on shared storage, there may be a more widespread issue. Multiple databases or servers may be affected if the shared storage is having problems. Copy/paste this into an email:
SQL Server (INSERT NAME HERE) just reported corruption in a database that lives on the SAN. I don’t know yet whether this is due to a storage issue or a SQL Server bug. I’m dropping everything to do emergency troubleshooting, and I’ll let you know more in half an hour, but if you hear anything from other server admins about SAN issues, please let me know.
4. Alert the application owners and stakeholders. Copy/paste this into an email:
SQL Server (INSERT NAME HERE) just reported database corruption. I’m dropping everything to do emergency troubleshooting to find out if we’ve lost data, and how we’ll recover it. I’ll let you know more in half an hour, but in the meantime, I would recommend taking the applications offline. If more data is added after this point, I may not be able to recover it.
This sounds paranoid, but as an example, here’s a corruption case I had recently: shortly after detecting corruption, the team realized they’d have to revert to a backup of the database from a few days ago. Rather than telling users about that possibility, they let the users keep adding data into the already-corrupt database while the DBAs did troubleshooting. Several days later, as the corruption got worse, even Microsoft couldn’t repair the corruption – and the affected tables went completely offline, permanently. If the users would have been alerted earlier, they could have avoided even more data loss.
5. Review the history of your CHECKDB and backup jobs. Fill out these answers:
- When was the most recent clean CHECKDB? (Meaning, when did CHECKDB run against this database without errors?) ________
- Do we have a full backup available from prior to that date? (For example, if CHECKDB said this database was okay on January 15th at 10PM, what’s the latest backup we have BEFORE that date, like January 14th at 11PM.) ________
- Do we have all of the transaction log backups since that date, yes/no? ________
If you can pass all three of these, great news! You’re going to have an alternate recovery path. If any of these backups are not online (like if they’re on a slow virtual tape library, or have been sent offsite), get one of your Windows admins to start retrieving these backups and putting them on a very fast network share as quickly as possible. You may need to restore all of these backups in a subsequent step.
6. If you have the full CHECKDB output, look for an easy fix. DO NOT RUN CHECKDB in this step – that can be a long process, and before you go down that road, I need you to do other stuff first. I’m just saying that if you do happen to have the CHECKDB output that shows the corruption, look at the index IDs involved. If the only corruption involves an index with an ID of 2 or higher, that means it’s only corruption in a nonclustered index, and you’ll be able to repair that relatively quickly by:
- Run sp_BlitzIndex® to list out the index definitions on the affected table:
EXEC sp_BlitzIndex @DatabaseName = ‘MyDB’, @SchemaName = ‘dbo’, @TableName = ‘MyTable’
- Find the corrupt index by its ID
- Copy out the CREATE INDEX statement from sp_BlitzIndex’s output
- Drop the index
- Recreate it again
- To verify that no other corruption exists in the database, run:
DBCC CHECKDB(‘MyDB’) WITH NO_INFOMSGS, ALL_ERRORMSGS
- If it comes back clean, you’re good, and you can email the teams that things are OK, but jump to the Aftermath section at the end of this post
- If it doesn’t come back clean, keep going
7. Fork your recovery efforts into multiple teams. Grab other members of your team and assign them tasks that will be done in parallel:
- Team 1: Restore clean backups to another server – this team grabs a different SQL Server of the same version as production. For example, if production is a SQL 2012 box, grab a development server that’s also SQL 2012. Restore the full and log backups from the dates specified in step 5. After it finishes, run DBCC CHECKDB to make sure there’s no corruption in this copy. This may take a long time, but if it works, you’ve got an excellent plan B.
- Team 2: Open a support call with Microsoft. If you don’t have a support agreement with Microsoft, grab a credit card and call professional support. It’s about $500, and they work the problem with you until it’s done. Don’t be afraid to say, “We’re not making progress fast enough – I’d like to escalate this call to the next support level.” (But at the same time, you’d better be doing what they tell you to do.)
- Team 3: Run CHECKDB if you don’t have the output. If the corruption was detected by an end user’s query or a monitoring system alert, run this, save the output to a text file, and then analyze the output with Gail Shaw’s corruption recovery instructions. Depending on the database’s size, this can take hours:
DBCC CHECKDB(‘MyDB’) WITH NO_INFOMSGS, ALL_ERRORMSGS
- Team 4: If you don’t have a clean backup, restore the most recent backups to yet another server. If you’re going to try your own attempts at corruption repair, you’re going to want another copy of the corrupt database on a different server where you can test first before trying in production. As soon as that restore finishes, if the other teams aren’t ready to try their corruption repair commands yet, create a new database on this same server, and start copying as many tables as you can from the corrupt database into the new one. You want to be able to salvage as many tables as you can.
- Team 5: Contact a corruption specialist. If things are looking bad, and management wants another option, contact one of the following to start a consulting engagement:
8. Send out a status update as work starts. As soon as the above team task assignments are done – not when the work stops, but when it starts – it’s time to communicate your status to your manager, team members, and the application stakeholders.
Your update email will be based on what you’ve learned so far, but here’s an example:
Update on the corruption issue: we have identified that the corruption is confined to the SalesSite database, and it occurred at some point after 2016/05/04. We have split up our recovery efforts into three teams. Based on a quick assessment, we believe Team 1 will be the first to finish, and we hope to have a clean copy of the database up on DevSQL2012. If we have to fail over to that, we’ll have a one-hour outage. We’ll update you in another half-hour as we know more.
9. Keep working the plan. Don’t panic – let each of the teams do their thing independently. Keep doing check-ins across teams every 30 minutes, see which team is the closest to a solution the business is comfortable with, and keep the end users informed.
The Aftermath: Avoiding Corruption Next Time
Turn your backup-cleanup job back on. (Turning it off was step #1 above.)
As soon as practical, get on the most current service pack and cumulative update for your SQL Server version. You may have hit a bug like the notorious online index rebuild bug or the NOLOCK UPDATE bug, and patching will reduce the chances that you hit that issue again.
If you can’t rule out a SQL Server bug as the cause of the corruption, work with your storage teams to make sure your storage gear is patched up to its most recent firmware and driver versions. Check to see if there were any storage errors over the last several weeks – not just in the Windows event logs, but in the storage gear’s logs as well.
Once you’ve hit corruption on a server, consider taking a few proactive measures:
- Run CHECKDB more often, like every night (which sounds crazy – until you hit a corruption scenario)
- Keep your backups for a longer length of time – if you’re only running CHECKDB every week, you need at least a week’s worth of full and transaction log backups so that you can recover via restores
- Run transaction log backups more frequently, like every minute
- Put your RPO and RTO in writing – the business always assumes databases can’t lose data, and you assume that the business understands technology breaks sometimes. Get everybody on the same page.
- Based on your RPO and RTO, revisit how many databases you have on each server. If your hardware isn’t able to keep up with backups and corruption checks, it’s time to up your horsepower or switch to snapshot backups.
Erik says: Brent is spot on with how to react to corruption scenarios. I’m all about not being past the point of no return in the first place. Whether it’s:
- Making sure I have email alerts set up for 823, 824, and 825 errors that catch hard and soft I/O issues
- Running DBCC CHECKDB often enough
- Having Page Verification turned on
- Taking my backups with CHECKSUM (or using a Trace Flag if I can’t do it via my backup software),
- Making sure my backup routines at minimum do a restore with VERIFYONLY to assess basic usability, or even better, a restore of my FULL backups to another server
- Finally, using technology like Mirroring or Availability Groups, which have a feature called Automatic Page Repair that can fix some page corruption scenarios
sp_Blitz® can help you find all sorts of dangers and inadequacies with your server set up. Run it on all your servers. You may be surprised what you find.
Tara says: I’ve encountered corruption 3 times in my career. The last event was the most memorable, and I don’t mean that as a good thing. If we had setup alerts for the 823, 824 and 825 errors as Erik mentioned, we would have been notified of the corruption sooner and had less data loss. We implemented the alerts through SCOM after the event, but we should have also created SQL Server Agent alerts for those errors as SCOM wasn’t always reliable (your mileage may vary). Reacting to corruption quickly is very important.