The latest version of sp_Blitz® alerts you if you haven’t been using the WITH CHECKSUM parameter on your backups. This parameter tells SQL Server to check the checksums on each page and alert if there’s corruption.
But what about corrupt backups? Books Online says:
NO_CHECKSUM – Explicitly disables the generation of backup checksums (and the validation of page checksums). This is the default behavior, except for a compressed backup.
CHECKSUM – Specifies that the backup operation will verify each page for checksum and torn page, if enabled and available, and generate a checksum for the entire backup. This is the default behavior for a compressed backup.
Hmmm, let’s see about that. In my SQL Server 2014 lab environment, I shut down my primary replica, then busted out the hex editor XVI32 to edit the data file by hand, thereby introducing some corruption on a clustered index.
After starting the replica up again, I ran a normal compressed backup:
BACKUP DATABASE [AdventureWorks2012] TO DISK = N'\\DC1\SQLCLUSTERA\MSSQL\Backup\AW20131202_248_NoChecksum' WITH NOFORMAT, INIT, NAME = N'AdventureWorks2012-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, COMPRESSION, STATS = 10
The backup completed fine without errors – even though compressed backups are supposed to run WITH CHECKSUM by default.
Then I ran a compressed backup and manually specified the CHECKSUM parameter:
BACKUP DATABASE [AdventureWorks2012] TO DISK = N'\\DC1\SQLCLUSTERA\MSSQL\Backup\AW20131202_256_Checksum' WITH NOFORMAT, INIT, NAME = N'AdventureWorks2012-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, COMPRESSION, STATS = 10, CHECKSUM
That time, the backup stopped with an error:
10 percent processed. Msg 3043, Level 16, State 1, Line 4 BACKUP 'AdventureWorks2012' detected an error on page (1:3578) in file '\\dc1\SQLClusterA\MSSQL\Data\AdventureWorks2012_Data.mdf'. Msg 3013, Level 16, State 1, Line 4 BACKUP DATABASE is terminating abnormally.
Conclusion #1: Compressed backups don’t really check checksums. No idea if that’s a bug in the code or in the Books Online article.
But the plot thickens – this particular database is also part of an AlwaysOn Availability Group. One of the cool benefits of AGs (and also database mirroring) is that when one of the replicas encounters corruption, it automatically repairs the corruption using a clean copy of the page from one of the replicas. (After all, I didn’t use a hex editor on the secondary – only on the primary’s data file, so the secondaries still had a clean copy.)
After running the first backup (compressed, but no checksum), I queried sys.dm_hadr_auto_page_repair, the DMV that returns a row for every corruption repair attempt. The DMV held no data – because a backup without checksum doesn’t actually detect corruption.
After running the second backup (compressed, with checksum), I queried sys.dm_hadr_auto_page_repair again, and this time it successfully showed a row indicating which page had been detected as corrupt. However, the backup still failed – but why?
The clue is in the Books Online page for sys.dm_hadr_auto_page_repair – specifically, the page_status field’s possible values:
The status of the page-repair attempt:
2 = Queued for request from partner.
3 = Request sent to partner.
4 = Queued for automatic page repair (response received from partner).
5 = Automatic page repair succeeded and the page should be usable.
When I first queried the DMV, the page’s status was 3 – request sent to partner. My primary had asked for a clean copy of the page, but because my lab hardware is underpowered, it took several seconds for repair to complete. After it completed, I ran the backup again – and it completed without error.
A few things to take away here:
- Automatic page repair is automatic, but it’s not instant. When you’ve got corruption, a query (or backup) can fail due to corruption, and then magically succeed a few seconds later.
- Unless you’re doing daily DBCCs (and you’re not), then as long as you can stand the performance hit, use the WITH CHECKSUM parameter on your backups. Just doing compression alone isn’t enough.
- No, I can’t tell you what the performance hit will be on your system. Stop reading blogs and start doing some experimenting on your own.