Always On Availability Groups, Backup Checksums, and Corruption

Last Updated 7 years ago

Always On Availability Groups, SQL Server

The latest version of sp_Blitz® alerts you if you haven’t been using the WITH CHECKSUM parameter on your backups. This parameter tells SQL Server to check the checksums on each page and alert if there’s corruption.

But what about corrupt backups? Books Online says:

NO_CHECKSUM – Explicitly disables the generation of backup checksums (and the validation of page checksums). This is the default behavior, except for a compressed backup.
CHECKSUM – Specifies that the backup operation will verify each page for checksum and torn page, if enabled and available, and generate a checksum for the entire backup. This is the default behavior for a compressed backup.

Hmmm, let’s see about that. In my SQL Server 2014 lab environment, I shut down my primary replica, then busted out the hex editor XVI32 to edit the data file by hand, thereby introducing some corruption on a clustered index.

After starting the replica up again, I ran a normal compressed backup:

BACKUP DATABASE [AdventureWorks2012] TO DISK = N'\\DC1\SQLCLUSTERA\MSSQL\Backup\AW20131202_248_NoChecksum' WITH NOFORMAT, INIT,  NAME = N'AdventureWorks2012-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, COMPRESSION, STATS = 10

1	BACKUP DATABASE [AdventureWorks2012] TO DISK = N'\\DC1\SQLCLUSTERA\MSSQL\Backup\AW20131202_248_NoChecksum' WITH NOFORMAT, INIT, NAME = N'AdventureWorks2012-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, COMPRESSION, STATS = 10

The backup completed fine without errors – even though compressed backups are supposed to run WITH CHECKSUM by default.

Then I ran a compressed backup and manually specified the CHECKSUM parameter:

BACKUP DATABASE [AdventureWorks2012] TO  DISK = N'\\DC1\SQLCLUSTERA\MSSQL\Backup\AW20131202_256_Checksum' WITH NOFORMAT, INIT,  NAME = N'AdventureWorks2012-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, COMPRESSION,  STATS = 10, CHECKSUM

1	BACKUP DATABASE [AdventureWorks2012] TO DISK = N'\\DC1\SQLCLUSTERA\MSSQL\Backup\AW20131202_256_Checksum' WITH NOFORMAT, INIT, NAME = N'AdventureWorks2012-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, COMPRESSION, STATS = 10, CHECKSUM

That time, the backup stopped with an error:

10 percent processed.
Msg 3043, Level 16, State 1, Line 4
BACKUP 'AdventureWorks2012' detected an error on page (1:3578) in file '\\dc1\SQLClusterA\MSSQL\Data\AdventureWorks2012_Data.mdf'.
Msg 3013, Level 16, State 1, Line 4
BACKUP DATABASE is terminating abnormally.

10 percent processed.

Msg 3043, Level 16, State 1, Line 4

BACKUP 'AdventureWorks2012' detected an error on page (1:3578) in file '\\dc1\SQLClusterA\MSSQL\Data\AdventureWorks2012_Data.mdf'.

Msg 3013, Level 16, State 1, Line 4

BACKUP DATABASE is terminating abnormally.

And the warning to the left. — Do not stare into cork with remaining eye.

Conclusion #1: Compressed backups don’t really check checksums. No idea if that’s a bug in the code or in the Books Online article.

But the plot thickens – this particular database is also part of an AlwaysOn Availability Group. One of the cool benefits of AGs (and also database mirroring) is that when one of the replicas encounters corruption, it automatically repairs the corruption using a clean copy of the page from one of the replicas. (After all, I didn’t use a hex editor on the secondary – only on the primary’s data file, so the secondaries still had a clean copy.)

After running the first backup (compressed, but no checksum), I queried sys.dm_hadr_auto_page_repair, the DMV that returns a row for every corruption repair attempt. The DMV held no data – because a backup without checksum doesn’t actually detect corruption.

After running the second backup (compressed, with checksum), I queried sys.dm_hadr_auto_page_repair again, and this time it successfully showed a row indicating which page had been detected as corrupt. However, the backup still failed – but why?

The clue is in the Books Online page for sys.dm_hadr_auto_page_repair – specifically, the page_status field’s possible values:

The status of the page-repair attempt:
2 = Queued for request from partner.
3 = Request sent to partner.
4 = Queued for automatic page repair (response received from partner).
5 = Automatic page repair succeeded and the page should be usable.

When I first queried the DMV, the page’s status was 3 – request sent to partner. My primary had asked for a clean copy of the page, but because my lab hardware is underpowered, it took several seconds for repair to complete. After it completed, I ran the backup again – and it completed without error.

A few things to take away here:

Automatic page repair is automatic, but it’s not instant. When you’ve got corruption, a query (or backup) can fail due to corruption, and then magically succeed a few seconds later.
Unless you’re doing daily DBCCs (and you’re not), then as long as you can stand the performance hit, use the WITH CHECKSUM parameter on your backups. Just doing compression alone isn’t enough.
No, I can’t tell you what the performance hit will be on your system. Stop reading blogs and start doing some experimenting on your own.

Keep it Constrained

The Four Answers To “How Big Is Your SQL Server?”

15 Comments. Leave new

David RIechelmann
December 11, 2013 9:00 pm

Luckily, Mr. Hellengren’s maintenance plans enable checksums by default on the sample backup jobs, via ‘@CheckSum = ‘Y’. Hooray for Ola!

Reply
Andrew G
December 16, 2013 8:30 pm

Have confirmed compressed backup does not include checksum in 2008R2 SP2 and 2012 SP1 using RESTORE VERIFYONLY [db] WITH CHECKSUM. Must be a mistake in the documentation.

Reply
Kalen Delaney
December 19, 2013 6:36 pm

Awesome post, Brent. I especially like your comment about stopping reading blogs and trying it out for yourself! Plus… I also have to tell my students all the time that automatic doesn’t mean instantaneous.

Reply
- Brent Ozar
  December 19, 2013 7:30 pm
  
  Aww, thanks Kalen!
  
  Reply
Michael Meierruth
January 6, 2014 3:17 am

Brent,
A while back I played around with this WITH CHECKSUM thing. And I also intentionally corrupted the database using some desperate C# code. Thus I’d like to know if you can recommend a safe site for downloading this XVI32 editor that you are using.
Thanks loads.

PS
I seem to recall also that even with a compressed database (i.e. every 8K page is allocated), corrupting a page sometimes did not generate an error doing a backup WITH CHECKSUM. Did you ever run into this?

Reply
- Brent Ozar
  January 6, 2014 8:19 am
  
  Michael – if you Google for XVI32, there’s a bunch of download options.
  
  About the page corruption and backing up with checksum – I don’t purposely corrupt databases and back them up THAT often, heh.
  
  Reply
  - Michael Meierruth
    January 6, 2014 11:27 am
    
    Agreed. But when it came to testing this stuff you had to corrupt the database to believe that it truly works.
    BTW, in my original comment on ‘compressed databases’ I meant ‘shrunk databases’.
    Cheers.
    
    PS
    XVI32 is really neat – even with GB file sizes.
    
    Reply
Manish
June 2, 2014 11:22 pm

Gud Article 🙂

Reply
SHISHIR
May 31, 2017 9:59 am

As always, Great article, Brent. We are doing some tests on our own and therefore these questions. How often is the page corruption checked and how is it checked error log, and event being fired, etc? e.g. If a page gets corrupt on my primary will that corrupt page being synced to the replica(s) before auto page repair kicks in? What is the strategy to find corruption before it’s copied over? How does it work?

Reply
- Brent Ozar
  May 31, 2017 10:00 am
  
  Shishir – thanks, glad you liked the post? Those are all great questions that you’ll discover answers to during your tests.
  
  Reply
Melanie Sheehan
January 30, 2018 1:07 pm

This is a bit dated … But, looking for some clarifiction — if you apply CHECKSUM at the DB-level, use the “compress backup” setting, and backup the database WITH CHECKSUM — are you saying that it will not perform a CHECKSUM on the backup? I’ve been playing with sp_blitzbackups, and it only seems that when I turn off compression that the CHECKSUM is actually performed. Can you clarify “Unless you’re doing daily DBCCs (and you’re not), then as long as you can stand the performance hit, use the WITH CHECKSUM parameter on your backups. Just doing compression alone isn’t enough” …

Reply
- Brent Ozar
  January 30, 2018 1:09 pm
  
  Melanie – can you clearify what you mean by “if you apply CHECKSUM at the DB-level”? That’ll influence my answer. Thanks!
  
  Reply
MELANIE SHEEHAN
January 30, 2018 1:42 pm

ya, that wasn’t very clear … PAGE VERIFY = CHECKSUM

Reply
- Brent Ozar
  January 30, 2018 1:43 pm
  
  Gotcha. Correct, in that case, what I said in the blog post is still exactly correct. Thanks!
  
  Reply
yair
November 21, 2023 7:24 pm

backed up 10gb database more than 40 times for this experiment
1. Backup with compression, checksum and verify took on average – 78 seconds
2. Backup with compression and checksum – 39 seconds (50% less than 1)
3. Backup with compression and verify – 81 seconds (a bit more than 1)
4. Backup with compression – 46 seconds (46% less than 1)

Reply