Where to Run DBCC on Always On Availability Groups

Last Updated 7 years ago

With SQL Server 2012’s new AlwaysOn Availability Groups, we have the ability to run queries, backups, and even DBCCs on database replicas. But should we?

Paul Randal (Blog – @PaulRandal) covered this issue in today’s SQLskills Insider newsletter, and he says:

“It’s not checking the I/O subsystem of the log shipping primary database – only that the log backups are working correctly…. Log shipping only ships transaction log backups – so any corruptions to the data files from the I/O subsystem on the primary will likely not be detected by the offloaded consistency checks…. I’ve heard it discussed that the SQL Server 2012 Availability Groups feature can allow consistency checks to be offloaded to one of the secondary copies – no – by the same argument.”

Actually – brace yourself – I’m about to suggest that Paul’s DBCC advice needs to be even stronger. Take the following configuration:

SQLPRIMARY – primary replica where users connect. We run DBCC here daily.
SQLNUMBERTWO – secondary replica where we do our full and transaction log backups.

Only when Photoshop is involved, though — When Bacon Flies

In this scenario, the DBCCs on the primary server aren’t testing the data we’re backing up. We could be backing up corrupt data every night as part of our full backups. If we experienced corruption on the primary server, and it tried to take care of business with automatic page repair by fetching a copy from the secondary server (which also happened to be corrupt), we’d be screwed. We wouldn’t have a clean backup of the page, and our data would be permanently lost.

The moral of the story: with Availability Groups, we need to run DBCCs on whatever server is doing our full backups, too. Doing them on the primary database server isn’t enough. In a perfect world, I’d run them regularly on any server that could serve a role as the primary database. Automatic page repair can only save your bacon if a replica is online and has a clean copy of the page.

One group I’m working with has taken a rather unique approach to running DBCCs. They can’t afford the performance overhead of running DBCC or backups on the primary replica (ServerA), so every night they run backups and DBCC on a secondary (asynchronous) replica in the same datacenter (ServerB). On Saturday night during their regularly scheduled outage, they switch into synchronous, get the boxes in sync, and then fail over to ServerB. Then they run on ServerB all week long, and run DBCCs on ServerA every night. It’s more manual work, but the payoff is a blazing fast and safe primary node. (And in case you’re wondering about lost transactions in async mode, they’ve thought about that too – when the server becomes mission-critical, they plan to add a ServerC instance acting as a synchronous replica as well.)

And no, I never thought I’d write that Paul Randal isn’t telling you to run DBCC often enough. 😉

Monday Guest Post: Phony Robbins on Power

“Who Wrote This?” Contest

46 Comments. Leave new

BrianO
March 1, 2012 7:31 am

Hi Brent.

This approach seems a bit dangerous to me. If data file corruptions on the primary cant (or are unlikely to be) be detected on a secondary. Then any corruption wont be found until failover upto six days later!
It also doesnt feel right to rely on copy only backups for DR.

Reply
- Brent Ozar
  March 1, 2012 7:36 am
  
  Hi, Brian. Can you elaborate a little more about exactly what part seems dangerous? Corruption is automatically repaired in the background with AlwaysOn Availability Groups whenever it’s detected, and you can have up to 4 replicas all with uncorrupted versions of the data. Corruption repair no longer takes downtime. What’s dangerous about that?
  
  I do want to know fairly quickly when storage starts to get corrupted on any given replica (like within a week) but I can’t usually afford to run DBCC on every replica, every night. As long as I’ve got one clean replica, I can always use that as my primary.
  
  Reply
BrianO
March 1, 2012 7:54 am

Hi Brent

I was thinking along the lines that that the auto page repair cant repair all page types (file header, boot, GAM, SGAM or PFS). Therfore you would dbcc the secondary, getting the all clear each night, when in fact the primary is in trouble.

Reply
- Brent Ozar
  March 1, 2012 8:09 am
  
  But when you say “the primary is in trouble,” what’s our traditional mode of repairing those types of pages? We restore the database or do page level restores. The point of AlwaysOn Availability Groups is that you’ve got a replica to fail over to now, with independent data page storage. When you detect corruption, fail over. I’m missing the danger here.
  
  Yes, sooner or later the primary will corrupt, and we’ll need to fail over. It’ll happen automatically when SQL tries to read the corrupted page – given that we’ve properly set up AGs and failover. Running DBCC frequently on the primary doesn’t change the corruption – it just means an earlier failover to the secondary.
  
  Reply
BrianO
March 1, 2012 8:42 am

Ahhh.. thanks Brent. The penny/cent has finally dropped (I’m having a bad hair day or something).

I’m implementing AlwaysOn availability groups within the next 8-10 weeks and the scenario is simliar. I only started seriously looking at the technology yesterday. So I’m still at the stage where p45’s are looming in every shadow ;)!

Reply
Cris
January 22, 2014 6:04 pm

Does automatic page repair work in asynchronous availability mode?

Reply
- Kendra Little
  January 22, 2014 6:14 pm
  
  Yes, it should attempt to use async partners to repair pages. (Same with async partners in database mirroring.)
  
  The page repair itself is a asynchronous background process and won’t be immediate. Check out Brent’s recent post here: https://www.brentozar.com/archive/2013/12/alwayson-availability-groups-backup-checksums-and-corruption/
  
  Reply
Ashut
June 29, 2014 1:54 am

Hi Brent,

I am currently testing Alwayson and I am trying to figure out why diffrentail backups are not allowed on the syncronus replica’s.

I checked the DCM page on both primary and syncronus replica and both have the same DIFF_MAP.
I know there must be a logical reason for this but I can’t figure it 🙁 .

–Primary :
Allocation Status

GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED PFS (1:1) = 0x44 ALLOCATED 100_PCT_FULL
DIFF (1:6) = CHANGED ML (1:7) = NOT MIN_LOGGED

DIFF_MAP: Header @0x0000000016B6A064 Slot 0, Offset 96

status = 0x0

DIFF_MAP: Extent Alloc Status @0x0000000016B6A0C2

(1:0) – (1:16) = CHANGED
(1:24) – (1:56) = NOT CHANGED
(1:64) – = CHANGED
(1:72) – (1:144) = NOT CHANGED
(1:152) – (1:160) = CHANGED
(1:168) – (1:1168) = NOT CHANGED

—Sync Replica:

Allocation Status

GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED PFS (1:1) = 0x44 ALLOCATED 100_PCT_FULL
DIFF (1:6) = CHANGED ML (1:7) = NOT MIN_LOGGED

DIFF_MAP: Header @0x000000001321A064 Slot 0, Offset 96

status = 0x0

DIFF_MAP: Extent Alloc Status @0x000000001321A0C2

(1:0) – (1:16) = CHANGED
(1:24) – (1:56) = NOT CHANGED
(1:64) – = CHANGED
(1:72) – (1:144) = NOT CHANGED
(1:152) – (1:160) = CHANGED
(1:168) – (1:1168) = NOT CHANGED

Reply
- Kendra Little
  June 29, 2014 8:01 am
  
  Ashut, I’ve never looked into this deeply specific to AG secondaries, but it’s very likely related to the complexities with backing up any read-only database: http://technet.microsoft.com/en-us/library/ms190374.aspx
  
  If you read through the differential backup section there, I think it becomes clear why it wasn’t trivial for them to implement.
  
  Reply
Richard Van Veen
August 5, 2014 1:00 pm

Would there be any technical reason not to run full dbcc’s on both the primary and secondary HA copies on the weekend(for example, if dbcc was replicated in HA)?

Reply
- Brent Ozar
  August 5, 2014 1:02 pm
  
  Technically, no, but often folks run into problems with maintenance windows.
  
  Reply
Rail
January 16, 2015 2:17 pm

Given that your are only allow to do “copy only” Full backup on the secondary replica, it seemed a bit strange to only do Full backup on the secondary.

Reply
- Steve
  January 23, 2015 3:42 pm
  
  It’s not strange at all that this limitation exists or that people would want to do their full backups on the secondary. It exists because a normal full backup resets the flag for each page as being modified where it would be picked up for differential backups, which are writes that need to take place on the primary. This creates the only limitation of a copy-only backup, you’re limited to fulls and logs with no diffs.
  
  Log backups also need to write to the database. However, it’s sending back the last LSN backed up, which is a reasonable piece coming back from a secondary.
  
  I still don’t understand why you can’t do copy-only log backups from a secondary though, but I’m ok with that.
  
  Reply
Travis
May 6, 2016 10:44 am

I was wondering if there was a way to include a verification step before a backup on a secondary replica under asynchronous commit mode in availability groups to verify that the secondary replica is up to date with the primary. We intend to run backups on secondary replicas only to keep load off of our primary. However, we do not want to sacrifice performance by using synchronous commit so we must stick with asynchronous.

Current setup is a FCI as the primary, and 2 additional instances setup as secondary replicas. All configured as asynchronous commit mode. One secondary is read-intent only, another is a readable secondary.

Is there any solution that you know of that could essentially give us the best of both worlds?

Reply
- Brent Ozar
  May 7, 2016 9:31 am
  
  Travis – sure, you could build it by querying linked servers or DMVs. But here’s the catch: say the secondary is behind – that means the primary has to be able to detect that too, and run backups when the secondaries are too far behind.
  
  Reply
Jessica
July 28, 2016 8:11 pm

Hello, I have a SQL 2014 primary and use the secondary replica read only for my reporting. Question, considering I do full, diff and flog backups on my primary also I have a reorg job, update stats, dbcc checkdb jobs, should I add these same jobs in my secondary as well?

Reply
ravi
November 10, 2016 5:24 am

Hi Brent,
I have couple of questions.
1.In primary server ,when i executed DBCC CHECKDB,it returned some errors.Will the secondary have the same errors?(will there be any sync between primary and secondary related to corruption)?
2.In always on configured,When this errors are fixed in Primary,will this sync in secondary and automatically fix ?
Thanks in advance
Ravi

Reply
- Brent Ozar
  November 10, 2016 7:18 am
  
  Ravi – when you hit corruption, head over here: https://www.brentozar.com/go/corruption
  
  Reply
  - Ravi
    November 10, 2016 8:54 am
    
    Hi Brent,
    Thanks for the reply.
    My concern was related to the Always on ,how this consistency issues will work between primary and secondary.
    Thanks
    
    Reply
    - Ravi
      November 11, 2016 12:43 am
      
      https://sqltouch.blogspot.com/2013/04/sql-server-2012-alwayson-automatic-page.html?showComment=1478777930391#c3252023900391486460
      
      Above seems to have some info regarding to the same.
      
      Reply
Phil Cartwright
December 12, 2016 5:14 am

Why are they switching the primary servers each week?
If the full and t-log backups are being run on the Replica and this is also where they are also running nightly DBCC jobs you know the data being backed up is not corrupted (also being tested by being restored elsewhere i’m sure).
With auto page repair in HA any corruption on the primary will be corrected or it will fail over to the checked replica. I understand that it would be a good way to ensure their replica’s are always configured correctly (for all the things that aren’t replicated) and great for ensuring their DR process is well rehearsed, but not absolutely necessary in terms of DBCC?

Reply
- Brent Ozar
  December 12, 2016 5:53 am
  
  Phil – think through the process of what happens if you lose the standby replica (where checkdb is running), and then you encounter corruption. Automatic page repair only works if the checked replica is actually *online*, and manual page repair involves doing a restore – which means tearing the database out of your Availability Group.
  
  (And quick plug: this happens to be one of the scenarios we walk through in our Senior DBA online class – which is running this week, coincidentally! I have students design an AG, including where they’ll run backups and checkdb, and then we step through several failure scenarios we’ve seen out in the wild. It’s very eye-opening for folks who just assume the wizard is the finish line, heh.)
  
  Reply
  - Phil
    December 12, 2016 9:41 am
    
    So how does your scenario protect from this? Surely if you encounter corruption when your replica is down, it will make your bad day even worse in either situation?
    Is it just a case of the longer you run a database without checks the higher the probability that there is corruption lurking undetected? By switching each week you are reducing the risk of it being detected at the worst possible time?
    Thanks
    
    Reply
    - Brent Ozar
      December 12, 2016 9:58 am
      
      Phil – bingo!
      
      Reply
Chris Gehring
December 28, 2016 3:47 pm

Brent – Sorry to revive an old thread. Not sure if you remember our environment… it’s BIG :). Anyway, we have implemented several of your suggestions, but also noticed something interesting. We run DBCC on the Secondary servers during the week, and then Full backups from the Secondary servers on the weekend. We run hourly LOG backups all week from the primary. What I have found is that while DBCC runs fine on the secondary and will fix problems if it finds them, it does not update the ‘dbi_dbccLastKnownGood’ field. This is only updated when you run DBCC on the Primary. We are looking to also run DBCC on the primaries but if you remember the size of the environment that will be a challenge. The biggest concern is around reporting since I can’t find a good way (other than our logs of DBCC runs) to make sure it has run for a particular database. We are still on 2012 R2… does this behavior change in later versions? Any thoughts on what we should do, or is the best answer to bite the bullet and run DBCC on the primaries as well? The behavior makes sense, since I assume the dates are stored in the database which is read only, but I had assumed it would communicate back to the primary and update.

Reply
- Brent Ozar
  December 29, 2016 8:25 am
  
  Chris – for consulting questions, can you email us privately? Thanks!
  
  Reply
Mike Byrd
February 9, 2017 5:13 pm

Ok, I am running always on, and backing up daily the primary node. Every night I copy over the backup file to a dev server and restore it. Then I run an integrity check on the restored DB. Any thing I am not doing???? or catching???

Reply
- Brent Ozar
  February 10, 2017 1:46 pm
  
  Mike – sure, what happens when you fail over from the primary to the secondary, and within a few minutes, your users report corruption? You won’t have the benefit of online page repair from the primary when it’s down. Sure would be nice to not have to restore from backup.
  
  Reply
Michele Puopolo
March 30, 2017 3:37 am

Hello,
when dbcc find a corruption and create a SQL DUMP it cause the sql instance to freeze and few minutes of services disruption.
This means that we can’t make check on production server?
There is any way to deactivate the creation of SQL DUMP?

Here msdn forum:
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/71da5f88-3d45-4ad9-86ea-7f5eecd2eb1a/dbcc-checkdb-cause-my-availability-group-to-failover?forum=sqldisasterrecovery

Reply
- Brent Ozar
  March 31, 2017 7:50 am
  
  Michele – you’ve got answers over there in MSDN, check those out.
  
  Reply
  - Puopolo Michele
    March 31, 2017 11:16 am
    
    Thanks Brent ! I think we had to check our customer database in another SQL Instance. We can’t freeze our primary node when a corruption appears… but I really don’t like this ‘feature’
    
    Reply
sqlpassion
April 12, 2017 2:36 am

Please provide the solution for the scenario:
What causing the server to restart?

In our secondary replica(DBR2) DBCC check has not completed for the past 3 weeks and it’s causing our secondary replica(DBR2) to restart.. The job history did not show a failure, it didn’t even show that the job attempted to run. But if you look in the SQL errors logs, you would see messages about transactions rolling forward and rolling back at the exact time of the integrity-check job run.

2017-01-01 04:17:11.090 2017-01-01 04:17:11.220 2017-01-01 04:17:11.850 2017-01-01 04:22:09.050 2017-01-01 04:22:09.050 2017-01-01 04:22:09.050 2017-01-01 04:22:09.060 2017-01-01 04:22:09.080 2017-01-01 04:22:09.100 2017-01-01 04:22:09.130 2017-01-01 04:22:09.140 2017-01-01 04:22:09.200 2017-01-01 04:22:09.270 2017-01-01 04:22:11.240 2017-01-01 04:22:13.260 2017-01-01 04:22:15.260 2017-01-01 04:22:15.260 2017-01-01 04:22:15.260 2017-01-01 04:22:17.270 2017-01-01 04:22:18.000 spid57 600 transactions rolled forward in database ‘abc’ (52:0). This is an informational message only. No user action is required.
spid57 0 transactions rolled back in database ‘abc’ (52:0). This is an informational message only. No user action is required.
spid57 Recovery completed for database abc (database ID 52) in 130 second(s) (analysis 997 ms, redo 148972 ms, undo 59 ms.) This is an informational message only. No user action is required.
Server AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is no longer online. This is an informational message only. No user action is required.
spid41s AlwaysOn: The availability replica manager is going offline because the local Windows Server Failover Clustering (WSFC) node has lost quorum. This is an informational message only. No user action is required.
spid41s AlwaysOn: The local replica of availability group ‘AGxxxx’ is stopping. This is an informational message only. No user action is required.
spid43s Nonqualified transactions are being rolled back in database abd for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
spid41s Nonqualified transactions are being rolled back in database dcs for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
spid49s Nonqualified transactions are being rolled back in database gfd for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
spid34s Nonqualified transactions are being rolled back in database hgf for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
spid44s Nonqualified transactions are being rolled back in database dgl for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
Server SQL Server is terminating because of a system shutdown. This is an informational message only. No user action is required.
spid75 DBCC CHECKDB (abc) WITH all_errormsgs, no_infomsgs, data_purity executed by xxx\xxxb terminated abnormally due to error state 6. Elapsed time: 0 hours 7 minutes 8 seconds.
spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 43%. This is an informational message only. No user action is required.
spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 43%. This is an informational message only. No user action is required.
spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 43%. This is an informational message only. No user action is required.
spid64s Error: 17054, Severity: 16, State: 1.
spid64s The current event was not reported to the Windows Events log. Operating system error = (null). You may need to clear the Windows Events log if it is full.
spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 43%. This is an informational message only. No user action is required.
spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

Reply
- Brent Ozar
  April 12, 2017 6:29 am
  
  For production corruption questions, start here:
  
  https://www.brentozar.com/go/corruption
  
  Reply
Sajid
June 7, 2017 4:27 am

Hi Brent,

Excellent article as always and perfect timing 🙂 Just a quick question please. Where would you recommend to run Ola’s index maintenance scripts, a secondary replica or all replicas please?

Reply
- Brent Ozar
  June 7, 2017 6:35 am
  
  Sajid – on the primary.
  
  Reply
Janneaa
October 20, 2017 5:07 am

Would it be OK to do CHECKDBs with the PHYSICAL_ONLY option on the primary server? And then do the full CHECKDBs on the secondary (that is doing the backups).

I’m considering this because that quote from Randal emphasises to check for corruptions in the primary server’s I/O subsystem and data files.

Reply
- Brent Ozar
  October 20, 2017 6:09 am
  
  Janneaa – wow, that’s a great idea, and I love it! It’d be just a little bit tricky to schedule since the primary can move around, but nothing that couldn’t be scripted. I like it!
  
  Reply
  - carlos figueroa
    February 1, 2018 11:35 pm
    
    I was thinking about this strategy as well: Run PHYSICAL_ONLY in primary and full CHECKDB on secondary.
    Based on your answer, it seems to be a really good approach, and by using sys.fn_hadr_is_primary_replica ( ‘dbname’ ) function it seems to be something it can be scripted.
    
    However, this leaves me with one question Brent:
    Does this mean that when logical corruption occurs let say in the primary, this will be replicated to secondary nodes? so that’s why we can avoid running full CHECKDB in primary and do it only in secondary?
    
    Reply
    - Brent Ozar
      February 2, 2018 6:39 am
      
      Carlos – that’s kind of a big question, and it’s bigger than I can answer quickly in a blog post comment. It’s a great idea for a future post, and I’ll keep it in mind.
      
      You might try asking it on a Q&A site like https://dba.stackexchange.com too.
      
      Reply
Tuan Nguyen
January 28, 2019 11:21 am

Hi Brent,
After reviewing both scenarios in your post, I am wondering why transaction logs are backed up in both cases on secondary replica. In secondary replica, full database backups are copy-only. Does it mean point in time restore is not available in both cases?

Reply
- Brent Ozar
  January 28, 2019 11:25 am
  
  Tuan – there’s no such thing as copy-only log backups right now. Log backups have the same effect regardless of where they’re taken.
  
  Reply
  - Tuan Nguyen
    January 28, 2019 11:40 am
    
    Hi Brent,
    On secondary replica, if I take full backups (copy-only) and regular log backups. If I need to restore database to another server, I could only use full backup (copy-only) and could not use any log backups. Would you recommend to do Full backups on primary replica and log backups on secondary replica?
    
    Reply
    - Brent Ozar
      January 28, 2019 11:41 am
      
      Tuan – that’s kinda outside of the scope of this blog post. For questions, head to a Q&A site like https://dba.stackexchange.com.
      
      Reply
      - tnguyen4
        January 28, 2019 2:39 pm
        
        Hi Brent,
        Just watched a module 2.5 in Senior DBA class. It is clear now. Thank you very much.
amit.wadhawan
December 15, 2020 10:26 am

Hi Brent.
Can we run DBCC CheckDB on secondary and perform our database backups with checksum on Primary.
This will make sure if corruption happens on primary, backup will fail and Primary will use sys.dm_hadr_auto_page_repair to repair page from secondary node

Also running CheckDB on Secondary node will always keep DBCC related resource utilization on secondary only.

Reply
- Brent Ozar
  December 15, 2020 2:03 pm
  
  Hi, Amit. For general questions, head to a Q&A site like DBA.StackExchange.com.
  
  Reply