Where to Run DBCC on Always On Availability Groups

With SQL Server 2012’s new AlwaysOn Availability Groups, we have the ability to run queries, backups, and even DBCCs on database replicas.  But should we?

Paul Randal (Blog@PaulRandal) covered this issue in today’s SQLskills Insider newsletter, and he says:

“It’s not checking the I/O subsystem of the log shipping primary database – only that the log backups are working correctly…. Log shipping only ships transaction log backups – so any corruptions to the data files from the I/O subsystem on the primary will likely not be detected by the offloaded consistency checks….  I’ve heard it discussed that the SQL Server 2012 Availability Groups feature can allow consistency checks to be offloaded to one of the secondary copies – no – by the same argument.”

Actually – brace yourself – I’m about to suggest that Paul’s DBCC advice needs to be even stronger.  Take the following configuration:

  • SQLPRIMARY – primary replica where users connect.  We run DBCC here daily.
  • SQLNUMBERTWO – secondary replica where we do our full and transaction log backups.
When Bacon Flies

In this scenario, the DBCCs on the primary server aren’t testing the data we’re backing up.  We could be backing up corrupt data every night as part of our full backups.  If we experienced corruption on the primary server, and it tried to take care of business with automatic page repair by fetching a copy from the secondary server (which also happened to be corrupt), we’d be screwed.  We wouldn’t have a clean backup of the page, and our data would be permanently lost.

The moral of the story: with Availability Groups, we need to run DBCCs on whatever server is doing our full backups, too.  Doing them on the primary database server isn’t enough.  In a perfect world, I’d run them regularly on any server that could serve a role as the primary database.  Automatic page repair can only save your bacon if a replica is online and has a clean copy of the page.

One group I’m working with has taken a rather unique approach to running DBCCs.  They can’t afford the performance overhead of running DBCC or backups on the primary replica (ServerA), so every night they run backups and DBCC on a secondary (asynchronous) replica in the same datacenter (ServerB).  On Saturday night during their regularly scheduled outage, they switch into synchronous, get the boxes in sync, and then fail over to ServerB.  Then they run on ServerB all week long, and run DBCCs on ServerA every night.  It’s more manual work, but the payoff is a blazing fast and safe primary node.  (And in case you’re wondering about lost transactions in async mode, they’ve thought about that too – when the server becomes mission-critical, they plan to add a ServerC instance acting as a synchronous replica as well.)

And no, I never thought I’d write that Paul Randal isn’t telling you to run DBCC often enough. 😉

Previous Post
Monday Guest Post: Phony Robbins on Power
Next Post
“Who Wrote This?” Contest

46 Comments. Leave new

  • Hi Brent.

    This approach seems a bit dangerous to me. If data file corruptions on the primary cant (or are unlikely to be) be detected on a secondary. Then any corruption wont be found until failover upto six days later!
    It also doesnt feel right to rely on copy only backups for DR.

    Reply
    • Hi, Brian. Can you elaborate a little more about exactly what part seems dangerous? Corruption is automatically repaired in the background with AlwaysOn Availability Groups whenever it’s detected, and you can have up to 4 replicas all with uncorrupted versions of the data. Corruption repair no longer takes downtime. What’s dangerous about that?

      I do want to know fairly quickly when storage starts to get corrupted on any given replica (like within a week) but I can’t usually afford to run DBCC on every replica, every night. As long as I’ve got one clean replica, I can always use that as my primary.

      Reply
  • Hi Brent

    I was thinking along the lines that that the auto page repair cant repair all page types (file header, boot, GAM, SGAM or PFS). Therfore you would dbcc the secondary, getting the all clear each night, when in fact the primary is in trouble.

    Reply
    • But when you say “the primary is in trouble,” what’s our traditional mode of repairing those types of pages? We restore the database or do page level restores. The point of AlwaysOn Availability Groups is that you’ve got a replica to fail over to now, with independent data page storage. When you detect corruption, fail over. I’m missing the danger here.

      Yes, sooner or later the primary will corrupt, and we’ll need to fail over. It’ll happen automatically when SQL tries to read the corrupted page – given that we’ve properly set up AGs and failover. Running DBCC frequently on the primary doesn’t change the corruption – it just means an earlier failover to the secondary.

      Reply
  • Ahhh.. thanks Brent. The penny/cent has finally dropped (I’m having a bad hair day or something).

    I’m implementing AlwaysOn availability groups within the next 8-10 weeks and the scenario is simliar. I only started seriously looking at the technology yesterday. So I’m still at the stage where p45’s are looming in every shadow ;)!

    Reply
  • Does automatic page repair work in asynchronous availability mode?

    Reply
  • Hi Brent,

    I am currently testing Alwayson and I am trying to figure out why diffrentail backups are not allowed on the syncronus replica’s.

    I checked the DCM page on both primary and syncronus replica and both have the same DIFF_MAP.
    I know there must be a logical reason for this but I can’t figure it 🙁 .

    –Primary :
    Allocation Status

    GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED PFS (1:1) = 0x44 ALLOCATED 100_PCT_FULL
    DIFF (1:6) = CHANGED ML (1:7) = NOT MIN_LOGGED

    DIFF_MAP: Header @0x0000000016B6A064 Slot 0, Offset 96

    status = 0x0

    DIFF_MAP: Extent Alloc Status @0x0000000016B6A0C2

    (1:0) – (1:16) = CHANGED
    (1:24) – (1:56) = NOT CHANGED
    (1:64) – = CHANGED
    (1:72) – (1:144) = NOT CHANGED
    (1:152) – (1:160) = CHANGED
    (1:168) – (1:1168) = NOT CHANGED

    —Sync Replica:

    Allocation Status

    GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED PFS (1:1) = 0x44 ALLOCATED 100_PCT_FULL
    DIFF (1:6) = CHANGED ML (1:7) = NOT MIN_LOGGED

    DIFF_MAP: Header @0x000000001321A064 Slot 0, Offset 96

    status = 0x0

    DIFF_MAP: Extent Alloc Status @0x000000001321A0C2

    (1:0) – (1:16) = CHANGED
    (1:24) – (1:56) = NOT CHANGED
    (1:64) – = CHANGED
    (1:72) – (1:144) = NOT CHANGED
    (1:152) – (1:160) = CHANGED
    (1:168) – (1:1168) = NOT CHANGED

    Reply
  • Richard Van Veen
    August 5, 2014 1:00 pm

    Would there be any technical reason not to run full dbcc’s on both the primary and secondary HA copies on the weekend(for example, if dbcc was replicated in HA)?

    Reply
  • Given that your are only allow to do “copy only” Full backup on the secondary replica, it seemed a bit strange to only do Full backup on the secondary.

    Reply
    • It’s not strange at all that this limitation exists or that people would want to do their full backups on the secondary. It exists because a normal full backup resets the flag for each page as being modified where it would be picked up for differential backups, which are writes that need to take place on the primary. This creates the only limitation of a copy-only backup, you’re limited to fulls and logs with no diffs.

      Log backups also need to write to the database. However, it’s sending back the last LSN backed up, which is a reasonable piece coming back from a secondary.

      I still don’t understand why you can’t do copy-only log backups from a secondary though, but I’m ok with that.

      Reply
  • I was wondering if there was a way to include a verification step before a backup on a secondary replica under asynchronous commit mode in availability groups to verify that the secondary replica is up to date with the primary. We intend to run backups on secondary replicas only to keep load off of our primary. However, we do not want to sacrifice performance by using synchronous commit so we must stick with asynchronous.

    Current setup is a FCI as the primary, and 2 additional instances setup as secondary replicas. All configured as asynchronous commit mode. One secondary is read-intent only, another is a readable secondary.

    Is there any solution that you know of that could essentially give us the best of both worlds?

    Reply
    • Travis – sure, you could build it by querying linked servers or DMVs. But here’s the catch: say the secondary is behind – that means the primary has to be able to detect that too, and run backups when the secondaries are too far behind.

      Reply
  • Hello, I have a SQL 2014 primary and use the secondary replica read only for my reporting. Question, considering I do full, diff and flog backups on my primary also I have a reorg job, update stats, dbcc checkdb jobs, should I add these same jobs in my secondary as well?

    Reply
  • Hi Brent,
    I have couple of questions.
    1.In primary server ,when i executed DBCC CHECKDB,it returned some errors.Will the secondary have the same errors?(will there be any sync between primary and secondary related to corruption)?
    2.In always on configured,When this errors are fixed in Primary,will this sync in secondary and automatically fix ?
    Thanks in advance
    Ravi

    Reply
  • Phil Cartwright
    December 12, 2016 5:14 am

    Why are they switching the primary servers each week?
    If the full and t-log backups are being run on the Replica and this is also where they are also running nightly DBCC jobs you know the data being backed up is not corrupted (also being tested by being restored elsewhere i’m sure).
    With auto page repair in HA any corruption on the primary will be corrected or it will fail over to the checked replica. I understand that it would be a good way to ensure their replica’s are always configured correctly (for all the things that aren’t replicated) and great for ensuring their DR process is well rehearsed, but not absolutely necessary in terms of DBCC?

    Reply
    • Phil – think through the process of what happens if you lose the standby replica (where checkdb is running), and then you encounter corruption. Automatic page repair only works if the checked replica is actually *online*, and manual page repair involves doing a restore – which means tearing the database out of your Availability Group.

      (And quick plug: this happens to be one of the scenarios we walk through in our Senior DBA online class – which is running this week, coincidentally! I have students design an AG, including where they’ll run backups and checkdb, and then we step through several failure scenarios we’ve seen out in the wild. It’s very eye-opening for folks who just assume the wizard is the finish line, heh.)

      Reply
      • So how does your scenario protect from this? Surely if you encounter corruption when your replica is down, it will make your bad day even worse in either situation?
        Is it just a case of the longer you run a database without checks the higher the probability that there is corruption lurking undetected? By switching each week you are reducing the risk of it being detected at the worst possible time?
        Thanks

        Reply
  • Brent – Sorry to revive an old thread. Not sure if you remember our environment… it’s BIG :). Anyway, we have implemented several of your suggestions, but also noticed something interesting. We run DBCC on the Secondary servers during the week, and then Full backups from the Secondary servers on the weekend. We run hourly LOG backups all week from the primary. What I have found is that while DBCC runs fine on the secondary and will fix problems if it finds them, it does not update the ‘dbi_dbccLastKnownGood’ field. This is only updated when you run DBCC on the Primary. We are looking to also run DBCC on the primaries but if you remember the size of the environment that will be a challenge. The biggest concern is around reporting since I can’t find a good way (other than our logs of DBCC runs) to make sure it has run for a particular database. We are still on 2012 R2… does this behavior change in later versions? Any thoughts on what we should do, or is the best answer to bite the bullet and run DBCC on the primaries as well? The behavior makes sense, since I assume the dates are stored in the database which is read only, but I had assumed it would communicate back to the primary and update.

    Reply
  • Ok, I am running always on, and backing up daily the primary node. Every night I copy over the backup file to a dev server and restore it. Then I run an integrity check on the restored DB. Any thing I am not doing???? or catching???

    Reply
    • Mike – sure, what happens when you fail over from the primary to the secondary, and within a few minutes, your users report corruption? You won’t have the benefit of online page repair from the primary when it’s down. Sure would be nice to not have to restore from backup.

      Reply
  • Michele Puopolo
    March 30, 2017 3:37 am

    Hello,
    when dbcc find a corruption and create a SQL DUMP it cause the sql instance to freeze and few minutes of services disruption.
    This means that we can’t make check on production server?
    There is any way to deactivate the creation of SQL DUMP?

    Here msdn forum:
    https://social.msdn.microsoft.com/Forums/sqlserver/en-US/71da5f88-3d45-4ad9-86ea-7f5eecd2eb1a/dbcc-checkdb-cause-my-availability-group-to-failover?forum=sqldisasterrecovery

    Reply
    • Michele – you’ve got answers over there in MSDN, check those out.

      Reply
      • Puopolo Michele
        March 31, 2017 11:16 am

        Thanks Brent ! I think we had to check our customer database in another SQL Instance. We can’t freeze our primary node when a corruption appears… but I really don’t like this ‘feature’

        Reply
  • Please provide the solution for the scenario:
    What causing the server to restart?

    In our secondary replica(DBR2) DBCC check has not completed for the past 3 weeks and it’s causing our secondary replica(DBR2) to restart.. The job history did not show a failure, it didn’t even show that the job attempted to run. But if you look in the SQL errors logs, you would see messages about transactions rolling forward and rolling back at the exact time of the integrity-check job run.

    2017-01-01 04:17:11.090 spid57 600 transactions rolled forward in database ‘abc’ (52:0). This is an informational message only. No user action is required.
    2017-01-01 04:17:11.220 spid57 0 transactions rolled back in database ‘abc’ (52:0). This is an informational message only. No user action is required.
    2017-01-01 04:17:11.850 spid57 Recovery completed for database abc (database ID 52) in 130 second(s) (analysis 997 ms, redo 148972 ms, undo 59 ms.) This is an informational message only. No user action is required.
    2017-01-01 04:22:09.050 Server AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is no longer online. This is an informational message only. No user action is required.
    2017-01-01 04:22:09.050 spid41s AlwaysOn: The availability replica manager is going offline because the local Windows Server Failover Clustering (WSFC) node has lost quorum. This is an informational message only. No user action is required.
    2017-01-01 04:22:09.050 spid41s AlwaysOn: The local replica of availability group ‘AGxxxx’ is stopping. This is an informational message only. No user action is required.
    2017-01-01 04:22:09.060 spid43s Nonqualified transactions are being rolled back in database abd for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
    2017-01-01 04:22:09.080 spid41s Nonqualified transactions are being rolled back in database dcs for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
    2017-01-01 04:22:09.100 spid49s Nonqualified transactions are being rolled back in database gfd for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
    2017-01-01 04:22:09.130 spid34s Nonqualified transactions are being rolled back in database hgf for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
    2017-01-01 04:22:09.140 spid44s Nonqualified transactions are being rolled back in database dgl for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
    2017-01-01 04:22:09.200 Server SQL Server is terminating because of a system shutdown. This is an informational message only. No user action is required.
    2017-01-01 04:22:09.270 spid75 DBCC CHECKDB (abc) WITH all_errormsgs, no_infomsgs, data_purity executed by xxx\xxxb terminated abnormally due to error state 6. Elapsed time: 0 hours 7 minutes 8 seconds.
    2017-01-01 04:22:11.240 spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 43%. This is an informational message only. No user action is required.
    2017-01-01 04:22:13.260 spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 43%. This is an informational message only. No user action is required.
    2017-01-01 04:22:15.260 spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 43%. This is an informational message only. No user action is required.
    2017-01-01 04:22:15.260 spid64s Error: 17054, Severity: 16, State: 1.
    2017-01-01 04:22:15.260 spid64s The current event was not reported to the Windows Events log. Operating system error = (null). You may need to clear the Windows Events log if it is full.
    2017-01-01 04:22:17.270 spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 43%. This is an informational message only. No user action is required.
    2017-01-01 04:22:18.000 spid64s Nonqualified transactions are being rolled back in database abc for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.

    Reply
  • Hi Brent,

    Excellent article as always and perfect timing 🙂 Just a quick question please. Where would you recommend to run Ola’s index maintenance scripts, a secondary replica or all replicas please?

    Reply
  • Would it be OK to do CHECKDBs with the PHYSICAL_ONLY option on the primary server? And then do the full CHECKDBs on the secondary (that is doing the backups).

    I’m considering this because that quote from Randal emphasises to check for corruptions in the primary server’s I/O subsystem and data files.

    Reply
    • Janneaa – wow, that’s a great idea, and I love it! It’d be just a little bit tricky to schedule since the primary can move around, but nothing that couldn’t be scripted. I like it!

      Reply
      • carlos figueroa
        February 1, 2018 11:35 pm

        I was thinking about this strategy as well: Run PHYSICAL_ONLY in primary and full CHECKDB on secondary.
        Based on your answer, it seems to be a really good approach, and by using sys.fn_hadr_is_primary_replica ( ‘dbname’ ) function it seems to be something it can be scripted.

        However, this leaves me with one question Brent:
        Does this mean that when logical corruption occurs let say in the primary, this will be replicated to secondary nodes? so that’s why we can avoid running full CHECKDB in primary and do it only in secondary?

        Reply
  • Hi Brent,
    After reviewing both scenarios in your post, I am wondering why transaction logs are backed up in both cases on secondary replica. In secondary replica, full database backups are copy-only. Does it mean point in time restore is not available in both cases?

    Reply
    • Tuan – there’s no such thing as copy-only log backups right now. Log backups have the same effect regardless of where they’re taken.

      Reply
      • Hi Brent,
        On secondary replica, if I take full backups (copy-only) and regular log backups. If I need to restore database to another server, I could only use full backup (copy-only) and could not use any log backups. Would you recommend to do Full backups on primary replica and log backups on secondary replica?

        Reply
  • Hi Brent.
    Can we run DBCC CheckDB on secondary and perform our database backups with checksum on Primary.
    This will make sure if corruption happens on primary, backup will fail and Primary will use sys.dm_hadr_auto_page_repair to repair page from secondary node

    Also running CheckDB on Secondary node will always keep DBCC related resource utilization on secondary only.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.