Availability Group Direct Seeding: How to fix a database that won’t sync

This post covers two scenarios

You either created a database, and the sync failed for some reason, or a database stopped syncing. Our setup focuses on one where sync breaks immediately, because whatever it’s my blog post. In order to do that, I set up a script to create a bunch of databases, hoping that one of them would fail. Lucky me, two did! So let’s fix them.

You wimp.
You wimp.

You have to be especially vigilant during initial seeding

Automatic failover can’t happen while databases sync up. The AG dashboard reports an unhealthy state, so failover is manual. The good news is that in the limited test scenarios I checked out, Direct Seeding to Replicas will pick back up when the Primary is back online, but if anything really bad happens to your Primary, that may not be the warmest or fuzziest news.

Here’s our database stuck in a restoring state.

Poor Crap903
Poor Crap903

Now let’s look in the error log. Maybe we’ll have something good there. On the Primary…

Okie dokie. Good to know. On the Replica, you’ll probably see something like this…

Oh, correct the problem. You hear that, guys? Correct the problem.

IF ONLY I’D THOUGHT OF CORRECTING THE PROBLEM.

Sheesh

So what do we do? We can check out the AG dashboard, see a bunch of errors, and then focus in on them.

Sit, DBA, sit. Good DBA.
Sit, DBA, sit. Good DBA.

Alright, let’s see what we can do! We can run a couple magical DBA commands and see what happens.

ALTER DATABASE [Crap903] SET HADR RESUME

ALTER DATABASE [Crap903] SET HADR AVAILABILITY GROUP = SQLAG01;

Oh come on.
Oh come on.
THE SALES GUY SAID THIS WOULD BE SO EASY WTF SALES GUY
THE SALES GUY SAID THIS WOULD BE SO EASY WTF SALES GUY

The two errors were:
Msg 35242, Level 16, State 16, Line 1
Cannot complete this ALTER DATABASE SET HADR operation on database ‘Crap903’.
The database is not joined to an availability group. After the database has joined the availability group, retry the command.

And then

Msg 1412, Level 16, State 211, Line 1
The remote copy of database “Crap903” has not been rolled forward to a point in time that is encompassed in the local copy of the database log.

Interesting! What the heck does that mean? If Brent would give me his number, I’d call and ask. I don’t understand why he won’t give me his number. Well, let’s just kick this back off. We kind of expected that not to work because of the errors we saw in the log before, but it’s worth a shot to avoid taking additional steps.

Right? Wrong. Digging into our DMVs and Extended Events, they’re telling us that a database with that name already exists. What’s really lousy here is that this error doesn’t appear ANYWHERE ELSE. It’s not in the dashboard, it’s not in regular, documented DMVs, nor in the XE health session. It’s only in the undocumented stuff. If you’re going to use this feature, be prepared to do a lot of detective work. Be prepared to cry.

Crud.
Crud
Double crud
Double crud

What we have to do is go back, remove the database from the Availability Group again, then drop it from our other Replicas. We can’t just restore over what’s already there. That would break all sorts of laws of physics and whatever else makes front squats harder to do than back squats.

Since our database is in a restoring state, it’s a few steps to recover it, set it to single user so no one does anything dumber than our AG has done, and then drop it.

Drop it like it's crap.
Drop it like it’s crap.

When we re-add the database to our Availability Group, it should start syncing properly. Lucky for us, it did!

I'm not highly available and I'm so scared.
I’m not highly available and I’m so scared.

There’s no Tinder for databases.

I'm highly available. Call me.
I’m highly available. Call me.

New features are hard

With direct seeding, you have to be extra careful about named instances and default database creation paths. If you used named instances with default database paths to Program Files, or different drive letters and folder names, this isn’t going to work. You don’t have an option to change those things. SQL expects everything to be there in the same place across all of your Replicas. I learned that the annoying way. Several times. Troubleshooting this was weird because I still can’t track down a root cause as to why anything failed in the first place. For the record, I created 50 databases, and two of them didn’t work for some reason.

Correct the problem. Just correct the problem.

Thanks for reading!

Previous Post
First Responder Kit Updated, and sp_AskBrent is now sp_BlitzFirst
Next Post
Availability Group Direct Seeding: Extended Events and DMVs

24 Comments. Leave new

  • I guess this is the result of a “Cloud First” approach. It doesn’t have to be user-friendly at all, it just has to work in the Microsoft Azure environment 🙂

    Reply
  • How much for a front squat form check?

    Reply
  • Thomas Pullen
    June 30, 2016 3:38 am

    Reading this article has made me feel like binge drinking absinthe, it’s only 9.37am and I don’t even use AGs. Good work, Eric.

    Reply
  • hahaha your sarcasm toward MS error messages had me in stitches!

    Reply
  • This is literally the funniest blog I’ve ever seen. BY FAR. Bravo.

    Reply
  • “You don’t have an option to change those things. SQL expects everything to be there in the same place across all of your Replicas. I learned that the annoying way. Several times.”
    This tidbit of critical knowledge is worth all that pain. Thank you so much for finding this out. I was going to start using direct seeding except our server setup uses slightly different naming conventions between servers. It would have been messy without knowing this first.

    Reply
    • Erik Darling
      April 7, 2017 1:38 pm

      Thanks! While it may still be an issue, I always encourage people to try things on their own, especially after a few CUs or SPs have rolled out. You never know what may have changed.

      I have a feeling this is still a gotcha though.

      Reply
  • Can you please add programmatic step to add database to availability group? Since as far as I know we need to provide credentials and network path for the logs.
    If you share steps that will be awesome!

    Reply
  • Razvan Zoitanu
    August 3, 2018 1:21 pm

    Struggled with this on the latest version (2016 SP2 with CU2). Not using names instances, and all 3 availability replicas were deployed with the same settings (database files paths and all). Added 2 databases to an AG, hit this issue with the second database (secondary database not joined, files were not created). Cleaned up, ran the same script, all synced ok. Added another 5 databases, the third one failed. Cleaned it up, ran again for the failed one, now synced ok. And these are all small databases, 1GB each. Trace flag 9567 is not enabled.

    For me, failure_state was: FAILED 21 Seeding Check Message Timeout

    USE [master];
    SELECT TOP 100 start_time, completion_time, is_source, current_state, failure_state, failure_state_desc, error_code FROM sys.dm_hadr_automatic_seeding ORDER BY start_time DESC;
    SELECT * FROM sys.dm_hadr_physical_seeding_stats;

    Reply
  • I also struggled with the same issues until I upgraded (azure VM’s) all of the servers to 8 cores. With 8 cores, all of these issues seem to go away.

    Reply
  • I found a secret playing around. I experienced the same crap with failures doing 5 or 3 at a time adding to the AG group. So I said lets make tis really simple here for SQL. The below query is the meat & potatoes for a T-SQL script that first finds only 1 database ( even though there are 1200 waiting ) yes big smoke test and ran it out of threads doing so determining the Azure breaking point with thread exhaustion. SO back to what I was saying, snap in 1 database at a time using a SQL job running on 3-4min intervals seeded 1235 databases until SQL well was quite unresponsive. Script fetches the db name, runs a backup and pass in the AG group very successfully. My company hired a MS guy and he liked my code and ideas here since it worked.
    Drum roll fore the smallest but most useful query: We have a utility db we do not to bother with in build which is always #5 dbid.
    Select TOP 1 Name from master..sysdatabases
    where name not in(
    Select db_name(database_id) from sys.dm_hadr_database_replica_states)
    and dbid > 5
    I also added TDE encryption into this code and had to build a encryption watcher since while it is being encrypted you can not run the backup prior to adding to the AG group. Here is that small but very useful script:
    –Monitor the encryption process

    Declare @dbid int,@encryptionstate int
    Select @dbid = dbid from sysdatabases where name = @dbname
    looper:
    SELECT @encryptionstate = encryption_state FROM sys.dm_database_encryption_keys
    WHERE database_id = @dbid;
    Print @encryptionstate
    If @encryptionstate != 3
    Begin
    WAITFOR DELAY ’00:00:06′;
    Print ‘Checking for Encryption to completion status’
    GOTO looper
    End
    Else
    Begin
    Print ‘Encryption Process has Completed moving to adding to the AG Group now.’
    end

    Edward J Pochinski III

    Reply
  • I thought I was the only one thinking the Always On error messages were useless….

    “…needs to be in a SYNCHRONIZED state…”

    No kidding…

    Reply
  • Hey

    Alternative for the automatic seeding error

    Msg 1412, Level 16, State 211, Line 1
    The remote copy of database “Crap903” has not been rolled forward to a point in time that is encompassed in the local copy of the database log.

    You can avoid dropping the database (especially when it’s a big database) and fix this problem by applying the missing transactional backups. On my case, the secondary copy of the Distributed Availability Group failed with this error after the seeding successfully completed and resume operation didn’t work.

    The steps that i’ve did was to remove the db from AG ; when the db is in restore state apply the transactional backups with no recovery, backups that were taken during the seeding process.

    After this operation you will be able to add the db back to the AG by joining the db to the AG.

    Reply
  • Make sure when you run the seeding perms each node is primary replica. I had the auto seeding working aross 4 node replica systems flawlessly. I used POSH to auto deploy the perms. Also land the 6 busiest databases first in the AG group. We automated SQL to pickup new databases and slap them into the auto seeding. We also automated if one comes out of sync to remove from all secondary replicas and then re send across. This also worked flawlessly.

    Reply
  • Edward Pochinski
    June 25, 2021 6:53 pm

    I can get a bunch of the code up here I’m sure, I need to edit out any company information. I had done much testing, we exhausted worker threads exceedding 1000 databases in 1 Ag group( that was fun). Performed failovers with PowerShell with 400-600 databases and also to poll the 3 replicas and “resume data movement where needed” from failinbg over to a asynch replica. Wicked busy but they had me converted this wonderful technology due to a data issue ( Drum Roll )yes to LOg Shipping. So I now have 1 job as the puppet master and move 1,100 databases using some OLa H backup procedures and my custom T-SQL code. I also built in a healing process for the LS, if out of sync dropped and re established. I will checkj with my company to be sure everytrhing is peachy with me publically posting the code. I’m sure some time next week I can have some sort of posting. I wonder if Brent want to create a page or something so it is not posted in a blog. Stay tuned,

    Reply
  • thank you. however, I have a 2019 DAG configuration and here is the error message
    Seeding encountered a transient failure, state ‘108’, maximum retries exceeded’

    VM1 and VM2 (global) seed however, VM3 and VM4 (forward) did not

    What do you think

    Reply
  • Ed Pochinski
    July 22, 2021 4:29 pm

    Once the Auto seeding perms are configured correctly this script dispersed a database to 3 replicas. 1 in Azure East and 2 in Azure West was were my 3 were. East Replica was synchronous and West’s were Async. In testing we auto seeded over 1000 databases in a single AG group until yes, no more worker threads available. It was able to failover flawlessly with 400-600. We automated that with POSH and circled back to resume data movements on the new secondaries from the new Primary in West.

    CLS
    #add the databases to the AG Group Via Auto Seeding
    $SQL_Replica1=”SQLName,port”

    Write-Host “Performing the backup and seeding of new databases” -fore Magenta
    Write-Host “Connecting to : ” $SQL_Replica1 -fore Yellow
    Invoke-Sqlcmd -ServerInstance $SQL_Replica1 -verbose -Query ”

    –# Purpose – This script will deploy DBs with TDE and Auto seed to the replicas

    SET NOCOUNT ON
    Declare @sqlstring nvarchar (555),@backuppath varchar(555),@sqlstring1 nvarchar (555),@sqlstring2 nvarchar(555)
    DECLARE UserDBNotSyncd INSENSITIVE CURSOR

    FOR

    Select TOP 1 Name from master..sysdatabases
    where name not in(
    Select db_name(database_id) from sys.dm_hadr_database_replica_states)
    and dbid > 5

    FOR READ ONLY

    OPEN UserDBNotSyncd

    DECLARE @DBName varchar(50)

    FETCH NEXT FROM UserDBNotSyncd INTO @DBName

    WHILE (@@FETCH_STATUS = 0)

    BEGIN

    Print ‘Backing up database: ‘ + @DBName
    Set @backuppath = ‘\\PhyNodeName\AGshare\’ + @DBName + ‘.bak’
    Backup Database @DBName TO DISK = @backuppath

    Print ‘Completed Database Backup to staging’

    Print’*****Adding new database to AG Group for Auto Seeding: ‘ + @DBName + ‘ ******’
    set @sqlstring=’ ALTER AVAILABILITY GROUP AG1 ADD DATABASE [‘ + @DBName + ‘];’

    Print @sqlstring
    EXECUTE sp_executesql @sqlstring

    If @@error = 0
    Print ‘Database Database Successfully added: ‘ + @DBName

    FETCH NEXT FROM UserDBNotSyncd INTO @DBName
    END

    CLOSE UserDBNotSyncd

    DEALLOCATE UserDBNotSyncd

    –cleanup backup staged files

    Declare @cmdshellvalue sql_variant
    select @cmdshellvalue = value_in_use from sys.configurations where name = ‘xp_cmdshell’
    If @cmdshellvalue = 0
    Begin
    Print ‘Turning on xp_cmdshell,it magically was turned off somehow.’
    exec sp_configure ‘show advanced options’,’1′
    RECONFIGURE WITH OVERRIDE
    exec sp_configure ‘xp_cmdshell’,’1′
    RECONFIGURE WITH OVERRIDE
    END
    waitfor Delay ’00:00:02′
    Print ‘Cleaned up the full backup file from AG Prep & Auto Seeding’
    exec xp_cmdshell ‘ del \\PhyNodeName\AGshare\*.bak’
    Print ”
    Print ‘The process has completed, refresh the AG Dash Board’
    ” -querytimeout ([int]::MaxValue)

    Write-Host “Completed Seeding Process” -fore Green

    –Worried about worker thread issues this worked great

    –Author – Ed Pochinski 04232019
    –Purpose – AG Worker thread alert
    –SQL Server Central posting copy
    Declare @availableThreads varchar(25), @workers_waitingForCPU varchar(25), @Request_waiting_ForThread varchar(25),@msgbody varchar(500), @msgsubject varchar(500)
    Select @availableThreads = (select max_workers_count from sys.dm_os_sys_info)-sum(active_Workers_count) from sys.dm_os_Schedulers where status=’VISIBLE ONLINE’
    Print @availableThreads
    Select @workers_waitingForCPU = sum(runnable_tasks_count) from sys.dm_os_Schedulers where status=’VISIBLE ONLINE’
    Print @workers_waitingForCPU
    Select @Request_waiting_ForThread = sum(work_queue_count) from sys.dm_os_Schedulers where status=’VISIBLE ONLINE’
    Print @Request_waiting_ForThread
    –edit threshold to accomodate this is a test machine with no load so the humber is high to facilitate the alert
    If @availableThreads <=50
    Begin
    Print ' Send a email alert here !!!!! Worker threads are low: ' + @availableThreads + ', Workers Waiting on CPU: ' + @workers_waitingForCPU + ' \ Requests Waiting for thread: ' + @Request_waiting_ForThread
    set @msgbody ='Worker threads are low on Server: ' + @@servername + ' – Available Thread Count: ' + @availableThreads
    set @msgbody = @msgbody + char(13);
    set @msgbody = @msgbody + ' – Worker Threads Waiting on CPU count: ' + @workers_waitingForCPU + char(10);
    set @msgbody = @msgbody + ' – Requests Waiting for worker thread count: ' + @Request_waiting_ForThread
    set @msgsubject = '!!! AG Worker Thread Alert on SQL Server: ' + @@servername + ' !!!'
    EXEC msdb.dbo.sp_send_dbmail
    @recipients =N'email@somewhere.com',
    @body = @msgbody,
    @subject =@msgsubject ,
    @profile_name ='SQLMail_Public_Profile'
    End

    Reply
  • Thank you, for your reply. I found the solution. I did not failover the DAG correctly.
    Once I executed the following command, the databases were created and sync on the other 2 replicas

    alter availability group dag_samdc4892prod3 set (role = secondary)

    the error message within the SQL log did not suggest a review of the DAG. – thanks again

    Reply
  • Had this issue last night at 1am! The AG successfully failed over, but none of the databases were syncing. The SQL logs were vague, the dashboard showed when the last logs were hardened. After digging through the cluster log (Powershell: get-clusterlog -uselocaltime) I saw that the internal cluster database was out of sync with 2 of the 3 nodes.

    To fix this, within the Failover Cluster Manager on the current primary server, I stopped and restarted the SQL AG role. After the restart, all databases got re-synced and the dashboard showed healthy.

    I hope this helps somebody else.

    Reply
  • With Server 2019 and automatic seeding, the sync will fail if there is a transaction log backup taken on the primary whilst it is seeding. Disabling transaction log backups whilst synchronising seems to resolve the issue.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.