Availability Group Direct Seeding: How to fix a database that won’t sync

This post covers two scenarios

You either created a database, and the sync failed for some reason, or a database stopped syncing. Our setup focuses on one where sync breaks immediately, because whatever it’s my blog post. In order to do that, I set up a script to create a bunch of databases, hoping that one of them would fail. Lucky me, two did! So let’s fix them.

You wimp.

You wimp.

You have to be especially vigilant during initial seeding

Automatic failover can’t happen while databases sync up. The AG dashboard reports an unhealthy state, so failover is manual. The good news is that in the limited test scenarios I checked out, Direct Seeding to Replicas will pick back up when the Primary is back online, but if anything really bad happens to your Primary, that may not be the warmest or fuzziest news.

Here’s our database stuck in a restoring state.

Poor Crap903

Poor Crap903

Now let’s look in the error log. Maybe we’ll have something good there. On the Primary…

Okie dokie. Good to know. On the Replica, you’ll probably see something like this…

Oh, correct the problem. You hear that, guys? Correct the problem.

IF ONLY I’D THOUGHT OF CORRECTING THE PROBLEM.

Sheesh

So what do we do? We can check out the AG dashboard, see a bunch of errors, and then focus in on them.

Sit, DBA, sit. Good DBA.

Sit, DBA, sit. Good DBA.

Alright, let’s see what we can do! We can run a couple magical DBA commands and see what happens.

ALTER DATABASE [Crap903] SET HADR RESUME

ALTER DATABASE [Crap903] SET HADR AVAILABILITY GROUP = SQLAG01;

Oh come on.

Oh come on.

THE SALES GUY SAID THIS WOULD BE SO EASY WTF SALES GUY

THE SALES GUY SAID THIS WOULD BE SO EASY WTF SALES GUY

The two errors were:
Msg 35242, Level 16, State 16, Line 1
Cannot complete this ALTER DATABASE SET HADR operation on database ‘Crap903’.
The database is not joined to an availability group. After the database has joined the availability group, retry the command.

And then

Msg 1412, Level 16, State 211, Line 1
The remote copy of database “Crap903” has not been rolled forward to a point in time that is encompassed in the local copy of the database log.

Interesting! What the heck does that mean? If Brent would give me his number, I’d call and ask. I don’t understand why he won’t give me his number. Well, let’s just kick this back off. We kind of expected that not to work because of the errors we saw in the log before, but it’s worth a shot to avoid taking additional steps.

Right? Wrong. Digging into our DMVs and Extended Events, they’re telling us that a database with that name already exists. What’s really lousy here is that this error doesn’t appear ANYWHERE ELSE. It’s not in the dashboard, it’s not in regular, documented DMVs, nor in the XE health session. It’s only in the undocumented stuff. If you’re going to use this feature, be prepared to do a lot of detective work. Be prepared to cry.

Crud.

Crud

Double crud

Double crud

What we have to do is go back, remove the database from the Availability Group again, then drop it from our other Replicas. We can’t just restore over what’s already there. That would break all sorts of laws of physics and whatever else makes front squats harder to do than back squats.

Since our database is in a restoring state, it’s a few steps to recover it, set it to single user so no one does anything dumber than our AG has done, and then drop it.

Drop it like it's crap.

Drop it like it’s crap.

When we re-add the database to our Availability Group, it should start syncing properly. Lucky for us, it did!

I'm not highly available and I'm so scared.

I’m not highly available and I’m so scared.

There’s no Tinder for databases.

I'm highly available. Call me.

I’m highly available. Call me.

New features are hard

With direct seeding, you have to be extra careful about named instances and default database creation paths. If you used named instances with default database paths to Program Files, or different drive letters and folder names, this isn’t going to work. You don’t have an option to change those things. SQL expects everything to be there in the same place across all of your Replicas. I learned that the annoying way. Several times. Troubleshooting this was weird because I still can’t track down a root cause as to why anything failed in the first place. For the record, I created 50 databases, and two of them didn’t work for some reason.

Correct the problem. Just correct the problem.

Thanks for reading!

, , ,
Previous Post
First Responder Kit Updated, and sp_BlitzFirst is now sp_BlitzFirst
Next Post
Availability Group Direct Seeding: Extended Events and DMVs

13 Comments. Leave new

  • I guess this is the result of a “Cloud First” approach. It doesn’t have to be user-friendly at all, it just has to work in the Microsoft Azure environment 🙂

    Reply
  • How much for a front squat form check?

    Reply
  • Thomas Pullen
    June 30, 2016 3:38 am

    Reading this article has made me feel like binge drinking absinthe, it’s only 9.37am and I don’t even use AGs. Good work, Eric.

    Reply
  • hahaha your sarcasm toward MS error messages had me in stitches!

    Reply
  • This is literally the funniest blog I’ve ever seen. BY FAR. Bravo.

    Reply
  • “You don’t have an option to change those things. SQL expects everything to be there in the same place across all of your Replicas. I learned that the annoying way. Several times.”
    This tidbit of critical knowledge is worth all that pain. Thank you so much for finding this out. I was going to start using direct seeding except our server setup uses slightly different naming conventions between servers. It would have been messy without knowing this first.

    Reply
    • Erik Darling
      April 7, 2017 1:38 pm

      Thanks! While it may still be an issue, I always encourage people to try things on their own, especially after a few CUs or SPs have rolled out. You never know what may have changed.

      I have a feeling this is still a gotcha though.

      Reply
  • Can you please add programmatic step to add database to availability group? Since as far as I know we need to provide credentials and network path for the logs.
    If you share steps that will be awesome!

    Reply
  • Razvan Zoitanu
    August 3, 2018 1:21 pm

    Struggled with this on the latest version (2016 SP2 with CU2). Not using names instances, and all 3 availability replicas were deployed with the same settings (database files paths and all). Added 2 databases to an AG, hit this issue with the second database (secondary database not joined, files were not created). Cleaned up, ran the same script, all synced ok. Added another 5 databases, the third one failed. Cleaned it up, ran again for the failed one, now synced ok. And these are all small databases, 1GB each. Trace flag 9567 is not enabled.

    For me, failure_state was: FAILED 21 Seeding Check Message Timeout

    USE [master];
    SELECT TOP 100 start_time, completion_time, is_source, current_state, failure_state, failure_state_desc, error_code FROM sys.dm_hadr_automatic_seeding ORDER BY start_time DESC;
    SELECT * FROM sys.dm_hadr_physical_seeding_stats;

    Reply
  • I also struggled with the same issues until I upgraded (azure VM’s) all of the servers to 8 cores. With 8 cores, all of these issues seem to go away.

    Reply
  • I found a secret playing around. I experienced the same crap with failures doing 5 or 3 at a time adding to the AG group. So I said lets make tis really simple here for SQL. The below query is the meat & potatoes for a T-SQL script that first finds only 1 database ( even though there are 1200 waiting ) yes big smoke test and ran it out of threads doing so determining the Azure breaking point with thread exhaustion. SO back to what I was saying, snap in 1 database at a time using a SQL job running on 3-4min intervals seeded 1235 databases until SQL well was quite unresponsive. Script fetches the db name, runs a backup and pass in the AG group very successfully. My company hired a MS guy and he liked my code and ideas here since it worked.
    Drum roll fore the smallest but most useful query: We have a utility db we do not to bother with in build which is always #5 dbid.
    Select TOP 1 Name from master..sysdatabases
    where name not in(
    Select db_name(database_id) from sys.dm_hadr_database_replica_states)
    and dbid > 5
    I also added TDE encryption into this code and had to build a encryption watcher since while it is being encrypted you can not run the backup prior to adding to the AG group. Here is that small but very useful script:
    –Monitor the encryption process

    Declare @dbid int,@encryptionstate int
    Select @dbid = dbid from sysdatabases where name = @dbname
    looper:
    SELECT @encryptionstate = encryption_state FROM sys.dm_database_encryption_keys
    WHERE database_id = @dbid;
    Print @encryptionstate
    If @encryptionstate != 3
    Begin
    WAITFOR DELAY ’00:00:06′;
    Print ‘Checking for Encryption to completion status’
    GOTO looper
    End
    Else
    Begin
    Print ‘Encryption Process has Completed moving to adding to the AG Group now.’
    end

    Edward J Pochinski III

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}