Scary SQL Surprises: Crouching Tiger, Hidden Replication

Replication
22 Comments

The more SQL Servers I see, the more scared I get.  There are a staggering number of ways that your server can sneak around behind you and … well, let’s just say you shouldn’t drop the soap in the datacenter.

As part of my health checks, I’m used to verifying that full backups are done regularly, but that’s not enough.

Suspicious Cat, Angry Birds

If your database is in full recovery mode, then it needs transaction log backups to clear out the transaction log.  Full backups by themselves don’t free up log space – only transaction log backups do it.  (This explains why people are still searching the web for how to do BACKUP LOG WITH TRUNCATE_ONLY in SQL 2008 – they need to clear out log space without waiting around for a ginormous log backup.)

However, I ran into an interesting surprise recently at a client.  The full backups were working great, the transaction log backups were working great, so the DBA team thought everything was fine.

Except that we found a 16GB database with a 160GB log file.

After the disbelief and panic died down, we ran a simple query to see the problem:

SELECT name, log_reuse_wait_desc FROM sys.databases

By querying sys.databases, we can get all kinds of nifty information about database status.  The log_reuse_wait_desc column shows why SQL Server can’t reuse the existing space in the transaction log and must continue growing it out – in our case, to 10x the data size.  Some possibilities include:

  • NOTHING – you’re in good shape
  • LOG_BACKUP – you need to take a log backup (and preferably schedule it to happen regularly in a way that matches your recovery point objective)
  • DATABASE_MIRRORING, REPLICATION, and AVAILABILITY_REPLICA – one of your mirrors or replicas is down, and SQL Server is holding on to log data until the replica comes back up

In our case, we saw REPLICATION – which was quite a surprise since the company wasn’t using any SQL Server replication – or so they thought.  Turned out that somebody had played around with replication once and tore it down, but didn’t rip out all of the parts.  SQL Server was patiently waiting for a dead server to come back online and catch up.

Whoops.

The Moral of the Story

Just because you’ve got the right jobs in place and the jobs are running successfully doesn’t mean they’re doing everything you expect.  Yes, transaction log backups are supposed to let SQL Server reuse log space, but there are a lot of prerequisites for that to work.

The more you monitor, the more you can prevent these kinds of problems.  The answer isn’t to build your own monitoring tool and continue to add checks like this one by one AFTER they’ve jumped you in the shower.  The answer is to reuse wheels that other people have already invented that have extensive checks built in, and learn from other peoples’ mistakes.

Previous Post
Who’s taking your backups? No, really – who’s taking your backups?
Next Post
Write a Killer Technical Resume in Five Steps

22 Comments. Leave new

  • Thanks Brent
    What centralize monitoring solution have you been impressed with, if cost is not a road block?

    Reply
  • Looking forward to hearing your session at Red Gate SQL in the City [Seattle] in November.

    Reply
  • Ha! I encountered the exact same issue only a couple months back. A 2GB database, regular log file backups and 90GB odd log file!

    The customer had changed or removed mirroring from the database a long time ago, but the database was still in a state that it thought it needed to keep them around.

    Reply
  • It sounds like my workplace 😀

    Reply
  • I wish I had enough time to do this type of monitoring at work. Unfortunately, I end up with multiple hats many times and we don’t allot enough time for this type of work. I love being proactive, I’ll just keep pushing and hopefully things will change.

    Oh yeah, nice post and hilarious. One of my new worst fears now is dropping that bar of soap.

    Reply
  • Thanks Brent. I ran the above query and most of my databases returned “NOTHING” except two of them showed “LOG_BACKUP” in the log_reuse-wait_desc column. I just don’t know why since their log files are being backup every two hours. Any thoughts? Thanks.

    Reply
  • Thank you so much Brent. That explains it. I reran the query and they all returned “NOTHING” now.

    Reply
  • I realized that I have a handful of databases where log_reuse_wait_desc=LOG_BACKUP. So they need a LOG backup. But I have all user databases doing LOG backups multiple times a day. But these databases only actually yield FULL and DIFF backups.

    So to try to fix: after setting to SIMPLE then back to FULL mode, I am yet unable to get a LOG backup to run. It was say it is successful, but nothing actually gets written. I use the ola hallengren method of backup. Any help on this would be greatly appreciated. Thanks.

    Reply
    • Hi Josh. I can’t tell exactly what you mean by “but these databases only actually yield FULL and DIFF backups”. This is kind of beyond what I can troubleshoot in a blog post comment. You may consider posting it on http://DBA.StackExchange.com.

      Reply
      • Sorry, I meant to say that I can successfully get FULL and Differential backups to run, yielding actual backup files, but am unable to get LOG backup files to do the same. Most of the databases are doing all three right on time, no issues, but these few will not actually write out a LOG backup. I will see what I can get at stackexchange as well. Thanks for your time.

        Reply
        • Wow I’m still subscribed to this post 🙂

          You know what I would try Josh? Take your full and restore it on a Development server. Try your log backups there. If they don’t work, try to detach and reattach the database without the log file. Then try running log backups again. Could be a bad log file. I’ve had a similar situation and that helped me. Not exactly the same and please do it in a dev environment 🙂

          Reply
          • Ayman,

            Thank you so much for the reply. I just tried both suggestions. I detached and reattached without the LDF and the problem remained.

            I then took a recent full backup file and restored it to a dev server and the problem existed there as well.

            I then went to double check that the backup was in FULL recovery mode and it wasn’t. Switched it to FULL from SIMPLE and then it worked on the dev side.

            Back to production, turns out the db was back in SIMPLE mode as well so I changed it to FULL. But it still won’t do a LOG backup.

            I’m trying not to be a burden on the community, honestly. I am just puzzled. Thanks.

  • you need to do a full backup after changing the recovery model before doing the log backup.

    you’re not burdening anyone the community enjoys the challenges of helping others that’s how we grow stronger.

    Reply
    • I failed to mention that in each of the scenarios, I did a full backup prior to attempting the LOG backup.

      So, I took this full backup from dev and (after taking backup precautions with the original database) replaced it with the ‘working’ one from dev. Checked to see if it was FULL recovery mode. Yes it was. Made a full backup. Tried to do a LOG backup, same problem. :/

      Reply
      • Well darn, I thought you may have had a bad log file with an easy fix 🙁

        I would take a step back and approach this like a doctor; maybe because my Dad is one and I work for a medical society?! No need to put on a lab coat like Brent does in his webinars 🙂 Just look at the symptoms, are they occurring with other DBs on the same server? Check the SQL Logs, Windows Logs, anything interesting in there? I know in the stack exchange post you mentioned that you used Ola’s robust logging, but just be certain and check the SQL/Windows Logs.

        Also, what changed the recovery model to simple? That’s a big huge deal to be honest. Looks like you were working under the assumption that it is under the Full Recovery model, who has access to change that? I had an issue where DBs were set to Auto_Close and Auto_Shrink by a vendor, put a policy in to turn that option off on all DBs on a daily basis so they couldn’t do it again. Do you have a gremlin in the system that is playing with your settings?

        I’m going to follow the discussion over at stack exchange, very interesting issue.

        Reply
  • Here is the link to the same issue I cross posted to StackExchange: http://dba.stackexchange.com/questions/52005/how-to-fix-log-reuse-wait-desc-log-backup-log-backup-not-possible
    if that helps as well.

    Reply
  • UPDATE: I figured it out!

    Because the ola hallengren scripts check for log shipping roles and skip log backups if it finds the current db in either

    msdb.dbo.log_shipping_primary_databases

    or

    msdb.dbo.log_shipping_secondary_databases

    AND because the databases I have been fighting with recently DID have old log shipping entries that were not proiperly cleaned out, this script was not able to create the LOG backups. Now that that is all cleaned up, we appear to be good to go.

    Many thanks to all who took the time to chime in! -Josh

    Reply
  • Good article and liked the title. Reminds me of a round of the BBC’s I’m Sorry I Haven’t A Clue, when the panel where asked to adapt a movie title to describe handymen at one’s home. My fave was from the ingenious Barry Cryer: Crouching Plumber, Hidden Charges.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.