Building stuff

I sometimes really hate coming up with ideas, and much prefer someone to just say “I want this” so I can go out into the world and figure out how to do that. Occasionally though, I realize that I’m not going to get all that much direction. That’s what happened with sp_BlitzQueryStore.

It also happened with some of the internals of sp_AllNightLog. Now, this is mostly a wrapper for two already-brilliant stored procedures: sp_DatabaseRestore, and dbo.DatabaseBackup. It has a few different bits of functionality that I’ll write more about, but two bits in particular both work off a queue table: backups and restores. I’m going to explain a little bit how the backup portion works, and why we chose to do things this way.

Prior to this, I’d never used a queue table, or written any code that used queues. I’d always been told (and I think rightly) that queues don’t belong in the database. I joked with Brent about using Service Broker.

He didn’t laugh.

Now that I’m out of the hospital

Here’s why a queue made sense in this scenario

We do:

  • Need backups and restores to happen as soon as possible
  • Add databases frequently
  • Want a mostly static number of Agent jobs

We don’t:

  • Want to rebalance Agent jobs to make sure they all finish within RPO
  • Add Agent jobs if we can’t rebalance
  • Have a job per database

Given the set of needs, a queue was the only thing that makes sense.

The pseudo code looks something like this:

We have 10 ‘workers’ (Agent jobs)
Each job runs in a loop with a WAITFOR to throttle it
When it activates, it checks the queue table for either a database that hasn’t had a log backup in the last 30 seconds (this is configurable), or a database that has never had a log backup

When it finds one, it locks the record (there’s a begin tran, and some locking hints), updates a couple fields: one that will allow other workers to skip over it: is_started, and a date started.
Takes the log backup (this is what calls Ola’s proc)
Marks the backup as completed, mark the backup as not started (sets is_started back to 0), and marks a completion time

There’s some other mesh and lace in there to handle errors and failures. Each job also has a 10 second run schedule behind it, so that if it fails miserably, it will try to restart again pretty quickly.

Now, I know what you’re thinking. Because I said “queue table”, you’re picturing some awful deadlock-ridden hotspot, inserting a row every time something needs to be done.

Non, non, non! We are not the dummies!

cheek2cheek

Every database gets 1 row, and only 1 row. We use Locking Magick® to serialize row access, and a pretty simple query for workers to figure out which database needs a backing up. It’s nifty, I promise. It’s not like those other queue tables that have given you the runaround and broken your heart before. When you create a database, a polling job adds it to the worker table, and it gets picked up the next time a worker activates.

Four queue

There’s a really cool feature in Ola’s procs — @ChangeBackupType. What this will do, is if a backup job comes along to do a diff or log, and it finds that it can’t because there’s never been a full backup, it will change the backup type to a full. It was really important to us that any job where this happened didn’t impede other backups finishing.

That means that if we kept trying to have a set number of jobs, and trying to rebalance them, some number of databases could get held up if we added a large database to the server that had never had a full backup here.

Picture an Agent job responsible for backing up 199 databases. We add another to it, Database200, and it needs a full. If that full takes 5-10 minutes, the other 199 databases don’t have a log backup start until that’s done. That breaks RPO.

Since each worker is only responsible for one backup, and one database at a time, the other 9 workers can still work on other log backups. This gives us a higher degree of safety, and gives us a better shot at staying with RPO.

I hope this clarifies a bit why we chose to use a queue in this scenario.

Thanks for reading!

Previous Post
New Stack Overflow Public Database Available (2017-06)
Next Post
I’m on the new SQLPlayer Podcast

13 Comments. Leave new

  • Maybe nobody except me was dumb enough to need to peek at the source to figure this out, but the reason is_started and is_completed both exist is because rows in the queue have three states:
    1. Job is new and has never been backed up. is_started = 0 and is_completed = 0
    2. Job is currently being backed up (is_started = 1, is_completed = 0).
    3. Job is not currently being backed up (is_started = 0, is_completed = 1)

    In principle, all you really need is is_running (or, equivalently, is_completed could be a computed column), with new entries being those with a null or sentinel value for last_log_backup_start_time. The current approach is a bit more obvious/simple, though personally it tends to feel funny to have two database columns that are always updated in lockstep to each other’s inverse.

    Reply
  • ken ambrose
    July 7, 2017 11:23 am

    I’ve found FIFO queues – accessible from multiple concurrent processes – to be very useful in managing maintenance processes. And I have even used service broker when directed my management to do so.

    regarding [[columns updated that will allow other workers to skip over it: is_started, and a date started.]]
    Did you consider using “readpast” hint in the statement that finds the next task that needs to be done?

    Reply
  • Tim Cartwright
    July 7, 2017 11:35 am

    It would be nice to have a configurable amount of wait time per row. To allow different tlog schedules per database.

    Reply
  • Tim Cartwright
    July 7, 2017 12:01 pm

    @Erik, I was intrigued by your idea, and liking it until I ran across the need to have xp_cmdshell enabled. It looks like you did the xp_cmdshell to be able to monitor for new files when a database is restored. Most of us cannot enable it because of the security risk and compliance. An alternative idea would be to use an Extended Events (2016 only!): https://sqlperformance.com/2015/06/extended-events/t-sql-tuesday-67-backup-restore

    Reply
    • Erik Darling
      July 7, 2017 12:04 pm

      Yeah, even if I didn’t here, we use it in sp_DatabaseRestore.

      In 2017 there’s a non-xp function for looking at file system items, too, but… the number of people on 2016 is depressingly low. Nevermind the future.

      For the XE thing — yeah, it would work on one server, but we want this to all work without servers ever having to talk to each other. Like a good marriage.

      Reply
  • I have a different (not necessarily better) way of getting around the issue of a DIFF or FULL run starting when you are only trying to do a LOG or DIFF and using @ChangeBackupType. I’m testing some code that will add a new @ChangeBackupType of “DEFER.” If you use this type, any changes to DIFF or FULL are saved off in a table variable (I only use that and not a temp table because that’s how other temporary tables are coded in the procedure) and then, when the original run is done, a DIFF run is started for any DIFFs needed and then a FULL run is started for any FULLs needed. I hope to get this added as part of the standard code once it’s been fully tested here, but so far it seems to work OK. It doesn’t allow for simultaneous backups like your solution, but it does allow the backup of the LOGs or DIFFs to continue without being slowed by an unexpected DIFF or FULL.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}