I sometimes really hate coming up with ideas, and much prefer someone to just say “I want this” so I can go out into the world and figure out how to do that. Occasionally though, I realize that I’m not going to get all that much direction. That’s what happened with sp_BlitzQueryStore.
It also happened with some of the internals of sp_AllNightLog. Now, this is mostly a wrapper for two already-brilliant stored procedures: sp_DatabaseRestore, and dbo.DatabaseBackup. It has a few different bits of functionality that I’ll write more about, but two bits in particular both work off a queue table: backups and restores. I’m going to explain a little bit how the backup portion works, and why we chose to do things this way.
Prior to this, I’d never used a queue table, or written any code that used queues. I’d always been told (and I think rightly) that queues don’t belong in the database. I joked with Brent about using Service Broker.
He didn’t laugh.
Now that I’m out of the hospital
Here’s why a queue made sense in this scenario
- Need backups and restores to happen as soon as possible
- Add databases frequently
- Want a mostly static number of Agent jobs
- Want to rebalance Agent jobs to make sure they all finish within RPO
- Add Agent jobs if we can’t rebalance
- Have a job per database
Given the set of needs, a queue was the only thing that makes sense.
The pseudo code looks something like this:
We have 10 ‘workers’ (Agent jobs)
Each job runs in a loop with a WAITFOR to throttle it
When it activates, it checks the queue table for either a database that hasn’t had a log backup in the last 30 seconds (this is configurable), or a database that has never had a log backup
When it finds one, it locks the record (there’s a begin tran, and some locking hints), updates a couple fields: one that will allow other workers to skip over it: is_started, and a date started.
Takes the log backup (this is what calls Ola’s proc)
Marks the backup as completed, mark the backup as not started (sets is_started back to 0), and marks a completion time
There’s some other mesh and lace in there to handle errors and failures. Each job also has a 10 second run schedule behind it, so that if it fails miserably, it will try to restart again pretty quickly.
Now, I know what you’re thinking. Because I said “queue table”, you’re picturing some awful deadlock-ridden hotspot, inserting a row every time something needs to be done.
Non, non, non! We are not the dummies!
Every database gets 1 row, and only 1 row. We use Locking Magick® to serialize row access, and a pretty simple query for workers to figure out which database needs a backing up. It’s nifty, I promise. It’s not like those other queue tables that have given you the runaround and broken your heart before. When you create a database, a polling job adds it to the worker table, and it gets picked up the next time a worker activates.
There’s a really cool feature in Ola’s procs — @ChangeBackupType. What this will do, is if a backup job comes along to do a diff or log, and it finds that it can’t because there’s never been a full backup, it will change the backup type to a full. It was really important to us that any job where this happened didn’t impede other backups finishing.
That means that if we kept trying to have a set number of jobs, and trying to rebalance them, some number of databases could get held up if we added a large database to the server that had never had a full backup here.
Picture an Agent job responsible for backing up 199 databases. We add another to it, Database200, and it needs a full. If that full takes 5-10 minutes, the other 199 databases don’t have a log backup start until that’s done. That breaks RPO.
Since each worker is only responsible for one backup, and one database at a time, the other 9 workers can still work on other log backups. This gives us a higher degree of safety, and gives us a better shot at staying with RPO.
I hope this clarifies a bit why we chose to use a queue in this scenario.
Thanks for reading!