Backup speed isn’t the sexiest thing DBAs spend time on during the day. It’s kind of boring.
Backups don’t complain that they’re not running fast enough. Users, those are the ones who complain: “My query’s not fast enough. Why can’t I do a cross join between data warehouse tables?” The squeaky wheel gets the lube, so we bend the users over the – wait, where was I going with that?
Backup speed also worsens very gradually over time, taking longer and longer to complete each night. DBAs assume the time increase is caused by more database data, but you know what they say about assume: there’s no lube involved. What if the time has nothing to do with the database size?
In the next couple of posts, I’ll explore some backup bottlenecks I found when doing a Backup Health Check recently for a company with a couple dozen SQL Servers. Today, put your lifejackets on, because we’re in for…
The Perfect MSDB Storm
Any one or two of these things wouldn’t have been a problem, but combine them all together and we had a disaster on our hands:
- SQL Server 2000 – which by itself isn’t a big deal.
- MSDB was on the C drive – which in this case, happened to be a very slow pair of ancient drives. The rest of the server was on a pretty quick SAN with RAID 10 arrays for data and logs, but MSDB was still stuck.
- The server had over 100 databases – again, no big deal.
- Transaction log backups ran every 15 minutes – of all 100 databases. That’s a lot of backup history data pouring into MSDB.
- MSDB cleanup jobs weren’t set up initially – which meant that weeks of backup history data started turning into months, and then into years.
- MSDB had grown to 3gb – on SQL Server 2000, this is a huge problem because it’s not indexed well.
Combine these factors, and we had The Perfect MSDB Storm. I couldn’t add MSDB cleanup jobs at this point because they couldn’t finish in any reasonable time. Search the web for problems like this, and you’ll find people in forums beating their chests and wailing for lifejackets. I tried a couple of stored procs and an ordered-delete trick, but doing these deletes meant holding a lock on the MSDB backup tables. The business wasn’t comfortable not being able to back up while I ran these deletes.
Kids – don’t try this at home. We’re what you call experts. We’ve got years of experience that keeps us safe.
- I scripted out the foreign key constraints on the relevant MSDB tables (backupset, backupfile, backupmediaset, etc)
- I disabled the backup jobs and waited for them to stop
- I dropped the constraints
- I truncated the tables
- I created the foreign key constraints again
- I enabled the backup jobs
The good news: all the backup history was instantly gone without waiting for slow logged deletions.
The bad news: all the backup history was instantly gone. This wasn’t a problem for this client because they used Quest LiteSpeed, and it has its own backup history repository anyway. (Shameless product plug: when your server goes down, what good does a local MSDB backup history do you anyway? LiteSpeed’s repository is centralized on another server, so you can take action on restoring backups faster.) This is not a great solution, and I wouldn’t post the code here for that reason, but it did work.
The bottom line: backup times dropped by 2/3! The nightly backups had been taking two hours, but a whopping 90 minutes of that was spent just updating MSDB tables with the backup history. They now take under 40 minutes. This certainly isn’t a typical result, but take a minute to make really sure that you’re cleaning up your MSDB history regularly.
A very interesting and informative read Brent, thanks!
One question if I may, probably because I’m missing something here but anyway, how did you identify that the MSDB database itself was causing slower execution of the backup jobs? Was it a case of deduction, so to speak, as a result of the combination of all the points you detailed or was there something specific?
Great question. In this particular case, I could tell because the DBAs tried to implement MSDB cleanup jobs and they wreaked havoc. If that hadn’t happened, I could have identified it by doing an sp_who2 during the backup process, by running Profiler to catch the slow statements, or looking at wait stats to see where the holdups were. If the holdups weren’t backups during the backup window, then I could look to see what the slow statements were.
Very useful to know, thanks!
Quest Litespeed. Prayhap it’s time I get me one of those!
Great job as always Brent. When it’s time for me to blog I always read your stuff to get me ramped up!
Ha! Thank you, sir.
I was interested in preserving some history and ended up going with deletes, but tried to optimize as much as possible. This is what I came up with: delbackuphist
Quite a few years later yet still very useful. Thanks for sharing the script.
Mr Ozar, once again you have wowed me with some crazy DBA knowledge. Thank you for sharing your knowledge with us all. Especially the part where you told us how you knew to look at this. Good times.
“on SQL Server 2000, this is a huge problem because it’s not indexed well.”
Wow! Thank god I’ve never forgotten to setup my history cleanup tasks!
Geoff Hiten had a good post on this topic a while back, including adding some suggested performance indexes for msdb. I wonder if having these or other indexes would help make the msdb maintenance process take less time – especially for SQL 2000?
I just checked it with SQL 2014:
At least in this version all indexes suggested by Geoff Hiten’s script exists (except one on backupset(backup_start_date), but I think it is not important because an index on the backup_end_date exists instead)
I ran into this problem too. We had over 7 years of uncleared backup history in MS SQL 2000, 160 databases, and transaction log backups every 15 min. The major symptom was that right clicking a database to view properties in Enterprise Manager or SSMS would lock the application for 4-9 minutes.
Just to clarify, if i set my maintenance plan to delete back up history, will i have any issue in the future just for deleting that´s history ? or those records are used by SQL just for history ? pardon my poor knowledge.
Jeffry – no apologies necessary! No, you won’t have any issues as long as you don’t delete the last day or so. If you’re doing log shipping, for example, your log shipping software may require knowledge of the most recent full/differential/t-log backups.
Thanks a lot Brent, Because I have a backup that takes more than 12 hours, it is Blob DB, that could be a good reason why it is taking so long, I have found some records from 4 months ago.
you can schedulle and configure the sp_delete_backuphistory job from http://ola.hallengren.com/scripts/MaintenanceSolution.sql , too.
Stumbled across this old post and it got me thinking, is this still relevant in todays versions of SQL? (aka 2008 R2 & 2012)
Kevin – great question. I haven’t had the chance to benchmark a horrendous MSDB setup lately – it’s not something I’d put a lot of time into. Are you asking because you’d like to keep years of history in MSDB?
Still relevant in my environment:
2008 R2 – 100 databases – Log shipping secondary 15 minute intervals.
Started getting system memory exceptions in SSMS and kill kill kill VIH (voices in head) waiting for databases to drop when also deleting associated history.
Ran the purge sp, decreasing record counts in, for example, backupfile from ~8 million to ~5 million. No more errors. Added nightly purge step.
“Saving your time with horrendous setups” – Alan
Hi. I’m a little late to the party but… is this still relevant with availability groups, or will it break them?
Gabriele – still relevant. Each replica stores its own backups in its own msdb.
Brent – Yeah, this is an issue on my end, and I had to do the same thing you did. As I don’t care about backup history in this case, I whacked it and all of a sudden things to much, much better.
Another approach I found useful was this one using partitioning to purge the tables
Man, I am really, really nervous about that. I don’t think I’d encourage that approach. Something goes wrong, and you’re left with a wildly unsupported MSDB and busted server.
In the case of a large MSDB, would it work to find the oldest items, and then repeatedly run sp_delete_backuphistory in a loop (with a WAIT_FOR) starting with the oldest date, and incrementing it by one day at a time? Or is the performance so bad that it can’t even clean out the oldest day in a reasonable time?
If that does work, then it would be simpler (and “acceptably slower”) than the specialist approach with dropping/recreating constraints etc.
John – unfortunately there aren’t supporting indexes do to that, and that’s why these techniques come in.
I just created the missing indexes on the msdb tables (with my company name inside to recognice it)…
11 years later I’ve come across this post while enduring the “perfect msdb storm” as you describe. I’m a new dba to a sharepoint team that hasn’t had a dba before; among other scary findings, the msdb has apparently not been cleaned up since 2017. System dbs have their own drive which is 100 gb, msdb is taking up 97 Gb of that. I’m currently running the sp_delete_backuphistory a month at a time because doing so all at once is not an option with current incrementals running every 15 mins (hundreds of dbs at a time, most of them huge)… but it seems to have no affect on the drive size at all.
My only concern with truncating the tables as you mention in the article is that i may not get approval to actually get rid of every bit of backup/restore history, seems like it’s a good idea to keep at the very least a day or two, if not a month. I understand that truncating will actually reallocate the space as opposed to a simple delete of the rows (which is what i need), but is there another way i can accomplish that without losing absolutely everything?
I really appreciate any help, or points in the right direction you could give.
It’s beyond what I can answer quickly in a blog comment, unfortunately.
that’s unfortunate, but thank you for your response.
Thanks for your post.
I was faced with the same problem on my standby servers and just did the same as you did. It worked great and allowed me to get out of the situation I was stuck in.
I can live with no backup history until new history builds up on the standby servers.