This week, developer tools company GitLab had a serious database outage.
The short story:
- An admin was trying to set up replication
- The site had an unusual load spike, causing replication to get behind
- While struggling with troubleshooting, the admin made a lot of changes
- After hours of work, the admin accidentally deleted the production database directory
You can read more about the details in GitLab’s outage timeline doc, which they heroically shared while they worked on the outage. Oh, and they streamed the whole thing live on YouTube with over 5,000 viewers.
There are so many amazing lessons to learn from this outage: transparency, accountability, processes, checklists, you name it. I’m not sure that you, dear reader, can actually put a lot of those lessons to use, though. After all, your company probably isn’t going to let you live stream your outages. (I do pledge to you that I’m gonna do my damnedest to do that ourselves with our own services, though.)
I want you to zoom in on one particular part: the backups.
After the above stuff happened, it was time to recover from backups. In the outage timeline doc, scroll down to the Problems Encountered section, and you’ll see 7 bullet points. GitLab used PostgreSQL, Linux, and Azure VMs, but I’m going to translate these into a language that you, the SQL Server user, can understand.
Their 7 layers of protection were:
- LVM snapshots taken every 24 hours
- Regular backups every 24 hours
- Disk snapshots in Azure
- Synchronization to staging
- Backups to Amazon S3
- Backup failure alerts
Let’s turn this into SQL Server on Windows language.
1. OS volume snapshots
In Windows, these are VSS (shadow copy) snaps. They freeze SQL Server’s writes for a matter of seconds to get a consistent picture of the MDF/NDF/LDF files of all of your databases on that volume. (These are not database snapshots, which are also useful in some cases, but unrelated.)
VSS is a building block, and you don’t hear Windows admins just using the term VSS by itself without also referring to a third party backup product. These are usually the products you despise, like NetBackup, which use VSS to substitute for full backups. Depending on your vendor, you may or may not be able to apply additional point-in-time transaction log backups to them. If the product doesn’t have that capability, it usually resorts to doing VSS snaps every X minutes, so it looks like you have a full backup every X minutes that you can restore to – but no other point in time.
Because of that, they’re usually a last-resort for SQL Server users where point-in-time recovery is required. (However, they’re better than nothing.)
2. Regular database backups
You’re familiar with native BACKUP DATABASE commands in SQL Server, and you probably know the difference between:
- Full backups – typically done daily or weekly
- Log backups – done every X minutes, and you should probably be doing them more often
These are usually a DBA’s first choice for recovery. However, you’re only as good as your last restore. (In GitLab’s case, their backups were failing silently.)
I adore transaction log shipping because it’s essentially testing my log backups all the time. Log shipping is easy to set up, nearly bulletproof, and works with all versions/editions of SQL Server. Don’t think of it as just your disaster recovery: it’s also verifying that you’ve got good backup files.
3. Disk snapshots in Azure
On premises, this is the equivalent of a SAN snapshot or a VMware snapshot. The exact implementation details can either mean that the entire VM is snapshotted, or just the data/log drives.
This is a great insurance policy, and I hear some advanced SQL Server DBAs saying they do this before they undertake something dangerous like a SQL Server version upgrade or a massive schema change. However, rollback is all-or-nothing: if you revert the snapshot, you’re going to lose everything since the snapshot. (That’s why it makes sense for scheduled outages involving big changes with no simultaneous end user access.)
The usual problem with relying on volume snapshots as part of your normal recovery routine (not manual snaps) is that they’re done outside of the SQL Server DBA’s territory. The SAN admin usually controls when they happen, and who has access to them. If you’re going to rely on volume snapshots as part of your backup plan, you have to test those snaps.
In a perfect world, you build automation so that your snapshots are immediately made available to a second SQL Server, which then performs a CHECKDB on that snap. However, that costs licensing money plus personnel time, so I rarely see it done. Folks just assume their snaps are okay – but the SAN error emails aren’t sent to the DBAs.
4. Synchronization to staging
In GitLab’s case, they were pulling parts of the data to another environment. (GitLab users – I’m going to take some liberties here with the description.)
In your case, think about a periodic ETL process that takes data from production and pushes it into staging tables in a data warehouse. If the poop really hit the fan hard, you might be able to recover some of your most critical data that way.
The DBAs in the audience might immediately give that method the finger, but keep in mind that we’re in a new age of DevOps here. If everybody’s a developer, then you can do more creative code-based approaches to recovery.
Outside of SQL Server, it’s common to see replication used as a high availability and disaster recovery technology. Other platforms just echo their delete/update/insert operations to other servers.
GitLab wrote that their “replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented” – and I can certainly relate to that. I’ve been in replication environments like that. (I built one like that when I took the Microsoft Certified Master exam, ha ha ho ho.)
In the SQL Server world, I see a lot of replication setups like that, so the thought of using replication for HA/DR usually prompts reactions of horror. It’s just a cultural thing: we’re more accustomed to using either direct copies of transaction log data, or direct copies of the data pages.
So when you read “replication” in GitLab’s post-mortem, think database mirroring or Always On Availability Groups. It’s not the same – it’s just what we would culturally use.
Just like your backups can fail, your AGs can fail. Replication breaks, service packs have surprises, all kinds of nastiness. Data replication in any form doesn’t make your job easier – it becomes harder, and you have to staff up for it.
6. Backups to Amazon S3
I’m a huge fan of cross-cloud backups because stuff happens. It shouldn’t be your primary failover mechanism – especially if you rely on proprietary services that only one cloud provider offers – but it should be a line in your insurance policy.
It just so happens that Tara’s working on a white paper on doing this with Google Compute Engine’s newfound SQL Server capabilities. More on that soon.
7. Backup failure alerts
I know way, way too many admins who set up an Outlook rule to dump all of their monitoring emails into a folder.
These admins say they go in and look at the contents of the folder every now and then to see if there’s anything important. They go in, do a quick scan, get overwhelmed by the thousands of emails, and move on to other stuff.
I know because I used to be one of ’em myself.
Even worse, when the backup monitoring stops working, these admins think that no emails = good news. They never stop to think that there’s a much bigger problem at play.
I went into one giant corporation, had a room full of DBAs in suits, and the first sp_Blitz we ran pointed out that they had corrupt databases. The head DBA said condescendingly, “Your script must be wrong. We have CHECKDB jobs set up on all of our servers, and we never get failure emails.” Upon drilling down, we found out that the jobs had been failing for years, but the notification emails were going to a defunct distribution list. They permanently lost data, and had a very awkward discussion with the end users.
More layers of backups only help if they’re working.
No pointing fingers at GitLab today, folks.
Go run sp_Blitz on your own SQL Servers, look for priority 1 alerts, and get ’em fixed.
I’d rather not see you on YouTube.
Good article, thanks! I’ve configured my production servers to send me an “All is well” email daily, just to be certain that the supporting systems (database mail, etc.) are functioning. In my opinion, if you route all of your server communications to an Outlook folder for occasional review, you might as well just turn the emails off and hope for the best.
Pity of all those youtubers watching, there’s no way for one or more of them to hit the “for god sakes, don’t run that command” button! There but for the grace of go I…
Thanks for the excellent write-up!
My pleasure, sir! (As I was reading their stuff, I thought, “I should type what I’m thinking.”)
Great write up Brent. Going to add a link to this in what I wrote about it. You focused on a good translation of action items at a more detailed level for SQL Server DBAs to jump into.
You actually did comment. For some reason WP was waiting on someone to approve it. I put the first one through.
Regarding OS volume snapshots : I have made a few attempts, never successful, to restore / attach MDF files that had been snapshotted via VSS / StorageCraft SPX. I don’t even bother trying anymore, during the occasional disaster recovery. I go straight to my most-recent BAK files.
Does MS actually suggest anywhere that VSS snapshots of online SQL Server files are restorable? I would be very interested in studying any references you may be able to provide, please. Thank you in advance.
Bradley – yeah, VSS snapshots are how tools like Microsoft DPM work.
It’s like the standard cliche…
The quality of your backups regime means nothing…the quality of your restore regime means everything
“Even worse, when the backup monitoring stops working, these admins think that no emails = good news. They never stop to think that there’s a much bigger problem at play.”
I’ve made the SP that does the daily full backups send me an email regardless of wether or not the backup was successful. The mail will tell me if it was sucessful. Should I do the same for the half hourly log-backups? That would give me an awefull lot of mails. I have it now send me a mail when it fails. I do test backups of the important databases not daily, but more than once a week, including logs and DBCC CHECKDB.
Silent database backup failures are one of my bugbears, especially when using 3rd party tools which may not be under the control of the DBA team. The best way I found to get around this, was manually kicking off a daily script from a batch file as part of the morning checks – it hit all SQL Server instances to check the last backup date & time (for full & logs) and dumped the details of anything out of range into a text file for review.
Using this method we got alerted when NetBackup policies were accidentally amended or deleted, when someone added a new database and neglected to include it in the backups, when the SQL agent had stopped and no-
one re-started it or when there was some other failure of the monitoring agents or ticketing system.
It really opened my eyes to the number and type of silent failures there are!
Streamed it on YouTube? I smell another reality show concept!
As usual Typical Brentozaric explanation. Excellent!!!!
Brent, your closing H1 and comment, was exactly what this world needs more of. Less finger-pointing, more taking care of what needs taking care of.
“blaming” never fixes anything.
Silent backup failures are a personal bugbear of mine, especially when backups are taken using 3rd party tooling that’s not under the DBAs control.
The best way I found to avoid them, was to run a daily script from a batch file as part of the morning checks, that would connect to every instance and check the backup history tables. It would dump out a line if anything was outside of range.
It created alerts when Netbackup policies were accidentally deleted or changed, new databases were created and never added to the backup schedule, occasions where the Agent had stopped and no-one had restarted it and also when the monitoring or ticketing system failed for some reason.
It really opened my eyes to the multitude of ways that you can end up missing a backup without knowing!
If we don’t receive error messages for a couple of hours, we know there must be something wrong!
Silent database backup failures! This, and the ongoing failure to test backups. And a great point on using log shipping as a way to test and/or monitor backups.