This week, developer tools company GitLab had a serious database outage.
The short story:
- An admin was trying to set up replication
- The site had an unusual load spike, causing replication to get behind
- While struggling with troubleshooting, the admin made a lot of changes
- After hours of work, the admin accidentally deleted the production database directory
You can read more about the details in GitLab’s outage timeline doc, which they heroically shared while they worked on the outage. Oh, and they streamed the whole thing live on YouTube with over 5,000 viewers.
There are so many amazing lessons to learn from this outage: transparency, accountability, processes, checklists, you name it. I’m not sure that you, dear reader, can actually put a lot of those lessons to use, though. After all, your company probably isn’t going to let you live stream your outages. (I do pledge to you that I’m gonna do my damnedest to do that ourselves with our own services, though.)
I want you to zoom in on one particular part: the backups.
After the above stuff happened, it was time to recover from backups. In the outage timeline doc, scroll down to the Problems Encountered section, and you’ll see 7 bullet points. GitLab used PostgreSQL, Linux, and Azure VMs, but I’m going to translate these into a language that you, the SQL Server user, can understand.
Their 7 layers of protection were:
- LVM snapshots taken every 24 hours
- Regular backups every 24 hours
- Disk snapshots in Azure
- Synchronization to staging
- Backups to Amazon S3
- Backup failure alerts
Let’s turn this into SQL Server on Windows language.
1. OS volume snapshots
In Windows, these are VSS (shadow copy) snaps. They freeze SQL Server’s writes for a matter of seconds to get a consistent picture of the MDF/NDF/LDF files of all of your databases on that volume. (These are not database snapshots, which are also useful in some cases, but unrelated.)
VSS is a building block, and you don’t hear Windows admins just using the term VSS by itself without also referring to a third party backup product. These are usually the products you despise, like NetBackup, which use VSS to substitute for full backups. Depending on your vendor, you may or may not be able to apply additional point-in-time transaction log backups to them. If the product doesn’t have that capability, it usually resorts to doing VSS snaps every X minutes, so it looks like you have a full backup every X minutes that you can restore to – but no other point in time.
Because of that, they’re usually a last-resort for SQL Server users where point-in-time recovery is required. (However, they’re better than nothing.)
2. Regular database backups
You’re familiar with native BACKUP DATABASE commands in SQL Server, and you probably know the difference between:
- Full backups – typically done daily or weekly
- Log backups – done every X minutes, and you should probably be doing them more often
These are usually a DBA’s first choice for recovery. However, you’re only as good as your last restore. (In GitLab’s case, their backups were failing silently.)
I adore transaction log shipping because it’s essentially testing my log backups all the time. Log shipping is easy to set up, nearly bulletproof, and works with all versions/editions of SQL Server. Don’t think of it as just your disaster recovery: it’s also verifying that you’ve got good backup files.
3. Disk snapshots in Azure
On premises, this is the equivalent of a SAN snapshot or a VMware snapshot. The exact implementation details can either mean that the entire VM is snapshotted, or just the data/log drives.
This is a great insurance policy, and I hear some advanced SQL Server DBAs saying they do this before they undertake something dangerous like a SQL Server version upgrade or a massive schema change. However, rollback is all-or-nothing: if you revert the snapshot, you’re going to lose everything since the snapshot. (That’s why it makes sense for scheduled outages involving big changes with no simultaneous end user access.)
The usual problem with relying on volume snapshots as part of your normal recovery routine (not manual snaps) is that they’re done outside of the SQL Server DBA’s territory. The SAN admin usually controls when they happen, and who has access to them. If you’re going to rely on volume snapshots as part of your backup plan, you have to test those snaps.
In a perfect world, you build automation so that your snapshots are immediately made available to a second SQL Server, which then performs a CHECKDB on that snap. However, that costs licensing money plus personnel time, so I rarely see it done. Folks just assume their snaps are okay – but the SAN error emails aren’t sent to the DBAs.
4. Synchronization to staging
In GitLab’s case, they were pulling parts of the data to another environment. (GitLab users – I’m going to take some liberties here with the description.)
In your case, think about a periodic ETL process that takes data from production and pushes it into staging tables in a data warehouse. If the poop really hit the fan hard, you might be able to recover some of your most critical data that way.
The DBAs in the audience might immediately give that method the finger, but keep in mind that we’re in a new age of DevOps here. If everybody’s a developer, then you can do more creative code-based approaches to recovery.
Outside of SQL Server, it’s common to see replication used as a high availability and disaster recovery technology. Other platforms just echo their delete/update/insert operations to other servers.
GitLab wrote that their “replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented” – and I can certainly relate to that. I’ve been in replication environments like that. (I built one like that when I took the Microsoft Certified Master exam, ha ha ho ho.)
In the SQL Server world, I see a lot of replication setups like that, so the thought of using replication for HA/DR usually prompts reactions of horror. It’s just a cultural thing: we’re more accustomed to using either direct copies of transaction log data, or direct copies of the data pages.
So when you read “replication” in GitLab’s post-mortem, think database mirroring or Always On Availability Groups. It’s not the same – it’s just what we would culturally use.
Just like your backups can fail, your AGs can fail. Replication breaks, service packs have surprises, all kinds of nastiness. Data replication in any form doesn’t make your job easier – it becomes harder, and you have to staff up for it.
6. Backups to Amazon S3
I’m a huge fan of cross-cloud backups because stuff happens. It shouldn’t be your primary failover mechanism – especially if you rely on proprietary services that only one cloud provider offers – but it should be a line in your insurance policy.
It just so happens that Tara’s working on a white paper on doing this with Google Compute Engine’s newfound SQL Server capabilities. More on that soon.
7. Backup failure alerts
I know way, way too many admins who set up an Outlook rule to dump all of their monitoring emails into a folder.
These admins say they go in and look at the contents of the folder every now and then to see if there’s anything important. They go in, do a quick scan, get overwhelmed by the thousands of emails, and move on to other stuff.
I know because I used to be one of ’em myself.
Even worse, when the backup monitoring stops working, these admins think that no emails = good news. They never stop to think that there’s a much bigger problem at play.
I went into one giant corporation, had a room full of DBAs in suits, and the first sp_Blitz we ran pointed out that they had corrupt databases. The head DBA said condescendingly, “Your script must be wrong. We have CHECKDB jobs set up on all of our servers, and we never get failure emails.” Upon drilling down, we found out that the jobs had been failing for years, but the notification emails were going to a defunct distribution list. They permanently lost data, and had a very awkward discussion with the end users.
More layers of backups only help if they’re working.
No pointing fingers at GitLab today, folks.
Go run sp_Blitz on your own SQL Servers, look for priority 1 alerts, and get ’em fixed.
I’d rather not see you on YouTube.