I’ve been backing up SQL Servers for almost a decade now, and it’s time to share the lessons I’ve learned. All of this is my own opinion – your mileage may vary – but I’ll try to explain the reasoning behind the choices I make. I won’t address log shipping or snapshots this time around.
Never back up databases to local disk.
If the SQL Server crashes, especially due to a hardware problem or a severe OS problem, the local drives may not be available. In the event of a hardware disaster, I’ve been able to point to my junior guy and say, “Go build a new server and start doing restores from the network share while I try to resuscitate this server.”
Back up databases to a fileshare, then back the share up to tape.
Tape drives these days are fast enough that the vendors like to say DBAs should go straight to tape, and they’re technically right: tape backup & restore speed is not a bottleneck. However, there’s a limited number of drives available. When the DBA needs a restore right away, the tape drives aren’t necessarily sitting idle. At our shop, if multiple people need to do simultaneous restores or backups, there can be a lag time of hours. Disk backups, on the other hand, are always available.
Cost justify the network share with lower licensing costs & simpler backups.
The SAN & backup admins will need cost justification for a new dedicated array just for SQL backups. Tell them that it pays for itself by eliminating the need for backup agents on each SQL Server, plus it simplifies their lives because they can just have one backup policy. Back up everything on that network share once a day, and get it offsite ASAP. They don’t have to worry about what servers are on what schedules or when the peak loads are – they just have one easy task to back up that network share.
Back up to a different SAN if possible.
This one’s not a reality for all shops, but it’s saved my bacon, so I have to mention it. We had our production SAN go down for an entire weekend. I was able to build new database servers using local disk & another SAN, and then restore databases & transaction logs from the network share. The one reason it succeeded was because my backup array was on a different SAN than the production SAN.
My sweet spot for the backup array is raid 10 SATA.
Depending on the backup windows, multiple database servers may be writing backups to that same share simultaneously. Raid 10 gives better write performance than raid 5, and while I’d love to have fiber channel backup drives, it’s cost-prohibitive. Raid 10 on SATA gets me the best balance of cost versus backup throughput. I’ll write a separate article about how to monitor performance metrics during backups to determine where the bottleneck is. Before I started benchmarking, I’d expected my bottleneck to be my gig network cards, but it turned out to be a raid 5 SATA array.
Backup agents like NetBackup and Backup Exec mean giving up scheduling control.
I steer clear of backup agents like Backup Exec, NetBackup and TSM because the schedules are generally dictated by the backup guys instead of the database administrators. I like having the control in the hands of the DBAs because they’re quicker to react to SQL Server problems. For example, if the nightly ETL jobs fail on a data warehouse, the DBAs can quickly pause the backup schedule or restore the databases without input from anybody else. I want my backup guys to concentrate on getting the data to tape and getting it offsite rather than worrying about application problems.
Do regular fire drill rebuilds and restores.
At least once a quarter, do a restore from tape. When the DBA team is separate from the backup administrator, just tell the backup admin that it’s a real restore need, like a developer lost something and needs it back from tape. Restore it.
Build a standalone restore testbed.
Stay buddies with the system administrators. Ask them to keep an eye out for leftover end-of-life servers in the shop that have enough internal drive space to do fire drill restores. I acquired a pretty slow server with a huge number of internal drive bays, picked up a bunch of big drives off Ebay, and presto, I have a restore testbed. One of these days, I’ll automate the fire drill restore process so that each week it restores off the network share, but I haven’t quite gotten that ambitious yet. For now, I do the restores manually.
Keep management informed on restore time estimates.
Every quarter, I look at the backup sizes & speeds, and estimate how long it will take to restore a server from tape. I turn that into a written report to management, and for each server, I give cost estimates to decrease that outage window. Sometimes the answer is third-party backup compression software, sometimes it’s more CPU power for the server, sometimes it’s adding memory, etc. The primary goal is to give them enough information to make a good business decision about whether it’s worth the money. The secondary goal is to CYA: when a server is down, and a restore is taking longer than the business owners would like, then it’s a great time to pull out last quarter’s report and say, “See, I told you this would take an hour to restore, and you said it wasn’t worth the $5,000 to shrink down. Should we reconsider that decision now?”
Trust no one.
Get an external FireWire or USB drive. Once a month, bring it into the office, hook it up to the server hosting the network share, and copy the entire backup share to that drive. It’ll take hours, and that’s OK. Take it home, or in today’s security-conscious world, take it to the bank and put it in a safe deposit box. Nobody should ever need to do this, but sometimes even the best backup administrators make mistakes.