Blog

I’ve been backing up SQL Servers for almost a decade now, and it’s time to share the lessons I’ve learned. All of this is my own opinion – your mileage may vary – but I’ll try to explain the reasoning behind the choices I make. I won’t address log shipping or snapshots this time around.

Never back up databases to local disk.

If the SQL Server crashes, especially due to a hardware problem or a severe OS problem, the local drives may not be available. In the event of a hardware disaster, I’ve been able to point to my junior guy and say, “Go build a new server and start doing restores from the network share while I try to resuscitate this server.”

Back up databases to a fileshare, then back the share up to tape.

Tape drives these days are fast enough that the vendors like to say DBAs should go straight to tape, and they’re technically right: tape backup & restore speed is not a bottleneck. However, there’s a limited number of drives available. When the DBA needs a restore right away, the tape drives aren’t necessarily sitting idle. At our shop, if multiple people need to do simultaneous restores or backups, there can be a lag time of hours. Disk backups, on the other hand, are always available.

Cost justify the network share with lower licensing costs & simpler backups.

The SAN & backup admins will need cost justification for a new dedicated array just for SQL backups. Tell them that it pays for itself by eliminating the need for backup agents on each SQL Server, plus it simplifies their lives because they can just have one backup policy. Back up everything on that network share once a day, and get it offsite ASAP. They don’t have to worry about what servers are on what schedules or when the peak loads are – they just have one easy task to back up that network share.

Back up to a different SAN if possible.

This one’s not a reality for all shops, but it’s saved my bacon, so I have to mention it. We had our production SAN go down for an entire weekend. I was able to build new database servers using local disk & another SAN, and then restore databases & transaction logs from the network share. The one reason it succeeded was because my backup array was on a different SAN than the production SAN.

My sweet spot for the backup array is raid 10 SATA.

Depending on the backup windows, multiple database servers may be writing backups to that same share simultaneously. Raid 10 gives better write performance than raid 5, and while I’d love to have fiber channel backup drives, it’s cost-prohibitive. Raid 10 on SATA gets me the best balance of cost versus backup throughput. I’ll write a separate article about how to monitor performance metrics during backups to determine where the bottleneck is. Before I started benchmarking, I’d expected my bottleneck to be my gig network cards, but it turned out to be a raid 5 SATA array.

Backup agents like NetBackup and Backup Exec mean giving up scheduling control.

I steer clear of backup agents like Backup Exec, NetBackup and TSM because the schedules are generally dictated by the backup guys instead of the database administrators. I like having the control in the hands of the DBAs because they’re quicker to react to SQL Server problems. For example, if the nightly ETL jobs fail on a data warehouse, the DBAs can quickly pause the backup schedule or restore the databases without input from anybody else. I want my backup guys to concentrate on getting the data to tape and getting it offsite rather than worrying about application problems.

Do regular fire drill rebuilds and restores.

At least once a quarter, do a restore from tape. When the DBA team is separate from the backup administrator, just tell the backup admin that it’s a real restore need, like a developer lost something and needs it back from tape. Restore it.

Build a standalone restore testbed.

Stay buddies with the system administrators. Ask them to keep an eye out for leftover end-of-life servers in the shop that have enough internal drive space to do fire drill restores. I acquired a pretty slow server with a huge number of internal drive bays, picked up a bunch of big drives off Ebay, and presto, I have a restore testbed. One of these days, I’ll automate the fire drill restore process so that each week it restores off the network share, but I haven’t quite gotten that ambitious yet. For now, I do the restores manually.

Keep management informed on restore time estimates.

Every quarter, I look at the backup sizes & speeds, and estimate how long it will take to restore a server from tape. I turn that into a written report to management, and for each server, I give cost estimates to decrease that outage window. Sometimes the answer is third-party backup compression software, sometimes it’s more CPU power for the server, sometimes it’s adding memory, etc. The primary goal is to give them enough information to make a good business decision about whether it’s worth the money. The secondary goal is to CYA: when a server is down, and a restore is taking longer than the business owners would like, then it’s a great time to pull out last quarter’s report and say, “See, I told you this would take an hour to restore, and you said it wasn’t worth the $5,000 to shrink down. Should we reconsider that decision now?”

Trust no one.

Get an external FireWire or USB drive. Once a month, bring it into the office, hook it up to the server hosting the network share, and copy the entire backup share to that drive. It’ll take hours, and that’s OK. Take it home, or in today’s security-conscious world, take it to the bank and put it in a safe deposit box. Nobody should ever need to do this, but sometimes even the best backup administrators make mistakes.

More of My Backup Articles

Here’s more posts about SQL Server database backup technology:

↑ Back to top
  1. Well written, I really liked the second last point "management report". It made the DBA's presence felt and is a good approach to "slowly" convince management to put more faith in us

    I laughed at the last point though. I don't have enough $$ to buy a big enough drive to copy the backups, and I can't get into data centers either to copy

  2. Brent
    in the paragraph above
    My sweet spot for the backup array is raid 10 SATA.
    You said
    I’ll write a separate article about how to monitor performance metrics during backups to determine where the bottleneck is.
    Did you write this somewhere ? I cannot find it
    Please update thanks

  3. Forgot to say
    Very nice Article and used it already and cut my backup time from 24 hours to 4 hours (WOW!!!)
    Thanks

  4. When you did your benchmarking on the backup process, where did the network overhead ball in the grand scheme of things? Just curious as I am trying to get some metrics around backing up locally vs directly to NAS/SAN…

    cheers and thanks for the great ideas…

    • Typically the network wasn’t the bottleneck as long as I was doing backup compression. I consider backup compression an absolute necessity these days, so I didn’t dig deeper than that.

  5. The post is very helpful for all DBA’s.. Thanks Brent!

  6. One more thing about backing up to tape that should be mentioned; tapes are often taken offsite (or should be) incase of a physical data center disaster. While such an event should be rare, there is probably going to be a delay while retrieving the tapes. It’s best to keep an onsite backup on disk.

  7. On of the things you mention:

    “Cost justify the network share with lower licensing costs & simpler backups.”

    Are you talking about backups using normal windows share using the CIFS protocol or do you have another technique. I know CIFS backups are not recommended by Microsoft because the physical write to disk is not guaranteed. I have had to rebuild logshipping environments for this very reason

    • Yes, normal Windows shares. Where have you seen that those aren’t recommended by Microsoft? I’d love to see that – it might be a misunderstanding that needs to be cleared up.

  8. remote CIFS drives dont constitute what Microsoft would call stable media and as such is occasionally subject to lost write conditions as defined here:

    http://technet.microsoft.com/en-us/library/cc966500.aspx

    Ive rexamined the microsoft literature and it actually does say that backing up to network shares can be subject to network errors and backups should be verified.

    http://msdn.microsoft.com/en-us/library/ms179313.aspx

    I know from experience when setting up logshipping to a remote network share the whole environment can get corrupt when the destination subsytem experiences heavy load.

    • These same guidelines apply to all backup targets. Are you saying you never verify your backups that are written to local disks? I assure you I’ve seen many cases where those have been corrupt as well.

      • By default I normally verify all backups, if I’m doing ad-hoc backups I may not always do this, it depends on size. However by default it seems Microsoft do not configure log shipping this way, hence the corruption, i’ve previously experienced.

        Its good though, revisiting this subject does give me more confidence in backing up to network shares. Im guessing there must be a way to force log shipping to verify its backups.

  9. Brent,
    I am one of those rare people it seems like who was a very accomplished Windows AD/Exchange/App Admin and then decided to become a DBA. Above you reference not using backup agents, and I don’t get this. It seems like each DBA wants to do their own style of backups, and as a ex admin I think the standardization is key. Especially since the Oracle, DB2 and Teradata guys do this with no issues at all. It was a pain to setup, but they just restored a 8TB Oracle DB with no issues, so we know it works. Note: we just replaced our whole backup infrastructure, so it’s all new setups right now. SQL is next hopefully.

    It makes it hard to sell SQL Server to management when it has very different backup requirements. Especially since we have non-trusted dev/qa/prod domains that all share the same backup infrastructure. SQL native backups seem to not want to work across untrusted domains to file shares.

    The senior DBA and I get into arguments about this constantly, and I wanted to get your thoughts.

    B

    P.S. I enjoyed the book. I enjoy this blog and read your chapters right away. I took 8 pages of notes that should be very helpful.

    • Hi! Lots of questions in there. I’m heading out on vacation and answering this from my phone, but wanted to jot down a couple of thoughts.

      First, it sounds like you’re saying you can’t sell SQL to management because you have so many choices on how to back it up, whereas the other platforms don’t have any choices. I’d call that an advantage for SQL Server, actually. There’s a ton of ways you can do it, and that flexibility gives SQL DBAs power if they use it wisely.

      SQL backups can work fine over multiple nontristed domains – I’ve got several clients doing this. What specific errors or challenges are you having?

      • Ok, I made that a whole lot more complicated than it needed to be. I really only had one question. First though enjoy your vacation. You can answer this whenever as I was just looking for more experienced DBA’s opinions and your are usually right on.

        The real question from the mess was:

        Why are senior DBA’s so insistent on their native backups versus using the backup plugin of a product like Netbackup, etc.?

        • Ah, gotcha. The reason is that native backups give us much, much, MUCH more flexibility and power. For example, we can do things like log shipping, which let us maintain an disaster recovery server with data as up-to-date as the last 60 seconds – or less. Products like NetBackup can’t touch that.

  10. Hi Brent,

    I’m a new SQL Server DBA. I would like to know if there are any GUI tools or scripts available that will make it much easier to monitor backups for SQL Server databases that are hosted on multiple servers?

    I’ve been doing this for three weeks now, so any advice or suggestions would be greatly appreciated.

    Regards,
    Jim

    • Hi, Jim. I work for Quest Software, and their LiteSpeed for SQL Server does exactly that. It has a central repository that stores the data for all backups across the enterprise in one place. We’ve also got SQL Server Reporting Services reports you can use to let anybody view the backup status too. Here’s the product page:

      http://www.quest.com/litespeed-for-sql-server/

      I haven’t seen any open source scripts to do the same thing because of the difficulty in gathering backup status across lots of servers. You’d need to build some kind of linked server system or remote querying utility. That gets kind of painful, especially with servers separated by firewalls or in different domains.

      Hope that helps!

  11. Brent,

    I greatly appreciate your response.

  12. Hey Brent,

    I actually have a question about backup schedules. Is there any wisdom in staggering your backups so that they aren’t all going on at the same time? Specifically the full backups that are done each night. We have a share that all of our databases write their backups to that is then backed up to tape every night. This is on a SAN with SATA drives in Raid 10… or so I’m told by our SAN admin :-). All of our full backups are scheduled to happen at the same time each night during our off hours. Some of the jobs have been failing reporting back that the network path could no longer be found. I’m fairly certain that it is a network issue or something wrong with the san because when I re-run the jobs that failed they still sometimes fail and this is during a time when the other backups aren’t happening.

    I’ve gotten a little bit off topic here since I wasn’t really looking for an answer to my problem, but it got me thinking about scheduling and I was wondering how you handle your scheduling when you’re dealing with a lot of servers and databases. There’s obviously a limitation to what the drives and the network can do… when is too much too much?

    • Shane – that’s a good question. There definitely is wisdom in staggering them, but to find out if that’s a problem, use Perfmon as I describe in this site. (Search for Perfmon to see the tutorial videos.) Run those counters on the file server hosting the backups, and you’ll find out pretty quickly if your writes are within Microsoft’s guidelines. Hope that helps!

  13. I just want to thank you for taking the pain in writing these fine quality posts. You just saved me two days of research!

  14. Brent,
    We have been using scripts to back up our databases for several years. This script deletes all backups older than a certain date once the backups are complete.

    Do you use this method, or do you use the RETAINDAYS parameter with your backups? What are the pros and cons of these two methods?
    Thanks!
    Sam

  15. have you written the article on “how to monitor performance metrics during backups” as you have mentioned here…

    certainly very useful one.

    sonny

  16. Pingback: SQL Server 2005/2008 Setup Checklist | Brent Ozar PLF

  17. Great post! I’ve been searching around for a backup post like this, especially one that agrees with me. We go straight to tape and yes, just like you said, do not have enough drives to support the madness that could happen if we needed lots of backups all at once.

    Definitely not local disk, but could you get around the server crash problem with a LUN that you could map to another host? Or should I stick with the \\ and try to fix the network bottleneck with more compression?

  18. hi,
    i’ve heard that if we take the backup on the following path, that is where the sql server is installed,

    \Microsoft SQL Server\MSSQL.2\MSSQL

    there is a possibility of losing the backup if SQL crashed / corrupted. the folder may not be accessible if SQL crashed / corrupted – is this true?
    Thanks,

  19. Our 200 server SQL backups of 300TB to network shares, then to tape (as suggested above) is working, however is costly and taxing our infrastructure… We are growing and looking for a better SQL backup solution

    Do any new technologies exist now in 2013 such as NetVault LiteSpeed for SQL , or any other CDP solutions that will cut down on the shear volume of data for Nightly backups?

  20. A recent power surge (even going through APC 1500 UPS) has generated some strange things.
    One thing I am unable to figure out is why SQl Server Management Studio started making FULL backups of ALL databases (even inactive ones) to a “temp” folder on the C drive.
    Even when I delete the “temp” folder, it recreates it each night this backup runs.
    The backups in the maintenance plan are still running and being stored on the network.
    This is something new and unusual and I can not find in SQL where this job is located.
    This starting eating up space and I can not find where this job is located.
    It is not in the typical maintenence backup jobs which run and back up to a network drive.
    Any sugestions?
    Everything on the internet says SQL does not back up to a temp folder.
    Perhaps this was done by a prior admin (I have been here 11 months) and was somehow triggered by the event……I have tried contacting the prior admin, but no such luck on an answer.

  21. UnBelievably well put together. Incredible stuff. Absolutely incredible! It’s so dead on perfect to what anyone running SQL deals with every day. Obviously that because you guys ARE us. You’re doing the exact same things with real life clients… I am *so* impressed by every single thing your company puts out. I’ve built and run two companies – and it’s CLEAR that you guys have great/strong/confident/[many other great adjectivves] leadership. A company just cannot produce the amount of great stuff you guys do without that kind of management and leadership. I’m not sure how recently you got a simple Thank You, but here’s mine to each of you: Thank You! Kurt.

  22. Great down to earth information. One question for those with a larger amount of experience… I run a full backup of the major databases every night (for us the larger ones are about 25G) to a network drive and send the backup offsite for DR. Up to now we have been saving seven days of full backups on the network drive and just two at our DR site. A full image backup of the server is done weekly. Because of the space issue I was thinking of saving only two on the network drive. Oh, and we only have production, not development, SQL Servers. Is there ever a time that one would be requested to restore a backup greater than two days back?

    The dbs hold our financial, ERP, and other major data. Thanks, jean

    • Jean – that’s actually a question for your business. For example, if someone dropped a table on Friday evening, and nobody noticed until Monday morning, how would that work out?

  23. Brent, great article as always! Nicely done. Since this article is 6+ years old now, do you stand by your recommendations? Has anything changed? My new employer backs up right to tape, and the wait types BACKUPBUFFER, BACKUPTHREAD, BACKUPIO, are usually at the top for every SQL Server, so we’re definitely looking into backup up to a network share, then off to tape instead.

    • M – thanks! As you read through the post, what parts do you think have changed and are no longer relevant?

      • I am interested if there are more specific recommendations based on newer SAN technology. For example, for the backup file share, would a simple Windows file server with a gigantic LUN assigned be a good option? (VMware environment on an EMC VNX SAN) I would then use our third party backup tool to go from there to tape.

        • M – the answer depends on your business objectives for RPO and RTO. Building a customized backup plan for your business would be a little beyond the scope of what we could do here in a blog comment, unfortunately.

  24. I have worked places where the unauthorized backup someone had at home saved the company. But its very existence would have gotten the person fired until the major crash and burn happened. We supposed to run BitLocker to go on that bad boy or what?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php