How to Back Up Terabytes of Databases

When you weren’t looking, your databases went and grew up. Now your backup window has grown so large that you’re about ready to open it and jump.

Time to make a choice.

The Native Way: Tuning SQL Server Backups

You can theoretically pull this off by using a combination of tactics:

Back up as infrequently as the business will allow. Run your full backups once a week (or if you want to go wild and crazy, once per month) and differential backups periodically. As Jes explains in her backup and recovery class, differentials back up the data pages that have changed since the last full backup. When disaster strikes, you only need to recover the most recent full backup, the most recent differential backup, and all of the log backups after the differential. This can shave a lot of time off your restores – but only if you minimize the number of changed pages in the database. This means…

Change the database as little as possible. We can’t change what the users do, but we can change what we DBAs do. Stop doing daily index defrag/rebuild jobs – you’re just changing pages in the database, which means instantly inflating the size of your differential backups. In a scenario like this, you can only do index maintenance when you’re sure it is the only way to solve a performance problem, and it absolutely has to be your last resort.

My backup camera collection has started to grow out of control.
My backup camera collection has started to grow out of control.

Tune the data file read speeds. You need to read the pages off disk as fast as possible to back them up. Use tools like CrystalDiskMark and SQLIO to measure how fast you’re going, and then tune your storage to go faster.

Compress the data as much as possible. It’s not just about minimizing the size of your backup file – it’s about minimizing the amount of data we have to write to disk. Bonus points for using index compression inside the database so that it’s compressed once, not recompressed every time we do a backup, although that doesn’t really help with off-row data.

Tune the backup target write speeds. If you’re using a small pool of SATA drives in RAID 5 as a backup target, it’s probably not going to be able to keep up with a giant volume of streaming writes, even if those writes are compressed versions of the database. Problems will get even worse if multiple servers are backing up to the same RAID 5 pool simultaneously because the writes will turn random, which is the worst case scenario for RAID 5.

Tune the bottleneck between the reads and the writes. If you’re backing up over the network, use 10Gb Ethernet to avoid the pains of trying to push a lot of data through a tiny 1Gb straw.

Tune your backup software settings. If you’re using native backups, start with using multiple files and the built-in options, and graph your results. Third party compression products usually offer all kinds of knobs to tweak – you’ll need to use that same level of graphing diligence.

Whew. I got tired just typing all that stuff. And if you’re lucky, at the end of it, your backups will complete in an hour or two, but the server might be darned near unusable while you’re beating the daylights out of it. Then the fun balancing act starts, trying to figure out the right point where the system is still usable but the backups complete quickly.

Or Just Cheat with SAN Snapshots.

In my Virtualization, SAN, and Hardware video class, I explain how SAN snapshots are able to take a full database backup of any size in just a couple of seconds.

I'm a big believer in snapshots.
I’m a big believer in snapshots.

Seriously.

Sort of.

See, while it’s technically a backup, I don’t really consider it a backup until it’s off the primary storage device. Your SAN storage, expensive as it was, is still vulnerable to failure, and you need to get that data out as quickly as possible. The good news is that you can move that data out without dragging it through the SQL Server’s storage connections, CPU, and network ports. You can simply (simply?) hook a virtual tape library, actual tape library, or another storage device to the same storage network, and copy directly between the two.

Your data read speeds may degrade during that process, but it’s up to you – if you want to architect your storage so that it’s fast enough to do these full backups without any noticeable performance to the end user, it’s possible by inserting enough quarters in the front.

You still have to pay attention, though, because your backup process will look like this:

  • Daily full backups via SAN snapshots – all writes are quiesced for 1-10 seconds during this time
  • Conventional log backups every X minutes – where X is dictated by the business

If you push a big index rebuild job through, you can still bloat the transaction log, and your log backups may take longer than X minutes to complete. This is where our RPO/RTO planning worksheet is so important – if your RPO is 1 minute, you simply may not be able to do index rebuild jobs.

SAN snapshots have one other drawback: depending on your storage make/model, snapshots may not be included in your licensing pricing. You may have to spend a lot more (typically tens of thousands of dollars) to unlock the feature. Ask your SAN admin if snapshots are right for your wallet.

Building Terabyte Servers Means Starting with Backups First

When I’m building a SQL Server to hold multiple terabytes of databases, this backup question is the very first one we have to address – even before we talk about the speed of end user queries.

Otherwise, we could end up designing a server with all local solid state drives, which is very inexpensive and satisfies end user performance goals – but we can’t back the data up fast enough.

Previous Post
Comparing Estimated and Actual Execution Plans in SQL Server
Next Post
Getting an Oracle Execution Plan

33 Comments. Leave new

  • Greg Wellbrock
    July 17, 2014 5:51 pm

    On the SAN snapshot if you are using the SAN to do the full backups but NATIVE SQL to do the LOG backups, how does SQL know how far to do the log backup to? If I restored my FULL SAN backup, how would I get point in time recovery? My experience with SAN snapshots haven’t been application aware but could just be me not looking in the right place.

    *Currently using 3PAR SAN

    Reply
  • Greg Wellbrock
    July 17, 2014 5:54 pm

    *Was assuming by Conventional log backups you meant using SQL Server.

    Reply
  • Alex Friedman
    July 19, 2014 9:15 am

    We’re using EMC SAN snapshots, but how can you copy them offsite?
    Mounting as file system gives us the mdf/ldf files, but they can only be attached with recovery, so no point in time restore.
    Another option is to mount the snapshot directly on a SQL instance with NORECOVERY, but how to copy that offsite?

    Reply
  • In my case, I have SAN Snapshots and to copy them to another site, I have replication with SRDF.

    Rolando.

    Reply
  • I caught you at SQLBits and really enjoyed your session, I was the guy sitting in the front row of your “Watch Brent Tune Queries” who suggested “Statler and Waldorf”.

    Anyhow, one other idea that I’ve seen that seems to improve backup performance, is to customise the BUFFERCOUNT and \ or MAXTRANSFERSIZE values within the backup command.

    Do you have any thoughts on this?

    Obviously changing these to ridiculously high values is a no-no, but in the testing that we carried out at my client, quite modest increases to the BUFFERCOUNT led to quite respectable reductions to the backup duration.

    Keep up the good work 🙂

    Reply
    • Andy – thanks, glad you liked the session!

      I do see people tweaking that with native software, but typically when you get to the point that you need to tweak those numbers, you can get a much bigger bang for the buck by switching to 3rd party compression software instead.

      Reply
  • Anthony Martin
    July 23, 2014 7:25 am

    With the right scenario I’ve seen partial backups be a huge win for very large databases

    Reply
  • Joel Berman (@j0el)
    July 29, 2014 10:31 am

    Snapshots are not backups. Snapshots can be very important and very useful, but they cannot be trusted and the space they consume is hard ot manage automatically. Snapshots can be a lifesaver when you are making many changes or updates to either the data or to the queries, but if you are protecting against hardware failures, site failures, malicious people, cyber crime, stupid human errors, you need to do real backups as well.

    Reply
    • Joel – can you briefly describe the difference between a SAN snapshot and a “real backup”?

      (Also to clarify for folks – Joel works for Acronis, a backup company.)

      Reply
      • Joel Berman (@j0el)
        July 29, 2014 11:13 am

        Brent,
        Generically, a snapshot is the file metadata (pointers), and any data changed since the snapshot was taken (copy on write). So you need both the snapshot and the original disk to recover to the point in time that the snapshot was taken. A snapshot is very fast. A little overhead to make the initial snap, and then time proportional to the changes after the snapshot.

        A backup is a complete copy of the dataset. It takes much longer to create and the time is proportional to the total amount of data (neglecting any compression or deduplication).

        If you are fiddling and tinkering, snapshots are great, they are fast, and if you make a mistake you can easily and quickly correct it. If you are worried about a hardware failure or some sort of storm or flood, backups are great. They can be stored offsite.

        We can get more technical and speak about ways that snapshots have improved through other copying techniques and how backups are speeded up through differential and incremental techniques. But basically snapshots are for quick recovery over a short period of time, and backups are for full recovery after major problems.

        Reply
        • Joel – correct, and when you take a SAN snapshot, you can then export that entire copy of the data as a backup. Granted, it’s a two-part process, but you get the same end result (a duplicate copy of the data stored somewhere else) that you would have from a traditional backup.

          Thanks for stopping by!

          Reply
          • Joel Berman (@j0el)
            July 29, 2014 12:41 pm

            If the system is running in a virtual machine, most of the backup software will take a snapshot and do the same thing. There are a number of ways to product data. Among cost, speed, and safety, you can pick any two 🙂 And there are many variables to consider in choosing the best solution. We have not even considered various log shipping and DB replication approaches.

            Thanks

          • Joel – you wrote:

            “most of the backup software will take a snapshot and do the same thing.”

            Err, no, most SQL Server backup software will not do that. (Remember, you’re at a SQL Server focused blog.)

          • Also, a snapshot behaves somewhat like a COPY_ONLY backup; it does not cause the transaction log to truncate, correct? I’m surprised this point wasn’t mentioned. 🙂

          • Brent – errr, not exactly. A snapshot is a full backup, and it behaves like a full backup. Full backups don’t truncate the log.

  • Joel Berman (@j0el)
    July 29, 2014 12:53 pm

    Not sure of your rules on blatant commercials, but my company can do that. But your point is well taken. I was thinking generally.

    Reply
    • “Thinking generally” about your own company’s products, yes, I get that. 😉

      I get the feeling you might have rushed in to comment on a post without really understanding the context of the audience and the industry. Just a tip – I’d slow down a little bit on the astroturfing there.

      Reply
  • Hi Brent, we have a 300gb database doing full backup snapshots via Veeam, IO freezes only a few seconds, but the IO wait is spiked throughout the snapshot period, which is about 20 mins ( almost as long as native backup) and they’re doing hourly replication, so this happens hourly. I thought the VM snaps are very fast “…SAN snapshots are able to take a full database backup of any size in just a couple of seconds.” Any recommendation about this IO wait on hourly snaps process?

    Thank you!

    The following is the log showing

    Query duration: 1159 sec —- Almost 20 mins

    BACKUP DATABASE [XXXX] TO VIRTUAL_DEVICE='{XXXXXXXXXXXXXX}’ WITH SNAPSHOT,BUFFERCOUNT=1,BLOCKSIZE=1024

    Reply
  • Wes Crockett
    May 22, 2018 3:35 pm

    Brent! Thank you for this article… I manage two DW’s (one @ ~1tb and one @ ~4tb) on 3Par and my org uses CommVault for backup management… I am revolting against CommVault’s native SQL backups and going to dump to flat file… but the option for VSS through CommVault is there but I am terrified of it from a loss of control and EASY availability. Where would you land on this? Native SQL Server or jump to VSS?

    Reply
    • Wes Crockett
      May 22, 2018 3:36 pm

      Simple recovery… Nightly Fulls… … Restore plan would usually just be restore last full and perform nightly ETL again.

      Reply
    • Howdy sir! Personal architecture design is kinda beyond what we can do fast in blog post comments, though. We have to ask a lot of probing questions around RPO, RTO, archival requirements, etc.

      Reply
  • Hi Brent!

    We have a huge DB in our environment that is currently 62 TB and is expected to grow 30 more TB in another 3 months time. Having this thing in our bucket is always a risk especially when it comes to disaster recovery.
    Please advice the best practice that shall be followed-up here. Thanks!

    Reply
    • Chakra – sure, that’s exactly the kind of work I do in my consulting. You can click Consulting at the top of the page to get started with an engagement. Thanks!

      Reply
  • As mentioned For Large DB, backup process will look like this:

    Daily full backups via SAN snapshots – all writes are quiesced for 1-10 seconds during this time
    Conventional log backups every X minutes – where X is dictated by the business

    But how would we do resotre using on SAN Sanpshot backup and conventional Log Backup.?

    Can you give example using a timeline

    Reply
    • Hi, Nikhil. For questions about how to do restores with your particular SAN vendor’s tools, it’s best to check with your SAN vendor. Or, if you want me to walk you through it, click Consulting at the top of the screen. Thanks!

      Reply
  • Hi, I have 7 TB size database planning to setup DR. What would be the best method from the below mention option.
    1. Log shipping.
    2. Mirroring.
    3. Transactional publications replication ( but some table do not primary key)
    4. Snapshot replication.
    5. Merge replication.
    6. Or any third party replication tools would be best?

    Thanks
    Abdus

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.