Why Dedupe is a Bad Idea for SQL Server Backups

Has your SAN admin or CIO been telling you not to compress your SQL Server backups because they’re doing it for you with a dedupe tool like EMC’s Data Domain?  I’ve been hearing from a lot of DBAs who’ve been getting this bad advice, and it’s time to set some records straight.

The Basics of Storage Deduplication

Dedupe appliances basically sit between a server and storage, and compress the storage.  Sometimes they do it by identifying duplicate files, and sometimes they do it by identifying duplicate blocks inside files.  It isn’t like traditional compression because the process is totally transparent to anything that stores files on a deduped file share – you can save an Excel file to one of these things without knowing anything about dedupe or installing any special drivers.  In theory, this makes dedupe great because it works with everything.

The key thing to know about dedupe, though, is that the magic doesn’t happen until the files are written to disk.  If you want to store a 100 megabyte file on a dedupe appliance, you have to store 100 megabytes – and then after you’re done, the dedupe tool will shrink it.  But by that point, you’ve already pushed 100 megabytes over the network, and that’s where the problem comes in for SQL Server.

Dedupe Slows Down SQL Server Backups

In almost every scenario I’ve ever seen, the SQL Server backup bottleneck is the network interface or the drives we’re writing to.  We DBAs purposely set up our SQL Servers so that they can read an awful lot of data very fast, but our backup drives have trouble keeping up.

That’s why Quest LiteSpeed (and SQL 2008’s upcoming backup compression) uses CPU cycles to compress the data before it leaves the server, and that’s why in the vast majority of scenarios, compressed backups are faster than uncompressed backups.  People who haven’t used LiteSpeed before think, “Oh, it must be slower because it has to compress the data first,” but that’s almost never the case.  Backups run faster because the CPUs were sitting around idle anyway, waiting for the backup drive to be ready to accept the next write.  (This will really ring true for folks who sat through Dr. DeWitt’s excellent keynote at PASS about CPU performance versus storage performance.)

With dedupe, you have to write the full-size, uncompressed backup over the network.  This takes longer – plain and simple.

Dedupe Slows Down Restores Too

The same problem happens again when we need to restore a database.  At the worst possible time, just when you’re under pressure to do a restore as fast as possible, you have to wait for that full-size file to be streamed across the network.  It’s not unusual for LiteSpeed customers to see 80-90% compression rates, meaning they can pull restores 5-10 faster across the network when they’re compressed – or in comparison, deduped restores will take 5-10 times longer to copy across the network.  Ouch.

It gets worse if you verify your backups after you finish.  You’re incurring the speed penalty both ways every time you do a backup!

And heaven help you if you’re doing log shipping.  That’s the worst dedupe candidate of all: log shipping does restores across one or more SQL servers, all of which are hammering the network to copy these full size backups back and forth.

So Why Do SAN Admins Keep Pushing Dedupe?

Dedupe makes great sense for applications that don’t compress their own data, like file servers.  Dedupe can save a ton of backup space by compressing those files, saving expensive SAN space.

SAN admins see these incoming SQL Server backups and get frustrated because they don’t compress.  Everybody else’s backups shrink by a lot, but not our databases.  As a result, they complain to us and say, “Whatever you’re doing with your backups, you’re doing it wrong, and you need to do it the other way so my dedupe works.”  When we turn off our backup compression, suddenly they see 80-90% compression rates on the dedupe reports, and they think everything’s great.

They’re wrong, and you can prove it.

They don’t notice the fact that we’re storing 5-10x more data than we stored before, and our backups are taking 5-10x longer.  Do an uncompressed backup to deduped storage, then do a compressed backup to regular storage, and record the time differences.  Show the results to your SAN administrator – and perhaps their manager – you’ll be able to explain why your SQL Server backups shouldn’t go to dedupe storage.

In a nutshell, DBAs should use SQL Server backup compression because it makes for 80-90% faster backups and restores.  When faced with backing up to a dedupe appliance, back up to a plain file share instead.  Save the deduped storage space for servers that really need it – especially since dedupe storage is so expensive.

28 Responses to Why Dedupe is a Bad Idea for SQL Server Backups
  1. Glenn Berry
    November 16, 2009 | 7:35 AM

    Brent,

    I completely agree that backup compression is a huge win in nearly every scenario (unless you are under heavy CPU pressure). It really makes initializing a database mirror much quicker and easier. SQL Server 2008 Enterprise Edition already has native backup compression, while SQL Server 2008 R2 Standard Edition will also get it. Of course, SQL Server native backup compression does not have the flexibility of LiteSpeed, it is either on or off.

    • Brent Ozar
      November 16, 2009 | 3:39 PM

      Glenn – yep, I covered R2’s new inclusion of backup compression in Std Edition last week here, and I think that’ll make the dedupe conversation even more common. It’s gonna get ugly with dedupe vendors over this one!

  2. WIDBA
    November 16, 2009 | 8:25 AM

    Good stuff, I just sat through the Data Domain dedupe seminar. It really sounds good for file shares,etc as you mention. The DBAs in the room were all thinking that this makes little sense in the database world. I can point to your article if the SAN guys get ornery!

  3. Jason Hall
    November 16, 2009 | 9:04 AM

    Thank’s for writing this Brent. When new and expensive technology comes out, there always seems to be a push to use it. The “I just bought this 5 million dollar deduped storage system and you better use it” attitude can wreak havok on a backup infastructure. Isn’t the reason why we backup our databases, so that we can restore them in the event of a disaster? I would think that any technology that dramatically increases our time to recovery would be a negative, but I’m finding more and more DBA’s struggling to fight that battle. The end result of deduped hardware vs. compression may be a reduction in storage utilization, but the efficiency of getting to that end result is significantly more efficient (and cost effective) on both backup and restore.

    Just my two cents…

  4. Kendra Little
    November 16, 2009 | 11:47 AM

    To be fair to the dedup appliances, you shouldn’t just time the first full backup to it. You need to get a few backups to the dedup appliance completed and then start timing what time regular backups take. In the case of at least the DataDomain appliance, they should become more efficient once they have backups to de-duplicate against.

    I do also have to say that some of the snapshotting technology DD can do is pretty cool. I will note that I have not tested restoring from those snapshots, but it sure sounds good. ;)

    • Brent Ozar
      November 16, 2009 | 3:38 PM

      Kendra – the dedupe happens as SQL pushes the backup into Data Domain. The full size of every backup has to be copied across the wire, no matter whether it’s the first backup or the tenth. Dedupe makes SQL backups smaller, but not faster. If you have a system that works otherwise, I’d love to see it.

  5. Merrill Aldrich
    November 16, 2009 | 12:56 PM

    Hey Brent – I just went through this exact issue this year with my team. I would qualify this a little, by phrasing it this way: “the common advice about NOT using compression for SQL backups that go into a dedupe store is probably bad advice.” We do use *both* backup compression and a dedupe archive (+ remote mirror)together, and it works well. Here’s the thing: compressed data fed into the dedupe process generally doesn’t deduplicate as well as other data, because it’s very unlikely to match up against other unrelated bits that are already in there (which is how dedupe works). BUT, it can dedupe against prior versions of the compressed backup files that haven’t changed entirely. Example: take a full, compressed SQL backup on Monday, then another on Tuesday where only a small portion of the database has changed – lots of blocks in the file do match up the second time. So you take a big hit for the first file, but maybe not quite so bad the second and later iterations. We had to actually prove this out by testing it with real data from real compressed SQL backups over a couple of weeks when we first got the dedupe system.

    Also, related to the speed issue: you are exactly correct. We always back up to a disk available directly to Windows (local or SAN attached), with compression, to get a fast backup. The resulting files are then archived in a dedupe store, by backup software, before being deleted from the local disk at some later time. If we need to perform a restore, we have recent files sitting right at the server. Only older backups would need to come out of the archive and take the restore speed hit.

  6. Nick Weber
    November 16, 2009 | 3:30 PM

    Brent,

    Great read!

    Just one quick correction and a comment. DataDomain actually does Dedupe on the fly “inline Deduplication” so it only writes Dedupe data to disk. While other products like Avamar do a post base Dedupe after the data is written to disk. Even though DataDomain is a inline Dedupe it has been throttled to only transfer data as fast as it can Dedupe or cache it in memory. I’m also in 100% agreement with you regarding backing up directly to Dedupe targets. My initial backup is to a large Raid 10 (rdm) attached to a VM, then CommVault Simpana 8 runs nightly to backup the raid 10 (rdm) and off-site it and Dedupes it a bit.

    • Brent Ozar
      November 16, 2009 | 3:35 PM

      Nick – when you say online dedupe, are you saying SQL won’t have to push 100 megs of data across the wire? That’s not how I understand Data Domain to work. That would require running a driver or app on the SQL Server and deduping the data on the SQL Server, correct?

  7. Nick Weber
    November 16, 2009 | 3:55 PM

    With an inline or a most post-base Deduplication you would still need to transfer all the data across the line. Datadomain Dedupes the data as it enters the unit and only writes Dedupe data to disk. While most post-base Dedupe sends all the data to disk and then Dedupes it after the data is living on the target disk. A big pitfall with the post-base it will require a lot more disk to keep the full copy while it creates the Dedupe copy, also restore times are terrible. The nice thing about DataDomain is their fixed bandwidth rate includes Dedupe time up and down but it is limited. We use CommVault Simpana 8 here with their new Dedupe feature. I’m not going to say it’s the cat’s meow but I would strongly recommend it over Avamar.

    With Avamar you would not need to push the whole 100 megs over the wire, but it also requires a agent on the box constantly taking to the master Dedupe data base. Also has some of the worst restore times. If restore times are not important to you then this is a wonderful product.

    Nick

    • Brent Ozar
      November 16, 2009 | 4:04 PM

      Okay, great. Just to close the loop on this so future readers understand, you’re saying you agree with my statements that backing up to Data Domain is not faster, right? And in fact, if the DBA is required to send unvompressed data to Data Domain, it’ll be even longer than if they were doing compressed backups to a regular file share, right? Just wanna make it absolutely clear to readers – I understand what you’re saying about Data Domain deduping “inline”, but that means after the full size is already sent over the network, and at that point it’s too late for the DBA.

  8. Nick Weber
    November 16, 2009 | 4:06 PM

    I’m in 100% full agreement with you!

    • Brent Ozar
      November 16, 2009 | 4:07 PM

      Ok, cool! Thanks for following up so fast! Have a good one!

  9. alen
    November 17, 2009 | 12:59 PM

    i ran deduped SQL backups for almost a year in parallel with tape backups. In the end we dumped the dedupe system and spent $$$ to upgrade to LTO-4 tape. the dedupe had some amazing compression and it was a lot better than tape. but LTO-4 tapes are $55 for 1.6TB tapes which in the real world come out to almost 3TB of storage per tape.

    • Brent Ozar
      November 17, 2009 | 10:23 PM

      Ouch! Was it strictly a cost issue when you got rid of dedupe?

  10. alen
    November 17, 2009 | 1:06 PM

    forgot to mention, we used a post dedupe system and it actually used SQL 2005 Express as its backend. even had replication functionality built in. and my favorite was when the dedupe tasks ran, it would create a storm of blocking and a lot of the dedupe tasks would fail resulting in more space used. i tried to schedule it and it kind of worked, but never completely.

  11. alen
    November 18, 2009 | 9:18 AM

    cost was part of it. Over 3 years I think the estimate was 100TB – 200TB of disk at two locations. and us DBA’s never trusted the disk and trusted tapes a lot more than disk.

    with netbackup a restore to a different server was a 1 step process. with i365 it was 2 steps and several hours longer than with Netbackup.

  12. Anand Shah
    December 14, 2009 | 7:12 PM

    Same here…

    We were offered de-dupe technologies… about a year back… and when I went into the technicals… i thought to myself… that it is not de-dupe… but dupe…

    I then went ahead with an open source version of quest litespeed… called lzop compression….

    it basically reduces a 25 gb full / transactional backup to 6 gb within a matter of 3-4 minutes…

    http://www.lzop.org/

    There are windows binary links from those sites…

    • Brent Ozar
      December 15, 2009 | 6:18 AM

      Anand – so that we’re clear, LZOP is in no way, shape, or form an “open source version of Quest LiteSpeed.” All it is is a file compression engine, and LiteSpeed has a lot more than that.

      Who do you call for support?

      Does the author keep it up to date? I read the “News” section of the site, which says:

      As of Oct 2006 there has not been any problem report, and 1.02rc1 will get released as official version 1.02 whenever I find some spare time.

      That doesn’t sound like the kind of product I want to trust my mission-critical SQL Server backups with, and frankly, if I was hiring a DBA, I would avoid candidates who made choices like that. Something to think about…

  13. Anand Shah
    December 16, 2009 | 11:25 PM

    Brent – I do agree that lzop is only a compression engine and is not an exact copy of quest litespeed… but if you take the concept of fast compression (less than 1/2 the time taken by 7z) for multiple number of databases… it is much better to do that… and then send it over the wire to a secondary location, than to send a whole database or transaction logs…

    I wonder anybody calls the creators of zip software for basic support… all of these compression technologies are open source… so if you are not happy with a particular feature… go ahead… make it better…

    There might be many people in the market… who may not be able to purchase quest LiteSpeed… or similar such paid software… for them this is a very good choice…

    frankly, I wonder if you can decide a DBA based on the compression software they use… whether it be zip, arc, . I am just providing my opinion based on what I did when a choice was available for me to invest anywhere between 10-20 grand on a de-dupe…

    As much as sql servers are based on set theory… there is a set of non-DBAs out there in the world, who are part-time DBAs by choice… I do not think they will ever look for a DBA employer… because they are happily doing other things…

    A DBA is not on the required list of requirements to use SQL server…

    • Brent Ozar
      December 17, 2009 | 7:04 AM

      Anand – as someone who works for a software vendor, I can assure you that yes, people do indeed need support. I wish there was such a thing as bug-free software, but I haven’t seen it yet.

      You said: “all of these compression technologies are open source… so if you are not happy with a particular feature… go ahead… make it better…” That one argument alone shows that I can’t convince you of why open source isn’t the right answer for everybody. I wish you the best of luck with that software, though.

      Out of curiosity – if you’re such a big fan of open source, why use SQL Server? Why aren’t you using MySQL? After all, if it doesn’t have a feature you need, “go ahead… make it better…” ;-)

  14. Craig
    February 2, 2010 | 2:34 PM

    This is a pretty decent article, and sums up nicely some of the frustrations I come into contact with on a day-to-day basis, but it totally ignores systems with deduplication on client AND server. Systems that do this (to minimize both storage requirements, AND network traffic) are more efficient than either traditional copmression (with a small window size: eg. LZ usually has a working range of about 32KB) or non-deduped data going out across the network.

    It’s hard to get some people to understand detail. This article does it pretty well.

    C

  15. Doug
    February 17, 2010 | 8:32 AM

    While I agree completely, we are currently testing de-duplication devices (DataDomain’s) with SQL server backups – uncompressed and compressed via Litespeed. While doing some investigation, I discovered Quest doesn’t supported Litespeed when used with de-duplication technologies. It’s listed on Quest’s support under solution SOL49562. Here’s the direct link – https://support.quest.com/SUPPORT/index?page=solution&id=SOL49562. You’ll have to log in to using your support ID to view it (which you should have if you use Litespeed). For your convenience, here’s the case…

    Solution SOL49562
    Title
    Using de-duping technology (EMC Avamar, DataDomain, NetApp) with LiteSpeed
    Problem Description

    De-duping technology like EMC Avamar, DataDomain, or NetApp is not supported for LiteSpeed backup files regardless of whether the file is encrypted or not.

    Cause
    The data is not always in the same place from backup to backup.

    Resolution
    Unsupported software/platform.

    Environment
    Product: LiteSpeed for SQL Server
    Attachments:
    Server OS: Windows – All
    Database: SQL Server – All

    • Brent Ozar
      February 18, 2010 | 11:38 AM

      Hi, Doug! I sent this over to the support department and they’re trying to clarify that article. The person who wrote that article is no longer with Quest, and we’re pretty sure it’s incorrect. We’re tracking it down to make sure.

      We can’t “support” dedupe appliances in the sense that we don’t have any in-house, and we don’t test with them. We’re not aware of any issues with them, though.

  16. Craig
    February 18, 2010 | 2:03 PM

    TSM 6.2 (just announced) does client and server-side dedupe.

    Best of both worlds.

    C

    • Brent Ozar
      February 18, 2010 | 2:06 PM

      Craig – just to be clear, TSM 6.2 was just announced for first delivery next month. I look forward to reading how it performs with SQL Server. I can’t seem to find any documentation on that, though.

  17. Craig
    February 18, 2010 | 11:56 PM

    Hi Brent,
    TSM infocenters (that is, the online TSM manuals) are usually only enabled on the day of general availability.

    I am not in the loop on it, but I’ve no reason to expect anything different for the 6.2 release.

    For info, the TSM v6.1 infocenter is over here http://publib.boulder.ibm.com/infocenter/tsminfo/v6/index.jsp – usually a new URL would be used for each new infocenter.

    The TSM v6.1 server-side deduplication arrangement is described in a redbook at http://www.redbooks.ibm.com/abstracts/sg247718.html – this is likely to be one half of the setup in the 6.2 design.

    C

Trackbacks/Pingbacks
  1. Data Deduplication Technology – New Article on DBTA | Kevin E. Kline
Leave a Reply


Wanting to leave an <em>phasis on your comment?

Trackback URL http://www.brentozar.com/archive/2009/11/why-dedupe-is-a-bad-idea-for-sql-server-backups/trackback/
WOOHOO!
Mar 3 - Day with the DMVs - free day-long webcast about performance tuning.

Apr 17 - SQLSaturday Chicago - I'm doing the keynote with Kevin Kline.

More Upcoming Events
Recent Posts/RSS