Designing a Recovery Strategy for StackOverflow

When I design a backup & recovery strategy for a database, I don’t talk to the developers, database administrators, or systems administrators first.  The first people I go to are the business managers, and I ask three questions:

  • How much money would you lose if you lost the data altogether?
  • How much money would you lose if you were down for X hours?
  • How much time & money can you afford to devote to backups?

Each of these questions drive the strategy, and none of these questions are actually answered by the DBA.  In Jeff Atwood’s post about StackOverflow’s backup & recovery strategy this week, there’s been a lot of comments suggesting alternate backup methods.  Curiously, though, none of them seem to actually ask any of those three questions.  Let’s examine these questions one by one, and then look at StackOverflow‘s choices to see if they make sense.

How much is your data worth?

If you had to reconstruct the data from scratch, using other systems and records, how much would it cost you to do it?  And would you actually pursue that objective, or would you consider folding up the business altogether, or maybe living without the data?

If your database holds financial transactions for your customers, with live incoming credit card transactions and debit card withdrawals, it’s obviously very valuable, and you’d be forced to spend a fortune to get back in business.  If you’re owned by Microsoft and you lose all Sidekick customer contact data, you pour money and manpower into getting it back.  Other companies like Magnolia and Journalspace, on the other hand, have decided to pack their bags and call it a day.

The business has to work with IT to come up with a quick back-of-the-envelope calculation as to what complete data loss would cost, and that’s part of the formula that dictates what we spend on data protection.  Sometimes this simple question leads businesses to realize, “That particular data doesn’t really matter – we could rebuild it all from other sources for next to nothing.  Maybe we shouldn’t back it up at all.”

How much money does downtime cost?

While this database is down, can you still sell your products?  If not, then it’s easy to calculate the cost of downtime – it’s your sales metrics.  If you sell an average of $100,000 per hour, then an hour of downtime costs you $100,000.

And furthermore, if you can’t sell products, do your customers hold off on the purchase, or do they switch to another vendor?  If Amazon.com’s databases go down, then their customers probably won’t wait around until the site comes back up.  They’ll head straight over to another web site and spend money with a competitor.  This has a hidden business danger, too – if your customers like that new site better, they might stick with it and bypass you for future orders.

However, if you can still sell products, keep your customers & employees happy, and business moves along unaffected except for a few bumps, then that might guide your backup & recovery strategy too.  Or if your company isn’t making all that much per hour, then maybe you don’t want to dedicate a fortune to having your systems highly available.

How much resources can you devote to backups?

Availability costs time and money.

The more available your system needs to be, the more time and money it costs.  If you’re a global enterprise with a killer cash flow, then you can make more conservative decisions, back up more databases more often, and not be as concerned with the costs.  If you’re a startup with three guys, and all your revenue goes towards paying salaries, then you want to watch those backup costs a little more closely.

In addition, backups cost more than just money.  If you need up-to-the-minute recovery with constant transaction log backups, you have to put your database in full recovery mode – which can slow things down.  If you want the fastest possible response times, and you’re looking for every millisecond edge against your competitors on each page load, backups are going to hit your radar.
stackoverflow

So how does StackOverflow stack up?

Let’s ask the three questions:

  • How much is their data worth? Their data consists of questions and answers from the programming community.  Sure, they’re the #1 programming site in the world, but even the words of Jon Skeet are only worth so much.
  • How much money does downtime cost? This might sound callous to users, but if StackOverflow was down for four hours, the vast majority of users would get over it.  They might post a few questions elsewhere, but for the most part, they’d just sit around on Twitter complaining, refreshing their browser while they waited for StackOverflow to come back up.  They’re addicted, and they’ll tolerate downtime.
  • How much resources can they devote to backups? StackOverflow is a small startup trying to make a living off ad revenue.  Their primary target users are extremely tech-savvy people who are fully aware of tools like Firefox and Adblock Plus, making it even more challenging.  In an ideal world, they’d have a SAN with snapshot sub-second backup & restore technology – but that costs a lot of money, and it’s not realistic.  Frankly, every bit of traffic in and out of their colo servers costs them money, and not an insignificant amount.

With these answers in mind, StackOverflow’s decisions not to do transaction log backups, offsite log shipping, database mirroring, and so on make good business sense. Us geeks in the crowd may not like it, and we might demand the latest and greatest in backup & recovery technology, but at the same time we want StackOverflow to remain free.  As their volunteer DBA, I’d love to do 24×7 log shipping or database mirroring to a secondary server at another colo facility – but I wouldn’t be willing to pay out of my own pocket for expenses like that.

StackOverflow Database Server
StackOverflow Database Server

To drive the resources part home, take a look at the database server as shown in Jeff’s Stack Overflow Rack Glamour Shots post this week.  Count the number of hard drives.  That’s six SATA drives shared by the OS, page file, database files, log files, and full text catalogs to serve over one million pageviews per day.  Many of you out there use a server like this as your development server, and you complain that it’s slow.  Guess what – this is both their production server and development server.  They’re achieving some incredible stuff with a very limited hardware budget, and it’s a testimonial to what you can do if you really, really focus on performance.

And while I’ve got you thinking about backups, now’s a great time to check out some of my other backup articles:

Previous Post
Interview with Mike Walsh about blogging
Next Post
Pro SQL Server 2008 Internals and Troubleshooting

11 Comments. Leave new

  • To paraphrase Paul Randal on a recent Dot Net Rocks ineterview, “You don’t need a backup strategy, you need a recovery strategy”.

    Reply
  • I hate sounding like a sycophant, but you amaze me on a regular basis by simply doing things better than almost everyone else.

    However, you will never convince me that they are wise in using the same server for development and production. Shouldn’t everyone, even if they use old hardware, be using a separate server for development? The cost for that is nearly insignificant.

    Reply
    • David – hahaha, you sycophant… Ideally, yes, they’d use a separate server for development, but they pay for rack space. Each developer uses a local copy of the database for their own development, but when it comes time for testing builds & things with each other, they use the shared copy in colo. Each server you plug into the colo environment comes with its own costs, so I understand why they’d avoid doing that for now.

      Reply
  • They should host their dev server in the Brent Ozar data center. You’ll just need a few cycles for BitTorrent, right? 😉

    Reply
  • I think a key item to remember is even though we are technologists there are business decisions involved in everything we do.

    Nice job putting this in business context.

    Reply
  • how soon before you get their dev environment set up with SQLAzure?

    Reply
    • Their code relies on full text search, which isn’t available yet in Azure. We talked about it, but it’d require too many mods to their code. I’m surprised nobody’s uploaded their public data dumps to Azure though.

      Reply
  • I love this pragmatic approach to your backup strategy. You’re right, as much as I’m addicted to StackOverflow, I don’t have my financial related data in there and I could don my Googlian Monk robes if I was in that much of a bind.

    I’m sure Jeff & crew will eventually upgrade their strategies, but from what Jeff has described and your analysis of their needs, the backup strategy seems to fit the needs soundly.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.