Tag Archive: backup

SQL Server High Availability and Disaster Recovery Basics Webcast

If your boss wants you to make your SQL Server more reliable, there’s a lot of confusing options. I’ll walk you through the pros and cons of each method and show you how they can help make queries go faster too. We’ll finish with a list of books and blogs I’d recommend for each option.

This 30-minute session is for production DBAs who haven’t implemented clustering, replication, or log shipping before, and aren’t sure where to start.

Clustering Resources

Clustering had a bad reputation for being difficult to configure and manage, but Microsoft’s made dramatic improvements in recent versions of both Windows and SQL Server.  Whether you’re just getting started with clustering for the first time, or you still think clusters require heartbeat networks, your first resource should be Pro SQL Server 2008 Clustering by Allan Hirt (@SQLHA).

Database Mirroring Resources

For the first couple of years that I used Database Mirroring, I thought it was brain-dead-simple: just Next-Next-Next your way through a wizard, and you’re in business.  I couldn’t imagine how anyone could write an entire book about mirroring.  That, however, was before I found out about Pro SQL Server 2008 Mirroring, which I’d definitely recommend.  The book’s coauthors are Robert Davis (@SQLSoldier) and Ken Simmons (@KenSimmons).

For some quick free learning, check out Robert’s posts tagged with database mirroring.

Replication Resources

Replication’s been out for several SQL Server versions now, and it’s slowly changed with each version.  As a result, there’s a ton of good free info out there, but it’s not all up-to-date.  My favorite place to start is the Books Online topic for Designing and Implementing Replication, and then move on to:

Log Shipping Resources

Log shipping is more of a technique than a technology: it just means restoring your backups continuously on another server.  When database administrators first started doing this, they did it with their own scripts.  Third party backup compression tools like LiteSpeed started offering cool easy-to-manage user interfaces for it too.

These days, the log shipping and backup compression built into SQL Server 2008 R2 Standard Edition put log shipping within everybody’s reach.  The easiest place to get started is with the Books Online section on Log Shipping.  If you use a third party backup compression tool, poke around in their documentation too.

SAN/VM Replication Resources

These solutions are all vendor-specific, and unfortunately a lot of vendors lock their documentation behind support sites that require logins. However, there’s one very good place to start: Mike Laverick’s free ebook Administering VMware Site Recovery Manager 4.  This isn’t the most current information available, but don’t worry about that – just read Chapter 1 for an excellent vendor-neutral introduction to disaster recovery planning using SAN and VMware replication.

If your SAN happens to be covered in his book, skim through that chapter too just to get an idea of the difficulty level for these solutions.  These don’t have the ease-of-use of, say, SQL Server log shipping, but it’s one pane of glass to manage all of your disaster recovery needs regardless of the operating system and application.

For more learnin’, check out Veeam’s free whitepaper section.  Veeam is a vendor that sells their own virtualization backup and recovery software, and I’m not affiliated with ‘em or getting paid to say this, but I’m a big fan of vendors that give away free education to the community.

AlwaysOn Availability Groups Resources

This is a brand new feature of SQL Server 2012, so we don’t have a lot of good community resources out there yet.  There’s a few screenshot walkthroughs, but they don’t really tell you anything you can’t figure out in five minutes of playing around.  For now, the place to go is the Books Online section on Availability Enhancements.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

The 9 Letters That Get DBAs Fired

I’m proud to say that I’ve only lost one job in my life.  I was a grocery store bagger when the store came under pressure to increase their percentage of minority workers.  They laid off all of the Caucasian baggers, but I’m not bitter.  I actually loved bagging groceries, and if I hadn’t been laid off, I’d still be the guy asking you “Paper or plastic?”

I probably didn’t deserve to lose that job, but I’ve deserved to lose others.  I’ve just been lucky.  Through my career, I’ve seen a lot of database administrators go through some dicey situations that cost them their jobs.  Maybe not right away – but once these situations happen, it’s only a matter of time until management finds a legally defensible reason to swing the axe.  These situations often involve just nine letters.

The First Three Letters: RPO

Recovery Point Objective (RPO) measures how much data you’d lose if your server went down and you brought up the secondary server.  It’s measured in time, not megabytes.  The less time you have, the more frequently you have to get the data to a secondary location.

In theory, if you’re doing transaction log backups every 5 minutes to a network share, so you’ve got a 5-minute RPO.

In theory, “in theory” and “in practice” are the same.  In practice, they are different.

If you lit up the oxygen tank, the view would actually improve.

The Gremlin of Disaster

In practice, you’re unlucky, and your transaction log backups will fail before the engine itself does.  I know this because it’s happened to me and my clients.  In one case, the volume for backups filled up and the transaction log backups stopped working.  Since the log backups didn’t run, the transaction logs didn’t clear, and they continued to grow.  The Gremlin of Disaster arrived, sowing corruption in his wake.  By the time the dust cleared, we’d lost over an hour of data – permanently.

If you’re backing up to a local drive and the RAID controller fails, check out this helpful article.  You’ll be needing it, because with a failed RAID controller, you can’t access any of the data on the local drives.  Before you try pulling the hard drives out of the failed server and putting them into another server, check out this article.  After all, you’re gambling – gambling that the RAID controller firmware matches between servers and that it’ll be able to read and rebuild the newly inserted drives.  Guess wrong, and it will format the drives, thereby erasing your backups.

Savvy readers will detect a sense of acerbic humor in the post so far – even more biting than my usual pieces.  Brace yourself, because it’s going to get worse.

The Next Three Letters: RTO

Recovery Time Objective is how much time you’ve got to get your server’s act together.  Some companies measure it from the time you learn about the problem, some measure when the service is first unavailable.

The clock’s starting point matters a lot.  When things go wrong, SQL Server Agent might just email the failsafe operator – if you’ve gone to the trouble of setting one up and configuring alerting.  Even then, you probably won’t notice the email because you’ll be drunk at the pub having a nutritious dinner with your lovely family.  An hour later, when you realize that your cell phone is set on vibrate, you’ll panic, run to the car, fire up your laptop, and try connecting over your phone’s 1-bar internet connection.  Should we start the clock when the server started failing, when the email was sent to you, when you saw the email, or when you finally connected to the VPN?  If the business wants to start the clock when the server stops, you have to account for the meatbag reaction time, and with most of us, there’s a pretty long delay.  (If you’re the kind of DBA who sets up rules to move automated alerts into a separate email folder, you’re off to a bad start.)

Beep beep, back that thing up. Wait, on second thought...

The Pinto of Peril

If our idea of disaster protection is restoring the databases onto another server, we might assume that the RTO just needs to cover the length of time to do a full restore.  Not so fast, Speedy McPinto – here’s a few things you’ll need to do to get the server online:

  • Provision enough space to do the restore – do you know how much space you need on each drive?
  • Make sure it’s on the right version/build of SQL Server – sometimes we have to install service packs, cumulative updates, or hotfixes to get production to work right.  There’s no guarantee our standby box has those same binaries.
  • Get the sp_configure settings right – if production had a certain maxdop, for example, it might fail terribly on the new hardware.
  • Get security working correctly – you did script out the logins and passwords, didn’t you?
  • Configure TempDB correctly – I’ve seen people restore production databases onto dev environments where TempDB was just configured with a few gigabytes of storage, and it instantly started growing when handling production load.  TempDB ran out of space, and the server went Pete Tong.
  • Restore the database – and hope to hell it restores correctly the very first time.

The less time you have, the more tasks you have to perform ahead of time, and the more you have to document, test, and rehearse the steps you’re going to perform at the moment of disaster.

The Most Important Three Letters: CYA

Right now, all of your users believe your RPO and RTO is zero.  They believe you’ve got everything under control, and that when disaster strikes, you’re going to instantly bring up another server with no data loss whatsoever.  We both know the truth, though – you spend your time at work checking blogs, not backups, so we’ve got some work to do.

Start with a statement of fact that avoids blame: “Right now, if we lost the production system at 3:45 PM, it would take me at least four hours to bring another copy online, and that copy would have lost all data between 3PM and 3:45PM.  Is that acceptable?”

They’re going to say, “ZOMG, NO WAY!!!1!  IT HAS TO BE BETTER THAN THAT!”

You say, “I thought you’d say that, because I’m uncomfortable with that too.”  (See what we did there?  We just put you on their side.)  “Here are our options to improve our RPO and RTO.”

To give them options, check out my free video on HA/DR Options: Clustering, Log Shipping, and Replication.  I cover the pros and cons of all your high availability and disaster recovery options including database mirroring, SAN replication, virtualization, and more.  I won’t show you how to implement each of those (hey, I’ve only got 30 minutes) but I’ll give you enough information to make a recommendation to your business users.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

DBA Nightmare: SQL Server Down, No Plans

Managing data is about managing risk, but no matter how we good we are at managing risks, they’re still risks.

We’ve seen several high-profile data failures recently:

Ouch.  It’s time we start a series of DBA Nightmares to cover basic preparations that should be a part of every DBA’s career planning.  Why career planning?  Because if one of these happens to you and you’re not prepared, it’s a URLT moment – Update Resume, Leave Town.  If, on the other hand, you’re well-prepared and react smoothly, this could be your moment to shine.

Today’s Nightmare: From-Scratch Server Restore

Let’s be honest: most of us have never rebuilt a product server from scratch under duress.  Many of us bury our heads in the sand, hoping production will just keep on keepin’ on.  We don’t test our backups, and even if we do, we don’t go to the extreme of attempting a complete from-scratch reinstall.  When the system is down and the CIO’s standing behind us, tapping us on the shoulder, we learn some ugly lessons.

Right away, you need to choose one of two recovery plans: will you try to restore everything exactly as it was (including the system databases), or will you build a new server from scratch and just restore the user databases?  Ideally, you’ve designed your recovery plan ahead of time, but in a nightmare scenario, you’re standing in the datacenter with empty pockets and no game plan.

If you decide to restore the system databases, you should try this ahead of time.  Restoring the master database is different than typical user databases because you can’t use SQL Server Management Studio.  You have to set the SQL Server to run in single-user mode, then use SQLCMD to restore the master database, then remove the -m parameter that you added to start SQL Server, and start it back up again.  If you’re using a third-party product to do your database backups, it’ll require separate instructions like Tom LaRock’s instructions on restoring master with Quest LiteSpeed.  After restoring master, you’ll need to restore the msdb database, but fortunately that one can be done through the SSMS GUI as long as the SQL Server Agent is shut down.

If you don’t restore the system databases, you may be able to get your server up and running faster – at the cost of some configuration data.  For example, logins, Agent jobs, and linked servers are stored in the system databases.  On a small development server with a handful of logins and only maintenance jobs, it might be easier to install a fresh instance of SQL Server on a newly installed server, then just restore the user databases.  (This is one of the reasons I try to avoid excessive custom logins or Agent jobs where possible on development servers.)  Knowing your recovery process and risk will help you design your SQL Server security and Agent job configuration better.

If you decide ahead of time that your recovery plan involves a fresh OS and SQL Server, there’s one thing you can do to make your recovery process either: automate login creation.  Schedule a job to run weekly with Robert Davis’s login copy script and send the results to yourself via email.  That way, at the very least, you’ll have the exact list of logins, passwords, and SIDs to avoid the orphaned login problem when you restore databases.  Run the create-login script sent to you by Robert’s tool, then as you restore each user database, the logins will automatically be associated and users can resume work as normal.

To help plan your build-versus-restore decision ahead of time, it helps to think through all of the implications.  These are just some of the questions you’ll need to think through when designing a disaster recovery plan:

  • What service pack & cumulative update pack was the server running?
  • Did we have any non-SQL applications installed?
  • Were any server-level settings like trace flags configured?

For answers, try running my Blitz script ahead of time.  I bet you’ll learn a lot about your servers – I know my clients do!

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

How to Make SQL Server Backups and Restores Faster

Every day, your servers do hundreds or thousands of operations that are all resource-intensive.  You want these operations to finish as quickly as possible, because they’re absolutely mission critical, and you’ll lose your job if they’re not done successfully.  Whole companies have been lost when these operations fail.  Yet do you really know how to make these operations happen faster?

These operations are backups and restores.

And no, most of us don’t know why they take so long.

In this five-minute video, I explain the only two surefire ways to improve your backup and restore performance: instant file initialization and cleaning your MSDB history.

To learn more, here’s related links for each of those techniques:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

How to REALLY Compress Your SQL Server Backups

StackOverflow’s database servers are hosted in a colo facility. They pay by the byte for transfer in and out of their servers, so it’s really crucial to compress the database backups as much as possible. The smaller the backups, the less they pay. They’re already using backup compression, but they wanted to explore more ways to cut their backup sizes.

There’s two things DBAs can do to deflate their database backups, but both of these have serious implications. The only way this compression process will work is if you’ve got a second database server that can:

  1. Restore the production backups
  2. Do some work on them to decrease their size
  3. Back up the smaller databases

Notice I didn’t use the word “shrink” – nowhere in today’s discussion am I going to shrink the databases. Shrinking databases is evil. Instead, we’re going to do some things to lose the fat and keep the muscle.

Do not try what you’re about to read on your own production server. The only reason we can do these at StackOverflow is because we’ve got a second database server available. These techniques should never be used if you’ve only got one database server, because dropping your indexes and changing to 100% fill factor will both give you horrendous performance.

Consider taking this approach with a dedicated fire drill server. In a perfect world, you’ve got a server lying around that you’re using for fire drills: periodic restores of your databases. The whole reason we do backups is so that we can do restores, and you should be testing your backups as often as possible. In this scenario, we’re actually testing every single one of our backups.

Now, on to the shrin…I mean, compressing!

Drop All Non-Clustered Indexes

There’s two kinds of indexes – clustered, and non-clustered. Clustered indexes aren’t really indexes at all; they define how the table itself is laid out on disk. Dropping a clustered index doesn’t save space. Non-clustered indexes, on the other hand, take up space because they’re additional copies of parts of the table. Non-clustered indexes can help performance, but they can be rebuilt from scratch as long as you save the index definition first.

Which of these takes less space to back up:

  • The string “CREATE INDEX IX_EmployeeName ON dbo.Employees (LastName, FirstName)”, or
  • An index with thousands or millions of rows in it

The answer is the string. To minimize what we’re backing up, we need to:

  1. Script out all of the non-clustered indexes in the database
  2. Save those definitions to a table (or a stored procedure)
  3. Create a stored proc that will loop through those index definitions and recreate them later
  4. Drop the indexes

If our database’s content is, say, 30% indexes, we just dropped the size of our backup by 30%. The downside is that after the database is restored somewhere else, it’ll take longer for the database to be fully available, because we have to run the stored proc to recreate all our indexes. If we’re running SQL Server Enterprise Edition, our database will be online while the indexes are recreated.

Here’s a couple of scripts to generate T-SQL scripts for all of the indexes in a database:

To make these work, you’ll want to adapt them so that they insert the scripts into a table for later execution, and then drop the indexes. I know, I know, you’re going to want the script I’m using at StackOverflow, but I’m not quite comfortable publicizing that code just yet. It’s still a little duct-tape-y. I’ll revisit that in the future though.

This index removal task is extremely fast, only a matter of seconds, because SQL Server is just dumping metadata. The next step, however, is much more intensive.

Rebuild Tables with 100% Fill Factor

SQL Server stores data in pages, and you can think of these much like the pages in the phone book. If we tried to fill the pages completely full, then whenever we needed to add a single record, we’d have all kinds of shuffling to do. When we know we’re going to be adding a lot of data, we’re better off leaving some working area on each page so that we can add records without doing a lot of shuffling every time. The default amount of space we use on each page is known as the Fill Factor.

Normally, StackOverflow’s tables wouldn’t have a 100% fill factor because insert performance would be horrible. However, in our backup scenario, we want to cram all of the pages completely full. The solution is to rebuild all of the clustered indexes with a 100% fill factor. The lower your default fill factor, the more space you’ll save by rebuilding to 100% fill factor. This step takes a lot of IO crunching, and again, it only makes sense when you’ve got a second database server available to do all this work.

Our Results: 50-55% Smaller Backups

Cutting backup size in half means:

  • Half the bandwidth expense
  • Half the time to move backups across the wire for disaster recovery
  • Half the long-term storage requirements

Your results will vary depending on your fill factor and your amount of indexes.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Designing a Recovery Strategy for StackOverflow

When I design a backup & recovery strategy for a database, I don’t talk to the developers, database administrators, or systems administrators first.  The first people I go to are the business managers, and I ask three questions:

  • How much money would you lose if you lost the data altogether?
  • How much money would you lose if you were down for X hours?
  • How much time & money can you afford to devote to backups?

Each of these questions drive the strategy, and none of these questions are actually answered by the DBA.  In Jeff Atwood’s post about StackOverflow’s backup & recovery strategy this week, there’s been a lot of comments suggesting alternate backup methods.  Curiously, though, none of them seem to actually ask any of those three questions.  Let’s examine these questions one by one, and then look at StackOverflow‘s choices to see if they make sense.

How much is your data worth?

If you had to reconstruct the data from scratch, using other systems and records, how much would it cost you to do it?  And would you actually pursue that objective, or would you consider folding up the business altogether, or maybe living without the data?

If your database holds financial transactions for your customers, with live incoming credit card transactions and debit card withdrawals, it’s obviously very valuable, and you’d be forced to spend a fortune to get back in business.  If you’re owned by Microsoft and you lose all Sidekick customer contact data, you pour money and manpower into getting it back.  Other companies like Magnolia and Journalspace, on the other hand, have decided to pack their bags and call it a day.

The business has to work with IT to come up with a quick back-of-the-envelope calculation as to what complete data loss would cost, and that’s part of the formula that dictates what we spend on data protection.  Sometimes this simple question leads businesses to realize, “That particular data doesn’t really matter – we could rebuild it all from other sources for next to nothing.  Maybe we shouldn’t back it up at all.”

How much money does downtime cost?

While this database is down, can you still sell your products?  If not, then it’s easy to calculate the cost of downtime – it’s your sales metrics.  If you sell an average of $100,000 per hour, then an hour of downtime costs you $100,000.

And furthermore, if you can’t sell products, do your customers hold off on the purchase, or do they switch to another vendor?  If Amazon.com’s databases go down, then their customers probably won’t wait around until the site comes back up.  They’ll head straight over to another web site and spend money with a competitor.  This has a hidden business danger, too – if your customers like that new site better, they might stick with it and bypass you for future orders.

However, if you can still sell products, keep your customers & employees happy, and business moves along unaffected except for a few bumps, then that might guide your backup & recovery strategy too.  Or if your company isn’t making all that much per hour, then maybe you don’t want to dedicate a fortune to having your systems highly available.

How much resources can you devote to backups?

Availability costs time and money.

The more available your system needs to be, the more time and money it costs.  If you’re a global enterprise with a killer cash flow, then you can make more conservative decisions, back up more databases more often, and not be as concerned with the costs.  If you’re a startup with three guys, and all your revenue goes towards paying salaries, then you want to watch those backup costs a little more closely.

In addition, backups cost more than just money.  If you need up-to-the-minute recovery with constant transaction log backups, you have to put your database in full recovery mode – which can slow things down.  If you want the fastest possible response times, and you’re looking for every millisecond edge against your competitors on each page load, backups are going to hit your radar.
stackoverflow

So how does StackOverflow stack up?

Let’s ask the three questions:

  • How much is their data worth? Their data consists of questions and answers from the programming community.  Sure, they’re the #1 programming site in the world, but even the words of Jon Skeet are only worth so much.
  • How much money does downtime cost? This might sound callous to users, but if StackOverflow was down for four hours, the vast majority of users would get over it.  They might post a few questions elsewhere, but for the most part, they’d just sit around on Twitter complaining, refreshing their browser while they waited for StackOverflow to come back up.  They’re addicted, and they’ll tolerate downtime.
  • How much resources can they devote to backups? StackOverflow is a small startup trying to make a living off ad revenue.  Their primary target users are extremely tech-savvy people who are fully aware of tools like Firefox and Adblock Plus, making it even more challenging.  In an ideal world, they’d have a SAN with snapshot sub-second backup & restore technology – but that costs a lot of money, and it’s not realistic.  Frankly, every bit of traffic in and out of their colo servers costs them money, and not an insignificant amount.

With these answers in mind, StackOverflow’s decisions not to do transaction log backups, offsite log shipping, database mirroring, and so on make good business sense. Us geeks in the crowd may not like it, and we might demand the latest and greatest in backup & recovery technology, but at the same time we want StackOverflow to remain free.  As their volunteer DBA, I’d love to do 24×7 log shipping or database mirroring to a secondary server at another colo facility – but I wouldn’t be willing to pay out of my own pocket for expenses like that.

StackOverflow Database Server

StackOverflow Database Server

To drive the resources part home, take a look at the database server as shown in Jeff’s Stack Overflow Rack Glamour Shots post this week.  Count the number of hard drives.  That’s six SATA drives shared by the OS, page file, database files, log files, and full text catalogs to serve over one million pageviews per day.  Many of you out there use a server like this as your development server, and you complain that it’s slow.  Guess what – this is both their production server and development server.  They’re achieving some incredible stuff with a very limited hardware budget, and it’s a testimonial to what you can do if you really, really focus on performance.

And while I’ve got you thinking about backups, now’s a great time to check out some of my other backup articles:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Losing Weight and Testing Code

David Stein (BlogTwitter) wrote a post called “Pop Quiz Hotshot” about starting your disaster recovery plan *now*.  It’s a great read with good points that everyone needs to act on, but the comments indicate that not everybody’s seeing eye to eye.  As usual, I rely on similes because I’m lazy.

How to Lose Weight

Get your pencils ready, because I’m about to give you the ultimate weight loss tip.  It’s going to sound almost too good to be true because it’s so darned easy, and here’s the craziest part – it can actually save you money!  That’s right – this is the tip that the exercise equipment industry, the personal trainer industry, and the vitamin industry are desperately trying to keep under wraps.  You ready?  Here it comes.  Don’t blink – you might miss it.

Eat less.  A lot less.

I know – it sounds ridiculous, but that wild technique helped me drop 40 pounds in under a year.  I didn’t exercise one bit, either.

I know what you’re thinking – you’re thinking, “But Brent, how can I possibly lose weight without spending money?  Don’t I have to spend a fortune on the FlabBlaster 3000 just like Chuck Norris tells me to?”  Far be it from me to disagree with Chuck – very far – but…

When The Regular Thighmaster Isn't Working

When The Regular Thighmaster Isn't Working

Buying Hardware Doesn’t Fix Bad Practices

No matter how much you spend on exercise equipment or systems management, you and your servers aren’t going to get healthier when the stuff sits on the shelf.

Money can’t buy you health.

It can pay for experts to come in and fix you when you’re sick, but it can’t keep you healthy – that part is up to you.  The very first step to getting healthier, and this goes for both your servers and your thighs, is to change your habits.  Elbow grease has amazing results when applied liberally.

Start Testing Your Restores. Now.

If you don’t have a spare server lying around, use somebody’s spare desktop.  We all have ancient machines sitting in closets from our last upgrade or that employee who just got fired because he couldn’t restore a dropped table.  (Get it? That’s a hint.)  Get that machine, and throw in a 1.5 terabyte drive for about $100.  Yes, use your own money if the company won’t pay, because this is an investment in your career.  If you’ve got several machines lying around, consider combining their memory if possible, but don’t sweat it – this is only your training wheels system.

Install the OS again from scratch, and put SQL Server Developer Edition on there or the 180-day evaluation version of SQL Server 2005 or 2008.  Don’t overcomplicate your life by trying to get every best practice ideally perfect – even if everything’s installed on a 1.5tb C drive, this system will still work for the basics.

Start by testing your restores once per week.  The first few times you do it, don’t try to script the whole thing out – just use SQL Server Management Studio and point/click your way through it.  Remember, high bang, low buck/effort: we want this whole thing to take less than two hours per week of your time, max. The restores aren’t going to be fast, but the point is to even figure out if we can do them, period.  After a couple of weeks, you’ll start scripting your work as you find more and more things that aren’t included in your test system – logins, DTS packages, jobs, whatever.  Document what you’re doing along the way, because every time you find something else that has to be done to make the server work, that’s one less lesson you’d need to learn under the gun.

One Month Later: Add the Apps

After you’re comfortable restoring the database, try to configure your application.  Install IIS, DLLs, code, whatever else you might need to get the app to run.  If you don’t manage the app, ask the app guys to take another old desktop and try to do their part to set up a restore testbed for themselves.  If they don’t want to, that’s okay – but now you’re starting to build up some cover for your rear end.

Some things might not work in your environment.  For example, if you’re using the evil xp_cmdshell, your developers may have hard-coded paths and files into their code.  The faster you find things like this, the faster you can get them fixed before disaster strikes.  When disaster strikes, these problems won’t be seen as developer mistakes – you’ll get blamed, because you can’t make the server work the way it used to.  99% of your problems won’t stem from hardware that you can buy with a check, though – they’ll stem from practices.  Stop waiting for the company to buy you a Thighmaster, and start doing pushups.  It’s better than nothing, and when disaster strikes, the last thing you want to have is nothing.

When there’s enough basic plumbing in place that you think everything works, format the box and start over.  Use your documentation and try to repeat the whole process.  The first several times you do this, you’re going to continue to find more errors and gotchas.

When you think your documentation is complete, format the box and hand the documentation to your junior person or your manager.  Say, “I’ve got a set of steps to follow when disaster strikes, and I want you to test them for me, because if I’m not around then you’ll be the one doing it.”  They will be shocked, but down the road they’ll appreciate your due diligence.

This kind of disciplined effort is why experienced DBAs walk around with an air of confidence.  The best DBAs aren’t worried about what happens when disaster strikes, because they’ve already practiced it again and again and again.  When I was a DBA, I liked to say that disaster struck every week for me – it just struck in my lab.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Why Dedupe is a Bad Idea for SQL Server Backups

Has your SAN admin or CIO been telling you not to compress your SQL Server backups because they’re doing it for you with a dedupe tool like EMC’s Data Domain?  I’ve been hearing from a lot of DBAs who’ve been getting this bad advice, and it’s time to set some records straight.

The Basics of Storage Deduplication

Dedupe appliances basically sit between a server and storage, and compress the storage.  Sometimes they do it by identifying duplicate files, and sometimes they do it by identifying duplicate blocks inside files.  It isn’t like traditional compression because the process is totally transparent to anything that stores files on a deduped file share – you can save an Excel file to one of these things without knowing anything about dedupe or installing any special drivers.  In theory, this makes dedupe great because it works with everything.

The key thing to know about dedupe, though, is that the magic doesn’t happen until the files are written to disk.  If you want to store a 100 megabyte file on a dedupe appliance, you have to store 100 megabytes – and then after you’re done, the dedupe tool will shrink it.  But by that point, you’ve already pushed 100 megabytes over the network, and that’s where the problem comes in for SQL Server.

Dedupe Slows Down SQL Server Backups

In almost every scenario I’ve ever seen, the SQL Server backup bottleneck is the network interface or the drives we’re writing to.  We DBAs purposely set up our SQL Servers so that they can read an awful lot of data very fast, but our backup drives have trouble keeping up.

That’s why Quest LiteSpeed (and SQL 2008′s upcoming backup compression) uses CPU cycles to compress the data before it leaves the server, and that’s why in the vast majority of scenarios, compressed backups are faster than uncompressed backups.  People who haven’t used LiteSpeed before think, “Oh, it must be slower because it has to compress the data first,” but that’s almost never the case.  Backups run faster because the CPUs were sitting around idle anyway, waiting for the backup drive to be ready to accept the next write.  (This will really ring true for folks who sat through Dr. DeWitt’s excellent keynote at PASS about CPU performance versus storage performance.)

With dedupe, you have to write the full-size, uncompressed backup over the network.  This takes longer – plain and simple.

Dedupe Slows Down Restores Too

The same problem happens again when we need to restore a database.  At the worst possible time, just when you’re under pressure to do a restore as fast as possible, you have to wait for that full-size file to be streamed across the network.  It’s not unusual for LiteSpeed customers to see 80-90% compression rates, meaning they can pull restores 5-10 faster across the network when they’re compressed – or in comparison, deduped restores will take 5-10 times longer to copy across the network.  Ouch.

It gets worse if you verify your backups after you finish.  You’re incurring the speed penalty both ways every time you do a backup!

And heaven help you if you’re doing log shipping.  That’s the worst dedupe candidate of all: log shipping does restores across one or more SQL servers, all of which are hammering the network to copy these full size backups back and forth.

So Why Do SAN Admins Keep Pushing Dedupe?

Dedupe makes great sense for applications that don’t compress their own data, like file servers.  Dedupe can save a ton of backup space by compressing those files, saving expensive SAN space.

SAN admins see these incoming SQL Server backups and get frustrated because they don’t compress.  Everybody else’s backups shrink by a lot, but not our databases.  As a result, they complain to us and say, “Whatever you’re doing with your backups, you’re doing it wrong, and you need to do it the other way so my dedupe works.”  When we turn off our backup compression, suddenly they see 80-90% compression rates on the dedupe reports, and they think everything’s great.

They’re wrong, and you can prove it.

They don’t notice the fact that we’re storing 5-10x more data than we stored before, and our backups are taking 5-10x longer.  Do an uncompressed backup to deduped storage, then do a compressed backup to regular storage, and record the time differences.  Show the results to your SAN administrator – and perhaps their manager – you’ll be able to explain why your SQL Server backups shouldn’t go to dedupe storage.

In a nutshell, DBAs should use SQL Server backup compression because it makes for 80-90% faster backups and restores.  When faced with backing up to a dedupe appliance, back up to a plain file share instead.  Save the deduped storage space for servers that really need it – especially since dedupe storage is so expensive.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

My Weekly Bookmarks for October 16th

Here’s my bookmarked links for the week ending Friday, October 16th:

PASS Election Links

PASS Summit Links

SQL Server Links

IT Links

The Junk Drawer

These bookmarks are automatically imported from my bookmarks at Delicious.com. If you’d like to get up-to-the-minute updates on what I’m bookmarking, you can subscribe to my bookmark RSS feed.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Bad News, Good News, Worse News

Ran across a funny situation in my test lab, and it’s only funny because it was my test lab.

The bad news: backups started failing a few days ago. One of the databases had a filegroup that wasn’t online, and as my maintenance plan looped through the list of databases, it died when it couldn’t back up that database.  Unfortunately, it was going in alphabetical order, and that database started with a B.

Jesus Saves - and he always uses the shared drive.

Jesus Saves - and he always uses the shared drive.

The good news: the separate cleanup jobs still worked great. They were dutifully cleaning out any backups older than a few days.

The worst news: database mail had failed – of course, a few days before the backups started failing.  My DNS servers in my lab had decided to take the week off, so email wasn’t making it out of my lab.  I didn’t get notified about the backup jobs that started to fail.

The cleanup jobs worked better than the backup jobs, and the right hand didn’t know what the left hand was doing.  In this particular case, the left hand had been amputated at the wrist.  In a perfect world – or at least, in a world where my job depended on this data – the maintenance plan jobs would be interconnected so that they wouldn’t delete backup files if the backup job failed.  That perfect world would not be my server lab.

No real data was harmed in the making of this blog post, but times like this remind me of just how hard it is to be a good database administrator, and how easy it is to lose data.  Have you tested your restores lately?  Do you really think you’ve got something more important to do?

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts