DBA Nightmare: SQL Server Down, No Plans

Managing data is about managing risk, but no matter how we good we are at managing risks, they’re still risks.

We’ve seen several high-profile data failures recently:

Ouch.  It’s time we start a series of DBA Nightmares to cover basic preparations that should be a part of every DBA’s career planning.  Why career planning?  Because if one of these happens to you and you’re not prepared, it’s a URLT moment – Update Resume, Leave Town.  If, on the other hand, you’re well-prepared and react smoothly, this could be your moment to shine.

Today’s Nightmare: From-Scratch Server Restore

Let’s be honest: most of us have never rebuilt a product server from scratch under duress.  Many of us bury our heads in the sand, hoping production will just keep on keepin’ on.  We don’t test our backups, and even if we do, we don’t go to the extreme of attempting a complete from-scratch reinstall.  When the system is down and the CIO’s standing behind us, tapping us on the shoulder, we learn some ugly lessons.

Right away, you need to choose one of two recovery plans: will you try to restore everything exactly as it was (including the system databases), or will you build a new server from scratch and just restore the user databases?  Ideally, you’ve designed your recovery plan ahead of time, but in a nightmare scenario, you’re standing in the datacenter with empty pockets and no game plan.

If you decide to restore the system databases, you should try this ahead of time.  Restoring the master database is different than typical user databases because you can’t use SQL Server Management Studio.  You have to set the SQL Server to run in single-user mode, then use SQLCMD to restore the master database, then remove the -m parameter that you added to start SQL Server, and start it back up again.  If you’re using a third-party product to do your database backups, it’ll require separate instructions like Tom LaRock’s instructions on restoring master with Quest LiteSpeed.  After restoring master, you’ll need to restore the msdb database, but fortunately that one can be done through the SSMS GUI as long as the SQL Server Agent is shut down.

If you don’t restore the system databases, you may be able to get your server up and running faster – at the cost of some configuration data.  For example, logins, Agent jobs, and linked servers are stored in the system databases.  On a small development server with a handful of logins and only maintenance jobs, it might be easier to install a fresh instance of SQL Server on a newly installed server, then just restore the user databases.  (This is one of the reasons I try to avoid excessive custom logins or Agent jobs where possible on development servers.)  Knowing your recovery process and risk will help you design your SQL Server security and Agent job configuration better.

If you decide ahead of time that your recovery plan involves a fresh OS and SQL Server, there’s one thing you can do to make your recovery process either: automate login creation.  Schedule a job to run weekly with Robert Davis’s login copy script and send the results to yourself via email.  That way, at the very least, you’ll have the exact list of logins, passwords, and SIDs to avoid the orphaned login problem when you restore databases.  Run the create-login script sent to you by Robert’s tool, then as you restore each user database, the logins will automatically be associated and users can resume work as normal.

To help plan your build-versus-restore decision ahead of time, it helps to think through all of the implications.  These are just some of the questions you’ll need to think through when designing a disaster recovery plan:

  • What service pack & cumulative update pack was the server running?
  • Did we have any non-SQL applications installed?
  • Were any server-level settings like trace flags configured?

For answers, try running my Blitz script ahead of time.  I bet you’ll learn a lot about your servers – I know my clients do!

Previous Post
Announcing Our New Weekly Community Recap Email
Next Post
It’s a Lock: Due Diligence, Schema Changes, and You

9 Comments. Leave new

  • When considering third party apps, to get things back up and running, often times you can’t just restore the DB. Sometimes you need to do app things. So going from scratch isn’t just a DB-level exercise; it has many layers, some of which need to be coordinated with other groups. DBAs need to ask because if it fails, it still can look like their fault if the app can’t start.

    Reply
    • Yeah, absolutely. It’s so important to try the whole thing ahead of time. If you don’t, then you end up randomly jiggling things until they unbreak.

      Reply
  • How practical would it be to have a “full restore from blank disks” be part of the normal work flow? Given a production server, a slave/hot-backup, a dev server, etc, it seems there should be ample opportunity to practice/automate your disaster recovery.

    Reply
    • BCS – It depends on your environment. When I was a DBA, I played fire-drill-restore whenever I got a new server to build. Before I built the new server, I’d use the hardware to act as if I had to restore my one of my production servers from the ground up. It kept me really sharp.

      Reply
      • I’ve been doing these tests with my new Dell server and have unconvered major hardware issues/failures in the process. I’m so glad I found these problems before putting the server into production.

        Reply
  • URLT – I’m so stealing that. 🙂

    Reply
  • “Let’s be honest: most of us have never rebuilt a product server from scratch under duress”. Then I’m a proud member of the minority! After a SAN failure and restore we had overnight to rebuild a 2005 cluster. The most unsettling moment happened when we installed the primary node, patched it up to a service pack and then a CU iirc. Installing the secondary node after that didn’t go so smoothly. We had to call MSFT support and do some registry hacks according to some internal articles to get the install rolling.

    After that we went on to restore the system databases and happily see about 150 user databases come online just in time for 7am opening – happy day!

    Then on to Oracle drivers, and other tweaks for SSIS and whatnot.

    We use the Rodney Landrum’s SSIS collection method and automate collecting data about our environment – that was a great thing to have on hand.

    Fortunately I’d been around for the install, so I knew all the little bits of the server. I’d say if you’re in a new position and you don’t have reliable documents, doing a test rebuild should be very high on your list.

    Reply
    • Sam – ugh, sounds like a rough time! The first time is always the worst, but it’s such a great feeling once you’ve done it and you know you can conquer anything. 😀

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.