9 Ways to Lose Your Data

Every time someone tells me, “This database is mission critical – we can’t have data loss or downtime,” I just smile and shake my head. Technology is seriously difficult.

To illustrate, here’s a collection of client stories from the last few years:

  1. The DBCC CHECKDB job ran every week just like it was supposed to – but it failed due to corruption every week. No one got email alerts because the SQL Agent mail was no longer valid – internal email server changes meant the mail was just piling up in SQL Server. CHECKDB had been failing for three years, longer than the backups were kept. Data was permanently lost.
  2. The DBA configured his backups to write to a file share. The sysadmins never understood they were supposed to back up that file share. When the DBA asked for a restore, he was surprised to find there were no backups.
  3. Three SQL Servers were all replicating data to each other. When I asked the DBA where the backups were run, he looked at one server, then another, then the third. He sheepishly admitted – in front of his manager – that there were no backups done anywhere.
  4. The DBA set up full backups daily, plus log backups of all databases in full recovery mode. Later, she put a few databases into simple recovery mode in order to fix an issue. She forgot to put them back into full recovery mode. When problems struck and she needed to recover a database, she lost all data back to the prior full backup.
  5. The SQL Server ran out of space on the C drive. During emergency troubleshooting, someone deleted a bunch of BAK files. The server started up, but databases were offline and corrupt. Turned out the user databases were on the C drive, as were all of the backups – the very backups that were just deleted to free up space.
  6. The DBA started getting odd corruption errors on one of his servers, then more, and quickly all of them. The SAN admin had flashed the storage with new firmware – which had a bug. The DBA was writing his backups to that same SAN, and sure enough, some of the corrupt databases had corrupt backups too.
  7. The admin wanted to restore the production databases onto another server. He tried, but it kept saying the files were in use. He stopped the SQL Server service, deleted the files, started it again, and finally his restore worked – but his phone lit up. Turned out he’d remote desktopped into the wrong server – he was on production.
  8. The developer did a deployment on Friday afternoon, tested it, and went home. However, the deployment had an UPDATE script that wrote invalid data to a percentage of the rows in a critical table. Over the weekend, people worked in that table, putting in valid data and changing some invalid data. On Monday, by the time the team figured out what had happened, the data in the table was a total mess – with real valuable data that shouldn’t be restored over.
  9. The team built a complex AlwaysOn Availability Groups infrastructure for really reliable databases, and full backups every night, and log backups every 4 hours. When someone accidentally issued a delete statement, that delete was instantly replicated across to the other replica. They could still recover back to a point in time – but it meant serious downtime for the AG since you can’t restore databases in an AG. While they debated what to do, more valid data went into their production database – meaning they could no longer simply do a point-in-time restore.

Each of these teams thought their data was safe.

They were wrong.

Kendra says: OK, I’ve got to admit it– I’ve been part of teams where we’ve fallen victim to more than one of these scenarios. And I work on teams with smart people! This can happen to you, look at these stories and your own environment slowly and carefully.

Previous Post
Comparing SQL Server and Oracle Log Shipping
Next Post
Woohoo! We Won a Red Gate Tribal Award: Best Blog of 2014 (Again)

32 Comments. Leave new

  • Wow. Nine terrifying true stories.

  • Makes you think… I guess the only true way to know if your plan is solid is to test it, regularly.

  • First thing I did when taking a new job was ask “whats the DR plan”. They didn’t have one. Well, they do now. It’s just one of those things we all hope to never need, but when we do we hope it’s there and actually works.

    • When I was independent I always gave away the first hour free…but I used it to make them show me the backups were working on the servers I would be touching 🙂

  • # 1. My DBCC CheckDB job also write a record to the windows log.
    # 2. We write our backup files to 2 different places. We are paranoid.
    # 5. We store our databases on drives other than C:. The backups are stored on other drives from the DBs and logs. Plus they are in 2 spots. Well technically 3 since we also use Appasure to backup the backups.
    # 6. That is a concern for us also but since we write the backups to a spot outside the SAN hopefully we’d be ok.
    # 7. Whoops, I’d never do it that way (I hope).
    # 8. Heavy sigh…
    # 9. Last time we had a weird issue (power outage). I made sure we let no one in the system while we troubleshooted. Better to have 50 people idle for 30 minutes than 50 people idle for a day.

  • We had a client that called us to say that our application had deleted months of data and that we were responsible for the data loss. It didn’t take long to find out that he had done a restore of their latest full backup, but their last full backup was from 3 months earlier! Luckily for him we were troubleshooting an issue earlier in the week and we had taken a backup to run tests at the office. They only lost 3 days of data and he kept his job.

  • My piece today on Striped Backups comes out of a story I had. A client ran backups, striped, inadvertently because of the poor GUI in SSMS. They had been doing this for weeks and called me when they needed a restore and couldn’t get the “one” backup file they had to restore.

    I hate being the bearer of bad news.

  • Kenny Boudreaux
    February 19, 2015 2:23 pm

    Great article. #9 hits home for us because we’re just starting to add Always on Availability Groups in our SQL environment to replicate client data to our DR site.

    We hadn’t really considered this restore scenario… now I need to figure out what we’ll do in this scenario. Research time! I expect that you have to delete the DB from the secondary site, do the restore, then add it back to the AAG?

  • work for Target
    work for Home Depot
    work for Depont
    work for Horizon BCBS
    work for Anthem BCBS
    work for Nordstroms

    I’m sure there are a few others.

  • I’ve seen or participated in or have predicted most of these and am just waiting on the rest 🙂 Many of the problems extend well beyond just the duties of a DBA though and are affected by corporate spending cutbacks, corporate culture, and poor documentation. Specifically the last being more than just one of hundreds of Word documents bunged into a document control system, but an easy to access, easy to update, scriptable, living and breathing system for storing and retrieving server and configuration information – which nobody seems to have.

    1) Regarding checkdb corruption this is a failure on the DBA’s part. Relying on an email isn’t enough, we have scripts to actively reach out and check both that they are configured to run, and that they have run successfully, and we check the numbers on that every single morning. I can tell you now 20 databases out of 220+ in the environment are off track; and we deal with those as they come up.

    Why 20? A couple databases can’t be checked properly due to problems with spatial data types and we are switching them to physical checks only. Others are too transactional causing tempdb log to blow out during checks and have recently been disabled while we capture baselines to work out the best time to move them to and/or allocate extra disk space.

    But we’re not squeaky clean. Backups to disk only last 1-3 days because all of that disk is just wasted money right (and projects will fight that they don’t need to “buy” that disk because it’s not required… yeah right). We checkdb once a week, and that means there’s a significant gap there.

    Oh of course the outsourced data centre backs up to tape (see the next topic), but it has one option: 30 days. It doesn’t keep 1 tape set per month, 1 tape set per year, and it’s not even an option. So, we are “likely” covered.

    By the way, what does corporate think? Their documentation states that any database must have 7 years of complete data retention including audits of all logins and data changes. No reason, it’s just the only document they have, written by an architect in some high tower out of touch with reality. We’ve yet to find a server configured even remotely close to that. Of course we questioned the discrepancy but it was swept under the carpet.

    2) Backups to a file share that isn’t backed up. The outsourced data centre refuses to provide us current lists of what is backed up; it seems they don’t know how to use the reporting feature of NetBackup?

    Well they did once, 6 months ago… and it appeared to be cut and pasted into a Word document by hand and as such it was next to worthless in determining our coverage.

    Of course we shouldn’t worry because once we back up to disk it’s no longer our problem. We neither have the interfaces nor the authority to request the outsourced data centre to add, remove, or change any of those configurations – it’s meant to be done by whomever set up the server in IT.

    We raised that as an issue and it joined the previous one under the rug. We’re going to need a bigger rug.

    3) This one is a clear failure on the DBA’s part. Not having scripts to check msdb.dbo.backupset each day and that at least some kind of FULL/LOG backup went SOMEWHERE? Unacceptable.

    4) 220+ servers, 2,000+ databases, what recovery mode should each be in? Vendors (or the Business Analysts / Project Managers attached to them) don’t know or refuse to say (“that’s your job to determine!”). Corporate also never kept a proper list of which departments owned which servers, let alone which databases.

    So while we don’t change those settings ourselves, and we ask new setups if they want point in recovery (and assume so unless told otherwise), all the previous databases are in a fingers crossed situation.

    Of course you can’t just go change them all to FULL recovery. Disk space and backups become issues. Nothing is easy and this is super low on the priority list. Now do we also script out which databases are in which mode so we can determine if they have changed? Yes. But we don’t track it because… there are a total of 10,000 logins with sysadmin privileges.

    5) C: filled and all they deleted were the databases and backups? Luxury. Outsourced IT help us by deleting everything from C:\Windows\Installer. SQL upgrades often fail and one day when we get some authority to apply service packs we are likely going to be in a world of hurt.

    6) This one just sucks. No way around it. You can backup with checksum (and deal with the hit to your network and SAN reading all that data back to verify it) but even then, if it gets corrupted afterwards, what’s anyone to do?

    This is why I think disk to tape backups will always have a place.

    7) This one just sucks. I’ve done things on the wrong server before (multiple times throughout the past decade) and it’s a real hard problem to solve that can strike at any point; from reading the original specs, to writing it down repeatedly in documentation and change tickets, to finally carrying it out and mistyping or misreading the screen. It doesn’t help that so many servers are assigned similar and nonsensical names. Heaven forbid they should be descriptive… but Microsoft only gave us 15 characters and that’s what we have to deal with.

    8) This one just sucks and there’s nothing you can do about it. Heaven help them if they’re not going through at least one other environment between development and production for some third party verification, but even then it can be missed. Also, out there in vendor land, I can guarantee that most of them are NOT doing even that much. Automated application testing? Database replay traces? Oh lordy lordy, it’s rarer than religious figures on toast.

    9) This one just sucks. As we all know HA isn’t DR, and again, while you hope they were running some kind of script that went through testing or qa, it could very well have done so and still caused this problem.

    You know what the real lesson here is? As a DBA you can only do the best you can and your job is still expendable. Keep on good terms with your previous/current bosses so that you can get good references, document well what you’ve done (so you can keep your resume updated), and hope for the best. One day you or someone like you will make a minor mistake that has serious consequences and your job is gone.

  • First, let me just say, I can’t believe ppl still do deploys on a Friday afternoon, and if you do, you kinda deserve whatever problems you get!

    At any rate, regarding #7, and “accidentally” logging in to the production server, I am curious what others’ thoughts are in regards to server naming conventions that might prevent this? I’m not in charge of naming servers, so I don’t get to make those decisions, but as someone who has trouble remembering which server is Prod, Staging or Dev because of our odd naming convention, would you say that a server’s name should indicate what type of environment it is, or is it normal to name servers without regard to the environment type? Just curious what others’ think.

    p.s. I keep a print-out of which servers are which that I refer to OFTEN to make sure I’m on the server I think I am.

    • As we’ve recently been told by our CSOX auditor, developers are not supposed to have access to the production servers. Any production deployments should be done by production admins, so this should be mitigated by an “Access Denied” message when you try to log in to the wrong server. Now, real world is a bit different. Being the only DBA on staff, I don’t have much choice in having access to both development and production, but I am making sure I always check the server I’m on before I run anything and check the backup copy of my resume in case I blow it.

      • We are also getting hit by auditors. We’ve developed an authority matrix, and have release-based deployment in place, with a dedicated deployment team (even for SQL DDL deployments, which are scripted and tested in a pre-production environment before release).

        However, most companies are nowhere near large enough to have that kind of staffing. Finding the happy medium for such companies will be challenging.

    • Many shops – and not only small ones – have that exact problem. I usually suggest to set up a specific user account and Login that ONLY works on prod – nowhere else. that way, you can’t accidentially log into prod with your normal credentials thinking you are on the uat or test box.
      Admittedly, i usually get the occasional blank stare when i suggest this, but after the first desaster everyone agrees to this 🙂

    • We have hundreds of servers and the naming convention for them is dictated by our server admin team. It’s the same convention across all servers (applications, email, domain) and it does reflect its prod/qa/dev status and purpose in the name of each server.

  • > The SQL Server ran out of space on the C drive.

    this is why I keep RESERVATION FILES (files that take up a chunk of space, that can be deleted for just such an emergency) on every drive. See http://www.sbrickey.com/Tech/Blog/Post/Reservation_Files

  • Regarding the DBCC command that had been failing, and the failure emails that have been failing, I see similar things to this implemented frequently. Jobs, processes, and servers that monitor themselves. Bad idea. If the job/process/server goes south, how do you expect it to reliably tell you?

    Set up a 2nd server to monitor jobs/servers/processes, and a 3rd server to monitor the 2nd.

    • That might be a tad of overkill, but you can purchase very solid monitoring tools that will alert you whenever any job fails, independently of SQL Server’s failure notifications. One is Spotlight by Dell.

      The trick is not to go off on a tangent and disable all of the alerts. Live with the initial deluge of alerts and then learn what each one means BEFORE you disable them. Even after that, try not to disable anything but instead to set thresholds that will bring your attention to an issue before catastrophe strikes. An alert that can be disabled: Excessive use of one CPU core. An alert that can be tuned down: Total CPU usage.

      You can also configure a good tool to page you for certain errors, and even to run programs in the event of an alert (excellent to handle additional data-gathering for triage, or to stop a process before it fills up a database).

  • I never trust that email alerts are working. So we set up a job that sends us an email alert every day. If we don’t get that then alerts are down and we need to fall back to manual checking. Obviously that’s only covering us against one small hole – but it’s a hole that seems to happen a lot.

  • #7 Suggestion: This is by no means fool-proof, but it’s been very helpful to me. I set the Windows desktop background color for my DEV servers to a really hideous, bright green color. If I’ve got several RDC connections open simultaneously, it’s pretty obvious that I’m on DEV (or not).

    Granted RDC to PROD isn’t recommended, but it happens. This is a visually loud warning.

  • The hardest part about all of these problems is that they are a constant threat and need to be checked for periodically. I can setup a system today that avoids all of these issues but will that be true tomorrow when I add a database? or a server? or a new storage platform? etc.

    That’s why i’ve been working on setting up a quarterly check to test these things and (most importantly) validate that our monitoring is working. Trusting your monitoring is one of the biggest pitfalls I see people run into. It looks good when all of your sensors are green… but man does it suck when they’re green and they shouldn’t be… and wouldn’t you know it… you typically only figure out that they’re incorrect when the !@#$ has already hit the fan which is too late.

    • Monitoring is only as good as the person(s) managing the monitoring tool. As I posted earlier, buying a good packaged monitoring tool and then deactivating most alerts is foolish. I also think that quarterly review of monitoring effectiveness is inadequate. The monitoring software should be constantly checking itself and the DBA staff should review alert settings at least monthly if not much more often.

      The best tools make it a bit difficult to ignore/disable alerts, for good reason.

      • Every alert in our environment is actionable… a sensor/alert that isn’t actionable is either useless and should be removed or needs to be changed so that it’s useful. Reviewing your environment as a whole and checking to make sure your assumptions about the environment are still valid needs to happen on some sort of interval. I want to do it quarterly because we don’t have enough resources to do it any more frequently than that. I 100% agree that more frequently should be the goal but there’s a difference between what I want to do and what I can do.

        At the end of the day a quarterly review of the environment is a significant improvement over doing nothing at all… which is where a lot of people are at these days.

  • I remember one system where the admin kept forgetting to change the tape. The result was the backups stayed on the disk. We got called when the users could not update the db as the disk was full. Another good reason to keep db and backups on separate disks.

    My favourite was the DBA who decided to do a DR test on a live server before checking he had a decent backup. Luckily one ofy colleagues had burnt one to CD before leaving site.

  • Saw this in a code review…

    UPDATE a
    SET a.whatever = b.whatever
    FROM a
    JOIN b ON a.id = a.id — which will always be true, so the entire table was updated,

  • Client got hit by ransomware, as we have seen many. This particular hit, step one – Sentinel One ransomware protection uninstalled!
    Netbackup toast, password vault toast, san luns toast, SQL servers toast. The only thing that saved them from a total loss was a check box on some of the luns where they could not be deleted without the san vendor doing it. We are in the age of ransomware, keep your backups off your domain. If the same credentials can access your SQL servers and your backups, you’re doing it wrong. It’s probably just a matter of time before disaster strikes.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.