I remember it really clearly.
In the mid 1990s, long, long before I went into IT as a career, I was working at a photo studio in Muskegon, Michigan. They specialized in high school class photos, and they did a LOT of ’em. Every morning, the photographers would come into the office to collect printouts for the pictures they were going to shoot that day – student name, address, time of the photos, that kind of thing.
My job duties included:
- Removing the prior night’s backup tape and switching it out for a new one
- Running a few database queries to prep for the day’s sessions
- Printing out the appointments & labels for the photographers
One morning, I ran a query without the WHERE clause. Because that’s what you do.
I don’t remember how the queries worked, and I don’t even remember what the database was. I just remember that it ran on SCO Xenix because I remember the manuals so clearly, and I remember that I didn’t panic at all. I knew nobody else had accessed the database yet – I was one of the first ones in every morning – so all I had to do was restore the database and go through the steps again.
But I also remember that the boss (not manager – boss) had an epic temper. As in, swear-and-throw-things-and-fire-people kind of temper. And like me, he was an early riser, and I knew it was only a matter of time before he showed up and looked for his printouts to see who he was going to photograph that day. I was sure I was gonna get my rear handed to me, and I was resigned to that fact.
So I put last night’s tape in, started the restore, and waited. Sure enough, the boss came in, and before he could say anything, I said:
I screwed up, my fault, and I’m sorry. I messed up the queries I was supposed to run, so I’m restoring last night’s backup, and then I’m going to run the queries again and do the printouts. You can fire me if you want, and I’ll totally understand it if you do, but you should probably wait to fire me until after the restores finish and I do the printouts. Nobody else here knows how to do this, and the photographers need to work today.
Worked like a charm. He nodded, and I could tell he was pissed, but he didn’t yell or throw things. He just gruffly left the computer room and did other stuff.
The photographers and other people started trickling in, looking for their printouts, and I explained that they weren’t ready yet, and explained why. They all got big eyes and asked if the boss knew about this, and they were all sure I was going to get fired.
I didn’t get fired, and everybody was surprised. The boss used me as an example, saying, “No, this is what you’re supposed to do – own up when you do something stupid, fix it yourself, and be ready to deal with consequences.”
Part of me was a little disappointed that I didn’t get fired, though. I wasn’t a big fan of that job. I’ve only been fired once – from a Hardee’s – but that’s a story for another blog post.
What about you? Do you remember the very first time you had to do a database restore to fix something you messed up?
I from Muskegon, MI originally. Can I ask which studio?
I have worked in IT and with SQL Server since SQL 6.5.
Jason – I’m going to plead the fifth on that one just because I wouldn’t wanna throw anybody under the bus, heh. The city’s big enough that there were quite a few photo studios, so it leaves enough room that nobody’s feelings should get hurt. 😀
No worries! That boss is already dead from a heart-attack. This wat happens to bad people in reality, right?
The stress level would certainly point toward that, heh.
I have lived and worked almost my whole life in the Muskegon, MI area. I had to read your post in the email a few times to make sure it didn’t look like a form letter (having a hard time believing you came from here). Anyway, were did you go to cause trouble growing up? Do you miss the Muskegon area at all or ever stray through?
Rob – hahaha, yeah, I went to Whitehall High School (go Vikings!) To cause trouble, we mostly just got drunk and stole harmless road signs.
My dad’s side of the family still lives in Muskegon, so I get out there every couple of years. I don’t miss it too much though – those gray skies in the winter are too rough for me. It’s March as we speak, and I’m staring out at bright blue skies and 66 degrees here in San Diego. This gets pretty comfy out here, heh.
Both my kids are in high school at Whitehall right now. I went to Reeths-Puffer, class of 85.
That is great advice. Always own up and that will be remembered and you will be trusted.
Chris – yeah, we’re all going to make mistakes. The difference is in how you react to ’em.
I remember my first restore.
Our main middleware system went for a reboot and refused to come back up. The entire VM got corrupted, drives and all.
Have a guess where the backups were stored………Yep.
Luckily for some reason a MAC could still mount the drives and retrieve the backup but my ass got squeaky for moment or two. Had an entire day of down time because of it.
Didn’t get fired because UK employment is more tolerant than the US. Scares me how easily Americans can be fired
Matt – yeah, I had the same reaction about UK employment, hahaha. Y’all have some serious contracts over there!
Not first but worst. Was writing an update to a few records in a large customer database – they were like a store card and had a credit card type number, and a few had the wrong first few digits for various reasons. In the middle of the query the phone rang and I took a call from a notoriously long-winded customer. Lost the thread, and ran the highlighted portion of a statement in SSMS without the where clause. about 300K records all with the same number. Customer was actually pretty good about it, and I wrote procedures for peer review and being undisturbed for this kind of task that are still in use.
Steve – HAHAHA, man, I hate context-switching.
I recall mine – but as a computer operator. One of the programmers screwed something up after a SOC7 (invalid numeric data) error and after hours of trying to fix it and making things worse finally told us he “was out of his depth”. At that point the daytime had rolled around and these databases had to be online for the IBM CICS programs for Order Entry. I’d walked in at 9pm the previous night and simply stayed on through all three shifts and my normal one (35 hours). We were still synchronizing the databases until about 6pm.
The programmer? He quit out of frustration and opened a flower shop. That’s the last we ever heard of him. I got promoted from a computer operator to a programmer a few months after that. I think I took his old job – I worked for the same manager.
Kevin – wow, that’s quite a transition, computer programmer to flower shop. I’m not gonna lie: I wanna learn the Japanese art of flower arrangement, but…yeah, not gonna make that career shift anytime soon.
Oh… I’ve rebuild a few meta tables here and there… 😉
My main takeaway on a stressful 3am night around 1998 and SQL 6.5- WHY TF does diskdump exist and why did the former sysadmin set backs to go there?!
Also backups don’t matter. Restores are the only thing that matter. Would be an awesome lesson if it wasn’t a 24/7 shop, and I was the fresh out of college sole admin.
Jason – yep, you nailed it, restores are the only things that matter.
This is going to be printed and posted in a place where I can see it every day “own up when you do something stupid, fix it yourself, and be ready to deal with consequences” I ran a CHECKDB and it changed (without knowing) the path where the log files reside for Log Shipping, I didn’t get fired either, but being a woman, they doubt everything I do from that point till today.
Aileen – AAARGH, everything about that totally blows.
Yup. An Oracle 8.x DB ran out of space, resized it from cmd line. Forgot an additional zero. Oracle happily downsized the table space. All nighter to restore. The company was very near where you lived in Chicago, but your building wasn’t built yet;)
DBA – ugh, it blows that so much of what we do still involves human beings typing numbers into boxes. Software should manage so much more of this. Had somebody hit a similar problem with max memory sizes recently.
In the late 90’s we had issues with connectivity to a server, which held the DB and a large amount of images. We had the local PC support guy reboot the server. Unknown to us, he didn’t reboot from Windows. He just went in and held the power button until it powered down, the hit it again to power back up.
Blue screen of death…
After an hour of remotely trying to get it to boot up we were driving 2 hours to that office. We tried locally for about an hour and decided to put the server in my jeep and drive it back to central office where we had more server support staff.
2 hours later it still didn’t want to come up. We had already started restoring tapes to another server but these servers were large and we really didn’t have a standby server of the same size. Restore was going to take 8+ hours and loose 1-2 days of data.
In walks the Director who was the nicest person up until this point. He politely said if this doesn’t come up today, it could put his job in jeopardy… I politely responded, we might loose a day or two of data but we are still working on it multiple ways. We’ve started the restore and are working on getting the server back up still.
Pressure is on. Bad things are going to happen if we loose 1-2 days of data or it’s not up by tomorrow. An expectation I really didn’t fully understand until that discussion. Everything in the 90’s was get it up ASAP. RTO and RPO weren’t as clearly defined or tested as they are now.
So knowing the drives are mirrored and raid, replicated across 5 drives. I decided to see if any of them would boot. I pulled all but the primary drive, no dice. All but 2nd drive, no dice. 3rd drive, nope. 4th, nope. At this point I’m thinking a new op system and restore is in my future. But I go through the motions and pull all but the 5th drive, and… IT ACTUALLY BOOTS. System is up working on ONE drive. We slowly put all drives back in and let the mirroring/raid do it’s job. And the system is back normal.
We drove the server back to the remote office, discussed the proper way to reboot Windows with the local tech, and all was well.
Moral of the story, there are multiple ways to get things back up. Multiple methods should be enacted at the same time. If your RTO is same day, then your equipment should reflect that expectation. If your RPO is 0, then again your setup should reflect that. We’ve all learned this from our 90’s nightmares, right? If you were born in the 90’s this will all sound really strange to you. Yes, we used to build huge servers, that sat in hot rooms, with only nightly backups directly to local tape. And a person picked up those tapes and moved to an offsite location daily…
Mitch – HAHAHA, the power button, that’s awesome.
I’ve luckily never had to do a restore for something I did – about the worst thing I have done is deploy a bunch of stored procedures to the wrong database. Had some close calls, but they were in dev.
My first exposure to backups was in taking over a managed backup service we ran in my first IT job that we had just created. It was based on Simpana 9 at the beginning and because the number of restore options were so daunting I had to go through and make screen shot manuals with condensed restore options for every type of backup we provided. I’d estimate I did around a thousand restores over 3 months just getting practice with it enough to be able to write the manuals and then get the screen shots I needed. Then for every client I set up, initially I’d do a test restore of every backup, but eventually came to trust it enough to only do sample test restores.
I’ve done plenty of restores for mistakes coworkers caused, and one thing I am certain of is that some mid-tier business analyst running around panicking and hovering over my shoulder is the absolute most counterproductive thing to me being able to get them back online. I think the important thing is that whether its your fault or something else’s fault for needing to recover a backup, that panicking is a pretty horrible way of managing your success and can make things worse, and it absolutely makes me furious when I am trying to work to get something back online and the peanut gallery’s panic interferes with my ability to do it. I just had a colleague restore a server that was taken out by malware, WITH the payload still on it, the server mostly ruined and it was all because they were freaking out
My two worst restores came from hardware failure. One of them mostly restored fine, but there was a dll required for 2008 R2 to boot that was corrupted. That was the first time in my life I observed sfc /scannow actually fix anything….3 days later. The other one was a failed raid controller that took out all the volumes in a proliant ML350 G5. We didn’t know exactly where all the SQL files were, and I learned to use the errorlog and attempt starting the service over and over again until it cleared and the service started. Finding the exactly correct SQL 2000 media to install on the server in 2014 or 2015 was pretty fun when all the client had documentation for were the paper certificates for the cals out of a retail box that we weren’t sure were actually for that SQL instance. Also, installing server 2003 on bare metal is absolutely horrible compared to even 2008.
Keith – part of me thinks that today, in the cloud, we still face all these exact same issues. It’s just that instead of getting the chance to troubleshoot it, the server just disappears, hahaha.
I’m waiting for season 2: the Hardee’s
Zhen – HAHAHA, bonus points for the history trivia knowledge there!
I got fired from a zoo for “piloting” a tourist boat that “sailed” on a moat around the zoo.
I accidentally “piloted” it into a well-marked mud bank, right next to the lions’ enclosure. Pretty soon all the kids on board were screaming and crying (really, really, screaming) as the rather grumpy lions came over and pawed at the flimsy-looking fence.
I had to climb out and, waist deep in very stinky sludgy water with lions roaring, push the boat off the bank.
No tips that journey.
Richard – HAHAHA, woooow, that’s a good story there.
My first IT job was in a real estate title office that was probably run by the mob and shall remain nameless. Minor issues were met with slammed doors, tantrums, and comments like, “I don’t know what the f— we pay you for.” Late one night I was moving their database over to a brand spankin’ new NetWare 3.0 server and I, um, accidentally overwrote the only recent copy. (It wasn’t a real backup, it was just a file copy because the “database” was actually just a collection of Btrieve files.) Anyway, I confidently pulled out the most recent tape backup and discovered to my horror that it was not readable. Cold sweat. Not only that, but none of the backups were readable because I was new to this game and didn’t know that you have to actually TEST your backups to make sure they are useable. Panic. It was about 2 AM. I was planning to get on the first plane out of the city (not a joke, I really was). When I ‘fessed up to my boss, who was in the office for the server change, he told me that without my knowledge he made an extra backup of the database onto his desktop PC as a precaution but wanted me to suffer a little bit and confess before he told me. That’s why he had a PhD, I guess.
Curt – I always wondered what it would be like to do IT work for the mob! My guess was that it’d be exactly like a regular penny-pinching company. “Do you know how many guys I had to off in order to afford this server? You better make it work.”
Had very similar situation back in the good old days in late 90s – few months into my first job.
Classic example – forgot to comment out the select statement – ran 2nd and 3rd line – Sweet! thats the recorded I want to delete; add a delete statement (line 1) then highlight all 3 lines and F5 .. and you know what’s next… (and don’t ask me why my brain wasn’t thinking at time I just want to get it done fast)
DELETE * FROM TABLE
SELECT * FROM TABLE
WHERE COLA = 1
Wasn’t fast enough to get to the secondary – log shipping has already gone through to secondary, warm backups and log backups is the only option. Painful exercise. Two things I taught myself – there is always something you can learn even the hard way; Second, I still don’t understand why I have access to delete the table even I am the most junior person, and I probably would be more careful than how my manger granting access.
Ken – HAHAHA, as soon as I saw those 3 lines typed out, I knew exactly what was gonna happen. I’ve done something similar myself, too.
When I got into SQL Server DBA work at a bank, a senior DBA (gifted, but a bit of a show off) was wowing me with the various differences to Oracle (my previous speciality). Did I know that SQL Server, unlike Oracle, could rollback a table truncation? No, I didn’t.
Unable to resist another opportunity to bathe me in the sunshine of his expertise, he immediately started to demo it.
We were interrupted by someone who needed to ask him something; 5 minutes went by; we returned to his demo.
He highlighted the TRUNCATE TABLE statement and hit F5.
Except he hadn’t executed the BEGIN TRAN.
It was a core lookup table.
Nagios was pinging like a popcorn kettle.
I was sweating on his behalf.
Then the phone starting ringing.
On a regular basis we would restore a copy of our production ERP database to a QA system. As part of this process, I had developed a “scrubbing” script that would remove all sensitive data like bank account information as well as vendor and customer email address, etc. One day, I was dutifully updating my script. My first mistake was modifying my script inside SSMS (I should have been using an “inert” text/code editor). My second mistake, was not removing (or at a minimum moving) the “Execute” button in SSMS (it’s perilously close to the save button). My third mistake was already being connected to the production database via SSMS with my sysadmin account (we didn’t have seperate at that time). My fourth mistake was listening to a podcast while connected to a production instance (I was distracted). My fifth and final mistake in this comedy of errors was to accidentally hit the “Execute” button when I only meant to “Save” the script. It immediately broke production. By the time I could get my bearings, talk with the boss, and get the ERP application turned off (so no more transactions could be added), 15 minutes had passed. When we restored, we had to skip those 15 minutes from the time of my self-induced data-corruption to when we stopped the application. Yikes! ….and new lesson learned, even if the business claims 60 minutes of data loss is acceptable on the SLA, they don’t really mean it because this was painful for many people and we heard about it from the execs. All the transactions made during that 15 minutes had to be manually recreated.
Andy – oooooof, ouch. It’s like that TV show, “Seconds from Disaster” where they lay out the series of unfortunate events that led to a plane crash. If just ooooone little thing had happened differently…
A month after I started my IT career my boss quit. I was alone. I was alone with 6 T1’s, AD, all flavors of server (think mid 90’s servers though), printers, scanners, UNIX, fax, etc.. and 6 retail locations. I was working 80+ hours a week to keep things together until a new boss was hired.
Then it happened…I delete 50+ thousand items from our database(SQL 2000). We didn’t have a test, dev or UAT database. Prod was the only one and I performed some spring cleaning on it. I had highlighted the first couple of lines of a delete statement but not the where clause and *poof*. There it all went. I turned red and could feel the panic setting in on me. I was soo scared. lol
To this day, I don’t know how I came up with the idea to check backups but I restored the database from backup. Replication was setup after that and 2 months later, my new boss was hired. I hate admitting that I did this but…it happened. *sigh*
We had a 42TB AlwaysOn cluster in multiple geos, and we were running a multiday operation to remove terabytes of data and rebuild a 10TB unpartitioned index on our biggest table.
Guess what happened?
The primary database service restarted due to an IO latch wait error at 99% completion, and we had to roll back that entire index and restore everything from scratch over nearly a week of effort.
That mega sucked.
First in a production environment… (surprising that it took nearly 10 years into my career to have to do this)
Worked for a software company – probably back around 2009/10ish, we were running SQL 2008 and had for the developers/testers a fairly primitive script to restore a copy of the production environment for clients back to a dev/test server so they could work on bugs and the like. I inadvertently passed the parameters to the procedure the wrong way around and started restoring the test environment (which was months old by that stage) onto the live production system.
Fortunately I had enough foresight in the restore script to at least take a differential backup of the database it was restoring over as a “just in case” – so when I noted about 10 minutes into the restore (before the phones started running hot) I killed my accidental restore and started restoring the latest full backup plus the pre-restore differential. Went straight to the service desk folks, told them the phone was about to run hot and why and that it’d be back in about an hour. Next step to the managing director of my company and finally to the head of IT of the company whos DB I had screwed up.
Much like yourself I learned the lesson that accountability – owning up to your shit straight away – can get you through some pretty big mistakes and I badger that message to the team I lead now pretty regularly – I will 100% back you regardless of how big you f*&# up as long as you own up to it straight away, fix it straight away and tell me what you’re going to do to make sure it can’t happen again.
No surprise that after that the restore script got some upgrades to prevent restoring to a prod environment.
I had to do a restore just last week because another developer didn’t highlight where clause before running. I have also done the highlight without the where clause before, so I had full sympathy as I was helping him.
Worst I had was at one of my previous companies where they sent a salesperson with my update script, as they were trying to save costs by only sending one person to the site. Asked him to make sure they had a backup before running the script, and run it. It was a badly designed database using the customer number as a primary key and my script was to update them to a new structure – a trigger on the table went awkward and all customers got the same customer number for some reason (I swear this worked in the test environment). That’s ok, they just needed to restore the database – which was a good time to find out that they had zero technical skills at the site and that the manager thought his exported Excel spreadsheet from the application was a backup… They had to start over on a clean database.
I’ve done other things that caused downtime but nothing requiring a restore. However, when I was working for an EHR vendor we had a bug where when a doctor saved a patient note it would get truncated. While development worked on fixing the bug the process was for support to run a SQL statement to set the content for the note to the last valid version. As was bound to happen one of the techs ran the query without the where clause and set every version of every note to the same thing. Fortunately that client had space to restore a second copy of the database alongside prod so I could just copy over the notes instead of restoring the whole database.
First RECOVERY – 52 Banks down – YES, I mean 52 Bank and ALL their branches
Long-Story-Short: the overall gotcha was that 40 thousand tapes in the tape library were occupied due to active retention periods. First week on-call and I had never used the tape management CLI ever. I think after that professional monitoring systems evolved from 😉
RTO -wise the most time spent was to define the RPO, the Point-In-Time to restore to.
After all the restore went well.
While this was ongoing the bankers were hand writing the bank transactions on paper, and later when the bank had closed updating them on the system
Good post @BrentOzar
GUN10 – ouch, 52 banks, daaaaamn.
i remember cold sweat. This was a production database during working hours. Pharmacy. People were calling asking why patients had disappeared.
Loved your post today Brent, really the kind of light read I needed today… I didn’t feel like getting too technical for some reason and you read my mind…
I don’t think I ever messed up like that, but just because I’m extremely paranoid about any statements having the words UPDATE/DELETE/ALTER on them… maybe I was lucky… I dunno.
One tool I discovered recently (since we are on that topic) is SSMS Boost which is a plugin that, among other things have settings to force a popup to come up when it detects such statements being executed, you can even configure those warnings on a per-connection basis so it only kicks in when you are issuing statements on production environments (by the way… I don’t work for them nor I have any affiliation with them, I just thought it would be a good tip to share that goes hand in hand with your post)
Thanks, glad you enjoyed it!
Cool story! Haha! Mine is boring.. I never have had that yet as a DB admin. I did clicked on ‘Start XP installation’ instead of (and before…) ‘Backup hard drive’ in the boot menu during a migration of a sales persons laptop.
He also was the kind of person that stores his whole life on the hard drive of his work laptop (holiday pictures, movies, private government related stuff, nudes also would not have surprised me), so he was not so happy to say the least. Losing his stuff actually would came one day with that strategy, but it was a little sooner than I thought.
We eventually went to a specialized company that restores hard drives in a lab, but the imaging process already touched all sectors, so bad luck. I did the same thing. Just say what you have done directly and try to fix it.
I did the classic – DELETE statement without a where clause. It was for a retail company and I deleted about a thousand price changes that would all have to be manually re-keyed if I didn’t get the data back.
I didn’t know this at the time, but apparently I was the DBA, as well as the developer (it was a small team), and the backups were down to me. It’s at this point I realised the data and log backups had been failing for about 6 months, and the T-Log had just been growing and growing and growing. It had got so big that despite my (inexperienced) efforts, every attempt to back it up and then restore to a point-in-time failed.
Now, this part is a little sketchy, but I eventually solved the problem with the help of a collegue who found a piece of free-ware that would scan the T-Log file and create a script to reverse each statement for each transaction, so the DELETE became a series of INSERTS. I ran the script and my data returned (with new keys and timestamps, but I could live with that fortunately).
Not sure how it happened but it did. I was a relatively new employee with my current employer many years ago. At that time I was a developer not even using a db system. Another area of the company was developing an application that was using SQL Server 6.5. Again, not sure how it happened but it happened and I deleted the “master” db. Oops. Welcome to the world of databases and SQL. As it turned out, we did not have any backup of master, so I had to restore SQL Server. This lead to finding out about collation and choices made during setup. Well, the whole experience opened my eyes to database and I started a regiment of new opportunities for growth. 24 years and still learning
Luckily I never (knock on wood) messed up so bad in production that I had to restore a full database, Brent… One time I messed up the Customers table in our CRM application by giving them all the same name (yeah, those pesky where-clauses) but I restored the backup as a copy database and repaired the Customers table using that one.
Also, there is a song for this: https://www.youtube.com/watch?v=6sUSJE8pxsg&ab_channel=RodneyKrick