Last week, I asked for your worst database horror stories. Hoowee, did you deliver. Make yourself a nice cup of cocoa, gather around the campfire, and listen up for these terrible tales. Illustrations courtesy of Swear Trek (not safe for work.)
The mistake we’ve all made
Russ did an update without a where clause, but his storytelling earns a laugh and a prize:
One of my first SQL 6.5 scripting tasks was to correct the spelling for a customer’s last name. When I got finished, everyone in the database was related, everyone’s last name was Anderssen. Learned a lot about restores early in my career.
“We did miss the guy who did it”
Richard’s tale merited inclusion just for that phrase:
The best is the oldest: did something in Production, thinking it was Dev. and an airport came to a halt.
Thankfully not my doing – we did miss the guy who did it.
This is always a tough decision with major mistakes: do you punish the person, or embrace the mistake as a learning opportunity and move on? I like the concept of blameless post-mortems, but I understand why it’s tough for most organizations to really embrace.
DBA-101: two power outages in a row
Viking’s story ends with the most appropriate Jira issue number ever:
Right at the beginning of my DBA (AKA: “software dev with good SQL skills”) career, we had a power outage, which left us with an…unhappy…server. The first solution was to restore a server level backup. In the middle of that, there was another power outage. This is where we learned that the first one fried the UPS (though it didn’t report any errors), so the server went down again, mid restore. I can’t remember all the details, but this resulted in lost data, so we had to restore the server from an older backup and then apply the SQL backups to get the data back. This is where we learned that, while we did have backups (full/diffs/logs), we’d never actually tested a restore to a point in time using the full log chain….we’d just tested the full backups. The restore failed after something like the second log file and we were still out ~12 hours of data. “Luckily”, this all happened on a Friday morning, so the business only lost a day of work, while I, and a couple of our most experienced devs, spent the weekend rebuilding the lost data from one of our replicated servers, which still had everything up to the point of the first power failure.
The icing on the cake: The number I got when I started a new ticket to track the work in my DBA project in Jira? DBA-101.
Reverse engineering the worst database ever
Miu’s tale is a few paragraphs long, but I think that’s fair given the 18-month project from hell:
Just got off a project where I was tasked with archaeologically rebuilding an application that had ceased working and the source code was gone into the aether 20 years before.
That in and of itself wasn’t too bad. The real problem was the database: Oracle 8, hadn’t seen an update in nearly 2 decades. Its latest iteration been architected by a rank amateur in 1987 when they pulled the data off a mainframe, and had been maintained by a random combination of offshore firms and amateurs in the same time frame.
The database had more than 8,000 tables spread across three discrete clusters. Every time some new data needed to be stored, a new table was born. Column and table names were inconsistent, no keys, indexes, or documented relationships. Several of the tables had column names in transliterated Portuguese. Just determining what wasn’t needed, used, or was just someone’s data dumping ground was months of work. There was even a permanent set of tables that were earmarked as temp nomenclaturally for the application that had stale data containing everything from SOAP messages to credit card numbers and PII. A table of usernames and passwords (plaintext) were stored under the name “NOT_SECURITY.” Only bright side to that is we were never able to determine what system they were for, as they didn’t match any scheme used at the company over it’s 100+ year history.
Over the course of 18 months, we managed to reverse engineer most of it and get the application replacement stood up with just under 100 tables and a sane domain model, documented business process. Definitely the worst I’d ever seen.
I’m still laughing and crying about passwords being stored in plaintext as a table called “NOT_SECURITY” – that’s awesome. Congratulations, Miu.
Shutting down your machine to cancel a transaction
I laughed so hard at Gordon’s cold sweat – I have totally been there:
Early in my career, I was working as a developer on a system that used an Oracle database. I had written a script to reset my test data. It had several deletes to clear tables and I included a final commit in the script. I had given a demo to executives of the new functionality I had built. After the demo, the execs wanted to know how much data would be affected in the production database. So I changed my environment to point to prod and queried row counts. After this I decided to reset my test database so I ran my reset script. Wondering why it was running so long, I realized I was running it against the production database! I immediately broke out in a cold sweat. Not sure what to do, I shut down my computer and ran back to my desk where for the next 30 minutes I ran row counts to convince myself that the script was shut down before the commit executed.
Users will find a workaround
Recce had to fix corruption, and then found out the data wasn’t any good anyway:
Got a support call that our software at a French client was down. Dialed in over the PSTN line and quickly determined the SQL6.5 database (this was a while ago) was corrupt. Massively corrupt. Table scans were failing due to the data pages linked list points being incorrect. The logs showed that the corruption has started 3 months previously, 2 hours after someone had created a new, compressed, data device and added it to the database. The users had been clicking through the errors that bubbled up through the UI for 3 months, until they could perform order allocations anymore.
The oldest backup was 1 month old. All the new orders and inventory levels were in the corrupt data device. Worked until dawn to get as much data out as I could as the client was adamant they couldn’t organise a stock count. With my best efforts it wasn’t really representative of reality, it turned out the forklift truck drivers had started ignoring the system weeks earlier because they couldn’t complete transaction either.
Learning about conversions the hard way
JChertudi ran into something that you will all run into sooner or later as you work with different platforms & databases:
We were using a forked php-mssql driver ~2008, and converting user data to support HTML (CHAR(10) to ) for example) with a php script. Due to a code mismatch between Unicode and non-Unicode SQL strings , the php script left a “land mine” in the center of the user data. It looked fine after conversion, but once the user edited their data, the last half of the data disappeared on save. Sigh, that was a few days of finding impacted customers, restoring a copy of their data, and copy/pasting it back into the documents.
Don’t force it, get a bigger hammer
Steve faced the perfect storm:
There was a unit at my work that did 500 million dollars a year in revenue for both company and independent locations. The only backups they had were on the same raid array as the data. The corporation did have an exabyte tape backup at the time, but no one wanted to pay for the tapes.
One day, they bought a serial drive array so they could back up to off the server. The technician jammed it into the port and shorted out the board. All data is toast. Cue $50k in data recovery services. Failure.
They went to dev. Nothing there, really. Most development is done live in prod.
3 weeks of 18 hour days to get the app running again. Corporate stores can enter in 3-week old data on contracts that bill nightly. Independent locations are completely out of luck.
Lawsuits drag on for years with 7 figure awards.
If you use SQL Server’s database mail for notifications, sooner or later you’re probably gonna make the same mistake as Anon, too:
My darkest hour took place one Saturday morning.
It was a perfect storm – a Friday afternoon change, insufficient testing, someone else on-call and Blackberry safely left behind while taking the kids surfing.
The Change was simple enough, an update to the daily Data Load / notification process. Unfortunately, there was some code missing which resulted in an endless loop.
The Data Load completed and SQL Server sent the notification e-mail. Then it sent the e-mail again, and again and again. In total, 12.6 million e-mails over the course of four hours.
The on-call agent forgot that they were on-call so they remained oblivious to the issue. I returned home and picked up my Blackberry to see a record number of unread e-mails. Logging onto the SQL Server was almost impossible as the Network had somehow frozen.
SQL Server had become unresponsive, Exchange Servers died. Reports came in of Blackberrys vibrating themselves to death as they notified of each e-mail arriving until the battery ran out. One Blackberry decided enough was enough, vibrating off a table and smashing on the tiled floor below. The only physical casualty, although my mental scars remain even now.
The mess was cleared and the “DeNigel of Service Attack” went into IT folklore to join “Nickageddon”.
I have not surfed since.
So what did we learn?
Kenzo sums it up well and earns a prize too:
Couple things I learned from all these comments:
1: every single person who supports relational databases likely has multiple horror stories to tell.
2: The ultimate, #1, primary, existential, responsibility of a DBA – for which all other responsibilities pale in comparison – is to implement database backup and restore processing adequate to support the business’s acceptable level of data loss.
Congrats to the “winners” – let’s call them winners just for surviving – who each picked up their choice of a Recorded Class Season Pass or the Consultant Toolkit to help ease their pains.
“Reports came in of Blackberrys vibrating themselves to death as they notified of each e-mail arriving until the battery ran out.”
“Not sure what to do, I shut down my computer and ran back to my desk where for the next 30 minutes I ran row counts to convince myself that the script was shut down before the commit executed.”
Yeah. This one right here officer.
I think we have all been there at one point in our IT career! 🙂 Great stories.
Been there, done that, where’s my T-Shirt?
Forgotten where clause? Yep
Email sends, yep, code runs into a problem, email sent, retry executed, repeat indefinitely, oh but get at least 3 to 10 emails out per second, people will respond faster.
Routine maintenance, shut down system, let on-call dude get paged, as he knows, alerts drain blackberry faster that the charger can charge.
New one: SQL 2000 do a DBCC Check contig, bring down production. (Ooops)
re: Gordon – never turn on that computer again!! it’ll COMMIT! lol
My wife creates views for a Tabluea Data mart in their own DB, and the DBA never backed it up. A year’s work got blown away this weekend, and he’s trying to put it on her for not saving the scripts as files in git.
“You should have taken precautions againsty my stupidity”
I love you in Top Gear.
but yeah, she should have had her stuff backed up too. but Eff that guy!
Well, i am not confirming this happened here or anywhere else, but there is the issue where someone accidentally uses SCCM to roll out a new O/S to every server and desktop in the company and formats the drives. That can really create some havoc. And it kept trying for awhile so even powering off the device was not enough if it was powered back on before the process was finally killed. And it wasn’t realized right away what was going on. This could take quite a while to recover from. Interesting on SQL clusters also!
Other than that, you have a server which you are told is really important so you put full recovery model on it. Then a little while later your disk for the log fills up. Hmm, let’s reset the backup chain and run the log backup a few times just to verify all is working. Once, twice a log backup, 3 times a full disk again! The developer was so paranoid about the application that they programmed a loop reading and writing some data to the database. It was such a loop that in seconds it filled the large drive for the log! Wouldn’t respond to emails or other communications. (And he was a sacred cow where many thought he was so smart they really wouldn’t question him). So i had to back off the backup strategy to fulls and difs. Crazy but you can only do so much depending on the support you have.
There is a prickly feeling I get in the back of my neck that is coupled with my head dropping and a light-headed airy sensation I get when I think I’ve made a huge mistake in SQL that would cause an outage or is irreversible. I don’t seem to get it at any other time when I make a huge mistake, either professionally or personally.
https://www.brentozar.com/wp-content/uploads/2019/06/everythings-fucked-and-something-is-on-fire.gif I love this gif. I wish I could use it at work when presenting the issues with legacy systems
Not_Security… oh man, I’m dying. That’s probably my favorite thing ever.
not long time ago, we added a separate drive for tempDB and I reconfigured the tempDB files, existing and new (I added a few more tempDB files).
this is FCI.
to apply the changes, AKA, the service had to be restarted. I did it by failover.
The service refused to start. eventually I figured out that I over provisioned the tempdb files size because I didn’t calculate the exciting files 🙁
NET START MSSQLSERVER /f came to help to reconfigure the file size.
30 mins outage…
ist just comes to my mind that DBA’s “it was a dark and stormy night” is actually “it was a calm and pleasant friday afternoon”
I have discovered, the hard way, that there really are people in this world that do have the last name “Null”.
“We did miss the guy who did it” – I’m reminded of a news article from just a few days ago…! https://www.bbc.co.uk/news/uk-england-manchester-48732031
One of the best database developers on the team needed to reload the main transaction table with 900 million rows. For obvious performance reasons, a truncate was used… 2 minutes later the phones started ringing. The developer thought they were in test but were in prod. After production was restored the next day and the dust settled, the DBAs took the heat for not restricting such a powerful command to a developer. The developer kept his job and production access, just without truncate permissions.
Not as bad as the stories, but my last job we had a battle royale one time. SQL vs exchange. Something happened which generated over a million emails from SQL. It crashed the exchange server. took me like a week to clear out my mailbox
[…] Brent Ozar: The ultimate, #1, primary, existential, responsibility of a DBA – for which all other responsibilities pale in comparison – is to implement database backup and restore processing adequate to support the business’s acceptable level of data loss. […]
It’s stupid, but still makes my twitch. I once shutdown my laptop quickly to join in singing happy birthday. Momentarily forgetting I was remoted onto production – in fullscreen mode. No cake left by the time I got back from the server room.
HAHAHA, ooouuuuch. We’ve all been there though. That design was so terrible.
2002 working for a marketing company that organized international events for many thousands of participants. One time working on site in Valencia, Spain, our temporary team office was a hotel room, SQL Server, Access client, administrating all kinds of live data of people, flight information, shuttle data, hotel rooms, meal preferences, workshop bookings etcetc.
One nervous colleague permanently nagged and stressed me for some special kind of reports and analysis that was difficult to achieve in MS Access so I decided to copy the data to Excel for further calculation. I opened the central participants table in Access, selected all rows and instead of Ctrl+C it happened that I pressed Ctrl-X and due to bad database design the core table was empty. Foreign key relationships weren’t set up as foreign key constraints in the db thus deletion and orphaned data was allowed.
Re-pasting the copy would have caused new incremental IDs so I parked the copied data in Excel.
One minute later – while I desperately pored over how to repair the data relations – a power outage occurred and I hadn’t saved the Excel file yet. Server and Desktop PCs all dark. Needless to say we didn’t even had a short term backup.
My team switched over to do some paper work. I was left in despair with the hotel technician to regain power from somewhere. I knew that I had screwed the entire conference since we couldn’t know anymore neither who when or how many people would arrive or leave, which hotel, which country nor anything.
When two hours later the power was back I opened Excel and luckily found the data in the unsaved files list. I switched off the ID auto increment, pasted the data back in and a few seconds later my team came back in, returned to work with the database – not knowing what I was through. I crept out of the room, went to the hotel roof and smoked half a package of cigarettes in a row, shaking, soaked up wet.
It took years until I could tell them what had happened and I still get goose bumps when I think about it .
I shall never forget…”large” bulk update on a Friday evening, with the table in Txn Replication, from SSMS…session timed-out after an hour or 2…then someone killed the SPID…not talking about a friend…took the website down for the whole weekend…