Last week, I asked for your worst database horror stories. Hoowee, did you deliver. Make yourself a nice cup of cocoa, gather around the campfire, and listen up for these terrible tales. Illustrations courtesy of Swear Trek (not safe for work.)
The mistake we’ve all made
Russ did an update without a where clause, but his storytelling earns a laugh and a prize:
One of my first SQL 6.5 scripting tasks was to correct the spelling for a customer’s last name. When I got finished, everyone in the database was related, everyone’s last name was Anderssen. Learned a lot about restores early in my career.
“We did miss the guy who did it”
Richard’s tale merited inclusion just for that phrase:
The best is the oldest: did something in Production, thinking it was Dev. and an airport came to a halt.
Thankfully not my doing – we did miss the guy who did it.
This is always a tough decision with major mistakes: do you punish the person, or embrace the mistake as a learning opportunity and move on? I like the concept of blameless post-mortems, but I understand why it’s tough for most organizations to really embrace.
DBA-101: two power outages in a row
Viking’s story ends with the most appropriate Jira issue number ever:
Right at the beginning of my DBA (AKA: “software dev with good SQL skills”) career, we had a power outage, which left us with an…unhappy…server. The first solution was to restore a server level backup. In the middle of that, there was another power outage. This is where we learned that the first one fried the UPS (though it didn’t report any errors), so the server went down again, mid restore. I can’t remember all the details, but this resulted in lost data, so we had to restore the server from an older backup and then apply the SQL backups to get the data back. This is where we learned that, while we did have backups (full/diffs/logs), we’d never actually tested a restore to a point in time using the full log chain….we’d just tested the full backups. The restore failed after something like the second log file and we were still out ~12 hours of data. “Luckily”, this all happened on a Friday morning, so the business only lost a day of work, while I, and a couple of our most experienced devs, spent the weekend rebuilding the lost data from one of our replicated servers, which still had everything up to the point of the first power failure.
The icing on the cake: The number I got when I started a new ticket to track the work in my DBA project in Jira? DBA-101.
Reverse engineering the worst database ever
Miu’s tale is a few paragraphs long, but I think that’s fair given the 18-month project from hell:
Just got off a project where I was tasked with archaeologically rebuilding an application that had ceased working and the source code was gone into the aether 20 years before.
That in and of itself wasn’t too bad. The real problem was the database: Oracle 8, hadn’t seen an update in nearly 2 decades. Its latest iteration been architected by a rank amateur in 1987 when they pulled the data off a mainframe, and had been maintained by a random combination of offshore firms and amateurs in the same time frame.
The database had more than 8,000 tables spread across three discrete clusters. Every time some new data needed to be stored, a new table was born. Column and table names were inconsistent, no keys, indexes, or documented relationships. Several of the tables had column names in transliterated Portuguese. Just determining what wasn’t needed, used, or was just someone’s data dumping ground was months of work. There was even a permanent set of tables that were earmarked as temp nomenclaturally for the application that had stale data containing everything from SOAP messages to credit card numbers and PII. A table of usernames and passwords (plaintext) were stored under the name “NOT_SECURITY.” Only bright side to that is we were never able to determine what system they were for, as they didn’t match any scheme used at the company over it’s 100+ year history.
Over the course of 18 months, we managed to reverse engineer most of it and get the application replacement stood up with just under 100 tables and a sane domain model, documented business process. Definitely the worst I’d ever seen.
I’m still laughing and crying about passwords being stored in plaintext as a table called “NOT_SECURITY” – that’s awesome. Congratulations, Miu.
Shutting down your machine to cancel a transaction
I laughed so hard at Gordon’s cold sweat – I have totally been there:
Early in my career, I was working as a developer on a system that used an Oracle database. I had written a script to reset my test data. It had several deletes to clear tables and I included a final commit in the script. I had given a demo to executives of the new functionality I had built. After the demo, the execs wanted to know how much data would be affected in the production database. So I changed my environment to point to prod and queried row counts. After this I decided to reset my test database so I ran my reset script. Wondering why it was running so long, I realized I was running it against the production database! I immediately broke out in a cold sweat. Not sure what to do, I shut down my computer and ran back to my desk where for the next 30 minutes I ran row counts to convince myself that the script was shut down before the commit executed.
Users will find a workaround
Recce had to fix corruption, and then found out the data wasn’t any good anyway:
Got a support call that our software at a French client was down. Dialed in over the PSTN line and quickly determined the SQL6.5 database (this was a while ago) was corrupt. Massively corrupt. Table scans were failing due to the data pages linked list points being incorrect. The logs showed that the corruption has started 3 months previously, 2 hours after someone had created a new, compressed, data device and added it to the database. The users had been clicking through the errors that bubbled up through the UI for 3 months, until they could perform order allocations anymore.
The oldest backup was 1 month old. All the new orders and inventory levels were in the corrupt data device. Worked until dawn to get as much data out as I could as the client was adamant they couldn’t organise a stock count. With my best efforts it wasn’t really representative of reality, it turned out the forklift truck drivers had started ignoring the system weeks earlier because they couldn’t complete transaction either.
Learning about conversions the hard way
JChertudi ran into something that you will all run into sooner or later as you work with different platforms & databases:
We were using a forked php-mssql driver ~2008, and converting user data to support HTML (CHAR(10) to ) for example) with a php script. Due to a code mismatch between Unicode and non-Unicode SQL strings , the php script left a “land mine” in the center of the user data. It looked fine after conversion, but once the user edited their data, the last half of the data disappeared on save. Sigh, that was a few days of finding impacted customers, restoring a copy of their data, and copy/pasting it back into the documents.
Don’t force it, get a bigger hammer
Steve faced the perfect storm:
There was a unit at my work that did 500 million dollars a year in revenue for both company and independent locations. The only backups they had were on the same raid array as the data. The corporation did have an exabyte tape backup at the time, but no one wanted to pay for the tapes.
One day, they bought a serial drive array so they could back up to off the server. The technician jammed it into the port and shorted out the board. All data is toast. Cue $50k in data recovery services. Failure.
They went to dev. Nothing there, really. Most development is done live in prod.
3 weeks of 18 hour days to get the app running again. Corporate stores can enter in 3-week old data on contracts that bill nightly. Independent locations are completely out of luck.
Lawsuits drag on for years with 7 figure awards.
If you use SQL Server’s database mail for notifications, sooner or later you’re probably gonna make the same mistake as Anon, too:
My darkest hour took place one Saturday morning.
It was a perfect storm – a Friday afternoon change, insufficient testing, someone else on-call and Blackberry safely left behind while taking the kids surfing.
The Change was simple enough, an update to the daily Data Load / notification process. Unfortunately, there was some code missing which resulted in an endless loop.
The Data Load completed and SQL Server sent the notification e-mail. Then it sent the e-mail again, and again and again. In total, 12.6 million e-mails over the course of four hours.
The on-call agent forgot that they were on-call so they remained oblivious to the issue. I returned home and picked up my Blackberry to see a record number of unread e-mails. Logging onto the SQL Server was almost impossible as the Network had somehow frozen.
SQL Server had become unresponsive, Exchange Servers died. Reports came in of Blackberrys vibrating themselves to death as they notified of each e-mail arriving until the battery ran out. One Blackberry decided enough was enough, vibrating off a table and smashing on the tiled floor below. The only physical casualty, although my mental scars remain even now.
The mess was cleared and the “DeNigel of Service Attack” went into IT folklore to join “Nickageddon”.
I have not surfed since.
So what did we learn?
Kenzo sums it up well and earns a prize too:
Couple things I learned from all these comments:
1: every single person who supports relational databases likely has multiple horror stories to tell.
2: The ultimate, #1, primary, existential, responsibility of a DBA – for which all other responsibilities pale in comparison – is to implement database backup and restore processing adequate to support the business’s acceptable level of data loss.