At SQL Intersection in Orlando last week, a few of us were sitting around a table talking shop. One thing led to another, and as so often happens at conferences, we started talking about our worst scars from database misfortunes.
That got me thinking – I could kinda bring that fun experience to y’all here at the blog. Even if you can’t go to a conference, you can still have the fun of commiserating with your peers. Leave your comment here with your worst database horror story.
Just a few ground rules:
1: Keep it anonymous-ish. Don’t put details in here that your manager wouldn’t want to see out in public.
2: Don’t throw anybody under the bus. Remember that your fellow coworker might be reading this, and they might recognize the scenario. (I’m being polite. I’m really just trying to make sure your fellow coworkers don’t throw YOU under the bus. We’ve all had our days, am I right? Especially you.)
3. It needs to involve you and databases. And no, I don’t really wanna hear about how your data was involved in somebody’s data breach, hahaha. We’re to the point where newspapers are going to start including a “data breach” section just like they do a sports section.
This weekend, I’ll read through the comments and pick 5 of my favorites to be re-published in a blog post next week. Those 5 commenters will get their choice of a Recorded Class Season Pass or the Consultant Toolkit. My judgment is final. Not good, just final.
I’ll go make the popcorn.
Early in the cloud VM game, I was working on a machine and mistakenly detached a virtual disk from a SQL Server. It was an SSD for temp DB. It was terrible.
Just got off a project where I was tasked with archaeologically rebuilding an application that had ceased working and the source code was gone into the aether 20 years before.
That in and of itself wasn’t too bad. The real problem was the database: Oracle 8, hadn’t seen an update in nearly 2 decades. Its latest iteration been architected by a rank amateur in 1987 when they pulled the data off a mainframe, and had been maintained by a random combination of offshore firms and amateurs in the same time frame.
The database had more than 8,000 tables spread across three discrete clusters. Every time some new data needed to be stored, a new table was born. Column and table names were inconsistent, no keys, indexes, or documented relationships. Several of the tables had column names in transliterated Portuguese. Just determining what wasn’t needed, used, or was just someone’s data dumping ground was months of work. There was even a permanent set of tables that were earmarked as temp nomenclaturally for the application that had stale data containing everything from SOAP messages to credit card numbers and PII. A table of usernames and passwords (plaintext) were stored under the name “NOT_SECURITY.” Only bright side to that is we were never able to determine what system they were for, as they didn’t match any scheme used at the company over it’s 100+ year history.
Over the course of 18 months, we managed to reverse engineer most of it and get the application replacement stood up with just under 100 tables and a sane domain model, documented business process. Definitely the worst I’d ever seen.
congratulations, you win
I was new to SQL server (SQL Server 2000) and especially new to data transformation services. While importing a flat file as a table to a production database, I checked the box for “drop destination tables first”, thinking it would only drop the destination of the import. It actually dropped every table in the database.
Andy – HAHAHA, I have totally been there. (Thankfully I caught that one in a dev environment.)
Aahahahha Ahhh thanks, that made my day 🙂
We were using a forked php-mssql driver ~2008, and converting user data to support HTML (CHAR(10) to ) for example) with a php script. Due to a code mismatch between Unicode and non-Unicode SQL strings , the php script left a “land mine” in the center of the user data. It looked fine after conversion, but once the user edited their data, the last half of the data disappeared on save. Sigh, that was a few days of finding impacted customers, restoring a copy of their data, and copy/pasting it back into the documents.
OUCH! Wooooow, that’s rough.
Not sure this merits top 5 but just to drive home a point. About 15 years ago I created a database to manage authentication and authorization. Coming from an application development background I used UDFs extensively. I was encapsulating code with a lot of reuse. It was great, worked great — in dev that is with very infrequent use. Went live and died. As in totally died. Very bad. Thus began a very quick crash course on UDFs and the difference of in-lines UDFs. I am a better man for the experience.
Regan – it looks so good in Books Online though! It looks like that’s the right thing to do, build the code so that it can easily be reused…
I made an inplace sql server update from2012 to 2014.HA cluster was used and when i switched from DB-1 to DB-2 the database was in “Script upgrade mode” and sql service was not starting.Downtime aprox. 1 hour. Actually the issue was the SSIS database…
James – oof, the SSIS database has been a little thorny with AGs over the years, yeah.
The best is the oldest: did something in Production, thinking it was Dev. and an airport came to a halt.
Thankfully not my doing – we did miss the guy who did it.
“we did miss the guy who did it” – HAHAHA, ouch, oooof.
6-7 yrs ago …
Amongst 15-20 different SSMS windows open, I am testing a new process that drops/recreates a particular database. Mis-click one of them and accidentally start deleting 20-25TB OLTP prod db. … … … and it actually stopped without deleting it. Must have been blocked. It must have taken me 2-3 hrs to calm down.
those ellipses were me banging the ‘cancel query button and trying not to soil myself
Mine took place about 15 years ago. SQL Server 2000 running on a Windows 2000 fail over cluster. Hardware was hosted a few states away, and rented from the hosting company. The shared storage that held the Cluster Quorum had an ever so slight hiccup. After the initial crash of the the Cluster and SQL Server, both machines could briefly see the shared drives, which was supposed to be impossible. Both servers were rebooted, now only one can see the storage, but the cluster was down hard. Installed a named instance on the server with storage, attached the mdfs and ldfs to the new instance. It took two or three weekends working with MS to get the file system and registry cleaned up enough to install a functioning clustered SQL Server default instance again. Those were fun times.
Was a rookie dba and the “senior” devs gave me a script to update a column from varchar(4000) to varchar(max) on a table with about 50Mil rows.In 2 hours the database expanded around 200 GB and the execution failed.The disk got full and the other databases were not usable as well 🙂
At my last job they weren’t that big on change control, so while troubleshooting a data issue I changed a stored procedure in Prod just to add a PRINT statement, something I had done before.
This time it was between an IF statement and a BEGIN statement.
Inside the begin was a loop that could be quite involved.
Brought down the entire system – maxed out CPU, blocking on nearly every table – for a while before I figured out what I had done.
While working for a SaaS company in the local municipality health inspection market, I was meaning to restore a copy of prod in to a test cluster… I restored the nightly backup over production for the largest municipality in the country. It was a bad, BAD, day… We were in pre-production with their implementation, so about 8 hours of work was lost for their employees… Could have been way worse.
Back in the day I worked as a software developer with, what I thought was, a good handle on SQL Server. The business came to me with an issue with my software and I came up with a brilliant solution, a database trigger. I wouldn’t need to rewrite or redeploy any of my app. All it involved was a small change in the database. What could be easier? The DBA happened to be on a work trip overseas and I didn’t want to wait the 24 hours turn around on emails. I had access to production, he trusted me, what could go wrong? The trigger could go wrong. The trigger did go wrong. Instead of updating a newly inserted row, it updated Every. Single. Row. in the largest production table of the most important database we had, used in every office we had across the US and Canada. I weighed my options of running out the door or confessing to my boss, hung my head and spilled my guts. He told me to call the DBA and get it fixed. I made the middle of the night call, the DBA saved the day with a restore, I learned valuable lessons about backups, restores, and triggers then spent the next week dealing with the fall out. Good life lessons as a dev and soon to be DBA.
True story – 1998 – we had a server drive that went on the fritz and would not boot – was making a grinding sound. I got to the client, removed the drive from the server and set it sideways outside the server case at a 45 degree angle. It booted and was recognized by the server and we were able to backup 20 years of data stored on that device. Did I mention we did not have a good backup – but I guess you figured that out. It never booted again after we got the data off. Lots of lessons learned on that one – and we still talk about it. 🙂
Entire infrastructure team comes running into my cubicle. Faces bright red/angry looking. Lots of cursing. I’m just a lowly developer. Lowest of the low. Organizational Pond scum. The entire production system is down. This is a no longer in business health club chain. Let’s call them Sally Fitness. They decided to update the HBA firmware mid day on the production SQL server. It didn’t go so well. Production inst just down. Its gone. They corrupted the master database on their 2 node SQL 2000 cluster. There were also – no backups. Not of master. Not of anything. Somehow it was a developer’s job to figure it out since all this DBA stuff sure looked like programming. In about 10 minutes I was able to rebuild master from the installation CD. The resulting mess took a bit longer to clean up but total downtime was under 20 minutes.
I wrote a script having an update without a where clause (copy and pasted a few lines of code but missed the where clause on the last line) on a “system” table of a 3rd party accounting app. The accounting/billing app was dead for most of a week.
The table was used by the app to generate unique IDs for other tables containing business data. As users continued to use the app, all of a sudden IDs where no longer unique.
This got to prod because the testing team ignored the problem when exposed during testing. They simply asked for a DEV database refresh and never said anything about the issue.
p.s. I offered my resignation (it was declined of course) and never said anything to the testing team or management about the failure of testing team to communicate. I wrote the code and I can’t rely on anyone else in the company to do their jobs. That is up to management.
That reminds me when a developer was developing against a production database, and failed to include a where clause in his update. He came running over to me, asking to restore the database, as one of the fields “suddenly” was changed to the wrong value. clearly a problem with the database. After restoring, I setup a trace, just in case. A few minutes later, he was back. I pulled the the trace, and showed him where his code issued the incorrect update. LIke a thief upset at being caught, he was not pleased I discovered that he was the problem.
Thankfully it wasn’t a production environment, but important enough to backup daily.
Tape backup system couldn’t keep up with daily backups, so backups were sent to a separate LUN on the same SAN to be staged, then streamed to tape throughout the day.
All backups were restored once per month from the most current backups.
SAN provider detects issues with device, remotes in to start looking into and notifies operations of issue.
SAN provider asks how many LUNs we can afford to lose. We say none, go back to the drawing board.
SAN provider comes up with plan, goes to execute, skips a step, loses all LUNs.
Operations goes to the backups on tape (backups on LUN were obviously gone).
Tape management software apparently doesn’t like mountpoints, ignored all mountpoints inside the folder the backups had been placed in.
Retention policy was 30 days for backups on tape.
HOWEVER, it only cleared off backups as it needed the space.
From 3 months back.
5-6 people had to redo work from during that time.
Could have been sooo much worse.
About 20 years ago, I was supporting a new critical system built around a Progress database. There was a problem in production that caused various background processes to crash and leave remnant locks behind on the database, requiring a full system restart of the *nix server to clear. To speed up the diagnosis process, we were restoring the production database to a test server and cleared the locks in test by simply issuing a drop database and answer “Y” to the warning about the active connections.
Well, with all the back and forth with irate users hovering over my shoulder wanting to know when they could start up the line, I was flying pretty fast and low … Well, I issued a drop database command and answered “Y” … and one ohno second later realized I was on the production server, not the test.
Recovering the database in production took several hours, and ultimately we determined what was wrong with minimal loss of data. Ever since then I *ALWAYS* use different user credentials with write permissions to the production systems and configure my machine to have a different background colours for production vs non-production environments.
A client that used our software emailed saying their production SQL 2005 server was having problems (this was in 2018). I was able to remote in and the machine was in its death throes. Their backups hadn’t run in months. Opening any GUI sent it into a reboot, so I had to stop the services via command prompt so I could start to copy the data and log files. I got the MDF copied off and started the LDF when the server died for good, never to properly boot again.
The only saving grace was after about 8 years of talking, they’d finally set up a new server and asked for a software upgrade. So for the first time in ages we 1) had an active login and 2) had a second server to copy that MDF to.
We’d already set up a path for the upgrade to SQL 2016 (with a hop thru 2008 R2) and the 16-years-newer software version, which had 3-4 different stop points because of various incompatibilities along the way. Just now we were starting with only an MDF… and no working copy of SQL 2005 anywhere anymore.
I honestly don’t remember how we managed it. There was a chicken-and-egg issue of needing the LDF to connect it into 2008, but needing 2005 to regenerate the LDF. In the end, they were up and running the next day.
Not the recommended approach to SQL and software upgrades.
Our production database is a virtual machine in VMWare along with our backups on EMC and when my system admin applied a patch to the main VMware environment with VMWare on the phone, the entire environment became inaccessible including my SQL Backups! We were down for about 14 hours and all of us nearly lost our jobs! Now when we apply any patch we copy my backups to a bare metal first just in case.
At the start of my accidental DBA career. Need to change the max memory used by SQL Server down because web services won’t start unless there is 5% free memory (IIS and SQL Server on same computer). Forgot that the max memory setting is in MB instead of GB, set the max memory to 24 MB. Got to learn all about single user mode and how to change settings via the command prompt quickly that day. Only thing that saved me was this was a development box, so only the QA team was affected.
Right at the beginning of my DBA (AKA: “software dev with good SQL skills”) career, we had a power outage, which left us with an…unhappy…server. The first solution was to restore a server level backup. In the middle of that, there was another power outage. This is where we learned that the first one fried the UPS (though it didn’t report any errors), so the server went down again, mid restore. I can’t remember all the details, but this resulted in lost data, so we had to restore the server from an older backup and then apply the SQL backups to get the data back. This is where we learned that, while we did have backups (full/diffs/logs), we’d never actually tested a restore to a point in time using the full log chain….we’d just tested the full backups. The restore failed after something like the second log file and we were still out ~12 hours of data. “Luckily”, this all happened on a Friday morning, so the business only lost a day of work, while I, and a couple of our most experienced devs, spent the weekend rebuilding the lost data from one of our replicated servers, which still had everything up to the point of the first power failure.
The icing on the cake: The number I got when I started a new ticket to track the work in my DBA project in Jira? DBA-101.
Back when I was in support, there was a dangerous script I had to run that would check to see if you were running it against the master db and would exit with a warning if so. Or so we all thought. Unfortunately, the developer had put a ‘GO’ after the warning code out of habit, so on this occasion the script helpfully warned me and then immediately trashed the master db on a customer’s live server. That was the first and only time I had to rebuild master, and I was never more relieved that something worked.
A while back the company I worked for got acquired. We had a comprehensive backup system in place, with mirroring, warm-standby, dr-site, transactional replication and daily full backups. About a year after getting acquired the company got let again, so we had to temporarily shut off a lot of the backup processes to account for physically moving SQL servers from geographical locations separated thousands of miles apart.
When we were on our own again, we kind of put the task of setting up the backup system on hold. We were understaffed as it is, so our sysadmins basically took care of full nightly backups through a 3rd party tool. I was the only SQL guy in the company, and my day-to-day tasks were mostly developing and reporting. We got transactional replication back up to satisfy the business need of a reporting site.
Anyway, soon after the worst happened on one particular Monday morning. The main application database was reporting corruption. I had no idea what to do, so I quickly googled some command you run to get more information. The command was taking forever to complete, and it reached the point that operations, a call-center business, had to shut down for the entire day. Unfortunately, I did not know who Brent Ozar was at this point in my life.
I checked with our sysadmin and he said that the nightly backups of the database had been failing. We had not a single backup of the database anywhere! It is easy to point the finger at him, but honestly I should have been keeping an eye on things. At the very least, despite being swamped with work I should have found some time to get our mirroring and standby processes back online.
I prayed harder than I ever have in my life. I felt like I was going to die, just from the pressure of the whole thing. I mean the entire business was in jeopardy of going down because of this. We would lose our clients, for sure. Nobody else seemed that phased, but I was freaking out bigtime.
Thanks to the help of one of my friends, I learned out that the log reader agent for replication was blocking the diagnostic command from finishing. When we figured this out we learned that it was a single, large table in the db that was corrupt. No backups anywhere and we couldn’t just spin up the subscriber database for prod since it lacked many of the stored procedures needed to run the production application.
Fortunately, the data from the subscriber database saved the day. It had the correct and up to date copy of the table that was corrupt. We ran the update to the prod db and then quickly made it a priority to get our backups in order after that. Lost a day of operations, but the business survived. One of those days that I will never forget, and certainly never want to repeat. God bless all the production DBAs out there. For now I am happy staying a developer.
Long, well planned upgrade from SQL2012 to SQL2016. I set up log shipping with scripts I developed and had used for a very long time. They even wrote to a log file which would tell me if everything was in sync. The cut-over in our test environments took minutes.
A few days before the planned cutover there were many complaints of poor performance. We shut off everything that was “different”. I set up the log shipping again on the morning of the cutover, and all was well. Or so I thought. We did the cutover but I had never checked the log file that was being written. I would have seen that the databases were not in sync.
We did the cutover on Saturday morning at 6 AM. The smoke tests all worked, we were happy. About Monday at 4:30, the support team determined that there were no records from about Friday at 8 AM. Three of the 30 databases had never been synced from that time. We lost a LOT of data that could not be recovered.
Lesson learned was look at the _)!*(#^ file you set up so that you can prevent these issues from occurring.
Early in my career, I was working as a developer on a system that used an Oracle database. I had written a script to reset my test data. It had several deletes to clear tables and I included a final commit in the script. I had given a demo to executives of the new functionality I had built. After the demo, the execs wanted to know how much data would be affected in the production database. So I changed my environment to point to prod and queried row counts. After this I decided to reset my test database so I ran my reset script. Wondering why it was running so long, I realized I was running it against the production database! I immediately broke out in a cold sweat. Not sure what to do, I shut down my computer and ran back to my desk where for the next 30 minutes I ran row counts to convince myself that the script was shut down before the commit executed.
I had crafted a query to repopulate some customer data that had been skipped during a deployment. It worked so great on each individual run, I decided to automate it within a loop. This was before I knew that table variables retained their rows, even if you “re-declare” them inside a loop. I created cross-customer data contamination that took three months to clean up.
Many years ago, a co-worker was on a customer’s server and accidentally started a restore over top a live production database (he meant to select a development database as the destination but picked prod instead). A few minutes into this he realized his mistake. With his heart in his throat and his jaw on the floor and working under the influence of high adrenaline, he clicked cancel (and probably a few other things he won’t admit to) and the production database became unusable. That was at least 10-12 years so we’re talking SQL 2005 or maybe 2008, and I can’t remember if it was simply corrupt or marked as suspect but it was a goner. He was forced to restore from a day old backup and he spent the next couple days hand-entering data from “paper backups” to get the customer’s data back in. That was a TERRIBLE week. The customer took it amazingly well though.
18 years ago I decided it was a good idea to store user settings as xml in the database, in a varbinary(max) column, zipped.
When SQL Server 2012 reaches end of life and we force all customers to 2016 I can finally do stuff with those settings from SQL using DECOMPRESS.
Many years back, very green at the time, so much that I only had a vague idea that I should rename the production database to a different name when I refreshed the test environment. A fairly germane screw-up, but of course I’d thought I was in the test instance but instead right-clicked and deleted the production database for the payroll system. Got to learn point-in-time restores immediately after. In the end only lost a couple of minutes of data and we were in the off-cycle week thankfully. I was despondent from the mistake but my boss mostly just laughed at me.
Ransomeware attack. We didn’t pay. They hit the server hosting our accounting system which was storing the SQL backups locally, thanks to the consultants who set it up. We had a script that would copy those local backups to remote backup server. Imagine the horror when we found out that script had been silently failing for two months. Thankfully we had another backup, albeit weeks old. The accounting team had to re-enter a lot of transactions but we dodged Armageddon.
I can think of two off the top of my head – both of mine related around Oracle on Unix
1) Going back about 17 years and two employers was the time that the system administrator did one of those famous “rm -r *” commands on the wrong file system. Boom. Database data files gone. Go to restore from tape. We discovered a fairly critical bug in the backup software we were using. Could not restore from a backupset that crossed two reels of tape. We desperately go hunting through the backup repo, find one tape that contains a full backup and then the other DBA had a long night restoring database and archive logs to get it back up – with no data loss to boot.
2) About 14 years and one employer back, we had a SAN that either didn’t have very good disk mapping crosschecks or we just hadn’t bought the option (I don’t remember). There was this spreadsheet that said “this disk is mounted on this server”. I mucked up one of the updates. Assigned the same volume to two different servers. One database fails. Restore it. Another database fails. The two databases weren’t particularly active, so sometimes the failure took 4-6 hours to happen / get noticed. Went on for about 2 days (including an all nighter) before someone thought to check the disk mounts. Alas this one left one of the databases with some lost data – which also taught me better backup procedures as well!
In my earlier days as a DBA, I used to support a clinical, electronic health records database and it had 1000’s of users on at any given time during the day, mostly nurses and doctors. Inevitably, once a quarter, someone (or a handful of people) would call in and complain about slowness. One of my normal things to do (among other things) was fire up SQL Profiler and look for errors/warnings or other craziness going on.
I accidentally ran a session checking for 100’s of attributes with No filters and it brought the whole system to a crawl! A handful of users’ complaints turned into 1000’s of complaints! The heavy slowness mysteriously lightened up once I stopped my profiler session!
Couple things I learned from all these comments:
1: every single person who supports relational databases likely has mutliple horror stories to tell.
2: The ultimate, #1, primary, existential, responsibility of a DBA – for which all other responsibilities pale in comparison – is to implement database backup and restore processing adequate to support the business’s acceptable level of data loss.
One of my first SQL 6.5 scripting tasks was to correct the spelling for a customer’s last name. When I got finished, everyone in the database was related, everyone’s last name was Anderssen. Learned a lot about restores early in my career.
I was adding a new server to our failover cluster and accidently left the “Add all eligible storage to the cluster” box checked (and why is that checked by default). Needless to say it dropped all the drives and our main database and website were down for about an hour as I had to figure out what happened. I knew it was my fault but at that point I had never seen a server without hard drives nor did I know how to get them back. Server guys were very helpful. I wish I could say that was the only time a checkbox came back to haunt me.
Simulating tables (few base tables existing) via nested views (more than 5 levels deep). No wonder everything was slow. New development then became locked into death spiral using these views as APIs. Guess who was responsible for that? Why, an app developer ‘genius’, of course!
New employee created and executed a poor query to delete old records from multiple tables, but a bad join deleted most of the data from all the tables selected. That is when we found out the backup job hadn’t been running correctly for almost a year. Had to “find” the missing data and spent almost a full week fixing the data. Valuable lesson for the new guy, and now we have procedures in place to check the backup history to make sure the backup files are successfully created and validated.
Well…. this happened 3 years ago.. SQL Server 2014 1 tb database… customer:
“We are filling up the disk please check it”
Me: ok sure, the primary filegroup is filling up the disk
Customer: we have partitioning and we are not leaving data there
Me: are you sure?
Customer: yes, we are using the primary filegroup as default for the partition boundary
Me: i checked it and the default boundary is holding 2 years of data. Did you add the new boundaries?
Client:………….??? Please fix this
Then the dba nightmare story started…
After trying to split the boundaries and use 4 TB of transaction log space with no success (obviously i filled up the drive and the database turned into recovery..). We had to add two new boundaries at the begining of the partition function (because they had all their tables on the same partition scheme and partition function….) , then update 2 billion record in 75 tavles (commiting every 100 k rows…) to move them to these new partition boundaries, split the new boundaries again ( fiscal year 2016-2017-2018-2019-2020) and finally updated the records again and merging the old boundaries. Everything took 3 weeks = application down?????
Yes, a real nightmare!!
Only somewhat SQL-related, but was working on disabling TLS 1.0 for a client. We had tested it pretty well against or most important systems (mainly web apps) and then everything else we just accepted we would go through if we found issues.
We didn’t think to test it against SQL database instances. A few weeks after we push out the registry key to disable it, the sql servers in the AAG that held the Network Policy Server auditing database were reboot and BOOM! 200 something switches in the network stopped working.
We found out that the SQL native client NPS (dont remember which version) uses only supported SSL 3.0 and TLS 1.0 and then couldnt connect to SQL, and the RADIUS policy was to deny requests if they could not be audited. They were behind on their payments to their network consultant and couldn’t call them for help, so someone had to find a port on one of the 6509s (which thankfully weren’t using 802.1x) that had DHCP on it, to get in and edit the GPO to turn TLS 1.0 back on.
Had a similar experience, our security admin seems to think that his registry changes don’t need to be tested with the applications. Arrived in the morning to discovered all the SQL Server 2008R2 instances were down. And .Net web applications wouldn’t connect to the SQL 2012 and 2014 instances which were still running. Every encryption protocol prior to TLS1.2 had been disabled, cross the entire server estate in a single evening.
Had some blunt conversations about testing, the difference between server and client and how it relates to those registry keys. And if a server isn’t using encryption, is the encryption it’s not using really insecure?
One thing interesting about these comments. Nothing related to data breaches! That is a surprise to me. Is that because we don’t tell when it comes to data breaches?
I worked for one health care provider organization, desiginng and supporting data warehouse of PHI. Had we experienced a data breach (we never did) we would have been on the national news that night. I kept the warehouse off the public network for that reason alone.
Only DB-related data breaches clients had were on a quick books DB and a peach tree DB.
My former employer released several thousand tax returns in phishing attack through something in their ERP system though
Managed to drop all logins on production, including dba group whilst thinking I was happily on a test server.
Found out very quickly as multiple websites went down. Needless to say that day I learnt the hard way how to recover master database post-haste!!
Did that using Powershell with two instances on a server, one default, one named. Had maintenance on the named one, but accidentally ran the script without the named instance reference, which, of course, connected to the default instance, dropping the logins. Not a fun day.
There was a unit at my work that did 500 million dollars a year in revenue for both company and independent locations. The only backups they had were on the same raid array as the data. The corporation did have an exabyte tape backup at the time, but no one wanted to pay for the tapes.
One day, they bought a serial drive array so they could back up to off the server. The technician jammed it into the port and shorted out the board. All data is toast. Cue $50k in data recovery services. Failure.
They went to dev. Nothing there, really. Most development is done live in prod.
3 weeks of 18 hour days to get the app running again. Corporate stores can enter in 3-week old data on contracts that bill nightly. Independent locations are completely out of luck.
Lawsuits drag on for years with 7 figure awards.
Not a single person fired.
While onlining large corp database at large corp yearly offsite DR test, found that the wrong SAN LUN ID had been specified for DR replication. Since the db had already been brought up as suspect, it took hours to get the db back to a valid test-able state. Bullets were sweated.
You know how Management Studio treats the list of databases like file explorer treats files i.e. you can rename a database if you double click into it? (worse feature ever as far as I’m concerned). Well I was working on a laptop and I was remoted into a client of mine.. and I realized that I was inadvertently about to rename a database – so I quickly reached to hit the escape key to get out of that mode.. but accidentally hit the ~ key.. so yes, I renamed the database to ~. Worse is, I wasn’t even sure what the original name was! And it was Friday afternoon. What a mess.
haha! I did exactly that to a table once! I noticed it happen and knew what to change it back to though.
Accidentally deploying crap to the master database is another one
And…now that the Object Explorer delete key bug has been fixed in SSMS (meaning, it works), more fun & games will be had!
My customer started with SQL 2005 instance. After a few years new instance of SQL 2008 was installed. After some time I got phone that production databases cannot start and they don’t know why. After long research it turned out that both SQL 2005 and SQL 2008 instances were restarted and it happend that 2005 started first. Database was registered at the same file locations at 2005 and 2008, so 2005 captured datafiles and locked them. At the moment of accident customer’s administrator didn’t remember about old SQL 2005, so the path to find the problem was completely outside of the real problem.
I built an ecommerce website for a friend of mine. He had the ability to upload prices via an excel spreadsheet from his POS system. Not elegant, but it met his needs. He had an annual sale and once fat-fingered a price and asked if I could fix it quickly. No problem! Open up SSMS, type up an update query, run it…..and realize I’d forgotten the where clause. So, I just changed the price of *every* product in the system. Whups…. The best part is when I accidentally did the same thing again about 4 months later. Sometimes, I’m a slow learner….
Got a support call that our software at a French client was down. Dialed in over the PSTN line and quickly determined the SQL6.5 database (this was a while ago) was corrupt. Massively corrupt. Table scans were failing due to the data pages linked list points being incorrect. The logs showed that the corruption has started 3 months previously, 2 hours after someone had created a new, compressed, data device and added it to the database. The users had been clicking through the errors that bubbled up through the UI for 3 months, until they could perform order allocations anymore.
The oldest backup was 1 month old. All the new orders and inventory levels were in the corrupt data device. Worked until dawn to get as much data out as I could as the client was adamant they couldn’t organise a stock count. With my best efforts it wasn’t really representative of reality, it turned out the forklift truck drivers had started ignoring the system weeks earlier because they couldn’t complete transaction either.
Back in SQL 2000 days, I had a inherited a system that backed up straight to a tape drive that failed around 75% of the time. I was in the processes of changing the system to backups to disk after 3 consecutive failures and had sent out a department-wide email stating that the system would not be backed up that night (I honestly can’t recall anymore why there would be a night of no backups, but it made sense at the time). Despite this the app admin chose to run an update that night and the data was corrupted with no good backups to roll back to without losing days worth of financial data. Luckily the system was a midway tier between two other systems and we were able to reverse engineer the data from the downstream system after about 8 hours of overnight work.
Years ago, I got called in to ask for assistance with restoring a SQL Server 7 database (I did say it was years ago). I discovered their masterdb had been corrupted and recreated. So, the obvious question is backups. Oh, we have them….but we can’t read them. It was discovered that the combination of their SCSI controller, tape firmware and backup software wrote tapes that were gibberish. Of course, nobody had bothered to actually test the ability to read from a backup until they needed to read from the backup.
Had a European client which provided parts to a JIT car production line. Their 3rd party IT support provider had been onsite (Saturday morning) and reinstalled WinNT 4.0 server straight over the top of the factory’s application server, which hosted our SQL7.0 database. That our software, the main application required for site operations wasn’t running anymore was apparently a “vendor issue” and the engineer legged it.
Couldn’t dial in over the PSTN as the modem was on the same box and that software was effectively uninstalled. Along with the Seagate Backup Exec. Spent a fraught morning talking the factory’s Flemish shift supervisor through reinstalling SQL7.0 while making sure we didn’t accidentally lose the databases. And the car production line got all their part in time…..and in the right order. I hope the shift supervisor got the credit he deserved.
I love this one. Well done, that shift supervisor!
About 12 years ago (sql 2000 days) we primarily used log shipping for our DR between 3 different data centres let’s call them DC1, DC2 and DC3. Anyway about 10am one of the prod servers in DC1 blue screened and we quickly failed over to DC2 no problem, then about 10 minutes later a server in DC2 blue screened so we failed that to DC 3, 2 mins later one in DC3 blue screened and then another and another and another all in different data centers by this time were in full blown confusion mode. Long story short turned out one of the sysadmins changed the anti virus exclusion list without telling anyone and the AV started scanning the SQL binaries (logship.exe) which caused the servers to blue screen. It was a hell of a day but proved how important a solid DR plan was but also that you can’t plan for everything.
I just want to say thankyou for all that you’ve done.
But most of all for bringing my attention to Swear Trek.
Glenn – hahaha, you’re welcome!
We had a database that documented all the clinical trials (testing new drugs in development) in progress around the world. This is several hundreds of thousands of records. Data included drug, disease, trial location, companies involved etc…
It was updated constantly, both via automatic means and manual curation.
I accidentally relocated all of the trials to Colorado, United States, via a malformed WHERE clause
The database didn’t just have trials data in it, it had a lot of other content sets. They were fine, and constantly being updated as well. So we couldn’t just revert to a backup.
Fixing required telling 200+ curators to go home early (it was 10am) and suspending all datafeeds (both in and out), restoring a copy of the database to point in time in parallel, and updating all the records, managing the changes that had occurred since the hiccup and the shut-down order and resetting their last-update/high-watermark values so they weren’t in scope for outbound updates.
Took the better part of the day to do so.
Oh. And it cost me several boxes of chocolate fish as a penance payment to the curators
I was new to a Company as DBA and should restore a Backup from last night of the Production CRM. The CRM crashed and the Database was unrecoverably corrupted.
I asked my Manager where they store their Backups. He was kinda proud and told me, that they have a special Server on which the Backups from the last 2 Years are stored.
I checked the backup from last night. It could not be restored. Same on the Backup from 1 Week ago. Even the First Backup they took got corrupted. The Backup run through without an error and the Manager was thinking that this was enough. It wasn’t.
Since that time i have a new Server on which i can test the Restore of the Backups and ensure that everything is ready for the next Restore.
And the Company now knows about the importance of regularly DBCC Checks.
We had a Database that was slow Performing. looked through sp_blitz, Qs etc. Nothing out of the ordinary. So I let the Profiler start a Trace for a day, and after that started the Database Engine Tuning Advisor … this crashed. I quickly looked online, on what to do, and saw, that I Need to delete the DTA_ Index for all the tables .. which I did.. It was nowhere in the Article, that I too had to delete (nor, that it created!) DTA_Views, statistics etc. … Performance was down to barely working, and I only looked in BlitzNow, QueryStore and so on .. not in the “normal” sp_Blitz where the Answer was .. A Coworker found the ~8000 Views and deleted them .. poof, Performance back to normal
Two words, Emergency Services. That organisation had an on-premises 2014 active / passive database cluster that was MISSION CRITICAL. In short, no system, no “situational awareness”. Catastrophic, the kind of stuff that makes the news. One fine day a few years ago turned out to be one of the busiest on record. It was time for this departments’ newly acquired 3rd party enterprise application to shine. Bring it on I say. Load based performance testing? Heathen. You’re expelled!!!!
Now the vendor of this new product didn’t like us making changes to their database (such as applying new indexes) but hey, I’d been running Hallengrens’ maintenance scripts, checking for database corruption and had Spotlight On SQL Server Enterprise Edition pointed at it long enough to get a handle on the performance benchmarks. I’m sure it was healthy enough to handle anything mother nature and the states worst drivers could throw at it. Performant? I’m coming to that.
2 hrs into this particular day and I could see that the CPU utilisation had climbed to an a$$ clenching 92% average (from a benchmark of around 30%). Another hour passes and it hits 100%. Web interfaces stop responding, integration endpoints stop integrating. DBA’s stop breathing. It isn’t long before 5 high level uniformed operational personnel are standing around my desk. I’m the only DBA in the village and it was time to earn my wings. The ICT dept. manager flexes his delegation muscles and commands me to throw more cores at the server. My attuned DBA / spidey senses tell me that this will most likely not solve our problem, but hey he’s our manager and I don’t want to show him up in front of the OPS commanders. I adjust the MAXDOP settings in line with Microsoft recommendations. We failover once more, the server comes back up and……. Nothing. We’re right back to 100% CPU utilisation. A quick glance at sys.dm_os_schedulers shows me that only 1/2 of the available cores are online. “Standard Edition!” I say to myself in the Seinfeld / Newman accusatory tone. I blabber something about the organisation having to stump up for Enterprise edition and instruct the SysAdmins to reconfigure the VM (yes a clustered VM – don’t ask) to utilise 4 sockets and 4 cores (we all know the limits of Standard edition hey). We fail over again and ……. Nothing. All cores are now online but the beast refuses to comply. It’s still at 100%!. Enough of this poppycock I say to myself. The others look at me. “Please give me 15 minutes to asses things and I’ll get back to you with a plan of attack”. I sit down and look at Spotlight. Wait stats, a god send. CXPACKET makes up 70% of all waits. SOS_SCHEDULER_YIELD is right up there too. Hmmm. The number of connections to this server is 350% higher than normal. I run off a list of the top CPU consuming queries and pick the top one. A big fat nasty select statement. Inner joins, numerous left joins and to top it off a function in the where clause. Gold. I put it into SSMS and run it with “Include actual? execution plan”. The results instantly fill my screen. It looks like the Jackson Pollock of query plans made up of Parallel Repartition and Distribute Streams. I feel like crying but I pull myself together. I look closer. Green text. An index recommendation with 98% impact. I pull the recommended index out and asses it. Seems to cover a lot of the columns in the Pollock query. After some consideration (and gut instinct) I throw a Hail Mary and apply it. Executing, executing….. execuuuuutttttiiiiiing….. The application of the index finishes and the CPU almost immediately drops from 100% to 50%. Web interfaces start to respond. Message queues start to process, DBAs start to breath. I send the rest of the queries off to the vendor and say “please asses, dial in and index where necessary”. I slump back in my chair exhausted. In short – you can’t “out resource” bad design or bad queries. I head to the bar.
Back in the day a customer of mine wanted me to install SQL 2000, fine no problem. They created a database and put in all kinds of info in it, cool.
Then they wanted me to move the database to Another SQL server. So i tried to move the data/tlog files to the new server, did not work cause files were in use. Cool I’ll stop the service and move the files to the new server.
On the new server I start Enterprise Manager, but I can’t see the database. Restart the service, still no database.
Move the files back to the old server, database not working. (to wrong folder)
Oh well it must be corrupt then?! Told the customer that the database is corrupt and hey that’s Life. Customer creates a new database and calls in her kids to add all the info that they had in the database (it took 2-3 Days).
No problem, case closed and problem solved.
A couple of months later my boss sent me to a SQL 2000 administrator class with Tibor Karaszi. Tibor showed us how to attach/detach a database… in that second I remembered the above case and just thought “oooh sh*t”.
But hey, now I knew how to move a database and I actually had Another database that was created in C:\. (very important db) So I vpn in from the hotel, connect to the server, detach/move files.
Then I try to attach the database – error.. wtf..! Again, error..! I’m f*cked now.. panic!
So start search the web for TIbors phone number, find it and I call him at around 10pm. “Yo Tibor i’ve screwed up” and TIbor just calmy goes “When attaching, you must have the whole filename for the log file, tlog_log is not enough, it has to be tlog_log.ldf”
Hahaha, that’s awesome that Tibor answered the phone!
Teaching my felow devemopers how azure VM works. One of them got the mouse. Me “Here you have the redeploy tile, can you click the menu and then don’t click the butt…” too late he already redeployed our production SQL SERVER 2016. But the good thing is it takes only 5 min for Azure to redeploy and 2016 handles this kind of thing really nicely.
True story which happened to me a couple of years ago.
Production SQL in Hyper-V VM with Hyper-V replica to another location.
C: – system, D: – data, L: – logs, …
One of DBs grew to 1.5TB and D: drive partitioned in MBR (allows max 2TB) can no longer fit it.
Task is simple: make backups, shut down SQL, copy all data away, re-partition to GPT and extend drive to 3TB – 3h of work for 2 people.
The night everything goes wrong:
– 07:00 PM – due to miscommunication one of guys do not show-up in time.
– 08:30 PM – finally start maintains. Pause Hyper-V replica, copy all MDFs and LDFs to a network share, …
– 10:00 PM – copy to network share (1.5TB) fails due to network issue and must be restarted.
– 11:00 PM – Re-partition and copy data back.
– 00:15 AM – Drive suddenly becomes read-only. Turned out HYPER-V could not replicate that much of a data.
– 00:30 AM – Disable Hyper-V replica and start copy files back a-new.
– 02:00 AM – Copy finished but MDFs are corrupted.
// Turned out Hyper-V replica applied a DIFF file to a freshly-copied files and corrupted them.
– 02:30 AM – Kill D:/L: drives, kill any hyper-v replica stuff, create new drives, partition them and start copy anew.
– 03:40 AM – While copy SAN partition where original files were dies!
– 04:00 AM – Time for “Plan B” – restore all from backups!
– 04:30 AM – Wipe clean all drives again and stat SQL Server – it does not start!
// Turned out master, model, msdb where on D: and L: drives and are also lost!
– 04:45 AM – Try rebuild master DB, but failed because someone moved a share with installation ISO
(!) – 05:00 AM – “Plan C” – restore backup of a master DB on DEV server, detach -> move to production server -> attach -> SQL Server is on -> Start restore backups.
// Luckily both DEV and PROD servers had same SP/CU version so MDFs were not automatically upgraded to say SQL 2014.
– 07:00 AM – Most DBs are restored and running.
– 08:00 AM – One of DBs cannot be restored because full backup is corrupted.
// Restoring previous full backup + tons of log backups (taken each 10 minutes).
– 08:30 AM – THE 1.5TB DB is restored and running.
// No one was effected. Customers and business continued to work as usual.
My biggest database mistake/disaster is….
After we had developed a loading process that was run from hourly csv from a sftp site. I was really eager to take the next step, but I did not totally finish/test the current stage. At that time there was no test environment so everything we did was on a production server and so we were loading the data to staging tables. While we were loading, I was eager to start the DM tables so the data would be ready for reporting purposes. While developing a Fact table I truncated the wrong table. But did this after weeks of loading. To make things worse I did this on Friday where there was no one left anymore at the IT department to recover a backup. So next Monday I could tell everyone that we could start loading thousands of CSV files for a second time 🙁 Oups.. don’t forget your FK constraint 🙂
This was about 15 years ago.
About a year after joining the DBA team at my then employer (I’d worked there for a few years but started at 1st line tech support and moved into DBA work from there) I was asked to copy our HR database from Live onto Test. This was a semi-regular request.
To do this, you had to go into the server room, and plug in a network cable between the 2 and then access the local terminal on the Shark in order to do a expdp/impdp (it was an Oracle system, please don’t judge me too harshly :D). You’d run the export command on Live to create the file and then ssh onto the Test server to run the import command.
Well guess who forgot to ssh to the Test server! it was halfway through the import that I got a phone call from my boss, asking if everything was ok as HR were having issues looking at employee records! Cut to me looking as pale as if I’d seen a ghost.
All resolved relatively quickly with the use of backups, etc but some of my then-colleagues still tease me about it even though we don’t work together any more
Ha ha! Well I’e got two for consideration, one mine, one 3rd party but was a headache to work on.
1) First up, I was a fresh out of university developer/dev DBA (we had one DBA and I was helping him out and learning the ropes). This was Sybase on Unix and the DBAs had root level permissions. I was given the task of setting up the UAT servers for the final customer acceptance tests. While there I realised that there were lot of old “hidden” files that weren’t needed – In *nix land, they are hidden by simply prefixing the name with a dot e.g. “.old_file”. Our Unix had this annoying habit of asking are you sure (y/n) for deletes, so I bypassed that by writing a little script to feed “yeses” to the delete command. I then issued a “rm -rf .*” command. That’s when I found out that .* will match everything and also recurse up the directory tree. I manged to delete about 1/3 of the entire file system before I caught that. I then had to go fess up to the Unix sysadmin and get the entire file system restored – embarassing
2) I worked for a while for a utility company that had a very good DR system. RAID storage, dual remote sites, regular DR tests, the works. During one weekend, the RAID system with the database files reported a bad disk. now worries, replace the disk and let the system start re-striping the new disk. Then a second disk went. The RAID system panics and we initiate a DR failover to the second data centre. All good so far, well good in the sense of according to plan. Then the power went on the storage unit of the secondary data centre and wouldn’t come back. So now our DR plan is in the toilet, we are falling back onto restoring tape backups onto non-Prod hardware, which isn’t large enough for everything. We had to requisition every non-Prod server and kick out the devs and testers, plus work with business to prioritise what they wanted back up immediately and what could wait for the hardware to be fixed in the data centre. That was a long week and taught me that no matter how good your DR plan is, nature will find a way to break something that you haven’t planned for.
Heh heh heh, about #1: computers sure are good at following bad instructions quickly, hahaha.
I was moving files online. I can’t remember why, but I was moving the files to one drive , and it was going to be tight, but I checked and rechecked my math and it would fit.
On the primary.
On the secondary we had a reporting database that happened to have it’s files in that drive.
When the file growth command failed on the secondary, it brought the db on the primary down.
This is what lead my company to have a second AG running from the secondary to the primary, for that one DB, and why I am nervous whenever I see any files on one side of an AG but not the other.
Ooo, yes! I’ve run into that with log shipping, too! Had another team that was restoring the log backups to a server that I didn’t even know about.
Somehow I got elected to set up Sharepoint for the company. So I installed SQL Server on a box and then installed Sharepoint. Apparently, the Sharepoint installer defaults to setting up SQL Express edition to install to, and I didn’t notice. So fast forward a year and a half, and Sharepoint starts giving users error messages that they can’t add new documents anymore because the database is full. Of course, SQL Express had a 2GB size limit, and we’d hit that. So now came the fun of trying to migrate all the data from Express over to the standard instance. And then trying to coerce Sharepoint to talk to the standard instance. I don’t remember what exactly I had to do to get Sharepoint to move over to the other instance, but I remember the technique involved doing something that was either going to save the system or completely trash it.
First job as a DBA/Dev working for a tiny company handling credit cards.
#1 – Updated a credit card number manually for a single customer in production expect left off the WHERE clause. Luckily, very low volume database and managed to replace the data from a backup before it affected anything.
#2 – Same company, testing credit card processing in the Dev database, which was on the same server as production, and naturally used the production database by mistake. I ran up a few thousand in charges against a single customer card thinking I was using the test number in the dev database that had been plugged for that customer. Luckily I noticed the same day, so I was able to cancel the charges before they were committed by the CC company. I guess the customer either had a very high credit limit or did not try to charge anything since we didn’t hear anything about them having a charge rejected.
30+TB oracle database that stored data that had a required retention period of 7 years where there the two local copies were corrupted and the tapes were bad (nobody tested them for years because of the amount of time it took to pull off tape). Had to pay oracle to come on site and edit the engine in order to mount the corrupted files. 3 million plus in costs to vendor/consultants and 4 weeks down. Sad Face :'(
We had an old application running on Java 6.
The day we’ve upgraded everything to SQL Server 2016 SP1 we’ve discovered that the app no longer worked because SQL Server’s TDS protocol was working with TLS 1.2 while SQL Server Java drive for Java 6 only supported TLS 1.1.
And I had to make a SQL Server proxy:
– Install SQL Server 2012 Express locally
– Add a linked server to SQL 2016
– Create all required SPs and proxy call to a life DB through that linked server.
My darkest hour took place one Saturday morning.
It was a perfect storm – a Friday afternoon change, insufficient testing, someone else on-call and Blackberry safely left behind while taking the kids surfing.
The Change was simple enough, an update to the daily Data Load / notification process. Unfortunately, there was some code missing which resulted in an endless loop.
The Data Load completed and SQL Server sent the notification e-mail. Then it sent the e-mail again, and again and again. In total, 12.6 million e-mails over the course of four hours.
The on-call agent forgot that they were on-call so they remained oblivious to the issue. I returned home and picked up my Blackberry to see a record number of unread e-mails. Logging onto the SQL Server was almost impossible as the Network had somehow frozen.
SQL Server had become unresponsive, Exchange Servers died. Reports came in of Blackberrys vibrating themselves to death as they notified of each e-mail arriving until the battery ran out. One Blackberry decided enough was enough, vibrating off a table and smashing on the tiled floor below. The only physical casualty, although my mental scars remain even now.
The mess was cleared and the “DeNigel of Service Attack” went into IT folklore to join “Nickageddon”.
I have not surfed since.
A Certain Web Filtering Vendor’s SQL code was sending 1MB-size SELECT statement, with THOUSANDS of AND NOT IN statements. Replaced.
I work primarily in developer support for an ERP application that does a large amount of work in a SQL Server database. One lovely morning, a customer called us complaining about server performance. I connected and looked at the activity. I noticed a rather large cost on the longest running query:
Estimated CPU Cost: 3,346,030,000
Estimated Operator Cost: 3,346,100,000 (10%)
Estimated Subtree Cost: 34,372,800,000
Estimated Number of Rows: 348,099,000,000,000
It turns out one of the customer’s developers had written an SSRS report query that had cross joins of billion row tables without adequate where clauses. When the query ran, everything slowed to a halt, because the query used most of the server’s available CPU, memory, and I/O.
We killed the process and sent the customer some info on the report. Performance immediately returned to normal. The customer explained that the developer had deployed the report to production without checking performance in a test environment first.
While I was a team manager for my old job at an MSP a few years back one of my DBA’s ran into an interesting issue. My DBA had a client with a 2 node SQL 2014 AG Cluster. One of the databases has In Memory OLTP tables. Every two days the replica server would completely crash due to a memory leak…64GBs of memory. AG/IMOLTP will carry a small amount of memory overhead on the secondary server but crashing it is ridiculous! My DBA would rebuild AG and watch as SQL would take up all of the memory on the server then tank. We couldn’t find anything online about this issue and every few days we get a call from our irate customer about this server dying. After a few days of deductive research I found the issue had to do with Garbage Collection and cleaning up the in memory row versions. GC is a system process but it is very much pushed by user processes. The issue was that the Secondary Replica’s GC process is(or was, not sure if this has changed in newer version) independent of the Primary Replica and there was no user activity on the Secondary to mark the versions as old so GC couldn’t clean up the versions, this caused the bloat and then death of the server. I fixed the issue by writing a simple select query and joined the table once for each non-clustered index, forcing the indexes. I made a SQL Job that ran this against the 4-5 tables every 5 minutes. We immediately saw the memory start to go down and the server has been humming along ever since.
One of the best parts of this story was that I tried to open a support case with Microsoft to see if they were aware of this as we couldn’t find any thing online. I couldn’t open the case because the MS website was down 🙂
Dammit , i really thought I had a chance at winning some prizes, but i’ve still so much to learn ?
The best I have is restarting multiple nodes and taking out a cluster. I’ve also accidentally restarted servers instead of logging off, and clobbered the odd table by forgetting to highlight those pesky where clauses, but this is baby stuff compared to the genius above.
In a shameless attempt to win prizes, i’m going to accredit my flukiness to the brilliant SQL knowledge gleaned from the fabulous Brent Ozar Ltd and the people who have sacrificed so much so that I could remain unscathed – I salute you good folks (and hope my creepiness has not induced vomit).
Am hoping to recreate some of the glorious moments shared already, all the best DBAs have them…guess that says it all.
You’re not a true DBA if you haven’t rebooted a server on accident IMO. One time I rebooted a prod server when windows updates notification came up. Aimed for remind me later, missed and hit restart now 🙁
A number of years back, as a junior DBA, I was tasked with an upgrade/migration for an important system. Everything went well that evening. Then the next day someone noticed a problem: the history information for new data in the application was showing different names (e.g. record last modified by Joe Blow would instead say Jane Doe). The names weren’t broken – the issue was that this application was using the login ID values to write history records instead of the login names, and grabbing whatever name matched the ID. Those ID numbers were different in the newer server since the logins were all created at once, likely in alphabetical order. After sweating it out for a bit we reset the login IDs on the new server to match the old one, so going forward all was good. The data from the day after the migration was still showing the incorrect values, and was still wrong years later since they never approved a data fix.
I forgot to put where condition during an update. So, instead of updating just 10,000 records, I updated 1 Million records. Thankfully, I had taken a backup for the DB just before change.
During a migration from SQL 2005 to 2012, we had an application deployment to make as well.
We had planned on doing the migration, and then running the software deployment. I had a spare hour after the software had been taken offline, but before the migration was to start, so I thought I’d be efficient and start the deployment first, thinking I could just stop it, do the final backup, restore to the new instance and carry on with the deployment.
When the time came to start the migration, I hit the stop button on the script deployment…and nothing appeared to happen. I waited for 15 minutes for my query to stop…and then I started voicing my panic. 3 hours later (well outside the planned window) my query finally stopped. This is when I learnt that rollback is single threaded, and you can’t stop it (makes sense). In the end, no harm done, but a stressful few hours trying to work out why I couldn’t stop my transaction.
i’ve had my moments and blunders along the way, but the one that haunts me the most is after days on logging in and out of servers doing upgrades, migrations, deployments and firefighting with everything going well… i logged into Prod to do something trivial and after id finished, instead of logging off… I shut the server down.
Phones start ringing, Alarms going off, Red-Alert , All-hands-on-deck type response in motion… and I had to fess up and admit what id just done. Then just awkward silence as the rest of IT just stared at me with utter contempt on thier faces.
I just love how “log out” was right next to “shut down” for over a decade, and nobody thought it was a user experience problem.
So here is my story. PFS page corruption. Just one page 8KB within several hundred GB of data.
This corruption was not detected within one month after it happened. One day we had massive delete operation and a lot of empty space left after this. Manager said to reclaim space. In case of PFS corruption you can only grow file, you cannot shrink it to make file shorter than corrupted PFS page and subsequent pages covered by this PFS.
Luckily, corrupted was database file that had only indexes data in it – data was unaffected. Last (probably) good backup was deleted several days before detection.
There was ticked opened in Microsoft, but they just said “restore last good backup”.
I checked what was allocated within PFS page range – no IAM chains was even touching pages in affected area.
So maybe just empty brand new PFS page will be ok? So let do my first PFS page transplant.
Created empty database, with as many files as corrupted one, and with this one file as large, so it had this PFS page I was interested in.
Detached databases, copied 8KB data from healthy file to corrupted. DBCC CHECKDATABASE ran over 24 hours, did return complaints about PFS (do not remember, improper LSN for backup maybe?), but it worked.
Dropped every object that was allocated in this file, shrunk file to minimal size and removed it. Every thing was fine.
Even Microsofts PFE was surprised.
And one more thing – I violated license agreement, because I did this operation on 180 days evaluation instance 😀
Takeaways: do not do this in production, always check backups, use proper engine edition.
How on earth did you copy just the 8Kb page across?
Databases were offline, so using powershell I read 8KB from freshly created file at given offset and saved this into corrupted page at this same offset. In page header you have information about fileID, pageID, LSN of backup, checksum etc. If header information part is ok, rest of page is not important if and only if within PFS interval there were un- or de-allocated pages.
So I admit, we were probably extremely lucky.
When I was a fresher, I asked to give database diagram. I have prepared the diagram by removing unnecessary columns and saved the diagram. It asked to save the changes, i didn’t even read the messages and pressed spacebar for couple of times. Gone.. all unselected columns in diagram from the tables.. Including identity column.. Production.. Luckily it is just 5 rows master table, restored immediatly
HAHAHA, wow, I never even thought about that happening! Wow.
How a space can shut down SQL Server:
We needed to clear trailing dots and spaces from texts before showing them in reports.
A collague created this script:
WHILE ((RIGHT(@text, 1) = ‘.’) OR (RIGHT(@text,1) = ‘ ‘))
SET @text = LEFT(@text, LEN(@text) – 1)
Tested with several testcases (yes, including an empty string) and went in production with a lot of other stuff.
But the next morning everything was slowwww…. And after 1-2 hours everything stopped working so I kept killing processes while trying to figure it out.
You should try the above query with:
SET @text = ‘ ‘
That day I learned the difference between LEN and DATALENGTH.
Maybe it isn’t truly horror, but sure it was stressful and unexpected.
So I had the misfortune of getting handed a database that someone had built to track the supply side costing of the business. The way it was supposed to work was that they purchasing group would put a barcode on the materials they bought and record the purchase price and some dimension information. Then when the material was used in the factory, they would report the barcodes they saw coming through and we’d be able to loosely tie the material input to the factory output.
Well, the program kicks off, and after a few months, the purchasing group needs to order more barcodes. But for some reason, they place an order for the same barcode number series. So they come back with duplicate barcode numbers and I’m left trying to tell them that the database is now not going to know which barcode to look up when it hits the factory. They tell me that the next time they will not order the same barcode numbers again, but it’s OK because the raw material has a shelf life and so I can archive out any barcodes that are older than 8 months. Anything that hasn’t been marked as having been used probably just fell off.
So then, some time goes by and we start having a lot of material coming in as being marked as not having a barcode. This seems strange, and after a few weeks, we find out that the factory was having trouble keeping up the pace, so they just reported that the material didn’t have a barcode. So now we have a bunch of barcodes that haven’t been marked as seen. Fine, whatever. They age out of the database.
More time goes by, and suddenly we are asked why a whole bunch of barcodes that should have been present in a special run aren’t in the reports. I look into it, and I find them in the archive database, having been aged out. It turns out that they had been planning this special run for over a year, and had specially preserved this one batch of material and saved it for about 14 months. So now I’ve got to bring a bunch of stuff out of the archive to get the reports they want. I am so glad that I don’t have to deal with that awful system anymore.
Being an intern I drew the short straw to be on call for an update to deal with a VAT change. Senior dev coded it, set a SQL Job to execute at 1am. At 5 past 1 I get an angry call from the client asking what’s going on. Logging on I see the job ran successfully but no changes were made. Test an QA were fine but they had been run by hand the previous day. I look at the SQL in the job and there is our friend ROLLBACK at the bottom. The dev must have tested it on prod before setting up the job. No online sales made luckily.
Perhaps my worst horror story was the time I brought down the accounting team at a small business with an ill-thought out idea.
First, some background. This was at a small business and I had gradually been handed more and more duties, such that I went from phone support guy for clients, to:
Phone support guy for clients
On-site support for clients
Domain Admin (and yes, I signed in with one account that was a Domain Admin)
SQL Server Admin (as in, I knew how to install it, at least, and some basics)
Phone system admin
Yeah, I was a busy beaver.
So the mistake. At the time of this incident, we only had a single domain controller which was handling AD, DNS, and DHCP all on its’ own. Bright-boy me had read that you really should have more than one. Hmm, so looking at the various servers we had, the only one at the time with the same OS was the server the accounting software was living on, with its SQL2008 back-end on there as well. So, I said to myself, let’s run DCPROMO on that server and make it a Domain Controller too!
Yeah, it didn’t work out so well. The DCPromo went OK, server rebooted and shortly after I had the head of accounting asking why the accounting application was unable to find its database. Go looking and the SQL Server service wasn’t running. And the service account (NT Service\NETWORK I think) was GONE. Turns out, a Domain Controller in Windows 2008 does weird things to its’ local service accounts…
A swift demotion of the server back, re-setting the service accounts for SQL, and I managed to recover things back to normal.
Oh, and all of that was done with NO SQL backups, NO OS backups, and NO export backups from the accounting package…
Yeah. Brown pants time.
the other thing that surprises me about most if not all of these stories is the “interactive” nature of most of the work practice. Where are the run books?
I got to the point where all work I did in prod was always spec’d first in a run book, to be reviewed by team prior to work being done, used for testing as far as possible first in non-prod, and each step signed off as completed by the technician, and permanently maintained and refined for future repetitions of the work.
What? Change management? So quaint. Except for the place where I setup one up years ago, I’ve never seen one.
Sounds great but that’s a luxury only afforded to organisations large enough to resource that. I work for a small Local Government organisation. I’m the only DBA. Who’s going to review my work? We consider ourselves fortunate to have separate Dev and Prod environments. So every though we develop and test our changes, deployments to Live are always a one-shot deal.
Found a user record in our eCommerce users table that had become corrupted and needed to be recreated, so I typed a quick DELETE statement into my active SSMS tab to remove their record. But in my haste to solve the problem and be the hero, I didn’t select the last line of the WHERE clause before executing it and ended up deleting *ALL* of the users. I could hear the help desk phones start to ring almost immediately.
Fortunately, our Senior DBA was the hero that day and I learned what RPO means. He had the user data restored in about 5minutes. Uggh, I didn’t want to learn about RPO/RTO like that, but I will never underestimate the importance of knowing where your backups are and how to use them.
I was once trying to free up some space on a Linux server that hosted a calendar app that I’d written in php and sat on top of postresql. It was a prod box, used by around 400 people and I was logged on as root. After deleting some stale backups (we backed up every 15 mins) I finally came across a huge file called postres.log which I deleted and saved another big chunk of space. A few minutes later the phone started ringing and a few minutes after that I discovered I’d deleted the transaction log. Luckily we had backed up a few minutes before. I didn’t log on as root again.
There was a new application entering the database instance. But they had some performance issues with one of their SQL-statements. So we asked them to please email us the statement. They attached it but when we tried to open the attachement, Outlook crashed on the desktop. Finally we could open it, it was 22MB data for one single SELECT statement. Why? The included all columns WITH it’s value in the WHERE-clause??? Like, WHERE (fruit= ‘apple’ AND size=’big’ AND quantity=2) OR (fruit= ‘banana’ AND size=’small’ AND quantity=100) OR …
They SQL-statment became 100 times larger than the actual table. We told them they had to rewrite it…
When I was a junior DBA I got a call from a customer complaining the application was slow, turned out it was down to high IO because when I setup log shipping instead of setting the job up to take a backup of the transaction log I accidentally set it up to do a full database backup (every minute). My boss at the time found it funny and simply said “i know they say a DBA is only as good as there last backup, but in this case….”. Anyway I ended up buying a few rounds down the pub that night as penance.
Data centers in two locations, DC1, DC2, servers booting from SAN that synchronized between sites, so we could repoint the LUNs and boot the same system in the other DC in case of failure. This was what we were going to test. SAN admin didn’t take an extra snapshot before testing, since this was such an easy and quick test anyway… Why waste 10 minutes.
When the test started, we shut down prod in DC1, fired up the second server in DC2 to use the LUNs, all fine… Except a bug in the ILO board in the prod server in DC1 didn’t shut down the other server, it restarted with a delay… And it accessed the drives at the same time, making a complete mess of the the security settings in the system folders.
Did I mention this was the main stock exchange system… After failing to get it to work after several blue screens and hours, we re-installed a clean box, restored the databases and finally got hold of the application specialist around 5AM. He had just returned home from a vacation on a night flight. Well rested from the vacation, not, he got to work and somehow managed to get the system up and running 20 minutes before the stock exchange opened for the day… System uptime was still at 100% during office hours that month. 🙂
Job interview – interview test on a production database. I didn’t know it at the time though.
Sorry to be late at that party.
Well, I use PostgreSQL … so I really don’t know the pains you are talking about. 🙂
Yeah, your performance is just so bad that you end up commenting on blog posts almost a year after they’re published.
I’m not referring to Postgres.
I’m referring to your own personal performance.
Hell, you can’t even figure out which web site you’re supposed to be reading. Wow, you’re in rough shape over there. Hope things get better for you.