This has been blogged before by several people. One thing I’ve learned from Brent is to not let the fact that the topic has been blogged before stop you from blogging about it. With that out of the way…
When you have sysadmin access, you are bound to make a big mistake in production at some point in your career. Not everyone has. Maybe they’re perfect, or maybe it just hasn’t happened yet. I’ve made two big mistakes in production.
15 years ago. I remember it like it was yesterday. This was back in SQL Server 2000 days. We were using Log Shipping for our Disaster Recovery needs. This company failed over to the DR site 1-2 times per year and would run production there for 2-3 weeks at a time. This had two big benefits.
-Proves to the customers that the DR plan works (as long as resources are availabe in an unplanned situation)
-Allows you to do major maintenance at the primary site
During one of the planned failover maintenance windows, I was tasked with dropping Log Shipping and setting it up again in the reverse (from DR site to primary site). One of the drawbacks with Log Shipping is that after you failover to the secondary server, you can’t reverse it without dropping Log Shipping and setting it up again.
Well 15 years ago, I was a lot less experienced than I am now. My steps would be different these days. But here’s what I used to do:
- Drop Log Shipping
- Check the Log Shipping tables in the msdb database to make sure it dropped cleanly (back then there were often rows that had to be manually deleted)
- Drop the database at the primary site to avoid having to use WITH MOVE in the RESTORE DATABASE command
- Setup Log Shipping
I accidentally dropped the database at the DR site where production was now running. I immediately told my manager and the manager of the WebOps team. We were still inside our maintenance window, so I started restoring the database. I restored to the last transaction log backup that had been run, but there’d be data loss if I stopped there. I got lucky on that system that the incoming data could be recovered from text files. The bad news was that someone else had to do that task as it wasn’t a SQL Server task. I felt bad for not only having dropped the database but that I increased the work of someone else who already had a ton of stuff to complete in that maintenance window.
Now every time that I have had to drop a database in production, it takes me a good minute before I am confident that I am doing it on the right server and on the right database.
I’m embarrassed to say that my second big mistake in production occurred 2 years ago. The system that this big mistake occurred on required many manual data changes due to various reasons, such as the application not having a certain feature or an application bug that hadn’t been fixed yet.
I needed to delete some data in a table. You see where this is going, right?
First I had to query the table to see what needed to be deleted so that I could get my WHERE clause right. I changed the SELECT to a DELETE. The full DELETE query was in the query window including the WHERE clause. I highlighted the query (I had other things in the query window) and hit F5.
The problem was that I accidentally missed highlighting the WHERE clause and hit F5 too fast. To top it off, I had dismissed the SSMSToolsPack warning about deleting data without a WHERE clause. I was so used to dismissing the warning that I clicked it without even thinking about what it said.
My stomach dropped. I noted the time (this is important for point-in-time recovery). I ran to my manager’s office and fessed up to what I had done. He told me to start the side-by-side restore while he worked on figuring out the impact. We had enough disk space that I could restore the database with a different name so that there wouldn’t be downtime. I restored the database and then the transaction logs to the point in time before the DELETE command had been run. I then copied the data over to the actual database. There wasn’t foreign key data that I had to worry about, luckily. It was just the data in this table that needed to be recovered.
So, fess up! What have you done?
What is the biggest mistake you made in production? Feel free to comment with non-SQL Server mistakes, just as long as they were made in production. Mistakes in non-production can be costly, but doing them in production takes it to another level.
Brent says: I’ve done a lot of dumb stuff, and I should probably write a whole compendium blog post. One of my favorites was trying to move a big 4U rack-mount server from a bottom spot to a top spot, by myself, in the middle of the night during a Saturday maintenance window. I’m no Erik Darling, and my pencil-thin arms managed to drop the server from above my head. It missed my foot by maybe an inch, shattering the raised floor tile. Took me a good half hour just to calm down. I have to hand it to IBM, too – that x445 kept right on working when I plugged it back in.
Even recently, I have a client that has their “test” and prod databases on the same instance. Of course, this is not optimal but I have to deal. I was doing some development changes, which included dropping and recreating a table that held a history for ETL processing. Of course I was in the wrong database when doing so. The database is in simple recovery and last backup was day before. A couple big steel companies had to wait for processing until I got this corrected. That was “fun”
Many years ago, when I wore quite a few more hats (I was the sysadmin, DBA, network-admin, and a couple other -admins) I was in the process of adding a DC to the network (Windows Server 2003, maybe 2008.) Well, I hopped on the console, thought I had the KVM on the right server, and started the DCPROMO.
Shortly after is when the accountant asked me to look into why the accounting software (SQL 2005 or 2008 back end) had quit working.
Yeah, I’d promoted the SQL Server. Turns out SQL doesn’t like being on a Domain Controller. Took most of a day to straighten that mess out (unpromoting the server, fixing up the SQL service accounts privileges, etc)
Brilliant story. I have one about changing from Kerberos to NTLMv2 and then back. Was due to go to a party at 7pm and was still in work at 7am.
I missed a 0 when I was setting max server memory once. SQL Server does not like only having 700 MB of memory. Database mirroring HATES only having 700 MB of memory.
Fortunately, that server supported our Lync installation, so nobody was able to send me any angry IMs.
Sweet, i did the same but instead i punched in a whooping 1 MB. Tell you what, SQL Server doesn´t like that either…
I brought down XBOX Live Marketplace for 17 hours. (ok complete outage was probably 4-5 hours fully back to normal 17 hours) NIGHT shift DBA in the Operations center. Replication broke from the Zune catalog to the XBOX live catalog. I called the on call DBA from T3. I asked him if he wanted to fix it or if he wanted me to. He said I could. So I set up the replication from the zune server back to the catalog on the xbox server. Small problem. the main catalogs were the exact same name in both environments but with very different data. As soon as I set up the replication it started poisoning the xbox live catalog. The xbox live catalog automatically replicates out to 20 or so replicas to face the service. A few hours later when this was discovered, we had to bring down all of the catalog servers. Restore the master xbox live catalog from backups, and resetup replication to all of the replica servers. Zune was supposed to replicate to a different DB name on the main Xbox catalog server. Jobs then process the data correctly to the xbox live catalog. I’m pretty sure the way this works has all changed in the 5 years or so since this happened. The only thing that saved my job was that 75% of the T3 support team (3 out of 4) said they would have made the EXACT same mistake. Including the T3 I called and said I could do it. The 1 other said they had been campaigning for a change in naming setup since the integration of the zune catalog into the xbox live catalog. I was sweating bullets about my job until I was told that I still had one. I never did get the promotion to T3 that I was in line for before the incident however.
Will – DAAAAANG!
Will – DAAAAANG! X2
Should have just not bothered fixing the Zune server. Or maybe this is how you de-commed it. 😀
When doing an update statement, the WHERE clause is your best friend!
Yes. Another in addition It is always good to run a SELECT (prior to an UPDATE or DELETE) with the WHERE clause to see what will be changed (or deleted if using a delete statement). Another way is to add a BEGIN TRANS … ROLLBACK TRANS .. add an OUTPUT clause to your statement to see the records that would have been updated. If this is the desired result is returned, remove the change the ROLLBACK to a COMMIT..
When doing an UPDATE statement an explicit transaction and a commented out COMMIT is an even better friend 🙂 I now do that religiously.
Currency rounding error left undetected for three years lost client some $20 000 total.
It came down to a lack of testing. Client was very understanding.
First month on the job 14 years ago. Saw a disabled SQL Agent job called Mass Tax Update, thought it looked important, and stupidly decided to enable it. Not realizing it was obsolete and had just not been removed yet. Took a couple of nights and five people to clean up all the data that it changed. My job was probably saved because I was VERY active in helping to determine the impact and the remediation steps. Well of course I was!
Brian K – HAHAHA, I had a client do EXACTLY the same thing. You aren’t alone on that one for sure. That taught me not to ever enable Agent jobs (or fix broken ones.) Plus it’s a huge problem when people fail over to DR – “Is that Agent job actually supposed to be running?”
I never made/make mistakes. (aka, too embarrassed to tell!) 🙂
In my pre-DBA days I worked as a developer. I knew a few things about SQL so while the DBA was overseas I set up a trigger on one of the largest tables in our most important database. The trigger didn’t exactly work right so I quickly confessed to my boss and called the DBA. He didn’t really like being woken up in the middle of the night, but he thankfully got things back in working order. I learned quite a bit about testing and change control processes that day/night.
in 30 years of coding did do a delete without a where clause big OOP’s, Stuff happens
My first exposure to being a DBA (in addition to PC support, systems admin, etc.) was installing our first Oracle server. Back in the days of Oracle 7.3, archivelog mode wasn’t enabled by default (equivalent to running SQL Server DBs in “Simple” Recovery).
Then something broke in Oracle and the system wouldn’t come online. I didn’t have near enough training to figure out what, so I called up Oracle support and gave them the error message. They asked if I was in archivelog mode. I figured out that I wasn’t. They closed the ticket with “Sorry, we cannot help you.”
It was also when I learned that Oracle backups aren’t a default thing. Nor were there any GUIs or Wizards or other tools for idiots like me to use. (This was before Google.)
Luckily, the environment hadn’t been moved into production yet, so I got to rebuild it from scratch, the right way, and it only delayed deployment a month or so.
It was a great justification when I went to my boss and said that I needed to go Oracle training so this wouldn’t happen again.
Oracle support still uses that line on a regular basis, though for many more products than just the database.
Ran a DBCC SHOWCONTIG on Production SQL Server 2000 database.
Fortunately it was a quick recovery, i.e. kill query and no data loss.
I have also forgotten the where clause in an update statement on production. It didn’t have the same impact but when I worked with someone else to do the same update, they missed the where clause. Fortunately the Transaction was available to be rolled back.
I was developing a server-level trigger to track logins. Naturally I tested it on dev first however I got distracted for a period during testing and when I returned to it and hit F5 I discovered I wasn’t in dev any longer. So, the server-level trigger decided it wasn’t going to let anyone into that production instance anymore and booted everyone else out and several angry phone calls ensued, including from my boss who was demoing a new SharePoint site on the instance I’d taken down to the capo de tutti capi. In those days I had never even heard of the DAC but it took me 10 minutes to find it, research it, fire up sqlcmd and kill the trigger. Then I went off to change my underpants.
I’ve got 2:
1. I forgot to set up log backups on a reporting database – they filled the drive to 2 TB before I noticed. Luckily, its mostly me that uses that server.
2. In my early performance tuning days, disabled some clustered indexes in prod during the day, freezing some doctors.
I had just taken a new job as Jr DBA when the Sr DBA put in his notice. A week later I decided that production would run faster if I dropped/recreated the PK on the main transaction table. That broke several applications, which I found out 2 days later was because I had recreated the PK with the wrong column order. I managed to fix it on day 3 with several levels of management “watching”.
So many make mistakes in Production, but no one is willing to talk about it. I think learning and talking about mistakes can inform the conversation about not making them again and improving the process. Thank you.
Great stories, thanks for sharing. I had a good one when I first started where I missed an include column in a index when moving it to a new filegroup. Not only did I get to learn what an include column was, I also got to learn what query plans were. Good times…
One thing about Log Shipping, not sure about the 2000 era, but from at least 2008 on you can reverse the flow of log shipping without having to drop the existing setup or do a full initialization. It’s all in the tail of the log backups.
told someone-yeah I can help you with that query
Al – HAHAHA, that rings way too true.
On a busy financial system (international 24×7 type stuff) I ran a rebuild index totally forgetting to ensure to rebuild it online. Between China calling in a complete panic and boss man huffing over my shoulder it was an intense multi-hr rollback process. Lesson learned…stop the index maintenance madness!!! 🙂
On production database did an update to a table using a WHERE … AND tag1 = 1 OR tag2 = 2 but forgot parenthesis should have been AND (tag1 = 1 OR tag2 = 2) so it ignored all the rest of the filters in the where clause and updated everything where tag2=2. Which I then executed COMMIT because I was in a hurry and forgot to look at the number of rows affected. I then got to restore the DB on another server and update all the values back. Thankfully table didn’t change often and wasn’t used a great deal so I was able to restore without too many users noticing.
Truncated a view not knowing the obvious results. Had to manually recreate the data that fed an SSIS package because we can only restore to last night’s FULL backup. A valuable lesson was learned from all this though – our backup strategy did not protect against brain-farts and was changed to also include differential backups.
I’m on a national team of 8 DBAs, 3 local.
I’ve been 30 years in IT & 17 years Oracle DBA.
I’m solely responsible for the company’s admin system for the last 5 years. Thousands depend on me.
The admin system has 12, sometimes 13, environments, 1 being Prod – 3tb.
Early morning Christmas Eve. I’m working from home. I have the next week off, ending in my birthday and New Years.
Nobody want’n nuthin’ today!
I’ll tackle the Break Fix upgrade, complete the upgrade project before EOY and continue my hero status as the most productive DBA in the company. (I had just received the best review in my life.)
Refreshing Break Fix consists of dropping the hold-all schema and pulling the metadata from Prod.
I connect to my server and drop the schema.
Wow! Look at that thing spin!? Usually comes right back.
What’s that feeling in my stomach?
Ping. “What’s this error?”
Ping. “The app is not responding.”
Ping. “Do you know what’s wrong?”
(You know. A “B” and a “P” look very similar in the bleary early morning haze.)
(IT team thought the best standard was to have the environment indicator in the 3rd position, location first!)
I immediately fess up, “We’re in a recovery situation. I’ve dropped Production.”
I spend the next 8 hours, in a continuous conference call, with the storage admins, sysadmins, and managers, restoring 1 file, attempting recovery, then another file restore, then another recovery, then starting all over. (It’s an Oracle control file mis-match thing)
No one on my team would help me. I asked! Believe me, I asked for help.
My manager would not let me off of the conference call.
I eventually restored the database by 5 PM but the India team had already been told not to report to work on Christmas day.
I was told to take my vacation.
When I returned after New Years I was told I was not allowed to touch the admin system.
My off-site manager and off-site director told me all was fine, they wanted help in the other locations anyway, but 2 weeks later I was called to HR and walked out.
On the bright side every interviewer was incredulous that I was fired and my local coworkers were caustic.
Now I wear glasses and worship at the Temple of Ozar.
Now 33.4 years in IT and 3.4 years SQL Server DBA (and a legacy Oracle system)
The SQL Server community is so much nicer.
No one in the SQL Community has ever screamed at me to go RTFM.
I had no idea DBA work could be this friendly.
I’d have to say 15 years ago I stopped a service pack upgrade (2000 SP3) by ending it via task manager – bad idea. It was a very, very long recovery process.
Many years ago, our HR process for firing people required a manual update of the active bit field on the employee table. Once this bit was flipped there were several processes that ran to propagate permissions across all of the systems of the company. One day I was in a hurry and managed to “fire” EVERYONE in the company by missing the where clause. Everyone, including the admins, lost access to everything and we had to use sa and the windows equivalent user logins to gain access to our systems to fix things.
I have 2, neither are mine, but I was there at the time.
My manager was making changes to a central database used by the application to discover other servers.
Then that answer their phone, lock workstation, and wander off.
Not so bad as most of the data is cached in the application… apart from after the first custom complaint, the Ops team assume its a caching issue and flushed the cache, sending 1000’s of calls into the locked table, downing the entire system.
2nd one is what I call a close call…
A helpdesk user submitted the following script to be ran against production, it was scheduled to be run but luckily was checked before it ran:
update Address set Line1=’123 Road’
update Address set Line2=’Village’
update Address set Town=’My Town’
update Address set Postcode=’AB1 2CD’
where AddressId = 1;
We now have a more formal process for reviews 🙂
I hope my employer doesn’t read these…
1) I truncated a dimension table critical to our data warehouse. The database was under active development, but was still production. Thankfully, we’re no where near a 24×7 shop. I spent the weekend reloading all of the data from scratch.
2) In testing a change to an ETL procedure, I put in a WHERE clause filtering down to just one case. I left it in when I deployed to production (no source control… no QA…). During the next load, every *other* case was deleted from the target table.
3) In SQL Server 2012, we were super excited about using columnstore indexes. We slapped them on every fact table. Those suckers weren’t updatable, so we had to drop them, load data, then recreate them. We didn’t have a load window, so we caused INSANE amounts of blocking.
Please note I documented the process but it’s hugely manual because the data has to migrate from the “removed” file to the other files and I had to monitor the space and migration manually. I even tested it on a test system.
I have a multi-TB database with a hard limit of 255 database data files. To increase that, it requires a rebuild and import of the DB, restore to a new layout won’t work. (No, not Oracle, it wouldn’t have happened in that case.) We were up to 240 datafiles, many of them created as 2GB. How this works is that you “remove” the files via a database command which tells the DB to move the data to other files. After that, you remove the file at OS level and tell the DB to recreate as 16GB (or even 32GB).
I’d gotten up to #50 or so and (wrongly) deleted the datafile first on the DB. It was a 9 hour restore and due to the way the lovely thing works, another 1.5 hours starting up with no evidence it’s doing much of anything. That last part was the worst part of the experience. There’s precious little documentation on it, which is surprising since it’s been around since the 80s. <- Not a typo.
Needless to say, the folks who usually get beaten up at the monthly issues meetings were out to throw someone else to the wolves. They went on about what was done to discipline the person who did this and my director responded that I was beaten with whips.
These were the same people who broke HUGE (global) ERP systems with blind changes a few months earlier which would have meant about 24-36 hours downtime EACH. (One system they broke twice). They managed to import vendor source code from a vastly different release which broke many things, in one case the system wouldn't even start. In another, the background subsystem was completely broken which meant that changes would not propagate since it relied on background jobs for long running import processes. I'd gotten them up and running with some magic tricks in less than 4 hours each. The vendor told them to restore.
My favorite one dates back to about 2003-4. We were working on HP-UX Servers running Oracle. The SAN was an IBM Shark. The interface for allocating disk volumes to servers wasn’t very solid and would easily let you allocate the same disk to two different servers (you can see where this is going already, right) ?
I accidentally managed to assign the same volume to two servers. One database would go down. I’d restore it. The other would go down. Stupid me didn’t even twig to what I had done. Not something you think clearly about at 3AM. Fortunately, they were two low volume databases.
I was also very new as a DBA back in those days. What I found out after this was that the backup cycle for one of the databases wasn’t so crash hot … 4 days of data was lost once we figured out what was going on.
It’s mistakes like this one that can really make you look at your DR plan – 4 days on the main ERP system would have been much worse than this lightly used document management system (but the couple of users that lost data weren’t very happy with me, regardless).
About 5 years agoo I worked in an maintenance window to upgrade several stored procedures from the developer team into an Oracle Server 11g schema.
Normal work, normal way of doing thinks. After finishing all upgrades, I checked for invalide or not compiled procedures, functions, views, … .
Noticed that there were two invalid procedures that hade to be compiled, like on every other oracle server from the company and that are used in an Oracle Job on daily basis.
I check always, after such a midnight task if there are invalid objects. Also I compiled this two procedures, tested the rest, everithing was ok, and go to bed.
At 1:00 a clock in the morning the phone was rining and the colege responsible with the project told me that the oracle instance on that i worked ist not responding any more.
Checked what happend and find out that the main disk was full. Deleted some files, started manualy backup generation from the archivelogs, and broght the instance to work.
Checked that the disk was delating all backed up archivelogs and incresed the free size on the disk.
What happend: on DML changes, this changes, there are allways backed up on the online redologs thet are copied into archivelogs (if the instance is in archivelog modus).
Becouse this two prcedures deleted and chenged avery day a small amount of data/records the normal amount of archivelogs was small and did not groe to much until the full backup was done.
But this 2 procedures where stoped working and changed to invalid for longer than 6 months. So when this oracle job started it had a lot of changes to made and a lot of archivelogs to generate.
More archivelogs than there were free space available on thta mountpoint/disk. After more then one our work, the instance was working normal again.
PS.: Iforgot to mention, it was not my task to compile tihs two stored procedures.
This two stored procedures where not relevant for the upgrades I made with the developers scripts.
I just find it a goood idea, at that time, to help other DBA colegs when I find that somethin is not looking right.
In my first week on the new job I was trying to get acquainted with the database. Of the 10.000+ procedures only a few 100 were needed for my work. So I decided to create a test environment and remove the others (don’t ask me why). I created a script…. and ran it on production.
Since then I use a different color for the status bar for Production, as a friendly reminder.
When working as an application developer for an internet bank I was working on a Credit Card closure process. As part of the process, a check was carried out to see if there was any balance on the card that needed to be paid off. If so, a message was sent to the customer informing them so. So far, so dull right.
Well, in an attempt to be humorous, whilst testing this, the message that I wrote to the customer was something along the lines of
You’re trying to close your card and you still owe us money. Pay up, or Fat Tony will be round with a baseball bat”
You see where this is going? Yep, that message made it into production. During the first run, 300 customers were sent that message.
My heart sank, I started sweating and wanted to run and hide.
I went to tell my boss, thankfully these were “Secure Messages” to an online portal, rather than emails. We were able to delete the messages within about 60 minutes – to this day, I don’t know whether any of them were read.
Lesson learned: Don’t write humorous messages that to customers – even in Development.
Not a DBA mistake, but I once wrote a simple script to delete all files in a temp folder for an application on all our client machines (400ish). Only about 200 machines had the application installed. I rolled it out to all of our machines using GPO.
I did not bother to write a check to see if the application was installed, or any other checks for that matter.
Guess what happened on the machines where the temp folder didn’t excist…
Luckily it was set to run on startup, so we managed to stop it before too many machines got wrecked 😛
I uninstalled SQL Server from a running system with replication. We had just upgraded the server and they thought it was smart to alias the old server name to the new server. After the re-install and restore of master replication did the only magic I have ever seen from it and just started running again.
Back in 2004 I was a SQL developer told by the DBA to move a database from stage to prod and to use the wizard. You sure, I’ve not used that before. “Sure, it will be fine”. But it’s during production hours? “It will be fine”. Turns out one of the defaults was to move associated objects, and security is one of those objects. If you didn’t have a login on stage, you suddenly no longer had a login on prod…
Okay…here’s my top three off the top of my head at various stages of my career:
* As a jr. level developer — while using Query Analyzer against SQL Server 6.5 (yeah, it was a while ago!) I was trying out a new way of deleting all records from a table quickly called “TRUNCATE TABLE”. Yeah…it was super quick compared to DELETE! Anyway, I was changing a bunch of scripts that purged records from various import tables we loaded very large text files into using bcp. I even went so far as to have the I/O department restore a copy of production to another database instance on the same server so I could test my script changes.
So, I did all my testing in one window against a new database named [Customers_Test] while in another window I had a connection to the production [Customers] database. You can probably see where this is going 🙂 Somehow, I switched windows and ran the TRUNCATE TABLE commands against the production database. Within a few seconds, I realized what I did and turned white as a ghost as I knew there was no way to get the data back on my own. I meekly went to my supervisor to tell her what I did, and she just laughed and said, “Okay…just pay better attention next time and fill out this database restore slip to give to the I/O department”.
* As a mid level developer — I had an ETL process that ran a DTS package and then ran a stored procedure to crunch through a bunch of calculations. The call chain of stored procedures went VERY deep along with a bunch of looping involved one record at a time. That said, it was tough sometimes to debug if there was a logic issue or error (at least with the tools and knowledge I had at the time). Anyway, we needed to add additional logic to a WHERE clause in one of the scripts whenever certain values were encountered in a temp table that was created as part of this chain of procedures. So, the ‘quick and dirty’ way was to do something like this:
WHERE MyField = (SELECT DISTINCT MyOtherField FROM #tempTable)
Notice the WHERE clause — it has an EQUALS sign and NOT the IN keyword! Previously, the WHERE clause had a single static, hard-coded value, so I just changed the hard-coded value to have the nested SELECT. Now, given the data present at the time, this worked just fine, it compiled and ran for a while. Then one day, we started getting reports of missing data and the job not running right. What happened is that the source data changed such that the nested SELECT DISTINCT now returned more than one record — which now caused that statement to fail. If it had used the IN clause as I had intended, all would have been well, but not with the equals sign! I don’t even like doing IN clauses in that way now due to potential performance implications, but this one had me stumped for a day or two…
* As a sr. level developer — this happened just this year, I hate to say! I was making changes to a customer’s ETL process that pulls in eCommerce order data from a cloud based MySQL db into their local SQL Server db. Due to budgetary constraints, the customer could not stand up separate test servers on either the MySQL or SQL Server side. So…we’re back to creating multiple databases with “_TEST” in them!
So, we had two Linked Server objects (one normal production one and the other has “_TEST” in the name) which go through two separate MySQL DSNs setup on the SQL box — each of them going to their respective MySQL db environment. Most of the time, we were using the 4-part name of LinkedServerName.DbName.DbOwner.TableName to get at the data, but sometimes, we used the OPENQUERY() command to execute a remote stored procedure in the MySQL database. I know — a weird chain of connectivity here!
Anyway, we test everything and move all the scripts to production. A couple of days later, the customer called asking why 200+ new orders placed online weren’t showing up in their order fulfillment system. All appeared well at first, but then I noticed the shipping address validation job from UPS WorldShip didn’t appear to fire for any of the newly placed orders post-go live. And guess what remote procedure is called from the OPENQUERY() call? And guess where that production script was still pointing to? [OrdersDb_Test]. OOPS! I fixed it production instantly and re-ran the SQL Agent job, and then voila…all the orders began showing up! Go figure!
I haven’t done it yet, but I’m thinking about creating a DDL Trigger that’ll fire to see if certain phrases are found in the stored procedure text (like “_Test”) and then prevent the statement from committing if the string is found. However, I haven’t dug into actually trying that yet 🙂
Never mind deleting a table or a database. I once, a long time ago now. Managed to remove a complete SQL Instance from production, I was supposed to be removing from pre production but connected to the wrong VM. Luckily I had transaction log backups and managed to re install and restore before anyone noticed. I had always suspecteded that that CRM system was under used. Lol.
On a linux server, I vimed a file that was bigger than the available RAM, minimised the putty and went to do some other stuff (even away from keyboard)… until I got complaints about the postgres server not responding. I logged into it (again) and started looking at the issue… No errors, no big queries running, no locks, nothing… Then I realised there was a process that was eating all the RAM available, my vim session!! I killed it and all went back to normal. All in all the service was down for a couple of hours, and I got a yelling out of it.
Up until now it seems like a “normal” fuck up, but while going back home I started thinking about it and that was when i got really fucking scared. The client was a health care provider and having that database unavailable for a couple of hours meant that, among other stuff, the medical history of patients was unavailable. Someone could have entered the emergency room and could have died because thank to my stupid mistake doctors had no access to crucial information.
All in all nothing serious happened, but I still feel really bad about it.
I work in healthcare field too. Our critical systems all run on MS SQL in an Availability group. Outages in our field can be a big deal. Luckily in my case we are a radiology company. The radiologist can still read xrays. Do their reports manually. Depending what might be done, they might not be able to see prior xrays during an outage. Knock on wood I don’t end up with a major outage for saying this but so far I’ve avoided the big outage. There was 1 time however where there was a slight corruption somewhere that caused the AG to start flapping back and forth continuously and we had to engage MS Support for help.
As a noob DBA/dev, I once placed tempdb (or maybe it was master) in single user mode to perform ‘maintenance’ – oh the panic…
HAHAHA, that’s awesome.
I work in a car manufacturing plant.
One job required a change to one of the lineside systems. So I exported the row to be changed out of the system. Ran the change and then contacted the users to see if the change had worked, They said that it had not worked. So, I asked our DB2 guy what he wanted me to do and he said to import the exported row back into the table and this would set it back to how it had been.
Funny thing about DB2 is that if you miss a bit off of the command then it replaces the table with the data in the import file. You guessed it……my one row became the only record in the table.
This stopped the 3 production lines in the factory for about 20 minutes while we figured out what had happened and then restored the table back to how it was.
Well, live and learn they say !!
Not as a DBA, but as a SQL Developer…
* Stored Procedure for HR Salary Systems
* So much code lines + big f****ng cursors + calls to others procedures (about 15.000 lines, between the main procedure and the others called inside the process)
* Bonus calculations was recently modified by myself, to make it faster (i finded some data who was always queried directly on the tables, added some inner joins and new variables on the cursors to avoid the extra queries)
* Forgot to set null to a new variable that was passed to all the other procedures and cursors, at the end of the 1st loop…
The management staff called me around 11:00 p.m. to tell me that approximately 1000 employees filed a massive waiver and were going to file a lawsuit against the company for non-payment of their bonuses
Not as a DBA but as general system admin:
I was reviewing the VMs in our VSphere cluster, dropping a few (after checking thoroughly), renaming one or two. I see that our giant fileserver VM (13 TB, tens of millions of small files) is not in a folder but is sitting in the root. I helpfully move it into what seems the appropriate folder.
Five days later the fileserver freezes up solid and all the production systems go down because they can’t get their files.
It turns out that the folder I moved it into was being backed up by a VEEAM backup job. So now our 13 TB file server got backed up, which took about four and a half days. VEEAM does VM backups this way, using VMWare functions:
– create a snapshot
– back up the snapshot
– remove the snapshot
Guess what … “remove the VM snapshot” really means “integrate all changes since the snapshot back into the main VMDK file.” And on a 13 TB VMDK file, integrating 190GB of data changes takes a long time. All of which happens inside the context of the VM. Seven and a half hours, as it turned out. Of which the last five were spent with the progressbar insisting it was 99% done.
It could have been much worse if we hadn’t had so many vCPUs on that VM. There are horror stories on Google of such things taking days for much smaller VMs.
Lesson learned: we need a DR plan that works
I find it a great idea to talk about disaters in production enviroments.
About 14 JHears a go, I worked as SQL Server Developer on an SQL Server 2000 and did the most administrative tasks as well.
There were also a Windos Adminsitrator or network Administrator in that team, that were formaly in task with the admistration of this SQL Server, but generaly avoided to do anithing that had to do with SQL Server.
For performance issuse I decided to make some indexed views with stuff(tables columns) that were generaly needed in the most productive Querys. Createdthe indexed views by the book, tested with some inserts, updats and deletes on a small Developer edition, eveithing was going faster, and createdthis indexed views in to production.
Querys in the production where a little bit faster as before and everithing looked ok to me.
Second day i was very sik and had to stay home, wat took more then a week.
In the second day, after my changes into the production server, i was called in panic that the production is no longer working and inserts ar no longer possible on the most tables. So i helped over the phone somebody ho was in fact a Oracle DBA, how to restore the database in an SQL Server prior the creation of that indexed views.
One month later I found out that the formal administrator had the briliant idea to try to configure a replication between two SQL Servers and made some chenges on the database Parameters without telling somebody about it.
This was in the beginning of my career and I landed a side job for a mid size company they hired me to fix there index issues. I ran the Database Tunning advisor and implemented all the index recommendation. ouch!!!
So there’s this login trigger that logs to a table. I added the database containing this table to an Availability Group. The instance itself was a failover cluster instance — and guess what happened when we did a “quick” failover for maintenance.
Venue: Major trading platform.
Me: Inexperienced DBA.
Took offline a database as part of a housekeeping exercise. I had been told the database was “Definitely not part of Production. Definitely not in-use.” It *was* part of Production. And it *was* in-use. And trading stopped dead.
My putting that database back online was so fast that I was in danger of tearing the space-time continuum.
Takeaway: Check for oneself, and check again, and get your checking checked if you have any doubt.
When doing an intake at a new customer I accidentally enabled traceflag 4136 instead of detecting if this traceflag was enabled. Their AX database was heavily tuned and didn’t like this traceflag: all processes slowed down to a terrible response. My mistake wasn’t detected directly, because it was also at the same time an update was implemented, we blame the update. Luckily, this mistake was easy to fix.
Back in 2009 when the IT department was working on a high-visibility project, and the dev and qa teams worked very hard for several days to generate test data in a sandbox database. At around 4:30pm they asked me to copy that data to the QA/UAT instance where end users would finally perform testing the next day, so I launched the import/export wizard and configured it to delete all rows in the destination tables… with UAT as the source and sandbox as the destination. On top of that, the UAT tables were empty, so I ended up with two sets of empty tables.
Of course, I had never bothered to take regular backups of the sandbox databases… only production databases are worth of maintenance plans, right? :/
Once in a server room with a system admin, we were trying to boot a server from a USB stick. The system admin was pressing CTR+ALT+DEL like crazy, changing BIOS settings and trying out all the different ports and options, but nothing worked. After quite few reboots I looked at the server’s spec sheet and oops, wrong OS…
Suffice to say I’ve learned that the importance of pre-production like for like environments and of engineering out the possibility of failure.
SQL Server is great for providing control flow SQL, IF [NOT] EXISTS and metadata functions to detect which server you are actually on. I learned the hard way why those commands are great. Which is why I miss them very much when dealing with AWS RedShift and HP Vertica.
So I installed SQL2K on our new test server, mirroring the settings from the live (production server), using Remote Access switching between machines as necessary. I made a mistake, not including allowing logons using your windoes credentials. The only way to change that was to reinstall SQL2K.
So I uninstalled SQL2K. On the production server.
Aparantly my face went white as I realised what I’d done. Reinstaleld SQL on production and then attached all the databases in an hour.
Fortunatly it happened at lunchtime, and that impacted people almost not at all.
I was once making a manual change to one of the core tables in the database. It was routine work based on a ticket. I had updated the integration system, all was well, and now I was making the change on prod.
I ran the script and was checking the results in the Messages pane when one of our developers came over. I had wanted to talk to him since the previous day. Being a chatty fellow, our conversation went on for about half-an-hour until I guessed that the state of emergency being declared in the office might have something to do with me. People from the government department who own the product we manage wanted to know why the website wouldn’t work any more. The sysadmins were sure that their hardware was not the problem.
I committed the transaction.
I then went over to the sysadmins and product managers and sheepishly apologised. This lesson remains with me still how one open transaction can bring an entire system unintentionally to its knees, if the right table is chosen.
First day on the job as newly minted DBA/DEV/SysAdmin, I changed the sa password. 17 server links, all agent jobs, and 30 user applications failed — grinding the entire enterprise to a halt. Having moved on to the plethora of other tasks I did not immediately realize the connection.
Oh, and I parked in the CEO’s parking spot. Cost me $120 to retrieve my car from the towing yard.
It was my first month of a new job. I was working on a test server maintenance. Maintenance was completed successfully but by mistake I rebooted Prod server instead of test server. That server was hosting “the” main manufacturing application for our global manufacturing client. That application was backbone of the business. It took about 3 hours to recover from this mess because almost 20 other applications were dependent on that main application and the application was not smart enough to auto detect SQL shutdown and start by itself.
BTW this “main” application was one of the most widely used application from a major global product vendor which I will not name here.. 🙂
My blunder isn’t entirely database related, but let’s go back to the mid-90s, when most everything was CONNECT 14400. I was slightly younger and working at a Real Estate startup, where we downloaded information and files from their specialized MLS system. From there, we’d take the data, and then re-insert into this new thing called SQL Server 6.0. We got slightly more successful, and got more customers, which meant that we’d call more MLS systems, and things were so well, that 14.4k wasn’t going to cut it.
In New England, your phone lines are terribly old, and SNET wouldn’t guarantee anything higher, unless you get ISDN. After a slight wait, we finally got the unit and I jumped all over the networking, and wow, you could download the entire internet in just seconds. You’d have two huge 64k channels and a signaling channel. There were costs for how much you downloaded, how long you were connected, how many connections you made, and various pricings all over the place. Boss man wanted a channel to disconnect after 3 minutes of inactivity. Thinking I was even better, I figured we’d set it to disconnect every 3 seconds, and save even more money, because I’m good with technology, and he isn’t.
FedEx came by at the end of the month with a few reams of paper, with the 400 some-odd pages making up the call/connection history. Boss man walked into our office, ducked under the desk, and tore the Netopia Router right out of the wall, along with the ISDN modem, which is a perfectly normal response to a $1800 phone bill. Instead of configuring it to disconnect after 3 seconds of inactivity on a channel, I had configured it to reconnect after 3 seconds of inactivity on a channel. ¯\_(?)_/¯
Whoa…should have used 10-10-321 instead! 🙂
Your bit/logic mix-up reminds me of a story I read a while back about a financial market trading company that lost $440 million in only 45 minutes due to a logic issue with an old variable. Lesson learned: NEVER, EVER, EVER re-purpose variables/flags in code!
Sorry — forgot the link to the story: https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/
happend around 13 years ago, i was very green and just had a few years exp under my belt. We we using Sybase Adaptive Server Enterprise, we got a call from management to look at some issues in PROD. WARCRAFT 3 DOTA was a craze during that time, senior dev , who had access to the prod server, was busy playing it, so he pauses game, walks towards me, gives me access to PROD, told me stuff i should do, then he goes back playing. So, i did stuff in prod, including submitting a query that create a cartesian product of 2 very big tables when i ran a join without enough filters… WIthin minutes calls from around the country flooded us, i just brought down the courier company’s system because the DB ran out of resource to process my query. I got off easily, senior dev didnt.
i updated the distribution cleanup job in our SQL replication to delete from top(2000) to top(10000) based on an online article i read, then totally forgot about it. The job runs every 10 mins, we went live to prod and we have been experiencing these weird episodes where the server would screech to a very sluggish phase that we can tpinpoint the reason why, SPWHO2 no blocks, DM, etc, etc. Eventually, i did this in all our environments and it took one inexperienced developer to accidentally see the dsitribution job causing a TON of IO_LATCH because the job was deleting so much data from the disk that the DISK and OS itself wasnt able to cope with the request of SQL SERVER ENTERPRISE. I then secretly reverted the change i made in all the environments, then scheduled it to run only during weekends. VOILA! queries are now fast almost all the time.
I volunteered to do something that sounded cool & exciting: go to Prague airport and do a quick fix on a broken Oracle server. Hey, not a problem! Quick trip to a pretty city, my work would be over in a couple of hours, and maybe get to see a few sights. What’s not to love?
3 days in a cold basement, no phone signal, a rather unpleasant armed guard outside the locked door, and a realisation that toothpaste, soap & deodorant weren’t ever going to be top of the airport techie staff members’ shopping lists.
I won’t dwell on the free spirited approach to copyright and DVD-burning used there.
I work in Prague. You got off easy.
Created an restore script via ssms on production Server , database had a lot of ndf files, database was 2.1 TB… script was created .. and then go to lunsch. I went back and pressed F5 ? database offline ?