You’re a data professional working with (or at least applying to work with) a company using the StackOverflow database (I’m using the March 2016 version today). Your users are complaining that this stored procedure is slow:
They didn’t give you parameter 26837 – I’m just giving you that so you can see an execution plan.
You don’t have to talk me through the query itself, or what you’d want to do to fix it. In fact, I want you to avoid that altogether.
Instead, tell me what things you need to know before you start tuning, and explain how you’re going to get them.
I’ll follow up in a few days with my thoughts.
Let’s just get one thing out of the way first.
Yes, I understand that you, dear reader, probably hate maintenance plans. You associate them with everything dumb that you did early on in your career, like setting up a single plan that reorganized indexes, then rebuilt them, then updated their statistics. You are older and wiser now, and you swear by tools like Ola Hallengren’s maintenance scripts or Minion Reindex.
This blog post, however, is not about you.
It’s about all of the SQL Servers out there that have not yet had the wonderful opportunity to feel the tender loving hand of a qualified database administrator such as yourself. It’s about the tools that accidental DBAs will use over the years to come.
So let’s start with the index rebuild task:
Be still, my beating heart.
You can tell it’s kinda slapped together haphazardly – note the awkward spacing of the “Used in last” line at the bottom – but God bless ’em, Microsoft’s heart is in the right place. We have new options to only rebuild indexes if they’re a certain percent fragmented, or a certain size, or they’ve been used recently.
The goodness continues on the reorg screen:
Same nice options about only optimizing indexes that are in use, or are in bad shape.
The CHECKDB screen shows off its new MAXDOP capabilities, now that DBCC CHECKDB can take a hint:
Part of me is happy because undereducated database caretakers now have new, more powerful tools at their disposal.
The other part of me is a little bit sad because it’s still not easy to use. If maintenance plans are designed for the accidental and junior DBAs amongst us, I don’t think a lot of this stuff should even be an option. It should just default to the right thing, and take care of the database with Microsoft’s best practices set up as standard.
But that is a really, really small part of me. Maintenance plans are getting better, and that means something good.
The instant you encounter corruption in a production SQL Server database, here’s what to do:
1. Query the msdb.dbo.suspect_pages table. This system table tracks up to the last 1,000 corrupt pages detected. This will give you a quick idea of how widespread the corruption might be. If it spans multiple databases, you need to consider getting off this hardware rather than repairing the corruption. The storage itself may be bad, and your corruption repair efforts may be fighting a losing battle. If you see corruption in multiple databases, send an email to your manager, team, and app stakeholders:
SQL Server (INSERT NAME HERE) just reported corruption across multiple databases. This indicates that it has storage problems. We need to fail over to another SQL Server immediately – come to my cube (OR ONLINE MEETING OR WHATEVER) to discuss our options.
At that point, look at your disaster recovery plan to figure out where you’re going to fail over (like a log shipping secondary.) You’ll need to check that location for corruption as well, but if we’re dealing with widespread, multi-database corruption, it’s beyond the scope of what I can teach you to do in a blog post. You can keep reading for single-database corruption recovery options, and use these same tactics across multiple databases, but it’s going to be a ton of work.
2. If it’s just one database, send an email to your manager and your team. Copy/paste this in:
SQL Server (INSERT NAME HERE) just reported corruption in one of our databases. This is really bad – we may be facing data loss, and in order to find out, I’m going to have to drop everything and do emergency troubleshooting. I’ll let you know more in half an hour.
In this step – and the next couple – I’ve got you sending emails, and people are likely to come running in asking for more details. Be polite but firm – you need to do 30 minutes of homework before you can give answers.
3. If you’re using a SAN, alert the storage team. If your data and log files reside on shared storage, there may be a more widespread issue. Multiple databases or servers may be affected if the shared storage is having problems. Copy/paste this into an email:
SQL Server (INSERT NAME HERE) just reported corruption in a database that lives on the SAN. I don’t know yet whether this is due to a storage issue or a SQL Server bug. I’m dropping everything to do emergency troubleshooting, and I’ll let you know more in half an hour, but if you hear anything from other server admins about SAN issues, please let me know.
4. Alert the application owners and stakeholders. Copy/paste this into an email:
SQL Server (INSERT NAME HERE) just reported database corruption. I’m dropping everything to do emergency troubleshooting to find out if we’ve lost data, and how we’ll recover it. I’ll let you know more in half an hour, but in the meantime, I would recommend taking the applications offline. If more data is added after this point, I may not be able to recover it.
This sounds paranoid, but as an example, here’s a corruption case I had recently: shortly after detecting corruption, the team realized they’d have to revert to a backup of the database from a few days ago. Rather than telling users about that possibility, they let the users keep adding data into the already-corrupt database while the DBAs did troubleshooting. Several days later, as the corruption got worse, even Microsoft couldn’t repair the corruption – and the affected tables went completely offline, permanently. If the users would have been alerted earlier, they could have avoided even more data loss.
5. Turn off your backup-delete jobs: you’re gonna need ’em. If you have any kind of job that automatically deletes backups older than a certain number of days, turn off that job. Depending on the type of corruption, you may have to restore a clean full backup from before the corruption occurred, plus all of the transaction log backups since.
6. Review the history of your CHECKDB and backup jobs. Fill out these answers:
- When was the most recent clean CHECKDB? (Meaning, when did CHECKDB run against this database without errors?) ________
- Do we have a full backup available from prior to that date? (For example, if CHECKDB said this database was okay on January 15th at 10PM, what’s the latest backup we have BEFORE that date, like January 14th at 11PM.) ________
- Do we have all of the transaction log backups since that date, yes/no? ________
If you can pass all three of these, great news! You’re going to have an alternate recovery path. If any of these backups are not online (like if they’re on a slow virtual tape library, or have been sent offsite), get one of your Windows admins to start retrieving these backups and putting them on a very fast network share as quickly as possible. You may need to restore all of these backups in a subsequent step.
7. If you have the full CHECKDB output, look for an easy fix. DO NOT RUN CHECKDB in this step – that can be a long process, and before you go down that road, I need you to do other stuff first. I’m just saying that if you do happen to have the CHECKDB output that shows the corruption, look at the index IDs involved. If the only corruption involves an index with an ID of 2 or higher, that means it’s only corruption in a nonclustered index, and you’ll be able to repair that relatively quickly by:
- Run sp_BlitzIndex® to list out the index definitions on the affected table:
EXEC sp_BlitzIndex @DatabaseName = ‘MyDB’, @SchemaName = ‘dbo’, @TableName = ‘MyTable’
- Find the corrupt index by its ID
- Copy out the CREATE INDEX statement from sp_BlitzIndex’s output
- Drop the index
- Recreate it again
- To verify that no other corruption exists in the database, run:
DBCC CHECKDB(‘MyDB’) WITH NO_INFOMSGS, ALL_ERRORMSGS
- If it comes back clean, you’re good, and you can email the teams that things are OK, but jump to the Aftermath section at the end of this post
- If it doesn’t come back clean, keep going
8. Fork your recovery efforts into multiple teams. Grab other members of your team and assign them tasks that will be done in parallel:
- Team 1: Restore clean backups to another server – this team grabs a different SQL Server of the same version as production. For example, if production is a SQL 2012 box, grab a development server that’s also SQL 2012. Restore the full and log backups from the dates specified in step 5. After it finishes, run DBCC CHECKDB to make sure there’s no corruption in this copy. This may take a long time, but if it works, you’ve got an excellent plan B.
- Team 2: Open a support call with Microsoft. If you don’t have a support agreement with Microsoft, grab a credit card and call professional support. It’s about $500, and they work the problem with you until it’s done. Don’t be afraid to say, “We’re not making progress fast enough – I’d like to escalate this call to the next support level.” (But at the same time, you’d better be doing what they tell you to do.)
- Team 3: Run CHECKDB if you don’t have the output. If the corruption was detected by an end user’s query or a monitoring system alert, run this, save the output to a text file, and then analyze the output with Gail Shaw’s corruption recovery instructions. Depending on the database’s size, this can take hours:
DBCC CHECKDB(‘MyDB’) WITH NO_INFOMSGS, ALL_ERRORMSGS
- Team 4: If you don’t have a clean backup, restore the most recent backups to yet another server. If you’re going to try your own attempts at corruption repair, you’re going to want another copy of the corrupt database on a different server where you can test first before trying in production. As soon as that restore finishes, if the other teams aren’t ready to try their corruption repair commands yet, create a new database on this same server, and start copying as many tables as you can from the corrupt database into the new one. You want to be able to salvage as many tables as you can.
- Team 5: Contact a data recovery company. If things are looking bad, and management wants another option, contact Kroll Ontrack for a quote for database recovery. The pricing isn’t public, and it will likely vary based on the complexity of the situation.
9. Send out a status update as work starts. As soon as the above team task assignments are done – not when the work stops, but when it starts – it’s time to communicate your status to your manager, team members, and the application stakeholders.
Your update email will be based on what you’ve learned so far, but here’s an example:
Update on the corruption issue: we have identified that the corruption is confined to the SalesSite database, and it occurred at some point after 2016/05/04. We have split up our recovery efforts into three teams. Based on a quick assessment, we believe Team 1 will be the first to finish, and we hope to have a clean copy of the database up on DevSQL2012. If we have to fail over to that, we’ll have a one-hour outage. We’ll update you in another half-hour as we know more.
10. Keep working the plan. Don’t panic – let each of the teams do their thing independently. Keep doing check-ins across teams every 30 minutes, see which team is the closest to a solution the business is comfortable with, and keep the end users informed.
The Aftermath: Avoiding Corruption Next Time
As soon as practical, get on the most current service pack and cumulative update for your SQL Server version. You may have hit a bug like the notorious online index rebuild bug or the NOLOCK UPDATE bug, and patching will reduce the chances that you hit that issue again.
If you can’t rule out a SQL Server bug as the cause of the corruption, work with your storage teams to make sure your storage gear is patched up to its most recent firmware and driver versions. Check to see if there were any storage errors over the last several weeks – not just in the Windows event logs, but in the storage gear’s logs as well.
Once you’ve hit corruption on a server, consider taking a few proactive measures:
- Run CHECKDB more often, like every night (which sounds crazy – until you hit a corruption scenario)
- Keep your backups for a longer length of time – if you’re only running CHECKDB every week, you need at least a week’s worth of full and transaction log backups so that you can recover via restores
- Run transaction log backups more frequently, like every minute
- Put your RPO and RTO in writing – the business always assumes databases can’t lose data, and you assume that the business understands technology breaks sometimes. Get everybody on the same page.
- Based on your RPO and RTO, revisit how many databases you have on each server. If your hardware isn’t able to keep up with backups and corruption checks, it’s time to up your horsepower or switch to snapshot backups.
Erik says: Brent is spot on with how to react to corruption scenarios. I’m all about not being past the point of no return in the first place. Whether it’s:
- Making sure I have email alerts set up for 823, 824, and 825 errors that catch hard and soft I/O issues
- Running DBCC CHECKDB often enough
- Having Page Verification turned on
- Taking my backups with CHECKSUM (or using a Trace Flag if I can’t do it via my backup software),
- Making sure my backup routines at minimum do a restore with VERIFYONLY to assess basic usability, or even better, a restore of my FULL backups to another server
- Finally, using technology like Mirroring or Availability Groups, which have a feature called Automatic Page Repair that can fix some page corruption scenarios
sp_Blitz® can help you find all sorts of dangers and inadequacies with your server set up. Run it on all your servers. You may be surprised what you find.
Tara says: I’ve encountered corruption 3 times in my career. The last event was the most memorable, and I don’t mean that as a good thing. If we had setup alerts for the 823, 824 and 825 errors as Erik mentioned, we would have been notified of the corruption sooner and had less data loss. We implemented the alerts through SCOM after the event, but we should have also created SQL Server Agent alerts for those errors as SCOM wasn’t always reliable (your mileage may vary). Reacting to corruption quickly is very important.
After writing about “For Technical Interviews, Don’t Ask Questions, Show Screenshots”, lots of folks asked what kinds of screenshots I’d show. Here’s this week’s example.
I show each screenshot on a projector (or shared desktop) to the candidate and say:
- What’s this screen from?
- What does the screen mean?
- If it was a server you inherited from someone else, would there be any actions you’d take?
- What questions might you want to ask before you take those actions?
- Would there be any drawbacks to your actions?
- What would be the benefits of your actions?
After a few days, I’ll follow up with my own thoughts.
UPDATE 2016/05/20 – Great thoughts, everybody. This one was fun because it stems from real-life scenarios I’ve seen several times. You wouldn’t believe how long it takes folks to recognize this screen in real-life interviews – often it takes DBAs tens of seconds to realize they’re looking at TempDB. (They often start by talking about some other database because the file name tempdev is so deceiving.)
The DBA heard that they were supposed to create a file for each core, but they misunderstood the difference between cores and processors. The server had 2 processors, each with 4 cores – but they created 2 data files originally.
They had a super-fast SSD attached to the SQL Server as E:, and it’s got a relatively limited amount of space – say 128GB – so they started with small file sizes and let them autogrow.
At some point, the SSD ran out of space, so the DBA added another emergency overflow file on a slower drive (M:). Maybe they shrank it back manually, or maybe they have a job to shrink it – in either case, I get a little suspicious when I see small file sizes because there’s probably shrinking going on.
I got a chuckle out of the answer about the server being a dev box because the database file is named tempdev – even though I see a ton of servers, the default “tempdev” makes me pause every time because it was such an odd file name choice by Microsoft. Funny how everybody’s just as bad at naming conventions as I am.
So to answer the questions:
3. Would I take actions? I’d check to see if there are shrink jobs set up on TempDB, and if so, I’d start by disabling those. I might consider adding more TempDB data files, although if it only had one data file, I’d be a little more careful because it can have a funny side effect.
4. What questions would I ask? What wait types is this server facing? Is the E drive actually a good spot for TempDB? How are the file stats looking on that drive? Have we had a history of running out of space here? How big are the user databases? Are we sorting indexes in TempDB?
5. Any drawbacks? If TempDB is getting regularly hammered, and it runs out of space and needs the overflow file, I might not know it due to the shrinks. I’d start by disabling the shrink jobs so that I can see if this thing grows, and what it ends up growing to. That’ll help me plan for capacity.
6. Benefits to my actions? Some folks mentioned adding files or pre-growing files can make it faster for end users, but be really careful there. Anytime you say something will be faster, then as an interviewer, I’m going to challenge you to define what you would measure, and how it would change. If you don’t have metrics at the ready, then I’m going to suspect cargo cult programming.
On today’s episode of the SQL Server Radio podcast, I talk with Guy Glantser and Matan Yungman about what we look for when we’re hiring.
In the broadest sense, don’t think junior or senior:
- I’m hiring someone for what they already know, or
- I’m hiring someone for their capability to learn
(In reality, it’s usually a blend of both, but just think big picture for now.)
If I’m hiring you for what you already know, then I’ve got a list of skills, and I want to see your proficiency in those skills. If one of those skills includes communication, then I’m going to judge you based on how you communicate your mastery of the other skills. For example, I might be looking at your blog posts, presentations, or webcasts about the topics you’re great at.
If I’m hiring your excellent learning skills, then I want to see what you’ve been interested in learning in the past, and how you’ve gone about learning those topics. It doesn’t have to be technical, either – maybe you were interested in perfecting an Eggs Benedict recipe. Show me what resources you used, your preferred style of learning, what lessons you picked up along the way, and how you would recommend that I learn that same thing as fast as possible.
To hear more about my philosophies on that, and hear how Guy and Matan approach hiring for their own companies, check out the half-hour SQLServerRadio podcast.
Every time we announce a new SQLSaturday Pre-Con, the number one question I get is, “When are you going to teach Performance Tuning When You Can’t Fix the Queries in my city?”
Since I can’t get to all cities, let’s do this online!
Performance Tuning When You Can’t Fix the Queries – Friday, May 20th Online, $299 – Your users are frustrated because the app is too slow, but you can’t change the queries. Maybe it’s a third party app, or maybe you’re using generated code, or maybe you’re just not allowed to change it. Learn more.
SQL Server Performance Tuning – Tuesday-Friday, July 5-9 Online, $3,995 – You’re a developer or DBA who needs to speed up a database server that you don’t fully understand – but that’s about to change in a class of learning and fun with Brent. Learn more.
The Senior DBA Class of 2016 – Tuesday-Friday, July 26-29 Online, $3,995 – You’re a SQL Server DBA who is ready to advance to the next level in your career but aren’t sure how to fully master your environment and drive the right architectural changes. Learn more.
These online training classes run from 9AM to 5PM Eastern time using GoToWebinar, live with Brent Ozar, with a one-hour lunch break. Audio can come through either your computer audio (headset recommended), or by dialing into a US phone number. For best performance, test your computer’s compatibility before the meeting to make sure you’re not blocked by a corporate firewall. After purchasing your ticket, you’ll receive an email with your personal GoToWebinar link – don’t share this between users, because only one person will be able to log into the meeting with it. During the session, you can ask questions via text chat, or at the end during the audio Q&A as well.
Or if you’d like to join me at an in-person event, check out our upcoming classes.
No, not #tables – actual tables. Here’s how:
USE tempdb; GO /* This one is only available during my session: */ CREATE TABLE #myTempTable (ID INT IDENTITY(1,1), Stuffing VARCHAR(100)); GO /* This one is global, meaning it's available to other sessions: */ CREATE TABLE ##myTempTable (ID INT IDENTITY(1,1), Stuffing VARCHAR(100)); GO /* You can create both of those at the same time. They're different. */ /* This one is just like a user table, but in TempDB: */ CREATE TABLE dbo.myTempTable (ID INT IDENTITY(1,1), Stuffing VARCHAR(100)); GO
The first one disappears when my session is over, but the latter two persist until the SQL Server is restarted.
Why would you ever do the latter two? Say you need to share data between sessions, or between different applications, or staging tables for a data warehouse, or just faster tables that live on local SSDs in a cluster (as opposed to slower shared storage), or you wanna build a really crappy caching tier.
If you use global temp tables or user-space tables, though, you have to check for duplicates before creating your tables. Local temp tables are just all yours, and you can have a thousand users with the exact same-name local temp tables.
Next up, the ever-so-slightly different magic of temporary stored procedures:
USE tempdb; GO /* This one is only available during my session: */ CREATE PROC #usp_myTempWorker AS SELECT * FROM sys.databases; GO /* This one is global, meaning it's available to other sessions, but ONLY as long as my session is available: */ CREATE PROC ##usp_myTempWorker AS SELECT * FROM sys.databases; GO /* This one is just like a user stored proc, but in TempDB: */ CREATE PROC dbo.usp_myTempWorker AS SELECT * FROM sys.databases; GO
Here, the first TWO disappear when my session is over, and only the latter one sticks around. Diabolical. So the ## temp stored proc doesn’t really help me here because I can never tell when the creator’s session is going to finish. (Not God. His session keeps right on going.)
So why would you ever create stored procedures – temporary or user – in TempDB? You might not have permissions in the user databases, just might not be technically allowed to change things, or maybe you’ve got monitoring queries that you want to hide, or you want to create procs temporarily to check parameter sniffing issues.
All of the above will disappear when the SQL Server is restarted – or will they? Not if you create them permanently in the model database, which is the source of TempDB’s creation when SQL Server restarts:
USE model; GO CREATE PROC dbo.usp_myTempWorker AS SELECT * FROM sys.databases; GO CREATE TABLE dbo.myTempTable (ID INT IDENTITY(1,1), Stuffing VARCHAR(100)); GO /* Now restart your SQL Server, and check in TempDB */ USE tempdb; GO EXEC dbo.usp_myTempWorker; GO SELECT * FROM dbo.myTempTable; GO
Why would you ever wanna do this? Well, say you need to make sure that, uh, in case … look, I’m just an idea man. Somebody, somewhere, is looking for a really bad idea. That’s what I’m here for.
Last month’s post “For Technical Interviews, Don’t Ask Questions, Show Screenshots” was a surprise hit, and lots of folks asked for more details about the types of screenshots I’d show. Over the next few weeks, I’ll share a few more.
Normally I’d show this query as a screenshot, but for easier copy/pasting into comments, I’m showing it as code here.
CREATE PROC dbo.usp_ByCategory @Category NVARCHAR(20) AS IF @Category = NULL SET @Category = 'Default' SELECT i.itemID, i.itemName, COALESCE(po.Price, i.Price, 0) AS Price FROM Items I LEFT OUTER JOIN PriceOverrides po ON i.itemID = po.itemID AND po.SaleStartDate >= GETDATE() AND po.SaleEndDate <= GETDATE() WHERE i.Category = @Category
I’d say to the job candidate, “You’ve been asked to take a quick look at this code as part of a deployment. Explain what the business purpose of the code is, and tell me if there’s anything that concerns you.”
After a few days, I’ll follow up with my own thoughts in the comments.
It’s the news we’ve all been waiting for!
This PDF lays out the differences between editions, and here’s a few points that stand out:
- Standard Edition now goes up to 24 cores, and still just 128GB max memory
Query Store is Enterprise Edition only(see update below)
- Always Encrypted is Enterprise only, thereby killing its adoption rate among ISVs
- In-memory analytics, R integration are Enterprise only
- Business Intelligence Edition is gone with the wind
- According to the newly released TPC-H benchmark Executive Summary, Enterprise Edition still costs around $7k USD per core
Great news! Let me know what you think in the comments.
UPDATE 6:30PM – Microsoft unveiled a more detailed feature comparison by edition, and this one says Query Store will be available in all editions (including Express!)
The StackOverflow XML Data Dump was recently updated with 2016-03 data, so I’ve updated our torrent of the SQL Server database version of the Stack Overflow data dump.
Fun facts about the database and its real-world-ness:
- 95GB in size
- 29,499,660 posts spanning 2008-07-31 to 2016-03-06
- 5,277,831 users spanning ages from -972 to 96 (just like real world data, you can’t trust it)
- 46,306,538 comments (227 of which have the F-bomb)
- Every table has a clustered key on an Id identity field, and has relationships to other tables’ Ids (again, much more real-world-ish)
- Lots of lumpy data distribution and sizes, making it fun for parameter sniffing demos
- Case-sensitive collation (because if you’re going to share scripts online, you want to get used to testing them on case sensitive servers – this stuff exists out in the real world)
- 1,305% cooler than AdventureWorks
Here’s how I built the torrent:
In our AWS lab, we have an m4.large (2 cores, 8GB RAM) VM with SQL Server 2005. We use that for testing behaviors – even though 2005 isn’t supported anymore, sometimes it’s helpful to hop in and see how things used to work.
I still use 2005 to create the dump because I want the widest possible number of folks to be able to use it. (This is the same reason I don’t make the database smaller with table compression – that’s an Enterprise Edition feature, and not everybody can use that.) You can attach this database to a SQL 2005, 2008, 2008R2, 2012, or 2014 instance and it’s immediately usable. Keep in mind, though, that it attaches at a 2005 or similar compatibility level. If you want 2014’s new cardinality estimator, you’ll need to set your compat level to 2014 after you attach the database.
I downloaded the Stack Exchange data dump on that 2005 VM. It’s a little confusing because the Archive.org page says it was uploaded on 1/21/2014, but that’s just the first date the file was published. The top update date of March 1, 2016 is the current version you’ll get if you use the download links at the top right of the page.
To make the import run faster, I shut the VM down, then changed its instance type to the largest supported m4 – an M4 Deca Extra Large with 40 cores and 160GB RAM for $4.91/hour – and booted it back up. (Don’t forget to revisit your SQL Server’s max memory, MAXDOP, and TempDB settings when you make changes like this.)
I created an empty StackOverflow database, then fired up the Stack Overflow Data Dump Importer (SODDI), an open source tool that reads the XML data dump files and does batch inserts into a SQL Server database. I pasted in a connection string pointing to my SQL Server – ConnectionStrings.com makes this easy – and off it went:
The import finished in about 25 minutes, although it turns out the extra cores didn’t really help here – SODDI is single-threaded per import file:
After SODDI finished, I stopped the SQL Server service so I could access the ~95GB data and log files directly, and then used 7-zip set to use ultra compression and 32 cores, and the CPU usage showed a little different story:
After creating the 7z file, I shut down the EC2 VM, adjusted it back down to m4.large. I created a torrent with uTorrent, then hopped over to my Whatbox. Whatbox sells seedboxes – virtual machines that stay online and seed your torrent for you. They’re relatively inexpensive – around $10-$30/mo depending on the plan, and I just go for unlimited traffic to make sure the database is always available.
To double-check my work, I fired up my home BitTorrent client, downloaded the torrent, extracted it, and attached the database in my home lab. Presto, working 95GB StackOverflow database.
Now, you can go grab our torrent of the SQL Server database version of the Stack Overflow data dump. Enjoy!