Have you got a SQL Server that usually performs just fine, but every now and then, everything falls to pieces? Users complain about performance, and then minutes – or hours – later, the problem just mysteriously solves itself?
Here’s a few things to look for:
Are data or log files growing? When SQL Server needs to grow a file, it can bring all transactions in that database to a grinding halt. You can mitigate this issue by turning on Instant File Initialization, pre-growing out your data and log files, and using smaller autogrowth sizes. To figure out if this is your root cause and which cure makes the most sense, run sp_Blitz® and look for the warning about data and log file growths that have been taking more than 15 seconds.
Are queries being blocked? If one query takes out a lock that prevents other queries from getting their work done, it presents interesting symptoms. Your SQL Server’s CPU use might look low – but it’s because so many queries are sitting around idly waiting. To diagnose, start by running sp_BlitzFirst® with the @Seconds = 0, @ExpertMode = 1 parameters to check your wait stats since startup. If LCK* waits are significant, then start digging at the database level. Run sp_BlitzIndex® in each database, and look for the Aggressive Indexes warnings about indexes that are a frequent blocking sore spot. Consider dumping indexes you don’t need on those tables (to speed up deletes/updates/inserts), and adding the right indexes that you do need (to speed up other queries).
Are queries being rolled back? If a query fails – either due to being canceled, or running into an error, or hitting a deadlock – SQL Server has to undo their work. This may cause blocking, or it just may hammer storage. This one’s a little trickier to catch. If you run into it live, then you can see it with sp_BlitzFirst® – it warns about queries that are currently rolling back.
Did a bad plan get into cache? When you run parameterized SQL, SQL Server builds an execution plan based on the first set of parameters it sees. Normally, parameter sniffing is a good thing, but sometimes it goes awry. You can have a plan get pushed out of cache (due to memory pressure or statistics changes) and then get compiled based on the first – possibly unusual – set of parameters that come in. That plan might not work so well for other sets of parameters, and it can degrade performance on the whole box until a new plan is generated for it. You can learn more about solutions in this blog post about parameter sniffing, and to diagnose it, run sp_BlitzCache®. Look at the top resource-consuming queries, and see if an unusual one has suddenly jumped to the top of the charts.
Are shared resources under pressure? In virtualization environments, other guests may be using an increased amount of CPU, memory, or storage throughput, giving your guest less horsepower. In shared storage environments, other servers might be doing backups or disk-intensive workloads. In multi-instance servers, another instance on the same box might be the real culprit. To identify these, check out wait stats with sp_BlitzFirst®.
Looking back at what I’ve written, I’m sure someone’s going to say, “Brent, you’re just trying to pitch tools.” And sure, I am – they’re free. I want you to use the same powerful tools that we use every day in our consulting. Your job is hard enough without reinventing the wheel!