Wait Stats

Using Wait Stats to Find Why SQL Server is Slow

Forget page life expectancy, buffer cache hit ratio, and disk queue length – those antiquated stats worked great for Grandpa, but it’s time to find an easier way to tune.

Whenever SQL Server is running queries, it’s tracking how much time it spends waiting on bottlenecks. These wait statistics are the easiest way to identify your bottleneck.

These statistics are tracked automatically in every version/edition of SQL Server, and they’re easy to query. Here’s our favorite way using sp_BlitzFirst®:

/* To see the waits over a current 5-second period: */
EXEC sp_BlitzFirst;

/* OR, to see waits since startup, skip taking a sample: */
EXEC sp_BlitzFirst @Seconds = 0;

/* To see the waits over a current 5-second period: */

EXEC sp_BlitzFirst;

/* OR, to see waits since startup, skip taking a sample: */

EXEC sp_BlitzFirst @Seconds = 0;

Then, look at the waits it identifies as your top waits. Here’s the most common ones:

ASYNC_NETWORK_IO

SQL Server built the query results, and it’s just sitting around waiting for the application on the other end of the pipe to consume the results faster. There’s nothing you can do to performance tune the SQL Server here – you have to figure out why the app can’t get the data down faster. It could be a slow network pipe between the app and the SQL Server (like a long distance wide area network), an underpowered client machine, or row-by-row processing happening on the application server.

To help your developers narrow down what’s happening, run sp_WhoIsActive several times. Look for queries that are waiting on ASYNC_NETWORK_IO, see which queries they are, and which machines they’re coming from. You may see a pattern of just a couple of queries – hand those queries and machine names to your developers.

BACKUP*

Who cares? Sure, your backups are slow, but if this is your top wait, you’re probably not going to make users happy by tuning backups.

To fix it, though, check out our backup best practices. If you’ve got a single database that’s approaching or exceeding 1TB in size, it’s time to think about SAN snapshot backups. We cover those in our Senior DBA Class.

CXPACKET

We explain this one in our classic post, What is the CXPACKET Wait Type, and How Do You Reduce It?

To fix it, first start by making the Cost Threshold for Parallelism and MAXDOP changes recommended in the post above. Then, start looking at which queries are reading the most data – that’s typically the source for queries going parallel. You can find those by running: sp_BlitzCache @SortOrder = ‘reads’

Then, start tuning those queries for better sargability, and tune the indexes to drive down reads and sorting work required.

LCK*

It means one query is holding locks on an object, and another query is waiting to get locks on that same object. Maybe one query is trying to update all the rows in the Customers table, and another query is trying to read them.

To fix it, you want enough indexes to make queries fast, but not so many indexes that there’s a huge overhead for deletes/updates/inserts (which slows down those activities, and makes blocking worse.) It’s a delicate balance, but generally we recommend aiming for 5 or less nonclustered indexes, each with 5 or less fields on it (including includes). You can get away with more – but you’d better be really careful with blocking overhead, and if you’re facing LCK* waits, you’ve gone over (or under) the line.

Start by running sp_BlitzIndex® in the user database facing the problem. Look for warnings about:

Aggressive indexes – meaning SQL Server is seeing a lot of lock waits on these indexes
Missing indexes – if they’re on the aggressive tables, we’re probably doing table scans where we could be doing quick seeks
Unused indexes – if they’re on the aggressive tables, we could be locking up a bunch of indexes that we’re not even reading, thereby making our deletes/updates/inserts slower

For more details, check out our course on How to Tune Indexes.

PAGEIOLATCH_*

SQL Server is waiting to read data pages from storage. These pages weren’t cached in memory, so we have to get ’em from disk. In most cases, this is to satisfy a SELECT statement, but you can also see this for DELETES, UPDATES, and INSERTS (DUIs) as well.

To fix it, start by running sp_BlitzCache @SortOrder = ‘reads’, and it’ll list the queries reading the most data. Look for missing indexes in those query plans, or non-sargable things in the query that you could tune manually. Start here because it’s the easiest/cheapest.

Next up, consider adding more memory to the SQL Server. Sure, storage might be really slow – but changing out your storage is time-consuming and expensive. Check our classic post, What’s the Smallest SQL Server You Should Build?

Finally, if you can’t do any of those, look into improving the speed of your storage. We list this one last because it’s the hardest to do, and we cover it in our video training course, Virtualization, Storage, and Hardware for DBAs.

SOS_SCHEDULER_YIELD

The SQL Operating System (SOS, get it?) is waiting for a CPU scheduler to yield more time. It doesn’t necessarily mean that your server needs more CPU time overall – it means that an individual task in a query needs more CPU time.

If your query contains parallelism inhibitors like a serial UDF, a multi-statement table-valued function, modifies table variable contents, or queries system tables, the whole plan (or parts of it) may be forced into only using a single CPU core. In that case, you’ll see SOS_SCHEDULER_YIELD waits even when your server has lots of free CPU capacity.

To identify which queries are using the most CPU, run sp_BlitzCache @SortOrder = ‘CPU’.

WRITELOG

Users are doing delete/update/insert (DUI) operations, and SQL Server has to write those to the transaction log file before telling the end users that their transaction is committed. Your transaction log storage is having a tough time keeping up.

To fix it, tell your users to write less data.

Okay, maybe that won’t work – so let’s find out how bad the problem is by checking sp_BlitzFirst @Seconds = 0, @ExpertMode = 1. The file stats result set will tell you the most active data & log files since the last restart, and you’ll see how fast (or in this case, slow) your log files are. If the log files taking more than a few milliseconds for writes, there’s your problem. It’s time to hit up our Senior DBA Class.

If you’re on SQL Server 2014, you can investigate Delayed Durability – a database-level setting that will let transactions be “committed” even before they get written to the transaction log. How does this work, you ask? Well, if SQL Server happens to crash or get shut down before the writes make it to disk, the transaction doesn’t actually get committed – so you can lose data. This option only really makes sense when you’re dealing with data that it’s okay to lose – perhaps web site session state, or an ETL database where the nightly loads can be restarted from scratch. To learn more and see it in action, Watch Brent Tune Servers.

Got a different wait type?

Check out these resources:

Waitopedia by Dell Spotlight Essentials – tells you how common or rare the wait type is, and what “normal” numbers look like for that wait type.
SQL Server Wait Type Repository – started by Microsoftie Bob Ward, but not really updated.
Waits and Queues Whitepaper – the original whitepaper that started it all, but kinda out of date.

Or post a question on DBA.StackExchange.com.

If you want to learn more, check out our online Wait Stats Training Class.