For this month’s T-SQL Tuesday topic, Michael J. Swart told us to argue against a popular opinion, and man, is that right up my alley. I’ve told you to stop worrying about index fragmentation, stop backing up your logs once an hour, and to stop running $14k of licensing on $3k of hardware.
You’re probably checking SQL Server wait stats periodically to find your SQL Server’s bottleneck – and that’s a good thing. Instead of checking Perfmon counters and trying to figure out what’s the slowest part of your server, at least wait statistics tell you what SQL Server has been waiting on while running queries.
But it all falls apart when you say, “67% of the time, SQL Server is waiting on ____, so we should focus there.”
We need to understand the difference between latency and throughput.
Explaining Latency and Throughput with Shipping Companies
If we run an online store and we wanted to measure how fast a delivery service works, we could call them to pick up a single envelope, and then measure the amount of time it took to arrive at its destination. We could then say, “That service can deliver exactly one package per day. If our business is going to grow, we’re gonna have to find a delivery service that can ship faster, because we need to move more than one package per day.” We could then focus all our efforts trying to use local courier services, or spreading our load across multiple shipping companies.
But we would be morons.
Instead, we need to put MORE packages out on our doorstep and call the delivery service to get it. They’ll send a truck, pick up all the packages, and deliver them to various destinations. As we try to ship more and more packages, we’ll probably need to upgrade to a loading dock, or maybe even multiple loading docks, and set up an arrangement with our shipping company to send more trucks simultaneously.
Latency is the length of time it takes to deliver a single package.
Throughput is the number of packages they can deliver per day.
Just because our business is waiting overnight for a single package to be delivered doesn’t mean we have to go finding a new shipping company. It’s completely normal. We need to keep pushing our business to figure out where the breaking point is. Are packages piling up at the door because the shipping company only has a single small cargo van? Sure, that’s the point at which we worry.
How This Relates to SQL Server Wait Stats
In a recent load testing engagement, the DBA told me, “We’ve only got a few end users hitting the system, and we’re already seeing 80-90% of our waits in PAGEIOLATCH. The data file storage simply isn’t able to keep up.”
We switched from using his wait stats script to sp_BlitzFirst® instead, which shows the cumulative amount of time spent waiting on each wait type. In any given 5-second span, the server was spending less than a second waiting on PAGEIOLATCH. Furthermore, the average wait time was less than 5 milliseconds each time – indicating that the storage was responding fairly quickly to each request.
The server was sitting idle, and the DBA was reading wait stats incorrectly. Sure, the majority of time spent waiting was due to storage, but there just wasn’t much time spent waiting period.
“Crank up the load,” I said. “Quadruple the amount of work you’re throwing at the server.”
Everybody in the room looked at me like I was crazy, but they agreed – and the SQL Server still didn’t flinch. We kept upping and upping the load, and finally we did find a breaking point, but it wasn’t storage. Just as you can pile up a lot of boxes in front of your house and the shipping company will pick them all up to deliver them in the same amount of time, the SQL Server’s storage kept right on delivering every result within 5-6 milliseconds.
The Moral of the Story
When using wait stats for monitoring, make sure you’re looking at the total number of seconds spent waiting per second on the clock. If you sample waits for 5 seconds on a 16-core server, don’t freak out about 5 seconds worth of wait. Each core can have multiple queries piled up, all waiting on different resources, so even 15-20 seconds of wait time during a 5-second period may not indicate problems.