For this month’s T-SQL Tuesday topic, Michael J. Swart told us to argue against a popular opinion, and man, is that right up my alley. I’ve told you to stop worrying about index fragmentation, stop backing up your logs once an hour, and to stop running $14k of licensing on $3k of hardware.
You’re probably checking SQL Server wait stats periodically to find your SQL Server’s bottleneck – and that’s a good thing. Instead of checking Perfmon counters and trying to figure out what’s the slowest part of your server, at least wait statistics tell you what SQL Server has been waiting on while running queries.
But it all falls apart when you say, “67% of the time, SQL Server is waiting on ____, so we should focus there.”
We need to understand the difference between latency and throughput.
Explaining Latency and Throughput with Shipping Companies
If we run an online store and we wanted to measure how fast a delivery service works, we could call them to pick up a single envelope, and then measure the amount of time it took to arrive at its destination. We could then say, “That service can deliver exactly one package per day. If our business is going to grow, we’re gonna have to find a delivery service that can ship faster, because we need to move more than one package per day.” We could then focus all our efforts trying to use local courier services, or spreading our load across multiple shipping companies.
But we would be morons.
Instead, we need to put MORE packages out on our doorstep and call the delivery service to get it. They’ll send a truck, pick up all the packages, and deliver them to various destinations. As we try to ship more and more packages, we’ll probably need to upgrade to a loading dock, or maybe even multiple loading docks, and set up an arrangement with our shipping company to send more trucks simultaneously.
Latency is the length of time it takes to deliver a single package.
Throughput is the number of packages they can deliver per day.
Just because our business is waiting overnight for a single package to be delivered doesn’t mean we have to go finding a new shipping company. It’s completely normal. We need to keep pushing our business to figure out where the breaking point is. Are packages piling up at the door because the shipping company only has a single small cargo van? Sure, that’s the point at which we worry.
How This Relates to SQL Server Wait Stats
In a recent load testing engagement, the DBA told me, “We’ve only got a few end users hitting the system, and we’re already seeing 80-90% of our waits in PAGEIOLATCH. The data file storage simply isn’t able to keep up.”
We switched from using his wait stats script to sp_BlitzFirst® instead, which shows the cumulative amount of time spent waiting on each wait type. In any given 5-second span, the server was spending less than a second waiting on PAGEIOLATCH. Furthermore, the average wait time was less than 5 milliseconds each time – indicating that the storage was responding fairly quickly to each request.
The server was sitting idle, and the DBA was reading wait stats incorrectly. Sure, the majority of time spent waiting was due to storage, but there just wasn’t much time spent waiting period.
“Crank up the load,” I said. “Quadruple the amount of work you’re throwing at the server.”
Everybody in the room looked at me like I was crazy, but they agreed – and the SQL Server still didn’t flinch. We kept upping and upping the load, and finally we did find a breaking point, but it wasn’t storage. Just as you can pile up a lot of boxes in front of your house and the shipping company will pick them all up to deliver them in the same amount of time, the SQL Server’s storage kept right on delivering every result within 5-6 milliseconds.
The Moral of the Story
When using wait stats for monitoring, make sure you’re looking at the total number of seconds spent waiting per second on the clock. If you sample waits for 5 seconds on a 16-core server, don’t freak out about 5 seconds worth of wait. Each core can have multiple queries piled up, all waiting on different resources, so even 15-20 seconds of wait time during a 5-second period may not indicate problems.
Thanks for contributing! (finally). Man, who hasn’t been at least a little suspicious of those wait stats.
Michael – thanks for writing such an excellent topic! It’s so rare that the T-SQL Tuesday topic and my schedule line up in a way that works.
You hit the nail on the head. It seems like people spend too much time looking at aggregated wait stats. Does it really matter what the aggregate is if there is no perceived performance problem? The only time I really look at aggregated wait stats is when we do Load Runner tests because we know we can never accurately simulate real production chaos. Even then I’m less concerned with any given aggregated wait stat, rather, with this run’s aggregates vs the last run’s (assuming the load test profile didn’t change).
I’m more concerned with finding out from users what the pain point is and then finding out when those occurred and what the wait stats looked like during that moment in time. To further your example, PAGEIOLATCH_EX waits happen on our systems all the time because we improperly index and our queries are written by idiots. Generally these are a problem ONLY when every other spid serializes behind one spid on PAGEIOLATCH_EX. In that case the calls start coming in that the db is down when really we had table lock escalation due to improper use of a table variable or a bad parameter sniffing event. But these are literally six sigma events.
I find people also overly freak out when CXPACKETs are high and PLE’s aren’t “good”…whatever that means. I’ll hear statements like “this server’s CXPACKET waits are orders of magnitude higher than that server’s” or “our PLE’s are 300 on this server and 3000 on another one.”
Dave – yep, you got it. It’s amazing how many different things can trigger PAGEIOLATCH waits, above and beyond storage.
“We could focus all our efforts… but we would be morons.”
Man, even if I didn’t get the profoundly good insights about SQL Server from this blog, this would keep me coming back. You don’t even have to know what SQL Server is to understand this analogy.
Awww, thank sir! That’s one of my favorite analogies too, and surprised I haven’t blogged about it before. I use it constantly when I’m talking about storage.
Analogy is the language of business users.
I know for me I clear the wait stats weekly then run a weekly report to gauge if any issues since spikes do occur from time to time. Now is this a better approach versus running say a 1 minute sample of my waits to see if there are any issues. and what are other options that could cause PAGEIOLATCH waits besides storage?
IMHO looking at PAGEIOLATCH in aggregate isn’t very helpful b/c lots of things cause that wait type. Instead I want to know the queries that are causing that wait so I can look for the individual causes…missing indexes, refactoring opportunities (table scan lock escalations, joins to @TableVars with thousands of rows, bad param sniffing, etc), or very infrequently, fragmented indexes that are scanned. There’s probably others I can’t think of. If you sample every minute you aren’t going to see those waits that are < 1 minute so you miss a lot.
Others will likely disagree but I don't equate PAGEIOLATCH to be caused by storage in most cases. Rather, it is a symptom that I use as an indicator to help me find the true cause. True storage issues will display different symptoms like io stalls.
Eric & Dave – you’ve both got interesting points.
I do want periodic reports of my waits – say, total wait time per hour. If I know that between 7AM and 8AM, my SQL Server has 3 hours of PAGEIOLATCH, then I need to look at what queries are running during that hour. Sampling at the minute level is probably overkill – if I’m not going to act on data at a granularity that small, then I don’t really want to collect it.