SQL Server 2016 Service Pack 1 brought us performance tuners all kinds of shiny goodies – like the ability to see wait stats inside a query’s actual execution plan. I was really excited when this first came out, but I keep seeing some real oddities.
Let’s keep this one really simple: I’m using the Stack Overflow public database, and any version will do. Start by purposely kneecapping your SQL Server, restricting it to just 1GB of memory:
EXEC sys.sp_configure N'show advanced options', N'1';
EXEC sys.sp_configure N'max server memory (MB)', N'1000'
After that finishes, pop open another window, give your SQL Server amnesia, and start sp_Blitz for a 60-second monitoring span:
sp_BlitzFirst @ExpertMode = 1, @Seconds = 60
Then start another window, turn on actual execution plans, and run this simple query which scans the Posts table – which is quite too large to fit into 1GB of RAM:
SELECT COUNT(*) FROM dbo.Posts;
Assuming this is the only query running at the time, you would expect that the server-level wait stats would match the execution plan’s waits. Here’s what the query plan shows:
That sounds amazing! Our server only had 706 milliseconds of wait time altogether, none of which was spent waiting on storage! Our storage must be super-blazing fast, and there’s practically no way to tune it to wait less than 706 milliseconds, right?
Just to show that I’ve got nothing up my sleeve, here’s the waits in the XML plan:
Your first clue that these aren’t quite accurate is the fact that this query went parallel, but no CXPACKET waits are shown.
Your second clue: this query took a full 39 seconds to run, and CPU wasn’t 100% the whole time. Here’s a quick shot of Task Manager just to prove that point:
So what was SQL Server waiting on? Server-level wait stats tell a very different (and more accurate) story:
Ah-ha! We waited a heck of a long time for PAGEIOLATCH_SH, which means reading data pages from a data file. And while we’re looking – that “actual” execution plan didn’t mention anything about CXPACKET waits either.
Can you see wait stats in query plans? Sure. They’re just not accurate – yet, at least – so don’t go making any complex tuning decisions based on those. Keep your eye on this Connect item to know when it’s fixed.
Update 2017/10/05: Bad news.
Microsoft marked the bug as “Won’t Fix” and wrote:
This results in PAGEIOLATCH_* wait type and other IO waits not being tracked per session, because by-design these are SOS waits that are tracked at the server level. Potentially tracking individual IO requests per thread and session also has performance considerations that would impact query execution.
So no, 2016 doesn’t really show wait stats in query plans.
To me, this is worse than not showing any wait stats at all because the waits aren’t anywhere near accurate. Some poor performance tuner who hasn’t read this post is going to start analyzing their query’s wait stats in the plan, do all kinds of work to “fix” it, and not understand why the query’s not getting faster.
Update 2017/10/05 Part II – there might be more hope! Pedro Lopes reopened the Connect item for discussion.
How do you get 609 mSec of wait time? I calculate 59 + 100 +547 = 706 mSec. I suspect that someone added 59 + 547, but then mistakenly added 3 (the Wait Count) for the LATCH_SH.
Tom – ooo! Good catch! Yep, should be 706.
Wow, that’s terrible. How did they even release it like this…
Imagine you’re starting 2 similar non-trivial I/O and CPU intensive queries at almost the same moment, both involving a scan on the same big table. The first query is running, the SOS is starting to read pages from the disk to satisfy it, so it suspends the task. The other query starts and “overtakes” the first one, maybe using the pages that were already read and signaled to the first query (but it cannot run, because it’s Runnable, waiting for CPU). So the pages are read by the first query, but are used first by the second query. Imagine any other sophisticated interleaving execution pattern, possibly involving Merry-go-round scans, expunging pages from memory, read-ahead, multiple processors and so on. If that’s not enough, increase the amount of tasks involved. And defragment or rebuild an index on this table in ONLINE mode if that’s still not enough. Do you think I/O waits can be reliably assigned to just one of these queries?
And now, if you run the same query twice on loaded server, the waits will be there every time. Then run it twice on the same server during off-peak hours. For the second execution the waits disappear, so the query tuning problem is solved, yes?
Are you sure that these specific I/O waits for the specific query execution are the problem? Maybe they’re are caused by other badly performing query (which made SQL Server throw my important pages from memory), or by external memory pressure. Having them in the query doesn’t make my tuning job easier at all.
As they wrote, I/Os are per-instance, per-volume, not per-query.
For me, physical reads are just plain incorrect measure most of the time (PAGEIOLATCH is the symptom, not the cause). The tuning work most often is to minimize *logical* reads and the amount of computations involved, which may (but mustn’t) translate to less physical I/Os and less CPU usage.
But the title is catchy, and it counts, I’ve clicked it, too 🙂
Ernest – I’ve read your comment a few times, and from what I can tell, it sounds like you agree? I think you’re saying that, “Yes, the wait stats in the query plan don’t match what the query is really waiting on because tracking it is hard.”
Did I sum that up right? If not, can you try to sum it up for me? Thanks!
Is it realistic to use extended events to capture the before-query and after-query server-side wait stats? Obviously we’d need a BetterSSMS query editor to auto-magically do that behind the scenes. Kind of like SET STATISTICS IO ON; SET STATISTICS TIME ON; but SET BRENTHAPPY ON;
Only if you have exactly one query running at a time on your server. 😉
You’re completely correct but also missing the fact this is exactly what we do in practice. Before I learned how to really tune queries, I was the sole person using a SQL Server 2008 R2 Standard Edition server at the company, so I did have exactly one query running at a time. I could see when all 4 (!) CPUs were maxed out and that the query was perfectly parallel, and when the SQL page cache had a miss and disk activity went berserk.
Until I read Grant Fritchey’s book about query plans a year or so later, my method was simply to watch PerfMon. I still rely on that PerfMon mental model today, albeit much harder in AWS with “IOPS”
That’s totally okay – just be aware that that technique – measuring only one query on a server at a time – may not serve you well as your career scales. I sure wish people would stop all activity on their servers so I could run one query at a time, but… 😉
The idea is not to measure one query. The idea is to measure tangible changes in an environment, similar to using wind speed and temperature to estimate power utilization/generation and cost per watt in an energy grid (another problem domain where data samples are not lined up cleanly).
These days, I automate a great deal of this exact concept through tools like SolarWinds Database Performance Analyzer, but even those tools do not do the best job possible at overlaying the right stats, because they’re not deliberately sampling before and after specific causal events.
Ok, cool! Sounds like you’ve got it under control then. Might be neat for you to start a blog and share your knowledge, too.
Brent, sorry for late reply, I’ve just discovered the notification (one year later!). I’ll make sure this doesn’t happen again.
While you could potentially assign a specific physical read to a specific query, its tuning value is close to zero. From my tuning experience physical reads are almost useless measure. When I focus on logical reads for a specific query, I have much better results. If you’re able to reduce number of logical reads, you’re reducing physical reads as well.
PAGEIOLATCH gives you a good I/O overview for the instance and/or database, not per query. If you happen to run multiple invocations of the same query requiring the same pages, the physical I/O wait statistics are completely different for each invocation – just by chance. If you add more concurrent queries to the mix, not requiring exactly the same set of pages as the former ones, but sharing some of them, it makes the statistic picture even more blurry. This is why I think they’re useless.
In other words, in scenario when there are multiple running queries sharing some pages, the occurrence of PAGEIOLATCH_SH can be considered as random. No reliable conclusions can be drawn from these waits.
Here’s a description of a simple example.
Imagine two instances of
SELECT SUM(TotalDue) FROM Sales.SalesOrderHeaderstarting a very short moment apart from each other. They do need the same pages. If those pages are not in buffer pool, they need to read it. Imagine that the first query has enough time to read one page from the disk before query 2 starts. So query 1 has one PAGEIOLATCH_SH to read page 1, and query 2 doesn’t because it can just read it from the buffer pool. And maybe one of those queries sat in SUSPENDED or RUNNABLE state long enough for the other one to read other pages in the meantime?
That’s just my POV, but I’m open to other arguments.
I personally disagree with what you’re saying. I want that information, not to know if the query has to always wait a lot of time for disk, but to know that in a particular execution, the factors that influenced that execution and made up for the extra time it took was X, Y and Z.
For troubleshooting performance you are completely right, but to know why my query is taken some undetermined amount of time, that information would be extremely helpful.
My actual execution plans no longer include wait stats from a database at compatibility level 150 on SQL Server 2019. I have *.sqlplan files from before 2021-11-16 with occurrences of “<Wait WaitType=", but I don't see that string in any of my more recent sqlplan files. Could I have accidentally switched that feature off?
Link to Pedro’s re-opening of the issue just goes to a generic Bing search page…
Yes, time has moved on, sadly – Microsoft doesn’t believe in maintaining links from old systems. Sad, isn’t it?