Beyond Key Lookups
I’m going to show you two queries, and two query plans. The two queries are only different by one selected column, but the two query plans are wildly different.
Unless I’ve talked to you in the last few days, these may surprise you as well.
Here’s the first one.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
DECLARE @DisplayName NVARCHAR(40) SELECT @DisplayName = u.DisplayName FROM dbo.Users AS u LEFT JOIN ( SELECT p.OwnerUserId, COUNT_BIG(*) AS Records FROM dbo.Posts AS p LEFT JOIN dbo.PostTypes AS pt ON p.Id = pt.Id AND pt.Id = 2 WHERE p.Score >= 10 GROUP BY p.OwnerUserId HAVING COUNT_BIG(*) >= 10 ) AS pp ON pp.OwnerUserId = u.Id LEFT JOIN ( SELECT p2.OwnerUserId, COUNT_BIG(*) AS Records FROM dbo.Posts AS p2 JOIN dbo.PostLinks AS pl2 ON p2.Id = pl2.PostId AND pl2.LinkTypeId = 1 WHERE pl2.CreationDate >= '2016-01-01' GROUP BY p2.OwnerUserId HAVING COUNT_BIG(*) >= 2 ) AS pl ON pl.OwnerUserId = u.Id AND pl.OwnerUserId = pp.OwnerUserId LEFT JOIN ( SELECT DISTINCT p3.OwnerUserId FROM dbo.Votes AS v JOIN dbo.VoteTypes AS vt ON v.VoteTypeId = vt.Id AND vt.Id = 2 JOIN dbo.Posts AS p3 ON p3.Id = v.PostId ) AS pv ON pv.OwnerUserId = u.Id AND pv.OwnerUserId = pl.OwnerUserId LEFT JOIN ( SELECT DISTINCT c.UserId, c.PostId FROM dbo.Comments AS c JOIN dbo.Users AS u2 ON c.UserId = u2.Id JOIN dbo.Posts AS p4 ON c.PostId = p4.Id WHERE c.CreationDate >= '2016-01-01' AND c.Score >= 1) AS cc ON cc.UserId = pv.OwnerUserId AND cc.UserId = u.Id; |
Here’s the second one.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
DECLARE @DisplayName NVARCHAR(40) SELECT @DisplayName = u.DisplayName FROM dbo.Users AS u LEFT JOIN ( SELECT p.OwnerUserId, COUNT_BIG(*) AS Records FROM dbo.Posts AS p LEFT JOIN dbo.PostTypes AS pt ON p.Id = pt.Id AND pt.Id = 2 WHERE p.Score >= 10 GROUP BY p.OwnerUserId HAVING COUNT_BIG(*) >= 10 ) AS pp ON pp.OwnerUserId = u.Id LEFT JOIN ( SELECT p2.OwnerUserId, COUNT_BIG(*) AS Records FROM dbo.Posts AS p2 JOIN dbo.PostLinks AS pl2 ON p2.Id = pl2.PostId AND pl2.LinkTypeId = 1 WHERE pl2.CreationDate >= '2016-01-01' GROUP BY p2.OwnerUserId HAVING COUNT_BIG(*) >= 2 ) AS pl ON pl.OwnerUserId = u.Id AND pl.OwnerUserId = pp.OwnerUserId LEFT JOIN ( SELECT DISTINCT p3.OwnerUserId FROM dbo.Votes AS v JOIN dbo.VoteTypes AS vt ON v.VoteTypeId = vt.Id AND vt.Id = 2 JOIN dbo.Posts AS p3 ON p3.Id = v.PostId ) AS pv ON pv.OwnerUserId = u.Id AND pv.OwnerUserId = pl.OwnerUserId LEFT JOIN ( SELECT DISTINCT c.UserId--, c.PostId /*Just one column removed*/ FROM dbo.Comments AS c JOIN dbo.Users AS u2 ON c.UserId = u2.Id JOIN dbo.Posts AS p4 ON c.PostId = p4.Id WHERE c.CreationDate >= '2016-01-01' AND c.Score >= 1) AS cc ON cc.UserId = pv.OwnerUserId AND cc.UserId = u.Id; |
Part 2 coming up!
In part 2, I’ll discuss what happened and cite a couple of my favorite sources. In the meantime, feel free to have at it in the comments.
Unless you’re someone I talked about this with in the last couple days!
Thanks for reading!
27 Comments. Leave new
I am confused. You are saying that second plan was generated off of that second query? That is not even possible. Many of the tables used in the query are not represented at all in the second plan:
https://www.brentozar.com/pastetheplan/?id=rkwMUlmYZ
Did you accidentally post the wrong second plan?
Tim — nope! It’s exactly the plan from that query.
Hmmmm, join elimination possibly?
Wow, that’s amazing. Here’s my guess.
I noticed that for each outer join to a derived table, the cardinality of each userid on the right side is guaranteed to be 0 or 1. And because no columns are used come from that derived table, that whole outer join will have no effect on the results. So SQL Server can logically eliminate those joins from the query plan.
Except in the initial query, in the derived table “cc”, the cardinality of each userid is 0,1 or many. So SQL Server will have to do the work to figure out how many. This particular join cannot be logically eliminated.
… and here’s the rest of the mystery. For some reason, SQL Server freaks out and won’t ignore any of the outer joins to the derived tables. I guess it figures if it can’t ignore every derived table then It will execute every derived table.
I have no idea why that last mystery. It seems to me that SQL Server could in theory optimize a little bit more than it does in the original query. I look forward to part 2!
Doesn’t SQL Server have to determine all the other derived tables because “cc” is joined to “pv” on cc.UserId = pv.OwnerUserId? And so on up the line pulling in the other derived tables.
I think you’re onto something David!
There are multiple equivalences everywhere which means it’s still technically possible for SQL Server to eliminate some of those joins.
But I learned recently that “given multiple equivalences in a query, the optimizer won’t always explore all possible combinations”. Perhaps that’s the case here too.
I think you’re still right David, but I take back my comment. The post I linked to is about inner joins not outer joins. The chain of left joins I think is enough on its own to require executing every derived table. Because the 0,1 cardinality of each userid has to be calculated in each derived table.
Thanks, that was an interesting post none-the-less.
Looking at the query again, I’m wondering now if the fact that the “cc” derived table join is causing some of the problem as well. It is the only one that does not join to the dbo.Users table first and another derived table second.
ON cc.UserId = pv.OwnerUserId
AND cc.UserId = u.Id;
I know logically it shouldn’t, but since Brent has us chasing minutia, I can’t help but wonder if that is part of the issue as well.
Oops, I should have said Erik has us chasing minutia. Sorry Erik.
The blogger name isn’t quite minutia. 😉
Hmmmm, join elimination possibly?
Err, or what @Michael said so much more eloquently. 🙂
Erik – I’d be interested in seeing the query plan with “option (maxdop 1)” on please. The query plan looks complicated, but I have a feeling that is just because of the parallelism and its splitting work between processors.
If we limit this to one proc, the query plan will probably look as expected where its joining the Users table to the 4th left join sub query.
As a side note, you can get these expected results if your DB schema matches the one to one criteria and you are not including one of the tables in the select clause.
Andy – bad news. Here’s the plans with MAXDOP 1 per your request:
Query 1: https://www.brentozar.com/pastetheplan/?id=Hy9RF8ntb (estimated, because it takes so dang long to run on a single core)
Query 2: https://www.brentozar.com/pastetheplan/?id=SkVsFUnt-
It was an interesting thought though! Next guess?
SELECT DISTINCT c.UserId, c.PostId
VS
SELECT DISTINCT c.UserId, –c.PostId
Probably much less users then there are users and posts… right?
That is what changed, but think more about what the combination could produce for results rather than the number of results it could produce.
Just spit balling here, but looking at the last query by itself, its joining to Users and Posts tables but not referencing them other that filtering the result set.
If there is a trusted FK relationship between comments and users, any comments.userID would exist in the users table so maybe the optimizer can ignore reading comments and posts since the users.id list is inclusive of anything it’d find. But when the comments.postID is included then it has to actually read the comments table to get that info resulting in the more complex plan.
At least thats my random guess.
Hi Bob,
There aren’t any foreign keys in the database at all.
Ok, I actually didn’t take a close look at the query. I had assumed you were joining everything straight off the User’s table UserId, but you are doing two joins, one to Users and then one from the previous subquery’s left join.
If you remove that and have them all join off Users UserId, I get a query plan with just the Users and Comments table which is interesting.
As a side note, I had re-created the tables with a bit of guess work for columns and such. Is the first subquery supposed to join “ON p.PostTypeId = pt.Id” and not “ON p.Id = pt.Id”?
All the derived tables /sub queries can be eliminated as they are LEFT joined, and provide no results for the select.
Query optimiser hard at work fixing code!
Including the PostId field creates duplicate rows in the subquery ‘cc’. When left joining a table with a query/table containing duplicates for the join field, the end result also contains those duplicates.
Simple example:
SELECT pt.Id FROM PostTypes AS pt;
SELECT pt.Id FROM PostTypes AS pt
LEFT JOIN (SELECT PostTypeId FROM Posts) AS p
ON p.PostTypeId = pt.Id;
The first query returns 8 records in a split second. The second one about 30 million rows according to the estimated plan, which is the number of records in the Posts table…
The lesson is therefore never to perform a left join on a table/query with duplicates.
Richard – I’m leaving this comment more for other readers than for you (because you know this already) – here’s a slightly different pair of queries that illustrate the difference:
Query with duplicate values produced from the LEFT JOIN:
SELECT pt.Id FROM PostTypes AS pt
LEFT JOIN (SELECT PostTypeId FROM Posts) AS p
ON p.PostTypeId = pt.Id;
Plan: https://www.brentozar.com/pastetheplan/?id=SJxH7vaF-
Query with only unique values produced from the LEFT JOIN:
SELECT pt.Id FROM PostTypes AS pt
LEFT JOIN (SELECT Id FROM Posts) AS p
ON p.Id = pt.Id;
Plan: https://www.brentozar.com/pastetheplan/?id=rkXOmPTKW
The completely different query plans make sense. The first three sub-queries guarantee a unique UserID value returned. In the second query the fourth sub-query guarantees a unique UserID returned as well, so none of the queries have an impact on the result set, as none of their columns are in the select clause. So they can all be safely eliminated.
Since the fourth sub-query in the first query does not guarantee a unique UserID in the result set, it cannot eliminate the sub-query as you may get multiple results. The thing I didn’t notice at first glance with the first query, is that you are actually joining each sub-query off the last sub-query in addition to the users table.
So we actually need the results of the previous 3 sub-queries to perform the join for the 4th. If any of the first three sub-queries result in a null row returned for the userid, it will result in a null set for the 4th sub-query.
IE. It looks like the first sub-query finds users who have a post that has scored 10 or greater and the fourth sub-query finds users and comments with a score 1 or greater. If a user never got a post to a score of 10 or higher, but they did get multiple comments to a score of 1 or higher, their name would not be repeated for each comment that was 1 or higher, it would only result in their name displayed once.
I would have expected each sub-query to be joined straight off the Users table. If it was, then the first, second and third sub-queries would be eliminated from the query. But I’m not sure what you were using the query for.
The first three subqueries pp, pl and pv are all eliminated. It doesn’t matter what filters or aggregations are used. They don’t do anything since they are LEFT JOINed. The same for the last subquery cc when the PostId column is removed.
The LEFT JOIN inside the first subquery is redundant as is the join to the Posts table in the last subquery.
When you remove the first three subqueries and the redundant stuff you’re left with the following:
SELECT
TheDisplayName = u.DisplayName
FROM
dbo.Users AS u
LEFT JOIN
(
SELECT DISTINCT
c.UserId,
c.PostId /*Just one column removed*/
FROM
dbo.Comments AS c
JOIN
dbo.Users AS u2
ON c.UserId = u2.Id
WHERE
c.CreationDate >= ‘2016-01-01’
AND c.Score >= 1
) AS cc
ON cc.UserId = u.Id;
You could even remove the where clause for simplicities sake. The cc subquery does not affect the outcome given the left join. Or does it?
Turns out that when left joining a table with duplicates for the join field the end result also contains those duplicates.
For the first query, the first three sub-queries DO matter and cannot be eliminated. The fourth query joins off the results of the third, the third off the second and the second off the first. Below is a simplification of why the query plans on different and how the results can differ since the joins are based off not only the Users table but the previous Sub-query table.
create table TopDawg (
ID int not null identity(1,1) primary key,
Description varchar(8)
)
insert into TopDawg values ( ‘One’ )
insert into TopDawg values ( ‘Two’ )
create table OneOneDawg (
ID int not null identity(1,1),
TopDawgID int not null unique constraint fk_TopDawg_OneOneDawg references TopDawg(ID),
Description varchar(8)
)
insert into OneOneDawg values ( 1, ‘OneOne’)
create table OneManyDawg (
ID int not null identity(1,1),
TopDawgID int not null constraint fk_TopDawg_OneManyDawg references TopDawg(ID),
Description varchar(8)
)
insert into OneManyDawg values ( 1, ‘OneMany’)
insert into OneManyDawg values ( 2, ‘TwoMany’)
insert into OneManyDawg values ( 2, ‘2Many’)
select td.*
from TopDawg td
left join OneOneDawg ood
on ood.TopDawgID=td.ID
left join OneManyDawg omd
on omd.TopDawgID=td.Id
–and omd.TopDawgID=ood.ID /*Commented out returns three rows, uncommented returns 2*/
Yes, they can be eliminated. ALL subqueries are LEFT JOINed. The results of the subqueries are a subset of the UserId’s in the Users table, so no need to query them all. That’s the reason SQL Server sticks to a simple scan of the Users table.
With the extra PostId column however you create a whole new problem.
Looks like it is doing some type of elimination — If you look at the XML you see a couple missing index hints in the first and none in the second.
It also looks like one thread did a lot more work in one op than the others (if not maxdop 1, maybe 4? or 2?)…probably doesn’t explain the difference though…