I was writing another query, and became enamored with the fact that HAVING will accept IS NULL or IS NOT NULL as a predicate.
What I ended up writing as an example was this query:
SELECT v.PostId, SUM(v.UserId) AS whatever
FROM dbo.Votes AS v
WHERE v.UserId IS NULL
GROUP BY v.PostId
HAVING SUM(v.UserId) IS NOT NULL;
Why this query?
I figured the optimizer would take one look at it and bail out with a Constant Scan.
After all, the WHERE clause filters to only NULL UserIds, and this column is NULLable in the Votes table.
The HAVING could only ever produce a NULL. And according to the laws of Logical Query Processing, WHERE is processed earlier than HAVING is.
But that’s not what happens.
Query At Work
Just may kick my butt in the comments about why this doesn’t bail out. My soul and butt are prepared.
Thanks for reading!
The more I learn about Sql Server, I realize it is only as smart as the query you give it. Very powerful tool, proceed with caution.
Isaac — yeah, a lot of the time I’m amazed it can figure out what people are asking for, and I’m even more amazed at the ways people come up with to ask for things. Hahaha.
SUM() makes the column non-SARGable?
I can’t help but be amuzed at Sarge Argable’s nameor in short S.Argable
SUM() makes the column non-SARGable, and also strips out the NULL values (hence the Warning: NULL value eliminated… message).
Seems like the optimiser cannot think about the WHILE and the HAVING together (especially with that GROUP BY a-different-column-to-the-one-I’m-filtering-on).
I wonder if it could connect those 2 clauses if you were GROUPing by the actual UserID. (I’d test myself, but I’m at work. Shhhhh)
QT — hm. Well, when you’ve had a chance to test, lemme know!
It is reduced to a Constant Scan, if UserId is marked NOT NULLable.
That won’t work out too well on a column with NULLs in it…
Absolutely. Just an academic answer, nothing practical. Like the question… 😛
I could be wrong here but I was always taught (and evidence has yet to contradict) that the HAVING clause operates on the resultset so I’m guessing the optimizer isn’t smart enough to notice the contradiction until runtime.
What a silly oversight. Please send this patch for the SUM function to your bros at Microsoft.
if (identity.FullName == “Erik Darling”) return null;
Billy — that’d hurt more if you returned 0.
Same kind of query on my local box resulted in Constant Scan (SQL Server 2012 and table has some 500 rows).
Oh, column in my case is not nullable.
I don’t know why, but it looks like SUM, AVG, etc. are all converted into something like:
CASE WHEN COUNT_BIG(Col) = 0 THEN NULL ELSE SUM(Col) END
MIN and MAX work as expected. COUNT/COUNT_BIG do *not* work (seems weird). I’m guessing that that rewrite above of SUM can’t be pushed down because COUNT_BIG can’t be optimized away.
Of course, Paul White wrote something about this: http://sqlblog.com/blogs/paul_white/archive/2011/02/27/bug-slow-sums-and-averages.aspx
Microsoft’s documentation for the SUM operator states:
“Null values are ignored.”
Using the http://data.stackexchange.com/ tool to run your query, I first checked that there was a stats object on the dbo.Votes.UserID field:
FROM sys.stats AS s
INNER JOIN sys.stats_columns AS sc
ON s.object_id = sc.object_id
AND s.stats_id = sc.stats_id
INNER JOIN sys.syscolumns AS sc2
ON s.object_id = sc2.id
AND sc.column_id = sc2.colid
WHERE s.object_id = object_id(‘dbo.Votes’)
Is there a stats object on that column in your copy? If there isn’t one, then obviously different work would need to be done by the optimizer.
Now for running your query, to test the opposite case – if you flip the IS NULL/IS NOT NULL checks, and CAST the UserID as a BIGINT to prevent arithmetic overflow, like so:
SELECT v.PostId, SUM(cast(v.UserId as bigint)) AS whatever
FROM dbo.Votes AS v
WHERE v.UserId IS not NULL
GROUP BY v.PostId
HAVING SUM(cast(v.UserId as bigint)) IS NULL;
You get basically the same plan, where the FILTER operator applies the HAVING after you get the WHERE data, because it’s still possible for a SUM to provide a NULL value. However, in the revised statement the COMPUTE SCALAR operator for the BIGINT cast comes before the HASH MATCH operator, and there isn’t a second COMPUTE SCALAR operator, signifying the resulting in a more efficient plan. That COMPUTE SCALAR operator from your query is
[Expr1001] = Scalar Operator(CASE WHEN [Expr1006]=(0) THEN NULL ELSE [Expr1007] END)
This doesn’t exist in the plan for the flipped query.
In both cases in the execution plan the expected row count is 1 and the actual is 0. However, in the STATISTICS IO results there is no actual worktable work being done:
(0 row(s) returned)
Table ‘Votes’. Scan count 7, logical reads 650511, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
I’m not sure if this part is due to the nature of the site vs clearing cache and running it clean.
It looks like HAVING SUM bails in the case for IS NULL vs the case for IS NOT NULL, but it still needs to do the work to get the data before applying the filter.