How to Check for Non-Existence of Rows

Last Updated July 20, 2018

You’re writing a query, and you wanna check to see if rows exist in a table.

I’m using the free Stack Overflow database, and I wanna find all of the users who have not left a comment. The tables involved are:

In dbo.Users, the Id field uniquely identifies a user.
In dbo.Comments, there’s a UserId field that links to who left the comment.

A quick way to write it is:

SELECT u.*
  FROM dbo.Users u
  WHERE NOT EXISTS (SELECT * FROM dbo.Comments c WHERE c.UserId = u.Id);

SELECT u.*

FROM dbo.Users u

WHERE NOT EXISTS (SELECT * FROM dbo.Comments c WHERE c.UserId = u.Id);

And this works fine. When you read the query, you might think SQL Server would run that SELECT * FROM dbo.Comments query for every single row of the Users table – but it’s way smarter than that, bucko. It scans the Comments index first because it’s much larger, and then joins that to the Users table. Here’s the plan:

But another way to write that same query is:

SELECT u.*
  FROM dbo.Users u
  LEFT OUTER JOIN dbo.Comments c ON c.UserId = u.Id
  WHERE c.Id IS NULL;

SELECT u.*

FROM dbo.Users u

LEFT OUTER JOIN dbo.Comments c ON c.UserId = u.Id

WHERE c.Id IS NULL;

This can be a little tricky to wrap your head around the first time you see it – I’m joining to the Comments table, but it’s an optional (left outer) join, and I’m only pulling back rows where the Comments primary key (Id) is null. That means, only give me Users rows with no matching Comments rows.

This join plan is completely different: there’s no stream aggregate, and now there’s a filter (c.Id IS NULL) that occurs after the merge join:

It’s completely different:

The Users table is processed first
There’s a different kind of merge join (left outer)
There’s a filter after the join

To see which one performs better, let’s use the metrics I explain in Watch Brent Tune Queries: logical reads, CPU time, and duration. In the 10GB StackOverflow2010 database, both queries do identical logical reads and duration, but the join technique uses around 20-30% more CPU time on my VM.

But don’t draw a conclusion from just one query.

I tell my developers, write your queries in whatever way feels the most intuitively readable to you and your coworkers. If you can understand what’s going on easily, then the engine is likely to, as well. Later, if there’s a performance problem, we can go back and try to nitpick our way through different tuning options. The slight pros and cons to the different approaches are less useful when you’re writing new queries from scratch, and more useful when you’re tuning queries to wring every last bit of speed out of ’em.

Why Does My Select Query Have An Assert?

The First 3 Things I Look At on a SQL Server

8 Comments. Leave new

Joe Celko
August 9, 2018 11:12 am

Can you let us know if anyone came up with a better (or just different!) Solution for the been packing problem? I would like to keep my notes current, in case I ever update an edition of one of my books.

Reply
- Brent Ozar
  August 9, 2018 11:14 am
  
  Hi Joe – no, I just linked folks to the Stack question. If an answer showed up, it’d be over there.
  
  Reply
- H
  August 20, 2018 3:14 am
  
  Joe, I assume you are already aware of the bin packing algorithm I published in 2011?
  If not, here is a link: https://sqlserverfast.com/?s=bin+packing
  (Note that the posts have moved so some formatting and links might be off. You can always send me an email if you want to know more)
  
  Reply
Mark Freeman
August 9, 2018 11:49 am

I keep seeing developers use WHERE NOT IN(), which is almost always horrible for performance.

Reply
Calin Oprea
August 9, 2018 2:24 pm

NOT IN, NOT EXISTS, LEFT JOIN… there is one difference, at the logical processing level, mind that: NOT EXISTS uses two value logic, namely TRUE and FALSE, while the rest use three value logic; TRUE, FALSE, and UNKNOWN (our NULLs). so it kinda makes sense that NOT EXISTS might use less resources.

Reply
Jason Smith
August 13, 2018 8:08 am

Just a thought; in the example, the requirement is to return a list of users that have not left a comment. Given the table structure and no insider knowledge of the data we should assume that a user can leave more than one comment. In one-to-many relationship structure the NOT EXISTS method is the most accurate since if a user has left 100 comments that user will only be returned once where as in the LEFT JOIN that same user will return 100 times. Of course we can add a DISTINCT or GROUP BY and exclude any comment table data that would expand the data sets cardinality, however the execution plans will vary depending on comment per user count. Would you say in general, that this technique is better for performance mostly when your primary and foreign table have a one-to-one relationship or does the LEFT OUTER technique still consistently yield better performance even with the DISTINCT?

Reply
- Brent Ozar
  August 13, 2018 8:09 am
  
  Jason – you’re misreading the code a little. Note the WHERE clause – it only produces rows where Comment.Id is null, meaning no rows will be returned for users who have left a comment.
  
  Reply
Lee Malatesta
August 13, 2018 10:15 am

For readability, I think the EXCEPT key word would be the way to go here. You’d have to get a little fancy if you need data other than ID. But if you just need a list of IDs.

SELECT u.id
FROM dbo.Users u
EXCEPT
SELECT c.UserId from dbo.Comments c

Reply