How to Check for Non-Existence of Rows

You’re writing a query, and you wanna check to see if rows exist in a table.

I’m using the free Stack Overflow database, and I wanna find all of the users who have not left a comment. The tables involved are:

  • In dbo.Users, the Id field uniquely identifies a user.
  • In dbo.Comments, there’s a UserId field that links to who left the comment.

A quick way to write it is:

And this works fine. When you read the query, you might think SQL Server would run that SELECT * FROM dbo.Comments query for every single row of the Users table – but it’s way smarter than that, bucko. It scans the Comments index first because it’s much larger, and then joins that to the Users table. Here’s the plan:

Doing the scan-scan
Doing the scan-scan

But another way to write that same query is:

This can be a little tricky to wrap your head around the first time you see it – I’m joining to the Comments table, but it’s an optional (left outer) join, and I’m only pulling back rows where the Comments primary key (Id) is null. That means, only give me Users rows with no matching Comments rows.

This join plan is completely different: there’s no stream aggregate, and now there’s a filter (c.Id IS NULL) that occurs after the merge join:

Filters are like rose-colored glasses

It’s completely different:

  • The Users table is processed first
  • There’s a different kind of merge join (left outer)
  • There’s a filter after the join

To see which one performs better, let’s use the metrics I explain in Watch Brent Tune Queries: logical reads, CPU time, and duration. In the 10GB StackOverflow2010 database, both queries do identical logical reads and duration, but the join technique uses around 20-30% more CPU time on my VM.

But don’t draw a conclusion from just one query.

I tell my developers, write your queries in whatever way feels the most intuitively readable to you and your coworkers. If you can understand what’s going on easily, then the engine is likely to, as well. Later, if there’s a performance problem, we can go back and try to nitpick our way through different tuning options. The slight pros and cons to the different approaches are less useful when you’re writing new queries from scratch, and more useful when you’re tuning queries to wring every last bit of speed out of ’em.

Previous Post
Why Does My Select Query Have An Assert?
Next Post
The First 3 Things I Look At on a SQL Server

8 Comments. Leave new

  • Can you let us know if anyone came up with a better (or just different!) Solution for the been packing problem? I would like to keep my notes current, in case I ever update an edition of one of my books.

    Reply
  • I keep seeing developers use WHERE NOT IN(), which is almost always horrible for performance.

    Reply
  • NOT IN, NOT EXISTS, LEFT JOIN… there is one difference, at the logical processing level, mind that: NOT EXISTS uses two value logic, namely TRUE and FALSE, while the rest use three value logic; TRUE, FALSE, and UNKNOWN (our NULLs). so it kinda makes sense that NOT EXISTS might use less resources.

    Reply
  • Just a thought; in the example, the requirement is to return a list of users that have not left a comment. Given the table structure and no insider knowledge of the data we should assume that a user can leave more than one comment. In one-to-many relationship structure the NOT EXISTS method is the most accurate since if a user has left 100 comments that user will only be returned once where as in the LEFT JOIN that same user will return 100 times. Of course we can add a DISTINCT or GROUP BY and exclude any comment table data that would expand the data sets cardinality, however the execution plans will vary depending on comment per user count. Would you say in general, that this technique is better for performance mostly when your primary and foreign table have a one-to-one relationship or does the LEFT OUTER technique still consistently yield better performance even with the DISTINCT?

    Reply
    • Jason – you’re misreading the code a little. Note the WHERE clause – it only produces rows where Comment.Id is null, meaning no rows will be returned for users who have left a comment.

      Reply
  • Lee Malatesta
    August 13, 2018 10:15 am

    For readability, I think the EXCEPT key word would be the way to go here. You’d have to get a little fancy if you need data other than ID. But if you just need a list of IDs.

    SELECT u.id
    FROM dbo.Users u
    EXCEPT
    SELECT c.UserId from dbo.Comments c

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.