The Joy of Joining on NULLs

With all the trouble NULLs cause…

You’d think people would be more inclined to avoid them. Slap a NOT NULL constraint and a default value on your column and call it a day. I’m perfectly fine with bizarro world canary values. If it’s an integer column, some really high (low?) negative number. If it’s date-based, why not have it be the lowest value your choice accomodates?

But no, no, no. Instead of stopping the bleeding, developers tend to just avoid getting blood on the new carpet. Some of the worst performing queries I see have ISNULL(something, '') = ISNULL(somethingelse, '') either in the JOIN or WHERE clause. Still. To this day. In the age of the hoverboard, people are still putting functions where they don’t belong. And I know they have the internet.

But did you really mean that?

Most people will slap together their ISNULL manifesto and call it the dog end of a day gone by. But most of them don’t realize how that changes the logic of the query.

For instance, I see ISNULL(intcol, 0) = 0 pretty often. Most people are surprised that rows with 0 in them will also be included here.

But for joins, well, let’s look at what’s really going on.

Teeny tiny tables, appear!

We don’t even need 100 rows for this, but whatever. Let’s join these muddy truckers. Here’s a pretty average inner join, on both ID and OrderDate. It returns 53 rows; all 100 IDs match, but the NULL dates prohibit some joins from occurring.

The results will be something like this. IDs are missing where dates are NULL. For us, that’s every 3rd column in t1, and every 5th column in t2.

Missing links

Missing links

Most people will think this will get them out of the woods, but it only fixes six of the NULL joins.

Hey! Something happened!

Hey! Something happened!

We’ll call them the FizzBuzz six, because…

Jammin' on the 3. And 5.

Jammin’ on the 3. And 5.

You guessed it, only numbers divisible by three AND five get joined on our canary value. If we switch over to a left join, we’ll at least see where the gaps are.

And uh, carry the... Whatever.

And uh, carry the… Whatever.

Since t1 is the left side, the gaps are every 3rd ID. With t2 on the right side of the join, every 3rd and 5th ID is NULL. To really get the full picture, we’ll need to switch to a full join. This is instructive because it gives us ALL the unmatched rows.

Check out the next three queries, and then why each is important.

For the full join with no WHERE clause, we get back 147 rows. In the next two queries, we filter out NULLs from each of the date columns.

In the second query, 33 rows are missing. That makes sense, because there are 33 rows between 1 and 100 that are evenly divisible by 3.

For the third query, there are 20 rows missing. If you’re not a Mathlete, yes, 20 rows between 1 and 100 are evenly divisible by 5.

This goes back to our unfiltered query: 200 – 20 – 33 = 147.

There are your missing 57 rows.

So what did you really need?

Think carefully about what you’re trying to return with your query. It’s highly unlikely that ISNULL solved your problem, and that you needed all matching results, plus matching results where BOTH are NULL. If you did, this would be a better way to do that:

Killing Join

Remember that joins describe the relationship between your tables, and the where clause describes what you want (or don’t want) to keep from the table relationship.

Think about the first time you stupidly got an apartment with someone you just started dating. You joined your furniture based on the rooms in the apartment, and then filtered out what you weren’t keeping.

If you were me, all your cool guy stuff ended up at the curb. Heckuva where clause, that was.

This post goes out to Paul White, who also wants his t-shirts back. You know who you are!

Thanks for reading!

Previous Post
Contest: SQL Server Theme Songs
Next Post
2016 Company Retreat Photos

43 Comments. Leave new

  • You could probably shorten your solution to

    SELECT *
    FROM [#t1] AS [t]
    JOIN [#t2] AS [t2]
    ON ([t2].[ID] = [t].[ID]
    AND [t2].[OrderDate] = [t].[OrderDate])
    OR
    ([t].[OrderDate] IS NULL
    AND [t2].[OrderDate] IS NULL)
    ORDER BY [t].[ID]

    Reply
    • Jonathan Shields
      March 19, 2016 6:53 am

      Nice article and debate. I agree that canary values can be good in the real world but the real problem is lack of understanding of what null does.

      I would rather see default values too but sometimes people don’t realise they are there and use isnull anyhow. Not defensible but it happens…

      Reply
    • Except that this is going to run quite badly. I’ve experienced this exact problem in production code before and the best result that could take advantage of the indexes on the table was the UNION solution. Especially once the WHERE criteria get a bit convoluted.

      Reply
  • Nicholas Peterson
    March 17, 2016 11:13 am

    I’ll be ‘That Guy’, just never use null values, and don’t use canary values either, because some other query will do an aggregate on that column and forget to exclude those values.

    Reply
    • So, if it’s not null, and not a canary value… What would be there?

      Reply
      • Nicholas Peterson
        March 17, 2016 1:36 pm

        Nothing, a different table that reflects that value with a foreign key back to the existing table.

        Reply
        • Erik Darling
          March 17, 2016 3:57 pm

          That sounds like a fast track to ending up with some nasty EAV data.

          Reply
          • Nicholas Peterson
            March 17, 2016 5:27 pm

            I’m not sure I follow, EAV tables tend be when you don’t know what you’re going to store or the matrix is so sparse that a single table would be too wide and empty. I’m suggesting that an optional value really should be moved to a related table to the primary one. Otherwise you have to contend with NULLS and their drawbacks.

          • Erik Darling
            March 18, 2016 8:00 am

            Two quick things:
            Not sure I’d want a whole separate table for optional values. Customer information for one: do I need another table for alt. phone or address line 2?

            Some values aren’t optional, just eventual. Take hospital stays. You may know the arrival date, but not the check out date.

  • Hello Erik

    I’m interested in this as we have quite a lot of clauses that say- where isnull(field, ‘n’) = ‘n’

    Are you just saying the last method (with the union all) is a cleaner way to return the data you want?

    I might have missed something but the query with the JOIN that used
    ISNULL([t2].[OrderDate], ‘1900-01-01’) = ISNULL([t].[OrderDate], ‘1900-01-01’)
    gave me the same results as the final query you wrote.

    and . . . looking at the execution plan it used less resources too.

    Reply
    • Cleaner and more obvious, yes.

      At 100 rows, I don’t expect it to make much of a difference either way.

      At scale, functions in WHERE clauses rob the QP of its ability to efficiently use indexes or statistics to generate smart plans.

      Check out this excellent post by Grant Fritchey from a few years back.
      https://www.simple-talk.com/sql/performance/the-seven-sins-against-tsql-performance/

      Reply
    • I am with Chris and thinking that may be in the final solution, the “AND” should be “OR”
      —————————————–
      WHERE [t].[OrderDate] IS NULL OR [t2].[OrderDate] IS NULL

      instead of

      WHERE [t].[OrderDate] IS NULL AND [t2].[OrderDate] IS NULL

      Reply
  • Nice post.

    The SQL Server team really needs to add function indexes support to SQL Server like Oracle has had for a while now.

    i.e., have support for “ISNULL(something, ”) = ISNULL(somethingelse, ”)” type joins without any performance penalties.
    https://oracle-base.com/articles/8i/function-based-indexes
    http://docs.oracle.com/cd/B19306_01/appdev.102/b14251/adfns_indexing.htm#i1006366

    This is as close as it gets to function indexes:
    http://stackoverflow.com/questions/22168213/function-based-indexes-in-sql-server

    Just my 2 cents. 🙂

    Reply
    • Hey! Yeah, those are pretty cool in Oracle, but I also wish Oracle had true filtered indexes.

      Then again, I wish SQL Server had filtered indexes the way Postgres does, where you can use <> and NOT IN in the filter definition…

      Reply
  • I agree. Haven’t worked with Postgres though!

    Reply
  • IMO, “magic values” (or what you are calling canary values) are evil. The violate the least astonishment rule for the next developer that discovers that ‘-12345.76’ has magic meaning. Further, those magic values have to bleed from the data tier, to the middle tier to the presentation tier. Absence of a value *should* use nulls. That is its purpose. Developers should not avoid nulls per se. They should *understand* nulls so that they use and consume them properly.

    Reply
    • With a little documentation, canary values won’t astonish anyone. Most developers I talk to are astonished when NOT IN is broken by NULLs. I’d wager it’s easier for most people to understand how -9999999 will act as a value than how NULLs will act as a not-value. As for bleeding, NULLs do the same thing, and there would be little difference in handling them at presentation-time regardless of the layer.

      Reply
      • Gotta say, i’m firmly with Thomas on this one. NULL is one of my favorite things about SQL (yes, really) I’ve never understood the lack of love for NULL out there on the part of data professionals.

        “With a little documentation, canary values won’t astonish anyone. Most developers I talk to are astonished when NOT IN is broken by NULLs.”

        agree with both points. however, with proper use of NULLability and defaults, which i would define as:

        1.) strive for as few NULL-able columns as possible
        2.) employ default values where a default is correct or natural [1]

        if you reject canary values, every value, including NULL (where allowed) is the *truth*, you need *no* anti-astonishment documentation: the schema *is* the documenation, and i can know what to expect just by reading it (the schema) without SELECT ing a single row. if canaries are in the mix, i gotta look at data–maybe even lots of it–and *then* some documentation, which may or may not even exist.

        but the bottom line is that canary values aren’t data: they’re logic, masquerading as data. i want data in my tables, and i want the schema to tell me what said data purports to be, and the engine to enforce its validity. because if i have those things, then the logic of the SQL code, if it is good, should pretty stand much on its own: it will be obvious what it is purporting to do, whether or not it does so correctly and efficiently, etc. etc.

        [1] for example: in a users table in a message board database, PostCount has a natural default of 0 for a new user, whereas there is no natural default for, say, LastPostID (though both would probably be dumb choices for static columns)

        [2] if you use canary values, you need at least as many values (potentially) as there are data types–and there are some datatypes (like datetime [3]) for which there is, i would argue, no such thing as a non-arbitrary canary. even one canary potentially begets a flock.

        [3] because, when you think about it, *all* dates are abitrary (1970-01-01 and 2012-12-21 are no more or less arbitrary than 2016-03-18, though all three represent real moments in time, their representations all depend upon some “event” which could’ve happened earlier, later, or not at all). whereas -1 (or any negative value) at least has some validity qua canary in a column that represents a count of physical objects.

        Reply
        • Mike – I totally hear where you’re coming from, and I think both sides make great points.

          Your data-vs-logic explanation is so well illustrated that I just wanted to add one small counterpoint: joining to an explanation record. For example, in a data warehouse, when we receive fact table data that doesn’t have a dimensional match, sometimes we want to add an “unknown” record that joins up to something else. For example, I’ve seen sales tables that had a SalespersonID field, and then a separate dimensional table for salespeople that had a RegionID field. Sales reports were run by region, and those had to be accurate, but sometimes online sales came in without an attached SalesPersonID. We could populate SalesPersonID with -1, put in a SalesPerson record with an ID of -1, and put them in a specific region (online rather than a geo).

          Yes, that totally violates your data-vs-logic, but in that example, we were strapped for time and had to get the data right without rebuilding the reports from scratch. Canary values give you a workable shortcut for that.

          (I was young and I needed the money.)

          Reply
        • Erik Darling
          March 18, 2016 8:37 am

          You’re pretty.

          Reply
      • Eric, if you teach a developer about the issues using IN with NULLs and NULL logic, it is knowledge that will serve them both in the system in question and all other systems in the future. Learning what one developer used for magic values in one system is knowledge that only serves in that system. Further, no new documentation is required with NULLs whereas whenever a new magic value is declared or changed, someone has to remember to keep the documentation current. My experience is that this documentation frequently sits next to the Bigfoot photos.

        Reply
        • Erik Darling
          March 21, 2016 5:48 pm

          If developers were all interested in proper NULL handling, I wouldn’t have to write posts like this. Would that everyone were properly trained the moment they opened SSMS… Besides, they don’t need to know what the magic values mean, just that they exist, and they don’t have to worry about NULLs. I do agree about documentation though. It’s always in short supply.

          Reply
  • Gulp . . . the nested views bit has me a little concerned.

    We’ve recently used views to overcome a problem where an external supplier is providing us a database but in the development stage the data keeps changing, so we’ve created views that refer to their data, then we only need to alter what those views point at, not the procedures that point at the views.

    Hopefully they’re simple enough for the optimiser to create good plans.

    Reply
  • SQL Server Performance Tune
    March 18, 2016 7:22 am

    “We don’t even need 100 rows for this, but whatever. Let’s join these muddy truckers. Here’s a pretty average inner join, on both ID and OrderDate. It returns 53 rows; all 100 IDs match, but the NULL dates prohibit some joins from occurring.”

    “We’ll call them the FizzBuzz six, because…”

    “This goes back to our unfiltered query: 200 – 20 – 33 = 147.”

    “There are your missing 57 rows.”

    53 rows returned = 47 missing.
    33 are excluded because of NULL on one side
    20 are excluded because of NULL on the other side
    6 have NULL on both sides

    33 + 20 – 6 = 47 missing because of NULL on any (not either) side.

    Reply
  • Jens-Peter Giersch
    March 18, 2016 9:25 am

    Dear Erik,

    thank you for that insight; I found it very interesting because I wasn’t aware of the problem before. Inspired me to make some checks.

    One really simple question: I do not understand “canary value”; didn’t find it in my dictionaries. Is there a synonym in English? Is it an artificial value used to replace the NULL value and an additional row in a referenced table stating that this value in reality doesn’t exist?

    Reply
    • Erik Darling
      March 18, 2016 9:32 am

      Hey! You know, I got the term canary value from Jeremiah. I’ve also heard “magic value” and, I guess, “placeholder” used to describe them as well, the latter being what you’re most likely to find in a dictionary.

      Reply
  • Isn’t this just about a violation of normal form? If some of your joins are happy with one field and others require a second field and that second field is often null, as PK fields cannot be, then of course complications arise.

    Reply
  • You might want to stop using the old Sybase ISNULL() and switch to the correct, modern, ANSI/ISO Standard COALESCE(). There are some subtle differences in them.

    Reply
    • Mr. Celko! Yeah, totally with you. This post is based on what I see people doing pretty often, so I used ISNULL() to make it more relate-able to casual readers.

      Reply
  • Haresh Ambaliya
    March 19, 2016 2:26 pm

    Hey Erik,

    Do you think that instead of using ISNULL everywhere I can Use Cross apply for that column and just convert that column like

    cross apply( isnull(#t1.OrderDate,’1900-01-01′) OrderDate) as t1

    Does this make sense?

    Reply
    • Erik Darling
      March 19, 2016 3:49 pm

      No, I think the only thing that’s going to help you is to either give the column a default value, or insert a filtered result into a temp table with the ISNULL() conversion there. That’s the only way SQL would be able to generate usable statistics on the column, or efficiently use any indexes you put on it.

      Reply
      • Haresh Ambaliya
        March 20, 2016 3:18 pm

        We are in a loop again. Inserting records into the public toilet (temp table) with the ISNULL() conversion again leads to performance issue. Is there any another solution?

        Reply
  • Nobody mentioned the IS [NOT] DISTINCT FROM comparison operator that got added by ANSI a short time ago. I acts like equality, but treats NULLs as the same. Details at:
    https://wiki.postgresql.org/wiki/Is_distinct_from

    Reply
    • Erik Darling
      March 20, 2016 6:39 pm

      Well, since it was added by ANSI a short time ago, we can expect MS adoption in 10 or so years.

      Reply
  • Even without function indexes or IS NOT DISTINCT FROM, you can do the following on SQL Server from at least 2008:

    SELECT *
    FROM [#t1] AS [t]
    JOIN [#t2] AS [t2]
    ON [t2].[ID] = [t].[ID]
    AND ([t2].[OrderDate] = [t].[OrderDate]
    OR ([t].[OrderDate] IS NULL
    AND [t2].[OrderDate] IS NULL))
    ORDER BY [t].[ID]

    There is logic in the query engine that recognizes the sub-expression in the join on OrderDate above and treats it as a single logical operation including NULLs. This is an exception to the usual advice that OR in a join condition is bad for performance.

    http://sqlblog.com/blogs/paul_white/archive/2011/06/22/undocumented-query-plans-equality-comparisons.aspx

    Reply
  • Martin Smith
    March 21, 2016 4:06 am

    Paul White blogged about this here http://sqlblog.com/blogs/paul_white/archive/2011/06/22/undocumented-query-plans-equality-comparisons.aspx

    You can use

    SELECT *
    FROM [#t1] AS [t]
    JOIN [#t2] AS [t2]
    ON EXISTS(SELECT [t].[ID],
    [t].[OrderDate]
    INTERSECT
    SELECT [t2].[ID],
    [t2].[OrderDate])

    Reply
  • Bit late to the party I know, but thought i’d add a word of support for nulls. One of the negatives raised for ISNULL / COALESCE is that the replacement value can occur in the data naturally and you accidentally join to populated data that happens to contain the same value – that applies to replacing nulls with high/low values too. It might be impossible to choose a value that as obtuse enough to be relied upon. In that case NULL is the only real option as it accurately represents the intended meaning of UNKNOWN. If you use any other value, that meaning is lost. At least with a NULL you have the ability, as with the suggested queries, to obtain the data you intended. Of course if your data originates from flat files, it might already have been lost…

    I’d also support Nicholas Peterson’s suggestion regarding using a relation where it’s in a relevant schema – it totally makes sense to have one of your lookup FK/PK values as NOT NULL and -1=unknown or similar.

    Reply
  • I realize this is an old blog post …but somewhat related is the disturbing “logic” that regards NULLs as equal when determining DISTINCT values – including when DISTINCT is implicit (like in EXCEPT).
    http://sqlrecords.blogspot.com/2017/02/null-is-theabsence-of-value.html

    Reply
  • There’s actually bit of history on this from the old ANSI X3H2 days. We have two things called equivalence relations in SQL.. An equivalence relation partitions a set into disjoint subsets, which are also a covering of the original set. The mod() function is one example, and so is plain old equality.. But in SQL, we also have “grouping”, and the rule in grouping is that it’s like equality with the addition that the NULLs test equal.

    This was not always the case in some of the very, very early SQL products. The thing that switch to the committee’s mind the example of traffic tickets. There’s been a clipping in the newspaper about traffic tickets issued in California. Cars with a missing tag were written up as “no tag” and since this was a string, everybody without a tag was seen as belonging to one vehicle. The poor master that was randomly picked by the database. Got several thousand tickets all in one day from all over the state..

    We then started thinking about the use of NULLs for missing values in such situations. Our decision was to put them in their own class, so that summations, counts, and other aggregates would not be done on each individual NULL. We then added a warning message to tell people when an aggregate had dropped a NULL.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}