WHERE GETDATE() BETWEEN StartDate AND COALESCE(CancelDate, EndDate) Is Even Harder to Tune.

Development, Indexing
8 Comments

In my last post, we started with a memberships table, and each membership had start & end dates. I’m going to create the table and populate it with everyone having an active membership – their StartDate is the same as their Stack Overflow account creation date, and their EndDate is around a year or two from now:

If I run either of these queries to check how many active users we have:

The actual plans show that SQL Server does a great job of accurately estimating that all 9 million of the rows in the table will match:

But now let’s introduce that CancelledEarlyDate.

I’m going to modify my reporting query to include a COALESCE, taking the first non-null column between CancelledEarlyDate and EndDate. If the user cancelled their membership early, we’ll use that cancellation date – otherwise, use EndDate. This is similar to the ISNULL function, but COALESCE lets me pass in multiple values rather than just two. (ISNULL performs the same in this case.)

Note that I haven’t updated anyone’s CancelledEarlyDate yet – they’re all still null. SQL Server’s actual execution plans do a beautiful job of estimating that all 9M rows will still match our search. Note that we’re getting index seeks here (without key lookups) because I included the CancelledEarlyDate in my indexes this time around.

And if I update 50% of the rows to have a CancelledEarlyDate of yesterday:

Now our estimates go off the rails.

Then the actual plans show that SQL Server believes all of the rows will match our filter – even though only half of them do. He estimates 9M rows will come back, but in reality, only 4.5M do:

In the last post, I had to introduce a more real-world data distribution to show how estimation went off the rails as your population started to shift. Here, I don’t – even a very simple population causes our query plans to go wildly wrong.

Now let’s take a breather here for a second because I can almost hear the comments rolling in.

First, you might be thinking that this particular illustration isn’t really all that bad: SQL Server uses the right access method (an index seek rather than a table scan), and it shouldn’t have approached this problem in any other way. However, in the real world, this incorrect row estimation has a huge impact on things like memory grants, how SQL Server approaches other tables in the query, and the order of operations in a query.

Second, you might be typing, “Brent, how might different indexing strategies fix this problem?” That’s why I use the free Stack Overflow database for my demos, and why I give you all of the scripts. You can go test out different approaches for yourself to learn.

Back to the stuff I wanna teach you here. I don’t think it’s fair for us to ask SQL Server to estimate this particular part of the query on the fly:

That’s just too hard. Sure, SQL Server may have statistics on each of those columns individually, but it doesn’t understand how those columns relate to each other on particular rows. He just has a rough idea of each one’s date distribution. In my simple scenario, it might be theoretically doable, but in real-world data distributions with memberships & policies spread across decades, it’d be really tough. You would want whole separate statistics based on the relationships between columns (and no, multi-column stats aren’t anywhere near that powerful.)

We can help by combining the two columns.

While SQL Server isn’t able to do this itself automatically, we can build our own computed column that performs the calculation for us:

And suddenly – with no other changes, like no indexes or no query modifications – the actual plan’s estimates are way better:

That’s because when we add a computed column, SQL Server automatically adds a statistic on that column, and it can use that statistic when estimating rows.

It’s still not perfect because the plan has a variety of issues:

  • We’re doing an index scan here, not a seek
  • The scan reads all 9M rows of the index by doing 37,591 logical reads
  • We’re calculating the COALESCE every row, every time we do the query, burning about 1 second of CPU time

In this case, indexing the computed column solves all 3 problems.

I’ll create two indexes – one that leads with StartDate, and one that leads with the computed column – because SQL Server may choose different ones depending on our data distribution:

The new actual plans of our reporting queries show that we fixed ’em:

We get an index seek, we only read the rows we need to output, and we don’t have to compute the coalesced dates on every row. Yay!

But only in this case.

All of our reporting queries better be exactly the same. Remember how I said earlier that I’m using COALESCE because it takes multiple arguments, and how ISNULL performed the same in that query when I was discussing it? Now, not so much. If I try ISNULL here, SQL Server ignores the computed columns altogether.

And even if they’re all the same, I’m still back to the same problems I had in the last post: as our data distribution becomes more real-world, with a mix of expired, current, and future memberships, I’m going to have estimation problems.

For more about the computed column technique, including when to persist them, when not to, and how to level up with indexed views, check out the Mastering Index Tuning class module on filtered indexes, indexed views, and computed columns.

Previous Post
WHERE GETDATE() BETWEEN StartDate AND EndDate Is Hard to Tune.
Next Post
Interesting Undocumented Tidbits in SQL Server 2019 CU2

8 Comments. Leave new

  • When you wrote the “columns relate to each other” I thought “The columns in a table refer to the primary key, the whole primary key, and nothing but the primary key, so help me Codd”. But heck, you were presenting a real world example, so I have faith: Codd will forgive you :).

    Reply
  • Is it to much to hope that I could talk to the developer and convince them to remodel this table to have [EndDate] and [OriginalEndDate]? Hopefully they’re talking to me as this is developed instead of me finding out about it two days after deployment when performance is tanking.

    (This is assuming that the early end date (if it exists) is of greater importance to our business than the original end date.)

    Reply
    • Andrew K – it’s definitely possible, but start with the recommendation from the original blog post yesterday: write up your experiment, prove that the performance is better, and post it as a Github gist (or a blog post on your own site.) I’ve already had a bunch of people email me privately saying, “Wow, I thought I knew how to fix that, but then I ran an experiment and I couldn’t fix it either.”

      Reply
  • And here I was thinking the advice was to prefer ISNULL over COALESCE since Coalesce ended up working like a CASE statement behind the scenes.
    If I’m still using SQL 2014 in a lot of cases (with highest possible compatibility level applied), should I just switch to COALESCE all of the time then?

    I even have some IsNull(A, IsNull(B, C)) checks in my code as the advice from years ago was that Coalesce was not well implemented… It’d be nice to get rid of them 🙂

    Reply
    • Ian – your best bet is to evaluate query plans like I do in these posts, comparing & contrasting the plans and the STATISTICS IO results. Even when there’s a flat global rule (“always do X”), you’ll find that it tends to morph over time with different versions (and even cumulative updates) of SQL Server.

      Reply
  • @Brent, Have you considered the common real case where enddate is unclosed (is null)

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.