WHERE GETDATE() BETWEEN StartDate AND EndDate Is Hard to Tune.

Development, Indexing

Say you’ve got a memberships (or policies) table, and each membership has start & end dates:

If all you need to do is look up the memberships for a specific UserId, and you know the UserId, then it’s a piece of cake. You put a nonclustered index on UserId, and call it a day.

But what if you frequently need to pull all of the memberships that were active on a specific date? That’s where performance tuning gets hard: when you don’t know the UserId, and even worse, you can’t predict the date/time you’re looking up, or if it’s always Right Now.

Let’s say – just to say – all of our memberships are current. They’ve all started in the past, and they all expire in the future. I’m going to synthesize the data by using the Stack Overflow Users table, taking the user’s CreationDate as the start of their membership, and making up an EndDate in the future:

I’m going to index the bejeezus out of both StartDate and EndDate to give SQL Server the best possible chance:

When you’re querying for active memberships, you’ll probably run one of these two query design patterns:

In this case, estimation is easy. The actual plans show that SQL Server knows that all of the rows are going to match:

So when you just start building a membership or policy system, the query plans look nice and simple: estimation is bang on. Over time, though, your data starts to skew, and so do your estimates.

That imaginary case is easy,
but in real life, some folks have expired.

To show how bad it can get, I’m going to create another contrived extreme: instead of everyone’s membership being current, I’m going to say that NO one is current. 50% of our members’ plans have expired, and 50% haven’t started yet:

Now, when I run the query to check our active members again, there are a few tricky problems hidden in a plan that looks deceivingly simple. It looks like a pair of simple index seeks, but they don’t perform worth a damn:

The first problem:
these seeks read a lot of data to produce 0 rows.

While this looks like a pair of lightweight index seeks, both of these queries did a heck of a lot of work. Hover your mouse over either index seek, and check out the number of rows read:

They read 4.5M rows – half of the table – in order to produce a whopping 0 rows. I can almost hear you screaming, “How is that possible?!? It says SEEK right on there, so we’re seeking to the rows we need, right?”

Not exactly: look at the Seek Predicates at the bottom of that tooltip. SQL Server is opening up the EndDate_StartDate index, and it’s turning to the first row where the EndDate >= GETDATE(), and then it starts reading. But from that point forward, it has to examine every row that it finds.

To simulate what it’s doing, let’s visualize the index by crafting a query that accurately simulates what SQL Server will find when it opens the index. (To learn more about this technique, check out the very first module in my Fundamentals of Index Tuning class.)

We can verify that it’s an accurate visualization of the index’s contents by looking at the contents of the execution plan, and making sure that there are no sorts and no residual predicates on the index seek:

When SQL Server opens up that index on EndDate, StartDate and starts reading, it’s jumping to EndDate = GETDATE(), and it reads forward. It’s examining all future-expiring memberships, and checking to see if their StartDates are in the past:

And it has to read through all future-expiring memberships just hoping to find one that started in the past. Until it reads all of them, every single future-expiring membership, it can’t guarantee that one of them hasn’t started in the past.

In terms of the good ol’ white pages of the phone book, it’s as if I told you to find everyone whose LastName >= P, and FirstName <= AAA. Sure, you could seek to the P’s and start reading, but you’re going to have to read through a lot of pages checking to see who has a FirstName like AAAAaandy. In the phone book example, you would jump to the P’s, find that no one matches, and then skip ahead to the Q’s. You can do that because there are a relatively limited number of letters in the alphabet, but SQL Server can’t do that because there aren’t a limited number of EndDates. There might be millions of distinct EndDates.

That sucks. But it gets worse.

The next problem:
SQL Server’s estimates were way off.

SQL Server estimated that 27.5% of the table would match. He thought this query would return 3.3M of the table’s 8.9M rows:

That’s a problem for me if my index doesn’t fully cover the query. In my contrived example, I completely covered my contrived membership query:

But eagle-eyed readers might have caught that my Memberships table has 4 columns: Id (the primary clustered key), StartDate, EndDate, and also, UserId, which links back to the Users table to show whose membership it is. If I add that UserId column in as well:

Then SQL Server ignores my nonclustered index altogether! He thinks that 3.3M key lookups would cost too much money, so he ends up scanning the table, and doing even MORE logical reads than he was doing before!

Thankfully, real-life data distributions aren’t quite THAT bad.

In real life, your own business has its own skew of expired, current, and future memberships. Based on your own skew, you might need to cluster a table like this on either EndDate, StartDate, or StartDate, EndDate. I can’t write a demo to show your exact data skew (well, I mean, I can, I’m a consultant for hire, and that’s how this whole thing started.) For the sake of this blog post, I wrote a script to populate the table with a bunch of data so that about 25% of my memberships haven’t started yet, 25% have started but already expired, 25% are current, and then 25% are kinda wacko.

For that demo script, you’ll note a few unusual table design choices:

  • The table is a heap
  • It has a nonclustered primary key on Id
  • It has a nonclustered index on EndDate, StartDate that includes Id, UserId
  • It has another on StartDate, EndDate that includes Id, UserId

With that data mix populated, let’s run our reporting query, check out our estimates, and we’ll even hint it with both of the nonclustered indexes so we can see which one does less reads:

The good news is that both query plans get decent row estimations:

In addition, both pick the same IX_StartDate_EndDate index. The bad news, though, is that even though the estimate is decent, the residual scan predicate means that we seek to one point, but then we read 2x more rows than the query actually produces:

If I use index hints for testing which index actually reads less data:

In my case, with my demo’s data distribution, the StartDate index reads less 8KB pages to produce the same results:

So if I needed to tune my demo data’s scenario, and if we did a lot of querying for active memberships, and I couldn’t predict the date/time we were going to query for nor could I know which UserIds we were looking for, I might:

  • Put a clustered index on on StartDate, EndDate
  • Put a nonclustered index on UserId (and probably no other fields, although I’d be open to changing that if individual users could have a lot of memberships)

I know this is one of those blog posts that’s going to provoke a lot of “why don’t you do ___” comments, but I’ll be honest: I’m not going to answer them here because you can answer ’em yourself. Download a Stack Overflow database, run my script to populate the UsersMemberships table, and then run your own experiments to test your ideas. When you find something that seems like it works, you’re welcome to share it in the comments. You can also use a Github gist to share your source code.

One approach I expect folks to suggest is to build a child table like this with a row for each day that a subscription is active:

  • UserId
  • MembershipId
  • ActiveDate

So if your subscription was active for 300 days, you’d have 300 rows in that table. Unfortunately, solutions like that can be tough to scale too: using our UsersMembership table, the number of rows in it would be pretty ugly if we had to maintain all history. I’d be at 7 billion rows even with this simplified example where each user only has one membership:

Razor-sharp readers like yourself may have noticed the column that I didn’t mention in this post. That particular column makes this exercise deliciously difficult – more on that in the next post.

Previous Post
Recommended SQL Server Books, 2020 Edition
Next Post
WHERE GETDATE() BETWEEN StartDate AND COALESCE(CancelDate, EndDate) Is Even Harder to Tune.

19 Comments. Leave new

  • Just create a persisted computed column calculating whether or not the row is active based off the current time. That will work great, won’t it? 🙂

    • People constantly suggest that at first too, hahaha, and then they realize the problem when they’re thinking through the details. Time is a cruel mistress.

      • As I was typing it out I was imagining SQL dedicating CPU cycles to recalculating it frequently behind the scenes and shuddered.

  • Or you could have an active table that is truncated and populated at midnight via a scheduled job…the membership only expires at the end of the day right?

  • […] Brent Ozar shows why the StartDate + EndDate pattern is not great for filtering: […]

  • What about using a Temporal table. Add a user to the table when they “Start” and have a Job delete them when the End Date is reached, or in a trigger where CancelledEarlyDate is set to current (I’m assuming that you only set CancelledEarlyDate to the current date, if not, then into the previous Job they go.

    Then you can simply use the Temporal querying for any specific date.

    • Scott – sure, but then you’d have to change the queries (often dramatically, as I talk about in the next post when the early cancelation date column comes into play.)

      I wish I could tell people, “hey, just change the way you store and query data,” but in real life, that’s usually a pretty tough sell.

  • Dominique Boucher
    February 19, 2020 10:53 am

    This would required quantum physics (as object can have 2 different values at the same time, like the start and the end) in order to make it work great 😉
    Maybe in SQL2119

  • how about treating timestamps like points in 1 dimensional space and use heavy artillery of spatial goodies builtin db engine?

    • Filip – go ahead and read the whole post, particularly the part about suggestions.

      • I’ve read that part before posting my question (no fingers crossed). I’ll try that myself when I manage to accumulate some free time. I won’t forget that topic, because I’m dealing with sub 1TB tables, where long scans on timestamp columns pains me every week.

  • Wonder where you got that scenario from… 😉 Thank you for sharing, big hug my friend!

  • Hi,

    I’d like to know when was the first order and last order placed for a given product. How to extract first order start date and last order end date from same column (OrderDate)?Do I need to use dateadd() or lead function?


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.