Say you’ve got a memberships (or policies) table, and each membership has start & end dates:
1 2 3 4 5 6 7 8 9 10 11 |
USE StackOverflow; GO DROP TABLE IF EXISTS dbo.UsersMemberships; CREATE TABLE dbo.UsersMemberships (Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, UserId INT NOT NULL, StartDate DATETIME NOT NULL, EndDate DATETIME NOT NULL, CancelledEarlyDate DATETIME NULL); GO |
If all you need to do is look up the memberships for a specific UserId, and you know the UserId, then it’s a piece of cake. You put a nonclustered index on UserId, and call it a day.
But what if you frequently need to pull all of the memberships that were active on a specific date? That’s where performance tuning gets hard: when you don’t know the UserId, and even worse, you can’t predict the date/time you’re looking up, or if it’s always Right Now.
Let’s say – just to say – all of our memberships are current. They’ve all started in the past, and they all expire in the future. I’m going to synthesize the data by using the Stack Overflow Users table, taking the user’s CreationDate as the start of their membership, and making up an EndDate in the future:
1 2 3 4 5 |
INSERT INTO dbo.UsersMemberships(UserId, StartDate, EndDate) SELECT Id, CreationDate, DATEFROMPARTS(YEAR(GETDATE()) + 2,MONTH(LastAccessDate), DAY(LastAccessDate)) FROM dbo.Users WHERE NOT(MONTH(LastAccessDate) = 2 AND DAY(LastAccessDate) > 28); /* Avoid Feb 29 problems */ |
I’m going to index the bejeezus out of both StartDate and EndDate to give SQL Server the best possible chance:
1 2 3 |
CREATE INDEX IX_StartDate_EndDate ON dbo.UsersMemberships(StartDate, EndDate); CREATE INDEX IX_EndDate_StartDate ON dbo.UsersMemberships(EndDate, StartDate); GO |
When you’re querying for active memberships, you’ll probably run one of these two query design patterns:
1 2 3 4 5 6 7 8 9 10 |
/* The queries we would normally run: */ SELECT Id, StartDate, EndDate FROM dbo.UsersMemberships WHERE GETDATE() BETWEEN StartDate AND EndDate; GO SELECT Id, StartDate, EndDate FROM dbo.UsersMemberships WHERE StartDate <= GETDATE() AND EndDate >= GETDATE(); GO |
In this case, estimation is easy. The actual plans show that SQL Server knows that all of the rows are going to match:
So when you just start building a membership or policy system, the query plans look nice and simple: estimation is bang on. Over time, though, your data starts to skew, and so do your estimates.
That imaginary case is easy,
but in real life, some folks have expired.
To show how bad it can get, I’m going to create another contrived extreme: instead of everyone’s membership being current, I’m going to say that NO one is current. 50% of our members’ plans have expired, and 50% haven’t started yet:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
TRUNCATE TABLE dbo.UsersMemberships; /* 50% are expired. They both start and end in the past. */ INSERT INTO dbo.UsersMemberships(UserId, StartDate, EndDate) SELECT Id, CreationDate, LastAccessDate FROM dbo.Users WHERE Id % 2 = 0; GO /* 50% haven't started yet. They both start and end in the future. */ INSERT INTO dbo.UsersMemberships(UserId, StartDate, EndDate) SELECT Id, CASE WHEN GETDATE() < DATEFROMPARTS(YEAR(GETDATE()),MONTH(LastAccessDate), DAY(LastAccessDate)) THEN DATEFROMPARTS(YEAR(GETDATE()),MONTH(LastAccessDate), DAY(LastAccessDate)) ELSE DATEFROMPARTS(YEAR(GETDATE()) + 1,MONTH(LastAccessDate), DAY(LastAccessDate)) END, DATEFROMPARTS(YEAR(GETDATE()) + 2,MONTH(LastAccessDate), DAY(LastAccessDate)) FROM dbo.Users WHERE Id % 2 = 1 AND NOT(MONTH(LastAccessDate) = 2 AND DAY(LastAccessDate) > 28); GO |
Now, when I run the query to check our active members again, there are a few tricky problems hidden in a plan that looks deceivingly simple. It looks like a pair of simple index seeks, but they don’t perform worth a damn:
The first problem:
these seeks read a lot of data to produce 0 rows.
While this looks like a pair of lightweight index seeks, both of these queries did a heck of a lot of work. Hover your mouse over either index seek, and check out the number of rows read:
They read 4.5M rows – half of the table – in order to produce a whopping 0 rows. I can almost hear you screaming, “How is that possible?!? It says SEEK right on there, so we’re seeking to the rows we need, right?”
Not exactly: look at the Seek Predicates at the bottom of that tooltip. SQL Server is opening up the EndDate_StartDate index, and it’s turning to the first row where the EndDate >= GETDATE(), and then it starts reading. But from that point forward, it has to examine every row that it finds.
To simulate what it’s doing, let’s visualize the index by crafting a query that accurately simulates what SQL Server will find when it opens the index. (To learn more about this technique, check out the very first module in my Fundamentals of Index Tuning class.)
1 2 3 4 5 6 7 8 |
SELECT TOP 1000 EndDate, ExpiredYet = CASE WHEN GETDATE() > EndDate THEN 'Expired' ELSE 'OK' END, StartDate, StartedYet = CASE WHEN GETDATE() > StartDate THEN 'OK' ELSE 'Not Started Yet' END, Id FROM dbo.UsersMemberships WHERE EndDate >= GETDATE() ORDER BY EndDate, StartDate GO |
We can verify that it’s an accurate visualization of the index’s contents by looking at the contents of the execution plan, and making sure that there are no sorts and no residual predicates on the index seek:
When SQL Server opens up that index on EndDate, StartDate and starts reading, it’s jumping to EndDate = GETDATE(), and it reads forward. It’s examining all future-expiring memberships, and checking to see if their StartDates are in the past:
And it has to read through all future-expiring memberships just hoping to find one that started in the past. Until it reads all of them, every single future-expiring membership, it can’t guarantee that one of them hasn’t started in the past.
In terms of the good ol’ white pages of the phone book, it’s as if I told you to find everyone whose LastName >= P, and FirstName <= AAA. Sure, you could seek to the P’s and start reading, but you’re going to have to read through a lot of pages checking to see who has a FirstName like AAAAaandy. In the phone book example, you would jump to the P’s, find that no one matches, and then skip ahead to the Q’s. You can do that because there are a relatively limited number of letters in the alphabet, but SQL Server can’t do that because there aren’t a limited number of EndDates. There might be millions of distinct EndDates.
That sucks. But it gets worse.
The next problem:
SQL Server’s estimates were way off.
SQL Server estimated that 27.5% of the table would match. He thought this query would return 3.3M of the table’s 8.9M rows:
That’s a problem for me if my index doesn’t fully cover the query. In my contrived example, I completely covered my contrived membership query:
1 2 3 4 5 |
SELECT Id, StartDate, EndDate FROM dbo.UsersMemberships WHERE StartDate <= GETDATE() AND EndDate >= GETDATE(); GO |
But eagle-eyed readers might have caught that my Memberships table has 4 columns: Id (the primary clustered key), StartDate, EndDate, and also, UserId, which links back to the Users table to show whose membership it is. If I add that UserId column in as well:
Then SQL Server ignores my nonclustered index altogether! He thinks that 3.3M key lookups would cost too much money, so he ends up scanning the table, and doing even MORE logical reads than he was doing before!
Thankfully, real-life data distributions aren’t quite THAT bad.
In real life, your own business has its own skew of expired, current, and future memberships. Based on your own skew, you might need to cluster a table like this on either EndDate, StartDate, or StartDate, EndDate. I can’t write a demo to show your exact data skew (well, I mean, I can, I’m a consultant for hire, and that’s how this whole thing started.) For the sake of this blog post, I wrote a script to populate the table with a bunch of data so that about 25% of my memberships haven’t started yet, 25% have started but already expired, 25% are current, and then 25% are kinda wacko.
For that demo script, you’ll note a few unusual table design choices:
- The table is a heap
- It has a nonclustered primary key on Id
- It has a nonclustered index on EndDate, StartDate that includes Id, UserId
- It has another on StartDate, EndDate that includes Id, UserId
With that data mix populated, let’s run our reporting query, check out our estimates, and we’ll even hint it with both of the nonclustered indexes so we can see which one does less reads:
1 2 3 4 5 6 7 8 9 10 11 |
SET STATISTICS IO ON; GO SELECT Id, StartDate, EndDate FROM dbo.UsersMemberships WHERE GETDATE() BETWEEN StartDate AND EndDate; GO SELECT Id, StartDate, EndDate FROM dbo.UsersMemberships WHERE StartDate <= GETDATE() AND EndDate >= GETDATE(); GO |
The good news is that both query plans get decent row estimations:
In addition, both pick the same IX_StartDate_EndDate index. The bad news, though, is that even though the estimate is decent, the residual scan predicate means that we seek to one point, but then we read 2x more rows than the query actually produces:
If I use index hints for testing which index actually reads less data:
1 2 3 4 5 6 7 8 |
SELECT Id, StartDate, EndDate FROM dbo.UsersMemberships WITH (INDEX = IX_StartDate_EndDate) WHERE GETDATE() BETWEEN StartDate AND EndDate; GO SELECT Id, StartDate, EndDate FROM dbo.UsersMemberships WITH (INDEX = IX_EndDate_StartDate) WHERE GETDATE() BETWEEN StartDate AND EndDate; GO |
In my case, with my demo’s data distribution, the StartDate index reads less 8KB pages to produce the same results:
So if I needed to tune my demo data’s scenario, and if we did a lot of querying for active memberships, and I couldn’t predict the date/time we were going to query for nor could I know which UserIds we were looking for, I might:
- Put a clustered index on on StartDate, EndDate
- Put a nonclustered index on UserId (and probably no other fields, although I’d be open to changing that if individual users could have a lot of memberships)
I know this is one of those blog posts that’s going to provoke a lot of “why don’t you do ___” comments, but I’ll be honest: I’m not going to answer them here because you can answer ’em yourself. Download a Stack Overflow database, run my script to populate the UsersMemberships table, and then run your own experiments to test your ideas. When you find something that seems like it works, you’re welcome to share it in the comments. You can also use a Github gist to share your source code.
One approach I expect folks to suggest is to build a child table like this with a row for each day that a subscription is active:
- UserId
- MembershipId
- ActiveDate
So if your subscription was active for 300 days, you’d have 300 rows in that table. Unfortunately, solutions like that can be tough to scale too: using our UsersMembership table, the number of rows in it would be pretty ugly if we had to maintain all history. I’d be at 7 billion rows even with this simplified example where each user only has one membership:
Razor-sharp readers like yourself may have noticed the column that I didn’t mention in this post. That particular column makes this exercise deliciously difficult – more on that in the next post.
19 Comments. Leave new
Just create a persisted computed column calculating whether or not the row is active based off the current time. That will work great, won’t it? 🙂
People constantly suggest that at first too, hahaha, and then they realize the problem when they’re thinking through the details. Time is a cruel mistress.
As I was typing it out I was imagining SQL dedicating CPU cycles to recalculating it frequently behind the scenes and shuddered.
Or you could have an active table that is truncated and populated at midnight via a scheduled job…the membership only expires at the end of the day right?
Unfortunately, businesses usually need to check eligibility for past dates too – as in, show me if the membership/policy was active as of last Thursday.
See, this is why you get the big bucks… 😉
HA!
[…] Brent Ozar shows why the StartDate + EndDate pattern is not great for filtering: […]
What about using a Temporal table. Add a user to the table when they “Start” and have a Job delete them when the End Date is reached, or in a trigger where CancelledEarlyDate is set to current (I’m assuming that you only set CancelledEarlyDate to the current date, if not, then into the previous Job they go.
Then you can simply use the Temporal querying for any specific date.
Scott – sure, but then you’d have to change the queries (often dramatically, as I talk about in the next post when the early cancelation date column comes into play.)
I wish I could tell people, “hey, just change the way you store and query data,” but in real life, that’s usually a pretty tough sell.
This would required quantum physics (as object can have 2 different values at the same time, like the start and the end) in order to make it work great 😉
Maybe in SQL2119
how about treating timestamps like points in 1 dimensional space and use heavy artillery of spatial goodies builtin db engine?
Filip – go ahead and read the whole post, particularly the part about suggestions.
I’ve read that part before posting my question (no fingers crossed). I’ll try that myself when I manage to accumulate some free time. I won’t forget that topic, because I’m dealing with sub 1TB tables, where long scans on timestamp columns pains me every week.
Wonder where you got that scenario from… 😉 Thank you for sharing, big hug my friend!
I thought you’d like that! Good hearing from you, and hope to see you again soon.
I do indeed!
Same here.
Hi,
I’d like to know when was the first order and last order placed for a given product. How to extract first order start date and last order end date from same column (OrderDate)?Do I need to use dateadd() or lead function?
Hi, Chris. Unfortunately I can’t do free personalized consulting & training here in the comments.