Date Tables are Great for Users, but Not So Great for Performance

Date tables help users filter data based on day of week, holiday, this year vs last year, and other kinds of filters that business users usually wanna add to their reports. If you haven’t used a date table before, here are a couple of primers on how to create one and use it:

And for the purposes of this post, I’m going to use the date_calendar table created in the second bullet point above. Here’s what it looks like:

Typically when people use date tables, they join from their real tables over to the date table, and they do their date calculations & filtering by just specifying columns in the date table.

For example, if you wanted to find out what day of the week people usually sign up for Stack Overflow accounts (database), you might write a query like this:

The results look like this:

Pretty cool! But…does it perform well? To find out, let’s write an old-school version of the query that doesn’t use a date table, and we’ll compare the two using SET STATISTICS IO, TIME ON to measure CPU, duration, and logical reads:

The metrics tell an interesting story:

  • Date table method: 100,074 logical reads, CPU time 3,422 ms, elapsed time 3,425 ms
  • Old school method: 20,095 logical reads, CPU time 3,671 ms, elapsed time 1,467 ms

The actual execution plans are obviously different:

So why did the date table not perform as well as the old-school way?

SQL Server doesn’t understand the relationship between these two tables. It simply doesn’t know that all of the rows in the Users table will match up with rows in the calendar table. It assumes that the calendar table is doing some kind of filtering, especially given our CAST on the date. It doesn’t expect all of the rows to match.

SQL Server decides it’s going to scan the date_calendar table first, and then for each date that it finds, it does an index seek into the Users index on CreationDate:

If you hover your mouse over that index seek on CreationDate, the number of executions are worth a laugh:

This index seek was done over 18,000 times – once for every day in our calendar table.

Because SQL Server thought the date table would do some filtering, it also underestimated the number of rows it’d find in Users, so it also underestimated the CPU work required, so the date table approach only went single-threaded. The old school way understood that lots of rows would be involved, so the work was parallelized across multiple threads.

That’s a simple query where I’m not even doing filtering – but query plans get even more predictable as you build atop this, adding filters on the date table looking for weekdays, specific days of the month, or ranges. SQL Server just has a hard time guessing how many rows are going to come out of the related tables. The more complex your filtering becomes, the crazier the estimates become.

So if I have date tables, what should I do next?

This quick experiment doesn’t mean that you should abandon your date tables and force everyone to do calculations against the base tables. Date tables are WONDERFUL, and I recommend them highly.

However, you just need to be aware that:

  • If you filter on the date table rather than the base tables, you’ll likely get inaccurate estimates
  • The more you rely on functions or ranges of data in the date tables, the more likely you are to get increasingly bad estimates
  • To fix these estimates, you may need to add relational columns on your base tables that define exactly which rows in the calendar table match which rows in your base table
  • When doing performance tuning on a query that uses a date table, consider briefly tweaking the query to not use the date table, and see how much better the execution plan gets. If it’s a night and day difference, you may need to work on the date table’s relationships or switch this one particular query to filter directly on the base tables rather than the date tables.

Got questions? Want to try different approaches to my queries above? Grab the Stack Overflow database and try running your own experiments. That’s why I love using the Stack Overflow database as a sample: it contains real-world distribution on stuff like dates so you can prove your own ideas or disprove mine.

Watch me write this post

I streamed this post live on Sunday, July 19:

Previous Post
[Video] The Top 10 Developer Mistakes That Won’t Scale on Microsoft SQL Server
Next Post
SQL Server Problems We Don’t Have Anymore

30 Comments. Leave new

  • Not The DBA You're Looking For
    August 4, 2020 9:17 am

    I disagreed with this less and less as I read through it.

    First thoughts:
    1) That CAST is unfair
    2) Having a foreign key on the base table would help no end

    Then you actually touched on each point. My main take away is, If you’re going to implement a date table, implement it correctly

    Reply
    • The cast IS unfair, but…that’s how a lot of folks use date tables. Now about the foreign key – best to try it and go see if it really helps. (Hint: in this case, it won’t.)

      Reply
    • Koen Verbeeck
      August 4, 2020 10:20 pm

      That CAST is unfair indeed. There’s no issue with a date table, there’s an issue with the data modelling. I did almost the exact same query on my data warehouse, and I got a vastly different execution plan (hint: it was fast).
      If you’re going to open up your database to users (as the title suggests), then model your database for user access (in other words: build a data warehouse). Even if you just follow the Kimball method, you’ll model it correctly.

      Reply
      • We could frame all database problems that way, right? “You don’t have a stored procedure problem – it’s just your data modeling.” 😉

        Ah, if only everyone modeled their database correctly, the world would be a delightful place.

        Reply
        • Koen Verbeeck
          August 5, 2020 3:44 am

          I know I know. So many issues come from data modelling alone 🙂
          I just took offense at the title. It seems like date tables lead to performance issues, but that’s not the case. Only when you use non-SARGable expressions in your joins, they give issues as you demonstrated, but that problem is much wider than date tables alone.

          Reply
          • Yeah, it just stems from the fact that I’ve seen 3 clients in a row doing it this same way. I figure if I grab people’s attention with the title, then they’ll at least read to see if they’re doing date tables that way.

            If I write it as, “If you use a cast, blah blah” then they won’t necessarily catch it and read the post.

          • (Oh I should also add – it’s not only when you use non-SARGable expressions. I’ve got another post in the queue about how date tables cause problems with row estimations when you filter on things like weekday/holiday columns in the date tables.)

          • Koen Verbeeck
            August 5, 2020 3:56 am

            You and your clickbait articles 😀
            Looking forward to the other post about date tables.

  • Interesting. I suppose my inability to replicate the poor results in your test are from not knowing enough about your dbo.Users table. When I remove the CAST I get completely parallelized results, but the table I created is completely different from the one you used. I haven’t tried it yet, but would it be worth putting a date_low and date_high column in the date_calendar table and indexing them, to change that CAST to a BETWEEN clause?

    Reply
  • Interesting post, I often find date tables implemented in the most bizarre way… I wonder: what if instead of the table, a view that build the dates on the fly is used? You know, with dateadd, row_number and so on, without relying on a physical table.. I’ll give it a try as soon as I have some spare time

    Reply
  • […] Brent Ozar walks through a scenario in which a calendar table (AKA, date dimension) makes a query pe…: […]

    Reply
  • Well, the example you used made no sense, IMHO. Calendar table has lots of great usages and you shown a case where it has no usage at all.

    Did you try to compare one of the typical use cases when we need to find number of work dates (e.g. not a holiday and not a weekday) between 2 dates and do some calculation using some extra, say, Orders table?

    Reply
  • The title was catchy indeed, BTW 🙂

    Reply
  • There’s no shield, I’m afraid, that would save you from people writing bad SQL and people designing bad models. All we can or maybe should count on (or are forced to do) is Microsoft taking steps to make changes to the internals of SQL Server so that even bad SQL is transformed into good SQL. When you know what you do, you do it carefully and right. When all you want is to get A result, you don’t care about anything else. You just write and go home. Can’t be helped. But at least we, who do care, have well-paid jobs 🙂

    Reply
  • Esther Neustadt
    August 10, 2020 10:05 am

    Thanks for the article. Very helpful info. The title though it is catchy may leave the reader with the impression that data tables are not good for performance. Even if further in the article there is more detail as to when they actually go bad the message of the title sometimes does remain imprinted in a reader’s mind.

    Reply
  • If the purpose of the post is to show how “not” to use a date table, then it does it well indeed. Reading the early links help frame it by showing how to implement well. I would have benefited if the juxtaposition was stated initially – i.e. good vs the bad.

    Reply
  • Just so I’m clear… so the optimal solution in the model is to create a FK constraint between the date and have tables?

    Reply
  • Maxim Ivashkov
    March 18, 2021 9:44 pm

    In one hand this is so common in reporting, we always need this extra column in date format just to convert datetime so it matches the date dimension.
    It would be nice if sql server did clever joins and supported joins of datetime with date format by dropping the time and using the date index.
    I guess too much to ask from microsoft.

    Reply
  • Use a real bad method #1.
    Use a real really bad method #2.
    Never tell anyone about a better/faster method.

    Great article! Again.

    Reply
  • For me, the title of this article is definitely NOT merely “click bait”. This article and the discussion that follow demonstrate exactly what I was looking for. An author posts a problem and some code. People look at the code and find issues with the code in the demo (the supposedly “unfair CAST”). The author defends it by correctly stating that a whole lot of people make the mistake of using CAST because the User table uses DATETIME and the code doesn’t work correctly without the CAST that so many people use because they don’t know the correct method. Some people also cite that the author doesn’t demonstrate the correct method (and some of those people are known heavy hitters) but they also fail to post the correct method. Other’s cite the fact that they don’t have the data to test with even though the info on where to get the data is clearly listed near the end of the article.

    Thank you for this article and the code you posted. I didn’t see the CAST thing as being “unfair” at all. I saw it as precisely what you intended… it’s exactly what most users would end up doing once they realize that the datatype mismatch between the calendar table and the User table produces the incorrect result. In fact, that reminded me that I absolutely MUST include such a “typical user mistake” in something I’m writing up where a calendar table is used to solve the “Get all dates for a Date Range” relational multiplication problem.

    Thanks for what you do for the community day in and day out. PLEASE… Heh… save all the baby turtles on the beach you want but never fully retire. There’s a whole bunch of “baby turtles” out here in the SQL Community that need saving as well! 😀

    Reply
    • … and I’m still one of them! 😀

      Reply
      • p.s. Awesome ‘tube you made on this, as well. It shows the “thought process” and it actually does explain the “why” of the seemingly unfair CAST. It really is the way most people would probably do the JOIN.

        Reply
    • BWAAA-HAAA!!! Here’s another reason you have to be really careful when when using OPS code (ummm… Other People’s Stuff is the polite interpretation. ;D ) that they download off the internet. Even the best-of-the-best can forget in the heat of battle.

      SELECT avg_fragmentation_in_percent
      ,avg_page_space_used_in_percent
      FROM sys.dm_db_index_physical_stats(DB_ID(‘StackOverflow2013’),OBJECT_ID(‘StackOverflow2013.dbo.date_calendar’),1,1,’SAMPLED’)
      ;

      Results:
      avg_fragmentation_in_percent avg_page_space_used_in_percent
      ——————————————- ————————————————
      58.5956416464891 56.180701754386

      Shifting gears back to the problem at hand, I’d personally love to see what folks come up with to use the calendar table that Brent used to solve this problem correctly because I’m not seeing a way that will beat the straight up old school method that Brent posted, which doesn’t use the calendar table even when adding a second column for the “End of day” or “Next Day”. If there is one, I’d love to see it because I don’t personally use a Calendar table for this type of thing and don’t mind learning something new/different.

      Reply
  • I thought CAST AS DATE was sargable?

    Reply

Leave a Reply

Your email address will not be published.

Fill out this field
Fill out this field
Please enter a valid email address.