How to Think Like the SQL Server Engine: When Statistics Don’t Help

In our last episode, we saw how SQL Server estimates row count using statistics. Let’s write two slightly different versions of our query – this time, only looking for a single day’s worth of users – and see how its estimations go:

Both of those queries are theoretically identical in that they accomplish the same result by producing exactly the same rows – but their execution plans are different. On this one, you’ll probably want to click to zoom in, and play spot-the-differences:

The good news is that both plans are essentially doing an index seek + key lookup. The bottom one’s Constant Scan is where SQL Server is figuring out what the first date would be if it had to seek to a specific date/time to start the CAST process.

If you read a lot of blog posts – and you strike me as the kind of person who does – then you’ve probably read a post somewhere that said if you use functions in the WHERE clause, SQL Server won’t use an index, or that it won’t do an index seek. That’s not exactly true, as the bottom query plan shows – and it’s awesome! Hot diggety, I love SQL Server at moments like this.

The bad news is that they have very different row estimates. The top one estimates that 1,989 rows will match that date range, as shown by the “2443 of 1989” on the top Index Seek.

The bottom one has a very different row estimate: 1.

SQL Server brought back 2443 rows out of an expected 1.

And if you hover your mouse over the index seek to see more details, you can see that SQL Server knew doggone good and well what date it was looking for – and even that our CAST got converted (see what I did there.)

But the statistics didn’t help. SQL Server just couldn’t be bothered to guess how many rows were going to match the date of 2018-08-27.

In this case, the row estimate of 1 helped me out a lot because it encouraged SQL Server to lowball the row estimate, thereby encouraging an index seek + key lookup. But…what if that was a bad idea?

Let’s try another date.

This time, let’s aim for just 3 days later:

It just so happens 24,380 people logged in on that date, so now the top execution plan looks very different from the bottom one:

In this case, the 1 row lowball estimate backfired on me. The top query understood that over 24,000 rows would come back, so it did a clustered index scan. The bottom one’s 1-row estimate caused an index seek + key lookup – and as you’ll remember from your earlier lessons, dear reader, an index seek + key lookup is dramatically less effective as you add more rows, as statistics IO will show:

The top query did 7,405 reads (the entire table), whereas the bottom one did ALMOST TEN TIMES AS MANY.

The easier it is for SQL Server to read your query and quickly guess how many rows are going to come back, the faster of a query plan it’ll usually build. The more you layer in obfuscation – even the tiniest obfuscations that seem completely obvious to you – the less likely it is you’ll get a fast query plan.

Or, you’re gonna have to compensate by putting more work into your indexes – and we’ll do that next post, Building Wider Indexes to Deal with Bad T-SQL.

Previous Post
How to Think Like the SQL Server Engine: Using Statistics to Build Query Plans
Next Post
How to Think Like the SQL Server Engine: Building Wider Indexes to Deal with Bad T-SQL

16 Comments. Leave new

  • Hi, Brent! The first link is broken

    Reply
  • The queries are currently not exactly identical the first one will include rows on midnight at 2018-08-28 (and should be written WHERE LastAccessDate >= ‘2018-08-27’ AND LastAccessDate ‘2018-08-26’ … < '2018-08-28' and discards all rows from the 26th (i.e. after 00:00 on that date) via a residual predicate so "Numbers of rows read" will be higher

    Reply
    • Martin – this is where it’d help if you actually get the database and test your ideas first so you can see the number of rows on particular dates. By all means, jump in and give it a shot!

      Reply
      • The comment you are replying too had a chunk out of the middle replaced with an elipsis on posting (as it contained a “less than” sign and I think it tried to strip out some following content as an HTML tag) so not sure if you got the points or not,

        I don’t need the database I can demo both points easily with this dummy data

        CREATE TABLE dbo.Users
        (
        Id INT IDENTITY,
        LastAccessDate DATETIME INDEX IX_LastAccessDate
        )

        INSERT INTO dbo.Users (LastAccessDate)
        SELECT TOP 86410 DATEADD(SECOND, ISNULL(NULLIF(ROW_NUMBER() OVER (ORDER BY @@SPID),86410),172800), ‘2018-08-26’)
        FROM sys.all_objects o1, sys.all_objects o2

        SELECT LastAccessDate
        FROM dbo.Users
        WHERE CAST(LastAccessDate AS DATE) = ‘2018-08-27’
        ORDER BY LastAccessDate;

        SELECT LastAccessDate
        FROM dbo.Users
        WHERE LastAccessDate BETWEEN ‘2018-08-27’ AND ‘2018-08-28’
        ORDER BY LastAccessDate;

        The second query returns an extra row compared to the first because of the use of BETWEEN.
        The first one reads 86409 rows as it reads the whole of the rows for the 26th rather than starting the seek on the 27th,

        Reply
        • Yes, but that’s irrelevant because it doesn’t match the data we’re dealing with in the database. Please, before you reply further, download the database and do what I’m doing in the post – you’ll see that no one logged in at exactly midnight, so your point is irrelevant.

          Should people always write queries exactly the way you personally desire? Sure. Does that matter for this discussion here? Not even a little.

          Thanks for being passionate about your querying standards though! Good to see people fighting for the right query design.

          Reply
          • Brian Boodman
            November 6, 2019 7:22 am

            “Yes, but that’s irrelevant because it doesn’t match the data we’re dealing with in the database. Please, before you reply further, download the database and do what I’m doing in the post – you’ll see that no one logged in at exactly midnight, so your point is irrelevant.”

            Actually, this makes his point even more relevant, since running your query against Microsoft’s database will reinforce the false belief that the two queries have the same meaning. On many real world databases, data ends up clustered around midnight, since the granularity of the input data often varies.

          • I totally get that y’all are excited about discussing how to write date queries, and that’s great. However, that’s not what this post – or this series – is about.

            I also totally get that you want me to cover every single topic in every single post. I wish I could, but that just isn’t how blogging works, especially when I’m tackling beginner topics like this one.

            By all means, if you’re passionate about this topic, you should totally start a blog and talk about it. It’s a great way to show your expertise and discuss things you care about. Here’s how to get started: https://www.brentozar.com/archive/2008/12/how-start-blog/

            I look forward to reading your work!

  • My above comment seems to have been mangled. Not sure if it thinks the angled brackets are HTML that need to be stripped.

    Points were that first query shouldn’t use BETWEEN and should use gte and lt to not include rows from the next day at midnight, and that the dynamic cast seeks a wider range than the handcrafted seek and discards some unneeded rows through a residual predicate so “Numbers of rows read” will be higher

    Reply
  • You ruined all of my beliefs about the point that using functions will force SQL Server to Scan.
    Thanks for this damage 😉

    Reply
    • You’re not alone. Every time I think I’m getting a handle on even the basics, something comes along to tell me I’ve overlooked something all along 🙂 Now I’m going to have to go back to some of my queries and see if I’ve overlooked estimates on the Constant Scans/Compute Scalars that I “cleanse” my parameters with. The good part is we’re reading and learning.

      @Brent how does this relate though to the fact that, I believe, having another variable and doing a SET inside a procedure will also play havoc with estimates – and especially parameter sniffing? IE if I was to say:
      params: @startDate DATETIME, @endDate DATETIME
      SET @startDate = ISNULL(@startDate, GETDATE())
      ….
      WHERE LastAccessed >= @startDate

      Not that I’d put a function on the table side of that query anyway generally if I could help it, because indexes don’t tend to like having every value have to be parsed/converted.

      Reply
      • Andrew – that’s a totally different concept, and I cover that one in Mastering Query Tuning. That one’s a lot of fun too!

        Reply
        • Thanks, and since I just got that in the sale, I’ll have to have a look thru it.
          But I thought there was a “damned if you do, damned if you don’t” choice there.

          Reply
    • Hahaha, my pleasure.

      Reply
  • Great post Brent,
    Everyone is speaking about SARG-ability but here you proved that even underestimate can be good sometimes. I wasn’t able to get exactly the same numbers like shown above with new CE (2014 and greater). But then I used old CE and exact numbers appeared.

    Reply
    • Milan – yep, we cover the differences in cardinality estimation in later classes & posts. There are just only so many things I can cover in each blog post. 😀

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.