How to Think Like the SQL Server Engine: When Statistics Don’t Help

By Brent Ozar · November 5, 2019 · 16 comments

In our last episode, we saw how SQL Server estimates row count using statistics. Let’s write two slightly different versions of our query – this time, only looking for a single day’s worth of users – and see how its estimations go:

SELECT LastAccessDate, Id, DisplayName, Age
  FROM dbo.Users
  WHERE LastAccessDate BETWEEN '2018-08-27' AND '2018-08-28'
  ORDER BY LastAccessDate;

SELECT LastAccessDate, Id, DisplayName, Age
  FROM dbo.Users
  WHERE CAST(LastAccessDate AS DATE) = '2018-08-27'
  ORDER BY LastAccessDate;

SELECT LastAccessDate, Id, DisplayName, Age

FROM dbo.Users

WHERE LastAccessDate BETWEEN '2018-08-27' AND '2018-08-28'

ORDER BY LastAccessDate;

SELECT LastAccessDate, Id, DisplayName, Age

FROM dbo.Users

WHERE CAST(LastAccessDate AS DATE) = '2018-08-27'

ORDER BY LastAccessDate;

Both of those queries are theoretically identical in that they accomplish the same result by producing exactly the same rows – but their execution plans are different. On this one, you’ll probably want to click to zoom in, and play spot-the-differences:

The good news is that both plans are essentially doing an index seek + key lookup. The bottom one’s Constant Scan is where SQL Server is figuring out what the first date would be if it had to seek to a specific date/time to start the CAST process.

If you read a lot of blog posts – and you strike me as the kind of person who does – then you’ve probably read a post somewhere that said if you use functions in the WHERE clause, SQL Server won’t use an index, or that it won’t do an index seek. That’s not exactly true, as the bottom query plan shows – and it’s awesome! Hot diggety, I love SQL Server at moments like this.

The bad news is that they have very different row estimates. The top one estimates that 1,989 rows will match that date range, as shown by the “2443 of 1989” on the top Index Seek.

The bottom one has a very different row estimate: 1.

SQL Server brought back 2443 rows out of an expected 1.

And if you hover your mouse over the index seek to see more details, you can see that SQL Server knew doggone good and well what date it was looking for – and even that our CAST got converted (see what I did there.)

But the statistics didn’t help. SQL Server just couldn’t be bothered to guess how many rows were going to match the date of 2018-08-27.

In this case, the row estimate of 1 helped me out a lot because it encouraged SQL Server to lowball the row estimate, thereby encouraging an index seek + key lookup. But…what if that was a bad idea?

Let’s try another date.

This time, let’s aim for just 3 days later:

SELECT LastAccessDate, Id, DisplayName, Age
  FROM dbo.Users
  WHERE LastAccessDate BETWEEN '2018-08-31' AND '2018-09-01'
  ORDER BY LastAccessDate;

SELECT LastAccessDate, Id, DisplayName, Age
  FROM dbo.Users
  WHERE CAST(LastAccessDate AS DATE) = '2018-08-31'
  ORDER BY LastAccessDate;

SELECT LastAccessDate, Id, DisplayName, Age

FROM dbo.Users

WHERE LastAccessDate BETWEEN '2018-08-31' AND '2018-09-01'

ORDER BY LastAccessDate;

SELECT LastAccessDate, Id, DisplayName, Age

FROM dbo.Users

WHERE CAST(LastAccessDate AS DATE) = '2018-08-31'

ORDER BY LastAccessDate;

It just so happens 24,380 people logged in on that date, so now the top execution plan looks very different from the bottom one:

In this case, the 1 row lowball estimate backfired on me. The top query understood that over 24,000 rows would come back, so it did a clustered index scan. The bottom one’s 1-row estimate caused an index seek + key lookup – and as you’ll remember from your earlier lessons, dear reader, an index seek + key lookup is dramatically less effective as you add more rows, as statistics IO will show:

The top query did 7,405 reads (the entire table), whereas the bottom one did ALMOST TEN TIMES AS MANY.

The easier it is for SQL Server to read your query and quickly guess how many rows are going to come back, the faster of a query plan it’ll usually build. The more you layer in obfuscation – even the tiniest obfuscations that seem completely obvious to you – the less likely it is you’ll get a fast query plan.

Or, you’re gonna have to compensate by putting more work into your indexes – and we’ll do that next post, Building Wider Indexes to Deal with Bad T-SQL.

Free, 3× a week

Get my new posts by email

Three posts a week, plus a Monday roundup of the best database news from around the web.

16 comments

Peter

November 5, 2019 at 9:42 am

Hi, Brent! The first link is broken

Reply
1. Brent Ozar
  
  November 5, 2019 at 9:46 am
  
  Sorry about that! Fixed.
  
  Reply
Martin Smith

November 5, 2019 at 10:41 am

The queries are currently not exactly identical the first one will include rows on midnight at 2018-08-28 (and should be written WHERE LastAccessDate >= ‘2018-08-27’ AND LastAccessDate ‘2018-08-26’ … < '2018-08-28' and discards all rows from the 26th (i.e. after 00:00 on that date) via a residual predicate so "Numbers of rows read" will be higher

Reply
1. Brent Ozar
  
  November 5, 2019 at 10:59 am
  
  Martin – this is where it’d help if you actually get the database and test your ideas first so you can see the number of rows on particular dates. By all means, jump in and give it a shot!
  
  Reply
  1. martin smith
    
    November 5, 2019 at 3:13 pm
    
    The comment you are replying too had a chunk out of the middle replaced with an elipsis on posting (as it contained a “less than” sign and I think it tried to strip out some following content as an HTML tag) so not sure if you got the points or not,
    
    I don’t need the database I can demo both points easily with this dummy data
    
    CREATE TABLE dbo.Users
    (
    Id INT IDENTITY,
    LastAccessDate DATETIME INDEX IX_LastAccessDate
    )
    
    INSERT INTO dbo.Users (LastAccessDate)
    SELECT TOP 86410 DATEADD(SECOND, ISNULL(NULLIF(ROW_NUMBER() OVER (ORDER BY @@SPID),86410),172800), ‘2018-08-26’)
    FROM sys.all_objects o1, sys.all_objects o2
    
    SELECT LastAccessDate
    FROM dbo.Users
    WHERE CAST(LastAccessDate AS DATE) = ‘2018-08-27’
    ORDER BY LastAccessDate;
    
    SELECT LastAccessDate
    FROM dbo.Users
    WHERE LastAccessDate BETWEEN ‘2018-08-27’ AND ‘2018-08-28’
    ORDER BY LastAccessDate;
    
    The second query returns an extra row compared to the first because of the use of BETWEEN.
    The first one reads 86409 rows as it reads the whole of the rows for the 26th rather than starting the seek on the 27th,
    
    Reply
    1. Brent Ozar
      
      November 6, 2019 at 5:43 am
      
      Yes, but that’s irrelevant because it doesn’t match the data we’re dealing with in the database. Please, before you reply further, download the database and do what I’m doing in the post – you’ll see that no one logged in at exactly midnight, so your point is irrelevant.
      
      Should people always write queries exactly the way you personally desire? Sure. Does that matter for this discussion here? Not even a little.
      
      Thanks for being passionate about your querying standards though! Good to see people fighting for the right query design.
      
      Reply
      1. Brian Boodman
        
        November 6, 2019 at 7:22 am
        
        “Yes, but that’s irrelevant because it doesn’t match the data we’re dealing with in the database. Please, before you reply further, download the database and do what I’m doing in the post – you’ll see that no one logged in at exactly midnight, so your point is irrelevant.”
        
        Actually, this makes his point even more relevant, since running your query against Microsoft’s database will reinforce the false belief that the two queries have the same meaning. On many real world databases, data ends up clustered around midnight, since the granularity of the input data often varies.
      2. Brent Ozar
        
        November 6, 2019 at 7:29 am
        
        I totally get that y’all are excited about discussing how to write date queries, and that’s great. However, that’s not what this post – or this series – is about.
        
        I also totally get that you want me to cover every single topic in every single post. I wish I could, but that just isn’t how blogging works, especially when I’m tackling beginner topics like this one.
        
        By all means, if you’re passionate about this topic, you should totally start a blog and talk about it. It’s a great way to show your expertise and discuss things you care about. Here’s how to get started: https://www.brentozar.com/archive/2008/12/how-start-blog/
        
        I look forward to reading your work!
Martin Smith

November 5, 2019 at 10:48 am

My above comment seems to have been mangled. Not sure if it thinks the angled brackets are HTML that need to be stripped.

Points were that first query shouldn’t use BETWEEN and should use gte and lt to not include rows from the next day at midnight, and that the dynamic cast seeks a wider range than the handcrafted seek and discards some unneeded rows through a residual predicate so “Numbers of rows read” will be higher

Reply
Ehsan VIP Student since 2020

November 5, 2019 at 2:13 pm

You ruined all of my beliefs about the point that using functions will force SQL Server to Scan.
Thanks for this damage 😉

Reply
1. Andrew Tobin VIP Student since 2017
  
  November 5, 2019 at 3:01 pm
  
  You’re not alone. Every time I think I’m getting a handle on even the basics, something comes along to tell me I’ve overlooked something all along 🙂 Now I’m going to have to go back to some of my queries and see if I’ve overlooked estimates on the Constant Scans/Compute Scalars that I “cleanse” my parameters with. The good part is we’re reading and learning.
  
  @Brent how does this relate though to the fact that, I believe, having another variable and doing a SET inside a procedure will also play havoc with estimates – and especially parameter sniffing? IE if I was to say:
  params: @startDate DATETIME, @endDate DATETIME
  SET @startDate = ISNULL(@startDate, GETDATE())
  ….
  WHERE LastAccessed >= @startDate
  
  Not that I’d put a function on the table side of that query anyway generally if I could help it, because indexes don’t tend to like having every value have to be parsed/converted.
  
  Reply
  1. Brent Ozar
    
    November 5, 2019 at 4:14 pm
    
    Andrew – that’s a totally different concept, and I cover that one in Mastering Query Tuning. That one’s a lot of fun too!
    
    Reply
    1. Andrew Tobin VIP Student since 2017
      
      November 5, 2019 at 7:52 pm
      
      Thanks, and since I just got that in the sale, I’ll have to have a look thru it.
      But I thought there was a “damned if you do, damned if you don’t” choice there.
      
      Reply
2. Brent Ozar
  
  November 5, 2019 at 4:30 pm
  
  Hahaha, my pleasure.
  
  Reply
Milan R.

March 17, 2020 at 3:37 pm

Great post Brent,
Everyone is speaking about SARG-ability but here you proved that even underestimate can be good sometimes. I wasn’t able to get exactly the same numbers like shown above with new CE (2014 and greater). But then I used old CE and exact numbers appeared.

Reply
1. Brent Ozar
  
  March 17, 2020 at 11:38 pm
  
  Milan – yep, we cover the differences in cardinality estimation in later classes & posts. There are just only so many things I can cover in each blog post. 😀
  
  Reply

How to Think Like the SQL Server Engine: When Statistics Don’t Help

Let’s try another date.

Get my new posts by email

Keep digging

16 comments

Leave a comment Cancel reply