Statistics Aren’t Guarantees: SQL Server Still Checks.

By Brent Ozar · November 4, 2020 · 3 comments

When I’m reviewing plans with folks, I get this response a lot:

But SQL Server should know there’s no data that matches! It has statistics!

Statistics are guideposts, not guarantees.

We’ll start with the Stack Overflow database and put an index on LastAccessDate, which also creates a statistic on that column. We’ll check the min and max LastAccessDates:

CREATE INDEX LastAccessDate ON dbo.Users(LastAccessDate);

SELECT MIN(LastAccessDate), MAX(LastAccessDate) FROM dbo.Users;

1

2

3

CREATE INDEX LastAccessDate ON dbo.Users(LastAccessDate);

SELECT MIN(LastAccessDate), MAX(LastAccessDate) FROM dbo.Users;

In the database I’m using today, the min is ‘2008-08-01 00:59:11.147’ and the max is ‘2018-06-03 05:18:24.843’. If I query for all users lower than the min, or all users above the max:

SELECT * FROM dbo.Users 
    WHERE LastAccessDate < '2008-08-01 00:59:11.147'
    ORDER BY DisplayName;

SELECT * FROM dbo.Users 
    WHERE LastAccessDate > '2018-06-03 05:18:24.843'
    ORDER BY DisplayName;

1

2

3

4

5

6

7

SELECT * FROM dbo.Users

WHERE LastAccessDate < '2008-08-01 00:59:11.147'

ORDER BY DisplayName;

SELECT * FROM dbo.Users

WHERE LastAccessDate > '2018-06-03 05:18:24.843'

ORDER BY DisplayName;

The actual execution plans show that SQL Server estimated 1 row would return for each of those, but of course 0 rows actually return:

You might say, “But SQL Server has statistics on those columns, and it knows what the top values are!” Well, that’s true, but…data can change without the statistics being updated. For example, say that one user logs in right now, and then we run the MAX query again:

UPDATE dbo.Users 
    SET LastAccessDate = GETDATE() 
    WHERE DisplayName = N'Brent Ozar';

SELECT * FROM dbo.Users 
    WHERE LastAccessDate > '2018-06-03 05:18:24.843'
    ORDER BY DisplayName;

1

2

3

4

5

6

7

UPDATE dbo.Users

SET LastAccessDate = GETDATE()

WHERE DisplayName = N'Brent Ozar';

SELECT * FROM dbo.Users

WHERE LastAccessDate > '2018-06-03 05:18:24.843'

ORDER BY DisplayName;

One row actually returns:

This has a bunch of interesting ramifications.

SQL Server has to build the whole query plan at once. He doesn’t go execute the first operator, check to see how many rows come back, and then build the rest of the plan. There are adaptive joins that will change their behavior depending on how many rows are found, but their existence in the plan has to be set up before the first operation in the plan even starts. The more complex your query is, the more time it takes to build the whole plan – even if no rows are going to be found.

The statistics can be way off. In our Stack Overflow scenario, imagine that users are constantly logging in, all the time. SQL Server will keep thinking just one row is going to come back until the statistics are updated, or SQL Server figures out that we have what’s called an “ascending key problem.” This is a constant issue in SQL Server where it doesn’t realize that a particular column is going to keep growing in the future, and that it should always assume there’s going to be more data for higher values, even if that data hasn’t been loaded yet.

If apps constantly query for data that can’t possibly exist, like if we know no one is ever going to have a LastAccessDate < 2008, then we can put in a check constraint. SQL Server may bypass querying the table altogether and do a constant scan. I’ve used this trick when an app sent in queries that I couldn’t control, and the query was becoming a performance issue.

If apps constantly query for data that rarely exists, like if they check for invalid data and then correct it with a subsequent query, we can use a filtered index to just focus on that bad range of data. I cover that in the filtered indexes module in Mastering Index Tuning.

If you want to learn more about statistics, check out the free statistics courses in my YouTube channel.

Free, 3× a week

Get my new posts by email

Three posts a week, plus a Monday roundup of the best database news from around the web.

3 comments

Chad Franklin

November 4, 2020 at 11:59 am

I love it how you refer to SQL Server as “he”. I’ve always used “it” as the pronoun for SQL Server. Maybe I will will start using he now as well.

Reply
1. Brent Ozar
  
  November 4, 2020 at 12:43 pm
  
  Thanks! In the live streams & classes, I explain that it’s because he’s overly confident and refuses to take direction. He keeps grunting, “TRUST ME, I’VE GOT THIS.”
  
  Reply
2. Alex Kosariev
  
  November 4, 2020 at 11:58 pm
  
  It is always “he” for SQL Server in Eastern European languages such as Ukrainian or Russian.
  However, statistics is “she”. Brent, feel free to use she for statistics ;o)
  
  Reply

Statistics Aren’t Guarantees: SQL Server Still Checks.

Get my new posts by email

Keep digging

3 comments

Leave a comment Cancel reply