Statistics Aren’t Guarantees: SQL Server Still Checks.

Statistics
3 Comments

When I’m reviewing plans with folks, I get this response a lot:

But SQL Server should know there’s no data that matches! It has statistics!

Statistics are guideposts, not guarantees.

We’ll start with the Stack Overflow database and put an index on LastAccessDate, which also creates a statistic on that column. We’ll check the min and max LastAccessDates:

In the database I’m using today, the min is ‘2008-08-01 00:59:11.147’ and the max is ‘2018-06-03 05:18:24.843’. If I query for all users lower than the min, or all users above the max:

The actual execution plans show that SQL Server estimated 1 row would return for each of those, but of course 0 rows actually return:

You might say, “But SQL Server has statistics on those columns, and it knows what the top values are!” Well, that’s true, but…data can change without the statistics being updated. For example, say that one user logs in right now, and then we run the MAX query again:

One row actually returns:

This has a bunch of interesting ramifications.

SQL Server has to build the whole query plan at once. He doesn’t go execute the first operator, check to see how many rows come back, and then build the rest of the plan. There are adaptive joins that will change their behavior depending on how many rows are found, but their existence in the plan has to be set up before the first operation in the plan even starts. The more complex your query is, the more time it takes to build the whole plan – even if no rows are going to be found.

The statistics can be way off. In our Stack Overflow scenario, imagine that users are constantly logging in, all the time. SQL Server will keep thinking just one row is going to come back until the statistics are updated, or SQL Server figures out that we have what’s called an “ascending key problem.” This is a constant issue in SQL Server where it doesn’t realize that a particular column is going to keep growing in the future, and that it should always assume there’s going to be more data for higher values, even if that data hasn’t been loaded yet.

If apps constantly query for data that can’t possibly exist, like if we know no one is ever going to have a LastAccessDate < 2008, then we can put in a check constraint. SQL Server may bypass querying the table altogether and do a constant scan. I’ve used this trick when an app sent in queries that I couldn’t control, and the query was becoming a performance issue.

If apps constantly query for data that rarely exists, like if they check for invalid data and then correct it with a subsequent query, we can use a filtered index to just focus on that bad range of data. I cover that in the filtered indexes module in Mastering Index Tuning.

If you want to learn more about statistics, check out the free statistics courses in my YouTube channel. Or, if you’d like to learn about more advanced indexing tricks like these, check out my Mastering Index Tuning class. I have upcoming live classes starting this Friday, and also December 8th, and you can get in free with a Live Class Season Pass.

Previous Post
Why I Teach Training Classes on Weekends (the next one starts Friday)
Next Post
Free Webcast: Why is the Same Query Sometimes Slow?

3 Comments. Leave new

  • I love it how you refer to SQL Server as “he”. I’ve always used “it” as the pronoun for SQL Server. Maybe I will will start using he now as well.

    Reply
    • Thanks! In the live streams & classes, I explain that it’s because he’s overly confident and refuses to take direction. He keeps grunting, “TRUST ME, I’VE GOT THIS.”

      Reply
    • It is always “he” for SQL Server in Eastern European languages such as Ukrainian or Russian.
      However, statistics is “she”. Brent, feel free to use she for statistics ;o)

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.