The 201 Buckets Problem, Part 2: How Bad Estimates Backfire As Your Data Grows

Statistics
14 Comments

In the last post, I talked about how we don’t get accurate estimates because SQL Server’s statistics only have up to 201 buckets in the histogram. It didn’t matter much in that post, though, because we were using the small StackOverflow2010 database.

But what happens as our data grows? Let’s move to a newer Stack Overflow database, the 2018-06 one that I use for my Mastering training classes. We’ll create the same index, giving us the same statistics, and then look at the histogram to see how Miami grew over time:

Now we’re up to about 9 million users:

And the statistics histogram for Miami’s area;

The data distribution has changed, but Miami still isn’t big enough to have its own bucket. Now:

  • Miami, FL is between México and Minneapolis (note that Michigan no longer has its own bucket)
  • AVG_RANGE_ROWS = 13.3 means that any location in between México and Minneapolis will get an estimate of 13.3 rows

So when we run our Miami query, what’s it look like now:

Estimated number of rows in Miami, FL is 13, but in actuality, 625 rows come out of the index seek. The variance is starting to grow now that our data sizes have grown larger.

So what do we do about Miami’s inaccurate estimates?

Again, in this case…nothing. 625 rows vs 13 rows isn’t really that big of a deal. In my Fundamentals of Query Tuning class, I talk about how you want to read query plans from right to left, looking for places where estimates vs actual are 10x off or more. In this case, we’re 48x off, but…it doesn’t really matter that much. SQL Server still allocated enough memory to do the sort, and it didn’t spill to disk. (Note there’s no yellow bang on the sort operator.)

However, as I noted yesterday, it will become a big deal if we join to other tables. Let’s see what happens if we add a join to Comments, for example, to display the top comments left by people from Miami:

The actual plan:

We read plans from right to left, top to bottom, to understand what SQL Server did in what order. (You can also read from left to right in some cases, and I’m just legally required to say that, because if I don’t, Grant Fritchey and Hugo Kornelis will smother me as I sleep.)

So from right to left:

  1. SQL Server did an index seek on Location = Miami, FL, and expected only 13 rows to come back
  2. In reality, 625 rows came back
  3. So we also had to do 625 key lookups
  4. We had to do 625 index seeks on Comments.UserId because each person has their own set of comments
  5. In total, we found 14,176 comments left by people in Miami, FL, so that poor Sort operator never stood a chance. He has a yellow bang because he ended up spilling to disk.

So, really, now, seriously,
what do we do about Miami’s inaccurate estimates?

In this example, the query still runs fairly quickly. But if you run into this kind of problem with a larger query, larger spills to disk, larger estimate variations, and so forth, here are some techniques you can consider for performance tuning:

But the big takeaway for me is that 201 buckets in a histogram just isn’t enough to portray real-world data skew, especially as your data size grows into the millions of rows and beyond.

Previous Post
The 201 Buckets Problem, Part 1: Why You Still Don’t Get Accurate Estimates
Next Post
Want Your Own Private Conference?

14 Comments. Leave new

  • It would be interesting to see what the economic cost is on a cloud platform in terms of CPU cycles (if any).

    Reply
    • On that one, it’s practically nothing – it’s just an example to show how estimates gradually start to go haywire. When I see real-world production examples, it’s much more dramatic, though.

      Reply
  • Have you had any good results with filtered statistics for this type of problem? When I tried in the past, I couldn’t get the estimator to actually use the filtered statistic as the source of the estimate. This was a few years ago, so I don’t remember all the details, but if I recall correctly they won’t be used for joins.

    I wonder what MS’s thoughts on adding more buckets are. Surely the resource tradeoffs for adding more buckets are much different today than they were in the 80s or 90s when this bucket count was established.

    Reply
    • You’re not alone: I haven’t had good luck with filtered stats either, especially in real-world scenarios like this where there are hundreds or thousands of values that get bad estimates, not just Miami.

      Reply
    • Thomas Franz
      July 9, 2020 12:31 am

      I didn’t test it with filtered statistics specifically, but with filtered indexes you may need to add an OPTION RECOMPILE to your query, if you do not hardcode Miami (but use a variable / parameter instead). If this is worth the effort depends, how often this query will be used (number of recompiles per minute) and how much you save.

      Reply
  • I still haven’t tried column indexes, but in my thinking a column index is like statistic on steroids. There should be entries for each unique value. Right?

    Reply
  • Andrej Kuklin
    July 9, 2020 3:24 am

    A real world example when filtered statistics did help:
    A star-schema DWH with a DimDate table having an integer key in format YYYYMMDDVV where VV is a version during a given day. So 2020010103 would be the third version on the 1. January 2020. Usually there are not a lot of versions per day (like 1 or 2, almost always <10).
    As there are more than 200 records in the DimDate we get exactly the same problem as in the article. Worse than that, if you do a "select … from facttable where DateID=…" for a given DateID which didn't get a histogram bucket, SQL Server computes a probability this integer value will be in the range. As we can theoretically have 100 versions for a day (but usually have 1), the cardinality estimates were 100x too low in these cases.

    The solution (or better workaround. Or better dirty hack 😉 ) was to create in ETL filtered statistics for each loaded DateID (approximately 700 statistics for 2 years of data, they are automatically removed with outdated data during housekeeping), this way the above query gets correct estimates,

    Reply
    • 700 statistics per table, interesting. Did you happen to notice any issues with your update statistics jobs, or did you have to just abandon them altogether?

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.