Query Exercise Answers: Solving the 201 Buckets Problem

Query Exercises
8 Comments

In this week’s Query Exercise challenge, I explained SQL Server’s 201 buckets problem. SQL Server’s statistics only handle up to ~201 outliers, which means that outliers ~202-300 get wildly inaccurate estimates.

In our example, I had an index on Location and perfectly accurate statistics, but even still, this query gets bad estimates because Lithuania is in outliers ~202-300:

SQL Server estimates that only 8 rows will be found for Lithuania, when in reality 2,554 rows come back:

This under-estimation isn’t really a problem for this particular query’s performance, but when I’m teaching you concepts, I gotta teach you how the simple things break before I start getting into real-world specifics. If you can’t solve this simplistic version, then there’s no way you’re gonna be able to solve the real-world version, which we’re gonna get into in the next Query Exercise.

Solving It with a New Filtered Statistic

I cringe when I type that heading out because it’s such a slimy solution. I’ll start with a simple version, and it’s actually something I hinted at you not to do in the challenge instructions because it’s a bad idea, but bear with me for a second.

SQL Server lets you create your own statistic on specific outliers by using a statistic with a WHERE clause, like this:

The query gets a new execution plan with bang-on estimates:

Okay, great – but like I mentioned in the challenge requirements, you wouldn’t actually do this in the real world because you’d probably have multiple outliers, not just 1. If you only had 1 big outlier, then… it’d be in your regular 201 buckets!

Instead, if we’re going to solve this with a filtered statistic, we need to create our own manual statistic for the next 200 outliers. First, let’s write a query to find outliers 201-400. Keep in mind, I’m not guaranteeing that they’re not in the existing Location statistics yet – I’m just trying to illustrate how we would find Locations with a large population, but that aren’t in the top 200:

To keep things simple, I’m not using a tiebreaker here. (There are only so many things I can cover in a blog post, and at the end of the day, remember that the focus here is on statistics & outliers.)

Next, let’s use that to build a string that will create our filtered statistic:

Which gives us a string we can execute, and I’ve highlighted Lithuania just to show that it’s there:

Note that my dynamic SQL is not checking for the existence of the stat and dropping it – if you were going to productize a solution like this, that’s left as an exercise for the reader. You could either drop & create the statistic on a regular basis (like, say, quarterly – the top outliers shouldn’t change that much in a mature database) or create a new filtered state with a date-based name, and then drop the old one. That’s much more work though.

For now, let’s create the statistic and check out its contents:

We now have a bucket dedicated exclusively to Lithuania. Try the Lithuania query filter again, and look at the new actual query plan:

Presto, Lithuania gets accurate estimates, as do any of the other top 200 values.

To see why, right-click on the SELECT operator, click Properties, and go into OptimizerStatsUsage. Both the Location and the Location_Outliers statistics were used when building the query plan’s estimates.

Like I said when I first started talking about this solution above, this solution makes you cringe. It feels dirty, like it’s made out of chewing gum and duct tape. This feels like something that should have a more intuitive solution. For example, over in the free database crowd, Postgres lets you set the number of buckets at the server level with default_statistics_target, and lets you override it with an ALTER TABLE SET STATISTICS command. Postgres lets you pick up to 10,000 outliers – and hell, SQL Server’s filtered statistics solution only gives you another 200 per filtered stat that you create!

At the same time though, the vast majority of tables will never grow large enough to experience this problem. In this blog post, I’m illustrating the problem with the Users table in the most recent Stack Overflow database, and that table has over 22 million rows, with over a decade of activity for a very large web site. Even still, the query we’re talking about doesn’t even have a performance problem!

It’s also important to note that both the 201 buckets problem and the filtered statistics solution have nothing to do with parameter sniffing, trivial optimization, plan reuse, or anything like that. People often get these problems confused because they’re similar, but in this specific example, I’m looking at best-case scenarios for query optimization – this query doesn’t even have parameters, and we’re getting a fresh execution plan each time, specifically designed for Lithuania. (You can try recompile hints to prove to yourself that plan reuse isn’t the issue here, and even with the filtered stats solution, you’d also still have parameter sniffing issues on reusable plans for different locations.)

So would I recommend this solution to solve the above problem? Probably not – but again, like I said, I have to teach you how to solve it in a simple scenario before I move into a more complex, real-world scenario. That one’s coming up in the next Query Exercise where we’ll add in larger tables and more query complexity.

If you want to see even more solutions, check out the comments on the challenge post. Thomas Franz had a particularly interesting one: he created 26 filtered stats, one for each letter of the alphabet, effectively giving him over 5,000 buckets for his statistics!

Hope you enjoyed the challenge! For more exercises, check out the prior Query Exercises posts and catch up on the ones you missed so far.

Previous Post
About Half Of You Don’t Have Your Databases Under Control.
Next Post
[Video] Office Hours: Abrupt Intro Edition

8 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.