Query Exercise: Solving The 201 Buckets Problem

Query Exercises
14 Comments

When you run a query, SQL Server needs to estimate the number of matching rows it’ll find – so that it can decide which indexes to use, whether to go parallel, how much memory to grant, and more.

For example, take any Stack Overflow database, and let’s say I have an index on Location, and I want to find the top-ranking users in Lithuania:

Then SQL Server has to guess how many people are in Lithuania so it can decide whether to use the index on Location, or do a table scan – because if there are a lot of folks in Lithuania, then it would mean a lot of key lookups to get the Reputation value for each of them.

We’ll run the query in the small StackOverflow2010 database and review the actual execution plan:

In the top right operator, the Index Seek, SQL Server only estimated 5 rows, but 84 rows actually came back. Now, that’s not really a problem for this particular query because:

  • SQL Server used the index – which makes the query fast
  • SQL Server did 84 key lookups instead of 5 – but still, that’s less logical reads than a table scan
  • The query went single-threaded – but there was so little work that it didn’t matter
  • The query didn’t spill to disk – there’s no yellow bang on the sort operator

As our database grows, though, the lines start to blur. Let’s run the same query on the largest current version of the StackOverflow database and see what happens in the actual execution plan:

The top right operator, the Index Seek, shows just 8 rows estimated, but 2,554 rows were actually found. As our data size grows, these estimate variances start to become problematic. Now granted, this succeeds in the same way the 2010 query succeeds: we get an index seek, it’s still less logical reads than a key lookup plan would be, the single-threaded thing isn’t a problem for a 27 millisecond query, and we don’t spill to disk.

However, if we start to join to other tables (and we will, in the next Query Exercise), then this under-estimation is going to become a problem.

Why is the estimate wrong?

We do indeed have statistics on the Location index, and they were created with fullscan since we just created the index. Let’s view the statistics for the large database:

And check out the histogram contents – we’ll page down to Lithuania:

Or rather, we’ll page down to where you would expect Lithuania to be, and there’s a problem: Lithuania’s not there. SQL Server’s statistics are limited to just 201 buckets, max. (Technically, it’s up to 200 buckets for “normal” values in the table, plus 1 bucket for null.)

SQL Server does the best job it can of picking outliers in order to paint a perfect picture of the data, but it’s hard with just 201 buckets.

Typically – but not always – when SQL Server picks the locations that it’ll use for outliers, it uses around the top 200 locations by size, but this can vary a lot depending on the sort order of the column and the distribution of the data. Let’s look at the top locations:

And Lithuania is at row 240 in this case:

So it’s a big location – but not big enough to hit the top 201, which means it’s not going to get accurate estimates. The estimates are derived by looking at which bucket Lithuania is in – in the screenshot below, it’s row 100:

Lithuania is higher than Lisbon, but less than London, so it’s in the row 100 bucket. The row 100’s AVG_RANGE_ROWS is 7.847202, which means that any location between Lisbon and London has an average number of rows of about 8. And that’s where the estimate is coming from in our query:

Your challenge: get an accurate estimate.

You can change the query, the database, server-level settings, you name it. Anything that you would do in a real-life situation, you can do here. However, having done this exercise in my Mastering classes, I can tell you a couple things that people will try to do, but don’t really make sense.

You don’t wanna dump the data into a temp table first. Sometimes people will extract all of the data into a temp table, and then select data out of the temp table and say, “See, the estimate is accurate!” Sure it is, speedy, but look at your estimate from when you’re pulling the data out of the real table – the estimate’s still wrong there.

You don’t wanna use a hard-coded stat or index for just ‘Lithuania’. That only solves this one value, but you’ll still have the problem for every other outlier. We’re looking for a solution that we can use for most big outliers. (It’s always tricky to phrase question requirements in a way that rules out bad answers without pointing you to a specifically good answer, hahaha.)

Put your queries in a Github Gist and the query plans in PasteThePlan, showing your new accurate estimates, and include those link in your comments. Check out the solutions from other folks, and compare and contrast your work. I’ll circle back next week for a discussion on the answers. Have fun!

Previous Post
Make Money Referring Folks to My Training!
Next Post
[Video] Office Hours: Professional Development Q&A

14 Comments. Leave new

  • My Gist
    https://gist.github.com/Paul-Fenton/83b0829263e9586868e1bd29fc2d6ccf

    Query Plan
    https://www.brentozar.com/pastetheplan/?id=r19_spQjC

    Create a new column “PopularLocation” which is set to the Location if it’s one of the top 250 locations.

    Then change the query to look like:

    SELECT * FROM dbo.Users
    WHERE Location = N’Lithuania’ OR PopularLocation = N’Lithuania’
    ORDER BY Reputation DESC;

    The estimate is now “84 of 85 rows (98%)” instead of “84 of 5 rows (1680%)”

    Reply
  • Sorry, cannot test it right now, just thinking out loud, perhaps filtered stats may solve it… I know it won’t probably work for other locations, but perhaps if we use parameter for the location, and using RECOMPILE hint that will be more accurate… Again sorry , but cannot test it right now…..

    Reply
  • Wojciech Sawicki
    August 22, 2024 6:58 am

    Sometimes I need to cheat mssql because I want to see what will happen when there are many rows.

    UPDATE STATISTICS dbo.Users([Location])
    WITH rowcount =500000000000;

    This gives 77 rows in Lithuanian. I known its not real life but for testing it is sometimes useful.

    Reply
  • Code:
    https://gist.github.com/samot1/8d6e3bc5feee76b7959841cae8f81b97
    Plan:
    https://www.brentozar.com/pastetheplan/?id=HJnkAt4oR

    Idea:
    I just created a filtered statistic for every location starting with A, with B, with C … with Z

    This way I have not just 200 statistic steps but 5200 which is enough to cover the most larger locations and get the correct estimates.

    Drawback: you may or may not have to use OPTION(RECOMPILE) to use the correct statistic. When you put the location into a variable and use it in the query or have the PARAMETERIZATION option on your database set to FORCED you always have to use RECOMPILE to get the correct estimate.

    PS: in a real scenario there may be much better options to archive the correct estimate, but you would need to know, how exactly the database is queried (which queries, how often, what happens with the results, how busy is the server …) and use all this information to decide the best option (which could be to simply ignore the wrong estimates too, since the overhead is bigger than the result)

    Reply
    • Thomas – that is intriguing! I love the creativity.

      I like the drawbacks that you put in the Gist – there are totally drawbacks here, for sure. For example, each stat has to be updated separately, so we just made our maintenance window explode. However, if someone is advanced enough to use this solution, I’d also expect them to be advanced enough to have relatively infrequent, targeted statistics updates.

      I like it! I think you’ll also like the solution I talk about in next week’s post.

      Reply
      • I had another much worse idea but it didn’t work (even if it could/should):
        CREATE PARTITION FUNCTION pf_a_to_z (NVARCHAR(100)) AS RANGE RIGHT FOR VALUES (‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘J’, ‘K’, ‘L’, ‘M’, ‘N’, ‘O’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘U’, ‘V’, ‘W’, ‘X’, ‘Y’, ‘Z’)
        CREATE PARTITION SCHEME ps_a_to_z AS PARTITION pf_a_to_z ALL TO ([PRIMARY])
        ALTER TABLE [dbo].[Users] DROP CONSTRAINT [PK_Users_Id]; — drop old PK

        CREATE UNIQUE CLUSTERED INDEX pk_Users_id ON dbo.Users (Id, Location) ON ps_a_to_z (Location) — base table (=clustered index) must be partition aligned, to allow STATISTICS_INCREMENTAL = ON on the location index
        GO
        CREATE NONCLUSTERED INDEX [Location] ON [dbo].[Users]
        ([Location])
        WITH (DROP_EXISTING = ON,
        DATA_COMPRESSION = NONE,
        SORT_IN_TEMPDB = ON,
        ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON,
        STATISTICS_INCREMENTAL = ON
        )
        ON ps_a_to_z (Location);
        GO

        now I have a table who is partitioned by the location, the STATISTICS_INCREMENTAL = ON says, that it should get one statistic per partition (plus a combined one), but sadly the SQL server eliminates the statistics all besides one but didn’t use the partitions statistic, so the estimates are still wrong.

        And I have several drawbacks, e.g. I need to specifiy the location to be able to query the UserID efficent and I can have the same UserId twice in several locations, except I create another unique nonclustered index on the UserId …

        Reply
  • […] Query Exercise: Solving The 201 Buckets Problem (Brent Ozar) […]

    Reply
  • Geoff Patterson
    August 27, 2024 8:59 pm

    Code: https://gist.github.com/pattertall/0588861171caff94467ba7d6100c8762
    Plan: https://www.brentozar.com/pastetheplan/?id=B1mDKnijA

    An indexed view that groups by location is an efficient way to get the perfect cardinality “estimate” for any location via a single indexed seek. In principal, SQL Server could even use such an indexed view (if one exists) in place of the statistics histogram in order to overcome the 200 bucket limit.

    Since SQL Server can’t do this automatically, we can hack together a rudimentary version by looking up a location that does exist within the 200 bucket histogram and has a user count roughly the same as the location we are querying. Then we execute the query with OPTIMIZE FOR to tell SQL Server to pretend we are querying whichever location does exist in the histogram with a similar number of rows.

    There might be any number of risks to putting something so hacky into production (e.g., much greater code complexity, overhead of querying histogram, even more complexity if you introduce a cache to eliminate that small fixed overhead, possibility of getting blocked querying the histogram internals, etc.), but it’s the most creative approach I could think of that seems to generally work for both Lithuania and other test cases.

    Reply
    • That’s wild! It introduces a lot of overhead though – slower inserts/updates/deletes, more queries to get the similar location counts, etc. The cure here may be worse than the disease. 😉

      Reply
  • […] this week’s Query Exercise challenge, I explained SQL Server’s 201 buckets problem. SQL Server’s statistics only handle up to ~201 outliers, which means that outliers ~202-300 […]

    Reply
  • […] prior Query Exercise introduced SQL Server’s 201 buckets problem: its inability to accurately estimate rows for more than 201 outliers in a table. I followed up […]

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.