Query Exercise: Why Are These 3 Estimates So Wrong?

Query Exercises
6 Comments

Our prior Query Exercise introduced SQL Server’s 201 buckets problem: its inability to accurately estimate rows for more than 201 outliers in a table. I followed up with a solution using filtered statistics to help with the next 200 outliers, and I talked about how that’s really overkill for the simple problem we were facing in that initial challenge.

Let’s build upon that knowledge, but use a more complex query. I’m going to keep the filtered statistic in place, which gives me accurate estimates on the number of people in Lithuania, and I’m going to add an index on the Posts table (questions & answers.) My query’s goal is to find the top-voted Posts (questions & answers) created by people who live in Lithuania:

The query’s actual execution plan in the biggest Stack Overflow database, on SQL Server 2022, using 2022 compatibility level:

Let’s zoom in to read the estimates from right to left, top to bottom:

They’re great! The top right index seek on Users.Location accurately estimated that 2,554 users live in Lithuania. Then, it joined to Posts on OwnerUserId, and its estimate that the Lithuanians had created 24,316 posts was really close to accurate!

Granted, there’s a yellow bang on the plan because SQL Server didn’t grant enough memory. The sort spilled thousands of pages to disk, but if you run the query again on SQL Server 2022, adaptive memory grants start taking care of that problem for you automatically.

Try another location and add another join.

I hear Berkshire in the UK is a scenic place, and let’s also sort their posts by which ones have had the most votes cast overall. I’ll even have an index to support that, too:

The query’s actual plan has some problems:

ENHANCE! Let’s zoom in on the top right and read from right to left to see where things went wrong:

What happened, in order:

  1. SQL Server seeked on Users.Location to ‘Reading’ and expected to find 10 rows – but it actually found 418
  2. SQL Server seeked on Posts.OwnerUserId to find those 418 people’s posts, expecting to only find 94 (because there were so few estimated people in Reading), but it actually found 59,391
  3. SQL Server seeked on Votes.PostId to find the votes cast on those posts, expecting to only find 415 (because there were so few expected posts), but it actually found 603,085

Your exercise includes a few parts this time.

Let’s play fill-in-the-blank. Your comment needs to include 3 answers:

  1. What caused the estimation on Users.Location = ‘Reading, United Kingdom’ to be wrong?
  2. What caused the estimation on Posts.OwnerUserId to be wrong?
  3. What caused the estimation on Votes.PostId to be wrong?

I’m not even asking you to solve any of them – just conceptually put together the pieces about why each estimate failed.

If you do wanna try to fix any of the estimates – and again, you don’t have to, that’s just icing on the cake – then put your queries in a Github Gist and the query plans in PasteThePlan, showing your new accurate estimates, and include those link in your comments. Check out the solutions from other folks, compare and contrast your work, then check out my post with the answers. Have fun!

Previous Post
PASS Data Community Summit Prices Are About to Go Up!
Next Post
[Video] Office Hours: Database Q&A

6 Comments. Leave new

  • Without knowing the exact statistics on your database I have to estimate too 🙂
    – There is no exact statistic for ‘Reading, United Kingdom’ for Users.Location so it takes the average number of users per location (between R… and S… or whatever the statistic step in the 200-bucket is). And this average are 10 Users.
    – there is no way that the server can use more than the total average number posts per user in the Posts table, since it can’t know which UserIds will be returned from the Users table, he can’t use any detailed statisctics to include e.g. the one very active guy who has posted 10.000 times (don’t know if this is true here). The average posts per user is 9.4 (94 estimated for 10 users), the real average at this location is 142
    – same vor the Votes table, where the total average of votes per post is 415/94=4.4 but the real average vote count for posts created by users from ‘Reading, United Kingdom’ is 10.15

    A potential solution could be a cursor over the Users table which reads the the UserId into @UserId and uses it in a secon query which selects the Posts and joins the Votes table. Since it is already doing Nested Lookups on the Posts this would only slightly reduce the performance for the most locations, but allows the SQL server to use better statistics for the most active posters, if some of them are at your requested location (if not it may help just a very little bit but would not harm you except for big locations where SQL server would have done a hash or merge join instead).

    Another solution would be a datawarehouse table, where you already join / aggregate / denormalize the three tables in a meaningful manner and use it for your reports.

    Reply
  • […] Query Exercise: Why Are These 3 Estimates So Wrong? (Brent Ozar) […]

    Reply
  • I am getting different numbers to you as I am on SO2010 on SQL Server 2019 compatibility. However, my numbers below:

    Index seek on Users.Location = ‘Reading, United Kingdom’
    62(Act) 3(Est)

    Index Seek on Posts.OwnerUserId = Users.Id
    15,042 (Act) 50 (Est)

    My Histogram on Users.Location does not have a RANGE_HI_KEY for Reading, United Kingdom , the next bucket is Redmond, WA which has a AVG_RANGE_ROWS of 3.486486 so SQL Server appears to truncate this to 3

    As for the estimate on Posts.OwnerUserId = Users.Id

    My statistics on Posts.OwnerUserId shows
    Rows: 3,729,195
    Density Vector: 3.817654E-06

    so

    Number of unique OwnerUserIds : 1 / 3.817654E-06 = 261,940.97

    OwnerUserIds Per location = Rows / Number of unique locations = 14.23

    3 users estimated * OwnerUserIds Per location = 42.71

    So clearly I am out somewhere (or doing it completely wrong!) but not far off at 42.71 making me think I am on the right lines

    Looking forward to seeing the solution

    Reply
    • Ah! Typo!
      OwnerUserIds Per location = Rows / Number of unique locations = 14.23

      should be

      Posts per OwnerUserId = Rows / Number of unique OwnerUserIds = 14.23

      Reply
  • […] This week’s Query Exercise challenged you to figure out why these 3 estimates went so badly: […]

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.