In the last post, I talked about how we don’t get accurate estimates because SQL Server’s statistics only have up to 201 buckets in the histogram. It didn’t matter much in that post, though, because we were using the small StackOverflow2010 database.
But what happens as our data grows? Let’s move to a newer Stack Overflow database, the 2018-06 one that I use for my Mastering training classes. We’ll create the same index, giving us the same statistics, and then look at the histogram to see how Miami grew over time:
CREATE INDEX Location ON dbo.Users(Location);
SELECT COUNT(*) FROM dbo.Users;
Now we’re up to about 9 million users:
And the statistics histogram for Miami’s area;
The data distribution has changed, but Miami still isn’t big enough to have its own bucket. Now:
- Miami, FL is between México and Minneapolis (note that Michigan no longer has its own bucket)
- AVG_RANGE_ROWS = 13.3 means that any location in between México and Minneapolis will get an estimate of 13.3 rows
So when we run our Miami query, what’s it look like now:
Estimated number of rows in Miami, FL is 13, but in actuality, 625 rows come out of the index seek. The variance is starting to grow now that our data sizes have grown larger.
So what do we do about Miami’s inaccurate estimates?
Again, in this case…nothing. 625 rows vs 13 rows isn’t really that big of a deal. In my Fundamentals of Query Tuning class, I talk about how you want to read query plans from right to left, looking for places where estimates vs actual are 10x off or more. In this case, we’re 48x off, but…it doesn’t really matter that much. SQL Server still allocated enough memory to do the sort, and it didn’t spill to disk. (Note there’s no yellow bang on the sort operator.)
However, as I noted yesterday, it will become a big deal if we join to other tables. Let’s see what happens if we add a join to Comments, for example, to display the top comments left by people from Miami:
SELECT TOP 1000 u.DisplayName, c.Score, c.Text
FROM dbo.Users u
INNER JOIN dbo.Comments c ON u.Id = c.UserId
WHERE u.Location = N'Miami, FL'
ORDER BY c.Score DESC;
The actual plan:
We read plans from right to left, top to bottom, to understand what SQL Server did in what order. (You can also read from left to right in some cases, and I’m just legally required to say that, because if I don’t, Grant Fritchey and Hugo Kornelis will smother me as I sleep.)
So from right to left:
- SQL Server did an index seek on Location = Miami, FL, and expected only 13 rows to come back
- In reality, 625 rows came back
- So we also had to do 625 key lookups
- We had to do 625 index seeks on Comments.UserId because each person has their own set of comments
- In total, we found 14,176 comments left by people in Miami, FL, so that poor Sort operator never stood a chance. He has a yellow bang because he ended up spilling to disk.
So, really, now, seriously,
what do we do about Miami’s inaccurate estimates?
In this example, the query still runs fairly quickly. But if you run into this kind of problem with a larger query, larger spills to disk, larger estimate variations, and so forth, here are some techniques you can consider for performance tuning:
- Breaking the query up into phases using techniques like temp tables
- Query hints, like different cardinality estimator versions or optimization hints (like optimize-for, asking for a larger grant, etc)
- Newer compatibility levels with tricks like adaptive joins or adaptive memory grants
But the big takeaway for me is that 201 buckets in a histogram just isn’t enough to portray real-world data skew, especially as your data size grows into the millions of rows and beyond.