How Multi-Column Statistics Work

Statistics
4 Comments

The short answer: in the real world, only the first column works. When SQL Server needs data about the second column, it builds its own stats on that column instead (assuming they don’t already exist), and uses those two statistics together – but they’re not really correlated.

For the longer answer, let’s take a large version of the Stack Overflow database, create a two-column index on the Users table, and then view the resulting statistics:

The output of DBCC SHOW_STATISTICS shows that we’ve got about 22 million rows in this table. So, what does the statistics histogram say about the relationships between locations and reputations?

DBCC SHOW_STATISTICS output

In the first result set, you can see that Rows = 22,484,235, and Rows Sampled = 22,484,235 – the same number. That means our statistics had a full scan, which is as good as they can possibly get.

The density vectors aren’t very useful.

The second result set are the density vectors – aka, the averages – and it has 3 rows. The first one says Location: all density = 3.282714E-06. If you take that number, times 22,484,235, you get 73.8093. That’s the density vector: if SQL Server needs to estimate how many rows are going to match for a given location, and it doesn’t know what the location is, it’ll estimate 73.8093.

Here’s an example query to prove that. I’m using a local variable to prevent SQL Server from sniffing the value, so it’ll have no idea what the @LocationUnknown value will be at compilation time:

In the resulting query plan:

Query plan

When you hover your mouse over the index seek and look at “Estimated Number of Rows Per Execution”, you get 73.8093:

ENHANCE

Back to our DBCC SHOW_STATISTICS output. We explained that the first row was the density vector for Location alone – so what’s the second row, which says Location, Reputation?

Density vectors

What’s that “1.058257E-06” number mean in the Location, Reputation column? Multiply that by the number of rows in the table (22,484,235) and you get 23.794. I bet you can see where I’m going with this:

Here’s our query plan, and bingo, 23.794 estimated rows:

Unknown location and reputation

If you’re searching for an unknown (or unpredictable before runtime) location and reputation combo, SQL Server uses the density vector in the histogram to calculate how many rows are going to match. SQL Server thinks that for any given location and reputation, no matter what values you pass in, they’re going to produce 23.7941.

At first, that sounds ridiculous: Reputation is an integer number. If you’re asking for equality searches on that number, and you’re passing in random numbers, there’s practically no way that estimate could be right. On “average”, maybe across thousands or millions of searches, this could be vaguely the average number – but it’s never going to be correct. It’s going to be wildly overestimated some times, and wildly underestimated at other times.

But what about the histogram?

The next result set in DBCC SHOW_STATISTICS is the histogram, which contains the detailed list of up to 201 location values – because Location is the first column in our statistic:

Locations histogram

Let’s scroll down to the Vegas area:

Las Vegas area

Las Vegas isn’t big enough of an outlier to get its own bucket, so if we query for the people who live in Las Vegas:

Note that I’m using 1 = (SELECT 1) in order to prevent autoparameterization, which is a totally different subject for another day. Hover our mouse over the execution plan to see the estimated number of rows:

Estimated number of rows

The 7.04485 estimate comes from our statistics. Scroll back up a couple of images, and note that in our statistics, I highlighted the row for Lebanon. “Las Vegas, NV” is somewhere between “Lahore, Pakistan” and “Lebanon” (it certainly feels that way on the highway sometimes), so SQL Server uses the AVG_RANGE_ROWS number of 7.044848.

When SQL Server is searching for an unknown Location, it uses the density vector. When it’s searching for a known location in between two range high keys, it uses the AVG_RANGE_ROWS number. So far, so good.

But what happens if we pass in a search for a known location AND a known reputation?

Before we look at the execution plan of that query, take a moment to review the statistics at play here. Which column is SQL Server going to use for its estimate?

Las Vegas area

Which column in the above screenshot tells SQL Server that any particular location has a higher or lower average reputation score, or how distributed the values are?

That’s right: there isn’t one!

This statistics histogram isn’t really about the second column of the object at all. It’s about the first column! Multi-column stats aren’t, really: they’re really just single-column stats!

Here’s the part that’s kinda mind-blowing. Here’s the query plan for Las Vegas and 1234:

Query plan of disappointment

Look familiar? That’s a 7.04485 estimate. Exactly the same as if we weren’t filtering on reputation at all. It’s using the avg_range_rows from our statistics, giving us the exact same estimate that we got from just filtering on Location = Las Vegas.

The histogram values for the first column in our object is really useful. Subsequent columns, not so much.

The Reputation value I’m searching for doesn’t really matter here either. Let’s try one of the biggest values, Reputation = 1. You’ll see why this is important later:

Estimated number of rows = static here

The estimate is still 7.04485: exactly the same as not filtering by reputation at all. That’s… not great.

Things change a little for really big outliers.

Let’s try searching for the very biggest location value: India. If we search for just the location value (not a reputation yet), the histogram is useful because India’s one of our outliers:

The resulting execution plan estimates are absolutely bang on:

Estimated number of rows for India

Then add in a filter for a given reputation number, and here I’m going to do both 1234 and 1 back to back:

And whaddya know: now our estimates are not 7.04485, unlike Las Vegas:

Estimated number of rows for India

SQL Server manages to figure out that India is not only huge, but also that Reputation = 1 is huge. So, how’d it do that? Right-click on the SELECT operator of the second query, go into Properties, and then OptimizerStatsUsage:

Optimizer Stats Usage

SQL Server didn’t just use the stat on Location_Reputation. In order to understand that Reputation = 1 is an outlier, it also automatically created a statistic on the Reputation column separately because the Reputation data in the Location_Reputation statistic just wasn’t useful enough.

Multi-column stats just don’t help much by themselves.

To really prove it, let’s set up an artificial scenario. Let’s say that everyone in China has really high reputation. And, just to give SQL Server the best defenses possible, let’s create a multi-column index (and therefore stat) on Reputation, Location. Hell, let’s even update statistics on Users so that our existing Location, Reputation stat completely understands that China’s where the smart people are at:

If that was a Venn diagram, we would now have a perfect circle: all of the people in China have exactly 1,000,000 reputation points, and the only people with exactly 1,000,000 reputation points are in China:

The results:

Venn diagram results

So now, let’s ask SQL Server:

  • How many people do you THINK live in China?
  • How many people do you THINK have 1,000,000 reputation points?
  • How many people do you THINK live in China, AND have 1,000,000 points?

Check out the estimated number of rows on the query plans:

Oopsie daisie

Or for those of you who prefer memes:

How you doin

If SQL Server had anything even remotely resembling true multi-column stats, the estimate would be closer than this. We don’t, so it’s not.

The documentation suggests that you should create these statistics manually when you know there’s correlation:

But yeah no, that still doesn’t work, and still produces the same 89-row estimate.

To learn how to solve these kinds of problems, check out my Mastering Query Tuning class.

Previous Post
Who’s Hiring Database People? April 2026 Edition
Next Post
Contest: Make the Comments Look Like My ChatGPT History.

4 Comments. Leave new

  • Forgive me, I may need more caffeine, but I’m confused by one of your examples. Your first 2 example queries use a variable in the WHERE clause so the values are “unknown” and SQL Server uses the density vectors to estimate 73 rows & 23 rows. Then you use a hard-coded Location and show how SQL Server uses the histogram to estimate 7 rows.

    But then you hard-code both Location and Reputation, and SQL Server estimates 7 rows again. But you say, “Exactly the same as if we had a completely unknown location and reputation! SQL Server used the density vector numbers, not the statistics histogram…” If it had used the density vectors, wouldn’t the estimate be 23 rows, same as when you use variables for Location and Reputation?

    Reply
    • It’s totally confusing, I get it! This is such a tricky issue. I’ve gone back and tried to clarify the wording to make it more intuitive. Hope that helps!

      Reply
  • […] How Multi-Column Statistics Work (Brent Ozar) […]

    Reply
  • Thanks for this. I don’t work with SQL Server anymore, but when I did, there were a few occasions when I tried to fix bad estimates for correlated predicates with multi column stats and, as you’ve shown, it didn’t help. I’m a bit baffled as to what the point of this feature is.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.