Stats Week: Statistics Terminology Cheatsheet

These things used to confuse me so much

Despite having worked at a Market Research company for a while, I know nothing about statistics, other than that project managers have all sorts of disagreeably subjective phrases for describing them. Vast majority, convincing plurality, dwindling minority, et al. Less talky, more picture.

When I started getting into SQL Server, and learning about statistics, I heard the same phrases over and over again, but wasn’t exactly sure what they meant.

Here are a few of them:

Selectivity

This tells you how special your snowflakes are. When a column is called “highly selective” that usually means values aren’t repeating all that often, if at all. Think about order numbers, identity or sequence values, GUIDs, etc.

Density

This is sort of the anti-matter to selectivity. Highly dense columns aren’t very unique. They’ll return a lot of rows for a given value. Think about Zip Codes, Gender, Marital Status, etc. If you were to select all the people in 10002, a densely (there’s that word again) populated zip code in Chinatown, you’d probably wait a while, kill the query, and add another filter.

Cardinality

If you mash selectivity and density together, you end up with cardinality. This is the number of rows that satisfy a given predicate. This is very important, because poor cardinality estimation can arise from a number of places, and every time it can really ruin query performance.

Here’s a quick example of each for a 10,000 row table with three columns.

 

Bigger by the day

A lot has been written about cardinality estimation. SQL Server 2014 saw a total re-write of the cardinality estimation guts that had been around since SQL Server 2000, build-to-build tinkering notwithstanding.

In my examples, it’s all pretty cut and dry. If you’re looking at a normal sales database that follows the 80/20 rule, where 80 percent of your business comes from 20 percent of your clients, the customer ID columns may be highly skewed towards a small group of clients. It’s good for SQL to know this stuff so it can come up with good execution plans for you. It’s good for you to understand how parameter sniffing works so you understand why that execution plan was good for a small client, but not good for any big clients.

That’s why you should go see Brent in person. He’ll tell you all this stuff, feed you, give you prizes, and then you go home and get a raise because you can fix problems. Everyone wins!

Thanks for reading!

Brent says: wanna learn more about statistics? Check out Dave Ballantyne’s past SQLbits videos, including the one about the new 2014 CE.

Previous Post
Looking for a New Challenge? kCura is Hiring a DBA.
Next Post
Stats Week: Do Query Predicates Affect Histogram Step Creation?

7 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}