These things used to confuse me so much
Despite having worked at a Market Research company for a while, I know nothing about statistics, other than that project managers have all sorts of disagreeably subjective phrases for describing them. Vast majority, convincing plurality, dwindling minority, et al. Less talky, more picture.
When I started getting into SQL Server, and learning about statistics, I heard the same phrases over and over again, but wasn’t exactly sure what they meant.
Here are a few of them:
Selectivity
This tells you how special your snowflakes are. When a column is called “highly selective” that usually means values aren’t repeating all that often, if at all. Think about order numbers, identity or sequence values, GUIDs, etc.
Density
This is sort of the anti-matter to selectivity. Highly dense columns aren’t very unique. They’ll return a lot of rows for a given value. Think about Zip Codes, Gender, Marital Status, etc. If you were to select all the people in 10002, a densely (there’s that word again) populated zip code in Chinatown, you’d probably wait a while, kill the query, and add another filter.
Cardinality
If you mash selectivity and density together, you end up with cardinality. This is the number of rows that satisfy a given predicate. This is very important, because poor cardinality estimation can arise from a number of places, and every time it can really ruin query performance.
Here’s a quick example of each for a 10,000 row table with three columns.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
USE [tempdb]; WITH x AS ( SELECT TOP 10000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS [rn] FROM sys.[messages] AS [m] ) SELECT [x].[rn], CASE WHEN [x].[rn] % 2 = 0 THEN 'M' ELSE 'F' END AS [Gender], CASE WHEN [x].[rn] % 2 = 0 THEN 'Married' WHEN [x].[rn] % 3 = 0 THEN 'Divorced' WHEN [x].[rn] % 5 = 0 THEN 'Single' ELSE 'Dead' END AS [MaritalStatus] INTO #xgen FROM [x] /*Selectivity*/ SELECT COUNT_BIG(DISTINCT [x].[rn]) FROM [#xgen] AS [x] SELECT COUNT_BIG(DISTINCT [x].[Gender]) FROM [#xgen] AS [x] SELECT COUNT_BIG(DISTINCT [x].[MaritalStatus]) FROM [#xgen] AS [x] /*Density*/ SELECT (1. / COUNT_BIG(DISTINCT [x].[rn])) FROM [#xgen] AS [x] SELECT (1. / COUNT_BIG(DISTINCT [x].[Gender])) FROM [#xgen] AS [x] SELECT (1. / COUNT_BIG(DISTINCT [x].[MaritalStatus])) FROM [#xgen] AS [x] /*Reverse engineering Density*/ SELECT 1.0 / 0.00010000000000000000 SELECT 1.0 / 0.50000000000000000000 SELECT 1.0 / 0.25000000000000000000 /*Cardinality*/ SELECT COUNT_BIG(*) / COUNT_BIG(DISTINCT [x].[rn]) FROM [#xgen] AS [x] SELECT COUNT_BIG(*) / COUNT_BIG(DISTINCT [x].[Gender]) FROM [#xgen] AS [x] SELECT COUNT_BIG(*) / COUNT_BIG(DISTINCT [x].[MaritalStatus]) FROM [#xgen] AS [x] DROP TABLE [#xgen] |
Bigger by the day
A lot has been written about cardinality estimation. SQL Server 2014 saw a total re-write of the cardinality estimation guts that had been around since SQL Server 2000, build-to-build tinkering notwithstanding.
In my examples, it’s all pretty cut and dry. If you’re looking at a normal sales database that follows the 80/20 rule, where 80 percent of your business comes from 20 percent of your clients, the customer ID columns may be highly skewed towards a small group of clients. It’s good for SQL to know this stuff so it can come up with good execution plans for you. It’s good for you to understand how parameter sniffing works so you understand why that execution plan was good for a small client, but not good for any big clients.
That’s why you should go see Brent in person. He’ll tell you all this stuff, feed you, give you prizes, and then you go home and get a raise because you can fix problems. Everyone wins!
Thanks for reading!
Brent says: wanna learn more about statistics? Check out Dave Ballantyne’s past SQLbits videos, including the one about the new 2014 CE.
7 Comments. Leave new
Erik,
Please, sir. May I have a semi-colon after the USE statement. 😉
You should include a ; af USE tempdb 🙂
Executed as a whole:
Msg 319, Level 15, State 1, Line 2
Incorrect syntax near the keyword ‘with’. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.
🙂 the site
I’m against people running scripts they found on the internet all at once. Semicolon request denied.
One more thing, syntax check goes boom without that semi-colon. ;-D
I always run the syntax check before running any script.
BTW, thanks for the refresher on these terms. Y’all do good work.
LOL. I agree with your position. I only ran the first two statements when I found it.
You I might put one in for. You seem trustworthy.
My day is complete. I’m going home now.