Blog

Stats Week: Statistics Terminology Cheatsheet

These things used to confuse me so much

Despite having worked at a Market Research company for a while, I know nothing about statistics, other than that project managers have all sorts of disagreeably subjective phrases for describing them. Vast majority, convincing plurality, dwindling minority, et al. Less talky, more picture.

When I started getting into SQL Server, and learning about statistics, I heard the same phrases over and over again, but wasn’t exactly sure what they meant.

Here are a few of them:

Selectivity

This tells you how special your snowflakes are. When a column is called “highly selective” that usually means values aren’t repeating all that often, if at all. Think about order numbers, identity or sequence values, GUIDs, etc.

Density

This is sort of the anti-matter to selectivity. Highly dense columns aren’t very unique. They’ll return a lot of rows for a given value. Think about Zip Codes, Gender, Marital Status, etc. If you were to select all the people in 10002, a densely (there’s that word again) populated zip code in Chinatown, you’d probably wait a while, kill the query, and add another filter.

Cardinality

If you mash selectivity and density together, you end up with cardinality. This is the number of rows that satisfy a given predicate. This is very important, because poor cardinality estimation can arise from a number of places, and every time it can really ruin query performance.

Here’s a quick example of each for a 10,000 row table with three columns.

USE [tempdb];

WITH x AS (
SELECT TOP 10000
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS [rn]
FROM sys.[messages] AS [m]
)
SELECT
[x].[rn],
CASE WHEN [x].[rn] % 2 = 0 THEN 'M' ELSE 'F' END AS [Gender],
CASE WHEN [x].[rn] % 2 = 0 THEN 'Married' WHEN [x].[rn] % 3 = 0 THEN 'Divorced' WHEN [x].[rn] % 5 = 0 THEN 'Single' ELSE 'Dead' END AS [MaritalStatus]
INTO #xgen
FROM [x]

/*Selectivity*/
SELECT COUNT_BIG(DISTINCT [x].[rn])
FROM [#xgen] AS [x]

SELECT COUNT_BIG(DISTINCT [x].[Gender])
FROM [#xgen] AS [x]

SELECT COUNT_BIG(DISTINCT [x].[MaritalStatus])
FROM [#xgen] AS [x]

/*Density*/
SELECT (1. / COUNT_BIG(DISTINCT [x].[rn]))
FROM [#xgen] AS [x]

SELECT (1. / COUNT_BIG(DISTINCT [x].[Gender]))
FROM [#xgen] AS [x]

SELECT (1. / COUNT_BIG(DISTINCT [x].[MaritalStatus]))
FROM [#xgen] AS [x]

/*Reverse engineering Density*/
SELECT 1.0 / 0.00010000000000000000

SELECT 1.0 / 0.50000000000000000000

SELECT 1.0 / 0.25000000000000000000

/*Cardinality*/
SELECT COUNT_BIG(*) / COUNT_BIG(DISTINCT [x].[rn])
FROM [#xgen] AS [x]

SELECT COUNT_BIG(*) / COUNT_BIG(DISTINCT [x].[Gender])
FROM [#xgen] AS [x]

SELECT COUNT_BIG(*) / COUNT_BIG(DISTINCT [x].[MaritalStatus])
FROM [#xgen] AS [x]

DROP TABLE [#xgen]

 

Bigger by the day

A lot has been written about cardinality estimation. SQL Server 2014 saw a total re-write of the cardinality estimation guts that had been around since SQL Server 2000, build-to-build tinkering notwithstanding.

In my examples, it’s all pretty cut and dry. If you’re looking at a normal sales database that follows the 80/20 rule, where 80 percent of your business comes from 20 percent of your clients, the customer ID columns may be highly skewed towards a small group of clients. It’s good for SQL to know this stuff so it can come up with good execution plans for you. It’s good for you to understand how parameter sniffing works so you understand why that execution plan was good for a small client, but not good for any big clients.

That’s why you should go see Brent in person. He’ll tell you all this stuff, feed you, give you prizes, and then you go home and get a raise because you can fix problems. Everyone wins!

Thanks for reading!

Brent says: wanna learn more about statistics? Check out Dave Ballantyne’s past SQLbits videos, including the one about the new 2014 CE.

Happy Fifth Birthday to Us, 50% Off for You.

Five years ago, we turned BrentOzar.com into a consulting and training company. Now look at us. (Well, maybe not too closely, we’re getting a little gray and wrinkly.)

To celebrate:

  • Video classes are now 50% off with coupon code HighFive
  • That includes bundles, too, which brings the epic Everything Bundle down to just $449!
  • Two of our in-person training classes are 50% off too – the Senior DBA Class of 2016 in Philly in August, and the Performance Tuning class in Portland in August – just use that same HighFive coupon code
  • Free Everything Bundle with all training seats bought during the sale

Head on over and shop now – the sale ends when April 2016’s over.

Enjoy, and here’s to another five years of cartoons, retreats, webcasts, free tools, horrific SQL Server problems, and laughter about the whole thing.

Generating test data without complicated T-SQL

Sometimes you need garbage

Not because DBAs are the IT equivalent of Oscar the Grouch, but maybe you want to post a question on a forum, and you don’t want to use your own data. At least, you probably shouldn’t just post your own data without masking it. But masking data is annoying, and by the time you get everything in order, someone’s breathing down your neck for an answer.

Your other option is to write T-SQL to generate random data of your choosing, but that can be daunting too. Generating random numbers and dates between ranges, creating relationships, etc. isn’t a ton of fun. Adding more tables makes the whole process exponentially caustic.

Enter the website

A website I really like to help generate data is over here. It’s really easy to set up data the way you want, and you get a lot of neat customization options for a wide variety of data types.

Cold Gettin' Dumb

Cold Gettin’ Dumb

You can have it generate the data to just about any format you can think of: CSV, Excel, SQL, JSON, HTML, XML, and you have all sorts of options for each type. It’s really awesome.

It's Like That

It’s Like That

 

Next time you need to generate some test data, consider trying this site out. It’s been really helpful for me on a bunch of occasions. If you like it, donate. There’s no such thing as a free table.

Thanks for reading!

Contest: SQL Server Theme Songs

During this month’s Senior DBA Class in Chicago, I started joking with the attendees about what songs they should play during common DBA job tasks, like:

  • Best Song to Play During a Server Outage
  • Best Song to Illustrate Your Predecessor’s Skills
  • Best Song to Accompany an Index Rebuild
  • Best Song to Play When Examining a Slow Query
  • Best Song to Play When Corruption is Found
  • Best Song for a SQL Server with 4GB RAM
  • Best Song to Play While a Cursor Runs
  • Best Song to Prepare for a SQL Server Service Pack

We wanna hear your ideas.

Rules:

  • To submit your songs, leave a comment here (BrentOzar.com) before Wednesday, March 23rd, 2016.
  • Your comment must include the category, the song & singer’s names, and the link to the song’s video. You can enter multiple times.
  • The winners will be chosen by us, the Brent Ozar Unlimited team, judged using completely unreliable and unpredictable methods. Winners will be announced within one week of the contest closing.
  • If multiple people submit the same song in a category, the winner will be the first one.
  • Taxes are the responsibility of the winners.
  • The winner in each category will get a training video class of their choice. (No bundles, just individual classes.)
  • The overall winner will get an Everything Bundle.

I’ll get you started with my favorite song to play during a server outage.

Update March 23rd – comments have been closed. Let the judging commence!

[Video] Office Hours 2016/03/09

Angie, Doug, Richie, and Tara get together for this week’s Office Hours to talk about how many cores you should leave for the operating system, how many unions you should do in a single query, and how often you should install patches.

Here’s the video on YouTube:

You can register to attend next week’s Office Hours, or subscribe to our podcast to listen on the go.

New York City: The Data That Never Sleeps

I love living in the city

Blog posts about people’s favorite data sets seem to be popular these days, so I’m throwing my hat in the ring.

NYC has been collecting all sorts of data from all sorts of sources. There’s some really interesting stuff in here.

Another personal favorite of mine is MTA turnstile data. If you’re a developer looking to hone your ETL skills, this is a great dataset, because it’s kind of a mess. I actually had to use PowerShell to fix inconsistencies with the older text files, which I’m still recovering from. I won’t spoil all the surprises for you.

Of course, there’s Stack Overflow.

You can’t go wrong with data from either of these sources. They’re pretty big. The main problem I have with Adventure Works is that it’s a really small database. It really doesn’t mimic the large databases that people deal with in the real world, unless you do some work run a script to make it bigger. The other problem with Adventure Works is that it went out of business a decade ago because no one wanted to buy yellow bikes. I’ve been learning a bit about Oracle, and their sample data sets are even smaller. If anyone knows of better ones, leave a comment.

Anyway, get downloading! Just don’t ask me about SSIS imports. I still haven’t opened it.

Thanks for reading!

Should I Worry About Index Fragmentation?

Here’s a handy flowchart for whenever you find yourself worrying about fragmentation:

Screen Shot 2016-01-26 at 9.08.14 AM

*Talk to your doctor about whether fragmentation is a problem for you. Seek help for index maintenance lasting longer than four hours. Serious side effects may occur if all index maintenance is ignored completely.

Brent says: pop quiz: what are you doing more often, rebuilding indexes or running DBCC CHECKDB?

[Video] Free Training of the Week: Joins in Execution Plans

This year, we’re trying something new: every Monday, we’re going to show you one module out of our online training classes.

It’s the whole module, not a sneak peek, and it’s available the whole week. But you’d better not put it off – this post will disappear the following week, and get replaced by a new post.

This week, it’s Jeremiah Peschka explaining the different kinds of joins in SQL Server execution plans:

UPDATE 1/30 – Expired!

(If you don’t see the video above, you’re reading this somewhere that doesn’t support video embedding. You’ll need to read the post on our blog.)

Wanna see more? Check out our How to Read Execution Plans course. We cover what an execution plan is, how to read them, predicates, aggregates, grouping, windowing, parallelism, the plan cache, and common problems in execution plans For $299, you get 18 months of access to stream it on your desktop, laptop, tablet, or phone.

Cheat Sheet: How to Configure TempDB for Microsoft SQL Server

The short version: configure one volume/drive for TempDB. Divide the total space by 9, and that’s your size number. Create 8 equally sized data files and one log file, each that size. Presto, the drive is full and your TempDB is configured for easy performance.

The long version is a little more complicated.

If you’re on SQL Server 2012, get on SP1 Cumulative Update 10 or newer. Microsoft made performance improvements to SQL Server 2014 so that it writes to TempDB less frequently. Because you were so good this year, they even backported it to SQL Server 2012, too. Get on this level – and by this level, I mean the latest updates shown on SQLServerUpdates.com.

Create one volume/drive/array for TempDB. You want this to be a separate volume from your user database & log files – not just for performance reasons, but also for logistical management reasons. Any user can dump a ton of stuff into TempDB without your control, and they can even run TempDB out of space. If TempDB lives on the same volume as your user data and log files, and the drive runs out of space, you can have a tough time restarting SQL Server. However, if TempDB is on its own volume that runs out of space, no big deal – just restart the instance and you’re back in business.

If you’re using a SAN-backed cluster, use local SSDs for TempDB. Since SQL Server 2012, Microsoft has fully supported using local solid state drives even in a cluster. This has three benefits: the local SSD is often cheaper, it’s often faster, and it keeps your SAN path traffic freed up for your valuable user data file & log file reads and writes.

Create 8 equally sized data files. This helps you avoid the GAM/SGAM/PFS page contention issues described in Microsoft KB 2154845. In your journey through the blogosphere, you may have read that at one point you were supposed to create one data file per core. That’s no longer the case – just start with 8, and only add more when you’re facing the page contention issues described in that KB article.

tempdb-uneven

TempDB files need to be sized evenly to avoid page contention.

Size them (and the log file) to fill up the drive. If the drive’s dedicated to TempDB anyway, why leave free space? You don’t want SQL Server pausing user activity to stop to grow the log file out, and you don’t want to have to hassle with possibly uneven data file growth sizes, and you haven’t got time for the pain. Just set it up and walk away with confidence.

And while you’re setting up the server, check out our free SQL Server Setup Checklist.

What happens to transaction log backups during full backups?

TL;DR

Unless you’re on SQL 2000, don’t worry about scheduling log backups during full backups
Log backups during full backups won’t truncate the transaction log
You want to keep taking log backups in case your full backup fails

The first time I ever set up backups

Was, unfortunately, using a maintenance plan. All of the databases were in simple recovery. It was largely used for staging and tenderizing data during ETL. No log backups need apply.

Fast forward a bit, and I’m setting up backups for a server where losing 30 minutes of data could set a project back several hours. We’re now hiring log backups.

You're hired!

You’re hired!

So there I was, dutifully creating extravagant maintenance plans, pondering the miracle of the differential backup, and the grand eloquence of log backups. They were running every 10 minutes, those log backups.

Every 10 minutes.

Even during full backups.

I’m a developer and what is this?

  • What is SQL going to do with those?
  • Do I have to restore them?
  • Should I pause log backups during full backups?
  • Will this break something?
  • How much Laphroaig do I have to drink to forget I thought about this?

This was confounding to me. So I did some digging. Back in the SQL 2000 days, this could have gotten weird. But I was, thankfully(?) on 2005. Heh. Yeah. 2005. I know.

After 2005, you can totally, absolutely, 100% take log backups during full backups. You can take them and nothing will break and you won’t have to restore them (unless your full backup fails).

Wanna see a demo?

Of course you wanna see a demo. This one is a little more complicated than usual. It will require several-SSMS-tab technology.

ALTER DATABASE [StackOverflow] SET RECOVERY FULL

BACKUP DATABASE [StackOverflow]
TO DISK = 'D:\Backup\Crap\StackOverflow\StackOverflow.bak'
WITH CHECKSUM, COMPRESSION, STATS = 1

I usually keep my SO database in simple, because I do horrible things that I don’t want to fully log. Fun fact: if you switch from simple to full recovery model and don’t take a full backup, you’re basically still in simple recovery. You should think about that for a minute and then take steps to avoid getting fired. Backing up SO takes me about 3.5-4 minutes.

In another window, kick off some log backups 1 minute apart. Note that doing this won’t overwrite log backups, it will stack them all within a single file.

You can verify this behavior by running the RESTORE HEADERONLY command at the end of this block. If you want to restore a particular backup out of a file with multiple backups in it, you use the position column and specify it with FILE = [n], which you can read more about here.

BACKUP LOG [StackOverflow]
TO DISK = 'D:\Backup\Crap\StackOverflow\StackOverflow_log.trn'
WITH COMPRESSION, CHECKSUM, INIT, FORMAT

WAITFOR DELAY '00:01:00.000'

BACKUP LOG [StackOverflow]
TO DISK = 'D:\Backup\Crap\StackOverflow\StackOverflow_log.trn'
WITH COMPRESSION, CHECKSUM

WAITFOR DELAY '00:01:00.000'

BACKUP LOG [StackOverflow]
TO DISK = 'D:\Backup\Crap\StackOverflow\StackOverflow_log.trn'
WITH COMPRESSION, CHECKSUM

WAITFOR DELAY '00:01:00.000'

BACKUP LOG [StackOverflow]
TO DISK = 'D:\Backup\Crap\StackOverflow\StackOverflow_log.trn'
WITH COMPRESSION, CHECKSUM

WAITFOR DELAY '00:01:00.000'

BACKUP LOG [StackOverflow]
TO DISK = 'D:\Backup\Crap\StackOverflow\StackOverflow_log.trn'
WITH COMPRESSION, CHECKSUM

WAITFOR DELAY '00:01:00.000'

BACKUP LOG [StackOverflow]
TO DISK = 'D:\Backup\Crap\StackOverflow\StackOverflow_log.trn'
WITH COMPRESSION, CHECKSUM

RESTORE HEADERONLY FROM DISK = 'D:\Backup\Crap\StackOverflow\StackOverflow_log.trn'

Just so you know I’m not pulling any shenanigans, let’s generate some log activity. This will dump 100 rows into a dummy table every 30 seconds. It is neither pretty nor elegant.

USE [StackOverflow]

IF OBJECT_ID('dbo.YouBigDummy') IS NOT NULL
DROP TABLE [dbo].[YouBigDummy]

CREATE TABLE [dbo].[YouBigDummy]
(
ID INT IDENTITY(1,1),
[Name] VARCHAR(50) DEFAULT 'You',
CreateDate DATETIME DEFAULT GETUTCDATE()
)

INSERT dbo.[YouBigDummy]
        ( [Name], [CreateDate] )
DEFAULT VALUES
GO 100
WAITFOR DELAY '00:00:30.000'
GO 
INSERT dbo.[YouBigDummy]
        ( [Name], [CreateDate] )
DEFAULT VALUES
GO 100
WAITFOR DELAY '00:00:30.000'
GO 
INSERT dbo.[YouBigDummy]
        ( [Name], [CreateDate] )
DEFAULT VALUES
GO 100
WAITFOR DELAY '00:00:30.000'
GO 
INSERT dbo.[YouBigDummy]
        ( [Name], [CreateDate] )
DEFAULT VALUES
GO 100
WAITFOR DELAY '00:00:30.000'
GO 
INSERT dbo.[YouBigDummy]
        ( [Name], [CreateDate] )
DEFAULT VALUES
GO 100
WAITFOR DELAY '00:00:30.000'
GO 
INSERT dbo.[YouBigDummy]
        ( [Name], [CreateDate] )
DEFAULT VALUES
GO 100
WAITFOR DELAY '00:00:30.000'
GO 
TRUNCATE TABLE [dbo].[YouBigDummy]

When that’s all done, you can run something like this to see what happened. You’ll probably have to replace the date. I wrote this, like, two or three weeks ago by now.

SELECT  [b].[database_name] ,
        [b].[backup_start_date] ,
        [b].[backup_finish_date] ,
        [b].[type] ,
        [b].[first_lsn] ,
        [b].[last_lsn] ,
        [b].[checkpoint_lsn] ,
        [b].[database_backup_lsn]
FROM    [msdb].[dbo].[backupset] AS [b]
WHERE   [b].[database_name] = 'StackOverflow'
        AND [b].[backup_start_date] >= '2015-12-02 17:30:00.000'
ORDER BY [b].[backup_start_date];

When you take a full backup, the first thing it does it issue a checkpoint. That’s why the full and all subsequent log backups have the same checkpoint LSN. The first four log backups all have the same database backup LSN because they occurred during the full backup. That doesn’t change until the full is done.

RED SQUARES AT NIGHT

RED SQUARES AT NIGHT

For toots and snickers, I ran this all a second time, and cancelled the full backup halfway through. The full backup issued a new checkpoint, so the checkpoint LSN changes, but the database backup LSN never changes, because it got canceled. That means taking log backups during full backups is totally useful. If your full backup fails for whatever reason, these things keep the chain alive.

NOTHING CHANGES

NOTHING CHANGES

If the third time is to be a charm, and it is, the same thing occurs as the first run. New checkpoint LSN, and the database backup LSN runs through until the backup finishes. You can verify that by looking at the start and end times columns.

I'm always convinced that whoever came up with the term LSN Chains really liked 90's Grunge.

I’m always convinced that whoever came up with the term LSN Chains really liked 90’s Grunge.

If you still don’t believe me

Just look at sys.databases while you’re running a full backup.

SELECT  [name],
        [log_reuse_wait] ,
        [log_reuse_wait_desc]
FROM    [sys].[databases]
WHERE [name] = 'StackOverflow';
Yes, I made you read all that to get here.

Yes, I made you read all that to get here.

The result is acknowledged, though not documented, here. This does indeed mean that log truncation will not occur during a full backup even if you take log backups. It will happen when you take the first log backup after the full finishes. You may want to consider this when scheduling maintenance items that may hog up log space alongside full backups.

Recap

Backups are beautiful things. You should take full ones, and probably differential ones, and if you’re in full recovery model, definitely log ones. How often you take them is up to you and your boss. Or maybe their boss. But it’s definitely not up to you. Unless you’re your boss.

Log backups during full backups won’t hurt anything, and may end up helping things if your full backup fails, and you need to restore something.

Log backups during full backups will not truncate the log. That has to wait until the first log backup after the full finishes.

css.php