Stats Week: Messin’ With Statistics

If there’s one thing living in Texas has taught me

It’s that people are very paranoid that you may Mess With it. Even in Austin, where the citizenry demand weirdness, they are vehemently opposed to any form of Messing, unless it results in mayonnaise-based dipping sauce.

Me? I like Messing With stuff. Today we’re going to look at one way you can make SQL think your tables are much bigger than they actually are, without wasting a bunch of disk space that has nearly the same price as wall to wall carpeting.

To do this, we’re going to venture deep into the Undocumented Command Zone. It’s like the Twilight Zone, except if you go there on your production server, you’ll probably end up getting fired. So, dev servers only here, people.

Creatine

Let’s make a table, stuff a little bit of data in it, and make some indexes.

There’s our thousand rows. If you’re dev testing against 1000 rows, your production data better only have 1001 rows in it, or you’re really gonna be screwed when your code hits real data. How do we cheat and make our data bigger without sacrificing disk space?

Eat Clen, Tren Hard, Anavar give up

You can update all statistics on the table at once, or target specific indexes with the following commands.

This will set your table row count to uh… 10 billion, and your page count to 1 billion. This makes sense, since usually a bunch of rows fit on a page. You can be more scientific about it than I was, this is just to give you an idea.

So let’s check in on our statistics! Sup with those?

Hint: these commands will not show inflated page or row counts in them. They actually won’t show page counts at all. Hah. That’s kinda silly, though. Hm.

Anyway, what we should grab from the statistics histograms are some middling values we can play with. For me, that’s an ID of 500, a date of 2016-03-18, and an amount of 4733.00.

One thing I’ve found is that the inflated counts don’t seem to change anything for Identities, or Primary Keys. You’ll always get very reasonable plans and estimates regardless of how high you set row and page counts for those. Regular old clustered indexes are fair game.

Some really interesting things can start to happen to execution plans when SQL thinks there’s this many rows in a table. The first is that SQL will use a rare (in my experience) plan choice: Index Intersection. You can think of this like a Key Lookup but with two nonclustered indexes rather than from one nonclustered index to the clustered index.

For these equality queries, we get the following plans:

Bizarre Love Parallel
Bizarre Love Parallel

SQL isn’t fooled with an equality on 500. We get a little plan. We’ll examine inequality plans in a moment. For now let’s look at the middle plan. That’s where the Index Intersection is occurring. The bottom plan has a regular Key Lookup.

Blood Everywhere.
Blood Everywhere.

The costs and estimates here are Banana Town crazy. And right there down the bottom, we can see SQL using the Clustered Index key to join our Nonclustered Indexes together. If you’ve been reading this blog regularly, you should know that Clustered Index key columns are carried over to all your Nonclustered Indexes.

If we switch to inequality queries, well…

All Hell Breaks Loose
All Hell Breaks Loose

The top query that SQL wasn’t fooled by before now has the same insane estimates as the others. Our two bottom queries get missing index requests due to the amount of work the Index Intersection takes.

It’s happening because of the SELECT * query pattern. This will go away if we stick to only selecting columns that are in our Nonclustered Indexes. For example, SELECT ID will result in some pretty sane index seeks occuring. The estimated rows are still way up there.

Unfortunately, STATISTICS TIME and IO are not fooled by our statistical tomfoolery.

Womp womp womp
Womp womp womp

They use reality-based measurements of our query activity. This trick is really only useful to see what happens to execution plans. But hey it’s a lot cheaper, easier, and faster than actually inserting 10 billion rows.

So what?

Like a grandma in a grocery store, SQL Server makes all its decisions based on cost. Whatever is cheapest is choice. If SQL Server were a person, it would probably wash and dry used aluminum foil, save old bread ties, and use clothes pins for the right thing.

I forget what I was going to say. Probably something smart about testing your queries about sets of data commensurate to what you have in production (or larger) so that you don’t get caught flatfooted by perf issues on code releases, or if your company finally starts getting customers. This is one technique to see how SQL will treat your code as you get more rows and pages involved.

Just don’t forget to set things back when you’re done. A regular stats update will take care of that.

Thanks for reading!

Previous Post
Stats Week: Do Query Predicates Affect Histogram Step Creation?
Next Post
Breaking News: Query Store in All Editions of SQL Server 2016

7 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.