Here at Brent Ozar Unlimited
We have a proud tradition of not blaming index fragmentation for everything. There are points you should deal with it, but they’re probably not 5% and 30% and 1000 pages. But that’s not what this blog post is about. I’m hoping to clarify why we’re more interested in up to date statistics, and also why statistics outside of indexes aren’t really the most helpful thing. If they were, we’d all just create statistics and every query would magically blaze from select to offset without a care in the world.
Statistics: It’s What’s Inside That Counts
Statistics are what SQL Server uses to figure out how many rows to expect from accessing a data structure. You can do some things that fundamentally break this, like using functions in joins or where clauses, using local variables or optimize for unknown, using table variables without recompile hints, and a sniffed parameter can just allow SQL to run a query without guessing at all. It already guessed once. No tag backs. As you may imagine, this is important information for SQL to have for running your queries optimally.
Indexes: Pride and Vanity
Indexes of the nonclustered variety contain subsets of your table or view’s data. Clustered ones are your table or view data ordered by the clustering key(s). Focusing on the nonclustered variety, they’re the “and the [band name]” to your clustered index’s “[lead singer name]”, and they’re great for providing SQL with a more appropriate data structure for your query.
If you don’t need to select all the columns, and you do need to filter, join, order by, or aggregate a column not in the key of the clustered index, nonclustered indexes get a solo after the chorus. Give the drummer some.
Nonclustered indexes will, under normal circumstances, get a statistics object created with rolling density information for the key columns going left to right, and a histogram on only the leftmost column.
With that out of the way
Why do we care more about statistics being updated than indexes being fragmented? Largely, because reading pages with some empty space from a fragmented index is oh-so-very-rarely the root cause of a performance issue. Especially if those pages are already in memory. Out of date statistics can allow SQL to continue to make some really bad guesses, and keep giving you a lousy execution plan no matter which way you tune your query.
The bigger your table gets, the worse the problem gets. Prior to 2016, if you don’t turn on Trace Flag 2371, about 20% of your table’s rows need to change before an automatic statistics update kicks in. For a 100 million row table, this can be a long ways off. Poor cardinality estimates here can really sink you. Rebuilding indexes for a 100 million row table is a B-U-M-M-E-R.
Log Shipping? Mirroring? Availability Group? Good luck with that.
Crappy server? Low memory? Slow disk? Dead man walking.
You may ultimately spend more time and expend more server resources defragmenting indexes than your queries will spend reading extra pages from fragmented indexes. Rebuilding or reorganizing large indexes can be a special kind of brutal.
Consider the process
Read a bunch of index pages with sys.dm_db_index_physical_stats to figure out if there’s fragmentation, reorganize or rebuild based on feedback.
- Reorganize is online but single threaded and can take FOREVER on big tables, especially if you’re compacting LOBs.
- Rebuild is offline and single threaded in Standard, online and potentially parallel in Enterprise, but you better hope you’re patched up so you don’t corrupt anything.
Is that worth it? Every night? For every index on every table in every user database? Only if you can prove it.
The one upside to Rebuilding is that it also updates statistics with a full scan. Think about this the next time you say something like “rebuilding the index fixed the problem”, you may have an epiphany on the way.
Wait for it.
Wait for it.
Wait for it.
Statistics with no indexes
SQL Server doesn’t make easy work of getting information about your Statistics, or finding out which statistics get used. Even at the query level, you have to use a spate of Trace Flags to find out what gets loaded and looked at. Cached plans don’t fare much better.
No wonder everyone cares about indexes and their fragmentation. Microsoft has made information about them easy and abundant, while Statistics are kept hidden in the basement next to piles of soft bones and a bowl of hot blood.
Head rush moment: SQL may use information from histograms outside of the index it chooses for cardinality estimation.
Back to earth: If you just create a bunch of statistics instead of indexes, you’re (at best) using your Clustered Index for everything (which is still bad), or you’re using a HEAP for everything (which is usually worse). You’re still generally better off creating good indexes for your workload. They’ll get statistics objects created and associated with them, and if SQL thinks another column is interesting, it will create a single column statistics object for it, as long as you haven’t turned off auto create stats.
Sure, you can put on your black cloak and goat mask and create some multi-column or filtered statistics, but in the words of a wise man (Doug), you end up with more stats to maintain and understanding query behavior gets more difficult.
Filtered statistics suffer from a problem where they don’t automatically update based on the filtered row count, but rather the table row count. Imagine you have a 100 million row table, and your filtered index is on 1 million rows. All million of those rows might change, but the statistics on that index won’t. 1 million is not 20% of 100 million. You’ll have to update the statistics manually, or rebuild the filtered index.
Multi-column statistics are hardly a replacement for a multi-column index, and it’s not like you get an enhanced histogram that includes the second column. It’s just like a normal histogram. All you get is the density information for the columns you throw in there. Boo hiss.
Moral of the story (B-B-B-B-B-B-BENNY WITHOUT THE JETS)
Indexes are a really important factor for performance, but index fragmentation very rarely is. Statistics are super helpful when they’re not out of date, and getting them up to date is much easier on your server’s nerves. Though not perfect, I’d rather take my chances here. Updating statistics can also cause a bad execution plan to get flushed out of the cache. On their own they can sometimes help queries, but you should only end up here after you’ve really tuned your indexes.
Unless you can establish a metric that makes nightly index defragmentation worthwhile, don’t jump to it as the default. Try just updating statistics. You may find that nothing at all changes, and you now have many extra hours a night to do maintenance. Like run DBCC CHECKDB. If you think index fragmentation is a performance problem, try corruption sometime. That’s not what Brent meant when he said “the fastest query is one you never make.”
Thanks for reading!