Last week, I blogged about the basics of SQL Server index fragmentation: why it happens, how to fix it, and how often people are fixing it. I left you with a cliffhanger: it seemed that the frequency of defrag jobs didn’t appear to affect fragmentation levels:
- Databases with no index defragmentation were an average of 5% fragmented
- Monthly – 17% fragmented
- Weekly – 3% fragmented
- Daily – 6% fragmented
At first glance, that would seem to indicate that your database got worse off if you defragmented! But like all good novels (and most bad ones), the plot thickens.
Enter Data Mining with Excel and SQL Server
Data mining is a lot like diamond mining, only there’s no monopoly on the market, and the ladies don’t seem to appreciate a quality KPI. Otherwise, they’re identical: there’s a whole lot of money in it, but that money doesn’t usually go to the people who do the actual mining. It goes to the executives and salespeople who take advantage of the mined products to make better decisions.
The people doing the mining, on the other hand, are forced to spend their lives in tiny, dark caves (or “cubicles”) trying to extract beautiful gems (or “data”) while risking painful lung ailments (or “carpal tunnel”) due to toiling with terribly unsafe and outdated hardware (or “hardware”).
For today’s demo, I will be the miner, and you’ll be the executive who takes advantage of my work. (It’s okay, I’m used to it – I work for a vendor now.)
In my podcast Data Mining with Excel in Four Minutes, I explained how to set up Microsoft’s free data mining add-ins for Excel 2007. It’s an Excel plugin that hooks up to any SQL Server Analysis Services server on your network, either SQL Server 2005 or 2008, and makes data mining a point-and-click affair. It doesn’t require high-end horsepower – even a desktop or laptop works great for this. If you can’t be bothered to set up an SSAS instance, then check out my Data Mining in the Cloud writeup on how to get started without using a server at all.
Help SSAS Help You: Explain Your Numbers
While data mining is really easy to set up, you can get much better results if you “prequalify” your data and turn some of the numbers into basic categories.
If I was working with United States salary data, for example, my source data might have a column for Hourly Wage. I would add another column and call it Tipped Employees:
- Under $6.55 per hour – Tipped Employees = Yes. You can pay someone less than minimum wage if they get tips, and in that case, you really just can’t go by their hourly wage alone.
- $6.55 per hour and over – Tipped Employees = “Unknown.” In a perfect world, I’d have enough data to find out if these people get tips, but that’s not always the case.
By adding a new attribute to my data, something that’s not clear from the numbers alone, I might get better insight from my data mining efforts.
By the way, if you’re reading this and it’s after July 2009, the minimum wage has risen to $7.25 per hour. If you’re a VB developer, you should immediately ask for a pay increase to match the new standard – unless of course they’ve got a tip jar by your desk.
Explaining Our Index Fragmentation Numbers
In the case of our index fragmentation numbers, one of the source data fields is Page Count – the number of pages that an object has. Size matters with fragmentation: small objects with only a handful of pages may appear to have very high fragmentation numbers, but they can’t actually be defragmented. There’s only so much defragmentation you can do when a table only has three pages. I’ve actually been on support escalation calls where customers demand to know why a defrag job doesn’t reduce all types of fragmentation to absolute zero, even for tables with just one page.
“Generally, you should not be concerned with fragmentation levels of indexes with less than 1,000 pages. In the tests, indexes containing more than 10,000 pages realized performance gains, with the biggest gains on indexes with significantly more pages (greater than 50,000 pages).”
With that in mind, I added a Page Count Group column and calculated it with a formula:
=IF(Table1[[#This Row],[page_count]]>50000,"Large",(IF(Table1[[#This Row],[page_count]]<10000,"Small","Medium")))
That adds a text label for Small, Medium or Large depending on the size of the table.
Suddenly, The Data Makes More Sense
Even before doing data mining, if we just add a Pivot Table, we can suddenly make more sense out of the numbers.
For Large tables, we see an average 44% fragmentation when the database has no defragmentation jobs set up. Monthly defrag drops that to 14%, and daily drops it to just 2%! The Weekly data is a bit of an outlier here, but it’s still less than no defrag jobs at all, so we’ll have to dig deeper.
For Medium tables, we see the type of data distribution we would hope for: the more often we defrag, the lower our fragmentation gets.
For Small tables, the data is all over the place, but we know why: it has to do with the way smaller tables behave.
Adding this bit of human interpretation helped us get better results from our data – and we haven’t even started mining!
Learning More About SQL Server Data Storage
Want to learn more? We’ve got video training explaining it! In our free 90 minute video series How to Think Like the SQL Server Engine, you’ll learn:
- The differences between clustered and nonclustered indexes
- How (and when) to make a covering index
- The basics of execution plans
- What determines sargability
- How SQL Server estimates query memory requirements
- What parameter sniffing means, and why it’s not always helpful