I bet you’re worried about your index fragmentation.
And I bet you’re wasting your time, missing the real point.
What is Index Fragmentation and How It Happens
Let’s take a step back for a second and pretend that your database is a phone book organized by last name, first name.
As people move into your city, we have to add them to the phone book. Ideally, each page has some empty space, and we govern that with the fill factor. When SQL Server rebuilds indexes, it uses the fill factor to decide how much free space to leave on each page. If there’s not enough free space, SQL Server has to do some rearranging – but it can’t exactly shove a brand new page in the middle of the phone book. The book’s already bound. We’ll have to tack more blank pages onto the end.
Problem #1 – Internal Fragmentation: We’ve got a newly added page with hardly any stuff on it.
Problem #2 – External Fragmentation: The phone book pages are out of order.
Updating people in our phone book causes problems too. When Pat Down marries Phil McCann, and we update her last name in the phone book, we leave empty space in the D section and cram a new record into the M’s. Deletes cause problems by leaving empty space behind (internal fragmentation).
How Fragmentation Hurts SQL Server Performance
Bad internal fragmentation (having lots of free space on the pages) means the index is bigger than it needs to be. Instead of our phone book having 1,000 pages that are 100% full, we might have 1100 pages that are only 90% full. This means every time we need to scan the index, it’ll take 10% longer (1,100 pages instead of 1,000). This also means we need more memory to cache the same amount of data – because SQL Server has to cache the empty space on each page. Our lowest unit of caching is a single page, not a record.
Bad external fragmentation (having shuffled pages on disk) means our storage performance could be slower. If magnetic hard drives have to jump around to different areas of the drive, their performance suffers – a lot. Take this recent Toshiba enterprise drive review by StorageReview – it gets around 200MB/sec for large sequential reads, but under 2MB/sec for random reads. Ouch.
Fixing Index Fragmentation Temporarily
If you can’t cache the database in memory, or you want to fix it on disk anyway, you can solve it by rebuilding or defragmenting the index. Most folks do this with maintenance plans, but those have a nasty problem. They rebuild (or defrag) every single index in the database, every time, whether it’s necessary or not. The maintenance plans ignore whether the table’s even had a single write since the last time it was maintained.
This is a problem because rebuilding and defragmenting indexes causes SQL Server to write to the transaction log. The more we write to the logs, the longer our log backups take, the more we have to push across the network wire for database mirroring or log shipping, and the longer restores take.
We might even be doing more damage, too. Some DBAs decide they want to fix fragmentation by setting a low fill factor, like say 50%. By doing so, half of every page would be empty – so inserts would be blazing fast. Reads, however, would be twice as slow. In order to scan the entire phone book, we’d have twice as many pages we have to read. Tweaking fill factor is a dangerous dance, much like the ones I did in high school. (True story: broke my nose slam dancing.) We might be forcing more empty space on each page every time we rebuild when we don’t need to.
Fix Index Fragmentation Permanently
Start by trying to cache your database – or at least the data that’s frequently accessed – in memory. External fragmentation (out-of-order pages on disk) doesn’t matter as much when we don’t have to hit the disks to begin with. The difference between physically fragmented disk throughput and physically unfragmented disk throughput is miniscule compared to the speed difference between memory and disk. Cache it and be done with it – 384GB of memory is just $5-$6k.
Next, dig into what your storage system is really doing with your data. If you’re using shared storage like EMC, NetApp, or Dell gear that shares drives between a bunch of different servers, then all of your drive access will be random anyway. Your hard drives are shared with other servers that are also making drive requests at the same time, so the drives will always be jumping all over the place to get data. Defragging your indexes is just meaningless busy work.
Find out what your queries are waiting on – and identify what problem you’re really trying to fix. During our SQL Server health checks, we rarely say, “Ah, yes, here’s the problem – your data is fragmented.” We DBAs are just used to defragmenting the bejeezus out of our databases because it’s one of the few problems we can fix easily, and see simple numbers to know whether our indexes become less fragmented. Find the real root cause, and then attack that.
And if you discover that the root problem really is fragmented indexes – and it just might be – use Ola Hallengren’s free database maintenance scripts. They check the fragmentation level of each index before they start – and then only defrag or rebuild the indexes that meet your thresholds. This keeps your log file activity as light as possible while achieving the performance improvement you’re after.


Michael J Swart August 14, 2012 | 8:18 am
Great post (as always)… I’m just curious. What search terms did you use to discover those (creative commons licensed) photos? That first one is so so perfect for this topic.
Brent Ozar August 14, 2012 | 8:21 am
Thanks man! I can’t remember offhand – I do a lot of image searches – but I’m betting I searched for yellow pages. I usually end up looking through a lot of images before I find just the right ones.
Tom Weaver October 23, 2012 | 7:40 am
On that note… I still prefer Kendra Little’s art! Though each of us has our own talents
Rowland August 14, 2012 | 8:30 am
Right on target Mr Brent. Nice article.
Cody K. August 14, 2012 | 9:40 am
Great read! The subject of indexes and other performance tuning is something that I am trying to focus on this year at work. Keep up the good work!
Ayman El-Ghazali August 14, 2012 | 9:55 am
Sorry about your nose, it looks good though no worries!
Two Quick questions.
1. How do you go about caching everything? Do you just add RAM and hope for the best, or is there a specific process that you would use?
2. If one were to use a Storage system that included SSD (better random read speeds than Magnetic storage), would that help reduce the External fragmentation problem? Would you recommend it as an alternative to adding RAM? I’m asking because our OPS team is putting together a nice Tiered Storage system from EMC that has SSDs in it.
Brent Ozar August 14, 2012 | 10:19 am
Ayman – to cache everything, yep, you just add memory and monitor memory usage via the DMVs. There’s a lot of good instrumentation in the system views that will tell you which objects are getting cached.
Tiered storage, however, that’s a much trickier question. It really depends on the size of the databases, the size of the SSDs, the way the data is queried, and much more. I couldn’t give a simple blanket answer that covers all scenarios here. However, I definitely wouldn’t recommend tiered storage over adding memory – nothing’s going to be faster than memory, and memory is *way* cheaper.
Ayman El-Ghazali August 14, 2012 | 10:44 am
Thanks for the quick response. Our OPS team put together the tiered storage system and I highly doubt they want to spend more money adding RAM. I will keep this article in mind for future upgrades. We probably could have spent less money on RAM then we did on the Enterprise level SSDs.
Oh well, didn’t come out of my pocket
Kevin Di Sotto August 14, 2012 | 10:03 am
Great post! I am currently testing ola’s scripts for as a replacement for our maintenance plans. I watched your video DBA Darwin Awards: Index Edition which is also superb. In that you say that Rebuild Indexes also updates statistics with full scans. Our plan at the moment runs the tasks reorganize indexes, rebuild indexes and update statistics with full scans in that order. Yes huge transaction log but why does the update statistics job take longer? If the rebuild indexes task does the same thing and also rebuilds the indexes I thought that would take longer.
Brent Ozar August 14, 2012 | 10:19 am
Kevin – I’m not quite sure what you mean – when you say “Why does the update statistics job take longer?” can you elaborate? Take longer than what? Have you got some metrics to illustrate what you’re asking?
Kevin Di Sotto August 14, 2012 | 10:56 am
Sorry Brent I see this in my maintenance plan log:
Rebuild Index
Rebuild index on Local server connection
Databases: MYPRODUCTION
Object: Tables and views
Original amount of free space
Task start: 2012-08-12T04:52:48.
Task end: 2012-08-12T08:25:57.
Success
Update Statistics Update Statistics on Local server connection
Databases: MYPRODUCTION
Object: Tables and views
All existing statistics
Task start: 2012-08-12T08:28:10.
Task end: 2012-08-12T15:37:36.
Success
If the rebuild indexes task also updates statistics with full scans I would have expected the rebuild index task to take longer than the update statistics task.
Brent Ozar August 14, 2012 | 10:57 am
Kevin – yep, something seems odd there. I’d want to know more about the underlying code being run, like more about the maintenance plan tasks you’re using. Unfortunately this is beyond something I can troubleshoot in blog comments – might want to post the full details with screenshots on http://DBA.StackExchange.com.
Kevin Di Sotto August 14, 2012 | 11:01 am
Thanks Brent. This was setup by someone else but I beleive it was using the built in maintenance plan wizard. The sooner I can get ola’s jobs then better.
Steve Criddle August 14, 2012 | 9:23 pm
The update statistics task updates all stats, which includes stats on columns that are not indexed. A specific case that I have seen before is if you have BLOB columns with statistics on them, they can take in the order of 4-5 hours to update the stats in a table with a few million rows.
Pingback: If an Index is Heavily Fragmented Will SQL Server Still Use It? | | Colin Stasiuk
Wayne August 14, 2012 | 10:44 am
Hi Brent,
Michelle Ufford also has a nice defrag script on her blog (latest version is at http://sqlfool.com/2011/06/index-defrag-script-v4-1/).
Brent Ozar August 14, 2012 | 10:48 am
Wayne – unfortunately, Michelle’s hasn’t been maintained, so it doesn’t update statistics, doesn’t manage the new column-level indexes in 2012, etc. I’d stick with Ola’s because it’s kept up to date.
Kevin Di Sotto August 14, 2012 | 10:57 am
Have you seen this one brent for statistics and what is your view on it?
http://crankydba.com/2012/02/03/selectively-updating-statistics/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+TheCrankyDbaSqlserverpediaSyndication+%28The+Cranky+DBA+%C2%BB+SQLServerPedia+Syndication%29
Brent Ozar August 14, 2012 | 10:59 am
Kevin – no, Mike’s a smart guy, but I use Ola’s scripts because they cover more things (indexes, backups, DBCC, etc.)
Aaron Bertrand August 14, 2012 | 3:50 pm
Brent, are you sure Ola’s scripts do anything with columnstore indexes? I can’t imagine what you would want his process to do without some kind of at least semi-manual intervention (e.g. planning some kind of bulk load + rebuild).
Brent Ozar August 14, 2012 | 3:51 pm
No, sorry, I wasn’t clear – by handling them I just meant it’s not goofed up by them.
Noeldr August 14, 2012 | 1:17 pm
Do you care to backup the “$5k-$6k for 384GB of RAM” claim with a link ?
Aaron Bertrand August 14, 2012 | 3:32 pm
If you have 24 slots, here are two links:
http://www.tigerdirect.com/applications/SearchTools/search.asp?keywords=KVR13LR9D4K4/64
http://www.newegg.com/Product/Product.aspx?Item=N82E16820239143&Tpk=KVR13LR9D4K4%2f64
Brent Ozar August 14, 2012 | 3:56 pm
Sure – go to http://www.crucial.com, pick the server you’re using, and go. Just priced out an HP DL580 for a client.
Noeldr August 15, 2012 | 1:49 pm
Although we get it at 7K I would consider it close enough. Thanks!
I think the “increase memory” advice is right but you should also add the fact that after 64GB you will be _forced_ into Enterprise Edition and that is a non-trivial expense in licensing cost.
Tim Plas August 17, 2012 | 10:02 am
Also note that going above 32GB will require Enterprise Edition of Windows Server. Not as pricey as the jump to EE on SQL, but still.. Or then again, maybe another reason to adopt Win Server 2012 (same specs for SE and EE).
Brent Ozar August 17, 2012 | 10:03 am
Tim – great point! I always forget about that limit.
Aaron Bertrand August 14, 2012 | 3:20 pm
Great read Brent!
Even if you have gobs of memory, it can still be wasteful to cache horribly fragmented indexes there. On the other hand, pulling it out of buffer to defragment it is quite wasteful also.
Not all fragmentation processes are created equal. Being associated with a tools vendor we used our collective DBA experience to strike a good balance : much like Ola’s and Michelle’s scripts, we only defragment or rebuild indexes using an algorithm (which is customizable enterprise-wide or down to the individual index level). On top of that, we have rules based on how much of an index is in memory – e.g. if it’s already largely or entirely in buffer, do we really need to do anything with it? Finally, we allow concurrent operations, so if you have a small window for index maintenance, you can hammer the crap out of multiple indexes at once instead of doing one at a time and extending your window. All of this beats stock maintenance plans hands down, and allows you to only care about the process as much as you want to.
I agree that it is a noble goal to stop worrying about fragmentation, and it is possible with the right tools / scripts in place. But I hope nobody takes from this the headline that it’s really okay to stop doing it entirely.
Aaron Morelli August 14, 2012 | 3:30 pm
This post from Conor (Query Opt team) seems especially relevant:
http://blogs.msdn.com/b/conor_cunningham_msft/archive/2010/04/23/conor-vs-index-fragmentation-in-plan-selection.aspx
Not every DB provides this luxury, but our transactional DBs really do have “OLTP patterns” (i.e. mostly index seeks to just a few rows), so we’ve reserved our defragmentation processes for the evening, and only when frag % gets pretty bad. Our Data Warehouse DBs are a different story…
Pingback: Something for the Weekend - SQL Server Links 17/08/12
Edward Dortland August 17, 2012 | 4:43 pm
Hi Brent thanks for your post. Although I do agree with you that since memory is cheap nowadays, you could make an argument that as long as you can keep the data in mem, it doesn’t really matter if that in mem data is logically fragmented or not.
however:
Say if an nc index is kept completely in memory. It’s page density still has a lot of influence on the decision to go for a index seek with bookmark lookup or go for a table scan. If the amount of rows being seeked in the index are spread over more than 25%/33% percent of pages of the total amount of pages of that index, the optimiser will not go for a index seek with bookmark lookup. It will go for a table scan.
That table, now has to be read into mem. If it’s a big table, there goes your cache. If the nc index would have had a better page density, the optimiser would have gone for a index seek with bookmark lookup, reading only the appropriate amount of pages from the clustered index/heap.
Regarding the logical fragmentation.(external). A lot of the time, page splits are a root cause. Page splits by itself cause a magnitue more translog entries. So I do agree that in a “large perc. of your db in memory” setup, the logical fragmentation by itself isn’t really an issue. The cause of that logical fragmentation by itself is something to worry about.
Warm regards,
Edward
Just something to think about.
Brent Ozar August 17, 2012 | 4:46 pm
Edward – thanks for the note. What’s interesting about page split is that often they’re caused by an overzealous DBA rebuilding indexes with 100% fill factor. That actually makes page splits worse due to updates.
About transaction log activity due to page splits – that’s a drop in the bucket compared to the transaction log activity doing constant defrags and rebuilds, wouldn’t you agree? If we’re rebuilding all of our indexes every night, then the transaction log activity is huge compared to a small percentage of inserts/updates during the day.
Edward Dortland August 17, 2012 | 5:16 pm
Well.. it depends.
I guess what I’m trying to say is that I largely agree with you that once you have fragmentation it’s not always necessary to imidiately go fix it. Especially if the reason for that is: “Well, that’s what we always do, every day… that’s what maintenance plans are for right…”.. An exception would be highly used nc indexes. (See my previous comment).
But I do think it’s still very important to fix whatever is causing fragmentation. Bad clustered indexes (Guids), to large row sizes, lot’s of var columns with huge differences between rows. etc. etc.
You say, that the overhead of pagesplits is small compared to a full blown rebuild. It depends on your load and on your schema design but lets say it is, even when the complete db is in mem, all that extra log overhead still has to be written to disk. And depending on the load, that can be very large overhead.
So I’d like to keep that log overhead to a minimum.
The rebuilds could be minimal logged. That helps me with keeping my log disk IOs low but I agree that doesn’t help you with your examples of mirroring (bulk logged not supported) and logshipping (still big log backups)
So yes, don’t go instantly running rebuild or reorganize but please DO keep worrying about fragmentation.
Cheers,
Edward
Ed August 28, 2012 | 6:04 pm
Nice phone book analogy. I can instantly understand and relate to. Thanks.
Mark Jones November 27, 2012 | 8:22 am
Hilarious, loved the point about 384GB ONLY costing $5-6K.
And what about when your DB is closer to 1TB on disk and still working really well (sub second queries) with only 1.5GB of RAM for the server? Oh yea, we call that MySQL with INNODB Tables.
In a cloud environment where you paying monthly for for RAM, it makes sense to NOT use databases that are as badly broken as MSSQL is. And a MySQL defrag? haven’t needed/done one in 5 years even though we add and purge 1 million entities/day (which is more than 1 million rows/day considering the multiple tables and joins).
Brent Ozar November 27, 2012 | 8:24 am
Mark – that’s great to hear that you’re able to work with a 1TB database with 1.5GB memory for the server. Sounds like you’ve found a solution that works well for you. Have you blogged about your database setup? I’m sure a lot of readers would love to hear more about that. Thanks!
Lonny Niederstadt February 6, 2013 | 9:27 pm
I agree that index fragmentation may not deserve as much attention as it sometimes gets. Improved query elapsed time after rebuilding indexes may be from the defragmentation, but may more likely be due to updated stats or the subsequent recompile of query plans dependent on those stats.
However, its simply not true that the SAN is performing random hard drive reads to retrieve data that the database and filesystem consider sequential. Consider the stripe unit size per hard disk up through at least the EMC Symmetrix DMX4 – 960k. That’s 15 consecutive 64k database extents in one disk stripe unit. There are efficiencies in retrieving as much data as possible in as few transfers as possible. There’s also less chance of being punished for QFULL or scsi_busy conditions.
SQL server does a good job with readahead, coalescing contiguous extent retrievals into reads of up to 512k (I’ve heard rumors of more) and warming the database buffer cache before the query processing threads get there. So query elapsed time might not suffer from suboptimal IO (as long as QFULL conditions aren’t incurred). But the SAN admin will notice if the system is busy enough. And other tenants of the SAN could notice as well.
On SSDs, the benefits of contiguity only really extend to the per drive stripe unit size, because each stripe unit is randomly placed within the drive regardless of its positioning within the filesystem or database. There is still a benefit in retrieving as few stripe units as possible – less of a chance of saturation between the storage controller and the disks.
On a very busy SQL server, reasonably defragged large high use indexes and possibly the -E adjustment of proportional fill can make io performance more consistent for SQL server, and make SQL Server a better neighbor.
Brent Ozar February 7, 2013 | 8:16 am
Lonny – have you noticed the stripe sizes on newer EMC SANs? I haven’t seen the VNX doing stripes that large, for example. NetApp is another example – it’s 4k.
Lonny Niederstadt February 7, 2013 | 10:37 am
Saying that sequential access is all basically random anyway – that completely disregards what SAN administrators spend lots of their time planning for, and correcting. You probably won’t win many over that way
The basic issue: even when the stripe unit per disk shrinks, truly random access to that disk for the next read incurs an average of half the maximum disk head seek time for the disk. Performance disks spin at 15k rpm (or 10k). But the heads move more slowly. So go ahead and switch from disk to disk fast. But incur as little head movement as possible. That’s why its STILL a bad idea to put two or more heavy write, heavy growth, heavy read database files on a single LUN (with the exception of SSD, although I don’t recommend it there either). The sequential activity at two different points in disk geometry introduces a particular type of head thrashing that Microsoft calls “IO weaving”. The disk head weaves from the location for one sequential read track to the location for the other sequential read track – on each and every disk that the data is striped over. In cases of IO weaving, sequential access performance can be WORSE than random read access, depending on the distance the head must travel between the two sequential read locations. NetApp and ZFS are slightly less vulnerable to IO weaving than “update in place” storage. But IO weaving results in trouble for them, too.
I like NetApp (and ZFS), although my heart will always be with Hitachi storage. NetApp plays by its own rules (in comparison to other storage players like IBM, HP, EMC, and Hitachi), due to the ONTAP WAFL shadow paging filesystem. But the read cache accelerator cards (they used to call them PAM) still benefit greatly from contiguity of access across the disks. Although there is a conceptual 4k stripe unit, writes occur in 64k chunks per disk. Its an under-the-covers optimization. Here’s how it works: if you have 16+2 in a RAID DP group (I don’t recommend this, its for ease of illustration only) a 64k database extent will span the raid group. But that’s NOT how it gets written to the disks. It goes to an NVRAM card first (and mirrored to another NVRAM card in the paired controller) and is burst from NVRAM to disk when NVRAM reaches its write pending threshold, or the timer goes off. In NVRAM the writes are coalesced. Assume that an index rebuild in a single database file database is the only current activity. if 16 extents are written to the database file, the extents are sorted in NVRAM. Eighteen 64k writes will take place. Each 64k write is made up of 16 4k theoretical stripe units. If you take the parity stripe units out of the mental picture, each write would have 4k of each of the 16 database extents. Compact. Compact especially for the inode maintenance that has to take place for the WAFL “redirect on write” to present a coherent file. The more fragmentation in the files presented from WAFL, the more inode updates have to take place.
VMAX virtual pools use 768k extent sizes. 12 Symmetrix tracks of 64k each. The 768k extent size is the minimal size for promotion in the Symmetrix auto tiering. If there isn’t a good amount of contiguity of database contents, autotiering will be inefficient.
On the CLARiiON/VNX systems, the main consideration is what flavor the LUNs are. They could be traditional LUNs, virtual pool LUNs, or thin provisioned virtual pool LUNs. The rules for optimizing database performance on each are slightly different. But in no case can one simply expect the sequential reads to be randomized and always achieve the benefits of sequential IO. This is a good source for VNX implementation details.
http://virtualeverything.wordpress.com/2011/03/05/emc-storage-pool-deep-dive-design-considerations-caveats/
Note what Vijay, the blog author says about thin provisioned LUNs:
“Utilizing Thin LUNs introduces a whole new level of considerations as it does not pre-allocate the 1GB slices, but rather writes in 8K extents. This can cause even more unpredictable behavior under the circumstances outlined above, but that should be something to be aware of when using Thin provisioning in general.”
Thin LUNs do become more highly distributed and less sequential at the 8k level. I do NOT recommend using thin LUNs for a performance oriented database unless the majority of access is 8k random. Even then, if it grows considerably and the underlying disks are shared, the disk head movement when thin LUN contents from other applications are interleaved can be a performance killer.
Brent Ozar February 7, 2013 | 3:17 pm
Lonny – hmm, I think we’re coming from different perspectives. You write that “SAN administrators spend lots of their time planning for, and correcting” access so that random access becomes sequential access. That hasn’t been my experience, but it sounds like you work with a team of really good SAN admins who have the time to hone that kind of thing. That’s great, and I’m glad that works well for you.
In my experience – which is just different from yours, not better or worse – is that most SAN admins are overworked, and perhaps a little bit underfunded/undertrained. They don’t have the time to figure out random vs sequential access on a shared pool of storage. In that type of environment, buying RAM is a lot cheaper than hiring more SAN admins and sending them to training, and dedicating their time to managing random vs sequential access.
Sounds like you’re in a great environment, though, and more power to you! Enjoy, and thanks for the comment.
Lonny Niederstadt February 7, 2013 | 4:46 pm
You are misunderstanding me slightly. For someone to say, “who cares if its sequential or not, its all random on the SAN” is not true, disregards what SAN admins spend lots of time doing, and potentially makes more work for the SAN admins. Please don’t give people that message.
Brent Ozar February 7, 2013 | 4:53 pm
Hmm – again, I haven’t met a lot of SAN admins who spend time focusing on issues like that. I think it’s great that you do, and that’s awesome. You’re the right kind of shop to focus on fragmentation. However, that’s not what I typically see out in the field.
Lonny Niederstadt February 7, 2013 | 11:54 am
Brent – you do such a good job steering people away from myths, that when you give advice with hidden gotchas I feel compelled to speak up. In your post you said “Cache it and be done with it – 384GB of memory is just $5-$6k.”
Be careful trying to cache a database working set in SQL Server. Do NOT rely on the performance of an in-memory working set, unless all the factors of keeping the working set cached have been accounted for. With 384 GB, on a four NUMA node server, you aren’t guaranteed to keep even a 30 GB database working set in cache.
To keep working set of 30GB of database blocks in database cache reliably, across SQL Server restarts and regardless of how tasks are distributed across the four NUMA nodes, you’d need 400 GB + (buffer cache tempdb load) for max server memory. Big it up for OS memory, worker thread memory, other applications, etc, to get the full amount of server RAM.
I haven’t seen the task scheduling and NUMA node considerations for working set caching in SQL Server together in one place, so I made the first post in my new blog about it. I don’t think too many people will read my blog (it takes me forever to write anything and I don’t have too much to add to the encyclopedia), but if I can put out a few things that work their way around among the really knowledgeable folks, I’ll have accomplished what I’m after
http://sqlsasquatch.blogspot.com/2013/02/sql-server-cache-database-working-set.html