Stop Worrying About SQL Server Fragmentation

I bet you’re worried about your index fragmentation.

If your database was a phone book, it would look like this one.

And I bet you’re wasting your time, missing the real point.

What is Index Fragmentation and How It Happens

Let’s take a step back for a second and pretend that your database is a phone book organized by last name, first name.

As people move into your city, we have to add them to the phone book.  Ideally, each page has some empty space, and we govern that with the fill factor.  When SQL Server rebuilds indexes, it uses the fill factor to decide how much free space to leave on each page.  If there’s not enough free space, SQL Server has to do some rearranging – but it can’t exactly shove a brand new page in the middle of the phone book.  The book’s already bound.  We’ll have to tack more blank pages onto the end.

Problem #1 – Internal Fragmentation: We’ve got a newly added page with hardly any stuff on it.

Problem #2 – External Fragmentation: The phone book pages are out of order.

Updating people in our phone book causes problems too.  When Pat Down marries Phil McCann, and we update her last name in the phone book, we leave empty space in the D section and cram a new record into the M’s.  Deletes cause problems by leaving empty space behind (internal fragmentation).

How Fragmentation Hurts SQL Server Performance

Bad internal fragmentation (having lots of free space on the pages) means the index is bigger than it needs to be.  Instead of our phone book having 1,000 pages that are 100% full, we might have 1100 pages that are only 90% full.  This means every time we need to scan the index, it’ll take 10% longer (1,100 pages instead of 1,000).  This also means we need more memory to cache the same amount of data – because SQL Server has to cache the empty space on each page.  Our lowest unit of caching is a single page, not a record.

Bad external fragmentation (having shuffled pages on disk) means our storage performance could be slower.  If magnetic hard drives have to jump around to different areas of the drive, their performance suffers – a lot.  Take this Toshiba enterprise drive review by StorageReview – it gets around 200MB/sec for large sequential reads, but under 2MB/sec for random reads.  Ouch.

Fixing Index Fragmentation Temporarily

If you can’t cache the database in memory, or you want to fix it on disk anyway, you can solve it by rebuilding or defragmenting the index.  Most folks do this with maintenance plans, but those have a nasty problem.  They rebuild (or defrag) every single index in the database, every time, whether it’s necessary or not.  The maintenance plans ignore whether the table’s even had a single write since the last time it was maintained.

This is a problem because rebuilding and defragmenting indexes causes SQL Server to write to the transaction log.  The more we write to the logs, the longer our log backups take, the more we have to push across the network wire for database mirroring or log shipping, and the longer restores take.

I just love antiques.

We might even be doing more damage, too.  Some DBAs decide they want to fix fragmentation by setting a low fill factor, like say 50%.  By doing so, half of every page would be empty – so inserts would be blazing fast.  Reads, however, would be twice as slow.  In order to scan the entire phone book, we’d have twice as many pages we have to read.  Tweaking fill factor is a dangerous dance, much like the ones I did in high school.  (True story: broke my nose slam dancing.)  We might be forcing more empty space on each page every time we rebuild when we don’t need to.

Fix Index Fragmentation Permanently

Start by trying to cache your database – or at least the data that’s frequently accessed – in memory.  External fragmentation (out-of-order pages on disk) doesn’t matter as much when we don’t have to hit the disks to begin with.  The difference between physically fragmented disk throughput and physically unfragmented disk throughput is miniscule compared to the speed difference between memory and disk.  Cache it and be done with it – 384GB of memory is just $5-$6k.

Next, dig into what your storage system is really doing with your data.  If you’re using shared storage like EMC, NetApp, or Dell gear that shares drives between a bunch of different servers, then all of your drive access will be random anyway.  Your hard drives are shared with other servers that are also making drive requests at the same time, so the drives will always be jumping all over the place to get data.  Defragging your indexes is just meaningless busy work.

Find out what your queries are waiting on – and identify what problem you’re really trying to fix.  During our SQL Server health checks, we rarely say, “Ah, yes, here’s the problem – your data is fragmented.”  We DBAs are just used to defragmenting the bejeezus out of our databases because it’s one of the few problems we can fix easily, and see simple numbers to know whether our indexes become less fragmented.  Find the real root cause, and then attack that.

And if you discover that the root problem really is fragmented indexes – and it just might be – use Ola Hallengren’s free database maintenance scripts.  They check the fragmentation level of each index before they start – and then only defrag or rebuild the indexes that meet your thresholds.  This keeps your log file activity as light as possible while achieving the performance improvement you’re after.

Want help? Talk to Brent for free.

See sample findings now

The problem probably isn’t fragmentation – you keep defragmenting and the problems keep coming back.

Our 3-day SQL Critical Care® is a quick, easy process that gets to the root cause of your database health and performance pains.

Learn more, see sample deliverables, and book a free 30-minute call with Brent.

Previous Post
#SQLPASS Wants You – to Help First Time Summit Attendees
Next Post
SQL Server DBA Interview Q&A Part 2: The Answers

138 Comments. Leave new

  • Great post (as always)… I’m just curious. What search terms did you use to discover those (creative commons licensed) photos? That first one is so so perfect for this topic.

    Reply
    • Thanks man! I can’t remember offhand – I do a lot of image searches – but I’m betting I searched for yellow pages. I usually end up looking through a lot of images before I find just the right ones.

      Reply
    • On that note… I still prefer Kendra Little’s art! Though each of us has our own talents 🙂

      Reply
  • Right on target Mr Brent. Nice article.

    Reply
  • Great read! The subject of indexes and other performance tuning is something that I am trying to focus on this year at work. Keep up the good work!

    Reply
  • Sorry about your nose, it looks good though no worries!
    Two Quick questions.
    1. How do you go about caching everything? Do you just add RAM and hope for the best, or is there a specific process that you would use?
    2. If one were to use a Storage system that included SSD (better random read speeds than Magnetic storage), would that help reduce the External fragmentation problem? Would you recommend it as an alternative to adding RAM? I’m asking because our OPS team is putting together a nice Tiered Storage system from EMC that has SSDs in it.

    Reply
    • Ayman – to cache everything, yep, you just add memory and monitor memory usage via the DMVs. There’s a lot of good instrumentation in the system views that will tell you which objects are getting cached.

      Tiered storage, however, that’s a much trickier question. It really depends on the size of the databases, the size of the SSDs, the way the data is queried, and much more. I couldn’t give a simple blanket answer that covers all scenarios here. However, I definitely wouldn’t recommend tiered storage over adding memory – nothing’s going to be faster than memory, and memory is *way* cheaper.

      Reply
      • Thanks for the quick response. Our OPS team put together the tiered storage system and I highly doubt they want to spend more money adding RAM. I will keep this article in mind for future upgrades. We probably could have spent less money on RAM then we did on the Enterprise level SSDs.
        Oh well, didn’t come out of my pocket 🙂

        Reply
      • Brent – we are in the process of trying to “help” a vendor determine why their Application is so slow and it’s my job to try and analyse what I can at the Database level. The server it runs on has loads of memory (64GB and can allocate more), but not sure how I force the entire DB to be cached to see if it makes a difference? Is a product like SafePeak worth considering in cases where a vendors badly written application & queries cannot be avoided?

        Reply
        • What’s the primary wait stat that the server is facing, and what’s the total size of all databases on the box?

          Reply
        • Rares Radulescu
          December 4, 2015 5:32 am

          In order to find out how you can cache the entire DB into RAM you need to know how SQL Server uses RAM.
          Each and every time a select is sent to SQL Server, the engine determines if the information is available in RAM or if it has to be retrieved from the Disk.
          Once the select information was delivered to the client the entire data remains in RAM and will not be cleared from RAM unless there is a memory pressure (memory pressure : SQL Server needs RAM for objects not currently in RAM and there is no more RAM available due to old cached objects).
          Knowing all of this means that if you would have a MAGIC select statement that would read all of the data inside your database (and your database_size<RAM_available_to_SQL), all you had to do is run this MAGIC select (every time the SQL Server is restarted) to pre-cache the database.
          I am sure most of you allready know by now what MAGIC command you can use to make SQL Server read each and every page from a database :

          ———- MAGIC :
          Use databasename;
          go
          DBCC CHECKDB;
          go
          ———-Important : database_size<RAM_available_to_SQL

          Reply
  • Kevin Di Sotto
    August 14, 2012 10:03 am

    Great post! I am currently testing ola’s scripts for as a replacement for our maintenance plans. I watched your video DBA Darwin Awards: Index Edition which is also superb. In that you say that Rebuild Indexes also updates statistics with full scans. Our plan at the moment runs the tasks reorganize indexes, rebuild indexes and update statistics with full scans in that order. Yes huge transaction log but why does the update statistics job take longer? If the rebuild indexes task does the same thing and also rebuilds the indexes I thought that would take longer.

    Reply
    • Kevin – I’m not quite sure what you mean – when you say “Why does the update statistics job take longer?” can you elaborate? Take longer than what? Have you got some metrics to illustrate what you’re asking?

      Reply
      • Kevin Di Sotto
        August 14, 2012 10:56 am

        Sorry Brent I see this in my maintenance plan log:

        Rebuild Index
        Rebuild index on Local server connection
        Databases: MYPRODUCTION
        Object: Tables and views
        Original amount of free space
        Task start: 2012-08-12T04:52:48.
        Task end: 2012-08-12T08:25:57.
        Success

        Update Statistics Update Statistics on Local server connection
        Databases: MYPRODUCTION
        Object: Tables and views
        All existing statistics
        Task start: 2012-08-12T08:28:10.
        Task end: 2012-08-12T15:37:36.
        Success

        If the rebuild indexes task also updates statistics with full scans I would have expected the rebuild index task to take longer than the update statistics task.

        Reply
        • Kevin – yep, something seems odd there. I’d want to know more about the underlying code being run, like more about the maintenance plan tasks you’re using. Unfortunately this is beyond something I can troubleshoot in blog comments – might want to post the full details with screenshots on http://DBA.StackExchange.com.

          Reply
          • Kevin Di Sotto
            August 14, 2012 11:01 am

            Thanks Brent. This was setup by someone else but I beleive it was using the built in maintenance plan wizard. The sooner I can get ola’s jobs then better.

          • Steve Criddle
            August 14, 2012 9:23 pm

            The update statistics task updates all stats, which includes stats on columns that are not indexed. A specific case that I have seen before is if you have BLOB columns with statistics on them, they can take in the order of 4-5 hours to update the stats in a table with a few million rows.

    • Stacy L. Gray
      November 6, 2013 8:08 pm

      If you want to update statistics as part of your regular maintenance plan, there is a catch you should be aware of. Both UPDATE STATISTICS and sp_updatestats default to using the previously specified level of sampling (if any)—and this may be less than a full scan. Index rebuilds automatically update statistics with a full scan. If you manually update statistics after an index rebuild, it’s possible to end up with less accurate statistics! This can happen if a sampled scan from the manual update overwrites the full scan generated by the index rebuild. On the other hand, reorganizing an index does not update statistics at all. – Paul Randal, http://technet.microsoft.com/en-us/magazine/2008.08.database.aspx

      Reply
  • Hi Brent,
    Michelle Ufford also has a nice defrag script on her blog (latest version is at http://sqlfool.com/2011/06/index-defrag-script-v4-1/).

    Reply
  • Do you care to backup the “$5k-$6k for 384GB of RAM” claim with a link ?

    Reply
  • Great read Brent!

    Even if you have gobs of memory, it can still be wasteful to cache horribly fragmented indexes there. On the other hand, pulling it out of buffer to defragment it is quite wasteful also.

    Not all fragmentation processes are created equal. Being associated with a tools vendor we used our collective DBA experience to strike a good balance : much like Ola’s and Michelle’s scripts, we only defragment or rebuild indexes using an algorithm (which is customizable enterprise-wide or down to the individual index level). On top of that, we have rules based on how much of an index is in memory – e.g. if it’s already largely or entirely in buffer, do we really need to do anything with it? Finally, we allow concurrent operations, so if you have a small window for index maintenance, you can hammer the crap out of multiple indexes at once instead of doing one at a time and extending your window. All of this beats stock maintenance plans hands down, and allows you to only care about the process as much as you want to.

    I agree that it is a noble goal to stop worrying about fragmentation, and it is possible with the right tools / scripts in place. But I hope nobody takes from this the headline that it’s really okay to stop doing it entirely. 🙂

    Reply
    • Regarding your comment “…we have rules based on how much of an index is in memory – e.g. if it’s already largely or entirely in buffer, do we really need to do anything with it?” Could you share some information on how this is implemented in your maintenance solution? What other factors (other than the number of index pages in memory versus the number of index pages) do you look at to determine if the index should be rebuild/reorganize?

      I also do a lot of checks and apply some business rules to decide what objects should be maintained. This seem like a promising adding to my maintenance solution.

      Reply
      • Sorry Michael, I spoke out of turn there. We have talked about implementing such a feature and last I checked it was on the roadmap, but it hasn’t made it yet. My thinking is simply that as long as the data is already in memory, who cares if it’s fragmented? Since it’s actively being used and not getting evicted, it would actually be worse for performance to put it on disk to reorganize it and make it fit slightly better into memory. That’s a lot of wasted I/O for – usually – very little gain. But yes, you’re right, the logic would be similar to that – percentage of the pages in memory, overall size, relative size to remainder of buffer pool, and perhaps even PLE (if PLE is low then you can make a guess that there is memory pressure and these pages may not be as stable in memory as you thought). You could also factor in index usage stats – if you have snapshots of this over time you could see if there are patterns with a particular index, like used heavily weekday mornings or at month end or on Saturdays, then you can make more educated decisions about when is the optimal time to rebuild or reorg. On some systems, it may be never, but I don’t think it should be forgotten about. Once you’re maintaining many layers of performance data over time, the number of switches and knobs and rules you can create is limitless. Now only if dev/QA resources didn’t have pesky limits. 🙂

        Again, I’m sorry I misspoke. We had talked about that feature enough that I thought it had already been implemented. 🙁

        Reply
  • Aaron Morelli
    August 14, 2012 3:30 pm

    This post from Conor (Query Opt team) seems especially relevant:
    http://blogs.msdn.com/b/conor_cunningham_msft/archive/2010/04/23/conor-vs-index-fragmentation-in-plan-selection.aspx

    Not every DB provides this luxury, but our transactional DBs really do have “OLTP patterns” (i.e. mostly index seeks to just a few rows), so we’ve reserved our defragmentation processes for the evening, and only when frag % gets pretty bad. Our Data Warehouse DBs are a different story…

    Reply
  • Edward Dortland
    August 17, 2012 4:43 pm

    Hi Brent thanks for your post. Although I do agree with you that since memory is cheap nowadays, you could make an argument that as long as you can keep the data in mem, it doesn’t really matter if that in mem data is logically fragmented or not.

    however:
    Say if an nc index is kept completely in memory. It’s page density still has a lot of influence on the decision to go for a index seek with bookmark lookup or go for a table scan. If the amount of rows being seeked in the index are spread over more than 25%/33% percent of pages of the total amount of pages of that index, the optimiser will not go for a index seek with bookmark lookup. It will go for a table scan.

    That table, now has to be read into mem. If it’s a big table, there goes your cache. If the nc index would have had a better page density, the optimiser would have gone for a index seek with bookmark lookup, reading only the appropriate amount of pages from the clustered index/heap.

    Regarding the logical fragmentation.(external). A lot of the time, page splits are a root cause. Page splits by itself cause a magnitue more translog entries. So I do agree that in a “large perc. of your db in memory” setup, the logical fragmentation by itself isn’t really an issue. The cause of that logical fragmentation by itself is something to worry about.

    Warm regards,
    Edward

    Just something to think about.

    Reply
    • Edward – thanks for the note. What’s interesting about page split is that often they’re caused by an overzealous DBA rebuilding indexes with 100% fill factor. That actually makes page splits worse due to updates.

      About transaction log activity due to page splits – that’s a drop in the bucket compared to the transaction log activity doing constant defrags and rebuilds, wouldn’t you agree? If we’re rebuilding all of our indexes every night, then the transaction log activity is huge compared to a small percentage of inserts/updates during the day.

      Reply
      • Edward Dortland
        August 17, 2012 5:16 pm

        Well.. it depends.

        I guess what I’m trying to say is that I largely agree with you that once you have fragmentation it’s not always necessary to imidiately go fix it. Especially if the reason for that is: “Well, that’s what we always do, every day… that’s what maintenance plans are for right…”.. An exception would be highly used nc indexes. (See my previous comment).

        But I do think it’s still very important to fix whatever is causing fragmentation. Bad clustered indexes (Guids), to large row sizes, lot’s of var columns with huge differences between rows. etc. etc.

        You say, that the overhead of pagesplits is small compared to a full blown rebuild. It depends on your load and on your schema design but lets say it is, even when the complete db is in mem, all that extra log overhead still has to be written to disk. And depending on the load, that can be very large overhead.

        So I’d like to keep that log overhead to a minimum.

        The rebuilds could be minimal logged. That helps me with keeping my log disk IOs low but I agree that doesn’t help you with your examples of mirroring (bulk logged not supported) and logshipping (still big log backups) 😉

        So yes, don’t go instantly running rebuild or reorganize but please DO keep worrying about fragmentation.

        Cheers,
        Edward

        Reply
  • Nice phone book analogy. I can instantly understand and relate to. Thanks.

    Reply
  • Hilarious, loved the point about 384GB ONLY costing $5-6K.

    And what about when your DB is closer to 1TB on disk and still working really well (sub second queries) with only 1.5GB of RAM for the server? Oh yea, we call that MySQL with INNODB Tables.

    In a cloud environment where you paying monthly for for RAM, it makes sense to NOT use databases that are as badly broken as MSSQL is. And a MySQL defrag? haven’t needed/done one in 5 years even though we add and purge 1 million entities/day (which is more than 1 million rows/day considering the multiple tables and joins).

    Reply
    • Mark – that’s great to hear that you’re able to work with a 1TB database with 1.5GB memory for the server. Sounds like you’ve found a solution that works well for you. Have you blogged about your database setup? I’m sure a lot of readers would love to hear more about that. Thanks!

      Reply
    • @Mark Jones,

      Since you used the word “entities”, do you mean you’re purging 1 million tables a day or 1 million “rows” per day.

      Also, it’s pretty easy to do a drive-by shooting. Do you have an article on the subject or any proof that what you say is actually in place and running the way you say? And, no… not trying to be a smarty here. I’d really like to know because it does sound interesting.

      Reply
  • Lonny Niederstadt
    February 6, 2013 9:27 pm

    I agree that index fragmentation may not deserve as much attention as it sometimes gets. Improved query elapsed time after rebuilding indexes may be from the defragmentation, but may more likely be due to updated stats or the subsequent recompile of query plans dependent on those stats.
    However, its simply not true that the SAN is performing random hard drive reads to retrieve data that the database and filesystem consider sequential. Consider the stripe unit size per hard disk up through at least the EMC Symmetrix DMX4 – 960k. That’s 15 consecutive 64k database extents in one disk stripe unit. There are efficiencies in retrieving as much data as possible in as few transfers as possible. There’s also less chance of being punished for QFULL or scsi_busy conditions.
    SQL server does a good job with readahead, coalescing contiguous extent retrievals into reads of up to 512k (I’ve heard rumors of more) and warming the database buffer cache before the query processing threads get there. So query elapsed time might not suffer from suboptimal IO (as long as QFULL conditions aren’t incurred). But the SAN admin will notice if the system is busy enough. And other tenants of the SAN could notice as well.
    On SSDs, the benefits of contiguity only really extend to the per drive stripe unit size, because each stripe unit is randomly placed within the drive regardless of its positioning within the filesystem or database. There is still a benefit in retrieving as few stripe units as possible – less of a chance of saturation between the storage controller and the disks.
    On a very busy SQL server, reasonably defragged large high use indexes and possibly the -E adjustment of proportional fill can make io performance more consistent for SQL server, and make SQL Server a better neighbor.

    Reply
    • Lonny – have you noticed the stripe sizes on newer EMC SANs? I haven’t seen the VNX doing stripes that large, for example. NetApp is another example – it’s 4k.

      Reply
      • Lonny Niederstadt
        February 7, 2013 10:37 am

        Saying that sequential access is all basically random anyway – that completely disregards what SAN administrators spend lots of their time planning for, and correcting. You probably won’t win many over that way 🙂
        The basic issue: even when the stripe unit per disk shrinks, truly random access to that disk for the next read incurs an average of half the maximum disk head seek time for the disk. Performance disks spin at 15k rpm (or 10k). But the heads move more slowly. So go ahead and switch from disk to disk fast. But incur as little head movement as possible. That’s why its STILL a bad idea to put two or more heavy write, heavy growth, heavy read database files on a single LUN (with the exception of SSD, although I don’t recommend it there either). The sequential activity at two different points in disk geometry introduces a particular type of head thrashing that Microsoft calls “IO weaving”. The disk head weaves from the location for one sequential read track to the location for the other sequential read track – on each and every disk that the data is striped over. In cases of IO weaving, sequential access performance can be WORSE than random read access, depending on the distance the head must travel between the two sequential read locations. NetApp and ZFS are slightly less vulnerable to IO weaving than “update in place” storage. But IO weaving results in trouble for them, too.
        I like NetApp (and ZFS), although my heart will always be with Hitachi storage. NetApp plays by its own rules (in comparison to other storage players like IBM, HP, EMC, and Hitachi), due to the ONTAP WAFL shadow paging filesystem. But the read cache accelerator cards (they used to call them PAM) still benefit greatly from contiguity of access across the disks. Although there is a conceptual 4k stripe unit, writes occur in 64k chunks per disk. Its an under-the-covers optimization. Here’s how it works: if you have 16+2 in a RAID DP group (I don’t recommend this, its for ease of illustration only) a 64k database extent will span the raid group. But that’s NOT how it gets written to the disks. It goes to an NVRAM card first (and mirrored to another NVRAM card in the paired controller) and is burst from NVRAM to disk when NVRAM reaches its write pending threshold, or the timer goes off. In NVRAM the writes are coalesced. Assume that an index rebuild in a single database file database is the only current activity. if 16 extents are written to the database file, the extents are sorted in NVRAM. Eighteen 64k writes will take place. Each 64k write is made up of 16 4k theoretical stripe units. If you take the parity stripe units out of the mental picture, each write would have 4k of each of the 16 database extents. Compact. Compact especially for the inode maintenance that has to take place for the WAFL “redirect on write” to present a coherent file. The more fragmentation in the files presented from WAFL, the more inode updates have to take place.
        VMAX virtual pools use 768k extent sizes. 12 Symmetrix tracks of 64k each. The 768k extent size is the minimal size for promotion in the Symmetrix auto tiering. If there isn’t a good amount of contiguity of database contents, autotiering will be inefficient.
        On the CLARiiON/VNX systems, the main consideration is what flavor the LUNs are. They could be traditional LUNs, virtual pool LUNs, or thin provisioned virtual pool LUNs. The rules for optimizing database performance on each are slightly different. But in no case can one simply expect the sequential reads to be randomized and always achieve the benefits of sequential IO. This is a good source for VNX implementation details.
        http://virtualeverything.wordpress.com/2011/03/05/emc-storage-pool-deep-dive-design-considerations-caveats/
        Note what Vijay, the blog author says about thin provisioned LUNs:
        “Utilizing Thin LUNs introduces a whole new level of considerations as it does not pre-allocate the 1GB slices, but rather writes in 8K extents. This can cause even more unpredictable behavior under the circumstances outlined above, but that should be something to be aware of when using Thin provisioning in general.”
        Thin LUNs do become more highly distributed and less sequential at the 8k level. I do NOT recommend using thin LUNs for a performance oriented database unless the majority of access is 8k random. Even then, if it grows considerably and the underlying disks are shared, the disk head movement when thin LUN contents from other applications are interleaved can be a performance killer.

        Reply
        • Lonny – hmm, I think we’re coming from different perspectives. You write that “SAN administrators spend lots of their time planning for, and correcting” access so that random access becomes sequential access. That hasn’t been my experience, but it sounds like you work with a team of really good SAN admins who have the time to hone that kind of thing. That’s great, and I’m glad that works well for you.

          In my experience – which is just different from yours, not better or worse – is that most SAN admins are overworked, and perhaps a little bit underfunded/undertrained. They don’t have the time to figure out random vs sequential access on a shared pool of storage. In that type of environment, buying RAM is a lot cheaper than hiring more SAN admins and sending them to training, and dedicating their time to managing random vs sequential access.

          Sounds like you’re in a great environment, though, and more power to you! Enjoy, and thanks for the comment.

          Reply
          • Lonny Niederstadt
            February 7, 2013 4:46 pm

            You are misunderstanding me slightly. For someone to say, “who cares if its sequential or not, its all random on the SAN” is not true, disregards what SAN admins spend lots of time doing, and potentially makes more work for the SAN admins. Please don’t give people that message.

          • Hmm – again, I haven’t met a lot of SAN admins who spend time focusing on issues like that. I think it’s great that you do, and that’s awesome. You’re the right kind of shop to focus on fragmentation. However, that’s not what I typically see out in the field.

      • Brian Sullivan
        January 9, 2015 9:43 am

        Netapp? Not sure what they are like now but 4 years ago I suffered through migrating a 1.2TB Db to netapp. Worst 4 months ever.

        Reply
        • Brian Sullivan
          January 9, 2015 9:46 am

          Should mention we had the the Netapp team in to help.

          In the end we had twice the number of spindles as our old EMC to get the same performance.

          Reply
  • Lonny Niederstadt
    February 7, 2013 11:54 am

    Brent – you do such a good job steering people away from myths, that when you give advice with hidden gotchas I feel compelled to speak up. In your post you said “Cache it and be done with it – 384GB of memory is just $5-$6k.”
    Be careful trying to cache a database working set in SQL Server. Do NOT rely on the performance of an in-memory working set, unless all the factors of keeping the working set cached have been accounted for. With 384 GB, on a four NUMA node server, you aren’t guaranteed to keep even a 30 GB database working set in cache.
    To keep working set of 30GB of database blocks in database cache reliably, across SQL Server restarts and regardless of how tasks are distributed across the four NUMA nodes, you’d need 400 GB + (buffer cache tempdb load) for max server memory. Big it up for OS memory, worker thread memory, other applications, etc, to get the full amount of server RAM.
    I haven’t seen the task scheduling and NUMA node considerations for working set caching in SQL Server together in one place, so I made the first post in my new blog about it. I don’t think too many people will read my blog (it takes me forever to write anything and I don’t have too much to add to the encyclopedia), but if I can put out a few things that work their way around among the really knowledgeable folks, I’ll have accomplished what I’m after 🙂
    http://sqlsasquatch.blogspot.com/2013/02/sql-server-cache-database-working-set.html

    Reply
  • Sorry I strongly disagree. What you’re saying is ‘just get bigger hardware’. That’s a really bad way to solve a performance issue. Also, you’re assuming DBAs have control over the storage configuration. Generally we don’t. The storage solution where I currently is configured in almost the exact opposite of industry standard, and the person administering it doesn’t agree with me on how it should be configured. Also we have roughly 50gb of ram to work with. We can’t just up the ram whenever we have I/O issues.

    Recently I set up a weekly defrag maint job using the scripts you talk about. The results were monumental. Before I did the first massive rebuilds we had massive blocking issues, to the point one of the databases was having chronic deadlocks. Reports on our reporting services box were timing out. Databases were growing out of control. Performance was deteriorating pretty rapidly. After running the index maint scripts on our production boxes I noticed from the log many rather large indexes including the clustered indexes were 99% fragmented. This was killing disk times. After defragging our deadlock issue was gone, going from 50+ deadlocks a day to 1 or 2, our reports which were timing out were coming back in 2 seconds, our databases went from 95% full to 60-70% full, locks are being released much, much faster than previous. It has been the single largest performance improvement I’ve ever had the privilege to implement in my 3 years with this small company.

    Moral of the story, index defrag is important no matter what system you administer. Also once it’s running weekly, the hit to the system for maintaining indexes should be minimal.

    Reply
  • Thanks for your comments on Defragmentation! And I mean all of the commenters as I am still getting my feet under me on this and have found only periodic defragmentation to be a solution in my circumstances.

    My question comes from a situation where a client with an exceptionally large database by our application standards (250 gig) is replicating large portions of it making maintenance a bit problematic.

    This is a WinServer 2003 32bit box with 8gb ram and SQL 2005. We have forced them to stop accessing the DB directly via other apps due to concerns about corruption, uncontrolled I/O, etc… The problem is the server never seems to use more than 3 gb of ram for SQL server process and general perf seems to degenerate within 3-4 months of reindex. They kick and scream because we have to drop their replication to reindex which takes then takes almost 24 hours to bring the replication back up. (if there is a way to “pause” transactional replication so that it doesn’t need to re-snapshot I’m all ears!)

    I am tempted to tell them it’s time to kick their raided old box (~5 years old) and get a new 64bit server w/ SQL 2012 and gobs of RAM, but before I do, is there hope for getting SQL 2005 to better use the resources on this server? Task Manager makes it look like it doing next nothing CPU-wise (think it has 8 cores) and the RAM isn’t even close to 50% used…

    So there are two questions here really: Hardware /facepalm and Reindex w/ Replication…

    Thanks!

    Reply
    • K Fulcher –
      Hardware /facepalm : If SQL is a Workgroup edition of SQL2005, memory limit is 3Go. Alternatively, look at the Max Memory setting in the server’s properties.(you probably already looked, but still). I would say you can definitively get more out of that box, from your last sentence. Try to get an IO monitor, or look at WAIT statistics, as you may have bottlenecks other than CPU and RAM that make it look like the server is sleeping. Is the server also used as a reporting server, or does it hold more applications, or does it serves as a fileserver for the company?

      Reindex w/ Replication: It depends on what replication tools and methods they are using. SQL-to-SQL using the built-in SQL Replication? A 3rd party tool like Attunity or ConnX or IBM-IDR? I use 2 of theses 4 methods, and using Ola Hallengren’s maintenance scripts never caused me to have stop my subscriptions. Do they replicate in another DB on the same server or to another SQL Server?

      I run the script for an hour each day, forcing a full re-scan every saturday. This way all my indexes get maintained on a regular basis, the worst ones getting more attention more often. Less stress on the system too as it is done in low-business hours.

      Reply
  • Hey Brent,

    Would this also be true for the clustered key? How about on SSDs?
    I’ve always favoured sequential data (identities, created etc) when generating keys to avoid fragmentation.
    Read this a while back, but I’m building a new DB so I dug it out and I’ll be testing using the main business keys.
    Add in the fact we are moving to Pure Storage in the next couple of weeks (full SSD storage, very excited) and using the cluster on an otherwise meaningless identity seems, wasteful.
    Mind you, perhaps random access makes the discussion about fragmentation and clustered index choice pointless any way.
    What do you think?

    Reply
  • Gary Nease
    May 5, 2014 9:08 pm

    I saw no response to commentor Edward Dortland’s (August 17, 2012 | 4:43 pm) having pointed out the optimizer may choose with a suboptimal plan because it ignores a fragmented index. Even if I have the whole table and all its indexes in cache, a bad plan can mean the difference between excellent join plans or scanning a covering index, vs. table-scans and/or nested loops. You are saying this difference is meaningless?

    Reply
    • Gary & Edward – sorry, sometimes a blog comment slips through the cracks. Any proof on that optimizer/fragmentation thing? That seems a little suspicious.

      Reply
  • I disagree with Brent here. We recently had our production app start failing… because queries ran slow… because indexes on large tables had become fragmented. (Actually, the core issue is that our maint script to rebuild indexes had not been running for months…) Fragmentation MATTERS. Brent is wrong to think we all have an extra 20-80 GB RAM in our servers, waiting for this situation. (Our server currently has 32GB RAM)

    Reply
  • Brian Sullivan
    January 9, 2015 10:02 am

    Many people reindex way to often. Ola H’s scripts do a great job only going after things that really matter.

    Some things they don’t do.

    When rebuilding an index the whole thing seems to be logged. Which means your logs can grow very quickly. I had a 200GB DB with a 4GB log grow to 80GB at least once a week. Which meant I had to shrink it back so I could restore on my report server. In the end I modified his script to punt when log size go bigger than 20 GB (my largest heap/index + 10%)

    When deciding to Rebuild or ReIndex it only looks at fragmentaion and not index depth. I had a 1,200M row table that we were adding 40M rows to each day updating 40M and deleting 40M. So about 4% fragmentation each day. The scripts would do an reorg, rather than a rebuild. Over a six month period we saw performance degrade because it had become a ragged right index. Once a month we had a job that forced the rebuild.

    Reply
  • Good article!! However index defrag will become MORE useful when services like Amazon, Softlayer, Azure become more common and pricing based on RAM and CPU. We are paying over $1 an hour for the cheapest 64gb system with Amazon. Now… To move up to the next system with 117GB of RAM you are looking at $2 / hour then the next at 3.50/hr at 244GB.

    What you may think… That scales well…

    Server r3.2xlarge 61GB ram, 8 cores, $720/month
    Server r3.4xlarge 122GB ram, 16 cores, $1440/month (increase of $43,200 over 5 years)
    Server r3.8xlarge 255GB ram, 32 cores, $2520/month (increase of $108,000 over 5 years)

    Now add in the extra licensing costs for the extra cores you’re looking at a good reason to keep your indexes in check. Amazon storage is IO bound as well… You’re going to get considerable less performance at 4k random vs. sequential. 4-8MB for queue depth of 1, 12MB for a QD of 32. Sequential is at 130MB/sec.

    For servers that you control more RAM is great! However not so good for cloud providers.

    Reply
    • Ron – that’s an interesting concept. Have you tested to see how defragmentation makes a difference on queries on those instances?

      Reply
      • Brent,

        No I have not as we just moved our systems the first of this month and there have been plenty of fires to put out. What is nice is that I have not done any defrag yet. Just a restore from backup, begin async mirror and break the mirror to move our systems to the new data center.

        I can run some tests next month. Would you like to see some data before and after? Is there anything in particular you would have me look for other than reads, writes, disk, cpu etc?

        Reply
        • Ron – the test isn’t for me – it’s for you. 😉 What would *you* want to see?

          Reply
          • Hi you two,
            are there any news in testing/using Thiel solution so that you could write some more about the sizing andere if it depends on fragmentation?

          • Bjoern – no, we haven’t put any more work into this. We still haven’t seen a case where external/physical fragmentation was the root cause of anybody’s performance issues. (Logical/internal fragmentation, aka free space on pages, that’s another story.)

  • I am new DBA and have done TONS of research.. In this, I have found that Ola’s solution is pretty well top notch… It is funny that I have came across your site and you have also referred to Ola’s solution. It makes me feel good about my research as Ola’s tasks are what I am starting to practice… Great write-up by the way… I have bookmarked you for my referencing if needed.

    Reply
  • Shalom Slavin
    June 30, 2015 11:53 am

    Brent,

    Just heard you mention this article on the ‘Out of Office’ webinar today.

    I’ve been looking into index fragmentation recently and coming upon this it sounds like I may have been wasting time.

    Some questions if you don’t mind:
    1) What if you have a table that’s deleted from often, wouldn’t a rebuild/reorg compact the pages needed to house the current data?

    2) Someone mentioned this and I have seen this myself – with a badly fragmented index SQL chose one plan, after rebuilding the index (it was ad-hoc sql so no SP to recompile), SQL chose an entirely differently plan (using a different index). So it seems fragmentation most definitely affects the execution plan, no?

    3) Finally, I’m a big believer in the ‘scientific method’ and would to prove whether or not fragmentation is or isn’t the cause of an issue. Any tips as to how I could prove something like that?

    Thanks!

    Reply
    • Shalom – let’s turn it around a different way: what are the bad performance symptoms that you’re trying to solve?

      Reply
      • Shalom Slavin
        June 30, 2015 1:09 pm

        Firstly, I realized my 1st question is irrelevant – that’s internal fragmentation, not external.

        There isn’t one specific issue. We have a large DB (1.5TB or so) which started life in SQL 2000, and is now on SQL 2k8. We have (too) many indexes on tables, many of which are really duplicates (found a few using sp_blitzIndex 😉 ), and so on.

        The system has performance issues overall due to a lack of update and lack of maintenance, so I’ve been trying to put scripts together to find ‘no-no’ things such as wide rows (i.e. 5K rows wasting 3K space), indexes that can be restructured, unneeded text columns, etc.

        As I was doing this I’ve come across the fragmentation as well, and noticed quite a few indexes (I focused only on heavily used ones) are something like 99+% fragmented, which makes it seem like it’s not really in order AT ALL anymore.

        After reading about EXTERNAL fragmentation it sounded really bad – how you can’t do ‘read-aheads’ and how the disk needs to jump all over the place to read what’s logically contiguous – so I got a bit hung up on it.

        Another issue I noticed is low page life expectancy, and that can indeed be helped with solving the INTERNAL fragmentation to allow the pages to be more ‘full’.

        I’ve actually just started putting a test together – creating a large table, purposely fragmenting it, and then comparing queries before and after an index re-org. So far the results are on your side, it actually does less physical reads BEFORE a reorganize which is really weird – still fine tuning that.

        Any insight would be greatly appreciated.

        Reply
        • You wrote:

          “The system has performance issues overall due to a lack of update and lack of maintenance”

          Let’s focus this: specifically, what are the symptoms that led you to this conclusion? What made you say, “Clearly, the problem is fragmentation.”

          Reply
          • Shalom Slavin
            June 30, 2015 1:19 pm

            Nothing – I was looking at things overall, bad code, bad table design, bad indexes, fragmentation is just one of the things I was looking at – I like to multi-task :).

          • Ok, great – when you’ve got a specific performance problem, let’s talk. Otherwise, quit bloating your log file, growing your maintenance window, and increasing the challenge of meeting your RPO and RTO.

          • Shalom Slavin
            June 30, 2015 1:31 pm

            –Reached the indentation level of replies so this may be out of order–

            Ok,

            I hear your point. I also watched your video about this subject and it’s pretty convincing. I’m going to switch gears and only focus on internal fragmentation for now (if any) and not pay as much attention to external fragmentation unless I see an issue.

            Thanks!

  • Big can of worms, but with 40 years of DBA experience on various platforms (including Ingres, Oracle, MS SQL and a few more on UNIX and Windoze) the 2 causes for slow query performance are the table design and the query strategies! People trying to use twisted SQL queries to help hiding their poor application or data model design always hit performance issues. Defragmenting the tables and indices can help, but usually very marginally. My 2 cents.

    Reply
    • Angela Kelly
      April 6, 2016 3:16 pm

      I would have to agree with you on this. Poor design is usually the issue. What makes it worse is when you have vendor databases and applications that you cannot touch, but the blame gets put on the DBA and they are constantly saying its the database, when the database is fine. Its the bad sql and bad code that has been written. What does a Jr. DBA do when it is a vendor system. My current issue is, my boss asked me to rebuild the indices on a particular table and when i ran a query to check the fragmentation it was actually very low, like 1 plus percent. Do I even bother rebuilding the indices?

      Reply
      • Just tell the boss the truth. Tell the boss about all the indexes only being 1+ percent fragmented (be prepared to prove that!) and suggest that it would be a waste of time, cpu, and disk resources. Instead, suggest that you’d like to do what REBUILDINGing the indexes does and REORGANIZEing does not… rebuild the stats that need it and see if that makes the necessary improvements. I’ll state again that it’s been working perfectly for me for months. It may not for you but it’s absolutely worth the experiment.

        Reply
  • Great post! Thanks for this.
    Just a question please, do you know if a restart of database have effect on fragmentation.
    I have a job which rebuild or reorg index depending on fragmentation but i restart db just before this job, is there any impact on my job?
    Thanks
    Regards,
    Tarek

    Reply
  • Things that make me go hmmm:

    DBA: “Performance improved after we rebuilt the indexes for the first time last year. Since then we reorganize our indexes each day, but performance has gotten gradually slower.”
    Me: @Statistics?

    DBA: “We rebuild our indexes every day, then we update all statistics with fullscan”
    Me: POWER(2,@Statistics?)

    Reply
  • Hey there, Brent,

    First, it was really good to see you in person again at the Pittsburgh SQLSaturday back in October. Always a pleasure.

    Shifting gears to the subject at hand… when I first read this article and then watched your video on the same subject, I thought you were drinking bong water for all the reasons that many folks have talked about on this thread. I even tried an experiment. When I copy prod to dev, I delete some rather large audit tables and then shrink the DB from about 400GB down to about 150. Of course, that frags the dickens out of the database and indexes so I normally carefully rebuild them so as not to cause too much regrowth nor blow the log file out of the water. I didn’t do that once just to see what would happen and performance was absolutely horrible. I though “Yep… Brent’s using two straws on a big bong”, rebuilt the indexes, and everything was wonderful again.

    We’ve been having some blocking issues on the production box. I originally thought it was because of all the new code and customers we’ve been putting on but then made a correlation. The worst blocking occurred on Mondays and Thursdays, particularly in the mornings and then got better all by itself later on those days to the point of not happening. I also run PerfMon 24/7 on the prod box and noticed that as the days after Mondays and Thursdays progressed, disk IO would decrease, memory IO would decrease, and CPU usage would decrease from about 22% down to about 12% across all 16 core. I checked and the same jobs and intensity of work on the floor was occurring and I didn’t originally understand what was happening. Breaking the vein in my forehead, it finally dawned on me that index rebuilds/reorgs where happening on Sunday and Wednesday nights.

    Just on a major SWAG, I disabled the index rebuild jobs and just left the stats rebuild jobs running and with some rather surprising results… CPU usage, memory IO, and disk IO have never been better and have remained low for going on two months.

    It would seem that once the initial round of pages splits (which caused all the blocking problems because no one actually did the FILL FACTORs on this 400GB database right) have had their way, the open spots seem to have become kind of a selective Fill Factor in the areas where a lower Fill Factor would be needed without having to waste free space on every bloody page. Oddly enough, once the initial round of splits occurred, the system is running better than it has in a couple of years and even “normal” blocking has decreased significantly. And it hasn’t caused unexpected growth in the tables, either, because the split pages get filled up eventually.

    I still think that bong water was involved for you to have made this realization but I, for one, have witnessed your claim first hand. It may not be for everyone but, for my “money maker” system, not doing index maintenance has made a truly unexpected improvement and has given me more “night hours” to get stuff done because I’m not doing the index maintenance any more. I’m going to start trying it on a few more databases and see what happens.

    For those that are curious, I have about 20 databases (I don’t actually count them because that’s like counting chips at the poker table ;-)) on the server. 3 of them are in the 400-450GB range and only a couple of them are in the < 100GB range. The larger databases are a combination of heavily hit OLTP and nightly/daily batch processes that each affect several million rows across multiple tables.

    Oh… and why did the fragmentation cause by a shrink cause such bad performance? I haven't done a deep dive on that (don't have the time just now) but I suspect it's because of the random nature of that type of fragmentation rather than the “targeted” (and seemingly beneficial) fragmentation of normal usage.

    Heh… I might be sipping water from the same bong as you but, remarkably and seemingly against all odds and in direct violation of supposed “Best Practices, NOT rebuilding indexes on my production box seems (at this point… the experiment continues) to have all the benefits you’ve claimed with the additional benefit of actually improving performance and decreasing blocking.

    To me, this is a crazy bit of proof that “Before you can think outside the box, you must first realize that… you’re in a box!”

    Well done, ol’ friend. I’ll let you know if anything goes haywire.

    Reply
    • Jeff – HAHAHA, great. I have this vision of you seeing me in a cafe, with me drinking bong water, and you saying, “I’ll have what he’s having.” That’s an awesome mental image.

      Reply
  • Wait so I can use the fact that I haven’t reindexed TBs of DBs in years upon years as a POSITIVE factor in my annual performance review? I felt the same way, why reindex if you’re not getting any real net benefit but are also increasing more workload with wear and tear? Being on shitty SAN hardware was how I came to that realization.

    Reply
  • @Ali,

    See the mistake you’re making? You should be reindexing 24/7 so you can break that grubby ol’ SAN of yours and get a new one. 😉 Then you can join Brent and I for a sip of some of the finest bong water there is. 😉

    Reply
  • I may be off base here, but couldn’t this logic apply to auto-growth of .mdf and.ldf files as well?

    Reply
  • Just a status update. I’ve not done any index maintenance (just stats updates) since the 18th of January 2016 and both the OLTP and heavily batch oriented databases have suffered no performance problems. In fact and as previously noted, performance continued to improve and finally “bottomed out” about a month ago. I’m truly and absolutely amazed.

    Reply
  • Hi Brent. First, thanks for your posts, great as always.

    I have a quick question, consider next paragraph: “If you’re using shared storage like EMC, NetApp, or Dell gear that shares drives between a bunch of different servers, then all of your drive access will be random anyway. Your hard drives are shared with other servers that are also making drive requests at the same time, so the drives will always be jumping all over the place to get data. Defragging your indexes is just meaningless busy work.”

    Would it be fair to say that REBUILD/REORGANIZE operations on large tables would help us to reduce internal fragmentation, index would need less space and therefore more likely to be cached in RAM? So, this way we will take more advantage of our memory.

    Thanks!

    Reply
  • Hi Brent,

    I am a SQL DBA, but am attempting to learn Oracle as required for a new role. As part of the transition I looked into index maintenance. Microsoft recommend that indexes a routinely rebuilt, but Oracle does not. As to why I found out that Oracle does single block reads where as SQL first looks at the index fragmentation and then determines whether it can read ahead in multiple blocks. This is only beneficial for long range scans but it is definitely worth consideration.

    In addition I was going to mention internal fragmentation but it looks like the above poster covered this one off!

    Reply
    • Harrison – sure, now test it. In my live 4-day class, we actually run a test to show the performance overhead of externally fragmented indexes. (Hint: it doesn’t matter.)

      Reply
      • Just a note of confirmation, I still haven’t rebuilt indexes since 17 Jan 2016 (over a year ago). Thanks to a clone of the system, I now have a system where I can rebuild all of the indexes and see what the differences are especially when it comes to what I refer to as a “natural Fill Factor”. For others that don’t know, I had continuous gradual performance increases of the first 3 months of not rebuilding indexes and I’ve never had the “blocking after rebuilding” problems that we used to suffer. My log file and reserved empty space on the databases looks a hell of a lot better, as well.

        Reply
    • @Harrison,

      When executing a query, cached or not yet cached, SQL Server doesn’t look at fragmentation at all. It only looks at stats. Busy SANS operate quite randomly so having large volumes of contiguous data won’t actually help on large batch jobs, as we found out on our 1TB ETL database where each batch affects hundreds of thousands of new rows and existing rows.

      Reply
  • Hi Brent. Yesterday I had a very weird situation on a SQL Server 2008 instance: SQL Started locking pretty much everything that was running on one single DB. Stopping the web app that access this DB and resetting the DB fixed it, but as soon as the app was place online, immediately all Stored Procs started locking (and according to tho SQL monitoring, the code or SPs causing the lock was continuously changing, so it could not be pinpointed to a single piece of code. After a while of scratching our heads, we checked fragmentation (we have a script that nightly runs defragmentation, but only for those tables that are larger than 1000 pages), and at least a couple of tables were very fragmented (one over 99%). One was larger than 1000 pages (and I’m still trying to figure out why the automated defrag job missed it) and the other was around 500 pages, so we forced an index rebuild on those two and after that all started running smoothly again. As far as you know, can it be that even if a table is smaller than 1000 pages, if it gets fragmented over 99% it could throw SQL Server out of wack?
    Thanks in advance for your time and all your great articles!

    Reply
  • German Porta
    March 29, 2017 5:44 pm

    What too you so long to reply? Almost a full 10 minutes 🙂
    Thanks a ton Brent!
    German

    Reply
  • Pat Down marries Phil McCann – I missed those puns first time around.

    Reply
  • Pradeep Korukonda
    August 16, 2017 1:56 pm

    Hi Brent, Does SQL Server re-builds the indexes on a table after every delete statement execution ? Please clarify. I have a requirement to delete 6 millions of records (out of 25-30 million) which has foreign key references. Do we have any way to just complete the delete in 1 hour (Using SQL Server standard Edition). Please advice.

    Reply
  • It’s been 2 years and 2 months (last done on 16 Jan 2016) since I’ve rebuilt indexes on my production boxes and the same great results for performance simply continue as before. There are some indexes on tables that are inserted to and then updated with larger data in some of the variable width columns and I am setting up to rebuild those for space recovery but they’re generally very few and far between.

    I’m also writing an article on all of this. Since you can’t afford to wait two years on your own boxes and need some serious evidence that what Brent talked about really does work, it’ll include the tests I’ve been running, which simulate adding 1000 rows per hour for 10 hours per day for 365 days (3.65 million rows total). I measure fragmentation, page count, and two different type of performance at the end of each simulated hour and the plots from that data clearly explain why defragging the wrong way will actually cause major blocking the next day and the right way to defrag if you must.

    Brent, thank you again for the bong water… it’s much better than the purple KoolAid that MS has recommended in BOL. 😉

    Reply
  • What are you thoughts on whether fragmentation can contribute to deadlocks. The respondents on this post https://www.sqlservercentral.com/Forums/Topic1375490-1550-1.aspx give a definate ‘no’, however I’m not convinced. To me, index fragmentation can equal slower reads. Slower reads = longer running transactions = increased t risk of a deadlock (without being the explicit cause).

    Reply
    • Shaun – why not run experiments to test your hypothesis, then?

      If you don’t have the time to do that, start here: https://groupby.org/conference-session-abstracts/why-defragmenting-your-indexes-isnt-helping/

      Reply
    • While I do agree that the faster that code runs, the less chance there is for deadlocks, every deadlock problem I’ve ever solved has been to fix some code even if it was only subtly performance challenged.

      Shifting gears a bit, I made a “Hail Mary!” decision on Monday, 18 Jan 2016 to stop rebuilding any indexes based on Brent’s observations. My main system was suffering from major blocking (didn’t check for deadlocks… I think I still have the logs and will go back and check) every Monday morning and then performance would get better for the remainder of the week. To make a much longer story shorter, I checked everything (it was a 3 month problem that I spent a huge amount of time on) and the only difference between Monday and the rest of the week was the fact that I was doing index maintenance on Sunday nights. I turned off all index maintenance, average CPU usage across 36 CPUs dropped from 22% to 8% over a period of 3 months, and has stayed there even though the server is under 50% more load than it was back then.

      Yeah… I agree… that’s anecdotal evidence on one system that won’t stand up in court. That’s why I’ve been doing repeatable/fully demonstrable tests for the last 5 months to prove that it was the index maintenance that was causing the problem and what the effects of not doing any index maintenance actually are. To my surprise, I’ve proven that the supposed “Best Practice” methods (no defrag ’til 10%, Reorg at 10 to 30%, and Rebuild >= 30%) actually cause more harm than good because everyone is doing defrags the wrong way.

      As a teaser for the presentation/article that I’m writing about all of that, please see the following post…

      @Brent… you’ll like this teaser… 😉
      https://www.sqlservercentral.com/Forums/FindPost1928856.aspx

      I do agree that “fragmentation” can slow queries down but NOT based on the kind of most people consider. I’ve not completed all of my testing yet but preliminary results seem to confirm that the average % of fragmentation means very little for performance even on a single user machine. The only thing that can slow down performance is the “natural fill factor” formed over time as represented in the average % of page fullness. If every page is only 50% full, then it will take about 2X longer for a given query to run IF it has to read more than one page (it still doesn’t matter if only one row is read at a time because SQL Server still has to do a logical read for one row and a logical read is one page). With today’s memory and bus speeds, 2 times slower usually isn’t even detectable unless you’re reading tens of thousands of pages or more.

      The bottom line here is, no… fragmentation is not the cause of deadlocks. Bad or code or highly contentious usage is the only cause of deadlocks.

      Reply
  • @Brent,

    Just a follow up… I’ve found the smoking gun on the performance problems (that I ran into way back in January 2016) caused by index maintenance. Bloody amazing. For those “addicted” to index maintenance, I’ve also found a way to defrag indexes and suffer nearly zero page splits. That also means that I’ve determined that most folks that are following the “Best Practice” recommendations are suffering some performance problems that they might not even be aware of. I’m doing the first presentation of what I found in a very “Alice’s Restaurant” fashion for my local PASS user group on Thursday, May 10th, 2018. I’m really looking forward to coming out to your neck of the woods for your local groups after that.

    Thanks again for your original article on this subject. It led to some serious eye opening. If nothing else, it got me out of hot water in January 2016 and has continued to work incredibly well after all this time.

    Reply
  • George Kochkin
    July 11, 2018 10:22 am

    Hey, Brent.
    What is a bit unclear is that you recommend to use Ola Hallengren’s solutions AND you also say, that not full pages can lead to performance degradation. BUT Ola’s scripts checks only avg_fragmentation_in_percent in sys.dm_db_index_physical_stats which depicts only logical fragmentation (external) and leaves behind the internal one.
    The point is: do we have to address Avg_page_space_used_in_percent column to save disk and RAM space?

    Reply
    • Reply
      • George Kochkin
        July 12, 2018 6:47 am

        Thanks a lot.
        Most stuff became clear, I have not seen this video.
        I’ve got some points from there:
        1. Put stuff into RAM, external fragmentation doesn’t matter there
        2. Set the fill factor back to 100% to NOT be the reason of internal fragmentation
        But page split will have some pages pretty much empty, so to get the pages as full as possible I should address avg_space_used_in_percent and rebuild indexes anyway, right?

        Reply
        • George – I totally understand that you’re very focused on this sentence:

          “But page split will have some pages pretty much empty, so to get the pages as full as possible I should address avg_space_used_in_percent and rebuild indexes anyway, right?”

          And the best advice I can give you is not to take anyone else’s word for things. Run experiments. Find things out for yourself. Test to see what kind of difference it makes in your queries before and after.

          Reply
          • George Kochkin
            July 12, 2018 9:47 am

            Brent, simple tests showed me less amount of logical reads and less CPU during index scan. Wellm that was predictable.
            Indeed I was hoping you tell me another reason not to track page space used…
            Is there one?

          • George – sure, if your queries are scanning an entire index, then, uh…yeah, they’ll go faster if there’s less free space.

            You might want to attend our Mastering Index Tuning or Mastering Query Tuning classes though. If your queries are scanning an entire index, you have much bigger performance gain opportunities than defragmenting your tables. 😉

  • Alastair Farrugia
    August 20, 2018 11:46 am

    Hi Brent,

    Suppose we have a database that is on SSDs (all the mdf and ldf files are on SSDs).
    What sort of performance improvement could there be if we cache the database entirely in memory? Could the speed double, for example?
    (I know that you said caching is definitely better than a tiered storage system that includes SSDs, but I’m asking about a database that is completely on SSDs.)

    And what if the database is stored on a RAM disk? I imagine that caching the database can’t improve the speed much in that case – is that correct?

    Thanks
    Alastair

    Reply
    • Alastair – why not run experiments and find out for yourself?

      Reply
    • I’ve tested the RAM disk thing. It doesn’t do a thing to improve performance. In fact, if you think about it, you’re stealing memory from SQL Server to create a RAM disk. The only time RAM disk helps is when you have “Read Aheads” but, again, you’ve stolen memory from SQL Server to do it. Once the data is in memory, logical fragmentation doesn’t matter at all for performance. Physical fragmentation (low page density) can affect performance two ways… 1) you have wasted free space in memory that won’t be used, which may knock other tables’ pages out of memory and 2) you have to read more pages from memory to get the same job done. On that last part though, you have to have a whole lot of pages that have very low page density to cause any kind of significant performance problem.

      SSDs help a bit when it comes to reading from “disk” to memory but only really helps when it comes to the speed of doing Read Aheads. The best I’ve seen for some of our batch-intensive procs is twice the performance but it’s actually quite rare because stuff is usually in memory for that type of thing.

      The real key here is that all this trickery with RAM disks and hardware is that the best you can do is usually only 2X improvement and then only in about 10% of the cases and things like RAM disks can actually cause performance problems as previously stated. That’s because hardware is hardware and that’s not normally the primary key to performance.

      If you want blinding speed, then do a good database design along with proper indexes and then write good code. Code is where the true performance is at and I’ve found that, on most systems, you can pretty easily achieve 60 to 1 MILLION times (not a mistake… that’s my record) the performance by making sometimes insanely simple changes to the code. And, it’s not really dependent on defragging indexes, either… I should know… I’ve not defragged indexes on my production box since Monday, the 18th of January, 2016 (more than two years now). Performance actually got better in the first 3 months after that because the current “Best Practice” recommendations for doing defrags is actually quite incorrect and leads to the perpetuation of “bad” page splits which forms the vicious circle of needing to defrag more often which causes more bad page splits, wash, rinse, repeat.

      If you want good performance, you need to make sure the code is good. Without that, all the tricks and fancy hardware in the world won’t help even if you spend a bazillion bucks on MPP appliances.

      Reply
  • Alastair Farrugia
    August 21, 2018 3:11 pm

    Jeff – thanks for the detailed post. It’s certainly changed my perception.

    We are using an ERP called Dynamics NAV, and NAV then talks to SQL.
    99% of the code is standard NAV, and has been tested by thousands of users around the world.
    That doesn’t mean that the code is perfect, and we do occasionally make changes and test them, but there is also a significant chance of breaking something and only finding out months later when some rarely-used functionality is executed.
    But when one particular report is much slower than other reports, it’s likely that the code or the design of that report is at fault.

    Reply
    • Alastair,

      Hmmm… correct me if I’m wrong, please. Doesn’t Dynamics NAV use GUIDs for all primary keys? Lemme know because, if they do, I’ve got an awesome solution for how to manage your indexes that will literally prevent any kind of page splits for up to a couple of months (depending on the INSERT rate and width of the each table). While not doing any index maintenance is quite effective (and I’ve proven that over the last 2-1/2 years), it still allows page splits and still wastes RAM due to physical fragmentation. Like I said, GUIDs are SO random, they actually lend themselves very well to a stupid simple but very effective index maintenance method. Because the indexes never get any significant fragmentation problems, you’ll never have performance issues due to “Read Aheads”.

      Reply
  • Alastair Farrugia
    August 23, 2018 7:45 am

    Hi Jeff,
    Unfortunately no, there are very few Nav tables that use GUIDs as primary keys.
    There is some discussion of that in, for example https://www.archerpoint.com/blog/Posts/dynamics-nav-blob-fields-and-guids-as-primary-clustered-key

    But it’s good to know that you have a good approach for tables with GUID primary keys.

    Reply
    • Thanks. And, BTW, I agree… when everything is pretty zippy and one report isn’t, there IS a pretty good chance that it’s the code and not much else that’s the problem, as you said. There may be an exception now and then but it’s almost always something wrong in the code.

      Reply
  • I’ve found an interesting twist on all of this. We run a pretty large web storefront application and all the supporting back office tables on a SQL 2016 SP1 EE server with 168GB of RAM and one of those shiny new PureStorage SSD arrays for all the disk. There are all sorts of different workloads on this server, including an approximately 1TB DW that doesn’t get cached much at all, and some pretty active OLTP workloads that are mostly cached in memory, including a replicated copy of a 100GB Dynamics AX ERP database, from which the data gets loaded via change tracking into the tables that support the storefront application. While most of the indexes are not affected one way or another by the fragmentation level, as one would likely expect when using the SSD SAN and a decent sized buffer cache, it turns out that weekly rebuilds of some indexes that grow steadily are actually quite important, as just updating the statistics does not change the query plans, and that causes terrible performance with anything that is CPU bound, which it turns out, is pretty much everything with IOPS this high. Even with 8 x E5-2698 v3 @ 2.3Ghz, the CPU typically hovers around 85% total utilization. All of this to say that being CPU bound may change things a bit, because the regular index rebuilds aren’t being done to help with disk I/O, but rather to ensure the correct query plans are being used by the optimizer.

    By the way, this is the first time I recall being CPU bound on a SQL Server since around 1998, running SQL 6.5 EE on NT 4 EE on one of those big black IBM Netfinity 7000 severs with Pentium Pro CPUs.

    Reply
  • Great post! This showed true in one of our clients’ environments where they had indexes rebuild on a daily basis. Not needed, definitely.

    Reply
  • DirkAndTheMac
    June 11, 2019 3:59 am

    Brent, your kudos just went up a thousand percent in my book. Curious to know…. Can you remember what band/song you broke your nose too? Die hard Kennedys fan… never broke my nose to them though;-)

    Reply
  • hi, your video are brilliant,

    if we have ssd/san and the page file are storaged randomly,

    do we still need to reorganise index? will reorganise index give any major benefit?

    From Edmond

    Reply
  • https://www.dynamicssquare.co.uk/
    November 10, 2021 11:33 am

    Great post (as always) … I’m curious. What search terms did you use to discover those (creative commons licensed) photos? That first one is so so perfect for this topic.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.