Tag Archive: indexing

Seeks, Scans, and Statistics in the Grocery Store

Building an execution plan is a lot like going shopping.  Before I leave the house, I need to think about a few things:

  • How many items do I need? I take a glance at my grocery list to see roughly how many things I’m shopping for.
  • What stores am I going to visit? I can’t get everything from the same store, so I make a list of where I’m going to source everything.
  • Am I in a big store or a small store? The bigger the store, the more likely I’m going to pick up more items that I might not have had on my list.  My guess of what I need might change if I see cool things I’d like to pick up.  The store also might not have everything that I’d like to find.

Armed with this background information, it’s time to make my plan of attack:

  • Which store should I visit first? The order is determined by what I’m shopping for.  I want to visit my butcher or fishmonger first to see what’s fresh and what’s on sale, and then those purchases determine what I get from the grocery store and wine store.
  • In each store, do I need a push cart, a basket, or just my hands? This is influenced by how many items I think I’m going to take to the checkout stand.
  • In each store, will I go to specific aisles for what’s on my list, or just hit all the aisles? If I only need a couple of things, I’ll go directly to those aisles, get ‘em, and walk out.  (When I get rich, I’ll pay the minimart back.)  If I have a huge list, then it makes more sense to start at one side of the store and walk through every aisle, crossing items off my list as I pick ‘em up.

Why, that’s just like SQL Server’s query processor!

How SQL Server Builds Execution Plans

The query processor looks at your query, reviews the table statistics, and tries to quickly choose an efficient execution plan – just like you choose which stores to visit in what order, and then what aisles to hit inside the store.  The first order of business is checking out what stores tables are involved, and what we need from each one.  Let’s take a simple AdventureWorks query joining two tables:

SELECT  *
FROM    Sales.SalesOrderHeader h
        INNER JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
WHERE   h.DueDate >= '5/15/2011'
        AND d.OrderQty > 10
Well, it was good, but I dunno about amazing.

Found at Whole Foods

We’re joining between SalesOrderHeader and SalesOrderDetail, which have a parent/child relationship: rows in SalesOrderHeader can have multiple rows in SalesOrderDetail table.  We’re filtering on both tables – we’re checking for the header’s DueDate field and the details OrderQuantity field.  Which one should we check first?  We’ve got two options:

  1. Check SalesOrderHeader to look up all rows with DueDate >= 5/15/2011, and then look up their matching SalesOrderDetail records to examine their OrderQty, or
  2. Check SalesOrderDetail to look up all rows with OrderQty > 10, and then look up their matching SalesOrderHeader records to examine their DueDate

Let’s check to see what SQL Server chose in one situation.  When reading execution plans, we read from the top right to determine what happens first, and then go left.  While this may not seem intuitive at first, keep in mind that it’s designed to increase the job security of the database administrator.  You can thank Microsoft later.

Well, not totally simple.

Simple Execution Plan

In my sample data, it chose to check SalesOrderHeader first – that’s the top right operator in the execution plan.  SQL Server determined that this would be the most effective method because neither table had an index on the fields I needed.  Either SQL Server was going to have to scan the entire SalesOrderHeader table to check each record’s DueDate or it would need to scan the entire SalesOrderDetail table to check each line item’s OrderQty.  In this example, SalesOrderHeader was the smaller table, so it made more sense to scan that one.

Mo Tables, Mo Problems

I thought "Vegetarian Feast" meant what happens when we eat the vegetarians.

The Other Other White Meat

When I go to the fishmonger and discover crawfish is finally in season, I have to change the rest of my shopping plans.  I need to go to the wine store because I don’t keep the right wine pairing for mud bugs.  Me being a flexible guy, I can easily rework my entire shopping plan if crawfish happens to be in season, but SQL Server doesn’t have that kind of on-the-fly flexibility.  It has to build a query execution plan, and then rigidly adhere to that plan – no matter what it finds in at the fishmonger.

The more tables we add into our query, the more SQL Server has to make guesstimates.  When we daisy-chain lots of tables together, SQL Server estimates how many rows it’s going to match in each table, pick the right orders, and reserve memory to do its work.  One bad estimate early on in the query execution plan can turn the whole plan sour.

In performance tuning, sometimes I help SQL Server by breaking a ginormous shopping trip into two: I separate a large query into two separate queries.  I don’t mean nested SELECT statements in a maelstrom of parenthesis, either – I build a temp table, populate it with the first query’s results, and then join that temp table to another set of results in a second query.  Here’s an overly simplified example:

CREATE TABLE #CreditCards (CreditCardID INT)
INSERT INTO #CreditCards
        ( CreditCardID )
SELECT CreditCardID
  FROM Sales.CreditCard
  WHERE Hacked = 1

SELECT  *
FROM    #CreditCards cc
		INNER JOIN Sales.SalesOrderHeader h ON cc.CreditCardID = h.CreditCardID
        INNER JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
WHERE   h.DueDate >= '5/15/2011'
        AND d.OrderQty > 100

Sometimes by breaking up our shopping trip, SQL Server is able to build a better execution plan for the second query.  This is especially effective when I’m dealing with a very large number of tables, yet there’s just a couple of key tables in the query that determine whether we show any results or not.  I don’t use this technique proactively – I only try it when I’m consistently running into problems with execution plans that would be better suited separately.  Alternate techniques could include index hints or query plan guides.

How SQL Server Chooses An Aisle Seek or a Store Scan

How Erika wooed me

Man Bait - Bacon Lollipops

Once we’re inside a particular store (table), we’re faced with another choice: do we jump directly around to just a couple of aisles to find exactly what we want, or do we walk up and down every aisle of the store, picking things up as we go past?  SQL Server makes this decision based on statistics about the table (as well as other tables in the query).  If it believes it only has to pick up a few items, it’ll choose to do a seek (possibly combined with row lookups).  If the number of things we need is greater than the tipping point, then SQL Server does a scan instead.

While the word “seek” has a reputation for performance amongst database administrators, don’t get sidetracked trying to force every query execution plan to use all seeks.  Sometimes it’s just more efficient to do a scan – like when we need to pick up several things from every aisle, or every page in the table.  The better question is to ask whether we really need something from every aisle.  (Erika asks me this question every time I walk into Frys – my credit card can’t handle table scans in there.)

The decision to do a seek or a scan is influenced by our shopping list.  Sometimes I can’t exactly make sense of what’s on my list.  When Erika wrote down “Frantoia,” that didn’t make sense to me, so I was forced to go up and down the grocery store aisles looking for a product by that name.  If she would have said, “WHERE Brand = Frantoia AND ProductCategory = Olive Oil”, then I’d have been able to use my trusty index to seek directly to the olive oil, but I didn’t have that option.  Instead, I had to do a scan – there wasn’t enough information in the query for me to use an available index, and I was too lazy to ask for directions.

In addition to searching by fields that aren’t indexed, other things in our query can force SQL Server to do a scan instead of a seek:

Using functions in the WHERE clause – if Erika tells me to buy everything WHERE PriceToday <= (50% * RegularPrice), I’m going to have to walk through the entire store looking at every single price tag.  I can’t ask the store for a map of items that are half-off or more, because they simply won’t have a map like that.  Likewise, if my SQL Server query gets fancy with the functions, it won’t be sargable.

Bad or out-of-date statistics – SQL Server uses statistics to gauge the contents of the grocery store as a whole.  If it believes that the store is like a typical grocery store with only a small section of olive oils, then it’ll jump directly to the olive oil area if we say WHERE ProductCategory = Olive Oil.  However, if this store just came under new ownership, and the new owners decided to carry a stunning array of olive oil, SQL Server might be overwhelmed by the amount of data that comes back.

Why I Hate “ORDER BY”

We believe in division of labor in the Ozar house.  When I get home from shopping, I deliver the bags into the dining room table.  From there, it’s Erika’s job to put things away in the various cupboards and closets.  I know that I could make Erika’s job easier if I sorted items as I loaded them into the car, but that just doesn’t make sense.  It’s not efficient for me to stop what I’m doing in the middle of the shopping trip, move things around between bags, and get them in exactly the right order for efficient cabinet loading.  There’s plenty of space for efficient sorting when I get home.

Likewise, I don’t want my SQL Server sorting data.  The more processing that I can offload to web servers, application servers, and thick clients, the better.  SQL Server Enterprise Edition is $30k/CPU, but your web/app servers are likely a great deal cheaper, and you’ve probably got much better methods of scaling out your app server tier.  We just don’t have a good way of scaling out SQL Server queries until Denali AlwaysOn hits.

More Reading on SQL Server Seeks, Scans, and Statistics

To learn more about the topics in today’s post, check out:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Sargability: Why %string% Is Slow

People love to search.

Google has us addicted to fast, easy search functions.  Users expect every application to have a built-in blazing-fast search functionality.  To pull that off, developers build search queries that let users enter a string, and we ask SQL Server to find matches.  For example, say our users need to find some nuts.  We take their string, put percent signs on either side, and then search for that field in the product names:

SELECT *
FROM AdventureWorks.Production.Product
WHERE [Name] LIKE '%nut%'

This works great on our machine – but when we scale it up to hundreds or thousands of simultaneous users, or if we ramp up the number of products, our response time sucks.  Let’s find out why by checking the execution plan.  In SQL Server Management Studio, click Query, Display Estimated Execution Plan, and we get:

Query Execution Plan

Query Execution Plan

There’s only one operation, which would sound like a good thing, but that one operation is a clustered index scan.  That means SQL Server has to read every row out of the Product table, check to see whether it’s got “nut” anywhere in the name, and then return our results.

But wait – there’s an index on Production.Product for the Name field, and SQL Server will use that index if we have a slightly different query:

Less Wildcards, Same Nuts

Less Wildcards, Same Nuts

If we take the leading % wildcard out, notice that this time it does two operations.  First, it does an index seek against our Name index, then it looks up the corresponding rows in the base table (Product).  Two operations might sound more expensive, but if we looked at the total cost for each query, the second one would be faster in situations where we’re only pulling a small number of search results back.  If this table was for a company called Nothin’ But Nuts, on the other hand, we would probably still need to scan the entire table, but that’s a discussion for another day.

So why doesn’t SQL Server use the index for the %nut% query?  Pretend for a moment that you held a phone book in your hand, and I asked you to find everyone whose last name contains the letters HAM.  You would have to scan every single page in the phone book, because the results would include things like:

  • Beckham
  • Chambers
  • Hamilton
  • Thames

If, on the other hand, I asked you to find everyone whose last name began with the letters HAM, you could turn straight to the H page in the phone book, scan through a few lines, and quickly get the information you need.

When I asked you for everyone beginning with the letters HAM, my query was sargable.  When I asked you for all last names containing HAM anywhere in the name, my query was not sargable – meaning, you couldn’t leverage the indexes to do an index seek.  (Yes, sargable is sort of a real word – it stems from the phrase Search Arguments.)

Since we can’t talk users out of using search queries, and since non-sargable queries don’t scale, we need to find a better way to search for text.  That’s where SQL Server’s Full Text Search comes in.  Unfortunately, your queries must change – normally when you add an index, your queries just magically become faster.  Full text indexes don’t work that way.  You have to tweak your queries to use operators like CONTAINS instead of LIKE.  If you’re dealing with an old boat anchor of a database and the developers are long gone, you might be out of luck.

%String% is just one thing that will cause SQL Server to do slower scans instead of high-performing index seeks.  To learn more about sargability, check out:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

SQL Server Index Tuning Tip: Identify Overlaps

Performance Tuning 101 - Add More Spoilers

Performance Tuning 101 - Add More Spoilers

If you’ve got performance troubles with an application that stores data in SQL Server, and especially if it’s a home-grown application (not a store-bought app), you can get dramatic performance improvements simply by focusing on some basic indexing techniques.  These tips and tricks pay off more than pouring money into hardware that might look good sitting in the datacenter, but doesn’t really make the application significantly faster.

When I go into a shop to speed up an application I’ve never seen before, two of my favorite quick-hits are from the index performance tuning queries from SQLServerPedia:

  • Find unused indexes – these are indexes the SQL Server engine says it’s not using.  Unused indexes incur a speed penalty because SQL Server still has to add/update the indexes as records change, so they make writes slower.
  • Find missing indexes – these are indexes SQL Server wishes it had available.

I’m not going to cover those in detail this week because I’ve already recorded tutorial videos over at SQLServerPedia for those, but I do want to focus on something these queries won’t pick up.  Sometimes a table has two nearly-identical indexes, and they’re both being used for reads.  Take these two:

CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType] ON [dbo].[MyTable]
([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC
)
 
CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType_QuotaItemDriverID] ON [dbo].[MyTable]
([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC,
[QuotaItemDriverID] ASC,
)
Performance Tuning 201 - Even Wood 2x4s Can Be Spoilers

Performance Tuning 201 - Even Wood 2x4s Can Be Spoilers

They’re two different indexes, and they’re both getting used – but does that mean we need them both?

They’re very nearly identical – but the second index has one extra field.  When the SQL Server engine gets a query that needs RunID, SiteID, DataSource, OutputType, and PeriodType – but not QuotaItemDriverID – then it will use the first index.  When it gets a query that needs all six fields, then it’ll use the second index.

In cases like this, I prefer to drop that first index and let the slightly bigger index pick up the slack.  Reading a slightly larger index will take slightly more time: if a query didn’t need that QuotaItemDriverID field, it still has to pull it off the disk in order to perform the query.  However, dropping the index pays off during inserts/updates/deletes, because it’s one less index SQL Server has to manage.  It also makes the database smaller, thereby making database maintenance tasks smaller/faster.

If:

  • I have two indexes with the exact same fields in the same order, but
  • One has 1-2 extra fields, and
  • There aren’t include fields, or the include fields are the same

Then I’ll drop the shorter index with extreme prejudice.

When Indexes Have Include Fields

If they have “include” fields, then I’ll merge the include fields between the two indexes to make one index to serve both needs.  Say we have these two indexes:

CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType_Includes] ON [dbo].[MyTable]
([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC
) INCLUDE ( [YTDRevenue], [MTDRevenue] )
 
CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType_QuotaItemDriverID_Includes] ON [dbo].[MyTable]([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC,
[QuotaItemDriverID] ASC,
) INCLUDE ( [SalespersonID], [MTDRevenue] )

The first index includes the YTDRevenue field, but the second index doesn’t.  If I just drop the first index, then queries that needed that field won’t get the full speed benefits from the second index.  To merge the two indexes, I need to drop both indexes and recreate the second one with the YTDRevenue field included, like this:

CREATE NONCLUSTERED INDEX [IX_RunID_SiteID_DataSource_OutputType_PeriodType_QuotaItemDriverID_Includes] ON [dbo].[MyTable]([RunID] ASC,
[SiteID] ASC,
[DataSource] ASC,
[OutputType] ASC,
[PeriodType] ASC,
[QuotaItemDriverID] ASC,
) INCLUDE ( [SalespersonID], [MTDRevenue], [YTDRevenue] )

In this example, I tacked the YTDRevenue field on to the end of the include field list.  The order of the included fields doesn’t matter, since SQL Server doesn’t sort by those.

Performance Tuning 301 - Beauty Just Adds Weight

Performance Tuning 301 - Beauty Just Adds Weight

Things to Watch Out For

In my examples, I kept things simple by omitting all of the extra indexing options like partitioning and sorting in TempDB.  When doing index tuning in real life, though, you’ll want to check those options to make sure they’re consistent from index to index.

Field order matters in indexes; if two indexes have the same fields but in different order, that doesn’t mean you can drop one of them.

Ideally, after making index changes, we would restart the SQL Server instance to reset the DMV counters that monitor index use.  In reality, though, that’s not so easy to pull off, so we need to log our changes to understand what the changes have been.  After making index changes, log the changes somewhere. I keep the output of the index performance tuning DMV queries in Excel spreadsheets because it’s easier to email those back and forth from machine to machine, especially when I’m consulting. The next time you do index performance tuning on the same database, you can use the historical spreadsheets to determine whether or not your changes worked the way you’d planned.

SQL Server 2008 Query Performance Tuning Distilled

SQL Server 2008 Query Performance Tuning Distilled

Learn More About SQL Server Index Tuning

I really like Grant Fritchey’s book SQL Server 2008 Query Performance Tuning Distilled, and I wrote a book review about it.  I can’t recommend it highly enough, and I’d start there.

If you don’t have the patience to wait for a book, here’s a few more blog posts about performance tuning:

Performance Tuning with Perfmon – how to set up Perfmon, what SQL Server Perfmon counters to track what the indicators mean.

Data Mining Your SQL Server Perfmon Counters – want to take your Performance Monitor statistics to a new level? I wrote an article on SQLServerPedia explaining how to use Microsoft’s free Table Analysis Tools for the Cloud plugins to dive deeply into your data.

Primary Keys and Indexes – I explain the concepts behind keys and indexes using phone books as an example. Indexes have huge impacts on performance, and if you master these you can make your server go a whole lot faster without spending more money.

SQL Server 2005 Setup Checklist – some simple configuration tweaks can get you 20-30% performance increases right from the start without spending any extra money.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts