Seeks, Scans, and Statistics in the Grocery Store

Last Updated 5 years ago

Brent Ozar

Indexing, T-SQL

Building an execution plan is a lot like going shopping. Before I leave the house, I need to think about a few things:

How many items do I need? I take a glance at my grocery list to see roughly how many things I’m shopping for.
What stores am I going to visit? I can’t get everything from the same store, so I make a list of where I’m going to source everything.
Am I in a big store or a small store? The bigger the store, the more likely I’m going to pick up more items that I might not have had on my list. My guess of what I need might change if I see cool things I’d like to pick up. The store also might not have everything that I’d like to find.

Armed with this background information, it’s time to make my plan of attack:

Which store should I visit first? The order is determined by what I’m shopping for. I want to visit my butcher or fishmonger first to see what’s fresh and what’s on sale, and then those purchases determine what I get from the grocery store and wine store.
In each store, do I need a push cart, a basket, or just my hands? This is influenced by how many items I think I’m going to take to the checkout stand.
In each store, will I go to specific aisles for what’s on my list, or just hit all the aisles? If I only need a couple of things, I’ll go directly to those aisles, get ’em, and walk out. (When I get rich, I’ll pay the minimart back.) If I have a huge list, then it makes more sense to start at one side of the store and walk through every aisle, crossing items off my list as I pick ’em up.

Why, that’s just like SQL Server’s query processor!

How SQL Server Builds Execution Plans

The query processor looks at your query, reviews the table statistics, and tries to quickly choose an efficient execution plan – just like you choose which stores to visit in what order, and then what aisles to hit inside the store. The first order of business is checking out what ~~stores~~ tables are involved, and what we need from each one. Let’s take a simple AdventureWorks query joining two tables:

SELECT *
FROM Sales.SalesOrderHeader h
INNER JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
WHERE h.DueDate >= '5/15/2011'
AND d.OrderQty > 10

SELECT *

FROM Sales.SalesOrderHeader h

INNER JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID

WHERE h.DueDate >= '5/15/2011'

AND d.OrderQty > 10

Well, it was good, but I dunno about amazing. — Found at Whole Foods

We’re joining between SalesOrderHeader and SalesOrderDetail, which have a parent/child relationship: rows in SalesOrderHeader can have multiple rows in SalesOrderDetail table. We’re filtering on both tables – we’re checking for the header’s DueDate field and the details OrderQuantity field. Which one should we check first? We’ve got two options:

Check SalesOrderHeader to look up all rows with DueDate >= 5/15/2011, and then look up their matching SalesOrderDetail records to examine their OrderQty, or
Check SalesOrderDetail to look up all rows with OrderQty > 10, and then look up their matching SalesOrderHeader records to examine their DueDate

Let’s check to see what SQL Server chose in one situation. When reading execution plans, we read from the top right to determine what happens first, and then go left. While this may not seem intuitive at first, keep in mind that it’s designed to increase the job security of the database administrator. You can thank Microsoft later.

Well, not totally simple. — Simple Execution Plan

In my sample data, it chose to check SalesOrderHeader first – that’s the top right operator in the execution plan. SQL Server determined that this would be the most effective method because neither table had an index on the fields I needed. Either SQL Server was going to have to scan the entire SalesOrderHeader table to check each record’s DueDate or it would need to scan the entire SalesOrderDetail table to check each line item’s OrderQty. In this example, SalesOrderHeader was the smaller table, so it made more sense to scan that one.

Mo Tables, Mo Problems

I thought "Vegetarian Feast" meant what happens when we eat the vegetarians. — The Other Other White Meat

When I go to the fishmonger and discover crawfish is finally in season, I have to change the rest of my shopping plans. I need to go to the wine store because I don’t keep the right wine pairing for mud bugs. Me being a flexible guy, I can easily rework my entire shopping plan if crawfish happens to be in season, but SQL Server doesn’t have that kind of on-the-fly flexibility. It has to build a query execution plan, and then rigidly adhere to that plan – no matter what it finds in at the fishmonger.

The more tables we add into our query, the more SQL Server has to make guesstimates. When we daisy-chain lots of tables together, SQL Server estimates how many rows it’s going to match in each table, pick the right orders, and reserve memory to do its work. One bad estimate early on in the query execution plan can turn the whole plan sour.

In performance tuning, sometimes I help SQL Server by breaking a ginormous shopping trip into two: I separate a large query into two separate queries. I don’t mean nested SELECT statements in a maelstrom of parenthesis, either – I build a temp table, populate it with the first query’s results, and then join that temp table to another set of results in a second query. Here’s an overly simplified example:

CREATE TABLE #CreditCards (CreditCardID INT)
INSERT INTO #CreditCards
( CreditCardID )
SELECT CreditCardID
FROM Sales.CreditCard
WHERE Hacked = 1

SELECT *
FROM #CreditCards cc
INNER JOIN Sales.SalesOrderHeader h ON cc.CreditCardID = h.CreditCardID
INNER JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
WHERE h.DueDate >= '5/15/2011'
AND d.OrderQty > 100

CREATE TABLE #CreditCards (CreditCardID INT)

INSERT INTO #CreditCards

( CreditCardID )

SELECT CreditCardID

FROM Sales.CreditCard

WHERE Hacked = 1

SELECT *

FROM #CreditCards cc

INNER JOIN Sales.SalesOrderHeader h ON cc.CreditCardID = h.CreditCardID

INNER JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID

WHERE h.DueDate >= '5/15/2011'

AND d.OrderQty > 100

Sometimes by breaking up our shopping trip, SQL Server is able to build a better execution plan for the second query. This is especially effective when I’m dealing with a very large number of tables, yet there’s just a couple of key tables in the query that determine whether we show any results or not. I don’t use this technique proactively – I only try it when I’m consistently running into problems with execution plans that would be better suited separately. Alternate techniques could include index hints or query plan guides.

How SQL Server Chooses An Aisle Seek or a Store Scan

How Erika wooed me — Man Bait – Bacon Lollipops

Once we’re inside a particular store (table), we’re faced with another choice: do we jump directly around to just a couple of aisles to find exactly what we want, or do we walk up and down every aisle of the store, picking things up as we go past? SQL Server makes this decision based on statistics about the table (as well as other tables in the query). If it believes it only has to pick up a few items, it’ll choose to do a seek (possibly combined with row lookups). If the number of things we need is greater than the tipping point, then SQL Server does a scan instead.

While the word “seek” has a reputation for performance amongst database administrators, don’t get sidetracked trying to force every query execution plan to use all seeks. Sometimes it’s just more efficient to do a scan – like when we need to pick up several things from every aisle, or every page in the table. The better question is to ask whether we really need something from every aisle. (Erika asks me this question every time I walk into Frys – my credit card can’t handle table scans in there.)

The decision to do a seek or a scan is influenced by our shopping list. Sometimes I can’t exactly make sense of what’s on my list. When Erika wrote down “Frantoia,” that didn’t make sense to me, so I was forced to go up and down the grocery store aisles looking for a product by that name. If she would have said, “WHERE Brand = Frantoia AND ProductCategory = Olive Oil”, then I’d have been able to use my trusty index to seek directly to the olive oil, but I didn’t have that option. Instead, I had to do a scan – there wasn’t enough information in the query for me to use an available index, and I was too lazy to ask for directions.

In addition to searching by fields that aren’t indexed, other things in our query can force SQL Server to do a scan instead of a seek:

Using functions in the WHERE clause – if Erika tells me to buy everything WHERE PriceToday <= (50% * RegularPrice), I’m going to have to walk through the entire store looking at every single price tag. I can’t ask the store for a map of items that are half-off or more, because they simply won’t have a map like that. Likewise, if my SQL Server query gets fancy with the functions, it won’t be sargable.

Bad or out-of-date statistics – SQL Server uses statistics to gauge the contents of the grocery store as a whole. If it believes that the store is like a typical grocery store with only a small section of olive oils, then it’ll jump directly to the olive oil area if we say WHERE ProductCategory = Olive Oil. However, if this store just came under new ownership, and the new owners decided to carry a stunning array of olive oil, SQL Server might be overwhelmed by the amount of data that comes back.

Why I Hate “ORDER BY”

We believe in division of labor in the Ozar house. When I get home from shopping, I deliver the bags into the dining room table. From there, it’s Erika’s job to put things away in the various cupboards and closets. I know that I could make Erika’s job easier if I sorted items as I loaded them into the car, but that just doesn’t make sense. It’s not efficient for me to stop what I’m doing in the middle of the shopping trip, move things around between bags, and get them in exactly the right order for efficient cabinet loading. There’s plenty of space for efficient sorting when I get home.

Likewise, I don’t want my SQL Server sorting data. The more processing that I can offload to web servers, application servers, and thick clients, the better. SQL Server Enterprise Edition is $30k/CPU, but your web/app servers are likely a great deal cheaper, and you’ve probably got much better methods of scaling out your app server tier. We just don’t have a good way of scaling out SQL Server queries until Denali AlwaysOn hits.