Seeks, Scans, and Statistics in the Grocery Store

Building an execution plan is a lot like going shopping.  Before I leave the house, I need to think about a few things:

  • How many items do I need? I take a glance at my grocery list to see roughly how many things I’m shopping for.
  • What stores am I going to visit? I can’t get everything from the same store, so I make a list of where I’m going to source everything.
  • Am I in a big store or a small store? The bigger the store, the more likely I’m going to pick up more items that I might not have had on my list.  My guess of what I need might change if I see cool things I’d like to pick up.  The store also might not have everything that I’d like to find.

Armed with this background information, it’s time to make my plan of attack:

  • Which store should I visit first? The order is determined by what I’m shopping for.  I want to visit my butcher or fishmonger first to see what’s fresh and what’s on sale, and then those purchases determine what I get from the grocery store and wine store.
  • In each store, do I need a push cart, a basket, or just my hands? This is influenced by how many items I think I’m going to take to the checkout stand.
  • In each store, will I go to specific aisles for what’s on my list, or just hit all the aisles? If I only need a couple of things, I’ll go directly to those aisles, get ’em, and walk out.  (When I get rich, I’ll pay the minimart back.)  If I have a huge list, then it makes more sense to start at one side of the store and walk through every aisle, crossing items off my list as I pick ’em up.

Why, that’s just like SQL Server’s query processor!

How SQL Server Builds Execution Plans

The query processor looks at your query, reviews the table statistics, and tries to quickly choose an efficient execution plan – just like you choose which stores to visit in what order, and then what aisles to hit inside the store.  The first order of business is checking out what stores tables are involved, and what we need from each one.  Let’s take a simple AdventureWorks query joining two tables:

Well, it was good, but I dunno about amazing.

Found at Whole Foods

We’re joining between SalesOrderHeader and SalesOrderDetail, which have a parent/child relationship: rows in SalesOrderHeader can have multiple rows in SalesOrderDetail table.  We’re filtering on both tables – we’re checking for the header’s DueDate field and the details OrderQuantity field.  Which one should we check first?  We’ve got two options:

  1. Check SalesOrderHeader to look up all rows with DueDate >= 5/15/2011, and then look up their matching SalesOrderDetail records to examine their OrderQty, or
  2. Check SalesOrderDetail to look up all rows with OrderQty > 10, and then look up their matching SalesOrderHeader records to examine their DueDate

Let’s check to see what SQL Server chose in one situation.  When reading execution plans, we read from the top right to determine what happens first, and then go left.  While this may not seem intuitive at first, keep in mind that it’s designed to increase the job security of the database administrator.  You can thank Microsoft later.

Well, not totally simple.

Simple Execution Plan

In my sample data, it chose to check SalesOrderHeader first – that’s the top right operator in the execution plan.  SQL Server determined that this would be the most effective method because neither table had an index on the fields I needed.  Either SQL Server was going to have to scan the entire SalesOrderHeader table to check each record’s DueDate or it would need to scan the entire SalesOrderDetail table to check each line item’s OrderQty.  In this example, SalesOrderHeader was the smaller table, so it made more sense to scan that one.

Mo Tables, Mo Problems

I thought "Vegetarian Feast" meant what happens when we eat the vegetarians.

The Other Other White Meat

When I go to the fishmonger and discover crawfish is finally in season, I have to change the rest of my shopping plans.  I need to go to the wine store because I don’t keep the right wine pairing for mud bugs.  Me being a flexible guy, I can easily rework my entire shopping plan if crawfish happens to be in season, but SQL Server doesn’t have that kind of on-the-fly flexibility.  It has to build a query execution plan, and then rigidly adhere to that plan – no matter what it finds in at the fishmonger.

The more tables we add into our query, the more SQL Server has to make guesstimates.  When we daisy-chain lots of tables together, SQL Server estimates how many rows it’s going to match in each table, pick the right orders, and reserve memory to do its work.  One bad estimate early on in the query execution plan can turn the whole plan sour.

In performance tuning, sometimes I help SQL Server by breaking a ginormous shopping trip into two: I separate a large query into two separate queries.  I don’t mean nested SELECT statements in a maelstrom of parenthesis, either – I build a temp table, populate it with the first query’s results, and then join that temp table to another set of results in a second query.  Here’s an overly simplified example:

Sometimes by breaking up our shopping trip, SQL Server is able to build a better execution plan for the second query.  This is especially effective when I’m dealing with a very large number of tables, yet there’s just a couple of key tables in the query that determine whether we show any results or not.  I don’t use this technique proactively – I only try it when I’m consistently running into problems with execution plans that would be better suited separately.  Alternate techniques could include index hints or query plan guides.

How SQL Server Chooses An Aisle Seek or a Store Scan

How Erika wooed me

Man Bait – Bacon Lollipops

Once we’re inside a particular store (table), we’re faced with another choice: do we jump directly around to just a couple of aisles to find exactly what we want, or do we walk up and down every aisle of the store, picking things up as we go past?  SQL Server makes this decision based on statistics about the table (as well as other tables in the query).  If it believes it only has to pick up a few items, it’ll choose to do a seek (possibly combined with row lookups).  If the number of things we need is greater than the tipping point, then SQL Server does a scan instead.

While the word “seek” has a reputation for performance amongst database administrators, don’t get sidetracked trying to force every query execution plan to use all seeks.  Sometimes it’s just more efficient to do a scan – like when we need to pick up several things from every aisle, or every page in the table.  The better question is to ask whether we really need something from every aisle.  (Erika asks me this question every time I walk into Frys – my credit card can’t handle table scans in there.)

The decision to do a seek or a scan is influenced by our shopping list.  Sometimes I can’t exactly make sense of what’s on my list.  When Erika wrote down “Frantoia,” that didn’t make sense to me, so I was forced to go up and down the grocery store aisles looking for a product by that name.  If she would have said, “WHERE Brand = Frantoia AND ProductCategory = Olive Oil”, then I’d have been able to use my trusty index to seek directly to the olive oil, but I didn’t have that option.  Instead, I had to do a scan – there wasn’t enough information in the query for me to use an available index, and I was too lazy to ask for directions.

In addition to searching by fields that aren’t indexed, other things in our query can force SQL Server to do a scan instead of a seek:

Using functions in the WHERE clause – if Erika tells me to buy everything WHERE PriceToday <= (50% * RegularPrice), I’m going to have to walk through the entire store looking at every single price tag.  I can’t ask the store for a map of items that are half-off or more, because they simply won’t have a map like that.  Likewise, if my SQL Server query gets fancy with the functions, it won’t be sargable.

Bad or out-of-date statistics – SQL Server uses statistics to gauge the contents of the grocery store as a whole.  If it believes that the store is like a typical grocery store with only a small section of olive oils, then it’ll jump directly to the olive oil area if we say WHERE ProductCategory = Olive Oil.  However, if this store just came under new ownership, and the new owners decided to carry a stunning array of olive oil, SQL Server might be overwhelmed by the amount of data that comes back.

Why I Hate “ORDER BY”

We believe in division of labor in the Ozar house.  When I get home from shopping, I deliver the bags into the dining room table.  From there, it’s Erika’s job to put things away in the various cupboards and closets.  I know that I could make Erika’s job easier if I sorted items as I loaded them into the car, but that just doesn’t make sense.  It’s not efficient for me to stop what I’m doing in the middle of the shopping trip, move things around between bags, and get them in exactly the right order for efficient cabinet loading.  There’s plenty of space for efficient sorting when I get home.

Likewise, I don’t want my SQL Server sorting data.  The more processing that I can offload to web servers, application servers, and thick clients, the better.  SQL Server Enterprise Edition is $30k/CPU, but your web/app servers are likely a great deal cheaper, and you’ve probably got much better methods of scaling out your app server tier.  We just don’t have a good way of scaling out SQL Server queries until Denali AlwaysOn hits.

More Reading on SQL Server Seeks, Scans, and Statistics

To learn more about the topics in today’s post, check out:

Previous Post
New Community Event by Denny Cherry: SQL Excursions
Next Post
Twitter #SQLHelp Hash Tag Dos and Don’ts

15 Comments. Leave new

  • Great analogy and an exquisite ref list !

    Reply
  • Good luck on the ORDER BY mission! The database makes it so easy, as a developer, I would have to experience a lot of personal pain before I yielded on that one.

    You reminded me again too why SQL Server needs function-based indexes!

    The first thing I do when looking at a plan, is reformat it to a flow-chart diagram widget so I don’t have to do the left-right-Arabic translation 🙂

    Reply
    • It’s compellingly easy in .NET:

      var result = new DataSet();
      using (var connection = new SqlConnection(ConfigurationManager.ConnectionStrings[“stage”].ConnectionString))
      {
      using (var command = new SqlCommand())
      {
      using (var da = new SqlDataAdapter(command))
      {
      command.Connection = connection;
      command.CommandText = sql;
      command.Parameters.AddRange(parameters.ToArray());
      da.Fill(result);
      }
      }
      }
      result.Sort(“hell_yeah DESC”);
      return result;

      Reply
  • Love the extended metaphor! However, I now have “Lost in the Supermarket” stuck in my head.

    Reply
  • Steve LaRochelle
    May 10, 2011 9:35 am

    Grocery stores need better clustered indexes. In my grocery store, olive oil can be found in the general oils shelves, in the international isle, in the heart healthy shelf, or the gourmet section. So it is an inefficient scan every time looking for a specific oil. Great article.

    Reply
  • Jeff Stanlick
    May 10, 2011 12:33 pm

    Love the analogy! Although, now I have this overwhelming desire to defrag my grocery list so it’s in the same order as the store’s shelves.

    Reply
  • Luke Winikates
    May 10, 2011 12:43 pm

    Likewise, I don’t want my SQL Server sorting data. The more processing that I can offload to web servers, application servers, and thick clients, the better.

    I’m not against this, but how do you do paging without ORDER BY?

    Reply
    • Luke – great question, and there’s a simple answer: caching. Fetch the data you want to page through, cache it in the application tier, and then page through it. That’s how you scale.

      Reply
      • If total number of records are in thousands, caching it in the application tier to perform paging might not be efficient. In such cases, we are using order by and is there any alternative for that?

        Reply
        • GK – what makes you think it’s more efficient to perform that task on a centralized database server rather than scale it out with the cheaper CPU/memory in the app tier? Remember, Enterprise Edition = $30k/CPU, and you can buy a LOT of web servers for that.

          Reply
  • Love the crawfish example!! You never have to change your shopping plans here in Louisiana because you KNOW when crawfish is in season! Now you went and made me hungry for crawfish etouffe! Great article as usual!

    Reply
  • Hi Brent, everybody

    you are talking about ORDER BY high cpu cost, is there way, any magic dmv to have estimation about this ORDER BY impact compared to the overall activity !?

    thanks for you help and congrats for your site and your way of life 🙂
    nicolas

    Reply
    • Nicolas – well, sort of, but it’ll take some work. SQL Server stores execution plans in the procedure cache in XML format. You could parse those, take the total cost of the orders, and report on that. I haven’t seen a pre-built query to do that though.

      Reply
  • Simply superb! The shopping trip analogy is simple and can be used for easy understanding of concepts. The next time I have some novice struggling with execution plans and optimization, I will be sure to guide him/her to this post.

    Thank-you so much!

    Thanks & Regards,
    Nakul Vachhrajani

    Be courteous. Drive responsibly.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}