Tag Archive: statistics

Webcast Recording – DBA Darwin Awards: Index Edition

Right after the DBA says “Uh oh,” they call in the consultants. You wouldn’t believe some of the stuff I’ve seen, but here’s the worst of it: some of these things are probably happening in your environment right now. If you’re a production DBA of any experience level, do yourself a favor and look for these common mistakes before they call in the consultants.

In this 30-minute video, we’ll discuss heaps, fill factor, maintenance plans, over-indexing, and the Database Tuning Advisor’s stupidity so you can learn what you’re doing wrong before your boss has to hear it from the outsider:

The links we discuss in the video:

Enjoyed this video?  Check out our upcoming webcasts.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Seeks, Scans, and Statistics in the Grocery Store

Building an execution plan is a lot like going shopping.  Before I leave the house, I need to think about a few things:

  • How many items do I need? I take a glance at my grocery list to see roughly how many things I’m shopping for.
  • What stores am I going to visit? I can’t get everything from the same store, so I make a list of where I’m going to source everything.
  • Am I in a big store or a small store? The bigger the store, the more likely I’m going to pick up more items that I might not have had on my list.  My guess of what I need might change if I see cool things I’d like to pick up.  The store also might not have everything that I’d like to find.

Armed with this background information, it’s time to make my plan of attack:

  • Which store should I visit first? The order is determined by what I’m shopping for.  I want to visit my butcher or fishmonger first to see what’s fresh and what’s on sale, and then those purchases determine what I get from the grocery store and wine store.
  • In each store, do I need a push cart, a basket, or just my hands? This is influenced by how many items I think I’m going to take to the checkout stand.
  • In each store, will I go to specific aisles for what’s on my list, or just hit all the aisles? If I only need a couple of things, I’ll go directly to those aisles, get ‘em, and walk out.  (When I get rich, I’ll pay the minimart back.)  If I have a huge list, then it makes more sense to start at one side of the store and walk through every aisle, crossing items off my list as I pick ‘em up.

Why, that’s just like SQL Server’s query processor!

How SQL Server Builds Execution Plans

The query processor looks at your query, reviews the table statistics, and tries to quickly choose an efficient execution plan – just like you choose which stores to visit in what order, and then what aisles to hit inside the store.  The first order of business is checking out what stores tables are involved, and what we need from each one.  Let’s take a simple AdventureWorks query joining two tables:

SELECT  *
FROM    Sales.SalesOrderHeader h
        INNER JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
WHERE   h.DueDate >= '5/15/2011'
        AND d.OrderQty > 10
Well, it was good, but I dunno about amazing.

Found at Whole Foods

We’re joining between SalesOrderHeader and SalesOrderDetail, which have a parent/child relationship: rows in SalesOrderHeader can have multiple rows in SalesOrderDetail table.  We’re filtering on both tables – we’re checking for the header’s DueDate field and the details OrderQuantity field.  Which one should we check first?  We’ve got two options:

  1. Check SalesOrderHeader to look up all rows with DueDate >= 5/15/2011, and then look up their matching SalesOrderDetail records to examine their OrderQty, or
  2. Check SalesOrderDetail to look up all rows with OrderQty > 10, and then look up their matching SalesOrderHeader records to examine their DueDate

Let’s check to see what SQL Server chose in one situation.  When reading execution plans, we read from the top right to determine what happens first, and then go left.  While this may not seem intuitive at first, keep in mind that it’s designed to increase the job security of the database administrator.  You can thank Microsoft later.

Well, not totally simple.

Simple Execution Plan

In my sample data, it chose to check SalesOrderHeader first – that’s the top right operator in the execution plan.  SQL Server determined that this would be the most effective method because neither table had an index on the fields I needed.  Either SQL Server was going to have to scan the entire SalesOrderHeader table to check each record’s DueDate or it would need to scan the entire SalesOrderDetail table to check each line item’s OrderQty.  In this example, SalesOrderHeader was the smaller table, so it made more sense to scan that one.

Mo Tables, Mo Problems

I thought "Vegetarian Feast" meant what happens when we eat the vegetarians.

The Other Other White Meat

When I go to the fishmonger and discover crawfish is finally in season, I have to change the rest of my shopping plans.  I need to go to the wine store because I don’t keep the right wine pairing for mud bugs.  Me being a flexible guy, I can easily rework my entire shopping plan if crawfish happens to be in season, but SQL Server doesn’t have that kind of on-the-fly flexibility.  It has to build a query execution plan, and then rigidly adhere to that plan – no matter what it finds in at the fishmonger.

The more tables we add into our query, the more SQL Server has to make guesstimates.  When we daisy-chain lots of tables together, SQL Server estimates how many rows it’s going to match in each table, pick the right orders, and reserve memory to do its work.  One bad estimate early on in the query execution plan can turn the whole plan sour.

In performance tuning, sometimes I help SQL Server by breaking a ginormous shopping trip into two: I separate a large query into two separate queries.  I don’t mean nested SELECT statements in a maelstrom of parenthesis, either – I build a temp table, populate it with the first query’s results, and then join that temp table to another set of results in a second query.  Here’s an overly simplified example:

CREATE TABLE #CreditCards (CreditCardID INT)
INSERT INTO #CreditCards
        ( CreditCardID )
SELECT CreditCardID
  FROM Sales.CreditCard
  WHERE Hacked = 1

SELECT  *
FROM    #CreditCards cc
		INNER JOIN Sales.SalesOrderHeader h ON cc.CreditCardID = h.CreditCardID
        INNER JOIN Sales.SalesOrderDetail d ON h.SalesOrderID = d.SalesOrderID
WHERE   h.DueDate >= '5/15/2011'
        AND d.OrderQty > 100

Sometimes by breaking up our shopping trip, SQL Server is able to build a better execution plan for the second query.  This is especially effective when I’m dealing with a very large number of tables, yet there’s just a couple of key tables in the query that determine whether we show any results or not.  I don’t use this technique proactively – I only try it when I’m consistently running into problems with execution plans that would be better suited separately.  Alternate techniques could include index hints or query plan guides.

How SQL Server Chooses An Aisle Seek or a Store Scan

How Erika wooed me

Man Bait - Bacon Lollipops

Once we’re inside a particular store (table), we’re faced with another choice: do we jump directly around to just a couple of aisles to find exactly what we want, or do we walk up and down every aisle of the store, picking things up as we go past?  SQL Server makes this decision based on statistics about the table (as well as other tables in the query).  If it believes it only has to pick up a few items, it’ll choose to do a seek (possibly combined with row lookups).  If the number of things we need is greater than the tipping point, then SQL Server does a scan instead.

While the word “seek” has a reputation for performance amongst database administrators, don’t get sidetracked trying to force every query execution plan to use all seeks.  Sometimes it’s just more efficient to do a scan – like when we need to pick up several things from every aisle, or every page in the table.  The better question is to ask whether we really need something from every aisle.  (Erika asks me this question every time I walk into Frys – my credit card can’t handle table scans in there.)

The decision to do a seek or a scan is influenced by our shopping list.  Sometimes I can’t exactly make sense of what’s on my list.  When Erika wrote down “Frantoia,” that didn’t make sense to me, so I was forced to go up and down the grocery store aisles looking for a product by that name.  If she would have said, “WHERE Brand = Frantoia AND ProductCategory = Olive Oil”, then I’d have been able to use my trusty index to seek directly to the olive oil, but I didn’t have that option.  Instead, I had to do a scan – there wasn’t enough information in the query for me to use an available index, and I was too lazy to ask for directions.

In addition to searching by fields that aren’t indexed, other things in our query can force SQL Server to do a scan instead of a seek:

Using functions in the WHERE clause – if Erika tells me to buy everything WHERE PriceToday <= (50% * RegularPrice), I’m going to have to walk through the entire store looking at every single price tag.  I can’t ask the store for a map of items that are half-off or more, because they simply won’t have a map like that.  Likewise, if my SQL Server query gets fancy with the functions, it won’t be sargable.

Bad or out-of-date statistics – SQL Server uses statistics to gauge the contents of the grocery store as a whole.  If it believes that the store is like a typical grocery store with only a small section of olive oils, then it’ll jump directly to the olive oil area if we say WHERE ProductCategory = Olive Oil.  However, if this store just came under new ownership, and the new owners decided to carry a stunning array of olive oil, SQL Server might be overwhelmed by the amount of data that comes back.

Why I Hate “ORDER BY”

We believe in division of labor in the Ozar house.  When I get home from shopping, I deliver the bags into the dining room table.  From there, it’s Erika’s job to put things away in the various cupboards and closets.  I know that I could make Erika’s job easier if I sorted items as I loaded them into the car, but that just doesn’t make sense.  It’s not efficient for me to stop what I’m doing in the middle of the shopping trip, move things around between bags, and get them in exactly the right order for efficient cabinet loading.  There’s plenty of space for efficient sorting when I get home.

Likewise, I don’t want my SQL Server sorting data.  The more processing that I can offload to web servers, application servers, and thick clients, the better.  SQL Server Enterprise Edition is $30k/CPU, but your web/app servers are likely a great deal cheaper, and you’ve probably got much better methods of scaling out your app server tier.  We just don’t have a good way of scaling out SQL Server queries until Denali AlwaysOn hits.

More Reading on SQL Server Seeks, Scans, and Statistics

To learn more about the topics in today’s post, check out:

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts

Simon Sabin’s talk on query performance

At #SQLbits, Simon Sabin (BlogTwitter) talked about car crash queries: queries that suddenly have bad performance out of nowhere, yet if you copy/paste them into SQL Server Management Studio, they run quickly.  The culprit is often parameter sniffing: the engine looks at the query and builds a plan, but if the parameters in the WHERE clause (or joins or whatever) change that plan may not work right anymore.

Simon Sabin onstage at SQLBits

Simon Sabin onstage at SQLBits

Simon likened it to getting walking directions.  If you ask for directions from Big Ben to the Millennium Wheel, your best option is walking.  But if you decide to go from Big Ben to the Eiffel Tower,walking is no longer the best option – because the statistics of the distance you’re trying to cover is suddenly different.  SQL Server’s statistics about the quantity and distribution of the data in your database help it determine how to build the best directions.

Scaling up to higher numbers of users also changes how queries perform.  If one query returns in 250 milliseconds, that doesn’t mean that 100 of those queries running simultaneously may not all finish in 250 milliseconds each.  He likened it to a grocery checkout line – the more people you put in line, the more the folks at the end of the line are going to have to wait.

A great example for parameter sniffing troubles is when the same query is run in both a stored procedure, and standalone.  It may perform poorly inside the stored proc, but when the DBA tries to troubleshoot it, it performs fine.  The two queries are using different execution plans.  Simon demoed how to compare the two plans, and how to find out why the bad plan isn’t optimal.

One way to get better plans is to get better statistics, and you have to do better than SQL Server’s default statistics.  SQL creates its own statistics for new tables, but they don’t work terribly well.  If you’re using table variables, those don’t get statistics at all, but on the plus side, they won’t cause recompiles when the data distributions change.  Temp tables, on the other hand, get statistics, but they incur recompiles when SQL Server sees that the temp table contents have changed dramatically and need statistics updates.

He demoed how SQL Server will create different query plans based on the exact text of the query.  If one query includes comments and another doesn’t, they can end up with different query plans and different performance.

In order to catch these problems, you have to monitor and baseline your application, then identify changes in reads, writes, and CPU time.  You don’t want to monitor duration, because you won’t catch things until it’s too late.

Brent Ozar

Brent specializes in performance tuning for SQL Server, VMware, and storage. He's one of the very few Microsoft Certified Masters of SQL Server, a published author, and a Microsoft MVP. He likes travel, Jeeps, Apple gear, jokes, and writing about himself in the third person. Read more and contact Brent.

Website - Twitter - Facebook - More Posts