Then Why Doesn’t SQL Always Seek?

Last Updated February 6, 2018

Scan The Man

There seems to be a perpetual battle waged against the Index Scan.

At some point it was declared that scans were inferior to seeks, and all energy should be dedicated to eradicating them.

Much like asking why the whole plane isn’t made out of the stuff the black box is made out of, you start to wonder why the Index Scan is even an operator.

Obviously, if Microsoft cared about performance, there would be no Index Scans.

Right?

I bet Postgres would never scan an index.

Being Practical

It should be pretty easy to prove Seek Supremacy once and for all. Then you can literally (LITERALLY!) ignore every other operator in a query plan and focus the entirety of your being on this unknown astral plane toward mocking those without the mental capacity to receive your wisdom.

About Index Scans.

Let’s start with a query.

SELECT  COUNT(*)
FROM    dbo.Users AS u
JOIN    dbo.Posts AS p
    ON p.OwnerUserId = u.Id;

SELECT COUNT(*)

FROM dbo.Users AS u

JOIN dbo.Posts AS p

ON p.OwnerUserId = u.Id;

It has no predicates. It’s just joining two tables together in total.

In the query plan, we have two scans. It’s a two scan plan.

For the Seekists out there; fear not, your moment of ascension is at hand.

SELECT  COUNT(*)
FROM    dbo.Users AS u WITH ( FORCESEEK )
JOIN    dbo.Posts AS p
    ON p.OwnerUserId = u.Id;

SELECT COUNT(*)

FROM dbo.Users AS u WITH ( FORCESEEK )

JOIN dbo.Posts AS p

ON p.OwnerUserId = u.Id;

This plan must surely be superior.

Not All Who Scan Are Lost

In this case, the Two Scan Plan does a bit better.

The source of all those extra reads, and likely the extra CPU time for the seek plan is…

Well, it’s in the seek.

You can see why the optimizer chose the scan in the first place.

Hang on though, maybe we picked the wrong table to force a seek on.

SELECT  COUNT(*)
FROM    dbo.Users AS u
JOIN    dbo.Posts AS p WITH ( FORCESEEK )
    ON p.OwnerUserId = u.Id;

SELECT COUNT(*)

FROM dbo.Users AS u

JOIN dbo.Posts AS p WITH ( FORCESEEK )

ON p.OwnerUserId = u.Id;

This must be the better choi-

Important Information

The reason we get this error here is because we don’t have an index on the Posts table with OwnerUserId as a key column.

This is where most Earth Peoples start to get annoyed — they usually do have an index. In fact, if consulting has taught me anything, you have 17 of the same index with _dta_ somewhere in the name on the data you wish to seek into.

Alright then. Let’s add an index.

CREATE INDEX ix_maybe
    ON dbo.Posts ( OwnerUserId );

1 2	CREATE INDEX ix_maybe ON dbo.Posts ( OwnerUserId );

If we re-run our query without hints, what happens?

We still get two scans!

Comparing all three plans (we can force a seek on Posts now, because we have an index on OwnerUserId), the seeks still aren’t doing so hot.

Now what?

Hopefully you learned that not every seek is great, and not every scan is awful.

If you’re joining two tables together, often a single scan of the data is the wisest choice.

The worst part of a plan may not always be obvious, and it may not always be the method that the optimizer chooses to access data with.

Thanks for reading!

New SQL Server Management Studio 17.5: It’s Classified

[Video] Office Hours 2018/2/14 (With Transcriptions)

13 Comments. Leave new

Aaron Morelli
February 16, 2018 11:36 am

“I bet Postgres would never scan an index.” ISWYDT, nice.

Reply
- Erik Darling
  February 16, 2018 11:37 am
  
  Yeah, if we’re being honest, I wrote this whole post as a vehicle for that.
  
  Reply
Andrew Notarian
February 16, 2018 11:37 am

I am going to assume the Postgres would never scan an index line is specifically directed at me.

Reply
- Erik Darling
  February 16, 2018 11:39 am
  
  No, it’s directed at my favorite martian 🙂
  
  Reply
not a guru
February 16, 2018 11:38 am

Very nice.

Reply
Chris Harshman
February 16, 2018 11:47 am

I don’t think it’s cases like this where people fight scans, it’s cases where your expected results are only a portion of the total rows of the table, yet the optimizer seems to decide to scan everything anyway. I was just looking at a query last week where it did a scan on a 40 million row table to return 20 thousand rows. LOL

Reply
- Erik Darling
  February 16, 2018 11:50 am
  
  Chris — sure! That’s a case I’d probably wanna take a look at too. But it’s a far cry from the “let no scan go unpunished” mentality out there.
  
  Out of curiosity, what was the root cause on that?
  
  Thanks!
  
  Reply
  - Chris Harshman
    February 16, 2018 1:22 pm
    
    In that particular case, I believe it was a problem where the developer was trying to do too many joins at once.
    I ended up suggesting they break it down a bit, trying to separate the “filtering logic” types of joins from the “presentation logic” types of joins.
    
    Reply
    - Erik Darling
      February 16, 2018 1:23 pm
      
      Good call. I typically find it helps to separate relational data from informational data.
      
      Reply
Brian Beuning
February 17, 2018 10:42 am

Seeks read pages (8KB), I thought I read scans read extents (64KB).
At the hardware level, the costs are almost the same.
So a query that reads more than 1/8 the pages is more efficient as a scan.

Also every seek needs to traverse the B-tree (more IO) while scans only
look in the B-tree for the first and last items in the scan. Usually the B-tree
nodes are cached in memory, but the query optimizer has a hard time
costing that factor.

Reply
Brian Battles
February 19, 2018 8:29 am

Can anyone post a link to some explanation of exactly how a seek/scan actually works, for a mere mortal….

Reply
Perry Whittle
February 20, 2018 1:33 am

Btw. You can make the plane out of the same stuff as the black box, which is actually red.
As long as you don’t want the ability to fit engines and load fuel and passengers ?

Reply
OR vs UNION ALL – Is One Better For Performance? – SQL with Bert
February 20, 2018 6:00 am

[…] Well the plan shows seeks, but as Erik Darling recently pointed out, seeks aren’t always a good thing. […]

Reply