Missing Index Impact and Join Type

Last Updated March 22, 2018

Just Another Way

No matter how you delve into missing index requests — whether it’s the plan level, DMV analysis, or (forgive me for saying it), DTA, the requests will generally be the same.

They’ll prioritize equality predicates, the columns may or not may be in the right order, the columns may or may not be in the right part of the index, and the impact…

Oh, that impact.

It’s all just a cry for help, anyway.

Like a teenager watching anime and buying intricate parasols.

Salted Grains

If I run these three queries with different join types:

SELECT b.UserId, b.Name, b.Date, p.CreationDate
FROM dbo.Badges AS b
JOIN dbo.Posts AS p
ON b.UserId = p.OwnerUserId
WHERE b.Date = '20170101'
OPTION(LOOP JOIN, MAXDOP 1, RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate
FROM dbo.Badges AS b
JOIN dbo.Posts AS p
ON b.UserId = p.OwnerUserId
WHERE b.Date = '20170101'
OPTION(MERGE JOIN, MAXDOP 1, RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate
FROM dbo.Badges AS b
JOIN dbo.Posts AS p
ON b.UserId = p.OwnerUserId
WHERE b.Date = '20170101'
OPTION(HASH JOIN, MAXDOP 1, RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate

FROM dbo.Badges AS b

JOIN dbo.Posts AS p

ON b.UserId = p.OwnerUserId

WHERE b.Date = '20170101'

OPTION(LOOP JOIN, MAXDOP 1, RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate

FROM dbo.Badges AS b

JOIN dbo.Posts AS p

ON b.UserId = p.OwnerUserId

WHERE b.Date = '20170101'

OPTION(MERGE JOIN, MAXDOP 1, RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate

FROM dbo.Badges AS b

JOIN dbo.Posts AS p

ON b.UserId = p.OwnerUserId

WHERE b.Date = '20170101'

OPTION(HASH JOIN, MAXDOP 1, RECOMPILE)

They’re all going to ask for the same missing index:

USE [StackOverflow]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[Badges] ([Date])
INCLUDE ([Name],[UserId])

USE [StackOverflow]

CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]

ON [dbo].[Badges] ([Date])

INCLUDE ([Name],[UserId])

Kinda weird already, that a join column is an INCLUDE, but hey.

What’s even weirder is that they have diminishing estimated impacts based on join type.

Loop: 99.1954%
Merge: 42.7795%
Hash: 28.7901%

It gets a bit stranger if I force parallelism!

SELECT b.UserId, b.Name, b.Date, p.CreationDate
FROM dbo.Badges AS b
JOIN dbo.Posts AS p
ON b.UserId = p.OwnerUserId
WHERE b.Date = '20170101'
OPTION(LOOP JOIN, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'), RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate
FROM dbo.Badges AS b
JOIN dbo.Posts AS p
ON b.UserId = p.OwnerUserId
WHERE b.Date = '20170101'
OPTION(MERGE JOIN, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'), RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate
FROM dbo.Badges AS b
JOIN dbo.Posts AS p
ON b.UserId = p.OwnerUserId
WHERE b.Date = '20170101'
OPTION(HASH JOIN, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'), RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate

FROM dbo.Badges AS b

JOIN dbo.Posts AS p

ON b.UserId = p.OwnerUserId

WHERE b.Date = '20170101'

OPTION(LOOP JOIN, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'), RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate

FROM dbo.Badges AS b

JOIN dbo.Posts AS p

ON b.UserId = p.OwnerUserId

WHERE b.Date = '20170101'

OPTION(MERGE JOIN, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'), RECOMPILE)

SELECT b.UserId, b.Name, b.Date, p.CreationDate

FROM dbo.Badges AS b

JOIN dbo.Posts AS p

ON b.UserId = p.OwnerUserId

WHERE b.Date = '20170101'

OPTION(HASH JOIN, USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE'), RECOMPILE)

Now the estimated impacts look like this:

Loop: 98.9569%
Merge: 29.1982%
Hash: 44.2455%

The Hash and Merge join impacts have just about changed places.

The funny thing is…

I totally agree.

As far as indexes go, that’s a crappy index for the Merge and Hash Join plans. But no better one is being offered, not even sneakily.

For the nested loops plans, it’s super easy to grab the UserIds for that date, and dig into the Posts table for just those.

For the merge join plans, it’s less helpful. While it’s nice that we can easily filter the date predicate, we still have to order our data for the merge join. There are further complications in the parallel version.

For the hash join plans, it’s a similar situation. We need to create a hash table on UserId. It being in the leaf of an index on Date doesn’t help us much, or rather as much as it would if it were in the key.

In short, both the merge and hash join plans have cost-increasing operators thrown into the mix that the index as requested just wouldn’t help.

Thanks for reading!

Brent says: in the Mastering Index Tuning class, these types of examples are why I tell students that you should look at one-key-column index suggestions carefully. In most cases, SQL Server really needs one (or more) of the included fields to be in the key. The hard part is figuring out which one(s) without looking at the execution plans.

Building SQL ConstantCare®: Product, Packaging, and Pricing

Office Hours Guest Instructor Month

8 Comments. Leave new

roe
April 26, 2018 1:23 pm

thanks, never seen the parallel query hint. I have had a few tables with forcing a loop join bring the run time down from 10 hours to 1 hour (large data processing joining membership to a $ table on membershipID and date)

Reply
- Erik Darling
  April 27, 2018 5:58 am
  
  Roe — yeah, it’s new as of 2016.
  
  Reply
Toby Ovod-Everett
April 26, 2018 3:20 pm

I don’t see why having the join column (UserId) in the INCLUDE instead of a key column is an issue for this query (assuming the LOOP join). If you’re doing a LOOP join with Badges as your left-hand side, the order that the Posts get retrieved by UserId is of minimal consequence. If you had a range for the Date field in the WHERE clause, there would be no guarantee that the Posts would be retrieved in UserId order unless the optimizer sorted the output from Badges on UserId (since the initial sort would be Date, UserId – Date has to be the leading key for the index to be useful with the Date range).

Adding fields to the key instead of include means index re-ordering if that field is updated, so if there’s no advantage to having the field in the key, it makes sense to me to leave it in the include.

It’s funny – everytime I start discussing indexes, I keep wishing SQL Server supported skip scanning. Joe Chang has a great writeup on faking a skip scan at http://sqlblog.com/blogs/joe_chang/archive/2011/06/13/oracle-index-skip-scan.aspx, but I really wish we didn’t have to add unnecessary joins as a workaround.

Reply
- Erik Darling
  April 27, 2018 5:59 am
  
  Toby — that’s a really interesting point of view. Sounds like a great thing for you to experiment with and blog about.
  
  Reply
- Brent Ozar
  April 27, 2018 6:05 am
  
  Toby – isn’t it interesting how most of Joe’s posts seem to be about making a case for something a vendor doesn’t have – like something he wishes Microsoft or Intel or laptop makers would put in – but then the vendors never do? I wonder why that is. I wonder what they know that he doesn’t know, or why they don’t bring him in to consult?
  
  Reply
Ian
April 27, 2018 2:04 am

As a slight aside point, am I right in thinking that the missing index advice is based on the plan chosen by the optimiser. Therefore, if the plan selected was suboptimal (maybe due to stale statistics) then the danger is that the missing index advice could be misleading and not the real issue.
So maybe it all comes down to the sage advice of taking the missing index advice as a hint that optimisations are required and reviewing many other things before deciding on a course of action?

Reply
- Erik Darling
  April 27, 2018 6:00 am
  
  Ian — No, it’s contrived (from what I know) based on the index matching portion of optimization. Adding indexes often changes query plans quite bit.
  
  Reply
LeoG
May 31, 2019 7:24 am

Questions. I have some big tables in SQL 2016 with columnstore indexes. The execution plan shows missing indexes but after creating those, it is still using the columnstore indexes.
Q1. Does the optimizer have a priority to regular indexes over columnstore indexes?
Q2. Do I need to hit it or clear statistics etc…?

Reply