The Many Mysteries of Merge Joins

Not A Single Picture Of A Zipper

Image humbly borrowed from https://70srichard.wordpress.com/2014/12/17/beverly-hills-cop/

There are some interesting things about Merge Joins, and Merge Join plans that I figured I’d blog about.

Merge joins have at least one interesting attribute, and may add some weird stuff to your query plans.

It’s not that I think they’re bad, but they can be tricky.

Lots of people see a Merge Join and are somewhere between grateful (that it’s not a Hash Join) and curious (as to why it’s not Nested Loops).

Oh, you exotic Merge Join.

E pluribus unum

In a “good” Merge Join, the join operator in the query plan will have the Many to Many: False attribute.

The optimizer knows this because the Primary Key on Id (though a unique index or constraint offers similar assurances) is distinct for each value in the Users table.

Having one unique input give you a one to many Merge Join.

Simple as a pimple

The statistics TIME and IO profile for this query is about like so:

Not too shabby for one meeeeeeeeeeeeeeeeeeeellion rows.

E pluribus pluribus

In a “bad” Merge Join, that attribute will be True.

Hamburger Lady

Why does this happen, and why is it bad?

It happens most commonly when there are, or may be duplicates on both sides of the results. They can also happen when the outer input has duplicates, but quite often the optimizer will rewrite the JOIN order to put it on the inside to avoid having to use a worktable, etc. Thanks to Adam and Kalen for nudging me in the comments to clarify this part.

Internally, the Merge Join will spin up a work table (similar to how a Hash Join operates), and work out the duplicate situation.

The stats TIME and IO profile of this plan looks like so:

If we were to, say, choose that serial Merge Join plan in the compilation of a stored procedure that became the victim of parameter sniffing, we could run into trouble.

Yes, that is seven minutes and forty seconds. A little over half of one metric cigarette break.

Head to head, the large merge (many to many) is costed higher than the smaller merge (one to many). But you won’t see that in a parameter sniffing situation.

You’ll only see the lower costed Merge Join version of the plan.

TELL’EM LARGE MERGE SENT YA

The optimizer will sometimes to try protect itself from such hi jinks.

Aggregation Ruling The Nation

In some cases, the optimizer may inject an aggregation into one side of the join ahead of time to distinctify the data. It doesn’t need to do both — we only need one distinct input for the many to many attribute to be false.

I haven’t seen a situation where both inputs get aggregated merely to support the Merge Join, but it might happen if you ask for other aggregations on the join column.

It can use a Stream Aggregate, which would generally make more sense, since both the Stream Aggregate and the Merge join require sorted data.

Sense and Sensibility

Under less reasonable circumstances, you may get a Hash Match Aggregate. This plan has the additional misfortune of needing to re-order the data for the Merge Join. Teehee.

Hash Gang

If you see this, something has truly gone wrong in your life. This query was heavily under the hint-fluence.

A Sort Is A Sort

Much more common is seeing a Sort injected into a plan to support one or more downstream operators that require sorted input (Merge Joins, Stream Aggregates, Windowing Functions).

For instance, a query like this:

May give you a plan like this:

Sort early, Sort often

In this case, the Sort happens early on to support the Window Function. It also aids the Stream Aggregate, but whatever. Once data is sorted, the optimizer tends to not un-sort it.

The Sort in the next example will be needed with no index on, or where the join column is not the leading column in the index (there’s an If here, which we’ll get to).

If this is my index on the Votes table, data is sorted by PostId, and then UserId

When my query is just a simple join, like this:

My query plan will look like this, with a big honkin’ Sort in it:

Is kill

On the other hand, if my query looks like this:

My WHERE clause filters the leading column to a single PostId (one Post can be voted on by many Users), the UserId column will already be sorted for that single PostId value.

We won’t need to physically sort data coming out of that.

Like mustard

Out In The Street, They Call It Merge Join

I hope you learned some stuff that you can use when troubleshooting, or trying to understand a Merge Join plan.

This is one of many places that the optimizer may inject a Sort into a plan that you didn’t ask for.

Thanks for reading!

ZIPPER FREE!

Previous Post
Building SQL ConstantCare®: The Minimum Viable Product (MVP)
Next Post
Building SQL ConstantCare®: 10% of you have Priority Boost on.

5 Comments. Leave new

  • In the section right after the graphic showing Many to Many = True, you say “t happens when there are, or may be duplicates in one side of the results.” I would think duplicates on one side would be One to Many and not a problem, as in the first example. I would thing Many to Many = True would happen when there are duplicates in BOTH sides.

    Reply
    • Erik Darling
      April 4, 2018 2:13 pm

      Hi Kalen – yes you’re right, I’ll correct that. Thanks.

      Reply
      • No, that’s not correct. Many-to-many only depends upon whether or not there are duplicates on the OUTER side.


        DECLARE @i TABLE(y INT)
        INSERT @i SELECT 1 UNION ALL SELECT 1

        SELECT
        *
        FROM @i AS x
        INNER MERGE JOIN
        (
        SELECT DISTINCT y FROM @i
        ) AS z (y) ON
        z.y = x.y

        Reply
        • Erik Darling
          April 5, 2018 5:03 pm

          Adam – yeah, but I’ve noticed the optimizer will flip join order a lot of times. The perils of responding from vacation. I’ll rework this section when I get home! Thanks!

          Reply
          • Yes, the optimizer will flip the join order as it sees fit. I think your initial phrasing as criticized by Kalen was mostly correct from a logical perspective — but probably better to say something like “it may occur … if at least one of the two input sets is not guaranteed unique.” From a tuning perspective it is especially important to understand that it’s only the outer input that matters and putting a join hint or other query construct in place to force the order to change can effect a fix.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.