Faster queries using narrow indexes and CROSS APPLY

Indexing, SQL Server
32 Comments

It’s common to try to build the perfect index for a query. This index only uses the exact right columns for the key columns and other, presentational, data is pushed to the INCLUDE section of the index. While you can create a narrow key this way, the entire index row can become wide.

A Sample Report

For this blog post, we’re going to work with the StackOverflow dataset. For more information on getting started with the StackOverflow database, check out How to Query the StackExchange database.

We want to find interesting, unanswered questions that have been asked in the last 90 days.

We’ll start by creating some indexes:

And here’s our sweet query:

This query uses a lot of CPU and uses nearly 5 seconds of CPU time on my machine between parsing, compilation, and execution. Overall it does around 1 million logical reads. That’s a few too many reads for my liking. I think we can make it better.

Simple, Small, Indexes and APPLY

We need a way to make this query faster and lighter. First, we get rid of the wide index we created before.

We add two more indexes:

 

And then we re-write the query with magic:

The new query performs 800,000 reads but it only takes 1.2 seconds to execute and it goes parallel (1.9 total CPU seconds are used). This is a big improvement over the first query, and I’ll explain how we got there.

What makes the CROSS APPLY faster?

This query has a lot of differences from the first query. Let’s break down what’s going on in here.

We’ve split out the core of the query into two CTEs. In this case, we’re using the CTEs as optimization fences. Rather than let SQL Server do its own thing with optimization, we’re tricking it into providing the optimization path that we want – recent_votes and posts will be optimized separately. Because of this optimization fence, we’ve been able to construct very narrow indexes that bring back the data that we need for each.

After tricking SQL Server into optimizing our CTEs separately, we use CROSS APPLY to pull back the details we want based on a primary key lookup. This helps our query run faster – we need fewer pages allocated for our indexes which means that SQL Server estimates that less memory will be used. The results speak for themselves, too – the second query performs 1/4 the I/O of the first query.

What’s the Result?

The end result is that we have query that’s fast. Our indexes are free of bloat caused by wide INCLUDE columns.

If you like this sort of thing, you might be interested in our Advanced Querying & Indexing class this August in Portland, OR.

Previous Post
Careful Testing the 2014 Cardinality Estimator with Trace Flags
Next Post
New sp_BlitzFirst® Reports on Waits, Files, Perfmon Since Startup

32 Comments. Leave new

  • Both the CTE and the CROSS APPLY have nothing to do with the speedup. A CTE has no impact on the plan ever and the CROSS APPLY is converted into a normal inner join.

    Reply
    • Tobi – wow, that’s an interesting opinion. Go on – I’d like to hear more about your evidence.

      Reply
      • It is well known that CTEs are inlined into the query plan like views. Therefore I assume you are talking about the CROSS APPLY because that statement is more controversial.

        It is an important capability of the optimizer to convert cross applys and subqueries into joins. This is important so that it does not have to use a loop join to execute them. Once those patterns are converted to joins all three join types are applicable. This is easy to test:

        SELECT *
        INTO #o
        FROM sys.objects

        SELECT *
        FROM #o o1
        CROSS APPLY (SELECT * FROM #o o2 WHERE o1.object_id = o2.object_id) x

        Results in a hash join as if you had written a normal join. This rewrite happens at an early stage. The way this join/cross apply is written has zero query performance impact.

        It really is necessary to compare the before and after plans to see *why* the optimization described in this article worked.

        Reply
        • Tobi – so you’re saying that views don’t change query plans either, eh?

          In that case, you’ll want to dig a little deeper. Believe it or not, they can! Back when I worked for Quest Software, one of my favorite products was Toad Xpert – it’s like the Index Tuning Wizard, but for T-SQL. It tries a bunch of different possible rewrites for your code that you wouldn’t ordinarily expect to change your plans, but they can.

          It’s like that common myth that join order doesn’t matter – the more complex your T-SQL is, rewrites can actually make a huge difference even when they’re logically identical. Here’s a great blog post from Michael J Swart talking about it:

          http://michaeljswart.com/2013/01/joins-are-commutative-and-sql-server-knows-it/

          Enjoy!

          Reply
          • I would like to see a case where using a view makes a difference because I have never seen one and I read about the way views are inlined.

            I do believe that join order matters and I have seen that often. It’s the join order that causes a difference. Not the presence of a view or CTE.

            I absolutely believe in rewriting queries! It’s just that some changes never ever cause a difference. It is important to attribute improvements to the right things.

          • Tobi – sure, start experimenting with queries just like I did on Michael’s post. The presence of a view or a CTE is effectively join ordering – making the query more or less complex in a way that influences which plans the optimizer tries out first, which helps you in the event of optimization timeouts.

            I do wish I could build repro scripts for every blog comment – I did it for Michael because I had some free time that day – but in this case, you’ll have to continue your learning journey the same way I do: by doing experiments. The way I approach it is to say, “Let’s say my opinion is wrong, and I hold the opposite viewpoint – what experiment would I run to prove my point?” In this case, you could build queries that incur optimization timeouts, then try them with different views and CTEs that effectively produce different join orders, and see the different plans coming up.

            Have fun on your journey! I love doing stuff like that when I want proof.

  • Aaron Morelli
    June 11, 2015 11:08 am

    I agree with Tobi that the CTEs aren’t affecting the optimizer’s logic in the way implied by the article. CTEs might affect the shape of the initial input tree (just because the objects appear in a different order) but the optimizer still has the freedom to reorder the joins however it sees fit. Using the well-known “optimizer tree” trace flags should be helpful here.

    However, I disagree with his statement about CROSS APPLY. If CROSS APPLY could always be converted to a normal inner join, Adam Machanic’s oft-discussed “APPLY pattern” wouldn’t be reliable at all, yet it is a usually-reliable way to turn a query that uses hash joins to loop joins, and Adam uses it frequently to make a Dimension table the “driver table” on the outer side of a parallel nested loop join. So I suspect that it is the CROSS APPLY that is the primary contributor to Jeremiah’s trick.

    I admit that I don’t fully understand when the optimizer can and cannot treat an APPLY operator as just a normal join (since obviously Tobi just provided an example of APPLY being optimized as a hash join). However, there are definitely cases where it cannot, as mentioned by Paul White here, in the “Implementation and Optimization” section:
    http://www.sqlservercentral.com/articles/APPLY/69954/

    I like CTEs, and use them. But I try not to associate it with reliable optimization “tricks”, whereas APPLY seems to fit that bill more nicely.

    Cheers!
    Aaron

    Reply
    • Aaron Morelli
      June 11, 2015 11:09 am

      Yikes… I really just left a comment under the name “Aar”…

      Reply
    • I did not mean to imply that CROSS APPLY is *always* converted. It is converted when it implements an equi-join in an obvious way. Adams trick involves using a TOP which prevents this rewrite from happening.

      In fact you can disable this rewrite by saying TOP 100000000000000 or by using other such hacks.

      I wish the SQL Server team would provide a reliable way to introduce an optimization barrier. Something like WITH (MATERIALIZE) or WITH (SUBQUERY) to force materialization or at least force separate evaluation of arbitrary parts of the query.

      Reply
      • Aaron Morelli
        June 11, 2015 11:36 am

        I’d definitely be interested in an in-depth post by Paul or Dima (or a Microsoft employee, of course, though they tend to post shorter articles) describing what we can and can’t rely on when using APPLY.

        Back when I was converting lots of slow or volatile SQL 2000 code to 2005, I discovered that I could get more consistently-fast query plans by converting some of the non-correlated sub-queries to APPLY. This approach seemed to lock in the order in which tables were accessed, to some extent (i.e. tables in the APPLY construct were always accessed after the outer table(s) whose columns were used to correlate, implying that “left-to-right” ordering that Paul mentions in the article I linked), so I’d just place the APPLY clauses in the place where I wanted the tables to be accessed (i.e. after lots of filtering joins had been applied). My results were consistent enough that I grew to rely on this approach.

        However, I never understood WHY the optimizer always seemed to do this. Maybe it was just because the queries had enough tables that the optimizer couldn’t consider the rewrites that you’re referencing.

        Cheers!
        Aaron

        Reply
  • Can you elaborate a bit more on how Cross Apply is making that faster? You mentioned “we use CROSS APPLY to pull back the details we want based on a primary key lookup.”, could you just elaborate there or expand the post a bit?

    Maybe a simple pic or query plan before/after?

    I’d just like to better understand the impact of Cross Apply.

    Thanks!

    Reply
    • In this case, SQL Server is rewriting the CROSS APPLY into an INNER JOIN. No matter how this is accomplished, part of the benefit in this query is that the optimizer is able to use a narrow index in the posts CTE. Using that small index decreases the I/O and memory requirements for the query and leads to some efficiencies further down in the plan.

      Because we don’t have a lot of data moving through the early stages of the query, less memory is required and less work has to be done.

      The lookup in the CROSS APPLY is a by-product of the index tuning – it could be written different ways, but the CROSS APPLY has, in my experience, given me better optimization results when I’m doing something like this. In this specific case, there’s no plan difference between CROSS APPLY and INNER JOIN.

      If you want to try this at home, you can check out How to query the StackExchange databases for info on how to import the StackOverflow database.

      Reply
  • In the posts CTE, why do you left join to answers, but use answers.id IS NOT NULL in the WHERE clause? Does it perform better than using an inner join here?

    Reply
  • view or cte would not force join order NECCESSARILY. If there are many CTE’s or views involved in joins, it would confuse optimizer.

    Reply
  • Geoff Patterson
    June 11, 2015 3:57 pm

    I think that the ” AND answers.Id IS NOT NULL” line in the posts CTE of the second query should read “AND answers.Id IS NULL”. Since it is trying to replicate a NOT EXISTS, it should be confirming that the LEFT JOIN was not matched. (I was wondering why the two queries returned different results on my instance of StackOverflow!)

    Unfortunately, the refactored query is runs about 50% slower than the original query on my machine after making that fix so that it generates the same results.

    One other thing that I found confusing is the term “optimization fence” as applied to CTEs here. This is a more specific technical term in Postgres (e.g., see http://stackoverflow.com/a/17583736/1582862), and based on my prior experience SQL Server does not treat CTEs as an optimization fence. Instead, SQL will happily move logic into or out of the “CTE region” of the query plan during optimization in order to generate the lowest cost query plan. This is quite different than the Postgres behavior, in which the optimizer truly treats the CTE as a fence and would not push a predicate into the CTE even in cases where it would be an obvious optimization to do so. Both behaviors can be useful (it would be great to have the option to apply either behavior!), but it’s important to remember that a (non-recursive) CTE in SQL Server is simply an alternative way of phrasing a sub-query.

    Reply
    • Regarding the logic, see my response to Nicholai.

      Regarding CTEs – I’ve typically seen SQL Server behave closer to PostgreSQL than what you describe.

      Reply
      • Geoff Patterson
        June 11, 2015 4:22 pm

        Got it, I had opened the post this morning and just got around to reading it. I didn’t see that the query had since been corrected, so thanks for pointing that out. It seems like the LEFT JOIN and IS NOT NULL check could simply be converted to an INNER JOIN in the new query, so the fact that it used a LEFT JOIN was why I had assumed that the NOT EXISTS was the behavior you were going for.

        That’s an interesting viewpoint on CTEs. It doesn’t necessarily reflect what I have seen (or some other posters, judging from the comments) in the specific CTEs that I have worked with. However, it is something to keep an eye on, so thanks for bringing up the possibility. I particularly agree with Brent’s point that, as query complexity grows, re-writes such as using a CTE or changing join order can be increasingly important. Here is a way that I have previously summarized my intuition about that (though not sure if it’s 100% correct!):

        “Because the solution space of potential execution plans is so large (it grows exponentially with the number of the number of objects involved in the query), SQL Server is only going to be able to explore a tiny fraction of the potential query plans when coming up with a plan for very complex queries. Remember that SQL Server’s job isn’t to create the best query plan possible, but instead to create a query plan that is good enough and to do so as quickly as possible.

        I have often found that small changes in the way a query is formulated, even if they don’t impact the logic of the query, can have a significant impact on the query plan. Anecdotally, this grows more and more likely as the query grows more and more complex. One possible reason that this could happen is that a tweak to the query might cause SQL Server to begin cost-based optimization with a different initial plan. As cost-based optimization proceeds, this different starting point could yield a different exploration of the space of potential query plans–kind of like a different random seed impacts random number generation. Note that the query plans you provided are significantly different (compare images of the plan shape below!) and SQL Server actually does estimate that the @table variable plan is slightly cheaper.”

        In any case, thanks for this post which has clearly spurred some thought and friendly debate!

        Reply
  • I think the only reason the query was faster was because of the typo “IS NOT NULL” instead of “IS NULL”

    After correcting that, I’m back to same number of logical reads as the first query.

    Reply
  • Harrison Brock
    June 24, 2015 5:29 pm

    Thanks Jeremiah. This was a great article.

    I found this article http://www.toadworld.com/platforms/sql-server/w/wiki/9466.how-to-import-the-stackoverflow-xml-into-sql-server.aspx which is an SQL script for load XML data into SQL Server using OPENROWSET.

    Reply
  • Is part of the “performance gain” from the indexing that occurred in the second example? Did you test the first example again with similar optimal indexes?

    I converted a nested subquery last night into a cross apply and my query now takes 100% longer (2 minutes) than it did previously. The subquery is part of a casestatement and in statement as well. Thanks!

    Reply
  • George Burten
    May 29, 2017 3:41 am

    Not a solution to the problem. cross apply is still very slow.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu