Parallelism Can Make Queries Perform Worse.

While I was building lab queries for my all-new Fundamentals of Parameter Sniffing course – first live one is next week, still time to get in – I ran across a query with delightfully terrible behavior.

I’m always torn when I build the hands-on labs. I want to make the challenges easy enough that you can accomplish ’em in the span of an hour, but hard enough that they’ll … well, challenge you.

It’s tricky walking that line.

This query, though…this query, even though it’s just one table with one nonclustered index, crosses the line way past what you’d probably pull off in a Fundamentals class. It’s just so beautifully diabolical that I gotta share it somewhere, so here we are, dear reader.

Let’s query the Posts table to find new questions & answers.

I’m using the 50GB 2013 version of the the Stack Overflow database – large enough that it’s got some meat to it, but small enough that you can still download it quickly to follow along (it’s just 10GB compressed.)

Every day, all day long, people are asking questions and posting answers at StackOverflow.com. As new questions & answers are posted, they start with CreationDate = right now, and Score = 0.

Let’s say that our users want to find posts with a minimum amount of Score points, Answers, and Comments, and we’ve written this stored proc to help ’em:

Sometimes our users want to catch brand-new questions before anyone else has jumped in. Those users call it with a very LOW ScoreMin, AnswerCountMin, and CommentCountMin. Other users are looking for a leaderboard – they just wanna see the highest-ranking questions so they can go read to learn. Those users call it with a very HIGH ScoreMin, AnswerCountMin, and CommentCountMin.

With that in mind, here are the two parameter sets we’ll test with:

We give those a shot, and we get lucky – our query runs pretty quickly. It just so happens there’s an index that our query will use, so we’re satisfied with these actual plans since the two queries combined run in just a few seconds:

We do a quick glance, and it’s using our index, and the actual numbers are pretty small – we’re good here, right? Ship it.

Eagle-eyed readers will note that this query does an index scan, but – that’s not necessarily a bad thing at all! In this case, it’s quite appropriate. SQL Server sniffs the parameters for the first query (@ScoreMin = 1) and says, “It’ll be really easy to find rows with such low Score & Comment & Answer numbers. I could just scan the posts from most recent CreationDate down, and do key lookups for each of ’em. I won’t have to look up many rows at all before I find enough rows with the right minimums for CommentCount & AnswerCount!” And he’s right. In case you’re following along at home, here’s the definition for that index on Posts:

But try running them in the opposite order,
and parallelism makes things terrible.

Free that plan from the cache, and then run them in the opposite order:

And, uh oh – this isn’t good – the first query is quick, but the second query takes almost a minute. The secret is in the sauce, and by sauce I mean actual execution plans:

Let’s walk through what happened when first proc call ran:

  • This time, SQL Server optimized the proc’s plan looking for Posts with Score >= 250, AnswerCount >= 10, and CommentCount >= 10
  • It knows there are WAY less Posts that match those predicates
  • It still didn’t want to do a table scan – it still wanted to use the luscious CreationDate index because it satisfies the ORDER BY
  • But it knew it would have to read a lot of rows from that index in order to find the matching ones
  • So it decided to parallelize the scan on the CreationDate_Score index, letting multiple threads simultaneously work on doing all those key lookups
  • But because of that, when the CreationDate index is suddenly scattered across multiple worker threads, the threads’ output is no longer sorted by the index, and needs to be re-sorted again

This works out just fine for those parameters because there are relatively few rows with Score >= 250, AnswerCount >= 10, and CommentCount >= 10. It’s not hard to sort those rows. Piece of cake, the query’s done in under a second. This is a completely appropriate use for parallelism.

However, let’s walk through what happens when the second proc call runs:

  • We’re reusing the plan built for Score >= 250, AnswerCount >= 10, and CommentCount >= 10
  • But there are way MORE rows with Score >= 1, AnswerCount >= 0, and CommentCount >0 – millions, in fact
  • Originally, when we only used 1 CPU core, we could just scan the index backward on one thread and stop doing key lookups as soon as we’d found enough rows to match the TOP 1000 in our query
  • But now, since the work was parallelized, all of the threads have to check all of the rows in the index, doing key lookups for them
  • And then of course the poor sort is screwed too, because he didn’t allocate enough memory to sort all these rows

And bam, we’re in a world of hurt.

So what’s the fix?

There are lots of different options here: query tuning, index tuning, hints, plan guides, you name it – but I’m going to use one you probably wouldn’t expect just to show you that it works. I’m gonna stick an OPTION (MAXDOP 1) hint on the query:

With that change in place, there are no differences between the plans no matter which parameters go in first. Both queries have the first plan we discussed, and both queries run in a second or less. A similar change would involve to setting Cost Threshold and MAXDOP appropriately to reduce the blast radius, but of course that would affect many more queries.

Is a parallelism hint the right fix? Well, there are a whole buffet full of choices to choose from. Heck, one could spend an entire day just discussing the fundamental issue at hand, or even three days mastering it. Which reminds me – this demo’s too hard for the fundamentals class, but if I add in another index and use a third set of parameters, introducing 3 unique query plan combinations – yep, that’s perfectly diabolical for a Mastering level lab. See you in class!

Previous Post
Can I Offload DBCC CHECKDB To Another Server?
Next Post
[Video] How to Think Like the SQL Server Engine: All-Demo Edition

3 Comments. Leave new

  • I’ve had other less technical cases where parallelism made things worse, or too much parallelism made things worse. We had a really busy report server that was super slow, we started with throwing ram at it to no effect, then more CPUs at it, and increased MAXDOP and it took a total nose dive. In one of the databases, about 40 thousand queries were running each hour, (we later figured out years after the fact it was from an obscene number of “personal” copies of almost identical reports getting used by thousands of users) and when it tried to execute it would wait and wait and wait on threads. We tried it at 2 for a bit and sometimes the reports would be extremely fast and other times slow and ended up settling on 1. It was still slow but we didn’t have 5 page reports running for 15 minutes anymore.

    Reply
  • Pedro Mogga
    May 26, 2020 11:48 pm

    Brent … would this not be a case of using a recompile hint on the statement … or indeed the proc? I am assuming that its not a proc executed many times every minute.

    I see countless overnight processes that benefit massively with recompiles due to bad param sniffing….eg. same procs called 50 times during 8 hour batch period. A lot of customers seem to be of the opinion that recompiles are a no go area. I tell them that it’s not really the case with procs that are run only a few times an hour. The CPU recompile time v good plan (nearly) every time is no doubt covered in your courses.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}