When You’re Troubleshooting Blocking, Look at Query #2, Too.

When I’m troubleshooting a blocking emergency, the culprit is usually the query at the head of a blocking chain. Somebody did something ill-advised like starting a transaction and then locking a whole bunch of tables.

But sometimes, the lead blocker isn’t the real problem. It’s query #2.

Here’s a sample scenario:

  1. A long-running select with nolock starts, taking out a schema stability lock: nobody can change the structure of the table while he runs
  2. An online index operation needs to finish up, which needs a schema modification lock on that same table in order to switch in the newly built index – but he can’t do that until select #1 finishes
  3. Other selects with nolock start – but they need a schema stability lock, which they can’t get until #2 finishes

Most monitoring tools will say that #1 is the lead blocker, and they’re technically correct. However, it’s really confusing for users because they look at the monitoring tool and ask questions like:

  • “How can that #1 query possibly block someone else?”
  • “What could I even do to reduce the impact of select #1?”
  • “How are a bunch of selects with nolock being blocked by another select with nolock?”

Reproducing the problem with an “online” index rebuild

Let’s reproduce it with the Stack Overflow database. I’m going to put a tiny index on the Posts table:

That index only has a few rows in it, so it’ll be quick to rebuild – and the rebuild is what I’m going to use to illustrate the problem.

Query #1: To start our blocking chain, I kick off a long-running select with nolock:

Query #2: I kick off an online index rebuild on that tiny filtered index – which would normally happen instantly, but it needs a schema modification lock to switch in the new index:

That query is blocked by #1, which wouldn’t be a big deal, but now…

Queries #3: I fire off Query #1 again, but this time in SQLQueryStress so I can throw a lot of sessions at it at once:

They’re all blocked.

The problem is easy to diagnose with sp_WhoIsActive.

Here’s what the situation looks like in sp_WhoIsActive @find_block_leaders = 1, which helpfully organizes the blocking chains in a row:

The lead blocker looks like a select with nolock – but if you’re going to troubleshoot something, that’s not the query to troubleshoot. He’s fairly harmless. Instead, you have to go down a level to figure out who’s blocking the rest. I just love this approach.

It's the one between the aubergine query and the eggplant query.
See the index rebuild?

But the reason we’re gathered here today, dear congregation, is because I had to troubleshoot this exact issue for a client. They were seeing strange blocking problems that didn’t make sense: all day long, random queries that never should have been a lead blocker were suddenly blocking dozens or hundreds of other queries out of nowhere.

Their monitoring tool just listed all of the waiting statements in a pile – without distinguishing query #2, which was the real problem. The client didn’t realize there was a hidden “online” index operation in that colorful mess. That was the real culprit, but the monitoring tool only showed query #1 as the lead blocker – and since query #1 was different every time the index rebuild would sneak in, it was nearly impossible to troubleshoot.

In this case, SQL Server 2014 & newer has an easy solution.

SQL Server 2014 introduced a few new options for index rebuilds:

Now, when I rerun the same scenario, sp_WhoIsActive looks wildly different:

Only the “online” index rebuild is blocked, and he has a different wait type – he’s waiting at a low priority. He’s sitting by in the background, graciously letting other queries get in ahead of him.

If you use Ola Hallengren’s IndexOptimize script, use the parameters for WaitAtLowPriorityMaxDuration and WaitAtLowPriorityAbortAfterWait to configure these options.

Want to learn more about troubleshooting this stuff?

Check out my How I Use the First Responder Kit class. I walk you through using sp_BlitzWho, sp_BlitzFirst, sp_BlitzIndex, and many of the other scripts in the open source First Responder Kit.

Previous Post
How Scalar User-Defined Functions Slow Down Queries
Next Post
What It Takes To Write Two Blog Posts

2 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.