When You’re Troubleshooting Blocking, Look at Query #2, Too.

Last Updated 3 years ago

When I’m troubleshooting a blocking emergency, the culprit is usually the query at the head of a blocking chain. Somebody did something ill-advised like starting a transaction and then locking a whole bunch of tables.

But sometimes, the lead blocker isn’t the real problem. It’s query #2.

Here’s a sample scenario:

A long-running select with nolock starts, taking out a schema stability lock: nobody can change the structure of the table while he runs
An online index operation needs to finish up, which needs a schema modification lock on that same table in order to switch in the newly built index – but he can’t do that until select #1 finishes
Other selects with nolock start – but they need a schema stability lock, which they can’t get until #2 finishes

Most monitoring tools will say that #1 is the lead blocker, and they’re technically correct. However, it’s really confusing for users because they look at the monitoring tool and ask questions like:

“How can that #1 query possibly block someone else?”
“What could I even do to reduce the impact of select #1?”
“How are a bunch of selects with nolock being blocked by another select with nolock?”

Reproducing the problem with an “online” index rebuild

Let’s reproduce it with the Stack Overflow database. I’m going to put a tiny index on the Posts table:

CREATE INDEX AlmostEmpty ON dbo.Posts(Id) 
WHERE Id >= 1 AND Id <= 10;

1 2	CREATE INDEX AlmostEmpty ON dbo.Posts(Id) WHERE Id >= 1 AND Id <= 10;

That index only has a few rows in it, so it’ll be quick to rebuild – and the rebuild is what I’m going to use to illustrate the problem.

Query #1: To start our blocking chain, I kick off a long-running select with nolock:

SELECT COUNT(DISTINCT Title) FROM dbo.Posts WITH (NOLOCK);

1	SELECT COUNT(DISTINCT Title) FROM dbo.Posts WITH (NOLOCK);

Query #2: I kick off an online index rebuild on that tiny filtered index – which would normally happen instantly, but it needs a schema modification lock to switch in the new index:

ALTER INDEX AlmostEmpty ON dbo.Posts 
REBUILD WITH (ONLINE = ON);

1 2	ALTER INDEX AlmostEmpty ON dbo.Posts REBUILD WITH (ONLINE = ON);

That query is blocked by #1, which wouldn’t be a big deal, but now…

Queries #3: I fire off Query #1 again, but this time in SQLQueryStress so I can throw a lot of sessions at it at once:

They’re all blocked.

The problem is easy to diagnose with sp_WhoIsActive.

Here’s what the situation looks like in sp_WhoIsActive @find_block_leaders = 1, which helpfully organizes the blocking chains in a row:

The lead blocker looks like a select with nolock – but if you’re going to troubleshoot something, that’s not the query to troubleshoot. He’s fairly harmless. Instead, you have to go down a level to figure out who’s blocking the rest. I just love this approach.

It's the one between the aubergine query and the eggplant query. — See the index rebuild?

But the reason we’re gathered here today, dear congregation, is because I had to troubleshoot this exact issue for a client. They were seeing strange blocking problems that didn’t make sense: all day long, random queries that never should have been a lead blocker were suddenly blocking dozens or hundreds of other queries out of nowhere.

Their monitoring tool just listed all of the waiting statements in a pile – without distinguishing query #2, which was the real problem. The client didn’t realize there was a hidden “online” index operation in that colorful mess. That was the real culprit, but the monitoring tool only showed query #1 as the lead blocker – and since query #1 was different every time the index rebuild would sneak in, it was nearly impossible to troubleshoot.

In this case, SQL Server 2014 & newer has an easy solution.

SQL Server 2014 introduced a few new options for index rebuilds:

ALTER INDEX AlmostEmpty ON dbo.Posts REBUILD 
WITH (ONLINE = ON(WAIT_AT_LOW_PRIORITY 
	(MAX_DURATION = 1 MINUTES, ABORT_AFTER_WAIT = SELF)));

1

2

3

ALTER INDEX AlmostEmpty ON dbo.Posts REBUILD

WITH (ONLINE = ON(WAIT_AT_LOW_PRIORITY

(MAX_DURATION = 1 MINUTES, ABORT_AFTER_WAIT = SELF)));

Now, when I rerun the same scenario, sp_WhoIsActive looks wildly different:

Only the “online” index rebuild is blocked, and he has a different wait type – he’s waiting at a low priority. He’s sitting by in the background, graciously letting other queries get in ahead of him.

If you use Ola Hallengren’s IndexOptimize script, use the parameters for WaitAtLowPriorityMaxDuration and WaitAtLowPriorityAbortAfterWait to configure these options.

Want to learn more about troubleshooting this stuff?

Check out my How I Use the First Responder Kit class. I walk you through using sp_BlitzWho, sp_BlitzFirst, sp_BlitzIndex, and many of the other scripts in the open source First Responder Kit.

2 Comments. Leave new

Chad Franklin
November 20, 2020 10:52 am

Wow. Really interesting post. Thanks for sharing.

Reply
- Brent Ozar
  November 20, 2020 12:11 pm
  
  Thanks!
  
  Reply