Network Issue or THREADPOOL waits?

Last Updated 6 years ago

I’ve had a handful of clients that were experiencing significant THREADPOOL waits. This isn’t a wait type that I had any knowledge of prior to joining Brent Ozar Unlimited. Looking back, I am pretty sure that a few servers that I’ve been responsible for over the years were experiencing it based on the symptoms.

WebOps: For a few minutes last night the app servers couldn’t connect to SQL. Can you check if there was a SQL problem during that time?

Me: <checks SQL Server Error Log and our fancy shmancy monitoring tools>

Me: I didn’t see anything in the log, but it appears to be a network issue since the monitoring tools couldn’t connect either. CPU utilization was pretty low during that time.

WebOps: <checks with network team>

Network team: I don’t see any network issues last night.

Me: <still thinks it’s a network issue>

Was it a network issue?

I believe these servers were experiencing THREADPOOL waits at night due to the blocking the various database maintenance jobs that occurred during that time frame would cause. One such job would purge data at night. It would do the DELETEs in ~10,000 row batches, but it still would cause some blocking. Blocking can lead to THREADPOOL waits if the blocking doesn’t clear before the server runs out of worker threads. I’m simplifying of course.

Imagine this scenario:

A job kicks off that modifies a bunch of data and holds locks on critical tables for a few minutes.
You’ve got several application servers that are trying to run queries against those critical tables.
Many of the application server queries are blocked by the job.
More and more queries are coming in wanting to get data back from those critical tables. They’re all blocked, timing out and retrying.

Eventually the server is going to run out of worker threads. When this happens, new connections will get errors. You won’t even be able to connect to the server in SSMS unless you know to use the Remote DAC. During this time, THREADPOOL waits are happening.

What do you do if you are experiencing THREADPOOL waits?

A couple of my clients have been told by Microsoft to increase the “max worker threads” setting. It’s no surprise that increasing it didn’t help. Increasing them can lead to memory issues. Each worker thread uses 2MB of memory. That doesn’t sound like a lot, but remember we’ve increased the amount of worker threads on the server. Maybe you are delaying when THREADPOOL happens since you have more worker threads to give out, but maybe you are now encountering RESOURCE_SEMAPHORE or RESOURCE_SEMAPHORE_QUERY_COMPILE waits.

You have to get to the root cause of the problem.

If you’ve got monitoring tools, you should see what’s leading up to the problem even though there’ll be a gap of monitoring data during the event as they too will be getting connection errors once the server has run out of worker threads.

Is blocking the only reason to experience THREADPOOL waits?

Blocking is the most common culprit of THREADPOOL waits. It can also be due to a massive amount of connections trying to run queries, meaning not just idle connections.

Find the blockers and “fix” them. Easier said than done and is not going to be covered here, but check out this post.

If only I could turn back time!

I’d love to troubleshoot the servers that I think were experiencing THREADPOOL waits that I thought were due to network issues. I get to troubleshoot it with clients, but I would love to know if my theory is correct for the servers I’ve been responsible for in the past.

Brent says: I had this exact same reaction when I first understood what THREADPOOL was doing to my server. “Awww, man, I’ve been haunted by this for years….” I think that’s one of the big perks of being a constant proactive learner like yourself, dear reader – you’re going to read symptoms and then recognize them when disaster strikes.

Heaps, Deletes, and Optimistic Isolation Levels

SQL Server Patches for Meltdown and Spectre Attacks

12 Comments. Leave new

mmiike
January 3, 2018 10:22 am

great post, Tara! very interesting.

Reply
RansomingSQL
January 3, 2018 11:21 am

We get a similar symptom with our server (everything is locked up, can’t connect, etc) when our query plans go parallel in a nasty way with blocking – our query plans are now taking 4 threads where they normally would only take 1, and we hammer the server and take away all of the available worker threads. It usually is a statistics problem and an update to them (along with the recompile that happens because of it) usually fixes the problem in our case.
I setup a process that runs in a continual loop (that way it has it’s worker thread already allocated) that counts sys.dm_os_tasks and saves it off so that I can see how often we run out of worker threads. It hasn’t pointed out what the problem is, but now I can pin-point when it happens and know why.

Reply
Gerald
January 3, 2018 11:40 am

Thanks Tara and Brent for sharing your knowledge.
Mostly I read your blog when I’m on the way to or from work and I’m always happy to learn new stuff about MS SQL.
All the best for 2018 to all of you
Gerald

Reply
Terry Mills
January 3, 2018 11:44 am

Thank you for your post. This story sound too familiar, we are going through it right now. We initially focused on network issues and finally found threadpool waits. We started watching this wait very closely and discovered when we get what we call “DB connection timeout burst” (aka a short period of time where our user DB connection attempts would timeout) we also noticed a significant jump in threadpool waits. We have identified several opportunities in our SW that our dev team is working to resolve, etc. For short term solutions we have increased the cost threshold from 5 to 25, then 50, then 100 (perhaps we will leave it at 100 since the HW it pretty kickass). This has helped, and has reduced the incidents quite a bit. Our MAXDOP is currently set to 8 (4 NUMA nodes, 32 cores). What are your thoughts on decreasing MAXDOP temporarily to help with threadpool waits?

Reply
- Tara Kizer
  January 3, 2018 1:23 pm
  
  Decreasing MAXDOP can help, but you’ll likely still have issues. The key is fixing the root cause. What is causing the blocking?
  
  Reply
Martin Surasky
January 3, 2018 1:06 pm

Great post Tara.

I’m wondering, other than constantly monitoring sys.dm_os_tasks (like one of the previous comments mentioned) is there anything written to the SQL Server Event Log that clearly indicates you have suffered from complete depletion of worker threads?

I’m wondering if there is an easy way to retroactively look for this symptoms…

Reply
- Tara Kizer
  January 3, 2018 1:22 pm
  
  You’d be looking for the THREADPOOL waits. Use sp_BlitzFirst @SinceStartup=1 to see what you’ve got. Or any wait stat query.
  
  Reply
- Erik Darling
  January 3, 2018 1:53 pm
  
  Martin – you may see messages that say new queries assigned to process on node (n) have not been picked up by a worker thread in the last (n) seconds. You can also try filtering for messages on ‘exhaust’ for similar messages.
  
  Reply
  - Martin Surasky
    January 3, 2018 2:58 pm
    
    Thank you guys, this is yet another great excuse to play more and more with sp_BlitzFirst… this post certainly opened my eyes to see old problems in new ways!
    
    Reply
Kevin Fries
January 4, 2018 8:45 am

Klaus Aschenbrenner has a nice article on this: Troubleshooting THREADPOOL Waits
http://www.sqlpassion.at/archive/2011/10/25/troubleshooting-threadpool-waits/

Reply
Filip
December 24, 2023 8:54 am

It would probably be best to schedule a job in a time you know the database won’t be used that much. Or am I missing something? Never had a case like this so I’m just wondering. 🙂

Reply
- Brent Ozar
  December 24, 2023 3:12 pm
  
  Schedule a job for what? Not sure what you’re suggesting.
  
  Reply