I’ve had a handful of clients that were experiencing significant THREADPOOL waits. This isn’t a wait type that I had any knowledge of prior to joining Brent Ozar Unlimited. Looking back, I am pretty sure that a few servers that I’ve been responsible for over the years were experiencing it based on the symptoms.
WebOps: For a few minutes last night the app servers couldn’t connect to SQL. Can you check if there was a SQL problem during that time?
Me: <checks SQL Server Error Log and our fancy shmancy monitoring tools>
Me: I didn’t see anything in the log, but it appears to be a network issue since the monitoring tools couldn’t connect either. CPU utilization was pretty low during that time.
WebOps: <checks with network team>
Network team: I don’t see any network issues last night.
Me: <still thinks it’s a network issue>
Was it a network issue?
I believe these servers were experiencing THREADPOOL waits at night due to the blocking the various database maintenance jobs that occurred during that time frame would cause. One such job would purge data at night. It would do the DELETEs in ~10,000 row batches, but it still would cause some blocking. Blocking can lead to THREADPOOL waits if the blocking doesn’t clear before the server runs out of worker threads. I’m simplifying of course.
Imagine this scenario:
- A job kicks off that modifies a bunch of data and holds locks on critical tables for a few minutes.
- You’ve got several application servers that are trying to run queries against those critical tables.
- Many of the application server queries are blocked by the job.
- More and more queries are coming in wanting to get data back from those critical tables. They’re all blocked, timing out and retrying.
Eventually the server is going to run out of worker threads. When this happens, new connections will get errors. You won’t even be able to connect to the server in SSMS unless you know to use the Remote DAC. During this time, THREADPOOL waits are happening.
What do you do if you are experiencing THREADPOOL waits?
A couple of my clients have been told by Microsoft to increase the “max worker threads” setting. It’s no surprise that increasing it didn’t help. Increasing them can lead to memory issues. Each worker thread uses 2MB of memory. That doesn’t sound like a lot, but remember we’ve increased the amount of worker threads on the server. Maybe you are delaying when THREADPOOL happens since you have more worker threads to give out, but maybe you are now encountering RESOURCE_SEMAPHORE or RESOURCE_SEMAPHORE_QUERY_COMPILE waits.
You have to get to the root cause of the problem.
If you’ve got monitoring tools, you should see what’s leading up to the problem even though there’ll be a gap of monitoring data during the event as they too will be getting connection errors once the server has run out of worker threads.
Is blocking the only reason to experience THREADPOOL waits?
Blocking is the most common culprit of THREADPOOL waits. It can also be due to a massive amount of connections trying to run queries, meaning not just idle connections.
Find the blockers and “fix” them. Easier said than done and is not going to be covered here, but check out this post.
If only I could turn back time!
I’d love to troubleshoot the servers that I think were experiencing THREADPOOL waits that I thought were due to network issues. I get to troubleshoot it with clients, but I would love to know if my theory is correct for the servers I’ve been responsible for in the past.
Brent says: I had this exact same reaction when I first understood what THREADPOOL was doing to my server. “Awww, man, I’ve been haunted by this for years….” I think that’s one of the big perks of being a constant proactive learner like yourself, dear reader – you’re going to read symptoms and then recognize them when disaster strikes.