Three Reasons to (Temporarily) Change Max Worker Threads

By Brent Ozar · January 8, 2020 · 10 comments

When you go to the deli, there’s a “take a number” dispenser where you tear off a number and wait for your number to be called.

If you walked into a deli and you saw 50 people lined up all waiting their turn, and you went to take a number and the dispenser was completely empty, you’d say to yourself, “I think I’ll come back later.”

The deli could solve the problem by getting a larger take-a-number dispenser – or could they? All that would let you do is grab a higher number, but it’s not like they’d be able to start working with you any quicker. More numbers don’t let the people behind the counter actually do more work. In fact, it’s almost the opposite: it just means that the queues are going to grow larger and larger.

SQL Server is the same way:

The workers behind the counter are your CPU cores.
When a query wants to start, it grabs a numbered ticket.
If there aren’t any tickets left, there’s a problem – but to be honest, there was already a problem if the deli was chock full of angry customers waiting their turn.

If a user walks into your SQL Server and there aren’t any tickets left, you run into a poison wait called THREADPOOL. To a user, THREADPOOL feels like the SQL Server isn’t responding to network requests – but it’s just because the workers behind the counter are really busy right now, and so many customers have piled up already.

You could solve the problem (except it’s not really solving the problem) by adding more numbered tickets. In SQL Server, the max worker threads setting is how many tickets are available. In the max worker threads documentation page, Microsoft explains that for most servers (64-bit operating systems with 5-63 CPU cores), the default max worker threads is 512 + ((logical CPUs -4) * 16):

8 cores: 576 max worker threads
16 cores: 704
32 cores: 960
128 cores: 4,480
256 cores: 8,576

In theory, if you had 8 workers at your deli, and you wanted thousands of customers to be able to grab a number and sit around waiting forever to be called, you could set max worker threads to be, say, 8,576. However, in most scenarios, this is just going to mean a serious performance resource problem and a whole lot of pissed-off customers. Trust me, your sandwiches aren’t worth waiting that long.

There are 3 unusual customers who often stand around without ordering food.

SQL Server has a few features that consume worker threads even when they’re not doing anything:

Availability Groups – which consume threads based on the number of databases you’re protecting
Database mirroring – similar to AGs
Query notifications – which let apps subscribe to query updates, but consumes threads based on the number of SqlDependencies (times the number of servers subscribed)

In our deli analogy, these 3 features can consume all of your ticket numbers – even when they’re not actually doing any work. They’re considered active, running queries – even though they’re just standing around the deli, and whenever they’re called on, they just say, “Go on to the next customer – I haven’t figured out which cheese I want yet.”

If you’re backed into a corner where you’re experiencing worker thread exhaustion and THREADPOOL waits – running out of numbers to start new queries – just because these 3 features have consumed all your worker threads, then you could temporarily solve that problem by increasing max worker threads. That will let other customers go ahead and start placing orders – and at first, it’ll seem like you’ve averted disaster, and your deli can make money again.

But sooner or later, those 3 customers WILL order food.

And that’s when your deli is going to be in a world of hurt.

Great example: I’ve had several clients with thousands of databases protected with Always On Availability Groups spread across several replicas. They all had to raise max worker threads to a ginormous number just so that they could continue to service end user queries. They told themselves, “It’s fine, it’s not like all of these databases are changing data at the same time – most of the time, those worker threads are idle, so it doesn’t matter.”

Until one of the replicas went offline for more than a few minutes.

Then, as soon as the replica came back online, all of the databases on the primary replica had been active over the last few minutes, and all of them had changes they needed to push over to the replica. All of a sudden, the deli was chock full of angry customers, all of which wanted to place an order RIGHT NOW, and CPU went to 100% in a really ugly way, with new queries timing out and experiencing THREADPOOL waits.

When you buy a bigger number dispenser, you also need to start architecture discussions.

The bigger number dispenser buys you time, but only as long as not all the customers wanna place an order. The very fact that you had to buy more numbers is your big sign that at some point, workloads can be unsustainable, and you’re going to need to figure out how to:

Reduce the number of potential active worker threads, or
Get yourself more CPU cores (workers behind the deli counter), or
Divide the work across more SQL Servers

When you’re making these decisions, stop to talk to the business about their future growth plans. Think about the number of databases you’ll have 3 years from now, and the number of query notifications you plan to add. If those numbers are up in the thousands, it’s time to revisit the architecture strategy.

Free, 3× a week

Get my new posts by email

Three posts a week, plus a Monday roundup of the best database news from around the web.

10 comments

DBADave

January 8, 2020 at 9:28 am

All of a sudden I’m hungry, great post Brent-thank you.

Reply
1. Brent Ozar
  
  January 8, 2020 at 9:36 am
  
  You’re welcome!
  
  Reply
JRStern

January 8, 2020 at 9:52 am

“Query Notifications”, whazzat? How long has this been going on? OMG, since SQL 2005!?!?! I’m glad I didn’t know! I’m glad my project managers didn’t know! OMG.
Good article, though.
I only messed with max worker threads once. Company had a big, ancient (1990’s) batch program that opened one connection per table, then effectively did its own joins, but that was a couple of hundred threads and mostly idle. So they wanted to run two or more instances concurrently, and it ran out of threads, so I raised the count. So it ran, but it still ran more slowly than just running them consecutively, so I put the count back down.

Reply
Allen Shepard VIP Student since 2019

January 8, 2020 at 11:28 am

Two questions
1) Are all worker threads the same? A deli may need more people slicking meat but not sweeping the floor.
2) how does one know they are at 80% or 20% or worker threads? 100% is easy – wait go up but how does one measure percent of capacity used?
All I have found is SELECT scheduler_id
,current_tasks_count
,current_workers_count
,active_workers_count
,work_queue_count
FROM sys.dm_os_schedulers
WHERE STATUS = ‘Visible Online’

Reply
1. Brent Ozar
  
  January 8, 2020 at 5:42 pm
  
  1. Yes
  
  2. That’s a great question for DBA.StackExchange.com.
  
  Reply
  1. Allen Shepard VIP Student since 2019
    
    January 9, 2020 at 7:25 am
    
    (claps hands for asking a good question. I never know till I speak – then everyone can tell)
    Aye aye Capt. I did number 2 [ask the question on stack exchange 🙂
    Best code to do this so far is
    SELECT cast((select count(*)*100 from sys.dm_os_workers ) as real)/i.max_workers_count FROM sys.dm_os_sys_info i
    
    Reply
Chris Wood

January 9, 2020 at 6:29 am

This came out at just the right time.
We recently had a situation where I saw Threadpool as high up in the waits (I use Paul Randal’s scripts and his wait stats library) and so I thought we perhaps had a lack of cores on this server. I got more cores added and Threadpool is not a concern now. This post just confirmed my thoughts.

Reply
Sunil

January 9, 2020 at 6:37 am

Thats a great analogy to describe worker threads.

Reply
Now I'm Hungry

January 24, 2024 at 11:08 pm

I know this is a bit old but SQL internals are always relevant.

I’m a bit confused. What I’m reading is – the Max Worker Thread default may result in SQL refusing requests, but it’s not a good idea to change it because then it will accept the request and probably serve it slowly. But if I do raise it, I may be in trouble when it’s really busy, and in that case it will…..refuse my request because it’s busy serving all the other ones. But it’s already doing that at the default setting.

So what am I missing here? Why is it a good thing for me to allow SQL to fail to serve requests? Isn’t it better to use the setting that is going to let the request be filled? To use the deli analogy – If I REALLY want a sandwich I’ll grab number 500. If I get lucky 490 of those people infront of me are still picking their sauce, and I get my sandwich. If I am unlucky I have to wait till 499 other sandwiches are made and then I get my sandwich. But I want my sandwich, and it’s not like there’s another shop I can go to in this down?

Reply
1. JRStern
  
  January 24, 2024 at 11:33 pm
  
  Because as hungry as you are, if you grab #500 you’ll get your sandwich three days later, and if you grab #5000 the place may go out of business before you get your sandwich, or the other $4999 people block the kitchen and nothing gets done.
  Or see my comment above.
  
  Reply