Three Reasons to (Temporarily) Change Max Worker Threads

When you go to the deli, there’s a “take a number” dispenser where you tear off a number and wait for your number to be called.

If you walked into a deli and you saw 50 people lined up all waiting their turn, and you went to take a number and the dispenser was completely empty, you’d say to yourself, “I think I’ll come back later.”

The deli could solve the problem by getting a larger take-a-number dispenser – or could they? All that would let you do is grab a higher number, but it’s not like they’d be able to start working with you any quicker. More numbers don’t let the people behind the counter actually do more work. In fact, it’s almost the opposite: it just means that the queues are going to grow larger and larger.

SQL Server is the same way:

  • The workers behind the counter are your CPU cores.
  • When a query wants to start, it grabs a numbered ticket.
  • If there aren’t any tickets left, there’s a problem – but to be honest, there was already a problem if the deli was chock full of angry customers waiting their turn.

If a user walks into your SQL Server and there aren’t any tickets left, you run into a poison wait called THREADPOOL. To a user, THREADPOOL feels like the SQL Server isn’t responding to network requests – but it’s just because the workers behind the counter are really busy right now, and so many customers have piled up already.

You could solve the problem (except it’s not really solving the problem) by adding more numbered tickets. In SQL Server, the max worker threads setting is how many tickets are available. In the max worker threads documentation page, Microsoft explains that for most servers (64-bit operating systems with 5-63 CPU cores), the default max worker threads is 512 + ((logical CPUs -4) * 16):

  • 8 cores: 576 max worker threads
  • 16 cores: 704
  • 32 cores: 960
  • 128 cores: 4,480
  • 256 cores: 8,576

In theory, if you had 8 workers at your deli, and you wanted thousands of customers to be able to grab a number and sit around waiting forever to be called, you could set max worker threads to be, say, 8,576. However, in most scenarios, this is just going to mean a serious performance resource problem and a whole lot of pissed-off customers. Trust me, your sandwiches aren’t worth waiting that long.

There are 3 unusual customers who often stand around without ordering food.

SQL Server has a few features that consume worker threads even when they’re not doing anything:

  1. Availability Groups – which consume threads based on the number of databases you’re protecting
  2. Database mirroring – similar to AGs
  3. Query notifications – which let apps subscribe to query updates, but consumes threads based on the number of SqlDependencies (times the number of servers subscribed)

In our deli analogy, these 3 features can consume all of your ticket numbers – even when they’re not actually doing any work. They’re considered active, running queries – even though they’re just standing around the deli, and whenever they’re called on, they just say, “Go on to the next customer – I haven’t figured out which cheese I want yet.”

If you’re backed into a corner where you’re experiencing worker thread exhaustion and THREADPOOL waits – running out of numbers to start new queries – just because these 3 features have consumed all your worker threads, then you could temporarily solve that problem by increasing max worker threads. That will let other customers go ahead and start placing orders – and at first, it’ll seem like you’ve averted disaster, and your deli can make money again.

But sooner or later, those 3 customers WILL order food.

And that’s when your deli is going to be in a world of hurt.

Great example: I’ve had several clients with thousands of databases protected with Always On Availability Groups spread across several replicas. They all had to raise max worker threads to a ginormous number just so that they could continue to service end user queries. They told themselves, “It’s fine, it’s not like all of these databases are changing data at the same time – most of the time, those worker threads are idle, so it doesn’t matter.”

Until one of the replicas went offline for more than a few minutes.

Then, as soon as the replica came back online, all of the databases on the primary replica had been active over the last few minutes, and all of them had changes they needed to push over to the replica. All of a sudden, the deli was chock full of angry customers, all of which wanted to place an order RIGHT NOW, and CPU went to 100% in a really ugly way, with new queries timing out and experiencing THREADPOOL waits.

When you buy a bigger number dispenser, you also need to start architecture discussions.

The bigger number dispenser buys you time, but only as long as not all the customers wanna place an order. The very fact that you had to buy more numbers is your big sign that at some point, workloads can be unsustainable, and you’re going to need to figure out how to:

  • Reduce the number of potential active worker threads, or
  • Get yourself more CPU cores (workers behind the deli counter), or
  • Divide the work across more SQL Servers

When you’re making these decisions, stop to talk to the business about their future growth plans. Think about the number of databases you’ll have 3 years from now, and the number of query notifications you plan to add. If those numbers are up in the thousands, it’s time to revisit the architecture strategy.

Previous Post
[Video] You shouldn’t dynamically create/alter/drop triggers.
Next Post
Building SQL ConstantCare®: Which Recommendations Do People Mute?

8 Comments. Leave new

  • All of a sudden I’m hungry, great post Brent-thank you.

  • “Query Notifications”, whazzat? How long has this been going on? OMG, since SQL 2005!?!?! I’m glad I didn’t know! I’m glad my project managers didn’t know! OMG.
    Good article, though.
    I only messed with max worker threads once. Company had a big, ancient (1990’s) batch program that opened one connection per table, then effectively did its own joins, but that was a couple of hundred threads and mostly idle. So they wanted to run two or more instances concurrently, and it ran out of threads, so I raised the count. So it ran, but it still ran more slowly than just running them consecutively, so I put the count back down.

  • Allen Shepard
    January 8, 2020 11:28 am

    Two questions
    1) Are all worker threads the same? A deli may need more people slicking meat but not sweeping the floor.
    2) how does one know they are at 80% or 20% or worker threads? 100% is easy – wait go up but how does one measure percent of capacity used?
    All I have found is SELECT scheduler_id
    FROM sys.dm_os_schedulers
    WHERE STATUS = ‘Visible Online’

    • 1. Yes

      2. That’s a great question for

      • Allen Shepard
        January 9, 2020 7:25 am

        (claps hands for asking a good question. I never know till I speak – then everyone can tell)
        Aye aye Capt. I did number 2 [ask the question on stack exchange 🙂
        Best code to do this so far is
        SELECT cast((select count(*)*100 from sys.dm_os_workers ) as real)/i.max_workers_count FROM sys.dm_os_sys_info i

  • This came out at just the right time.
    We recently had a situation where I saw Threadpool as high up in the waits (I use Paul Randal’s scripts and his wait stats library) and so I thought we perhaps had a lack of cores on this server. I got more cores added and Threadpool is not a concern now. This post just confirmed my thoughts.

  • Thats a great analogy to describe worker threads.


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.