If you walked into a deli and you saw 50 people lined up all waiting their turn, and you went to take a number and the dispenser was completely empty, you’d say to yourself, “I think I’ll come back later.”
The deli could solve the problem by getting a larger take-a-number dispenser – or could they? All that would let you do is grab a higher number, but it’s not like they’d be able to start working with you any quicker. More numbers don’t let the people behind the counter actually do more work. In fact, it’s almost the opposite: it just means that the queues are going to grow larger and larger.
SQL Server is the same way:
- The workers behind the counter are your CPU cores.
- When a query wants to start, it grabs a numbered ticket.
- If there aren’t any tickets left, there’s a problem – but to be honest, there was already a problem if the deli was chock full of angry customers waiting their turn.
If a user walks into your SQL Server and there aren’t any tickets left, you run into a poison wait called THREADPOOL. To a user, THREADPOOL feels like the SQL Server isn’t responding to network requests – but it’s just because the workers behind the counter are really busy right now, and so many customers have piled up already.
You could solve the problem (except it’s not really solving the problem) by adding more numbered tickets. In SQL Server, the max worker threads setting is how many tickets are available. In the max worker threads documentation page, Microsoft explains that for most servers (64-bit operating systems with 5-63 CPU cores), the default max worker threads is 512 + ((logical CPUs -4) * 16):
- 8 cores: 576 max worker threads
- 16 cores: 704
- 32 cores: 960
- 128 cores: 4,480
- 256 cores: 8,576
In theory, if you had 8 workers at your deli, and you wanted thousands of customers to be able to grab a number and sit around waiting forever to be called, you could set max worker threads to be, say, 8,576. However, in most scenarios, this is just going to mean a serious performance resource problem and a whole lot of pissed-off customers. Trust me, your sandwiches aren’t worth waiting that long.
There are 3 unusual customers who often stand around without ordering food.
SQL Server has a few features that consume worker threads even when they’re not doing anything:
- Availability Groups – which consume threads based on the number of databases you’re protecting
- Database mirroring – similar to AGs
- Query notifications – which let apps subscribe to query updates, but consumes threads based on the number of SqlDependencies (times the number of servers subscribed)
In our deli analogy, these 3 features can consume all of your ticket numbers – even when they’re not actually doing any work. They’re considered active, running queries – even though they’re just standing around the deli, and whenever they’re called on, they just say, “Go on to the next customer – I haven’t figured out which cheese I want yet.”
If you’re backed into a corner where you’re experiencing worker thread exhaustion and THREADPOOL waits – running out of numbers to start new queries – just because these 3 features have consumed all your worker threads, then you could temporarily solve that problem by increasing max worker threads. That will let other customers go ahead and start placing orders – and at first, it’ll seem like you’ve averted disaster, and your deli can make money again.
But sooner or later, those 3 customers WILL order food.
And that’s when your deli is going to be in a world of hurt.
Great example: I’ve had several clients with thousands of databases protected with Always On Availability Groups spread across several replicas. They all had to raise max worker threads to a ginormous number just so that they could continue to service end user queries. They told themselves, “It’s fine, it’s not like all of these databases are changing data at the same time – most of the time, those worker threads are idle, so it doesn’t matter.”
Until one of the replicas went offline for more than a few minutes.
Then, as soon as the replica came back online, all of the databases on the primary replica had been active over the last few minutes, and all of them had changes they needed to push over to the replica. All of a sudden, the deli was chock full of angry customers, all of which wanted to place an order RIGHT NOW, and CPU went to 100% in a really ugly way, with new queries timing out and experiencing THREADPOOL waits.
When you buy a bigger number dispenser, you also need to start architecture discussions.
The bigger number dispenser buys you time, but only as long as not all the customers wanna place an order. The very fact that you had to buy more numbers is your big sign that at some point, workloads can be unsustainable, and you’re going to need to figure out how to:
- Reduce the number of potential active worker threads, or
- Get yourself more CPU cores (workers behind the deli counter), or
- Divide the work across more SQL Servers
When you’re making these decisions, stop to talk to the business about their future growth plans. Think about the number of databases you’ll have 3 years from now, and the number of query notifications you plan to add. If those numbers are up in the thousands, it’s time to revisit the architecture strategy.