Why Database Monitoring Tools Are So Hard to Interpret


SentryOne’s Greg Gonzalez recently published a post about “Why Wait Stats Alone Are Not Enough.” Sure, Greg’s obviously biased towards his own monitoring tool, but his post is really well-written and fair. I want to explain the same story, but from a different (and simplified) angle.

Say we’ve got a really simple database server, and we start out with just one query that runs each day:

Query R is a reporting query that scans ten years of sales history. It’s a terribly written query, everyone hates it, and when it runs, it maxes out all of our CPU cores for quite a while.

At this point, if you use any monitoring tool, it’ll tell you that:

  • CPU is a problem for us around 8AM (either by judging CPU %, or if the query is complex enough, SOS_SCHEDULER_YIELD waits)
  • The query that’s waiting on CPU at that time is Query R
  • The query that’s using the most CPU time throughout the entire day is Query R

Now let’s add in transactions.

Whenever our web site does transactions, Query T is a transaction that inserts a new sale and subtracts from our inventory. It runs every now and then, mostly during business hours:

Query T’s code is alright – not great, just alright – so when it runs, it just consumes some processor cycles for a few seconds and then finishes. People are still pretty happy with performance – other than that terrible Query R – so nobody bothers looking at monitoring tools.

But over time, our web site starts doing more sales and transactions:

Now the data gets a little trickier. Monitoring tools that focus solely on wait stats will tell you that you have a CPU problem around 8AM, and that the query waiting on CPU during that time is Query R, the big ugly report.

However, outside of that window, you may not have any wait time on CPU at all. If queries are simple enough, and there isn’t enough concurrency, then as we discuss in Mastering Server Tuning, you can end up with SOS_SCHEDULER_YIELD waits with no time attached to them. You wouldn’t notice a CPU problem at all outside of that 8AM window because you don’t have a CPU bottleneck – yet. And to be fair, you probably wouldn’t tune a server in this kind of shape, either.

But when you DO need to tune a server that isn’t running at 100% capacity, picking the right query to tune isn’t just about wait stats: it’s also about which queries you need to be more responsive, and how you need to plan for future load. If your company tries to run a flash sale, and a lot of folks try to check out at the same time, Query T is going to knock your server over. But wait-stats based tools won’t see that coming: they’ll still be focused on Query R, the only one that spends a lot of time waiting on CPU.

If you’re trying to tune this scenario, you need to think about two things:

  • Which time windows best represent the workload that I’m trying to tune?
  • Which queries are consuming the most resources during that time, and not necessarily which queries are waiting on resources yet at that time?

For example, if I was tuning this exact scenario in Microsoft SQL Server or Azure SQL DB, I’d use sp_BlitzCache @SortOrder = ‘cpu’ to see which queries are consuming the most CPU cycles. They may not be waiting on CPU – yet – but it doesn’t matter. I might need to find ’em before we get to the point where we’re actually waiting on CPU. (I would also strongly recommend that we move the reporting query to a replica, hahaha.)

Now let’s add in monitoring and backups.

Query B is our nightly backups, Query D is database corruption checking, and Query M is our monitoring tool that runs continuously 24/7:

Now it doesn’t matter whether you sort queries by wait stats or by how much CPU they used: you’re going to be in a fog. You have to ask yourself a few questions to get the right context:

  • What are the time windows where I’m worried about performance? You get that answer by working with the business.
  • During that window, what’s my current bottleneck today? You get that answer from wait stats.
  • During that window, what queries are causing that current bottleneck? You get that answer from a variety of sources like the plan cache, Query Store, Extended Events, sampling live workloads, and more.
  • Looking ahead, are there queries that will cause a bottleneck tomorrow? You get that answer by reviewing workloads whose quantity will vary as your user workloads or data sizes grow.

If you only use one source – whether it’s wait stats or the plan cache – and you don’t put it in context with the above questions, you’re going to be led down the wrong path, and your tuning efforts will be wasted. You need context, and that’s why I teach training classes to help you figure this stuff out. I long for the day when you can open a tool that has a big flashing button that says “click here to fix the problem,” but we ain’t there yet.

Previous Post
Updated First Responder Kit and Consultant Toolkit for January 2020
Next Post
Building PasteThePlan.com: Removing Comments

2 Comments. Leave new

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.