Things Your SQL Server Monitoring Software Should Tell You

Tools
47 Comments

In our work as consultants, we see a lot of shops that have a monitoring tool, and yet they’re still not able to get to the root cause of a SQL Server performance or reliability issue. In many of these cases, the problem isn’t the people: the problem is that they bought a tool that doesn’t do what they think it’ll do.

I think of monitoring tools in terms of maturity tiers:

  • Level 1: Inexpensive Wall of Numbers – a tool that shows you a big dashboard of numbers, but has absolutely no guidance whatsoever on what’s a normal range for that number. It auto-scales every graph so that you always see peaks and valleys, making it even harder to diagnose problems. It’s not SQL Server specific at all, just a bunch of Perfmon and WMI counters. There’s nothing wrong with this tier – if you’re an advanced sysadmin, all you want is a metrics repository so you can do your own analysis, and you get exactly that here.
  • Level 2: Basic DMV Wall of Numbers – adds on a bunch of DMV queries for things like wait stats, but typically not the queries from the plan cache that are causing a given wait type. A good example of this is New Relic’s SQL Server agent, and a good example of user questions for these tools are, “Why is Page Life Expectancy going up and down regularly in a consistent sawtooth pattern?”
  • Level 3: Advanced DMV Wall of Numbers – if it’s in a DMV, you get it here, including Availability Groups diagnostics, execution plan cache, and index utilization. For performance tuning, tools like this typically start with a wait stats dashboard, and then let you drill down into the specific query causing your highest wait type. However, it’s still just a wall of numbers that you need to interpret: for example, if you’re seeing 100% CXPACKET waits, is it even a problem – or are we talking 6 seconds of CXPACKET waits per hour?
  • Level 4: Mentoring and Root Cause Analysis – tools that use this huge volume of data to actually tell you if your SQL Server is having a problem right now (or not!), and if so, what the root cause is. When you open the dashboard, you don’t have to ask if you have a problem – the tool tells you what the problem is, and guides you to a solution, leveling up your skills along the way.

So what level is your monitoring tool?

It’s easy to find out: on a dev box, trigger a few common SQL Server emergencies, and see how your monitoring tool surfaces that problem.

Have databases not been backed up recently? When I open the dashboard under its default configurations, this needs to be a full-blown emergency. If failed Agent jobs are gonna cause me to lose my own job, then the tool needs to make this blatantly obvious. There’s nothing more important to me as a data professional. I’m not asking for it to assess my RPO/RTO and make sure my backups match – just at least make it obvious that my backups aren’t working.

Are databases corrupt? If there’s rows in msdb.dbo.suspect_pages, sys.dm_hadr_auto_page_repair, or sys.dm_db_mirroring_auto_page_repair, we have a serious problem. (This happened to yet another client last month – their storage was actively corrupting their database, and yet their monitoring tool was happily singing along about CXPACKET being the biggest problem.)

Who’s the lead blocker in a blocking chain? If you open the dashboard when queries are piled up waiting on locks, it should be immediately obvious who the problem is. Ironically, the low-end monitoring tools show huge lock waits, but when you drill down, they only show the blocked queries experiencing the LCK* waits, leading DBAs down the completely wrong troubleshooting path.

If a log file or TempDB is full and growing, why? When someone or a broken app leaves a transaction open, I’ve got a ticking time bomb in terms of disk space. Don’t just tell me the disk is filling up – that’s a Wall of Numbers approach – tell me why.

Odds are, you don’t have a Level 4 tool, so when these emergencies happen, you’re going to be doing your own diagnostics using your monitoring tool. You have two options: learn to use your tool better, or get a better tool.

Your monitoring vendor wants to help! You can call their support and ask for a consulting engagement where they take control of your monitoring tool, walk you through it, and explain what they’re seeing in the tool. They want to teach you how to get the most out of the tool because after all, if you can’t figure out how to use it, you’re probably going to switch tools.

I know – you think you don’t have a budget or approval to get a better tool. I’ve got great news – other monitoring vendors will take trade-ins. They’ll swap your current licenses for another tool, and you just start paying them maintenance from here on out.

Previous Post
My Transaction Log File Is How Big?
Next Post
First Responder Kit Release: Fully Hand Automated

47 Comments. Leave new

  • One more thing I would like to see regarding commercial monitoring tools: the impact (observer effect) of monitoring queries on monitored SQL instance.

    Reply
    • I’d put that in kinda the same category as the overhead of a speedometer on a car’s speed. Sure, there’s some overhead – but you wouldn’t wanna remove the speedometer in order to go faster, because then you wouldn’t know how fast (or slow) you’re going, and what you need to work on.

      Reply
  • Lawrence Dugger
    August 31, 2017 11:18 am

    Unfortunately, many to ‘dashboards’ are oriented to management, not troubleshooting…

    Reply
  • You missed one type – The Handholder without numbers. The real executive dashboard, which tells you what is wrong by looking at some level of Wall of Numbers input and running it through rules. But getting to the numbers behind the warnings requires you to know where the numbers come from and getting them yourself. These can be Tier 1, 2, or 3, but are far more annoying than the simple Wall o’ Numbers, because the rules are rarely comprehensive.

    “Check Disk Throughput, and consider adding faster disks” – Okay, I’ve done that, disk throughput is fine, no hot spots, no significant physical or logical fragmentation. Got any other suggestions? No? Aaaaaaa!

    Reply
  • James Sleeman
    August 31, 2017 1:41 pm

    Would be interested in knowing what tools you think fall into Tier 4.

    Reply
    • James – I’ll wait for product users to chime in. I wouldn’t want to skew things one way or another – folks are welcome to comment with their thoughts about the tools they’re using day to day.

      Reply
  • Fabien Sabinet
    September 1, 2017 2:38 am

    I was a .NET developer for many years before so when I became DBA I developed my own monitoring tool : sqlProcexp. It’s based on the concept of the excellent sysinternals procexp tool, for those who know, and I it’s free as well.

    It monitor and mostly historize main counters on the whole instance : CPU, logical&physical read and write IO, wait time, signal wait, plus two user configurable list of wait(s) (by default IO waits and CXPacket). All counters are historized with the query executed at the moment. It’s made with regular capture on sys.sysprocesses, sp.sql_handle, sys.dm_os_wait_stats and sys.dm_exec_query_stats.

    You can find it with some more information and documentation at :
    https://fabiensabinet.wordpress.com/2017/06/21/sql-procexp-en/

    A picture of the main window :
    https://fabiensabinet.files.wordpress.com/2017/06/sqlprocexpapp21.png

    Even if it’s possible to monitor instances remotely (also with the DAC in case of emergency) I finally run one permanently directly on each server as it’s very cheap in CPU and IO ressource.

    The main problem of the tool is that now I got use to it, I miss it when it’s not available 🙂 Hope you will enjoy it as much as me.

    Reply
  • scottjcomputer
    September 1, 2017 8:29 am

    I use solarwinds database performance analyzer, I don’t know if I like it any better than anything else would like to here other’s thoughts.

    Reply
  • andreas yiannaki
    September 1, 2017 9:03 am

    Hello,
    i use Embarcadero Dba artisan( now Idera) and Spotlight ( ex Quest),
    I think that they are Level 3 1/2 or 3+
    😉

    Reply
  • sqlProcexp tool monitor (and mostly historize) main counters on the whole instance : CPU, logical&physical read and write IO, wait time, signal wait, plus two user configurable list of wait(s) (by default IO waits and CXPacket). All counters are historized with the query executed at the moment. It’s made with regular capture on sys.sysprocesses, sp.sql_handle, sys.dm_os_wait_stats and sys.dm_exec_query_stats.

    It can be found with some more information and documentation at :

    fabiensabinet.wordpress.com/2017/06/21/sql-procexp-en/

    Even if it’s possible to monitor instances remotely (also with the DAC in case of emergency) I finally run one permanently directly on each server as it’s very cheap in CPU and IO ressource.

    The main problem of the tool is that now I got use to it, I miss it when it’s not available ? Hope you will enjoy it as much as me.

    Reply
  • SQL Sentry has the best drill down and correlation capabilities of all the tools we’ve tried. We got immediate value from it troubleshooting production issues even during the trial period, heh.

    Reply
  • For simple, easy alerting, with access to trend information, I love Redgate’s SQL Monitor.

    For more in depth analysis it’s SQL Sentry for me

    For non-prod environments where licence cost is a problem, Opserver is great
    Actually, it’s probably great in production environments too. I just don’t know how to use it well enough and get the information I need

    Reply
  • I have used many commercial tools – DBTuna, Redgate SQL Monitor, Idera Diagnostic Manager, SQL Sentry, Nagios, Foglight for SQL Server, and Solarwinds Database Performance Analyzer. Right now I say both Foglight and Solarwinds DPA are both are at the top of my list, with Solarwinds being preferred. One thing I can say for sure, ANY of these beat trying to capture, monitor, and manage metrics on my own.

    I would put Foglight at a tier 3, and Solarwinds at a tier 4 – but missing some key pieces (like resource reporting – but is being worked on for newer release).

    Reply
    • Add Spotlight to my list, but I dislike that with a passion. It’s slow and requires a client side tool. I have a strong opinion in the tool being web based to keep usage and installation simple as can be.

      Reply
    • What did you like about Foglight and Solarwinds that’s missing in Sentry?

      Reply
      • In this case, it’s not about what it’s missing, but rather about ease of use, ease of installation (web based goes along way). Foglight allows drilling down into many dimensions, while Solarwinds just presents you with the top performance hitters off the bat.

        Reply
    • Tony – when you say Solarwinds is a Tier 4, it sounds like you’ve got a lot of experience with it. Would you mind putting together a quick Word doc or screenshots that explain how it handles the 4 scenarios in the post (databases w/o backups, data corruption, lead blocker in a live blocking chain, and a growing log file or TempDB) and gets you to the root cause quickly? I’d love to be able to show that to folks in the post so they can see how a Tier 4 tool does it well. Thanks!

      Reply
      • Brent, I can do that. Note that I classified it as a tier 4 based on the description supplied for “Tier 4 – Mentoring and Root Cause Analysis”. But it has a variety of alerts and custom metrics to extend for other purposes. Let me see what I can put together. Specifically custom metrics would include databases w/o backups, data corruption, possibly growing log file. Lead blocker and blocking chains have been in the tool since the 10.1 release I believe. Either way, I’ll put something together but might not be until next week. Busy preparing for a fundraising event this weekend for my kids school that I’m organizing!

        Reply
        • Tony – uh, hello – if you have to build your own code into the tool to handle the above simple scenarios, then it ain’t Tier 4.

          Reply
          • I would argue that if they provide a framework to interact with their collections, then it qualifies.

          • Tony – hahaha, no, reread the description of Tier 1. That’s exactly what you have – a framework you can use to build your own alerts and business logic. Nothing wrong with that – just don’t confuse it with Tier 4 tools that do these things out of the box.

          • Brent, then it’s a hybrid :)… because the metrics it does show OOB, it provides mentoring, root cause analysis, and some guidance.

          • Sorry, Tony, you’re not getting it. Even Tier 1 tools provide metrics you can customize. If you’re telling the tool what to do – like building your own custom measurements to see if your backups are done – then you’re mentoring the tool, not the other way around. Nice try though.

        • Brent, let me restate that. For the metrics SWDPA captures out-of-the-box, it does provide mentoring and root cause analysis, and even some tutoring. In addition to it’s metric collections, it provides alerting and reporting. The product *also* allows you to create custom collections / alerts (not so much on the reporting side). That’s all I meant.

          Reply
          • Tony – again, I hear you, but look at the scenarios in the post. If you have to tell the tool how to get those answers, it ain’t Level 4. (It’s not like those are even hard scenarios – sp_Blitz catches half of ’em, and that’s free!)

          • Ok, so it has to meet 100% of the criteria, not just some. Roger.

          • Brent did you had some time to take a look at my tool ? 😀

            https://fabiensabinet.wordpress.com/2017/06/21/sql-procexp-en/

          • Fabien – no, sorry, I wish I had enough time to look at everything, but the reality is that I don’t spend time working on closed-source free tools. The problem is that the tool could disappear at any time, or the licensing could change or become expensive.

            If you want to get market adoption, you’re best off either open sourcing the tool, or building a company behind it. (But once you’re building a company behind it, you need to have a lot of resources – I wouldn’t bother using a paid tool that only has one developer. If they get in a car accident, I’m screwed.)

            Sorry that I don’t have a happier answer there.

          • Yes I understand.

            But if I promise you, even an oath! that I will, never ! NEVER make pay for it, AND that if I stop working on it (which is not probable but who knows) I will release the source code freely !? 🙂

            Then, you can give it e try no ? You will love it ! it’s another promise 🙂 It’s a level 1 for sure but the one who give you immediately the information to get directly to the level 4 as a DBA ! 😀

          • No, sorry. Promises in blog post comments from strangers don’t hold a lot of weight. (sigh) Wish I could, but just not enough hours in my day to gamble on that. Any personal time would either come out of my day job or my family time.

          • Ok again I understand. Hope one day, if it become popular, you will finally think : “hey wait a minute.. is it the tool from that swearing guy I remember ! 🙂 So many people using it ?!! … what the hell.. let’s try it !!!” 😀 😀

      • Brent, based on your feedback, I will put both Foglight and Solarwinds at a 2.5… I do not agree they are a tier 1. While they might not cover *all* the bases they do cover a lot.

        Reply
  • I’m currently in the market, with executive backing, for a new monitoring tool (I don’t have anything at the moment). Which true tier 4 tools have you found?

    Reply
    • Aaron – rather than endorsing a specific vendor’s tools here, I’d rather let the audience talk about which tools they’ve found are in a particular tier. I’ll let them answer though.

      Reply
  • So nobody knows of a Tier 4 tool then??? And only 2 come in at Tier 3 (Embarcadero/Idera Dba artisan and Spotlight)

    Reply
    • Well that’s simply because the only reliable tier 4 tool is .. a DBA !! 🙂 thanks for us 😉

      Reply
      • Fabien – congratulations, you nailed it, heh.

        Everybody thinks their tool is Tier 4 until they stop and think about what it really means. If you’re telling the tool the very basic steps of how to monitor – like building custom alerts for things like your databases not being backed up or alerting you about corruption – then the tool just isn’t that smart.

        There’s a few great Tier 3 tools – Idera SQL DM, Quest Spotlight, and SentryOne SQL Sentry come to mind – but the rest often aren’t quite what they claim to be.

        Reply
  • I recently joined a new organization & they have SolarWinds DPA. One of the servers it always shows a steady bar of Memory/CPU waits. But when I use some of the dmv queries, it doesn’t match. Any ideas?

    Reply
    • Prashant – for support with SolarWinds, your best bet would be to contact them. I wouldn’t know why their monitoring tool doesn’t match the DMVs. Interesting though!

      Reply
      • Brent,
        Posted the query on THWACK & below is the response. Now am worried of my other server which doesn’t have this Memory/CPU bar!

        “When you see memory/cpu, that’s actually an absence of waits. It means that the query is executing and likely the most significant activity is doing logical reads (cpu fetching data from cache). The reason we track it is that from an end user perspective, wall clock time is still passing, so even though that’s an efficient activity, there is still impact.

        Look at cpu metrics and make sure nothing is under pressure (maxed out). If not, in any context, the vast majority of time in memory/cpu is a good profile and should be desired.”

        Reply
        • Prashant – so your monitoring vendor is telling YOU to look at metrics and make sure nothing is maxed out?

          Huh.

          Would have thought that was what the monitoring software was supposed to be doing. Go figure.

          Reply
  • I am considering purchasing a diagnostic/monitoring tool for my shop and we have SQL Server 2012
    availability Group instances and plan on migrating to SQLServer 2017 by next year . What tool would you recommend. Would very much appreciate your guidiance.
    Thanks
    Patrick

    Reply
  • Chris Beardsley
    July 20, 2018 11:59 am

    While it isn’t a paid product, so far we’ve compiled favorite diagnostic queries/sp’s/free GUI tools from various vendors/good folk to build a decent composite view (with some blitzy scripts among them, ty) that we think we’d get (more or less) from an off the shelf product.

    The challenge is this amalgam points to viable series of solutions, just not the smoking gun management wants. Our hope it a singular diagnostic tool would reinforce our findings, ‘officially’ telling management it is Mr.Plum in the library with the disk IO hammer.

    We think we’ve eliminated RedGate, still investigating Idera (VM angle is intriguing) and will be adding Sentry to the research list – and will work making a troubleshooting list, creating testable scenarios.

    Reply
  • Hello Brent,
    What is your opinion about Redgate SQL Toolbelt?
    Thanks,
    Sunny

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.