In our work as consultants, we see a lot of shops that have a monitoring tool, and yet they’re still not able to get to the root cause of a SQL Server performance or reliability issue. In many of these cases, the problem isn’t the people: the problem is that they bought a tool that doesn’t do what they think it’ll do.
I think of monitoring tools in terms of maturity tiers:
- Level 1: Inexpensive Wall of Numbers – a tool that shows you a big dashboard of numbers, but has absolutely no guidance whatsoever on what’s a normal range for that number. It auto-scales every graph so that you always see peaks and valleys, making it even harder to diagnose problems. It’s not SQL Server specific at all, just a bunch of Perfmon and WMI counters. There’s nothing wrong with this tier – if you’re an advanced sysadmin, all you want is a metrics repository so you can do your own analysis, and you get exactly that here.
- Level 2: Basic DMV Wall of Numbers – adds on a bunch of DMV queries for things like wait stats, but typically not the queries from the plan cache that are causing a given wait type. A good example of this is New Relic’s SQL Server agent, and a good example of user questions for these tools are, “Why is Page Life Expectancy going up and down regularly in a consistent sawtooth pattern?”
- Level 3: Advanced DMV Wall of Numbers – if it’s in a DMV, you get it here, including Availability Groups diagnostics, execution plan cache, and index utilization. For performance tuning, tools like this typically start with a wait stats dashboard, and then let you drill down into the specific query causing your highest wait type. However, it’s still just a wall of numbers that you need to interpret: for example, if you’re seeing 100% CXPACKET waits, is it even a problem – or are we talking 6 seconds of CXPACKET waits per hour?
- Level 4: Mentoring and Root Cause Analysis – tools that use this huge volume of data to actually tell you if your SQL Server is having a problem right now (or not!), and if so, what the root cause is. When you open the dashboard, you don’t have to ask if you have a problem – the tool tells you what the problem is, and guides you to a solution, leveling up your skills along the way.
So what level is your monitoring tool?
It’s easy to find out: on a dev box, trigger a few common SQL Server emergencies, and see how your monitoring tool surfaces that problem.
Have databases not been backed up recently? When I open the dashboard under its default configurations, this needs to be a full-blown emergency. If failed Agent jobs are gonna cause me to lose my own job, then the tool needs to make this blatantly obvious. There’s nothing more important to me as a data professional. I’m not asking for it to assess my RPO/RTO and make sure my backups match – just at least make it obvious that my backups aren’t working.
Are databases corrupt? If there’s rows in msdb.dbo.suspect_pages, sys.dm_hadr_auto_page_repair, or sys.dm_db_mirroring_auto_page_repair, we have a serious problem. (This happened to yet another client last month – their storage was actively corrupting their database, and yet their monitoring tool was happily singing along about CXPACKET being the biggest problem.)
Who’s the lead blocker in a blocking chain? If you open the dashboard when queries are piled up waiting on locks, it should be immediately obvious who the problem is. Ironically, the low-end monitoring tools show huge lock waits, but when you drill down, they only show the blocked queries experiencing the LCK* waits, leading DBAs down the completely wrong troubleshooting path.
If a log file or TempDB is full and growing, why? When someone or a broken app leaves a transaction open, I’ve got a ticking time bomb in terms of disk space. Don’t just tell me the disk is filling up – that’s a Wall of Numbers approach – tell me why.
Odds are, you don’t have a Level 4 tool, so when these emergencies happen, you’re going to be doing your own diagnostics using your monitoring tool. You have two options: learn to use your tool better, or get a better tool.
Your monitoring vendor wants to help! You can call their support and ask for a consulting engagement where they take control of your monitoring tool, walk you through it, and explain what they’re seeing in the tool. They want to teach you how to get the most out of the tool because after all, if you can’t figure out how to use it, you’re probably going to switch tools.
I know – you think you don’t have a budget or approval to get a better tool. I’ve got great news – other monitoring vendors will take trade-ins. They’ll swap your current licenses for another tool, and you just start paying them maintenance from here on out.