Did you know that you can ship a bobcat 1/30th of the time and still maintain 97% positive feedback on ebay?
What other statistical lies are lurking out there for you to find?
Cache Hit Ratio
You’re tracking your SQL Server’s cache hit ratio. That number is consistently 97-99%, surely that means you don’t need more memory, right?
Consider this: 256GB of RAM is relatively cheap. In order to move the needle from 99% to 98% on the buffer cache hit ratio you’d need nearly 2.56GB of data to not be in memory when SQL Server asks for it. Unfortunately for our monitoring, SQL Server’s read ahead mechanisms will pull data into cache during regular I/O. As long as your disks can keep up, the buffer cache hit ratio is going to stay high.
If you’re looking for a different metric, remember that it’s more important to measure disk responsiveness.
Speaking of disks…
Disk Queue Length
Once upon a time, in the dark ages of 4200RPM hard drives, there was advice given that servers with a sustained disk queue length greater than 2 per physical disk drive were under I/O pressure. This metric has continued to persist into the present day even though the world has moved on.
Modern disk drives have large caches built into the drives – consumer grade drives have 32-64MB caches on them. When you take into account RAID controllers, HBAs, and modern SANs, there can be a huge cache in place to buffer writes to disks. These same caches, plus the SQL Server buffer pool, can also be configured to cache reads.
More useful counters to measure are disk latency numbers – either Avg Disk sec/Read and Avg Disk sec/Write from perfmon or take a look in the
sys.dm_io_virtual_file_stats. Definitions of acceptable latencies vary from application to application but, in general, you can reasonably expect reads to be under 100ms and writes to be under 30ms.
Page Life Expectancy
“But, I know deep down that page life expectancy should always be higher than 300!” — said every DBA, ever.
Not so fast there, sparky. Page life expectancy, as measured by SQL Server, is an average of the page life expectancy for each NUMA node. Without getting too much into hardware configurations, you’ve got a separate measure of page life expectancy per NUMA node. On a four socket system, you could have a page life expectancy of 4,000 on one NUMA node and a page life expectancy of 0 on the other three, which still averages out to a page life expectancy of 1,000.
While there are a variety of formulas to calculate an effective page life expectancy metric, it’s even more important to know what you’re measuring and why. If you’re worried about the page life expectancy dropping during index maintenance, you’re probably worried about the wrong thing. But if you’re worried about page life expectancy during OLTP or during reporting, then you’re worrying about the right thing.
From time to time, people get so focused on tracking index fragmentation, that they start tracking page splits – it’s right there in the
Page Splits/sec perfmon counter, after all. The thought behind tracking page splits is that you, as the DBA, will be able to figure out when those pesky page splits are occurring that lead to index fragmentation. By having accurate monitoring, you can figure out a better way to keep your indexes ship shape. The problem with tracking page splits is that while you’re tracking something, you’re not tracking something with meaning.
It isn’t possible to derive meaning from the
Page Splits/sec counter – this number doesn’t just include all so-called bad splits – inserts in the middle of an index, this also includes good page splits.
Good page splits? A good page split occurs when new pages are allocated at the end of a table or index. These new page allocations count as page splits. Measuring page splits… well, it just measures new pages being added to the table, regardless of location.
There’s no easy metric for tracking fragmentation rates, but you shouldn’t be worrying about it too much.
Most people don’t think about it, but you can track the number of users connected to SQL Server. This has an incredibly loose relationship to the maximum number of worker threads – queries use threads and users run queries, right?
Unfortunately, this is another one of those numbers that doesn’t mean a lot on its own. Modern database clients use connection pools to keep persistent connections open to the database – it’s computationally cheaper to be lazy and keep a network connection open for a long time. In the .NET world, every different connection string has a separate connection pool. Each connection pool can have up to 100 connections to SQL Server. Most of those connections are going to be sitting around doing nothing until they’re killed off by the connection pool itself.
In extreme cases a high number of active user connections (each with a lot of active threads) can lead to
THREADPOOL waits, despite sounding marginally cool, are a terrible situation where SQL Server has run out of threads to hand out to users trying to do work.
Having a high number of user connections can be an indicator that something might be amiss, but it’s not concrete enough to raise a warning on – we can’t gather enough information with this metric.
You only get what you measure – by recording spurious metrics, you’ll end up with a misinformed, and potentially wrong, view of performance.
While good metrics depend on your application, there’s something you can immediately do: only monitor something that can produce actionable information. Don’t add more spam to your inbox, stop watching bogus metrics and start watching for the things that make a difference to your application performance.