On Bobcats per 100 Orders and Other Spurious Metrics

Last Updated July 3, 2013

Did you know that you can ship a bobcat 1/30th of the time and still maintain 97% positive feedback on ebay?

What other statistical lies are lurking out there for you to find?

Cache Hit Ratio

You’re tracking your SQL Server’s cache hit ratio. That number is consistently 97-99%, surely that means you don’t need more memory, right?

Consider this: 256GB of RAM is relatively cheap. In order to move the needle from 99% to 98% on the buffer cache hit ratio you’d need nearly 2.56GB of data to not be in memory when SQL Server asks for it. Unfortunately for our monitoring, SQL Server’s read ahead mechanisms will pull data into cache during regular I/O. As long as your disks can keep up, the buffer cache hit ratio is going to stay high.

If you’re looking for a different metric, remember that it’s more important to measure disk responsiveness.

Speaking of disks…

Disk Queue Length

Once upon a time, in the dark ages of 4200RPM hard drives, there was advice given that servers with a sustained disk queue length greater than 2 per physical disk drive were under I/O pressure. This metric has continued to persist into the present day even though the world has moved on.

Modern disk drives have large caches built into the drives – consumer grade drives have 32-64MB caches on them. When you take into account RAID controllers, HBAs, and modern SANs, there can be a huge cache in place to buffer writes to disks. These same caches, plus the SQL Server buffer pool, can also be configured to cache reads.

More useful counters to measure are disk latency numbers – either Avg Disk sec/Read and Avg Disk sec/Write from perfmon or take a look in the sys.dm_io_virtual_file_stats. Definitions of acceptable latencies vary from application to application but, in general, you can reasonably expect reads to be under 100ms and writes to be under 30ms.

Page Life Expectancy

“But, I know deep down that page life expectancy should always be higher than 300!” — said every DBA, ever.

Not so fast there, sparky. Page life expectancy, as measured by SQL Server, is an average of the page life expectancy for each NUMA node. Without getting too much into hardware configurations, you’ve got a separate measure of page life expectancy per NUMA node. On a four socket system, you could have a page life expectancy of 4,000 on one NUMA node and a page life expectancy of 0 on the other three, which still averages out to a page life expectancy of 1,000.

While there are a variety of formulas to calculate an effective page life expectancy metric, it’s even more important to know what you’re measuring and why. If you’re worried about the page life expectancy dropping during index maintenance, you’re probably worried about the wrong thing. But if you’re worried about page life expectancy during OLTP or during reporting, then you’re worrying about the right thing.

Page splits

From time to time, people get so focused on tracking index fragmentation, that they start tracking page splits – it’s right there in the Page Splits/sec perfmon counter, after all. The thought behind tracking page splits is that you, as the DBA, will be able to figure out when those pesky page splits are occurring that lead to index fragmentation. By having accurate monitoring, you can figure out a better way to keep your indexes ship shape. The problem with tracking page splits is that while you’re tracking something, you’re not tracking something with meaning.

It isn’t possible to derive meaning from the Page Splits/sec counter – this number doesn’t just include all so-called bad splits – inserts in the middle of an index, this also includes good page splits.

Good page splits? A good page split occurs when new pages are allocated at the end of a table or index. These new page allocations count as page splits. Measuring page splits… well, it just measures new pages being added to the table, regardless of location.

There’s no easy metric for tracking fragmentation rates, but you shouldn’t be worrying about it too much.

User Connections

Most people don’t think about it, but you can track the number of users connected to SQL Server. This has an incredibly loose relationship to the maximum number of worker threads – queries use threads and users run queries, right?

Unfortunately, this is another one of those numbers that doesn’t mean a lot on its own. Modern database clients use connection pools to keep persistent connections open to the database – it’s computationally cheaper to be lazy and keep a network connection open for a long time. In the .NET world, every different connection string has a separate connection pool. Each connection pool can have up to 100 connections to SQL Server. Most of those connections are going to be sitting around doing nothing until they’re killed off by the connection pool itself.

In extreme cases a high number of active user connections (each with a lot of active threads) can lead to THREADPOOL waits. THREADPOOL waits, despite sounding marginally cool, are a terrible situation where SQL Server has run out of threads to hand out to users trying to do work.

Having a high number of user connections can be an indicator that something might be amiss, but it’s not concrete enough to raise a warning on – we can’t gather enough information with this metric.

Measure It!

You only get what you measure – by recording spurious metrics, you’ll end up with a misinformed, and potentially wrong, view of performance.

While good metrics depend on your application, there’s something you can immediately do: only monitor something that can produce actionable information. Don’t add more spam to your inbox, stop watching bogus metrics and start watching for the things that make a difference to your application performance.

The Elephant and the Mouse, or, Parameter Sniffing in SQL Server

Will Instant File Initialization Really Help My Databases?

6 Comments. Leave new

Anand
July 2, 2013 8:07 am

“More useful counters to measure are disk latency numbers – either Avg Disk sec/Read and Avg Disk sec/Write from perfmon or take a look in the sys.dm_io_virtual_file_stats. Definitions of acceptable latencies vary from application to application but, in general, you can reasonably expect reads to be under 100ms and writes to be under 30ms. …”

For reads100 ms or 10 ms? I generally tend to follow up reads average > 10 ms for sustained perioeds > 30 mins .. on our RAID 5 SAN Disk. Please let me know.

Reply
- Jeremiah Peschka
  July 2, 2013 9:59 am
  
  Hi Anand,
  
  The numbers that raise an alarm are going to depend on the storage that you’re using, the storage fabric itself, and the requirements of the application. The problem with averages over time periods is that they can hide spikes in load, just as single spikes might be anomalies.
  
  There’s no single tried and true metric, but if SQL Server is waiting on disks for more than 100ms, I know I’ve got a problem.
  
  What matters most is that you’re meeting the service level guarantees set out by the business.
  
  Reply
Guillaume
July 2, 2013 12:11 pm

Nice to see connection pool and SQL Server mentioned. Lots of misconceptions surrounding them.

Reply
Aaron Morelli
July 2, 2013 4:58 pm

I’ve recently seen some pretty big discrepancies between Perfmon’s latency counters you mentioned and sys.dm_io_virtual_file_stats (the DMV being 4x-5x higher). This has occurred during times of heavy load, when the Perfmon counters were approaching or even exceeding 1 second averages. I haven’t been able to determine WHY they are so different, but #’s that far off definitely affect communication efforts with the storage team (not to mention one’s own sense of how bad the IO problem really is), so readers from the future may want to consider monitoring both metrics… I certainly am going to. 🙂

Reply
- Aaron Morelli
  July 2, 2013 5:11 pm
  
  …I should add that the more straightforward reasons why those 2 #’s might be off don’t seem to apply here:
  
  – only 1 DB file on the LUN, so Perfmon #’s should only be tracking IOs to just one file (like the DMV)
  
  – very short time windows on the DMV polling (10 minutes) to granularize our comparisons (in case the formula used by Perfmon is not really so comparable to the DMV after all?)
  
  – Confirmed that CPU is moderate and that the io_pending column in dm_io_pending_io_requests is almost always 1. (we sample every 15 seconds)
  
  We’re considering this approach, if we get permission to install the necessary items: http://blogs.msdn.com/b/sql_pfe_blog/archive/2013/04/23/identifying-cause-of-sql-server-io-bottleneck-using-xperf.aspx
  
  Reply
James Lupolt
July 6, 2013 5:01 pm

I’d be tempted to add magic number thresholds (500, 1000, etc) for Memory: Pages/sec to the list. I see people chasing this red herring a lot.

Reply