Building SQL ConstantCare®: Refining Our Recommendations

SQL ConstantCare

One of the fun things about building SQL ConstantCare® as a cloud-based service is that we can iterate fast, making improvements based on the data we find. We can assess the population as a whole, figure out what people are doing well, and look more closely at the rules they’re struggling with.

What checks are everyone passing?

Let’s look at a sample day from last week when 562 servers sent in data (out of the users who agreed to share their data to improve recommendations):

  • No one was using the -x startup flag
  • No one had known-dangerous third party modules installed
  • No one was running a 32-bit SQL Server
  • No one had high memory usage for in-memory OLTP (Hekaton)
  • No one’s plan cache had been erased recently (aside from server restarts)
  • No one needed to restart to fix a TempDB file configuration mismatch in the DMVs
  • Everyone on 2016 Standard Edition had patched to SP1 to get the free features

That’s not to say no one was having these problems when they first started ConstantCare – the subject of which rules people are fixing the fastest is the topic of another upcoming post.

What recommendations are people struggling with the most?

  • 276 servers need to enable remote access to the DAC
  • 197 servers have databases with no CHECKDB in the last couple of weeks
  • 135 servers have over 10,000 plans cached for a single query
  • 132 servers have databases with transaction logs larger than their total data file sizes
  • 89 servers have databases with no recent full backups
  • 79 servers have databases in full recovery model, but no recent log backups (kinda ties into the 132 with big log files, too)
  • 78 servers have databases without checksum page verification turned on
  • 70 servers have max memory at the default (unlimited) or larger than their physical memory
  • 69 servers have user databases on the C drive (and 60 have TempDB on C)

As I look at that list, I ask myself:

  • Are the recommendations right? Are the thresholds valid, or do I need to tweak some?
  • Why aren’t people doing those tasks?
  • What can I do to make those tasks easier?

Lemme share 2 examples of how we’re adapting the emails based on your data.

Making the DAC recommendation easier

Take the top one, enabling remote access to the DAC. I believe the recommendation is right, and I don’t think that many people are pushing back due to valid security concerns. (Hardly anybody is muting that recommendation.) However, it sounds dangerous the way it’s written, and the call to action was kinda vague: go read this blog post and watch this video to learn how to fix the problem. It sounds like it’s going to take work on your part – when in reality, it’d be really simple. I was just making it sound harder than it is.

What I really needed to do was rewrite the guidance to get faster adoption. Here’s the new one:

Enable the Remote DAC – When poison wait issues like threadpool strike, you can fix them by connecting to the Dedicated Admin Connection (DAC). However, by default, you can only do this by remote desktopping into the box – something I’m not a big fan of doing. Run the below command, and the next time SQL Server restarts, you’ll be able to access the DAC remotely. More information about the DAC.

And in case you’re curious, yes, we can adapt the T-SQL in the recommendation based on whether it’s safe to run in their environment, like if they have any reconfigurations pending or if they don’t have show-advanced-options turned on. (To do that, I found myself building dynamic SQL in PostgreSQL. Man, my life is weird.)

That new recommendation will go into production soon, and we’ll see how that affects completion rates. I really take that stuff seriously – if your servers aren’t getting better, then we’re not doing our job and earning our keep.

Changing thresholds for plan cache guidance

The remote DAC recommendation make sense, and I just needed to tune the advice to make it easier to follow. However, what about the 135 servers with over 10,000 plans cached for a single query? Fixing that is much harder: changing apps to send in parameterized queries can take a long time, and forced parameterization comes with some risks.

For this one, I stepped back and asked – “Am I really sure this is a problem for all of these servers? Or are we maybe triggering that recommendation a little too casually?”

Upon further investigation, I realized that most of these servers had no memory pressure. Many of ’em were outright bored – with wait time ratios below 1.0, and in many cases below 0.1! To pick an example: if a server has 256GB RAM, spends less than 10 minutes per hour waiting on resources, and has days of queries in the plan cache, is it really a problem if 10,000 plans are cached for a single query? I mean, sure, it’s bad, but in the grand scheme of things, how bad is it? Especially when you’ve got lots of servers and lots of tasks to perform?

Some customers even emailed in the sources of these queries – they included backup software, monitoring tools, and even sp_BlitzFirst’s logging query! Granted, with open source tools, smart folks like Gooch can check in improvements to mitigate the problem – but realistically, is your backup vendor going to react to issues like this? Probably not.

Therefore, we’re editing that rule’s thresholds so it only fires when you’re under serious memory pressure, and you can’t keep much in the plan cache, and there are tens of thousands of queries for a single plan in the cache. We’ll also consider it when it looks like your server is under heavy CPU load due to high compilations per second.

Analyzing customer data from SQL ConstantCare® is my favorite part of my job right now. We’re building a new set of realistic, doable performance recommendations that make a measurable difference in servers – cutting the fluff, and focusing on the big bang for the buck tasks. I want you to be able to make as few changes as possible, and see as big of a difference as possible. We’re learning more every week.

Previous Post
[Video] Office Hours 2018/4/25 (With Transcriptions)
Next Post
First Responder Kit Release: Darn May Showers

6 Comments. Leave new

  • Are you sure you don’t mean 10,000 query plans in the plan cache with a use count of 1 (rather than 10,000 query plans for a single query)?

    • No, I indeed do mean 10,000 query plans for 1 query. For example, strings that get seen as “different” queries even though they’re really the same thing:

      SELECT * FROM dbo.Users WHERE Id = 1;
      SELECT * FROM dbo.Users WHERE Id = 2;

      That kind of thing. They’re different strings, but in reality, they’re the same query.

      • Brian Beuning
        May 5, 2018 6:47 am

        We had this type issue. We looked at an app fix but it was not easy. Then we figured out the total memory used was under 1 MB and said we don’t care.

  • Gavin Harris
    May 10, 2018 5:29 am

    Hi Brent,

    I know that this is few days late, but I started looking at our DAC connections for named instances and hit a snag. DAC connections for named instances don’t have a set port #, so remote connection to those servers requires the network/firewall guys to open up the entire range of port numbers for TCP connections (49152-65535). They’re not entirely happy with that as a concept.

    Is there any way to set the DAC to a specific port number on a named instance the same as you can do with the standard connection?


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.