Consulting Lines: SQL Server Needs a Dog

SQL Server isn’t lonely, but it needs a dog.  In today’s Consulting Lines series post, I’ll share my very favorite consulting line of all – one that makes people laugh while illustrating a point.

The Situation: Bob the Button Basher

My Blitz script helps me quickly spot SQL Server settings that have been changed from the default values.  I always work with the client’s staff in the room, and when I see a changed setting, I turn to them and ask:

Me: “Hey, Bob, I notice that max degree of parallelism has been set to 1.  The default is zero.  Can you tell me why that was set to 1?”  (The key here is to not say anything whatsoever about what the setting does – let Bob explain it to you.)

Bob: “Yeah, I read a blog saying that all OLTP servers should have maxdop set to 1.”

Me: “Ah, yeah, I’ve read those too.  Did we measure the before and after impact of the change?”

Bob: “No, it just seemed like a good setting to change.”

Me: “Gotcha.  Funny thing there – SQL Server is like an airplane.  Modern airplanes are so advanced that there’s only two things in the cockpit: there’s the pilot, and there’s the dog.  It’s the pilot’s job to feed the dog, and it’s the dog’s job to bite the pilot if he touches anything.”

Bob: “HA HA HO HO”

Me: “The problem with SQL Server is that it doesn’t have a dog.”

Bob: “Wait, what?”

Me: “SQL Server has a lot of knobs in the cockpit, and there’s a lot of advice out on the web that tells us to push all kinds of buttons to go faster.  Thing is, SQL Server ships out of the box with a pretty good set of defaults.  Let’s use those defaults until we have a really good reason to change something, and when we do, we need to document what we change so we know how it improved performance or reliability.”

What That Line Does

SQL Server really does have a lot of knobs, but every setting – every single one of them, including the defaults – has drawbacks.  I salute Microsoft for doing a really good job of weighing the pros and cons and coming up with a solid set of defaults.

In my line of work, I flip a lot of knobs, but for every one I touch, I try to explain thoroughly what I’m doing.  I may casually say, “This server needs Optimize for Ad Hoc Workloads enabled,” but when I do it, I also explain what I’m seeing about that particular server’s workload that justifies the change.

Jonathan Kehayias (Blog@SQLPoolBoy) recently wrote an excellent article about Lock Pages in Memory, one of those settings people just love playing with.  Before he posted the article, we had a great discussion about why we have differing opinions on that recommendation.  I understand why he recommends using it as a default, but the reason I don’t recommend it is illustrated beautifully by the fourth comment on the post:

“With LPIM, SQLOS will back down its usage to your ‘min server memory’ value. At that point Windows can try and hard trim the other processes, and potentially page the process requiring the 30GB memory out, or if it can’t respond efficiently this way, you may face a crash due to Out Of Memory for the OS.” – Jonathan

Ouch.  I’m not a big fan of giving a default recommendation that might crash the OS.  There’s also a bug in SQL2008R2 that causes corruption with LPIM.  I understand why Microsoft doesn’t enable Lock Pages in Memory by default, and I agree with them – it’s just not a safe button to push if you aren’t intimately familiar with the consequences and how to monitor for them.

This consulting line helps establish whether the other person understands the real ramifications of pushing buttons.  If they’re blindly bashing buttons based on blogs, it helps them understand the risks.

What Happens Next

The success of this line depends on your ability to dive deep and explain the concepts.  You have to be able to explain the pros and cons of both settings – the default setting and the one they chose.  Jonathan’s post on LPIM is a great example – if you can be comfortable explaining a topic to that level of detail, then you should feel comfortable flipping that switch.  (If, on the other hand, you’re not even comfortable reading a post with that level of detail, then you shouldn’t be flipping that switch.)

Bob: “So should we leave maxdop set to 1, or go back to the default?  We have no CXPACKET waits, so it must be working.”

Me: “Well, that’s true – no queries are going parallel. But parallelism isn’t always a bad thing – sometimes we want to throw lots of CPU power at queries.  Right now, we’ve got queries that aren’t running as fast as they could because they’re not getting the CPU they need.”

Bob: “But the server isn’t CPU-bound – CPU % is usually under 10%.”

Me: “That’s because you’re measuring CPU use as a total, not per thread.  We’ve only got a few active queries at a time, but we’ve got a four-socket server with 10 cores per processor, and hyperthreading is turned on.  Our CPUs are sitting around bored.  If we let some queries go parallel, we’ll be able to use those CPUs more effectively.  Based on what I’m seeing in your procedure cache, I recommend setting maxdop to 4 and cost threshold for parallelism to 100.  Here’s the evidence that supports it, and here’s how we’ll measure the success of that change…”

More of My Favorite Consulting Lines

Previous Post
Perk Up Your Career with the SQL Server Troubleshooting Checklist
Next Post
sp_Blitz® Updated to v5

29 Comments. Leave new

  • The comment about Windows crashing due to Out of Memory issues, is only 1/4 of the original comment, and out of context really doesn’t provide the whole story. The most important part of that comment is actually the first two sentences:

    My first question would be, what on the server caused a 30GB memory demand that was not SQL? If you have something that can do that, you haven’t appropriately planned your memory configuration if you have ‘max server memory’ set at 54GB, as explained in the article.

    I’ve never had a server crash due to the OS being Out of Memory with LPIM enabled when all the considerations for its usage, as covered in the article, were made as a part of the configuration. I’ve worked on plenty of servers that did have problems when it wasn’t used that could be have been avoided though.

    • I hear ya, but I *have* seen episodes where there’s been a suddenly huge demand on the server’s memory that wasn’t SQL. Yes, if someone carefully monitors the server, LPIM won’t be a problem, but we don’t get into these situations when we’re carefully monitoring the server to begin with. That’s my concern about doing it as a default – we don’t necessarily watch every server carefully.

      • If some process on my production SQL Server that isn’t SQL suddenly needs 30 GB of memory, I’d rather it crash that process than SQL get hammered by being paged out. Whatever it is isnt likely to have a legitimate need for that much memory or I wouldn’t put it on my SQL Server in the first place.

        After all, when was the last time time an end-user complained because they were running a critical query and anti-virus crashed while it was running?

        The argument about people not carefully monitoring can be made on both sides of the argument. Let’s assume that they’re not monitoring at all (I think the majority of servers out there are being monitored). If that’s the case, then setting max memory appropriately and enabling LPIM will result in far fewer SQL problems than not … in my experience.

      • You don’t really want to run something else a huge demand of memory on SQL Server database host. If it is an app with run-away memory problem, run it on app server. The first goal of database server is to allow databases working well. If you put things that have contention to the same resources, neither one is going to work well.

        • Jason – of course I don’t *want* to. I also really don’t want to run SQL 2000, use SATA drives, use RAID 5, or any number of other things. It’s not always about what we *want* – it’s about what we have to deal with out in the wild.

          • When I used anti-virus as an example, I idn’t just pull it out of my … errr, thin air. Yes, if a process has a memory leak and starves itself and the OS for memory, just that process can crash and not the whole OS. I used anti-virus because it is known to do this. But really, I’ve seen it with lots of system drivers where the nic will crash or the USB port will crash or something else because of a driver with a memory leak.

      • You don’t really want to run something else a huge demand of memory on SQL Server database host. If it is an app with run-away memory problem, run it on app server. The first goal of database server is to allow databases working well. If you put things that have contention to the same resources, neither one is going to work well.

        • What if… you have a Six Cluster setup with three instances of SQL server on it. each one with a different primary and secondary node, but if two fail it has some where else to go. In that situation, worse case senario… you could have 3 instances of SQL on one box all fighting for memory until you get the other 5 servers up.

          • Bill – generally speaking, I don’t see a lot of people running multiple standby nodes in clusters. If people need 3 instances of SQL Server, I’ll often see them running 4 nodes – one sitting around passive. The worst case scenario isn’t 3 instances running on 1 node – the worst case scenario is zero instances running because you’ve got a major failure like storage or networking. 😀

          • I was refering to servers themselves. yes our san or network going down would be much worse for customers ^.^ we gained on that was a 6 node setup, initially setup for 2 node 1 instance but it was decided before my time to make it 6 node 3 instance just in case. This actually has happened with all three on the same box. in that situation, now we have 3 SQL servers fighting for the same memory.

            Just a basic setup where SQL has a fight that supports the idea of a “rogue” item taking another 30GB of memory.

  • I’m gonna buy me a dog….because I need a friend now…


  • A meta-question about blog post writing: Something about a sentence you wrote piqued my interest. You wrote “Before he [Jonathan] posted the article, we had a great discussion …”

    How often do you have blog posts vetted by others? How often do other people ask you to vet their posts?

    Being able to write articles whose accuracy you can stand behind is so so so valuable. I feel comfortable writing about something technical if I have a ton of experience in that area. But often, I’m writing about something fascinating that I just learned recently. It would be cool to have a second pair of eyes on it before the post goes live.

    • Michael – great question. I have the luxury of a few partners who love to blog too, and we bounce blog post ideas off each other all the time. I’d say maybe half of my blog posts get run past Jeremiah & Kendra before they go live. Before we all teamed up, though, I think I bounced maybe 1/4 to 1/10 of my posts off somebody else first. It helps so much to have someone else poke holes in it before the public does.

  • Update – I’ve gotten a couple of private emails from readers pointing out another problem with locking pages in memory. Most shops are no longer running VMware ESX 3.5, so this shouldn’t be a widespread issue, but I understand why they’re pointing it out. It’s another reason why LPIM shouldn’t be enabled without understanding the entire environment and its risks.

  • Brent,

    You are pretty right on most of the defaults, but my experience has been very negative with leaving MaxDoP set to 0. Even on a server with 16 Itanium CPUs running IA64 SQL Server had issues with over allocation of CPU resources with it set to 0.

    But you and I think alike, because setting the MaxDoP to 4 and setting the threshold for parallelism to 160 solved the problems quite well.

    On Lock Pages in Memory, that needs to be used with the SQL Server Min/Max settings. If the Min and the Max are the same with Lock Pages In Memory enabled, SQL sets up a continous area of memory that is well bounded, and that helps SQL Server to not worry about testing memory locations as it is processing.

  • And now you’re going to find a bunch of servers configured with “maxdop to 4 and cost threshold for parallelism to 100” because someone read it on some guy’s blog… 🙂

    • Hahaha, yeah. Every now and then I do run across stuff like that in the wild, and the client points out where they read it on my blog from 2005, and then I have to go through and add language on that page explaining why things may have changed since then. 😀

  • Read this blog post again today,a nd I should have commented the first time that I think the dog analogy is perfect and SQL Server definitely does need it.

  • Well explained with perfect example comparing Plane with SQL

  • There is an adjacent argument to this, when you walk into an environment of hundreds of instances all configured differently with no documentation about why they have been set this way.

    This can then go in one of a few ways:
    – Set a new standard and only apply it to new servers (not very useful unless there’s a high rate of server turnover).
    – Set a new standard and apply it to servers one by one after thorough investigation. However this is extremely time consuming and changes the balance of cost vs benefit; just trying to build relationships with all of the vendors (who want money for their time discussing configurations, even though they often don’t even know themselves), internal support teams, business groups, etc, often all in vain.
    – Set a standard and apply it to all servers after a best-effort investigation into the outliers. If it causes any severe problems, they can be quickly reverted and now you have documentation about why something is required. More likely it will cause milder performance issues down the line which can then be investigated properly.

    What kind of settings?
    – Max server memory.
    – MAXDOP
    – Page verification
    – Auto close, auto shrink
    – But not things that would obviously immediately break most apps (xp_cmdshell).

    This often evokes the same response from other DBAs.
    a) This can break things (true, and accepted).
    b) Better to just leave it rather than touch it and break everything.

    My counter belief is that standardisation plus documentation is worth the risk a little short term pain – even if you can’t afford to do the baseline before and after.

    Everyone seems to fall into one camp or the other. What do you think Brent?

    And this aside, maybe it’s my inexperience showing, but baselining on modern virtual hardware on a SAN and with multiple applications per server; we’re finding more and more how difficult it is to extract anything meaningful from it. I find my time much better spent starting investigating performance issues from the basics up: deadlocks and missing/unused indexes, and if all else fails then capturing and replacing traces elsewhere to tighten things up.

  • wish I had a dog

  • Michael Irwin
    December 6, 2021 5:56 pm

    I worked once writing CADD software. The keyboards were metal, strangely, and the pointing devices were light pens. Some users would sit thinking and point the pen up, at the fluorescent lights, thus effectively hitting the machine with a position-click every 1/60th of a second! After a few seconds the OS gave up !!
    We asked them not to do this …. no use.
    So we ran a cable from the wall power socket and soldered it to the keyboard, and hinted at the pain of 240V AC! Problem over !


Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.