Consulting Lines: SQL Server Needs a Dog

Last Updated 7 years ago

SQL Server isn’t lonely, but it needs a dog. In today’s Consulting Lines series post, I’ll share my very favorite consulting line of all – one that makes people laugh while illustrating a point.

The Situation: Bob the Button Basher

My Blitz script helps me quickly spot SQL Server settings that have been changed from the default values. I always work with the client’s staff in the room, and when I see a changed setting, I turn to them and ask:

Me: “Hey, Bob, I notice that max degree of parallelism has been set to 1. The default is zero. Can you tell me why that was set to 1?” (The key here is to not say anything whatsoever about what the setting does – let Bob explain it to you.)

Bob: “Yeah, I read a blog saying that all OLTP servers should have maxdop set to 1.”

Me: “Ah, yeah, I’ve read those too. Did we measure the before and after impact of the change?”

Bob: “No, it just seemed like a good setting to change.”

Me: “Gotcha. Funny thing there – SQL Server is like an airplane. Modern airplanes are so advanced that there’s only two things in the cockpit: there’s the pilot, and there’s the dog. It’s the pilot’s job to feed the dog, and it’s the dog’s job to bite the pilot if he touches anything.”

Bob: “HA HA HO HO”

Me: “The problem with SQL Server is that it doesn’t have a dog.”

Bob: “Wait, what?”

Me: “SQL Server has a lot of knobs in the cockpit, and there’s a lot of advice out on the web that tells us to push all kinds of buttons to go faster. Thing is, SQL Server ships out of the box with a pretty good set of defaults. Let’s use those defaults until we have a really good reason to change something, and when we do, we need to document what we change so we know how it improved performance or reliability.”

What That Line Does

SQL Server really does have a lot of knobs, but every setting – every single one of them, including the defaults – has drawbacks. I salute Microsoft for doing a really good job of weighing the pros and cons and coming up with a solid set of defaults.

In my line of work, I flip a lot of knobs, but for every one I touch, I try to explain thoroughly what I’m doing. I may casually say, “This server needs Optimize for Ad Hoc Workloads enabled,” but when I do it, I also explain what I’m seeing about that particular server’s workload that justifies the change.

Jonathan Kehayias (Blog – @SQLPoolBoy) recently wrote an excellent article about Lock Pages in Memory, one of those settings people just love playing with. Before he posted the article, we had a great discussion about why we have differing opinions on that recommendation. I understand why he recommends using it as a default, but the reason I don’t recommend it is illustrated beautifully by the fourth comment on the post:

“With LPIM, SQLOS will back down its usage to your ‘min server memory’ value. At that point Windows can try and hard trim the other processes, and potentially page the process requiring the 30GB memory out, or if it can’t respond efficiently this way, you may face a crash due to Out Of Memory for the OS.” – Jonathan

Ouch. I’m not a big fan of giving a default recommendation that might crash the OS. There’s also a bug in SQL2008R2 that causes corruption with LPIM. I understand why Microsoft doesn’t enable Lock Pages in Memory by default, and I agree with them – it’s just not a safe button to push if you aren’t intimately familiar with the consequences and how to monitor for them.

This consulting line helps establish whether the other person understands the real ramifications of pushing buttons. If they’re blindly bashing buttons based on blogs, it helps them understand the risks.

What Happens Next

The success of this line depends on your ability to dive deep and explain the concepts. You have to be able to explain the pros and cons of both settings – the default setting and the one they chose. Jonathan’s post on LPIM is a great example – if you can be comfortable explaining a topic to that level of detail, then you should feel comfortable flipping that switch. (If, on the other hand, you’re not even comfortable reading a post with that level of detail, then you shouldn’t be flipping that switch.)

Bob: “So should we leave maxdop set to 1, or go back to the default? We have no CXPACKET waits, so it must be working.”

Me: “Well, that’s true – no queries are going parallel. But parallelism isn’t always a bad thing – sometimes we want to throw lots of CPU power at queries. Right now, we’ve got queries that aren’t running as fast as they could because they’re not getting the CPU they need.”

Bob: “But the server isn’t CPU-bound – CPU % is usually under 10%.”

Me: “That’s because you’re measuring CPU use as a total, not per thread. We’ve only got a few active queries at a time, but we’ve got a four-socket server with 10 cores per processor, and hyperthreading is turned on. Our CPUs are sitting around bored. If we let some queries go parallel, we’ll be able to use those CPUs more effectively. Based on what I’m seeing in your procedure cache, I recommend setting maxdop to 4 and cost threshold for parallelism to 100. Here’s the evidence that supports it, and here’s how we’ll measure the success of that change…”

More of My Favorite Consulting Lines

Perk Up Your Career with the SQL Server Troubleshooting Checklist

sp_Blitz® Updated to v5

29 Comments. Leave new

Jonathan Kehayias
December 16, 2011 9:04 am

The comment about Windows crashing due to Out of Memory issues, is only 1/4 of the original comment, and out of context really doesn’t provide the whole story. The most important part of that comment is actually the first two sentences:

My first question would be, what on the server caused a 30GB memory demand that was not SQL? If you have something that can do that, you haven’t appropriately planned your memory configuration if you have ‘max server memory’ set at 54GB, as explained in the article.

I’ve never had a server crash due to the OS being Out of Memory with LPIM enabled when all the considerations for its usage, as covered in the article, were made as a part of the configuration. I’ve worked on plenty of servers that did have problems when it wasn’t used that could be have been avoided though.

Reply
- Brent Ozar
  December 16, 2011 9:06 am
  
  I hear ya, but I *have* seen episodes where there’s been a suddenly huge demand on the server’s memory that wasn’t SQL. Yes, if someone carefully monitors the server, LPIM won’t be a problem, but we don’t get into these situations when we’re carefully monitoring the server to begin with. That’s my concern about doing it as a default – we don’t necessarily watch every server carefully.
  
  Reply
  - Robert L Davis
    December 17, 2011 11:51 am
    
    If some process on my production SQL Server that isn’t SQL suddenly needs 30 GB of memory, I’d rather it crash that process than SQL get hammered by being paged out. Whatever it is isnt likely to have a legitimate need for that much memory or I wouldn’t put it on my SQL Server in the first place.
    
    After all, when was the last time time an end-user complained because they were running a critical query and anti-virus crashed while it was running?
    
    The argument about people not carefully monitoring can be made on both sides of the argument. Let’s assume that they’re not monitoring at all (I think the majority of servers out there are being monitored). If that’s the case, then setting max memory appropriately and enabling LPIM will result in far fewer SQL problems than not … in my experience.
    
    Reply
    - Robert L Davis
      December 17, 2011 11:53 am
      
      I meant to say “I think the majority of servers out there are NOT being monitored”.
      
      And to clarify, I include servers that have monitoring running against them but nobody looks at it.
      
      Reply
      - Jason
        December 21, 2011 12:18 pm
        
        “I think the majority of servers out there are NOT being monitored”.
        
        If it is not monitored, why do you worry about performance?
        
        I think the majority of PRODUCTION servers out there ARE BEING monitored
      - Brent Ozar
        December 21, 2011 5:00 pm
        
        Hi, Jason. We’re going to have to agree to disagree – in both Robert and my experience, most production servers aren’t being monitored. It’s great that you’re monitoring yours, though.
    - Brent Ozar
      December 17, 2011 11:53 am
      
      Robert – the process doesn’t crash, the *OS* crashes. You can get a BSOD. It’s exacerbated by the 64-bit caching problem:
      
      https://blogs.msdn.com/b/ntdebugging/archive/2007/11/27/too-much-cache.aspx?Redirected=true
      
      I’ve seen a lot of issues lately where servers have had zero free memory due to that.
      
      Reply
  - Jason
    December 21, 2011 11:49 am
    
    You don’t really want to run something else a huge demand of memory on SQL Server database host. If it is an app with run-away memory problem, run it on app server. The first goal of database server is to allow databases working well. If you put things that have contention to the same resources, neither one is going to work well.
    
    Reply
    - Brent Ozar
      December 21, 2011 11:51 am
      
      Jason – of course I don’t *want* to. I also really don’t want to run SQL 2000, use SATA drives, use RAID 5, or any number of other things. It’s not always about what we *want* – it’s about what we have to deal with out in the wild.
      
      Reply
      - Robert L Davis
        April 17, 2013 5:42 pm
        
        When I used anti-virus as an example, I idn’t just pull it out of my … errr, thin air. Yes, if a process has a memory leak and starves itself and the OS for memory, just that process can crash and not the whole OS. I used anti-virus because it is known to do this. But really, I’ve seen it with lots of system drivers where the nic will crash or the USB port will crash or something else because of a driver with a memory leak.
  - Jason
    December 21, 2011 11:51 am
    
    You don’t really want to run something else a huge demand of memory on SQL Server database host. If it is an app with run-away memory problem, run it on app server. The first goal of database server is to allow databases working well. If you put things that have contention to the same resources, neither one is going to work well.
    
    Reply
    - Bill
      January 16, 2013 8:22 am
      
      What if… you have a Six Cluster setup with three instances of SQL server on it. each one with a different primary and secondary node, but if two fail it has some where else to go. In that situation, worse case senario… you could have 3 instances of SQL on one box all fighting for memory until you get the other 5 servers up.
      
      Reply
      - Brent Ozar
        January 16, 2013 8:45 am
        
        Bill – generally speaking, I don’t see a lot of people running multiple standby nodes in clusters. If people need 3 instances of SQL Server, I’ll often see them running 4 nodes – one sitting around passive. The worst case scenario isn’t 3 instances running on 1 node – the worst case scenario is zero instances running because you’ve got a major failure like storage or networking. 😀
      - Bill
        January 16, 2013 9:21 am
        
        I was refering to servers themselves. yes our san or network going down would be much worse for customers ^.^ we gained on that was a 6 node setup, initially setup for 2 node 1 instance but it was decided before my time to make it 6 node 3 instance just in case. This actually has happened with all three on the same box. in that situation, now we have 3 SQL servers fighting for the same memory.
        
        Just a basic setup where SQL has a fight that supports the idea of a “rogue” item taking another 30GB of memory.
Andy Galbraith (@DBA_ANDY)
December 16, 2011 1:20 pm

I’m gonna buy me a dog….because I need a friend now…

http://www.youtube.com/watch?v=RBlitTE9kLk

{-:

Reply
Michael J. Swart
December 16, 2011 1:39 pm

A meta-question about blog post writing: Something about a sentence you wrote piqued my interest. You wrote “Before he [Jonathan] posted the article, we had a great discussion …”

How often do you have blog posts vetted by others? How often do other people ask you to vet their posts?

Being able to write articles whose accuracy you can stand behind is so so so valuable. I feel comfortable writing about something technical if I have a ton of experience in that area. But often, I’m writing about something fascinating that I just learned recently. It would be cool to have a second pair of eyes on it before the post goes live.

Reply
- Brent Ozar
  December 16, 2011 1:41 pm
  
  Michael – great question. I have the luxury of a few partners who love to blog too, and we bounce blog post ideas off each other all the time. I’d say maybe half of my blog posts get run past Jeremiah & Kendra before they go live. Before we all teamed up, though, I think I bounced maybe 1/4 to 1/10 of my posts off somebody else first. It helps so much to have someone else poke holes in it before the public does.
  
  Reply
Brent Ozar
December 17, 2011 11:59 am

Update – I’ve gotten a couple of private emails from readers pointing out another problem with locking pages in memory. Most shops are no longer running VMware ESX 3.5, so this shouldn’t be a widespread issue, but I understand why they’re pointing it out. It’s another reason why LPIM shouldn’t be enabled without understanding the entire environment and its risks.

Reply
David Eaton
December 19, 2011 12:16 pm

Brent,

You are pretty right on most of the defaults, but my experience has been very negative with leaving MaxDoP set to 0. Even on a server with 16 Itanium CPUs running IA64 SQL Server had issues with over allocation of CPU resources with it set to 0.

But you and I think alike, because setting the MaxDoP to 4 and setting the threshold for parallelism to 160 solved the problems quite well.

On Lock Pages in Memory, that needs to be used with the SQL Server Min/Max settings. If the Min and the Max are the same with Lock Pages In Memory enabled, SQL sets up a continous area of memory that is well bounded, and that helps SQL Server to not worry about testing memory locations as it is processing.

Reply
Chuck Rummel
December 28, 2011 7:45 am

And now you’re going to find a bunch of servers configured with “maxdop to 4 and cost threshold for parallelism to 100” because someone read it on some guy’s blog… 🙂

Reply
- Brent Ozar
  December 28, 2011 7:47 am
  
  Hahaha, yeah. Every now and then I do run across stuff like that in the wild, and the client points out where they read it on my blog from 2005, and then I have to go through and add language on that page explaining why things may have changed since then. 😀
  
  Reply
Robert L Davis
April 17, 2013 5:45 pm

Read this blog post again today,a nd I should have commented the first time that I think the dog analogy is perfect and SQL Server definitely does need it.

Reply
- Brent Ozar
  April 19, 2013 6:42 am
  
  Thanks, sir! Glad you liked it.
  
  Reply
Dhananjay
December 27, 2013 5:10 am

Well explained with perfect example comparing Plane with SQL

Reply
Cody Konior
January 5, 2015 12:21 am

There is an adjacent argument to this, when you walk into an environment of hundreds of instances all configured differently with no documentation about why they have been set this way.

This can then go in one of a few ways:
– Set a new standard and only apply it to new servers (not very useful unless there’s a high rate of server turnover).
– Set a new standard and apply it to servers one by one after thorough investigation. However this is extremely time consuming and changes the balance of cost vs benefit; just trying to build relationships with all of the vendors (who want money for their time discussing configurations, even though they often don’t even know themselves), internal support teams, business groups, etc, often all in vain.
– Set a standard and apply it to all servers after a best-effort investigation into the outliers. If it causes any severe problems, they can be quickly reverted and now you have documentation about why something is required. More likely it will cause milder performance issues down the line which can then be investigated properly.

What kind of settings?
– Max server memory.
– MAXDOP
– Page verification
– Auto close, auto shrink
– But not things that would obviously immediately break most apps (xp_cmdshell).

This often evokes the same response from other DBAs.
a) This can break things (true, and accepted).
b) Better to just leave it rather than touch it and break everything.

My counter belief is that standardisation plus documentation is worth the risk a little short term pain – even if you can’t afford to do the baseline before and after.

Everyone seems to fall into one camp or the other. What do you think Brent?

And this aside, maybe it’s my inexperience showing, but baselining on modern virtual hardware on a SAN and with multiple applications per server; we’re finding more and more how difficult it is to extract anything meaningful from it. I find my time much better spent starting investigating performance issues from the basics up: deadlocks and missing/unused indexes, and if all else fails then capturing and replacing traces elsewhere to tighten things up.

Reply
- Brent Ozar
  January 5, 2015 6:33 am
  
  Cody – for that, I’ll answer with another Consulting Line:
  
  https://www.brentozar.com/archive/2011/08/consulting-line-problem-trying-solve/
  
  Reply
IM
April 20, 2016 12:58 pm

wish I had a dog

Reply
Michael Irwin
December 6, 2021 5:56 pm

I worked once writing CADD software. The keyboards were metal, strangely, and the pointing devices were light pens. Some users would sit thinking and point the pen up, at the fluorescent lights, thus effectively hitting the machine with a position-click every 1/60th of a second! After a few seconds the OS gave up !!
We asked them not to do this …. no use.
So we ran a cable from the wall power socket and soldered it to the keyboard, and hinted at the pain of 240V AC! Problem over !

Reply
- Brent Ozar
  December 7, 2021 12:46 am
  
  HAHAHA!
  
  Reply