Blog

Brent Answers Your Virtualization Questions

I’m doing a 1-hour open Q&A session on May 8th for the PASS Virtualization Virtual Chapter.  Bring your VMware and Hyper-V questions about setup, performance, management, monitoring, or whatever, and I’ll answer ‘em.

You can even get a head start here – post your questions in the comments below, and I’ll build slides to answer ‘em ahead of time.  That way you can make sure you get the best answer possible.  (Well, from me anyway, ha ha ho ho.)

Then come join us on the webcast and hear the answers. See you there!

VMware HA is Not Database Server High Availability

I served with High Availability. I knew High Availability. High Availability was a friend of mine.  VMware HA, you’re no High Availability.

See, for us database administrators, high availability means protection when:

  • The system drive fills up because some potato decided to download a bunch of files
  • An operating system or database server update goes horribly awry
  • Or even when an OS or SQL update goes right – because the beauty of real high availability solutions is that they let you patch the standby node first, make sure it works, and then fail over to it so you can patch the other node.

Don’t get me wrong – I love VMware, and I love using VMware HA for database servers.  It’s a fantastic way to get higher availability for those old dinosaur database servers running SQL Server 2000 that we just can’t kill, yet still run important apps.  But in systems where uptime really matters, a single virtual machine isn’t the answer to high availability.  That’s where solutions like clustering, database mirroring, replication, and AlwaysOn Availability Groups come into play.

Thankfully, there’s good news: when VMware HA is paired with SQL Server technologies, they can both work even better.  Two standalone physical database servers running AlwaysOn Availability Groups are more reliable than just one server, but two virtual machines doing the same thing are even more reliable.  They’re protected from hardware failures because they can be spun up on any VMware host in the datacenter.  They’re more flexible because we can add CPU power or memory quickly based on demand.

I’ve blogged about why your SQL Server cluster shouldn’t be virtualized, and that still holds true.  If you need to build a hybrid AlwaysOn solution involving both failover clustered instances (FCIs) and standalone instances, I would rather not put the FCIs in VMware first.  But if you’re under pressure from management to cut costs and cut your datacenter footprint, put the rest of the instances in virtual machines.  You’ll gain the power and comfort you want from physical machines while getting even higher availability from the virtual machines.  Everybody wins, and the future will be better tomorrow.

Wanna learn more? Check out our VMware, SANs, and Hardware training videos. It’s a 5-hour training video series explaining how to buy the right hardware, configure it, and set up SQL Server for the best performance and reliability. Here’s a preview:

Buy it now.

Big Changes for Big Virtual Machines in VMware vSphere 5

Even if your SQL Server is the only guest on a host, it still might not be as fast as bare metal.

One of the reasons is NUMA, which stands for Not Ur Momma’s Architecture.  Okay, no, smart reader, you caught me – it actually stands for Non-Uniform Memory Access.  In your momma’s architecture (Symmetric Multi-Processing), any CPU could access any memory all at the same low price.  In today’s NUMA servers, a single motherboard with two CPUs and 128GB of memory can actually be divided into different nodes.

Exhibit A1 – Mmmm, steak sauce.

When a process running on CPU #1 wants to access memory that’s directly connected to it, that’s local access, and it’s fast.  However, when that same process wants to grab data stored in CPU #2’s memory, that’s remote access, and it’s not as fast.

The performance penalty of remote memory access varies greatly from system to system, and you can measure it with Coreinfo from Sysinternals.  (That Russinovich knows everything.)  Blogger Linchi Shea went so far as to test the overhead of local versus remote access on one particular system, and he saw about 5% performance reduction.  He considered that the worst case scenario for the server hardware he was using, but keep in mind that the situation will be much worse on servers with higher costs for remote memory access like IBM’s x3850 and x3950.

Windows exposes NUMA configuration details to applications, and it’s up to the app to tune itself appropriately.  SQL Server has been NUMA-aware since 2005, and Microsoft’s continued to add improvements for it through 2008 and 2012.  To learn more about how SQL Server handles NUMA, check out Gavin Payne’s SQLbits presentation, The NUMA Internals of SQL Server 2012.

How Virtualization Screws Things Up

The good thing about virtualization is that it abstracts away the hardware.  You can run any virtual SQL Server on any server in the datacenter without a reinstall.  You can even move virtual machines from one host to another, live, without a service restart – even if the underlying hardware is completely different.  You can use multiple VMware hosts with completely different NUMA architectures – different numbers of cores per NUMA node, different amounts of memory per node, etc.

In order to pull this off, virtualization just presents a lump of CPUs and memory to our guest.  Our virtual machine has no idea what the underlying NUMA configuration is – and it can’t, because it could change at any time when we’re moved from one host to another.  This isn’t a performance problem for most apps because they don’t need to know anything about NUMA.  They just want a lump of CPUs and memory.

Unfortunately, this is a performance problem for SQL Server because it actually wants to know the underlying configuration – and wants to tune itself for it.  This is why when even running on a host with no other guests involved, performance still won’t match bare metal.

How vSphere 5’s Virtual NUMA Fixed Things Up Again

There are three key decisions that will make your life easier (and possibly your performance better).

First, isolate your virtual SQL Servers onto their own hosts.  With SQL Server 2012’s licensing, when you buy Enterprise Edition for the host’s CPUs, you get unlimited virtual machines on that host.  For a while, this wasn’t easily doable in VMware because of their incredibly stupid memory limits with licensing, but thank goodness they fixed that license stupidity recently.  I can’t imagine a software vendor being dumb enough to limit their product to 64GB of memory in this day and age. <cough>sqlserverstandardedition</cough>  I’m so glad VMware listened to their end users and fixed that limitation.  <cough>microsoftpayattention</cough>  Restricting a database server to just $500 worth of memory, why, that’d be like releasing a tablet with 4 hours of battery life.  <cough>mylastpostasanmvp</cough>

Second, in that pool of hosts, use identical hardware running vSphere 5.  All of the hosts need to have the same NUMA architecture.  This does come with a drawback: it’s harder to do hardware refreshes.  Most shops just buy new hardware as it becomes available, throw it into the VMware cluster, and let VMware DRS automatically rebalance the load.  Unfortunately, the SQL Servers won’t be able to vMotion onto this hardware if it has a different NUMA configuration.  The guests will need to be shut down at the next maintenance window, get a different NUMA config, and then be booted on the appropriate hosts.

Finally, configure vSphere 5’s Virtual NUMA on your guests.  This is done automatically for guests with more than 8 vCPUs, but at 8 or less, you’ll need to enable it manually.  Presto, SQL Server will see the underlying architecture and tune itself appropriately.  (Well, not entirely appropriately – now SQL Server just solves the easy problems for you, and creates new hard problems.)

To enable virtual NUMA on VMs with 8 or less vCPUs, follow the instructions on page 41 of the Performance Best Practices for VMware vSphere 5.1 PDF.  And hey, while you’re in there, get your learn on – it’s an excellent resource for SQL Server DBAs who want to know if their shop is doing things right.

Wanna learn more? Check out our VMware, SANs, and Hardware training videos. It’s a 5-hour training video series explaining how to buy the right hardware, configure it, and set up SQL Server for the best performance and reliability. Here’s a preview:

Buy it now.

How to Set SQL Server Max Memory for VMware

SQL Server has two settings that kinda-sorta govern how much memory we’ll use to cache data: min server memory and max server memory.  I explain the basics of how they work in my Sysadmin’s Guide to SQL Server Memory, but things are a little different under virtualization.

VMware does a great job of sharing memory between different virtual machines, but to do that, sometimes it has to steal memory from one VM to take care of another.  Just because we gave 16GB of memory to our VM doesn’t mean the memory’s always there.  If a host crashes and we suddenly need to boot up a bunch more guests on our existing host, we might pull some memory away from the other guests temporarily.  If our company’s really cheap, we just might have never bought enough memory to begin with, and the memory might be stolen permanently.

Some of my restaurant choices are less satisfying than others.

To work around that, VMware admins can set a reservation for any guest’s memory.  It works like a reservation for a table in a restaurant – we’re guaranteeing that a corner table memory will be available whenever the virtual server needs it.  By default, guests don’t have reservations – they just walk up and try to take whatever they need at the time.  That works really well for most applications, but not for SQL Server.  SQL Server starts at near-zero memory used, and then gradually caches more and more data as queries request it.  Unlike most apps, SQL Server’s memory needs don’t go back down.  It’s like that guy who keeps going to the buffet over and over and claiming all the food for himself.

When we build new virtual machines, we need to come up with three numbers:

  • The guest’s memory – this is the amount of memory the guest thinks it has when it starts up.  Let’s say we’re building a virtual machine with 32GB of memory.
  • SQL Server’s max memory – I like to set this to leave 4GB of memory for the OS or 10%, whichever is greater.  In this case, we’d set SQL’s max memory at 28GB, which would leave 4GB free for the OS.
  • The VMware reservation – the lowest amount of memory the guest will have.  Ideally, this is 100% of the guest’s memory, but that’s not always practical. If a host crashes, I’d rather be able to boot up all my guests with less memory than not be able to boot them up at all.  For SQL Server, I generally set my reservations at 75% of the guest memory – in this case, 24GB.

So now we have an interesting problem: in the event of a disaster, VMware’s balloon driver may fire up and claim 25% of the memory, leaving just 24GB total for the guest.  This will come as an ugly surprise for SQL Server because he was humming along using 28GB of memory (our max).

That’s where SQL Server’s min memory comes into play.  I have to set the min memory in a way that accommodates my reservation.  If my reservation is only 24GB, that means the balloon driver might pipe up and steal 8GB of my memory at any time.  If I still want to leave 4GB or 10% free, that means my min memory should be 20GB.

Excel – When you care enough to draw the very least.

The max memory number doesn’t change – but suddenly we need to pay more attention to our min server memory number.  It’s completely okay to set that number even lower as long as you’re okay with reduced performance.  For example, if this server is a value meal that also hosts SSAS/SSIS/SSRS, you’ll need to set min memory much lower to let those other apps get their jobs done.

If the VMware team refuses to set a reservation, you can’t fake your way around it by setting a high min server memory number.  When things start swapping to disk, SQL Server is going to run slower – even if it’s not the one paging to disk.  When the OS ain’t happy, nobody’s happy.

Wanna learn more? Check out our VMware, SANs, and Hardware training videos. It’s a 5-hour training video series explaining how to buy the right hardware, configure it, and set up SQL Server for the best performance and reliability. Here’s a preview:

Buy it now.

Reader Letter from Kelly J.

Hi Brent –

I would imagine you get many of these notes of appreciation for what you do but I just couldn’t leave the office for the day without say “Thank you so much”.  I would have never have thought that my personal investment would have paid such dividends.

Today, armed with the awesome information, knowledge and suggestions you provided in your SQL Server for VMware training,  I headed into what I thought would be a rather contentious meeting as we have been experiencing some serious performance issues for almost a year and the discussions just never went anywhere.

To make a long story short – it was amazing, I was able to bring them on board, got my reservations, got access to the ESX Hosts for checking performance and an admission that maybe the SAN could have been better configured for SQL Server.

I have been a fan for years – thanks again and I look forward to reading and hearing more from you, as always.

Best Regards,
Kelly J.

Wanna know what Kelly was raving about?  My 3-hour training session is half off until October 25. SOLD OUT!

Why Your SQL Server Cluster Shouldn’t Be Virtualized

When people buy my virtualization training video, one of the followup questions I get most often via email is, “Can I build SQL Server clusters in VMware and Hyper-V?”

In theory, yes.  Microsoft’s knowledge base article on SQL Server virtualization support says they’ll support you as long as you’re using configurations listed in the Server Virtualization Validation Program (SVVP).

But the real question for me isn’t whether or not Microsoft supports virtual SQL Server clusters.

The question is about whether you can support it.

Us geeks usually implement clusters because the business wants higher availability.  Higher availability means faster troubleshooting when the system is down.  We need to be able to get the system back up and running as quickly as possible.  Getting there usually means reducing the amount of complexity; complex systems take longer to troubleshoot.

If this is how your SAN team, VMware team, and DBA team hang out, you’re good with virtual clusters.

Adding virtualization (which also means shared storage) makes things much tougher to troubleshoot.  If the business wants a highly available SQL Server, ask yourself these questions before virtualizing a SQL Server cluster:

  • Do you have a great relationship between the SQL Server, storage, and network teams?
  • Do all of the teams have read-only access to each others’ tools to speed up troubleshooting?
  • Do all of the teams have access to the on-call list for all other teams, and feel comfortable calling them?
  • Do you have a well-practiced, well-documented troubleshooting checklist for SQL Server outages?
  • Does your company have a good change control process to avoid surprises?
  • Do you have an identical environment to test configuration changes and patches before going live?

If the answer to any of those questions is no, consider honing your processes before adding complexity.

But the Business is Making Me Do It!

They’re making you do it because you haven’t clearly laid out your concerns about the business risk.  Show the business managers this same list of questions.  Talk to them about what each answer means for the business.  Was there a recent outage with a lot of finger-pointing between teams?  Bring that up, and remind the business about how painful that troubleshooting session was.  Things will only get worse under virtualization.

To really drive the point home, I like whiteboarding out the troubleshooting process for a physical cluster versus a virtual cluster.  Show all of the parts involved in the infrastructure, and designate which teams own which parts.  Every additional team involved means longer troubleshooting time.

Once the business signs off on that increased risk, then everyone’s on the same page.  They’re comfortable with the additional risk you’re taking, and you’re comfortable that you’re not to blame when things go wrong.  And when they do go wrong – and they will – do a post-mortem meeting explaining the outage and the time spent on troubleshooting.  If the finger-pointing between the app team, SQL Server DBAs, network admins, virtualization admins, and sysadmins was a problem, document it and share it (in a friendly way) with management.  They might change their mind when it’s time to deploy the next SQL Server cluster.

More Microsoft SQL Server Clustering Resources

Whether you want help choosing between an active/passive and an active/active cluster, or if you’re the kind of DBA who knows that’s not even the right name for failover clustered instances anymore, check out our SQL Server clustering training page.

css.php