Here’s some of the reasons companies usually virtualize their SQL Servers:
- Cost savings on hardware
- Cost savings on Windows OS licensing
- Cost savings on SQL Server licensing
- Protect against the failure of a single hardware element
- Leverage extended features for Disaster Recovery
- Automatic load balancing across multiple hosts
- Easier hardware replacement/migration
When we perform a SQL Critical Care® on a virtualized SQL Server, we often ask, “Are we actually getting those benefits?”
1. Cost savings on hardware – do you find yourself putting one SQL Server guest on each host, isolating them to make sure they get the performance they need? If so, you’re not actually saving money on hardware.
2. Cost savings on Windows OS licensing – as a standard, some companies license all their virtualization hosts with Windows Server Datacenter Edition in order to get unlimited virtualization rights. However, if you’re only running one guest per host (or just a few), then you’re not saving money here either.
3. Cost savings on SQL Server licensing – for this one, you’ve gotta do a little bit harder work. Add up the licensing you’re spending now, and look at what it would take to run similar instances on bare metal hardware. Keep in mind that you can still buy dual-socket, quad-core servers that are insanely powerful (768GB RAM, dozens of SSDs), thereby keeping your SQL licensing lower.
4. Protect against the failure of a single hardware element – on the free versions of most hypervisors, you don’t get automatic failover protection. You can manually start up a guest on another host with some human intervention. Is that enough for the business, or are they assuming it’ll all happen automatically with only a minute or two of downtime – even when you’re not around? Or even worse, do you not have enough hardware horsepower to start up your biggest SQL Server guest somewhere else if its host fails? Or, heaven forbid, are you using local SSDs with virtualization, thereby missing the entire ability to move guests around?
5. Leverage extended features for Disaster Recovery – VMware and Hyper-V have killer features (and third-party app extensions) that make it easy to replicate a guest from one site to another. Are you using those, or have you given up because SQL Server’s data change rates are too high, and your network can’t keep up?
6. Automatic load balancing across multiple hosts – VMware’s Distributed Resource Scheduler (DRS) will automatically shuffle VMs around between hosts based on resource utilization. It’s an amazing way to react to performance issues with less human intervention. You should be using it.
7. Easier hardware replacement/migration – because SQL Server licensing is priced by the CPU core, and it’s super expensive, many shops choose to improve their virtualization host hardware annually. Whenever they need more capacity in their VMware or Hyper-V clusters, they drop in a couple of new hosts, vMotion or LiveMigrate the most expensive per-core guests over to those hosts (thereby taking advantage of today’s faster processors), and then give everybody else the hand-me-downs. It’s easy to do even live during the daytime. However, some shops are still running their SQL Servers on CPUs that might get featured on Antiques Roadshow.
If you’re not leveraging at least some of these virtualization features, and you don’t plan to…then what was the point of virtualizing to begin with? Jump on in – the water’s fine!
Our first two concerns in this series were virtualization licensing and recoverability. It’s not enough to have answers for these questions today, though – we have to think about how long the server will be in place.
The server’s expiration date isn’t dictated by IT. If it was up to us, we’d never run an old, unsupported version of anything – but sometimes the business is quite content doing just that. So how do we plan for this from the start?
We need to get written answers from the business using a simple worksheet. In our Virtualization, Storage, and Hardware video class, students get an Excel spreadsheet to use when interviewing the business users about the project requirements. Here’s one of the tabs:
Start with the first question: how long does this SQL Server need to last? This can be measured in time (like 3 years) or in business growth (like until we quadruple the number of active users).
There are columns for several answers because database servers are rarely one-person projects – people have lots of different opinions, so everybody fills in their answers, and then we come to a group consensus in the last column.
The next several questions ask about how the business, application, and data will change during that time. We’re not looking for exact percentages, but just vague ideas. Are we going to double the number of end users per year? Are we going to start saving files in the database in the next version of our app?
This isn’t just about query performance – remember, we have to be able to do our backups, DBCCs, and restores fast enough to meet the business goals for RPO and RTO. After all, the business isn’t likely to say, “Since we’re growing fast, it’s okay if you lose more data.”
Who Really Answers These Questions?
Every time I teach this class in person, someone raises their hand and asks, “Do business people ever actually have answers for these questions?”
But it opens their eyes to your challenge as a database administrator. How are you supposed to build a perfect server to handle load that the business can’t define?
The less specific the business goals are, the more agile your server design needs to be.
That’s Where Virtualization Shines.
When you’re handed a wildly vague set of business requirements, it’s easy to build a virtual machine sized for today, and then add additional capacity to handle growth of data, users, and app complexity.
But here’s the kicker: you have to actually monitor the growth.
And just buying an off-the-shelf monitoring product ain’t gonna cut it.
Off-the-shelf monitoring products don’t tell you when your number of active users have doubled, when your incoming order rate has quadrupled, or when you’re keeping ten years of sales history instead of two. You have to work with the business to define a list of business metrics that you can monitor via database queries, and then add those metrics to your monitoring software.
For example, when I ran a sales system, I added monitoring queries for the number of orders per day, the number of active salespeople, and the number of orders we kept in the active database. I used my monitoring tool to create line graphs of those numbers every day, and then I sent that report to the stakeholders every month. I could point to the graph and say, “Sales volume has grown by forty percent, and we haven’t archived any data. Without additional investments in hardware, tuning, or staff training, you can expect that performance will degrade accordingly.”
Lines like that open wallets.
But You Gotta Understand Databases, Virtualization, and Business.
This is where I got so frustrated with guidance that says things like, “Set up one data file per virtual core.” If you’ve got a database that’s doubled in size over the past year, and you need to add four additional vCPUs, that guidance means you would have to:
- Add four LUNs on every SQL Server (primary, disaster recovery, reporting, QA, development – because you’re going to refresh those databases from production)
- Add four more identically sized data files (which instantly gives you hot spots on those empty files)
- Rebuild your clustered indexes to even out the load on the old and new data files
- Watch your transaction logs and backup sizes during this time because you’re going to have a huge change rate
But the one-file-per-core never includes guidance like that. They just expect you to “add files” and call it a day, and they ignore the technical and business side effects. Does the business really want to take a huge slowdown impact while all this happens? Especially if they’re in the midst of a growth spurt? Of course not.
Instead, you have to look at the bottleneck that the SQL Server is really facing, and identify the best way to fix that bottleneck with the least impact to the business.
That brings us to the last set of questions on that page of the planning spreadsheet – how will we handle growth when it happens?
If we suddenly acquire our biggest competitor and need to double the data and user load, or if we run a Super Bowl ad, how will we proactively keep the database fast? Are we going to be able to tune code or hire staff – or are we only able to throw hardware at the problem?
Another way to visualize this for management is my Manager’s Guide to Tuning Code – it shows management what your options are, and lets them choose your avenues.
Don’t get me wrong – I love throwing hardware at a problem when it’s the easiest, cheapest, fastest way to solve the pain. But remember that while virtualization makes it easier to throw hardware at a problem quickly, that may not be the easiest/cheapest/fastest solution. And even if you do it, you need to understand the real side effects of tricks like “one data file per volume per core” – because it’s nowhere near easy/cheap/fast in the real world.
Your first concern is licensing, and once you’ve got that under control, it’s time to talk about backups, restores, and uptime.
The business begins by defining the RPO and RTO requirements for high availability, disaster recovery, and “oops” deletions. Use our High Availability Planning Worksheet and fill in the “Current” column on page 1:
You need to give the business a rough idea of the current state – how much data you’ll lose under each of those scenarios, and how long the SQL Server will be down. Be honest – don’t overstate your capabilities, because that just means you’ll get less budget money to build the system the business wants.
Then let the business users fill in the “Biz Goal” column for each scenario. Of course, by default, everyone wants to pick zero data loss and zero downtime, but that’s where page 2 of the worksheet comes in:
Give both page 1 & 2 of the worksheet to the business users and let them pick the right availability requirements for their budget. Technical managers will want to leap to specific methods (“we have to use VMware replication”) but keep steering the conversation back to the big picture – what does the database need?
The costs aren’t meant to be exact estimates, just rough ballpark numbers of the hardware, software, installation, and ongoing maintenance costs for a couple/few years. To learn more about the technologies in each box, check out our High Availability and Disaster Recovery Resources page.
If your business checks different boxes for each of the three scenarios (HA, DR, and OOPS), then you’ll probably end up with a mix of different SQL Server technologies. For example, the OOPS scenario is usually handled by transaction log backups and a separate standby restore server so that you can pull out just the data you need, but that isn’t going to help you for high availability if you need 1-minute automatic failover with zero data loss.
After the business checks boxes for high availability, disaster recovery, and “oops” deletions, it’s time to pick the right recoverability option. There are virtualization-friendly options in every box, but it’s important to note what isn’t in these boxes.
VMware, Hyper-V, and Xen Alone Aren’t High Availability.
Virtualization admins often think, “If something goes wrong with hardware, the virtual server will just start up on another host.” VMware, Hyper-V, and Xen all do a great job of recovering from these kinds of failures, but those aren’t the only kinds of failures we need to avoid. Note the list of failure types in our High Availability scenario.
Sure, hypervisors protect you fairly well from Windows crashes, RAID controller failures, bad memory chips, or somebody unplugging the wrong box.
But what about patching and OS-drive-full type problems?
I’ve heard admins say, “No problem – before we do any patching, we’ll shut SQL Server down, take a snapshot of the VM, and then start the patch. If anything goes wrong, we’ll just roll back to the snapshot. No data loss, not much downtime.”
Oh, I wish. Let me tell you a story.
One of my clients was preparing a new SQL Server for production. As part of their prep, they needed to apply SQL updates to it, so they started the installation process, and …
A few minutes later the phone calls started pouring in.
Because they were patching the wrong box. They were patching the production SQL Server VM, not the new VM. To make matters worse, they ran into the SSISDB bug, and their production server was down for hours while they figured it out.
How Recoverability Influences Virtualization Design
When you’re designing solutions for HA, DR, and OOPS, read the scenarios described in this simple worksheet. Expect that sooner or later, at a time you can’t predict or control, every one of these is going to happen to you. (Well, maybe not zombies in the data center.) Your technical solution is driven by the business’s requirements for RPO/RTO in each scenario. Understand what virtualization alone can give you, and when you’re going to have to add SQL-Server-level solutions.
Your design then needs to take into account one more challenge: capacity. How many databases will you have, how large will they be, and how fast will the data change? For example, a solution involving log shipping, a 1TB database, a 10% change rate per day due to batch load jobs, and a 1Gb Ethernet pipe probably isn’t going to be a 1-minute-data-loss solution.
It all boils down to one simple question:
can you back up, DBCC, and restore the databases fast enough?
Not only do you have to answer the question for today, but for the life expectancy of the server as well. In my next virtualization post, I explore capacity planning as it relates to RPO/RTO, and then we’ll be able to put the whole picture together of what our VM environment will look like.
Every single thing you do with a virtual SQL Server starts with one very important concern: how are you going to license it?
The 3 Common Ways to License Virtual SQL Servers
Developer Edition, by the guest – Developer Edition has all the power of Enterprise Edition, but it’s way cheaper (like in the $50/server range) and you’re not allowed to use it for production purposes. Because the licensing is so cheap, just mix these VMs in with the rest of your guests on a shared pool of hosts. Developer Edition licensing is handled differently than other SQL Server editions, typically bundled with your developers’ MSDN subscriptions.
Enterprise Edition, by the host – When you have 5 or more guests, it usually makes sense to buy a dedicated pair of host servers in an active/passive configuration. For example:
- Dell R720 Primary – 2 sockets, 4 cores each, 768GB memory – ~$25k USD
- Dell R720 Secondary – same hardware config – ~$25k USD
- 8 physical cores of SQL Server Enterprise Edition licensing and Software Assurance – ~$55k USD (you only have to pay licensing on the physical cores regardless of whether you enable hyperthreading)
- Hypervisor, Windows, management software licensing, shared storage – left as an exercise for the reader
You can configure your hypervisor (VMware, Hyper-V, Xen) to use one of the hosts as the primary, and when it fails, automatically boot up the guests on the secondary R720. This way, you only have to license one of the two hosts, and you can run an unlimited* number of SQL Servers on this host. (* – Not really.) Even better, even though they may be small VMs, you can still use Enterprise Edition on these guests and get access to cool EE-only features like online index rebuilds.
As your SQL Server needs grow, you can license the Secondary and balance the VM load across both, and even start to add additional licensed hosts. The key, though, is to avoid running any non-SQL-Server loads here because these are your utmost expensive host cores.
Standard Edition, by the guest – When you have 4 or fewer SQL Server instances, and they don’t need enough hardware horsepower to merit a dedicated host pool, then you’re usually best off licensing each guest with Standard Edition. This way, you can intermingle them on the same hardware that the rest of your virtual guests use.
At around $2k USD per core, it’s roughly 1/4 the price of Enterprise Edition, but you have to keep an eye on the number of virtual cores you’re using. You may read training material that says “just add vCPUs when you have performance pressure,” but this is expensive advice when you’re licensing each guest by the core. As you approach 16 vCPUs of Standard Edition licensing, you need to build a plan to migrate over to the host-based Enterprise Edition licensing approach. 28 vCPUs of Standard costs about the same as 8 vCPUs of Enterprise, but only the latter lets you run an unlimited number of guests, use over 64GB of memory on SQL 2012 and prior, do online index rebuilds, etc.
How Licensing Affects Your Administration
If you need SSRS, SSIS, or SSAS, and you’re licensing hosts with Enterprise Edition, you simply deploy these other services on separate virtual servers. You never make these other services fight with the core SQL Server engine for the same memory. However, if you’re on Standard, you saved money on licensing, but you spend extra performance troubleshooting time. You’ll need to install these services on the same VM as the engine to keep costs down, but you’ll need to understand how to triage performance.
When you design a high availability solution, the edition determines your choices as well. If you’re on Standard, and you need automatic failover for OS & SQL problems, you’ve backed yourself into a corner of either failover clustered instances or synchronous database mirroring. These solutions add complexity that shops without a DBA often can’t afford, so they end up going without OS/SQL protection and sticking with the simpler hypervisor-based failovers. On the other hand, once you’ve decided to license the hosts with Enterprise Edition, you can take advantage of AlwaysOn Availability Groups and asynchronous database mirroring without extra licensing costs – even for your smallest VMs.
Knowing your licensing means faster decision-making on architecture, and that’s why you want to start your virtualization projects here. Licensing is usually the most expensive and most restrictive part of the project – get it right first. To learn more about licensing SQL Server, check out the 2014 Licensing Guide PDF.
The PASS Virtualization Virtual Chapter hosted a Q&A session with me last week. We talked about storage configuation options like VMDK/VHD vs RDM, how licensing works, what’s the biggest SQL Server I’m comfortable virtualizing, and much more:
For more tips, check out our virtualization resources page.
I’m doing a 1-hour open Q&A session on May 8th for the PASS Virtualization Virtual Chapter. Bring your VMware and Hyper-V questions about setup, performance, management, monitoring, or whatever, and I’ll answer ‘em.
You can even get a head start here – post your questions in the comments below, and I’ll build slides to answer ‘em ahead of time. That way you can make sure you get the best answer possible. (Well, from me anyway, ha ha ho ho.)
Then come join us on the webcast and hear the answers. See you there!
I served with High Availability. I knew High Availability. High Availability was a friend of mine. VMware HA, you’re no High Availability.
See, for us database administrators, high availability means protection when:
- The system drive fills up because some potato decided to download a bunch of files
- An operating system or database server update goes horribly awry
- Or even when an OS or SQL update goes right – because the beauty of real high availability solutions is that they let you patch the standby node first, make sure it works, and then fail over to it so you can patch the other node.
Don’t get me wrong – I love VMware, and I love using VMware HA for database servers. It’s a fantastic way to get higher availability for those old dinosaur database servers running SQL Server 2000 that we just can’t kill, yet still run important apps. But in systems where uptime really matters, a single virtual machine isn’t the answer to high availability. That’s where solutions like clustering, database mirroring, replication, and AlwaysOn Availability Groups come into play.
Thankfully, there’s good news: when VMware HA is paired with SQL Server technologies, they can both work even better. Two standalone physical database servers running AlwaysOn Availability Groups are more reliable than just one server, but two virtual machines doing the same thing are even more reliable. They’re protected from hardware failures because they can be spun up on any VMware host in the datacenter. They’re more flexible because we can add CPU power or memory quickly based on demand.
I’ve blogged about why your SQL Server cluster shouldn’t be virtualized, and that still holds true. If you need to build a hybrid AlwaysOn solution involving both failover clustered instances (FCIs) and standalone instances, I would rather not put the FCIs in VMware first. But if you’re under pressure from management to cut costs and cut your datacenter footprint, put the rest of the instances in virtual machines. You’ll gain the power and comfort you want from physical machines while getting even higher availability from the virtual machines. Everybody wins, and the future will be better tomorrow.
Wanna learn more? Check out our VMware, SANs, and Hardware training videos. It’s a 5-hour training video series explaining how to buy the right hardware, configure it, and set up SQL Server for the best performance and reliability. Here’s a preview:
Even if your SQL Server is the only guest on a host, it still might not be as fast as bare metal.
One of the reasons is NUMA, which stands for Not Ur Momma’s Architecture. Okay, no, smart reader, you caught me – it actually stands for Non-Uniform Memory Access. In your momma’s architecture (Symmetric Multi-Processing), any CPU could access any memory all at the same low price. In today’s NUMA servers, a single motherboard with two CPUs and 128GB of memory can actually be divided into different nodes.
When a process running on CPU #1 wants to access memory that’s directly connected to it, that’s local access, and it’s fast. However, when that same process wants to grab data stored in CPU #2’s memory, that’s remote access, and it’s not as fast.
The performance penalty of remote memory access varies greatly from system to system, and you can measure it with Coreinfo from Sysinternals. (That Russinovich knows everything.) Blogger Linchi Shea went so far as to test the overhead of local versus remote access on one particular system, and he saw about 5% performance reduction. He considered that the worst case scenario for the server hardware he was using, but keep in mind that the situation will be much worse on servers with higher costs for remote memory access like IBM’s x3850 and x3950.
Windows exposes NUMA configuration details to applications, and it’s up to the app to tune itself appropriately. SQL Server has been NUMA-aware since 2005, and Microsoft’s continued to add improvements for it through 2008 and 2012. To learn more about how SQL Server handles NUMA, check out Gavin Payne’s SQLbits presentation, The NUMA Internals of SQL Server 2012.
How Virtualization Screws Things Up
The good thing about virtualization is that it abstracts away the hardware. You can run any virtual SQL Server on any server in the datacenter without a reinstall. You can even move virtual machines from one host to another, live, without a service restart – even if the underlying hardware is completely different. You can use multiple VMware hosts with completely different NUMA architectures – different numbers of cores per NUMA node, different amounts of memory per node, etc.
In order to pull this off, virtualization just presents a lump of CPUs and memory to our guest. Our virtual machine has no idea what the underlying NUMA configuration is – and it can’t, because it could change at any time when we’re moved from one host to another. This isn’t a performance problem for most apps because they don’t need to know anything about NUMA. They just want a lump of CPUs and memory.
Unfortunately, this is a performance problem for SQL Server because it actually wants to know the underlying configuration – and wants to tune itself for it. This is why when even running on a host with no other guests involved, performance still won’t match bare metal.
How vSphere 5’s Virtual NUMA Fixed Things Up Again
There are three key decisions that will make your life easier (and possibly your performance better).
First, isolate your virtual SQL Servers onto their own hosts. With SQL Server 2012’s licensing, when you buy Enterprise Edition for the host’s CPUs, you get unlimited virtual machines on that host. For a while, this wasn’t easily doable in VMware because of their incredibly stupid memory limits with licensing, but thank goodness they fixed that license stupidity recently. I can’t imagine a software vendor being dumb enough to limit their product to 64GB of memory in this day and age. <cough>sqlserverstandardedition</cough> I’m so glad VMware listened to their end users and fixed that limitation. <cough>microsoftpayattention</cough> Restricting a database server to just $500 worth of memory, why, that’d be like releasing a tablet with 4 hours of battery life. <cough>mylastpostasanmvp</cough>
Second, in that pool of hosts, use identical hardware running vSphere 5. All of the hosts need to have the same NUMA architecture. This does come with a drawback: it’s harder to do hardware refreshes. Most shops just buy new hardware as it becomes available, throw it into the VMware cluster, and let VMware DRS automatically rebalance the load. Unfortunately, the SQL Servers won’t be able to vMotion onto this hardware if it has a different NUMA configuration. The guests will need to be shut down at the next maintenance window, get a different NUMA config, and then be booted on the appropriate hosts.
Finally, configure vSphere 5’s Virtual NUMA on your guests. This is done automatically for guests with more than 8 vCPUs, but at 8 or less, you’ll need to enable it manually. Presto, SQL Server will see the underlying architecture and tune itself appropriately. (Well, not entirely appropriately – now SQL Server just solves the easy problems for you, and creates new hard problems.)
To enable virtual NUMA on VMs with 8 or less vCPUs, follow the instructions on page 41 of the Performance Best Practices for VMware vSphere 5.1 PDF. And hey, while you’re in there, get your learn on – it’s an excellent resource for SQL Server DBAs who want to know if their shop is doing things right.
Wanna learn more? Check out our VMware, SANs, and Hardware training videos. It’s a 5-hour training video series explaining how to buy the right hardware, configure it, and set up SQL Server for the best performance and reliability. Here’s a preview:
Hi Brent –
I would imagine you get many of these notes of appreciation for what you do but I just couldn’t leave the office for the day without say “Thank you so much”. I would have never have thought that my personal investment would have paid such dividends.
Today, armed with the awesome information, knowledge and suggestions you provided in your SQL Server for VMware training, I headed into what I thought would be a rather contentious meeting as we have been experiencing some serious performance issues for almost a year and the discussions just never went anywhere.
To make a long story short – it was amazing, I was able to bring them on board, got my reservations, got access to the ESX Hosts for checking performance and an admission that maybe the SAN could have been better configured for SQL Server.
I have been a fan for years – thanks again and I look forward to reading and hearing more from you, as always.
Wanna know what Kelly was raving about?
My 3-hour training session is half off until October 25. SOLD OUT!