The most successful deployments of virtual SQL Servers have a few things in common. When a company is doing all of these things, odds are they’re going to be very happy with their virtual SQL Servers:
10. Use vSphere 4.1, not earlier versions. vSphere 4.1 introduced major improvements in multi-CPU scheduling that vastly improve performance. SQL Server will break queries out across multiple virtual CPUs, and it expects all of those vCPUs to perform identically. vSphere 4.1 helps make that happen. You can read more about the vSphere CPU scheduler here.
9. When using blades, avoid overcommitting switch uplinks. Virtualization works great with blades – but only when the SAN and network teams monitor utilization rates between the blade chassis switches and the rest of the networks. All too frequently, the SQL Server is thrown into a blade chassis with dozens of other servers all competing for the same small amount of storage and network bandwidth leaving the blade chassis. Everyone looks at their monitoring systems and can’t understand why the backups, big queries, and maintenance jobs run so slow, and the culprit is the hidden small uplink.
8. Avoid 1Gb iSCSI for storage. While this cheap and easy-to-use storage works great for most servers, it doesn’t work as well for SQL Server. If you’re happy with the speed of SQL Server running on local storage, you can be in for a rude surprise with 1Gb iSCSI.
7. Test storage performance before deployment. I’m partial to SQLIO, which might be the worst-named tool in history. It has absolutely nothing to do with SQL Server – it doesn’t require SQL Server, and it doesn’t mimic SQL Server. It just flat out hammers IO using whatever parameters you pass in. Using a test file of the same size as the expected database, the storage should consistently exceed 100MB/sec whether it’s random or sequential, reads or writes. Higher numbers than that may be required due to the application design or end user expectations, but that’s a good foot-in-the-door number.
6. Compensate for slow storage with memory. If you can’t get the storage performance you need, you can help by caching as much of the database as possible in memory. If the projected database size is 50GB but the virtual server only has 16GB of memory, then it’s not going to be able to cache the entire database. Perhaps the users won’t be querying old data, in which case you might be able to get by with less.
5. Ensure CPU power saving is turned off. While Intel’s latest Xeon processors provide impressive power savings, they won’t ramp up to full processor speed unless the CPU is under heavy load. SQL Server will rarely push the processors that hard, which means they stay slow – sounds good in theory, but in reality, every query takes 70-100% longer than it did pre-virtualization. If you care about query performance, turn this setting off in the hardware BIOS. You can read more about the power-saving CPU issue here.
4. Coordinate VMware reservations and shares with SQL Server settings. VMware vSphere has a great set of options to ensure that a guest OS gets the right amount of resources for its needs. These settings need to be set in concert with SQL Server’s min and max memory settings. There’s plenty of bad advice out on the web saying things like, “Just disable the balloon driver and give SQL Server all the memory” – that’s not right either. There’s a happy medium in letting SQL Server and VMware cooperate to balance resources across multiple guests, but it only works when these settings are aware of each other.
3. Use Resource Pools to track SQL Server licensing. Microsoft’s virtualization licensing for SQL Server is notoriously complex. Starting with SQL Server 2008R2, only Datacenter Edition ($60k/cpu socket) provides unlimited virtualization rights. Enterprise Edition ($30k/cpu socket) provides just four VMs. Tracking these closely with VMware Resource Pools can result in huge cost savings. If SQL Servers are allowed to move to any host in the VMware cluster, then a licensing audit can produce staggering costs.
2. Use VMware HA for high availability. If your users can tolerate a SQL Server outage of 2-3 minutes, VMware HA is much easier to manage than a SQL Server cluster. If your users require less than 30 seconds of downtime, consider implementing a physical SQL Server cluster instead. SQL Server clusters are tricky enough to manage on their own, and doing them inside VMware adds an impractical level of management on servers that can’t be down for more than 30 seconds.
1. Virtualize small SQL Servers first. Start by gaining experience with 1-2 vCPU servers with under 16GB of memory. As the company’s sysadmins grow accustomed to how SQL Server uses CPU and memory in a virtual environment, they’ll be more confident virtualizing larger servers. If the DBAs and sysadmins don’t get along when trying to pin down performance problems on smaller servers, they’re going to be very adversarial when dealing with larger servers.
Wow – That’s a Demanding List!
I know what you’re thinking: “This outside consultant is greedy and expects everybody to dump tons of money into their SQL Servers. He’s asking for the moon. Nobody does this in real life.”
Let’s rewind back to the beginning of the recommendations where I said, “When a company is doing all of these things, odds are they’re going to be very happy with their virtual SQL Servers.” You can certainly skimp on some of these items and still stand a pretty good chance of being happy with your SQL Server performance.
The more items you skimp on, the worse your chances become. If you implement a virtual SQL Server with 32GB of memory trying to cache 250GB of databases on RAID 5 storage, hooked up via 1GB iSCSI, with no prior experience virtualizing SQL Servers of that size, odds are you’re going to be miserable. Users will scream and complain, and you’ll bring in an outsider who will track the problems back to these types of recommendations. Keep my contact info handy.
Our Training Videos: VMware, SANs, and Hardware for SQL Server DBAs – our 5-hour training video series explaining how to buy the right hardware, configure it, and set up SQL Server for the best performance and reliability. Here’s a preview:
Tony O'Grady May 31, 2011 | 7:03 am
Good article Brent.
Some really clever thinking in there.
I have one query. We looked at 10GB vs 1GB iSCSI there is a considerable cost difference. With VMWare 4.1 we were able to use the multipathing module to get up to 4 GB connectivity to our storage. Running SQLIO against this we were able to satisfy ourselves that it would meet our performance requirements, we now have it in production and are quite happy with it, the perfmon counters back this up! The only issue is all the extra cables in server room!
Brent Ozar May 31, 2011 | 11:28 am
Tony – Thanks! I’m a little confused though – you said you had a query, but is there a question in here for me?
Tony O'Grady May 31, 2011 | 11:57 am
Brent,
I was just wondering if we overlooked something. Is there a particular reason you recommend avoiding 1GB iSCSI, with our configuration we get 4GB throughput, how does this stack up against 4GB FC, in your experience?
Tony
Brent Ozar May 31, 2011 | 12:51 pm
Tony – ah, gotcha. You’re most likely not getting 4GB read and write throughput to just one SQL Server data file unless you’re doing connection bonding at the switch level, and if you’re doing that, all 4 ports are going through one switch – which means you have no switch failover capabilities.
Tony O'Grady June 1, 2011 | 11:15 am
Thanks Brent,
I am going to put a post together with my analysis and configuration.
Tony
Tom Fox June 6, 2011 | 7:33 am
Brent,
Some of today’s switches, namely the Cisco Nexus models, allow you to aggregate ports across multiple physical switches. May be a good note for the toolbox.
Also, many folks tend to think that multiple aggregated links equal the same as one link of the same speed (i.e. 10 x 1Gb bonded vs 1 x 10Gb port). That’s frankly not true, as there’s overhead in managing the bond and you will find that the load balancing algorithm may only put one connection on one of the physical links. It’s always better to go with the faster link speed, if the budget will support it.
Tom
Brent Ozar June 6, 2011 | 7:35 am
Tom – thanks for the heads-up, but I haven’t seen anyone successfully managing port aggregation in the wild. It’s one of those things that yes, the hard-core guys can do it, but the rest of us (well, “them”, hahaha) don’t go to that level of detail.
Michael J Swart May 31, 2011 | 8:50 am
I’ve got a question about #7: “Test storage performance before deployment”.
My guess is that you frequently look at systems *after* deployment. Are there ways (some perf counter over time maybe?) to look at performance of the disk subsystems after deployment?
Brent Ozar May 31, 2011 | 11:29 am
Michael – unfortunately, no, there’s not a good way to see after deployment other than monitoring how much throughput you’re getting with SQL Server, but then, it can be a SQL problem instead of a SAN problem.
Jorge Segarra May 31, 2011 | 9:07 am
Nice list Brent but quick comment on #7. If you’re going to run SQLIO, be very careful ESPECIALLY if that LUN is being shared by other VM’s. Your tests could potentially drag down a ton of other VMs not at all related to your application and you’re going to get a flood of calls with “my application is running like crap!”.
Brent Ozar May 31, 2011 | 11:27 am
Jorge – absolutely, great point!
Oscar Zamora May 31, 2011 | 10:20 am
“Avoid 1Gb iSCSI for storage”… is 10 GB iSCSI the way to go instead?
Brent Ozar May 31, 2011 | 11:27 am
Oscar – yep, or fiber channel.
Scott Shaw May 31, 2011 | 3:05 pm
Another great VM article. You do a wonderful job of pulling back the curtain on VM technology. This is a world many DBAs in large organization rarely get to see and have little control over. It is especially difficult to find a one-shop stop for all the best practices for virtualizing SQL Server. Thanks for your contributions!
Scott Lowe May 31, 2011 | 9:34 pm
Brent, this is a great list! Thanks for taking the time to compile it.
One comment regarding #3 (using resource pools to track licensing): I might be misunderstanding you here, but if you want to control which physical systems run SQL Server for the purposes of controlling license costs, resource pools are not the right tool. Instead, look at DRS host affinity rules, which will allow you to define rules to constrain the SQL Server VMs to a subset of ESX/ESXi hosts.
Otherwise, great work!
Brent Ozar June 1, 2011 | 12:30 am
Scott – thanks, and you’re absolutely right! I was thinking DRS and I typed out Resource Pools. I’ll edit that when I get back to land. Thanks!
John Sansom June 1, 2011 | 8:40 am
Content like this is knowledge gold to the DBA looking to get their learn on or has been thrown in the deep in end with Virtualization.
I’ve got a lot of projects involving virtualization coming my way and there’s info in here that’s new to me right now. If it’s new to the Server Admins too then we’re in trouble
Thanks for sharing!
Darren June 3, 2011 | 1:08 am
Thanks for a great checklist Brent! I was at a SQL Server on Vmware session at EMC World and the presenter mentioned using the “Lock pages in memory” permission for the SQL service user. Would you suggest this for 64bit systems as well?
Mark June 3, 2011 | 7:54 am
Brent, thank you for this awesome checklist.
I’m a dba and virtualized servers & db servers were “pushed” down our throats, and since then, performance has been slow as heck.
At this stage we are incorrectly set to #10,9,8,7,6 & 4 as starters, even #1.
Now I finally have something I can go to our vm/sys admin about and say, hey you have these set wrong!!!!!
THANK YOU!
He always thought we were being whiny dba’s!, But I KNEW I was right and our infrastructure wasn’t built properly for all that was virtualized.
Rich Fantozzi June 3, 2011 | 10:45 am
Great article. One other point that we have found important when migrating an existing database to vmware is to really baseline the performance of the existing server using perfmon with specific scenerios and even over specific length of time. When a user comes in and says the app seems slower. Having a target (real numbers) for acceptable performance makes fixing performance issues much more straight-forward. Using your suggestion above becomes more about picking one trying it out and checking to see if fixes the performance issue. For us it was #5, 70%-100% performance hit is very very true.
Pingback: Something for the Weekend – SQL Server Links 03/06/11
Laura Grob June 6, 2011 | 3:09 pm
Awesome article and just in time as we are starting a plan to virtualize most of our sql servers when we upgrade from 2005 to 2008R2.
Eve Lynes June 7, 2011 | 12:01 pm
Thank you for this list, much appreciated!! Can I get a little more direction on #4 “These settings need to be set in concert with SQL Server’s min and max memory settings … but it only works when these settings are aware of each other.” Do you really mean “aware” or do you mean that the settings are manually configured to same values?
Brent Ozar June 7, 2011 | 12:02 pm
Eve – they have to be manually configured in concert with each other – definitely not the same, though. SQL Server’s memory settings are only the amount of memory SQL uses by itself, not including the OS. It requires some SQL expertise to set these correctly in relation to the VMware reservation.
Jason Crider June 14, 2011 | 8:35 pm
Great. Now I have to research our current 1GB iSCSI to see where we stand and how we could get to 10GB iSCSI. If anyone has any research they’ve done or good links feel free to send them to me by email: arsqldba at gmail.
Michael Codanti June 20, 2011 | 3:56 pm
Great timing, we are just looking at virtualizing our first SQL Server, so this will help.
The first thing I asked about was storage I/O and they told me the blade chassis has 2) 10 Gb connections shared across 4 blades. Given that the SQL Server in question is currently generating up to 11 Gb of DAS storage I/O, I don’t think things are going to go well with it sharing 20 Gb with 20-50 other virtual servers.
Brent Ozar June 20, 2011 | 3:57 pm
Michael – wow, how are you generating 11Gb of DAS storage IO? That’s enough to saturate not one but two of the latest generation 6Gb SAS controllers. What kind of server is that, and what kind of storage are you using?
Michael Codanti June 20, 2011 | 4:27 pm
This is an old HP ProLiant ML570G4, and as things went we eliminated bottlenecks. The current bottleneck is the CPUs, and there is room to add 2. (We were just about to replace the server when they decided they wanted to go virtual.)
The storage is setup this way:
* P400 SmartArray controller with 512MB cache hooked to 18) 3Gb 10k SAS drives
* P800 SmartArray controller with 512MB cache hooked to 9) 3Gb 15k SAS drives and 3) 1.5Gb 7.2k SATA drives
* P812 SmartArray controller with 1GB cache hooked to 12) 3Gb 15k SAS drives
Each controller is connected to the drive cage with a SAS 4x cable, 4) 3Gb physical links, and is capable of up to 1.2GB/sec of throughput if the drives can handle it. (9.6 Gb/sec if my calculations are correct.)
The 11Gb measurement is during actual production load; if I run SQLIO on all of the drives at the same time I come to an aggregate of just over 2GB/sec of sequential reads which I calculate as 16Gb/sec. (The current storage bottlneck is the old PCIe bus in the server and the CPUs.
The P812 controller is 6Gb SAS and is capable of 2.4GB/sec or about 19Gb/sec if we had gotten a 6Gb cage and drives. (And the PCIe bus could handle it.) Are you looking at SAS controllers that only have one 6Gb physical link?
Needless to say I was shocked that you mentioned 100MB/sec as a minimum target for storage. (Our random writes drop to as low as 50 MB/sec on one of the arrays, but it is used for almost exclusively reads.)
Brent Ozar June 20, 2011 | 4:29 pm
Michael – Yep, absolutely, if you’re using three dedicated RAID controllers in a single server, you shouldn’t virtualize that box on anything less than 10Gb iSCSI.
Also, just as a sanity check – you are WAY outside the norm there. Very, very, *very* few folks hook up multiple DAS RAID controllers in their servers. Usually when a server is that performance-critical, they’ve moved on to SAN storage.
Michael Codanti June 20, 2011 | 5:00 pm
Yeah, when talking to an HP support rep once he mentioned that “we sure have that server tricked out.”
We only have 10 servers, all using DAS, and 2 of them have multiple RAID controllers. (The main SQL and the Exchange servers.)
We looked at getting a SAN but the startup cost was out of this world. (Not to mention the FTEs and training to manage the SAN.)
For just over $10k I can get a P812 controller, a drive cage that holds 12 drives, the SAS 4x cable, and 12) 450GB 3G 15k LFF SAS drives. For about $12.3k I can get the P812, a drive cage that holds 25 drives, the SAS 4x cable, and 25) 146GB 6G 15k SFF SAS drives. (Both come with three years of warranty/support.)
How much would it cost to get 2.4GB/sec of throughput and 3.6TB of raw drive space from a SAN?
Going back to my problem where they plan to “outsource” my server they currently have people complaining about the performance on a physical Oracle database server. (They are getting a maximum of 384MB/sec from the SAN with dual 4Gb HBAs.) Their answer to me was that they were moving away from the SAN and going to Network Appliances for the storage needs. I’m not getting any warm fuzzy feelings about this. (It will be way cheaper, but I suspect management will be very unhappy with the performance.)
Michael Codanti June 20, 2011 | 5:22 pm
Correction: the P812 is capable of 8GB/sec (4 GB in each direction) if it is hooked to at least 2) 6Gb SAS cages with drives fast enough to support the speed.
John C July 21, 2011 | 8:25 am
Hi Brent, Regarding #5, you say to turn this off in the hardware BIOS. This may be a naive question, but the VM also has BIOS, right? I assume you’re referring to the BIOS of the ESX host or hosts that the SQL VM is living on?
Brent Ozar July 21, 2011 | 8:26 am
Hi, John. Yep, I’m referring to the real hardware BIOS – the BIOS of the hosts.
Bruce Waaso August 10, 2011 | 11:03 am
We are reviewing virtulizing our SQL 2005 server into VMWare 4.1 using a iSCSI 1gb SAN. Our databases are mostly under 1GB in size and the number of users hitting the database at the same time is usually under 10.
#8 recommends not using iSCSI but under small loads would we notice much delay?
Brent Ozar August 10, 2011 | 3:16 pm
Bruce – sure, as long as you never need to do backups, restores, DBCCs, index rebuilds, or statistics updates, you should be fine.
Andrew T October 10, 2011 | 3:44 pm
Brent –
I’m in the process of researching mirgrations to a VM environment and I keep getting led back to your list here. I have just two questions.
1. HA – I currently support a few A/A clusters and haven’t had many issues with them and the downtime when failing over is usually less than 10 seconds. Is it that it takes that long to move the VM and spin it up on a different part of the VM cluster that takes the 2 – 3 minutes?
2. I’ve researched and have seen several people on the ‘Net that have had numerous issues with IO performance. From your experience (just looking for general guidance), would you feel comfortable putting highly transactional OLTP enviornments on a shared ESX cluster?
Brent Ozar October 11, 2011 | 7:29 am
Andrew:
1. The 2-3 minutes is the time it takes to boot Windows from scratch and fully start SQL Server to the point where it can accept incoming queries. In active/active clustering, you can achieve sub-10 second failover when you manually fail nodes over, but it takes longer during real-life failures. SQL Server’s LooksAlive and IsAlive checks need to be run, for starters, and LooksAlive only runs every 10 seconds to begin with – and that’s before the failover process even starts. Be careful when you compare manual failover to a real emergency.
2. I’ll answer this with another question: have you also found people on the net who have had numerous issues with IO performance on physical boxes?
Andrew T October 11, 2011 | 10:24 am
Thank you Brent. You are right about the clustering. Our manual failovers take about 10 seconds. When we simulated power failure to one of the boxes, it took about 45 seconds for it to fully fail over, recover, and come back online.
Also, with your question, you have answered mine.
Brent Ozar October 11, 2011 | 6:14 pm
Cool, glad I could help!
Steve December 20, 2011 | 3:36 pm
Great article..Quick question. How do you go about allocating CPUs to Virtual SQL Servers. I am in the process of Virtualizing some physical servers and VMWare’s best practices say dont allocate more CPUs than needed, but the servers originally have 2 quad core processors. I would like to start with 1 or 2 and add more as needed but that may not work for our DBAs. I know that the way VMWare works is if you assign a machine a processor the machine has access to all the cores of the physical processor however in Windows and in SQL only 1 processor is seen. Our DBA was curious about a setting called ” Degree of Parallelism” which he said may needed to be changed in SQL. Have you had to deal with this before.
Brent Ozar December 20, 2011 | 3:51 pm
Steve – I cover that in my VMware training events, and it’s beyond what I can cover in a comment, but here’s a related post I wrote about it a while back:
http://sqlskills.com/blogs/brent/post/SQL-Server-Virtualization-Get-Your-Hands-Off-That-MAXDOP.aspx
Aneil February 28, 2012 | 10:19 am
I would like to understand from you why iSCSI is not a good idea when using VMware?
DO you also beleive iSCSI is not a good idea without Vmware for SQL?
thanks
Brent Ozar February 28, 2012 | 10:54 am
Hi, Aneil. That’s a little beyond what I can address in a comment, but I go into detail in this in my 4-hour training sessions.
Steven Stirling January 22, 2013 | 2:25 pm
Hi Brent
This is good information, but with vsphere 5.1 and SQL 2012 AG’s, do you have an updated article that takes the new features into account? I’m looking at rolling out a new SQL enviroment with vsphere 5.1 and server 2012 – i’d be interested in hearing some real world experience with that. We’re still only using 1GB uplinks to the SAN storage (Equalogic) but we have 4 per ESXi host, so i’m hoping it won’t be so bad.
Thanks. Loving the site so far
Brent Ozar January 22, 2013 | 2:40 pm
Steven – thanks, glad you like the site! Right now I don’t have an updated version of the post for SQL 2012 AGs, though. If you’re interested in consulting help, let me know. I’ve been too busy doing the implementations to blog much lately with the details – they’ve been too fluid with a lot of the recent changes.
Pingback: Microsoft – SQLServer – Frenemy | Daniel Adeniji's – Learning in the Open
Jeff Bennett May 28, 2013 | 11:34 am
Brent,
Regarding item #9 in your listing, what would be the best way to monitor this? What tools and metrics would best cover uplink switches. We have both SAN and network teams, and I would like to approach both with a performance troubleshooting template that includes their view of performance for these vague issues where each team cannot find the problem. Lately, a lot of SQL VM performance troubleshooting is starting to resemble the old story about 3 blind men and an elephant…
Brent Ozar May 28, 2013 | 4:57 pm
Jeff – you actually can’t cover #9 from a SQL Server perspective. You have to use network monitoring tools for that, and whiteboard out the switch infrastructure so you know where the bottlenecks may be.