Blog

When people buy my virtualization training video, one of the followup questions I get most often via email is, “Can I build SQL Server clusters in VMware and Hyper-V?”

In theory, yes.  Microsoft’s knowledge base article on SQL Server virtualization support says they’ll support you as long as you’re using configurations listed in the Server Virtualization Validation Program (SVVP).

But the real question for me isn’t whether or not Microsoft supports virtual SQL Server clusters.

The question is about whether you can support it.

Us geeks usually implement clusters because the business wants higher availability.  Higher availability means faster troubleshooting when the system is down.  We need to be able to get the system back up and running as quickly as possible.  Getting there usually means reducing the amount of complexity; complex systems take longer to troubleshoot.

If this is how your SAN team, VMware team, and DBA team hang out, you’re good with virtual clusters.

Adding virtualization (which also means shared storage) makes things much tougher to troubleshoot.  If the business wants a highly available SQL Server, ask yourself these questions before virtualizing a SQL Server cluster:

  • Do you have a great relationship between the SQL Server, storage, and network teams?
  • Do all of the teams have read-only access to each others’ tools to speed up troubleshooting?
  • Do all of the teams have access to the on-call list for all other teams, and feel comfortable calling them?
  • Do you have a well-practiced, well-documented troubleshooting checklist for SQL Server outages?
  • Does your company have a good change control process to avoid surprises?
  • Do you have an identical environment to test configuration changes and patches before going live?

If the answer to any of those questions is no, consider honing your processes before adding complexity.

But the Business is Making Me Do It!

They’re making you do it because you haven’t clearly laid out your concerns about the business risk.  Show the business managers this same list of questions.  Talk to them about what each answer means for the business.  Was there a recent outage with a lot of finger-pointing between teams?  Bring that up, and remind the business about how painful that troubleshooting session was.  Things will only get worse under virtualization.

To really drive the point home, I like whiteboarding out the troubleshooting process for a physical cluster versus a virtual cluster.  Show all of the parts involved in the infrastructure, and designate which teams own which parts.  Every additional team involved means longer troubleshooting time.

Once the business signs off on that increased risk, then everyone’s on the same page.  They’re comfortable with the additional risk you’re taking, and you’re comfortable that you’re not to blame when things go wrong.  And when they do go wrong – and they will – do a post-mortem meeting explaining the outage and the time spent on troubleshooting.  If the finger-pointing between the app team, SQL Server DBAs, network admins, virtualization admins, and sysadmins was a problem, document it and share it (in a friendly way) with management.  They might change their mind when it’s time to deploy the next SQL Server cluster.

More Microsoft SQL Server Clustering Resources

Whether you want help choosing between an active/passive and an active/active cluster, or if you’re the kind of DBA who knows that’s not even the right name for failover clustered instances anymore, check out our SQL Server clustering training page.

↑ Back to top
  1. I’ve implemented a cross training program between the DB team and the Ops team. I’ve found it very useful to explain to the Ops team how DBs, logs, DB security, etc work. They’ve also done some sessions on network security and other stuff. I’m hoping to get to Clustering and VMWare soon so we can all be on the same page with many of the technologies that both of our teams are involved with. We’ve managed to get a good DR process using DB Mirroring to a remote location working well; both teams worked very well on this effort.
    I want to highlight Brent’s first question “Do you have a great relationship between the SQL Server, storage, and network teams?” Without a doubt, this has been the most important part of breaking the Silos and working towards a successful solution that works out for everyone.

    After all DBAs are good at relations since we work with relational data all the time, right? :)

  2. We spend a lot of time talking about how the technology *can* and *does* work, but this underlines the point that many people miss – with bad (or no) processes in your infrastructure group, it doesn’t matter what the technology can do – you won’t be able to support it effectively.

    Thanks Brent!

  3. Good points. The relationships between people and teams is often overlooked when implementing technology. Having the knowledge and a wallet doesn’t always mean I should buy into something.

    This summer I wrestled with the idea of virtualizing SQL Server in a farm vs. Clustering SQL Server on Virtual Servers. I considered both as High Availability options. The bottom line as I saw it came down to this: if the virtual host dies, your down time is the time it takes for your VM to start on another host; if your clustering node fails, your down time is the length it takes for SQL Server to start on the secondary server.

    The relationships are something to ponder when evaluating the difference between the two. How important is that extra minute or two and much effort do I invest in this relationship to make that work in my environment?

    Thanks for the post Brent.

    • You’re bottom line is exactly what a network admin and I tried to help a technical manager understand when they called a meeting yesterday to discuss how to soon they could set up a sql cluster on a db that was already virtualized. They still want some additional HA beyond VM reboot because they think those few minutes of downtime are still too much, but at least I think we helped them see the other options.

  4. What I think is funny, is that clustering isn’t really needed in the Virtual environment. VMWare has the VMotion technology that will move the server if needed. Hyper-V is no where near as elegant but has a brute force way to do hte same thing.

    When you have these in place, clustering only causes more issues.

    One time I was contracting for a public entity that had virtualized and clustered the servers and there was nothing but problems that lead to the complete de-virtualization of the SQL server infrastructure.

    Respectfully,

    David

    • David – so how do you do SQL Server service packs in your environments?

      • Aww man, Brent beat me to it.
        We were going that route at my last job and I argued for clustering since we only have 5 days of downtime a year (our production servers were 24/7). Clustering would help with Windows and SQL Patching.

        • Good points, I was wondering the same thing, why would anyone consider a cluster if you’re SQL box was a vm that could be vmotioned? However, I did not consider Windows/SQL updates. What about a hybrid solution for those that already have a vm environment. A physical SQL Server clustered with a virtual server?

          • David – well, keeping in mind the points mentioned in the posts, how do you think the experience would be? Would a mixed physical & virtual cluster be easy to troubleshoot?

      • test on an exact replica of the db, then just take a snap shot before the update if it works correctly apply the snpshot

        • David – hmm, can you define a little bit more about your suggestion? What do you mean by an exact replica of the DB? Are you talking about database snapshots, VMware snapshots, or SAN snapshots?

    • I agree – I’m not a fan of this approach at all. Not too many DBA’s are experts in VMWare so would they really put their names and careers on the line for a technology that someone else supports – assuming you have a VMWare/VMotion expert in-house? I would not roll those dice personally, even if I did have a great relationship with that individual. And that’s only considering the unknown – as Brent noted, you know you have to apply patches. I could not implement a solution in a 24×7 environment knowing I’m going to be having planned downtime – that is counter-intuitive.

  5. Great blog post. I have been in that situation and No, the DBA team did not have a great relationship with the Storage team or the virtualization team. The DBA team shared their tools but did not have any kind of access into the the other team’s tools. And the change management process was there but the risks were often dismissed as unlikely. Infrastructure had the green light to call the shots as to what would be virtualized in the enterprise and they decided all SQL Servers had to be virtualized. All I can say is DBA’s just could not do their jobs and the DBA’s kept quitting after only a few months. It is not always about how much can be saved with virtualization. Organizations need to be mature enough in order to support solutions that depend on so many different pieces owned by different teams.

  6. At one point I was the VMware admin and the DBA. However thing change and I now no longer even have visibility into all the virtualization aspects and the ‘new’ VMware guy isn’t a SQL guy at all. I have my disks being changed, cpu assignments being discussed, overcommitment (cpu and memory and number of CPus) happening as people try to save money and look good. Now some users are complaining about performance. It really annoys when people are saving so much by virutalizing in the first place and then don’t want to spend the money to ensure good performance for virtualized SQL Server. The fact is that virtualization while great in theory (and can be in practice) does add a lot of complexity technically and also organizationally. I would avoid a virtual SQL cluster and would be carefull with virutalizing production due to these considerations.

    • I like your point, Ron– and that’s totally true that what’s supportable one day may become harder down the line when duties get separated out and an organization grows!

  7. Thanks!

    Here is one scenario that occurred. Somehow the RDMs assigned to a virtual cluster for disks such as the quorum showed up as available to be assigned in VCenter for the VMware cluster. So naturally the VMware admin requested they be reclaimed by the storage team which they were. However they were actually still in use by the Microsoft Application cluster when they were somewhat rudely yanked awasy. Well the microsoft cluster wasn’t very happy with this. Forutanately it was not a Microsoft SQL Server cluster. :) However since the Microsoft SQL Server cluster guy usually has the most experience with cluster issues I sometimes get asked to assist with such issues. The point is that this created a company wide service outage and it could easily have been a SQL Server cluster issue if I had virtualized one. If it was a physical cluster this would have been even more unlikely to have ever occurred as an issue in the first place.
    If someone wants to virtualize a SQL Cluster I would recommend possibly devoting stand alone hosts for seperate cluster nodes in VMware just like you would have to in Hyper-V as this could help avoid issues.

  8. I am now where you were at Ron. Hosts are way overloaded and perfomance is fading fast. And to top that off, data growth will be doubling in the next 90 days.

    We have a new higer performance SAN in place, and new dedicated servers ordered for the SQL infrastucture. I just hope they arrive in time. And in this case we are sending the SQL servers back to physical machines and will use clustering for high availability.

    One of things I have picked up over the years is there is a critical point where the data has out grown the infrastucture supporting it. And I am definitely there right now.

    In our case

  9. We are working hard to extend the critical point for at least a 5 fold increase in size.

  10. Resource intensive SQL Servers are definitely a special use case for virtualization. Virtualization definitely has its advantages but you lose a lot of control over the resources supporting SQL. And I don’t see that getting any better. To guarantee any level of service the resources have to be tightly controled and devoted to the SQL Server. If its a small SQL Server not really doing that much then probably it could ‘play well with others’. The heavy use SQL Servers are trickier to manage. VMware/Microsoft does give guidelines/recommendations such as build your virtual SQL Server the same way you would your physical SQL Servers with dedicated resources and disks and also put in a reservation for all the memory assigned to the server. Split out the drives across all the virtual SCSI controllers (helps with multi-tasking). These rules aren’t usually followed by VMware admins. Also don’t have it competing with a lot of lower vCpu count virtual machines either. But over time the host will get over commited, the LUNs will get consolidated, and the memory reservations will get reversed by somoeone and they may even want to ‘standardize’ on one virtual SCSI controller per VM. They may even load a lot of high I/O VMs on one host. This creates a management headache for the DBA trying to manage performance on the SQL servers. Say the SQL server needs a lot of CPU for 2 hours a day to complete a load within a time frame. But someone says ‘vKernal says the CPU is largely unused so I am reducuing it’. Or maybe doesn’t even tell you. Suddenly your load may be taking longer.
    Organizationally I don’t see how this situation can get better. Great benefits but management complexity and inefficiences are introduced.
    I am not saying this is always going to be the case, but it is too easy for it to happen.
    Some great benefits though. High availabilty (at the server level). Can easily migrate to more powerful hardware. Can easily add storage and increase CPU. Can easily change storage or even have it automatically use the appropriate storage as needed (SSD, FC, SATA..). DR is easy with replicated LUNs for the VMs or even VMware SRM for a more true DR solution. These are just some of the advantages. For many this can work out well. But sometimes it can become a real PITA. :)

  11. very helpful discussion

  12. This seems to be a case of avoiding something that is beneficial because processes are broken. Should we not be looking to address the process gaps as they will fundamentally break any environment regardless of technology choices. There doesn’t have to be increases in risk just because you virtualize SQL Clusters if you approach the management and operations in a disciplined and methodical way as you should approach the design and implementation. It is the approach, design and process that is important when supporting any large SQL database regardless if it is virtual of physical. If the VMware Admins simply throw the database into an already overloaded cluster and everyone expects it to work you’re all kidding yourselves and you should get some new admins. This is not the way to treat business critical applications. The production environments shouldn’t be getting into an overcommitted state where performance suffers in the first place and again this points back to point 1 regarding broken processes. Broken processes and lack of discipline can make any technology look bad. I’ve designed, implemented and seen many an environment that is well run and where clustered SQL databases work great. But the dependant teams have a good working relationships and the processes aren’t broken. I’ve seen just as many environments where this isn’t the case. An investment in good inter team relationships and good processes will pay you back in spades, regardless if you are virtualizing or not.

    • Michael – yep, we both agree. If you do make the investment in good inter-team relationships and processes, then you can leverage that investment to do cool stuff like virtual clusters. However, if you try to run before you learn to walk, you’ll end up in a body cast. ;-) The investment has to come first.

  13. Investing in good inter-team relationships and processes is worthwhile. I am in a situation where we consolidated 4 IT teams down to one bigger team. There was ‘right sizing’ and many changed responsibilities, roles and management. In a situation such as this most of those relationships are changed completely and previous agreements are pretty much gone.
    In such a time a debate about whether to consolidate those SQL disk LUNs or not may not be the best start of a new relationship. It can get back to waiting until there is a reason to review the current setup based on performance and try and work out the ‘new’ solution with new processes and agreements if this is open to discussion.
    Everything is always changing and any planning should be done with this in mind. And for SQL Clusters planning is key to the configuration holding up over time for performance and reliability.
    And the team members adapting to change is key to them being around to work on the SQL Clusters.
    Case in point: Patching SQL Cluster nodes. The ‘old’ patching would migrate the instance back to it’s original planned primary node after patching. The ‘new’ patching does not and this is considered a possible enhancement not planned on being looked at until 2014. (So time to look into Failback! :) )

  14. Great article Brent! Do you feel that VMware HA can replace a failover cluster for many systems, assuming they have planned maintenance windows? You could take a snapshot, patch, reboot, and go right back to the snapshot if there are problems. This would be a ‘stability through simplicity” approach and would require not much more time to restart than a failover cluster would take to fail over.

    • Andrew – you’re not just assuming planned maintenance windows, but you’re also assuming that all changes happen inside those windows. If you work in a shop where everyone pre-announces every change, coordinates it with the sysadmins for good snapshots ahead of time, and then does user acceptance testing afterward to make sure the change had the planned effect, then sure, VMware HA can take the place of clustering. Back here in reality, though… ;-)

  15. I started working with VMware technology back to 2004, I think. I was still working for HP and I remember we were one of the DBAs in USA to deploy SQL 2000 failover instances (lab environment and for testing). It was a real pain! It was cool, because we were the VMware admins but we were also the DBAs, so we have total and complete control, no black boxes.

    Fast forward to 2013, VMware has improved a lot and configure a Windows 2008R2 Cluster is easier than it was before. However, there is still a problem: how to troubleshoot SQL performance issues when SQL runs on a virtual machine.

    If the DBA has no rights on the VMware Server (which is usually the case) it is really hard, if not impossible, to troubleshoot I/O, CPU, or RAM issues.

    Another thing that concerns me is the use of DMVs (and actually a question to Brent, how can a DBA troubleshoot performance issues?). Inside a virtual machine, those DMVs won’t provide actual I/O or RAM usage; those parameters are just a mere abstraction of the real work or SAN.

    But, seeing the industry’s trend and the desire of management team to consolidate and save money, I honestly believe that professional DBAs (Oracle as well) won’t have other option than learn VMware too or even get certified.

    I think VMware is fantastic, but it is not the “one size fit all” solution that some VMware experts believe it is.

  16. excuse me if my questions sound stupid but i am a ‘noob’ to this technology. however, with the intent of mastering it.
    the general consensus seems to steer away from clustering in virtualisation(also on others sources over the web).
    would you still be of the same opinion if you had a stand alone server connected to the SAN? i currently use it for Mirroring as part of my HA strategy. Mirroring saved my beef too often to give away :)

    • Waged – every situation is different, and it boils down to your business’s RPO and RTO. What are they?

      • hi Brent,

        thanks for getting back to me. in a nutshell my application is extremely sensitive to any data loss and down time.
        RPO = 30min
        RTO = 5-10min max.
        so far Mirroring has served me well in the fact i mirrored to a stand alone machine outside my SAN. when a storm arrives, server failure, storage etc, i have a procedure to shift application interfacing to the mirrored database, we achieve that in minutes and my application is online again.

        by the way, the material you and your team provide on youtube, website and books is invaluable. thank you very much, you’ve been great influence. :)

        • If your recovery time objective is 5-10 minutes, it needs to be a fully automatic process. If the server goes down while you’re in an elevator or in the bathroom, you won’t be able to manually fail over in time. You’re looking at a cluster or AlwaysOn Availability Groups in sync mode, or database mirroring in sync with a witness.

          • agreed, I automated the procedure with powershell scripts.
            i think my solution lies with AlwaysOn Availability Groups, i could motivate the licensing.
            and thanks to AlwaysOn i can finally get rid of replication and it’s frustrations :/

            thank you again

  17. Very helpful post and follow-up comments. We currently have mirrored SQL Server 2008 instances in our primary data centre and a single SQL Server 2008 instance in our disaster recovery data centre with merge replication between the primary and disaster recovery instances.

    We have a remote mirroring witness server which is also the merge replication distributor. This represents a single point of failure in our high availability architecture (!). We are looking to eliminate this and have a number of options:

    SQL Server clustering on physical servers.
    SQL Server clustering on VMWare virtual servers with VMware HA.
    VMware HA.

    We don’t face some of the challenges outlined above as we have a small operations team responsible for the server, database and VMware environment.

    As the DBA my inclination is to go down the SQL Server clustering route. I have taken a look at http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1037959 which highliglights some of the limitations of running MSCS on VMware: vMotion migration, storage vMotion, hot adding memory and CPU ….

    Would welcome any thoughts.

    • Neil – as much as I’d love to be able to do customized infrastructure planning in each blog comment, that’s kinda beyond what I have the bandwidth to do. :-D

      • Understand completely. I was really hoping for a steer on the factors that I need to consider when evaluating SQL Server Clustering on physical servers; SQL Server Clustering on VMWare HA and VMware HA.

  18. if the server is big enough to take all the VMware host resources then maybe the extra VMware licensing costs are not worth it.
    The RDM’s are a little unfriendly so maybe only use a couple of hosts with these so the start up timeouts will not be much of an issue. (WSFC does SCSI locks on them so others cannot get to them.)
    Clustering is pretty good. But VMware HA may be enough depending on your needs.
    (WSFC = MCSC now)
    Anyway , some thoughts…

  19. Thanks for the comments. If we do virtualise then it would be hosted on an ESXi server cluster with HA and iSCSI SAN.

    The service that depends on the mirroring/merge replication is 7*24. We do not take the service down for routine maintenance and have a custom-built application failover solution for both intra- and inter-site availability. As such, we really can’t afford for the witness/distributor to be unavailable for a prolonged period as it leaves us vulnerable in the event that another failure requires us to failover.
    Would VMware HA be enough given these requirements? What other factors would influence SQL Server Cluster? If we were to adopt SQL Server Cluster would deployment on VMware HA make sense?

  20. The Microsoft Support Policy has the following exception which makes answering the question easier:

    http://support2.microsoft.com/?id=956893

    Exceptions:
    If multiple SQL VMs are tightly coupled with one another, individual VMs can failover to the disaster recovery (DR) site but SQL high availability (HA) features inside the VM need to be removed and re-configured after VM failover. For this reason the following SQL Server features are not supported on Hyper-VM Replica:
    ?Availability Groups
    ?Database mirroring
    ?Failover Cluster instances
    ?Log shipping
    ?Replication

  21. After having managed 100+ virtual clusters on Vmware and Hyper-V I am so glad we have moved back to more physical clusters and slowly decommissioning the virtual servers. It was the added troubleshooting required which was a huge and unnecessary layer of complexity.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

css.php