Designing Microsoft Windows Server Failover Clusters for vCloud Air Network Service Providers
In the modern dynamic business environment, uptime of virtualized business critical applications (vBCAs) and fast recovery from system failures are vital to meeting business service-level agreements (SLAs) for vCloud Air Network Service Providers. Cloud service providers must be prepared for business disruptions and be able to minimize their impact to their consumers.
The “being prepared” approach to providing application high availability is aimed at reducing risk of revenue losses, maintaining compliance, and meeting customer agreed SLAs. Designing and deploying applications on Microsoft Windows Server Failover Clusters (WSFC), and having a highly available infrastructure, can help organizations to meet these challenges.
This following figure provides a simple overview view of a Microsoft Windows Server Failover Cluster running on ESXi hosts in a VMware vSphere Infrastructure.
Figure 1. Microsoft Windows Cluster Service on VMware ESXi Hosts
The Microsoft Clustering Services (MSCS) has been available in the Microsoft Server products since the release of Microsoft Windows NT Server, Enterprise Edition. A Microsoft Server failover cluster is defined as a group of independently running servers that work together and co-exist to increase the availability of the applications and services they provide. The clustered servers, generally referred to as nodes, are connected by virtual and physical networking and by the clustering software. If one of the cluster compute nodes fails, the Microsoft cluster provides the service through a failover process with minimal disruption to the consumer.
Since the release of Microsoft Windows Server 2008, Microsoft clustering services has been renamed to Windows Server Failover Clustering (WSFC) with a number of significant enhancements.
Due to additional cost and increased complexity, Microsoft clustering technology is typically used by cloud service providers to provide high availability to Tier 1 applications such as Microsoft Exchange mailbox servers or highly available database services for Microsoft SQL Server. However, it can also be used to protect other services, such as a highly available Windows Dynamic Host Configuration Protocol (DHCP) Server or file and print services.
Windows Server Failover Cluster technologies protect services and the application layer against the following types of system failure:
- Application and service failures, which can affect application software running on the nodes and the essential services they provide.
- Hardware failures, which affect hardware components such as CPUs, drives, memory, network adapters, and power supplies.
- Physical site failures in multisite organizations, which can be caused by natural disasters, power outages, or connectivity outages.
The decision to implement a Microsoft clustering solution on top of a vCloud Air Network platform should not be taken without the appropriate consideration and certainly not before addressing all of the design options and business requirements. This implementation adds environmental constraints that might limit other vCloud benefits such as mobility, flexibility, and manageability. It also adds a layer of complexity to the vCloud Air Network platform.
The aim of this vCloud Architectural Toolkit for Service Providers (vCAT-SP) technical blog is to address some of the most important/critical design considerations of running WSFC on top of a vCloud Air Network Service Provider platform. It is not intended to be a step-by-step installation and configuration guide for WSFC. See instead the VMware Setup for Failover Clustering and Microsoft Cluster Service document.
The customer or provider decision to employ Microsoft clustering in a vCloud infrastructure should not be taken lightly. If VMware vSphere High Availability, VMware vSphere Distributed Resource Scheduler and VMware vSphere SMP-FT can provide a high enough level of availability to meet the application SLAs, why reduce flexibility by implementing a Microsoft Clustered application? Having said this, vSphere HA cannot be considered a replacement for WSFC, because vSphere HA is not application-aware. vSphere HA focuses on VMware ESXi host failure from the network and can, if configured to do so, verify whether a virtual machine is still running by checking the heartbeat provided through VMware Tools. Microsoft Cluster Services is application-aware and is aimed at the high-end and high service availability applications, such as Microsoft Exchange Mailbox Servers or Microsoft SQL.
Also, consider if other alternatives, such as Database Log Shipping, Mirroring, or AlwaysOn Availability Groups for Microsoft SQL Server could meet the availability requirement of the applications. For Microsoft Exchange, technologies such as Database Availability Groups (DAGs) make single copy cluster technology less of a necessity in today’s data center.
The decision to use any high availability technology should be defined and driven by the cloud consumer’s requirements for the application or service in question. Inevitably, this depends on the application and whether it is cluster-aware. The majority of common applications are not Microsoft clustering-aware.
As with all design decisions, the architect’s skill in collecting information, correlating it with a solid design, and understanding the trade-offs of different design decisions plays a key role in a successful architecture and implementation. However, a good design is not unnecessarily complex and includes rationales for design decisions. A good design decision about the approach taken to availability balances the organization’s requirements with a robust technical platform. It also involves key stakeholders and the customer’s subject matter experts in every aspect of the design, delivery, testing, and handover.
The following table is not intended to demonstrate a preferred choice to meet your specific application availability requirements, but rather to assist in carrying out an assessment of the advantages, drawbacks, similarities, and differences in the technologies being proposed. In reality, most vCloud Air Network data centers use a combination of all these technologies, in a combined manner and independently, to provide different applications and services with the highest level of availability possible, while maintaining stability, performance, and operational support from vendors.
Table 1. Advantages and Drawbacks of Microsoft Clustering and VMware Availability Technologies
|Advantages of Microsoft Clustering on vSphere
||Drawbacks of Microsoft Clustering on vSphere
||Advantages of VMware Availability Technologies
||Drawbacks of VMware Availability Technologies
|Supports application-level awareness. A WSFC application or service will survive a single node operating system failure. While VMware clusters that are using vSphere HA can use virtual machine failure monitoring to provide a certain level of protection against the failure of the guest operating system, you do not have the protection of the application running on the guest operating system, which is provided with WSFC.
||Additional cost to deploy and maintain the redundant nodes from an operational maintenance perspective.
||Reduced complexity and lower infrastructure implementation effort. vSphere HA and vSphere SMP-FT are extremely simple to enable, configure, and manage. Far more so than a WSFC operating system-level cluster.
||If vSphere HA fails to recognize a system failure, human intervention is required.With vSphere 5.5, AppHA can potentially work to overcome some of the vSphere HA shortcomings by working with VMware vRealize™ Hyperic to provide application high availability within the vSphere environment. However, this might require additional application development and implementation efforts to support the application-awareness elements. In addition, there is a continuing management and operational overhead of this solution to take into account. The implementation and design of App HA is beyond the scope of this document. For more information, refer to the VMware App HA documentation page at https://www.vmware.com/support/pubs/appha-pubs.html. Note that with the release of vSphere 6 AppHA is now End of Availability (EOA). Please see http://kb.vmware.com/kb/2108249 for further details.
|WSFC minimizes the downtime of applications that should remain available while maintenance patching is performed on the redundant node. A short outage would be required during the obligatory failover event. As a result, WSFC can potentially reduce patching downtime.
||Potentially added environment costs for passive node virtual machines. That is, wasted hardware resources utilized on hosts for passive WSFC cluster nodes.
||Reduced costs because no redundant node resources are required for vSphere HA.Overall vSphere HA can allow for higher levels of utilization within an ESXi host cluster than using operating system-level clustering.
||You are not able to use vSphere HA or SMP-FT to fail over between systems for performing scheduled patching of the guest operating system or application.
|If architected appropriately with vSphere, a virtual implementation of clustered business critical applications can meet the demands for a Tier 1 application that cannot tolerate any periods of downtime.
||Reduced use of virtual machine functionality. (There is no VMware vMotion®, DRS, VMware Storage vMotion, VMware vSphere Storage DRS™, or snapshots). This also means no snapshot-based backups can be utilized for full virtual machine backups. While other options are available for backups, a cluster node or full cluster loss could require a full rebuild (extending RTO into days and not hours).
||vSphere HA and vSphere SMP-FT do not require any specific license versions of the guest operating system or application in order to make use of their benefits.
||vSphere SMP-FT does not protect you against a guest operating system failure. A failure of the operating system in the primary virtual machine will typically result in a failure of the operating system in the secondary virtual machine.
|WSFC permits an automated response to either a failed server or application. Typically, no human intervention is required to ensure applications and services continue to run.
||Added implementation and operational management complexity for the application and vSphere environment. This requires more experienced application administrators, vSphere, storage, and network administrators to support the cluster services.
||Application-agnostic. vSphere HA and SMP-FT are not application-aware and do not require any application layer support to protect the virtual machine and its workloads, unlike operating system clustering which requires application-level support.
||vSphere SMP-FT does not protect you against an application failure. A failure of the application or service on the primary virtual machine will typically result in a failure of that application in the secondary virtual machine.
|Potentially faster recovery during failover events than with vSphere HA. Virtual machine reboots might take 30 to 60 seconds before all services are up and running.
||Any failover event might require server admin and application admin interaction. This action could be anything from a node reboot to a node rebuild (not self-healing).
||Eliminates the need for dedicated standby resources and for installation of additional software and operating system licensing.
||A failover event requires the virtual machine to be restarted, which could take 30 to 60 seconds. Applications protected solely by vSphere HA might not be available during this time.
|Virtualizing Tier 1 business critical applications can reduce hardware costs by consolidating current WSFC deployments.
||SCSI LUN ID limitation. When using RDMs, remember that each presented RDM reserves one LUN ID. There is a maximum of 256 LUN IDs per ESXi host. These can mount up quickly when running multiple WSFC instances on a vSphere host.
||vSphere SMP-FT can provide higher levels of availability than are available in most operating system-level clustering solutions today.
||Admission control policy requires reserved resources to support host failures in the cluster (% / slot size).
|Failback is quick and can be performed once the primary server is fixed and put back online.
||In a situation where both nodes have failed, recovery time might be increased greatly due to the added complexity of the vSphere layer.
||Supports the full range of virtual machine functionality, which in turn leads to maximized resource utilization. DRS and vMotion provide significant flexibility when it comes to virtual machine placement. Full vSphere functionality can be released for the servers (that is, snapshots, vMotion, DRS, Storage vMotion and Storage DRS.
||Requires additional configuration to support host isolation response and virtual machine monitoring.
|WSFC is a supported Microsoft solution, which makes it an obvious choice for Microsoft Applications such as SQL or Exchange.
||Many applications do not support Microsoft clustering. Use cases are typically Microsoft Tier 1 applications, such as SQL and Exchange.
||vSphere host patching/maintenance can be accomplished without after-hour maintenance and Windows Server or application owner participation.
||Reserved capacity and DRS licensing required to facilitate host patching of live systems.
|DRS can be employed to determine initial virtual machine placement at power-on.
||vSphere host patching and maintenance would still have to be done after hours due to the failover outage and could require application owner participation.
||Can support a 99.9% availability SLA.
||Can only support a 99.9% availability SLA, which could mean up to 10.1 minutes per week of downtime.
Based on what has been discussed so far, you can see there is additional complexity when introducing Microsoft clustering on a vCloud Air Network platform. As such, one should carefully consider all of the business and technical requirements. The next section discusses the process of gathering those business requirements to make an informed recommendation.
Figure 2. Cost Versus Complexity
Establishing Business Requirements
For either the vCloud provider or consumer, the first step in establishing the need to employ Microsoft clustering on the cloud platform is to assess and define the application availability requirements and to understand the impact of downtime on stakeholders, application owners, and most importantly, the end users.
To identify availability requirements for a Microsoft failover cluster, you can use some or all of the following questions. The answers to these questions will help the service provider cloud architect, to gather, define, and clarify the deployment goals of the application and services being considered for failover clustering.
- What applications are considered business critical to the organization’s central purpose? What applications and services do end users require when working?
- Are there any Service Level Agreements (SLAs) or similar agreements that define service levels for the applications in question?
- For the services end users, what defines a satisfactory level of service for the applications in question?
- What increments of downtime are considered significant and unacceptable to the business (for example, five seconds, five minutes, or an hour) during peak and non-peak hours? If availability is measured by the customer, how is it measured?
The following table might help establish the requirements for the applications in question.
Does the cloud consumer have a business requirement for 24-hour, 7-days-a-week availability or is there a working schedule (for example, 9:00 a.m. to 5:00 p.m. on weekdays)? Do the services or applications that are being targeted have the same availability requirements, or are some of them more important than others? Business days, hours of use, and availability requirements can typically be obtained by the service provider from end-user leadership, application owners, and business managers.
For instance, the following table provides a simple business application list along with the end-user requirements for availability and common hours of use. These requirements are important to establish because downtime when an application is not being used, for example overnight, might not negatively impact the application service level agreement.
||Hours of Use
|Customer Tracking System
|Document Management System
|Microsoft SharePoint (Collaboration)
|Microsoft Exchange (Email and Collaboration)
|Microsoft Lync (Collaboration)
|Digital Imaging System
|Document Archiving System
|Public Facing Web Infrastructure
It is also important to establish and understand application dependencies. Many of the applications shown in the previous table consist of a number of components including databases, application layer software, web servers, load balancers, and firewalls. In order to achieve the levels of availability required by the business, a number of techniques must be employed by a range of technologies, not only by clustered services.
- Do the applications in question have variations in load over time or the business cycle (for example, 9:00 a.m. to 5:00 p.m. on weekdays, monthly, or quarterly)?
- How many vSphere host servers are available on the vCloud platform for failover clustering and what type of storage is available for use in the cluster or clusters?
- Is having a disaster recovery option for the services or applications important to the cloud consumer’s organization? What type of infrastructure will be available to support the workload at your recovery site? Is your recovery site cold/hot or a regional data center used by other parts of the business? Is any storage replication technology in place? Have you accounted for the clustered application itself? What steps must be taken to ensure the application is accessible to users/customers if failed over to the recovery site?
- Is it possible for some of the Microsoft clustered nodes to be placed in a separate vCloud Air Network Service Provider site, an adjacent data center or data center zone to provide an option for disaster recovery if a serious problem develops at the primary site?
When asking these questions of your cloud platform customer also consider that simply because an application has always been protected with the use of Microsoft clustering in the past, does not mean it always has to be in the future. VMware vSphere and the vCloud platform offers several high availability solutions that can be used collectively to support applications where there is a requirement to minimize unplanned downtime. It is important for the provider to examine all options with the consumer and carefully consider and understand the impact of that decision on the application or service.
Microsoft Cluster Configuration Implementation Options
When implementing Microsoft clusters in a vSphere based vCloud environment three primary architectural options exist. The choice of the most appropriate design will depend on your specific design ruse case. For instance, if you are looking for a solution to provide high availability in case of single hardware failure (N+1), hosting both cluster nodes on the same physical host will fail to meet this basic requirement.
In this section, we examine three options and analyze the advantages and drawbacks of each.
Option 1: Cluster-In-A-Box (CIB).
Option 1 is Cluster-In-A-Box (CIB). This is a design where the two clustered virtual machine nodes are running on the same vSphere ESXi host. In this scenario, the shared disks and quorum can be either local or remote RDMs and are shared between the virtual machines within the single host. For instance, you can use VMDKs or RDMs (with their SCSI bus set to virtual mode). The use of RDMs can be beneficial if you decide to migrate one of the virtual machines to another host to create a Cluster across Boxes (CAB) design (described in the next section).
The Cluster-In-A-Box option would most typically be used in test or development environments, because this solution offers no high availability in the event of a host hardware failure.
For CIB deployments, create VM-to-VM affinity rules to keep them together. VMware vSphere Distributed Resource Scheduler (DRS) requires additional host-to-VM rule groups, because (depending on version of vSphere) HA does not account for DRS. Consider the VM-to-VM rules when restarting VMs in the event of hardware failure. For CIB deployments, virtual machines must be in the same virtual machine DRS group, which must be assigned to a host DRS group containing two hosts using a “must run” on hosts in group rule.
Figure 3. Option 1 Design Cluster-In-A-Box (CIB)
Option 2: Cluster–Across-Boxes (CAB)
Cluster–Across-Boxes (CAB) is this most common scenario and describes the design where a WSFC is employed on two virtual machines that are running across two different physical ESXi hosts. The primary advantage here is that this protects the environment against a hardware failure of a single physical server (n+1). In this design scenario, VMware recommends physical RDMs as the disk choice. The shared storage and quorum should be located on Fibre Channel SAN storage or be available through an in-guest iSCSI initiator.
For CAB deployments, create VM-to-VM anti-affinity rules to keep them apart. These should be “must run” rules because there is no point in having the two nodes running on the same ESXi host. Again, account for DRS. You will need additional “host-to-VM” rule groups, because HA does not consider the VM-to-VM rules when restarting virtual machines in the event of hardware failure. For CAB deployments, VMs must be in different VM DRS groups. The VMs must be assigned to different host DRS groups using a “must run” on hosts in-group rule.
Figure 4. Option 2 Design Cluster–Across-Boxes (CAB)
Option 3: Physical and Virtual Machine
The final typical design scenario is Physical and Virtual Machine (Physical and N+1 VM). This cluster design allows for the primary (active) node of a WSFC cluster to run natively on a bare metal physical server, while the secondary (passive) node runs in a virtual machine. This model can be used to migrate from a physical two-node deployment to a virtualized environment, or as a means of providing N+1 availability with the purchase of a single physical server. With this design, if you need to run on the secondary node during primary business hours, performance-based SLAs might be impacted. However, when you consider that typically a WSFC only runs on the primary node and is only failed over to the secondary node for short periods of time (and outside of business hours for maintenance), this might be a viable option for some use cases. The Physical and N+1 virtual machine model does not require any special affinity rules because one of the nodes is virtual and the other is physical.
Figure 5. Option 3 Design Physical and Virtual Machine
VMware recommends physical RDMs as the disk option. Shared storage and quorum disks must be located on Fibre Channel SAN or iSCSI storage or be presented through an in-guest iSCSI initiator. Note that RDMs are not support backed by VMware Virtual SAN. Refer to http://kb.vmware.com/kb/1037959 for further details.
Design factors are components that are combined and that dictate the outcomes of each design decision. If your customer is looking at virtualizing physical Microsoft Windows clusters on vSphere, you must first assess the impact of using WSFC in your design. Consider the impact on availability, manageability, performance, recoverability, and security.
The use of Microsoft clustering on a vCloud Air Network Platform will add new design requirements, constraints, and risks to the environment. It is crucial that all design factors and their impact on the architecture be addressed at the design stage.
Migrating from physical to virtual cloud platform instances of WSFC offers a significant cost reduction in required hardware, and if architected correctly, can provide the performance and levels of availability to support the most demanding application and the strictest of SLAs. However, it is also important to evaluate other solutions, such as the native high availability features of vSphere, which can be implemented without the high operational costs associated with WSFC. These alternatives can often provide levels of availability that meet the SLAs for the majority of your consumer’s business applications and provide a good alternative to Microsoft clustered implementations, particularly where application-level availability can be used alongside established vSphere technologies.
The decision to use WSFC on a vCloud Air Network Platform should be driven by the workload availability requirements of the end-user’s application or service as defined by the customer or application owner. These requirements ultimately drive the decision behind your applications availability strategy.
To meet high availability and disaster recovery requirements for cloud consumers using WSFC, it is important for the service provider to:
- Determine high availability and disaster recovery needs of the applications in question.
- Examine design requirements, constraints and risks for your customer-specific use cases.
- Develop a WSFC design strategy for the business and overall solution architecture that can be replicated for different applications within the infrastructure.
- Choose an appropriate WSFC design and size, and configure the infrastructure components to meet the applications performance and availability requirements.
- Follow VMware’s proven technical guidance for WSFC on a vSphere platform.