Home > Blogs > vCloud Architecture Toolkit (vCAT) Blog > Tag Archives: vCAT-SP

Tag Archives: vCAT-SP

Dedicated Hosted Cloud with vCloud Director for VMware Cloud Providers

When looking for service providers for hosted infrastructure, some customers require dedicated infrastructure for their workloads. Whether the customer is looking for additional separation for security or more predictable performance of hosted workloads, service providers will need tools that enable them to provide dedicated hardware service for customers while reducing their operational overhead. In some scenarios, providers will implement managed vSphere environments for customers to satisfy this type of request and then manage the individual vSphere environments manually or with custom automation and orchestration tools. However, it is also possible to leverage vCloud Director to provide dedicated hardware per customer while also providing a central management platform for service providers to manage multiple tenants. In this post, we will explore how this can be accomplished with ‘out of the box’ functionality in vCloud Director.

Continue reading

Deploying Cassandra for vCloud Availability Part 2

In the previous post, we reviewed the preparation steps necessary for the installation of Cassandra for use with vCloud Availability. In this post we will complete the deployment by showing the steps necessary to install Cassandra and then configure Cassandra for secure communication as well as clustering the 3 nodes. This post assumes basic proficiency with the ‘vi’ text editor.

Installing & Configuring Cassandra

For this example, the Datastax version of Cassandra will be deployed. To prepare the server for Cassandra, create the datastax.repo file in the /etc/yum.repos.d directory with the following command:

vi /etc/yum.repos.d/datastax.repo

Then input the Datastax repo details in to the file.

 [datastax]
 name = DataStax Repo for Apache Cassandra
 baseurl = https://rpm.datastax.com/community
 enabled = 1
 gpgcheck = 0

Once the repo details have been correctly entered, press the ESC key, type :wq! to write and exit the file.

Continue reading

Deploying Cassandra for vCloud Availability Part 1

With the recent release of vCloud Availability for vCloud Director 2.0, it seems like a good opportunity to review the steps for one of the key components required for its installation, the Cassandra database cluster.  While the vCloud Availability installation provides a container based deployment of Cassandra, this container instance of Cassandra is only meant for ‘proof of concept’ deployments.

To support a production implementation of vCloud Availability, a fully clustered instance of Cassandra must be deployed with a recommend minimum of 3 nodes. This post will outline the steps for prepping the nodes for the installation of Cassandra. These preparation steps consist of:

  • Installation of Java JDK 8
  • Installation of Python 2.7

This post assumes basic proficiency with the ‘vi’ text editor.

Infrastructure Considerations

Before deploying the Cassandra nodes for vCloud Availability, ensure that:

  • All nodes have access to communicate with the vSphere Cloud Replication Service over ports 9160 and 9042.
  • DNS is properly configured so that each node can successfully be resolved by the respective FQDN.

It is also worth mentioning that for this implementation, Cassandra does not require a load balancer as the vSphere Cloud Replication Service will automatically select an available node from the Cassandra cluster database communications.

Continue reading

Virtual Machine Performance Metrics in VMware vCloud Director 9.0

Starting with VMware vCloud Director® 5.6, service providers have been able to configure vCloud Director to store metrics that it collects on virtual machine performance and resource consumption. Data for historic metrics is stored in a Cassandra and KairosDB database.

VMware Cloud Providers™ can set up database schema to store basic VM historical performance and resource consumption metrics (CPU, memory and storage), which are collected every 5 minutes (with 20 seconds granularity) by a StatsFeeder process running on the vCloud Director cells. These metrics are then are pushed to a Cassandra NoSQL database cluster with KairosDB persistent storage.

However, this implementation has several limitations, including the following:

• Uses Kairos on top of Cassandra, with an extra layer to maintain
• Supports an outdated version of Kairos DB 0.9.1 and Cassandra 1.2.x/2.0.x
• VMware vCenter Server® does not provide metrics for NFS-based storage
• Difficult to maintain the size of performance data, there is no TTL setting
• Lack of SSL support

With vCloud Director 9.0, VMware has made the following enhancements:

• Provides hybrid mode (you can still choose to use KairosDB)
• Uses a native Cassandra schema and support Cassandra 3.x
• Uses SSL
• Uses vCloud Director entity IDs to tag data in Cassandra instead of Moref/VC-id
• Adds the CMT command to configure a Cassandra cluster

 

After the service provider has successfully implemented this VM performance metrics collecting mechanism, vCloud Director tenant users can directly view their VM’s performance chart from within their vCloud Director 9.0 tenant HTML5 user interface. Service providers are no longer required to use the API call for this purpose, enabling them to offer this benefit to their customers in a much simpler way.

To configure basic VM metrics for vCloud Director 9.0, follow the steps in “Install and Configure Optional Database Software to Store and Retrieve Historic Virtual Machine Performance Metrics” in the vCloud Director 9.0 Installation and Upgrade Guide here. In this version, the configuration file does not need to be generated first. Simply follow the documented steps and everything will automatically be done for you.

If you issue the cell-management-tool configure-metrics –metrics-config /tmp/metrics.groovy command described here, you might have a problem adding schema (as shown in the following screen capture) where vCloud Director 9.0 cannot start up normally and is stopped at the com.vmware.vcloud.metrices-core process.

You must perform the following steps before running the cell-management-tool cassandra command, because it will try to add the same schema again which will cause the error:

1. Remove the keyspace on Cassandra:
# cqlsh –ucassandra –pcassandra; // or other super account
#  drop keyspace vcloud_metrics;

2. Edit the content of the /tmp/metrics.groovy file to:

configuration {
}

3. Run the following command:
# cell-management-tool configure-metrics –metrics-config /tmp/metrics.groovy

4. Run the following command (replace with your Cassandra user and IPs):
# cell-management-tool cassandra –configure –create-schema –cluster-nodes ip1,ip2,ip3,ip4 –username cassandra –password ‘cassandra’ –ttl 15 –port 9042

Notes:

• See the latest vCloud Director 9.0 release notes here for supported vCloud Director Cassandra versions:
– Cassandra 2.2.6 (deprecated for new installations. Supported for legacy upgrades still using KairosDB)
– Cassandra 3.x (3.9 recommended)

• See the vCAT blog at https://blogs.vmware.com/vcat/2015/08/vmware-vcloud-director-virtual-machine-metric-database.html for detailed VM metrics explanations.

• The service provider can implement a more advanced tenant-facing performance monitoring solution for their tenants by using the VMware vRealize® Operations Manager™ Tenant App for vCloud Director, which provides a tenant administrator visibility in to their vCloud Director environment. For more information, go to https://marketplace.vmware.com/vsx/solutions/management-pack-for-vcloud-director.

• There is no need to setup additional Load Balancer in front of a Cassandra Cluster, Cassandra’s Java driver is smart enough in terms of load balancing the request between the Cassandra nodes.

Service Provider Multi-Tenant vRealize Operations (Managed Service)

VMware vRealize Operations™ is a key component of a vCloud Air Network powered cloud service offering. It provides a simplified yet extensible approach to operations management of  the cloud infrastructure. It helps service providers maximize profitability by optimizing efficiency and differentiates their service offerings by increasing customer satisfaction and  delivering to SLAs.
VMware vRealize Operations also enables service providers to generate new revenue streams by expanding their footprint to offer VMware vRealize Operations™ as a service to give their tenants a deeper insight in to the health, capacity and performance of their hosted environments.
This can either be delivered on a dedicated per-tenant basis as part of a private cloud solution offering alternatively the vCAN Service Provider can offer a shared vRealize Operations™ platform as a managed service.
Conceptual Overview:
mt-vr-ops

In this scenario, the service provider operates a centralized vRealize Operations Manager instance to collect all data generated by the resource cluster. Both service provider personnel and tenants will access the same instance of vRealize Operations, and data access will be controlled with RBAC. This scenario allows for easy management and deployment.

This approach is especially attractive for service providers who can operate their complete environment within one vRealize Operations Manager environment.

Advantages include the following:

  • Easy to deploy and manage
  • No additional data/configuration distribution for dashboards, policies, and so on is needed
  • Only one instance to maintain (software updates, management packs, and so on)

Disadvantages involve the following:

  • Role-based access control requires careful maintenance
  • Objects can only be operated under one policy, removing the ability to limit alert visibility for a customer/tenant
  • Sizing can get complex and larger environments could be limited by sizing parameters. A possible workaround could be to build instances per larger resource group.

Example:mt-vr-ops-1

This is just one way a vCloud Air Network provider can differentiate their service portfolio with  vRealize Operations™ by extending the consumption to your end-customers as a managed service.

For more information on common deployment models for vCloud Air Network Service Providers, please visit the vCloud Architecture Toolkit for Service Providers

Save

How to use and interpret the vCloud Availability for vCloud Director Business Calculator

Foreword:

In this blog we will run though how to use the vCloud Air Network vCloud Availability for vCloud Director Calculator to see how a multi tier DR solution could benefit your business. It has been created to provide indicative revenues and margins based on a multi-tiered Disaster Recovery solution using vCloud Air Network vCloud Availability for vCloud Director as the middle tier option.

Using the calculator

Please access the calculator at the Partner Central link: “vCloud Air Network Services IP”

https://vmware.my.salesforce.com/apex/page?name=set.hybrid

Capital Expenditure Modelling

In the sheet called CapEx modelling you can change any cell highlighted GREY and with Bold Red Text

  • Input your number of VM for Premium / Standard and Basic Tiers of Disaster Recovery Service.
  • Input the approximate number of virtual CPU (vCPU), virtual RAM (vRAM) and storage for each VM in each Tier
  • Input the contention ratio of compute (vCPU) for each tier, usually the lower the service, the higher it is contented with other resources.

Continue reading

Protecting workloads in the cloud with minimal effort through VMware vCloud Availability

Among the many challenges an organization and its IT department confront on a daily basis, availability of services is particularly critical for the survival of the businesses that entrust and rely on the technologies on which their services have been built. At the same time, several legislations across different countries are creating continuous pressure on each and every organization to maintain an appropriate plan to protect and secure their data and their services.

Historically, every large enterprise has planned and built its own approach to face a disaster of small or large proportions in the most suitable way for their businesses: backups, hardware redundancy, host clustering, data mirroring, replication, geographically distributed sites, and so on, are just few identifiers for technologies and strategies to build a solution trying to address the problem.

Over the years, some of these technologies have been commoditized. Still for some of them, the financial burden to allow their implementation has been an overwhelming capital expense for many medium and small organizations. In addition, expertise is required to manage and organize the software, hardware, and storage components involved.

In this context, a great opportunity for cloud service providers has materialized. The market has increased its confidence in using cloud-based services offering a more cost-effective (subscription based) access to resources. Disaster recovery as a service (DRaaS) is a highly desirable service to offer to all organizations, but particularly for the ones that might have concerns or financial exposures caused by planning and building their own secondary data center site to make their services more robust and resilient to local disasters. Continue reading

Windows Failover Clusters for vCloud Air Network Service Providers

Designing Microsoft Windows Server Failover Clusters for vCloud Air Network Service Providers

Introduction

In the modern dynamic business environment, uptime of virtualized business critical applications (vBCAs) and fast recovery from system failures are vital to meeting business service-level agreements (SLAs) for vCloud Air Network Service Providers. Cloud service providers must be prepared for business disruptions and be able to minimize their impact to their consumers.

The “being prepared” approach to providing application high availability is aimed at reducing risk of revenue losses, maintaining compliance, and meeting customer agreed SLAs. Designing and deploying applications on Microsoft Windows Server Failover Clusters (WSFC), and having a highly available infrastructure, can help organizations to meet these challenges.

This following figure provides a simple overview view of a Microsoft Windows Server Failover Cluster running on ESXi hosts in a VMware vSphere Infrastructure.

Figure 1. Microsoft Windows Cluster Service on VMware ESXi Hosts

Picture1

The Microsoft Clustering Services (MSCS) has been available in the Microsoft Server products since the release of Microsoft Windows NT Server, Enterprise Edition. A Microsoft Server failover cluster is defined as a group of independently running servers that work together and co-exist to increase the availability of the applications and services they provide. The clustered servers, generally referred to as nodes, are connected by virtual and physical networking and by the clustering software. If one of the cluster compute nodes fails, the Microsoft cluster provides the service through a failover process with minimal disruption to the consumer.

Since the release of Microsoft Windows Server 2008, Microsoft clustering services has been renamed to Windows Server Failover Clustering (WSFC) with a number of significant enhancements.

Due to additional cost and increased complexity, Microsoft clustering technology is typically used by cloud service providers to provide high availability to Tier 1 applications such as Microsoft Exchange mailbox servers or highly available database services for Microsoft SQL Server. However, it can also be used to protect other services, such as a highly available Windows Dynamic Host Configuration Protocol (DHCP) Server or file and print services.

Windows Server Failover Cluster technologies protect services and the application layer against the following types of system failure:

  • Application and service failures, which can affect application software running on the nodes and the essential services they provide.
  • Hardware failures, which affect hardware components such as CPUs, drives, memory, network adapters, and power supplies.
  • Physical site failures in multisite organizations, which can be caused by natural disasters, power outages, or connectivity outages.

The decision to implement a Microsoft clustering solution on top of a vCloud Air Network platform should not be taken without the appropriate consideration and certainly not before addressing all of the design options and business requirements. This implementation adds environmental constraints that might limit other vCloud benefits such as mobility, flexibility, and manageability. It also adds a layer of complexity to the vCloud Air Network platform.

The aim of this vCloud Architectural Toolkit for Service Providers (vCAT-SP) technical blog is to address some of the most important/critical design considerations of running WSFC on top of a vCloud Air Network Service Provider platform. It is not intended to be a step-by-step installation and configuration guide for WSFC. See instead the VMware Setup for Failover Clustering and Microsoft Cluster Service document.

The customer or provider decision to employ Microsoft clustering in a vCloud infrastructure should not be taken lightly. If VMware vSphere High Availability, VMware vSphere Distributed Resource Scheduler and VMware vSphere SMP-FT can provide a high enough level of availability to meet the application SLAs, why reduce flexibility by implementing a Microsoft Clustered application? Having said this, vSphere HA cannot be considered a replacement for WSFC, because vSphere HA is not application-aware. vSphere HA focuses on VMware ESXi host failure from the network and can, if configured to do so, verify whether a virtual machine is still running by checking the heartbeat provided through VMware Tools. Microsoft Cluster Services is application-aware and is aimed at the high-end and high service availability applications, such as Microsoft Exchange Mailbox Servers or Microsoft SQL.

Also, consider if other alternatives, such as Database Log Shipping, Mirroring, or AlwaysOn Availability Groups for Microsoft SQL Server could meet the availability requirement of the applications. For Microsoft Exchange, technologies such as Database Availability Groups (DAGs) make single copy cluster technology less of a necessity in today’s data center.

Feature Comparison

The decision to use any high availability technology should be defined and driven by the cloud consumer’s requirements for the application or service in question. Inevitably, this depends on the application and whether it is cluster-aware. The majority of common applications are not Microsoft clustering-aware.

As with all design decisions, the architect’s skill in collecting information, correlating it with a solid design, and understanding the trade-offs of different design decisions plays a key role in a successful architecture and implementation. However, a good design is not unnecessarily complex and includes rationales for design decisions. A good design decision about the approach taken to availability balances the organization’s requirements with a robust technical platform. It also involves key stakeholders and the customer’s subject matter experts in every aspect of the design, delivery, testing, and handover.

The following table is not intended to demonstrate a preferred choice to meet your specific application availability requirements, but rather to assist in carrying out an assessment of the advantages, drawbacks, similarities, and differences in the technologies being proposed. In reality, most vCloud Air Network data centers use a combination of all these technologies, in a combined manner and independently, to provide different applications and services with the highest level of availability possible, while maintaining stability, performance, and operational support from vendors.

Table 1. Advantages and Drawbacks of Microsoft Clustering and VMware Availability Technologies

Advantages of Microsoft Clustering on vSphere Drawbacks of Microsoft Clustering on vSphere Advantages of VMware Availability Technologies Drawbacks of VMware Availability Technologies
Supports application-level awareness. A WSFC application or service will survive a single node operating system failure. While VMware clusters that are using vSphere HA can use virtual machine failure monitoring to provide a certain level of protection against the failure of the guest operating system, you do not have the protection of the application running on the guest operating system, which is provided with WSFC. Additional cost to deploy and maintain the redundant nodes from an operational maintenance perspective. Reduced complexity and lower infrastructure implementation effort. vSphere HA and vSphere SMP-FT are extremely simple to enable, configure, and manage. Far more so than a WSFC operating system-level cluster. If vSphere HA fails to recognize a system failure, human intervention is required.With vSphere 5.5, AppHA can potentially work to overcome some of the vSphere HA shortcomings by working with VMware vRealize™ Hyperic to provide application high availability within the vSphere environment. However, this might require additional application development and implementation efforts to support the application-awareness elements. In addition, there is a continuing management and operational overhead of this solution to take into account. The implementation and design of App HA is beyond the scope of this document. For more information, refer to the VMware App HA documentation page at https://www.vmware.com/support/pubs/appha-pubs.html. Note that with the release of vSphere 6 AppHA is now End of Availability (EOA). Please see http://kb.vmware.com/kb/2108249 for further details.
WSFC minimizes the downtime of applications that should remain available while maintenance patching is performed on the redundant node. A short outage would be required during the obligatory failover event. As a result, WSFC can potentially reduce patching downtime. Potentially added environment costs for passive node virtual machines. That is, wasted hardware resources utilized on hosts for passive WSFC cluster nodes. Reduced costs because no redundant node resources are required for vSphere HA.Overall vSphere HA can allow for higher levels of utilization within an ESXi host cluster than using operating system-level clustering. You are not able to use vSphere HA or SMP-FT to fail over between systems for performing scheduled patching of the guest operating system or application.
If architected appropriately with vSphere, a virtual implementation of clustered business critical applications can meet the demands for a Tier 1 application that cannot tolerate any periods of downtime. Reduced use of virtual machine functionality. (There is no VMware vMotion®, DRS, VMware Storage vMotion, VMware vSphere Storage DRS™, or snapshots). This also means no snapshot-based backups can be utilized for full virtual machine backups. While other options are available for backups, a cluster node or full cluster loss could require a full rebuild (extending RTO into days and not hours). vSphere HA and vSphere SMP-FT do not require any specific license versions of the guest operating system or application in order to make use of their benefits. vSphere SMP-FT does not protect you against a guest operating system failure. A failure of the operating system in the primary virtual machine will typically result in a failure of the operating system in the secondary virtual machine.
WSFC permits an automated response to either a failed server or application. Typically, no human intervention is required to ensure applications and services continue to run. Added implementation and operational management complexity for the application and vSphere environment. This requires more experienced application administrators, vSphere, storage, and network administrators to support the cluster services. Application-agnostic. vSphere HA and SMP-FT are not application-aware and do not require any application layer support to protect the virtual machine and its workloads, unlike operating system clustering which requires application-level support. vSphere SMP-FT does not protect you against an application failure. A failure of the application or service on the primary virtual machine will typically result in a failure of that application in the secondary virtual machine.
Potentially faster recovery during failover events than with vSphere HA. Virtual machine reboots might take 30 to 60 seconds before all services are up and running. Any failover event might require server admin and application admin interaction. This action could be anything from a node reboot to a node rebuild (not self-healing). Eliminates the need for dedicated standby resources and for installation of additional software and operating system licensing. A failover event requires the virtual machine to be restarted, which could take 30 to 60 seconds. Applications protected solely by vSphere HA might not be available during this time.
Virtualizing Tier 1 business critical applications can reduce hardware costs by consolidating current WSFC deployments. SCSI LUN ID limitation. When using RDMs, remember that each presented RDM reserves one LUN ID. There is a maximum of 256 LUN IDs per ESXi host. These can mount up quickly when running multiple WSFC instances on a vSphere host. vSphere SMP-FT can provide higher levels of availability than are available in most operating system-level clustering solutions today. Admission control policy requires reserved resources to support host failures in the cluster (% / slot size).
Failback is quick and can be performed once the primary server is fixed and put back online. In a situation where both nodes have failed, recovery time might be increased greatly due to the added complexity of the vSphere layer. Supports the full range of virtual machine functionality, which in turn leads to maximized resource utilization. DRS and vMotion provide significant flexibility when it comes to virtual machine placement. Full vSphere functionality can be released for the servers (that is, snapshots, vMotion, DRS, Storage vMotion and Storage DRS. Requires additional configuration to support host isolation response and virtual machine monitoring.
WSFC is a supported Microsoft solution, which makes it an obvious choice for Microsoft Applications such as SQL or Exchange. Many applications do not support Microsoft clustering. Use cases are typically Microsoft Tier 1 applications, such as SQL and Exchange. vSphere host patching/maintenance can be accomplished without after-hour maintenance and Windows Server or application owner participation. Reserved capacity and DRS licensing required to facilitate host patching of live systems.
DRS can be employed to determine initial virtual machine placement at power-on. vSphere host patching and maintenance would still have to be done after hours due to the failover outage and could require application owner participation. Can support a 99.9% availability SLA. Can only support a 99.9% availability SLA, which could mean up to 10.1 minutes per week of downtime.

Based on what has been discussed so far, you can see there is additional complexity when introducing Microsoft clustering on a vCloud Air Network platform. As such, one should carefully consider all of the business and technical requirements. The next section discusses the process of gathering those business requirements to make an informed recommendation.

Figure 2. Cost Versus Complexity

Picture2

Establishing Business Requirements

For either the vCloud provider or consumer, the first step in establishing the need to employ Microsoft clustering on the cloud platform is to assess and define the application availability requirements and to understand the impact of downtime on stakeholders, application owners, and most importantly, the end users.

To identify availability requirements for a Microsoft failover cluster, you can use some or all of the following questions. The answers to these questions will help the service provider cloud architect, to gather, define, and clarify the deployment goals of the application and services being considered for failover clustering.

  • What applications are considered business critical to the organization’s central purpose? What applications and services do end users require when working?
  • Are there any Service Level Agreements (SLAs) or similar agreements that define service levels for the applications in question?
  • For the services end users, what defines a satisfactory level of service for the applications in question?
  • What increments of downtime are considered significant and unacceptable to the business (for example, five seconds, five minutes, or an hour) during peak and non-peak hours? If availability is measured by the customer, how is it measured?

The following table might help establish the requirements for the applications in question.

Availability Downtime (Year) Downtime (Month) Downtime (Week)
90% (1-nine) 36.5 days/year 72 hours/month 16.8 hours/week/
99% (2-nines) 3.65 days/year 7.20 hours/month 1.68 hours/week
99.9% (3-nines) 8.76 hours/year 43.8 minutes/month 10.1 minutes/week
99.99% (4-nines) 52.56 minutes/year 4.32 minutes/month 1.01 minutes/week
99.999% (5-nines) 5.26 minutes/year 25.9 seconds/month 6.05 seconds/week
99.9999% (6-nines) 31.5 seconds/year 2.59 seconds/month 0.605 seconds/week
99.99999% (7-nines) 3.15 seconds/year 0.259 seconds/month 0.0605 seconds/week

Does the cloud consumer have a business requirement for 24-hour, 7-days-a-week availability or is there a working schedule (for example, 9:00 a.m. to 5:00 p.m. on weekdays)? Do the services or applications that are being targeted have the same availability requirements, or are some of them more important than others? Business days, hours of use, and availability requirements can typically be obtained by the service provider from end-user leadership, application owners, and business managers.

For instance, the following table provides a simple business application list along with the end-user requirements for availability and common hours of use. These requirements are important to establish because downtime when an application is not being used, for example overnight, might not negatively impact the application service level agreement.

Application Business Days Hours of Use Availability Requirements
Customer Tracking System 7 Days 0700-1900 99.999%
Document Management System 7 Days 0600-1800 99.999%
Microsoft SharePoint (Collaboration) 7 Days 0700-1900 99.99%
Microsoft Exchange (Email and Collaboration) 7 Days 24 Hours 99.999%
Microsoft Lync (Collaboration) 7 Days 24 Hours 99.99%
Digital Imaging System 5 Days 0800-1800 99.9%
Document Archiving System 5 Days 0800-1800 99.9%
Public Facing Web Infrastructure 7 Days 24 Hours 99.999%

It is also important to establish and understand application dependencies. Many of the applications shown in the previous table consist of a number of components including databases, application layer software, web servers, load balancers, and firewalls. In order to achieve the levels of availability required by the business, a number of techniques must be employed by a range of technologies, not only by clustered services.

  • Do the applications in question have variations in load over time or the business cycle (for example, 9:00 a.m. to 5:00 p.m. on weekdays, monthly, or quarterly)?
  • How many vSphere host servers are available on the vCloud platform for failover clustering and what type of storage is available for use in the cluster or clusters?
  • Is having a disaster recovery option for the services or applications important to the cloud consumer’s organization? What type of infrastructure will be available to support the workload at your recovery site? Is your recovery site cold/hot or a regional data center used by other parts of the business? Is any storage replication technology in place? Have you accounted for the clustered application itself? What steps must be taken to ensure the application is accessible to users/customers if failed over to the recovery site?
  • Is it possible for some of the Microsoft clustered nodes to be placed in a separate vCloud Air Network Service Provider site, an adjacent data center or data center zone to provide an option for disaster recovery if a serious problem develops at the primary site?

When asking these questions of your cloud platform customer also consider that simply because an application has always been protected with the use of Microsoft clustering in the past, does not mean it always has to be in the future. VMware vSphere and the vCloud platform offers several high availability solutions that can be used collectively to support applications where there is a requirement to minimize unplanned downtime. It is important for the provider to examine all options with the consumer and carefully consider and understand the impact of that decision on the application or service.

Microsoft Cluster Configuration Implementation Options

When implementing Microsoft clusters in a vSphere based vCloud environment three primary architectural options exist. The choice of the most appropriate design will depend on your specific design ruse case. For instance, if you are looking for a solution to provide high availability in case of single hardware failure (N+1), hosting both cluster nodes on the same physical host will fail to meet this basic requirement.

In this section, we examine three options and analyze the advantages and drawbacks of each.

Option 1: Cluster-In-A-Box (CIB).

Option 1 is Cluster-In-A-Box (CIB). This is a design where the two clustered virtual machine nodes are running on the same vSphere ESXi host. In this scenario, the shared disks and quorum can be either local or remote RDMs and are shared between the virtual machines within the single host. For instance, you can use VMDKs or RDMs (with their SCSI bus set to virtual mode). The use of RDMs can be beneficial if you decide to migrate one of the virtual machines to another host to create a Cluster across Boxes (CAB) design (described in the next section).

The Cluster-In-A-Box option would most typically be used in test or development environments, because this solution offers no high availability in the event of a host hardware failure.

For CIB deployments, create VM-to-VM affinity rules to keep them together. VMware vSphere Distributed Resource Scheduler (DRS) requires additional host-to-VM rule groups, because (depending on version of vSphere) HA does not account for DRS. Consider the VM-to-VM rules when restarting VMs in the event of hardware failure. For CIB deployments, virtual machines must be in the same virtual machine DRS group, which must be assigned to a host DRS group containing two hosts using a “must run” on hosts in group rule.

Figure 3. Option 1 Design Cluster-In-A-Box (CIB)

Picture3

Option 2: Cluster–Across-Boxes (CAB)

Cluster–Across-Boxes (CAB) is this most common scenario and describes the design where a WSFC is employed on two virtual machines that are running across two different physical ESXi hosts. The primary advantage here is that this protects the environment against a hardware failure of a single physical server (n+1). In this design scenario, VMware recommends physical RDMs as the disk choice. The shared storage and quorum should be located on Fibre Channel SAN storage or be available through an in-guest iSCSI initiator.

For CAB deployments, create VM-to-VM anti-affinity rules to keep them apart. These should be “must run” rules because there is no point in having the two nodes running on the same ESXi host. Again, account for DRS. You will need additional “host-to-VM” rule groups, because HA does not consider the VM-to-VM rules when restarting virtual machines in the event of hardware failure. For CAB deployments, VMs must be in different VM DRS groups. The VMs must be assigned to different host DRS groups using a “must run” on hosts in-group rule.

Figure 4. Option 2 Design Cluster–Across-Boxes (CAB)

Picture4

Option 3: Physical and Virtual Machine

The final typical design scenario is Physical and Virtual Machine (Physical and N+1 VM). This cluster design allows for the primary (active) node of a WSFC cluster to run natively on a bare metal physical server, while the secondary (passive) node runs in a virtual machine. This model can be used to migrate from a physical two-node deployment to a virtualized environment, or as a means of providing N+1 availability with the purchase of a single physical server. With this design, if you need to run on the secondary node during primary business hours, performance-based SLAs might be impacted. However, when you consider that typically a WSFC only runs on the primary node and is only failed over to the secondary node for short periods of time (and outside of business hours for maintenance), this might be a viable option for some use cases. The Physical and N+1 virtual machine model does not require any special affinity rules because one of the nodes is virtual and the other is physical.

Figure 5. Option 3 Design Physical and Virtual Machine

Picture5

VMware recommends physical RDMs as the disk option. Shared storage and quorum disks must be located on Fibre Channel SAN or iSCSI storage or be presented through an in-guest iSCSI initiator. Note that RDMs are not support backed by VMware Virtual SAN.  Refer to http://kb.vmware.com/kb/1037959 for further details.

Conclusion

Design factors are components that are combined and that dictate the outcomes of each design decision. If your customer is looking at virtualizing physical Microsoft Windows clusters on vSphere, you must first assess the impact of using WSFC in your design. Consider the impact on availability, manageability, performance, recoverability, and security.

The use of Microsoft clustering on a vCloud Air Network Platform will add new design requirements, constraints, and risks to the environment. It is crucial that all design factors and their impact on the architecture be addressed at the design stage.

Migrating from physical to virtual cloud platform instances of WSFC offers a significant cost reduction in required hardware, and if architected correctly, can provide the performance and levels of availability to support the most demanding application and the strictest of SLAs. However, it is also important to evaluate other solutions, such as the native high availability features of vSphere, which can be implemented without the high operational costs associated with WSFC. These alternatives can often provide levels of availability that meet the SLAs for the majority of your consumer’s business applications and provide a good alternative to Microsoft clustered implementations, particularly where application-level availability can be used alongside established vSphere technologies.

The decision to use WSFC on a vCloud Air Network Platform should be driven by the workload availability requirements of the end-user’s application or service as defined by the customer or application owner. These requirements ultimately drive the decision behind your applications availability strategy.

To meet high availability and disaster recovery requirements for cloud consumers using WSFC, it is important for the service provider to:

  • Determine high availability and disaster recovery needs of the applications in question.
  • Examine design requirements, constraints and risks for your customer-specific use cases.
  • Develop a WSFC design strategy for the business and overall solution architecture that can be replicated for different applications within the infrastructure.
  • Choose an appropriate WSFC design and size, and configure the infrastructure components to meet the applications performance and availability requirements.
  • Follow VMware’s proven technical guidance for WSFC on a vSphere platform.

Reference Documents

Description URL
Microsoft Clustering on VMware vSphere: Guidelines for supported configurations (1037959) http://kb.vmware.com/kb/1037959
MSCS support enhancements in vSphere 5.5 (2052238) http://kb.vmware.com/kb/2052238
Microsoft Cluster Service (MSCS) support on ESXi/ESX (1004617) http://kb.vmware.com/kb/1004617
Windows Server Failover Clustering (WSFC) with SQL Server http://technet.microsoft.com/en-us/library/hh270278.aspx
Setup for Failover Clustering and Microsoft Cluster Service https://pubs.vmware.com/vsphere-60/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-60-setup-mscs.pdf