Home > Blogs > vCloud Architecture Toolkit (vCAT) Blog

Introducing VMware vCloud Architecture Toolkit for Service Providers

Just in time for VMworld Europe, we are pleased to announce the first release of the VMware vCloud® Architecture Toolkit for Service Providers (vCAT-SP). vCAT-SP is a set of reference documents and architectural notes designed to help VMware vCloud® Air™ Network Partners construct cloud platforms and service offerings leveraging current technologies, recommended practices and innovative tools that have been provisioned in real-world cloud service provider environments. Written by the vCloud Air Network Global Cloud Practice Architecture team and experts across VMware, vCAT-SP provides cloud service provider IT managers and architects recommended design and support solutions attested, validated and optimized – they represent the most efficient examples to help you make the right choices for your business.

Solution Stacks

VMware vCAT-SP is supported by the ‘VMware Cloud Service Provider Solution Stacks’, these are recommended solution stacks aligned to common service provider delivery models; Hosting, Managed Private Cloud and Public Cloud. The solution stacks provide recommendations in terms of which VMware products should be included as part of a VMware powered Hosting, Managed Private Cloud or Public Cloud platform.

You can download the vCAT-SP – Public Cloud Solution Stack

The Hosting and Private Cloud Solution Stacks will be released shortly.

Service Definitions

The service definition documents will help vCloud Air Network service providers define their cloud service requirements across compliance, SLA and OLA’s, recoverability business continuity, integration requirements for OSS and BSS systems, and service offerings use cases. The documents will provide example service definitions that can be leveraged as a starting point.

The initial release of vCAT-SP will provide a service definition example document for Public Cloud. Additional service definition example documents for Hosting and Managed Private Cloud will be provided at a later date.

Architecture Domains

vCAT-SP has been broken down in to seven architecture domains:

  • Virtualization Compute – Documents that detail specific design considerations for the virtualization platform across cloud service offerings for service providers.
  • Network and Security – Documents that detail network and security specific use-cases and design considerations for service providers.
  • Storage and Availability – Documents detailing design considerations for storage platforms and availability solutions for service providers.
  • Cloud Operations and Management – Documents detailing the design considerations and use-cases around opertions management of cloud platforms and services.
  • Cloud Automation and Orchestration – Documents detailing the design considerations and use-cases for automation and orchestration within cloud platforms.
  • Unified Presentation – Documents detailing design considerations and use-cases around presentation of cloud services through UI’s and API’s and the available options to service providers.
  • Hybridity – Documents detailing design considerations and use-cases for hybridity ranging from hybrid applicaton architectures to hybrid network design considerations.

Each domain contains architecture documents that cover off the core platform architecture design considerations, key service provider use cases and operational considerations.

Solution Architecture Examples

Solution Architecture Examples are reference solution designs that have taken the architecture domains in to consideration to formulate a holistic cloud solution. This will include design decisions that have been driven by key requirements and constraints specific to a particular deployment scenario. We are aiming to provide solution architecture examples for Hosting, Managed Private Cloud and Public Cloud platforms initially.

Solutions and Services Examples

Solutions and Services Examples are reference architecture blueprints detailing how to implement a particular cross functional service offering such as DR as a Service which requires a core cloud platform solution and configuration across network and security, storage and availability and compute virtualization domains. This area is where we will publish additional value-add service offerings that the cloud service providers can plug in to their core architectures.

More Information

The initial release of vCAT-SP zip file will be available for download during VMworld Barcelona from: https://www.vmware.com/go/vcat

For any feedback or requests for additional materials, please contact the team at: vcat-sp@vmware.com



vCloud Director with Virtual SAN Sample Use Case

This brief and high level implementation example will provide a sample use case for the utilization of VMware Virtual SAN in a vCloud Director for Service Providers environment.

Outlined in the illustration below, each Provider Virtual Data Center / Resource Cluster has been configured with a Virtual SAN datastore that meets the specific capability requirements set out by their Service Level Agreement (SLA) for that tier of service.

In this example, the service provider is deploying three tiers of offerings, Gold, Silver and Bronze. The compute consolidation ratio and virtual SAN capability, based on the disk group configuration and storage policy, defines how the offering will perform for a consumer. In addition, not shown in the configuration below, NIOC and QoS are being employed by the service provider to ensure an appropriate balance of network resources are assigned, based on tier of service. This requires the configuration of 3 separate tiered VLANs for Virtual SAN traffic (Gold, Silver and Bronze) with traffic priorities configured accordingly.

The exact disk configuration will vary depending on hardware manufacturer and provider SLAs.

Logical Design Overview

VSAN with vCD1


The Full VMWare technology solution stack is illustrated below.

VSAN with vCD2

The above figure shows how the solution is constructed on VMWare technologies. The core vSphere platform provides the storage capability through Virtual SAN, which in turn is abstracted via vCloud Director. The VSAN Disk Group configuration across the hosts, along with the Storage Policy, that is configured at the vSphere level, define the performance and capacity capabilities of the distributed datastore, which in turn is employed to define the SLAs for this tier of the cloud offering.

As is illustrated above, the vSphere resources are abstracted by vCloud Director into a Provider Virtual Data Center (PvDC). These resources are then further carved up into individual Virtual Data Centers (vDC), assigned to Organisational tenants. The overall result is that the vApps that reside within the Organizational vDCs represent the Virtual SAN storage capability defined by the service provider.

Typically, but outside the scope of this discussion, tiered service offerings are defined by more than just storage capability. vCPU consolidation ratios, levels of guaranteed memory and network resources and backups etc. will all be employed by a service provider to define the SLAs.

As I develop this use case for the service providers I’m working with I will update this article further.

Windows Failover Clusters for vCloud Air Network Service Providers

Designing Microsoft Windows Server Failover Clusters for vCloud Air Network Service Providers


In the modern dynamic business environment, uptime of virtualized business critical applications (vBCAs) and fast recovery from system failures are vital to meeting business service-level agreements (SLAs) for vCloud Air Network Service Providers. Cloud service providers must be prepared for business disruptions and be able to minimize their impact to their consumers.

The “being prepared” approach to providing application high availability is aimed at reducing risk of revenue losses, maintaining compliance, and meeting customer agreed SLAs. Designing and deploying applications on Microsoft Windows Server Failover Clusters (WSFC), and having a highly available infrastructure, can help organizations to meet these challenges.

This following figure provides a simple overview view of a Microsoft Windows Server Failover Cluster running on ESXi hosts in a VMware vSphere Infrastructure.

Figure 1. Microsoft Windows Cluster Service on VMware ESXi Hosts


The Microsoft Clustering Services (MSCS) has been available in the Microsoft Server products since the release of Microsoft Windows NT Server, Enterprise Edition. A Microsoft Server failover cluster is defined as a group of independently running servers that work together and co-exist to increase the availability of the applications and services they provide. The clustered servers, generally referred to as nodes, are connected by virtual and physical networking and by the clustering software. If one of the cluster compute nodes fails, the Microsoft cluster provides the service through a failover process with minimal disruption to the consumer.

Since the release of Microsoft Windows Server 2008, Microsoft clustering services has been renamed to Windows Server Failover Clustering (WSFC) with a number of significant enhancements.

Due to additional cost and increased complexity, Microsoft clustering technology is typically used by cloud service providers to provide high availability to Tier 1 applications such as Microsoft Exchange mailbox servers or highly available database services for Microsoft SQL Server. However, it can also be used to protect other services, such as a highly available Windows Dynamic Host Configuration Protocol (DHCP) Server or file and print services.

Windows Server Failover Cluster technologies protect services and the application layer against the following types of system failure:

  • Application and service failures, which can affect application software running on the nodes and the essential services they provide.
  • Hardware failures, which affect hardware components such as CPUs, drives, memory, network adapters, and power supplies.
  • Physical site failures in multisite organizations, which can be caused by natural disasters, power outages, or connectivity outages.

The decision to implement a Microsoft clustering solution on top of a vCloud Air Network platform should not be taken without the appropriate consideration and certainly not before addressing all of the design options and business requirements. This implementation adds environmental constraints that might limit other vCloud benefits such as mobility, flexibility, and manageability. It also adds a layer of complexity to the vCloud Air Network platform.

The aim of this vCloud Architectural Toolkit for Service Providers (vCAT-SP) technical blog is to address some of the most important/critical design considerations of running WSFC on top of a vCloud Air Network Service Provider platform. It is not intended to be a step-by-step installation and configuration guide for WSFC. See instead the VMware Setup for Failover Clustering and Microsoft Cluster Service document.

The customer or provider decision to employ Microsoft clustering in a vCloud infrastructure should not be taken lightly. If VMware vSphere High Availability, VMware vSphere Distributed Resource Scheduler and VMware vSphere SMP-FT can provide a high enough level of availability to meet the application SLAs, why reduce flexibility by implementing a Microsoft Clustered application? Having said this, vSphere HA cannot be considered a replacement for WSFC, because vSphere HA is not application-aware. vSphere HA focuses on VMware ESXi host failure from the network and can, if configured to do so, verify whether a virtual machine is still running by checking the heartbeat provided through VMware Tools. Microsoft Cluster Services is application-aware and is aimed at the high-end and high service availability applications, such as Microsoft Exchange Mailbox Servers or Microsoft SQL.

Also, consider if other alternatives, such as Database Log Shipping, Mirroring, or AlwaysOn Availability Groups for Microsoft SQL Server could meet the availability requirement of the applications. For Microsoft Exchange, technologies such as Database Availability Groups (DAGs) make single copy cluster technology less of a necessity in today’s data center.

Feature Comparison

The decision to use any high availability technology should be defined and driven by the cloud consumer’s requirements for the application or service in question. Inevitably, this depends on the application and whether it is cluster-aware. The majority of common applications are not Microsoft clustering-aware.

As with all design decisions, the architect’s skill in collecting information, correlating it with a solid design, and understanding the trade-offs of different design decisions plays a key role in a successful architecture and implementation. However, a good design is not unnecessarily complex and includes rationales for design decisions. A good design decision about the approach taken to availability balances the organization’s requirements with a robust technical platform. It also involves key stakeholders and the customer’s subject matter experts in every aspect of the design, delivery, testing, and handover.

The following table is not intended to demonstrate a preferred choice to meet your specific application availability requirements, but rather to assist in carrying out an assessment of the advantages, drawbacks, similarities, and differences in the technologies being proposed. In reality, most vCloud Air Network data centers use a combination of all these technologies, in a combined manner and independently, to provide different applications and services with the highest level of availability possible, while maintaining stability, performance, and operational support from vendors.

Table 1. Advantages and Drawbacks of Microsoft Clustering and VMware Availability Technologies

Advantages of Microsoft Clustering on vSphere Drawbacks of Microsoft Clustering on vSphere Advantages of VMware Availability Technologies Drawbacks of VMware Availability Technologies
Supports application-level awareness. A WSFC application or service will survive a single node operating system failure. While VMware clusters that are using vSphere HA can use virtual machine failure monitoring to provide a certain level of protection against the failure of the guest operating system, you do not have the protection of the application running on the guest operating system, which is provided with WSFC. Additional cost to deploy and maintain the redundant nodes from an operational maintenance perspective. Reduced complexity and lower infrastructure implementation effort. vSphere HA and vSphere SMP-FT are extremely simple to enable, configure, and manage. Far more so than a WSFC operating system-level cluster. If vSphere HA fails to recognize a system failure, human intervention is required.With vSphere 5.5, AppHA can potentially work to overcome some of the vSphere HA shortcomings by working with VMware vRealize™ Hyperic to provide application high availability within the vSphere environment. However, this might require additional application development and implementation efforts to support the application-awareness elements. In addition, there is a continuing management and operational overhead of this solution to take into account. The implementation and design of App HA is beyond the scope of this document. For more information, refer to the VMware App HA documentation page at https://www.vmware.com/support/pubs/appha-pubs.html. Note that with the release of vSphere 6 AppHA is now End of Availability (EOA). Please see http://kb.vmware.com/kb/2108249 for further details.
WSFC minimizes the downtime of applications that should remain available while maintenance patching is performed on the redundant node. A short outage would be required during the obligatory failover event. As a result, WSFC can potentially reduce patching downtime. Potentially added environment costs for passive node virtual machines. That is, wasted hardware resources utilized on hosts for passive WSFC cluster nodes. Reduced costs because no redundant node resources are required for vSphere HA.Overall vSphere HA can allow for higher levels of utilization within an ESXi host cluster than using operating system-level clustering. You are not able to use vSphere HA or SMP-FT to fail over between systems for performing scheduled patching of the guest operating system or application.
If architected appropriately with vSphere, a virtual implementation of clustered business critical applications can meet the demands for a Tier 1 application that cannot tolerate any periods of downtime. Reduced use of virtual machine functionality. (There is no VMware vMotion®, DRS, VMware Storage vMotion, VMware vSphere Storage DRS™, or snapshots). This also means no snapshot-based backups can be utilized for full virtual machine backups. While other options are available for backups, a cluster node or full cluster loss could require a full rebuild (extending RTO into days and not hours). vSphere HA and vSphere SMP-FT do not require any specific license versions of the guest operating system or application in order to make use of their benefits. vSphere SMP-FT does not protect you against a guest operating system failure. A failure of the operating system in the primary virtual machine will typically result in a failure of the operating system in the secondary virtual machine.
WSFC permits an automated response to either a failed server or application. Typically, no human intervention is required to ensure applications and services continue to run. Added implementation and operational management complexity for the application and vSphere environment. This requires more experienced application administrators, vSphere, storage, and network administrators to support the cluster services. Application-agnostic. vSphere HA and SMP-FT are not application-aware and do not require any application layer support to protect the virtual machine and its workloads, unlike operating system clustering which requires application-level support. vSphere SMP-FT does not protect you against an application failure. A failure of the application or service on the primary virtual machine will typically result in a failure of that application in the secondary virtual machine.
Potentially faster recovery during failover events than with vSphere HA. Virtual machine reboots might take 30 to 60 seconds before all services are up and running. Any failover event might require server admin and application admin interaction. This action could be anything from a node reboot to a node rebuild (not self-healing). Eliminates the need for dedicated standby resources and for installation of additional software and operating system licensing. A failover event requires the virtual machine to be restarted, which could take 30 to 60 seconds. Applications protected solely by vSphere HA might not be available during this time.
Virtualizing Tier 1 business critical applications can reduce hardware costs by consolidating current WSFC deployments. SCSI LUN ID limitation. When using RDMs, remember that each presented RDM reserves one LUN ID. There is a maximum of 256 LUN IDs per ESXi host. These can mount up quickly when running multiple WSFC instances on a vSphere host. vSphere SMP-FT can provide higher levels of availability than are available in most operating system-level clustering solutions today. Admission control policy requires reserved resources to support host failures in the cluster (% / slot size).
Failback is quick and can be performed once the primary server is fixed and put back online. In a situation where both nodes have failed, recovery time might be increased greatly due to the added complexity of the vSphere layer. Supports the full range of virtual machine functionality, which in turn leads to maximized resource utilization. DRS and vMotion provide significant flexibility when it comes to virtual machine placement. Full vSphere functionality can be released for the servers (that is, snapshots, vMotion, DRS, Storage vMotion and Storage DRS. Requires additional configuration to support host isolation response and virtual machine monitoring.
WSFC is a supported Microsoft solution, which makes it an obvious choice for Microsoft Applications such as SQL or Exchange. Many applications do not support Microsoft clustering. Use cases are typically Microsoft Tier 1 applications, such as SQL and Exchange. vSphere host patching/maintenance can be accomplished without after-hour maintenance and Windows Server or application owner participation. Reserved capacity and DRS licensing required to facilitate host patching of live systems.
DRS can be employed to determine initial virtual machine placement at power-on. vSphere host patching and maintenance would still have to be done after hours due to the failover outage and could require application owner participation. Can support a 99.9% availability SLA. Can only support a 99.9% availability SLA, which could mean up to 10.1 minutes per week of downtime.

Based on what has been discussed so far, you can see there is additional complexity when introducing Microsoft clustering on a vCloud Air Network platform. As such, one should carefully consider all of the business and technical requirements. The next section discusses the process of gathering those business requirements to make an informed recommendation.

Figure 2. Cost Versus Complexity


Establishing Business Requirements

For either the vCloud provider or consumer, the first step in establishing the need to employ Microsoft clustering on the cloud platform is to assess and define the application availability requirements and to understand the impact of downtime on stakeholders, application owners, and most importantly, the end users.

To identify availability requirements for a Microsoft failover cluster, you can use some or all of the following questions. The answers to these questions will help the service provider cloud architect, to gather, define, and clarify the deployment goals of the application and services being considered for failover clustering.

  • What applications are considered business critical to the organization’s central purpose? What applications and services do end users require when working?
  • Are there any Service Level Agreements (SLAs) or similar agreements that define service levels for the applications in question?
  • For the services end users, what defines a satisfactory level of service for the applications in question?
  • What increments of downtime are considered significant and unacceptable to the business (for example, five seconds, five minutes, or an hour) during peak and non-peak hours? If availability is measured by the customer, how is it measured?

The following table might help establish the requirements for the applications in question.

Availability Downtime (Year) Downtime (Month) Downtime (Week)
90% (1-nine) 36.5 days/year 72 hours/month 16.8 hours/week/
99% (2-nines) 3.65 days/year 7.20 hours/month 1.68 hours/week
99.9% (3-nines) 8.76 hours/year 43.8 minutes/month 10.1 minutes/week
99.99% (4-nines) 52.56 minutes/year 4.32 minutes/month 1.01 minutes/week
99.999% (5-nines) 5.26 minutes/year 25.9 seconds/month 6.05 seconds/week
99.9999% (6-nines) 31.5 seconds/year 2.59 seconds/month 0.605 seconds/week
99.99999% (7-nines) 3.15 seconds/year 0.259 seconds/month 0.0605 seconds/week

Does the cloud consumer have a business requirement for 24-hour, 7-days-a-week availability or is there a working schedule (for example, 9:00 a.m. to 5:00 p.m. on weekdays)? Do the services or applications that are being targeted have the same availability requirements, or are some of them more important than others? Business days, hours of use, and availability requirements can typically be obtained by the service provider from end-user leadership, application owners, and business managers.

For instance, the following table provides a simple business application list along with the end-user requirements for availability and common hours of use. These requirements are important to establish because downtime when an application is not being used, for example overnight, might not negatively impact the application service level agreement.

Application Business Days Hours of Use Availability Requirements
Customer Tracking System 7 Days 0700-1900 99.999%
Document Management System 7 Days 0600-1800 99.999%
Microsoft SharePoint (Collaboration) 7 Days 0700-1900 99.99%
Microsoft Exchange (Email and Collaboration) 7 Days 24 Hours 99.999%
Microsoft Lync (Collaboration) 7 Days 24 Hours 99.99%
Digital Imaging System 5 Days 0800-1800 99.9%
Document Archiving System 5 Days 0800-1800 99.9%
Public Facing Web Infrastructure 7 Days 24 Hours 99.999%

It is also important to establish and understand application dependencies. Many of the applications shown in the previous table consist of a number of components including databases, application layer software, web servers, load balancers, and firewalls. In order to achieve the levels of availability required by the business, a number of techniques must be employed by a range of technologies, not only by clustered services.

  • Do the applications in question have variations in load over time or the business cycle (for example, 9:00 a.m. to 5:00 p.m. on weekdays, monthly, or quarterly)?
  • How many vSphere host servers are available on the vCloud platform for failover clustering and what type of storage is available for use in the cluster or clusters?
  • Is having a disaster recovery option for the services or applications important to the cloud consumer’s organization? What type of infrastructure will be available to support the workload at your recovery site? Is your recovery site cold/hot or a regional data center used by other parts of the business? Is any storage replication technology in place? Have you accounted for the clustered application itself? What steps must be taken to ensure the application is accessible to users/customers if failed over to the recovery site?
  • Is it possible for some of the Microsoft clustered nodes to be placed in a separate vCloud Air Network Service Provider site, an adjacent data center or data center zone to provide an option for disaster recovery if a serious problem develops at the primary site?

When asking these questions of your cloud platform customer also consider that simply because an application has always been protected with the use of Microsoft clustering in the past, does not mean it always has to be in the future. VMware vSphere and the vCloud platform offers several high availability solutions that can be used collectively to support applications where there is a requirement to minimize unplanned downtime. It is important for the provider to examine all options with the consumer and carefully consider and understand the impact of that decision on the application or service.

Microsoft Cluster Configuration Implementation Options

When implementing Microsoft clusters in a vSphere based vCloud environment three primary architectural options exist. The choice of the most appropriate design will depend on your specific design ruse case. For instance, if you are looking for a solution to provide high availability in case of single hardware failure (N+1), hosting both cluster nodes on the same physical host will fail to meet this basic requirement.

In this section, we examine three options and analyze the advantages and drawbacks of each.

Option 1: Cluster-In-A-Box (CIB).

Option 1 is Cluster-In-A-Box (CIB). This is a design where the two clustered virtual machine nodes are running on the same vSphere ESXi host. In this scenario, the shared disks and quorum can be either local or remote RDMs and are shared between the virtual machines within the single host. For instance, you can use VMDKs or RDMs (with their SCSI bus set to virtual mode). The use of RDMs can be beneficial if you decide to migrate one of the virtual machines to another host to create a Cluster across Boxes (CAB) design (described in the next section).

The Cluster-In-A-Box option would most typically be used in test or development environments, because this solution offers no high availability in the event of a host hardware failure.

For CIB deployments, create VM-to-VM affinity rules to keep them together. VMware vSphere Distributed Resource Scheduler (DRS) requires additional host-to-VM rule groups, because (depending on version of vSphere) HA does not account for DRS. Consider the VM-to-VM rules when restarting VMs in the event of hardware failure. For CIB deployments, virtual machines must be in the same virtual machine DRS group, which must be assigned to a host DRS group containing two hosts using a “must run” on hosts in group rule.

Figure 3. Option 1 Design Cluster-In-A-Box (CIB)


Option 2: Cluster–Across-Boxes (CAB)

Cluster–Across-Boxes (CAB) is this most common scenario and describes the design where a WSFC is employed on two virtual machines that are running across two different physical ESXi hosts. The primary advantage here is that this protects the environment against a hardware failure of a single physical server (n+1). In this design scenario, VMware recommends physical RDMs as the disk choice. The shared storage and quorum should be located on Fibre Channel SAN storage or be available through an in-guest iSCSI initiator.

For CAB deployments, create VM-to-VM anti-affinity rules to keep them apart. These should be “must run” rules because there is no point in having the two nodes running on the same ESXi host. Again, account for DRS. You will need additional “host-to-VM” rule groups, because HA does not consider the VM-to-VM rules when restarting virtual machines in the event of hardware failure. For CAB deployments, VMs must be in different VM DRS groups. The VMs must be assigned to different host DRS groups using a “must run” on hosts in-group rule.

Figure 4. Option 2 Design Cluster–Across-Boxes (CAB)


Option 3: Physical and Virtual Machine

The final typical design scenario is Physical and Virtual Machine (Physical and N+1 VM). This cluster design allows for the primary (active) node of a WSFC cluster to run natively on a bare metal physical server, while the secondary (passive) node runs in a virtual machine. This model can be used to migrate from a physical two-node deployment to a virtualized environment, or as a means of providing N+1 availability with the purchase of a single physical server. With this design, if you need to run on the secondary node during primary business hours, performance-based SLAs might be impacted. However, when you consider that typically a WSFC only runs on the primary node and is only failed over to the secondary node for short periods of time (and outside of business hours for maintenance), this might be a viable option for some use cases. The Physical and N+1 virtual machine model does not require any special affinity rules because one of the nodes is virtual and the other is physical.

Figure 5. Option 3 Design Physical and Virtual Machine


VMware recommends physical RDMs as the disk option. Shared storage and quorum disks must be located on Fibre Channel SAN or iSCSI storage or be presented through an in-guest iSCSI initiator. Note that RDMs are not support backed by VMware Virtual SAN.  Refer to http://kb.vmware.com/kb/1037959 for further details.


Design factors are components that are combined and that dictate the outcomes of each design decision. If your customer is looking at virtualizing physical Microsoft Windows clusters on vSphere, you must first assess the impact of using WSFC in your design. Consider the impact on availability, manageability, performance, recoverability, and security.

The use of Microsoft clustering on a vCloud Air Network Platform will add new design requirements, constraints, and risks to the environment. It is crucial that all design factors and their impact on the architecture be addressed at the design stage.

Migrating from physical to virtual cloud platform instances of WSFC offers a significant cost reduction in required hardware, and if architected correctly, can provide the performance and levels of availability to support the most demanding application and the strictest of SLAs. However, it is also important to evaluate other solutions, such as the native high availability features of vSphere, which can be implemented without the high operational costs associated with WSFC. These alternatives can often provide levels of availability that meet the SLAs for the majority of your consumer’s business applications and provide a good alternative to Microsoft clustered implementations, particularly where application-level availability can be used alongside established vSphere technologies.

The decision to use WSFC on a vCloud Air Network Platform should be driven by the workload availability requirements of the end-user’s application or service as defined by the customer or application owner. These requirements ultimately drive the decision behind your applications availability strategy.

To meet high availability and disaster recovery requirements for cloud consumers using WSFC, it is important for the service provider to:

  • Determine high availability and disaster recovery needs of the applications in question.
  • Examine design requirements, constraints and risks for your customer-specific use cases.
  • Develop a WSFC design strategy for the business and overall solution architecture that can be replicated for different applications within the infrastructure.
  • Choose an appropriate WSFC design and size, and configure the infrastructure components to meet the applications performance and availability requirements.
  • Follow VMware’s proven technical guidance for WSFC on a vSphere platform.

Reference Documents

Description URL
Microsoft Clustering on VMware vSphere: Guidelines for supported configurations (1037959) http://kb.vmware.com/kb/1037959
MSCS support enhancements in vSphere 5.5 (2052238) http://kb.vmware.com/kb/2052238
Microsoft Cluster Service (MSCS) support on ESXi/ESX (1004617) http://kb.vmware.com/kb/1004617
Windows Server Failover Clustering (WSFC) with SQL Server http://technet.microsoft.com/en-us/library/hh270278.aspx
Setup for Failover Clustering and Microsoft Cluster Service https://pubs.vmware.com/vsphere-60/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-60-setup-mscs.pdf



vCloud Director for Service Providers (VCD-SP) and RabbitMQ Security

Let us start with what is RabbitMQ and how does RabbitMQ fit into vCloud Director for Service Providers (VCD-SP)?

RabbitMQ provides robust messaging for applications, in particular vCloud Director for Service Providers (VCD-SP).  Messaging describes the sending and receiving of data (in the form of messages) between systems. Messages are exchanged between programs or applications, similar to the way people communicate by email, but with select-able guarantees on delivery, speed, security and the absence of spam.

A messaging infrastructure (a.k.a. message-oriented middle-ware or enterprise service bus) makes it easier for developers to create complex applications by decoupling individual program components. Rather than communicating directly, the messaging infrastructure facilitates the exchange of data between components.  The components need know nothing about each other’s status, availability or implementation, which allows them to be distributed over heterogeneous platforms and turned off and on as required.

In a vCloud Director for Service Provider deployment, VCD-SP uses the open standard AMQP protocol to publish messages associated with Blocking Tasks or Notifications. AMQP is the wire protocol natively understood by RabbitMQ and many similar messaging systems, and defines the wire format of messages, as well as specifying the operational details of how messages are published and consumed. VCD-SP also uses AMQP to communicate with extension services: http://goo.gl/xZ9gkL – vCloud Director for Service Provider API Extensions are implemented as services that consume the API requests from a RabbitMQ queue. The API request (http request is serialized and published as an AMQP message. The API implementation consumes the messages, performs the business logic and then replies with an AMQP message. In order to publish and consume messages, you need to configure your RabbitMQ exchange and queues.


A RabbitMQ server or _broker_, runs within the vCloud Director for Service Provider network environment, and for example is deployed into the VCD-SP underlying vSphere installation as a virtual appliance, or vApp. Clients (in this case vCloud Director for Service Provider cells belonging to the vCloud Director Service Provider (VCD-SP) infrastructure itself, as well as other applications interested in notifications) connect to the RabbitMQ broker. Such clients then publish messages to, or consume messages from the broker. The RabbitMQ broker is written in the Erlang programming language and runs on the Erlang virtual machine. Notes on Erlang-related security and operational issues are presented later in this vCAT-SP blog.


The Base Operating System Hosting the RabbitMQ Broker

Securing the RabbitMQ broker in a vCloud Director for Service Provider environment begins with securing the base operating system of the computer (bare metal or virtualized) on which Rabbit runs.  Rabbit runs on many platforms, including Windows and multiple versions of Linux.  As of this writing, commercial versions of RabbitMQ are sold by VMware as part of the vFabric suite and supported on Windows and RPM-based Linux distributions in the Fedora/RHEL family, as well as in a tar.gz-packaged Generic Linux edition. Please see : http://docs.gopivotal.com/rabbitmq/index.html for purchasing details.

It is generally recommended in a vCloud Director Service Provider (VCD-SP) deployment that a Linux distribution of RabbitMQ be used.  VMware expects to eventually provide a pre-packaged vApp with a Linux installation, the necessary Erlang runtime, and a RabbitMQ broker, although this form factor is not yet officially released. The VMware RabbitMQ virtual appliance undergoes, as part of its build process, a security hardening regime common to VMware-produced virtual appliances.

If a customer is deploying RabbitMQ on a Linux of their own choosing, whether running on bare-metal OS, or as part of a virtual appliance they have created themselves, the VMware’s security team recommends the following guidelines be adopted for securing the base Operating System in question:

The hardening discipline applied to the VMware produced RabbitMQ virtual appliance is based on DISA STIG recommendations above.


General networking concerns

Exposing the AMQP traffic occurring between vCloud Director for Service Provider cells and other interested applications in one’s cloud infrastructure outside of the private networks meant for cloud management can expose a VCD-SP provider to security threats. Messages published on an AMQP broker like RabbitMQ are sent for events that happen when something in vCloud Director for Service Provider changes and thus may include sensitive information. Thus, AMQP ports should be blocked at the network firewall protecting the DMZ to which vCloud cells are connected. Code that consumes AMQP messages from the broker must also be connected to same DMZ.  Any such piece of code should be controlled, or at least audited to the point of trustiness, by the vCloud Director Service for Provider.

It is also worth mentioning that AMQP is not exposed to any Cloud tenants and is only used by the Service Provider.

* The Erlang runtime

** What is Erlang?

Erlang is a programming language developed and used by Ericsson in its high-end telephony and data routing products.  The language and its associated virtual machine supports several features leveraged by RabbitMQ, including:

  • support for highly concurrent applications like RabbitMQ
  • built-in support for distributed computing, thus enabling easier clustering of RabbitMQ systems
  • built-in process monitoring and control, for ensuring that a RabbitMQ broker’s subsystems remain running and healthy
  • Mnesia: a performant distributed database
  • high-performance execution.

That RabbitMQ is written in Erlang matters relatively little to a system administrator responsible for deploying, configuring and securing the broker, with only a few small exceptions:

  • Erlang distribution has certain open port constraints.
  • Erlang distribution requires a special “cookie” file to be shared between hosts participating in distributed Erlang communication; this cookie must be kept private.
  • Some RabbitMQ configuration files are represented with Erlang syntax, of which one must be mindful when placing delimiters (like ‘[‘, ‘{‘, and ‘)’) and certain punctuation marks (notably the comma and the period).


Running Erlang securely for RabbitMQ

When clustered, RabbitMQ is a distributed Erlang system, consisting of multiple Erlang virtual machines communicating with one another.  Each such running virtual machine is called a *node*.  In such a configuration, the administrator must be aware of two basic Erlang ideas: the Erlang port mapper daemon, and the Erlang node magic cookie.


epmd:  The Erlang port mapper daemon

The Erlang port mapper daemon is automatically started at every host where an Erlang node (such as a RabbitMQ broker) is started.  The appearance of a process called ‘epmd’ is not to be viewed with alarm. The Erlang virtual machine itself is called ‘beam’ or ‘beam.smp’ and at least one of these will be seen on a machine running the RabbitMQ server. The Erlang port mapper daemon listens, by default on TCP port 4369. The host system’s firewall should leave this port open as a result.


Node magic cookies

Each Erlang node (as defined above) has its own magic cookie, which is an Erlang atom contained in a text file.  When an Erlang node tries to connect to another node (this could be a pair of RabbitMQ brokers connecting in a clustered RabbitMQ implementation, or the rabbitmqctl

utility connecting to a broker to perform some administrative function upon it) the magic cookie values are compared.  If the values of the cookies do not match, the connected node rejects the connection.

A node magic cookie on a system should be readable only by those users under whose id Erlang processes that need to communicate with one another are expected to run.  The Unix permissions of cookie files should typically be 400 (read-only by user).

For most versions of RabbitMQ, cookie creation and installation is handled automatically during installation.  For an RPM-based Linux distribution of RabbitMQ such as that for RHEL/Fedora the cookie will be created and deposited in /var/lib/rabbitmq, called ‘.erlang.cookie’ and given permissions 400 as described above.

* Rabbit server concepts

** Rabbit security:  the OS-facing side

*** OS user accounts

**** RPM-based Linux

In an RPM-based Linux distribution such as the vFabric release of RabbitMQ or the RabbitMQ virtual appliance, the Rabbit server runs as a daemon, started by default at OS boot time.  On such a platform the server is set up to run as system user ‘rabbitmq’.  The Mnesia database and log files must be owned by this user.  More will be said about these files in subsequent sections.

To change whether the server starts at system boot time use:

$ chkconfig rabbitmq-server on


$ chkconfig rabbitmq-server off

An administrator can start or stop the server with:

$ /sbin/service rabbitmq-server stop|start|restart


Network ports

Unless configured otherwise, the RabbitMQ broker will listen on the default AMQP port of 5672.  If the management plugin is installed to provide browser-based and HTTP API-based management services, it will listen on port 55672.

*Any firewall configuration should be certain to open these two ports. *

Strictly speaking, you only need port 5672 open for VCD-SP to work. You open port 55672 only if you want to expose the management interface to the outside world.

Also, as noted above, the Erlang port mapper daemon port, TCP 4369, must also be open.


Rabbit security: The broker-facing side

When considering the security of the RabbitMQ broker itself it’s helpful to divide one’s thinking into the consideration of the face Rabbit shows to the outside world, in terms of how communication with clients can optionally be authenticated and secured against eavesdropping and the ways in which RabbitMQ’s internal structures like exchanges, queues and the bindings between them that determine message routing are governed.

For the former consideration, a RabbitMQ broker can be configured to communicate with clients using the SSL protocol.  This can provide channel security for client-broker communications and optionally the verification of the identities of communicating parties.


TLSv1.2 and RabbitMQ in vCloud Director for Service Providers (VCD-SP)

In the context of vCloud Director Service Provider (VCD-SP), the administrator can configure vCloud Director Service Provider (VCD-SP) to use secure communication based on TLSv1.2 when sending messages to the AMQP broker. TLSv1.2 can also be configured to verify the presented broker’s certificate to authenticate its identity. To enable secured communication, you need to log in to vCloud Director Service Provider (VCD-SP) as a system administrator. In the ‘Administration’ section of the user interface, you must open the ‘Blocking Tasks’ page and select ‘Settings’ tab. In the ‘AMQP Broker Settings’ section there is checkbox labelled ‘Use SSL.’  Turn this option on. You can now select whether to accept all certificates – turn “Accept All Certificates” option on or to verify presented certificates. To configure verification of presented broker’s certificates you need either to create a Java KeyStore in JCEKS format that contains the trusted certificate(s) used to sign the broker’s certificate or you can directly upload the certificate if it is in PEM format.  Under this same ‘AMQP Broker Settings’ section use either the ‘Browse’ button for single SSL Certificate or for SSL Key Store. If you upload keystore you need to provide also SSL Key Store Password. If neither keystore or certificate are uploaded, then default JRE truststore is used.


Securing RabbitMQ AMQP communication with SSL

Full documentation on setting up the RabbitMQ broker’s built-in SSL support can be found at: http://www.rabbitmq.com/ssl.html

The documentation at this site covers:

  • the creation of a certificate authority using OpenSSL and the generation of signed certificates for both the RabbitMQ server and its clients.
  • enabling SSL support in RabbitMQ by editing the broker’s config file (for its location on a specific Rabbit platform see http://www.rabbitmq.com/configure.html#configuration-file)


Broker virtual hosts and RabbitMQ users

A RabbitMQ server internally defines a set of AMQP users (with passwords), which are stored in its Mnesia database.  *NOTE:* A freshly installed RabbitMQ broker starts life with a user account called ‘guest’ and endowed with the password ‘guest’.  We recommend that this password be changed, or this account deleted when RabbitMQ is first set up.

A RabbitMQ broker’s resources are logically partitioned into multiple “virtual hosts.”  Each virtual host provides a separate namespace for resources such as exchanges and queues.  When clients connect to a broker, they specify the virtual host with which they plan to interact at connection time.  A first level of access control is enforced at this point, with the server checking whether the user has sufficient permissions to access the virtual host.  If not, the connection is rejected.

RabbitMQ offers _configure_, _read_, and _write_ permissions on its resources.  Configure operations create or destroy resources, or modify their behavior.  Write operations inject messages into a resource, and read operations retrieve messages from a resource.

It is important to note that VCD-SP requires to have all these permissions granted for its AMQP user.

Details on RabbitMQ virtual hosts, users, access control and permissions can be found here:


The setting of permissions using the ‘rabbitmqctl’ utility is described in:


One should stick to a policy of least privilege in the granting of permissions on broker resources.


The rabbitmqctl utility

The rabbitmqctl (analogous to apachectl or tomcatctl) utility is one of the primary points of contact for administering RabbitMQ.  On Linux systems a man page for rabbitmqctl is typically available specifying its many options.  The contents of this page can also be found online at:



The Rabbit broker:  Where things are and how they should be protected

The following are true for a RabbitMQ server installed on an RPM-based Linux distribution such as RHEL/Fedora.  Permissions are given for top level directories where named.  Data files within them may have more liberal permissions set, particularly group/other authorized to read/write.


Erlang cookie

Ownership:    rabbitmq/rabbitmq

Permissions:  400

Location: /var/lib/rabbitmq/.erlang.cookie


RabbitMQ logs

Ownership:    rabbitmq/rabbitmq

Permissions:  755

Location: /var/log/rabbitmq/

|– rabbit@localhost-sasl.log

|– rabbit@localhost.log

|– startup_err

`– startup_log


Mnesia database location, plugins and message stores

Ownership:    rabbitmq/rabbitmq

Location: /var/lib/rabbitmq/mnesia

|– rabbit@localhost

|   |– msg_store_persistent

|   `– msg_store_transient

`– rabbit@localhost-plugins-expand


Configuration files location and permissions

RabbitMQ’s main configuration file, as well as environment variables that influences its behavior are documented here: http://www.rabbitmq.com/configure.html

Note that the contents of the rabbitmq.config file are an Erlang term, and it is thus important to be mindful of delimiters and line ending symbols, so as not to produce a syntactically invalid file that will prevent RabbitMQ from starting up.


Privileges required to run broker process and rabbitmqctl

Ownership:    root/root

Permissions:  755/usr/sbin/rabbitmqctl

The rabbitmqctl utility must be run as root, and maintain ownership and permissions as above.

The broker can be started, stopped, restarted or status checked by an administrator running:

$ /sbin/service rabbitmq-server stop|start|restart|status



VMware vFabric Cloud Application Platform (with purchase links for commercial RabbitMQ):


NSA operating systems security guidelines: http://www.nsa.gov/ia/guidance/security_configuration_guides/operating_systems.shtml

US DoD Information Assurance Support Environment Security Technical Implementation Guides for operating systems: http://iase.disa.mil/stigs/os/index.html#

RabbitMQ broker configuration: http://www.rabbitmq.com/configure.html

RabbitMQ administration guide: http://www.rabbitmq.com/admin-guide.html

RabbitMQ broker/client SSL configuration guide: http://www.rabbitmq.com/ssl.html

RabbitMQ configuration file reference: http://www.rabbitmq.com/configure.html#configuration-file)

Configuring access control with rabbitmqctl: http://www.rabbitmq.com/man/rabbitmqctl.1.man.html#Access%20control

Rabbitmqctl man page: http://www.rabbitmq.com/man/rabbitmqctl.1.man.html


Authored by Michael Haines – Global Cloud Practice

Special thanks to Radoslav Gerganov and Jerry Kuch for their help and support.

VMware vCloud Director Virtual Machine Metric Database

Hybrid Cloud PoweredThis article is a preview of a section from the Hybrid Cloud Powered Automation and Orchestration document that is part of the VMware vCloud® Architecture Toolkit – Service Providers (vCAT-SP) document set. The document focuses on architectural design considerations to obtain the VMware vCloud Powered service badge, which guarantees true hybrid cloud experience for VMware vSphere® customers. The service provider requires validation from VMware that their public cloud fulfills hybridity requirements:

  • Cloud is built on vSphere and VMware vCloud Director®
  • vCloud user API is exposed to cloud tenants
  • Cloud supports Open Virtualization Format (OVF) for bidirectional workload movement

This particular section focuses on a new feature of vCloud Director—virtual machine performance and resource consumption metric collection, which requires deployment of an additional scalable database to persist and make available a large amount of data to cloud consumers.

Virtual Machine Metric Database

As of version 5.6, vCloud Director collects virtual machine performance metrics and provides historical data for up to two weeks.

Table 1. Virtual Machine Performance and Resource Consumption Metrics

Table 1. Virtual Machine Performance and Resource Consumption Metrics

Retrieval of both current and historical metrics is available through the vCloud API. The current metrics are directly retrieved from the VMware vCenter Server™ database with the Performance Manager API. The historical metrics are collected every 5 minutes (with 20 seconds granularity) by a StatsFeeder process running on the cells and are pushed to persistent storage—Cassandra NoSQL database cluster with KairosDB database schema and API. The following figure depicts the recommended VM metric database design. Multiple Cassandra nodes are deployed in the same network. On each node, the KairosDB database is running, which also provides an API endpoint for vCloud cells to store and retrieve data. For high availability load balancing, all KairosDB instances are behind a single virtual IP address which is configured by the cell management tool as the VM metric endpoint.

Figure 1. Virtual Machine Metric Database Design

Figure 1. Virtual Machine Metric Database Design

Design Considerations

  • Currently only KairosDB 0.9.1 and Cassandra 1.2.x/2.0.x are supported.
  • Minimum cluster size is three nodes (must be equal or larger than the replication factor). Use scale out rather than scale up approach because Cassandra performance scales linearly with number of nodes.
  • Estimate I/O requirements based on the expected number of VMs, and correctly size the Cassandra cluster and its storage.

n … expected number of VMs
m … number of metrics per VM (currently 8)
t … retention (days)
r … replication factor

Write I/O per second = n × m × r / 10
Storage = n × m × t × r × 114 kB

For 30,000 VMs, the I/O estimate is 72,000 write IOPS and 3288 GB of storage (worst-case scenario if data retention is 6 weeks and replication factor is 3).

  • Enable Leveled Compaction Strategy (LCS) on the Cassandra cluster to improve read performance.
  • Install JNA (Java Native Access) version 3.2.7 or later on each node because it can improve Cassandra memory usage (no JVM swapping).
  • For heavy read utilization (many tenants collecting performance statistics) and availability, VMware recommends increasing the replication factor to 3.
  • Recommended size of 1 Cassandra node: 8 vCPUs (more CPU improves write performance), 16 GB RAM (more memory improves read performance), and 2 TB storage (each backed by separate LUNs/disks with high IOPS performance).
  • KairosDB does not enforce a data retention policy, so old metric data must be regularly cleared with a script. The following example deletes one month’s worth of data:


if [ "$#" -ne 4 ]; then
    echo "$0  port month year"

let DAYS=$(( ( $(date -ud 'now' +'%s') - $(date -ud "${4}-${3}-01 00:00:00" +'%s')  )/60/60/24 ))
if [[ $DAYS -lt "42" ]]; then
 echo "Date to delete is in not before 6 weeks"

METRICS=( `curl -s -k http://$1:$2/api/v1/metricnames -X GET|sed -e 's/[{}]/''/g' | awk -v k="results" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}'|tr -d '[":]'|sed 's/results//g'|grep -w "cpu\|mem\|disk\|net\|sys"` ) echo $METRICS for var in "${METRICS[@]}" do for date in `seq 1 30`;   do     STARTDAY=$(($(date -d $3/$date/$4 +%s%N)/1000000))     end=$((date + 1))     date -d $3/$end/$4 > /dev/null 2>&1
    if [ $? -eq 0 ]; then
       ENDDAY=$(($(date -d $3/$end/$4 +%s%N)/1000000))
       echo "Deleting $var from " $3/$date/$4 " to " $3/$end/$4
       echo '
          "metrics": [
            "tags": {},
            "name": "'${var}'"
          "cache_time": 0,
          "start_absolute": "'${STARTDAY}'",
          "end_absolute": "'${ENDDAY}'"
       }' > /tmp/metricsquery
       curl http://$1:$2/api/v1/datapoints/delete -X POST -d @/tmp/metricsquery

rm -f /tmp/metricsquery > /dev/null 2>&1

Note: The space gains will not be seen until data compaction occurs and the delete marker column (tombstone) expires. This is 10 days by default, but you can change it by editing gc_grace_seconds in the cassandra.yaml configuration file.

  • KairosDB v0.9.1 uses QUORUM consistency level both for reads and writes. Quorum is calculated as rounded down (replication factor + 1) / 2, and for both reads and writes quorum number of replica nodes must be available. Data is assigned to nodes through a hash algorithm and every replica is of equal importance. The following table provides guidance on replication factor and cluster size configurations.
Table 2. Cassandra Configuration Guidance

Table 2. Cassandra Configuration Guidance


VMware vCloud Architecture Toolkit (vCAT) is back!

Introducing VMware vCloud Architecture Toolkit for Service Providers (vCAT-SP)

The current VMware vCloud® Architecture Toolkit (3.1.2) is a set of reference documents that help our service provider partners and enterprises architect, operate and consume cloud services based on the VMware vCloud Suite® of products.

As the VMware product portfolio has diversified over the past few years with the introduction of new cloud automation, cloud operations and cloud business products plus the launch of our VMware’s own hybrid cloud service; VMware vCloud Air™, VMware service provider partners now have many more options for designing and building their VMware powered cloud services.

VMware has decided to create a new version of vCAT specifically focused on helping guide our partners in defining, designing, implementing and operating VMware based cloud solutions across the breadth of our product suites. This new version of vCAT is called VMware vCloud Architecture Toolkit – Service Providers (or vCAT-SP).

What are we attempting to achieve?

What VMware intends to do through the new vCAT-SP is provide prescriptive guidance to our partners on what is required to define, design, build and operate a VMware based cloud service… aligned to the common service models that are typically deployed by our partners. This will include core architectures and value-add products and add-ons.

VMware vCAT-SP will be developed using the architecture methodology shown in the following graphic. This methodology takes real-world service models, use cases, functional, non-functional requirements and implementation examples that have been validated in the real world.

Architecture Methodology


Which implementation models will be covered?

The new vCAT-SP initially focuses on two implementation models: Hybrid Cloud Powered and Infrastructure as a Service (IaaS) Powered. These in turn align to common cloud service models, such as Hosting, Managed Private Cloud, and Public/Hybrid Cloud.

Hybrid Cloud Powered

To become hybrid cloud powered, the service provider’s cloud infrastructure must meet the following criteria:

  • The cloud service must be built with VMware vSphere® and VMware vCloud Director for Service Providers.
  • The vCloud APIs must be exposed to the cloud tenants.
  • Cloud tenants must be able to upload and download virtual workloads packaged with Open Virtualization Format (OVF) version 1.0.
  • The cloud provider must have an active rental contract of 3600 points of more with an aggregator.

This implementation model is typically used to build large scale multi-tenant public or hybrid cloud solutions offering a range is IaaS, PaaS or SaaS services to the end-customers.

Infrastructure as a Service Powered

To design an Infrastructure as a Service powered cloud infrastructure, the solution must meet the following criteria:

  • The cloud service must be built with vSphere.
  • The cloud provider must have an active rental subscription with an aggregator.

This implementation model is typically used to build managed hosting and managed private cloud solutions with varying levels of dedication through compute, storage and networking layers, again offering a range of IaaS, PaaS and SaaS services to the end customers.

The vCloud Architecture Toolkit provides all the required information to design and implement a hybrid powered or IaaS powered cloud service, and to implement value-added functionality for policy based operations management, software-defined network and security, hybridity, unified presentation, cloud business management, cloud automation and orchestration, software-defined storage, developer services integration etc.

For more information please visit: vcloudairnetwork.com

Modular and iterative development framework

Modularity is one of the key principles when starting to develop the new vCAT-SP architecture framework. Our modular approach makes it easier to iterate upon, by having smaller building blocks that can be checked out of the architecture, have the impact assessed against other components, updated, then re-inserted in to the architecture with minimal impact to the larger solution landscape.

What will vCAT-SP contain?

VMware vCAT-SP provides the following core documents:

Introductory Documents

Within this section there will be a document map, which details all the available documents and document types that are contained within vCAT-SP. There will also be an introduction document that provides the partners with guidance on how to get the most out of vCAT-SP as a consumer.

Service Definitions

The service definition document(s) provide the information needed to create an effective service definition document. They contain use cases, SLAs, OLAs, business drivers, and the like, that are required to build a hybrid cloud powered or IaaS powered cloud service. The initial vCAT-SP efforts will focus on the hybrid cloud powered service definition, with IaaS Powered following shortly after.

Architecture Documents

The vCAT-SP architecture documents detail the logical design specifics, architecture options available to the designing architect, design considerations for availability, manageability, performance, scalability, recoverability, security, and cost.

Implementation Examples

The implementation example documents detail an end-to-end specific implementation of a solution aligned to an implementation model and service definition. These documents highlight which design decisions were taken and how the solution meets the use cases and requirements identified in a service definition.

Additionally, there will be implementation examples for pluggable value-added services that are developed through the VMware vCloud Air Network. For example, Disaster Recovery as a Service (DRaaS), these components can be plugged in to the core architecture.

Emerging Tools, Solutions and Add-Ons

This area is not just for documentation, but also allows for the team to capture and store Useful software tools and utilities, such as, scripts, plugins, workflows, etc. and which can be used to enhance a particular implementation model. For example, how my cloud platform can present cloud-native applications, such as Project Photon. The development of these documents / add-ons will be iterative and not aligned to the core documentation releases.

The following figure shows the map of documentation currently planned. This is subject to change.

Document Map

When can I get a copy of vCAT-SP?

We are planning to launch the first pdf-based release of vCAT-SP on www.vmware.com/go/vcat around the VMworld EMEA time-frame, we will be publishing in to web format shortly afterwards… so watch this space!

VMware vCAT-SP will be developed iteratively, with a published road-map. This will be in-line with our major software releases where possible, to ensure there is effective service and solution focused architectural guidance available to VMware service provider partners as close to GA dates as possible.

Who is the vCAT-SP development team?

The Global Cloud Practice – vCloud Air Network team, led by Dan Gallivan, is a team of specialist service provider-focused cloud architects that work throughout the vCloud Air Network within the VMware Cloud Services Business Unit.

The team is a global team with many years experience helping our service provider partners build world-class cloud products based on VMware software. The team also contains five certified VCDX architects and three members of the VMware CTO Ambassadors program.

Over the next couple of months we will be releasing frequent technical preview blogs across the technology domains as we approach VMworld EMEA.


Be sure to subscribe to the vCAT blog, vCloud blog, follow @VMwareSP on Twitter or ‘like’ us on Facebook for future updates.


Related VMware vFabric Reference Architecture released

Now available, the highly anticipated vFabric Reference Architecture!!!

Customers, partners and VMware field can leverage the practical templates, examples and guiding principles authored by VMware’s vFabric experts to design, implement and run their own integrated vFabric middleware technology suite.

Download the vFabric Reference Architecture at www.vmware.com/go/vFabric-Ref-Arch.

vCAT 3.1 released with videos

vCAT 3.1 was released several weeks ago in time for PEX. We have added updates to vCenter Chargeback and vCloud Connector. Please see the release notes for specific details on content change.

vCAT 3.1 now includes two sets of videos in support of the vCloud Architecture Toolkit.
a) Executive videos – We have provided several short executive briefs on VMware Validated Architectures, vCAT, our use internally of vCAT, and alignment of our Cloud Infrastructure Management (Layer 1) set of technologies. Currently we have postings from Pat Gelsinger, Ray O’Farrell, Bogomil Balkanski, Scott Aronson, and Mark Egan.
b) Subject Matter Expert (SME) videos – We have included 10 videos covering the Document Center tool for viewing the documentation, and one for each of the 9 document areas.

VMware Press release
VMware Press will be publishing the vCAT 3.1 release for those wishing a printed copy.

As always, we welcome your feedback on how we can improve on our vCAT.next release. Please send feedback to ipfeedback@vmware.com.

vCloud Director Service Builder and the vCloud Director workflow run service


In vCAT 3 we described how we could leverage vCloud Director blocking tasks and notifications to extend vCloud Director with new capabilities and as part of the workflow examples document, provided the notification package as an implementation example leveraging vCenter Orchestrator rich library of workflows.

vCloud Director 5.1 introduced a new API extensions feature allowing a cloud provider to extend the vCloud API with developing services providing functionality not available in the vCloud API. These API extensions have been covered in details on Christopher Knowles’s theclouds.ca blog. Thomas Kraus wrote about implementing a specific service leveraging a vCenter Orchestrator workflow on his Cloud Actual blog.

Now it is my turn to release a tool to create custom services leveraging any vCenter Orchestrator workflow as a service operation. “vCloud Director Service Builder” is a wizard based workflow allowing to create new services and their operations in a few clicks.

In addition once a service operation has been started the included “vCloud Director workflow run” service allows managing the workflow life cycle.

Get vCloud Director Service Builder and the vCloud Director workflow run service and find out more information in the vCenter Orchestrator Communities.


Enforce System Wide CPU/Memory limits in vCloud Director

Have you ever wished you could prevent users from powering on VMs in your vCD environment with 4 or more CPUs? How about preventing VMs with more than 8 GB of memory from powering up? There may be performance benefits to enacting such limitations depending on the number of CPUs and cores the physical hosts in the underlying cluster have available.

While neither of the above items are possible with any basic settings, such control can be enforced in your vCloud Director environments. Laying out every step in detail to accomplish this task is beyond the scope of today’s post, but the basic steps are as follows:

  • Configure an AMQP server
  • Enabling the “Start vApp (Deploy from api)” blocking task
  • Specify a vCenter Orchestrator workflow to subscribe to the queue

If you find this concept interesting and feel you would benefit from such a solution, please leave me feedback as a comment to the Workflow Examples document in the Orchestrator Community.