Author Archives: Martin Hosken

Martin Hosken

About Martin Hosken

Martin is a Principal Architect, working within the Cloud Provider Software Business Unit at VMware. Martin has extensive experience architecting and consulting with international customers, and serves as a trusted adviser in the design and transition of enterprise organizations and cloud service provider's legacy infrastructure onto VMware software-defined data center based cloud platforms. Martin specializes in cloud and storage architecture, as well as storage related solutions for public and hybrid cloud platforms. He is also one of a small number of individuals who hold two VMware Certified Design Expert (VCDX #117) certifications in Data Center Virtualization and Cloud Management and Automation. He is an established vExpert, and is the author of multiple white papers, blogs and articles based on VMware and other technologies. He is also the sole author of the new Sybex publication VMware Software-Defined Storage: A Design Guide to the Policy-Driven, Software-Defined Storage Era.

VMware Cloud on AWS – Managed Service Provider (MSP) Program

It is without doubt, that one of the most significant announcements that came from VMware during 2017 was the launch of VMware Cloud on AWS. While many enterprises and organizations are deliberating their specific use cases for this service, it is absolutely clear, that providing VMware customers and partners the ability to have a vSphere Cloud Platform running on AWS hardware, with a low latency and high bandwidth interconnect into AWS’ native services is highly appealing. This is because, as we all know, the public cloud is often not the most appropriate location for every workload type.

During 2018 we will see significant growth of this service across the multiple global regions, in which it will be made available, in addition to gaining visibility into the wide-ranging customer use cases, that will become key drivers for customer adoption.

Also in 2018, we will see VMware Cloud on AWS being made available through the VMware Cloud Provider ‘Managed Service Provider’ (MSP) program, allowing VMware’s cloud provider partners to deliver this service to their end consumers, as part of any fully managed service offering.

For those of you who are unfamiliar with the concept of Managed Services, this is the practice of outsourcing IT services based on the proactive management offered through pre-defined service-level agreements. With this model, a cloud provider takes responsibility for IT functions, and also in many cases, acts as a trusted advisor to the consumer, offering strategic solutions for improving IT operations and reducing costs.

In the VMC on AWS managed service provider model, the cloud provider has direct oversight of the VMC on AWS organization, and the systems being managed. This allows the cloud provider to deliver the solution, with the consumer being provided with a service-level agreement that defines the performance and quality metrics based on the overall service provider offering, which might include multiple different components. The key differentiator of this solution is that the cloud provider maintains the relationship with the end consumer at all times, while being backed by VMware support services.

 

One of the key advantages to the end consumer is that this is an efficient way to stay up to date with technology trends, and to have access to all of the necessary skills to manage and maintain this truly hybrid solution, which in turn, minimizes risk. A recent survey [2017 State of Cloud Adoption and Security] identified that it is a lack of knowledge and expertise in cloud computing, rather than reluctance, which appears to be the main obstacle to cloud adoption for many corporate organizations. Therefore, as a value-added managed service provider, VMware Cloud Providers can evolve to offer a higher level of service and adopt service models that are tailored to meet the needs of these organizations. In addition, managing day-to-day IT processes and reducing related business costs can provide a significant advantage for consumer organizations, and also provides efficiency to cloud providers through the centralization of technical expertise.

As a result, VMware Cloud Providers can be instrumental as the IT infrastructure components of some corporations are migrated to the cloud, making it easier than ever for them to capture these workloads. Also, for cloud providers who have been providing in-house cloud services or acting as brokers for cloud service providers, the VMC on AWS solution takes this approach to a whole new level of integration, opening the door to integrated cross-cloud services, which can meet the needs of the most demanding, complex or diverse application. For instance, a VMC on AWS managed service provider might stretch applications across the boundaries of the hybrid solution, allowing tenants to build solutions that can consume the best from both worlds, such as EC2/ECS applications querying an Oracle RAC database or SAP modules running on the VMC’s SDDC platform. There are unlimited use cases for customers to leverage solutions between the two environments, all of which can be provided as a fully managed infrastructure by a VMware Cloud Provider.

In all likelihood, the most common use cases and managed services that will be offered on a VMC on AWS solution will evolve around the low-latency and high bandwidth connectivity with AWS native services, and the disaster recovery solutions being made available through this offering. This takes application topologies and service development options beyond the capabilities of the traditional VMware infrastructure. As a result, cloud provider managed services can be extended significantly, and might include a wide range of new offerings, such as

  • Software – application production support and maintenance
  • Authentication solutions
  • Systems management
  • Secure mobile device management
  • Data backup and managed recovery services
  • Data storage, data warehouse and management
  • Network monitoring, overall operational management
  • End-to-end security services
  • Communications services (mail, phone, VoIP)
  • Managed video services

In addition, we also expect to see VMware cloud providers deploy VMC on AWS as a means of rapid deployment into new regions, versus building new co-locations, providing a significantly faster route to local markets. This use case will see VMware cloud providers deploying new infrastructure, while avoiding complex, expensive and time-consuming processes. Also, cloud providers who wish to provision one-off or multiple resources into a new global region, where AWS is present, can now do so in a matter of days, as opposed to months or years.

Also, managed cloud providers who wish to reduce their data center footprint and consolidate customer workloads, in what might be smaller regions, can employ VMC – reducing the need for some or all of their own facilities. Likewise, expanding resources for both short and long periods, based on the end consumer’s needs, delivers a new level of flexibility that cloud providers can offer. From the cloud provider’s perspective, this service delivers what you need, when you need it, with no upfront capital outlay – in effect, creating a cloud bursting model.

Managed disaster recovery services are also highly likely to be one of the key use cases for the managed cloud providers who offer this solution as part of their portfolio. Disaster Recovery-as-a-Service can, in a simplified architecture, deliver business continuity through an on-demand service solution, optimized by VMware Cloud on AWS. This solution allows VMware cloud providers to offer services that can provide the operationally consistent experience of a VMware data center, while also:

  • Accelerating time-to-protection
  • Simplifying disaster recovery operations
  • Reducing secondary site costs with cloud economics

This Disaster Recovery-as-a-Service is built, as you would expect, on established VMware solutions, including Site Recovery Manager, vSphere Replication, and optionally VMware vRealize Orchestrator, which together provides the application centric runbook, and removes the need for service consumers to require a dedicated disaster recovery data center.

Sold as an add-on service to VMware Cloud on AWS, the Disaster Recovery-as-a-Service solution offers multiple failure topologies, to provide flexibility to both the end consumer and cloud provider, as illustrated below:

 

In summary, the VMware Cloud on AWS solution provides VMware cloud service providers the means to offer a whole new range of service offerings based on the combined benefits of the VMware and AWS platforms, including:

  • Maintain your teams, tools & skills investments
  • Consumption based economics
  • Unique service architecture options
  •  Scale and elasticity with on-demand capacity and flexible consumption

It is important to recognise, that VMware Cloud Providers are uniquely placed to merge seamlessly, through the power of managed services, VMware SDDC platforms and Native AWS solutions, transforming entire IT service realities through a powerful combination of service offerings. However, to maximize the benefits of VMware Cloud on AWS, cloud service providers need a holistic cloud strategy, and a way to make it real. Also, to get there, cloud providers need to be ready to act. For this reason, over the coming months, I will be working with many of VMware’s key cloud providers to develop new service offerings based on VMware Cloud on AWS architectures. For more information as these services become a reality, watch this space…

Martin Hosken | Principal Architect | Office of the CTO, Global Field
VCDX-DCV & VCDX-CMA | VCIX-DCV | vExpert
AWS Certified Solutions Architect – Professional

vCenter Server Scalability for Service Providers

Designing and architecting monster vCloud Air Network service provider environments takes VMware technology to its very limits, in terms of both scalability and complexity. vCenter Server, and its supporting services, such as SSO, are at the heart of the vSphere infrastructure, even in cloud service provider environments where a Cloud Management Platform (CMP) is employed to abstract the service presentation away from vCenter Server.

Meeting service provider scalability requirements with vCenter Server requires optimization at every level of the design, in order to implement a robust technical platform that can scale to its very limits, whilst also maintain operational efficiency and support.

This article outlines design considerations around optimization of Microsoft Windows vCenter Server instances and best practice recommendations, in order to maximize operational performance of your vCenter ecosystem, which is particularly pertinent when scaling over 400 host servers. Each item listed below should be addressed in the context of the target environment, and properly evaluated before implementation, as there is no one solution to optimize all vCenter Server instances.

The following is simply a list of recommendations that should, to some extent, improve performance in large service provider environments. This blog targets the Windows variant of vCenter Server 5.x and 6.x with a Microsoft SQL database, which is still the most commonly deployed configuration.

Warning: Some of the procedures and tasks outlined in this article are potentially destructive to data, and therefore should only be undertaken by experienced personnel once all appropriate safeguards, such as backed up data and a tested recovery procedure, are in place.

 

Part 1 – vCenter Server Operational Optimization

vCenter Server Sizing
vCloud Air Network service providers must ensure that the vCenter virtual system(s) are sized accordingly, based on their inventory size. Where vCenter components are separated and distributed across multiple virtual machines, ensure that all systems meet the sizing recommendations set out in the installation and configuration documentation.

vSphere 5.5: https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html
vSphere 6.0: https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-6-pubs.html
vSphere 5.1: http://kb.vmware.com/kb/2021202

Distribute vCenter Services across multiple virtual machines (vSphere 5.5)
In vSphere 5.5, depending on inventory size, multiple virtual machines can be used to accommodate different vCenter roles. VMware recommends separating VMware vCenter, SSO Server, Update Manager and SQL for flexibility during maintenance and to improve scalability of the vCenter management ecosystem. The new architecture of vCenter 6 simplifies the deployment model, but also reduces design and scaling flexibility, with only two component roles to deploy.

Dedicated Management Cluster
For anything other than the smallest of environments, VMware recommends separating all vSphere management components onto a separate out-of-band management cluster. The primary benefits of management component separation, include:

  • Facilitating quicker troubleshooting and problem resolution as management components are strictly contained in a relatively small and manageable cluster.
  • Providing resource isolation between workloads running in the production environment and the actual systems used to manage the infrastructure.
  • Separating the management components from the resources they are managing.

vCenter to Host operational latency
The number of network hops between the vCenter Server and the ESXi host affects operational latency. The ESXi host should reside as few network hops away from the vCenter Server as possible.

vCenter to SQL Server operational latency
The number of network hops between the vCenter Server and the SQL database also affects operational latency. Where possible, vCenter should reside on the same network segment as the supporting database. If appropriate, configure a DRS affinity rule to ensure that the vCenter Server and database server reside on the same ESXi host, reducing latency still further.

Java Max Heap Size 
vCloud Air Network service providers must ensure that the max heap size for Java virtual machine is set correctly based on the inventory size. Confirm heap size on JVM Heap settings on vCenter, Inventory Service, SSO and Web Client are checked. Monitor Web Services to verify. vSphere 5.1 & 5.5: http://kb.vmware.com/kb/2021302

Concurrent Client Connections
Whilst no always easy, attempt to limit the number of clients connected to vCenter Server, as this affects its performance. This is particularly the case for the traditional Windows C# client.

Performance Monitoring
Employ a performance monitoring tool to ensure the health of the vCenter ecosystem and to help troubleshoot problems when they arise. Where appropriate, configure a vROps Custom Dashboard for vCenter/Management components. Also ensure appropriate alerts and notifications on performance monitoring tools exist.

Virtual disk type
All vCenter Server virtual machine VMDK’s should be provisioned in an eagerZeroedThick format. This provides approximately a 10-20 percent performance improvement over the other two disk formats.

vCenter vNIC type
vCloud Air Network service providers should ensure to employ the VMXNET3 paravirtualized network adaptor to maximise network throughput, efficiency and reduce latency.

ODBC Connection
Ensure that the vCenter and VUM ODBC connections are configured with the minimum permissions required for daily operations. Additional permissions are typically required during installation and upgrade activities, but not for day to day operations. Please refer to the Service Account Permissions provided below.

vCenter Logs Clean Up
vCenter Server has no automated way of purging old vCenter Log files. These files can grow and consume a significant amount of disk space on the vCenter Server. Consider a 3/6 monthly scheduled task to delete or move log files older than the period of time defined by business requirements.

For instance, the VBscript below can be used to clean up old log files from vCenter. This script deletes files that are older than a fixed number of days, defined in line 9, from the path set in line 6. This VBscript can be configured to run as a scheduled task using the windows task scheduler.

Dim Fso
Dim Directory
Dim Modified
Dim Files
Set Fso = CreateObject("Scripting.FileSystemObject")
Set Directory = Fso.GetFolder("C:\ProgramData\VMware\VMware VirtualCenter\Logs\")
Set Files = Directory.Files
For Each Modified in Files
If DateDiff("D", Modified.DateLastModified, Now) > 180 Then Modified.Delete
Next

For more information, refer to KB article: KB1021804 Location of vCenter Server log files.
For additional information on modifying logging levels in vCenter please refer to KB1004795 and KB1001584.

Note: Once a log file reaches a maximum size it is rotated and numbered similar to component-nnn.log files and they may be compressed.

Statistics Levels
The statistics collection interval determines the frequency at which statistic queries occur, the length of time statistical data is stored in the database, and the type of statistical data that is collected.

As historical performance statistics can take up to 90% of the vCenter server database size, it is the primary factor in the performance and scalability of the vCenter Server database. Retaining this performance data allow administrators to view the collected historical statistics, through the performance charts in the vSphere Web Client, through the traditional Windows Client or through command-line monitoring utilities, for up to 1 year after the data was first ingested into the database.

You must ensure that statistics collection times are set as conservatively as possible so that the system does not become overloaded. For instance, you could set a new DB Data Retention Period of 60 Days and configure the DB to not retain performance data beyond 60 days. At the same, it is equally important to ensure that the retention of this historical data meets the service provider’s data compliance requirements.

As this statistics data consumes such a large proportion of the database, proper management of these vCenter Server statistics is an important consideration for overall database health. This is achieved by the processing of this data through a series of rollup jobs, which stop the database server from becoming overloaded. This is a key consideration for vCenter Server performance and is addressed in more detail in Part 2 of this article.

Task and Events Retention
Operational teams should ensure that the Task and Events retention levels are set as conservatively as possible, whilst still meeting the service provider’s data retention and compliance requirements. Every time a task or event is executed via vCenter, it is stored in the database. For example, a task is created when an user powers on or off on a virtual machine and an event is generated when something occurs, such as the vCPU usage for a VM changing to red.

vCenter Server has a Database Retention Policy setting that allows you to specify after how long vCenter Server Tasks and Events should be deleted. This correlates to a database rollup job that purges the data from the database after the selected period of time. Whilst compared to statistical data these tables consume a relevantly small amount of database space, it is good practice to consider this option for further database optimization. For Instance, by default, vCenter is configured to store tasks and events data for 180 days. However, it might be possible, based on the service provider’s compliance requirements, to configure vCenter not to retain Event and Task Data in the database beyond 60 days.

vCenter Server Backup Best Practice
In addition to scheduling regular backups of the vCenter Server database, the backups for the vCenter Server should also include the SSL certificates and license key information.

 

Part 2 – SQL DB Server Operational Optimization (for vCenter Server)

SQL Database Server Disk Configuration
The vCenter Server database data file (mdf) generates mostly random I/O, while database transaction logs (ldf) generate mostly sequential I/O. The traffic for these files is almost always simultaneous so it’s preferable to keep these files on two separate storage resources, that don’t share disks or I/O. Therefore, where a large service provider inventory demands it, operational teams should ensure that the vCenter Server database uses separate drives for data and logs which, in turn, are backed by different physical disks.

tempDB Separation
For large service provider inventories, place tempDB on a different drive, backed by different physical disks than the vCenter database files or transaction logs.

Reduce Allocation Contention in SQL Server tempDB database
Consider using multiple data files to increase the I/O throughput to tempDB. Configure 1:1 alignment between TempDB files and vCPUs (up to eight) by spreading tempDB across at least as many equal sized files as there are vCPUs.

For instance, where 4 vCPUs exist on the SQL server, create three additional tempDB data files, and make them all equally sized. They should also be configured to grow in equal amounts. After changing the configuration, a restart of the SQL Server instance is required. For more information please refer to: http://support.microsoft.com/kb/2154845

Database Connection Pool
vCenter server starts, by default, with a database connection pool of 50 threads. This pool is then dynamically sized according to the vCenter Server’s workload. If high load is expected due to a large inventory, then the size of the pool can be increased to 128 threads. This will increase memory consumption and load time of the vCenter Server. To change the pool size, edit the vpxd.cfg file, adding, as below, where ‘128’ is the number of connection threads to be configured.

< vpxd>
< odbc>
< maxConnections>128
< /odbc>
< /vpxd>

Table Statistics
Update statistics of the SQL tables and indexes on a regular basis, for better overall performance of the database. Create an SQL job to carry out this task, or alternatively, it should form part of a vSphere database maintenance plan. http://sqlserverplanet.com/dba/update-statistics

Index Fragmentation (Not Applicable to vCenter 5.1 or newer)
Check for fragmentation of index objects and recreate indexes if needed. This happens with vCenter due to statistic roll ups. Defragment after <30% fragmentation. See this KB1003990.

Note: With the new enhancements and design changes made in the vCenter Server 5.1 database and later, this is no longer applicable or required.

Database Recovery Model
Depending on your vCenter database backup methodology, consider setting the transaction logs to SIMPLE recovery. This model will reduce the disk space needed for the logs as well decrease I/O load.

Choosing the Recovery Model for a Database: http://msdn.microsoft.com/en-us/library/ms175987(SQL.90).aspx
How to view or Change the Recovery Model of a Database in SQL Server Management Studio: http://msdn.microsoft.com/en-us/library/ms189272(SQL.90).aspx

Virtual Disk Type
Where the vCenter Server database server is a virtual machine, ensure that all VMDK’s are provisioned in an eagerZeroedThick format. This option provides approximately 10-20 percent performance improvement over the other two disk formats.

Verify SQL Rollup Jobs
Ensure all the SQL Agent rollup jobs have been created on the SQL server during the vCenter Server Installation. For instance:

  • Past Day stats rollup
  • Past Week stats rollup
  • Past Month stats rollup

For the full set of stored procedures and jobs please refer to the appropriate article below. Where necessary, recreate MSSQL agent rollup jobs. Note that detaching, attaching, importing, and restoring a database to a newer version of MSSQL Server does not automatically recreate these jobs. To recreate these jobs, if missing, please refer to: KB1004382.

KB 2033096 (vSphere 5.1, 5.5 & 6.0): http://kb.vmware.com/kb/2033096
KB 2006097 (vSphere 5.0): http://kb.vmware.com/kb/2006097

Also, ensure that the myDB references the vCenter Server database, and not the master or some other database. If these jobs reference any other database, you must delete and recreate the jobs.

Ensure database jobs are running correctly
Monitor scheduled database jobs to ensure they are running correctly. For more information, refer to KB article: Checking the status of vCenter Server performance rollup jobs: KB2012226

Verify MSSQL Permissions
Ensure that the local SQL and AD permissions required are in place, and align with the principle of least privilege (see below). If necessary, truncate all unrequired performance data from the database (Purging Historical Statistical Performance Data). For more information, refer to KB article: Reducing the size of the vCenter Server database when the rollup scripts take a long time to run KB1007453

Truncate all performance data from vCenter Server
As discussed in Part 1, to truncate all performance data from vCenter Server 5.1 and 5.5:

Warning: This procedure permanently removes all historical performance data. Ensure to take a backup of the database/schema before proceeding.

  1. Stop the VMware VirtualCenter Server service. Note: Ensure that you have a recent backup of the vCenter Server database before continuing.
  2. Log in to the vCenter Server database using SQL Management Studio.
  3. Copy and paste the contents of the SQL_truncate_5.x.sql script (available from the link below) into SQL Management Studio.
  4. Execute the script to delete the data.
  5. Restart the vCenter Server services.

For truncating data in vCenter Server and vCenter Server Appliance 5.1, 5.5, and 6.0, see Selective deletion of tasks, events, and historical performance data in vSphere 5.x and 6.x (2110031)

Shrink Database
After purging historical data from the database, optionally shrink the database. This is an online procedure to reduce the database size and to free up space on the VMDK, however, this activity will not in itself improve performance. For more information, refer to: Shrinking the size of the VMware vCenter Server SQL database KB1036738

For further information on Shrinking a Database, refer to: http://msdn.microsoft.com/en-us/library/ms189080.aspx

Rebuilding indexes to Optimize the performance of SQL Server
Configure regular maintenance job to rebuild indexes. KB2009918

  1. To rebuild the vCenter Server database indexes. Note, for a vCenter Server 5.1 and 5.5 database, download and extract the .sql files from the 2009918_rebuild_51.zip file attached to this procedure.
  2. Backup your vCenter Server database before proceeding. For more information, see Backing up and restoring vCenter Server 4.x and 5.x (1023985).
  3. These steps must be performed against the vCenter database and not the Master.
  4. Connect to the vCenter Server database using Management Studio for SQL Server
  5. Execute the .sql file to create the REBUILD_INDEX stored procedure, available from the above link.
  6. Execute the stored procedure that was created in the previous step: execute REBUILD_INDEX

VPX_HIST_STAT Table Sizes
VMware recommend a fill factor of 70% for the 4 VPX_HIST_STAT tables. If this recommended fill factor is too high for resources on the database server, then it will need to take time splitting pages, which equates to additional I/O.

If you are experiencing high unexplained I/O in the environment, monitor the SQL Server Access Methods object: Page Splits/sec. Page splits are expensive, and cause your table to perform more poorly due to fragmentation. Therefore, the fewer page splits you have the better your system will perform.

By decreasing the fill factor in your indexes, what you are doing is increasing the amount of empty space on each data page. The more empty space there is, the fewer page splits you will experience. On the other hand, having too much unnecessary empty space can also hurt performance because it means that less data is stored per page, which means it takes more disk I/O to read tables, and less data can be stored in the buffer cache.

High Page Splits/sec will result in the database being larger than necessary and having more pages to read during normal operations.

To determining where growth is occurring in the VMware vCenter Server database refer to:  http://kb.vmware.com/kb/1028356

For troubleshooting VPX_HIST_STAT table sizes in VMware vCenter Server 5, refer to: KB2038474

To reduce the size of the vCenter Server database when the rollup scripts take a long time to run, refer to: KB1007453

Monitor Database Growth
Service provider operational teams should monitor vCenter Server database growth over a period of time to ensure the database is functioning as expected. For more information, refer to KB article: Determining where growth is occurring in the vCenter Server database KB1028356

Schedule and verify regular database backups
The vCenter, SSO, VUM and SRM servers are by themselves stateless. The databases are far more critical since they store all the configuration and state information for each of the management components. These databases must be backed-up nightly and the restore process of each database needs to be tested periodically.

Operational teams should ensure that a schedule of regular backups exists of the vCenter database and based on requirements of the business, restore and mount databases from backup periodically onto a non-production system to ensure a clean recovery is possible, should database corruption or data loss occur in the production environment.

Create a Maintenance Plan for vSphere databases
Work with the DBA’s to create a daily and weekly database maintenance plan. For Instance:

  • Check Database Integrity
  • Rebuild Index
  • Update Statistics
  • Back Up Database (Full)
  • Maintenance Cleanup Task

Warning: DO NOT SHRINK DB IN MAINTENANCE PLAN UNLESS THERE IS A SPECIFIC REQUIREMENT TO RECLAIM DISK SPACE: http://msdn.microsoft.com/en-us/library/ms189080.aspx

 

Part 3 – Service Account Permissions (Least Privilege)

vCenter Service Account
Required by the ODBC Connection for access to the database, the vCenter service account must be configured with dbo_owner privileges for normal operational use. However, the vCenter database account being used to make the ODBC connection also requires the db_owner role on the MSDB System database, during installation or upgrade of the vCenter Server. This permission facilitates the installation of SQL Agent jobs for vCenter statistic rollups.

Typically, the DBA should only grant the vCenter service account the db_owner role on the MSDB System database when installing or upgrading vCenter, then revoke that role when these activities are complete.

RSA_DBO (vSphere 5.1 Only)
Only Required for SSO 5.1, the RSA_DBA account is a local SQL account which is used for creating the schema (DDL) and requires dbo_owner permissions.

RSA_USER (vSphere 5.1 Only)
Only Required for SSO 5.1, the RSA_USER reads and writes data (only DML).

VUM Service Account
Despite being a 64bit application, VUM requires a 32bit ODBC connection from “C:\Windows\SysWOW64\odbcad32.exe”. The VUM service account must be provide the dbo_owner permission on the VUM DB. The installation of vCenter Update Manager 5.x and 6.x with a Microsoft SQL back end database also requires the ODBC connection account to temporarily have db_owner permissions on the MSDB System database. This was a new requirement in vSphere 5.0.

As with the vCenter service account, typically the DBA would only grant the VUM service account the db_owner role for the MSDB System database during an install or upgrade to the VUM component of vCenter. This permission should then be revoked when that task has been completed.

vCloud Director with Virtual SAN Sample Use Case

This brief and high level implementation example will provide a sample use case for the utilization of VMware Virtual SAN in a vCloud Director for Service Providers environment.

Outlined in the illustration below, each Provider Virtual Data Center / Resource Cluster has been configured with a Virtual SAN datastore that meets the specific capability requirements set out by their Service Level Agreement (SLA) for that tier of service.

In this example, the service provider is deploying three tiers of offerings, Gold, Silver and Bronze. The compute consolidation ratio and virtual SAN capability, based on the disk group configuration and storage policy, defines how the offering will perform for a consumer. In addition, not shown in the configuration below, NIOC and QoS are being employed by the service provider to ensure an appropriate balance of network resources are assigned, based on tier of service. This requires the configuration of 3 separate tiered VLANs for Virtual SAN traffic (Gold, Silver and Bronze) with traffic priorities configured accordingly.

The exact disk configuration will vary depending on hardware manufacturer and provider SLAs.

Logical Design Overview

blog

The full VMware technology solution stack is illustrated below.

VSAN with vCD2

The above figure shows how the solution is constructed on VMWare technologies. The core vSphere platform provides the storage capability through Virtual SAN, which in turn is abstracted via vCloud Director. The VSAN Disk Group configuration across the hosts, along with the Storage Policy, that is configured at the vSphere level, define the performance and capacity capabilities of the distributed datastore, which in turn is employed to define the SLAs for this tier of the cloud offering.

As is illustrated above, the vSphere resources are abstracted by vCloud Director into a Provider Virtual Data Center (PvDC). These resources are then further carved up into individual Virtual Data Centers (vDC), assigned to Organisational tenants. The overall result is that the vApps that reside within the Organizational vDCs represent the Virtual SAN storage capability defined by the service provider.

Typically, but outside the scope of this discussion, tiered service offerings are defined by more than just storage capability. vCPU consolidation ratios, levels of guaranteed memory and network resources and backups etc. will all be employed by a service provider to define the SLAs.

As I develop this use case for the service providers I’m working with I will update this article further.

Windows Failover Clusters for vCloud Air Network Service Providers

Designing Microsoft Windows Server Failover Clusters for vCloud Air Network Service Providers

Introduction

In the modern dynamic business environment, uptime of virtualized business critical applications (vBCAs) and fast recovery from system failures are vital to meeting business service-level agreements (SLAs) for vCloud Air Network Service Providers. Cloud service providers must be prepared for business disruptions and be able to minimize their impact to their consumers.

The “being prepared” approach to providing application high availability is aimed at reducing risk of revenue losses, maintaining compliance, and meeting customer agreed SLAs. Designing and deploying applications on Microsoft Windows Server Failover Clusters (WSFC), and having a highly available infrastructure, can help organizations to meet these challenges.

This following figure provides a simple overview view of a Microsoft Windows Server Failover Cluster running on ESXi hosts in a VMware vSphere Infrastructure.

Figure 1. Microsoft Windows Cluster Service on VMware ESXi Hosts

Picture1

The Microsoft Clustering Services (MSCS) has been available in the Microsoft Server products since the release of Microsoft Windows NT Server, Enterprise Edition. A Microsoft Server failover cluster is defined as a group of independently running servers that work together and co-exist to increase the availability of the applications and services they provide. The clustered servers, generally referred to as nodes, are connected by virtual and physical networking and by the clustering software. If one of the cluster compute nodes fails, the Microsoft cluster provides the service through a failover process with minimal disruption to the consumer.

Since the release of Microsoft Windows Server 2008, Microsoft clustering services has been renamed to Windows Server Failover Clustering (WSFC) with a number of significant enhancements.

Due to additional cost and increased complexity, Microsoft clustering technology is typically used by cloud service providers to provide high availability to Tier 1 applications such as Microsoft Exchange mailbox servers or highly available database services for Microsoft SQL Server. However, it can also be used to protect other services, such as a highly available Windows Dynamic Host Configuration Protocol (DHCP) Server or file and print services.

Windows Server Failover Cluster technologies protect services and the application layer against the following types of system failure:

  • Application and service failures, which can affect application software running on the nodes and the essential services they provide.
  • Hardware failures, which affect hardware components such as CPUs, drives, memory, network adapters, and power supplies.
  • Physical site failures in multisite organizations, which can be caused by natural disasters, power outages, or connectivity outages.

The decision to implement a Microsoft clustering solution on top of a vCloud Air Network platform should not be taken without the appropriate consideration and certainly not before addressing all of the design options and business requirements. This implementation adds environmental constraints that might limit other vCloud benefits such as mobility, flexibility, and manageability. It also adds a layer of complexity to the vCloud Air Network platform.

The aim of this vCloud Architectural Toolkit for Service Providers (vCAT-SP) technical blog is to address some of the most important/critical design considerations of running WSFC on top of a vCloud Air Network Service Provider platform. It is not intended to be a step-by-step installation and configuration guide for WSFC. See instead the VMware Setup for Failover Clustering and Microsoft Cluster Service document.

The customer or provider decision to employ Microsoft clustering in a vCloud infrastructure should not be taken lightly. If VMware vSphere High Availability, VMware vSphere Distributed Resource Scheduler and VMware vSphere SMP-FT can provide a high enough level of availability to meet the application SLAs, why reduce flexibility by implementing a Microsoft Clustered application? Having said this, vSphere HA cannot be considered a replacement for WSFC, because vSphere HA is not application-aware. vSphere HA focuses on VMware ESXi host failure from the network and can, if configured to do so, verify whether a virtual machine is still running by checking the heartbeat provided through VMware Tools. Microsoft Cluster Services is application-aware and is aimed at the high-end and high service availability applications, such as Microsoft Exchange Mailbox Servers or Microsoft SQL.

Also, consider if other alternatives, such as Database Log Shipping, Mirroring, or AlwaysOn Availability Groups for Microsoft SQL Server could meet the availability requirement of the applications. For Microsoft Exchange, technologies such as Database Availability Groups (DAGs) make single copy cluster technology less of a necessity in today’s data center.

Feature Comparison

The decision to use any high availability technology should be defined and driven by the cloud consumer’s requirements for the application or service in question. Inevitably, this depends on the application and whether it is cluster-aware. The majority of common applications are not Microsoft clustering-aware.

As with all design decisions, the architect’s skill in collecting information, correlating it with a solid design, and understanding the trade-offs of different design decisions plays a key role in a successful architecture and implementation. However, a good design is not unnecessarily complex and includes rationales for design decisions. A good design decision about the approach taken to availability balances the organization’s requirements with a robust technical platform. It also involves key stakeholders and the customer’s subject matter experts in every aspect of the design, delivery, testing, and handover.

The following table is not intended to demonstrate a preferred choice to meet your specific application availability requirements, but rather to assist in carrying out an assessment of the advantages, drawbacks, similarities, and differences in the technologies being proposed. In reality, most vCloud Air Network data centers use a combination of all these technologies, in a combined manner and independently, to provide different applications and services with the highest level of availability possible, while maintaining stability, performance, and operational support from vendors.

Table 1. Advantages and Drawbacks of Microsoft Clustering and VMware Availability Technologies

Advantages of Microsoft Clustering on vSphere Drawbacks of Microsoft Clustering on vSphere Advantages of VMware Availability Technologies Drawbacks of VMware Availability Technologies
Supports application-level awareness. A WSFC application or service will survive a single node operating system failure. While VMware clusters that are using vSphere HA can use virtual machine failure monitoring to provide a certain level of protection against the failure of the guest operating system, you do not have the protection of the application running on the guest operating system, which is provided with WSFC. Additional cost to deploy and maintain the redundant nodes from an operational maintenance perspective. Reduced complexity and lower infrastructure implementation effort. vSphere HA and vSphere SMP-FT are extremely simple to enable, configure, and manage. Far more so than a WSFC operating system-level cluster. If vSphere HA fails to recognize a system failure, human intervention is required.With vSphere 5.5, AppHA can potentially work to overcome some of the vSphere HA shortcomings by working with VMware vRealize™ Hyperic to provide application high availability within the vSphere environment. However, this might require additional application development and implementation efforts to support the application-awareness elements. In addition, there is a continuing management and operational overhead of this solution to take into account. The implementation and design of App HA is beyond the scope of this document. For more information, refer to the VMware App HA documentation page at https://www.vmware.com/support/pubs/appha-pubs.html. Note that with the release of vSphere 6 AppHA is now End of Availability (EOA). Please see http://kb.vmware.com/kb/2108249 for further details.
WSFC minimizes the downtime of applications that should remain available while maintenance patching is performed on the redundant node. A short outage would be required during the obligatory failover event. As a result, WSFC can potentially reduce patching downtime. Potentially added environment costs for passive node virtual machines. That is, wasted hardware resources utilized on hosts for passive WSFC cluster nodes. Reduced costs because no redundant node resources are required for vSphere HA.Overall vSphere HA can allow for higher levels of utilization within an ESXi host cluster than using operating system-level clustering. You are not able to use vSphere HA or SMP-FT to fail over between systems for performing scheduled patching of the guest operating system or application.
If architected appropriately with vSphere, a virtual implementation of clustered business critical applications can meet the demands for a Tier 1 application that cannot tolerate any periods of downtime. Reduced use of virtual machine functionality. (There is no VMware vMotion®, DRS, VMware Storage vMotion, VMware vSphere Storage DRS™, or snapshots). This also means no snapshot-based backups can be utilized for full virtual machine backups. While other options are available for backups, a cluster node or full cluster loss could require a full rebuild (extending RTO into days and not hours). vSphere HA and vSphere SMP-FT do not require any specific license versions of the guest operating system or application in order to make use of their benefits. vSphere SMP-FT does not protect you against a guest operating system failure. A failure of the operating system in the primary virtual machine will typically result in a failure of the operating system in the secondary virtual machine.
WSFC permits an automated response to either a failed server or application. Typically, no human intervention is required to ensure applications and services continue to run. Added implementation and operational management complexity for the application and vSphere environment. This requires more experienced application administrators, vSphere, storage, and network administrators to support the cluster services. Application-agnostic. vSphere HA and SMP-FT are not application-aware and do not require any application layer support to protect the virtual machine and its workloads, unlike operating system clustering which requires application-level support. vSphere SMP-FT does not protect you against an application failure. A failure of the application or service on the primary virtual machine will typically result in a failure of that application in the secondary virtual machine.
Potentially faster recovery during failover events than with vSphere HA. Virtual machine reboots might take 30 to 60 seconds before all services are up and running. Any failover event might require server admin and application admin interaction. This action could be anything from a node reboot to a node rebuild (not self-healing). Eliminates the need for dedicated standby resources and for installation of additional software and operating system licensing. A failover event requires the virtual machine to be restarted, which could take 30 to 60 seconds. Applications protected solely by vSphere HA might not be available during this time.
Virtualizing Tier 1 business critical applications can reduce hardware costs by consolidating current WSFC deployments. SCSI LUN ID limitation. When using RDMs, remember that each presented RDM reserves one LUN ID. There is a maximum of 256 LUN IDs per ESXi host. These can mount up quickly when running multiple WSFC instances on a vSphere host. vSphere SMP-FT can provide higher levels of availability than are available in most operating system-level clustering solutions today. Admission control policy requires reserved resources to support host failures in the cluster (% / slot size).
Failback is quick and can be performed once the primary server is fixed and put back online. In a situation where both nodes have failed, recovery time might be increased greatly due to the added complexity of the vSphere layer. Supports the full range of virtual machine functionality, which in turn leads to maximized resource utilization. DRS and vMotion provide significant flexibility when it comes to virtual machine placement. Full vSphere functionality can be released for the servers (that is, snapshots, vMotion, DRS, Storage vMotion and Storage DRS. Requires additional configuration to support host isolation response and virtual machine monitoring.
WSFC is a supported Microsoft solution, which makes it an obvious choice for Microsoft Applications such as SQL or Exchange. Many applications do not support Microsoft clustering. Use cases are typically Microsoft Tier 1 applications, such as SQL and Exchange. vSphere host patching/maintenance can be accomplished without after-hour maintenance and Windows Server or application owner participation. Reserved capacity and DRS licensing required to facilitate host patching of live systems.
DRS can be employed to determine initial virtual machine placement at power-on. vSphere host patching and maintenance would still have to be done after hours due to the failover outage and could require application owner participation. Can support a 99.9% availability SLA. Can only support a 99.9% availability SLA, which could mean up to 10.1 minutes per week of downtime.

Based on what has been discussed so far, you can see there is additional complexity when introducing Microsoft clustering on a vCloud Air Network platform. As such, one should carefully consider all of the business and technical requirements. The next section discusses the process of gathering those business requirements to make an informed recommendation.

Figure 2. Cost Versus Complexity

Picture2

Establishing Business Requirements

For either the vCloud provider or consumer, the first step in establishing the need to employ Microsoft clustering on the cloud platform is to assess and define the application availability requirements and to understand the impact of downtime on stakeholders, application owners, and most importantly, the end users.

To identify availability requirements for a Microsoft failover cluster, you can use some or all of the following questions. The answers to these questions will help the service provider cloud architect, to gather, define, and clarify the deployment goals of the application and services being considered for failover clustering.

  • What applications are considered business critical to the organization’s central purpose? What applications and services do end users require when working?
  • Are there any Service Level Agreements (SLAs) or similar agreements that define service levels for the applications in question?
  • For the services end users, what defines a satisfactory level of service for the applications in question?
  • What increments of downtime are considered significant and unacceptable to the business (for example, five seconds, five minutes, or an hour) during peak and non-peak hours? If availability is measured by the customer, how is it measured?

The following table might help establish the requirements for the applications in question.

Availability Downtime (Year) Downtime (Month) Downtime (Week)
90% (1-nine) 36.5 days/year 72 hours/month 16.8 hours/week/
99% (2-nines) 3.65 days/year 7.20 hours/month 1.68 hours/week
99.9% (3-nines) 8.76 hours/year 43.8 minutes/month 10.1 minutes/week
99.99% (4-nines) 52.56 minutes/year 4.32 minutes/month 1.01 minutes/week
99.999% (5-nines) 5.26 minutes/year 25.9 seconds/month 6.05 seconds/week
99.9999% (6-nines) 31.5 seconds/year 2.59 seconds/month 0.605 seconds/week
99.99999% (7-nines) 3.15 seconds/year 0.259 seconds/month 0.0605 seconds/week

Does the cloud consumer have a business requirement for 24-hour, 7-days-a-week availability or is there a working schedule (for example, 9:00 a.m. to 5:00 p.m. on weekdays)? Do the services or applications that are being targeted have the same availability requirements, or are some of them more important than others? Business days, hours of use, and availability requirements can typically be obtained by the service provider from end-user leadership, application owners, and business managers.

For instance, the following table provides a simple business application list along with the end-user requirements for availability and common hours of use. These requirements are important to establish because downtime when an application is not being used, for example overnight, might not negatively impact the application service level agreement.

Application Business Days Hours of Use Availability Requirements
Customer Tracking System 7 Days 0700-1900 99.999%
Document Management System 7 Days 0600-1800 99.999%
Microsoft SharePoint (Collaboration) 7 Days 0700-1900 99.99%
Microsoft Exchange (Email and Collaboration) 7 Days 24 Hours 99.999%
Microsoft Lync (Collaboration) 7 Days 24 Hours 99.99%
Digital Imaging System 5 Days 0800-1800 99.9%
Document Archiving System 5 Days 0800-1800 99.9%
Public Facing Web Infrastructure 7 Days 24 Hours 99.999%

It is also important to establish and understand application dependencies. Many of the applications shown in the previous table consist of a number of components including databases, application layer software, web servers, load balancers, and firewalls. In order to achieve the levels of availability required by the business, a number of techniques must be employed by a range of technologies, not only by clustered services.

  • Do the applications in question have variations in load over time or the business cycle (for example, 9:00 a.m. to 5:00 p.m. on weekdays, monthly, or quarterly)?
  • How many vSphere host servers are available on the vCloud platform for failover clustering and what type of storage is available for use in the cluster or clusters?
  • Is having a disaster recovery option for the services or applications important to the cloud consumer’s organization? What type of infrastructure will be available to support the workload at your recovery site? Is your recovery site cold/hot or a regional data center used by other parts of the business? Is any storage replication technology in place? Have you accounted for the clustered application itself? What steps must be taken to ensure the application is accessible to users/customers if failed over to the recovery site?
  • Is it possible for some of the Microsoft clustered nodes to be placed in a separate vCloud Air Network Service Provider site, an adjacent data center or data center zone to provide an option for disaster recovery if a serious problem develops at the primary site?

When asking these questions of your cloud platform customer also consider that simply because an application has always been protected with the use of Microsoft clustering in the past, does not mean it always has to be in the future. VMware vSphere and the vCloud platform offers several high availability solutions that can be used collectively to support applications where there is a requirement to minimize unplanned downtime. It is important for the provider to examine all options with the consumer and carefully consider and understand the impact of that decision on the application or service.

Microsoft Cluster Configuration Implementation Options

When implementing Microsoft clusters in a vSphere based vCloud environment three primary architectural options exist. The choice of the most appropriate design will depend on your specific design ruse case. For instance, if you are looking for a solution to provide high availability in case of single hardware failure (N+1), hosting both cluster nodes on the same physical host will fail to meet this basic requirement.

In this section, we examine three options and analyze the advantages and drawbacks of each.

Option 1: Cluster-In-A-Box (CIB).

Option 1 is Cluster-In-A-Box (CIB). This is a design where the two clustered virtual machine nodes are running on the same vSphere ESXi host. In this scenario, the shared disks and quorum can be either local or remote RDMs and are shared between the virtual machines within the single host. For instance, you can use VMDKs or RDMs (with their SCSI bus set to virtual mode). The use of RDMs can be beneficial if you decide to migrate one of the virtual machines to another host to create a Cluster across Boxes (CAB) design (described in the next section).

The Cluster-In-A-Box option would most typically be used in test or development environments, because this solution offers no high availability in the event of a host hardware failure.

For CIB deployments, create VM-to-VM affinity rules to keep them together. VMware vSphere Distributed Resource Scheduler (DRS) requires additional host-to-VM rule groups, because (depending on version of vSphere) HA does not account for DRS. Consider the VM-to-VM rules when restarting VMs in the event of hardware failure. For CIB deployments, virtual machines must be in the same virtual machine DRS group, which must be assigned to a host DRS group containing two hosts using a “must run” on hosts in group rule.

Figure 3. Option 1 Design Cluster-In-A-Box (CIB)

Picture3

Option 2: Cluster–Across-Boxes (CAB)

Cluster–Across-Boxes (CAB) is this most common scenario and describes the design where a WSFC is employed on two virtual machines that are running across two different physical ESXi hosts. The primary advantage here is that this protects the environment against a hardware failure of a single physical server (n+1). In this design scenario, VMware recommends physical RDMs as the disk choice. The shared storage and quorum should be located on Fibre Channel SAN storage or be available through an in-guest iSCSI initiator.

For CAB deployments, create VM-to-VM anti-affinity rules to keep them apart. These should be “must run” rules because there is no point in having the two nodes running on the same ESXi host. Again, account for DRS. You will need additional “host-to-VM” rule groups, because HA does not consider the VM-to-VM rules when restarting virtual machines in the event of hardware failure. For CAB deployments, VMs must be in different VM DRS groups. The VMs must be assigned to different host DRS groups using a “must run” on hosts in-group rule.

Figure 4. Option 2 Design Cluster–Across-Boxes (CAB)

Picture4

Option 3: Physical and Virtual Machine

The final typical design scenario is Physical and Virtual Machine (Physical and N+1 VM). This cluster design allows for the primary (active) node of a WSFC cluster to run natively on a bare metal physical server, while the secondary (passive) node runs in a virtual machine. This model can be used to migrate from a physical two-node deployment to a virtualized environment, or as a means of providing N+1 availability with the purchase of a single physical server. With this design, if you need to run on the secondary node during primary business hours, performance-based SLAs might be impacted. However, when you consider that typically a WSFC only runs on the primary node and is only failed over to the secondary node for short periods of time (and outside of business hours for maintenance), this might be a viable option for some use cases. The Physical and N+1 virtual machine model does not require any special affinity rules because one of the nodes is virtual and the other is physical.

Figure 5. Option 3 Design Physical and Virtual Machine

Picture5

VMware recommends physical RDMs as the disk option. Shared storage and quorum disks must be located on Fibre Channel SAN or iSCSI storage or be presented through an in-guest iSCSI initiator. Note that RDMs are not support backed by VMware Virtual SAN.  Refer to http://kb.vmware.com/kb/1037959 for further details.

Conclusion

Design factors are components that are combined and that dictate the outcomes of each design decision. If your customer is looking at virtualizing physical Microsoft Windows clusters on vSphere, you must first assess the impact of using WSFC in your design. Consider the impact on availability, manageability, performance, recoverability, and security.

The use of Microsoft clustering on a vCloud Air Network Platform will add new design requirements, constraints, and risks to the environment. It is crucial that all design factors and their impact on the architecture be addressed at the design stage.

Migrating from physical to virtual cloud platform instances of WSFC offers a significant cost reduction in required hardware, and if architected correctly, can provide the performance and levels of availability to support the most demanding application and the strictest of SLAs. However, it is also important to evaluate other solutions, such as the native high availability features of vSphere, which can be implemented without the high operational costs associated with WSFC. These alternatives can often provide levels of availability that meet the SLAs for the majority of your consumer’s business applications and provide a good alternative to Microsoft clustered implementations, particularly where application-level availability can be used alongside established vSphere technologies.

The decision to use WSFC on a vCloud Air Network Platform should be driven by the workload availability requirements of the end-user’s application or service as defined by the customer or application owner. These requirements ultimately drive the decision behind your applications availability strategy.

To meet high availability and disaster recovery requirements for cloud consumers using WSFC, it is important for the service provider to:

  • Determine high availability and disaster recovery needs of the applications in question.
  • Examine design requirements, constraints and risks for your customer-specific use cases.
  • Develop a WSFC design strategy for the business and overall solution architecture that can be replicated for different applications within the infrastructure.
  • Choose an appropriate WSFC design and size, and configure the infrastructure components to meet the applications performance and availability requirements.
  • Follow VMware’s proven technical guidance for WSFC on a vSphere platform.

Reference Documents

Description URL
Microsoft Clustering on VMware vSphere: Guidelines for supported configurations (1037959) http://kb.vmware.com/kb/1037959
MSCS support enhancements in vSphere 5.5 (2052238) http://kb.vmware.com/kb/2052238
Microsoft Cluster Service (MSCS) support on ESXi/ESX (1004617) http://kb.vmware.com/kb/1004617
Windows Server Failover Clustering (WSFC) with SQL Server http://technet.microsoft.com/en-us/library/hh270278.aspx
Setup for Failover Clustering and Microsoft Cluster Service https://pubs.vmware.com/vsphere-60/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-60-setup-mscs.pdf