By: Gregg Robertson ,vExpert
Business Continuity and Disaster Recovery (BC/DR) is something every business, no matter how big or small, should be thinking about and planning for. Whilst preparing for my VCAP-DCD and even for my VCDX attempt, BC/DR was a very important topic, as two of the infrastructure qualities of AMPRS (Availability, Manageability, Performance, Recoverability and Security) designs should show impact on availability and recoverability.
In my daily role as a consultant, BC/DR is a core component for every virtualization design no matter if it is data center virtualization, end-user computing or hybrid cloud. In this four-part blog series, I am going to cover four different ways BC/DR can help you with your small/midsized business (SMB) IT infrastructure. In this second blog, we will cover the benefits of automated high availability built in as a feature in VMware vSphere.
Automated High Availability For SMB’s
BC/DR is met and ensured with features that have been part of vSphere for years, like VMware High Availability (HA), which, since vSphere 5.0, has been rebuilt from the ground up to use the Fault Domain Manager (FDM) agent instead of the legacy AAM agent (Legato Automated Availability Management). This rebuilding of a new agent has introduced higher resiliency and less complexity and means that HA can be enabled with as little as five clicks and be installed onto ESXi hosts in seconds rather than the minutes that it took previously. HA allows you to protect the virtual machines running on your hosts from isolation and/or recover from host failure by restarting the virtual machines on the affected host to the remaining working hosts, thereby bringing your applications and solutions back online as soon as possible. With the new FDM agent, this also allows partitioned hosts to elect a master node within the partitioned section and maintain the uptime of the virtual machines on the affected hosts. HA also has a number of features that provide additional checks to ensure that hosts are indeed non-responsive before rebooting the virtual machines through the usage of Datastore Heartbeating and the setting of additional isolation addresses.
HA can also restart virtual machines if the application in a virtual machine fails through the usage of application monitoring. By utilizing the appropriate SDK or an application that supports VMware application monitoring, HA can setup customized heartbeats for your applications.
vSphere HA has several advantages over traditional failover solutions, including:
Minimal setup – After a vSphere HA cluster is set up, all virtual machines in the cluster get failover support without additional configuration.
Reduced hardware cost and setup – The virtual machine acts as a portable container for the applications and it can be moved among hosts. Administrators avoid duplicate configurations on multiple machines. When you use vSphere HA, you must have sufficient resources to fail over the number of hosts you want to protect with vSphere HA. However, the vCenter Server system automatically manages resources and configures clusters.
Increased application availability – Any application running inside a virtual machine has access to increased availability. Because the virtual machine can recover from hardware failure, all applications that start at boot have increased availability without increased computing needs, even if the application is not itself a clustered application. By monitoring and responding to VMware Tools heartbeats and restarting nonresponsive virtual machines, it protects against guest operating system crashes.
Distributed Resource Scheduler (DRS) and vMotion integration – If a host fails and virtual machines are restarted on other hosts, DRS can provide migration recommendations or migrate virtual machines for balanced resource allocation. If one or both of the source and destination hosts of a migration fail, vSphere HA can help recover from that failure.
High Availability Overview
Fault Domain Manager Agent
HA’s architecture is fairly simple with the FDM agent being installed on each ESXi host within a vSphere cluster that has HA enabled. As of vSphere 5.0, there is now only a single master node and all the remaining hosts within the cluster are slaves which report their health to the master node as well as the vCenter server. This is unlike HA in versions previous to vSphere 5.0, where there were Primary and Secondary nodes, which constrained you to a limit of 5 primary nodes and the need to have at least 1 primary node available. The below diagram shows a simplistic view of the FDM agent on each host and the allocation of the master and slave roles to the hosts.
As of vSphere 5.0, there are now two different heartbeat mechanisms that HA uses to ensure the health of the ESXi hosts within the HA enabled cluster. The first of these is datastore heartbeating, a new feature as of vSphere 5.0. Datastore heartbeating adds an additional check where HA utilizes the existing VMFS file system locking mechanism of creating a heartbeat region. The heartbeat region is where at least one file per host is kept open per selected heartbeat datastore (default is two datastores). HA does a check whether the heartbeat region has been updated and if it has, then the host still has storage connectivity and therefore the virtual machines on the host don’t need to be restarted elsewhere. The below diagram shows the selection of three datastores and that currently, only two of the hosts within the cluster are attached to the two datastores. Good design practice is to allow HA to select the datastores, as HA will choose the datastores with the most connected hosts and if applicable NFS and FC/iSCSI datastores to ensure added resiliency.
The other method of heartbeating is the standard way of using the heartbeat network to talk to the master and the master sends a heartbeat to the slaves, as I mentioned earlier in this blog. When a slave stops receiving heartbeats from its master, it will start trying to ascertain if it is isolated/partitioned or if the master is isolated or failed. To learn more about the various states of isolated, partitioned and failed hosts, this vSphere documentation on host failure types and detection describes it perfectly, as does the vSphere 5.1 Clustering Deepdive book by Duncan Epping and Frank Denneman.
High Availability Installation
The installation of HA is actually as simple as ticking the box to Turn ON vSphere HA during the creation of a vSphere cluster or by going into the settings of an existing cluster and enabling HA from within the cluster setting panel as shown below:
Selecting Enable admission control allows the admission control mechanism to control and protect a determined percentage of resources or number of hosts worth of resources for failover capabilities. I won’t go into all the different options and the permutations, as there are many, but the capabilities and settings of HA are defined and explained in depth in the vSphere 5.1 Clustering Deepdive book by Duncan Epping and Frank Denneman.
Conclusion: High availability benefits for SMBs
VMware vSphere High Availability allows three nines (99.9%) of availability, which is the sweet spot for SMB customers looking for automated and intelligent failover of their virtual machines in the event of hosts being lost/failing. The configurations enabled by the multiple heartbeating checks and the assurance via admission control that resources are set aside in the event of a host or more failure means HA is a brilliant solution for SMB businesses.
Look out for the third part of this series, where I will be covering how BC/DR through the usage of vSphere Replication can help your SMB.
Gregg Robertson is a senior consultant, professional blogger, vExpert 2011 – 2014, VCAP5-DCA/DCD, VCP-Cloud, VCP 3/4/5, VMware communities moderator and co-host of the EMEA vBrownbag weekly webinars/podcasts. Gregg’s blog, TheSaffaGeek , started as a place to write down fixes plus VMware certification links and resources, but has quickly found a large following of readers and subscribers.
Follow VMware SMB on Facebook, Twitter, Spiceworks and Google+ for more blog posts, conversation with your peers, and additional insights on IT issues facing small to midmarket businesses.
* HA Architecture diagram from vSphere 5.0 Clustering Deepdive book by Duncan Epping and Frank Denneman