By: Gregg Robertson ,vExpert
Business Continuity and Disaster Recovery (BC/DR) is something every business, no matter how big or small, should be thinking about and planning for. Whilst preparing for my VCAP-DCD and even for my VCDX attempt, BC/DR was a very important topic, as two of the infrastructure qualities of AMPRS (Availability, Manageability, Performance, Recoverability and Security) designs should show impact on availability and recoverability.
In my daily role as a consultant, BC/DR is a core component for every virtualization design no matter if it is data center virtualization, end-user computing or hybrid cloud. In this four-part blog series, I am going to cover four different ways BC/DR can help you with your small/midsized business (SMB) IT infrastructure. In this second blog, we will cover the benefits of automated high availability built in as a feature in VMware vSphere.
Automated High Availability For SMB’s
BC/DR is met and ensured with features that have been part of vSphere for years, like VMware High Availability (HA), which, since vSphere 5.0, has been rebuilt from the ground up to use the Fault Domain Manager (FDM) agent instead of the legacy AAM agent (Legato Automated Availability Management). This rebuilding of a new agent has introduced higher resiliency and less complexity and means that HA can be enabled with as little as five clicks and be installed onto ESXi hosts in seconds rather than the minutes that it took previously. HA allows you to protect the virtual machines running on your hosts from isolation and/or recover from host failure by restarting the virtual machines on the affected host to the remaining working hosts, thereby bringing your applications and solutions back online as soon as possible. With the new FDM agent, this also allows partitioned hosts to elect a master node within the partitioned section and maintain the uptime of the virtual machines on the affected hosts. HA also has a number of features that provide additional checks to ensure that hosts are indeed non-responsive before rebooting the virtual machines through the usage of Datastore Heartbeating and the setting of additional isolation addresses.
HA can also restart virtual machines if the application in a virtual machine fails through the usage of application monitoring. By utilizing the appropriate SDK or an application that supports VMware application monitoring, HA can setup customized heartbeats for your applications.
vSphere HA has several advantages over traditional failover solutions, including:
Minimal setup – After a vSphere HA cluster is set up, all virtual machines in the cluster get failover support without additional configuration.
Reduced hardware cost and setup – The virtual machine acts as a portable container for the applications and it can be moved among hosts. Administrators avoid duplicate configurations on multiple machines. When you use vSphere HA, you must have sufficient resources to fail over the number of hosts you want to protect with vSphere HA. However, the vCenter Server system automatically manages resources and configures clusters.
Increased application availability – Any application running inside a virtual machine has access to increased availability. Because the virtual machine can recover from hardware failure, all applications that start at boot have increased availability without increased computing needs, even if the application is not itself a clustered application. By monitoring and responding to VMware Tools heartbeats and restarting nonresponsive virtual machines, it protects against guest operating system crashes.
Distributed Resource Scheduler (DRS) and vMotion integration – If a host fails and virtual machines are restarted on other hosts, DRS can provide migration recommendations or migrate virtual machines for balanced resource allocation. If one or both of the source and destination hosts of a migration fail, vSphere HA can help recover from that failure.
High Availability Overview
Fault Domain Manager Agent
HA’s architecture is fairly simple with the FDM agent being installed on each ESXi host within a vSphere cluster that has HA enabled. As of vSphere 5.0, there is now only a single master node and all the remaining hosts within the cluster are slaves which report their health to the master node as well as the vCenter server. This is unlike HA in versions previous to vSphere 5.0, where there were Primary and Secondary nodes, which constrained you to a limit of 5 primary nodes and the need to have at least 1 primary node available. The below diagram shows a simplistic view of the FDM agent on each host and the allocation of the master and slave roles to the hosts.
As of vSphere 5.0, there are now two different heartbeat mechanisms that HA uses to ensure the health of the ESXi hosts within the HA enabled cluster. The first of these is datastore heartbeating, a new feature as of vSphere 5.0. Datastore heartbeating adds an additional check where HA utilizes the existing VMFS file system locking mechanism of creating a heartbeat region. The heartbeat region is where at least one file per host is kept open per selected heartbeat datastore (default is two datastores). HA does a check whether the heartbeat region has been updated and if it has, then the host still has storage connectivity and therefore the virtual machines on the host don’t need to be restarted elsewhere. The below diagram shows the selection of three datastores and that currently, only two of the hosts within the cluster are attached to the two datastores. Good design practice is to allow HA to select the datastores, as HA will choose the datastores with the most connected hosts and if applicable NFS and FC/iSCSI datastores to ensure added resiliency.