This post originally appeared on the vSphere blog.
Recently I’ve participated in a number of discussions around Virtual SAN and vSphere HA where a couple of great and interesting questions have been brought up with regards to Virtual SAN and vSphere HA interoperability and behavior.
For the most part, the discussions have been around vSphere HA and how it works and supports network partitions and isolation events for Virtual SAN enabled clusters. Those concerns required a bit more detail to provide the adequate technical guidance.
Before diving into the details, let me start with an official statement about Virtual SAN and vSphere HA. vSphere HA fully supports and is integrated with Virtual SAN. This support required some changes in vSphere HA which impact vSphere HA behavior and result in some unique Virtual SAN related configuration considerations for vSphere HA.
In this post I will detail the following information and recommendations:
- Architecture Changes Impacting Isolation and Partition Support
- Heartbeat Datastore Recommendations
- Host Isolation Address Recommendations
- Isolation Response Recommendations
Architecture Changes Impacting Isolation and Partition Support
In vSphere 5.5 anytime HA is enabled in a cluster that is also enabled for Virtual SAN, the vSphere HA FDM agents and heartbeating monitoring operations use the Virtual SAN network instead of the Management Network(s).
The modifications to the design and behavior of vSphere HA were made in order to prevent network partition events for non-overlapping partitions as illustrated here:
- HA partition A: hosts esxi-01, 05, 06
- Virtual SAN partition A: hosts esxi-01, 02, 03
- HA partition B: hosts esxi-02, 03, 04
- Virtual SAN partition B: hosts esxi-04, 05, 06
Such partitions are hard to reason about and troubleshoot. They would also have required significant additional HA logic to support.
The “same networks” constraint leads to simpler partitions. The desired and actual behavior is illustrated here:
- HA & Virtual SAN partition A: hosts esxi-01, 02, 03
- HA & Virtual SAN partition B: hosts esxi-04, 05, 06
In a vSphere 5.5 Virtual SAN enabled cluster, Virtual SAN datastores are not utilized by the vSphere HA agents as a means for monitoring partitioned or isolated hosts. This is because during a partition or isolation event the impacted hosts will not have been able to access the heartbeat information stored on the Virtual SAN datastore.
In a scenario where a partition event occurs, the heartbeat information would have been accessible to only one segment of the cluster therefore defeating the purpose. Virtual SAN utilizes a proprietary mechanism in partition or isolation scenarios that prevents data corruption and as a byproduct, would have prevented all hosts from accessing the heartbeat information.
Heartbeat Datastore Recommendation
Heartbeat datastores are not necessary in a Virtual SAN cluster, but like in a non-Virtual SAN cluster, if available, they can provide additional benefits. VMware recommends provisioning Heartbeat datastores when the benefits they provide are sufficient to warrant any additional provisioning costs.
For example, if you are using converged networking, provisioning Heartbeat datastores can be quite expensive since separate switch infrastructure should be used for providing each host with access to a fault-isolated datastore. Hence, there is a higher cost to realizing the benefits. However, if you already have multiple physical networks, the cost of setting up an iSCSI or NFS datastore could be much lower.
Heartbeat Datastores provide the following benefits:
- They allow vCenter to report the actual state of a partitioned or isolated host rather than reporting that it appears to have failed
- For non-Virtual SAN VMs, they increase the likelihood that a FDM master will respond to a VM that fails after its host becomes partitioned or isolated.
- They prevent vSphere HA from causing VM MAC address conflicts on the VM network after a host isolation or partition when the VM network is not affected by the event. The conflict will exist until the original instance is powered off, which for a partition would occur automatically only after the partition was resolved
Heartbeat datastores provide these benefits by allowing an FDM master agent to determine if a non-responsive host is isolated, partitioned or dead and if alive, which VMs are running on that host.
Only use a datastore that all hosts will be able to access during a Virtual SAN network partition or isolation event. If you are already using a non-Virtual SAN datastore in a Virtual SAN cluster, there is no need to add another datastore just for heartbeating if the existing datastores are fault isolated from the Virtual SAN network.
To that end, look at your design holistically. For example, If you add an iSCSI datastore as a Heartbeat datastore and your Management, Virtual SAN and iSCSI vmkernal interfaces all use the same 10GbE link, you won’t be getting the same benefit as you would if the Virtual SAN and iSCSI interfaces used different links. In the first example, if the 10GbE link fails even with a Heartbeat datastore, the FDM master won’t be able to determine if non-responsive slaves are isolated or dead.