One of the many noticeable changes introduced with the vCloud Networking and Security 5.1 release is the availability of different sizes of Edge Gateway appliances: compact, large, and x-large. Large and x-large Edge Gateway appliances are deployed with 2 vCPUs. The vSphere Fault Tolerance for workloads with multiple vCPUs is currently not supported. To ensure high availability (HA), a built-in high availability (HA) mechanism implemented in 5.1, where two Edge Gateways are deployed to work in Active-Standby mode. The vSphere HA can be used in conjunction with the Edge Gateway HA to handle host failure scenarios.
The vCloud Networking and Security Manager manages the life cycle of both Active and Standby Edge Gateway instances and will push user configurations as they are made to both simultaneously. The Active Edge Gateway will push run-time state information to the Standby Edge Gateway. It is a best practice to create the primary and secondary Edge Gateway appliances on separate resource pools and data stores. When the data store is shared across all hosts in the cluster, the vCloud Networking and Security Manager deploys active and standby Edge Gateway appliances on different hosts and sets up anti-affinity rule to separate virtual machines as shown below. This ensures that the two HA Edge Gateway virtual machines are not on the same ESXi host even after a host failure. If the data store is a local storage, both virtual machines are deployed on the same host.
Edge Gateway HA Communication
Edge Gateway peers communicate with each other for heartbeat messages as well as run-time state synchronization using one of the internal interface. In order for HA service to run, at least one internal interface/network be configured. Each Edge Gateway has a designated IP address to communicate with the peer. This IP address is for HA purpose only and cannot be used for any other services. Firewall rule for allowing HA heartbeat and state synchronization traffic is auto-plumbed by the vCloud Networking and Security Manager as indicated below. By default link local addresses are used for HA communication e.g. 169.254.0.5/30 and 169.254.0.6/30 as highlighted below.
Heart beat messages are sent at 1 sec interval and the default declared dead time is 6 secs. The declared dead time parameter can be modified using “Change HA configuration” as shown below. By default Active and Standby use one of the internal interface for exchanging HA heartbeat and state synchronization messages. A specific or dedicated internal interface can be assigned for this communication by using the drop down option in “Change HA configuration” shown below.
By default link local IP addresses are used for active and standby heartbeat and state synchronization communication. User can configure separate management IP addresses for this purpose as indicated below. HA management IP addresses manually entered need to be on the same subnet and must not overlap with any other interface subnets.
Active Edge Gateway shows all the interface IP addresses in the summary tab of the virtual machine whereas Standby Edge only shows HA interface address used for heartbeat and state synchronization.
Active Edge Gateway:
Edge Gateway high availability service status, unit state, internal interface used for heartbeat, HA polling parameters etc. are displayed by the CLI “show service highavailability”as shown below. This CLI can be executed by logging to Edge Gateway using virtual machine console or SSH.
Other Edge Gateway HA CLIs
- show service highavailability link
- show service highavailability connection-sync
- show configuration highavailability
Edge Gateway Standby to Active Switch-over
When the heartbeat messages are missing from active Edge Gateway, standby Edge Gateway sends ARP probes on all configured internal interfaces (the HA interface as well as other internal interfaces) to ensure the peer is down or completely isolated. When both nodes are still partially connected through other (non-HA) interfaces, the active and standby remain in their roles and no switch over occurs. If the standby Edge gets no reply for all probes, it assumes that the peer is down and becomes Active. The switch-over takes around 10 secs after detecting the peer is down.
Edge Gateway Split-brain Active/Active Scenario
When both nodes are up, but isolated from each other, then both will become active leading to split-brain scenario. One scenario where we observed this is when all Edge internal interfaces are VXLAN based and physical network not setup properly for VXLAN transport. Once the heartbeat channel is re-established, both sides will start negotiation, one node will stay as the active and the other node will move to standby.
Edge Gateway HA Switch-over Behavior
|Edge Feature||Switch-over behavior|
|Interfaces||All interfaces other than the HA interface are not connected on the standby. Once switch over occurs, the new active will bring up all user-configured interfaces. Since the standby has different MAC addresses, it will send out gratuitous ARP on every interface to update ARP caches of connected VMs.|
|Firewall||Supports stateful switch over of firewall connections|
|DHCP||When standby becomes active the HA link synchronization preserves DHCP allocation table state|
|Load balancer (LB)||Health of backend pool servers is synced over. Edge Gateway HA does not support tcp handoff, so existing tcp sessions will be broken, reconnect is needed after switch over. Due to this, stateful switch over not supported currently for L7 proxy LB connections. Whereas, stateful switch over supported for L4 LB connections.|
|IPSec VPN||Stateful switchover (SSO) not supported for IPSec.When standby becomes active all configured IPSec tunnels are reset and reconnect automatically.|
|SSL VPN||When Standby becomes active the client should reconnect automatically.|
Get notification of these blogs and more vCloud Networking and Security information by following me on Twitter @vCloudNetSec.