In a previous post, I demonstrated what would happen in a vSphere Storage Appliance (VSA) Cluster if we lost the back-end (cluster communication/replication) network. In this post, I want to look at what would happen in the event of losing the front-end (management/NFS) network.
To cause this failure, I'm going to do the same steps that I carried out in the previous blog, namely downing the uplinks on one of the ESXi hosts/VSA Cluster nodes, but this time I will be doing it to the vmnics associated with the front-end network. For an overview of the VSA cluster networking, please check my previous post which explains it in detail.
The configuration is exactly the same as before, with a 3 node VSA cluster presenting 3 distinct NFS datastores. Once again I will have a single VM, running on host1, but using the NFS datastore exported from the appliance running on host3. As before, I will cause the outage on the ESXi host (host3) which hosts the appliance/datastore on which the WinXP VM resides. The ESXi host (host1) where WinXP is running will not be affected.
To begin, lets start downing the first interface on host3:
Once again, while nothing externally visible happens when one of the teamed uplinks is downed, internally, this will cause the three port groups associated with the vSwitch (i.e., VSA-Front End, VM Network, and Management Network) to utilize the same active uplink since the NIC team (either the VSA-Front End port group or both the VM Network and Management Network will fail-over to its previously configured passive uplink). Until the failed NIC is restored to health, the network traffic for the three port groups will share the same single active uplink. Let's now bring down the second interface on host3.
As I am sure you know by now, the VSA installer places all ESXi hosts that are VSA cluster members into a vSphere HA Cluster. This provides a way for Virtual Machines to be automatically restarted if the ESXi host on which they were running goes down. Now, since I've just downed the uplinks for the management interfaces on one of the ESXi hosts, the vSphere HA agent that runs on the ESXi will not be able to communicate to the other hosts in the vSphere HA cluster or with the vCenter server. Therefore the first thing you see when the front-end network is lost are complaints from vSphere HA that it cannot communicate with the HA agent on that ESXi host (I've expanded the vSphere HA State detail bubbles in the screen-shots below to show more verbose messages):
This is shortly followed by a Host Failed event/status from vSphere HA:
Eventually, since vCenter communication to the vpx agent is also via the front-end network, this ESXi and the VSA appliance running on that host become disconnected from vCenter:
Now, because cluster communication is all done over the back-end network, and this network is unaffected by the outage, the VSA Cluster will not take any corrective action in this case. It continues to export all NFS datastores from all appliances. Therefore there is no need for another appliance to take over the presentation of the datastore from the appliance that is running on the ESXi host that has the front-end network outage. Let's look now at the datastore from the VSA Manager UI:
From a VSA cluster perspective, the datastores are online. All appliances also remain online:
But because the front-end network is now down, the datastore can no longer be presented to the ESXi hosts in the cluster. This is because the front-end network is used for NFS exporting by the appliances. What this means is that all of the ESXi hosts in the datacenter have lost access to the datastore (inactive):
Basically, what is happening in this network outage is that the VSA Cluster remains intact and functional (replication & heartbeating continues), but the front-end network outage is preventing it from presenting the NFS datastore(s) to the ESXi hosts.
The Virtual Machine will remain in this state indefinitely until the datastore comes back online. This is unique feature of Virtual Machines if the underlying disk 'goes away' – they will retry I/O indefinitely until the disk becomes available again. This has saved the skin of many an admin who inadvertently removed the incorrect datastore from a host. When they realise their mistake, they re-present the datastore back and the VMs suffer no outage, picking up where they left off. One caveat though – just because the Guest OS can survive this sort of outage, there is no guarantee that the application running inside the Guest OS will.
And that is basically it – a front-end network outage on an ESXi host in the cluster will mean that that datastore becomes unavailable for the duration of the outage. The VMs will retry I/Os for the duration of the outage, and when the network issue is addressed & the datastore comes back online, the VMs will resume from the point where the outage occurred. The point is that the cluster framework is unaffected.
If we bring the uplinks for the front end network back up…
This leads me on to a question for you folks out there thinking about deploying the VSA – do you think this behaviour is optimal? Or would you prefer to see the behaviour similar to the back-end network outage, i.e. failing over the NFS datastore to an alternate appliance? Please use the comments field to provide feedback. It is something that is being debated internally, and we'd love to hear your thoughts.