A couple of questions arose recently about the resilience of the vSphere Storage Appliance (VSA) to network outages, and what sort of behaviour you can expect in the VSA cluster when there are network outages. Although we require a lot of redundancy to be configured for the VSA cluster (NIC teaming for the networks, redundant physical switches), its probably harm to demonstrate what happens when there are network outages.
Let's have a quick look at the VSA cluster networking to being with. This is a logical diagram detailing the different networks used by the VSA Cluster, namely the front-end (appliance management & NFS server) networks, and the back-end (cluster communication, volume mirroring & vMotion) networks.
In my setup, I have a 3 node VSA cluster. This means that there are 3 NFS datastores presented from the cluster. I have deployed a single VM for the purposes of this test. The VM is running on host1, but resides on an NFS datastore (VSADs-2) presented from an appliance that is running on host3. The VM is called WinXP. This diagram shows the network configuration for host3, although all hosts have the same network configuration.
And before we bring the network down, let's check on the appliances and datastores from the VSA Manager plugin. First, lets look at the appliances:
In this particular posting, I want to demonstrate what happens when the back-end network goes down one one of the nodes. In this post, I am going to bring down the back-end network on host3 (the hosts which has the appliance which is presenting NFS storage for WinXP). To achieve this, I'm simply bringing down both NICs used in the back-end network NIC team using an esxcli network command in the ESXi shell.
Now, nothing externally visible happens to the cluster when a single back-end NIC is removed. However, internally, a significant change occurs. The removal of a single NIC from the team will cause the two port groups associated with vSwitch (i.e., VSA-Back End and VSA-VMotion) to utilize the same active uplink since the NIC team for one of these port groups will fail-over to its previously configured passive uplink. Until the failed NIC is restored to health, the network traffic for the two port groups will share the same single active uplink.
Now I go ahead and remove the second NIC from the team, leaving no uplinks for the back-end traffic.
If we check the UI, we can now see that both uplinks are removed from the vSwitch:
As you might imagine, this is going to cause some impact in the VSA cluster. Since the front-end uplinks are still intact, the vSphere HA agent on the ESXi host can continue to communicate with the other hosts. Therefore vSphere HA takes no action But this scenario causes fail-over to kick-in in the VSA Cluster. Since the appliance running on this ESXi host can no longer communicate with the other nodes via the cluster communication network (e.g. heartbeats), and since it can no longer mirror to its remote replica, the appliance effectively goes offline:
And since the appliance is offline, so is it's respective datastore. You can see from the above screen-shot that the Exported Datastores column has changed, so that this appliance is no longer exporting any datastores. Instead the datastore is now being exported by another appliance, the one which had the mirror copy of the datastore. If we now look at the datastores view in VSA Manager:
In the datastores view we can see two datastores degraded. Why two, you might ask, since we only lost one appliance? Well, you have to remember that each appliance is responsible for presenting the primary datastore as well as a mirror replica copy of another datastore. Since we have lost an appliance, we have lost both a primary and a replica, which means that we are now running with two unmirrrored datastores. That is why two datastores appear degraded.
The main point to take away from this is that although we've had a complete back-end network outage, the VSA resilience has allowed another appliance to take over the exporting of the NFS datastore via its mirror copy. This means that the ESXi servers in the data center which have this datastore mounted are oblivious to the fact that the datastore is now being presented from a different appliance (since it is using the same IP address for exporting). Therefore any VMs running on this datastore are also unaffected. In this example, my WinXP VM which was running on the datastore VSADs-2 continues to run just fine after the seamless failover of the datastore to the replica on another appliance:
The error symbol against host3 in the inventory is to flag that we've lost the network. But we can clearly see that the VM continues to run on datastore VSADs-2, even though this datastore is now being exported from a different appliance in the VSA cluster.
Finally, lets bring the networking connections back up and let the appliance come back online.
Once the appliance comes back online, the primary volume and replica volume begin to resynchronize automatically. There are two synchronization tasks, one for each datastore that was affected by the outage.
When the synchronization completes, the appliance which was down will take back over the role of exporting the NFS datastore VSADs-2. This fail-back is once again seamless, and goes unnoticed by the ESXi hosts which have the datastore mounted (since the presentation is done using the same IP address), and thus goes unnoticed by any VMs (WinXP) running on the datastore.
So once again, we have demonstrated how resilient the vSphere Storage Appliance (VSA) actually is. It really is a really cool product, with a lot of thought given to the availability of your data/Virtual Machines. I'll follow up shortly with another article detailing the behaviour when the Front End network suffers an outage. That behaviour is even more interesting as it involves vSphere HA doing its thing!