[Updated – October 12th, 2011 – newer higher resolution videos]
Recently I created a few blogs posts on how the vSphere Storage Appliance (VSA) handles network outages. I described what happens when the Front-End (Management & NFS) network has an outage on one of the ESXi hosts which participates in the cluster and I also documented what happens when the Back-End (Cluster Communication & Replication) network has an outage.
However, reading blogs posts can be time consuming and as they say, a picture is worth a thousand words. So to that end, I decided to put together a few short videos on the same scenarios.
I'm ignoring the fact that we also have vSphere HA in the picture for restarting VMs if a host fails. In these videos, I will use a VM that is running on an ESXi host which does not fail, but which has its disk on an appliance/ESXi host which does fail.
VSA Front-End Network Outage
This first video (approx. 7 minutes) shows what happens when the front-end network is removed from an ESXi host which is hosting a VSA appliance. This basically means that the management network and NFS network are lost to this appliance. This means that the NFS datastore can no longer be presented to the other ESXi hosts in the cluster. But since the back-end network continues to function fine, there is no fail-over initiated. This is interesting behaviour as it exhibits the All Path Down (APD) scenario, and subsequent behaviour of Virtual Machines using that storage. VMs continue to do I/O, but since the datastore is no longer online, the VMkernel requests that the VM resend the I/O indefinitely. When the datastore comes back online, the VM can resume from where it left off. APD (and the new PDL) behaviour is described here.
VSA Back-End Network Outage
The second video (also approx. 7 minutes) highlights what happens where there is a back-end network outage. This basically means that the other nodes in the cluster can no longer communicate with the failing node, and initiate a fail-over of the NFS datastore to another appliance. This video once again shows how seamless this fail-over is, and how it is transparent to the remaining ESXi hosts and VMs running on the datastore.
I should point out that the VSA cluster reacts differently for 2 or 3 node configuration. The description above is correct for 3 node configurations. For 2 node configuration, a VSA server will go offline for a front-end failure, and volumes will fail-over on one of the ESXi nodes. For a back-end failure, the VSA server will stay online, but volumes will become degraded.