Product Announcements

Auto Deploy Stateless Caching – Stateless Caching and PXE Infrastructure Outages?

So far I’ve introduced Auto Deploy Stateless Caching and discussed its role in helping to facilitate troubleshooting in the event that an auto-deployed host should ever become isolated on the network.   I now want to continue my blog series on stateless caching by discussing the role that it plays in protecting against PXE infrastructure outages – that is an outage affecting either the DHCP or TFTP server that are used in the PXE boot process.

PXE boot requires both a DHCP and TFTP server.    Should either of these components fail an auto-deployed host would not be able to boot.  This scenario differs from a network isolated host in that the host is able to communicate on the network, but it’s the services  running on the network that are unavailable.  In this case, stateless caching allows an auto deployed host to overcome the PXE boot failure by falling back to the cached image that was saved to the disk during the last successful PXE boot. The screen shot below shows an example of a host attempting to PXE boot when the TFTP server is unavailable.

What is different about this scenario, as compared to a network isolated host, is that after the host boots from the cached disk image, because it is able to communicate with the vCenter server the administrator has the option to manually reconnect it and add it back to the cluster.  The screen shot below shows an example of this.

While it is possible to manually reconnect the host after it has booted from the cached disk image there a few things to consider:

1.  Reconnecting an Auto Deploy host that has booted off the cached disk image is a manual step.  Admin intervention is required.
2.  When the host reconnects, vCenter will detect that it has booted from the cached disk  and will flag it as having not booted stateless.  This will prevent the host from ever becoming compliant with the Host Profile.  The only way to clear this flag and bring the host back into compliance is to first resolve the PXE infrastructure outage and then reboot the host again.  The image below shows an example of a host that has been flagged as having booted from the cached disk image.

3.  Once manually re-connected to vCenter the host will successfully rejoin the cluster and will be able to host virtual machines.  However, remember, the host will be flagged as having not booted stateless and another reboot will be required to clear this flag (after you fix the PXE outage of course).

Note: Auto Deploy stateless caching does protect against PXE infrastructure outages assuming the host can connect to vCenter.  Manual admin intervention is required.

As you can see stateless caching does help protect against PXE component failures.  As such, if you are ever in a situation where PXE is down and you need to reboot an auto deployed host, you can do the reboot.  However, you’ll need to manually reconnect the host to vCenter following the reboot.  Of course, you should make it a priority to identify and resolve the PXE failure as soon as possible, and remember, once the outage has been resolved you will need to reboot the host again in order to bring the host back into compliance with the host profile.   

For notification on future posts follow me on twitter @VMwareESXi