posted

1 Comment

Visibility to vSphere hosts through out-of-band management has always been a convenient way to access the vSphere Direct Console User Interface (DCUI).  Traditionally, remote management using Intelligent Platform Management Interface (IPMI) or some other method, provided the pre-boot visibility and control necessary to update firmware on the host, and see the host state if it was not accessible using traditional, in-band methods.  For typical host restarts with ESXi, most administrators get a feel for roughly how long a host takes to restart, and simply wait for the host to reappear as “connected” in vCenter.  This may be one of the many reasons why out-of-band host management isn’t configured, available, or a part of operational practices.

Yet, the DCUI access can play an important role for administering a vSAN environment, which is why incorporating out-of-band console visibility into your operational practices is recommended.  Let’s look a bit more as to why this practice makes sense.

A host in a vSAN based cluster has additional actions to perform during the host reboot process.  Many of these additional tasks during a host reboot simply ensure the safety and integrity of data.  Looking at the DCUI during a host restart will reveal a few vSAN related activities.  The most prominent message, and perhaps the one that may take the most time is “vSAN:  Initializing SSD… Please wait…” similar to what is shown in Figure 1.

Figure 1. DCUI showing the “Initializing SSD” status.

During this step, vSAN is processing data, and digesting the log entries in the buffer to generate all required metadata tables.  More detail on a variety of vSAN initialization activities can be exposed by hitting ALT + F11 or ALT + F12 in the DCUI, as shown in Figure 2.

Figure 2. Detailed log entries during “Initializing SSD” state.

The specific activities that relate to the duration of the “Initializing SSD” activity are the Physical Log, and Object Manager entries.  You might see entries such as:

SSDLOGLogEnumProgress:948: Estimated time for recovering 712459 log blks is 95221 ms
PLOG_Recover:970: Doing plog recovery on SSD
PLOGRecDisp:988: PLOG recovery complete

Entries like the examples shown above are a normal part of this “Initializing SSD” step, and show that vSAN is making progress in the processing of this data.  Since each vSAN host contributes to the overall storage footprint available to the VMs, the processing and reconciliation of data during a restart is expected behavior of a host in a vSAN environment.

This means that hosts in a vSAN cluster can take longer to reboot than non-vSAN hosts.  The message may appear only momentarily on the DCUI screen, or it may take several minutes per disk group to complete this step and proceed with the remainder of the host reboot.

This task may fail if there is an underlying issue with SSD, or perhaps when one is using a storage controller not on the HCL.  In rare circumstances, long periods of time can also indicate possible health issues with some metadata associated with components that make up an object in vSAN. Metadata health can be easily viewed in the UI using the vSAN Health check service.

The time this initialization actually takes depends on a number of factors.  One of the primary variables is the amount of data, or blocks, that are in the write buffer at the time of the host restart.  During this “initializing SSD” period, further reboots of hosts in this state should be avoided.  Having out-of-band access to the host DCUI is one of the best ways to provide proper visibility, and avoid unnecessary, additional host restarts during these moments where it is performing tasks, but not available to the cluster.

DCUI accessibility via remote management should also be incorporated into defined maintenance workflows such as host restarts. It doesn’t mean that an administrator needs to watch the DCUI every time they restart a host.  The objective would be to 1.) adjust expectations on what typical restart times are for a vSAN host during the reboot process.  2.)  Instill a good operational practice that in the event that the status of a host is uncertain during a normal host restart, how should the administrator proceed as a next step.

While some operational practices for vSAN are noticeably different than a traditional infrastructure, many practices, like this one, are a simple reminder of best practices for almost any environment.  Overly anxious, hard resets of hosts have never been ideal, but unfortunately these types of practices become a customary troubleshooting step for many organizations.  By understanding how vSAN can potentially change the boot time of a host, this helps bring to light the importance of adhering to proper operational procedures.