vSAN

Health Checks – Delivering a more Intelligent vSAN and vSphere

Guidance in most forms is a nice option to have. Consider a scenario that is commonplace with people and today’s technology, where one may be almost certain of the driving directions to a destination. Even the most capable individuals will still enter the address into their phone for turn-by-turn guidance. Why? Compared to old methods (paper maps, asking others, etc.) the effort to do so is insignificant, and the payoff is clear: A dramatic reduction of potential mistakes. Accolades are rarely given to those doing things the hard way.

The desire for guidance follows us into the data center, and is one of the reasons why the vSAN Health Service plays such a prominent part of recent versions of vSAN. The vSAN Health service has dozens of checks across ten different categories. More importantly, the framework used (introduced in recent versions of vSphere and vSAN) allows for asynchronous updates, which means the software driving the infrastructure can introduce new checks and guidance without any need for the existing installation to be updated. Much like driving directions from a phone, health checks remove uncertainty from the data center by providing awareness and a path for guidance. These health checks are integrated throughout VMware products, and plays a part in one of the newest features introduced to vSAN: VMware Update Manager (VUM) firmware update integration in vSAN 6.7 U1.

VMware has demonstrated a commitment to simplifying operations, and the example given below demonstrates how these intelligent health checks helps new and experienced users alike.

Intelligence from all angles

Checking on the health of a vSAN cluster spreads beyond just explicit vSAN health checks. Many of the recommended practices for a vSAN cluster are common to three-tier architectures as well, so it makes sense that vCenter continues to improve its health detection abilities: Discovering misconfigurations, as well as the general health and well-being of hosts that comprise a cluster. A good example of this, as shown in Figure 1, occurred recently in my own environment.

Figure 1. ESXi with bnx2x driver version check

The alert clearly indicates a NIC driver issue that needs attention, but I was confident that there were no alerts like this in my environment up until recently. This illustrates the beauty of the framework. The health check highlighted in Figure 1 was pushed out on October 16th, 2018, and is visible to all environments running vCenter 6.7 and above. Thanks to the new cloud-connected framework used, this health check was introduced without any updating of the software.

As Figure 2 shows, clicking on the “ESXi with bnx2x driver version 2.713.30.v60.5 or below” health check provides further detail on the issue.

Figure 2. Details of the health check condition

Clicking on the “Ask VMware” link takes the user directly to the KB article, as shown in Figure 3. The KB article describes the problem in detail, and the recommended course of action.

Figure 3. Cause and resolution details in KB article 53353

From there, it was just a matter of downloading the patch and letting VUM apply it to all of the hosts in the cluster.  This is accomplished performing the following:

  1. In VUM, click on “updates” > “upload from file” and select the recently downloaded zip file that contained the patch.
  2. Create a new Baseline in VUM, give it a name, and choose a content type of “patch.” For clarity and for this example, filter by “QLogic” as the patch vendor. You will see the bnx2x-2.713.30.v60.8 among others listed in the “Content” section of the Baseline.
  3. At the cluster level, click on “Updates” and attach the baseline, then remediate the cluster.

Once complete, the subsequent health check passed without issue, as shown in Figure 4.

Figure 4. Verification of the health check condition after the update

Network connectivity and all components that impact connectivity has grown to be a critical element of modern data center architectures. While this alert in my environment caught me by surprise, it was a great example of how VMware recognized an issue with a NIC driver, and quickly introduced health checks to alert users of the issue. Agility in infrastructure software translates to agility for those who manage the data center.

Conclusion

Intelligent, dynamically updated health checks through VMware’s new cloud-connected framework may not help your driving skills, but the guidance it provides does make the ongoing operation of your data center easier. If you are looking for reasons to upgrade your infrastructure to the latest versions of vSphere and vSAN, this should be reason enough. What are you waiting for?

@vmpete