vSAN

Upgrading Large vSAN Clusters

The post, “Multi-Cluster Upgrading Strategies for vSAN” described an approach for upgrading vSAN in a data center comprising of several vSAN clusters. But what about environments running large vSAN clusters? How should an upgrade of a large vSAN cluster be approached? Let’s take a look at this in more detail.

Standard vSAN clusters can range from 3 hosts to 64 hosts in size. Since vSAN provides storage services on a per-cluster basis, a large cluster will need to be treated in the same way as a small cluster: as a single unit of services and management. While the upgrade process runs through the updating of the discrete hosts that make up the cluster, the upgrade task should occur on a per-cluster basis. This is why the cluster can be thought of and referred to as a “maintenance domain.”

The procedure of upgrading vSAN clusters with a large number of hosts is no different than upgrading vSAN clusters with a smaller number of hosts. While the steps remain the same, there are a few additional considerations to be mindful of during these update procedures.

Upgrading Large vSAN Clusters

Figure 1. Visualizing the “maintenance domain” of a single large vSAN cluster

VUM is limited to updating one host at a time in a vSAN cluster. The length of time for the cluster to complete the update will be proportional to the number of hosts in a cluster. If there is a desire to speed up the host-upgrade process by upgrading more than one host at a time, the size of the maintenance domain can be reduced by creating more clusters comprised of fewer hosts. This smaller maintenance domain will allow for more hosts (one per cluster) to perform parallel upgrades. As demonstrated in the post “Multi-Cluster Upgrading Strategies for vSAN,” designing an environment that has a modestly sized maintenance domain is one of the most effective ways to improve operations and maintenance of a vSAN powered environment.

While VUM will not upgrade more than one host at a time within a vSAN cluster, there are some steps that can be taken to ensure the upgrade process is effective and non-disruptive.

  • Use hosts that support the new Quick Boot feature. This can help host restart times. Since hosts in a vSAN cluster will be updated one-after-the-other, reducing host restart times can significantly improve the completion time of the larger clusters.
  • If a large cluster has relatively few resources used, an administrator may be able to place multiple hosts into maintenance mode safely without running short of storage and capacity resources. VUM will still update the host one at a time, but this may save some time placing the respective hosts into maintenance mode. This would only be possible in large clusters that are noticeably underutilized, may add additional complexity to an upgrade strategy, with actual time savings may be minimal.
  • Monitor latency for a few critical workloads across the cluster. In the vSAN Performance Service, change the time window in the performance graphs to include the period of time before, and during the upgrade to provide reassurance to you, and any applications owners that the levels of performance provided to the applications are being met during the upgrade process. One can monitor latency across the cluster, but since cluster-level statistics are based on rolling up the data from all of the hosts, monitoring it at the VM level may provide a better understanding if discrete workloads are being impacted.

Recommendation:  For cluster updates, focus on the efficient delivery of services while the cluster is being updated, as opposed to the speed in which the cluster is updated. vSAN restricts parallel host remediation. A well designed and operating cluster will seamlessly roll through updating all of the hosts in the cluster without interfering with expected service levels. In other words, the speed at which the cluster is updated is less important than the efficient delivery of resources to the VMs powered by the cluster.

Larger vSAN clusters may have a greater ability to absorb reduced resources as a host enters maintenance mode for the update process. This is because proportionally, each host is contributing a smaller percentage of resources to a cluster. Large clusters may also see slightly less movement of data than much smaller clusters to comply with the “Ensure Accessibility” data migration option when the host is entered into maintenance mode by VUM. For more information on understanding the tradeoffs between larger vSAN clusters versus smaller vSAN clusters, see “vSAN Cluster Design – Large Clusters Versus Small Cluster” on StorageHub.

Summary

Upgrading a vSAN cluster comprised of a larger quantity of hosts is no different than updating a vSAN cluster with a smaller quantity of hosts. While it is quite common to want an upgrade process to complete as quickly as possible, the real emphasis should be ensuring that a vSAN cluster can maintain its ability to meet VM performance and resilience expectations during the upgrade. If those requirements are met, then the time it takes to upgrade is much less relevant. For customers who do want to improve the time to complete an upgrade of a cluster, current practices on host count per cluster may need to be revisited, as well as checking to verify that eligible servers are using the Quick Boot feature.

@vmpete