Virtual Blocks: vSAN Deployment Considerations

Now more than ever, vSAN deployments are quickly growing in number. There are many vSAN documents available on StorageHub with great content; from design guides to day 2 operations, but a common ask from the field relates to vSAN deployment considerations. While there are many areas we can explore, some of them vary based on hardware selected, needs and requirements, as well as different deployment scenarios such as 2-node or stretched clusters. However, there are some considerations that can be applied to most deployment scenarios based on availability, sizing, performance, and features. Let’s take a closer look at some of these considerations.

Number of Nodes

Number of nodes is probably the most common consideration talked about. We have a requirement of a minimum of three nodes for a vSAN cluster (two physical nodes, if you use the 2-node option with Witness Appliance), and such minimum comes down to quorum, and math. We need an uneven number of hosts to have a majority, and three is the lowest number in order to meet the default policy. Albeit the minimum number of nodes is three, it is advisable to consider having N+1 nodes , meaning that for a three node cluster you would have four nodes instead of three, and so on. Having an additional node beyond the minimum required, increases your failure domains; allowing you to rebuild data (vSAN self-healing) in case of a host outage or extended maintenance. Keep in mind that some vSAN capabilities are dependent on a minimum number of hosts within the cluster, such as Erasure Coding. For example, with RAID5 a minimum of four nodes is required, and with a design decision option to use N+1, the new minimum would be five nodes in order to meet policy compliance during outages or extended maintenance periods.

I know… what about price? Another consideration is to go with single-socket nodes vs. dual-socket nodes. As far as licensing, vSAN uses per-CPU licensing in most cases, so replacing a dual-socket node with two single-socket nodes maintains the same number of vSAN licenses needed (two, in this case). Single socket nodes are cheaper than dual socket nodes, so it is in your best interest to price both options. You may be pleasantly surprised! See the vSAN Licensing Guide for details.

N+1

Sizing

Like any other storage solutions, sizing is important. When going through a sizing exercise, consider future growth, storage policy settings used (PFTT1 = two copies of data), swap space, RAID1 vs Erasure Coding, and slack space; which is something we recommend in order to accommodate rebuilds, and reconstruction of objects and components after a policy change.

Another important aspect to consider is to properly size the cache layer. For Hybrid environments, we still abide to the 10% rule; however, for All-Flash environment we are recommending to base this on the workload type, and read/write profile. John Nicholson wrote a blog about this here.

To help you with the vSAN sizing exercises, the vsan sizer tool is a great, and easy way to help you through this. This tool takes into considerations some of the items previously discussed. This tool is constantly being updated in order to help facilitate faster sizing results, in a more simplified manner.

Up until vSAN 6.7, SwapThickProvision was enabled by default, meaning that the swap space created by the VMs based on RAM/memory reservations was thick provisioned. Starting on vSAN 6.7, this configuration is disabled by default, making swap spaced thin provisioned out of the box. For older versions of vSAN, you can easily change this setting via CLI, PowerCLI, HostProfiles, UI, etc. More on that here.

SparseSwap

Network Considerations

When it comes to networking, the version of vSAN deployed will determine whether multicast or unicast is used for cluster membership updates. Starting with vSAN 6.6, unicast was made the networking mode of choice. Prior to that, multicast had to be configure by either disabling IGMP, or enabling IGMP and configuring IGMP querier.

Jumbo Frames (MTU 9000), is something else that should be considered. There is a performance gain by using jumbo frames, but for some deployments, the implementation, and management of it can become a big overhead for the staff. So if jumbo frames are already enabled in the environment, vSAN supports them. If jumbo frames are not enabled, there is no requirement to enable them for vSAN, as the slight performance gain may not justify the additional complexity. If implemented, remember to implement Jumbo Frames from end-to-end.

Network IO Control is something that has been talked a lot with vSAN. It is recommended to implement virtual distributed switches with vSAN in order to implement NIOC. The vSAN license includes vDS, and implementing NIOC (shares), will allow for vSAN to have priority in case of traffic contention. For example, in an active/passive configuration, an uplink failure may cause all traffic types to traverse through a single link, if such link is saturated, NIOC will be able to prioritize traffic based on shares. See more at NIOC Documentation.

As new technology gets introduced for disks (cache and capacity), some vSAN designs eliminate the need to disk controller, such as is the case with NVMe devices. The fast processing speeds of these device and the elimination of a potential bottleneck device (HBAs), results in an increased performance within the server. However, it is very important to play close attention to other outside factors such as network switch capabilities; especially switch buffer sizes. For more information on this topic, please read the following update. For all NVMe vSAN configurations, the buffer sizes may play a critical role depending on NIC configuration and application requirements.

NIOC vSAN

Performance Considerations

When it comes to performance, consider hardware choices. First and foremost, the hardware utilized is required to be on the vSAN VMware Compatibility Guide (vSAN VCG) as approved hardware. The vSAN VCG will not only provide an approved list of hardware, but it will also list the approved versions of firmware and drivers to be used. Also consider checking the network adapters within the vSphere VCG, and making sure the proper level of software is applied, as a driver/firmware mismatch may cause issues.

Disk selection is also vital. Aside from verifying that the devices are on the vSAN VCG, and endurance and performance classes; the protocol selected has an implication on performance. For example SATA (SSD) devices not only have a slower bus speed, but also have a small queue depth compare to SAS (SSD) devices. SATA protocol has other disadvantages as well, such as bus locking, etc. In most cases, SAS drives are recommended over SATA drives, so select your drives wisely.

The next step is to consider what is going to be backing those devices. There are still some RAID controllers on the vSAN VCG, but there are more and more HBAs taking their place. Not only are HBAs more cost effective, but they are simpler. There is no need to disable advanced features such as is the case for RAID controller. If additional performance is needed, having additional HBAs may help with that.

Disk Groups can also play a role in performance. Properly designing a vSAN cluster also includes the number of disk groups within a host. It is generally recommended to have two disk groups per vSAN node, as a minimum. As more disk groups are added, the cache capability increases, as well as the failure domains. Having between two and four disk groups per host, will yield better performance.

BIOS settings are important to consider during a vSAN deployment. Not only is having the latest BIOS version during the ESXi build important, but the performance settings can make a big difference. Most vendors set this profile to Balanced, and this may cause a performance reduction of up to 30%. Setting the BIOS profile to OS Controlled, and High Performance will yield to better performance overall.

Scale Considerations

Scaling is often tricky, as no one has a crystal ball to determine exactly the growth rate, or unpredicted sprawl. A common approach to this, is to go with the N+1 rule of having an additional host as previously discussed, but not populating all drive bays. This approach allows for future growth by simply adding drives, and disk groups; while maintaining a uniform cluster with the same CPU generation, etc. Some customers opt to go this route as they have the option to procure additional drives as part of their Operational Expense model, versus having to request a Capital Expense to add more servers. This approach allows for more flexibility for scaling options, while still maintaining consistent hardware, and CPU generation across the cluster.

Boot Device Considerations

Last, but not least, boot devices are important to consider during design exercises. When it comes to selecting a boot device, some aspects such as price, as well as the placement of logs, and vSAN traces, are key components that drive the choice decision. Mirrored M.2 SSD devices are becoming the preferred boot device; based on redundancy, endurance, price, and log placement capabilities. An earlier post about boot devices for vSAN, breaks down the cost, and log placement considerations for such devices.

Number of Nodes

Sizing

Network Considerations

Performance Considerations

Scale Considerations

Boot Device Considerations

Discover more from VMware Cloud Foundation (VCF) Blog

Related Articles

Beyond Benchmarks: Engineering a Science-Grounded Validation for the Envoy AI Gateway

Announcing the General Availability of Holodeck 9.1

Faster Security Patching with Fewer Disruptions in VCF 9.1