One of the common asks prior to deploying or migrating workloads to an environment, is to know and understand the capabilities of the environment. After the initial deployment and configuration, we may also require reference metrics when all of the configuration is in a healthy state. The objective of this blog post is to show how to measure the performance of a vSAN cluster in terms of IOPS, throughput and latency. We do this by using a benchmarking utility: HCIBench. Benchmarking is generally performed using synthetic I/O testing. Synthetic testing typically requires several customizations to measure an HCI environment. HCIBench simplifies this effort. Its integration into the platform allows for environment-aware tests based on the cluster design. HCIBench distributes and deploys VMs across the hosts and simultaneously initiates I/O, as it would typically work in a real-world scenario. Additionally, it provides an intuitive UI to assess the results graphically and sets the standard for HCI benchmarking tools.
Why would you need to benchmark the cluster?
The most common reasons include:
- Understanding the environment capabilities and ensuring that there are no anomalies
- Validation that the design meets the requirements or User Acceptance Testing (UAT)
- Reference numbers that can be used to compare against, if running into a performance issue
- For Proof of Concepts (PoC)
- Establish a baseline and set user expectations post implementation
There could be more derivatives to the above as well, hence knowing the cluster capability has several merits.
At the end of the proceeding tests we would need to be able to answer the following questions:
- What is the highest amount of synthetically generated IOPS that can be achieved?
- What is the expected latency at a given number of IOPS (workload requirement)?
- What is the maximum throughput that can be achieved?
As with all storage solutions, enabling data services such as Deduplication & Compression and Encryption introduces overhead processing that impacts performance. In order to establish a performance baseline, it’s recommended to leave all data services disabled other than checksum. On a similar note, the storage policies are left to the default settings. Choosing RAID-1 Mirroring or RAID-5 erasure coding can impact the outcomes. Mirroring is optimized for performance and erasure coding is capacity optimized.
The number of IOPS a storage system can provide is dependent on the hardware componentry and architecture of the system. There are various attributes that govern the actual IOPS delivered such as the RAID configuration, utilization, etc. Since vSAN is a distributed storage sub-system, there are additional considerations influencing performance depending on the availability and performance requirements. In order to deduce the IOPS capability, we can gradually increase the number of threads per object and repeat the tests until we get to the point where additional threads do not result in incremental IOPS. Another aspect to note is I/Os have a direct correlation to latency, i.e. as the I/O size (block size/payload per I/O) increases, the number of IOPS reduces and latency increases.
Latency is essentially the amount of time (typically measured in milliseconds) to complete a read or write operation. This metric is often the starting point of any performance benchmarking or troubleshooting effort to ascertain how slow/fast a workload is performing. An application administrator is unlikely to have sufficient understanding of the storage sub-system utilization and is limited to what the application or guest OS is generating. An application may be generating fewer IOPS yet experiencing higher latency. This may be due to the characteristics of the I/O, including but not limited to I/O size, read/write ratios, parallelism or contention with other sources of I/O. In this exercise, we strive to deduce the outcomes in the context of such characteristics.
Throughput indicates the volume and speed of data transferred. It provides a context to IOPS and latency. To ascertain the maximum throughput possible, tests should be carried out with larger payloads. The larger the I/O size, the better the throughput. It is important to note that since each I/O is now carrying larger payload, the IOPS number will reduce proportionally. The payload size of a single I/O that is 256K in size is equivalent to 64 IOs that are 4k in size.
HCIBench embeds certain default test parameters representative of generic workload types. Those unfamiliar with synthetic testing, can simply pick a workload type categorized as “EASY RUN” and get started. Behind the scenes, EASY RUN estimates the number of VMs, disks and size of each data disk based on the target vSAN cluster.
Here is a sample screenshot from the configuration page of HCIBench- https://<HCIBench IP address>:8443
On completing the EASY RUN test, you should see a results file similar to - vdb-8vmdk-100ws-4k-70rdpct-100randompct-4threads-xxxxxxxxxx-res.txt
This represents the following I/O profile,
Block size : 4k
Read/Write (%) : 70/30
Random (%) : 100
OIO (per vmdk) : 4
With the above test, a baseline of IOPS, Latency and throughput is obtained. The next step is to tune parameters to pressure the system for optimal results. This can be done by increasing the parallelism i.e. modifying the number of threads or Outstanding I/O(OIO). Increasing OIO would result in increase of IOPS and increase in latency (note that an increase in latency indicates a reduction in performance).
In order to modify OIO, a custom parameter file must be created as shown below,
- Login to HCIBench config page
- Toggle EASY RUN
- Click on ADD as shown in the screenshot below
- Modify the number of threads per vmdk(increment/decrement) by 4
- Replicate all other settings similar to EASY RUN
Note: You may need to repeat the steps until you get to an inflection point wherein additional threads do not “meaningfully” improve IOPS, i.e. the rate of increase of IOPS will be much lesser than the increase in latency.
From a throughput standpoint, the same test can be repeated with a higher I/O size. Most modern-day guest operating systems perform in-guest coalescing of writes ranging from 32k to 1 MB. We can choose to test 32k - 256k I/O size to get an estimate of the throughput capability.
Additional points to consider
- Synthetic testing can approximate general characteristics of a workload but cannot accurately predict the exact same result mostly due to the multifaceted nature of applications and time.
- The tests are oriented towards storage specific metrics and not CPU or Memory
- I/O size referred in this article is the size of each I/O, the term "blocksize" is avoided since it could also be interpreted as a filesystem attribute
- Knowing your environment and quantifying the performance of a vSAN backed cluster
- Gold standard reference of the performance post the initial deployment
- Scale scientifically with the actual performance metrics from a hardware configuration