In Part I, Part II and Part III of this blog post series, we reviewed methods of running benchmark tests on a Virtual SAN cluster using three different methods; synthetic I/O Tools such is Iometer, pre-created application I/O trace replay files available for download, or custom created application I/O trace replay. Once you are running benchmark testing, there will be the need to assess and analyze the performance results of your Virtual SAN cluster, and how they meet the needs of the target applications within your environment . In this post, we will review some key concepts in performing a performance analysis of your Virtual SAN solution.

Virtual SAN Observer

Virtual SAN Observer is a tool that is included with the Ruby vSphere Console (RVC), and was originally developed by the Virtual SAN Engineering team to monitor and diagnose Virtual SAN performance issues during development of the product. It provides a huge amount of information, but not all of this information is needed in standard monitoring and analysis. Virtual SAN Observer is the best tool to dive into the performance profile of your Virtual SAN cluster and understand the performance impact of running specific workloads, whether to validate the performance of your Virtual SAN cluster, gain in-depth knowledge the performance profile of your cluster, or troubleshoot potential performance bottlenecks.

Virtual SAN Observer Setup

Virtual SAN Observer can be enabled on any Virtual Center server (either appliance or Windows vCenter). In production environments, it is recommended to configure RVC and run Virtual SAN Observer on a remote VCVA appliance, as diagrammed below.

For detailed information on the configuration and usage of Virtual SAN Observer, see the following Monitoring with Virtual SAN Observer Whitepaper http://blogs.vmware.com/vsphere/files/2014/08/Monitoring-with-VSAN-Observer-v1.2.pdf

Virtual SAN Key Performance Indicators

Below is a review of key performance indicators that you should be aware of when analyzing storage performance, and how they relate to VMware Virtual SAN.

I/Ops – People frequently ask what are the amount of I/O per second that a storage solution can achieve. While this is an important metric, there needs to be context in analyzing I/Ops capabilities. This includes understanding the average block size of your workloads, and the acceptable response time (latency) of your application. In general, both throughput and latency will increase in an almost linear fashion as the average block size of your I/O increases, while at the same time the total number of I/O will decrease.

Some applications issue small block I/O (4K or 8K), while others do large streaming I/Os where MBs of data are transferred at a time. An application issuing 2000 IOPS with 4K IO block size will have relatively low bandwidth requirements (8 MB/s), while the same application with a 32K I/O size will require 8x the bandwidth at 64 MB/s. I/O size also has an impact on latency, large I/O can cause round-trip latency to look higher than normal, even if Virtual SAN is performing well. Virtual SAN uses 4k writes and 1MB reads for I/O operations.

I/O Access Patterns – I/O access patterns (the read/write mix of random and sequential I/O) will impact the optimal size of your Virtual SAN flash acceleration layer. Virtual SAN utilizes flash for all writes and as many reads as possible, with 70% of flash allocated to the read cache, and 30% allocated as a write buffer (this percentage split is hard-coded and cannot be adjusted). Read cache misses are served from underlying magnetic spindles withing Virtual SAN disk groups. Understanding the read/write mix of your application, along with the active working set size can help determine how much flash is optimal in your Virtual SAN solution.

Latency – Latency is the number one indicator of acceptable performance. You should always start a performance analysis not only looking at I/Ops, but I/Ops at a given latency. High I/Ops are meaningless if the latency the storage subsystem is able to achieve the said I/Ops level impacts the application response time in a negative manner. Understanding the desired response time of your application, and then analyzing I/Ops at the said latency requirement is key to profiling storage performance.

Latency can occur in different levels of a Virtual SAN solution . You can observe exactly where the latency is occurring utilizing Virtual SAN Observer. Latency within a Virtual SAN environment can be viewed through the

VSAN Client Tab – total latency as seen from the VSAN Client Layer. This includes any overhead from replication of components, fair-scheduler queuing, and networking.
VSAN Disk/Deep-Dive – latency coming from the VSAN local abstraction layer, physical disk hardware, including storage controller, SDD, HDD.

Below is a diagram of the different layers where latency can occur within Virtual SAN, and how they are exposed through Virtual SAN Observer.

Note: We do not recommend utilizing esxtop for measuinrg latency metrics (e.g. DAVG and GAVG) within Virtual SAN 5.5, because exstop will not properly detect amplified write IOs of guest IOs, ReadCache IOs, destaging IOs, and Recovery IOs. Virtual SAN Observer is aware of the specific IO types and breaks these out when viewing latency under the Physical disks deep dive tab.

Different applications have different latency requirements, but a common problem threshold is typically above 30ms. Virtual SAN Observer will mark latency as an issue (by placing a red bar under the graph) if the average threshold within a collection interval exceeds 30ms latency. Always start a performance analysis with examining latency and I/Ops under the VSAN Client Tab of Virtual SAN Observer.

Network Performance – As Virtual SAN is a scale-out storage solution, performance of the network interconnects is also a key factor in validating performance. If you observe high latency during a test run, the next step is the isolate the source of the latency. An easy way to figure out whether the source of latency in your performance test run is potentially the network is to compare latency figures seen at the VSAN Client tab and the VSAN disk tab. If the latency at the client tab is much higher than what is seen at the disk tab, the network could potentially be a source of latency, and you should validate that your network configuration is appropriate. For detail on Virtual SAN Network design and recommended practices, see the Virtual SAN Network Design Guide.

Troubleshooting Performance Problems

Before you start to dive into troubleshooting performance results utilizing Virtual SAN Observer, or diagnosing perceived performance problems, you should validate you have performed a basic set of prerequisites to validate the posture of your environment.

Validate your Virtual SAN hardware components are on the VMware Compatibility Guide for Virtual SAN (SSD, HDD, I/O Controller)
Validate your hardware configuration is appropriate for your expected performance profile. (.i.e. don’t expect to be able to run a high performance Virtual SAN cluster if you are using 1GbE network for the Virtual SAN vmkernel port, or Class B SSDs.)
As a best practice, make sure all hosts are configured in a homogenous manner and contributing storage to the Virtual SAN cluster (i.e. same SSD, MD speed, number of disk groups per host)
See the Virtual SAN Hardware Quick Reference Guide for examples of low, medium and high performance configurations, and the Virtual SAN Hardware Design Guidance whitepaper for more in-depth information on how hardware configuration effects the performance of a Virtual SAN configuration.
Validate all hosts in your configuration are working as expected. One quick way to perform this validation is through use of the following RVC command.

diagnostics.vm_create -d <datastore> -v <vmfolder> <cluster>

This command will try to create a VM on every host, and if the VM create fails, provide detailed error stack information about the the host and specific issue. It is recommended to perform these basic checks before proceeding with actual performance testing, and validate they have been completed if perceived performance issues are occurring.

Virtual SAN Performance Testing Conclusion

Understanding the performance needs of your workloads, and then sizing your Virtual SAN solution to fit the profile of your workloads can either be as simple as using our general rule of thumb (sizing the flash acceleration tier to be 10% of anticipated used capacity) and using our Virtual SAN Hardware Quick Reference Guide and Virtual SAN Sizing Tool, or performing more complex Proof-of-Concept testing to dive deeper into the performance profile of your Virtual SAN cluster based on benchmark testing. The VMware Virtual SAN Performance Testing blog series reviewed recommended tools, practices and methodologies to address performing a deeper analysis to understand the performance profile of a Virtual SAN solution.