Author Archives: Amitabha Banerjee

Amitabha Banerjee

About Amitabha Banerjee

I am a performance engineer in the IO Performance team at VMware.

vSAN Performance Diagnostics Now Shows “Specific Issues and Recommendations” for HCIBench

By Amitabha Banerjee and Abhishek Srivastava

The vSAN Performance Diagnostics feature, which helps customers to optimize their benchmarks or their vSAN configurations to achieve the best possible performance, was first introduced in vSphere 6.5 U1. vSAN Performance Diagnostics is a “cloud connected” feature and requires participation in the VMware Customer Experience Improvement Program (CEIP). Performance metrics and data are collected from the vSAN cluster and are sent to the VMware Cloud. The data is analyzed and the results are sent back for display in the vCenter Client. These results are shown as performance issues, where each issue includes a problem with its description and a link to a KB article.

In this blog, we describe how vSAN Performance Diagnostics can be used with HCIBench and show the new feature in vSphere 6.7 U1 that provides HCIBench specific issues and recommendations.

What is HCIBench?

HCIBench (Hyper-converged Infrastructure Benchmark) is a standard benchmark that vSAN customers can use to evaluate the performance of their vSAN systems. HCIBench is an automation wrapper around the popular and proven VDbench open source benchmark tool that makes it easier to automate testing across an HCI cluster. HCIBench, available as a fling, simplifies and accelerates customer performance testing in a consistent and controlled way.

Example: Achieving Maximum IOPS

As an example, consider the following HCIBench workload that is run on a vSAN system:

Number of
VMs
Number of
Disks (vmdks)
to Test
Number of
Threads/Disk
Working Set
Percentage
Block Size Read/Write
Percentage
Randomness
Percentage
1 10 1 100 4 KB 0/100 100

If the goal is to achieve maximum IOs per second (IOPS), vSAN Performance Diagnostics for this workload yields the result shown in figure 1.

Figure 1

In this example, vSAN Performance Diagnostics reports, “The Outstanding IOs for the benchmark is too low to achieve the desired goal”. Here we can see that the feedback from vSAN Performance Diagnostics tells us about the problem, and a possible solution recommends that we increase the number of outstanding IOs to a value of 2 per host. The linked “Ask VMware” article explains this issue and what we must do with the benchmark in more detail.

The new HCIBench Exceptions and Recommendations feature removes the need to read through a KB article by precisely mapping recommendations to one (or more) configurable parameters of HCIBench.

Now let us check how vSAN Performance Diagnostics works with vSphere 6.7 U1, which has access to this new feature (figure 2).

Figure 2

Now we can clearly see an exact issue in our HCIBench workload configuration that provides the precise recommendation to resolve this issue with the message: “Increase number of threads per disk from 1 to 2”.

Let us monitor the current write IOPS generated by the benchmark for our reference. We go to Data Center–>Cluster–>Monitor–>vSAN–>Performance (figure 3).

Figure 3

Now, as part of the evaluation, we apply the recommendation generated by vSAN Performance Diagnostics and check its impact. We now run the new HCIBench workload configuration with 2 threads per disk and the following parameters:

Number of
VMs
Number of
Disks (vmdks)
to Test
Number of
Threads/Disk
Working Set
Percentage
Block Size Read/Write
Percentage
Randomness
Percentage
1 10 2 100 4KB 0/100 100

After the benchmark completes, we use vSAN Performance Diagnostics again to see if we can now achieve the required goal of maximum IOPS. The result from Performance Diagnostics now shows “No Issues were found”, which means that we are achieving good IOPS from the vSAN system (figure 4).

Figure 4

Now, as part of actual verification, we can see the change in IOPS after applying the recommendation. From figure 5 below (screenshot from Data Center–>Cluster–>Monitor–>vSAN–>Performance), we can clearly see that there is a 25-30% increase in IOPS after applying the recommendation, which verifies that it helped us achieve our goal.

Figure 5

We believe that this feature will be very useful for customers who want to tune their HCIBench workload for a desired goal.

Prerequisites

  • Please note that this feature is currently integrated with HCIBench 1.6.7 or later.
  • This feature is available for vSphere 6.7 U1 and newer releases. It will not be available to patch releases of vSphere 6.7.

IPv6 performance improvements in vSphere 5.5

Many of our customers use IPv6 networks in their datacenters for a variety of reasons. We expect that many more will transition from IPv4 to IPv6 to reap the large address range and other benefits that IPv6 provides. Keeping this in mind, we have worked on a number of performance enhancements for the way that vSphere 5.5 manages IPv6 network traffic. Some new features that we have implemented include:

• TCP Checksum Offload: For Network Interface Cards (NICs) that support this feature, the computation of the TCP checksum of the IPv6 packet is offloaded to the NIC.

• Software Large Receive Offload (LRO): LRO is a technique of aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking stack, thus reducing the number of packets that have to be processed and saving CPU. Many NICs do not support LRO for IPv6 packets in hardware. For such NICs, we implement LRO in the vSphere network stack.

• Zero-Copy Receive: This feature prevents an unnecessary copy from the packet frame to a memory space in the vSphere network stack. Instead, the frame is processed directly.

vSphere 5.1 offers the same features, but only for IPv4. So, in vSphere 5.1, services such as vMotion, NFS, and Fault Tolerance had lower bandwidth in IPv6 networks when compared to IPv4 networks. vSphere 5.5 solves that problem—it delivers similar performance over both IPv4 and IPv6 networks. A seamless transition from IPv4 to IPv6 is now possible.

Next, we demonstrate the performance of vMotion over a 40Gb/s network connecting two vSphere hosts. We also demonstrate the performance of networking traffic between two virtual machines created on the vSphere hosts.

System Configuration
We set up a test environment with the following specifications:

• Servers: 2 Dell PowerEdge R720 servers running vSphere 5.5.
• CPU: 2-socket, 12-core Intel Xeon E5-2667 @ 2.90 GHz.
• Memory: 64GB memory; 32GB spread across two NUMA nodes.
• Networking: 1 dual-port Intel 10GbE and 1 dual-port Broadcom 10GigE adapter placed on separate PCI Gen-2 x8 lanes in both machines. We thus had 40Gb/s of network connectivity between the two vSphere hosts.
• Virtual Machine for vMotion: 1 VM running Red Hat Enterprise Linux Server 6.3 assigned 2 virtual CPUs (vCPUs) and 48GB memory. We migrate this VM between the two vSphere hosts.
• Virtual Machines for networking tests: A pair of VMs running Red Hat Enterprise Linux server 6.3, assigned 4 vCPUs and 16GB memory, on each host. We use these VMs to test the performance of networking traffic between two VMs.

We configured each vSphere host with four vSwitches, each vSwitch having one 10GbE uplink port. We created one VMkernel adapter on each vSwitch. Each VMkernel adapter was configured on the same subnet. The MTU of the NICs was set to the default of 1500 bytes. We enabled each VMkernel adapter for vMotion, which allowed vMotion traffic to use the 40Gb/s network connectivity. We created four VMXNET3 virtual adapters on the pair of virtual machines used for networking tests.

Methodology
In order to demonstrate the performance for vMotion, we simulated a heavy memory usage footprint in the virtual machine. The memory-intensive program allocated 48GB memory in the virtual machine and touched one byte in each page in an infinite loop. We migrated this virtual machine between the two vSphere hosts over the 40Gb/s network. We used net-stats to monitor network throughput and CPU utilization on the sending and receiving systems. We also noted the bandwidth achieved in each pre-copy iteration of vMotion from VMkernel logs.

In order to demonstrate the performance of virtual machine networking traffic, we use Netperf 2.60 to simulate traffic from one virtual machine to the other. We create two connections for each virtual adapter. Each connection generates traffic for the TCP_STREAM workload, with 16KB message size and 256KB socket buffer size. As in the previous experiment, we used net-stats to monitor network throughput and CPU utilization.

Results
Figures 1 and 2 show, for IPv4 and IPv6 traffic, the network throughput and CPU utilization data that we collected over the 40-second duration of the migration. After the guest memory is staged for migration, vMotion begins iterations of pre-copying the memory contents from the source vSphere host to the destination vSphere host.

In the first iteration, the destination vSphere host needs to allocate pages for the virtual machine. Network throughput is below the available bandwidth in this stage as vMotion bandwidth usage is throttled by the memory allocation on the destination host. The average network bandwidth during this phase was 1897 megabytes per second (MB/s) for IPv4 and 1866MB/s for IPv6.

After the first iteration, the source vSphere host sends the delta of changed pages. During this phase, the average network bandwidth was 4301MB/s with IPv4 and 4091MB/s with IPv6.

The peak measured bandwidth in netstats was 34.5Gb/s for IPv4 and 32.9Gb/s for IPv6. The CPU utilization of both systems followed a similar trend for both IPv4 and IPv6. Please also note that vMotion is very CPU intensive on the receiving vSphere hosts, and high CPU clock speed is necessary to achieve high bandwidths. The results are summarized in Table 1. In all, migration of the virtual machine was complete in 40 seconds regardless of IPv4 or IPv6 connectivity.

vMotion over an IPv4 network
Figure 1. vMotion over an IPv4 network
vMotion over an IPv4 network
Figure 2. VMotion over an IPv6 network

vMotion-IPv4 vs IPv6
Table 1. vMotion results—IPv4 versus IPv6

The results for virtual machine networking traffic are in Table 2. While the throughput with IPv6 is about 2.5% lower, the CPU utilization is the same on both the sending as well as the receive sides.

Virtual Machine Performance - IPv4 vs IPv6
Table 2. Virtual machine networking results—IPv4 versus IPv6

Thanks to a number of IPv6 enhancements added to vSphere 5.5, migrations with vMotion occur over IPv6 networks at speeds within 5%, compared to those over IPv4 networks. For virtual machine networking performance, the throughput of IPv6 is within 2.5% of IPv4. In addition, testing shows that we can drive bandwidth close to 40Gb/s link speeds with both protocols. Combined, this functionality allows for a seamless transition from IPv4 to IPv6 with little performance impact.