Dynamic Host-Wide Performance Tuning in VMware vSphere 6.0

by Chien-Chia Chen

Introduction

The networking stack of vSphere is, by default, tuned to balance the tradeoffs between CPU cost and latency to provide good performance across a wide variety of applications. However, there are some cases where using a tunable provides better performance. An example is Web-farm workloads, or any circumstance where a high consolidation ratio (lots of VMs on a single ESXi host) is preferred over extremely low end-to-end latency. VMware vSphere 6.0 introduces the Dynamic Host-Wide Performance Tuning feature (also known as dense mode), which provides a single configuration option to dynamically optimize individual ESXi hosts for high consolidation scenarios under certain use cases. Later in this blog, we define those use cases. Right now, we take a look at how dense mode works from an internal viewpoint.

Mitigating Virtualization Inefficiency under High Consolidation Scenarios

Figure 1 shows an example of the thread contexts within a high consolidation environment. In addition to the Virtual CPUs (each labeled VCPU) of the VMs, there are per-VM vmkernel threads (device-emulation, labeled “Dev Emu”, threads in the figure) and multiple vmkernel threads for each Physical NIC (PNIC) executing physical device virtualization code and virtual switching code. One major source of virtualization inefficiency is the frequent context switches among all these threads. While context switches occur due to a variety of reasons, the predominant networking-related reason is Virtual NIC (VNIC) Interrupt Coalescing, namely, how frequently does the vmkernel interrupt the guest for new receive packets (or vice versa for transmit packets). More frequent interruptions are likely to result in lower per-packet latency while increasing virtualization overhead. At very high consolidation ratios, the overhead from increased interrupts hurts performance.

Dense mode uses two techniques to reduce the number of context switches:

The VNIC coalescing scheme will be changed to a less aggressive scheme called static coalescing.
With static coalescing, a fixed number of requests are delivered in each batch of communication between the Virtual Machine Monitor (VMM) and vmkernel. This, in general, reduces the frequency of communication, thus fewer context switches, resulting in better virtualization efficiency.
The device emulation vmkernel thread wakeup opportunities are greatly reduced.
The device-emulation threads now will only be executed either periodically with a longer timer or when the corresponding VCPUs are halted. This optimization largely reduces the frequency that device emulation threads being waken up, so frequency of context switch is also lowered.

Figure 1. High Consolidation Example

Enabling Dense Mode

Dense mode is disabled by default in vSphere 6.0. To enable it, change Net.NetTuneHostMode in the ESXi host’s Advanced System Settings (shown below in Figure 2) to dense.

Figure 2. Enabling Dynamic Host-Wide Performance Tuning
“default” is disabled; “dense” is enabled

Once dense mode is enabled, the system periodically checks the load of the ESXi host (every 60 seconds by default) based on the following three thresholds:

Number of VMs ≥ number of PCPUs
Number of VCPUs ≥ number of 2 * PCPUs
Total PCPU utilization ≥ 50%

When the system load exceeds the above thresholds, these optimizations will be in effect for all regular VMs that carry default settings. When the system load drops below any of the thresholds, those optimizations will be automatically removed from all affected VMs such that the ESXi host performs identical to when dense mode is disabled.

Applicable Workloads

Enabling dense mode can potentially impact performance negatively for some applications. So, before enabling, carefully profile the applications to determine whether or not the workload will benefit from this feature. Generally speaking, the feature improves the VM consolidation ratio on an ESXi host running medium network throughput applications with some latency tolerance and is CPU bounded. A good use case is Web-farm workload, which needs CPU to process Web requests while only generating a medium level of network traffic and having a few milliseconds of tolerance to end-to-end latency. In contrast, if the bottleneck is not at CPU, enabling this feature results in hurting network latency only due to less frequent context switching. For example, the following workloads are NOT good use cases of the feature:

X Throughput-intensive workload: Since network is the bottleneck, reducing the CPU cost would not necessarily improve network throughput.
X Little or no network traffic: If there is too little network traffic, all the dense mode optimizations barely have any effect.
X Latency-sensitive workload: When running latency-sensitive workloads, another set of optimizations is needed and is documented in the “Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5” performance white paper.

Methodology

To evaluate this feature, we implement a lightweight Web benchmark, which has two lightweight clients and a large number of lightweight Web server VMs. The clients send HTTP requests to all Web servers at a given request rate, wait for responses, and report the response time. The request is for static content and it includes multiple text and JPEG files totaling around 100KB in size. The Web server has memory caching enabled and therefore serves all the content from memory. Two different request rates are used in the evaluation:

Medium request rate: 25 requests per second per server
High request rate: 50 requests per second per server

In both cases, the total packet rate on the ESXi host is around 400 Kilo-Packets/Second (KPPS) to 700 KPPS in each direction, where the receiving packet rate is slightly higher than the transmitting packet rate.

System Configuration

We configured our systems as follows:

One ESXi host (running Web server VMs)
- Machine: HP DL580 G7 server running vSphere 6.0
- CPU: Four 10-core Intel® Xeon® E7-4870 @ 2.4 GHz
- Memory: 512 GB memory
- Physical NIC: Two dual-port Intel X520 with a total of three active 10GbE ports
- Virtual Switching: One virtual distributed switch (vDS) with three 10GbE uplinks using default teaming policy
- VM: Red Hat Linux Enterprise Server 6.3 assigned one VCPU, 1GB memory, and one VMXNET3 VNIC

Two Clients (generating Web requests)
- Machine: HP DL585 G7 server running Red Hat Linux Enterprise Server 6.3
- CPU: Four 8-core AMD Opteron™ 6212 @ 2.6 GHz
- Memory: 128 GB memory
- Physical NIC: One dual-port Intel X520 with one active 10GbE port on each client

Results

Medium Request Rate

We first present the evaluation results for medium request rate workloads. Figures 3 and 4 below show the 95th-percentile response time and total host CPU utilization as the number of VMs increase, respectively. For the 95th-percentile response time, we consider 100ms as the preferred latency tolerance.

Figure 3 shows that at 100ms, default mode consolidates only about 470 Web server VMs, whereas dense mode consolidates more than 510 VMs, which is an over 10% improvement. For CPU utilization, we consider 90% is the desired maximum utilization.

Figure 3. Medium Request Rate 95-Percentile Response Time
(Latency Tolerance 100ms)

Figure 4 shows that at 90% utilization, default mode consolidates around 465 Web server VMs, whereas dense mode consolidates about 495 Web server VMs, which is still a nearly 10% improvement in consolidation ratio. We also notice that dense mode, in fact, also reduces response time. This is because the great reduction in context switching improves virtualization efficiency, which compensates the increase in latency due to more aggressive batching.

Figure 4. Medium Request Rate Host Utilization
(Desired Maximum Utilization 90%)

High Request Rate

Figures 5 and 6 below show the 95th-percentile response time and total host CPU utilization for a high request rate as the number of VMs increase, respectively. Because the request rate is doubled, we reduce the number of Web server VMs consolidated on the ESXi host. Figure 5 first shows that at 100ms response time, dense mode only consolidates about 5% more VMs in a medium request rate case (from ~280 VMs to ~290 VMs). However, if we look at the CPU utilization as shown in Figure 6, at 90% desired maximum load, dense mode still consolidates about 10% more VMs (from ~ 240 VMs to ~260 VMs). Considering both response time and utilization metrics, because there are a fewer number of active contexts under the high request rate workload, the benefit of reducing context switches will be less significant compared to a medium request rate case.

Figure 5. High Request Rate 95-Percentile Response Time
(Latency Tolerance 100ms)

Figure 6. High Request Rate Host Utilization
(Desired Maximum Utilization at 90%)

Conclusion

We presented the Dynamic Host-Wide Performance Tuning feature, also known as dense mode. We proved a Web-farm-like workload achieves up to 10% higher consolidation ratio while still meeting 100ms latency tolerance and 90% maximum host utilization. We emphasized that the improvements do not apply to every kind of application. Because of this, you should carefully profile the workloads before enabling dense mode.