Home > Blogs > VMware VROOM! Blog > Monthly Archives: September 2009

Monthly Archives: September 2009

Comparing Performance of 1vCPU Nehalem VM with 2vCPU Harpertown VM

vSphere has a new feature called Fault Tolerance that allows for a VM to be running in vLockstep on two physical servers at the same time.  In the event of a failure of the primary VM, the secondary VM will immediately take over with no downtime for the VM.  There is a great whitepaper that covers FT architecture and performance.  There have also been a couple of blog posts on VROOM! recently that cover FT performance as well.  One uses VMmark to show how FT has excellent performance in a heavily loaded multi workload environment.  The other blog post shows how an Exchange VM maintains excellent performance while supporting 2000 users with FT enabled.

FT currently requires that 1vCPU VMs be used.  This presents a challenge for some applications that have traditionally been run in 2vCPU VM configurations.  At the same time, new processors have features that provide much higher performance than in the past.  When combined with the performance enhancements of ESX 4, it is now possible to get much better performance per core. 

Testing Configuration

A series of Exchange Server 2007 tests were conducted to compare the performance of 1vCPU current processor generation VMs with previous processor generation 2vCPU VMs.  For the 1vCPU tests the Intel Xeon X5570 (Nehalem) processor was used with FT enabled.  (For detailed test results comparing FT enabled and disabled on the same VMs read my previous blog post on Exchange with FT Performance.)   For the 2vCPU tests, two previous generation Intel processors were used: a Xeon x5355 (Clovertown) and Xeon x5460 (Harpertown).  The specific servers used were a Dell M600 and Dell 2950 respectively.  Storage for all the tests was provided by several Dell EqualLogic PS5000XV iSCSI arrays.  Microsoft Exchange Load Generator (LoadGen) was used to run the tests. 

The VM was configured with 10GB of RAM and installed with Windows Server 2008 x64 Enterprise Edition and the Exchange Server 2007 mailbox role.  A VM running on another ESX server served as the domain controller and Exchange Client Access and Hub Transport server roles. 


The graph below shows the results in terms of the average latency for the sendmail action from LoadGen and the sum of the vCPU utilizations of the VM.  For these results the sum was used instead of the average because some VMs had 1vCPU and some had 2vCPUs. 


There are a couple of interesting things to note about the results. 

The first is that the sendmail average latency results with FT enabled on a 1vCPU Xeon 5570 based VM with 1500 users was within 5ms of the 2vCPU Xeon 5460 VM with 2000 users.  This means that the Nehalem based 1vCPU VM was getting an extra 50% more users per vCPU than the 2vCPU Harpertown based VM.

Average CPU utilization on the 1vCPU VM with 2000 users and FT enabled was only 45% which leaves head room for spikes in usage.  This means that 2000 heavy online LoadGen users ran comfortably in a 1vCPU VM. 


A 1vCPU Xeon X5500 series based Exchange Server VM can support 50% more users per core than a 2vCPU VM based on previous generation processors while maintaining the same level of performance in terms of Sendmail latency.  This is accomplished while the VM’s CPU utilization remains below 50%, allowing plenty of capacity for peaks in workload and making an FT VM practical for use with Exchange Server 2007.

Performance Troubleshooting for VMware vSphere 4 and ESX 4.0

Performance problems can arise in any computing environment. In a
virtualized computing environment performance problems can arise due to
new and often subtle interactions occurring in the shared
infrastructure. Uncovering the causes of those problems requires an
understanding of the available performance metrics and their
relationship to underlying configuration issues.

A new guide covering performance troubleshooting for VMware vSphere
4, including ESX 4.0 hosts, is now available. This document uses a guided
approach to lead the reader through the observable manifestations of
complex hardware/software interactions in order to identify specific
performance problems. For each problem covered, it includes a
discussion of the possible root-causes and solutions. Topics covered
include performance problems arising from issues in the CPU, memory,
storage, and network subsystems, as well as in the VM and ESX host

The document is available on the VMware Performance Community at

Performance Evaluation of VMXNET3 Virtual Network Device

vSphere 4.0 introduces a new para-virtualized network device – VMXNET3.  We recently published a paper demonstrating its performance characteristics, compared to that of enhanced VMXNET2 (the previous generation of high performance virtual network device from VMware).

Some highlights of this paper are:

(1) Throughput gains of up to 92% for 10G TCP/IPv4 Rx workloads with large socket buffer, which greatly improves bulk data transfer performance in a data center environment.

(2) Dramatic gains across all configurations of IPv6 traffic, with significant CPU usage reduction and throughput improvement over enhanced VMXNET2.

In a nutshell, VMXNET3 offers performance on par with or better than its predecessors on both Windows and Linux guests. Both the driver and the device have been highly tuned to perform better on modern systems.  Furthermore, VMXNET3 introduces new features and enhancements, such as TSO6 and RSS. TSO6 makes it especially useful for users deploying applications that deal with IPv6 traffic, while RSS is helpful for deployments requiring high scalability.  Moving forward, to keep pace with an ever-increasing demand for network bandwidth, we recommend customers migrate to VMXNET3.

For more details, please read our full paper from here.

Understanding Memory Resource Management in VMware ESX Server

Recently, we have published a whitepaper about how ESX server manages the host memory resource. This paper not only presents the basic memory resource management concepts but also shows experiment results explaining the performance impact of three different memory reclamation techniques:

Page sharing, ballooning, and host swapping used in ESX sever. The experiment results show that:

1) Page sharing introduces negligible performance overhead;
2) Compared to host swapping, ballooning will cause much smaller performance degradation when reclaiming memory. In some cases, ballooning even brings zero performance overhead.

The following is the brief summary of the paper.

In general, ESX server uses high-level resource management policies to compute a target memory allocation for every virtual machine based on the current system load and parameter settings for the virtual machine (shares, reservation, and limit, etc). The computed target allocation is used to guide the dynamic adjustment of the memory allocation for each virtual machine. In the cases where host memory is overcommitted, the target allocations are still achieved by invoking several lower-level memory reclamation techniques to reclaim memory from virtual machines.

In this paper, we start from introducing the basic memory virtualization concepts. Then, we discuss the reason why supporting memory overcommitment is necessary in ESX server. Three memory reclamation techniques are currently used in ESX server: Transparent Page Sharing (TPS), Ballooning and Host Swapping. We illustrate the mechanism of these three techniques and analysis the Pros and Cons of each technique from performance perspective. In addition, we present how ESX memory scheduler uses a share-based allocation algorithm to allocate memory for multiple Virtual machines when host memory is overcommitted.

Beyond the technique discussion, we conduct experiments to help user understand how individual memory reclamation techniques impact the performance of various applications. In these experiments, we choose to use SPECjbb, Kernel Compile, Swingbench and Exchange benchmarks to evaluate different techniques.

Finally, based on the memory management concepts and performance evaluation results, we present some best practices for host and guest memory usage.
For more details, please read the full paper from here.