Home > Blogs > VMware VROOM! Blog > Monthly Archives: March 2009

Monthly Archives: March 2009

Performance Evaluation of Intel EPT Hardware Assist

We recently released a whitepaper that demonstrates huge
performance gains provided by VMware ESX on the latest Intel Xeon™ 5500-series
processors. These processors introduce Intel’s second-generation hardware
support for virtualization that incorporates memory management unit (MMU)
virtualization called Extended Page Tables™ (EPT). ESX has been adopting these
technologies as they are introduced, and many workloads see performance
benefits as a result. The benefits seen can include higher throughput and lower CPU utilization, improving user experience and freeing
up servers for greater consolidation. The paper can be found here: Performance Evaluation of Intel EPT Hardware Assist

 

We did a similar performance study when AMD RVI™ was released.
However, as the two studies were done with different VMware ESX builds, the
results are not directly comparable. The performance gains observed in this
paper were up to 48% for MMU-intensive benchmarks and up to 600% for
MMU-intensive microbenchmarks compared to software MMU virtualization. We also
observed that although EPT increases memory access latencies for a few
workloads, this cost can be reduced by effectively using large pages in the
guest and the hypervisor. For optimal performance, ESX aggressively tries to use
large pages for its own memory when EPT is used.

 

Prior to the introduction in 2006 of first-generation hardware
support for x86 virtualization, AMD-V from AMD and Intel VT-x from Intel, the
VMware virtual machine monitor (VMM) relied upon software-only techniques for
virtualizing x86 processors.

We used:

- Binary Translation (BT) for instruction-set virtualization

- Shadow Paging for MMU virtualization

- Device Emulation for device virtualization.

 

With the advent of first-generation hardware support on Intel Xeon
processors, the VMM could make use of the hardware features for instruction-set
virtualization. However, MMU and device virtualization were still done in
software. Now with the introduction of second-generation virtualization
hardware support (Intel VT-x w/ EPT), the VMM can take advantage of
hardware-assist for both instruction-set and MMU virtualization. MMU
virtualization allows the guest to access only those memory locations that
belong to it. In software MMU virtualization, this requires the VMM to
intercept guest execution when the guest updates its virtual memory data
structures (page tables). In hardware MMU virtualization the hardware provides
a mechanism by which the VMM no longer needs to intercept guest execution
during page table updates. This results in significant performance improvements
for workloads that stress the x86 MMU.
Continue reading here.

Scale-out of XenApp on ESX 3.5

In an earlier posting (Virtualizing XenApp on XenServer 5.0 and ESX 3.5) we looked at the performance of virtualizing a Citrix XenApp workload in a 2-vCPU VM in comparison to the native OS booted with two cores. This provided valuable data about the single-VM performance of XenApp running on ESX 3.5. In our next set of experiments we used the same workload, and the same hardware, but scaled out to 8 VMs. This is compared to the native OS booted with all 16 cores. We found that ESX has near-linear scaling as the number of VMs is increased, and that aggregate performance with 8 VMs is much better than native.

We expected the earlier single-VM approach to produce representative results because of the excellent scale-out performance of ESX 3.5. This is especially true on NUMA architectures where VMs running on different nodes are nearly independent in terms of CPU and memory resources. However, the same cannot be said for the scale-up performance (SMP scaling) of a single native machine, or a single VM. As for many other applications, virtualizing many relatively small XenApp servers on a single machine can overcome the inherent SMP performance limitations of XenApp on the same machine.

In the current experiments, each VM is the same as before, except the allocated memory is set to 6700 MB (the amount needed to run 30 users). Windows 2003 x64 was used in both the VMs and natively. See the above posting for more workload and configuration details. Shown below is the average aggregate latency as a function of the total number of users. Every data point shown is a separate run with about 4 hours of steady state execution. Each user performs six iterations where a complete set of 22 workload operations is performed during each iteration. The latency of these operations is summed to get the aggregate latency. The average is over the middle four iterations, all the users, and all the VMs.

Blog_multivm_esx_native  

In both the Native and ESX cases all 16 cores are being used (although with much less than 100% utilization). At very low load Native has somewhat better total latency, but beyond 80 users the latency quickly degrades. Starting at 140 users some of the sessions start to fail. 120 users is really the upper limit for running this workload on Native. With 8 VMs on ESX, 20 users per VM (160 total) was not a problem at all, so we pushed the load up to 240 total users. At this point the latency is getting high, but there were no failures and all of the desktop applications were still usable. The load has to be increased to more than 200 users on ESX before the latency exceeds that from 120 users on Native. That is, for a Quality-of-Service standard of 39 seconds aggregate latency, ESX supports 67% more users than Native. Like many commonly-deployed applications, XenApp has limited SMP scalability. Roughly speaking, its scalability is better than common web apps but not as good as well-tuned databases. When running this workload, XenApp scales well to 4 CPUs, but 8 CPUs is marginal and 16 CPUs is clearly too many. Dividing the load among smaller VMs avoids SMP scaling issues and allows the full capabilities of the hardware to be utilized.

Some would say that even 200 XenApp users are not very many for such a powerful machine. In any benchmark of this kind many decisions have to be made with regard to the choice of applications, operations within each application, and amount of “user think time” between operations. As we pointed out earlier, we strove to make realistic choices when designing the VDI workload. However, one may choose to model users performing less resource-intensive operations and thus be able to support more of them.

The scale-out performance of ESX is quantified in the second chart, which shows the total latency as a function of the number of VMs with all VMs running either a low (10 users), medium (20), or high (30) load each. Flat lines would indicate perfect scalability. They are actually nearly so for the each of the load cases up to 4 VMs. The latency increases noticeably only for 8 VMs, and then only for higher loads. This indicates that the increased application latency is mostly due to the increased memory latency caused by running 2 VMs per NUMA node (as opposed to at most a single VM per node for four or fewer VMs).
Blog_multivm_scale_out    

While our first blog showed how low the overhead is for running XenApp on a single 2-vCPU VM on ESX compared to a native OS booted with two CPUs, the current results fully utilizing the 16 core machine are even more compelling. These show the excellent scale-out performance of ESX on a modern NUMA machine, and that the aggregate performance of several VMs can far exceed the capabilities of a single native OS.

Performance Evaluation of AMD RVI Hardware Assist










We
recently released a whitepaper that demonstrates huge performance gains provided
by VMware ESX server on latest third generation AMD Opteron™ processors. These
processors introduce AMD’s
second generation hardware support for
virtualization that incorporates memory management unit (MMU) virtualization,
called Rapid Virtualization Indexing™ (RVI). Intel has also announced MMU
virtualization support in their “Nehalem” processors called Extended Page
Tables™ (EPT).
ESX
has been adopting these technologies as they are introduced, and many workloads
see performance benefits as a result. The benefits seen can include higher
throughput and lower CPU utilization, improving user experience and freeing up
servers for greater consolidation. The paper can be found here: Performance Evaluation of AMD RVI Hardware Assist

The
performance gains observed in this paper were up to 42% for MMU intensive
benchmarks and up to 500% for MMU-intensive microbenchmarks compared to
software-only virtualization. We also observed that although RVI increases
memory access latencies for a few workloads, this cost can be reduced by
effectively using large pages in the guest and the hypervisor.  For optimal performance, ESX aggressively
tries to use large pages for its own memory when RVI is used. 

Prior
to the introduction in 2006 of first-generation hardware support for x86
virtualization, AMD-V from AMD and Intel VT-x from Intel, the VMware virtual
machine monitor (VMM) relied upon software-only techniques for virtualizing x86
processors.

We
used:

  • Binary Translation (BT) for
    instruction-set virtualization

  • Shadow Paging for MMU virtualization

  • Device Emulation for device virtualization.

With
the advent of first-generation hardware support, the VMM could make use of the
hardware features for instruction-set virtualization. However, MMU and device
virtualization were still done in software. Now with the introduction of
second-generation virtualization hardware support, the VMM can take advantage
of hardware-assist for both instruction-set and MMU virtualization. MMU
virtualization allows the guest to access only those memory locations that
belong to it. In software MMU virtualization, this requires the VMM to
intercept guest execution when the guest updates its virtual memory data
structures (page tables). In hardware MMU virtualization the hardware provides
a mechanism by which the VMM no longer needs to intercept guest execution
during page table updates. This results in significant performance improvements
for workloads that stress the x86 MMU.
Continue reading here.