VMware

November 16, 2009

Performance Study of VMware vStorage Thin Provisioning

vStorage Thin Provisioning, a key component of VMware vSphere™, is a technology that redefines storage provisioning by allocating space on demand to the virtual disks. We recently published a paper that gives the details of this feature and discusses experiments conducted to study the performance of thin-provisioned disks in a VMware vSphere™ environment.


The data we have collected reveals

  • Both thin and thick disks perform similarly on various workloads.
  • Thin provisioned disks show similar performance trends as thick disks do when scaled across different hosts.
  • External fragmentation has negligible impact on the performance of thin provisioned disks.
  • There is insignificant performance impact on existing thick disks if thin provisioning is implemented on a shared array.

Thin
The above graph shows that the performance of a thin provisioned disk matches the performance of a thick disk. For more details on the environment of these tests and the experiments conducted we invite you to read the full whitepaper at http://www.vmware.com/pdf/vsp_4_thinprov_perf.pdf.


October 20, 2009

Storage performance improvements in vSphere 4.0

We made a huge number of performance improvements in vSphere 4.0. The ESX storage stack was no exception. We ran a wide variety of micro and real world benchmarks to thoroughly evaluate and optimize vSphere’s storage subsystem. It is now even more efficient for the enterprise and ready to support the cloud.

A wide variety of I/O intensive applications will run efficiently on vSphere with all the improvements.  You can find details on the architectural changes and storage performance improvements made in this white paper.

Some of the noteworthy improvements are:

·         VMware Paravirtualized SCSI (PVSCSI driver): vSphere ships with this new high performance virtual storage adapter. Bus logic and LSI logic were the only choices so far. PVSCSI is best suited to run highly I/O intensive applications in the guest more efficiently (reduced CPU cycles). This is possible with a series of optimizations explained in the paper.

·         iSCSI support improvements: We made significant improvements in the iSCSI stack for both software and hardware iSCSI. The improvements are not just in terms of performance but features as well. Noteworthy among these is CPU efficiency improvements that range from 7-52% depending on the type and size of I/O.

·         Software iSCSI and NFS support with Jumbo Frames: vSphere adds jumbo frames and 10Gbit NIC networking support for both NFS and iSCSI. This helps drive bandwidth that is many times faster than previous ESX releases.

·         File system improvements for enhanced Virtual Desktop experience and scalable cloud solutions: We made several optimizations in VMware File System (VMFS) with a special focus on enterprise desktop and cloud solutions. File system along with other improvements in different parts of ESX improves performance of several provisioning operations dramatically. An example is “boot storm” performance (where several hundreds of virtual machines are booted simultaneously in a virtual desktop environment). With these improvements time taken to boot a large number of virtual machines simultaneously is many times faster compared to ESX3.5.

ESX supports several different storage protocols such as Fibre Channel, iSCSI and NFS. We published a white paper that compares I/O performance using each of these protocols.  Results show that line rate can be achieved with each of the storage protocols for single or multiple virtual machines. The paper also highlights CPU efficiency improvements in vSphere compared to the previous release. This means that more virtual machines can now run on the same hardware.  Graph below shows one example (sequential read, 64KB block size) of the relative CPU cost for each of the storage protocols. Results on ESX 4.0 are shown next to ESX 3.5 to highlight efficiency improvements on all protocols.


Hardware configuration and detailed results can be found in this protocol comparison white paper.


Storage-protocol-efficiency-comparison-ver3

 (Lower is better)

 

Figure: Relative CPU cost of 64 KB sequential reads in a single virtual machine

 


September 21, 2009

Comparing Performance of 1vCPU Nehalem VM with 2vCPU Harpertown VM

vSphere has a new feature called Fault Tolerance that allows for a VM to be running in vLockstep on two physical servers at the same time.  In the event of a failure of the primary VM, the secondary VM will immediately take over with no downtime for the VM.  There is a great whitepaper that covers FT architecture and performance.  There have also been a couple of blog posts on VROOM! recently that cover FT performance as well.  One uses VMmark to show how FT has excellent performance in a heavily loaded multi workload environment.  The other blog post shows how an Exchange VM maintains excellent performance while supporting 2000 users with FT enabled.

FT currently requires that 1vCPU VMs be used.  This presents a challenge for some applications that have traditionally been run in 2vCPU VM configurations.  At the same time, new processors have features that provide much higher performance than in the past.  When combined with the performance enhancements of ESX 4, it is now possible to get much better performance per core. 

Testing Configuration

A series of Exchange Server 2007 tests were conducted to compare the performance of 1vCPU current processor generation VMs with previous processor generation 2vCPU VMs.  For the 1vCPU tests the Intel Xeon X5570 (Nehalem) processor was used with FT enabled.  (For detailed test results comparing FT enabled and disabled on the same VMs read my previous blog post on Exchange with FT Performance.)   For the 2vCPU tests, two previous generation Intel processors were used: a Xeon x5355 (Clovertown) and Xeon x5460 (Harpertown).  The specific servers used were a Dell M600 and Dell 2950 respectively.  Storage for all the tests was provided by several Dell EqualLogic PS5000XV iSCSI arrays.  Microsoft Exchange Load Generator (LoadGen) was used to run the tests. 

The VM was configured with 10GB of RAM and installed with Windows Server 2008 x64 Enterprise Edition and the Exchange Server 2007 mailbox role.  A VM running on another ESX server served as the domain controller and Exchange Client Access and Hub Transport server roles. 

Results

The graph below shows the results in terms of the average latency for the sendmail action from LoadGen and the sum of the vCPU utilizations of the VM.  For these results the sum was used instead of the average because some VMs had 1vCPU and some had 2vCPUs. 

1vCPUwFTvs2vCPUgraph

There are a couple of interesting things to note about the results. 

The first is that the sendmail average latency results with FT enabled on a 1vCPU Xeon 5570 based VM with 1500 users was within 5ms of the 2vCPU Xeon 5460 VM with 2000 users.  This means that the Nehalem based 1vCPU VM was getting an extra 50% more users per vCPU than the 2vCPU Harpertown based VM.

Average CPU utilization on the 1vCPU VM with 2000 users and FT enabled was only 45% which leaves head room for spikes in usage.  This means that 2000 heavy online LoadGen users ran comfortably in a 1vCPU VM. 

Conclusion

A 1vCPU Xeon X5500 series based Exchange Server VM can support 50% more users per core than a 2vCPU VM based on previous generation processors while maintaining the same level of performance in terms of Sendmail latency.  This is accomplished while the VM’s CPU utilization remains below 50%, allowing plenty of capacity for peaks in workload and making an FT VM practical for use with Exchange Server 2007.


September 18, 2009

Performance Troubleshooting for VMware vSphere 4 and ESX 4.0

Performance problems can arise in any computing environment. In a virtualized computing environment performance problems can arise due to new and often subtle interactions occurring in the shared infrastructure. Uncovering the causes of those problems requires an understanding of the available performance metrics and their relationship to underlying configuration issues.

A new guide covering performance troubleshooting for VMware vSphere 4, including ESX 4.0 hosts, is now available. This document uses a guided approach to lead the reader through the observable manifestations of complex hardware/software interactions in order to identify specific performance problems. For each problem covered, it includes a discussion of the possible root-causes and solutions. Topics covered include performance problems arising from issues in the CPU, memory, storage, and network subsystems, as well as in the VM and ESX host configuration.

The document is available on the VMware Performance Community at http://communities.vmware.com/docs/DOC-10352


September 14, 2009

Performance Evaluation of VMXNET3 Virtual Network Device

vSphere 4.0 introduces a new para-virtualized network device - VMXNET3.  We recently published a paper demonstrating its performance characteristics, compared to that of enhanced VMXNET2 (the previous generation of high performance virtual network device from VMware).

Some highlights of this paper are:

(1) Throughput gains of up to 92% for 10G TCP/IPv4 Rx workloads with large socket buffer, which greatly improves bulk data transfer performance in a data center environment.

(2) Dramatic gains across all configurations of IPv6 traffic, with significant CPU usage reduction and throughput improvement over enhanced VMXNET2.

In a nutshell, VMXNET3 offers performance on par with or better than its predecessors on both Windows and Linux guests. Both the driver and the device have been highly tuned to perform better on modern systems.  Furthermore, VMXNET3 introduces new features and enhancements, such as TSO6 and RSS. TSO6 makes it especially useful for users deploying applications that deal with IPv6 traffic, while RSS is helpful for deployments requiring high scalability.  Moving forward, to keep pace with an ever-increasing demand for network bandwidth, we recommend customers migrate to VMXNET3.

For more details, please read our full paper from here.

September 08, 2009

Understanding Memory Resource Management in VMware ESX Server

Recently, we have published a whitepaper about how ESX server manages the host memory resource. This paper not only presents the basic memory resource management concepts but also shows experiment results explaining the performance impact of three different memory reclamation techniques:

Page sharing, ballooning, and host swapping used in ESX sever. The experiment results show that:

1) Page sharing introduces negligible performance overhead;
2) Compared to host swapping, ballooning will cause much smaller performance degradation when reclaiming memory. In some cases, ballooning even brings zero performance overhead.

The following is the brief summary of the paper.

In general, ESX server uses high-level resource management policies to compute a target memory allocation for every virtual machine based on the current system load and parameter settings for the virtual machine (shares, reservation, and limit, etc). The computed target allocation is used to guide the dynamic adjustment of the memory allocation for each virtual machine. In the cases where host memory is overcommitted, the target allocations are still achieved by invoking several lower-level memory reclamation techniques to reclaim memory from virtual machines.

In this paper, we start from introducing the basic memory virtualization concepts. Then, we discuss the reason why supporting memory overcommitment is necessary in ESX server. Three memory reclamation techniques are currently used in ESX server: Transparent Page Sharing (TPS), Ballooning and Host Swapping. We illustrate the mechanism of these three techniques and analysis the Pros and Cons of each technique from performance perspective. In addition, we present how ESX memory scheduler uses a share-based allocation algorithm to allocate memory for multiple Virtual machines when host memory is overcommitted.

Beyond the technique discussion, we conduct experiments to help user understand how individual memory reclamation techniques impact the performance of various applications. In these experiments, we choose to use SPECjbb, Kernel Compile, Swingbench and Exchange benchmarks to evaluate different techniques.

Finally, based on the memory management concepts and performance evaluation results, we present some best practices for host and guest memory usage.
 
For more details, please read the full paper from here.


August 26, 2009

Comparing Fault Tolerance Performance & Overhead Utilizing VMmark v1.1.1

VMware Fault Tolerance (FT), based on vLockstep technology and available with VMware vSphere, easily and efficiently provides zero downtime and zero data loss for your critical workloads. FT provides continuous availability in the event of server failures by creating a live shadow instance of the primary virtual machine on a secondary system.  The shadow VM (or secondary VM), running on the secondary system, executes sequences of x86 instructions identical to the primary VM, with which it proceeds in vLockstep.  By doing so, if catastrophic failure of the primary system occurs it causes an instantaneous failover to the secondary VM that would be virtually indistinguishable to the end user. While FT technology is certainly compelling, some potential users express concern about possible performance overhead. In this article, we explore the performance implications of running FT in realistic scenarios by measuring an FT-enabled environment based on the heterogeneous workloads found in VMmark, the tile-based mixed-workload consolidation benchmark from VMware®.

Figure 1 : High Level Architecture of VMware Fault Tolerance

Pic1

Environment Configuration :

System under Test

2 x Dell PowerEdge R905

CPUs

4 Quad-Core AMD Opteron 8382 (2.6GHz)

4 Quad-Core AMD Opteron 8384 (2.7GHz)

Memory

128GB DDR2 Reg ECC

Storage Array

EMC CX380

Hypervisor

VMware ESX 4.0

Application

VMmark v1.1.1

Virtual Hardware (per tile)

8 vCPUs, 5GB memory, 62GB disk

  •  VMware Fault Tolerance currently only supports 1 vCPU VMs and requires specific processors for enablement; for the purposes of our experimentation our VMmark Database and MailServer VMs were set to run with 1vCPU only.  For more information on FT and its requirements see here.
  • VMmark is a benchmark intended to measure the performance of virtualization environments in an effort to allow customers to compare platforms.  It is also useful in studying the effect of architectural features. VMmark consists of six workloads (Web, File, Database, Java, Mail and Standby servers). Multiple sets of workloads (tiles) can be added to scale the benchmark load to match the underlying hardware resources. For more information on VMmark see here.


Test Methodology :

An initial performance baseline was established by running VMmark from 1 to 13 tiles on the primary system with Fault Tolerance disabled for all workloads. FT was then enabled for the MailServer and Database workloads after customer feedback suggested they were the applications most likely to be protected by FT. The performance tests were then executed a second time and compared to the baseline performance data.

 

Results :

The results in Table 1 are enlightening as to the performance and efficiency of VMware’s Fault Tolerance.  For this case, “FT-enabled Secondary %CPU”, indicates the total CPU utilized by the secondary system under test.  It should also be noted that, for our workload, the default ESX 4.0, High Availability, and Fault Tolerance settings were used and these results should be considered ‘out of the box’ performance for this configuration.  Finally, the secondary system’s %CPU is much lower by comparison to the primary system because it is only running the MailServer and Database workloads, as opposed to the six workloads that are being run on the primary system.

Table 1:

Pic2b  

You can see that as we scaled both configurations toward saturation the overhead of enabling VMware Fault Tolerance remains surprisingly consistent, with an average delta in %CPU used of 7.89% over all of the runs.  ESX was also able to achieve very comparable scaling for both FT-enabled and FT-disabled configurations.  It isn’t until the FT-enabled configuration nears complete saturation, a scenario most end users will never see, that we start to see any real discernable delta in scores.

It should be noted that these performance and overhead statements may or may not be true for dissimilar workloads and systems under test.  From the results of our testing you can see that the advantage of having Mail servers and Database servers truly protected, without fear of end-user interruption, is completely justified.

It’s a tough world out there; you never know when the next earthquake, power outage, or someone tripping over a power cord will strike next.  It’s nice to know that your critical workloads are not only safe, but running at high efficiency.  The ability of VMware Fault Tolerance technology to provide quick and efficient protection for your critical workloads makes it a standout in the datacenter.

All information in this post regarding future directions and intent are subject to change or withdrawal without notice and should not be relied on in making a purchasing decision of VMware's products. The information in this post is not a legal obligation for VMware to deliver any material, code, or functionality. The release and timing of VMware's products remains at VMware's sole discretion.



August 24, 2009

Performance of Exchange Server 2007 in a Fault Tolerant Virtual Machine

One of the great new features of vSphere is VMware Fault Tolerance (FT) which allows a VM to be in lockstep on two different physical servers at the same time.  This provides for a high availability option which has virtually no downtime.   A whitepaper focused on FT was recently published along with a blog post that has the complete details about this great new technology.  Using an Exchange Server 2007 mailbox VM, we did some tests to measure the performance of up to 2000 users with FT.

In order to examine the performance of an FT VM running Exchange Server 2007, a series of tests were run with 1000, 1500, and 2000 users.  Performance was measured in terms of CPU utilization and Sendmail response time for the same VM both with and without FT enabled.  The results were used to measure the performance impact of using FT as well as the number of users that can be supported by a 1 vCPU VM. (Today FT is supported on 1 vCPU VMs).

Test Configuration

I worked with the Dell TechCenter team and used two of their Dell PowerEdge blade servers with Intel Nehalem-based Xeon 5500 processors.  The primary server was an M710 with two Intel Xeon X5570 processors running at 2.93GHz and 72GB of RAM.  The secondary server was an M610 with the same type of processors, but with 48GB of RAM.  The terms primary and secondary refer to the portions of the fault tolerant VMs that the servers hosted during the tests.

Both blade servers were in the same chassis, so all FT logging traffic remained local in the chassis Ethernet switch. The servers connected via iSCSI to EqualLogic PS5000XV storage arrays where the OS, data, and log LUNs for the VMs were stored.

The servers were installed with ESX 4.0 and managed by a vCenter Server.  VMs were created with 1 vCPU and 10GB of RAM, installed with Windows Server 2008 x64 and Exchange Server 2007 Mailbox role.  Another VM that acted as the domain controller and Hub Transport and Client Access server was on a third blade server in the same chassis.  Microsoft Exchange Load Generator (LoadGen) was used with the Heavy Online user profile to simulate an eight hour workday.

Fault Tolerant Test Results

The testing showed that the performance of the Exchange VM was affected only slightly when FT was used. Sendmail average latency increased by 10 to 13 milliseconds, and 95th percentile avgerage latency increased by 33 to 45 milliseconds.  All test results were under the 1000ms threshold at which user experience starts to degrade.  These results indicate that, even at 2000 users, the performance of Exchange on a 1 vCPU VM was acceptable with or without FT.

SendMailLatencyGraphs_withFT

The CPU utilization results for the overall system show a low impact of using FT.  Because the Exchange VM was the only one on the ESX server, overall system utilization was very low with a peak of just over 7% in the most stressful test.  Enabling FT only caused an additional 1 to 1.5% of system CPU to be used.  The utilization of the ESX host with the secondary VM was slightly lower than the primary.  When examining the CPU utilization of the 1 vCPU VM, the utilization average reaches just under 45%.  This is a comfortable level that still leaves room for the bursty nature of Exchange. 

CPU_UtilizationGraphs_FT 

Enabling FT for an Exchange VM running on the latest server hardware shows good performance for up to the 2000 users tested, and the effect of FT on the workload was relatively small.  These results show that an Exchange VM can be a good candidate for using FT to enable increased uptime and availability.

 


August 18, 2009

VMware vSphere™ 4: The CPU Scheduler in VMware® ESX™ 4

VMware recently published a whitepaper that discusses changes in CPU scheduler in ESX 4. The paper also describes a few key concepts in CPU scheduler that should be useful to understand performance issues involved with CPU scheduler. Specifically, it attempts to answer the following questions:

  • How CPU time is allocated between virtual machines? How well does it work?
  • What is the difference between “strict” and “relaxed” co-scheduling? What is the performance impact of recent co-scheduling improvements?
  • What is the “CPU scheduler cell”? What happened to the scheduler cell in ESX4?
  • How does ESX scheduler exploit the underlying CPU architecture features like multi-core, Hyper-threading, and NUMA?


The following provides brief summary of the paper:

In ESX 4, many improvements have been introduced in CPU scheduler. This includes further relaxed co-scheduling, lower lock-contention, and multi-core aware load balancing. Co-scheduling overhead has been further reduced by the accurate measurement of the co-scheduling skew, and by allowing more scheduling choices. Lower lock-contention is achieved by replacing scheduler cell-lock with finer-grained locks. By eliminating the scheduler-cell, a virtual machine can get higher aggregated cache capacity and memory bandwidth. Lastly, multi-core aware load balancing achieves high CPU utilization while minimizing the cost of migrations.

Experimental results show that the ESX 4 CPU scheduler faithfully allocates CPU resource as specified by users. While maintaining the benefit of a proportional-share algorithm, the improvements in co-scheduling and load-balancing algorithms are shown to benefit performance. Compared to ESX 3.5, ESX 4 significantly improves performance in both lightly loaded and heavily loaded systems.

For more details please download and read our full paper from here.


Virtual Machine Monitor Execution Modes in vSphere 4.0

Recently we published a whitepaper describing the VMware Virtual Machine Monitor (VMM) execution modes in vSphere 4.0. The VMM may choose hardware support for virtualization whenever it's available or may choose software techniques for virtualization when hardware support is unavailable or not enabled on the underlying platform. The method chosen by the VMM for virtualizing the x86 CPU and MMU is known as the "monitor mode".

This paper attempts to familiarize our customers with default monitor modes chosen by the VMware VMM for many popular guests running on modern x86 CPUs. Most workloads perform well under these default settings. In some cases the user may want to override the default monitor mode. We provide a few examples in which the user may observe performance benefits in overriding the default monitor modes and two ways by which the user can override the defaults.

The default monitor mode chosen by the VMM for a particular guest depends on the available (or enabled) hardware features on the underlying platform and the guest OS performance in that mode. The difference in the availability of virtualization support on modern x86 CPUs and the guest OS performance when using those features (or when using software techniques when those features are unavailable) leads to a complex problem of choosing the appropriate monitor mode for a given guest on a given x86 CPU.  For more details please download and read our full paper from here.