Home > Blogs > VMware VROOM! Blog > Monthly Archives: December 2010

Monthly Archives: December 2010

Performance Scaling of an Entry-Level Cluster

Performance benchmarking is often conducted on top-of-the-line hardware, including hosts that typically have a large number of cores, maximum memory, and the fastest disks available. Hardware of this caliber is not always accessible to small or medium-sized businesses with modest IT budgets. As part of our ongoing investigation of different ways to benchmark the cloud using the newly released VMmark 2.0, we set out to determine whether a cluster of less powerful hosts could be a viable alternative for these businesses. We used VMmark 2.0 to see how a four-host cluster with a modest hardware configuration would scale under increasing load.

Workload throughput is often limited by disk performance, so the tests were repeated with two different storage arrays to show the effect that upgrading the storage would offer in terms of performance improvement. We tested two disk arrays that varied in both speed and number of disks, an EMC CX500 and an EMC CX3-20, while holding all other characteristics of the testbed constant.

To review, VMmark 2.0 is a next-generation, multi-host virtualization benchmark that models application performance and the effects of common infrastructure operations such as vMotion, Storage vMotion, and a virtual machine deployment. Each tile contains Microsoft Exchange 2007, DVD Store 2.1, and Olio application workloads which run in a throttled fashion. The Storage vMotion and VM deployment infrastructure operations require the user to specify a LUN as the storage destination. The VMmark 2.0 score is computed as a weighted average of application workload throughput and infrastructure operation throughput. For more details about VMmark 2.0, see the VMmark 2.0 website or Joshua Schnee’s description of the benchmark.

Configuration
All tests were conducted on a cluster of four Dell PowerEdge R310 hosts running VMware ESX 4.1 and managed by VMware vCenter Server 4.1.  These are typical of today’s entry-level servers; each server contained a single quad-core Intel Xeon 2.80 GHz X3460 processor (with hyperthreading enabled) and 32 GB of RAM.  The servers also used two 1Gbit NICs for VM traffic and a third 1Gbit NIC for vMotion activity.

To determine the relative impact of different storage solutions on benchmark performance, runs were conducted on two existing storage arrays, an EMC CX500 and an EMC CX3-20. For details on the array configurations, refer to Table 1 below. VMs were stored on identically configured ‘application’ LUNs, while a designated ‘maintenance’ LUN was used for the Storage vMotion and VM deployment operations.

Table 1. Disk Array Configuration   Table1-3

Results
To measure the cluster's performance scaling under increasing load, we started by running one tile, then increased the number of tiles until the run failed to meet Quality of Service (QoS) requirements. As load is increased on the cluster, it is expected that the application throughput, CPU utilization, and VMmark 2.0 scores will increase; the VMmark score increases as a function of throughput. By scaling out the number of tiles, we hoped to determine the maximum load our four-host cluster of entry-level servers could support.  VMmark 2.0 scores will not scale linearly from one to three tiles because, in this configuration, the infrastructure operations load remained constant. Infrastructure load increases primarily as a function of cluster size. Although showing only a two host cluster, the relationship between application throughput, infrastructure operations throughput and number of tiles is demonstrated more clearly by this figure from Joshua Schnee’s recent blog article. Secondly, we expected to see improved performance when running on the CX3-20 versus the CX500 because the CX3-20 has a larger number of disks per LUN as well as faster individual drives. Figure 1 below details the scale out performance on the CX500 and the CX3-20 disk arrays using VMmark 2.0. 

Figure 1. VMmark 2.0 Scale Out On a Four-Host Cluster

Figure1-2

Both configurations saw improved throughput from one to three tiles but at four tiles they failed to meet at least one QoS requirement. These results show that a user wanting to maintain an average cluster CPU utilization of 50% on their four-host cluster could count on the cluster to support a two-tile load. Note that in this experiment, increased scores across tiles are largely due to increased workload throughput rather than an increased number of infrastructure operations.

As expected, runs using the CX3-20 showed consistently higher normalized scores than those on the CX500. Runs on the CX3-20 outperformed the CX500 by 15%, 14%, and 12% on the one, two, and three-tile runs, respectively. The increased performance of the CX3-20 over the CX500 was accompanied by approximately 10% higher CPU utilization, which indicated that that the faster CX3-20 disks allowed the CPU to stay busier, increasing total throughput.

The results show that our cluster of entry-level servers with a modest disk array supported approximately 220 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second. A more robust disk array supported 270 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second with 20% lower latencies on average and a correspondingly slightly higher CPU utilization.

Note that this type of experiment is possible for the first time with VMmark 2.0; VMmark 1.x was limited to benchmarking a single host but the entry-level servers under test in this study would not have been able to support even a single VMmark 2.0 tile on an individual server. By spreading the load of one tile across a cluster of servers, however, it becomes possible to quantify the load that the cluster as a whole is capable of supporting.  Benchmarking our cluster with VMmark 2.0 has shown that even modest clusters running vSphere can deliver an enormous amount of computing power to run complex multi-tier workloads.

Future Directions
In this study, we scaled out VMmark 2.0 on a four-host entry-level cluster to measure performance scaling and the maximum supported number of tiles. This put a much higher load onto the cluster than might be typical for a small or medium business so that businesses can confidently deploy their application workloads.  An alternate experiment would be to run fewer tiles while measuring the performance of other enterprise-level features, such as VMware High Availability. This ability to benchmark the cloud in many different ways is one benefit of having a well-designed multi-host benchmark. Keep watching this blog for more interesting studies in benchmarking the cloud with VMmark 2.0.

Oracle RAC Performance on vSphere 4.1

Oracle Real Application Clusters (RAC) is used to run critical databases with stringent performance requirements. A series of tests recently were run in the VMware performance lab to determine how an Oracle RAC database performs when running on vSphere. The test results showed that the application performed within 11 to 13 percent of physical when running in a virtualized environment.

Configuration

Two servers were used for both physical and virtual tests. Two Dell PowerEdge R710s with 2x Intel Xeon x5680 six-core processors and 96GB of RAM were connected via Fibre Channel to a NetApp FAS6030 array. The servers were dual booted between Red Hat Enterprise Linux 5.5 and vSphere ESXi 4.1. Each server was connected via three gigabit Ethernet NICs to a shared switch. One NIC was used for the public network and the other two were used for interconnect and cluster traffic.

The NetApp storage array had a total of 112 10K RPM 274GB Fibre Channel disks. Two 200GB LUNs, backed by a total of 80 disks, were used to create a data volume in Oracle ASM. Each data LUN was backed by a 40 disk RAID DP aggregate on the storage array. A 100GB log LUN was created on another volume that was backed by a 26 disk RAID DP aggregate. An additional small 2GB LUN was created to be used as the voting disk for the RAC cluster.

ServerAndLUNConfigTables 

Each VM was configured with 32GB of RAM, three VMXNET3 virtual NICs, and a PVSCSI adapter for all the LUNs used except the OS disk. In order for the VMs to be able to share disks with physical hosts, it was necessary to mount the disks as RDMs and put the virtual SCSI adapter into physical compatibility mode. Additionally, to achieve the best performance for the Oracle RAC interconnect, the VMXNET3 NICs were configured with ethernetX.intrmode =1 in the vmx file. This option is a work around for an ESX performance bug that is specific to RHEL 5.5 VMs and to extremely latency sensitive workloads. The additional configuration option is no longer needed starting with ESX 4.1u1 because the bug is fixed starting with that version.

VMConfigTable
A four node Oracle RAC cluster was created with two virtual nodes and two physical nodes. The virtual nodes were hosted on a third server when the two servers used for testing were booted to the native RHEL environment. RHEL 5.5 x64 and Oracle 11gR2 were installed on all nodes. During tests the two servers used for testing were booted either to native RHEL or ESX for the physical or virtual tests respectively. This meant that only the two virtual nodes or the two native nodes were powered on during a physical or virtual test. The diagrams below show the same test environment when setup for the two node physical or virtual test.

Physical Test Diagram:

PhyRACDiagram 

Virtual Test Diagram:

VirtRACDiagram 
Testing

The servers used in testing have a total of 12 physical cores and 24 logical threads if hyperthreading is enabled. The maximum number of vCPUs per VM supported by ESXi 4.1 is eight. This made it necessary to limit the physical server to a smaller number of cores to enable a performance comparison. Using the BIOS settings of the server, hyperthreading was disabled and the number of cores limited to two and four per socket. This resulted in four and eight core physical server configurations that were compared with VM configurations of four and eight vCPUs. Limiting the physical server configurations was only done to enable a direct performance comparison and is clearly not a good way to configure a system for performance normally.

Open source DVD Store 2.1 was used as the workload for the test.  DVD Store is an OLTP database workload that simulates customers logging on, browsing, and purchasing DVDs from an online store.  It includes database build scripts, load files, and driver programs.  For these tests, the database driver was used to directly load the database without a need to have the Web tier installed.  Using the new DVD Store 2.1 functionality, two custom-size databases of 50GB each with a 12GB SGA were created as two different instances named DS2 and DS2B.  Both instances were running on both nodes of the cluster and were accessed equally on each node.

Results

Running an equal amount of load against each instance on each node was done with both the four CPU and eight CPU test cases.  DS2 and DS2B instances spanned all nodes and were actively used on all nodes. An equal amount of threads were connected for each instance on each node.  The amount of work was scaled up with the number of processors:  twice as many DVD Store driver threads were used in the eight CPU case as compared with the four CPU case.  For example, a total of 40 threads were running against node one in the four CPU test with 20 accessing DS2 and 20 accessing DS2B.  Another 40 threads were accessing DS2 and DS2B on node two at the same time during that test.  CPU utilization of the physical hosts and VMs were above 95% in all tests.  Results are reported in terms of Orders Per Minute (OPM) and Average Response Time (RT) in milliseconds.

RAC_VirtvsNativeGraph 
In both the OPM and RT measurements, the virtual RAC performance was within 11 to 13 percent of the physical RAC performance.  In an intensive test running on Oracle RAC, the CPU, disk, and network were heavily utilized, but virtual performance was close to native performance.  This result removes a barrier from considering virtualizing one of the more performance-intensive tier-one applications in the datacenter.

 

Two Host Matched-Pair Scaling Utilizing VMmark 2

As mentioned in Bruce’s previous blog, VMmark 2.0 has been released.  With its release we can now begin to benchmark an enterprise-class cloud platform in entirely new and interesting ways.  VMmark 2 is based on a multi-host configuration that includes bursty application and infrastructure workloads to drive load against a cluster.  VMmark 2 allows for the analysis of infrastructure operations within a controlled benchmark environment for the first time, distinguishing it from server consolidation benchmarks. 

Leading off a series of new articles introducing VMmark 2, the goal of this article was to provide a bit more detail about VMmark 2 and to test a vSphere enterprise cloud, focusing on the scaling performance of a matched pair of systems.  More simply put, this blog looks to see what happens to cluster performance as more load is added to a pair of identical servers.  This is important because it allows a means for identifying the efficiency of a vSphere cluster as demand increases.

VMmark2 Overview

VMmark 2 is a next-generation, multi-host virtualization benchmark that not only models application performance but also the effects of common infrastructure operations. It models application workloads in the now familiar VMmark 1 tile-based approach, where the benchmarker adds tiles until either a goal is met or the cluster is at saturation.  It’s important to note that while adding tiles does effectively linearly increase the application workload requests being made, the load caused by infrastructure operations does not scale in the same way.  VMmark 2 infrastructure operations scale as the cluster size grows to better reflect modern datacenters.  Greater detail on workload scaling can be found within the benchmarking guide available for download.  To calculate the score for VMmark 2, final results are generated from a weighted average of the two kinds of workloads; hence scores will not linearly increase as tiles are added.  In addition to the throughput metrics, quality-of-service (QoS) metrics are also measured and minimum standards must be maintained for a result to be considered fully compliant.

VMmark 2 contains the combination of the application workloads and infrastructure operations running simultaneously.  This allows for the benchmark to include both of these critical aspects in the results that it reports.  The application workloads that make up a VMmark 2 tile were chosen to better reflect applications in today’s datacenters by employing more modern and diverse technologies.  In addition to the application workloads, VMmark 2 makes infrastructure operation requests of the cluster.  These operations stress the cluster with the use of vMotion, storage vMotion and Deploy operations.  It’s important to note that while the VMmark 2 harness is stressing the cluster through the infrastructure operations, VMware’s Distributed Resource Scheduler (DRS) is dynamically managing the cluster in order to distribute and balance the computing resources available.  The diagrams below summarize the key aspects of the application and infrastructure workloads.

VMmark 2 Workloads Details:

VMmark2.0AppWkTile

Application Workloads – Each “Tile” consists of the following workloads and VMs.

DVD Store 2.1  - multi-tier OLTP workload consisting of a database VM and three webserver VMs driving a bursty load profile

• Exchange 2007

• Standby Server (heart beat server)

OLIO - multi-tier social networking workload consisting of a web server and a database server.

VMmark2.0InfWk

Infrastructure Workloads – Consists of the following

• User-initiated vMotion.

Storage vMotion.

• Deploy : VM cloning, OS customization, and Updating.

DRS-initiated vMotion to accommodate host-level load variations

 

Environment Configuration:

  • Systems Under Test : 2 HP ProLiant DL380 G6
  • CPUs : 2 Quad-Core Intel® Xeon® CPU 5570 @ 2.93 GHz with HyperThreading Enabled
  • Memory : 96GB DDR2 Reg ECC
  • Storage Array : EMC CX380
  • Hypervisor : VMware ESX 4.1
  • Virtualization Management : VMware vCenter Server 4.1.0

Testing Methodology:

To test scalability as the number of VMmark 2 tiles increases, two HP ProLiant DL380 servers were configured identically and connected to an EMC Clarion CX-380 storage array.  The minimum configuration for VMmark 2 is a two-host cluster running 1 tile, as such this was our baseline and all VMmark 2 scores were normalized to this result.  A series of tests were then conducted on this two-host configuration increasing the number of tiles being run until the cluster approached saturation, recording both the VMmark 2 score and the average cluster CPU utilization during the run phase.

Results:

In circumstances where demand on a cluster increases, it becomes critical to understand how the environment adapts to these demands in order to plan for future needs.  In many cases it can be especially important for businesses to understand how the application and infrastructure workloads were individually impacted.  By breaking out the distinct VMmark 2 sub-metrics we can get a fine grained view of how the vSphere cluster responded as the number of tiles, and thus work performed, increased.

   VMmark2.0DetailedScaling

From the graph above we see the VMmark 2 scores show significant gains until reaching the point where the two-host cluster was saturated at 5 Tiles.  Delving into this further, we see that as expected, the infrastructure operations remained nearly constant due to the requested infrastructure load not changing during the experimentation.  Continued examination shows that the cluster was able to achieve nearly linear scaling for the application workloads through 4 Tiles.  This is equivalent to 4 times the application work requested of the 1 Tile configuration.  When we reached the 5 Tile configuration the cluster was unable to meet the minimum quality-of-service requirements of VMmark 2, however this still helps us to understand the performance characteristics of the cluster.

Monitoring how the average cluster CPU utilization changed during the course of our experiments is another critical component to understanding cluster behavior as load increases.  The diagram below plots the VMmark 2 scores shown in the above graph and average cluster CPU utilization for each configuration.

VMmark2.0ClusterScaling

The resulting diagram helps to illustrate what the impact on cluster CPU utilization and performance was by incrementing the work done by our cluster through the addition of VMmark 2 Tiles. The results show that the VMware’s vSphere matched-pair cluster was able to deliver outstanding scaling of enterprise-class applications while also providing unequaled flexibility in the load balancing, maintenance and provisioning of our cloud. This is just the beginning of what we’ll see in terms of analysis using the newly-released VMmark 2, we plan to explore larger and more diverse configurations next, so stay tuned …

 

VMware Load-Based Teaming (LBT) Performance

Virtualized data center environments are often characterized by a variety as well as a sheer number of traffic flows whose network demands often fluctuate widely and unpredictably. Provisioning a fixed-network capacity for these traffic flows can result either in poor performance (by under provisioning) or waste valuable capital (by over provisioning).

NIC teaming in vSphere enables you to distribute (or load balance) the network traffic from different traffic flows among multiple physical NICs by providing a mechanism to logically bind together multiple physical NICs. This results in increased throughput and fault tolerance and alleviates the challenge of network-capacity provisioning to a great extent. Creating a NIC team in vSphere is as simple as adding multiple physical NICs to a vSwitch and choosing a load balancing policy.

vSphere 4 (and prior ESX releases) provide several load balancing choices, which base routing on the originating virtual port ID, an IP hash, or source MAC hash. While these load balancing choices work fine in the majority of virtual environments, they all share a few limitations. For instance, all these policies statically map the affiliations of the virtual NICs to the physical NICs (based on virtual switch port IDs or MAC addresses) and do not base their load balancing decisions on the current networking traffic and therefore may not effectively distribute traffic among the physical uplinks. Besides, none of these policies take into consideration the disparity of the physical NIC capacity (such as a mixture of 1 GbE and 10 GbE physical NICs in a NIC team). In the next section, we will describe the latest teaming policy introduced in vSphere 4.1 that addresses these shortcomings.

Load-Based Teaming (LBT)

vSphere 4.1 introduces a load-based teaming (LBT) policy that is traffic-load-aware and ensures physical NIC capacity in a NIC team is optimized. Note that LBT is supported only with the vNetwork Distributed Switch (vDS). LBT avoids the situation of other teaming policies where some of the distributed virtual uplinks (dvUplinks) in a DV Port Group’s team are idle while others are completely saturated. LBT reshuffles the port binding dynamically, based on load and dvUplink usage, to make efficient use of the available bandwidth.

LBT is not the default teaming policy while creating a DV Port Group, so it is up to you to configure it as the active policy. As LBT moves flows among uplinks, it may occasionally cause reordering of packets at the receiver. LBT will only move a flow when the mean send or receive utilization on an uplink exceeds 75% of capacity over a 30 second period. LBT will not move flows any more often than once every 30 seconds.

Performance

In this section, we describe in detail the test-bed configuration, the workload used to generate the network traffic flows, and the test results.

Test configuration

In our test configuration, we used an HP DL370 G6 server running the GA release of vSphere 4.1, and several client machines that generated SPECweb®2005 traffic. The server was configured with dual-socket, quad-core 3.1GHz Intel Xeon W5580 processors, 96GB of RAM, and two 10 GbE Intel Oplin NICs. The server hosted four virtual machines and SPECweb2005 traffic was evenly distributed among all four VMs.  Each VM was configured with 4 vCPUs, 16GB memory, 2 vmxnet3 vNICs, and SLES 11 as the guest OS.

SPECweb2005 is an industry-standard web server benchmark defined by the Standard Performance Evaluation Corporation (SPEC). The benchmark consists of three workloads: Banking, Ecommerce, and Support, each with different workload characteristics representing common use cases for web servers. We used the Support workload in our tests which is the most I/O intensive of all the three workloads.

Baseline performance

In our baseline configuration, we configured a vDS with two dvUplinks and two DV Port Groups. We mapped the vNICs of the two VMs to the first dvUplink, and the vNICs of the other two VMs to the second dvUplink through the vDS interface. The SPECweb2005 workload was evenly distributed among all the four VMs and therefore we ensured both the dvUplinks were equally stressed. In terms of the load balancing, this baseline configuration presents the most optimal performance point. With the load of 30,000 SPECweb2005 support users, we observed a little over 13Gbps traffic, that is, about 6.5Gbps per 10 GbE uplink. The %CPU utilization and the percentage of SPECweb2005 user sessions that met the quality-of-service (QoS) requirements were found to be 80% and 99.99% respectively. We chose this load point because customers typically do not stress their systems beyond this level.

LBT performance

We then reconfigured the vDS with two dvUplinks and a single DV Port Group to which all the vNICs of the VMs were mapped. The DV Port Group was configured with the LBT teaming policy. We used the default settings of LBT, which are primarily the wakeup period (30 seconds) and link saturation threshold (75%). Our goal was to evaluate the efficacy of the LBT policy in terms of load balancing and the added CPU cost, if any, when the same benchmark load of 30,000 SPECweb2005 support sessions was applied.

Before the start of the test, we noted that the traffic from all the VMs propagated through the first dvUplink. Note that the initial affiliation of the vNICs to the dvUplinks is made based on the hash of the virtual switch port IDs. To find the current affiliations of the vNICs to the dvUplinks, run the esxtop command and find the port-to-uplink mappings in the network screen. You can also use the “net-lbt” tool to find affiliations as well as to modify LBT settings.

The figure below shows the network bandwidth usage on both of the dvUplinks during the entire benchmark period.

LBT-new

A detailed explanation of the bandwidth usage in each phase follows:

Phase 1: Because all the virtual switch port IDs of the four VMs were hashed to the same dvUplink, only one of the dvUplinks was active. During this phase of the benchmark ramp-up, the total network traffic was below 7.5Gbps. Because the usage on the active dvUplink was lower than the saturation threshold, the second dvUplink remained unused.

Phase 2: The benchmark workload continued to ramp up and when the total network traffic exceeded 7.5Gbps (above the saturation threshold of 75% of link speed), LBT kicked in and dynamically remapped the port-to-uplink mapping of one of the vNIC ports from the saturated dvUplink1 to the unused dvUplink2. This resulted in dvUplink2 becoming active.  The usage on both the dvUplinks remained below the saturation threshold.

Phase 3: As the benchmark workload further ramped up and the total network traffic exceeded 10Gbps (7.5Gbps on dvUplink1 and 2.5Gbps on dvUplink2), LBT kicked in yet again, and dynamically changed port-to-uplink mapping of one of the three active vNIC ports currently mapped to the saturated dvUplink.

Phase 4: As the benchmark reached a steady state with the total network traffic exceeding little over 13Gbps, both the dvUplinks witnessed the same usage.

We did not observe any spikes in CPU usage or any dip in SPECweb2005 QoS during all the four phases. The %CPU utilization and the percentage of SPECweb2005 user sessions that met the QoS requirements were found to be 80% and 99.99% respectively.

These results show that LBT can serve as a very effective load balancing policy to optimally use all the available dvUplink capacity while matching the performance of a manually load-balanced configuration.

Summary

Load-based teaming (LBT) is a dynamic and traffic-load-aware teaming policy that can ensure physical NIC capacity in a NIC team is optimized.  In combination with VMware Network IO Control (NetIOC), LBT offers a powerful solution that will make your vSphere deployment even more suitable for your I/O-consolidated datacenter.

Performance and Use Cases of VMware DirectPath I/O for Networking

Summary

VMware DirectPath I/O is a technology, available from vSphere 4.0 and higher that leverages hardware support (Intel VT-d and AMD-Vi) to allow guests to directly access hardware devices. In the case of networking, a VM with DirectPath I/O can directly access the physical NIC instead of using an emulated (vlance, e1000) or a para-virtualized (vmxnet, vmxnet3) device. While both para-virtualized devices and DirectPath I/O can sustain high throughput (beyond 10Gbps), DirectPath I/O can additionally save CPU cycles in workloads with very high packet count per second (say > 50k/sec). However, DirectPath I/O does not support many features such as physical NIC sharing, memory overcommit, vMotion and Network I/O Control. Hence, VMware recommends using DirectPath I/O only for workloads with very high packet rates, where CPU savings from DirectPath I/O may be needed to achieve desired performance.

DirectPath I/O for Networking

VMware vSphere 4.x provides three ways for guests to perform network I/O: device emulation, para-virtualization and DirectPath I/O. A virtual machine using DirectPath I/O directly interacts with the network device using its device drivers. The vSphere host (running ESX or ESXi) is only involved in virtualizing interrupts of the network device. In contrast, a virtual machine (VM) using an emulated or para-virtualized device (referred to as virtual NIC or virtualized mode henceforth) interacts with a virtual NIC that is completely controlled by the vSphere host. The vSphere host handles the physical NIC interrupts, processes packets, determines the recipient of the packet and copies them into the destination VM, if needed. The vSphere host also mediates packet transmissions over the physical NIC.

In terms of network throughput, a para-virtualized NIC such as vmxnet3 matches the performance of DirectPath I/O in most cases. This includes being able to transmit or receive 9+ Gbps of TCP traffic with a single virtual NIC connected to a 1-vCPU VM. However, DirectPath I/O has some advantages over virtual NICs such as lower CPU costs (as it bypasses execution of the vSphere network virtualization layer) and the ability to use hardware features that are not yet supported by vSphere, but might be supported by guest drivers (e.g., TCP Offload Engine or SSL offload). In the virtualized mode of operation, the vSphere host completely controls the virtual NIC and hence it can provide a host of useful features such as physical NIC sharing, vMotion and Network I/O Control. By bypassing this virtualization layer, DirectPath I/O trades off virtualization features for potentially lower networking-related CPU costs. Additionally, DirectPath I/O needs memory reservation to ensure that the VM’s memory has not been swapped out when the physical NIC tries to access the VM’s memory.

VMware’s Performance Review of DirectPath I/O vs. Emulation

VMware used the netperf [1] microbenchmark to plot the gains of DirectPath I/O as a function of packet rate. For the evaluation, VMware used the following setup:

  • SLES11-SP1 VM on vSphere 4.1. vSphere was running on a dual socket Intel E5520 processor (@2.27 GHz) with a Broadcom 57711 10GbE NIC as the physical NIC.
  • A native Linux machine was used as the traffic source or sink.
  • UDP_STREAM benchmark of netperf, along with the burst and interval functionality to send or receive packets at a controlled rate.

PktRate vs CPU Savings with DirectPath I/O

The above figure plots CPU savings due to DirectPath I/O as a percent of one core against packet rate (Packets per Second – PPS). Immediately, you can see the benefits of DirectPath I/O at high packet rates (100,000 PPS). However, it is equally clear that at lower packet rates, the benefits of DirectPath I/O are not as significant. At 10,000 PPS, DirectPath I/O can only save about 6% of one core. This is an important observation as many enterprise workloads do not have very high networking traffic (see Tables 1 and 2).

Table 1. Performance of enterprise class workloads with DirectPath I/O

To further illustrate the specific use cases and benefits for DirectPath I/O, VMware also compared its performance against that of a virtual NIC with three complex workloads: a web server workload and two database workloads. The web server workload and configuration was similar to SPECweb®2005 (described in reference [2]). We ran a fixed number of users requesting data from a web server and measured the CPU utilization between DirectPath I/O and a para-virtualized virtual NIC. Due to the high packet rate of this workload, DirectPath I/O is able to support 15% more users per %CPU Used. Note that in a typical web server workload, the packets that a web server receives are smaller than 1500 bytes (average of 86 bytes in our experiments). Hence, we cannot directly use the receive numbers in Figure 1 to calculate CPU savings.

Next, we looked at a database workload that has far lower packet rates. We used the Order Entry benchmark [3], and measured the ratio of number of operations per second. As expected, due to the low packet rate, the performance of virtual NIC and DirectPath I/O was similar.

We also looked at the performance of an OLTP-like workload with SAP and DB2 [4] on a 4-socket Intel X-7550 machine with one 8-vcpu VM. Virtual NIC out-performs DirectPath I/O by about 3% in the default configuration. This performance gap was an artifact of memory pinning, reservation and NUMA behavior of the platform in the DirectPath I/O configuration. By setting memory reservations for the virtual NIC configuration, we were able to match the performance of both configurations. Table 2 lists packet rates for some more enterprise-class workloads. Based on the packet rate numbers and the CPU cost saving estimates from Figure 1, we do not expect these workloads to benefit from the use of DirectPath I/O.

Table 2. Packet Rates for some  enterprise class workloads

Compatibility Matrix

DirectPath I/O requires the VM to be directly allowed to access a device and the device to be allowed to modify the VM’s memory (e.g., to copy a received packet to the VM’s memory). Additionally, the VM and the device can now share essential state information that is invisible to ESX. Hence the use of DirectPath I/O is incompatible with many of core virtualization features. Table 3 presents a compatibility matrix for DirectPath I/O.

Table 3. Feature Compatibility Matrix for DirectPath I/O

Conclusion

As stated in the beginning of this post, DirectPath I/O is intended for specific use cases. It is another technology VMware users can deploy to boost performance of applications with very high packet rate requirements.

Further Reading

  • VMware DirectPath I/O. http://communities.vmware.com/docs/DOC-11089
  • Configuration Examples and Troubleshooting for DirectPath I/O. http://www.vmware.com/files/pdf/techpaper/vsp_4_vmdirectpath_host.pdf

References

  1. netperf. http://www.netperf.org/netperf/
  2. Achieving High Web Throughput Scaling with VMware vSphere 4 on Intel Xeon 5500 series (Nehalem) servers. http://communities.vmware.com/docs/DOC-12103
  3. Virtualizing Performance-Critical Database Applications in VMware vSphere. http://www.vmware.com/pdf/Perf_ESX40_Oracle-eval.pdf
  4. SAP Performance on vSphere with IBM DB2 and SUSE Linux Enterprise. http://www.vmware.com/files/pdf/techpaper/vsp_41_perf_SAP_SUSE_DB2.pdf

SPECweb®2005 is a registered trademark of the Standard Performance Evaluation Corporation (SPEC).

VMmark 2.0 Release

VMmark 2.0, VMware’s next-generation multi-host virtualization benchmark, is now generally available here.

We were motivated to create VMmark 2.0 by the revolutionary advancements in virtualization since VMmark 1.0 was conceived. The rapid pace of innovation in both the hypervisor and the hardware has quickly transformed datacenters by enabling easier virtualization of heavy and bursty workloads coupled with dynamic VM relocation (vMotion), dynamic datastore relocation (storage vMotion), and automation of many provisioning and administrative tasks across large-scale multi-host environments. In this paradigm, a large fraction of the stresses on the CPU, network, disk, and memory subsystems is generated by the underlying infrastructure operations. Load balancing across multiple hosts can also greatly effect application performance. The benchmarking methodology of VMmark 2.0 continues to focus on user-centric application performance while accounting for the effects of infrastructure activity on overall platform performance. This approach provides a much more accurate picture of platform capabilities than less comprehensive benchmarks.

I would like to thank all of our partners who participated in the VMmark 2.0 beta program. Their thorough testing and insightful feed back helped speed the development process while delivering a more robust benchmark. I anticipate a steady flow of benchmark results from partners over the coming months and years.

I should also acknowledge the hard work of my colleagues in the VMmark team that completed VMmark 2.0 on a relatively short timeline. We have performed a wide array of experiments during the development of VMmark 2.0 and will use the data as the basis for a series of upcoming posts in this forum. Some topics likely to be covered are cluster-wide scalability, performance of heterogeneous clusters, and networking tradeoffs between 1Gbit and 10Gbit for vMotion. I hope we can inspire others to use VMmark 2.0 to explore performance characteristics in multi-host environments in novel and interesting ways all the way up to cloud-scale.