Home > Blogs > VMware VROOM! Blog > Author Archives: Sreekanth Setty

Each vSphere release introduces new vMotion functionality, increased reliability and significant performance improvements. vSphere 5.5 continues this trend by offering new enhancements to vMotion to support EMC VPLEX Metro, which enables shared data access across metro distances.

In this blog, we evaluate vMotion performance on a VMware vSphere 5.5 virtual infrastructure that was stretched across two geographically dispersed datacenters using EMC VPLEX Metro.

Test Configuration

The VPLEX Metro test bed consisted of two identical VPLEX clusters, each with the following hardware configuration:

• Dell R610 host, 8 cores, 48GB memory, Broadcom BCM5709 1GbE NIC
• A single engine (two directors) VPLEX Metro IP appliance
• FC storage switch
• VNX array, FC connectivity, VMFS 5 volume on a 15-disk RAID-5 LUN


Figure 1. Logical layout of the VPLEX Metro deployment

Figure 1 illustrates the deployment of the VPLEX Metro system used for vMotion testing. The figure shows two data centers, each with a vSphere host connected to a VPLEX Metro appliance. The VPLEX virtual volumes presented to the vSphere hosts in each data center are synchronous, distributed volumes that mirror data between the two VPLEX clusters using write-through caching. As a result, vMotion views the underlying storage as shared storage, or exactly equivalent to a SAN that both source and destination hosts have access to. Hence, vMotion in a Metro VPLEX environment is as easy as traditional vMotion that live migrates only the memory and device state of a virtual machine.

The two VPLEX Metro appliances in our test configuration used IP-based connectivity. The vMotion network between the two ESXi hosts used a physical network link distinct from the VPLEX network. The Round Trip Time (RTT) latency on both VPLEX and vMotion networks was 10 milliseconds.

Measuring vMotion Performance

The following metrics were used to understand the performance implications of vMotion:

• Migration Time: Total time taken for migration to complete
• Switch-over Time: Time during which the VM is quiesced to enable switchover from source to the destination host
• Guest Penalty: Performance impact on the applications running inside the VM during and after the migration

Test Results


Figure 2. VPLEX Metro vMotion performance in vSphere 5.1 and vSphere 5.5

Figure 2 compares VPLEX Metro vMotion performance results in vSphere 5.1 and vSphere 5.5 environments. The test scenario used an idle VM configured with 2 VCPUs and 2GB memory. The figure shows a minor difference in the total migration time between the two vSphere environments and a significant improvement in vMotion switch-over time in the vSphere 5.5 environment. The switch-over time reduced from about 1.1 seconds to about 0.6 second (a nearly 2x improvement), thanks to a number of performance enhancements that are included in the vSphere 5.5 release.

We also investigated the impact of VPLEX Metro live migration on Microsoft SQL Server online transaction processing (OLTP) performance using the open-source DVD Store workload. The test scenario used a Windows Server 2008 VM configured with 4 VCPUs, 8GB memory, and a SQL Server database size of 50GB.


Figure 3. VPLEX Metro vMotion impact on SQL Server Performance

Figure 3 plots the performance of a SQL Server virtual machine in orders processed per second at a given time—before, during, and after VPLEX Metro vMotion. As shown in the figure, the impact on SQL Server throughput was very minimal during vMotion. The SQL Server throughput on the destination host was around 310 orders per second, compared to the throughput of 350 orders per second on the source host. This throughput drop after vMotion is due to the VPLEX inter-cluster cache coherency interactions and is expected. For some time after the vMotion, the destination VPLEX cluster continued to send cache page queries to the source VPLEX cluster and this has some impact on performance. After all the metadata is fully migrated to the destination cluster, we observed the SQL Server throughput increase to 350 orders per second, the same level of throughput seen prior to vMotion.

These performance test results show the following:

  • Remarkable improvements in vSphere 5.5 towards reducing vMotion switch-over time during metro migrations (for example, a nearly 2x improvement over vSphere 5.1)
  • VMware vMotion in vSphere 5.5 paired with EMC VPLEX Metro can provide workload federation over a metro distance by enabling administrators to dynamically distribute and balance the workloads seamlessly across data centers

To find out more about the test configuration, performance results, and best practices to follow, see our recently published performance study.

VMware vSphere 5.1 vMotion Architecture, Performance, and Best Practices

vMotion and Storage vMotion are key, widely adopted technologies which enable the live migration of virtual machines on the vSphere platform. vMotion provides the ability to live migrate a virtual machine from one vSphere host to another host, with no perceivable impact to the end user. Storage vMotion technology provides the ability to live migrate the virtual disks belonging to a virtual machine across storage elements on the same host.  Together, vMotion and Storage vMotion technologies enable critical datacenter workflows, including automated load-balancing with DRS and Storage DRS, hardware maintenance, and the permanent migration of workloads.

Each vSphere release introduces new vMotion functionality, increased reliability and significant performance improvements. vSphere 5.1 continues this trend by offering new enhancements to vMotion that provide a new level of ease and flexibility for live virtual machine migrations.  vSphere 5.1 vMotion now removes the shared storage requirement for live migration and allows combining traditional vMotion and Storage vMotion into one operation. The combined migration copies both the virtual machine memory and its disk over the network to the destination vSphere host. This shared-nothing unified live migration feature offers administrators significantly more simplicity and flexibility in managing and moving virtual machines across their virtual infrastructures compared to the traditional vMotion and Storage vMotion migration solutions.

A new white paper, “VMware vSphere 5.1 vMotion Architecture, Performance and Best Practices”, is now available. In that paper, we describe the vSphere 5.1 vMotion architecture and its features. Following the overview and feature description of vMotion in vSphere 5.1, we provide a comprehensive look at the performance of live migrating virtual machines running typical Tier 1 applications using vSphere 5.1 vMotion, Storage vMotion, and vMotion. Tests measure characteristics such as total migration time and application performance during live migration. In addition, we examine vSphere 5.1 vMotion performance over a high-latency network, such as that in a metro area network. Test results show the following:

  • During storage migration, vSphere 5.1 vMotion maintains the same performance as Storage vMotion, even when using the network to migrate, due to the optimizations added to the vSphere 5.1 vMotion network data path.
  • During memory migration, vSphere 5.1 vMotion maintains nearly identical performance as the traditional vMotion, due to the optimizations added to the vSphere 5.1 vMotion memory copy path.
  • vSphere 5.1 vMotion retains the proven reliability, performance, and atomicity of the traditional vMotion and Storage vMotion technologies, even at metro area network distances.

Finally, we describe several best practices to follow when using vMotion.

For the full paper, see “VMware vSphere 5.1 vMotion Architecture, Performance and Best Practices”.

 

vMotion Architecture, Performance, and Best Practices in VMware vSphere 5

VMware vSphere vMotion enables the live migration of virtual machines from one VMware vSphere 5 host to another, with no perceivable impact to the end user. vMotion brings invaluable benefits to administrators—it enables load balancing, helps prevent server downtime, and provides flexibility for troubleshooting. vMotion in vSphere 5 incorporates a number of performance enhancements which allow vMotion to be used with minimal overhead on even the largest virtual machines running heavy-duty, enterprise-class applications.

A new white paper, vMotion Architecture, Performance, and Best Practices in VMware vSphere 5, is now available. In that paper, we describe the vMotion architecture and present the features and performance enhancements that have been introduced in vMotion in vSphere 5.  Among these improvements are multiple–network adaptor capability for vMotion, better utilization of 10GbE bandwidth, Metro vMotion, and optimizations to further reduce impact on application performance.

Following the overview and feature description of vMotion in vSphere 5, we provide a comprehensive look at the performance of migrating VMs running typical Tier 1 applications including Rock Web Server, MS Exchange Server, MS SQL Server and VMware View. Tests measure characteristics such as total migration time and application performance during vMotion. Test results show the following:

  • Remarkable improvements in vSphere 5 towards reducing the impact on guest application performance during vMotion
  • Consistent performance gains in the range of 30% in vMotion duration on vSphere 5
  • Dramatic performance improvements over vSphere 4.1 when using the newly added multi–network adaptor feature in vSphere 5 (for example, vMotion duration time is reduced by a factor of more than 3x)

Finally, we describe several best practices to follow when using vMotion.

For the full paper, see vMotion Architecture, Performance, and Best Practices in VMware vSphere 5.

VMware Load-Based Teaming (LBT) Performance

Virtualized data center environments are often characterized by a variety as well as a sheer number of traffic flows whose network demands often fluctuate widely and unpredictably. Provisioning a fixed-network capacity for these traffic flows can result either in poor performance (by under provisioning) or waste valuable capital (by over provisioning).

NIC teaming in vSphere enables you to distribute (or load balance) the network traffic from different traffic flows among multiple physical NICs by providing a mechanism to logically bind together multiple physical NICs. This results in increased throughput and fault tolerance and alleviates the challenge of network-capacity provisioning to a great extent. Creating a NIC team in vSphere is as simple as adding multiple physical NICs to a vSwitch and choosing a load balancing policy.

vSphere 4 (and prior ESX releases) provide several load balancing choices, which base routing on the originating virtual port ID, an IP hash, or source MAC hash. While these load balancing choices work fine in the majority of virtual environments, they all share a few limitations. For instance, all these policies statically map the affiliations of the virtual NICs to the physical NICs (based on virtual switch port IDs or MAC addresses) and do not base their load balancing decisions on the current networking traffic and therefore may not effectively distribute traffic among the physical uplinks. Besides, none of these policies take into consideration the disparity of the physical NIC capacity (such as a mixture of 1 GbE and 10 GbE physical NICs in a NIC team). In the next section, we will describe the latest teaming policy introduced in vSphere 4.1 that addresses these shortcomings.

Load-Based Teaming (LBT)

vSphere 4.1 introduces a load-based teaming (LBT) policy that is traffic-load-aware and ensures physical NIC capacity in a NIC team is optimized. Note that LBT is supported only with the vNetwork Distributed Switch (vDS). LBT avoids the situation of other teaming policies where some of the distributed virtual uplinks (dvUplinks) in a DV Port Group’s team are idle while others are completely saturated. LBT reshuffles the port binding dynamically, based on load and dvUplink usage, to make efficient use of the available bandwidth.

LBT is not the default teaming policy while creating a DV Port Group, so it is up to you to configure it as the active policy. As LBT moves flows among uplinks, it may occasionally cause reordering of packets at the receiver. LBT will only move a flow when the mean send or receive utilization on an uplink exceeds 75% of capacity over a 30 second period. LBT will not move flows any more often than once every 30 seconds.

Performance

In this section, we describe in detail the test-bed configuration, the workload used to generate the network traffic flows, and the test results.

Test configuration

In our test configuration, we used an HP DL370 G6 server running the GA release of vSphere 4.1, and several client machines that generated SPECweb®2005 traffic. The server was configured with dual-socket, quad-core 3.1GHz Intel Xeon W5580 processors, 96GB of RAM, and two 10 GbE Intel Oplin NICs. The server hosted four virtual machines and SPECweb2005 traffic was evenly distributed among all four VMs.  Each VM was configured with 4 vCPUs, 16GB memory, 2 vmxnet3 vNICs, and SLES 11 as the guest OS.

SPECweb2005 is an industry-standard web server benchmark defined by the Standard Performance Evaluation Corporation (SPEC). The benchmark consists of three workloads: Banking, Ecommerce, and Support, each with different workload characteristics representing common use cases for web servers. We used the Support workload in our tests which is the most I/O intensive of all the three workloads.

Baseline performance

In our baseline configuration, we configured a vDS with two dvUplinks and two DV Port Groups. We mapped the vNICs of the two VMs to the first dvUplink, and the vNICs of the other two VMs to the second dvUplink through the vDS interface. The SPECweb2005 workload was evenly distributed among all the four VMs and therefore we ensured both the dvUplinks were equally stressed. In terms of the load balancing, this baseline configuration presents the most optimal performance point. With the load of 30,000 SPECweb2005 support users, we observed a little over 13Gbps traffic, that is, about 6.5Gbps per 10 GbE uplink. The %CPU utilization and the percentage of SPECweb2005 user sessions that met the quality-of-service (QoS) requirements were found to be 80% and 99.99% respectively. We chose this load point because customers typically do not stress their systems beyond this level.

LBT performance

We then reconfigured the vDS with two dvUplinks and a single DV Port Group to which all the vNICs of the VMs were mapped. The DV Port Group was configured with the LBT teaming policy. We used the default settings of LBT, which are primarily the wakeup period (30 seconds) and link saturation threshold (75%). Our goal was to evaluate the efficacy of the LBT policy in terms of load balancing and the added CPU cost, if any, when the same benchmark load of 30,000 SPECweb2005 support sessions was applied.

Before the start of the test, we noted that the traffic from all the VMs propagated through the first dvUplink. Note that the initial affiliation of the vNICs to the dvUplinks is made based on the hash of the virtual switch port IDs. To find the current affiliations of the vNICs to the dvUplinks, run the esxtop command and find the port-to-uplink mappings in the network screen. You can also use the “net-lbt” tool to find affiliations as well as to modify LBT settings.

The figure below shows the network bandwidth usage on both of the dvUplinks during the entire benchmark period.

LBT-new

A detailed explanation of the bandwidth usage in each phase follows:

Phase 1: Because all the virtual switch port IDs of the four VMs were hashed to the same dvUplink, only one of the dvUplinks was active. During this phase of the benchmark ramp-up, the total network traffic was below 7.5Gbps. Because the usage on the active dvUplink was lower than the saturation threshold, the second dvUplink remained unused.

Phase 2: The benchmark workload continued to ramp up and when the total network traffic exceeded 7.5Gbps (above the saturation threshold of 75% of link speed), LBT kicked in and dynamically remapped the port-to-uplink mapping of one of the vNIC ports from the saturated dvUplink1 to the unused dvUplink2. This resulted in dvUplink2 becoming active.  The usage on both the dvUplinks remained below the saturation threshold.

Phase 3: As the benchmark workload further ramped up and the total network traffic exceeded 10Gbps (7.5Gbps on dvUplink1 and 2.5Gbps on dvUplink2), LBT kicked in yet again, and dynamically changed port-to-uplink mapping of one of the three active vNIC ports currently mapped to the saturated dvUplink.

Phase 4: As the benchmark reached a steady state with the total network traffic exceeding little over 13Gbps, both the dvUplinks witnessed the same usage.

We did not observe any spikes in CPU usage or any dip in SPECweb2005 QoS during all the four phases. The %CPU utilization and the percentage of SPECweb2005 user sessions that met the QoS requirements were found to be 80% and 99.99% respectively.

These results show that LBT can serve as a very effective load balancing policy to optimally use all the available dvUplink capacity while matching the performance of a manually load-balanced configuration.

Summary

Load-based teaming (LBT) is a dynamic and traffic-load-aware teaming policy that can ensure physical NIC capacity in a NIC team is optimized.  In combination with VMware Network IO Control (NetIOC), LBT offers a powerful solution that will make your vSphere deployment even more suitable for your I/O-consolidated datacenter.

Achieving High Web Throughput with VMware vSphere 4 on Intel Xeon 5500 series (Nehalem) servers

We just published a SPECweb2005 benchmark score of 62,296 — the highest result published to date on a virtual configuration. This result was obtained on an HP ProLiant DL380 G6 server running VMware vSphere 4 and featuring Intel Xeon 5500 series processors, and Intel 82598EB 10 Gigabit AF network interface cards. While driving the network throughput from a single host to just under 30 Gbps, this benchmark score still stands at 85% of the level achieved in native (non-virtualized) execution on equivalent hardware configurations.

Our latest benchmark results show that VMware, with our partners Intel and HP, is able to provide virtualization solutions that meet the performance and scaling needs of modern data centers. In addition, the simplification achieved through consolidation in a virtual environment, as demonstrated by the configuration used in our benchmark publication, contributes to eliminating complexity in the software environment.

Let me briefly discuss some of the distinctive characteristics of our latest benchmark results:

Use of VMDirectPath for virtualizing network I/O: VMDirectPath is a feature in vSphere 4 that builds upon Intel VT-D (Virtualization Technology for Directed I/O) capability engineered into recent Intel processors to virtualize network I/O. It allows guest operating systems to directly access an I/O device, bypassing the virtualization layer. The result we just published is notably different from our previous results in that this time we used VMDirectPath feature to take benefit of the higher performance that it makes possible.

High performance and linear scaling with the addition of virtual machines: VMDirectPath bypasses the virtualization layer to a large extent for the network interactions but, a measurable number of guest OS and hypervisor interactions still remain. The possibility still exists that the hypervisor can become a scaling limiter in a multi-VM environment. The excellent performance achieved by our benchmark configuration using four virtual machines shows that this should not be a concern.

A highly simplified setup: Results published in the SPECweb2005 website reveal the complexity of “interrupt pinning” that is common in the configurations in a native setting, generally employed in order to make full use of all the cores in today’s multi-core processors. By comparison, our benchmark configuration does not use device interrupt pinning. This is because the virtualization approach divides the load among multiple VMs, each of which is smaller and therefore easier to keep core-efficient.

Virtualization Performance: Our results show that a single vSphere host can handle 30 Gbps real world Web traffic and still reach a performance level of 85% of the native results published on equivalent physical configuration. This demonstrates capabilities several orders of magnitude greater than those needed by typical Web applications, proof-positive that the vast majority of the Web applications can be consolidated, with excellent performance, in a virtualized environment.

For more details, check out the full length article published on the VMware community website in which we elaborate upon each of the characteristics that we briefly discussed here.

VMware breaks the 50,000 SPECweb2005 barrier using VMware vSphere 4

VMware has achieved a SPECweb2005 benchmark score of 50,166 using VMware vSphere 4, a 14% improvement over the world record results previously published on VI3. Our latest results further strengthen the position of VMware vSphere as an industry leader in web serving, thanks to a number of performance enhancements and features that are included in this release. In addition to the measured performance gains, some of these enhancements will help simplify administration in customer environments.

The key highlights of the current results include:

  1. Highly scalable virtual SMP performance.
  2. Over 25% performance improvement for the most I/O intensive SPECweb2005 support component.
  3. Highly simplified setup with no device interrupt pinning.

Let me briefly touch upon each of these highlights.

Virtual SMP performance

The improved scheduler in ESX 4.0 enables usage of large symmetric multiprocessor (SMP) virtual machines for web-centric workloads. Our previous world record results published on ESX 3.5 used as many as fifteen uniprocessor (UP) virtual machines. The current results with ESX 4.0 used just four SMP virtual machines. This is made possible by several improvements
that went into the CPU scheduler in ESX 4.0.

From a scheduler perspective, SMP virtual machines present additional considerations such as co-scheduling. This is because in case of a SMP virtual machine, it is important for ESX scheduler to
present the applications and the guest OS running in the virtual machine with
the illusion that they are running on a dedicated multiprocessor machine. ESX
implements this illusion by co-scheduling the virtual processors of a SMP virtual machine. While the requirement to co-schedule all the virtual processors of a VM was
relaxed in the previous releases of ESX, the relaxed co-scheduling algorithm
has been further refined in ESX 4.0. This means the scheduler has more choices in
its ability to schedule the virtual processors of a VM. This leads to higher
system utilization and better overall performance in a consolidated
environment.

ESX 4.0 has also improved its resource locking mechanism. The
locking mechanism in ESX 3.5 was based on the cell lock construct. A cell is a
logical grouping of physical CPUs in the system within which all the vCPUs of a
VM had to be scheduled. This has been replaced with per-pCPU and per-VM locks.
This fine-grained locking reduces contention and improves scalability. All
these enhancements enable ESX 4.0 to use SMP VMs and achieve this new level of SPECweb2005 performance.

Very high performance gains for workloads with large I/O component

I/O intensive applications highlight the performance enhancements of ESX 4.0. These tests show that high-I/O workloads yield the largest gains when upgrading to this release.

In all our tests, we used SPECweb2005 workload which measures the system’s ability to
act as a web server. It is designed with three workloads to characterize different web usage patterns: Banking (emulate online banking), E-commerce (emulates an E-commerce site) and Support (emulates a vendor support site that provides downloads). The performance score of each of the workloads is measured in terms of the number of simultaneous sessions the system is able to support while meeting the QoS requirements of the workload. The aggregate metric reported by the SPECweb2005 workload normalizes the performance scores obtained on the three workloads.

The following figure compares the scores of the
three workloads obtained on ESX 4.0 to the previous results on ESX 3.5. The
figure also highlights the percentage improvements obtained on ESX 4.0 over ESX
3.5. We used an HP ProLiant DL585 G5 server with four Quad-Core AMD Opteron processors
as the system under test. The benchmark results have been reviewed and approved
by the SPEC committee.

Sw2005_KL

We used the same HP ProLiant
DL585 G5 server and the physical test infrastructure in the current as well as
the previous benchmark submission on VI3. There were some differences between
the two test configurations (for example, ESX 3.5 used UP VMs while SMP VMs were used
on ESX 4.0; ESX 4.0 tests were run on currently available processors that have
a slightly higher clock speed). To highlight the performance gains, we will look
at the percentage improvements obtained for all the three workloads rather than
the absolute numbers.

As you can see from the above figure, the biggest percentage gain was seen with the Support workload, which has the largest I/O component. In this test, a 25% gain was seen while ESX drove about 20 Gbps of web traffic. Of the three workloads, the Banking workload has the smallest I/O component, and accordingly had relatively smaller percentage gain.

Highly simplified setup

ESX 4.0 also simplifies customer environments without sacrificing performance. In our previous ESX 3.5 results, we pinned the device interrupts to make efficient use of hardware caches and improve performance. Binding device interrupts to specific processors is a technique common to SPECweb2005 benchmarking tests to maximize performance. Results published in the http://www.spec.or/osg/web2005 website reveal the complex pinning configurations used by the benchmark publishers in the native environment.

The highly improved I/O processing model in ESX 4.0 obviates the need to do any manual device interrupt pinning. On ESX, the I/O requests issued by the VM are intercepted by the virtual machine monitor (VMM) which handles them in cooperation with the VMkernel. The improved execution model in ESX 4.0 processes these I/O requests asynchronously which allows the vCPUs of the VM to execute other tasks.

Furthermore, the scheduler in ESX 4.0 schedules processing of network traffic based on processor cache architecture, which eliminates the need for manual device interrupt pinning. With the new core-offload I/O system and related scheduler improvements, the results with ESX 4.0 compare favorably to ESX 3.5.

Conclusions

These SPECweb2005 results demonstrate that customers can expect substantial performance gains on ESX 4.0 for web-centric workloads. Our past results published on ESX 3.5 showed world record performance in a scale-out (increasing the number of virtual machines) configuration and our current results on vSphere 4 demonstrate world class performance while scaling up (increasing the number of vCPUs in a virtual machine). With an improved scheduler that required no fine-tuning for these experiments, VMware vSphere 4 can offer these gains while lowering the cost of administration.

VMware Sets Performance Record with SPECweb2005 Result

Introduction

We just published the largest SPECweb2005 score to date on a 16 core server. The benchmark was run on an HP ProLiant DL585 G5 with four Quad-Core AMD 8382 OpteronTM processors.  This record score of 44,000 includes an Ecommerce component demonstrating 69,525 concurrent connections.  In the Support component, this single-host workload drove network throughput on the server to just under 16 Gb/s.

This once again proves the capabilities of the VI3 platform and its ability to service workloads with stringent Quality of Service (QoS) requirements along with a large storage and networking footprint. With continuous advancements in virtualization technology (such as hardware assist for MMU virtualization and NetQueue support for 10 Gigabit Ethernet) performance with the VI3 platform can meet the needs of the most demanding, high traffic web sites.

While record-setting performance of web servers proves the capabilities of ESX, the real story of web server virtualization is the gains due to web farm consolidation and improved flexibility. Infrastructure serving as the web front end today is designed around hundreds or even thousands of often underutilized two and four core servers.  Consolidation of these servers onto modern systems with multi-core CPUs reduces costs, simplifies management and eases power and cooling demands. Consolidating web servers makes business sense. This SPECweb2005 result from VMware has shown that the ESX Server can handle loads much more extreme than anticipated in such a consolidated environment.

The Benchmark

The SPECweb2005 benchmark consists of three workloads: Banking, Ecommerce, and Support, each with different workload characteristics representing common use cases for web servers. Each workload measures the number of simultaneous user sessions a web server can support while still meeting stringent quality-of-service and error-rate requirements. The aggregate metric reported by the SPECweb2005 benchmark is a normalized metric based on the performance scores obtained on all three workloads.

Component

Score

Explanation

Banking

80,000

Models online banking.  Represents number of customers accessing accounts at a given time that can be supported with acceptable QoS

E-commerce

69,525

Models online retail store. 69,525 is only 75 shy of the highest number reported, which required 50% more processing cores.

Support

33,000

Represents users acquiring patches and downloads from a support web site. In this test network throughput was 16Gb/s.

SPECweb2005 Score

44,000

Normalized metric from the three components.

Table 1.Results in SPECweb2005 submission.

Benchmark Configuration  

Hardware

HP ProLiant DL585 G5 with four Quad-Core AMD 8382 OpteronTM processors, 128 GB RAM.

Disk subsystem

Two EMC CLARiiON CX3-40 Fibre Channel SAN arrays, total of 110 * 133GB (15K RPM) spindles 

Network

Four Intel 10 Gigabit XF SR Server Adapters

Hypervisor

ESX Server 3.5 U3

Guest Operating system

RedHat Enterprise Linux 5 Update 1

Virtual hardware

1 vCPU, 8 GB memory, vmxnet virtual network adapter

Web Server Software

Rock Web Server v1.4.7, Rock JSP/Servlet Container v1.3.2

Client Systems

30 * Dell Poweredge 1950, Dual-socket Dual-core Intel Xeon, 8 GB

Workload

SPECweb2005

Table 2.Benchmark Configuration.

Performance Details
Here’s a quick look at what was accomplished with a single ESX host using SPECweb2005 workload.

Aggregate performance: The aggregate SPECweb2005 performance of 44,000 obtained on our 16-core virtual configuration is higher than any result ever recorded on a 16-core native system. 

Support performance: The support workload is the most I/O intensive of all the workloads. The file-set data used for the support workload was laid out on a little over 100 spindles and consisted of files varying in size ranging from 100 KB to 36 MB. In our test configuration, we used fifteen virtual machines that shared the underlying physical 10Gbps NICs. Together they supported over 33,000 concurrent support user sessions, and handled close to sixteen Gigabits per second web traffic on a single ESX host. 

Banking performance: This workload emulates online banking that transfers encrypted information using HTTPS. The file-set data used for the banking test was about 1.3 terabytes consisting of some eight million individual files of varied sizes. We laid out all this data in a single VMFS volume that spanned multiple LUNs. We used fifteen virtual machines that shared the same base image. Together, they supported 80,000 concurrent banking user sessions and handled 143,000 HTTP operations/second. 

Ecommerce performance: Of the three workloads, Ecommerce workload probably fits the profile of most customers. This is because unlike Banking, and Support workloads, this workload is a mixture of HTTP and HTTPS requests. The I/O characteristics fall in between Banking and Support workloads. In our test, ESX supported 69,525 concurrent Ecommerce user sessions on a 16-core server. Our result is the second highest E-commerce result ever published, which has only been bested by only another 75 sessions on a system with 50% more cores.

To learn more about the test configuration and tuning descriptions, please see the full disclosure report on the official SPEC website: http://www.spec.org/osg/web2005.

Scaling real-life Web server workloads

In an earlier blog, we compared the performance aspects (such as latency, throughput and CPU resource utilization) of real-life web server workloads in a native environment and a virtualized data center environment. In this post, we focus on yet another important dimension of performance – scalability.

For our scalability evaluation tests, we used the widely deployed Apache/PHP as the Web serving platform. We used the industry-standard SPECweb2005 as the web server workload. SPECweb2005 consists of three workloads: banking, e-commerce, and support. The three workloads have vastly different characteristics, and we thus evaluated the results from all three.

First, we evaluated the scalability of the Apache/PHP Web serving platform in the native environment with no virtualization by varying the number of available CPUs at boot time. Note that in all these native configurations, there was a conventional, single operating environment that consisted of single RHEL5 kernel system image and a single Apache/PHP deployment. We applied all the well documented performance tunings to Apache/PHP configuration – for example, increasing the number of Apache worker processes, and using an Opcode cache to improve PHP performance.

The figure below shows the scaling results of SPECweb2005 workload in the native environment. The scaling curve plots the aggregate SPECweb2005 metric (a normalized metric based on the throughput scores obtained on all the three workloads -banking, e-commerce and support) as the number of processors was increased. In our test configuration, there were no bottlenecks in the hardware environment.

Blog2_pic12_2 

As shown in the figure, the scalability was severely limited as we increased the number of processors. In a single CPU configuration, we achieved the processor utilization of above 95%. But, as we increased the number of processors, we failed to achieve such high processor utilization. The performance was limited by software serialization points in the Apache/PHP/SPECweb2005 software stack. Analysis using the Intel Vtune performance analyzer confirmed increasing hot spot contention as we increased the number of CPUs. For the same size workload of 1,800 banking sessions, the CPI (Cycles Per Instruction) jumped by a factor of roughly four as we increased the number of CPUs from three to eight, indicating a software scaling issue. As observed in our test configuration, such issues often show up as unacceptable latencies even when there are plenty of compute resources available on the system. More often than not, diagnosing and fixing these issues is not practical in the time available.

Most real life web server workloads suffer from scalability issues such as those observed in our tests. In order to circumvent these issues, lots of businesses choose to deploy web server workloads on a multitude of one-CPU or dual-CPU machines. However, such approach leads to proliferation of servers in a data center environment resulting in higher costs in both power and space usage. Virtualization offers an easier alternative to avoid software scaling issues as well as provide efficiency in power and space usage. This is because, virtualization enables several complex operating environments that are not easily scalable to run concurrently on a single physical machine and exploit the vast compute resources offered by today’s power and space efficient multi-core systems. To quantify the effectiveness of this approach we measured SPECweb2005 performance by deploying multiple Apache/PHP configurations in a virtual environment. We have submitted our test results to the SPEC committee and they are under review.

In our virtualized tests, we configured the virtual machines in accordance with the general performance best practices recommended by VMware. Each VM was assigned one virtual CPU, and 4 GB of memory. We then varied the number of simultaneously running virtual machines from one to six, stopping at six, as this workload is highly network intensive and ESX offloads some of the network processing to the other available cores. Stopping short of allocating virtual machines to all cores ensured that with I/O intensive workloads such as this one, ESX Server has enough resources to take care of virtual machine scheduling, I/O processing and other housekeeping tasks. The following figure compares the SPECweb2005 scaling results between the native and the virtual environments.

Blog2_pic21_2   

As shown in the above figure, we observed good scaling in the virtual environment as we increased the number of virtual machines. The aggregate SPECweb2005 performance obtained in the tests with up to two virtual machines was slightly lower than the performance observed in corresponding native configurations. However, as we further increased the number of processors, the cumulative performance of the configuration using multiple virtual machines well exceeded the performance of a single native environment.

These results clearly demonstrate the benefit of using VMware Infrastructure to bypass software scalability limitations and improve overall efficiency when running real-life web server workloads.

To find out more about the test configuration, tuning information, and detailed results of all the individual SPECweb2005 workloads, check out our recently published performance study.

SPECweb2005 Performance on VMware ESX Server 3.5

I got a chance to attend the VMworld 2007 conference in San Francisco a little over three months ago. During the conference, many of my Performance group colleagues and I had the opportunity to speak with a number of customers from various segments of industry. They all seem to love VMware products and are fully embracing virtualization technology across IT infrastructure, clearly reflecting a paradigm shift. As Diane Greene described in her keynote, virtualization has become a mainstream technology. However, among the customers we spoke to there were a few who had some concerns about virtualizing I/O-intensive applications. Not surprisingly, the concerns had more to do with perception than with their actual experience.

Truth be told, with a number of superior features and performance optimizations in VMware ESX Server 3.5, performance is no longer a barrier to virtualization, even for the most I/O-intensive workloads. In order to dispel the misconceptions these customers had, we decided to showcase the performance of ESX Server by benchmarking with industry-standard I/O-intensive benchmarks. We looked at the whole spectrum of I/O-intensive workloads. My colleague has already addressed database performance. Here, I’d like to focus on web server performance; in particular, the performance of a single virtual machine running the highly-network intensive SPECweb2005 benchmark.

SPECweb2005 is a SPEC benchmark for measuring a system’s ability to act as a web server. It is designed with three workloads to characterize different web usage patterns: Banking (emulates online banking), E-commerce (emulates an E-commerce site), and Support (emulates a vendor support site providing downloads). The three benchmark components have vastly different workload characteristics and we thus look at results from all three.

In our test environment we used an HP ProLiant DL385 G1 server as the system under test (SUT). The server was configured with two 2.2 GHz dual-core AMD Opteron 275 processors and 8GB of memory. In the native tests the system was booted with 1 CPU and 6GB of memory and ran RHEL4 64-bit. In the virtualized tests, we used a 1-vCPU virtual machine configured with 6GB of memory, running RHEL4 64-bit, and hosted on ESX Server 3.5. We used the same operating system version and web server software (Rock Web Server, Rock JSP/Servlet container) in both the native and virtualized tests. Note that neither the storage configuration nor the network configuration in the virtual environment required any additional hardware. In fact we used the same physical network and storage infrastructure when we switched between the native and virtual machine tests.

There are different dimensions to performance. For real-world applications the most significant of these are usually overall latency (execution time) and system throughput (maximum operations per second). We are also concerned with the physical resource utilization per request/response. We used the SPECweb2005 workloads to evaluate all these aspects of performance in a virtualized environment.

Figure 1 shows the performance we obtained using the SPECweb2005 Banking workload. The graph plots the latency in seconds against the total number of SPECweb2005-banking users. The blue dashed line corresponds to the performance observed in a physical environment and the green solid line corresponds to the performance observed in a virtual environment.

Figure1

Figure 1. Response Time Curves

You can see from the graph that both curves have similar shapes. Both exhibit behavior observed in a typical response time curve. There are three regions of particular interest in the graph: the performance plateau, the stressed region, and the knee of the curve.

The part of the curve marked “Performance plateau” represents the behavior of the system under moderate stress, with CPU utilizations typically well below 50%. Interestingly, we observed lower latency in the virtual environment than in the native environment. This may be because ESX Server intelligently offloads some functionality to the available idle cores, and thus in certain cases users may experience slightly better latency in a virtual environment.

The part of the curve marked “Stressed region” represents the behavior of the system under heavier stress, with utilizations above approximately 60%. Response time gradually starts to increase with the increase in the load in both curves. But the response times are still below the reasonable limits.

The knee of each curve is marked by a point where the solid red line intersects the curve. The knee represents the maximum throughput (or load) that can be sustained by the system while meeting reasonable response time requirements. Beyond that point the system can no longer gracefully handle higher loads.

From this graph we can draw the following conclusions:

  1. When the CPU resources in the system are not saturated, you may not notice any difference in the application latency between the virtual and physical environments.
  2. The behavior of the system in both the virtual and physical environments is nearly identical, albeit the knee of the curve in the virtual environment occurs slightly earlier (due to moderately more CPU resources being used by the virtualized system).

We have similar results for the Support and E-commerce workloads. For brevity, I’ll focus on a portion of the response time curve that interests most of the system administrators. We have chosen a load point that is approximately 80% of the peak throughput obtained on a native machine. This represents the center of the ‘stressed region’ of the native response time curve, with CPU utilization level of 70% to 80%. We applied the same load in the virtual environment to understand the latency characteristics.

As you can see from Figure 2, we did not observe any appreciable difference in application latency between the native and virtual environments.

Figure2

Figure 2. SPECweb2005 Latency

Figure 3 compares the knee points of the response time curves obtained in all three workloads.

The knee points represent the peak throughput (or the maximum connections) sustained by both the native and virtual systems while still meeting the benchmark latency requirements.

Figure3

Figure 3. SPECweb2005 Throughput

As shown in Figure 3, we obtained close to 90% of native throughput performance on the SPECweb2005 Banking workload, close to 80% of native performance on the E-commerce workload, and 85% of native performance on Support workload.

If you’d like to know more about the test configuration, tuning information, and performance statistics we gathered during the tests, check out our recently published performance study.

These tests clearly demonstrate that performance in a virtualized environment can be close to that of a native environment even when using the most I/O-intensive applications. Virtualization does require moderately higher processor resources, but this is typically not a concern given the highly underutilized CPU resources in many IT environments. In fact, with so many additional benefits, such as server consolidation, lower maintenance costs, higher availability, and fault tolerance, a very compelling case can be made to virtualize any application, irrespective of its workload characteristics.