Home > Blogs > VMware VROOM! Blog > Monthly Archives: June 2009

Monthly Archives: June 2009

SQL Server Performance on VMware vSphere 4.0

    VMware recently published a whitepaper titled “Performance and Scalability of Microsoft SQL Server on VMware vSphere 4“ that demonstrates VMware vSphere 4.0 can virtualize large SQL Server deployments with excellent performance and scalability. The paper documents results for a resource intensive OLTP workload running against a SQL Server 2008 database on the Windows Server 2008 operating platform and highlights single-VM as well as multi-VM performance.

  • In an 8vCPU virtual machine, we achieve OLTP throughput that is 86% of physical machine performance 
  • In consolidation experiments with multiple 2-vCPU virtual machines, aggregate throughput scales linearly until physical CPUs are saturated 

Single-VM Performance Relative to Native 

    The table below summarizes the performance relative to the physical machine as we scale-up the vCPUs in a VM running our workload.

Number of Virtual CPUs

Ratio to Native

1

92%

2

92%

4

88%

8

86%

 
    At 1,2 and 4vCPUs on the 8pCPU server, ESX is able to effectively offload certain tasks such as I/O processing to idle cores.
Even at 8vCPUS on a fully committed system, vSphere 4.0 still delivers excellent performance .

    The following table summarizes the resource intensive nature of the workload used for the tests.

Metric

Physical Machine

Virtual Machine

Throughput in transactions per second

3557

3060

Disk I/O throughput (IOPS)

29 K

25.5 K

Disk I/O latencies

9 milliseconds

8 milliseconds

Network bandwidth receive

Network bandwidth send

11.8 Mb/s

123 Mb/s

10 Mb/s

105 Mb/s send

Multi-VM Performance and Scalability

    Multiple SQL server VMs running a resource intensive OLTP workload can be consolidated to achieve excellent aggregate throughput with minimal performance impact to individual VMs. In the figure below, we plot the total throughput as we add eight 2-vCPU SQL Server VMs onto an 8-way host.

Scaleout_graph4

    The cumulative throughput increases linearly as we add up to four virtual machines (eight vCPUs). As we over-commit the physical CPUs by increasing the number of VMs from four to six (factor of 1.5), the aggregate throughput increases by a factor of 1.4 Adding eight VMs to this saturates the physical CPUs on this host, yet ESX is able to utilize the few idle cycles to deliver 5% more throughput.

    The data clearly shows that performance is not a barrier for configuring large multi-CPU SQL Server instances in virtual machines or consolidating multiple virtual machines on a single host to achieve impressive aggregate throughput on vSphere 4. 

For more details regarding these tests, we refer you to the paper at Performance and Scalability of Microsoft SQL Server on VMware vSphere 4

Measuring the Cost of SMP with Mixed Workloads

It is no secret that vSphere 4.0 delivers excellent
performance and provides the capability to virtualize the beefiest of
workloads. Several impressive performance studies using ESX 4.0 have been
already been presented. (My favorite is this database performance whitepaper.) However, I continue to hear questions about the
scheduling overhead of larger VMs within a heavily-utilized, mixed-workload
environment. We put together a study using simple variations of VMware’s
mixed-workload consolidation benchmark VMmark to help answer this
question.

For this study we chose two of the VMmark workloads,
database and web server, as the vCPU-scalability targets. These VMs represent
workloads that typically show the greatest range of load in production
environments so they are natural choices for a scalability assessment. We
varied the number of vCPUs in these two VMs between one and four and measured throughput
scaling and CPU utilization of each configuration by increasing the number of
benchmark tiles up to and beyond system saturation.

The standard VMmark workload levels were used and were held
constant for all tests. Given that the workload is constant, we are measuring
the cost of SMP VMs and their impact on the scheduler . This approach
places increasing stress the hypervisor as the vCPU allocations increase and
creates a worst-case scenario for the scheduler. The vCPU allocations for the
three configurations are shown in the table below:

 

Webserver vCPUs

Database vCPUs

Fileserver vCPUs

Mailserver vCPUs

Javaserver vCPUs

Standby vCPUs

Total vCPUs

Config1

1

1

1

2

2

1

8

Config2

2

2

1

2

2

1

10

Config3

4

4

1

2

2

1

14

 

Config2 uses the standard VMmark vCPU allocation of 10 vCPUs
per tile. Config1 contains 20% fewer vCPUs than the standard while Config3
contains 40% more than the standard.

We also used Windows Server 2008 instead of Windows Server
2003 where possible to characterize its behavior in anticipation of using
Server 2008 in a next-generation benchmark. As a result, we increased the
memory in the Javaserver VMs from 1GB to 1.4 GB to insure sufficient memory
space for the JVM. The table below provides a summary of each VM’s
configuration:

Workload

Memory

Disk

OS

Mailserver

1GB

24GB

Windows
2003 32bit

Javaserver

1.4GB

12GB
(*)

Windows
2008 64bit

Standby
Server

256MB
(*)

12GB
(*)

Windows
2008 32bit

Webserver

512MB

8GB

SLES
10 SP2 64bit

Database

2GB

10GB

SLES
10 SP2 64bit

Fileserver

256MB

8GB

SLES
10 SP2 32bit

Below is a basic summary of the hardware used:

  • Dell PowerEdge R905 with 4 x 2.6GHz Quad Core AMD Opteron
    8382
  • Firmware version 3.0.2 (latest available).
  • 128GB DDR2 Memory.
  • 2 x Intel E1000 dual-port NIC
  • 2 x Qlogic 2462 dual-port 4Gb
  • 2 x EMC CX3-80 Storage Arrays.
  • 15 x HP DL360 client systems.

Experimental Results

Figure 1 below shows both the CPU utilization and the throughput
scaling normalized to the single-tile throughput of Config1. Both throughput and
CPU utilization remain roughly equal for all three configurations at load
levels of 1, 3, and 6 tiles (6, 18, and 36 VMs, respectively). The cost of
using SMP VMs is negligible here. The throughputs remain roughly equal while
the CPU utilization curves begin to diverge as the load increases to 9, 10, and
11 tiles (54, 60, and 66 VMs, respectively). Furthermore, all three
configurations achieve roughly linear scaling up to 11 tiles (66 VMs). CPU
utilization when running 11 tiles was 85%, 90%, and 93% for Config1, Config2,
and Config3, respectively. Considering that few customers are comfortable
running at overall system utilizations above 85%, this result shows remarkable
scheduler performance and limited SMP co-scheduling overhead within a typical
operating regime.

FIG1_Alternatev-CPUscaling-4b 

Figure 2 below shows the same normalized throughput of Figure 1 as well as the total number of running vCPUs to illustrate the additional stresses put on the hypervisor by the progressively larger SMP configurations. For instance, the throughput scaling at nine tiles is equivalent despite the fact that Config1 requires only 72 vCPUs while
Config3 uses 126 vCPUs. As expected, Config3, with its heavier resource demands, is the first to transition into system saturation. This occurs at a load of 12 tiles (72 VMs). At 12 tiles, there are 168 vCPUs active – 48 more vCPUs than used by Config2 at 12 tiles. Nevertheless, Config3 scaling only lags Config2 by 9% and Config1 by 8%. Config2 reaches system saturation at 14 tiles (84 VMs), where it lags Config1 by 5%. Finally Config1 hits the saturation point at 15 tiles (90 VMs).

FIG2_Alternatev-CPUscaling-5b 

Overall, these results show that ESX 4.0 effectively and fairly manages VMs of all shapes and sizes in a mixed-workload environment. ESX 4.0 also exhibits excellent throughput parity and minimal CPU differences between the three configurations throughout the typical operating envelope. ESX continues to demonstrate first-class enterprise stability, robustness, and predictability in all cases. Considering how well ESX 4.0 handles a tough situation like this, users can have confidence when virtualizing their larger workloads within larger VMs.

(*) The spartan memory and disk allocations for the Windows Server 2008 VMs might cause readers to question if the virtual machines were adequately provisioned. Since our internal testing covers a wide array of virtualization platforms, reducing the memory of the Standby Server enables us to measure the peak performance of the server before encountering memory bottlenecks on virtualization platforms where physical memory is limited and sophisticated memory overcommit techniques are unavailable. Likewise, we want to configure our tests so that the storage capacity doesn’t induce an
artificial bottleneck. Neither the Standby Server nor the Javaserver place significant demands on their virtual disks, allowing us to optimize storage usage. We carefully compared this spartan Windows Server 2008 configuration against a richly configured Windows Server 2008 tile and found no measurable difference in stability or performance. Of course, I would not encourage this type of configuration in a live production setting. On the other hand, if a VM gets configured in this way, vSphere users can sleep well knowing that ESX won’t let them down.

VMware breaks the 50,000 SPECweb2005 barrier using VMware vSphere 4

VMware has achieved a SPECweb2005 benchmark score of 50,166 using VMware vSphere 4, a 14% improvement over the world record results previously published on VI3. Our latest results further strengthen the position of VMware vSphere as an industry leader in web serving, thanks to a number of performance enhancements and features that are included in this release. In addition to the measured performance gains, some of these enhancements will help simplify administration in customer environments.

The key highlights of the current results include:

  1. Highly scalable virtual SMP performance.
  2. Over 25% performance improvement for the most I/O intensive SPECweb2005 support component.
  3. Highly simplified setup with no device interrupt pinning.

Let me briefly touch upon each of these highlights.

Virtual SMP performance

The improved scheduler in ESX 4.0 enables usage of large symmetric multiprocessor (SMP) virtual machines for web-centric workloads. Our previous world record results published on ESX 3.5 used as many as fifteen uniprocessor (UP) virtual machines. The current results with ESX 4.0 used just four SMP virtual machines. This is made possible by several improvements
that went into the CPU scheduler in ESX 4.0.

From a scheduler perspective, SMP virtual machines present additional considerations such as co-scheduling. This is because in case of a SMP virtual machine, it is important for ESX scheduler to
present the applications and the guest OS running in the virtual machine with
the illusion that they are running on a dedicated multiprocessor machine. ESX
implements this illusion by co-scheduling the virtual processors of a SMP virtual machine. While the requirement to co-schedule all the virtual processors of a VM was
relaxed in the previous releases of ESX, the relaxed co-scheduling algorithm
has been further refined in ESX 4.0. This means the scheduler has more choices in
its ability to schedule the virtual processors of a VM. This leads to higher
system utilization and better overall performance in a consolidated
environment.

ESX 4.0 has also improved its resource locking mechanism. The
locking mechanism in ESX 3.5 was based on the cell lock construct. A cell is a
logical grouping of physical CPUs in the system within which all the vCPUs of a
VM had to be scheduled. This has been replaced with per-pCPU and per-VM locks.
This fine-grained locking reduces contention and improves scalability. All
these enhancements enable ESX 4.0 to use SMP VMs and achieve this new level of SPECweb2005 performance.

Very high performance gains for workloads with large I/O component

I/O intensive applications highlight the performance enhancements of ESX 4.0. These tests show that high-I/O workloads yield the largest gains when upgrading to this release.

In all our tests, we used SPECweb2005 workload which measures the system’s ability to
act as a web server. It is designed with three workloads to characterize different web usage patterns: Banking (emulate online banking), E-commerce (emulates an E-commerce site) and Support (emulates a vendor support site that provides downloads). The performance score of each of the workloads is measured in terms of the number of simultaneous sessions the system is able to support while meeting the QoS requirements of the workload. The aggregate metric reported by the SPECweb2005 workload normalizes the performance scores obtained on the three workloads.

The following figure compares the scores of the
three workloads obtained on ESX 4.0 to the previous results on ESX 3.5. The
figure also highlights the percentage improvements obtained on ESX 4.0 over ESX
3.5. We used an HP ProLiant DL585 G5 server with four Quad-Core AMD Opteron processors
as the system under test. The benchmark results have been reviewed and approved
by the SPEC committee.

Sw2005_KL

We used the same HP ProLiant
DL585 G5 server and the physical test infrastructure in the current as well as
the previous benchmark submission on VI3. There were some differences between
the two test configurations (for example, ESX 3.5 used UP VMs while SMP VMs were used
on ESX 4.0; ESX 4.0 tests were run on currently available processors that have
a slightly higher clock speed). To highlight the performance gains, we will look
at the percentage improvements obtained for all the three workloads rather than
the absolute numbers.

As you can see from the above figure, the biggest percentage gain was seen with the Support workload, which has the largest I/O component. In this test, a 25% gain was seen while ESX drove about 20 Gbps of web traffic. Of the three workloads, the Banking workload has the smallest I/O component, and accordingly had relatively smaller percentage gain.

Highly simplified setup

ESX 4.0 also simplifies customer environments without sacrificing performance. In our previous ESX 3.5 results, we pinned the device interrupts to make efficient use of hardware caches and improve performance. Binding device interrupts to specific processors is a technique common to SPECweb2005 benchmarking tests to maximize performance. Results published in the http://www.spec.or/osg/web2005 website reveal the complex pinning configurations used by the benchmark publishers in the native environment.

The highly improved I/O processing model in ESX 4.0 obviates the need to do any manual device interrupt pinning. On ESX, the I/O requests issued by the VM are intercepted by the virtual machine monitor (VMM) which handles them in cooperation with the VMkernel. The improved execution model in ESX 4.0 processes these I/O requests asynchronously which allows the vCPUs of the VM to execute other tasks.

Furthermore, the scheduler in ESX 4.0 schedules processing of network traffic based on processor cache architecture, which eliminates the need for manual device interrupt pinning. With the new core-offload I/O system and related scheduler improvements, the results with ESX 4.0 compare favorably to ESX 3.5.

Conclusions

These SPECweb2005 results demonstrate that customers can expect substantial performance gains on ESX 4.0 for web-centric workloads. Our past results published on ESX 3.5 showed world record performance in a scale-out (increasing the number of virtual machines) configuration and our current results on vSphere 4 demonstrate world class performance while scaling up (increasing the number of vCPUs in a virtual machine). With an improved scheduler that required no fine-tuning for these experiments, VMware vSphere 4 can offer these gains while lowering the cost of administration.

SAP Performance with vSphere 4

VMware recently published a whitepaper that demonstrates VMware vSphere 4’s excellent performance and scalability with SAP ERP software.  The paper presents results of several experiments using VMware vSphere and SAP software with both the Microsoft Windows Server 2008 and SUSE Linux 10.2 operating systems.

First, vSphere’s support for nested page tables (AMD Rapid Virtual Indexing, and Intel Extended Page Tables) is shown to provide a 15-82% performance boost for SAP's most MMU-intensive memory models.  Next, the paper presents a "scale-up" study, comparing n-way virtual machines to n-way physical machines (see figure); using an SAP application load test, vSphere supported up to 95% of the users achieved on physical machines.  The paper also shows that vSphere maintains fairness during CPU overcommitment for an SAP workload and that a performance benefit can be realized when large pages are configured on the host and guest.

SAP Scale-Up Performance

The results in the paper suggest that to run SAP in a virtual machine most efficiently, one should adopt the following best practices:

  • Run with no more vCPUs than necessary.
  • Use the newest processors  (e.g., “AMD Opteron 2300/8300 Series” or “Intel Xeon 5500 Series”) to exploit vSphere's support of hardware nested page tables.
  • Limit virtual machine size to fit within a NUMA node.
  • Configure guest operating system and applications for large pages.
  • If using a processor with hardware nested page tables (RVI or EPT) and Linux, choose the Std memory model
  • If using a processor with hardware nested page tables (RVI or EPT) and Windows 2008, convenience should dictate the choice of memory model as it has only a minor effect on performance.

For more information on the experiments and how we arrive at these recommendations, we refer you to the full paper.  For additional SAP information, please visit: http://www.vmware.com/sap.