VMware

July 08, 2009

Application Performance Improvement with DRS

Summary

VMware Distributed Resource Scheduler (DRS), a unique feature of VMware vSphere 4, dynamically allocates and balances computing resources in a cluster of vSphere hosts. DRS continuously monitors utilization across resource pools and intelligently allocates available resources among virtual machines based on policies specified by administrators.  DRS provides not only most efficient resource management but can also provide significant VM performance gains. In the experiments conducted at EMC labs, we observed VM performance improvements as high as 47% with DRS when running virtualized SQL databases on a cluster of vSphere hosts.


Why DRS?

VMware vSphere provides a virtual platform to consolidate many servers onto fewer physical hosts. However, in such consolidation scenarios, unexpected spikes in the resource demands of the VMs can cause the total resource requirements to exceed the available resources on their host. A manual approach to mitigate this problem is to estimate the individual as well as aggregate resource demands upfront and place the VMs intelligently on the hosts based on the estimation.

But, even if the hosts are balanced initially, there is no guarantee that the resource demands of the VMs will stay constant and system loads on ESX will remain balanced. A change in workload behavior may cause resource demand to change, which can lead to contention among VMs for CPU cycles on some hosts while CPU cycles remain unused on other lightly loaded hosts.

DRS provides an automated mechanism to manage the resource demands. It monitors the resource needs of the VMs at the time of their power on and allocates the resources by placing the VMs on the appropriate hosts in the cluster. If the resource demands change after the initial placement, DRS automatically relocates the VMs to hosts where the resources are readily available. VMs will continue to get the resources their workloads demand and thus deliver the same performance they would have if they were running on dedicated hardware.

Methodology

We created a DRS cluster consisting of 4 ESX hosts. All hosts were identical in hardware configuration (refer to "Configuration Details" for more information). On each host we created 4 VMs, and in each VM we installed SQL Server 2005 and DVD Store version 2.0 (DS2) database.

We created 2 DS2 workload profiles:

  • Heavy: This profile drove vCPU utilization in a VM to 70%
  • Light: This profile drove vCPU utilization in a VM to 10%

We randomly assigned these profiles to the VMs as shown in Table 1.Though the number of VMs was the same on all hosts, differences in the application load led to CPU resource contention in some hosts and unused CPU resources on the remaining hosts.

Table. 1 VM Workload Profiles

Host 1
Host 2
Host 3
Host 4
4H
4H 4L
4L
4H 3H / 1L 1H / 3L
4L
4H 2H / 2L
2H / 2L 4L
4H 2H / 2L 1H / 3L 1H / 3L
3H / 1L
3H / 1L 2H / 2L 4L
3H / 1L 3H / 1L 1H / 3L 1H / 3L
3H / 1L 2H / 2L 2H / 2L 1H / 3L

H – VM with 70% CPU utilization; L – VM with 10% CPU utilization

For each test case in Table 1, we ran a DS2 workload simultaneously in all VMs with DRS disabled. We collected the application throughput (Orders per Minute or OPM) in all the VMs

We repeated the experiments after enabling DRS. During each test case, DRS migrated a few VMs based on the resource demands across the hosts. The final balanced configuration achieved in each case was same and is given in Table 2. We measured the aggregate throughput from all the VMs in this balanced configuration.

Table 2. Balanced DRS cluster

Host 1
Host 2
Host 3
Host 4
2H / 2L
2H / 2L 2H / 2L 2H / 2L


Key Findings

Figure. 1 compares the aggregate throughput of all heavily loaded VMs in each of our test case (Table 1) with and without DRS. We have not shown the performance of lightly loaded VMs as there was no change in their performance.

Figure 1. Performance Gains with DRS

DRS
With DRS we observed:

  • 15 ~ 47% gains in aggregate performance for the cases tested.
  • That the higher the resource demand and imbalance in the cluster, the higher the performance gain.
  • No performance impact when the cluster was already balanced

This testing was the result of a joint effort between VMware and EMC. We would like to thank the Midrange Partner Solutions Engineering team at EMC, Santa Clara for providing access to the hardware, for the use of their lab, and for their joint collaboration throughout this project.

Configuration Details

ESX Hosts (4)
HP DL380
4 Dual socket, Quad core Intel Xeon 5450 3.0GHz
32GB of Memory
Dual port QLogic QLE2462 HBA

VC Server (1)
HP DL380
4 Dual socket, Quad core Intel Xeon 5450 3.0GHz
8GB of Memory

Load Generators (4)
Dual socket, Dual core server
8GB of Memory

Storage (1)
CX 4-960 with 188 15K rpm FC disks

Virtual Platform:
VMware vSphere

Virtual Machines (16)
4 virtual CPUs
5GB memory
Windows Server 2003 x64 with SP2
SQL Server 2005 x64 with SP2
DVD Store version 2 (Large sized database)
http://www.delltechcenter.com/page/DVD+Store

Tuning
DRS aggressiveness threshold: 5 (most aggressive)

For more comments or questions, please join us in the VMware Performance Community website.

About the Authors:
Chethan Kumar is a member of Performance Engineering team at VMware. Radhakrishnan Manga is a member of Midrange Partner Solutions Engineering team at EMC.


June 26, 2009

SQL Sever performance on VMware vSphere 4.0

    VMware recently published a whitepaper titled “Performance and Scalability of Microsoft SQL Server on VMware vSphere 4“ that demonstrates VMware vSphere 4.0 can virtualize large SQL Server deployments with excellent performance and scalability. The paper documents results for a resource intensive OLTP workload running against a SQL Server 2008 database on the Windows Server 2008 operating platform and highlights single-VM as well as multi-VM performance.

  • In an 8vCPU virtual machine, we achieve OLTP throughput that is 86% of physical machine performance 
  • In consolidation experiments with multiple 2-vCPU virtual machines, aggregate throughput scales linearly until physical CPUs are saturated 

Single-VM Performance Relative to Native 

    The table below summarizes the performance relative to the physical machine as we scale-up the vCPUs in a VM running our workload.


Number of Virtual CPUs

Ratio to Native

1

92%

2

92%

4

88%

8

86%

 
    At 1,2 and 4vCPUs on the 8pCPU server, ESX is able to effectively offload certain tasks such as I/O processing to idle cores.
Even at 8vCPUS on a fully committed system, vSphere 4.0 still delivers excellent performance .

    The following table summarizes the resource intensive nature of the workload used for the tests.

Metric

Physical Machine

Virtual Machine

Throughput in transactions per second

3557

3060

Disk I/O throughput (IOPS)

29 K

25.5 K

Disk I/O latencies

9 milliseconds

8 milliseconds

Network bandwidth receive

Network bandwidth send

11.8 Mb/s

123 Mb/s

10 Mb/s

105 Mb/s send


Multi-VM Performance and Scalability

    Multiple SQL server VMs running a resource intensive OLTP workload can be consolidated to achieve excellent aggregate throughput with minimal performance impact to individual VMs. In the figure below, we plot the total throughput as we add eight 2-vCPU SQL Server VMs onto an 8-way host.

Scaleout_graph4

    The cumulative throughput increases linearly as we add up to four virtual machines (eight vCPUs). As we over-commit the physical CPUs by increasing the number of VMs from four to six (factor of 1.5), the aggregate throughput increases by a factor of 1.4 Adding eight VMs to this saturates the physical CPUs on this host, yet ESX is able to utilize the few idle cycles to deliver 5% more throughput.

    The data clearly shows that performance is not a barrier for configuring large multi-CPU SQL Server instances in virtual machines or consolidating multiple virtual machines on a single host to achieve impressive aggregate throughput on vSphere 4. 

For more details regarding these tests, we refer you to the paper at Performance and Scalability of Microsoft SQL Server on VMware vSphere 4

June 23, 2009

Measuring the Cost of SMP with Mixed Workloads

It is no secret that vSphere 4.0 delivers excellent performance and provides the capability to virtualize the beefiest of workloads. Several impressive performance studies using ESX 4.0 have been already been presented. (My favorite is this database performance whitepaper.) However, I continue to hear questions about the scheduling overhead of larger VMs within a heavily-utilized, mixed-workload environment. We put together a study using simple variations of VMware’s mixed-workload consolidation benchmark VMmark to help answer this question.

For this study we chose two of the VMmark workloads, database and web server, as the vCPU-scalability targets. These VMs represent workloads that typically show the greatest range of load in production environments so they are natural choices for a scalability assessment. We varied the number of vCPUs in these two VMs between one and four and measured throughput scaling and CPU utilization of each configuration by increasing the number of benchmark tiles up to and beyond system saturation.

The standard VMmark workload levels were used and were held constant for all tests. Given that the workload is constant, we are measuring the cost of SMP VMs and their impact on the scheduler . This approach places increasing stress the hypervisor as the vCPU allocations increase and creates a worst-case scenario for the scheduler. The vCPU allocations for the three configurations are shown in the table below:

 

Webserver vCPUs

Database vCPUs

Fileserver vCPUs

Mailserver vCPUs

Javaserver vCPUs

Standby vCPUs

Total vCPUs

Config1

1

1

1

2

2

1

8

Config2

2

2

1

2

2

1

10

Config3

4

4

1

2

2

1

14

 

Config2 uses the standard VMmark vCPU allocation of 10 vCPUs per tile. Config1 contains 20% fewer vCPUs than the standard while Config3 contains 40% more than the standard.

We also used Windows Server 2008 instead of Windows Server 2003 where possible to characterize its behavior in anticipation of using Server 2008 in a next-generation benchmark. As a result, we increased the memory in the Javaserver VMs from 1GB to 1.4 GB to insure sufficient memory space for the JVM. The table below provides a summary of each VM’s configuration:

Workload

Memory

Disk

OS

Mailserver

1GB

24GB

Windows 2003 32bit

Javaserver

1.4GB

12GB (*)

Windows 2008 64bit

Standby Server

256MB (*)

12GB (*)

Windows 2008 32bit

Webserver

512MB

8GB

SLES 10 SP2 64bit

Database

2GB

10GB

SLES 10 SP2 64bit

Fileserver

256MB

8GB

SLES 10 SP2 32bit

Below is a basic summary of the hardware used:

  • Dell PowerEdge R905 with 4 x 2.6GHz Quad Core AMD Opteron 8382
  • Firmware version 3.0.2 (latest available).
  • 128GB DDR2 Memory.
  • 2 x Intel E1000 dual-port NIC
  • 2 x Qlogic 2462 dual-port 4Gb
  • 2 x EMC CX3-80 Storage Arrays.
  • 15 x HP DL360 client systems.

Experimental Results

Figure 1 below shows both the CPU utilization and the throughput scaling normalized to the single-tile throughput of Config1. Both throughput and CPU utilization remain roughly equal for all three configurations at load levels of 1, 3, and 6 tiles (6, 18, and 36 VMs, respectively). The cost of using SMP VMs is negligible here. The throughputs remain roughly equal while the CPU utilization curves begin to diverge as the load increases to 9, 10, and 11 tiles (54, 60, and 66 VMs, respectively). Furthermore, all three configurations achieve roughly linear scaling up to 11 tiles (66 VMs). CPU utilization when running 11 tiles was 85%, 90%, and 93% for Config1, Config2, and Config3, respectively. Considering that few customers are comfortable running at overall system utilizations above 85%, this result shows remarkable scheduler performance and limited SMP co-scheduling overhead within a typical operating regime.

FIG1_Alternatev-CPUscaling-4b 

Figure 2 below shows the same normalized throughput of Figure 1 as well as the total number of running vCPUs to illustrate the additional stresses put on the hypervisor by the progressively larger SMP configurations. For instance, the throughput scaling at nine tiles is equivalent despite the fact that Config1 requires only 72 vCPUs while Config3 uses 126 vCPUs. As expected, Config3, with its heavier resource demands, is the first to transition into system saturation. This occurs at a load of 12 tiles (72 VMs). At 12 tiles, there are 168 vCPUs active – 48 more vCPUs than used by Config2 at 12 tiles. Nevertheless, Config3 scaling only lags Config2 by 9% and Config1 by 8%. Config2 reaches system saturation at 14 tiles (84 VMs), where it lags Config1 by 5%. Finally Config1 hits the saturation point at 15 tiles (90 VMs).

FIG2_Alternatev-CPUscaling-5b 

Overall, these results show that ESX 4.0 effectively and fairly manages VMs of all shapes and sizes in a mixed-workload environment. ESX 4.0 also exhibits excellent throughput parity and minimal CPU differences between the three configurations throughout the typical operating envelope. ESX continues to demonstrate first-class enterprise stability, robustness, and predictability in all cases. Considering how well ESX 4.0 handles a tough situation like this, users can have confidence when virtualizing their larger workloads within larger VMs.

(*) The spartan memory and disk allocations for the Windows Server 2008 VMs might cause readers to question if the virtual machines were adequately provisioned. Since our internal testing covers a wide array of virtualization platforms, reducing the memory of the Standby Server enables us to measure the peak performance of the server before encountering memory bottlenecks on virtualization platforms where physical memory is limited and sophisticated memory overcommit techniques are unavailable. Likewise, we want to configure our tests so that the storage capacity doesn’t induce an artificial bottleneck. Neither the Standby Server nor the Javaserver place significant demands on their virtual disks, allowing us to optimize storage usage. We carefully compared this spartan Windows Server 2008 configuration against a richly configured Windows Server 2008 tile and found no measurable difference in stability or performance. Of course, I would not encourage this type of configuration in a live production setting. On the other hand, if a VM gets configured in this way, vSphere users can sleep well knowing that ESX won’t let them down.


June 15, 2009

VMware breaks the 50,000 SPECweb2005 barrier using VMware vSphere 4

VMware has achieved a SPECweb2005 benchmark score of 50,166 using VMware vSphere 4, a 14% improvement over the world record results previously published on VI3. Our latest results further strengthen the position of VMware vSphere as an industry leader in web serving, thanks to a number of performance enhancements and features that are included in this release. In addition to the measured performance gains, some of these enhancements will help simplify administration in customer environments.

The key highlights of the current results include:

  1. Highly scalable virtual SMP performance.
  2. Over 25% performance improvement for the most I/O intensive SPECweb2005 support component.
  3. Highly simplified setup with no device interrupt pinning.

Let me briefly touch upon each of these highlights.

Virtual SMP performance

The improved scheduler in ESX 4.0 enables usage of large symmetric multiprocessor (SMP) virtual machines for web-centric workloads. Our previous world record results published on ESX 3.5 used as many as fifteen uniprocessor (UP) virtual machines. The current results with ESX 4.0 used just four SMP virtual machines. This is made possible by several improvements that went into the CPU scheduler in ESX 4.0.

From a scheduler perspective, SMP virtual machines present additional considerations such as co-scheduling. This is because in case of a SMP virtual machine, it is important for ESX scheduler to present the applications and the guest OS running in the virtual machine with the illusion that they are running on a dedicated multiprocessor machine. ESX implements this illusion by co-scheduling the virtual processors of a SMP virtual machine. While the requirement to co-schedule all the virtual processors of a VM was relaxed in the previous releases of ESX, the relaxed co-scheduling algorithm has been further refined in ESX 4.0. This means the scheduler has more choices in its ability to schedule the virtual processors of a VM. This leads to higher system utilization and better overall performance in a consolidated environment.

ESX 4.0 has also improved its resource locking mechanism. The locking mechanism in ESX 3.5 was based on the cell lock construct. A cell is a logical grouping of physical CPUs in the system within which all the vCPUs of a VM had to be scheduled. This has been replaced with per-pCPU and per-VM locks. This fine-grained locking reduces contention and improves scalability. All these enhancements enable ESX 4.0 to use SMP VMs and achieve this new level of SPECweb2005 performance.

Very high performance gains for workloads with large I/O component

I/O intensive applications highlight the performance enhancements of ESX 4.0. These tests show that high-I/O workloads yield the largest gains when upgrading to this release.

In all our tests, we used SPECweb2005 workload which measures the system's ability to act as a web server. It is designed with three workloads to characterize different web usage patterns: Banking (emulate online banking), E-commerce (emulates an E-commerce site) and Support (emulates a vendor support site that provides downloads). The performance score of each of the workloads is measured in terms of the number of simultaneous sessions the system is able to support while meeting the QoS requirements of the workload. The aggregate metric reported by the SPECweb2005 workload normalizes the performance scores obtained on the three workloads.

The following figure compares the scores of the three workloads obtained on ESX 4.0 to the previous results on ESX 3.5. The figure also highlights the percentage improvements obtained on ESX 4.0 over ESX 3.5. We used an HP ProLiant DL585 G5 server with four Quad-Core AMD Opteron processors as the system under test. The benchmark results have been reviewed and approved by the SPEC committee.

Sw2005_KL

We used the same HP ProLiant DL585 G5 server and the physical test infrastructure in the current as well as the previous benchmark submission on VI3. There were some differences between the two test configurations (for example, ESX 3.5 used UP VMs while SMP VMs were used on ESX 4.0; ESX 4.0 tests were run on currently available processors that have a slightly higher clock speed). To highlight the performance gains, we will look at the percentage improvements obtained for all the three workloads rather than the absolute numbers.

As you can see from the above figure, the biggest percentage gain was seen with the Support workload, which has the largest I/O component. In this test, a 25% gain was seen while ESX drove about 20 Gbps of web traffic. Of the three workloads, the Banking workload has the smallest I/O component, and accordingly had relatively smaller percentage gain.

Highly simplified setup

ESX 4.0 also simplifies customer environments without sacrificing performance. In our previous ESX 3.5 results, we pinned the device interrupts to make efficient use of hardware caches and improve performance. Binding device interrupts to specific processors is a technique common to SPECweb2005 benchmarking tests to maximize performance. Results published in the http://www.spec.or/osg/web2005 website reveal the complex pinning configurations used by the benchmark publishers in the native environment.

The highly improved I/O processing model in ESX 4.0 obviates the need to do any manual device interrupt pinning. On ESX, the I/O requests issued by the VM are intercepted by the virtual machine monitor (VMM) which handles them in cooperation with the VMkernel. The improved execution model in ESX 4.0 processes these I/O requests asynchronously which allows the vCPUs of the VM to execute other tasks.

Furthermore, the scheduler in ESX 4.0 schedules processing of network traffic based on processor cache architecture, which eliminates the need for manual device interrupt pinning. With the new core-offload I/O system and related scheduler improvements, the results with ESX 4.0 compare favorably to ESX 3.5.

Conclusions

These SPECweb2005 results demonstrate that customers can expect substantial performance gains on ESX 4.0 for web-centric workloads. Our past results published on ESX 3.5 showed world record performance in a scale-out (increasing the number of virtual machines) configuration and our current results on vSphere 4 demonstrate world class performance while scaling up (increasing the number of vCPUs in a virtual machine). With an improved scheduler that required no fine-tuning for these experiments, VMware vSphere 4 can offer these gains while lowering the cost of administration.


June 05, 2009

SAP Performance with vSphere 4

VMware recently published a whitepaper that demonstrates VMware vSphere 4’s excellent performance and scalability with SAP ERP software.  The paper presents results of several experiments using VMware vSphere and SAP software with both the Microsoft Windows Server 2008 and SUSE Linux 10.2 operating systems.

First, vSphere’s support for nested page tables (AMD Rapid Virtual Indexing, and Intel Extended Page Tables) is shown to provide a 15-82% performance boost for SAP's most MMU-intensive memory models.  Next, the paper presents a "scale-up" study, comparing n-way virtual machines to n-way physical machines (see figure); using an SAP application load test, vSphere supported up to 95% of the users achieved on physical machines.  The paper also shows that vSphere maintains fairness during CPU overcommitment for an SAP workload and that a performance benefit can be realized when large pages are configured on the host and guest.

SAP Scale-Up Performance

The results in the paper suggest that to run SAP in a virtual machine most efficiently, one should adopt the following best practices:

  • Run with no more vCPUs than necessary.
  • Use the newest processors  (e.g., “AMD Opteron 2300/8300 Series” or “Intel Xeon 5500 Series”) to exploit vSphere's support of hardware nested page tables.
  • Limit virtual machine size to fit within a NUMA node.
  • Configure guest operating system and applications for large pages.
  • If using a processor with hardware nested page tables (RVI or EPT) and Linux, choose the Std memory model
  • If using a processor with hardware nested page tables (RVI or EPT) and Windows 2008, convenience should dictate the choice of memory model as it has only a minor effect on performance.

For more information on the experiments and how we arrive at these recommendations, we refer you to the full paper.  For additional SAP information, please visit: http://www.vmware.com/sap.


May 26, 2009

Java Performance on vSphere 4

VMware ESX is an excellent platform for deploying Java applications.  Many customers use it to support Java applications from the desktop to business-critical enterprise servers.  However, we haven't published any results recently highlighting the excellent performance of Java applications on VMware ESX.  As a first step at remedying this situation, we compared native and virtualized performance using SPECjvm2008.  This workload is a benchmark suite containing several real life applications and benchmarks focusing on core java functionality. The results demonstrate that Java applications run on VMware vSphere at greater than 94% of native performance over a range of VM sizes.  This is up to a 9% improvement over VMware ESX 3.5, which already runs this workload at close to or better than 90% of native performance.

We ran SPECjvm2008 on Red Hat Enterprise Server 5 Update 3 using the latest JVM from Sun Microsystems, JRE 1.6 Update 13.  Tests were conducted with both 32-bit and 64-bit  versions of the OS and JVM.  An HP DL380G5 equipped with two quad-core Intel Xeon X5460 (Harpertown) processors running at 3.16GHz was used.  This server had 32GB of memory.  For native runs using less than the full number of available CPU cores, we used the kernel boot parameter maxcpus= to limit the OS to a given number of cores.  We also used the kernel boot parameter mem= to limit the memory to 16GB in all 64-bit runs.  The runs on VMware vSphere 4.0 and VMware ESX 3.5 Update 4 were done in virtual machines (VMs) using the stated number of virtual CPU s and 16GB of memory. 

The runs of SPECjvm2008 were all base runs, meaning that no Java tuning parameters were used.   All SPECjvm2008 results are required to include a base run.  Unfortunately, the default heap size of the Sun JVM in the 1 CPU case is not large enough to run the SPECjvm2008 workload.  As a result, we were not able to generate 1 CPU results which would be compliant with the run-rules for SPECjvm2008.  We did generate native and vSphere 4.0 results for 2, 4, and 8 CPUs, and ESX 3.5 results for 2 and 4 CPUs.

Figure 1 shows the SPECjvm2008 results for the native, VMware vSphere 4.0, and VMware ESX 3.5 cases.  Figure 2 presents the same results normalized to the native result for that server and CPU count.  These results show that VMs running on VMware vSphere 4.0 perform at greater than 95% of native on this benchmark at all VM sizes.  Even with 8 vCPUs running on a server with only 8 physical cores, the vSphere 4.0 VM achieves 99% of native performance.   The VMware ESX 3.5 VMs ran at close to or greater than 90% of native, which is still excellent for a virtualized environment.  However, for 64-bit VMs, vSphere 4.0 gives a performance improvement over ESX 3.5U4 of 9% in the 4 vCPU case, and about 3% in the 2 vCPU case.

Figure 1 SPECjvm2008 on 8-Core Intel Harpertown Server

SPECjvm2008_blog_fig1

Figure 2 SPECjvm2008 performance relative to native

SPECjvm2008_blog_fig2

In order to sanity-check the native results, we compared the 8-Core Harpertown result using the 64-bit OS and JVM to the closest published result.  There is no directly comparable result, but there is a result generated by Sun on a 16-Core Intel Tigerton Server.  The Tigerton is architecturally similar to the Harpertown, but the Harpertown has a larger L2 cache.  The Sun 16-core Tigerton result, using Solaris 10, a special performance build of the Sun JVM (1.6.0_06p), and 64GB of memory, achieved 260 SPECjvm2008 ops/m.   Our native result on the 8-core Harpertown  with 16GB of memory was  145 SPECjvm2008 ops/m.   A native run on the Harpertown with 32GB and using the Sun 1.6.0_06p JVM achieved 174 SPECjvm2008 ops/m.  This is well more than half of the Tigerton result, and indicates that our native configuration is producing reasonable results.

Figure 3 shows the scaling of the results as we move from 2 to 4 and 8 CPUs for the 64-bit case.  The scaling is essentially the same for 32-bit.  The results are normalized to the 2 CPU results on the same platform.  These results show that VMware vSphere 4.0 scales as well as or better than native for this workload.  VMware ESX 3.5 scaling is just slightly below native.

Figure 3 SPECjvm2008 Scaling from 2 CPUs

 

SPECjvm2008_blog_fig3

The SPECjvm2008 results presented here show that core Java functionality runs extremely well on VMware vSphere 4.0 and VMware ESX 3.5.  No special tuning was required to get results that are remarkably close to native performance.  We hope to soon produce additional results to demonstrate that this excellent performance extends to multi-tier Java Enterprise Edition applications as well.  For comments or questions, please join us in the VMware Performance Community at this thread: http://communities.vmware.com/message/1262696


May 21, 2009

VMware vCenter Update Manager Sizing Estimator Posted

VMware vCenter Update Manager is a component of VMware Infrastructure that automates patches and upgrades of ESX hosts, virtual machine Tools and hardware, Windows and Linux virtual machines, and virtual appliance. A new sizing tool, VMware vCenter Update Manager Sizing Estimator, is now available.

 

The following input parameters are used to estimate database size, patch store disk space, and temporary disk space:

-       Feasibility for virtual machine remediation

-       Number of ESX and ESXi flavors in the deployment

-       Number of hosts, virtual machines, Windows distributions, average number of locales for Windows distribution, average number of different Service Pack levels for Windows distribution,

-       Patch scan frequency for virtual machines

-       VMware Tools upgrade scan frequency for virtual machines

-       Virtual machine hardware upgrade scan frequency

-       Patch scan frequency for hosts

-       Upgrade scan frequency for hosts

 

The following are the outputs from the tool:

-       VMware vCenter Update Manager 4.0 database deployment model recommendations

-       VMware vCenter Update Manager 4.0 server deployment model recommendations

-       Initial disk space utilization in MB for database, patch store, and temporary space

-       Monthly disk space utilization growth in MB for database and patch store

-       The upper and lower bounds on the estimation, assuming a 20% variance

 

 

 

 


VMware vCenter Update Manager Performance and Best Practices White Paper Posted

VMware vCenter Update Manager is a component of VMware Infrastructure that automates patches and upgrades of ESX hosts, virtual machine Tools and hardware, Windows and Linux virtual machines, and virtual appliance. A new white paper, VMware vCenter Update Manager Performance and Best Practices, is now available.

In this paper we discuss VMware vCenter Update Manager 4.0 host deployment, latency, resource consumption, guest OS tuning, high-latency networks, and the impact of on-access virus scanning. We also provide performance tips to help customers tune the system for better performance.


Exchange 2007 performance on vSphere 4

VMware recently released a whitepaper showing the performance scalability of Exchange 2007 on VMware vSphere. This paper shows that vSphere 4.0 achieves excellent performance and scalability both with regards to scale up (adding more vCPUs) and scale out (adding more VMs).  The results indicate that vSphere can easily support 4,000 heavy Exchange users with a single 8 vCPU VM or 8,000 heavy Exchange users with multiples of either 2 or 4 vCPU VMs. While supporting these high user counts, the latencies of most of our virtualized Exchange configurations are half the recommended threshold (500 ms) with little overhead compared to physical.

 

Even the largest configuration, which supports 8,000 Heavy users with 16 vCPUs on an 8-way server, provides outstanding user experience. For our 8,000 heavy user mailbox configuration, the 95th Percentile Send Mail latency Is 273 ms with eight 2 vCPU VMs and 304 ms with four 4 vCPU VMs.

 

95th Percentile Send Mail Latency (2 vCPU VM vs. 4 vCPU VM)

 

  

VMs-Latency

 

 

In addition to these low latencies, this paper also shows that the 8,000 mailbox configuration consumes less than 60% of host CPU resources, which leaves room for further user growth and further consolidation. In addition, the paper shows that ESX provides consistent performance across all consolidated virtual machines. For example, the response times of the Exchange transactions in the eight 2 vCPU configuration were within 2% of each other. For more information on this research, read the full paper: Microsoft Exchange Server 2007 Performance on VMware vSphere.


May 18, 2009

350,000 I/O operations per Second, One vSphere Host

Summary

VMware vSphere includes a number of enhancements that enables it to deliver very high I/O performance. In this study, we demonstrate that vSphere can easily support even an extreme demand for I/O throughput made possible by new products like Enterprise Flash Drives (EFD) offered by EMC. In the experiments conducted at EMC labs, we were able to achieve just above 350,000 I/O operations per second with

  • Single vSphere host with just three virtual machines running on it
  • Latencies under 2ms
  • I/O block size of 8KB

What does such a high throughput mean to customers? Consider this: the entire database of Wikipedia is supported by 20 MySQL servers each 200GB to 300GB in size. On an average Wikipedia receives 50,000 http requests or 80,000 SQL queries per second1, which translates to 4.3 billion hits per day. With the storage infrastructure used in our experiments we could easily accommodate the entire database of Wikipedia and still be left with enough space. A single vSphere host driving more than 350,000 I/O requests per second could easily support the throughput requirements of Wikipedia.

Background

In late May 2008, we published a blog article on achieving 100K I/O operations per second with ESX 3.5. To achieve that, we had used 495 15K RPM Fibre Channel disks spread across three CX3-80 arrays. If we were to push the envelope further with vSphere, we needed more storage bandwidth. It would have taken approximately 1750 15K rpm Fibre Channel drives with 120 Disk Array Enclosures to provide the 350,000 I/O operations per second throughput. If we were to have some redundancy in the storage then the numbers would increase further and go as high as 3500 drives for a RAID 1/0 configuration doubling the entire SAN infrastructure.

Instead only 30 EFDs housed in three CX4-960 arrays provided enough storage bandwidth for vSphere to drive just above 350,000 I/O requests per second.

I/O workload


We could have achieved higher I/O operations per second with a smaller block size, but we focused our studies on 8KB block because it is the most  representative of real applications. We chose an I/O pattern that was 100% random in nature.

Key Findings


  • 3 VMs on one vSphere host supported 350,000 I/O operations per second with 8KB block size (Figure. 1)
  • A single VM with 2 vCPU and 4GB memory provided just under 120,000 I/O operations per second with 8KB block size
  • I/O latency as measured in ESX was just under 2 ms
  • VMware’s new paravirtualized SCSI adapter (pvSCSI) offered 12% improvement in throughput at 18% less CPU cost compared to LSI virtual adapter

350k
Figure.1 Scaling I/O performance through vSphere

We are documenting all the experiments in detail in a white paper that will be posted on the VMware website. We encourage readers to refer to that white paper for more details.

This testing was the result of a joint effort between VMware and EMC. We would like to thank the Midrange Partner Solutions Engineering team at EMC, Santa Clara for providing access to the hardware, for the use of their lab, and for their joint collaboration throughout this project.

For more comments or questions, please join us in the VMware Performance Community website.

About the Authors:
Chethan Kumar is a member of Performance Engineering team at VMware. Radhakrishnan Manga is a member of Midrange Partner Solutions Engineering team at EMC.