VMware

February 25, 2008

16,000 Exchange Mailboxes, 1 Server

We recently finished a large Exchange 2007 capacity test on VMware ESX Server 3.5. How large? Well, larger than anything ever done before on a single server. And we did it from start to finish in about two weeks.

We did this test because we have felt for a while that advances in processor and server technology were about to leave another widely-used and important application unable to fully utilize the hardware that vendors were offering. Microsoft has guidelines on what environment works well with Exchange, and a system with more than eight CPUs and/or 32GB of RAM is beyond the recommended maximums.

Hardware vendors are now offering commodity servers with 16 cores (quad socket with four cores each) and enough memory slots to hold 256GB of RAM. Within a year or two we would expect this to go up even further, with commodity x86 systems being built with 32 cores. Microsoft Exchange deployments typically work well with the 'scale out' model, but that causes server proliferation and underutilized hardware, especially as systems get this large.  VMware ESX Server allows us to make more effective use of the hardware and improve capacity.

Using VMware ESX Server 3i version 3.5 we created eight virtual machines, each with two vCPUs and 14GB of memory, and configured 2,000 mailboxes on each one.  We chose 2,000 users based on Microsoft’s recommendation of 1,000 mailboxes per core and we selected 14GB of memory in accordance with the recommendation to use 4GB + 5MB/mailbox. We used the hardware recommendations for Exchange Server in Multi-Role configuration because each virtual machine was running the Hub, CAS, and UM components in addition to hosting the mailboxes.

We ran this test on an IBM x3850 M2 server with 128GB of RAM. The virtual machines ran Microsoft Windows Server 2003 R2 Datacenter x64 Edition with Service Pack 2 and Microsoft Exchange 2007 Server Version 8 with Service Pack 1.

The storage used for these tests was an EMC CX3-40 with 225 disks (15 drawers of 15 disks each). Each virtual machine was configured to use two LUNs of 10 disks each for the Exchange database and a three-disk LUN for logs.

We used the Microsoft Load Generator (LoadGen) tool to drive the load on the mailboxes, and ran with the heavy user profile.  Here are the LoadGen settings:

  • Simulated day - 8 hours
  • Test run - 8 hours
  • Stress mode - disabled
  • No distribution lists or dynamic distribution lists for internal messages
  • No contacts for outgoing messages
  • No external outbound SMTP mail
  • Profile used: Outlook 2007 Online, Heavy, with Pre-Test Logon

We ran the tests using both ESX Server 3.5 and ESX Server 3i version 3.5 and the performance was the same across both versions. Tests were run with one through eight virtual machines, and even in the eight virtual machine case about half the CPU resources were still available.

Disk latencies were around 6ms across our runs. The IOPS rate started off at about .65 IOPS/mailbox in the first hour but stabilized at .37 IOPS/mailbox in the last hour (once the cache was warmed up). Over the duration of the run the average rate was .45 IOPS/mailbox.  The read/write ratio observed was approximately 4:1.

Sendmail latency is an important measure of the responsiveness of the Exchange Server. Figure 1 shows how it changed as more virtual machines were added to the system.

Exchange2007latency

Figure 1. Sendmail Latency

A 1000ms response time is considered the threshold at which user experience starts to degrade. As can be seen from the 95th percentile response times in Figure 1, there’s still a significant amount of headroom on this server, even at our highest tested load level.

These tests ran smoothly and demonstrated what we expected. This should come as no surprise. As new hardware becomes available, the scalability of ESX Server allows us to easily make productive use of the additional capacity.

It took many hours and creative hardware "repurposing" from our lab personnel to put this setup together within a couple of days, and it’ll probably take them even longer to get everything back to its original place.  I’d like to acknowledge that without their efforts, we wouldn’t have been able to get this done.

Summary

The large number of companies already running Microsoft Exchange Server on VMware ESX Server are experiencing improved resource utilization and better manageability as well as lower space, power, and cooling costs. New servers with greater processing power make the transition to Exchange on ESX Server even more compelling.

November 14, 2007

Ten Reasons Why Oracle Databases Run Best on VMware

We’re really excited about the buzz around Oracle in virtualized environments. One of the best kept secrets is just how well Oracle performs on VMware ESX. This didn’t happen by accident – there are a number of features and performance optimizations in the VMware ESX server architecture, specifically for databases.

In this blog, I'll walk through the top ten most important features for getting the best database performance. Here are a few of the performance highlights:

  • Near Native Performance: Oracle databases run at performance similar to that of a physical system
  • Extreme Database I/O Scalability: VMware ESX Server’s thin hypervisor layer can drive over 63,000 database I/Os per second (fifty times the requirement of a typical database)
  • Multi-core Scaling: Scale up using SMP virtual machines and multiple database instances
  • Large Memory : Scalable memory - 64GB per database, 256GB per host

We’ve continued to invest a great deal of work towards optimizing Oracle performance on VMware, because it’s already one of the most commonly virtualized applications. The imminent ESX 3.5 release is our best database platform to date, with several new advanced optimizations.

In this blog article we’d like to explain the unique and demanding nature of database applications such as Oracle produces and show the performance capabilities of ESX Server on this type of workload.

The Nature of Databases

Databases have some unique properties, such as a-large memory footprint. At the outset this can make them slightly more complex to virtualize well. However this has proven to be an opportunity, since we can optimize specifically for these defining properties.

  • Large Memory: Databases use large amounts of memory to cache their storage. A large cache is one of the most important performance criteria for databases, since it can often reduce physical I/O by 10-100 fold.
  • High Performance Block I/O: Databases read and write their data in fixed, block sized chunks. The I/Os are typically small, and operate at a very high rate on a small number of files or devices.
  • Throughput Oriented: Databases often have a large number of concurrent users, giving them natural parallelism and makes them ideally suited to take advantage of systems with multiple logical or physical processors.

Understanding and Quantifying Virtual Performance

The performance of a virtualized system should first be quantified in terms of latency and throughput, and then in terms of how efficiently resources are being used. For example, if a physical system is delivering 10000 transactions per minute at 500ms latency per transaction, then a virtualized system that is performing at 100% of native should provide the same level of throughput with acceptable latency characteristics. Secondary should be a metric of resource usage, which is a measure of how many additional physical resources were used to achieve the same level of performance. It’s sometimes overly easy to focus primarily on the CPU resource, when in reality memory and I/O are much more expensive resources to provision. This is becoming especially important going forward, as multicore CPUs continue to lower the cost per processor core, while memory cost remains at a premium.

More important for Oracle is the ability to scale up by taking advantage of multi-core CPUs, large memories, and the I/O throughput through the hypervisor to support the large number of disk spindles in the backend storage arrays.

Database Performance Myths

There are a few common myths about virtualizing databases:

  • Databases have a high overhead when virtualized: Virtualized Databases can perform at or near the speed of physical systems, in terms of latency and throughput. The virtualization overhead for typical real-world databases is minimal – for VMware ESX Server, we measured CPU overhead to be less than 10%.
  • Databases have too much I/O to be virtualized: Databases typically have a large number of small random  I/Os, and it is in theory possible to hit a scaling ceiling in the hypervisor layer. VMware ESX’s thin hypervisor layer can drive over 63,000 database I/Os per second, which is equivalent to more than 600 disk spindles of I/O  throughput. This is sufficient I/O scaling for even the largest databases on x86 systems.
  • Virtualization should only be used for smaller, non-critical applications: The ESX hypervisor is very robust: many customers are seeing over two years of uptime from ESX based systems. In addition, the ESX hypervisor remains stable, even if resources are overcomitted.

There isn’t one quick hit to make databases work well for a wide range of real-world applications – good performance is something that is earned from the long term discipline of focusing the lessons learned from many customer-oriented real-world database workloads, and applying those lessons across the architecture of the hypervisor.

Let’s take a quick walk through the specific features that you should look for in the hypervisor for good database performance.


1: High Performance I/O in VMware ESX

Throughput and latency of the I/O system are critical to performance of online transaction processing systems. Since transaction database systems operate on small data items at random places in the dataset, it’s important that we measure random I/O throughput (measured in I/O operations per second), rather than bandwidth (MB/s).

Esx_io_2

Figure.1 VMware ESX I/O Driver Model


Since the hypervisor logically resides between the database in the guest virtual machine and the backend storage, it is critical that the hypervisor’s I/O facilities scale up without any performance ceilings, and don’t add any appreciable latency. The I/O subsystem in VMware ESX shown in Figure.1 uses a direct driver model, so that there is minimal latency added by the virtualization stack. This is possible because I/O requests can be handled  in-line by the same processor as the requesting virtual machine (other architectures add substantial latency and CPU overhead when I/O is proxied via a heavy-weight domain-0 or parent-partition).

Iops_2

Figure.2 – Random I/O Throughput

(Average 4-CPU DB vs. VMware ESX 3.5

Oracle databases typically issue many small 4Kbyte or 8Kbyte sized I/Os in a random access pattern. For these I/O’s, a single typical disk can deliver somewhere in the order of 100-200 I/Os per second, depending on the rotational speed of the disk, though in practice, it's best not to push the drives beyond 100 IOPS each. The throughput of the VMware ESX 3.5 hypervisor has been increased significantly, and shows that more than 60,000 I/Os per second can be sustained – the throughput of over 600 disks.

Results from an VMware study of its customers, showed that across 15,000 Oracle servers the average number of I/Os per second for a loaded 4 processor system is 1280, which is approximately the throughput of 15 disks. Since some workloads are more demanding than others, and some are bursty in nature, it’s important to have substantially more headroom. The throughput capability of the ESX’s I/O subsystem is sufficient for more than even the most demanding database.

2: Scale Up using Virtual SMP

VMware ESX can take advantage of systems with multiple physical processors in two ways, by scaling out through multiple virtual machines, and by scaling up each virtual machine to use more than one physical processor. VMware ESX provides a Virtual SMP capability, allowing up to four processors in each guest virtual machine, and up to 64 processors in the physical system.

Products_vsmp_diagram_6

Figure.3 - Virtual SMP


Since database workloads typically have a large number of concurrent users, they are explicitly parallel and can easily process more than one task at a time.

Oracle is able to take advantage of VMware’s Virtual SMP, so performance can be scaled beyond a single processor for each virtual machine. To demonstrate this, we ran several benchmarks with Oracle database 10g Release 2, using the popular SwingBench on-line transaction processing workload. Figure.5 shows the throughput of Oracle with an increasing number of processors in a single virtual machine. The benchmark measures transaction throughput, and shows 94% scaling as additional processors are added. Incidentally, this is exactly the same as the scaling we see on native, which is likely due to hardware and database scaling artifacts.

Swingbench_scaling_linear_norm_4




Figure.5 – Scaling of  VMware Virtual SMP with a single instance of Oracle 10g R2

One of the key requirements for consolidation is good scalability with a large number of database instances. To show this, we ran multiple SMP instances of Oracle 10G on VMware ESX Server 3.5. Figure.3 shows the scaling of the VMware ESX platform when running the open source DVDstore database benchmark. The benchmark is run using client-server mode, so that we can focus on the database tier . In this study, we scaled the benchmark using one through seven dual processor SMP Linux virtual machines, each with its own database instance. We'll be posting further details of this benchmark on Vroom soon.

DVDstore Benchmark Scaling

 

 

Figure.4 – Scaling of multiple database instances on VMware ESX on a 16-core Sun x4600 M2




 

 

 

 

 

3: Scale up with Large Memory

Oracle databases love memory. The primary use is for caching pre-compiled SQL queries and caching blocks from disk in memory.

Database designers go to significant effort to avoid doing disk I/O when possible. This is because the latency of a disk I/O is substantially higher than the time a transaction will spend on the CPU. For example, a disk I/O takes on the order of 10 milliseconds, while the typical transaction takes just a few milliseconds of CPU time. If an I/O to disk can be cached in memory, it then could be serviced in a fraction of a millisecond. In addition, since disks are expensive, the cost of the storage systems for databases is often more affected by the I/Os per second that it can deliver, than by pure storage space. Lowering overall disk throughput can mean significantly lowering the cost of the system.

Larger memory sizes help Oracle by caching more disk blocks in memory. Consider this simple example: if a database system is using memory to cache it’s disk I/O, and is yielding a 90% cache hit rate, this means that one in every ten accesses causes a physical I/O. For 10,000 accesses a second, we would see 1,000 I/O’s per second. If we increase memory to improve the cache hit rate to 99%, then we reduce the I/Os to one in one-hundred, reducing the physical I/O by 10x to only 100 I/Os per second.

Often, over 80% of the memory used by the guest operating system is used by the Oracle disk block cache. A general rule of thumb is that the database cache be sized at 5-10% of the database size, and that doubling memory improves throughput by about 20%. This is obviously very workload dependent, but you can see that larger memory sizes help improve resource efficiency in other areas of the system, and that generally, more is better. For these reasons, large memory is very important for databases. VMware ESX 3.0 allowed 16GB of RAM per guest, and 3.5 increases the capability to 64GB per guest.

Due to the inherent gains in processor utilization through consolidation of workloads, we can squeeze more workloads onto a single system. This means that the average memory requirement per physical processor is on average twice that of a traditional unvirtualized system. To accommodate these growing requirements, we’ve pushed the memory scalability curve considerably in ESX 3.5, and now supports up to 256 Gigabytes of RAM on the new high-end systems from Sun, IBM and Unisys.

4: Large Pages in ESX 3.5 Hypervisor

Oracle databases have used large pages in the CPU’s MMU to optimize memory performance for some time. This facility is used with the operating system’s large page feature, typically for the large shared memory segments that hold the database’s disk block cache. Large pages are supported on Linux, Windows and Solaris guests. Oracle typically yields a 5-20% performance advantage with large pages, depending on the type of processor and the size of the configured memory.

Other x86 hypervisors don’t provide virtual large page capability, so this optimization is lost when the database is virtualized. The ESX 3.5 hypervisor provides advanced large-page support which allows the database to properly exploit the CPU’s large page capability.

5: ESX Optimization for NUMA systems

Many of the interesting new hardware systems today are implemented using non-uniform memory architectures. This means that not all memory is of uniform speed – accessing memory that is closer topologically to the processor is faster than memory that is further away.

To ensure optimal performance, the VMware ESX hypervisor allocates memory for the guest operating systems from physical memory near the CPU on which the guest resides on.

6: High-performance Paravirtualized Networking

Paravirtualization is a term used to describe when the guest operating system has some knowledge of the hypervisor, and can leverage this knowledge to optimize it’s execution in concert with the hypervisor. The VMware ESX hypervisor uses paravirtualized networking drivers in the guest operating system to provide high performance networking. These drivers are installed automatically through the VMware tools package at the time the guest is first powered on. Unlike CPU paravirtualization, paravirtualized drivers do not require any changes to the guest operating system – they are simply installed as transparent new drivers.

Multinic




Figure.6 – Multi-NIC Scalability

VMware ESX can drive gigabit Ethernet at line rate, as demonstrated in the paper Networking Performance in Multiple Virtual Machines. The networking performance of ESX 3.5 has further improved by incorporating new stateless offload features, such as large-segment-offload (LSO)  and jumbo frames --  and now achieves near line-rate (9.9gbits/s) on 10Gbit Ethernet.

The performance of networking can be increased beyond that of one NIC by scaling across multiple NICs. Figure.6 shows the scaling of gigabit Ethernet performance as multiple NICs are added.


7. Use VMware ESX’s Page-Sharing to use less memory

The ESX hypervisor can safely share physical memory that has the same contents through a facility known as transparent page sharing. Through page sharing, the hypervisor arranges for a single physical page of memory to back multiple pages in the guest, so that just one copy of the data need reside in memory. Using this technique, the total amount of memory consumed is less than the sum of the parts. The hypervisor ensures full security isolation -- if one guest modifies a page, then it get its own private copy.

This facility can be used effectively to save memory with Oracle in several ways. When there are multiple instances of a database running, the page sharing facility will automatically share the code portions of the operating system and the Oracle instance. This often results in saving in the order of a few hundred megabytes of memory per virtual machine.

When multiple databases are sharing similar data – for example, a shared reference table or multiple copies of the database for development purposes, ESX can automatically detect the duplicate disk blocks in the Oracle disk block cache, and arrange to share those. Thus, the database cache memory image can be transparently shared across database instances, and across virtual machines. This can result in a further saving in the order of tens of megabytes (at least the system tables will be the same), through several gigabytes, depending on the amount of common disk data between the instances.

As an additional benefit, some memory can be shared within each instance. This is often for zero pages.

8: Paravirtualized CPU

Pv1

Figure.7 – Virtualization Techniques

There are various techniques used to virtualize the x86 instruction set, including binary translation, paravirtualization and hardware assist. Binary translation has long been used by the VMware hypervisor to provide near native performance for virtualization for many workloads. CPU Paravirtualization or hardware assist are two approaches that can be used to provide small optimizations for workloads with many system calls as well as providing certain memory optimizations. No single approach is best for all workloads, and in VMware ESX, different approaches are used for different workloads. Ole Agesen and Keith Adams help explain the different technologies in their paper about the performance of virtualization.

Paravirt_2

VMware ESX can optionally use paravirtualization for some guest operating system types. In a recent study, the performance of Oracle 10g R2 using the Swingbench online transaction processing workload on a paravirtualized Linux guest shows a moderate gain of 10% when using paravirtualized CPU interfaces.










9: The Best Oracle-Windows Performance

Since all of the key CPU, memory and I/O virtualization capabilities are in a portable layer of the hypervisor, the performance of Oracle on Windows guests is equivalent to that of Linux guests. Oracle on Windows is able to take advantage of large-pages, SMP, and I/O scalability as well as our high performance paravirtualized networking drivers.

For Linux, Solaris and Unix administrators, this means that you have the freedom to choose the OS which has the best tools to facilitate your deployment. For Windows administrators, it means that you can confidently run your Oracle databases on Windows, with the same levels of performance and scalability.

10: Universal 32-bit and 64-bit Guest Support

To take advantage of more than 3.5Gbytes of memory in a guest, databases need to be configured as 64-bit applications, and use a 64-bit capable operating system. VMware ESX allows mixed 32-bit and 64-bit guests concurrently, thu simplifying the deployment of 64-bit guests when needed.

Summary

To all the database administrators out there, watch for more VROOM posts about VMware performance with Oracle, the new  Oracle portal, which will contain plenty of good resources for virtualization of Oracle. Also, there is a new Oracle Discussion Forum - feel free to discuss Oracle performance over at the forum too. Virtual database performance has never been so good!


June 06, 2007

Networking Performance and Scaling in Multiple VMs

Last month we published a Tech Note summarizing networking throughput results using ESX Server 3.0.1 and XenEnterprise 3.2.0.  Multiple NICs were used in order to achieve the maximum throughput possible in a single uniprocessor VM.  While these results are very useful for evaluating the virtualization overhead of networking, a more common configuration is to spread the networking load across multiple VMs. We present results for multi-VM networking in a new paper just published.  Only a single 1 Gbps NIC is used per VM, but with up to four VMs running simultaneously. This simulates a consolidation scenario of several machines each with substantial, but not extreme, networking I/O.  Unlike the multi-NIC paper, there is no exact native analog, but we ran the same total load in a SMP native Windows machine for comparison.  The results are similar to the earlier ones: ESX stays close to native performance, achieving up to 3400 Mbps for the 4-VM case. XenEnterprise peaks at 3 VMs and falls off to 62-69% of the ESX throughput with 4 VMs.  According to the XenEnterprise documentation only three physical NICs are supported in the host, even though the UI let us configure and run four physical NICs without error or warning.  This is not surprising given the performance.  We then tried a couple of experiments (like making dom0 use more than 1 CPU) to fix the bottleneck, but only succeeded in further reducing the throughput.  The virtualization layer in ESX is always SMP, and together with a battle-tested scheduler and support for 32 e1000 NICs, scales to many heavily-loaded VMs. Let us know if you're able to reach the limits of ESX networking!

May 18, 2007

Windows Vista Performance in VMware Workstation 6.0

One of the great new features in VMware Workstation 6.0 is its Windows Vista support. Vista can be used as the host operating system (HOS) and the guest operating system (GOS) for VMware Virtual Machines (VM). The question to us is how well Vista performs in VMware Workstation 6.0. This actually contains two sub-questions. (1) What is the performance of Vista as the HOS? (2) What is the performance of Vista as the GOS? To answer these two questions, we did a comparison of Windows Vista and Windows XP performance.

To answer Question (1), we ran experiments using the same virtual machine on the two different HOS's (Vista and XP) and compared the results. We ran a set of workloads to measure the CPU, memory, disk and network performance of the VM. Vista host performance is on par with XP, except that Vista itself consumes more memory than XP. This means that Vista leaves less memory for the use of VM's than XP.

To answer Question (2), we compared a Vista VM against an XP VM both on an XP host. We ran the same set of workloads as for the Vista host experiments described above. While Vista guest performance is on par with XP in most of our workloads, we did find a few cases that perform worse on the Vista VM than on the XP VM. To understand why Vista was slower in those particular cases, we conducted the same measurements on native physical systems, rather than on virtual machines. We found that Vista is slower than XP on native hardware almost to the same degree as on virtual hardware. This made it clear that VMware Workstation 6.0 wasn't introducing any Vista-specific overheads, and that the relative performance on Vista is as good as on XP.

The chart below shows some representative results from our experiments. The bars represent the ratio of Vista to XP performance when comparing the Host OS, the Guest OS and native. The benchmarks shown are: gzip from the SPEC CPU2000 suite, PassMark PerformanceTest, Iometer disk workloads, Netperf networking send/receive, and boot/halt (time taken to boot and immediately halt the OS). The workloads had minimal variations from run to run, e.g. around 3% in performance.

Vistaperfinws6

In conclusion, Windows Vista works great with VMware Workstation 6.0! Go ahead and have fun with our cool virtualization technology!

VMware Workstation 6.0 supports both 32-bit and 64-bit Vista. Our conclusion here holds for both 32-bit and 64-bit.

Please refer to "Performance Tuning and Benchmarking Guidelines for VMware Workstation 6" for more information about Workstation 6.0 performance.

January 26, 2007

Shrinking the VMmark Tile

Our new benchmark, VMmark, had its first Beta release on December 21st. Now we are busy supporting the Beta users as well as trying to address some of the feedback we received during the earlier VMmark technology preview program with some of our hardware partners. We heard from almost everyone that the memory footprint of 7GB per tile should be reduced. (Details on VMmark and its tile definition can be found here: http://www.vmware.com/vmtn/resources/573). Looking at the trends in the mid-range space, the feedback makes sense. Many current two-socket, 4-core systems have only 8 DIMM slots. One would have to break the bank buying 4GB DIMMs to get 8GB/core and 4-core chips are arriving. Ultimately, I hope the hardware vendors add more memory slots to address this looming imbalance. But for now, if we are going to measure these types of systems, we’ll need to reduce the memory usage of VMmark.

Three of the workloads in a VMmark tile, the web, file, and standby servers, together consume only 1GB of memory. They are already pretty lean, so squeezing memory from them would have limited benefits. The remaining three workloads, the database, mail, and java servers, use 2GB each. Databases tend to like a large, well-tuned buffer cache. I’d rather leave that one alone since it is a fairly typical database size. That leaves the java and mail server VMs as candidates. If we cut both of those VMs down to 1GB each, the total memory footprint drops to 5GB. In this configuration, 3 tiles will fit into 16GB, which should max out a current 2-socket, dual-core system using the cheaper 2GB DIMMs while leaving plenty of headroom for quad-core with 4GB DIMMs.

I first made the necessary changes for the mail and java server VMs to use 1GB. I then ran them each in isolation to get an idea of the performance impacts. To my surprise, the java server exhibited only a 2-3% reduction in throughput while mail server showed no discernable difference. Looking back, I suspect that this is due to the workload throttling we implemented in VMmark to insure that the workloads run at less than full utilization as they would in a datacenter consolidation scenario. Given that we initially sized our VMs based upon various industry and customer surveys, I am led to wonder if there aren’t lots of servers over-configured with not only CPU but also memory. As a final series of tests, I reran the newly modified VMmark on several systems for which I already had data for the existing 7GB tile size. Overall I saw very little effect on the benchmark scores. It looks like the 5GB VMmark tile is a go.