Home > Blogs > VMware VROOM! Blog > Monthly Archives: February 2008

Monthly Archives: February 2008

16,000 Exchange Mailboxes, 1 Server

We recently finished a large Exchange 2007 capacity test on VMware
ESX Server 3.5. How large? Well, larger than anything ever done before on a
single server. And we did it from start to finish in about two weeks.

We did this test because we have felt for a while that
advances in processor and server technology were about to leave another
widely-used and important application unable to fully utilize the hardware that
vendors were offering. Microsoft has guidelines on what environment works well
with Exchange, and a system with more than eight CPUs and/or 32GB of RAM is beyond
the recommended maximums.

Hardware vendors are now offering commodity servers with 16 cores (quad socket with four cores each) and enough memory slots to hold 256GB of RAM. Within a year or two we would expect this to go up even further, with commodity x86 systems being built with 32 cores. Microsoft Exchange deployments
typically work well with the ‘scale out’ model, but that causes server proliferation and underutilized hardware, especially as systems get this large.  VMware ESX Server allows us to make more effective use of the hardware and improve capacity.

Using VMware ESX Server 3i version 3.5 we created eight virtual machines, each with two vCPUs and 14GB of memory, and configured 2,000 mailboxes on each one.  We chose 2,000 users based on Microsoft’s recommendation of 1,000 mailboxes per core and we selected 14GB of memory in accordance with the recommendation to use 4GB + 5MB/mailbox. We used the hardware recommendations for Exchange Server
in Multi-Role configuration because each virtual machine was running the Hub, CAS, and UM components in addition to hosting the mailboxes.

We ran this test on an IBM x3850 M2 server with 128GB of RAM. The virtual machines ran Microsoft Windows Server 2003 R2 Datacenter x64 Edition with Service Pack 2 and Microsoft Exchange 2007 Server Version 8 with Service Pack 1.

The storage used for these tests was an EMC CX3-40 with 225 disks (15 drawers of 15 disks each). Each virtual machine was configured to use two LUNs of 10 disks each for the Exchange database and a three-disk LUN for logs.

We used the Microsoft Load Generator (LoadGen) tool to drive the load on the mailboxes, and ran with the heavy user profile.  Here are the LoadGen settings:

  • Simulated day – 8 hours
  • Test run – 8 hours
  • Stress mode – disabled
  • No distribution lists or dynamic distribution lists for internal messages
  • No contacts for outgoing messages
  • No external outbound SMTP mail
  • Profile used: Outlook 2007 Online, Heavy, with Pre-Test Logon

We ran the tests using both ESX Server 3.5 and ESX Server 3i version 3.5 and the performance was the same across both versions. Tests were run with one through eight virtual machines, and even in the eight virtual machine case about half the CPU resources were still available.

Disk latencies were around 6ms across our runs. The IOPS rate started off at about .65 IOPS/mailbox in the first hour but stabilized at .37 IOPS/mailbox in the last hour (once the cache was warmed up). Over the duration of the run the average rate was .45 IOPS/mailbox.  The read/write ratio observed was approximately 4:1.

Sendmail latency is an important measure of the responsiveness of the Exchange Server. Figure 1 shows how it changed as more virtual machines were added to the system.

Exchange2007latency

Figure 1. Sendmail Latency

A 1000ms response time is considered the threshold at which user experience starts to degrade. As can be seen from the 95th percentile response times in Figure 1, there’s still a significant amount of headroom on
this server, even at our highest tested load level.

These tests ran smoothly and demonstrated what we expected. This should come as no
surprise. As new hardware becomes available, the scalability of ESX Server allows us to easily make productive use of the additional capacity.

It took many hours and creative hardware "repurposing" from our lab personnel to put this setup together within a couple of days, and it’ll probably take them even longer to get everything back to its original place.  I’d like to acknowledge that without their efforts, we wouldn’t have been able to get this done.

Summary

The large number of companies already running Microsoft Exchange Server on VMware ESX Server are experiencing improved resource utilization and better manageability as well as lower space, power, and cooling costs. New servers with greater processing power make the transition to Exchange on ESX Server even more compelling.

VMI performance benefits

One of the hallmarks of VMware has been its continuous innovation. With every new release VMware adds exciting new features to its products. But there are also lots of changes in each release that may not be as obvious, but are still quite important. Performance improvements often fall into this category. One such feature that was added in VMware ESX Server 3.5 is the support for guest operating systems that are paravirtualized using the virtual machine interface (VMI) specification.

A little bit of history will provide insight into the motivation for the VMI standard. Previous Linux paravirtualization efforts produced paravirtualized kernels that were tightly coupled to the hypervisor.  Frequent interface changes led to guest kernel and hypervisor version dependencies, impeding the independent evolution of the kernel and the hypervisor. These paravirtualized kernels were also not compatible with native hardware. Wouldn’t it be nice if a single kernel could run both natively and in a virtual machine, but with improved performance using paravirtualization in the latter case? This feature, now called transparent paravirtualization, was non-existent. The lack of transparent paravirtualization doubled the number of kernels to develop, test, and debug. In order to address these issues VMware proposed the VMI paravirtualization standard in 2006. As a proof of concept VMware implemented the VMI standard in Linux kernel 2.6.16 and demonstrated it to the Linux open-source community at the Linux Symposium in Ottawa, Canada. It was clear that the performance gains of paravirtualization could be attained without sacrificing modularity and code cleanliness and that VMI-enabled kernels could run transparently on native machines without any performance impact. Several VMI-style paravirtualization design philosophies were accepted by the Linux community, leading to the development of the paravirt-ops interface in the 2.6.20 mainline kernel. VMware then worked with the Linux community to include VMI in the Linux kernel. Today VMI can be enabled in mainline Linux kernel 2.6.22 and above. See Knowledge base article #1003644 for instructions on how to enable VMI in your custom Linux kernel.

Since VMI-enabled kernels can run on native systems, the popular Linux distributions Ubuntu Feisty Fawn (7.04) and Ubuntu Gutsy Gibbon (7.10) were shipped with VMI enabled by default in the kernel, providing transparent performance benefits when they are run in ESX Server 3.5. VMware is also working with Novell to include VMI in the SUSE Linux Enterprise Server distribution. We in the Performance group did a study to evaluate the performance benefits of using a distribution that supports VMI, the results of which can be found in this whitepaper.

The paper has details on the workloads that we ran, the benchmark methodologies used, and the reasoning behind them. It will be clear from the paper that VMware’s VMI-style paravirtualization offers performance benefits for a wide variety of workloads in a totally transparent way. With support for virtualization techniques like binary translation and hardware assist, and now the addition of VMI paravirtualization, VMware provides customers the most comprehensive range of choices, allowing them to choose the right virtualization technique based on the guest operating system and workload.

You can also learn more about VMI’s performance benefits at my upcoming talk, "VMI: Maximizing Linux virtual machine performance in ESX Server 3.5" at VMworld Europe 2008. I hope to see you there.

SPECweb2005 Performance on VMware ESX Server 3.5

I got a chance to attend the VMworld 2007 conference in San Francisco a little over three months ago. During the conference, many of my Performance group colleagues and I had the opportunity to speak with a number of customers from various segments of industry. They all seem to love VMware products and are fully embracing virtualization technology across IT infrastructure, clearly reflecting a paradigm shift. As Diane Greene described in her keynote, virtualization has become a mainstream technology. However, among the customers we spoke to there were a few who had some concerns about virtualizing I/O-intensive applications. Not surprisingly, the concerns had more to do with perception than with their actual experience.

Truth be told, with a number of superior features and performance optimizations in VMware ESX Server 3.5, performance is no longer a barrier to virtualization, even for the most I/O-intensive workloads. In order to dispel the misconceptions these customers had, we decided to showcase the performance of ESX Server by benchmarking with industry-standard I/O-intensive benchmarks. We looked at the whole spectrum of I/O-intensive workloads. My colleague has already addressed database performance. Here, I’d like to focus on web server performance; in particular, the performance of a single virtual machine running the highly-network intensive SPECweb2005 benchmark.

SPECweb2005 is a SPEC benchmark for measuring a system’s ability to act as a web server. It is designed with three workloads to characterize different web usage patterns: Banking (emulates online banking), E-commerce (emulates an E-commerce site), and Support (emulates a vendor support site providing downloads). The three benchmark components have vastly different workload characteristics and we thus look at results from all three.

In our test environment we used an HP ProLiant DL385 G1 server as the system under test (SUT). The server was configured with two 2.2 GHz dual-core AMD Opteron 275 processors and 8GB of memory. In the native tests the system was booted with 1 CPU and 6GB of memory and ran RHEL4 64-bit. In the virtualized tests, we used a 1-vCPU virtual machine configured with 6GB of memory, running RHEL4 64-bit, and hosted on ESX Server 3.5. We used the same operating system version and web server software (Rock Web Server, Rock JSP/Servlet container) in both the native and virtualized tests. Note that neither the storage configuration nor the network configuration in the virtual environment required any additional hardware. In fact we used the same physical network and storage infrastructure when we switched between the native and virtual machine tests.

There are different dimensions to performance. For real-world applications the most significant of these are usually overall latency (execution time) and system throughput (maximum operations per second). We are also concerned with the physical resource utilization per request/response. We used the SPECweb2005 workloads to evaluate all these aspects of performance in a virtualized environment.

Figure 1 shows the performance we obtained using the SPECweb2005 Banking workload. The graph plots the latency in seconds against the total number of SPECweb2005-banking users. The blue dashed line corresponds to the performance observed in a physical environment and the green solid line corresponds to the performance observed in a virtual environment.

Figure1

Figure 1. Response Time Curves

You can see from the graph that both curves have similar shapes. Both exhibit behavior observed in a typical response time curve. There are three regions of particular interest in the graph: the performance plateau, the stressed region, and the knee of the curve.

The part of the curve marked “Performance plateau” represents the behavior of the system under moderate stress, with CPU utilizations typically well below 50%. Interestingly, we observed lower latency in the virtual environment than in the native environment. This may be because ESX Server intelligently offloads some functionality to the available idle cores, and thus in certain cases users may experience slightly better latency in a virtual environment.

The part of the curve marked “Stressed region” represents the behavior of the system under heavier stress, with utilizations above approximately 60%. Response time gradually starts to increase with the increase in the load in both curves. But the response times are still below the reasonable limits.

The knee of each curve is marked by a point where the solid red line intersects the curve. The knee represents the maximum throughput (or load) that can be sustained by the system while meeting reasonable response time requirements. Beyond that point the system can no longer gracefully handle higher loads.

From this graph we can draw the following conclusions:

  1. When the CPU resources in the system are not saturated, you may not notice any difference in the application latency between the virtual and physical environments.
  2. The behavior of the system in both the virtual and physical environments is nearly identical, albeit the knee of the curve in the virtual environment occurs slightly earlier (due to moderately more CPU resources being used by the virtualized system).

We have similar results for the Support and E-commerce workloads. For brevity, I’ll focus on a portion of the response time curve that interests most of the system administrators. We have chosen a load point that is approximately 80% of the peak throughput obtained on a native machine. This represents the center of the ‘stressed region’ of the native response time curve, with CPU utilization level of 70% to 80%. We applied the same load in the virtual environment to understand the latency characteristics.

As you can see from Figure 2, we did not observe any appreciable difference in application latency between the native and virtual environments.

Figure2

Figure 2. SPECweb2005 Latency

Figure 3 compares the knee points of the response time curves obtained in all three workloads.

The knee points represent the peak throughput (or the maximum connections) sustained by both the native and virtual systems while still meeting the benchmark latency requirements.

Figure3

Figure 3. SPECweb2005 Throughput

As shown in Figure 3, we obtained close to 90% of native throughput performance on the SPECweb2005 Banking workload, close to 80% of native performance on the E-commerce workload, and 85% of native performance on Support workload.

If you’d like to know more about the test configuration, tuning information, and performance statistics we gathered during the tests, check out our recently published performance study.

These tests clearly demonstrate that performance in a virtualized environment can be close to that of a native environment even when using the most I/O-intensive applications. Virtualization does require moderately higher processor resources, but this is typically not a concern given the highly underutilized CPU resources in many IT environments. In fact, with so many additional benefits, such as server consolidation, lower maintenance costs, higher availability, and fault tolerance, a very compelling case can be made to virtualize any application, irrespective of its workload characteristics.

Scalable Storage Performance with VMware ESX Server 3.5

We at VMware often get questions about how aggressively physical systems can be consolidated. Scalability on heavily-consolidated systems is not just a nice feature of VMware ESX Server, but is a requirement to support demanding applications in modern datacenters. With the launch of VI3 with ESX Server 3.5 we’ve further improved the efficiency of our storage system. For non-clustered environments, we’ve already shown in this comparison paper that our system overheads are negligible compared to physical devices. In this article we’d like to cover the scalable performance of VMFS, our clustered file system.

ESX Server enables multiple hosts to reliably share the same physical storage through its highly optimized storage stack and the VMFS file system. There are many benefits to a shared storage infrastructure, such as consolidation and live migration, but people commonly wonder about performance. While it is always desirable to squeeze the most performance out of the storage system, care should be taken not to severely over-commit the available resources, which can lead to performance degradation. Specifically, the primary factors that affect the shared storage performance of an ESX Server cluster are as follows:

1. The number of outstanding SCSI commands going to a shared LUN

SCSI allows multiple commands to be active on a link, and SCSI drivers support a configurable parameter called “queue depth” to control this. The maximum supported value is most commonly 256. For an I/O group (ESX Server(s) – LUN), it is important that the number of active SCSI commands does not exceed this value, otherwise the commands will get queued. Excessive queuing leads to increased latencies and potentially a drop in throughput. The number of commands queued per ESX Server host can be derived using the esxtop command.

2. SCSI reservations

VMFS is a clustered file system and uses SCSI reservations to implement on-disk locks. Administrative operations, such as creating/deleting a virtual disk, extending a VMFS volume, or creating/deleting snapshots, result in metadata updates to the file system using locks, and hence result in SCSI reservations. A reservation causes the LUN to be available exclusively to a single ESX Server host for a brief period of time. It is therefore preferable that administrators perform the above-mentioned operations during off-peak hours, especially if there will be many of them.

3. Storage device capabilities

The capabilities of the storage array play a role in how well performance scales with multiple ESX Servers. The capabilities include the maximum LUN queue depth, the cache size, the number of sequential streams, and other vendor-specific enhancements. Our results have shown that most modern Fibre Channel storage arrays have enough capacity to provide good performance in an ESX Server cluster.

We’re glad to share with you some results from our storage scalability experiments. Our hardware setup includes 64 blades running VMware ESX Server 3.5. They are connected to a storage array via 2Gbps Fibre Channel links. All hosts share a single VMFS volume, and virtual machines running IOmeter generate a heavy I/O load to that one volume. The queue depth for the Fibre Channel HBA is set to 32 on each ESX Server host, which is exactly how many commands are configured to be generated by all virtual machines on a single host. We measure two things:

 Aggregate Throughput – the sum of the throughput across all virtual machines on all hosts

 Average Latency - the end-to-end average delay per command as seen by any virtual machine in the cluster

 

Xput_4

                                                       Figure 1

It is clear from Figure 1 that except for sequential read there is no drop in aggregate throughput as we scale the number of hosts. The reason sequential read drops is that the sequential streams coming in from different ESX Server hosts are no longer sequential when intermixed at the storage array, and thus become random. Writes generally do better than reads because they are absorbed by the write cache and flushed to disks in the background.

 

Lat_5
                                                        Figure 2

Figure 2 illustrates the effect of commands from all ESX Server hosts reaching the shared LUN on the storage array. Each ESX Server host generates 32 commands, hence at eight hosts we have reached the recommended maximum per LUN of 256. Beyond this point, latencies climb upwards of 100 msec, and could affect applications that are sensitive to latencies, although there is no drop in aggregate throughput.

These experiments represent a specific configuration with an aggressive I/O rate. Virtual machines deployed in typical customer environments may not have as high a rate and therefore may be able to scale further. In general, because of varying block sizes, access patterns, and number of outstanding commands, the results you see in your VMware environment will depend on the types of applications running. The results will also depend on the capabilities of your storage and whether it is tuned for the block sizes in your application. Also, processing very small commands adds some compute overhead in any system, be it virtualized or otherwise. Overall, the ESX Server storage stack is well tuned to run a majority of applications. If you are using iSCSI or NFS, this comparison paper nicely outlines how ESX Server can efficiently make use of the full Ethernet link speed for most block sizes.We’re always pleased to show the scalability of VMware Infrastructure 3, and the file system that supports the VI3 features is a good example. Look for more details on storage and VMFS performance in the form of whitepapers and presentations from VMware and its partners in the coming weeks.

Sun Improves Upon Their Debut VMmark Result

Our partners at Sun have just published an improved result for the Sun Blade X8440. This latest test was run with VMware ESX Server 3.5 and demonstrates a 5% improvement over their original VMware ESX Server 3.0.2 result using a nearly identical configuration. Both benchmark disclosure reports can be found at the VMmark results page.

This result underlines the fact that well-designed and well-tuned blade platforms can perform competitively with traditional rack-mount servers in an enterprise-level consolidation scenario. It also demonstrates VMware’s continuous focus on improving the performance of virtualized workloads. And finally, it shows the value of having a well-designed virtualization benchmark with which to measure and compare platforms.