In an earlier paper on a small seven-host cluster it was shown that Hadoop can be virtualized with little overhead, and that better-than-native performance can be achieved with the right configuration. However, the reasons for the observed performance behavior were not well understood. Recently, this work was refreshed with a larger cluster of 32 high-performance hosts running VMware vSphere® 5.1. The performance of native and several virtual configurations was compared for three applications. The apples-to-apples case of a single virtual machine per host shows performance close to that of native. Improvements in elapsed time of up to 13% for the most important application (TeraSort) can be achieved by partitioning each host into two or four virtual machines, resulting in competitive or even better than native performance as shown in the figure below (number of VMs is per host, and a lower ratio is better). Details of the results are in a new whitepaper: “Virtualized Hadoop Performance with VMware vSphere 5.1“. The paper also discusses the use of several performance tools and models to gain a better understanding of both the sources of virtualization overhead and the reasons why configuring multiple smaller virtual machines per host can enhance performance. Based on this, recommendations for optimal hardware and software configuration are also given.
Category Archives: Web/Tech
Each new generation of servers brings advances in hardware components. For IT professionals purchasing or managing new generations of hardware, it’s vital to understand how these incremental hardware improvements translate into real-world gains in the datacenter. Using the VMware VMmark 2.5 virtualization benchmark, we compared performance and energy efficiency of two different generations of servers in four-node clusters.
VMmark 2.5 is a multi-host virtualization benchmark that uses varied application workloads as well as common datacenter operations to model the demands of the datacenter. VMs running diverse application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview.
All tests were conducted on two four-node clusters running VMware vSphere 5.1. We compared performance and energy efficiency between a cluster of previous generation Dell R310 servers, and a cluster of current generation Dell R620 servers. For simplicity, we refer to these as the ‘old cluster’ and ‘new cluster,’ respectively. Among other hardware differences, the old cluster servers contained four-core Intel Nehalem processors while the new cluster servers contained eight-core Intel Sandy Bridge EP processors. Memory in the newer servers was appropriately scaled up to accommodate their increased processing power and represents common current server configurations. Software and storage configurations were identical between clusters.
Systems Under Test: Four Dell PowerEdge R310 servers
CPUs (per server): One Quad-Core Intel® Xeon® X3460 @ 2.8 GHz, Hyper-Threading enabled
Memory (per server): 32GB DDR3 ECC @ 800 MHz
Systems Under Test: Four Dell PowerEdge R620 servers
CPUs (per server): One Eight-Core Intel® Xeon® E5-2665 @ 2.4 GHz, Hyper-Threading enabled
Memory (per server): 96GB DDR3 ECC @ 1067 MHz
Storage Array: EMC VNX5700
62 Enterprise Flash Drives (SSDs), RAID 0, grouped as 3 x 8 SSD LUNs, 7 x 5 SSD LUNs, and 1 x 3 SSD LUN
Hypervisor: VMware vSphere 5.1.0
Virtualization Management: VMware vCenter Server 5.1.0
VMmark version: 2.5
To determine the maximum VMmark load the old cluster could support, we increased the number of VMmark tiles until the cluster reached saturation, which is defined as the largest number of tiles that still meet Quality of Service (QoS) requirements. We then tested the new cluster at the same number of tiles. All data points are the mean of four tests in each configuration and VMmark scores are normalized to the old cluster’s performance.
The new cluster had a 32% higher VMmark score in combination with a 41% lower CPU utilization. The new cluster also showed a 24% increase in energy efficiency over the old cluster, which we’ll discuss further below. At four tiles, the old cluster was bottlenecked on CPU, resulting in decreased workload throughput, while the new cluster was not. With CPU resources to spare, the new cluster met the requested load at lower latencies, which increased its total throughput and score. Mean I/O latencies remained low for both clusters at 1.2ms reads and 1.1ms writes for the old cluster and 1.0ms reads and 0.9ms writes for the new cluster.
We next determined the maximum VMmark load the new cluster could support. While the old cluster was saturated at four tiles, the new cluster accommodated more than twice the load at nine tiles and produced a score 120% higher than the old cluster. Mean I/O latencies remained low at 1.0ms.
The performance advantages of the R620 over the R310 were largely due to the generational improvements of the R620’s eight-core E5-2665 processor versus the R310’s four-core x3460 processor, which includes improved bus speeds and larger L3 cache, and the R620’s increased memory.
These performance results suggest that it would be possible to replace four Dell R310 servers with two Dell R620 servers and expect better than equivalent performance. We put this to the test by removing two nodes from the new cluster and found that the two remaining nodes did support four tiles at 93% utilization, with an 11% higher VMmark score and 74% greater energy efficiency than the four-host old cluster.
Beyond their raw performance capability, we also compared the two server generations on their energy efficiency. The Performance per Kilowatt metric, which is new to VMmark 2.5, models energy efficiency as VMmark score per kilowatt of power consumed. Below, we’ve plotted energy efficiency against the normalized VMmark score. Both clusters were run with their servers’ power management set to “maximum performance.”
Two trends emerge from this figure. First, at four tiles, the four-host new cluster accomplishes more work at higher energy efficiency than the old cluster. Across the board, the new cluster is more energy efficient than the old cluster. Second, within the four-host new cluster, greater energy efficiency is correlated with increase in VMmark score. As the CPUs become busier, performance increases at a faster rate than the required power. This can be understood by noting that an idle server will still consume power, but with no performance to show for it. A highly utilized server is typically the most energy efficient per request completed, which is confirmed by the two-host new cluster that achieved high efficiency at 93% utilization. Higher energy efficiency creates cost savings in energy consumption and in cooling costs.
Our investigation shows that, while running vSphere 5.1, two newer Dell R620 servers are capable of supporting a greater load than four older Dell R310 servers. Because the Dell R620 performance is more than double that of the Dell R310, a four-node Dell R620 cluster reached a 120% higher maximum score than the Dell R310 cluster. In addition to its performance advantages, at each load level the Dell R620 cluster performed with greater energy efficiency, showing that the Dell R620 has superior performance but also has greater energy efficiency than the Dell R310.
At VMworld 2012 we demonstrated a single eight-way VM running on vSphere 5.1 exceeding one million IOPS. This testing illustrated the high end IOPS performance of vSphere 5.1.
In a new series of tests we have completed some additional characterization of high I/O performance using a very similar environment. The only difference between the 1 million IOPS test environment and the one used for these tests is that the number of Violin Memory Arrays was reduced from two to one (one of the arrays was a short term loan).
Hypervisor: vSphere 5.1
Server: HP DL380 Gen8
CPU: Two Intel Xeon E5-2690, HyperThreading disabled
HBAs: Five QLogic QLE2562
Storage: One Violin Memory 6616 Flash Memory Array
VM: Windows Server 2008 R2, 8 vCPUs and 48GB.
Iometer Configuration: Random, 4KB I/O size with 16 workers
We continued to characterize the performance of vSphere 5.1 and the Violin array across a wider range of configurations and workload conditions.
Based on the types of questions that we often get from customers, we focused on RDM versus VMFS5 comparisons and the usage of various I/O sizes. In the first series of experiments we compared RDM versus VMFS5 backed datastores using 100% read workload mix while ramping up the I/O size.
As you can see from the above graph, VMFS5 yielded roughly equivalent performance to that of RDM backed datastores. Comparing the average of the deltas across all data points showed performance within 1% of RDM for both IOPS and MB/s. As expected, the number of IOPS decreased after we exceed the default array block size of 4KB, but the throughput continued to scale, approaching 4500 MB/s at both 8KB and 16KB sizes.
For our second series of experiments, we continued to compare RDM versus VMFS5 backed datastores through a progression of block sizes, but this time we altered the workload mix to include 60% reads and 40% writes.
Violin Memory arrays use a 4KB sector size and perform at their optimal level when managing 4KB blocks. This is very visible in the above IOPS results at the 4KB block size. In the above graph, comparing RDM and VMFS5 IOPS, you can see that VMFS5 performs very well with a 60% read, 40% write mix. Throughputs continued to scale in a similar fashion as the read-only experimentation and VMFS5 performance for both IOPS and MB/s were within .01% of RDM performance when comparing the average of the deltas across all data points.
The amount of I/O, with just one eight-way VM running on one Violin storage array, is both considerable and sustainable at many I/O sizes. It’s also noteworthy to point out that running a 60% read and 40% write I/O mix still generated substantial IOPs and bandwidth. While in most cases a single VM won’t need to drive nearly this much I/O traffic, these experiments show that vSphere 5.1 is more than capable of handling it.
A system’s performance is often limited by the access time of its hard disk drive (HDD). Solid-state drives (SSDs), also known as Enterprise Flash Drives (EFDs), tout a superior performance profile to HDDs. In our previous comparison of EFD and HDD technologies using VMmark 2.1, we showed that EFD reads were on average four times faster than HDD reads, while EFD and HDD write speeds were comparable. However, EFDs are more costly per gigabyte.
Many vendors have attempted to address this issue using tiered storage technologies. Here, we tested the performance benefits of EMC’s FAST Cache storage array feature, which merges the strengths of both technologies. FAST Cache is an EFD-based read/write storage cache that supplements the array’s DRAM cache by giving frequently accessed data priority on the high performing EFDs. We used VMmark 2, a multi-host virtualization benchmark, to quantify the performance benefits of FAST Cache. For more details, see the overview, release notes for VMmark 2.1, and release notes for 2.1.1. VMmark 2 is an ideal tool to test FAST Cache performance for virtualized datacenters in that its varied workloads and bursty I/O patterns model the demands of the datacenter. We found that FAST Cache produced remarkable improvements in datacenter capacity and storage access latencies. With the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements.
FAST Cache is a feature of EMC’s storage systems that tracks frequently accessed data on disk, promotes the data into an array-wide EFD cache to take advantage of Flash I/O access speeds, then writes it back to disk when the data is superseded in importance. FAST Cache optimizes the use of EFD storage. In most workloads only a small percentage of data will be frequently accessed. This is referred to as the ‘working set.’ An EFD-based cache allows the data in the working set to take advantage of the performance characteristics of EFDs while the rest of the data stays on lower-cost HDDs. Relevant data is rapidly promoted into the cache in increments of 64 KB pages, and a least-recently-used algorithm is used to decide which data to write back to disk.
The benefit achieved with FAST Cache depends on the workload’s I/O profile. As with most caches, FAST Cache will show the most benefit for I/O with a high locality of reference, such as database indices and reference tables. FAST Cache will be least beneficial to workloads with sequential I/O patterns like log files or large I/O size access because these may not access the same 64 KB block multiple times and the FAST Cache would never become populated.
Systems Under Test: Four Dell PowerEdge R310 Servers
CPUs (per server): One Quad-Core Intel® Xeon® X3460 @ 2.8 GHz, Hyper-Threading enabled
Memory (per server): 32 GB DDR3 ECC @ 800 MHz
Storage Array: EMC VNX5500
FAST Cache configurations:
366 GB FAST Cache, 8 EFDs, RAID 1
92 GB FAST Cache, 2 EFDs, RAID 1
FAST Cache disabled
20 HDDs, 10K RPM, grouped into 3 LUNs of 8, 8, and 4 HDDs each
11 HDDs, 10K RPM, grouped into 3 LUNs of 4, 4, and 3 HDDs each
Hypervisor: ESXi 5.0.0
Virtualization Management: VMware vCenter Server 5.0
VMmark version: 2.1.1
We used VMmark 2 to investigate several different factors relating to FAST Cache. We wanted to measure the performance benefit afforded by adding FAST Cache into a VMmark 2 environment and we wanted to observe how the performance benefit of FAST Cache would scale as we changed the size of the cache. We tested with FAST Cache disabled and with two different FAST Cache sizes which were made from two EFDs and eight EFDs in RAID 1, creating a cache of 92 GB and 366 GB usable space, respectively. FAST Cache was configured according to best practices to ensure FAST Cache performance was not limited by array bus bandwidth. After the FAST Cache was created, it was warmed up by repeating VMmark 2 runs until scores showed less than 3% variability between runs.
We also wanted to examine whether FAST Cache could reduce the hardware requirements of our tests. As processors and other system hardware components have increased in capacity and speed, there has been greater and greater pressure for corresponding increasing performance from storage. RAID groups of HDDs have been one answer to these increasing performance demands, as RAID arrays provide performance and reliability benefits over individual disks. In typical RAID configurations, performance increases nearly linearly as disks are added to the RAID group. However, adding disks in order to increase storage access speed can result in underutilization of HDD space, which becomes far greater than required. FAST Cache should allow us to reduce the number of HDDs we require for RAID performance benefits, also reducing the cluster’s total power, cooling and space requirements, which results in lower cost. FAST Cache services the bulk of the workloads’ I/O operations at high speeds, so it is acceptable for us to service the remainder of operations at lower speeds and use only as many HDDs as needed for storage capacity rather than performance.
To test whether an environment with FAST Cache and a reduced number of disks could perform as well as an environment without FAST Cache, but with a larger number of disks, we tested performance with two different disk configurations. Workloads were tested on a set of 20 HDDs and then on a set of 11 HDDs, in both cases grouped into three LUNs. Each LUN was in a distinct RAID 0 group. Due to the performance characteristics of RAID 0, we expected the 20 HDD configuration to have better performance to than the 11 HDD configuration. The placement of workloads onto LUNs was meant to model a naïve environment with nonoptimal storage setup. Two LUNs held workload tile data, and the third smaller LUN served as the destination for VM Deploy and Storage vMotion workloads. The first LUN held VMs from the first and third tiles, and the second LUN held VMs from the second and fourth tiles. Running VMmark 2 with more than one tile per LUN was atypical of our best practices for the benchmark. It created a severe bottleneck for the disk, which was meant to simulate the types of storage performance issues we sometimes see in customer environments.
All VMmark 2 tests were conducted on a cluster of four identically configured entry-level Dell Power Edge R310 servers running ESXi 5.0. All components in the environment besides FAST Cache and number of HDDs remained unchanged during testing.
To characterize cluster performance at multiple load levels, we increased the number of tiles until the cluster reached saturation, defined as when the run failed to meet Quality of Service (QoS) requirements. Scaling out the number of tiles until saturation allows us to determine the maximum VMmark 2 load the cluster could support and to compare performance at each level of load for each cache and storage configuration. All data points are the mean of three tests in each configuration. Scaling data was generated by normalizing every score to the lowest passing score, which was 1 tile with FAST Cache disabled on 20 HDDs.
With FAST Cache disabled, the 20 HDD LUNs reached saturation at 2 tiles, and the 11 HDD LUNs were unable to support even 1 tile of load. Because all VMs for each tile were placed on the same LUN, a 1 tile run used one LUN, consisting of only four out of 11 HDDs or eight out of 20 HDDs. 4 HDDs were insufficient to provide the required QoS for even 1 tile. When FAST Cache was enabled, the 11 HDD and 20 HDD configurations supported 4 tiles. This is a remarkable improvement; with the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements. Even at lower load levels, the equivalent system with FAST Cache was allowing greater throughput and showed resulting increases in the VMmark score of 26% at 1 tile and 31% at 2 tiles. With FAST Cache enabled, the configuration with 11 HDDs performed equivalently to one with 20 HDDs until the system approached saturation.
With FAST Cache enabled, the system supported twice as much load on almost half as many disks. The results show that an environment with a 92 GB FAST Cache was able to greatly outperform a HDD-only environment that contains 82% more disks. At 4 tiles with FAST Cache enabled, the cluster’s CPU utilization was approaching saturation, reaching an average of 84%, but was not yet bottlenecked on storage.
In our tests, performance did not scale up very much as we increased FAST Cache size from 92 GB to 366 GB and the number of HDDs from 11 to 20.
We can see that all configurations scaled very similarly from 1 to 3 tiles with only minor differences appearing, primarily between the 92 GB FAST Cache and 366 GB FAST Cache. Only at the highest load level did performance begin to diverge. Predictably, the largest cache configurations show the best performance at 4 tiles, followed by the smaller cache configurations. To determine whether this performance falloff was directly attributable to the cache size and number of HDDs, we needed to know whether FAST Cache was performing to capacity.
Below are the FAST Cache and DRAM cache hit percentages for read and write operations at the 4 tile load. On average, our VMmark testing had I/O operations of 24% reads and 76% writes.
With the 366 GB FAST Cache, nearly all reads and writes were hitting either the DRAM or FAST Cache. In these cases, the number of backing disks did not affect the score because disks were rarely being accessed. At this cache size, all frequently accessed data fit into the FAST Cache. However, with the 92 GB FAST Cache, the cache hit percentage decreased to 96.5% and 92.1% for the 11 HDD and 20 HDD configurations, respectively. This indicated that the entire working set could no longer fit into the 92 GB FAST Cache. The 11 HDD configuration began to show decreased performance relative to 20 HDDs, because although only 3.5% of total I/O operations were going to disk, the increase in disk latency was large enough to reduce throughput and affect VMmark score. Despite this, a FAST Cache of 92 GB was still sufficient to provide us with VMmark performance that met QoS requirements. The higher read hit percentages in the 11 HDD configuration reflected this reduced throughput. Lower throughput resulted in a smaller working set and an accordingly higher read hit percentage.
Overall, FAST Cache did an excellent job of identifying the working set. Although only 8% of the 1.09 TB dataset could fit in the 92 GB cache at any one time, at least 92% of I/O requests were hitting the cache.
Scaling FAST Cache gave us a sense of the working set size of the VMmark benchmark. As performance with the 92 GB FAST Cache demonstrated a knee at 3 tiles, this suggests the working set size at 3 tiles is less than 92 GB and the working set size at 4 tiles is slightly greater than 92 GB. Knowing the approximate working set size per tile would allow us to select the minimum FAST Cache size required if we wanted our entire working set to fit into the FAST Cache, even if we scaled the benchmark to an arbitrary number of tiles in a different cluster.
The results below show that I/O operations per second and I/O latency were affected by our environment characteristics.
The variability in read latency is clearly affected by both FAST Cache size and number of backing HDDs. Latency is highest with only 11 HDDs and the smaller FAST Cache, and decreases as we add HDDs. Latency decreases even more with the larger FAST Cache size as nearly all reads hit the cache. Write latency, however, is relatively constant across configurations, which is as expected because in each configuration nearly all writes are being served by either the DRAM cache or FAST Cache.
It’s clear that we can replace a large number of HDDs with a much smaller number of EFDs and get similar or improved performance results. An array with 11 HDDs and FAST Cache outperformed an array with 20 HDDs without FAST Cache. FAST Cache handles the workloads’ performance requirements so that we need only to supply the HDDs necessary for their storage space, rather than performance capabilities. This allows us to reduce the number of HDDs and their associated power, space, cooling, and cost.
Tiered storage solutions like FAST Cache make excellent use of EFDs, even to the extent that 92% or more of our I/O operations are benefitting from Flash-level latencies while the EFD storage itself holds only 8% of our total data. The increased VMmark scores demonstrate the ability of FAST Cache to pinpoint the most active data remarkably well, and, even in a bursty environment, show incredible improvements in I/O latency and in the load that a cluster can support. Our testing showed FAST Cache provides Flash-level storage access speeds to the data that needs it most, reduces storage bottlenecking and increases supported load, making FAST Cache a highly valuable addition to the datacenter.
In previous articles on VROOM! we used VMmark 2 to investigate the effects of altering a single hardware component, such as a storage array or server model, in a vSphere cluster. In contrast to these earlier studies, we now examine the effects of upgrading the hosts’ software from ESXi 4.1 to ESXi 5.0 on the performance of a VMmark 2 cluster.
vSphere 5 includes many new features and virtual machine enhancements, the details of which can be found here. To the IT professional weighing the costs and benefits of upgrading their existing infrastructure to vSphere 5, an often important question is whether ESXi 5.0 can outperform ESXi 4.1 in the same environment. VMmark 2 is an ideal tool for answering this question with measurable results. We used VMmark 2.1.1 to see how ESXi 5.0 stacked up to ESXi 4.1 on an identically configured cluster.
VMmark 2 is a multi-host virtualization benchmark that models application performance as well as the effects of common infrastructure operations such as vMotion, Storage vMotion, and virtual machine deployments. Each VMmark tile contains a set of VMs running diverse application workloads as a unit of load. VMmark 2 scores are computed as a weighted average of application workload throughput and infrastructure operation throughput. For more details, see the overview, release notes for VMmark 2.1, and for 2.1.1.
All VMmark 2 tests were conducted on a cluster of four identically configured entry-level Dell Power Edge R310 servers. To determine the impact of the vSphere 5 environment on performance, a series of tests was conducted with these hosts running ESXi 4.1, then with ESXi 5.0. In addition, for the vSphere 5 environment, the virtual machine hardware and VMware Tools were upgraded on all workload VMs, and LUNs were reformatted as VMFS5. All other components in the environment remained unchanged during testing.
Systems Under Test: Four Dell PowerEdge R310 Servers
CPUs: One Quad-Core Intel® Xeon® X3460 @ 2.8 GHz, hyper-threading enabled per server
Memory: 32GB DDR3 ECC @ 800 MHz per server
Storage Array: EMC VNX5500
Hypervisors under test:
VMware ESXi 4.1
VMware ESXi 5.0
Virtualization Management: VMware vCenter Server 5.0
VMmark version: 2.1.1
To characterize cluster performance at multiple load levels, we increased the number of tiles until the cluster reached saturation, defined as when the run failed to meet Quality of Service (QoS) requirements. Scaling out the number of tiles until saturation allows us to determine the maximum VMmark 2 load the cluster could support and to compare the ESXi 4.1 and ESXi 5.0 configurations at each level of load.
The graph below shows the results of the VMmark 2 testing as described above with identically configured clusters running ESXi 4.1 and ESXi 5.0. All data points are the mean of three tests in each configuration.
The ESXi 4.1 cluster reached saturation at 3 tiles, but ESXi 5.0 was able to support 4 tiles while still meeting workload Quality of Service requirements. The ESXi 5.0 cluster also outperformed ESXi 4.1 by 3% and 4% on the two and three-tile runs, respectively. Differences in CPU utilization were negligible. The results show that, in an equivalent environment, vSphere 5 handled greater load than ESXi 4.1 before reaching saturation, and showed increased performance at lower levels of load as well. At saturation, vSphere 5 showed a 22% increase in overall VMmark 2 scores over ESXi 4.1. In this cluster, vSphere 5 supported 33% more VMs and twice the number of infrastructure operations while meeting Quality of Service requirements.
VMmark 2 scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. The completion time for vMotion, Storage vMotion, and VM Deploy is used as the latency measurement for the infrastructure operations. Latency can be very informative about the functioning of the environment and how the cluster as a whole performs under increasing loads. Examining latency at a 3-tile load, as seen in the figure below, reveals significant differences between the hypervisor versions. Latencies were normalized to the ESXi 4.1 results.
We saw decreases in latency for all VMmark 2 workloads with vSphere 5. The latency decreases were most striking in Olio, Storage vMotion, and DVD Store 2, with decreases of 20%, 19%, and 15%, respectively. These improvements to vMotion and Storage vMotion are consistent with publicized improvements in vMotion and Storage vMotion latency for vSphere 5 (details here).
A VMmark 2 run passes when all of its application QoS metrics, or latencies, remain below a specified threshold. These decreases in latency with ESXi 5.0 are directly related to why ESXi 5.0 was able to support an additional tile relative to ESXi 4.1.
Our comparison has shown that upgrading an ESXi 4.1 cluster to vSphere 5 had two high-level effects on performance. The vSphere 5 cluster supported 33% more VMs at saturation than the ESXi 4.1 cluster, and it also exhibited improved latency and throughput at lower levels of load, showing that ESXi 5.0 does outperform ESXi 4.1.
Zimbra Collaboration Server is VMware’s mail, calendaring, and collaboration software. A performance study looks at how the mail server performs on vSphere 5. Test results demonstrate the following:
- Due to optimizations made within Zimbra Collaboration Server and the tickless timer within Red Hat Enterprise Linux (RHEL) 6, Zimbra Collaboration Server in a virtual machine can scale up effectively, and performs within 95% of a physical host.
- Zimbra Collaboration Server scales out effortlessly, with only a 10% drop in sendmail latency as up to eight virtual machines are added.
- Zimbra Collaboration Server consumes less than half of the CPU of Microsoft Exchange Server 2010 for the same number of users. The user provisioning time for the tests is also orders of magnitude better.
For the full paper, see Zimbra Collaboration Server Performance on VMware vSphere 5.
Performance benchmarking is often conducted on top-of-the-line hardware, including hosts that typically have a large number of cores, maximum memory, and the fastest disks available. Hardware of this caliber is not always accessible to small or medium-sized businesses with modest IT budgets. As part of our ongoing investigation of different ways to benchmark the cloud using the newly released VMmark 2.0, we set out to determine whether a cluster of less powerful hosts could be a viable alternative for these businesses. We used VMmark 2.0 to see how a four-host cluster with a modest hardware configuration would scale under increasing load.
Workload throughput is often limited by disk performance, so the tests were repeated with two different storage arrays to show the effect that upgrading the storage would offer in terms of performance improvement. We tested two disk arrays that varied in both speed and number of disks, an EMC CX500 and an EMC CX3-20, while holding all other characteristics of the testbed constant.
To review, VMmark 2.0 is a next-generation, multi-host virtualization benchmark that models application performance and the effects of common infrastructure operations such as vMotion, Storage vMotion, and a virtual machine deployment. Each tile contains Microsoft Exchange 2007, DVD Store 2.1, and Olio application workloads which run in a throttled fashion. The Storage vMotion and VM deployment infrastructure operations require the user to specify a LUN as the storage destination. The VMmark 2.0 score is computed as a weighted average of application workload throughput and infrastructure operation throughput. For more details about VMmark 2.0, see the VMmark 2.0 website or Joshua Schnee’s description of the benchmark.
All tests were conducted on a cluster of four Dell PowerEdge R310 hosts running VMware ESX 4.1 and managed by VMware vCenter Server 4.1. These are typical of today’s entry-level servers; each server contained a single quad-core Intel Xeon 2.80 GHz X3460 processor (with hyperthreading enabled) and 32 GB of RAM. The servers also used two 1Gbit NICs for VM traffic and a third 1Gbit NIC for vMotion activity.
To determine the relative impact of different storage solutions on benchmark performance, runs were conducted on two existing storage arrays, an EMC CX500 and an EMC CX3-20. For details on the array configurations, refer to Table 1 below. VMs were stored on identically configured ‘application’ LUNs, while a designated ‘maintenance’ LUN was used for the Storage vMotion and VM deployment operations.
To measure the cluster's performance scaling under increasing load, we started by running one tile, then increased the number of tiles until the run failed to meet Quality of Service (QoS) requirements. As load is increased on the cluster, it is expected that the application throughput, CPU utilization, and VMmark 2.0 scores will increase; the VMmark score increases as a function of throughput. By scaling out the number of tiles, we hoped to determine the maximum load our four-host cluster of entry-level servers could support. VMmark 2.0 scores will not scale linearly from one to three tiles because, in this configuration, the infrastructure operations load remained constant. Infrastructure load increases primarily as a function of cluster size. Although showing only a two host cluster, the relationship between application throughput, infrastructure operations throughput and number of tiles is demonstrated more clearly by this figure from Joshua Schnee’s recent blog article. Secondly, we expected to see improved performance when running on the CX3-20 versus the CX500 because the CX3-20 has a larger number of disks per LUN as well as faster individual drives. Figure 1 below details the scale out performance on the CX500 and the CX3-20 disk arrays using VMmark 2.0.
Figure 1. VMmark 2.0 Scale Out On a Four-Host Cluster
Both configurations saw improved throughput from one to three tiles but at four tiles they failed to meet at least one QoS requirement. These results show that a user wanting to maintain an average cluster CPU utilization of 50% on their four-host cluster could count on the cluster to support a two-tile load. Note that in this experiment, increased scores across tiles are largely due to increased workload throughput rather than an increased number of infrastructure operations.
As expected, runs using the CX3-20 showed consistently higher normalized scores than those on the CX500. Runs on the CX3-20 outperformed the CX500 by 15%, 14%, and 12% on the one, two, and three-tile runs, respectively. The increased performance of the CX3-20 over the CX500 was accompanied by approximately 10% higher CPU utilization, which indicated that that the faster CX3-20 disks allowed the CPU to stay busier, increasing total throughput.
The results show that our cluster of entry-level servers with a modest disk array supported approximately 220 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second. A more robust disk array supported 270 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second with 20% lower latencies on average and a correspondingly slightly higher CPU utilization.
Note that this type of experiment is possible for the first time with VMmark 2.0; VMmark 1.x was limited to benchmarking a single host but the entry-level servers under test in this study would not have been able to support even a single VMmark 2.0 tile on an individual server. By spreading the load of one tile across a cluster of servers, however, it becomes possible to quantify the load that the cluster as a whole is capable of supporting. Benchmarking our cluster with VMmark 2.0 has shown that even modest clusters running vSphere can deliver an enormous amount of computing power to run complex multi-tier workloads.
In this study, we scaled out VMmark 2.0 on a four-host entry-level cluster to measure performance scaling and the maximum supported number of tiles. This put a much higher load onto the cluster than might be typical for a small or medium business so that businesses can confidently deploy their application workloads. An alternate experiment would be to run fewer tiles while measuring the performance of other enterprise-level features, such as VMware High Availability. This ability to benchmark the cloud in many different ways is one benefit of having a well-designed multi-host benchmark. Keep watching this blog for more interesting studies in benchmarking the cloud with VMmark 2.0.
We recently finished a large Exchange 2007 capacity test on VMware
ESX Server 3.5. How large? Well, larger than anything ever done before on a
single server. And we did it from start to finish in about two weeks.
We did this test because we have felt for a while that
advances in processor and server technology were about to leave another
widely-used and important application unable to fully utilize the hardware that
vendors were offering. Microsoft has guidelines on what environment works well
with Exchange, and a system with more than eight CPUs and/or 32GB of RAM is beyond
the recommended maximums.
Hardware vendors are now offering commodity servers with 16 cores (quad socket with four cores each) and enough memory slots to hold 256GB of RAM. Within a year or two we would expect this to go up even further, with commodity x86 systems being built with 32 cores. Microsoft Exchange deployments
typically work well with the ‘scale out’ model, but that causes server proliferation and underutilized hardware, especially as systems get this large. VMware ESX Server allows us to make more effective use of the hardware and improve capacity.
Using VMware ESX Server 3i version 3.5 we created eight virtual machines, each with two vCPUs and 14GB of memory, and configured 2,000 mailboxes on each one. We chose 2,000 users based on Microsoft’s recommendation of 1,000 mailboxes per core and we selected 14GB of memory in accordance with the recommendation to use 4GB + 5MB/mailbox. We used the hardware recommendations for Exchange Server
in Multi-Role configuration because each virtual machine was running the Hub, CAS, and UM components in addition to hosting the mailboxes.
We ran this test on an IBM x3850 M2 server with 128GB of RAM. The virtual machines ran Microsoft Windows Server 2003 R2 Datacenter x64 Edition with Service Pack 2 and Microsoft Exchange 2007 Server Version 8 with Service Pack 1.
The storage used for these tests was an EMC CX3-40 with 225 disks (15 drawers of 15 disks each). Each virtual machine was configured to use two LUNs of 10 disks each for the Exchange database and a three-disk LUN for logs.
We used the Microsoft Load Generator (LoadGen) tool to drive the load on the mailboxes, and ran with the heavy user profile. Here are the LoadGen settings:
- Simulated day – 8 hours
- Test run – 8 hours
- Stress mode – disabled
- No distribution lists or dynamic distribution lists for internal messages
- No contacts for outgoing messages
- No external outbound SMTP mail
- Profile used: Outlook 2007 Online, Heavy, with Pre-Test Logon
We ran the tests using both ESX Server 3.5 and ESX Server 3i version 3.5 and the performance was the same across both versions. Tests were run with one through eight virtual machines, and even in the eight virtual machine case about half the CPU resources were still available.
Disk latencies were around 6ms across our runs. The IOPS rate started off at about .65 IOPS/mailbox in the first hour but stabilized at .37 IOPS/mailbox in the last hour (once the cache was warmed up). Over the duration of the run the average rate was .45 IOPS/mailbox. The read/write ratio observed was approximately 4:1.
Sendmail latency is an important measure of the responsiveness of the Exchange Server. Figure 1 shows how it changed as more virtual machines were added to the system.
Figure 1. Sendmail Latency
A 1000ms response time is considered the threshold at which user experience starts to degrade. As can be seen from the 95th percentile response times in Figure 1, there’s still a significant amount of headroom on
this server, even at our highest tested load level.
These tests ran smoothly and demonstrated what we expected. This should come as no
surprise. As new hardware becomes available, the scalability of ESX Server allows us to easily make productive use of the additional capacity.
It took many hours and creative hardware "repurposing" from our lab personnel to put this setup together within a couple of days, and it’ll probably take them even longer to get everything back to its original place. I’d like to acknowledge that without their efforts, we wouldn’t have been able to get this done.
The large number of companies already running Microsoft Exchange Server on VMware ESX Server are experiencing improved resource utilization and better manageability as well as lower space, power, and cooling costs. New servers with greater processing power make the transition to Exchange on ESX Server even more compelling.
We’re really excited about the buzz around Oracle in virtualized environments. One of the best kept secrets is just how well Oracle performs on VMware ESX. This didn’t happen by accident – there are a number of features and performance optimizations in the VMware ESX server architecture, specifically for databases.
In this blog, I’ll walk through the top ten most important features for getting the best database performance. Here are a few of the performance highlights:
- Near Native Performance: Oracle databases run at performance similar to that of a physical system
- Extreme Database I/O Scalability: VMware ESX Server’s thin hypervisor layer can drive over 63,000 database I/Os per second (fifty times the requirement of a typical database)
- Multi-core Scaling: Scale up using SMP virtual machines and multiple database instances
- Large Memory : Scalable memory – 64GB per database, 256GB per host
We’ve continued to invest a great deal of work towards optimizing Oracle performance on VMware, because it’s already one of the most commonly virtualized applications. The imminent ESX 3.5 release is our best database platform to date, with several new advanced optimizations.
In this blog article we’d like to explain the unique and demanding nature of database applications such as Oracle produces and show the performance capabilities of ESX Server on this type of workload.
The Nature of Databases
Databases have some unique properties, such as a-large memory footprint. At the outset this can make them slightly more complex to virtualize well. However this has proven to be an opportunity, since we can optimize specifically for these defining properties.
- Large Memory: Databases use large amounts of memory to cache their storage. A large cache is one of the most important performance criteria for databases, since it can often reduce physical I/O by 10-100 fold.
- High Performance Block I/O: Databases read and write their data in fixed, block sized chunks. The I/Os are typically small, and operate at a very high rate on a small number of files or devices.
- Throughput Oriented: Databases often have a large number of concurrent users, giving them natural parallelism and makes them ideally suited to take advantage of systems with multiple logical or physical processors.
Understanding and Quantifying Virtual Performance
The performance of a virtualized system should first be quantified in terms of latency and throughput, and then in terms of how efficiently resources are being used. For example, if a physical system is delivering 10000 transactions per minute at 500ms latency per transaction, then a virtualized system that is performing at 100% of native should provide the same level of throughput with acceptable latency characteristics. Secondary should be a metric of resource usage, which is a measure of how many additional physical resources were used to achieve the same level of performance. It’s sometimes overly easy to focus primarily on the CPU resource, when in reality memory and I/O are much more expensive resources to provision. This is becoming especially important going forward, as multicore CPUs continue to lower the cost per processor core, while memory cost remains at a premium.
More important for Oracle is the ability to scale up by taking advantage of multi-core CPUs, large memories, and the I/O throughput through the hypervisor to support the large number of disk spindles in the backend storage arrays.
Database Performance Myths
There are a few common myths about virtualizing databases:
- Databases have a high overhead when virtualized: Virtualized Databases can perform at or near the speed of physical systems, in terms of latency and throughput. The virtualization overhead for typical real-world databases is minimal – for VMware ESX Server, we measured CPU overhead to be less than 10%.
- Databases have too much I/O to be virtualized: Databases typically have a large number of small random I/Os, and it is in theory possible to hit a scaling ceiling in the hypervisor layer. VMware ESX’s thin hypervisor layer can drive over 63,000 database I/Os per second, which is equivalent to more than 600 disk spindles of I/O throughput. This is sufficient I/O scaling for even the largest databases on x86 systems.
- Virtualization should only be used for smaller, non-critical applications: The ESX hypervisor is very robust: many customers are seeing over two years of uptime from ESX based systems. In addition, the ESX hypervisor remains stable, even if resources are overcomitted.
There isn’t one quick hit to make databases work well for a wide range of real-world applications – good performance is something that is earned from the long term discipline of focusing the lessons learned from many customer-oriented real-world database workloads, and applying those lessons across the architecture of the hypervisor.
Let’s take a quick walk through the specific features that you should look for in the hypervisor for good database performance.
1: High Performance I/O in VMware ESX
Throughput and latency of the I/O system are critical to performance of online transaction processing systems. Since transaction database systems operate on small data items at random places in the dataset, it’s important that we measure random I/O throughput (measured in I/O operations per second), rather than bandwidth (MB/s).
Figure.1 VMware ESX I/O Driver Model
Since the hypervisor logically resides between the database in the guest virtual machine and the backend storage, it is critical that the hypervisor’s I/O facilities scale up without any performance ceilings, and don’t add any appreciable latency. The I/O subsystem in VMware ESX shown in Figure.1 uses a direct driver model, so that there is minimal latency added by the virtualization stack. This is possible because I/O requests can be handled in-line by the same processor as the requesting virtual machine (other architectures add substantial latency and CPU overhead when I/O is proxied via a heavy-weight domain-0 or parent-partition).
Oracle databases typically issue many small 4Kbyte or 8Kbyte sized I/Os in a random access pattern. For these I/O’s, a single typical disk can deliver somewhere in the order of 100-200 I/Os per second, depending on the rotational speed of the disk, though in practice, it’s best not to push the drives beyond 100 IOPS each. The throughput of the VMware ESX 3.5 hypervisor has been increased significantly, and shows that more than 60,000 I/Os per second can be sustained – the throughput of over 600 disks.
Results from an VMware study of its customers, showed that across 15,000 Oracle servers the average number of I/Os per second for a loaded 4 processor system is 1280, which is approximately the throughput of 15 disks. Since some workloads are more demanding than others, and some are bursty in nature, it’s important to have substantially more headroom. The throughput capability of the ESX’s I/O subsystem is sufficient for more than even the most demanding database.
2: Scale Up using Virtual SMP
VMware ESX can take advantage of systems with multiple physical
processors in two ways, by scaling out through multiple virtual
machines, and by scaling up each virtual machine to use more than one
physical processor. VMware ESX provides a Virtual SMP capability,
allowing up to four processors in each guest virtual machine, and up to
64 processors in the physical system.
Figure.3 – Virtual SMP
Since database workloads typically have a large number of concurrent
users, they are explicitly parallel and can easily process more than
one task at a time.
Oracle is able to take advantage of VMware’s Virtual SMP, so
performance can be scaled beyond a single processor for each virtual
machine. To demonstrate this, we ran several benchmarks with Oracle
database 10g Release 2, using the popular SwingBench on-line transaction processing
workload. Figure.5 shows the throughput of Oracle with an
increasing number of processors in a single virtual machine. The
benchmark measures transaction throughput, and shows 94% scaling as
additional processors are added. Incidentally, this is exactly the same as
the scaling we see on native, which is likely due to hardware and
database scaling artifacts.
One of the key requirements for consolidation is good scalability with a large number of database instances. To show this, we ran multiple SMP instances of Oracle 10G on VMware ESX Server 3.5. Figure.3 shows the scaling of the VMware ESX platform when running
the open source DVDstore database benchmark. The benchmark is run using client-server mode, so that we can focus on the database tier . In this study, we scaled the benchmark using one through seven dual processor SMP Linux virtual machines, each with its own database instance. We’ll be posting further details of this benchmark on Vroom soon.
3: Scale up with Large Memory
Oracle databases love memory. The primary use is for caching pre-compiled SQL queries and caching blocks from disk in memory.
Database designers go to significant effort to avoid doing disk I/O when possible. This is because the latency of a disk I/O is substantially higher than the time a transaction will spend on the CPU. For example, a disk I/O takes on the order of 10 milliseconds, while the typical transaction takes just a few milliseconds of CPU time. If an I/O to disk can be cached in memory, it then could be serviced in a fraction of a millisecond. In addition, since disks are expensive, the cost of the storage systems for databases is often more affected by the I/Os per second that it can deliver, than by pure storage space. Lowering overall disk throughput can mean significantly lowering the cost of the system.
Larger memory sizes help Oracle by caching more disk blocks in memory. Consider this simple example: if a database system is using memory to cache it’s disk I/O, and is yielding a 90% cache hit rate, this means that one in every ten accesses causes a physical I/O. For 10,000 accesses a second, we would see 1,000 I/O’s per second. If we increase memory to improve the cache hit rate to 99%, then we reduce the I/Os to one in one-hundred, reducing the physical I/O by 10x to only 100 I/Os per second.
Often, over 80% of the memory used by the guest operating system is used by the Oracle disk block cache. A general rule of thumb is that the database cache be sized at 5-10% of the database size, and that doubling memory improves throughput by about 20%. This is obviously very workload dependent, but you can see that larger memory sizes help improve resource efficiency in other areas of the system, and that generally, more is better. For these reasons, large memory is very important for databases. VMware ESX 3.0 allowed 16GB of RAM per guest, and 3.5 increases the capability to 64GB per guest.
Due to the inherent gains in processor utilization through consolidation of workloads, we can squeeze more workloads onto a single system. This means that the average memory requirement per physical processor is on average twice that of a traditional unvirtualized system. To accommodate these growing requirements, we’ve pushed the memory scalability curve considerably in ESX 3.5, and now supports up to 256 Gigabytes of RAM on the new high-end systems from Sun, IBM and Unisys.
4: Large Pages in ESX 3.5 Hypervisor
Oracle databases have used large pages in the CPU’s MMU to optimize memory performance for some time. This facility is used with the operating system’s large page feature, typically for the large shared memory segments that hold the database’s disk block cache. Large pages are supported on Linux, Windows and Solaris guests. Oracle typically yields a 5-20% performance advantage with large pages, depending on the type of processor and the size of the configured memory.
Other x86 hypervisors don’t provide virtual large page capability, so this optimization is lost when the database is virtualized. The ESX 3.5 hypervisor provides advanced large-page support which allows the database to properly exploit the CPU’s large page capability.
5: ESX Optimization for NUMA systems
Many of the interesting new hardware systems today are implemented using non-uniform memory architectures. This means that not all memory is of uniform speed – accessing memory that is closer topologically to the processor is faster than memory that is further away.
To ensure optimal performance, the VMware ESX hypervisor allocates memory for the guest operating systems from physical memory near the CPU on which the guest resides on.
6: High-performance Paravirtualized Networking
Paravirtualization is a term used to describe when the guest operating system has some knowledge of the hypervisor, and can leverage this knowledge to optimize it’s execution in concert with the hypervisor. The VMware ESX hypervisor uses paravirtualized networking drivers in the guest operating system to provide high performance networking. These drivers are installed automatically through the VMware tools package at the time the guest is first powered on. Unlike CPU paravirtualization, paravirtualized drivers do not require any changes to the guest operating system – they are simply installed as transparent new drivers.
VMware ESX can drive gigabit Ethernet at line rate, as demonstrated in the paper Networking Performance in Multiple Virtual Machines. The networking performance of ESX 3.5 has further improved by incorporating new stateless offload features, such as large-segment-offload (LSO) and jumbo frames – and now achieves near line-rate (9.9gbits/s) on 10Gbit Ethernet.
The performance of networking can be increased beyond that of one NIC by scaling across multiple NICs. Figure.6 shows the scaling of gigabit Ethernet performance as multiple NICs are added.
7. Use VMware ESX’s Page-Sharing to use less memory
The ESX hypervisor can safely share physical memory that has the same contents through a facility known as transparent page sharing. Through page sharing, the hypervisor arranges for a single physical page of memory to back multiple pages in the guest, so that just one copy of the data need reside in memory. Using this technique, the total amount of memory consumed is less than the sum of the parts. The hypervisor ensures full security isolation — if one guest modifies a page, then it get its own private copy.
This facility can be used effectively to save memory with Oracle in several ways. When there are multiple instances of a database running, the page sharing facility will automatically share the code portions of the operating system and the Oracle instance. This often results in saving in the order of a few hundred megabytes of memory per virtual machine.
When multiple databases are sharing similar data – for example, a shared reference table or multiple copies of the database for development purposes, ESX can automatically detect the duplicate disk blocks in the Oracle disk block cache, and arrange to share those. Thus, the database cache memory image can be transparently shared across database instances, and across virtual machines. This can result in a further saving in the order of tens of megabytes (at least the system tables will be the same), through several gigabytes, depending on the amount of common disk data between the instances.
As an additional benefit, some memory can be shared within each instance. This is often for zero pages.
8: Paravirtualized CPU
Figure.7 – Virtualization Techniques
There are various techniques used to virtualize the x86 instruction set, including binary translation, paravirtualization and hardware assist. Binary translation has long been used by the VMware hypervisor to provide near native performance for virtualization for many workloads. CPU Paravirtualization or hardware assist are two approaches that can be used to provide small optimizations for workloads with many system calls as well as providing certain memory optimizations. No single approach is best for all workloads, and in VMware ESX, different approaches are used for different workloads. Ole Agesen and Keith Adams help explain the different technologies in their paper about the performance of virtualization.
VMware ESX can optionally use paravirtualization for some guest operating system types. In a recent study, the performance of Oracle 10g R2 using the Swingbench online transaction processing workload on a paravirtualized Linux guest shows a moderate gain of 10% when using paravirtualized CPU interfaces.
9: The Best Oracle-Windows Performance
Since all of the key CPU, memory and I/O virtualization capabilities are in a portable layer of the hypervisor, the performance of Oracle on Windows guests is equivalent to that of Linux guests. Oracle on Windows is able to take advantage of large-pages, SMP, and I/O scalability as well as our high performance paravirtualized networking drivers.
For Linux, Solaris and Unix administrators, this means that you have the freedom to choose the OS which has the best tools to facilitate your deployment. For Windows administrators, it means that you can confidently run your Oracle databases on Windows, with the same levels of performance and scalability.
10: Universal 32-bit and 64-bit Guest Support
To take advantage of more than 3.5Gbytes of memory in a guest, databases need to be configured as 64-bit applications, and use a 64-bit capable operating system. VMware ESX allows mixed 32-bit and 64-bit guests concurrently, thu simplifying the deployment of 64-bit guests when needed.
To all the database administrators out there, watch for more VROOM posts about VMware performance with Oracle, the new Oracle portal, which will contain plenty of good resources for virtualization of Oracle. Also, there is a new Oracle Discussion Forum – feel free to discuss Oracle performance over at the forum too. Virtual database performance has never been so good!
Last month we published a Tech Note summarizing networking throughput results using ESX Server 3.0.1 and XenEnterprise 3.2.0. Multiple NICs were used in order to achieve the maximum throughput possible in a single uniprocessor VM. While these results are very useful for evaluating the virtualization overhead of networking, a more common configuration is to spread the networking load across multiple VMs. We present results for multi-VM networking in a new paper just published. Only a single 1 Gbps NIC is used per VM, but with up to four VMs running simultaneously. This simulates a consolidation scenario of several machines each with substantial, but not extreme, networking I/O. Unlike the multi-NIC paper, there is no exact native analog, but we ran the same total load in a SMP native Windows machine for comparison. The results are similar to the earlier ones: ESX stays close to native performance, achieving up to 3400 Mbps for the 4-VM case. XenEnterprise peaks at 3 VMs and falls off to 62-69% of the ESX throughput with 4 VMs. According to the XenEnterprise documentation only three physical NICs are supported in the host, even though the UI let us configure and run four physical NICs without error or warning. This is not surprising given the performance. We then tried a couple of experiments (like making dom0 use more than 1 CPU) to fix the bottleneck, but only succeeded in further reducing the throughput. The virtualization layer in ESX is always SMP, and together with a battle-tested scheduler and support for 32 e1000 NICs, scales to many heavily-loaded VMs. Let us know if you’re able to reach the limits of ESX networking!