Home > Blogs > VMware VROOM! Blog > Tag Archives: benchmarking

Tag Archives: benchmarking

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2

In part 1, we presented the VDI benchmark results on VSAN for 3-node and 7-node configurations. In this blog, we update the results for 5-node and 8-node VSAN configurations and show how VSAN scales for these configurations.

The View Planner benchmark was run again to find the VDImark for different numbers of nodes (5 and 8 nodes) in a VSAN cluster as described in the previous blog and the results are shown in the following figure.

View Planner QoS (VDImark)

 

In the 5-node cluster, a VDImark score of 473 was achieved and for the 8-node cluster, a VDImark score of 767 was achieved. These results are similar to the ones we saw on the 3-node and 7-node cluster earlier (about 95 VMs per host). So, there is nice scaling in terms of maximum VMs supported as the numbers of nodes were increased in the VSAN from 3 to 8.

To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience. If we look at the new results for 5-node and 8-node VSAN, we see that for most of the operations, the response time mostly remains the same across different node configurations.

Group-B Response Times

Since Group-B is more sensitive to I/O and CPU usage, the above chart for Group-B operations is more important to see how View Planner scales. The chart shows that there is not much difference in the response times as the number of VMs were increased from 286 VMs on a 3-node cluster to 767 VMs on an 8-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the VSAN nodes from 3 to 8 and user experience expectations are met.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

 

 

Line-Rate Performance with 80GbE and vSphere 5.5

With the increasing number of physical cores in a system, the networking bandwidth requirement per server has also increased. We often find many networking-intensive applications are now being placed on a single server, which results in a single vSphere server requiring more than one 10 Gigabit Ethernet (GbE) adapter. Additional network interface cards (NICs) are also deployed to separate management traffic and the actual virtual machine traffic. It is important for these servers to service the connected NICs well and to drive line rate on all the physical adapters simultaneously.

vSphere 5.5 supports eight 10GbE NICs on a single host, and we demonstrate that a host running with vSphere 5.5 can not only drive line rate on all the physical NICs connected to the system, but can do it with a modest increase in overall CPU cost as we add more NICs.

We configured a single host with four dual-port Intel 10GbE adapters for the experiment and connected them back-to-back with an IXIA Application Network Processor Server with eight 10GbE ports to generate traffic. We then measured the send/receive throughput and the corresponding CPU usage of the vSphere host as we increased the number of NICs under test on the system.

Environment Configuration

  • System Under Test: Dell PowerEdge R820
  • CPUs: 4 x  Intel Xeon Processors E5-4650 @ 2.70GHz
  • Memory: 128GB
  • NICs:8 x Intel 82599EB 10GbE, SFP+ Network Connection
  • Client: Ixia Xcellon-Ultra XT80-V2, 2U Application Network Processor Server

Challenges in Getting 80Gbps Throughput

To drive near 80 gigabits of data per second from a single vSphere host, we used a server that has not only the required CPU and memory resources, but also the PCI bandwidth that can perform the necessary I/O operations. We used a Dell PowerEdge Server with an Intel E5-4650 processor because it belongs to the first generation of Intel processors that supports PCI Gen 3.0. PCI Gen 3.0 doubles the PCI bandwidth capabilities compared to PCI Gen 2.0. Each dual-port Intel 10GbE adapter needs at least a PCI Gen 2.0 x8 to reach line rate. Also, the processor has Intel Data Direct I/O Technology where the packets are placed directly in the processor cache rather than going to the memory. This reduces the memory bandwidth consumption and also helps reduce latency.

Experiment Overview

Each 10GbE port of the vSphere 5.5 server was configured with a separate vSwitch, and each vSwitch had two Red Hat 6.0 Linux virtual machines running an instance of Apache web server. The web server virtual machines were configured with 1 vCPU and 2GB of memory with VMXNET3 as the virtual NIC adapter.  The 10GbE ports were then connected to the Ixia Application Server port. Since the server had two x16 slots and five x8 slots, we used the x8 slots for the four 10GbE NICs so that each physical NIC had identical resources. For each physical connection, we then configured 200 web/HTTP connections, 100 for each web server, on an Ixia server that requested or posted the file. We used a high number of connections so that we had enough networking traffic to keep the physical NIC at 100% utilization.

Figure 1. System design of NICs, switches, and VMs

The Ixia Xcellon application server used an HTTP GET request to generate a send workload for the vSphere host. Each connection requested a 1MB file from the HTTP web server.

Figure 2 shows that we could consistently get the available[1] line rate for each physical NIC as we added more NICs to the test. Each physical NIC was transmitting 120K packets per second and the average TSO packet size was close to 10K. The NIC was also receiving 400K packets per second for acknowledgements on the receive side. The total number of packets processed per second was close to 500K for each physical connection.

Figure 2. vSphere 5.5 drives throughput at available line rates. TSO on the NIC resulted in lower packets per second for send.

Similar to the send case, we configured the application server to post a 1MB file using an HTTP POST request for generating receive traffic for the vSphere host. We used the same number of connections and observed similar behavior for the throughput. Since the NIC does not have support for hardware LRO, we were getting 800K packets per second for each NIC. With eight 10GbE NICs, the packet rate reached close to 6.4 million packets per second. VMware does Software LRO for Linux and as a result we see large packets in the guest. The guest packet rate is around 240K packets per second. There was also significant traffic for TCP acknowledgements and for each physical NIC. The host was transmitting close to 120K acknowledgement packets for each physical NIC, bringing the total packets processed close to 7.5 million packets per second for eight 10Gb ports.

Figure 3. Average vSphere 5.5 host CPU utilization for send and receive

We also measured the average CPU reported for each of the tests. Figure 3 shows that the vSphere host’s CPU usage increased linearly as we added more physical NICs to the test for both send and receive. This indicates that performance improves at an expected and acceptable rate.

Test results show that vSphere 5.5 is an excellent platform on which to deploy networking-intensive workloads. vSphere 5.5 makes use of all the physical bandwidth capacity available and does this without incurring additional CPU cost.

 


[1]A 10GbE NIC can achieve only 9.4 Gbps of throughput with standard MTU. For a 1500 byte packet, we have 40 bytes for the TCP /IP header and 38 bytes for the Ethernet frame format.

Power Management and Performance in ESXi 5.1

Powering and cooling are a substantial portion of datacenter costs. Ideally, we could minimize these costs by optimizing the datacenter’s energy consumption without impacting performance. The Host Power Management feature, which has been enabled by default since ESXi 5.0, allows hosts to reduce power consumption while boosting energy efficiency by putting processors into a low-power state when not fully utilized.

Power management can be controlled by the either the BIOS or the operating system. In the BIOS, manufacturers provide several types of Host Power Management policies. Although they vary by vendor, most include “Performance,” which does not use any power saving techniques, “Balanced,” which claims to increase energy efficiency with minimal or no impact to performance, and “OS Controlled,” which passes power management control to the operating system. The “Balanced” policy is variably known as “Performance per Watt,” “Dynamic” and other labels; consult your vendor for details. If “OS Controlled” is enabled in the BIOS, ESXi will manage power using one of the policies “High performance,” “Balanced,” “Low power,” or “Custom.” We chose to study Balanced because it is the default setting.

But can the Balanced setting, whether controlled by the BIOS or ESXi, reduce performance relative to the Performance setting? We have received reports from customers who have had performance problems while using the BIOS-controlled Balanced setting. Without knowing the effect of Balanced on performance and energy efficiency, when performance is at a premium users might select the Performance policy to play it safe. To answer this question we tested the impact of power management policies on performance and energy efficiency using VMmark 2.5.

VMmark 2.5 is a multi-host virtualization benchmark that uses varied application workloads as well as common datacenter operations to model the demands of the datacenter. VMs running diverse application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview.

We tested three policies: the BIOS-controlled Performance setting, which uses no power management techniques, the ESXi-controlled Balanced setting (with the BIOS set to OS-Controlled mode), and the BIOS-controlled Balanced setting. The ESXi Balanced and BIOS-controlled Balanced settings cut power by reducing processor frequency and voltage among other power saving techniques.

We found that the ESXi Balanced setting did an excellent job of preserving performance, with no measurable performance impact at all levels of load. Not only was performance on par with expectations, but it did so while producing consistent improvements in energy efficiency, even while idle. By comparison, the BIOS Balanced setting aggressively saved power but created higher latencies and reduced performance. The following results detail our findings.

Testing Methodology
All tests were conducted on a four-node cluster running VMware vSphere 5.1. We compared performance and energy efficiency of VMmark between three power management policies: Performance, the ESXi-controlled Balanced setting, and the BIOS-controlled Balanced setting, also known as “Performance per Watt (Dell Active Power Controller).”

Configuration
Systems Under Test: Four Dell PowerEdge R620 servers
CPUs (per server): One Eight-Core Intel® Xeon® E5-2665 @ 2.4 GHz, Hyper-Threading enabled
Memory (per server): 96GB DDR3 ECC @ 1067 MHz
Host Bus Adapter: Two QLogic QLE2562, Dual Port 8Gb Fibre Channel to PCI Express
Network Controller: One Intel Gigabit Quad Port I350 Adapter
Hypervisor: VMware ESXi 5.1.0
Storage Array: EMC VNX5700
62 Enterprise Flash Drives (SSDs), RAID 0, grouped as 3 x 8 SSD LUNs, 7 x 5 SSD LUNs, and 1 x 3 SSD LUN
Virtualization Management: VMware vCenter Server 5.1.0
VMmark version: 2.5
Power Meters: Three Yokogawa WT210

Results
To determine the maximum VMmark load supported for each power management setting, we increased the number of VMmark tiles until the cluster reached saturation, which is defined as the largest number of tiles that still meet Quality of Service (QoS) requirements. All data points are the mean of three tests in each configuration and VMmark scores are normalized to the BIOS Balanced one-tile score.

Effects of Power Management on VMmark 2.5 score

The VMmark scores were equivalent between the Performance setting and the ESXi Balanced setting with less than a 1% difference at all load levels. However, running on the BIOS Balanced setting reduced the VMmark scores an average of 15%. On the BIOS Balanced setting, the environment was no longer able to support nine tiles and, even at low loads, on average, 31% of runs failed QoS requirements; only passing runs are pictured above.

We also compared the improvements in energy efficiency of the two Balanced settings against the Performance setting. The Performance per Kilowatt metric, which is new to VMmark 2.5, models energy efficiency as VMmark score per kilowatt of power consumed. More efficient results will have a higher Performance per Kilowatt.

Effects of Power Management on Energy Efficiency

Two trends are visible in this figure. As expected, the Performance setting showed the lowest energy efficiency. At every load level, ESXi Balanced was about 3% more energy efficient than the Performance setting, despite the fact that it delivered an equivalent score to Performance. The BIOS Balanced setting had the greatest energy efficiency, 20% average improvement over Performance.

Second, increase in load is correlated with greater energy efficiency. As the CPUs become busier, throughput increases at a faster rate than the required power. This can be understood by noting that an idle server will still consume power, but with no work to show for it. A highly utilized server is typically the most energy efficient per request completed, which is confirmed in our results. Higher energy efficiency creates cost savings in host energy consumption and in cooling costs.

The bursty nature of most environments leads them to sometimes idle, so we also measured each host’s idle power consumption. The Performance setting showed an average of 128 watts per host, while ESXi Balanced and BIOS Balanced consumed 85 watts per host. Although the Performance and ESXi Balanced settings performed very similarly under load, hosts using ESXi Balanced and BIOS Balanced power management consumed 33% less power while idle.

VMmark 2.5 scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how power management policies affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest results.

Effects of Power Management on VMmark 2.5 Latencies

Whereas the Performance and ESXi Balanced latencies tracked closely, BIOS Balanced latencies were significantly higher at all load levels. Furthermore, latencies were unpredictable even at low load levels, and for this reason, 31% of runs between one and eight tiles failed; these runs are omitted from the figure above. For example, half of the BIOS Balanced runs did not pass QoS requirements at four tiles. These higher latencies were the result of aggressive power saving by the BIOS Balanced policy.

Our tests showed that ESXi’s Balanced power management policy didn’t affect throughput or latency compared to the Performance policy, but did improve energy efficiency by 3%. While the BIOS-controlled Balanced policy improved power efficiency by an average of 20% over Performance, it was so aggressive in cutting power that it often caused VMmark to fail QoS requirements.

Overall, the BIOS controlled Balanced policy produced substantial efficiency gains but with unpredictable performance, failed runs, and reduced performance at all load levels. This policy may still be suitable for some workloads which can tolerate this unpredictability, but should be used with caution. On the other hand, the ESXi Balanced policy produced modest efficiency gains while doing an excellent job protecting performance across all load levels. These findings make us confident that the ESXi Balanced policy is a good choice for most types of virtualized applications.

Exploring FAST Cache Performance Using VMmark 2.1.1

A system’s performance is often limited by the access time of its hard disk drive (HDD). Solid-state drives (SSDs), also known as Enterprise Flash Drives (EFDs), tout a superior performance profile to HDDs. In our previous comparison of EFD and HDD technologies using VMmark 2.1, we showed that EFD reads were on average four times faster than HDD reads, while EFD and HDD write speeds were comparable. However, EFDs are more costly per gigabyte.

Many vendors have attempted to address this issue using tiered storage technologies. Here, we tested the performance benefits of EMC’s FAST Cache storage array feature, which merges the strengths of both technologies. FAST Cache is an EFD-based read/write storage cache that supplements the array’s DRAM cache by giving frequently accessed data priority on the high performing EFDs. We used VMmark 2, a multi-host virtualization benchmark, to quantify the performance benefits of FAST Cache. For more details, see the overview, release notes for VMmark 2.1, and release notes for 2.1.1. VMmark 2 is an ideal tool to test FAST Cache performance for virtualized datacenters in that its varied workloads and bursty I/O patterns model the demands of the datacenter. We found that FAST Cache produced remarkable improvements in datacenter capacity and storage access latencies. With the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements.

FAST Cache
FAST Cache is a feature of EMC’s storage systems that tracks frequently accessed data on disk, promotes the data into an array-wide EFD cache to take advantage of Flash I/O access speeds, then writes it back to disk when the data is superseded in importance. FAST Cache optimizes the use of EFD storage. In most workloads only a small percentage of data will be frequently accessed. This is referred to as the ‘working set.’ An EFD-based cache allows the data in the working set to take advantage of the performance characteristics of EFDs while the rest of the data stays on lower-cost HDDs. Relevant data is rapidly promoted into the cache in increments of 64 KB pages, and a least-recently-used algorithm is used to decide which data to write back to disk.

The benefit achieved with FAST Cache depends on the workload’s I/O profile. As with most caches, FAST Cache will show the most benefit for I/O with a high locality of reference, such as database indices and reference tables. FAST Cache will be least beneficial to workloads with sequential I/O patterns like log files or large I/O size access because these may not access the same 64 KB block multiple times and the FAST Cache would never become populated.

Configuration
Systems Under Test: Four Dell PowerEdge R310 Servers
CPUs (per server): One Quad-Core Intel® Xeon® X3460 @ 2.8 GHz, Hyper-Threading enabled
Memory (per server): 32 GB DDR3 ECC @ 800 MHz
Storage Array: EMC VNX5500
FAST Cache configurations:
366 GB FAST Cache, 8 EFDs, RAID 1
92 GB FAST Cache, 2 EFDs, RAID 1
FAST Cache disabled
LUN configurations:
20 HDDs, 10K RPM, grouped into 3 LUNs of 8, 8, and 4 HDDs each
11 HDDs, 10K RPM, grouped into 3 LUNs of 4, 4, and 3 HDDs each
Hypervisor: ESXi 5.0.0
Virtualization Management: VMware vCenter Server 5.0
VMmark version: 2.1.1

Methodology
We used VMmark 2 to investigate several different factors relating to FAST Cache. We wanted to measure the performance benefit afforded by adding FAST Cache into a VMmark 2 environment and we wanted to observe how the performance benefit of FAST Cache would scale as we changed the size of the cache. We tested with FAST Cache disabled and with two different FAST Cache sizes which were made from two EFDs and eight EFDs in RAID 1, creating a cache of 92 GB and 366 GB usable space, respectively. FAST Cache was configured according to best practices to ensure FAST Cache performance was not limited by array bus bandwidth. After the FAST Cache was created, it was warmed up by repeating VMmark 2 runs until scores showed less than 3% variability between runs.

We also wanted to examine whether FAST Cache could reduce the hardware requirements of our tests. As processors and other system hardware components have increased in capacity and speed, there has been greater and greater pressure for corresponding increasing performance from storage. RAID groups of HDDs have been one answer to these increasing performance demands, as RAID arrays provide performance and reliability benefits over individual disks. In typical RAID configurations, performance increases nearly linearly as disks are added to the RAID group. However, adding disks in order to increase storage access speed can result in underutilization of HDD space, which becomes far greater than required. FAST Cache should allow us to reduce the number of HDDs we require for RAID performance benefits, also reducing the cluster’s total power, cooling and space requirements, which results in lower cost. FAST Cache services the bulk of the workloads’ I/O operations at high speeds, so it is acceptable for us to service the remainder of operations at lower speeds and use only as many HDDs as needed for storage capacity rather than performance.

To test whether an environment with FAST Cache and a reduced number of disks could perform as well as an environment without FAST Cache, but with a larger number of disks, we tested performance with two different disk configurations. Workloads were tested on a set of 20 HDDs and then on a set of 11 HDDs, in both cases grouped into three LUNs. Each LUN was in a distinct RAID 0 group. Due to the performance characteristics of RAID 0, we expected the 20 HDD configuration to have better performance to than the 11 HDD configuration. The placement of workloads onto LUNs was meant to model a naïve environment with nonoptimal storage setup. Two LUNs held workload tile data, and the third smaller LUN served as the destination for VM Deploy and Storage vMotion workloads. The first LUN held VMs from the first and third tiles, and the second LUN held VMs from the second and fourth tiles. Running VMmark 2 with more than one tile per LUN was atypical of our best practices for the benchmark. It created a severe bottleneck for the disk, which was meant to simulate the types of storage performance issues we sometimes see in customer environments.

All VMmark 2 tests were conducted on a cluster of four identically configured entry-level Dell Power Edge R310 servers running ESXi 5.0. All components in the environment besides FAST Cache and number of HDDs remained unchanged during testing.

Results
To characterize cluster performance at multiple load levels, we increased the number of tiles until the cluster reached saturation, defined as when the run failed to meet Quality of Service (QoS) requirements. Scaling out the number of tiles until saturation allows us to determine the maximum VMmark 2 load the cluster could support and to compare performance at each level of load for each cache and storage configuration. All data points are the mean of three tests in each configuration. Scaling data was generated by normalizing every score to the lowest passing score, which was 1 tile with FAST Cache disabled on 20 HDDs.

VMmark 2.1.1 Scaling With and Without FAST Cache

With FAST Cache disabled, the 20 HDD LUNs reached saturation at 2 tiles, and the 11 HDD LUNs were unable to support even 1 tile of load. Because all VMs for each tile were placed on the same LUN, a 1 tile run used one LUN, consisting of only four out of 11 HDDs or eight out of 20 HDDs. 4 HDDs were insufficient to provide the required QoS for even 1 tile. When FAST Cache was enabled, the 11 HDD and 20 HDD configurations supported 4 tiles. This is a remarkable improvement; with the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements. Even at lower load levels, the equivalent system with FAST Cache was allowing greater throughput and showed resulting increases in the VMmark score of 26% at 1 tile and 31% at 2 tiles. With FAST Cache enabled, the configuration with 11 HDDs performed equivalently to one with 20 HDDs until the system approached saturation.

With FAST Cache enabled, the system supported twice as much load on almost half as many disks. The results show that an environment with a 92 GB FAST Cache was able to greatly outperform a HDD-only environment that contains 82% more disks. At 4 tiles with FAST Cache enabled, the cluster’s CPU utilization was approaching saturation, reaching an average of 84%, but was not yet bottlenecked on storage.

In our tests, performance did not scale up very much as we increased FAST Cache size from 92 GB to 366 GB and the number of HDDs from 11 to 20.

VMmark 2.1.1 Scaling with FAST Cache

We can see that all configurations scaled very similarly from 1 to 3 tiles with only minor differences appearing, primarily between the 92 GB FAST Cache and 366 GB FAST Cache. Only at the highest load level did performance begin to diverge. Predictably, the largest cache configurations show the best performance at 4 tiles, followed by the smaller cache configurations. To determine whether this performance falloff was directly attributable to the cache size and number of HDDs, we needed to know whether FAST Cache was performing to capacity.

Below are the FAST Cache and DRAM cache hit percentages for read and write operations at the 4 tile load. On average, our VMmark testing had I/O operations of 24% reads and 76% writes.

Total Cache Hits at 4 TilesRead and Write Cache Hits at 4 tiles
Click to Enlarge

With the 366 GB FAST Cache, nearly all reads and writes were hitting either the DRAM or FAST Cache. In these cases, the number of backing disks did not affect the score because disks were rarely being accessed. At this cache size, all frequently accessed data fit into the FAST Cache. However, with the 92 GB FAST Cache, the cache hit percentage decreased to 96.5% and 92.1% for the 11 HDD and 20 HDD configurations, respectively. This indicated that the entire working set could no longer fit into the 92 GB FAST Cache. The 11 HDD configuration began to show decreased performance relative to 20 HDDs, because although only 3.5% of total I/O operations were going to disk, the increase in disk latency was large enough to reduce throughput and affect VMmark score. Despite this, a FAST Cache of 92 GB was still sufficient to provide us with VMmark performance that met QoS requirements. The higher read hit percentages in the 11 HDD configuration reflected this reduced throughput. Lower throughput resulted in a smaller working set and an accordingly higher read hit percentage.

Overall, FAST Cache did an excellent job of identifying the working set. Although only 8% of the 1.09 TB dataset could fit in the 92 GB cache at any one time, at least 92% of I/O requests were hitting the cache.

Scaling FAST Cache gave us a sense of the working set size of the VMmark benchmark. As performance with the 92 GB FAST Cache demonstrated a knee at 3 tiles, this suggests the working set size at 3 tiles is less than 92 GB and the working set size at 4 tiles is slightly greater than 92 GB. Knowing the approximate working set size per tile would allow us to select the minimum FAST Cache size required if we wanted our entire working set to fit into the FAST Cache, even if we scaled the benchmark to an arbitrary number of tiles in a different cluster.

The results below show that I/O operations per second and I/O latency were affected by our environment characteristics.

I/O Latency at 4 Tiles

The variability in read latency is clearly affected by both FAST Cache size and number of backing HDDs. Latency is highest with only 11 HDDs and the smaller FAST Cache, and decreases as we add HDDs. Latency decreases even more with the larger FAST Cache size as nearly all reads hit the cache. Write latency, however, is relatively constant across configurations, which is as expected because in each configuration nearly all writes are being served by either the DRAM cache or FAST Cache.

Summary
It’s clear that we can replace a large number of HDDs with a much smaller number of EFDs and get similar or improved performance results. An array with 11 HDDs and FAST Cache outperformed an array with 20 HDDs without FAST Cache. FAST Cache handles the workloads’ performance requirements so that we need only to supply the HDDs necessary for their storage space, rather than performance capabilities. This allows us to reduce the number of HDDs and their associated power, space, cooling, and cost.

Tiered storage solutions like FAST Cache make excellent use of EFDs, even to the extent that 92% or more of our I/O operations are benefitting from Flash-level latencies while the EFD storage itself holds only 8% of our total data. The increased VMmark scores demonstrate the ability of FAST Cache to pinpoint the most active data remarkably well, and, even in a bursty environment, show incredible improvements in I/O latency and in the load that a cluster can support.  Our testing showed FAST Cache provides Flash-level storage access speeds to the data that needs it most, reduces storage bottlenecking and increases supported load, making FAST Cache a highly valuable addition to the datacenter.

Comparing ESXi 4.1 and ESXi 5.0 Scaling Performance

In previous articles on VROOM! we used VMmark 2 to investigate the effects of altering a single hardware component, such as a storage array or server model, in a vSphere cluster. In contrast to these earlier studies, we now examine the effects of upgrading the hosts’ software from ESXi 4.1 to ESXi 5.0 on the performance of a VMmark 2 cluster.

vSphere 5 includes many new features and virtual machine enhancements, the details of which can be found here. To the IT professional weighing the costs and benefits of upgrading their existing infrastructure to vSphere 5, an often important question is whether ESXi 5.0 can outperform ESXi 4.1 in the same environment. VMmark 2 is an ideal tool for answering this question with measurable results. We used VMmark 2.1.1 to see how ESXi 5.0 stacked up to ESXi 4.1 on an identically configured cluster.

VMmark 2 is a multi-host virtualization benchmark that models application performance as well as the effects of common infrastructure operations such as vMotion, Storage vMotion, and virtual machine deployments. Each VMmark tile contains a set of VMs running diverse application workloads as a unit of load. VMmark 2 scores are computed as a weighted average of application workload throughput and infrastructure operation throughput. For more details, see the overview, release notes for VMmark 2.1, and for 2.1.1.

Testing Methodology

All VMmark 2 tests were conducted on a cluster of four identically configured entry-level Dell Power Edge R310 servers. To determine the impact of the vSphere 5 environment on performance, a series of tests was conducted with these hosts running ESXi 4.1, then with ESXi 5.0. In addition, for the vSphere 5 environment, the virtual machine hardware and VMware Tools were upgraded on all workload VMs, and LUNs were reformatted as VMFS5. All other components in the environment remained unchanged during testing.

Configuration
Systems Under Test: Four Dell PowerEdge R310 Servers
CPUs: One Quad-Core Intel® Xeon® X3460 @ 2.8 GHz, hyper-threading enabled per server
Memory: 32GB DDR3 ECC @ 800 MHz per server
Storage Array: EMC VNX5500
Hypervisors under test:
VMware ESXi 4.1
VMware ESXi 5.0
Virtualization Management: VMware vCenter Server 5.0
VMmark version: 2.1.1

Results

To characterize cluster performance at multiple load levels, we increased the number of tiles until the cluster reached saturation, defined as when the run failed to meet Quality of Service (QoS) requirements. Scaling out the number of tiles until saturation allows us to determine the maximum VMmark 2 load the cluster could support and to compare the ESXi 4.1 and ESXi 5.0 configurations at each level of load.

The graph below shows the results of the VMmark 2 testing as described above with identically configured clusters running ESXi 4.1 and ESXi 5.0. All data points are the mean of three tests in each configuration.

  Scaling

 

The ESXi 4.1 cluster reached saturation at 3 tiles, but ESXi 5.0 was able to support 4 tiles while still meeting workload Quality of Service requirements. The ESXi 5.0 cluster also outperformed ESXi 4.1 by 3% and 4% on the two and three-tile runs, respectively. Differences in CPU utilization were negligible. The results show that, in an equivalent environment, vSphere 5 handled greater load than ESXi 4.1 before reaching saturation, and showed increased performance at lower levels of load as well. At saturation, vSphere 5 showed a 22% increase in overall VMmark 2 scores over ESXi 4.1. In this cluster, vSphere 5 supported 33% more VMs and twice the number of infrastructure operations while meeting Quality of Service requirements.

VMmark 2 scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. The completion time for vMotion, Storage vMotion, and VM Deploy is used as the latency measurement for the infrastructure operations. Latency can be very informative about the functioning of the environment and how the cluster as a whole performs under increasing loads. Examining latency at a 3-tile load, as seen in the figure below, reveals significant differences between the hypervisor versions. Latencies were normalized to the ESXi 4.1 results.

Latency

We saw decreases in latency for all VMmark 2 workloads with vSphere 5. The latency decreases were most striking in Olio, Storage vMotion, and DVD Store 2, with decreases of 20%, 19%, and 15%, respectively. These improvements to vMotion and Storage vMotion are consistent with publicized improvements in vMotion and Storage vMotion latency for vSphere 5 (details here).

A VMmark 2 run passes when all of its application QoS metrics, or latencies, remain below a specified threshold. These decreases in latency with ESXi 5.0 are directly related to why ESXi 5.0 was able to support an additional tile relative to ESXi 4.1.

Our comparison has shown that upgrading an ESXi 4.1 cluster to vSphere 5 had two high-level effects on performance. The vSphere 5 cluster supported 33% more VMs at saturation than the ESXi 4.1 cluster, and it also exhibited improved latency and throughput at lower levels of load, showing that ESXi 5.0 does outperform ESXi 4.1.

Performance Scaling of an Entry-Level Cluster

Performance benchmarking is often conducted on top-of-the-line hardware, including hosts that typically have a large number of cores, maximum memory, and the fastest disks available. Hardware of this caliber is not always accessible to small or medium-sized businesses with modest IT budgets. As part of our ongoing investigation of different ways to benchmark the cloud using the newly released VMmark 2.0, we set out to determine whether a cluster of less powerful hosts could be a viable alternative for these businesses. We used VMmark 2.0 to see how a four-host cluster with a modest hardware configuration would scale under increasing load.

Workload throughput is often limited by disk performance, so the tests were repeated with two different storage arrays to show the effect that upgrading the storage would offer in terms of performance improvement. We tested two disk arrays that varied in both speed and number of disks, an EMC CX500 and an EMC CX3-20, while holding all other characteristics of the testbed constant.

To review, VMmark 2.0 is a next-generation, multi-host virtualization benchmark that models application performance and the effects of common infrastructure operations such as vMotion, Storage vMotion, and a virtual machine deployment. Each tile contains Microsoft Exchange 2007, DVD Store 2.1, and Olio application workloads which run in a throttled fashion. The Storage vMotion and VM deployment infrastructure operations require the user to specify a LUN as the storage destination. The VMmark 2.0 score is computed as a weighted average of application workload throughput and infrastructure operation throughput. For more details about VMmark 2.0, see the VMmark 2.0 website or Joshua Schnee’s description of the benchmark.

Configuration
All tests were conducted on a cluster of four Dell PowerEdge R310 hosts running VMware ESX 4.1 and managed by VMware vCenter Server 4.1.  These are typical of today’s entry-level servers; each server contained a single quad-core Intel Xeon 2.80 GHz X3460 processor (with hyperthreading enabled) and 32 GB of RAM.  The servers also used two 1Gbit NICs for VM traffic and a third 1Gbit NIC for vMotion activity.

To determine the relative impact of different storage solutions on benchmark performance, runs were conducted on two existing storage arrays, an EMC CX500 and an EMC CX3-20. For details on the array configurations, refer to Table 1 below. VMs were stored on identically configured ‘application’ LUNs, while a designated ‘maintenance’ LUN was used for the Storage vMotion and VM deployment operations.

Table 1. Disk Array Configuration   Table1-3

Results
To measure the cluster's performance scaling under increasing load, we started by running one tile, then increased the number of tiles until the run failed to meet Quality of Service (QoS) requirements. As load is increased on the cluster, it is expected that the application throughput, CPU utilization, and VMmark 2.0 scores will increase; the VMmark score increases as a function of throughput. By scaling out the number of tiles, we hoped to determine the maximum load our four-host cluster of entry-level servers could support.  VMmark 2.0 scores will not scale linearly from one to three tiles because, in this configuration, the infrastructure operations load remained constant. Infrastructure load increases primarily as a function of cluster size. Although showing only a two host cluster, the relationship between application throughput, infrastructure operations throughput and number of tiles is demonstrated more clearly by this figure from Joshua Schnee’s recent blog article. Secondly, we expected to see improved performance when running on the CX3-20 versus the CX500 because the CX3-20 has a larger number of disks per LUN as well as faster individual drives. Figure 1 below details the scale out performance on the CX500 and the CX3-20 disk arrays using VMmark 2.0. 

Figure 1. VMmark 2.0 Scale Out On a Four-Host Cluster

Figure1-2

Both configurations saw improved throughput from one to three tiles but at four tiles they failed to meet at least one QoS requirement. These results show that a user wanting to maintain an average cluster CPU utilization of 50% on their four-host cluster could count on the cluster to support a two-tile load. Note that in this experiment, increased scores across tiles are largely due to increased workload throughput rather than an increased number of infrastructure operations.

As expected, runs using the CX3-20 showed consistently higher normalized scores than those on the CX500. Runs on the CX3-20 outperformed the CX500 by 15%, 14%, and 12% on the one, two, and three-tile runs, respectively. The increased performance of the CX3-20 over the CX500 was accompanied by approximately 10% higher CPU utilization, which indicated that that the faster CX3-20 disks allowed the CPU to stay busier, increasing total throughput.

The results show that our cluster of entry-level servers with a modest disk array supported approximately 220 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second. A more robust disk array supported 270 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second with 20% lower latencies on average and a correspondingly slightly higher CPU utilization.

Note that this type of experiment is possible for the first time with VMmark 2.0; VMmark 1.x was limited to benchmarking a single host but the entry-level servers under test in this study would not have been able to support even a single VMmark 2.0 tile on an individual server. By spreading the load of one tile across a cluster of servers, however, it becomes possible to quantify the load that the cluster as a whole is capable of supporting.  Benchmarking our cluster with VMmark 2.0 has shown that even modest clusters running vSphere can deliver an enormous amount of computing power to run complex multi-tier workloads.

Future Directions
In this study, we scaled out VMmark 2.0 on a four-host entry-level cluster to measure performance scaling and the maximum supported number of tiles. This put a much higher load onto the cluster than might be typical for a small or medium business so that businesses can confidently deploy their application workloads.  An alternate experiment would be to run fewer tiles while measuring the performance of other enterprise-level features, such as VMware High Availability. This ability to benchmark the cloud in many different ways is one benefit of having a well-designed multi-host benchmark. Keep watching this blog for more interesting studies in benchmarking the cloud with VMmark 2.0.