Home > Blogs > VMware VROOM! Blog > Author Archives: Joshua Schnee

Comparing Storage Density, Power, and Performance with VMmark 2.5

Datacenters continue to grow as the use of both public and private clouds becomes more prevalent.  A comprehensive review of density, power, and performance is becoming more crucial to understanding the tradeoffs when considering new storage technologies as a replacement for legacy solutions.  Expanding on previous articles around comparing storage technologies and the IOPS performance available when using flash-based storage, in this article we are comparing the density, power, and performance differences between traditional hard disk drive (HDDs) and flash-based storage.  As might be expected, we found that the flash-based storage performed very well in comparison to the traditional hard disk drives.  This article quantifies our findings.

In addition to VMmark’s previous performance measurement capability, VMmark 2.5 adds the ability to collect power measurements on servers and storage under test.  VMmark 2.5 is a multi-host virtualization consolidation benchmark that utilizes a combination of application workloads and infrastructure operations running simultaneously to model the performance of a cluster.  For more information on VMmark 2.5, see this overview.

Environment Configuration:
Hypervisor: VMware vSphere 5.1
Servers: Two x Dell PowerEdge R720
BIOS settings: High Performance Profile Enabled
CPU: Two x 2.9GHz Intel Xeon CPU-E5-2690
Memory: 192GB
HBAs: Two x 16Gb QLE2672 per system under test
Storage:
- HDD-Configuration: EMC CX3-80, 120 disks, 8 Trays, 1 SPE, 30U
- Flash-Based-Configuration: Violin Memory 6616, 64 VIMMs, 3U
Workload: VMware VMmark 2.5.1

Testing Methodology:
For this experimentation we set up a vSphere 5.1 DRS-enabled cluster consisting of two identically configured Dell PowerEdge R720 servers.  A series of VMmark 2.5 tests were then conducted on the cluster with the same VMs being moved to the storage configuration under test, progressively increasing the number of tiles until the cluster reached saturation.  Saturation was defined as the point where the cluster was unable to meet the VMmark 2.5 quality-of-service (QoS) requirements. We selected the EMC CX3-80 and the Violin Memory 6616 as representatives of the previous generation of traditional HDD-based and flash based storage, respectively. We would expect comparable arrays in these generations to have characteristics similar to what we measured in these tests.  In addition to the VMmark 2.5 results, esxtop data was collected to provide further statistics.  The HDD configuration running a single tile was used as the baseline and all VMmark 2.5 results in this article (excluding raw Watts metrics, %CPU, and Latency) were normalized to that result.

Average Watts and VMmark 2.5 Performance Per Kilowatt Comparison:
For our comparison of the two technologies, the first point of evaluation was reviewing both the average watts required by the storage arrays and the corresponding VMmark 2.5 Performance Per Kilowatt (PPKW) score.  Note that the HDD configuration reached saturation at 7 tiles. In contrast, the Flash-based configuration was able to support a total of 9 tiles, while still meeting the quality of service requirements for VMmark 2.5.

As can be seen from the above graphs, the difference between the two technologies is extremely obvious.  The average watts drawn by the Flash-based configuration was nearly 50% less than the HDD configuration across all tiles tested.  Additionally, the PPKW score of the Flash-based configuration was on average 3.4 times higher than the HDD configuration, across all runs.

Application Score Comparison:
Due to the very large difference in PPKW, we decided to dig deeper into the potential root causes, beyond just the discrepancy in power consumed.  Because the application workloads exhibit random access patterns, as opposed to the sequential nature of infrastructure operations, we focused on the differences in application scores between the two configurations, as this is where we would expect to see the majority of the gains provided by the Flash-based configuration.

The difference between the scaling of the application workloads is quite obvious.  Although running the same number of tiles, and thus attempting the same amount of work, the flash-based configuration was able to produce application workload scores that were 1.9 times higher than the HDD configuration across 7 tiles.

CPU and Latency Comparison:
After exploring the power consumption and various areas of performance difference, we decided to look into two additional key components behind the performance improvements: CPU utilization and storage latency.


In our final round of data assessment we found that the CPU utilization of the flash-based storage was on average 1.53 times higher than the HDD configuration, across all 7 tiles.  Higher CPU utilization might appear to be sub-optimal, however we determined that the systems were waiting less time for I/O to complete and were thus getting more work done.  This is especially visible when reviewing the storage latencies of the two configurations.  The flash-based configuration showed extremely flat latencies, and had on average less than one tenth of the HDD configuration’s latencies.

Finally, when comparing the physical space requirements of the two configurations, the flash-based storage was effectively 92% denser than the traditional HDD configurations (achieving 9 tiles in 3U versus 7 tiles 30U). In addition to physical density advancements, the flash-based storage allowed for a 29% increase in the number of VMs run on the same server hardware, while maintaining QoS requirements of VMmark 2.5.

The flash-based storage showed wins across the board for power and performance.  The flash-based storage consumed half the power while achieving over three times the performance.  Although the initial costs of flash-based storage can be somewhat daunting when compared to traditional HDD storage, the reduction in power, increased density, and superior performance of the flash-based storage certainly seems to provide a strong argument for integrating the technology into future datacenters. VMmark 2.5 gives us the ability to look at the larger picture, making an informed decision across a wide variety of today’s concerns.

vSphere 5.1 IOPS Performance Characterization on Flash-based Storage

At VMworld 2012 we demonstrated a single eight-way VM running on vSphere 5.1 exceeding one million IOPS.  This testing illustrated the high end IOPS performance of vSphere 5.1.

In a new series of tests we have completed some additional characterization of high I/O performance using a very similar environment. The only difference between the 1 million IOPS test environment and the one used for these tests is that the number of Violin Memory Arrays was reduced from two to one (one of the arrays was a short term loan).

Configuration:
Hypervisor: vSphere 5.1
Server: HP DL380 Gen8
CPU: Two Intel Xeon E5-2690, HyperThreading disabled
Memory: 256GB
HBAs: Five QLogic QLE2562
Storage: One Violin Memory 6616 Flash Memory Array
VM: Windows Server 2008 R2, 8 vCPUs and 48GB.
Iometer Configuration: Random, 4KB I/O size with 16 workers

We continued to characterize the performance of vSphere 5.1 and the Violin array across a wider range of configurations and workload conditions.

Based on the types of questions that we often get from customers, we focused on RDM versus VMFS5 comparisons and the usage of various I/O sizes.  In the first series of experiments we compared RDM versus VMFS5 backed datastores using 100% read workload mix while ramping up the I/O size.

click to enlarge

As you can see from the above graph, VMFS5 yielded roughly equivalent performance to that of RDM backed datastores.  Comparing the average of the deltas across all data points showed performance within 1% of RDM for both IOPS and MB/s.  As expected, the number of IOPS decreased after we exceed the default array block size of 4KB, but the throughput continued to scale, approaching 4500 MB/s at both 8KB and 16KB sizes.

For our second series of experiments, we continued to compare RDM versus VMFS5 backed datastores through a progression of block sizes, but this time we altered the workload mix to include 60% reads and 40% writes.

click to enlarge

Violin Memory arrays use a 4KB sector size and perform at their optimal level when managing 4KB blocks. This is very visible in the above IOPS results at the 4KB block size. In the above graph, comparing RDM and VMFS5 IOPS, you can see that VMFS5 performs very well with a 60% read, 40% write mix.  Throughputs continued to scale in a similar fashion as the read-only experimentation and VMFS5 performance for both IOPS and MB/s were within .01% of RDM performance when comparing the average of the deltas across all data points.

The amount of I/O, with just one eight-way VM running on one Violin storage array, is both considerable and sustainable at many I/O sizes.  It’s also noteworthy to point out that running a 60% read and 40% write I/O mix still generated substantial IOPs and bandwidth. While in most cases a single VM won’t need to drive nearly this much I/O traffic, these experiments show that vSphere 5.1 is more than capable of handling it.

1millionIOPS On 1VM

Last year at VMworld 2011 we presented one million I/O operations per second (IOPS) on a single vSphere 5 host (link).  The intent was to demonstrate vSphere 5′s performance by using mutilple VMs to drive an aggregate load of one million IOPS through a single server.   There has recently been some interest in driving similar I/O load through a single VM.  We used a pair of Violin Memory 6616 flash memory arrays, which we connected to a two-socket HP DL380 server, for some quick experiments prior to VMworld.  vSphere 5.1 was able to demonstrate high performance and I/O efficiency by exceeding one million IOPS, doing so with only a modest eight-way VM.  A brief description of our configuration and results is given below.

Configuration:
Hypervisor: vSphere 5.1
Server: HP DL380 Gen8
CPU: 2 x Intel Xeon E5-2690, HyperThreading disabled
Memory: 256GB
HBAs: 5 x QLE2562
Storage: 2 x Violin Memory 6616 Flash Memory Arrays
VM: Windows Server 2008 R2, 8 vCPUs and 48GB.
Iometer Config: 4K IO size w/ 16 workers

Results:
Using the above configuration we achieved 1055896 total sustained IOPS.  Check out the following short video clip from one of our latest runs.

Look out for a more thorough write-up after VMworld.

 

Analysis of Storage Technologies on Clusters using VMmark 2.1

Previous blog entries utilizing VMmark 2.1 introduced the benchmark, showed the effects of generational scaling, and evaluated the scale-out performance of vSphere clusters.  This article analyzes the performance impact of the type of storage infrastructure used, specifically when comparing the effects of Enterprise Flash Drives (EFDs; often referred to as SSDs) versus traditional SCSI HDDs.  There is a general perception, both in the consumer and business space, that EFDs are better than HDDs.  Less clear, however, is how much better and whether the performance benefits of the typically more expensive EFDs are observed in today’s more complex datacenters. 

VMmark 2 Overview:

Once again we used VMmark2.1 to model the performance characteristics of a multi-host heterogeneous virtualization environment.  VMmark 2.1 is a combination of application workloads and infrastructure operations running simultaneously.  In general, the infrastructure operations increase with the number of hosts in an N/2 fashion, where N is the number of hosts.  To calculate the score for VMmark 2.1, final results are generated from a weighted average of the two kinds of workloads; hence scores will not increase linearly as workload tiles are added.  For more general information on VMmark 2.1, including the application and infrastructure workload details, take a look at the expanded overview in my previous blog post or the VMmark 2.1 release notification written by Bruce Herndon.

Environment Configuration:

  • Systems Under Test: 2 HP ProLiant DL380 G6
  • CPUs: 2 Quad-Core Intel® Xeon® CPU 5570 @ 2.93 GHz with Hyper-Threading enabled per system
  • Memory: 96GB DDR2 Reg ECC per system
  • Storage Arrays Under Test:
    • HDD: EMC CX3-80
      • 8 Enclosures: RAID0 LUNs, 133.68GB FC HDDs
    • EFD: EMC CX4-960
      • 4 Enclosures: RAID0 LUNs, mix of 66.64GB and 366.8GB FC EFDs
  • Hypervisor: VMware ESX 4.1
  • Virtualization Management: VMware vCenter Server 4.1

Testing Methodology:

To analyze the comparative performance of EFDs versus HDDs with VMmark 2.1, a vSphere DRS enabled cluster consisting of two identically-configured HP ProLiant DL380 servers was connected to the two EMC storage arrays.  A series of tests were then conducted against the cluster with the same VMs being moved to the storage array under test, increasing the number of tiles until the cluster approached saturation.  Saturation was defined as the point where the cluster was unable to meet the minimum quality-of-service (QoS) requirements for VMmark 2.1.  The minimum configuration for VMmark 2.1 is a two-host cluster running a single tile.  The result from this minimal configuration on the HDD storage array was used as the baseline, and all VMmark 2.1 data in this article were normalized to that result.  In addition to the standard VMmark 2.1 results, esxtop data was also collected during the measurement phase of the benchmark to provide additional statistics. 

Results:

In a top-down approach to reviewing the two storage technologies, it seems natural that the first point of comparison would be the overall performance of VMmark 2.1.  By comparing the normalized scores, it’s possible to immediately see the impact of running our cluster on EFDs versus traditional HDDs at a variety of load levels.

    P1

Click to Enlarge

The improvement in score is apparent at every point of utilization, from the lowest-loaded 1-tile configuration out to the saturation point of 6 tiles.  Overall, the average improvement in score for the EFD configuration was 25.4%.  And while the HDD configuration was unable to meet the QoS requirements at 6 tiles, the EFD configuration not only met the requirements, but also improved the overall VMmark 2.1 score, even when the cluster was completely saturated (as seen in the graph below).  VMmark 2.1 can drive a considerable amount of I/O, up to many thousands of IOPS for large numbers of tiles.  Digging deeper into the root cause of such dramatic improvement for EFDs led me to investigate the overall throughputs for each of the configurations. 

    P2

Click to Enlarge

It’s apparent from the above graph that there was significant improvement in the total bandwidth, represented by Total MB/s, in the EFD configurations.  Compared to the HDD configuration, the EFD configuration’s total throughput improved (8%, 9.2%, 9.5%, 6.5%, and 14.5%, respectively). The amount of improvement actually increased as the I/O demands on the cluster increased.  Another interesting detail that arose from reviewing the data over numerous points of utilization was that %CPU used on the EFD configuration was typically higher than its HDD counterpart at the same load.  Although slightly counter-intuitive at first, it makes sense that if the system is waiting less for I/Os to complete, it can spend more time doing actual work as demonstrated by the higher VMmark 2.1 scores.  This observation leads to another interesting comparison.  Disk latency characteristics are often used to predict hardware performance. This can be useful, but what can be unclear is how this translates to real-world disk latencies running a diverse set of workloads. 

     P3aP3b 

Lower is Better:Click to Enlarge

Above is a series of graphs that display the average latency reported per write and read I/Os (note that lower latency is better).  In looking at each of the key latency counters we can get a better sense for where the additional performance is derived.  There’s a generalization that EFDs have poor write speeds by comparison to today’s HDDs.  The results here show that the generalization doesn’t always apply.  In fact, when looking at the average write latency for the tested EFDs across all data points, it was within 1% of the average write latency for the tested HDDs.  Additionally, reviewing the read latency comparison data showed massive reductions in latency across all workload levels, 76% on average.  Depending on the workload being run, this in itself could be all the justification needed to move to the newer technology.

It isn’t surprising that EFDs outperformed HDDs.  What is somewhat unexpected is the amount of performance, and the ability for EFDs to show immediate advantages even on the most lightly loaded clusters. With an average VMmark 2.1 score improvement of 25.4%, an average bandwidth increase of 9.6%, and a combined average read latency reduction of 76%, it’s easy to imagine there are a great many environments that might benefit from the real-world performance of EFDs. 

 

 

 

Exploring Generational Scaling with VMmark 2.1

The steady march of technological improvements is nothing new.  As companies either expand or refresh their datacenters it often becomes a non-trivial task to quantify the returns on hardware investments.  This difficulty can be further compounded when it’s no longer sufficient to answer how well one new server will perform in relation to its predecessor, but rather how well the new cluster will perform in comparison to the previous one.  With this in mind, we set off to explore the generational scaling performance of two clusters made up of very similar hardware using the newly released VMmark 2.1.  VMmark 2.1 is a next-generation, multi-host virtualization benchmark that models not only application performance but also the effects of common infrastructure operations.  For more general information on VMmark 2.1, including the application and infrastructure workload details, take a look at the expanded overview in one of my previous blog posts.

Environment Configuration:

  • Clusters Under Test
    • Cluster 1
      • Systems Under Test: 2 x Dell PowerEdge R805
      • CPUs: 2 Six-Core AMD Opteron™ 2427 @ 2.2 GHz
      • Memory: 128GB DDR2 Reg ECC @ 533MHz
      • Storage Array: EMC CX4-120
      • Hypervisor: VMware ESX 4.1
      • Virtualization Management: VMware vCenter Server 4.1
    • Cluster 2
      • Systems Under Test: 2 x Dell PowerEdge R815
      • CPUs: 2 Twelve-Core AMD Opteron™ 6174 @ 2.2 GHz
      • Memory: 128GB DDR3 Reg ECC @ 1066MHz
      • Storage Array: EMC CX4-120
      • Hypervisor: VMware ESX 4.1
      • Virtualization Management: VMware vCenter Server 4.1
  • VMmark 2.1

Testing Methodology:

To measure the generational improvement of the two clusters under test every attempt was made to set up and configure the servers identically.  The minimum configuration for VMmark 2.1 is a two-host cluster running a single tile.  The result from this minimal configuration on the older cluster, or cluster #1, was used as the baseline and all VMmark 2.1 scalability data in this article were normalized to that score.  A series of tests were then conducted on each of the clusters in isolation, increasing the number of tiles being run until the cluster approached saturation.  Saturation was defined as the point where the cluster was unable to meet the minimum quality-of-service (QoS) requirements for VMmark 2.1.  Results that were unable to meet minimum QoS for VMmark 2.1 were not plotted.

Results:

The primary component of change between the two clusters, making up the predominant factor in the generational scaling, is the change in processors.  The AMD Opteron™ 2427 processors provide six cores per socket for a total of twelve logical processors per server, whereas the newer AMD Opteron™ 6174 processors have twelve cores per socket for a total of twenty-four logical processors per server.  Factor in a doubling of the L3 cache per socket, as well as a doubling of the systems’ memory speeds, and the change in server characteristics is quite significant.

Pic

As shown in the above graph, the generational scaling between the two clusters under test is significant.  In the one-tile case, both clusters were able to perform the work requested without the presence of resource constraints.  The performance improvement of the newer cluster became more apparent once we started scaling up the number of tiles and significantly increasing the level of CPU over commitment and utilization.  It’s important to note that while adding tiles does effectively linearly increase the application workload requests being made, the workload caused by infrastructure operations does not scale in the same way, and was a constant across all tests.  Cluster #1 scaled to three tiles, at which point it was saturated and unable to support additional tiles while continuing to the meet the minimum quality-of-service (QoS) requirements of the benchmark.  For comparison, Cluster #2 was able to achieve an increase of normalized VMmark 2.1 scores of 1%, 14% and 9% for the one-tile, two-tile, and three-tile configurations, respectively.  Cluster #2, was then scaled to seven tiles, beyond which point it was unable to meet the minimum QoS requirements.

The newer generation cluster, with two Dell PowerEdge R815 AMD Opteron™ 6174 based hosts running vSphere 4.1, exhibited excellent scaling as the load was increased up to seven tiles, more than doubling the previous generation cluster’s performance and work accomplished.  Because VMmark 2.1 not only utilizes heterogeneous applications across a diverse computing environment, but also measures the impact of the common place infrastructure operations, it provided valuable insight on the generational scaling of the two cluster generations.  VMmark 2.1 proved itself an able benchmark for acquiring the answers for previously difficult datacenter questions.  

 

 

 

 

Experimenting with Cluster Scale-Out Utilizing VMmark 2

The first article in our VMmark 2 series gave an in-depth introduction to the benchmark while also presenting results on the scaling performance of a cluster based on a matched pair of systems.  The goal of this article is to continue to characterize larger and more diverse cloud configurations by testing scale-out performance of an expanding vSphere cluster.  This blog explores an enterprise-class cluster’s performance as more servers are added and subsequently the amount of work being requested is increased. Determining the impact of adding hosts to a cluster is important because it enables the measurement of the total work being done as cluster capacity and workload demand increases within a controlled environment.  It also assists in identifying the efficiency with which a vSphere managed cluster can utilize an increasing number of hosts.

VMmark 2 Overview:

VMmark 2 is a next-generation, multi-host virtualization benchmark that models not only application performance but also the effects of common infrastructure operations. VMmark 2 is a combination of the application workloads and the infrastructure operations running simultaneously.  Although the application workload levels are scaled up by the addition of tiles, the infrastructure operations scale as the cluster size increases.  In general, the infrastructure operations increase with the number of hosts in an N/2 fashion, where N is the number of hosts.  To calculate the score for VMmark 2, final results are generated from a weighted average of the two kinds of workloads; hence scores will not increase linearly as tiles are added.  For more general information on VMmark 2, including the application and infrastructure workload details, take a look at the expanded overview in my previous blog post.

Environment Configuration:

  • Systems Under Test : 2-5 HP ProLiant DL380 G6
  • CPUs : 2 Quad-Core Intel® Xeon® CPU 5570 @ 2.93 GHz with HyperThreading Enabled
  • Memory : 96GB DDR2 Reg ECC
  • Hypervisor : VMware ESX 4.1
  • Virtualization Management : VMware vCenter Server 4.1
  • 

Testing Methodology:

To test scale out performance with VMmark 2, five identically-configured HP ProLiant DL380 servers were connected to an EMC Clarion CX3-80 storage array.  The minimum configuration for VMmark 2 is a two-host cluster running one tile.  The result from this minimal configuration was the baseline used, and all VMmark 2 scalability data in this article were normalized to that score.  A series of tests were then conducted on this two-host configuration, increasing the number of tiles being run until the cluster approached saturation.  As shown in the series’ first article, our two-host cluster approached saturation at four tiles but failed QoS requirements when running five tiles.  Starting with a common workload level of four tiles, the three-host, four-host, and five-host configurations were tested in a similar fashion, increasing the number of tiles until each configuration approached saturation.  Saturation was defined to be the point where the cluster was unable to meet the minimum quality-of-service requirements for VMmark 2.  For all testing, we recorded both the VMmark 2 score and the average cluster CPU utilization during the run phase.

Results:

Organizations often outgrow existing hardware capacity, and it can become necessary to add one or more hosts in order to relieve performance bottlenecks and meet increasing demands.  VMmark 2 was used to measure such a scenario by keeping the load constant as new hosts were incrementally added to the cluster.  The starting point for the experiments was four tiles.  At this load level the two hosts had approached saturation, with nearly 90% CPU utilization  The test then determined the impact on cluster CPU utilization and performance by adding identical hosts to the available cluster resources.

 VMmark2-ScalingHosts4Tiles

As expected, scoring gains were easily achieved by adding hosts until the environment was generating approximately the maximum scores for the four tile load level, as CPU resources become more plentiful.  In comparison to the two-host configuration, the normalized scores increased 6%, 12%, and 12% for the three-host, four-host, and five-host configurations, respectively.  The configurations with additional hosts were able to generate more throughput while also reducing the average cluster CPU utilization as the requested work was spread over more systems.  This highlights the additional CPU capacity held in reserve by the cluster at each data point.  By charting two or more points at the same load level, it is much easier to approximate the expected average CPU utilization after adding new hosts into the cluster.  This data, combined with established CPU usage thresholds, can make additional purchasing or system allocation decisions more straight-forward.

The above analysis looks at scale out performance for an expanding cluster with a fixed amount of work.  To get the whole picture of performance it’s necessary to measure performance and available capacity as the load and the number of hosts increases.  Specifically, as we progress through each of the configurations, does the reduction in cluster CPU utilization and improved performance measured in the previous experiment hold true for varied amounts of load and hosts?  

VMmark2-ScalingHostsScores VMmark2-ScalingHostsCPU 
 

As shown in the above graphs, the vSphere based cloud effortlessly integrated new hosts into our testing environment and delivered consistent returns on our physical server investments.  It’s important to note that in both the two-host and three-host configurations, the test failed at least one of the quality-of-service (QoS) requirements when the cluster reached saturation.  Also important, the five-host configuration was not run out to saturation due to a lack of additional client hardware.  During our testing the addition of each host showed expected results with respect to scaling of VMmark 2 scores.   As we went through each of the configurations, the normalized scores increased an average of 13%, 13%, and 16%, for the three-host, four-host, and five-host configurations, respectively.  Each of the configurations exhibited nearly linear scaling of CPU utilization as load was increased.  Based on these results, the VMware vSphere managed cluster was able to generate significant performance scaling while also utilizing the additional capacity of newly-provisioned hosts quite efficiently. 

Thus far all VMmark 2 studies have involved homogenous clusters of identical servers.  Stay tuned for experimentation utilizing varying storage and/or networking solutions as well as heterogeneous clusters…

 

Two Host Matched-Pair Scaling Utilizing VMmark 2

As mentioned in Bruce’s previous blog, VMmark 2.0 has been released.  With its release we can now begin to benchmark an enterprise-class cloud platform in entirely new and interesting ways.  VMmark 2 is based on a multi-host configuration that includes bursty application and infrastructure workloads to drive load against a cluster.  VMmark 2 allows for the analysis of infrastructure operations within a controlled benchmark environment for the first time, distinguishing it from server consolidation benchmarks. 

Leading off a series of new articles introducing VMmark 2, the goal of this article was to provide a bit more detail about VMmark 2 and to test a vSphere enterprise cloud, focusing on the scaling performance of a matched pair of systems.  More simply put, this blog looks to see what happens to cluster performance as more load is added to a pair of identical servers.  This is important because it allows a means for identifying the efficiency of a vSphere cluster as demand increases.

VMmark2 Overview

VMmark 2 is a next-generation, multi-host virtualization benchmark that not only models application performance but also the effects of common infrastructure operations. It models application workloads in the now familiar VMmark 1 tile-based approach, where the benchmarker adds tiles until either a goal is met or the cluster is at saturation.  It’s important to note that while adding tiles does effectively linearly increase the application workload requests being made, the load caused by infrastructure operations does not scale in the same way.  VMmark 2 infrastructure operations scale as the cluster size grows to better reflect modern datacenters.  Greater detail on workload scaling can be found within the benchmarking guide available for download.  To calculate the score for VMmark 2, final results are generated from a weighted average of the two kinds of workloads; hence scores will not linearly increase as tiles are added.  In addition to the throughput metrics, quality-of-service (QoS) metrics are also measured and minimum standards must be maintained for a result to be considered fully compliant.

VMmark 2 contains the combination of the application workloads and infrastructure operations running simultaneously.  This allows for the benchmark to include both of these critical aspects in the results that it reports.  The application workloads that make up a VMmark 2 tile were chosen to better reflect applications in today’s datacenters by employing more modern and diverse technologies.  In addition to the application workloads, VMmark 2 makes infrastructure operation requests of the cluster.  These operations stress the cluster with the use of vMotion, storage vMotion and Deploy operations.  It’s important to note that while the VMmark 2 harness is stressing the cluster through the infrastructure operations, VMware’s Distributed Resource Scheduler (DRS) is dynamically managing the cluster in order to distribute and balance the computing resources available.  The diagrams below summarize the key aspects of the application and infrastructure workloads.

VMmark 2 Workloads Details:

VMmark2.0AppWkTile

Application Workloads – Each “Tile” consists of the following workloads and VMs.

DVD Store 2.1  - multi-tier OLTP workload consisting of a database VM and three webserver VMs driving a bursty load profile

• Exchange 2007

• Standby Server (heart beat server)

OLIO - multi-tier social networking workload consisting of a web server and a database server.

VMmark2.0InfWk

Infrastructure Workloads – Consists of the following

• User-initiated vMotion.

Storage vMotion.

• Deploy : VM cloning, OS customization, and Updating.

DRS-initiated vMotion to accommodate host-level load variations

 

Environment Configuration:

  • Systems Under Test : 2 HP ProLiant DL380 G6
  • CPUs : 2 Quad-Core Intel® Xeon® CPU 5570 @ 2.93 GHz with HyperThreading Enabled
  • Memory : 96GB DDR2 Reg ECC
  • Storage Array : EMC CX380
  • Hypervisor : VMware ESX 4.1
  • Virtualization Management : VMware vCenter Server 4.1.0

Testing Methodology:

To test scalability as the number of VMmark 2 tiles increases, two HP ProLiant DL380 servers were configured identically and connected to an EMC Clarion CX-380 storage array.  The minimum configuration for VMmark 2 is a two-host cluster running 1 tile, as such this was our baseline and all VMmark 2 scores were normalized to this result.  A series of tests were then conducted on this two-host configuration increasing the number of tiles being run until the cluster approached saturation, recording both the VMmark 2 score and the average cluster CPU utilization during the run phase.

Results:

In circumstances where demand on a cluster increases, it becomes critical to understand how the environment adapts to these demands in order to plan for future needs.  In many cases it can be especially important for businesses to understand how the application and infrastructure workloads were individually impacted.  By breaking out the distinct VMmark 2 sub-metrics we can get a fine grained view of how the vSphere cluster responded as the number of tiles, and thus work performed, increased.

   VMmark2.0DetailedScaling

From the graph above we see the VMmark 2 scores show significant gains until reaching the point where the two-host cluster was saturated at 5 Tiles.  Delving into this further, we see that as expected, the infrastructure operations remained nearly constant due to the requested infrastructure load not changing during the experimentation.  Continued examination shows that the cluster was able to achieve nearly linear scaling for the application workloads through 4 Tiles.  This is equivalent to 4 times the application work requested of the 1 Tile configuration.  When we reached the 5 Tile configuration the cluster was unable to meet the minimum quality-of-service requirements of VMmark 2, however this still helps us to understand the performance characteristics of the cluster.

Monitoring how the average cluster CPU utilization changed during the course of our experiments is another critical component to understanding cluster behavior as load increases.  The diagram below plots the VMmark 2 scores shown in the above graph and average cluster CPU utilization for each configuration.

VMmark2.0ClusterScaling

The resulting diagram helps to illustrate what the impact on cluster CPU utilization and performance was by incrementing the work done by our cluster through the addition of VMmark 2 Tiles. The results show that the VMware’s vSphere matched-pair cluster was able to deliver outstanding scaling of enterprise-class applications while also providing unequaled flexibility in the load balancing, maintenance and provisioning of our cloud. This is just the beginning of what we’ll see in terms of analysis using the newly-released VMmark 2, we plan to explore larger and more diverse configurations next, so stay tuned …

 

Comparing Fault Tolerance Performance & Overhead Utilizing VMmark v1.1.1

VMware Fault
Tolerance (FT), based on vLockstep technology and available with VMware
vSphere, easily and efficiently provides zero downtime and zero data loss for
your critical workloads. FT provides continuous availability in the event of
server failures by creating a live shadow instance of the primary virtual
machine on a secondary system.  The
shadow VM (or secondary VM), running on the secondary system, executes sequences
of x86 instructions identical to the primary VM, with which it proceeds in
vLockstep.  By doing so, if catastrophic
failure of the primary system occurs it causes an instantaneous failover to the
secondary VM that would be virtually indistinguishable to the end user. While
FT technology is certainly compelling, some potential users express concern
about possible performance overhead. In this article, we explore the
performance implications of running FT in realistic scenarios by measuring an
FT-enabled environment based on the heterogeneous workloads found in VMmark, the tile-based
mixed-workload consolidation benchmark from VMware®.

Figure 1 : High Level Architecture of
VMwar
e Fault Tolerance

Pic1

Environment Configuration :

System under Test

2 x Dell PowerEdge R905

CPUs

4 Quad-Core AMD Opteron 8382
(2.6GHz)

4 Quad-Core AMD Opteron 8384
(2.7GHz)

Memory

128GB DDR2 Reg ECC

Storage Array

EMC CX380

Hypervisor

VMware ESX 4.0

Application

VMmark v1.1.1

Virtual Hardware (per tile)

8 vCPUs, 5GB memory, 62GB disk

  •  VMware Fault Tolerance currently
    only supports 1 vCPU VMs and requires specific processors for enablement; for
    the purposes of our experimentation our VMmark Database and MailServer VMs were
    set to run with 1vCPU only.  For more
    information on FT and its requirements see
    here.
  • VMmark
    is a benchmark intended to measure the performance of virtualization environments
    in an effort to allow customers to compare platforms.  It is also useful in studying the effect of
    architectural features. VMmark consists of six workloads (Web, File, Database,
    Java, Mail and Standby servers). Multiple sets of workloads (tiles) can be added
    to scale the benchmark load to match the underlying hardware resources. For
    more information on VMmark see
    here.


Test Methodology :

An
initial performance baseline was established by running VMmark from 1 to 13
tiles on the primary system with Fault Tolerance disabled for all workloads. FT
was then enabled for the MailServer and Database workloads after customer
feedback suggested they were the applications most likely to be protected by FT.
The performance tests were then executed a second time and compared to the
baseline performance data.

 

Results
:

The
results in Table 1 are enlightening as to the performance and efficiency of
VMware’s Fault Tolerance.  For this case,
“FT-enabled Secondary %CPU”, indicates the total CPU utilized by the secondary
system under test.  It should also be
noted that, for our workload, the default ESX 4.0, High Availability, and Fault
Tolerance settings were used and these results should be considered ‘out of the
box’ performance for this configuration. 
Finally, the secondary system’s %CPU is much lower by comparison to the
primary system because it is only running the MailServer and Database
workloads, as opposed to the six workloads that are being run on the primary
system.

Table 1:

Pic2b  

You can see that as we scaled
both configurations toward saturation the overhead of enabling VMware Fault
Tolerance remains surprisingly consistent, with an average delta in %CPU used
of 7.89% over all of the runs.  ESX was
also able to achieve very comparable scaling for both FT-enabled and FT-disabled
configurations.  It isn’t until the FT-enabled
configuration nears complete saturation, a scenario most end users will never
see, that we start to see any real discernable delta in scores.

It should be noted that these
performance and overhead statements may or may not be true for dissimilar
workloads and systems under test.  From
the results of our testing you can see that the advantage of having Mail
servers and Database servers truly protected, without fear of end-user
interruption, is completely justified.

It’s a tough world out there; you
never know when the next earthquake, power outage, or someone tripping over a
power cord will strike next.  It’s nice
to know that your critical workloads are not only safe, but running at high
efficiency.  The ability of VMware Fault
Tolerance technology to provide quick and efficient protection for your
critical workloads makes it a standout in the datacenter.

All information in this post
regarding fut
ure directions and intent are
subject to change or withdrawal without notice and should not be relied on in
making a purchasing decision of VMware's products. The information in this post
is not a legal obligation for VMware to deliver any material, code,
or
functionality. The release and timing of VMware's products remains at VMware's
sole discretion.

Comparing Hardware Virtualization Performance Utilizing VMmark v1.1

Virtualization has just begun to remake the datacenter. One only needs to look at the rapid pace of innovation to know that we are in the midst of a revolution. This is true not only for virtualization software, but also for the underlying hardware. A perfect example of this is new hardware support for virtualized page tables provided by both Intel’s Extended Page Tables (EPT) and AMD’s Rapid Virtualization Indexing (RVI). In general, these features reduce virtualization overhead and improve performance. A previous paper showed how RVI performs with data for a range of individual workloads. As a follow-on, we decided to measure the effects of RVI in a heterogeneous environment using VMmark, the tile-based mixed-workload consolidation benchmark from VMware®.

VMware ESX has the following three modes of operation: software virtualization (Binary Translation, abbreviated as BT), hardware support for CPU virtualization (abbreviated in AMD systems as AMD-V), and hardware support for both CPU and MMU virtualization utilizing AMD-V and RVI (abbreviated as AMD-V + RVI). For most workloads, VMware recommends that users let ESX automatically determine if a virtual machine should use hardware support, but it can also be valuable to determine the optimal settings as a sanity check.

Environment Configuration:

System under Test

Dell PowerEdge 2970

CPU

2 x Quad-Core AMD Opteron 8384 (2.5GHz)

Memory

64GB DDR2 Reg ECC

Hypervisor

VMware ESX (build 127430)

Application

VMmark v1.1

Virtual Hardware (per tile)

10 vCPUs, 5GB memory, 62GB disk

 ·         AMD RVI works in conjunction with AMD-V technology, which is a set of hardware extensions to the x86 system architecture designed to improve efficiency and reduce the performance overhead of software-based virtualization solutions.  For more information on AMD virtualization technologies see here. 

·         VMmark is a benchmark intended to measure the performance of virtualization environments in an effort to allow customers to compare platforms.  It is also useful in studying the effect of architectural features. VMmark consists of six workloads (Web, File, Database, Java, Mail and Standby servers). Multiple sets of workloads (tiles) can be added to scale the benchmark load to match the underlying hardware resources. For more information on VMmark see here.

Test Methodology

By default, ESX automatically runs 32bit VMs (Mail, File, and Standby) with BT, and runs 64bit VMS (Database, Web, and Java) with AMD-V + RVI.  For these tests, we first ran the benchmark using the default configuration and determined the number of tiles it would take to saturate the CPU resources.  All subsequent benchmark tests used this same load level. We next measured the baseline benchmark score with all VMs under test except Standby configured to use BT (i.e., no hardware virtualization features). A series of benchmark tests was then executed while varying the hardware virtualization settings for different workloads to assess their effects in a heavily-utilized mixed-workload environment. All of the results presented are relative to the baseline score and illustrate the percentage performance gains achieved over the BT-only configuration.

We began by setting the Standby servers to use both AMD-V + RVI.  We then stepped through each of the available workloads and altered the CPU/MMU hardware virtualization settings for that specific workload type.  After determining which setting was best (BT, AMD-V, or AMD-V + RVI) we used that setting for the subsequent tests.

Results


The test results summarized in Table 1 are both interesting and insightful. ESX’s efficient utilization of AMD-V + RVI for each workload highlights a leap forward in virtualization platform performance. Remember that once we determined AMD-V + RVI to be the best for a workload, we continued to use that setting for that workload during all subsequent tests unless otherwise noted. For example in the AMD-V File run below, the Web server VMs were set to AMD-V + RVI, File server VMs were set to use just AMD-V, and all other non-Standby servers were set to BT.

Vroom-RVI-2   Click on graph to enlarge

By taking advantage of hardware-assist features in the processor, ESX is able to achieve significant performance gains over using software-only virtualization. The default or “out of the box” settings produced good results, and further tuning for this particular set of workloads yielded additional performance gains of nearly 6% for our SUT. 

It should be noted that these performance gains may or may not be true for dissimilar workload, but for this configuration the improvement made by utilizing an all AMD-V and RVI enabled environment was very impressive. In addition, older processor versions with different cache sizes, clock rates, etc. may produce different results.

It’s probably safe to say that hardware technologies seem to be trending to continued improvements for virtualized environments.  ESX’s ability to provide proficient deployment of the latest and greatest hardware innovation, combined with its flexibility in allowing users to run different workloads with different levels of hardware assist is what truly sets it apart.    

All information in this post regarding future directions and intent are subject to change or withdrawal without notice and should not be relied on in making a purchasing decision of VMware's products. The information in this post is not a legal obligation for VMware to deliver any material, code, or functionality. The release and timing of VMware's products remains at VMware's sole discretion.