Home > Blogs > VMware VROOM! Blog > Tag Archives: VMmark

Tag Archives: VMmark

Updated VMmark 2.1 Benchmarking Guide Available

Just a quick note to inform all benchmarking enthusiasts that we have released an updated VMmark 2.1 Benchmarking Guide. You can get it from the VMmark 2.1 download page. The updated guide contains a new troubleshooting section as well as more comprehensive instructions for using virtual clients and Windows Server 2008 clients.

 

Exploring Generational Scaling with VMmark 2.1

The steady march of technological improvements is nothing new.  As companies either expand or refresh their datacenters it often becomes a non-trivial task to quantify the returns on hardware investments.  This difficulty can be further compounded when it’s no longer sufficient to answer how well one new server will perform in relation to its predecessor, but rather how well the new cluster will perform in comparison to the previous one.  With this in mind, we set off to explore the generational scaling performance of two clusters made up of very similar hardware using the newly released VMmark 2.1.  VMmark 2.1 is a next-generation, multi-host virtualization benchmark that models not only application performance but also the effects of common infrastructure operations.  For more general information on VMmark 2.1, including the application and infrastructure workload details, take a look at the expanded overview in one of my previous blog posts.

Environment Configuration:

  • Clusters Under Test
    • Cluster 1
      • Systems Under Test: 2 x Dell PowerEdge R805
      • CPUs: 2 Six-Core AMD Opteron™ 2427 @ 2.2 GHz
      • Memory: 128GB DDR2 Reg ECC @ 533MHz
      • Storage Array: EMC CX4-120
      • Hypervisor: VMware ESX 4.1
      • Virtualization Management: VMware vCenter Server 4.1
    • Cluster 2
      • Systems Under Test: 2 x Dell PowerEdge R815
      • CPUs: 2 Twelve-Core AMD Opteron™ 6174 @ 2.2 GHz
      • Memory: 128GB DDR3 Reg ECC @ 1066MHz
      • Storage Array: EMC CX4-120
      • Hypervisor: VMware ESX 4.1
      • Virtualization Management: VMware vCenter Server 4.1
  • VMmark 2.1

Testing Methodology:

To measure the generational improvement of the two clusters under test every attempt was made to set up and configure the servers identically.  The minimum configuration for VMmark 2.1 is a two-host cluster running a single tile.  The result from this minimal configuration on the older cluster, or cluster #1, was used as the baseline and all VMmark 2.1 scalability data in this article were normalized to that score.  A series of tests were then conducted on each of the clusters in isolation, increasing the number of tiles being run until the cluster approached saturation.  Saturation was defined as the point where the cluster was unable to meet the minimum quality-of-service (QoS) requirements for VMmark 2.1.  Results that were unable to meet minimum QoS for VMmark 2.1 were not plotted.

Results:

The primary component of change between the two clusters, making up the predominant factor in the generational scaling, is the change in processors.  The AMD Opteron™ 2427 processors provide six cores per socket for a total of twelve logical processors per server, whereas the newer AMD Opteron™ 6174 processors have twelve cores per socket for a total of twenty-four logical processors per server.  Factor in a doubling of the L3 cache per socket, as well as a doubling of the systems’ memory speeds, and the change in server characteristics is quite significant.

Pic

As shown in the above graph, the generational scaling between the two clusters under test is significant.  In the one-tile case, both clusters were able to perform the work requested without the presence of resource constraints.  The performance improvement of the newer cluster became more apparent once we started scaling up the number of tiles and significantly increasing the level of CPU over commitment and utilization.  It’s important to note that while adding tiles does effectively linearly increase the application workload requests being made, the workload caused by infrastructure operations does not scale in the same way, and was a constant across all tests.  Cluster #1 scaled to three tiles, at which point it was saturated and unable to support additional tiles while continuing to the meet the minimum quality-of-service (QoS) requirements of the benchmark.  For comparison, Cluster #2 was able to achieve an increase of normalized VMmark 2.1 scores of 1%, 14% and 9% for the one-tile, two-tile, and three-tile configurations, respectively.  Cluster #2, was then scaled to seven tiles, beyond which point it was unable to meet the minimum QoS requirements.

The newer generation cluster, with two Dell PowerEdge R815 AMD Opteron™ 6174 based hosts running vSphere 4.1, exhibited excellent scaling as the load was increased up to seven tiles, more than doubling the previous generation cluster’s performance and work accomplished.  Because VMmark 2.1 not only utilizes heterogeneous applications across a diverse computing environment, but also measures the impact of the common place infrastructure operations, it provided valuable insight on the generational scaling of the two cluster generations.  VMmark 2.1 proved itself an able benchmark for acquiring the answers for previously difficult datacenter questions.  

 

 

 

 

VMmark 2.1 Released and Other News

VMmark 2.1 has been released and is available here. We had a list of improvements to VMmark 2.0 even as we finished up the initial release of the benchmark last fall. Most of the changes are intended to improve usability, managability, and scale-out-ability of the benchmark. VMmark 2.0 has already generated tremendous interest from our partners and customers and we expect VMmark 2.1 to add to that momentum.

Only the harness and vclient directories have been refreshed for VMware VMmark 2.1. The notable changes include the following:

  • Uniform scaling of infrastructure operations as tile and cluster sizes increase. Previously, the dynamic storage relocation infrastructure workload was held at a single thread.
  • Allowance for multiple Deploy templates as tile and cluster sizes increase.
  • Addition of conditional support for clients running Windows Server 2008 Enterprise Edition 64-bit.
  • Addition of support for virtual clients, provided all hardware and software requirements are met.
  • Improved host-side reporter functionality.
  • Improved environment time synchronization.
  • Updates to several VMmark 2.0 tools to improve ease of setup and running.
  • Miscellaneous improvements to configuration checking, error reporting, debug output, and user-specified options.

All currently published VMmark 2.0 results are comparable to VMmark 2.1. Beginning with the release of VMmark 2.1, any submission of benchmark results must use the VMmark 2.1 benchmark kit.

In other news, Fujitsu published their first VMmark 2.0 result last week.

Also, Intel has joined the VMmark Review Panel. Other members are AMD, Cisco, Dell, Fujitsu, HP, and VMware. Every result published on the VMmark results page is reviewed for correctness and compliance by the VMmark Review Panel. In most cases this means that a submitter's result will be examined by their competitors prior to publication, which enhances the credibility of the results.

That's all for now, but we should be back soon with more interesting experiments using VMmark 2.1.

Experimenting with Cluster Scale-Out Utilizing VMmark 2

The first article in our VMmark 2 series gave an in-depth introduction to the benchmark while also presenting results on the scaling performance of a cluster based on a matched pair of systems.  The goal of this article is to continue to characterize larger and more diverse cloud configurations by testing scale-out performance of an expanding vSphere cluster.  This blog explores an enterprise-class cluster’s performance as more servers are added and subsequently the amount of work being requested is increased. Determining the impact of adding hosts to a cluster is important because it enables the measurement of the total work being done as cluster capacity and workload demand increases within a controlled environment.  It also assists in identifying the efficiency with which a vSphere managed cluster can utilize an increasing number of hosts.

VMmark 2 Overview:

VMmark 2 is a next-generation, multi-host virtualization benchmark that models not only application performance but also the effects of common infrastructure operations. VMmark 2 is a combination of the application workloads and the infrastructure operations running simultaneously.  Although the application workload levels are scaled up by the addition of tiles, the infrastructure operations scale as the cluster size increases.  In general, the infrastructure operations increase with the number of hosts in an N/2 fashion, where N is the number of hosts.  To calculate the score for VMmark 2, final results are generated from a weighted average of the two kinds of workloads; hence scores will not increase linearly as tiles are added.  For more general information on VMmark 2, including the application and infrastructure workload details, take a look at the expanded overview in my previous blog post.

Environment Configuration:

  • Systems Under Test : 2-5 HP ProLiant DL380 G6
  • CPUs : 2 Quad-Core Intel® Xeon® CPU 5570 @ 2.93 GHz with HyperThreading Enabled
  • Memory : 96GB DDR2 Reg ECC
  • Hypervisor : VMware ESX 4.1
  • Virtualization Management : VMware vCenter Server 4.1
  • 

Testing Methodology:

To test scale out performance with VMmark 2, five identically-configured HP ProLiant DL380 servers were connected to an EMC Clarion CX3-80 storage array.  The minimum configuration for VMmark 2 is a two-host cluster running one tile.  The result from this minimal configuration was the baseline used, and all VMmark 2 scalability data in this article were normalized to that score.  A series of tests were then conducted on this two-host configuration, increasing the number of tiles being run until the cluster approached saturation.  As shown in the series’ first article, our two-host cluster approached saturation at four tiles but failed QoS requirements when running five tiles.  Starting with a common workload level of four tiles, the three-host, four-host, and five-host configurations were tested in a similar fashion, increasing the number of tiles until each configuration approached saturation.  Saturation was defined to be the point where the cluster was unable to meet the minimum quality-of-service requirements for VMmark 2.  For all testing, we recorded both the VMmark 2 score and the average cluster CPU utilization during the run phase.

Results:

Organizations often outgrow existing hardware capacity, and it can become necessary to add one or more hosts in order to relieve performance bottlenecks and meet increasing demands.  VMmark 2 was used to measure such a scenario by keeping the load constant as new hosts were incrementally added to the cluster.  The starting point for the experiments was four tiles.  At this load level the two hosts had approached saturation, with nearly 90% CPU utilization  The test then determined the impact on cluster CPU utilization and performance by adding identical hosts to the available cluster resources.

 VMmark2-ScalingHosts4Tiles

As expected, scoring gains were easily achieved by adding hosts until the environment was generating approximately the maximum scores for the four tile load level, as CPU resources become more plentiful.  In comparison to the two-host configuration, the normalized scores increased 6%, 12%, and 12% for the three-host, four-host, and five-host configurations, respectively.  The configurations with additional hosts were able to generate more throughput while also reducing the average cluster CPU utilization as the requested work was spread over more systems.  This highlights the additional CPU capacity held in reserve by the cluster at each data point.  By charting two or more points at the same load level, it is much easier to approximate the expected average CPU utilization after adding new hosts into the cluster.  This data, combined with established CPU usage thresholds, can make additional purchasing or system allocation decisions more straight-forward.

The above analysis looks at scale out performance for an expanding cluster with a fixed amount of work.  To get the whole picture of performance it’s necessary to measure performance and available capacity as the load and the number of hosts increases.  Specifically, as we progress through each of the configurations, does the reduction in cluster CPU utilization and improved performance measured in the previous experiment hold true for varied amounts of load and hosts?  

VMmark2-ScalingHostsScores VMmark2-ScalingHostsCPU 
 

As shown in the above graphs, the vSphere based cloud effortlessly integrated new hosts into our testing environment and delivered consistent returns on our physical server investments.  It’s important to note that in both the two-host and three-host configurations, the test failed at least one of the quality-of-service (QoS) requirements when the cluster reached saturation.  Also important, the five-host configuration was not run out to saturation due to a lack of additional client hardware.  During our testing the addition of each host showed expected results with respect to scaling of VMmark 2 scores.   As we went through each of the configurations, the normalized scores increased an average of 13%, 13%, and 16%, for the three-host, four-host, and five-host configurations, respectively.  Each of the configurations exhibited nearly linear scaling of CPU utilization as load was increased.  Based on these results, the VMware vSphere managed cluster was able to generate significant performance scaling while also utilizing the additional capacity of newly-provisioned hosts quite efficiently. 

Thus far all VMmark 2 studies have involved homogenous clusters of identical servers.  Stay tuned for experimentation utilizing varying storage and/or networking solutions as well as heterogeneous clusters…

 

Cisco Publishes First VMmark 2.0 Result

Our partners at Cisco recently published the first official VMmark 2.0 result using a matched pair of UCS B200 M2 systems. You can find all of the details at the VMmark 2.0 Results Page. Using a matched pair of systems provides a close analogue to single-system benchmarks like VMmark 1.x while providing a more realistic performance profile by including infrastructure operations such as Vmotion. Official VMmark 2.0 results are reviewed for accuracy and compliance by the VMmark Review Panel consisting of AMD, Cisco, Dell, Fujitsu, HP, and VMware.

 

Performance Scaling of an Entry-Level Cluster

Performance benchmarking is often conducted on top-of-the-line hardware, including hosts that typically have a large number of cores, maximum memory, and the fastest disks available. Hardware of this caliber is not always accessible to small or medium-sized businesses with modest IT budgets. As part of our ongoing investigation of different ways to benchmark the cloud using the newly released VMmark 2.0, we set out to determine whether a cluster of less powerful hosts could be a viable alternative for these businesses. We used VMmark 2.0 to see how a four-host cluster with a modest hardware configuration would scale under increasing load.

Workload throughput is often limited by disk performance, so the tests were repeated with two different storage arrays to show the effect that upgrading the storage would offer in terms of performance improvement. We tested two disk arrays that varied in both speed and number of disks, an EMC CX500 and an EMC CX3-20, while holding all other characteristics of the testbed constant.

To review, VMmark 2.0 is a next-generation, multi-host virtualization benchmark that models application performance and the effects of common infrastructure operations such as vMotion, Storage vMotion, and a virtual machine deployment. Each tile contains Microsoft Exchange 2007, DVD Store 2.1, and Olio application workloads which run in a throttled fashion. The Storage vMotion and VM deployment infrastructure operations require the user to specify a LUN as the storage destination. The VMmark 2.0 score is computed as a weighted average of application workload throughput and infrastructure operation throughput. For more details about VMmark 2.0, see the VMmark 2.0 website or Joshua Schnee’s description of the benchmark.

Configuration
All tests were conducted on a cluster of four Dell PowerEdge R310 hosts running VMware ESX 4.1 and managed by VMware vCenter Server 4.1.  These are typical of today’s entry-level servers; each server contained a single quad-core Intel Xeon 2.80 GHz X3460 processor (with hyperthreading enabled) and 32 GB of RAM.  The servers also used two 1Gbit NICs for VM traffic and a third 1Gbit NIC for vMotion activity.

To determine the relative impact of different storage solutions on benchmark performance, runs were conducted on two existing storage arrays, an EMC CX500 and an EMC CX3-20. For details on the array configurations, refer to Table 1 below. VMs were stored on identically configured ‘application’ LUNs, while a designated ‘maintenance’ LUN was used for the Storage vMotion and VM deployment operations.

Table 1. Disk Array Configuration   Table1-3

Results
To measure the cluster's performance scaling under increasing load, we started by running one tile, then increased the number of tiles until the run failed to meet Quality of Service (QoS) requirements. As load is increased on the cluster, it is expected that the application throughput, CPU utilization, and VMmark 2.0 scores will increase; the VMmark score increases as a function of throughput. By scaling out the number of tiles, we hoped to determine the maximum load our four-host cluster of entry-level servers could support.  VMmark 2.0 scores will not scale linearly from one to three tiles because, in this configuration, the infrastructure operations load remained constant. Infrastructure load increases primarily as a function of cluster size. Although showing only a two host cluster, the relationship between application throughput, infrastructure operations throughput and number of tiles is demonstrated more clearly by this figure from Joshua Schnee’s recent blog article. Secondly, we expected to see improved performance when running on the CX3-20 versus the CX500 because the CX3-20 has a larger number of disks per LUN as well as faster individual drives. Figure 1 below details the scale out performance on the CX500 and the CX3-20 disk arrays using VMmark 2.0. 

Figure 1. VMmark 2.0 Scale Out On a Four-Host Cluster

Figure1-2

Both configurations saw improved throughput from one to three tiles but at four tiles they failed to meet at least one QoS requirement. These results show that a user wanting to maintain an average cluster CPU utilization of 50% on their four-host cluster could count on the cluster to support a two-tile load. Note that in this experiment, increased scores across tiles are largely due to increased workload throughput rather than an increased number of infrastructure operations.

As expected, runs using the CX3-20 showed consistently higher normalized scores than those on the CX500. Runs on the CX3-20 outperformed the CX500 by 15%, 14%, and 12% on the one, two, and three-tile runs, respectively. The increased performance of the CX3-20 over the CX500 was accompanied by approximately 10% higher CPU utilization, which indicated that that the faster CX3-20 disks allowed the CPU to stay busier, increasing total throughput.

The results show that our cluster of entry-level servers with a modest disk array supported approximately 220 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second. A more robust disk array supported 270 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second with 20% lower latencies on average and a correspondingly slightly higher CPU utilization.

Note that this type of experiment is possible for the first time with VMmark 2.0; VMmark 1.x was limited to benchmarking a single host but the entry-level servers under test in this study would not have been able to support even a single VMmark 2.0 tile on an individual server. By spreading the load of one tile across a cluster of servers, however, it becomes possible to quantify the load that the cluster as a whole is capable of supporting.  Benchmarking our cluster with VMmark 2.0 has shown that even modest clusters running vSphere can deliver an enormous amount of computing power to run complex multi-tier workloads.

Future Directions
In this study, we scaled out VMmark 2.0 on a four-host entry-level cluster to measure performance scaling and the maximum supported number of tiles. This put a much higher load onto the cluster than might be typical for a small or medium business so that businesses can confidently deploy their application workloads.  An alternate experiment would be to run fewer tiles while measuring the performance of other enterprise-level features, such as VMware High Availability. This ability to benchmark the cloud in many different ways is one benefit of having a well-designed multi-host benchmark. Keep watching this blog for more interesting studies in benchmarking the cloud with VMmark 2.0.

Two Host Matched-Pair Scaling Utilizing VMmark 2

As mentioned in Bruce’s previous blog, VMmark 2.0 has been released.  With its release we can now begin to benchmark an enterprise-class cloud platform in entirely new and interesting ways.  VMmark 2 is based on a multi-host configuration that includes bursty application and infrastructure workloads to drive load against a cluster.  VMmark 2 allows for the analysis of infrastructure operations within a controlled benchmark environment for the first time, distinguishing it from server consolidation benchmarks. 

Leading off a series of new articles introducing VMmark 2, the goal of this article was to provide a bit more detail about VMmark 2 and to test a vSphere enterprise cloud, focusing on the scaling performance of a matched pair of systems.  More simply put, this blog looks to see what happens to cluster performance as more load is added to a pair of identical servers.  This is important because it allows a means for identifying the efficiency of a vSphere cluster as demand increases.

VMmark2 Overview

VMmark 2 is a next-generation, multi-host virtualization benchmark that not only models application performance but also the effects of common infrastructure operations. It models application workloads in the now familiar VMmark 1 tile-based approach, where the benchmarker adds tiles until either a goal is met or the cluster is at saturation.  It’s important to note that while adding tiles does effectively linearly increase the application workload requests being made, the load caused by infrastructure operations does not scale in the same way.  VMmark 2 infrastructure operations scale as the cluster size grows to better reflect modern datacenters.  Greater detail on workload scaling can be found within the benchmarking guide available for download.  To calculate the score for VMmark 2, final results are generated from a weighted average of the two kinds of workloads; hence scores will not linearly increase as tiles are added.  In addition to the throughput metrics, quality-of-service (QoS) metrics are also measured and minimum standards must be maintained for a result to be considered fully compliant.

VMmark 2 contains the combination of the application workloads and infrastructure operations running simultaneously.  This allows for the benchmark to include both of these critical aspects in the results that it reports.  The application workloads that make up a VMmark 2 tile were chosen to better reflect applications in today’s datacenters by employing more modern and diverse technologies.  In addition to the application workloads, VMmark 2 makes infrastructure operation requests of the cluster.  These operations stress the cluster with the use of vMotion, storage vMotion and Deploy operations.  It’s important to note that while the VMmark 2 harness is stressing the cluster through the infrastructure operations, VMware’s Distributed Resource Scheduler (DRS) is dynamically managing the cluster in order to distribute and balance the computing resources available.  The diagrams below summarize the key aspects of the application and infrastructure workloads.

VMmark 2 Workloads Details:

VMmark2.0AppWkTile

Application Workloads – Each “Tile” consists of the following workloads and VMs.

DVD Store 2.1  - multi-tier OLTP workload consisting of a database VM and three webserver VMs driving a bursty load profile

• Exchange 2007

• Standby Server (heart beat server)

OLIO - multi-tier social networking workload consisting of a web server and a database server.

VMmark2.0InfWk

Infrastructure Workloads – Consists of the following

• User-initiated vMotion.

Storage vMotion.

• Deploy : VM cloning, OS customization, and Updating.

DRS-initiated vMotion to accommodate host-level load variations

 

Environment Configuration:

  • Systems Under Test : 2 HP ProLiant DL380 G6
  • CPUs : 2 Quad-Core Intel® Xeon® CPU 5570 @ 2.93 GHz with HyperThreading Enabled
  • Memory : 96GB DDR2 Reg ECC
  • Storage Array : EMC CX380
  • Hypervisor : VMware ESX 4.1
  • Virtualization Management : VMware vCenter Server 4.1.0

Testing Methodology:

To test scalability as the number of VMmark 2 tiles increases, two HP ProLiant DL380 servers were configured identically and connected to an EMC Clarion CX-380 storage array.  The minimum configuration for VMmark 2 is a two-host cluster running 1 tile, as such this was our baseline and all VMmark 2 scores were normalized to this result.  A series of tests were then conducted on this two-host configuration increasing the number of tiles being run until the cluster approached saturation, recording both the VMmark 2 score and the average cluster CPU utilization during the run phase.

Results:

In circumstances where demand on a cluster increases, it becomes critical to understand how the environment adapts to these demands in order to plan for future needs.  In many cases it can be especially important for businesses to understand how the application and infrastructure workloads were individually impacted.  By breaking out the distinct VMmark 2 sub-metrics we can get a fine grained view of how the vSphere cluster responded as the number of tiles, and thus work performed, increased.

   VMmark2.0DetailedScaling

From the graph above we see the VMmark 2 scores show significant gains until reaching the point where the two-host cluster was saturated at 5 Tiles.  Delving into this further, we see that as expected, the infrastructure operations remained nearly constant due to the requested infrastructure load not changing during the experimentation.  Continued examination shows that the cluster was able to achieve nearly linear scaling for the application workloads through 4 Tiles.  This is equivalent to 4 times the application work requested of the 1 Tile configuration.  When we reached the 5 Tile configuration the cluster was unable to meet the minimum quality-of-service requirements of VMmark 2, however this still helps us to understand the performance characteristics of the cluster.

Monitoring how the average cluster CPU utilization changed during the course of our experiments is another critical component to understanding cluster behavior as load increases.  The diagram below plots the VMmark 2 scores shown in the above graph and average cluster CPU utilization for each configuration.

VMmark2.0ClusterScaling

The resulting diagram helps to illustrate what the impact on cluster CPU utilization and performance was by incrementing the work done by our cluster through the addition of VMmark 2 Tiles. The results show that the VMware’s vSphere matched-pair cluster was able to deliver outstanding scaling of enterprise-class applications while also providing unequaled flexibility in the load balancing, maintenance and provisioning of our cloud. This is just the beginning of what we’ll see in terms of analysis using the newly-released VMmark 2, we plan to explore larger and more diverse configurations next, so stay tuned …

 

VMmark 2.0 Release

VMmark 2.0, VMware’s next-generation multi-host virtualization benchmark, is now generally available here.

We were motivated to create VMmark 2.0 by the revolutionary advancements in virtualization since VMmark 1.0 was conceived. The rapid pace of innovation in both the hypervisor and the hardware has quickly transformed datacenters by enabling easier virtualization of heavy and bursty workloads coupled with dynamic VM relocation (vMotion), dynamic datastore relocation (storage vMotion), and automation of many provisioning and administrative tasks across large-scale multi-host environments. In this paradigm, a large fraction of the stresses on the CPU, network, disk, and memory subsystems is generated by the underlying infrastructure operations. Load balancing across multiple hosts can also greatly effect application performance. The benchmarking methodology of VMmark 2.0 continues to focus on user-centric application performance while accounting for the effects of infrastructure activity on overall platform performance. This approach provides a much more accurate picture of platform capabilities than less comprehensive benchmarks.

I would like to thank all of our partners who participated in the VMmark 2.0 beta program. Their thorough testing and insightful feed back helped speed the development process while delivering a more robust benchmark. I anticipate a steady flow of benchmark results from partners over the coming months and years.

I should also acknowledge the hard work of my colleagues in the VMmark team that completed VMmark 2.0 on a relatively short timeline. We have performed a wide array of experiments during the development of VMmark 2.0 and will use the data as the basis for a series of upcoming posts in this forum. Some topics likely to be covered are cluster-wide scalability, performance of heterogeneous clusters, and networking tradeoffs between 1Gbit and 10Gbit for vMotion. I hope we can inspire others to use VMmark 2.0 to explore performance characteristics in multi-host environments in novel and interesting ways all the way up to cloud-scale.

 

VMmark 2.0 Beta Overview

As I mentioned in my last blog, we have been developing VMmark 2.0, a next-generation multi-host virtualization benchmark that models not only application performance in a virtualized environment but also the effects of common virtual infrastructure operations. This is a natural progression from single-host virtualization benchmarks like VMmark 1.x and SPECvirt_sc2010. Benchmarks measuring single-host performance, while valuable, do not adequately capture the complexity inherent in modern virtualized datacenters. With that in
mind, we set out to construct a meaningfully stressful virtualization benchmark with the following properties:

  • Multi-host to model realistic datacenter deployments
  • Virtualization infrastructure workloads to more accurately capture overall platform performance
  • Heavier workloads than VMmark 1.x to reflect heavier customer usage patterns enabled by the increased capabilities of the virtualization and hardware layers
  • Multi-tier workloads driving both VM-to-VM and external network traffic
  • Workload burstiness to insure robust performance under variable high loads

The addition of virtual infrastructure operations to measure their impact on overall system performance in a typical multi-host environment is a key departure from
traditional single-server benchmarks. VMmark 2.0 includes the execution of the
following foundational and commonly-used infrastructure operations:

  • User-initiated vMotion 
  • Storage vMotion
  • VM cloning and deployment
  • DRS-initiated vMotion to accommodate host-level load variations

The VMmark 2.0 tile features a significantly heavier load profile than VMmark 1.x and consists of the following workloads:

  • DVD Store 2 – multi-tier OLTP workload consisting of a 4-vCPU database VM and three 2-vCPU webserver VMs driving a bursty load profile
  • OLIO – multi-tier social networking workload consisting of a 4-vCPU web server and a 2-vCPU database server
  • Exchange2007 – 4-vCPU mailserver workload
  • Standby server – 1 vCPU lightly-loaded server

We kicked off an initial partner-only beta program in late June and are actively polishing the benchmark for general release. We will be sharing a number of interesting experiments using VMmark 2.0 in our blog leading up to the general release of the benchmark, so stay tuned.

Surveying Virtualization Performance Trends with VMmark

The trends in published VMmark scores are an ideal illustration of the historical long-term performance gains for virtualized platforms. We began work on what
would become VMmark 1.0 almost five years ago. At the time, ESX 2.5 was the state-of-the-art hypervisor. Today’s standard features such as DRS, DPM, and Storage VMotion were in various prototype and development stages. Processors like the Intel Pentium4 5xx series (Prescott) or the single-core AMD 2yy-series Opterons were the high-end CPUs of choice. Second-generation hardware-assisted virtualization features such as AMD’s Rapid Virtualization Indexing (RVI) and Intel’s Extended Page Tables (EPT) were not yet available. Nevertheless, virtualization’s first wave was allowing customers to squeeze much more value from their existing resources via server consolidation. Exactly how much value was difficult to quantify. Our VMmark odyssey began with the overall goal of
creating a representative and reliable benchmark capable of providing meaningful comparisons between virtualization platforms.

VMmark 1.0 released nearly three years ago after two years of painstaking work and multiple beta releases of the benchmark. The reference architecture for VMmark 1.x is a two-processor Pentium4 (Prescott) server running ESX 3.0. That platform was capable of supporting one VMmark tile (six VMs) and by definition achieved a score of 1.0. (All VMmark results are normalized to this reference score.) The graph below shows a sampling of published two-socket VMmark scores for each successive processor generation. 

Blog_slide_3 ESX 3.0, a vastly more capable hypervisor than ESX 2.5, had arrived by the time of the VMmark 1.0 GA in mid-2007. Greatly improved CPU designs were also available. Two processors commonly in use by that time were the dual-core Xeon 51xx series and the quad-core Xeon 53xx series. ESX 3.5 was released with a number of performance improvements such as TCP Segmentation Offloading (TSO) support for networking in the same timeframe as the Xeon 54xx. Both ESX 4.0 and Intel 55xx (Nehalem) CPUs became available in early 2009. ESX 4.0 was a major new release with a broad array of performance enhancements and supported new hardware feature such as EPT and simultaneous multi-threading (SMT), providing a significant boost in overall performance. The recently released hexa-core Intel 56xx CPUs (Westmere) show excellent scaling compared to their quad-core 55xx brethren. (Overall, ESX delivers excellent scaling and takes advantage increased core-counts on all types of servers.) What is most striking to me in this data is the big picture: the performance of virtualized consolidation workloads as measured by VMmark 1.x has roughly doubled every year for the past five years.

In fact, the performance of virtualized platforms has increased to the point that the focus has shifted away from consolidating lightly-loaded virtual machines on a single server to virtualizing the entire range of workloads (heavy and light) across a dynamic multi-host datacenter. Not only application performance but also infrastructure responsiveness and robustness must be modeled to characterize modern virtualized environments. With this in mind, we are currently developing VMmark 2.0, a much more complex, multi-host successor to VMmark 1.x. We are rapidly approaching a limited beta release of this new benchmark, so stay tuned for more. But in this post, I’d like to look back and remember how far we’ve come with VMmark 1.x. Let’s hope the next five
years are as productive.