Author Archives: Harold Rosenberg

Weathervane, a benchmarking tool for virtualized infrastructure and the cloud, is now open source.

Weathervane is a performance benchmarking tool developed at VMware.  It lets you assess the performance of your virtualized or cloud environment by driving a load against a realistic application and capturing relevant performance metrics.  You might use it to compare the performance characteristics of two different environments, or to understand the performance impact of some change in an existing environment.

Weathervane is very flexible, allowing you to configure almost every aspect of a test, and yet is easy to use thanks to tools that help prepare your test environment and a powerful run harness that automates almost every aspect of your performance tests.  You can typically go from a fresh start to running performance tests with a large multi-tier application in a single day.

Weathervane supports a number of advanced capabilities, such as deploying multiple independent application instances, deploying application services in containers, driving variable loads, and allowing run-time configuration changes for measuring elasticity-related performance metrics.

Weathervane has been used extensively within VMware, and is now open source and available on GitHub at

The rest of this blog gives an overview of the primary features of Weathervane.

Continue reading

Introducing the Weathervane Benchmark

The ways in which we use, design, deploy, and evaluate the performance of large-scale web applications have changed significantly in recent years.  These changes have been driven by the increase in computing capacity and flexibility provided by virtualized and cloud-based computing infrastructures. The majority of these changes are not reflected in current web-application benchmarks.

Weathervane is a new web-application benchmark we have been developing as part of our work on optimizing the performance of VMware products for the next generation of cloud-scale applications. The goal of the Weathervane project has been to develop an application-level benchmark that captures the key characteristics of the workloads, design paradigms, deployment architectures, and performance metrics of the next generation of large-scale web applications. As we approach the initial release of Weathervane, we are starting to use it to understand performance across our product range.  In this post, we will give an overview of Weathervane that will provide context for the performance results that we will be writing about over the coming months.

Weathervane Motivation

There have been many changes in usage patterns and development practices for large-scale web applications.  The design and development of Weathervane has been driven by the goal of capturing these changes in a highly scalable benchmark that includes these key aspects:

  • The effect of increased user interactivity and rich web interfaces on workload patterns
  • New design patterns for decoupled and asynchronous services
  • The use of multiple data sources for data with varying scalability and consistency requirements
  • Flexible architectures that allow for deployment on a wide range of virtual and cloud-based infrastructures

The effect of increased user interactivity and rich web interfaces is one of the most important of these aspects. In current benchmarks, a user is represented by a single thread operating independently from other users. Contrast that to the way we interact with applications as diverse as social media and stock trading. Many user interactions, such as responding to a status update or selling shares of stock, are in direct response to the actions of other users.  In addition, the current generation of script-rich web interfaces performs many operations asynchronously without any action from, or even awareness by, the user.  Examples include web pages and rich client interfaces that update active friend lists, check for messages, or maintain stock tickers.  This leads to a very different model of user behavior than the traditional single-threaded, click-and-think design used by existing benchmarks.  As a result, one of the key design goals for Weathervane was to develop both a benchmark application and a workload generator that would allow us to capture the effect of these new workload patterns.

Weathervane Overview

An application-level benchmark typically consists of two main parts: the benchmark application and the workload driver.  The application is selected and designed to represent characteristics and technology choices that are typical of a certain class of applications.  The workload driver interacts with the benchmark application to simulate the behavior of typical users of the application.   It also captures the performance metrics that are used to quantify the performance of the application/infrastructure combination. Some benchmarks, including Weathervane, also provide a run harness that assists in the set-up and automation of benchmark runs.

Weathervane benchmark application is Auction, which is a web application for managing and hosting real-time auctions. An auction hosted by Auction consists of a number of items that will be placed up for bid in a set order.  Users are given only a limited time to bid before an item is sold and the next item is placed up for bid.  When an item is up for bid, all users attending the auction are presented with a description and image of the item.  Users see and respond to bids placed by other users. Auction can support thousands of simultaneous auctions with large numbers of active users, with each user possibly attending multiple, simultaneous auctions.   The figure below shows the browser application used to interact with the Auction application.  This figure shows the bidding screen for a user who is attending two auctions.  The current item, bid, and bid status for each auction are updated in real-time in response to bids placed by other users.

LiveAuctionScreenFigure 1. Auction bidding screen

In addition to managing live auctions, Auction provides auction and item search, profile management, historical data queries, image management, auction management, and other services that would be required by a user of the application.

Auction uses a scalable architecture that allows deployments to be easily sized for a large range of user loads.  A full deployment of Auction includes a wide variety of support services, such as load-balancing, caching, and messaging servers, as well as relational, NoSQL, and filesystem-based data stores supporting scalability for data with a variety of consistency requirements.  The figure below shows a full deployment of Auction and the Weathervane workload driver.

logicalLayoutFullFigure 2. Logical layout for full Weathervane deployment

The following is a brief description of the role played by each tier.

Infrastructure Services

TCP Load Balancers: The simulated users on the workload driver address the application through a set of IP addresses mapped to the application’s external hostname.  The TCP load balancers jointly manage these IP addresses to ensure that all IP addresses remain available in the event of a failure. The TCP load balancers distribute the load across the web servers while maintaining SSL/TLS session affinity.

Messaging Servers: The application nodes use the messaging backbone to distribute work and state-change information regarding active auctions.

Application Services

Web Servers: The web servers terminate SSL, serve static content, act as load-balancing reverse proxies for the application servers, and provide a proxy cache for application content, such as images returned by the application servers.

Application Servers: The application servers run Java servlet containers in which the application services are deployed.  The Auction application services use a stateless implementation with a RESTful interface that simplifies scaling.

Data Services

Relational Database: The relational database is used for all data that is involved in transactions.  This includes user account information, as well as auction, item, and high-bid data.

NoSQL Data Server:  The NoSQL Document Store is used to store image metadata as well as activity data such as auction attendance information and bid records. It can also be used to store uploaded images. Using the NoSQL store as an image store allows the application to take advantage of its sharding capabilities to easily scale the I/O capacity for image storage.

File Server: The file server is used exclusively to store item images uploaded by users.  Note that the file server is optional, as the images can be stored and served from the NoSQL document store.

Weathervane currently includes configuration support for deploying Auction using the following services:

  • Virtual IP Address Management: Keepalived
  • TCP Load Balancer: HAProxy
  • Web Server: Apache Httpd and Nginx
  • Application Server:  Apache Tomcat with EHcache for in-memory caching
  • Messaging Server: RabbitMQ
  • Relational Database: MySQL and PostgreSQL
  • NoSQL Data Store: MongoDB
  • Network Filesystem: NFS

Additional implementations will be supported in future releases.

Weathervane can be deployed with different subsets of the infrastructure and application services.  For example, the figure below shows a minimal deployment of Weathervane with a single application server and the supporting data services.  In this configuration, the application server performs the tasks handled by the web server in a larger deployment.

logicalLayoutMinimalFigure 3. Logical layout for a minimal Weathervane deployment

The Weathervane workload driver has been developed to drive HTTP-based loads for modern scalable web applications.  It can simulate workloads for applications that incorporate asynchronous behaviors using embedded JavaScript, and those requiring complex data-driven behaviors, as in web applications with significant inter-user interaction.  The Weathervane workload driver uses an asynchronous design with a small number of threads supporting a large number of simulated users. Simulated users may have multiple active asynchronous activities which share state information, and complex workload patterns can be specified with control-flow decisions made based on retrieved state and operation history. These features allow us to efficiently simulate workloads that would be presented to web applications by rich web clients using asynchronous JavaScript operations.

The Weathervane workload driver also monitors quality-of-service (QoS) metrics for both the Auction application and the overall workload. The application-level QoS requirements are based on the 99th percentile response-times for the individual operations.  An operation represents a single action performed by a user or embedded script, and may consist of multiple HTTP exchanges.  The workload-level QoS requirements define the required mix of operations that must be performed by the users during the workload’s steady state.  This mix must be consistent from run to run in order for the results to be comparable.  In order for a run of the benchmark to pass, all QoS requirements must be satisfied.

Weathervane also includes a run harness that automates most of the steps involved in configuring and running the benchmark.  The harness takes as input a configuration file that describes the deployment configuration, the user load, and many service-specific tuning parameters.  The harness is then able to power on virtual machines, configure and start the various software services, deploy the software components of Auction, run the workload, and collect the results, as well as the log, configuration, and statistics files from all of the virtual machines and services.  The harness also manages the tasks involved in loading and preparing the data in the data services before each run.


Scaling to large deployments is a key goal of Weathervane.  Therefore, it will be useful to conclude with some initial scalability data to show how we are doing in achieving that goal. There are many possible ways to scale up a deployment of Auction.  For the sake of providing a straightforward comparison, we will focus on scaling out the number of application server instances in an otherwise fixed deployment configuration.  The CPU utilization of the application server is typically the performance bottleneck in a well-balanced Auction deployment.

The figure below shows the logical layout of the VMs and services in this deployment.  Physically, all VMs reside on the same network subnet on the vSphere hosts, which are connected by a 10Gb Ethernet switch.

Blog1LayoutFigure 4. Deployment configuration for scaling results

The VMs in the Auction deployment were distributed across three VMware vSphere 6 hosts.  Table 1 gives the hardware details of the hosts.

Host Name Host Vendor/Model Processors Memory
Host1 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
Host2 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
Host3 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2680 @ 2.70GHz
8 Core, 16 Thread

Table 1. vSphere 6 hosts for Auction deployment

Table 2 shows the configuration of the VMs, and their assignment to vSphere hosts.  As the goal of these tests was to examine the scalability of the Auction application, and not the characteristics of vSphere 6, we chose the VM sizing and assignment in part to avoid using more virtual CPUs than physical cores. While we did some tuning of the overall configuration, we did not necessarily obtain the optimal tuning for each of the service configurations.  The configuration was chosen so that the application server was the bottleneck as far as possible within the restrictions of the available physical servers.  In future posts, we will examine the tuning of the individual services, tradeoffs in deployment configurations, and best practices for deploying Auction-like applications on vSphere.

Service Host VM vCPUs (each) VM Memory
HAProxy 1 Host1 2 8GB
HAProxy 2 Host2 2 8GB
HAProxy 3 Host3 2 8GB
Nginx 1, 2, and 3 Host3 2 8GB
RabbitMQ 1 Host2 1 2GB
RabbitMQ 2 Host1 1 2GB
Tomcat 1, 3, 5, 7, and 9 Host1 2 8GB
Tomcat 2, 4, 6, 8, and 10 Host2 2 8GB
MongoDB 1 and 3 Host2 1 32GB
MongoDB 2 and 4 Host1 1 32GB
PostgreSQL Host3 6 32GB

Table 2. Virtual machine configuration

Figure 5 shows the peak load that can be supported by this deployment configuration as the number of application servers is scaled from one to ten.  The peak load supported by a configuration is the maximum load at which the configuration can satisfy all of the QoS requirements of the workload.  The dotted line shows linear scaling of the maximum load extrapolated from the single application server result.  The actual scaling is essentially linear up to six application-server VMs.  At that point, the overall utilization of the physical servers starts to affect the ability to maintain linear scaling.  With seven application servers, the web-server tier becomes a scalability bottleneck, but there are not sufficient CPU cores available to add additional web servers.

It would require additional infrastructure to determine how far the linear scaling could be extended.  However, the current results provide strong evidence that with sufficient resources, Weathervane will be able to scale to support very large loads representing large numbers of users.

scalabilityFigure 5. Maximum supported users for increasing number of application servers


The discussion in this post has focused on the use of Weathervane as a traditional single-application benchmark with a focus on throughput and response-time performance metrics.  However, that only scratches the surface of our future plans for Weathervane.  We are currently working on extending Weathervane to capture more cloud-centric performance metrics.  These fall into two broad categories that we call multi-tenancy metrics and elasticity metrics.  Multi-tenancy metrics capture the performance characteristics of a cloud-deployed application in the presence of other applications co-located on the same physical resources.  The relevant performance metrics include isolation and fairness along with the traditional throughput and response-time metrics.  Elasticity metrics capture the performance characteristics of self-scaling applications in the presence of changing loads.  It is also possible to study elasticity metrics in the context of multi-tenancy environments, thus examining the impact of shared resources on the ability of an application to scale in a timely manner to satisfy user demands.  These are all exciting new areas of application performance, and we will have more to say about these subjects as we approach Weathervane 1.0.

Performance of Enterprise Java Applications on VMware vSphere 4.1 and SpringSource tc Server

VMware recently released a whitepaper presenting the results of a performance investigation using a representative enterprise-level Java application on VMware vSphere 4.1. The results of the tests discussed in that paper show that enterprise-level Java applications can provide excellent performance when deployed on VMware vSphere 4.1.  The main topics covered by the paper are a comparison of virtualized and native performance, and an examination of scale-up versus scale-out tradeoffs.

The paper first covers a set of tests that were performed to determine whether an enterprise-level Java application virtualized on VMware vSphere 4.1 can provide equivalent performance to a native deployment configured with the same memory and compute resources.  The tests used response-time as the primary metric for comparing the performance of native and virtualized deployments. The results show that at CPU utilization levels commonly found in real deployments, the native and virtual response-times are close enough to provide an essentially identical user-experience.  Even at peak load, with CPU utilization near the saturation point, the peak throughput of the virtualized application was within 90% of the native deployment.

The paper then discusses the results of an investigation of the performance impact of scaling-up the configuration of a single VM (adding more vCPUs) versus scaling-out to deploy the application on multiple smaller VMs. At loads below 80% CPU utilization, the response-times of scale-up and scale-out configurations using the same number of total vCPUs were effectively equivalent.  At higher loads, the peak-throughput results for the different configurations were also similar, with a slight advantage to scale-out configurations. 

The application used in these tests was Olio, a multi-tier enterprise application which implements a complete social-networking website.  Olio was deployed on SpringSource tc Server, running both natively and virtualized on vSphere 4.1.  

For more information, please read the full paper at  In addition, the author will be publishing additional results on his blog at

Performance Troubleshooting for VMware vSphere 4 and ESX 4.0

Performance problems can arise in any computing environment. In a
virtualized computing environment performance problems can arise due to
new and often subtle interactions occurring in the shared
infrastructure. Uncovering the causes of those problems requires an
understanding of the available performance metrics and their
relationship to underlying configuration issues.

A new guide covering performance troubleshooting for VMware vSphere
4, including ESX 4.0 hosts, is now available. This document uses a guided
approach to lead the reader through the observable manifestations of
complex hardware/software interactions in order to identify specific
performance problems. For each problem covered, it includes a
discussion of the possible root-causes and solutions. Topics covered
include performance problems arising from issues in the CPU, memory,
storage, and network subsystems, as well as in the VM and ESX host

The document is available on the VMware Performance Community at

Java Performance on vSphere 4

VMware ESX is an excellent platform for deploying Java
applications.  Many customers use it to
support Java applications from the desktop to business-critical enterprise
servers.  However, we haven't published
any results recently highlighting the excellent performance of Java
applications on VMware ESX.  As a first
step at remedying this situation, we compared native and virtualized
performance using SPECjvm2008.  This
workload is a benchmark suite containing several real life applications and
benchmarks focusing on core java functionality. The results demonstrate that
Java applications run on VMware vSphere at greater than 94% of native performance
over a range of VM sizes.  This is up to
a 9% improvement over VMware ESX 3.5, which already runs this workload at close
to or better than 90% of native performance.

We ran SPECjvm2008 on Red Hat Enterprise Server 5 Update 3 using
the latest JVM from Sun Microsystems, JRE 1.6 Update 13.  Tests were conducted with both 32-bit and
64-bit  versions of the OS and JVM.  An HP DL380G5 equipped with two quad-core Intel
Xeon X5460 (Harpertown) processors running at 3.16GHz was used.  This server had 32GB of memory.  For native runs using less than the full
number of available CPU cores, we used the kernel boot parameter maxcpus= to limit the OS to a given
number of cores.  We also used the kernel
boot parameter mem= to limit the
memory to 16GB in all 64-bit runs.  The
runs on VMware vSphere 4.0 and VMware ESX 3.5 Update 4 were done in virtual machines
(VMs) using the stated number of virtual CPU s and 16GB of memory. 

The runs of SPECjvm2008 were all base runs, meaning that no
Java tuning parameters were used.   All
SPECjvm2008 results are required to include a base run.  Unfortunately, the default heap size of the
Sun JVM in the 1 CPU case is not large enough to run the SPECjvm2008
workload.  As a result, we were not able
to generate 1 CPU results which would be compliant with the run-rules for
SPECjvm2008.  We did generate native and vSphere
4.0 results for 2, 4, and 8 CPUs, and ESX 3.5 results for 2 and 4 CPUs.

Figure 1 shows the SPECjvm2008 results for the native,
VMware vSphere 4.0, and VMware ESX 3.5 cases. 
Figure 2 presents the same results normalized to the native result for
that server and CPU count.  These results
show that VMs running on VMware vSphere 4.0 perform at greater than 95% of
native on this benchmark at all VM sizes. 
Even with 8 vCPUs running on a server with only 8 physical cores, the
vSphere 4.0 VM achieves 99% of native performance.   The
VMware ESX 3.5 VMs ran at close to or greater than 90% of native, which is
still excellent for a virtualized environment. 
However, for 64-bit VMs, vSphere 4.0 gives a performance improvement over
ESX 3.5U4 of 9% in the 4 vCPU case, and about 3% in the 2 vCPU case.

Figure 1 SPECjvm2008 on 8-Core Intel
Harpertown Server


Figure 2 SPECjvm2008 performance relative to native


In order to sanity-check the
native results, we compared the 8-Core Harpertown result using the 64-bit OS
and JVM to the closest published result. 
There is no directly comparable result, but there is a result generated
by Sun on a 16-Core Intel Tigerton Server. 
The Tigerton is architecturally similar to the Harpertown, but the
Harpertown has a larger L2 cache.  The
Sun 16-core Tigerton result, using Solaris 10, a special performance build of
the Sun JVM (1.6.0_06p), and 64GB of memory, achieved 260 SPECjvm2008
ops/m.   Our native result on the 8-core Harpertown  with 16GB of memory was  145 SPECjvm2008 ops/m.   A
native run on the Harpertown with 32GB and using the Sun 1.6.0_06p JVM achieved
174 SPECjvm2008 ops/m.  This is well more
than half of the Tigerton result, and indicates that our native configuration
is producing reasonable results.

Figure 3 shows the scaling of
the results as we move from 2 to 4 and 8 CPUs for the 64-bit case.  The scaling is essentially the same for
32-bit.  The results are normalized to
the 2 CPU results on the same platform. 
These results show that VMware vSphere 4.0 scales as well as or better
than native for this workload.  VMware
ESX 3.5 scaling is just slightly below native.

Figure 3 SPECjvm2008 Scaling from 2



The SPECjvm2008 results presented here show that core Java
functionality runs extremely well on VMware vSphere 4.0 and VMware ESX
3.5.  No special tuning was required to
get results that are remarkably close to native performance.  We hope to soon produce additional results to
demonstrate that this excellent performance extends to multi-tier Java Enterprise
Edition applications as well.  For comments or questions, please join us in the VMware Performance Community at this thread