Home > Blogs > VMware VROOM! Blog > Monthly Archives: March 2015

Monthly Archives: March 2015

Generational performance improvement from old study to new study

SQL Server VM Performance on VMware vSphere 6

Last October, I blogged about SQL Server performance with vSphere 5.5 using a four-socket Intel Xeon processor E7 based host.  Now that vSphere 6 is available, I’ve run an updated set of tests using this new release, on an even more powerful host, with Xeon E7 v2 processors.  A variety of virtual CPU (vCPU) and virtual machine (VM) quantities were tested to show that vSphere can handle hundreds of thousands of online transaction processing (OLTP) database operations per minute.

DVD Store 2.1, an open-source OLTP database stress tool, was the workload used to stress the VMs.  The first experiment in the paper was a generational performance comparison between the old and new setups; as you can see, there is a dramatic increase in throughput, even though the size of each VM has doubled from 8 vCPUs per VM to 16:

Generational performance improvement from old study to new study

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the benefit of “right-sizing” virtual machines, and measuring the impact of the advanced Latency Sensitivity setting. 

For more details and the test results, please download the whitepaper: Performance Characterization of Microsoft SQL Server on VMware vSphere 6.

Introducing the Weathervane Benchmark

The ways in which we use, design, deploy, and evaluate the performance of large-scale web applications have changed significantly in recent years.  These changes have been driven by the increase in computing capacity and flexibility provided by virtualized and cloud-based computing infrastructures. The majority of these changes are not reflected in current web-application benchmarks.

Weathervane is a new web-application benchmark we have been developing as part of our work on optimizing the performance of VMware products for the next generation of cloud-scale applications. The goal of the Weathervane project has been to develop an application-level benchmark that captures the key characteristics of the workloads, design paradigms, deployment architectures, and performance metrics of the next generation of large-scale web applications. As we approach the initial release of Weathervane, we are starting to use it to understand performance across our product range.  In this post, we will give an overview of Weathervane that will provide context for the performance results that we will be writing about over the coming months.

Weathervane Motivation

There have been many changes in usage patterns and development practices for large-scale web applications.  The design and development of Weathervane has been driven by the goal of capturing these changes in a highly scalable benchmark that includes these key aspects:

  • The effect of increased user interactivity and rich web interfaces on workload patterns
  • New design patterns for decoupled and asynchronous services
  • The use of multiple data sources for data with varying scalability and consistency requirements
  • Flexible architectures that allow for deployment on a wide range of virtual and cloud-based infrastructures

The effect of increased user interactivity and rich web interfaces is one of the most important of these aspects. In current benchmarks, a user is represented by a single thread operating independently from other users. Contrast that to the way we interact with applications as diverse as social media and stock trading. Many user interactions, such as responding to a status update or selling shares of stock, are in direct response to the actions of other users.  In addition, the current generation of script-rich web interfaces performs many operations asynchronously without any action from, or even awareness by, the user.  Examples include web pages and rich client interfaces that update active friend lists, check for messages, or maintain stock tickers.  This leads to a very different model of user behavior than the traditional single-threaded, click-and-think design used by existing benchmarks.  As a result, one of the key design goals for Weathervane was to develop both a benchmark application and a workload generator that would allow us to capture the effect of these new workload patterns.

Weathervane Overview

An application-level benchmark typically consists of two main parts: the benchmark application and the workload driver.  The application is selected and designed to represent characteristics and technology choices that are typical of a certain class of applications.  The workload driver interacts with the benchmark application to simulate the behavior of typical users of the application.   It also captures the performance metrics that are used to quantify the performance of the application/infrastructure combination. Some benchmarks, including Weathervane, also provide a run harness that assists in the set-up and automation of benchmark runs.

Weathervane benchmark application is Auction, which is a web application for managing and hosting real-time auctions. An auction hosted by Auction consists of a number of items that will be placed up for bid in a set order.  Users are given only a limited time to bid before an item is sold and the next item is placed up for bid.  When an item is up for bid, all users attending the auction are presented with a description and image of the item.  Users see and respond to bids placed by other users. Auction can support thousands of simultaneous auctions with large numbers of active users, with each user possibly attending multiple, simultaneous auctions.   The figure below shows the browser application used to interact with the Auction application.  This figure shows the bidding screen for a user who is attending two auctions.  The current item, bid, and bid status for each auction are updated in real-time in response to bids placed by other users.

LiveAuctionScreenFigure 1. Auction bidding screen

In addition to managing live auctions, Auction provides auction and item search, profile management, historical data queries, image management, auction management, and other services that would be required by a user of the application.

Auction uses a scalable architecture that allows deployments to be easily sized for a large range of user loads.  A full deployment of Auction includes a wide variety of support services, such as load-balancing, caching, and messaging servers, as well as relational, NoSQL, and filesystem-based data stores supporting scalability for data with a variety of consistency requirements.  The figure below shows a full deployment of Auction and the Weathervane workload driver.

logicalLayoutFullFigure 2. Logical layout for full Weathervane deployment

The following is a brief description of the role played by each tier.

Infrastructure Services

TCP Load Balancers: The simulated users on the workload driver address the application through a set of IP addresses mapped to the application’s external hostname.  The TCP load balancers jointly manage these IP addresses to ensure that all IP addresses remain available in the event of a failure. The TCP load balancers distribute the load across the web servers while maintaining SSL/TLS session affinity.

Messaging Servers: The application nodes use the messaging backbone to distribute work and state-change information regarding active auctions.

Application Services

Web Servers: The web servers terminate SSL, serve static content, act as load-balancing reverse proxies for the application servers, and provide a proxy cache for application content, such as images returned by the application servers.

Application Servers: The application servers run Java servlet containers in which the application services are deployed.  The Auction application services use a stateless implementation with a RESTful interface that simplifies scaling.

Data Services

Relational Database: The relational database is used for all data that is involved in transactions.  This includes user account information, as well as auction, item, and high-bid data.

NoSQL Data Server:  The NoSQL Document Store is used to store image metadata as well as activity data such as auction attendance information and bid records. It can also be used to store uploaded images. Using the NoSQL store as an image store allows the application to take advantage of its sharding capabilities to easily scale the I/O capacity for image storage.

File Server: The file server is used exclusively to store item images uploaded by users.  Note that the file server is optional, as the images can be stored and served from the NoSQL document store.

Weathervane currently includes configuration support for deploying Auction using the following services:

  • Virtual IP Address Management: Keepalived
  • TCP Load Balancer: HAProxy
  • Web Server: Apache Httpd and Nginx
  • Application Server:  Apache Tomcat with EHcache for in-memory caching
  • Messaging Server: RabbitMQ
  • Relational Database: MySQL and PostgreSQL
  • NoSQL Data Store: MongoDB
  • Network Filesystem: NFS

Additional implementations will be supported in future releases.

Weathervane can be deployed with different subsets of the infrastructure and application services.  For example, the figure below shows a minimal deployment of Weathervane with a single application server and the supporting data services.  In this configuration, the application server performs the tasks handled by the web server in a larger deployment.

logicalLayoutMinimalFigure 3. Logical layout for a minimal Weathervane deployment

The Weathervane workload driver has been developed to drive HTTP-based loads for modern scalable web applications.  It can simulate workloads for applications that incorporate asynchronous behaviors using embedded JavaScript, and those requiring complex data-driven behaviors, as in web applications with significant inter-user interaction.  The Weathervane workload driver uses an asynchronous design with a small number of threads supporting a large number of simulated users. Simulated users may have multiple active asynchronous activities which share state information, and complex workload patterns can be specified with control-flow decisions made based on retrieved state and operation history. These features allow us to efficiently simulate workloads that would be presented to web applications by rich web clients using asynchronous JavaScript operations.

The Weathervane workload driver also monitors quality-of-service (QoS) metrics for both the Auction application and the overall workload. The application-level QoS requirements are based on the 99th percentile response-times for the individual operations.  An operation represents a single action performed by a user or embedded script, and may consist of multiple HTTP exchanges.  The workload-level QoS requirements define the required mix of operations that must be performed by the users during the workload’s steady state.  This mix must be consistent from run to run in order for the results to be comparable.  In order for a run of the benchmark to pass, all QoS requirements must be satisfied.

Weathervane also includes a run harness that automates most of the steps involved in configuring and running the benchmark.  The harness takes as input a configuration file that describes the deployment configuration, the user load, and many service-specific tuning parameters.  The harness is then able to power on virtual machines, configure and start the various software services, deploy the software components of Auction, run the workload, and collect the results, as well as the log, configuration, and statistics files from all of the virtual machines and services.  The harness also manages the tasks involved in loading and preparing the data in the data services before each run.

Scalability

Scaling to large deployments is a key goal of Weathervane.  Therefore, it will be useful to conclude with some initial scalability data to show how we are doing in achieving that goal. There are many possible ways to scale up a deployment of Auction.  For the sake of providing a straightforward comparison, we will focus on scaling out the number of application server instances in an otherwise fixed deployment configuration.  The CPU utilization of the application server is typically the performance bottleneck in a well-balanced Auction deployment.

The figure below shows the logical layout of the VMs and services in this deployment.  Physically, all VMs reside on the same network subnet on the vSphere hosts, which are connected by a 10Gb Ethernet switch.

Blog1LayoutFigure 4. Deployment configuration for scaling results

The VMs in the Auction deployment were distributed across three VMware vSphere 6 hosts.  Table 1 gives the hardware details of the hosts.

Host Name Host Vendor/Model Processors Memory
Host1 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
256GB
Host2 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
256GB
Host3 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2680 @ 2.70GHz
8 Core, 16 Thread
192GB

Table 1. vSphere 6 hosts for Auction deployment

Table 2 shows the configuration of the VMs, and their assignment to vSphere hosts.  As the goal of these tests was to examine the scalability of the Auction application, and not the characteristics of vSphere 6, we chose the VM sizing and assignment in part to avoid using more virtual CPUs than physical cores. While we did some tuning of the overall configuration, we did not necessarily obtain the optimal tuning for each of the service configurations.  The configuration was chosen so that the application server was the bottleneck as far as possible within the restrictions of the available physical servers.  In future posts, we will examine the tuning of the individual services, tradeoffs in deployment configurations, and best practices for deploying Auction-like applications on vSphere.

Service Host VM vCPUs (each) VM Memory
HAProxy 1 Host1 2 8GB
HAProxy 2 Host2 2 8GB
HAProxy 3 Host3 2 8GB
Nginx 1, 2, and 3 Host3 2 8GB
RabbitMQ 1 Host2 1 2GB
RabbitMQ 2 Host1 1 2GB
Tomcat 1, 3, 5, 7, and 9 Host1 2 8GB
Tomcat 2, 4, 6, 8, and 10 Host2 2 8GB
MongoDB 1 and 3 Host2 1 32GB
MongoDB 2 and 4 Host1 1 32GB
PostgreSQL Host3 6 32GB

Table 2. Virtual machine configuration

Figure 5 shows the peak load that can be supported by this deployment configuration as the number of application servers is scaled from one to ten.  The peak load supported by a configuration is the maximum load at which the configuration can satisfy all of the QoS requirements of the workload.  The dotted line shows linear scaling of the maximum load extrapolated from the single application server result.  The actual scaling is essentially linear up to six application-server VMs.  At that point, the overall utilization of the physical servers starts to affect the ability to maintain linear scaling.  With seven application servers, the web-server tier becomes a scalability bottleneck, but there are not sufficient CPU cores available to add additional web servers.

It would require additional infrastructure to determine how far the linear scaling could be extended.  However, the current results provide strong evidence that with sufficient resources, Weathervane will be able to scale to support very large loads representing large numbers of users.

scalabilityFigure 5. Maximum supported users for increasing number of application servers

Future

The discussion in this post has focused on the use of Weathervane as a traditional single-application benchmark with a focus on throughput and response-time performance metrics.  However, that only scratches the surface of our future plans for Weathervane.  We are currently working on extending Weathervane to capture more cloud-centric performance metrics.  These fall into two broad categories that we call multi-tenancy metrics and elasticity metrics.  Multi-tenancy metrics capture the performance characteristics of a cloud-deployed application in the presence of other applications co-located on the same physical resources.  The relevant performance metrics include isolation and fairness along with the traditional throughput and response-time metrics.  Elasticity metrics capture the performance characteristics of self-scaling applications in the presence of changing loads.  It is also possible to study elasticity metrics in the context of multi-tenancy environments, thus examining the impact of shared resources on the ability of an application to scale in a timely manner to satisfy user demands.  These are all exciting new areas of application performance, and we will have more to say about these subjects as we approach Weathervane 1.0.

Virtual SAN 6.0 Performance: Scalability and Best Practices

A technical white paper about Virtual SAN performance has been published. This paper provides guidelines on how to get the best performance for applications deployed on a Virtual SAN cluster.

We used Iometer to generate several workloads that simulate various I/O encountered in Virtual SAN production environments. These are shown in the following table.

Type of I/O workload Size (1KiB = 1024 bytes) Mixed Ratio Shows / Simulates
All Read 4KiB Maximum random read IOPS that a storage solution can deliver
Mixed Read/Write 4KiB 70% / 30% Typical commercial applications deployed in a VSAN cluster
Sequential Read 256KiB Video streaming from storage
Sequential Write 256KiB Copying bulk data to storage
Sequential Mixed R/W 256KiB 70% / 30% Simultaneous read/write copy from/to storage

In addition to these workloads, we studied Virtual SAN caching tier designs and the effect of Virtual SAN configuration parameters on the Virtual SAN test bed.

Virtual SAN 6.0 can be configured in two ways: Hybrid and All-Flash. Hybrid uses a combination of hard disks (HDDs) to provide storage and a flash tier (SSDs) to provide caching. The All-Flash solution uses all SSDs for storage and caching.

Tests show that the Hybrid Virtual SAN cluster performs extremely well when the working set is fully cached for random access workloads, and also for all sequential access workloads. The All-Flash Virtual SAN cluster, which performs well for random access workloads with large working sets, may be deployed in cases where the working set is too large to fit in a cache. All workloads scale linearly in both types of Virtual SAN clusters—as more hosts and more disk groups per host are added, Virtual SAN sees a corresponding increase in its ability to handle larger workloads. Virtual SAN offers an excellent way to scale up the cluster as performance requirements increase.

You can download Virtual SAN 6.0 Performance: Scalability and Best Practices from the VMware Performance & VMmark Community.

VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines

vSphere 6 introduces the ability to run virtual machines (VMs) with up to 128 virtual CPUs (vCPUs) and 4TB of RAM. This doubles the number of vCPUs supported from the previous version and increases the amount of RAM by four times. This new capability provides the potential for customers to run larger workloads than ever before in a virtual machine.

A series of tests were run with a virtual machine hosting Oracle 12c database instances. The DVD Store 2.1 open-source transactional workload was used to measure the performance of a large “Monster” VM on vSphere 6. The Oracle 12c database VM was scaled from 15 vCPUs all the way up to 120 vCPUs, and the maximum achieved throughput was measured. The full results and test details have been published in a white paper – VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines.

A four-socket Intel Xeon E7-4890 v2 processor based server with 1TB of memory was used to host the virtual machine for the tests.  Each Xeon E7-4890 v2 processor has 15 cores / 30 threads with Hyper Threading enabled for a total of 60 cores / 120 threads for the system. The diagram below shows the basic test configuration.

vSphere6OracleArchDiagram

 

In all tests Hyper-Threading was enabled on the server, but in configurations where 60 vCPUs or less are assigned to the VM, Hyper-Threads are not used by the VM. This is a result of the default scheduling policy where the preference is for vCPUs to be scheduled on one thread per core before using the second thread of any core. This first set of results, shown below, is focused on the tests that scale up to 60 vCPUs. These tests show the scaling for the virtual machine without the use of Hyper-Threads

vSphere6OraScaleUpto60

While vSphere 6 supports up to 128 vCPUs per VM, these tests were limited to 120 vCPUs due to the number of threads available on the server. The largest VM configuration used both hardware execution threads (Hyper-Threads) on all the processor cores in order to reach 120 vCPUs. In this case, there is one vCPU per execution thread.

Hyper-Threading doubles the number of execution threads, but it does not double performance. In order to measure the scale-up performance of the 120-vCPU VM, a 60-vCPU VM was configured with CPU affinity so that it was limited to only two of the server’s four sockets. In this configuration the 60-vCPU VM has one vCPU per execution thread, which is the same as the 120-vCPU VM.  Configuring a 60-vCPU VM in this way makes it easy to see the scale up performance at 120 vCPUs on this server with hyper-threads enabled.

The results of the scale-up testing using the 60-vCPU VM configured with CPU affinity to only 2 sockets and the 120-vCPU VM using all four sockets showed approximately linear scaling, as shown in the graph below.

vSphere6OraScaleUpto120

For full test details and more test results please see the white paper that has was recently published.

The new larger “Monster” VM support in vSphere 6 allows for virtual machines that can support larger workloads than ever before with excellent performance. These tests show that large virtual machines running on vSphere 6 can scale up as needed to meet extreme performance demands.

 

Improvements in Network I/O Control for vSphere 6

Network I/O Control (NetIOC) in VMware vSphere 6 has been enhanced to support a number of exciting new features such as bandwidth reservations. A new paper published by the Performance Engineering team shows the performance of these new features. The paper also explores the performance impact of the new NetIOC algorithm. Later tests show that NetIOC offers a powerful way to achieve network resource isolation at minimal cost, in terms of latency and CPU utilization.

You can read the paper here.

 

Failover Time for vCenter Server 6.0 Protected by vSphere HA

By Jianli Shen, Kimberly Wang, and Lei Chai

There are many advantages to virtualizing vCenter Server over installing it on physical devices; these have been blogged about already. VMware best practice also recommends the deployment of vCenter Server as a virtual machine and to use vSphere High Availability (HA) to provide this functionality. The typical scenario is to place vCenter Server in a vSphere HA cluster. HA will protect vCenter Server just like any other virtual machine. If the ESXi server that is running vCenter Server goes down, HA will kick-in and power on this VM from the shared storage on another HA cluster member host. In this blog, we will show the downtime of vCenter Server in this scenario with experiments that simulate a production deployment.

We performed our experiments with a vCenter Server VM deployed in an HA-enabled cluster with vCenter Server itself managing an inventory of 64 hosts and 6,000 VMs. The experiment shows that in case the ESXi server which hosts vCenter Server fails, vCenter Server will be powered on by another host in the cluster as expected in the HA scenario. After vCenter Server is powered on, it takes time to boot up into full function, and 64 hosts/6,000 VMs will be presented in the inventory when the administrator is able to log into the vSphere Web Client again. In our experiments, the whole procedure from failure to vCenter Server administrator being able to log into the vSphere Web Client took about 7 minutes and 40 seconds, with most of the time spent booting up vCenter Server into full function. During vCenter Server downtime, customers will still be able to access their VMs; only the vCenter Server administrator is impacted.

In the following sections, we describe our experiments and numbers.

Experimental Deployment Scenarios

We created the scenario of vCenter Server in an HA-enabled cluster environment. We used a host simulator to simulate the 64 hosts/6,000 VMs inventory for vCenter Server. The deployment is shown in figure 1, below.

Test-Bed Setup
Figure 1. Test-Bed Setup

In this setup:

  • HostA is the host where we first deploy vCenter Server 6.0. HostB is the host that acts as a backup host for vCenter Server to failover in case hostA fails. HostA and HostB are managed by vCenter Server in an HA-enabled cluster, which is Cluster2 in the figure.
  • Both hosts are Dell PowerEdge R620 with Intel® Xeon® E5-2650@2.00GHz, 128GB memory and 3.7TB shared storage.
  • vCenter Server is a virtual appliance deployed on HostA with 16 vCPUs and 32GB RAM.
  • vCenter Server also has a large inventory with 64 hosts and 6,000 VMs that is in Cluster1.

The figure shows what happens before and after HostA fails. During the experiment, we powered off HostA so the vCenter Server VM went down as well. Then, the HA agent detected the failure of HostA and initiated the action of powering on the previously running VMs (vCenter Server, in this case) in HostA on HostB. Once vCenter Server was powered on, it recovered its inventory as during the normal power-on process.

Performance Result

We measured the time from the point vCenter Server VM stopped responding, to the point vSphere Web Client started responding to user activity again.

With the 64 hosts/6,000 VMs inventory, the total time is around 460 seconds (that is, 7 minutes and 40 seconds), with about 30-50 seconds for HA to get into action. The rest of the time is spent on the vCenter Server power-on process.

To see the impact of inventory size that vCenter Server has, we compared it with similar scenarios, where vCenter Server has 1 host/1 VM or 32 hosts/4,000 VMs inventory, the downtime is around 410 seconds (6 minutes and 50 seconds) and 439 seconds (7 minutes and 19 seconds) respectively. This means the downtime does increase when the inventory size increases in the vCenter Server. But considering the whole power-on process as shown in figure 2, the increase is reasonable.

As a reference, we looked at the vCenter Server reboot time. We issued a vCenter Server reboot command, then waited until it was back to full function and measured the latency for this period. Then we compared it to the total time it takes to fail over the vCenter Server, including its reboot time. As shown in the following figure 2, there is around a 30-50 second time period when a vCenter Server fails over from HostA to HostB, compared to rebooting the vCenter Server on HostA. This 30-50 second period is the time needed for the HA agent to detect the failure on HostA and then issue the command to power on the vCenter Server at HostB. Figure 3 shows a closer look at the time failover takes for each activity.

Figure 2
Figure 2. Difference in seconds between normal reboot and HA failover (reboot + HA) in very small, medium, and large inventories
Figure 3
Figure 3. Total time of failover, including time it takes to reboot during failover and time for vSphere HA activity

Summary

In this article, we show that deploying vCenter Server as a VM makes protecting it very easy by using vSphere HA. The solution is feasible in reality and the downtime is within an expected tolerance level. Administrators will be able to regain control of the vCenter Server within less than a 10-minute window. Meanwhile, customers will still have their services running on their VMs in Cluster1.