Home > Blogs > VMware VROOM! Blog > Category Archives: Web/Tech

Category Archives: Web/Tech

Introducing the Zephyr Benchmark

The ways in which we use, design, deploy, and evaluate the performance of large-scale web applications have changed significantly in recent years.  These changes have been driven by the increase in computing capacity and flexibility provided by virtualized and cloud-based computing infrastructures. The majority of these changes are not reflected in current web-application benchmarks.

Zephyr is a new web-application benchmark we have been developing as part of our work on optimizing the performance of VMware products for the next generation of cloud-scale applications. The goal of the Zephyr project has been to develop an application-level benchmark that captures the key characteristics of the workloads, design paradigms, deployment architectures, and performance metrics of the next generation of large-scale web applications. As we approach the initial release of Zephyr, we are starting to use it to understand performance across our product range.  In this post, we will give an overview of Zephyr that will provide context for the performance results that we will be writing about over the coming months.

Zephyr Motivation

There have been many changes in usage patterns and development practices for large-scale web applications.  The design and development of Zephyr has been driven by the goal of capturing these changes in a highly scalable benchmark that includes these key aspects:

  • The effect of increased user interactivity and rich web interfaces on workload patterns
  • New design patterns for decoupled and asynchronous services
  • The use of multiple data sources for data with varying scalability and consistency requirements
  • Flexible architectures that allow for deployment on a wide range of virtual and cloud-based infrastructures

The effect of increased user interactivity and rich web interfaces is one of the most important of these aspects. In current benchmarks, a user is represented by a single thread operating independently from other users. Contrast that to the way we interact with applications as diverse as social media and stock trading. Many user interactions, such as responding to a status update or selling shares of stock, are in direct response to the actions of other users.  In addition, the current generation of script-rich web interfaces performs many operations asynchronously without any action from, or even awareness by, the user.  Examples include web pages and rich client interfaces that update active friend lists, check for messages, or maintain stock tickers.  This leads to a very different model of user behavior than the traditional single-threaded, click-and-think design used by existing benchmarks.  As a result, one of the key design goals for Zephyr was to develop both a benchmark application and a workload generator that would allow us to capture the effect of these new workload patterns.

Zephyr Overview

An application-level benchmark typically consists of two main parts: the benchmark application and the workload driver.  The application is selected and designed to represent characteristics and technology choices that are typical of a certain class of applications.  The workload driver interacts with the benchmark application to simulate the behavior of typical users of the application.   It also captures the performance metrics that are used to quantify the performance of the application/infrastructure combination. Some benchmarks, including Zephyr, also provide a run harness that assists in the set-up and automation of benchmark runs.

Zephyr’s benchmark application is LiveAuction, which is a web application for managing and hosting real-time auctions. An auction hosted by LiveAuction consists of a number of items that will be placed up for bid in a set order.  Users are given only a limited time to bid before an item is sold and the next item is placed up for bid.  When an item is up for bid, all users attending the auction are presented with a description and image of the item.  Users see and respond to bids placed by other users. LiveAuction can support thousands of simultaneous auctions with large numbers of active users, with each user possibly attending multiple, simultaneous auctions.   The figure below shows the browser application used to interact with the LiveAuction application.  This figure shows the bidding screen for a user who is attending two auctions.  The current item, bid, and bid status for each auction are updated in real-time in response to bids placed by other users.

LiveAuctionScreenFigure 1. LiveAuction bidding screen

In addition to managing live auctions, LiveAuction provides auction and item search, profile management, historical data queries, image management, auction management, and other services that would be required by a user of the application.

LiveAuction uses a scalable architecture that allows deployments to be easily sized for a large range of user loads.  A full deployment of LiveAuction includes a wide variety of support services, such as load-balancing, caching, and messaging servers, as well as relational, NoSQL, and filesystem-based data stores supporting scalability for data with a variety of consistency requirements.  The figure below shows a full deployment of LiveAuction and the Zephyr workload driver.

logicalLayoutFullFigure 2. Logical layout for full Zephyr deployment

The following is a brief description of the role played by each tier.

Infrastructure Services

TCP Load Balancers: The simulated users on the workload driver address the application through a set of IP addresses mapped to the application’s external hostname.  The TCP load balancers jointly manage these IP addresses to ensure that all IP addresses remain available in the event of a failure. The TCP load balancers distribute the load across the web servers while maintaining SSL/TLS session affinity.

Messaging Servers: The application nodes use the messaging backbone to distribute work and state-change information regarding active auctions.

Application Services

Web Servers: The web servers terminate SSL, serve static content, act as load-balancing reverse proxies for the application servers, and provide a proxy cache for application content, such as images returned by the application servers.

Application Servers: The application servers run Java servlet containers in which the application services are deployed.  The LiveAuction application services use a stateless implementation with a RESTful interface that simplifies scaling.

Data Services

Relational Database: The relational database is used for all data that is involved in transactions.  This includes user account information, as well as auction, item, and high-bid data.

NoSQL Data Server:  The NoSQL Document Store is used to store image metadata as well as activity data such as auction attendance information and bid records. It can also be used to store uploaded images. Using the NoSQL store as an image store allows the application to take advantage of its sharding capabilities to easily scale the I/O capacity for image storage.

File Server: The file server is used exclusively to store item images uploaded by users.  Note that the file server is optional, as the images can be stored and served from the NoSQL document store.

Zephyr currently includes configuration support for deploying LiveAuction using the following services:

  • Virtual IP Address Management: Keepalived
  • TCP Load Balancer: HAProxy
  • Web Server: Apache Httpd and Nginx
  • Application Server:  Apache Tomcat with EHcache for in-memory caching
  • Messaging Server: RabbitMQ
  • Relational Database: MySQL and PostgreSQL
  • NoSQL Data Store: MongoDB
  • Network Filesystem: NFS

Additional implementations will be supported in future releases.

Zephyr can be deployed with different subsets of the infrastructure and application services.  For example, the figure below shows a minimal deployment of Zephyr with a single application server and the supporting data services.  In this configuration, the application server performs the tasks handled by the web server in a larger deployment.

logicalLayoutMinimalFigure 3. Logical layout for a minimal Zephyr deployment

The Zephyr workload driver has been developed to drive HTTP-based loads for modern scalable web applications.  It can simulate workloads for applications that incorporate asynchronous behaviors using embedded JavaScript, and those requiring complex data-driven behaviors, as in web applications with significant inter-user interaction.  The Zephyr workload driver uses an asynchronous design with a small number of threads supporting a large number of simulated users. Simulated users may have multiple active asynchronous activities which share state information, and complex workload patterns can be specified with control-flow decisions made based on retrieved state and operation history. These features allow us to efficiently simulate workloads that would be presented to web applications by rich web clients using asynchronous JavaScript operations.

The Zephyr workload driver also monitors quality-of-service (QoS) metrics for both the LiveAuction application and the overall workload. The application-level QoS requirements are based on the 99th percentile response-times for the individual operations.  An operation represents a single action performed by a user or embedded script, and may consist of multiple HTTP exchanges.  The workload-level QoS requirements define the required mix of operations that must be performed by the users during the workload’s steady state.  This mix must be consistent from run to run in order for the results to be comparable.  In order for a run of the benchmark to pass, all QoS requirements must be satisfied.

Zephyr also includes a run harness that automates most of the steps involved in configuring and running the benchmark.  The harness takes as input a configuration file that describes the deployment configuration, the user load, and many service-specific tuning parameters.  The harness is then able to power on virtual machines, configure and start the various software services, deploy the software components of LiveAuction, run the workload, and collect the results, as well as the log, configuration, and statistics files from all of the virtual machines and services.  The harness also manages the tasks involved in loading and preparing the data in the data services before each run.

Scalability

Scaling to large deployments is a key goal of Zephyr.  Therefore, it will be useful to conclude with some initial scalability data to show how we are doing in achieving that goal. There are many possible ways to scale up a deployment of LiveAuction.  For the sake of providing a straightforward comparison, we will focus on scaling out the number of application server instances in an otherwise fixed deployment configuration.  The CPU utilization of the application server is typically the performance bottleneck in a well-balanced LiveAuction deployment.

The figure below shows the logical layout of the VMs and services in this deployment.  Physically, all VMs reside on the same network subnet on the vSphere hosts, which are connected by a 10Gb Ethernet switch.

Blog1LayoutFigure 4. Deployment configuration for scaling results

The VMs in the LiveAuction deployment were distributed across three VMware vSphere 6 hosts.  Table 1 gives the hardware details of the hosts.

Host Name Host Vendor/Model Processors Memory
Host1 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
256GB
Host2 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
256GB
Host3 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2680 @ 2.70GHz
8 Core, 16 Thread
192GB

Table 1. vSphere 6 hosts for LiveAuction deployment

Table 2 shows the configuration of the VMs, and their assignment to vSphere hosts.  As the goal of these tests was to examine the scalability of the LiveAuction application, and not the characteristics of vSphere 6, we chose the VM sizing and assignment in part to avoid using more virtual CPUs than physical cores. While we did some tuning of the overall configuration, we did not necessarily obtain the optimal tuning for each of the service configurations.  The configuration was chosen so that the application server was the bottleneck as far as possible within the restrictions of the available physical servers.  In future posts, we will examine the tuning of the individual services, tradeoffs in deployment configurations, and best practices for deploying LiveAuction-like applications on vSphere.

Service Host VM vCPUs (each) VM Memory
HAProxy 1 Host1 2 8GB
HAProxy 2 Host2 2 8GB
HAProxy 3 Host3 2 8GB
Nginx 1, 2, and 3 Host3 2 8GB
RabbitMQ 1 Host2 1 2GB
RabbitMQ 2 Host1 1 2GB
Tomcat 1, 3, 5, 7, and 9 Host1 2 8GB
Tomcat 2, 4, 6, 8, and 10 Host2 2 8GB
MongoDB 1 and 3 Host2 1 32GB
MongoDB 2 and 4 Host1 1 32GB
PostgreSQL Host3 6 32GB

Table 2. Virtual machine configuration

Figure 5 shows the peak load that can be supported by this deployment configuration as the number of application servers is scaled from one to ten.  The peak load supported by a configuration is the maximum load at which the configuration can satisfy all of the QoS requirements of the workload.  The dotted line shows linear scaling of the maximum load extrapolated from the single application server result.  The actual scaling is essentially linear up to six application-server VMs.  At that point, the overall utilization of the physical servers starts to affect the ability to maintain linear scaling.  With seven application servers, the web-server tier becomes a scalability bottleneck, but there are not sufficient CPU cores available to add additional web servers.

It would require additional infrastructure to determine how far the linear scaling could be extended.  However, the current results provide strong evidence that with sufficient resources, Zephyr will be able to scale to support very large loads representing large numbers of users.

scalabilityFigure 5. Maximum supported users for increasing number of application servers

Future

The discussion in this post has focused on the use of Zephyr as a traditional single-application benchmark with a focus on throughput and response-time performance metrics.  However, that only scratches the surface of our future plans for Zephyr.  We are currently working on extending Zephyr to capture more cloud-centric performance metrics.  These fall into two broad categories that we call multi-tenancy metrics and elasticity metrics.  Multi-tenancy metrics capture the performance characteristics of a cloud-deployed application in the presence of other applications co-located on the same physical resources.  The relevant performance metrics include isolation and fairness along with the traditional throughput and response-time metrics.  Elasticity metrics capture the performance characteristics of self-scaling applications in the presence of changing loads.  It is also possible to study elasticity metrics in the context of multi-tenancy environments, thus examining the impact of shared resources on the ability of an application to scale in a timely manner to satisfy user demands.  These are all exciting new areas of application performance, and we will have more to say about these subjects as we approach Zephyr 1.0.

Virtual SAN 6.0 Performance: Scalability and Best Practices

A technical white paper about Virtual SAN performance has been published. This paper provides guidelines on how to get the best performance for applications deployed on a Virtual SAN cluster.

We used Iometer to generate several workloads that simulate various I/O encountered in Virtual SAN production environments. These are shown in the following table.

Type of I/O workload Size (1KiB = 1024 bytes) Mixed Ratio Shows / Simulates
All Read 4KiB - Maximum random read IOPS that a storage solution can deliver
Mixed Read/Write 4KiB 70% / 30% Typical commercial applications deployed in a VSAN cluster
Sequential Read 256KiB - Video streaming from storage
Sequential Write 256KiB - Copying bulk data to storage
Sequential Mixed R/W 256KiB 70% / 30% Simultaneous read/write copy from/to storage

In addition to these workloads, we studied Virtual SAN caching tier designs and the effect of Virtual SAN configuration parameters on the Virtual SAN test bed.

Virtual SAN 6.0 can be configured in two ways: Hybrid and All-Flash. Hybrid uses a combination of hard disks (HDDs) to provide storage and a flash tier (SSDs) to provide caching. The All-Flash solution uses all SSDs for storage and caching.

Tests show that the Hybrid Virtual SAN cluster performs extremely well when the working set is fully cached for random access workloads, and also for all sequential access workloads. The All-Flash Virtual SAN cluster, which performs well for random access workloads with large working sets, may be deployed in cases where the working set is too large to fit in a cache. All workloads scale linearly in both types of Virtual SAN clusters—as more hosts and more disk groups per host are added, Virtual SAN sees a corresponding increase in its ability to handle larger workloads. Virtual SAN offers an excellent way to scale up the cluster as performance requirements increase.

You can download Virtual SAN 6.0 Performance: Scalability and Best Practices from the VMware Performance & VMmark Community.

VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines

vSphere 6 introduces the ability to run virtual machines (VMs) with up to 128 virtual CPUs (vCPUs) and 4TB of RAM. This doubles the number of vCPUs supported from the previous version and increases the amount of RAM by four times. This new capability provides the potential for customers to run larger workloads than ever before in a virtual machine.

A series of tests were run with a virtual machine hosting Oracle 12c database instances. The DVD Store 2.1 open-source transactional workload was used to measure the performance of a large “Monster” VM on vSphere 6. The Oracle 12c database VM was scaled from 15 vCPUs all the way up to 120 vCPUs, and the maximum achieved throughput was measured. The full results and test details have been published in a white paper – VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines.

A four-socket Intel Xeon E7-4890 v2 processor based server with 1TB of memory was used to host the virtual machine for the tests.  Each Xeon E7-4890 v2 processor has 15 cores / 30 threads with Hyper Threading enabled for a total of 60 cores / 120 threads for the system. The diagram below shows the basic test configuration.

vSphere6OracleArchDiagram

 

In all tests Hyper-Threading was enabled on the server, but in configurations where 60 vCPUs or less are assigned to the VM, Hyper-Threads are not used by the VM. This is a result of the default scheduling policy where the preference is for vCPUs to be scheduled on one thread per core before using the second thread of any core. This first set of results, shown below, is focused on the tests that scale up to 60 vCPUs. These tests show the scaling for the virtual machine without the use of Hyper-Threads

vSphere6OraScaleUpto60

While vSphere 6 supports up to 128 vCPUs per VM, these tests were limited to 120 vCPUs due to the number of threads available on the server. The largest VM configuration used both hardware execution threads (Hyper-Threads) on all the processor cores in order to reach 120 vCPUs. In this case, there is one vCPU per execution thread.

Hyper-Threading doubles the number of execution threads, but it does not double performance. In order to measure the scale-up performance of the 120-vCPU VM, a 60-vCPU VM was configured with CPU affinity so that it was limited to only two of the server’s four sockets. In this configuration the 60-vCPU VM has one vCPU per execution thread, which is the same as the 120-vCPU VM.  Configuring a 60-vCPU VM in this way makes it easy to see the scale up performance at 120 vCPUs on this server with hyper-threads enabled.

The results of the scale-up testing using the 60-vCPU VM configured with CPU affinity to only 2 sockets and the 120-vCPU VM using all four sockets showed approximately linear scaling, as shown in the graph below.

vSphere6OraScaleUpto120

For full test details and more test results please see the white paper that has was recently published.

The new larger “Monster” VM support in vSphere 6 allows for virtual machines that can support larger workloads than ever before with excellent performance. These tests show that large virtual machines running on vSphere 6 can scale up as needed to meet extreme performance demands.

 

Improvements in Network I/O Control for vSphere 6

Network I/O Control (NetIOC) in VMware vSphere 6 has been enhanced to support a number of exciting new features such as bandwidth reservations. A new paper published by the Performance Engineering team shows the performance of these new features. The paper also explores the performance impact of the new NetIOC algorithm. Later tests show that NetIOC offers a powerful way to achieve network resource isolation at minimal cost, in terms of latency and CPU utilization.

You can read the paper here.

 

Failover Time for vCenter Server 6.0 Protected by vSphere HA

By Jianli Shen, Kimberly Wang, and Lei Chai

There are many advantages to virtualizing vCenter Server over installing it on physical devices; these have been blogged about already. VMware best practice also recommends the deployment of vCenter Server as a virtual machine and to use vSphere High Availability (HA) to provide this functionality. The typical scenario is to place vCenter Server in a vSphere HA cluster. HA will protect vCenter Server just like any other virtual machine. If the ESXi server that is running vCenter Server goes down, HA will kick-in and power on this VM from the shared storage on another HA cluster member host. In this blog, we will show the downtime of vCenter Server in this scenario with experiments that simulate a production deployment.

We performed our experiments with a vCenter Server VM deployed in an HA-enabled cluster with vCenter Server itself managing an inventory of 64 hosts and 6,000 VMs. The experiment shows that in case the ESXi server which hosts vCenter Server fails, vCenter Server will be powered on by another host in the cluster as expected in the HA scenario. After vCenter Server is powered on, it takes time to boot up into full function, and 64 hosts/6,000 VMs will be presented in the inventory when the administrator is able to log into the vSphere Web Client again. In our experiments, the whole procedure from failure to vCenter Server administrator being able to log into the vSphere Web Client took about 7 minutes and 40 seconds, with most of the time spent booting up vCenter Server into full function. During vCenter Server downtime, customers will still be able to access their VMs; only the vCenter Server administrator is impacted.

In the following sections, we describe our experiments and numbers.

Experimental Deployment Scenarios

We created the scenario of vCenter Server in an HA-enabled cluster environment. We used a host simulator to simulate the 64 hosts/6,000 VMs inventory for vCenter Server. The deployment is shown in figure 1, below.

Test-Bed Setup
Figure 1. Test-Bed Setup

In this setup:

  • HostA is the host where we first deploy vCenter Server 6.0. HostB is the host that acts as a backup host for vCenter Server to failover in case hostA fails. HostA and HostB are managed by vCenter Server in an HA-enabled cluster, which is Cluster2 in the figure.
  • Both hosts are Dell PowerEdge R620 with Intel® Xeon® E5-2650@2.00GHz, 128GB memory and 3.7TB shared storage.
  • vCenter Server is a virtual appliance deployed on HostA with 16 vCPUs and 32GB RAM.
  • vCenter Server also has a large inventory with 64 hosts and 6,000 VMs that is in Cluster1.

The figure shows what happens before and after HostA fails. During the experiment, we powered off HostA so the vCenter Server VM went down as well. Then, the HA agent detected the failure of HostA and initiated the action of powering on the previously running VMs (vCenter Server, in this case) in HostA on HostB. Once vCenter Server was powered on, it recovered its inventory as during the normal power-on process.

Performance Result

We measured the time from the point vCenter Server VM stopped responding, to the point vSphere Web Client started responding to user activity again.

With the 64 hosts/6,000 VMs inventory, the total time is around 460 seconds (that is, 7 minutes and 40 seconds), with about 30-50 seconds for HA to get into action. The rest of the time is spent on the vCenter Server power-on process.

To see the impact of inventory size that vCenter Server has, we compared it with similar scenarios, where vCenter Server has 1 host/1 VM or 32 hosts/4,000 VMs inventory, the downtime is around 410 seconds (6 minutes and 50 seconds) and 439 seconds (7 minutes and 19 seconds) respectively. This means the downtime does increase when the inventory size increases in the vCenter Server. But considering the whole power-on process as shown in figure 2, the increase is reasonable.

As a reference, we looked at the vCenter Server reboot time. We issued a vCenter Server reboot command, then waited until it was back to full function and measured the latency for this period. Then we compared it to the total time it takes to fail over the vCenter Server, including its reboot time. As shown in the following figure 2, there is around a 30-50 second time period when a vCenter Server fails over from HostA to HostB, compared to rebooting the vCenter Server on HostA. This 30-50 second period is the time needed for the HA agent to detect the failure on HostA and then issue the command to power on the vCenter Server at HostB. Figure 3 shows a closer look at the time failover takes for each activity.

Figure 2
Figure 2. Difference in seconds between normal reboot and HA failover (reboot + HA) in very small, medium, and large inventories
Figure 3
Figure 3. Total time of failover, including time it takes to reboot during failover and time for vSphere HA activity

Summary

In this article, we show that deploying vCenter Server as a VM makes protecting it very easy by using vSphere HA. The solution is feasible in reality and the downtime is within an expected tolerance level. Administrators will be able to regain control of the vCenter Server within less than a 10-minute window. Meanwhile, customers will still have their services running on their VMs in Cluster1.

Virtualized Hadoop Performance with vSphere 6

A recently published whitepaper shows that not only can vSphere 6 keep up with newer high-performance servers, it thrives on their capabilities.

Two years ago, Hadoop benchmarks were run with vSphere 5.1 on a cluster of 32 dual-socket, quad-core servers. Very good performance was demonstrated, with the optimal virtualized configuration shown to be actually 2% faster than native for TeraSort (see the previous whitepaper).

These benchmarks were recently run on a cluster of the same size, but with ten-core processors, more disks and memory, dual 10GbE networking, and vSphere 6. The maximum dataset size was almost quadrupled to 30TB, to ensure that it is much bigger than the total memory in the cluster (hence qualifying the test as Big Data, by one definition).

The results, summarized in the chart below, show that the optimal virtualized configuration now delivers 12% better performance than native for TeraSort. The primary reason for this excellent performance is the ability of vSphere to map physical hardware resources to virtual hardware that is optimized for scale-out applications. The observed trend, as well as theory based on processor characteristics, indicates that the importance of being able to do this mapping correctly increases as processors become more powerful. The sub-optimal performance of one of the tests is due to the combination of very small VMs and how Hadoop does replication during data creation. On the other hand, small VMs are very advantageous for read-dominated applications, which are typically more common. Taken together with other best practices discussed in the paper, this information can be used to configure Hadoop clusters for the highest levels of performance. Despite all the hardware and software changes over the past two years, the optimal configuration was still found to be four VMs per dual-socket host.

elapsed_time_ratioPlease take a look at the whitepaper for more details on how these benchmarks were run and for analyses on why certain virtual configurations perform so well.

Scaling Out Redis Performance with Docker on vSphere 6.0

by Davide Bergamasco

In an earlier VROOM! post we discussed, among other things, the performance of the Redis in-memory key-value store in a Docker/vSphere environment. In that post we focused on a single instance of a Redis server subject to a more or less artificial workload with the goal of assessing the absolute performance of said instance under various deployment scenarios.

In this post we are taking a different point of view, which is maximizing the throughput of multiple Redis instances running on a “large” server under a more realistic workload. Why are we interested in this perspective?  Conceptually, Redis is an extremely simple application, being just a thin layer of code implementing a large hash table on top of system calls.  From the implementation standpoint, a single-threaded event loop services requests from the clients in a polling fashion. The problem with this design is that it is not suitable for “scaling up”; that is, improving performance by using multiple cores. Modern servers have many processing cores (up to 80) and possibly terabytes of memory.  However, Redis can only access that memory at the speed of a single core.

This problem can be solved by “scaling out” Redis; that is, by partitioning the server memory across multiple Redis instances and running each of those on a different core.  This can be achieved by using a set of load balancers to fragment the key space and distribute the load among the various instances. The diagram shown in Figure 1 illustrates this concept.

Figure 1. Redis Scale Out Setup

Figure 1. Redis Scale Out Setup

Host H3 runs the various Redis server instances (red boxes), while Host H2 runs two sets of load balancers:

  • The green boxes are the Redis load balancers, which partition the key space using a consistent hashing algorithm.  We leveraged the Twemproxy OSS project to implement the Redis load balancers.
  • The yellow boxes are TCP load balancers, which distribute the load across the Redis load balancers in a round robin fashion. We used the HAProxy OSS project to implement the TCP load balancers.

Finally, Host H1 runs the load generators (dark blue boxes); that is, the standard benchmark redis-benchmark.

Deployment Scenarios

We assessed the performance of this design across a set of deployment scenarios analogous to what we considered in the previous post. These are listed below and illustrated in Figure 2:

  • Native: Redis instances are run as 8 separate processes on the Linux OS running directly on Host H3 hardware.
  • VM: Redis instances are run inside 8, 2-vCPU VMs running on a pre-release build of vSphere 6.0.0 running on Host H3 hardware; the guest OS is the same as the Native scenario.
  • Native-Docker: Redis instances are run inside 8 Docker containers running on the Native OS.
  • VM-Docker: Redis instances are run inside Docker containers each running inside the same VMs as the VM scenario, with one container per VM.
Figure 2. Different deployment scenarios

Figure 2. Different deployment scenarios

Hardware/Software/Workload Configuration

The following are the details about the hardware, software, and workload used in the various experiments discussed in the next section:

Hosts:

  • HP ProLiant DL380e Gen8
  • CPU: 2 x Intel® Xeon® CPU E5-2470 0 @ 2.30GHz (16 cores, 32 hyper-threads total)
  • Memory: 96GB
  • Hardware configuration: Hyper-Threading ON, turbo-boost OFF, power policy: Static High (no power management)
  • Network: 10GbE
  • Storage: 8 x 500GB 15,000 RPM 6Gb SAS disks, HP H220 host bus adapter

Linux OS:

  • CentOS 7
  • Kernel 3.18.1 (CentOS 7 comes with 3.10.0, but we wanted to use the latest kernel available at the time of this writing)
  • Docker 1.2

ESXi:

  • VMware vSphere 6.0.0 (pre-release build)

VM:

  • 8 x 2-vCPU, 11GB (VM scenario)
  • Virtual NIC: vmxnet3
  • Virtual HBA: LSI-SAS

Application:

  • Redis 2.8.13
  • AOF persistency with “everysec” flush policy (every operation that mutates a key is logged into an Append Only File in order to enable data recovery after a crash; the buffer cache is flushed every second, so with this durability policy at most one second worth of data can be lost)

Workload:

  • Keyspace: 250 million keys, value size 1 byte (this size has been chosen to prevent network or storage from becoming bottlenecks from the bandwidth perspective)
  • 8 redis-benchmark instances each simulating 100 clients with a pipeline depth of 30 requests
  • Operations mix: 75% GET, 25% SET 

Results

We ran two sets of experiments for every scenario listed in the “Deployment Scenarios” section. The first set was meant to establish a baseline by having a single redis-benchmark instance generating requests directly against a single redis-server instance. The second set aimed at assessing the overall performance of the Redis scale-out system we presented earlier. The results of these two set of experiments are shown in Figure 3, where each bar represents the throughput in operations per second averaged over five trials, and error bars indicate the range of the measured values.

Figure 3. Results of the performance experiments (Y-axis represents throughput in 1,000 operations per second)

Figure 3. Results of the performance experiments (Y-axis represents throughput in 1,000 operations per second)

Nothing really surprising can be noticed looking at the results of the baseline experiments (labeled “1 Server – 1 Client” in Figure 3).  The Native scenario is obviously the fastest in terms of operations per second, followed by the Docker, VM, and the Docker-VM scenarios. This is expected as both virtualization and containerization add some overhead on top of the bare-metal performance.

Looking at the scale-out experiments (labeled “Scale-Out” in Figure 3), we see a surprisingly different picture. The VM scenario is now the fastest, followed by Docker-VM, while the Native and Docker scenarios come in as a somewhat distant third and fourth.  This unexpected result can be explained by looking at the Host H3 CPU activity during an experiment run.  In the Native and Docker scenarios, notice that the CPU load is spread over the 16 cores.  This means that even though only 8 threads are active (the 8 redis-server instances), the Linux scheduler is continuously migrating them.  This might result in a large number of cross-NUMA node memory accesses, which are substantially more expensive than same-NUMA node accesses. Also, irqbalance is spreading the network card interrupts across all the 16 cores, additionally contributing to the above phenomenon.

In the VM and Docker-VM scenarios, this does not occur because the ESXi scheduler goes to great lengths to keep both the memory and vCPUs of a VM on one NUMA node.  Also, with the PVSCSI virtual device, the virtual interrupts are always routed to the same vCPU(s) that initiated an I/O, and this minimizes interrupt migrations.

We tried to eliminate the cross-NUMA node memory activity in the Native scenario by pinning all the redis-server processes to the cores of the same CPU; that is, to the same NUMA node. We also disabled irqbalance and manually pinned the interrupt vectors to the same set of cores. As expected, with this ad-hoc configuration, the Native scenario was the fastest, reaching 3.408 million operations per second. Without any pinning, the VM result is only 4% slower than the optimized Native performance. (Notice that introducing artificial affinity between processes/interrupt vectors and cores is not a recommended practice as it is error-prone and can, in general, lead to unexpected or suboptimal results.)

Our initial experiments were conducted with the CentOS 7 stock kernel (3.10.0), which unfortunately is not particularly recent. We thought it was prudent to verify if the Linux scheduler had been improved to avoid the inter-NUMA node thread migrations in more recent kernel versions.  Hence, we re-ran all the experiments with the latest version (at the time of this writing, 3.18.1), but we didn’t notice any significant difference with respect to version 3.10.0.

We thought it would be interesting to look at the performance numbers in terms of speedup; that is, the ratio between the throughput of the scale-out system and the throughput of the baseline 1 Server – 1 Client setup. Figure 4 below shows the speedup for the four scenarios considered in this study.

Figure 4: Speedup (Y-axis represents speedup with respect to baseline)

Figure 4: Speedup (Y-axis represents speedup with respect to baseline)

The speedup essentially tells, in relative terms, how much the performance has improved by deploying 8 Redis instances on the same host instead of on a single one. If the system scaled linearly, it would have achieved a maximum theoretical speedup of 8.  In practice, this limit could not be achieved because of extra overheads introduced by the load-balancers and possible resource contention across the Redis instances running on host H3 (this host is almost running at saturation as the overall CPU utilization is consistently between 75% and 85% during the experiment’s execution). In any case, the scale-out system delivers a performance boost of at least 4x as compared to running a single Redis instance with exactly the same memory capacity.  The VM and Docker-VM scenarios achieve a substantially larger speedup because of the cross-NUMA memory access issue afflicting the Native and Docker scenarios.

Conclusions

The main results of this study are the following:

  1. VMs and Docker containers are truly better together. The Redis scale-out system, using out-of-the-box configuration settings, clearly achieves better performance in the Docker-VM scenario than in the Native or Docker scenarios. Even though its performance is not as high as in the VM scenario, the Docker-VM setup offers the same ease of use and deployment typical of the Docker scenario, at a substantially higher performance.
  2. Using VMs and Docker, we managed to scale out a Redis deployment and extracted a great deal of extra performance (up to 5.6x more) from a large server that would have otherwise been underutilized.

 

A whitepaper on VMware Horizon 6 RDSH Performance and Best Practices

VMware Horizon 6 delivers remote desktops and applications from a single platform. With Horizon 6, IT can deliver seamless applications and users can access their desktops or applications from anywhere on devices running Windows, Linux, MAC, iOS, and Android operating systems.

In this blog, we present the Horizon 6 RDSH Performance whitepaper that shows the performance of remote sessions (desktops and seamless applications) and the performance of the RDSH virtual machines, paying particular attention to the sizing of the Horizon 6 environment. The paper also reveals the competitive results of leading remote display network protocols. Finally, the paper presents some of the best practices to tune the RDSH server VM performance. All performance tests were run using the VMware View Planner 3.5 benchmark.

The setup, detailed results, and best practices can be found in the performance whitepaper provided below:

VMware Horizon 6 RDSH Performance and Best Practices

Docker Containers Performance in VMware vSphere

By  Qasim Ali,  Banit Agrawal, and Davide Bergamasco

 

“Containers without compromise” – This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and  in a native configuration. Briefly,

  • We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.
  • Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

  • Native: Linux OS running directly on hardware (Ubuntu, CentOS)
  • vSphere VM: Upcoming release of vSphere with the same guest OS as native
  • Native-Docker: Docker version 1.2 running on a native OS
  • VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Test Scenarios

Figure 1: Different test scenarios

Benchmarks/Workloads

For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

-   Micro-benchmarks:

  • LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.
  • STREAM: This benchmark measures memory bandwidth across various configurations.
  • FIO: This benchmark is used for I/O benchmarking for block devices and file systems.
  • Netperf: This benchmark is used to measure network performance.

Real-world workload:

  • Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results

LINPACK

LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 32 vCPU VM with 45K and 65K problem sizes

linpack

Figure 2: LINPACK performance for different test scenarios

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.

STREAM

We used a NUMA-aware  STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, HT enabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 64 vCPU VM (Hyper-Threading ON)

stream

Figure 3: STREAM performance for different test scenarios

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.

FIO

We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

Hardware: 2 socket Intel Xeon E5-2660 2.2GHz with 392GB RAM, 16 total cores, Hyper-Threading enabled
Guest: 32-vCPU  14.04.1 Ubuntu 64-bit server with 256GB RAM, with a separate ext4 disk in the guest (on VMFS5 in vSphere run)
Benchmark:  FIO, Direct I/O, 10GB file
I/O Profile:  4k I/O, Random Read/Write: depth 16, jobs 8, Mixed: depth 32, jobs 8

fio

Figure 4: FIO benchmark performance for different test scenarios

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.

Netperf

Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

Hardware (Server): 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Hardware (Client): 2 socket Intel Xeon X5570 2.93GHz with 64GB RAM, 8 cores total, Hyper-Threading disabled
Networking hardware: Broadcom Corporation NetXtreme II BCM57810
Software on server and Client: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 2 vCPU VM with 4GB RAM

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.

netperf-recieve

Figure 5: Netperf Recieve performance for different test scenarios

netperf-transmit

Figure 6: Netperf transmit performance for different test scenarios

Figure 5 and Figure 6 shows the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.

netperf-latency

Figure 7: Netperf TCP_RR performance for different test scenarios (Lower is better)

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).

netperf-latency-passthrough

Figure 8: Netperf TCP_RR performance for different test scenarios (VMs in pass-through mode)

Redis

We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

Hardware: HP ProLiant DL380e Gen8 2 socket Intel Xeon E5-2470 2.3GHz with 96GB RAM, 16 total cores, Hyper-Threading enabled
Guest OS: CentOS 7
VM: 16 vCPU, 93GB RAM
Application: Redis 2.8.13
Benchmark: redis-benchmark, 1000 clients, pipeline: 1 request, operations: SET 1 Byte
Software configuration: Redis thread pinned to CPU 0 and network interrupts pinned to CPU 1

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access.  The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys.  This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run.  The results of these experiments are summarized in the following chart.

redis

Figure 9: Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead.  However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty.  This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.

Summary

In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems.  The results will be covered in the next installment of this series. Stay tuned!

 

Monster Performance with SQL Server VMs on vSphere 5.5

VMware vSphere provides an ideal platform for customers to virtualize their business-critical applications, including databases, ERP systems, email servers, and even newly emerging technologies such as Hadoop.  I’ve been focusing on the first one (databases), specifically Microsoft SQL Server, one of the most widely deployed database platforms in the world.  Many organizations have dozens or even hundreds of instances deployed in their environments. Consolidating these deployments onto modern multi-socket, multi-core, multi-threaded server hardware is an increasingly attractive proposition for IT administrators.

Achieving optimal SQL Server performance has been a continual focus for VMware; with current vSphere 5.x releases, VMware supports much larger “monster” virtual machines that can scale up to 64 virtual CPUs and 1 TB of RAM, including exposing virtual NUMA architecture to the guest. In fact, the main goal of this blog and accompanying whitepaper is to refresh a 2009 study that demonstrated SQL performance on vSphere 4, given the marked technology advancements on both the software and hardware fronts.

These tests show that large SQL Server 2012 databases run extremely efficiently with VMware, achieving great performance in a variety of virtual machine configurations with only minor tunings to SQL Server and the vSphere ESXi host. These tunings and other best practices for fully optimizing large virtual machines for SQL Server databases are presented in the paper.

One test in the paper shows the maximum host throughput achieved with different numbers of virtual CPUs per VM. This was measured starting with 8 vCPUs per VM, then doubled to 16, then 32, and finally 64 (the maximum supported with vSphere 5.5).  DVD Store, which is a popular database tool and a key workload of the VMmark benchmark, was used to stress the VMs.  Here is a graph from the paper showing the 8 vCPU x 8 VMs case, which achieved an aggregate of 493,804 opm (operations per minute) on the host:

8 x 8 vCPU VM throughput

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the impact of various virtual NUMA (vNUMA) topologies, and experiments with the Latency Sensitivity advanced setting.

For more details and the test results, please download the whitepaper: Performance and Scalability of Microsoft SQL Server on VMware vSphere 5.5.