Home > Blogs > VMware VROOM! Blog

VMware vCenter 6.0 Performance Improvements and Best Practices

VMware vCenter Server 6.0 brings many performance improvements over previous vCenter versions, including:

  • Extensive improvements in throughput and latency.
  • vCenter Server Appliance (VCSA) parity with vCenter Server on Windows, for both inventory size and performance.

A new white paper, “VMware vCenter Server Performance and Best Practices,” illustrates these performance improvements and discusses important best practices for getting the best performance out of your vCenter Server environment.

This chart, taken from the white paper, shows the improvement in throughput over vCenter Server 5.5 at various inventory sizes:

tput

Improvements in vCenter Server 6.0 Cluster Performance

VMware vCenter Server 6.0 brings significant improvements over previous vCenter server versions with respect to cluster size and performance. vCenter Server 6.0 supports up to 64 ESXi Hosts and 8000 VMs in a single cluster. A new white paper, ” VMware vCenter Server 6.0 Cluster Performance “, describes the improvements along several dimensions:

  • VMware vCenter 6.0 can support more hosts and more VMs in a cluster.
  • VMware vCenter 6.0 can support higher operational throughput in a cluster.​
  • VMware vCenter 6.0 can support higher operational throughput with ESXi 6.0 hosts.
  • VMware vCenter 6.0 VCSA can support higher operational throughput in a cluster, compared to vCenter Server on Windows.

Here is a chart from the white paper summarizing one of the key improvements: operational throughput in a cluster:

vCenter Server 6.0 Cluster Performance Improvement

vCenter Server 6.0 Cluster Performance Improvement

Project Capstone Shows Monster VM Performance

Project Capstone was put together a few weeks before VMworld 2015 with the goal of being able to show what is possible with Monster VMs today.  VMware worked with HP and IBM to put together an impressive setup using vSphere 6.0, HP Superdome X and an IBM FlashSystem array that was able to support running four 120 vCPU VMs simultaneously.  Putting these massive Virtual Machines under load we found that performance was excellent with great scalability and a high amount of throughput achieved.

vSphere 6 was launched earlier this year and includes support for virtual machines with up to 128 virtual CPUs which is a big increase from the 64 vCPUs supported in vSphere 5.5. “Monster” virtual machines have a new upper limit and it allows for customers to virtualize even the largest of systems with very hungry CPU needs.

The HP Superdome X used for the testing is an impressive system.  It has 16 Intel Xeon E7-2890v2 2.8 GHz processors.  Each processor has 15 cores and 30 logical threads when Hyper Threading is enabled. In total this is 240 cores / 480 threads.

An IBM FlashSystem array with 20TB of superfast low latency storage was used for the project Capstone configuration.  It provided extremely low latency throughout all testing and provided such great performance that storage was never a concern or issue.  The FlashSystem was extremely easy to setup and use.  Within 24 hours of it arriving in the lab, we were actively running four 120 vCPU VMs with sub millisecond latency.

Large Oracle 12c database virtual machines running on Redhat Enterprise Linux 6.5 were created and configured with 256GB of RAM, pvSCSI virtual disk adapters, and vmxnet3 virtual NICs.  The number of VMs and the number of vCPUs for each VM was varied across the tests.

The workload used for the testing was DVD Store 3 (github.com/dvdstore/ds3).  DVD Store simulates a real online store with customers logging onto the site, browsing products and product reviews, rating products, and ultimately purchasing those products.  The benchmark is measured in Orders Per Minute, with each order representing a complete login, browsing, and purchasing process that includes many individual SQL operations against the database.

This large system with 240 cores / 480 threads, an extremely fast and large storage system, and vSphere 6 showed that even with many monster VMs excellent performance and scalability is possible.  Each configuration was first stressed by increasing the DVD Store workload until maximum throughput was achieved for a single virtual machine.  In all cases this was found to be at near CPU saturation.  The number of VM was then increased so that the entire system was fully committed.   A whitepaper to be published soon will have the full set of test results, but here we show the results for four 120 vCPU VMs and sixteen 30 vCPU VMs.capstonePerf4x120graph

 

capstonePerf16x30graph

In both cases the performance of the system when fully loaded with either 4 or 16 virtual machines achieves about 90% of perfect linear scalability when compared to the performance of a single virtual machine.

In order to be able to drive the CPU usage to such high levels all disk IO must be very fast so that the system is not waiting for a response.  The IBM FlashSystem provided .3 ms average disk latency across all tests.  Total disk IO was minimized for these tests to maximize CPU usage and throughput by configuring the database cache size to be equal to the database size.   Total disk IO per second (IOPS) peaked at about 50k and averaged 20k while maintaining the extremely low latency during tests.

These test results show that it is possible to use vSphere 6 to successfully virtualize even the largest systems with excellent performance.

 

How to Efficiently Deploy Virtual Machines from VMware vSphere Content Library

by Joanna Guan and Davide Bergamasco

This post is the first of a series which aims to assess the performance of the VMware vSphere Content Library solution in various scenarios and provide vSphere Administrators with some ideas about how to set up a high performance Content Library environment. After providing an architectural overview of the Content Library components and inner workings, the post delves into the analysis and optimization of the most basic Content Library operation, i.e., the deployment of a virtual machine.

Introduction

The VMware vSphere Content Library empowers vSphere administrators to effectively and efficiently manage virtual machine templates, vApps, ISO images, and scripts. Specifically an administrator can leverage Content Library to:

  • Store and manage content from a central location;
  • Share content across boundaries of vCenter Servers;
  • Deploy virtual machine templates from the Content Library directly onto a host or cluster for immediate use.

Typically, a vSphere datacenter includes a multitude of vCenter servers, ESXi servers, networks, and datastores. In such an environment it can be time-consuming to clone or deploy a virtual machine through all the ESXi servers, vCenter servers, and networks from a source datastore to a destination datastore. Moreover, this problem is compounded by the fact that the size of virtual machines and other content keeps getting larger over time. The objective of Content Library is to address these issues by transferring large amounts of data in the most efficient way.

Architectural Overview

Content Library is composed of three main components which run on a vCenter server:

  • A Content Library Service, which organizes and manages content sitting on various storage locations;
  • A Transfer Service, which oversees the transfer content across said storage locations;
  • A Database which stores all the metadata associated with the content (e.g., type of content, date of creation, author/vendor, etc.)

The architecture diagram in Figure 1 shows how the three components interact with each other and with other vCenter components, along with the control path (depicted as thin black lines) and data path (depicted as thick red lines).

fig1

Figure 1. VMware vSphere Content Library architecture

The Content Library Service implements the control plane that manages storage and handles content operations such as deployment, upload, download, and synchronization. The Transfer Service implements the data plane that is responsible for actual data transfers between content stores, which may be datastores attached to ESXi hosts, NFS file systems mounted on the vCenter Server, or remote HTTP(S) servers.

Data Transfer

The data transfer performance varies depending on the storage type and available connectivity. The Transfer Service can transfer data in two ways: streaming mode and direct copy mode. The diagram in Figure 2 shows how the two modes work in a data transfer between datastores.

fig2

Figure 2. Content Library Data Transfer Flows

If the source and destination hosts have direct connectivity, the Transfer Service asks vCenter to instruct the source host to directly copy the content to the target host. When this is not possible (e.g., if the two hosts are connected to two different vCenter servers) streaming mode is used instead. In streaming mode the data flows through the Transfer Service itself. This involves one extra hop for the data, and also compression/decompression for the VMDK disk files. Also, vCenter appliances are usually connected to a management network, which could become a bottleneck due its limited bandwidth. For these reasons, direct copy mode typically has better performance than streaming mode.

Optimizing Virtual Machines Deployment

Having covered Content Library architecture and transfer modes, we can now discuss how to optimize its performance starting from the most basic operation, the deployment of a virtual machine. Deploying a virtual machine from Content Library creates a new virtual machine by cloning it from a template. We assess the performance of deployment operations by measuring their completion time. This metric is obviously the most visible and important one from an administrator’s perspective.

The experiments discussed in this blog post demonstrate how deployment performance is impacted by the Content Library backing storage configuration and provides some guidelines to help administrators choose the most appropriate configuration based on performance and cost tradeoffs.

Experimental Testbed

We used a total of three servers, one for running the vCenter Appliance and another two to create a cluster over which virtual machines were deployed from the Content Library. The following table summarizes the hardware and software specifications of the testbed.

vCenter Server Host
    Dell PowerEdge R910 server
         CPUs Four 6-core Intel® Xeon® E7530 @ 1.87 GHz, Hyper-Threading enabled.
         Memory: 80GB
    Virtualization Platform VMware vSphere 6.0. (RTM build # 2494585)
         VM Configuration 16 vCPU, 32GB of memory
         vCenter Appliance VMware vCenter Sever Appliance 6.0 (RTM build # 2562625)
ESXi Hosts
    Two Dell PowerEdge R610 servers
         CPUs Two 4-core Intel® Xeon® E5530 @ 2.40 GHz, Hyper-Threading enabled.
         Memory: 32GB
    Virtualization Platform VMware vSphere 6.0. (RTM build # 2494585)
     Storage Adapter QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express
     Network Adapter QLogic NetXtreme II BCM5709 1000Base-T (Data Rate: 1Gbps)
  Storage Array EMC VNX5700 Storage Array exposing two 20-disk RAID-5 LUNS with a capacity of 12TB each

Figure 3 illustrates the experimental testbed along with the data transfer flows for the various experiments. We ran a workload that consisted of deploying a virtual machine from a Content Library item onto a cluster. All experiments used the same 39GB OVF template. We conducted various experiments, based on the possible configurations of the source content store (the storage backing the Content Library) and the destination content store (the storage where the new virtual machine was deployed), as shown in the following table.

Experiment 1 An ESXi host is connected to a VAAI-capable storage array (VAAI stands for vStorage API for Array Integration and it is a technology which enables ESXi hosts to offload specific virtual machine and storage management operations to compliant storage hardware.)  Both the source and destination content stores are datastores residing on said array.
Experiment 2 An ESXi host is connected to the same datastores as in Experiment 1. However, these datastores are either hosted on a non-VAAI array or on two different arrays.
Experiment 3 An ESXi hosts is connected to the source datastore while a different host is connected to the destination datastore. The datastores are hosted on different arrays.
Experiment 4 The source content store is an NFS file system mounted on the vCenter server, while the destination content store is a datastore is hosted on a storage array.

 

fig3

Figure 3. Storage configurations and data transfer flows

Experimental Results

Figure 4 shows the results of the four experiments described above in terms of deployment duration (lower is better), while the following table summarizes the main observations for each experiment.

Experiment 1 The best performance was achieved in Experiment 1 (two datastores backed by a VAAI array). This was expected, as in this scenario the actual data transfer occurs internally to the storage array, without any involvement from the ESXi host. This is obviously the most efficient scenario from a deployment perspective.
Experiment 2 In Experiment 2, although the array is not VAAI-capable (or the datastores are hosted on two separate arrays), the source and the destination datastores are connected to the same ESXi host. This means the data transfer occurs through the 8 Gb/s Fibre Channel connection. This scenario is about 20% slower than Experiment 1.
Experiment 3 The scenario of Experiment 3 is significantly slower (about three times) than Experiment 1 because the datastores are attached to two different ESXi hosts. This causes the data transfer to go through the 1Gbps Ethernet connection. We also ran this experiment using a 10Gbps Ethernet network, and found that the deployment duration was similar to the one measured in Experiment 2. This suggests that the 1Gbps Ethernet connection is a significant bottleneck for this scenario.
Experiment 4 In the final scenario, Experiment 4, the template resides on an NFS file system mounted on the vCenter server. Because the template is stored in a compressed format on the NFS file system in order to save network bandwidth, its decompression on the vCenter server slows the data transfer quite noticeably. The network hops between the vCenter Server and the destination ESXi host may further slow the end-to-end data transfer. For these reasons, this scenario was about seven times slower than Experiment 1.  We also ran the same experiment using a 10Gbps network between the NFS server and the vCenter server and measured a completion time only slightly better than with the 1Gbps network (1260s vs. 1380s). Given that compression and decompression are CPU-heavy operations, using a faster network may result in only a marginal performance improvement.

 

fig4

Figure 4. Deployment completion time for the four storage configurations

Conclusions

This blog post explored how different Content Library backing storage configurations can affect the performance of a virtual machine deployment operation. The following guidelines may help an administrator in optimizing the Content Library performance for said operation based on the storage options at her/his disposal:

  1. If no other optimizations are possible, the Content Library should be at least backed by a datastore connected to one of the ESXi hosts (scenario of Experiment 3). Ideally a 10Gbps Ethernet connection should be employed.
  2. A better option is to have each ESXi host connected to both the source datastore (the one backing the Content Library) and the destination datastore(s) (the one(s) where the new virtual machine is being deployed). This is the scenario of Experiment 2.
  3. The best case is when all the ESXi hosts are connected to a VAAI-capable storage array and both the source and destination datastores reside on said array (Experiment 1).

 

Performance Best Practices for vSphere 6.0 is Available

We are pleased to announce the availability of Performance Best Practices for VMware vSphere 6.0. This is a book designed to help system administrators obtain the best performance from vSphere 6.0 deployments.

The book addresses many of the new features in vSphere 6.0 from a performance perspective. These include:

  • A new version of vSphere Network I/O Control
  • A new host-wide performance tuning feature
  • A new version of VMware Fault Tolerance (now supporting multi-vCPU virtual machines)
  • The new vSphere Content Library feature

We’ve also updated and expanded on many of the topics in the book. These include:

  • VMware vStorage APIs for Array Integration (VAAI) features
  • Network hardware considerations
  • Changes in ESXi host power management
  • Changes in ESXi transparent memory sharing
  • Using Receive Side Scaling (RSS) in virtual machines
  • Virtual NUMA (vNUMA) configuration
  • Network performance in guest operating systems
  • vSphere Web Client performance
  • VMware vMotion and Storage vMotion performance
  • VMware Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM) performance

The book can be found here http://www.vmware.com/files/pdf/techpaper/VMware-PerfBest-Practices-vSphere6-0.pdf.

 

 

 

Large Receive Offload (LRO) Support for VMXNET3 Adapters with Windows VMs on vSphere 6

Large Receive Offload (LRO) is a technique to reduce the CPU time for processing TCP packets that arrive from the network at a high rate. LRO reassembles incoming packets into larger ones (but fewer packets) to deliver them to the network stack of the system. LRO processes fewer packets, which reduces its CPU time for networking. Throughput can be improved accordingly since more CPU is available to deliver additional traffic. On Windows, LRO is also referred to as Receive Segment Coalescing (RSC).

LRO has been supported for Linux VMs with kernel 2.6.24 and later since vSphere 4. With the introduction of Windows Server 2012 and Windows 8 supporting LRO, vSphere 6 now adds support for LRO on a VMXNET3 adapter on Windows VMs. LRO is especially beneficial in the virtualized environment in which resources are shared by multiple VMs. This blog shows the performance benefits of using LRO for Windows VMs running on vSphere 6.

Test-Bed Setup

The test bed consists of a vSphere 6.0 host running VMs and a client machine that drives workload generation. Both machines have dual-socket, 6-core 2.9GHz Intel Xeon E5-2667 (Sandy Bridge) processors. The client machine is configured with native Red Hat Enterprise Linux 6 that generates TCP flows. The VMs in the vSphere host run Windows 2012 Server and are configured with 4 vCPUs and 2GB RAM. Both machines have an Intel 82599EB 10Gbps adapter installed, which are connected using a 10 Gigabit Ethernet (GbE) switch.

Performance Results

1. Native to VM Traffic

Figures 1 and 2 show a CPU efficiency and throughput comparison with and without LRO when TCP streams are generated from the client machine to a VM running on the vSphere host. Netperf is used to generate traffic. Three different message sizes are used: 256 bytes, 16KB, 64KB. The socket size is set to 8K, 64K, and 256K respectively. The message size is used to determine the number of bytes that Netperf delivers to the TCP stack in the client machine, which then determines the actual packet sizes. The NIC in the client machine splits packets with a size larger than the MTU (1500 is used for this blog) into smaller MTU-sized ones before sending them out. Once packets are received in the vSphere host, LRO aggregates packets and delivers larger packets (but smaller in number) to the receiving VM running in the host. This process is done in either hardware (that is, the physical NIC) or software (that is, the vSphere networking stack) depending on the NIC type and configuration. In this blog, LRO is performed in the vSphere networking stack before packets are delivered to the VM.

CPU efficiency is calculated by dividing the throughput by the number of CPU cores used for both the hypervisor and the VM, representing how much throughput in gigabits per second (Gbps) a single CPU core can receive. For example, a CPU efficiency of 5 means the system can handle 5Gbps with one core. Therefore, a higher CPU efficiency is desirable.

As shown in Figures 1 and 2, using LRO considerably improves both CPU efficiency and throughput with all three message sizes. Figure 1 shows that CPU efficiency improves by 86%, 25%, and 33% with 256 byte, 16KB, and 64KB messages respectively, when compared to the case without LRO. Figure 2 shows throughput improves 54% for 256 byte messages (0.6Gbps to 0.9Gbps) and 5% for 16KB messages (9.0Gbps to 9.4Gbps). There is not much difference for 64KB packets (9.5Gbps to 9.4Gbps, this is within a rage of normal variance). With 16KB and 64KB messages, throughput with LRO is already at line rate (that is, 10Gbps), which is why the improvement is not as significant as CPU efficiency.

Figure 3 compares the packet rate and the average packet size between the messages with LRO and without LRO. They are measured right before packets are delivered to the VM (but after LRO is performed for the LRO configuration). It is clearly shown that fewer but larger packets are delivered to the VM with LRO. For example, with 64KB messages, the packet rate delivered to the VM decreases from 815K packets per second (pps) to 113Kpps with LRO, while the packet size increases from 1.5KB to 10.5KB. The number of interrupts generated for the guest also becomes smaller accordingly, helping to improve overall CPU efficiency and throughput.

When the client machine generates packets, those with a size larger than the MTU are split into smaller MTU-sized packets before being sent out. With a larger message size, more MTU-sized packets are produced and the packet rate received by the vSphere host increases accordingly. This is why the packet rate becomes higher for 16KB and 64KB messages than 256 byte messages without LRO in Figure 3. LRO aggregates the received packets before they hit the VM so the packet rate remains low regardless of the message size in the figure.

Figure 1

Figure 1. CPU efficiency comparison with and without LRO with Native-VM traffic

Figure 2. Throughput comparison with and without LRO with Native-VM traffic

Figure 2. Throughput comparison with and without LRO with Native-VM traffic

Figure 3. Packet rate and size comparison with and without LRO with Native-VM traffic, when packets are delivered to the VM

Figure 3. Packet rate and size comparison with and without LRO with Native-VM traffic, when packets are delivered to the VM

2. VM to VM Local Traffic

LRO is also beneficial in VM-VM local traffic where VMs are located in the same host, communicating with each other through a virtual switch. Figures 4 and 5 depict CPU efficiency and throughput comparisons with and without LRO with two VMs on the same host sending and receiving TCP flows. The same message and socket sizes as the Native-VM tests above are used.

From Figure 4, LRO improves CPU efficiency by 15%, 92%, and 90% with 256 bytes, 16KB, and 64KB byte messages respectively, when compared to the case without LRO. Figure 5 shows throughput also improves by 20% with 256 byte messages (0.8Gbps to 1.0Gbps), by 103% with 16KB messages (9.0Gbps to 18.4Gbps), and by 142% with 64KB (11.7Gbps to 28.4Gbps).

Without LRO, big packets with a size larger than the MTU need to be split before delivered to the receiving VM, similar to Native-VM traffic. This is because the receiver cannot handle those big packets. LRO saves the time spent in both splitting packets and receiving smaller packets since packet splitting also happens on the vSphere host with VM-VM Local traffic. This explains why the improvement with 16KB and 64KB messages is higher than that of Native-VM traffic. The absolute CPU efficiency in VM-VM local traffic can become lower than that in Native-VM traffic since the CPU time of both sending and receiving VMs are included for this calculation.

As expected, the packet rate decreases while the average packet size increases with LRO as shown in Figure 6. For example, with 64KB messages, the packet rate delivered to the VM becomes reduced from 1009Kpps to 240Kpps, while the packet size gets increased from 1.6KB to 14.9KB.

The packet rate becomes higher for 256 byte messages with LRO, most likely because round-trip time (RTT) gets reduced due to the use of LRO. With the average packet size being similar to each other between LRO and without LRO, this effectively helps to improve throughput and correspondingly CPU efficiency, as seen in Figure 4 and 5.

Figure 4. CPU efficiency with and without LRO with VM-VM Local traffic

Figure 4. CPU efficiency with and without LRO with VM-VM Local traffic

Figure 5. Throughput comparison with and without LRO with VM-VM Local traffic

Figure 5. Throughput comparison with and without LRO with VM-VM Local traffic

Figure 6. Packet rate and size comparison with and without LRO with VM-VM Local traffic, when packets are delivered to the VM

Figure 6. Packet rate and size comparison with and without LRO with VM-VM Local traffic, when packets are delivered to the VM

Enable or Disable LRO on a VMXNET3 Adapter on a Windows VM

LRO is enabled by default for VMXNET3 adapters on vSphere 6.0 hosts, but you must set RSC to be enabled globally for Windows 8 VMs. For more information about configuring this, see the documentation.

Conclusion

This blog shows that enabling LRO for Windows Server 2012 and Windows 8 VMs on a vSphere host using VMXNET3 considerably enhances the CPU efficiency and correspondingly improves throughput for TCP traffic.

Running Transactional Workloads Using Docker Containers on vSphere 6.0

In a series of blogs, we showed that not only can Docker containers seamlessly run inside vSphere 6.0 VMs, but both micro-benchmarks and popular workloads in such configurations perform as well as, and in some cases better than, in native Docker configurations.

See the following blog posts for our past findings:

In this blog, we study a transactional database workload and present the results on how Docker containers perform in a VM when we scale out the number of database instances. To do this experiment, we use DVD Store 2.1, which is an OLTP benchmark that supports and stresses many different back-end databases including Microsoft SQL Server, Oracle Database, MySQL, and PostgreSQL. This benchmark is open source and the latest version 2.1 is available here. It is a 3-tier application with a Web server, an application server, and a backend database server. The benchmark simulates a DVD store, where customers log in, browse, and order DVD products. The tool is designed to utilize a number of advanced database features including transactions, stored procedures, triggers, and referential integrity. The main transactions are (1) add new customers, (2) log in customers, (3) browse DVDs, (4) enter purchase orders, and (5) re-order stock. The client driver is written in C# and is usually run from Windows; however, the client can be run on Linux using the Mono framework. The primary performance metric of the benchmark is orders per minute (OPM).

In our experiments, we used a PostgreSQL database with an Apache Web server, and the application logic was implemented in PHP. In all tests we ran 16 instances of DVD Store, where each instance comprises all 3-tiers. We found that, due to better scheduling, running Docker in a VM in a scale-out scenario can provide better throughput than running a Docker container in a native system.

Next, we present the configurations, benchmarks, detailed setup, and the performance results.

Deployment Scenarios

We compare four different scenarios, as illustrated below:

–   Native: Linux OS running directly on hardware (Ubuntu 14.04.1)

–   vSphere VM: vSphere 6.0 with VMFS5, in 8 VMs, each with the same guest OS as native

–   Native-Docker: Docker 1.5 running on a native OS (Ubuntu 14.04.1)

–   VM-Docker: Docker 1.5 running in each of 8 VMs on a vSphere  host

In each configuration, all of the power management features were disabled in the BIOS.

Hardware/Software/Workload Configuration

Figure 1 shows the hardware setup diagram for the server host below. We used Ubuntu 14.04.1 with Docker 1.5 for all our experiments.  While running Docker configuration, we use bridged networking and host volumes for storing the database.

dvdstore-benchmark-system-setup

Figure 1. Hardware/software configuration

Performance Results

We ran 16 instances of DVD Store, where each instance was running an Apache web server, PHP application logic, and a PostgreSQL database. In the Docker cases, we ran one instance of DVD Store per Docker container. In the non-Docker cases, we used the Virtual Hosts functionality of Apache to run many instances of a Web server listening on different ports. We also used the PostgreSQL command line to create different instances of the database server listening on different ports. In the VM-based experiments, we partitioned the host hardware between 8 VMs, where each VM ran 2 DVD Store instances. The 8 VMs exactly committed the CPUs, and under-committed the memory.

The four configurations for our experiments are listed below.

Configurations:

  • Native-16S: 16 instances of DVD Store running natively (16 separate instances of Apache 2 using virtual hosts and 16 separate instances of PostgreSQL database)
  • Native-Docker-16S: 16 Docker containers running on a native machine with each running one instance of DVD Store.
  • VM-8VMs-16S: Eight 4-vCPU VMs each running 2 DVD Store instances
  • VM-Docker-8VMs-16S: Eight 4-vCPU VMs each running 2 Docker containers,  where each Docker container is running one instance of DVD Store

We ran the DVD Store benchmark for the 4 configurations using 16 client drivers, where each driver process was running 4 threads. The results for these 4 configurations are shown in the figure below.

dvdstore-perf-docker-vsphere

Figure 2. DVD Store performance for different configurations

In the chart above, the y-axis shows the aggregate DVD Store performance metric orders per minute (OPM) for all 16 instances. We have normalized the order per minute results with respect to the native configuration where we saw about 126k orders per minute. From the chart, we see that, the VM configurations achieve higher throughput compared to the corresponding native configurations. As in the case of prior blogs, this is due to better NUMA-aware scheduling in vSphere.  We can also see that running Docker containers either natively or in VMs adds little overhead (2-4%).

To find out why the native configurations were not doing better, we pinned half of the Docker containers to one NUMA node and half to the other. The DVD Store aggregate OPM improved as a result and as expected, we were seeing slightly better than the VM configuration.  However, manually pinning processes to cores or sockets is usually not a recommended practice because it is error-prone and can, in general, lead to unexpected or suboptimal results.

Summary

In this blog, we showed that running a PostgreSQL transactional database in a Docker container in a vSphere VM adds very little performance cost compared to running directly in the VM. We also find that running Docker containers in a set of 8 VMs achieves slightly better throughput than running the same Docker containers natively with an out-of-the-box configuration. This is a further proof that VMs and Docker containers are truly “better together.”

 

 

Virtualized Storage Performance: RAID Groups versus Storage pools

RAID, a redundant array of independent disks, has traditionally been the foundation of enterprise storage. Grouping multiple disks into one logical unit can vastly increase the availability and performance of storage by protecting against disk failure, allowing greater I/O parallelism, and pooling capacity. Storage pools similarly increase the capacity and performance of storage, but are easier to configure and manage than RAID groups.

RAID groups have traditionally been regarded as offering better and more predictable performance than storage pools. Although both technologies were developed for magnetic hard disk drives (HDDs), solid-state drives (SSDs), which use flash memory, have become prevalent. Virtualized environments are also common and tend to create highly randomized I/O given the fact that multiple workloads are run simultaneously.

We set out to see how the performance of RAID group and storage pool provisioning methods compare in today’s virtualized environments.

First, let’s take a closer look at each storage provisioning type.

RAID Groups

A RAID group unifies a number of disks into one logical unit and distributes data across multiple drives. RAID groups can be configured with a particular protection level depending on the performance, capacity, and redundancy needs of the environment. LUNs are then allocated from the RAID group. RAID groups typically contain only identical drives, and the maximum number of disks in a RAID group varies by system model but is generally below fifty. Because drives typically have well defined performance characteristics, the overall RAID group performance can be calculated as the performance of all drives in the group minus the RAID overhead. To provide consistent performance, workloads with different I/O profiles (e.g., sequential vs. random I/O) or different performance needs should be physically isolated in different RAID groups so they do not share disks.

Storage Pools

Storage pools, or simply ‘pools’, are very similar to RAID groups in some ways. Implementation varies by vendor, but generally pools are made up of one or more private RAID groups, which are not visible to the user, or they are composed of user-configured RAID groups which are added manually to the pool. LUNs are then allocated from the pool. Storage pools can contain up to hundreds of drives, often all the drives in an array. As business needs grow, storage pools can be easily scaled up by adding drives or RAID groups and expanding LUN capacity. Storage pools can contain multiple types and sizes of drives and can spread workloads over more drives for a greater degree of parallelism.

Storage pools are usually required for array features like automated storage tiering, where faster SSDs can serve as a data cache among a larger group of HDDs, as well as other array-level data services like compression, deduplication, and thin provisioning. Because of their larger maximum size, storage pools, unlike RAID groups, can take advantage of vSphere 6 maximum LUN sizes of 64TB.

We used two benchmarks to compare the performance of RAID groups and storage pools: VMmark, which is a virtualization platform benchmark, and I/O Analyzer with Iometer, which is a storage microbenchmark.  VMmark is a multi-host virtualization benchmark that uses diverse application workloads as well as common platform level workloads to model the demands of the datacenter. VMs running a complete set of the application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview. Iometer places high levels of load on the disk, but does not stress any other system resources. Together, these benchmarks give us both a ‘real-world’ and a more focused perspective on storage performance.

VMmark Testing

Array Configuration

Testing was conducted on an EMC VNX5800 block storage SAN with Fibre Channel. This was one of the many storage solutions which offered both RAID group and storage pool technologies. Disks were 200GB single-level cell (SLC) SSDs. Storage configuration followed array best practices, including balancing LUNs across Storage Processors and ensuring that RAID groups and LUNs did not span the array bus. One way to optimize SSD performance is to leave up to 50% of the SSD capacity unutilized, also known as overprovisioning. To follow this best practice, 50% of the RAID group or storage pool was not allocated to any LUN. Since overprovisioning SSDs can be an expensive proposition, we also tested the same configuration with 100% of the storage pool or RAID group allocated.

RAID Group Configuration

Four RAID 5 groups were used, each composed of 15 SSDs. RAID 5 was selected for its suitability for general purpose workloads. RAID 5 provides tolerance against a single disk failure. For best performance and capacity, RAID 5 groups should be sized to multiples of five or nine drives, so this group maintains a multiple of the preferred five-drive count. One LUN was created in each of the four RAID groups. The LUN was sized to either 50% of the RAID group (Best Practices) or 100% (Fully Allocated). For testing, the capacity of each LUN was fully utilized by VMmark virtual machines and randomized data.

            RAID Group Configuration VMmark Storage Comparison        VMmark Storage Pool Configuration Storage Comparison

Storage Pool Configuration

A single RAID 5 Storage Pool containing all 60 SSDs was used. Four thick LUNs were allocated from the pool, meaning that all of the storage space was reserved on the volume. LUNs were equivalent in size and consumed a total of either 50% (Best Practices) or 100% (Fully Allocated) of the pool capacity.

Storage Layout

Most of the VMmark storage load was created by two types of virtual machines: database (DVD Store) and mail server (Microsoft Exchange). These virtual machines were isolated on two different LUNs. The remaining virtual machines were spread across the remaining two LUNs. That is, in the RAID group case, storage-heavy workloads were physically isolated in different RAID groups, but in the storage pool case, all workloads shared the same pool.

Systems Under Test: Two Dell PowerEdge R720 servers
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. VMs used virtual hardware version 11 and current VMware Tools.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express
     Network Controller: One Intel 82599EB dual-port 10 Gigabit PCIe Adapter, one Intel I350 Dual-Port Gigabit PCIe Adapter

Each configuration was tested at three different load points: 1 tile (the lowest load level), 7 tiles (an approximate mid-point), and 13 tiles, which was the maximum number of tiles that still met Quality of Service (QoS) requirements. All datapoints represent the mean of two tests of each configuration.

VMmark Results

RAID Group vs. Storage Pool Performance comparison using VMmark benchmark

Across all load levels tested, the VMmark performance score, which is a function of application throughput, was similar regardless of storage provisioning type. Neither the storage type used nor the capacity allocated affected throughput.

VMmark 2.5 performance scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how storage configuration affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest 1-tile results.

Storage configuration did not affect VMmark application latencies.

Application Latency in VMmark Storage Comparison RAID Group vs Storage Pool

Lastly, we measured read and write I/O latencies: esxtop Average Guest MilliSec/Write and Average Guest MilliSec/Read. This is the round trip I/O latency as seen by the Guest operating system.

VMmark Storage Latency Storage Comparison RAID Group vs Storage Pool

No differences emerged in I/O latencies.

I/O Analyzer with Iometer Testing

In the second set of experiments, we wanted to see if we would find similar results while testing storage using a synthetic microbenchmark. I/O Analyzer is a tool which uses Iometer to drive load on a Linux-based virtual machine then collates the performance results. The benefit of using a microbenchmark like Iometer is that it places heavy load on just the storage subsystem, ensuring that no other subsystem is the bottleneck.

Configuration

Testing used a VNX5800 array and RAID 5 level as in the prior configuration, but all storage configurations spanned 9 SSDs, also a preferred drive count. In contrast to the prior test, the storage pool or RAID group spanned an identical number of disks, so that the number of disks per LUN was the same in both configurations. Testing used nine disks per LUN to achieve greater load on each disk.

The LUN was sized to either 50% or 100% of the storage group. The LUN capacity was fully occupied with the I/O Analyzer worker VM and randomized data.  The I/O Analyzer Controller VM, which initiates the benchmark, was located on a separate array and host.

Storage Configuration Iometer with Storage Pool and RAID Group

Testing used one I/O Analyzer worker VM. One Iometer worker thread drove storage load. The size of the VM’s virtual disk determines the size of the active dataset, so a 100GB thick-provisioned virtual disk on VMFS-5 was chosen to maximize I/O to the disk and minimize caching. We tested at a medium load level using a plausible datacenter I/O profile, understanding, however, that any static I/O profile will be a broad generalization of real-life workloads.

Iometer Configuration

  • 1 vCPU, 2GB memory
  • 70% read, 30% write
  • 100% random I/O to model the “I/O blender effect” in a virtualized environment
  • 4KB block size
  • I/O aligned to sector boundaries
  • 64 outstanding I/O
  • 60 minute warm up period, 60 minute measurement period
Systems Under Test: One Dell PowerEdge R720 server
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. Worker VM used the I/O Analyzer default virtual hardware version 7.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express

Iometer results

Iometer Latency Results Storage Comparison RAID Group vs Storage PoolIometer Throughput Results Storage Comparison RAID Group vs Storage Pool

In Iometer testing, the storage pool showed slightly improved performance compared to the RAID group, and the amount of capacity allocated also did not affect performance.

In both our multi-workload and synthetic microbenchmark scenarios, we did not observe any performance penalty of choosing storage pools over RAID groups on an all-SSD array, even when disparate workloads shared the same storage pool. We also did not find any performance benefit at the application or I/O level from leaving unallocated capacity, or overprovisioning, SSD RAID groups or storage pools. Given the ease of management and feature-based benefits of storage pools, including automated storage tiering, compression, deduplication, and thin provisioning, storage pools are an excellent choice in today’s datacenters.

Network Improvements in vSphere 6 Boost Performance for 40G NICs

Introduced in vSphere 5.5, a Linux-based driver was added to support 40GbE Mellanox adapters on ESXi. Now vSphere 6.0 adds a native driver and Dynamic NetQueue for Mellanox, and these features  significantly improve network performance. In addition to the device driver changes, vSphere 6.0 includes improvements to the vmxnet3 virtual NIC (vNIC) that allows a single vNIC to achieve line-rate performance with 40GbE physical NICs. Another performance feature introduced in 6.0 for high bandwidth NICs is NUMA Aware I/O which improves performance by collocating highly network-intensive workloads with the device NUMA node. In this blog, we highlight these features and the corresponding benefits achieved.

Test Configuration

We used two identical Dell PowerEdge R720 servers with Intel E5-2667 @ 2.90GHz and 64GB of memory and Mellanox Technologies MT27500 Family [ConnectX-3]  /  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network NICs for our tests.

In the single VM test, we used 1 RHEL 6 VM with 4 vCPUs on each ESXi host with 4 netperf TCP streams running. We then measured the cumulative throughput for the test.

For the multi-VM test, we configured multiple RHEL VMs with 1 vCPU each and used an identical number of VMs on the receiver side. Each VM used 4 sessions of netperf for driving traffic, and we measured the cumulative throughput across the VMs.

Single vNIC Performance Improvements

In order to achieve line-rate performance for vmxnet3, changes were made to the virtual NIC adapter for vSphere 6.0 so that multiple hardware queues could push data to vNICs simultaneously. This allows vmxnet3 to use multiple hardware queues from the physical NIC more effectively. This not only increases the throughput a single vNIC can achieve, but also helps in overall CPU utilization.

As we can see from figure 1 below, 1 VM with 1 vNIC on vSphere 6.0 can achieve more than 35Gbps of throughput as compared to 20Gbps achieved in vSphere 5.5 (indicated by the blue bar chart). The CPU used to receive 1Gbps of traffic, on the other hand, is reduced by 50% (indicated by the red line chart).

Single VM throughput

Figure 1. 1VM vmxnet3 Receive throughput

By default, a single vNIC can receive packets from a single hardware queue. To achieve higher throughput, the vNIC has to request more queues. This can be done by setting ethernetX.pnicFeatures = “4” in the .vmx file. This option also requires the physical NIC to have RSS mode turn on. For Mellanox adapters, the RSS feature can be turned on by reloading the driver with num_rings_per_rss_queue=4.

CPU Cost Improvements for Mellanox 40GbE NIC

In addition to scalability improvements for the vmxnet3 adapter, vSphere 6.0 features an improved version of the Mellanox 40GbE NIC driver. The updated driver uses vSphere 6.0 APIs and, as a result, performs better than the earlier Linux-based driver. Native APIs remove the extra CPU overheads of data structure conversion that were earlier present in the Linux-based driver. The driver also has new features like Dynamic NetQueue that improves CPU utilization even further. Dynamic netqueue in vSphere 6.0 intelligently chooses the optimal number of active hardware queues in use according to the network workload and per NUMA-node CPU utilization.

40G Performance

Figure 2: Multi VM CPU usage for 40G traffic

As seen in figure 2 above, the new driver can improve CPU efficiency by up to 22%.  For all these test cases, the Mellanox NIC was achieving line-rate throughput for both vSphere 6.0 and vSphere 5.5. Please note that for the multi-VM tests, we were using a 1-vCPU VM and vmxnet3 was using a single queue. The RSS feature on the Mellanox Adapter was also turned off.

NUMA Aware I/O

In order to achieve the best performance out of 40GbE NICs, it is advisable to place the throughput-intensive workload on the same NUMA system to which the adapter is attached. vSphere 6.0 features a new configuration option that tries to do this automatically and is available through a system-wide option. The configuration will pack all kernel networking threads on the same NUMA node to which the device is connected. The scheduler will then try to place the VMs that use these networking threads the most on the same NUMA node. By default, the configuration is turned off because it may cause uneven workload distribution between multiple NUMA nodes, especially in the cases where all NICs are connected to the same NUMA node.

NUMA_IO

Figure 3: NUMA I/O benefit.

As seen in Figure 3 above, NUMA I/O can result in about 20% reduced CPU consumption and about 20% higher throughput with a 1-vCPU VM for 40GbE NICs. There is no throughput improvement for Intel NICs because we achieve line rate irrespective of where the workloads are placed. We do however see an increase in CPU efficiency of about 7%.

To enable this option, set the value of Net. NetNetqNumaIOCpuPinThreshold in the Advanced System Settings tab for the host. The value is configurable and can vary between 0 and 200. For example, if you set the value to 100, this results in using NUMA I/O as long as the networking load is less than 100% (that is, the networking threads do not use more than 1 core). Once the load increases to 100%, vSphere 6.0 will follow default scheduling behavior and will schedule VMs and networking threads across different NUMA nodes.

Conclusion

vSphere 6.0 includes some great new improvements in network performance. In this blog, we show:

  • Vmxnet3 can now achieve near line-rate performance with a 40GbE NIC.
  • Significant performance improvements were made to the Mellanox driver, which is now up to 25% more efficient.
  • vSphere also features a new option to turn on NUMA I/O that could improve application performance by up to 15%.

 

Scaling Web 2.0 Applications using Docker containers on vSphere 6.0

by Qasim Ali

In a previous VROOM post, we showed that running Redis inside Docker containers on vSphere adds little to no overhead and observed sizeable performance improvements when scaling out the application when compared to running containers directly on the native hardware. This post analyzes scaling Web 2.0 applications using Docker containers on vSphere and compares the performance of running Docker containers on native and vSphere. This study shows that Docker containers add negligible overhead when run on vSphere, and also that the performance using virtual machines is very close to native and, in certain cases, slightly better due to better vSphere scheduling and isolation.

Web 2.0 applications are an integral part of Enterprise and small business IT offerings. We use the CloudStone benchmark, which simulates a typical Web 2.0 technology use in the workplace for our study [1] [2]. It includes a Web 2.0 social-events application (Olio) and a client implemented using the Faban workload generator [3]. It is an open source benchmark that simulates activities related to social events. The benchmark consists of three main components: a Web server, a database backend, and a client to emulate real world accesses to the Web server. The overall architecture of CloudStone is depicted in Figure 1.

Figure 1: CloudStone architecture

The benchmark reports latency for various user actions. These metrics were compared against a fixed threshold. Studies indicate that users are less likely to visit a Web site if the response time is greater than 250 milliseconds [4]. This number can be used as an upper bound for latency for frequent operations (Home-Page, TagSearch, EventDetail, and Login). For the less frequent operations (AddEvent, AddPerson, and PersonDetail), a less restrictive threshold of 500 milliseconds can be used. Table 1 shows the exact mix/frequency of various operations.

Operation Number of Operations Mix
HomePage 141908 26.14%
Login 55473 10.22%
TagSearch 181126 33.37%
EventDetail 134144 24.71%
PersonDetail 14393 2.65%
AddPerson 4662 0.86%
AddEvent 11087 2.04%

Table 1: CloudStone operations frequency for 1500 users

Benchmark Components and Experimental Set up

The test system was installed with a CloudStone implementation of a MySQL database, NGINX Web server with PHP scripts, and a Tomcat application server provided by the Faban harness. The default configuration was used for the workload generator. All components of the application ran on a single host, and the client ran in a separate virtual machine on a separate host. Both hosts were connected using a direct link between a pair of 10Gbps NICs. One client-server pair provided a single, independent CloudStone instance. Scaling was achieved by running additional instances of CloudStone.

Deployment Scenarios

We used the following three deployment scenarios for this study:

  • Native-Docker: One or more CloudStone instances were run inside Docker containers (2 containers per CloudStone instance: one for the Web server and another for the database backend) running on the native OS.
  • VM: CloudStone instances were run inside one or more virtual machines running on vSphere 6.0; the guest OS is the same as the native scenario.
  • VM-Docker: CloudStone instances were run inside Docker containers that were running inside one or more virtual machines.

Hardware/Software/Workload Configuration

The following are the details about the hardware and software used in the various experiments discussed in the next section:

Server Host:

  • Dell PowerEdge R820
  • CPU: 4 x Intel® Xeon® CPU E5-4650 @ 2.30GHz (32 cores, 64 hyper-threads)
  • Memory: 512GB
  • Hardware configuration: Hyper-Threading (HT) ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
  • Network: 10Gbps
  • Storage: 7 x 250GB 15K RPM 4Gb SAS  Disks

Client Host:

  • Dell PowerEdge R710
  • CPU: 2 x Intel® Xeon® CPU X5680 @ 3.33GHz (12 cores, 24 hyper-threads)
  • Memory: 144GB
  • Hardware configuration: HT ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
  • Network: 10Gbps
  • Client VM: 2-vCPU 4GB vRAM

Host OS:

  • Ubuntu 14.04.1
  • Kernel 3.13

Docker Configuration:

  • Docker 1.2
  • Ubuntu 14.04.1 base image
  • Host volumes for database and images
  • Configured with host networking to avoid Docker NAT overhead
  • Device mapper as the storage backend driver

ESXi:

  • VMware vSphere 6.0 (pre-release build)

VM Configurations:

  • Single VM: An 8-vCPU 4GB VM ( Web server and database running in a single VM)
  • Two VMs: One 6-vCPU 2GB Web server VM and one 2-vCPU 2GB database VM (CloudStone instance running in two VMs)
  • Scale-out: Eight 8-vCPU 4GB VMs

Workload Configurations:

  • The NGINX Web server was configured with 4 worker processes and 4096 connections per worker.
  • PHP was configured with a maximum number of 16 child processes.
  • The Web server and the database were preconfigured with 1500 users per CloudStone instance.
  • A runtime of 30 minutes with a 5 minute ramp-up and ramp-down periods (less than 1% run-to-run  variation) was used.

Results

First, we ran a single instance of CloudStone in the various configurations mentioned above. This was meant to determine the raw overhead of Docker containers running on vSphere vs. the native configuration, eliminating scheduling differences. Second, we picked the configuration that performed best in a single instance and scaled it out to run multiple instances.

Figure 2 shows the mean latency of the most frequent operations and Figure 3 shows the mean latency of less frequent operations.

Figure 2: Results of single instance CloudStone experiments for frequently used operations

 

Figure 3: Results of single instance CloudStone experiments for less frequently used
operations

We configured the benchmark to use a single VM and deployed the Web server and database applications in it (configuration labelled VM-1VM in Figure 2 and 3). We then ran the same workload in Docker containers in a single VM (VMDocker-1VM). The latencies are slightly higher than native, which is expected due to some virtualization overhead. However, running Docker containers on a VM showed no additional overhead. In fact, it seems to be slightly better. We believe this might be due to the device mapper using twice as much page cache as the VM (device mapper uses a loopback device to mount the file system and, hence, data ends up being cached twice in the buffer cache). We also tried AuFS as a storage backend for our container images, but that seemed to add some CPU and latency overhead, and, for this reason, we switched to device mapper. We then configured the VM to use the vSphere Latency Sensitivity feature [5] (VM-1VM-lat and VMDocker-1VM-lat labels). As expected, this configuration reduced the latencies even further because each vCPU got exclusive access to a core and this reduced scheduling overhead.  However, this feature cannot be used when the VM (or VMs) has more vCPUs than the number of cores available on the system (that is, the physical CPUs are over-committed) because each vCPU needs exclusive access to a core.

Next, we configured the workload to use two VMs, one for the Web server and the other for the database application. This configuration ended up giving slightly higher latencies because the network packets have to traverse the virtualization layer from one VM to the other, while in the prior experiments they were confined within the same VM.

Finally, we scaled out the CloudStone workload with 12,000 users by using eight 8-vCPU VMs with 1500 users per instance. The VM configurations were the same as the VM-1VM and VMDocker-1VM cases above. The average system CPU core utilization was around 70-75%, which is the typical average CPU utilization for latency sensitive workloads because it allows for headroom to absorb traffic bursts. Figure 4 reports mean latencies of all operations (latencies were averaged across all eight instances of CloudStone for each operation), while Figure 5 reports the 90th percentile latencies (the benchmark reports these latencies in 20 millisecond granularity as evident from Figure 5.)

Scale-out-meanlatency

Figure 4: Scale-out experiments using eight instances of CloudStone (mean latency)

Scale-out-90pctLatency

Figure 5: Scale-out experiments using eight instances of CloudStone (90th percentile latency)

The latencies shown in Figure 4 and 5 are well below the 250 millisecond threshold. We observed that the latencies on vSphere are very close to native or, in certain cases, slightly better than native (for example, Login, AddPerson and AddEvent operations). The latencies were better than native due to better vSphere scheduling and isolation, resulting in better cache/memory locality. We verified this by pinning container instances on specific sockets and making the native scheduler behavior similar to vSphere. After doing that, we observed that latencies in the native case got better and they were similar or slightly better than vSphere.

Note: Introducing artificial affinity between processes and cores is not a recommended practice because it is error-prone and can, in general, lead to unexpected or suboptimal results.

Conclusion

VMs and Docker containers are truly “better together.” The CloudStone scale-out system, using out-of-the-box VM and VM-Docker configurations, clearly achieves very close to, or slightly better than, native performance.

References

[1] W. Sobel, S. Subramanyam, A. Sucharitakul, J. Nguyen, H. Wong, A. Klepchukov, S. Patil, O. Fox and D. Patterson, “CloudStone: Multi-Platform, Multi-Languarge Benchmark and Measurement Tools for Web 2.8,” 2008.
[2] N. Grozev, “Automated CloudStone Setup in Ubuntu VMs / Advanced Automated CloudStone Setup in Ubuntu VMs [Part 2],” 2 June 2014. https://nikolaygrozev.wordpress.com/tag/cloudstone/.
[3] java.net, “Faban Harness and Benchmark Framework,” 11 May 2014. http://java.net/projects/faban/.
[4] S. Lohr, “For Impatient Web Users, an Eye Blink Is Just Too Long to Wait,” 29 February 2012. http://www.nytimes.com/2012/03/01/technology/impatient-web-usersflee-slow-loading-sites.html.
[5] J. Heo, “Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5,” 18 September 2013.   http://blogs.vmware.com/performance/2013/09/deploying-extremely-latency-sensitive-applications-in-vmware-vsphere-5-5.html.