Home > Blogs > VMware VROOM! Blog

Docker Containers Performance in VMware vSphere

By  Qasim Ali,  Banit Agrawal, and Davide Bergamasco

 

“Containers without compromise” – This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and  in a native configuration. Briefly,

  • We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.
  • Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

  • Native: Linux OS running directly on hardware (Ubuntu, CentOS)
  • vSphere VM: Upcoming release of vSphere with the same guest OS as native
  • Native-Docker: Docker version 1.2 running on a native OS
  • VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Test Scenarios

Figure 1: Different test scenarios

Benchmarks/Workloads

For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

-   Micro-benchmarks:

  • LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.
  • STREAM: This benchmark measures memory bandwidth across various configurations.
  • FIO: This benchmark is used for I/O benchmarking for block devices and file systems.
  • Netperf: This benchmark is used to measure network performance.

Real-world workload:

  • Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results

LINPACK

LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 32 vCPU VM with 45K and 65K problem sizes

linpack

Figure 2: LINPACK performance for different test scenarios

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.

STREAM

We used a NUMA-aware  STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, HT enabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 64 vCPU VM (Hyper-Threading ON)

stream

Figure 3: STREAM performance for different test scenarios

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.

FIO

We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

Hardware: 2 socket Intel Xeon E5-2660 2.2GHz with 392GB RAM, 16 total cores, Hyper-Threading enabled
Guest: 32-vCPU  14.04.1 Ubuntu 64-bit server with 256GB RAM, with a separate ext4 disk in the guest (on VMFS5 in vSphere run)
Benchmark:  FIO, Direct I/O, 10GB file
I/O Profile:  4k I/O, Random Read/Write: depth 16, jobs 8, Mixed: depth 32, jobs 8

fio

Figure 4: FIO benchmark performance for different test scenarios

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.

Netperf

Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

Hardware (Server): 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Hardware (Client): 2 socket Intel Xeon X5570 2.93GHz with 64GB RAM, 8 cores total, Hyper-Threading disabled
Networking hardware: Broadcom Corporation NetXtreme II BCM57810
Software on server and Client: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 2 vCPU VM with 4GB RAM

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.

netperf-recieve

Figure 5: Netperf Recieve performance for different test scenarios

netperf-transmit

Figure 6: Netperf transmit performance for different test scenarios

Figure 5 and Figure 6 shows the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.

netperf-latency

Figure 7: Netperf TCP_RR performance for different test scenarios (Lower is better)

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).

netperf-latency-passthrough

Figure 8: Netperf TCP_RR performance for different test scenarios (VMs in pass-through mode)

Redis

We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

Hardware: HP ProLiant DL380e Gen8 2 socket Intel Xeon E5-2470 2.3GHz with 96GB RAM, 16 total cores, Hyper-Threading enabled
Guest OS: CentOS 7
VM: 16 vCPU, 93GB RAM
Application: Redis 2.8.13
Benchmark: redis-benchmark, 1000 clients, pipeline: 1 request, operations: SET 1 Byte
Software configuration: Redis thread pinned to CPU 0 and network interrupts pinned to CPU 1

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access.  The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys.  This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run.  The results of these experiments are summarized in the following chart.

redis

Figure 9: Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead.  However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty.  This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.

Summary

In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems.  The results will be covered in the next installment of this series. Stay tuned!

 

Monster Performance with SQL Server VMs on vSphere 5.5

VMware vSphere provides an ideal platform for customers to virtualize their business-critical applications, including databases, ERP systems, email servers, and even newly emerging technologies such as Hadoop.  I’ve been focusing on the first one (databases), specifically Microsoft SQL Server, one of the most widely deployed database platforms in the world.  Many organizations have dozens or even hundreds of instances deployed in their environments. Consolidating these deployments onto modern multi-socket, multi-core, multi-threaded server hardware is an increasingly attractive proposition for IT administrators.

Achieving optimal SQL Server performance has been a continual focus for VMware; with current vSphere 5.x releases, VMware supports much larger “monster” virtual machines that can scale up to 64 virtual CPUs and 1 TB of RAM, including exposing virtual NUMA architecture to the guest. In fact, the main goal of this blog and accompanying whitepaper is to refresh a 2009 study that demonstrated SQL performance on vSphere 4, given the marked technology advancements on both the software and hardware fronts.

These tests show that large SQL Server 2012 databases run extremely efficiently with VMware, achieving great performance in a variety of virtual machine configurations with only minor tunings to SQL Server and the vSphere ESXi host. These tunings and other best practices for fully optimizing large virtual machines for SQL Server databases are presented in the paper.

One test in the paper shows the maximum host throughput achieved with different numbers of virtual CPUs per VM. This was measured starting with 8 vCPUs per VM, then doubled to 16, then 32, and finally 64 (the maximum supported with vSphere 5.5).  DVD Store, which is a popular database tool and a key workload of the VMmark benchmark, was used to stress the VMs.  Here is a graph from the paper showing the 8 vCPU x 8 VMs case, which achieved an aggregate of 493,804 opm (operations per minute) on the host:

8 x 8 vCPU VM throughput

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the impact of various virtual NUMA (vNUMA) topologies, and experiments with the Latency Sensitivity advanced setting.

For more details and the test results, please download the whitepaper: Performance and Scalability of Microsoft SQL Server on VMware vSphere 5.5.

Custom Power Management Settings for Power Savings in vSphere 5.5

VMware vSphere serves as a common virtualization platform for a diverse ecosystem of applications. Every application has different performance demands which must be met, but the power and cooling costs of running these applications are also a concern. vSphere’s default power management policy, “Balanced”, meets both of these goals by effectively preserving system performance while still saving some power.

For those who would like to prioritize energy efficiency even further, vSphere provides additional ways to tweak its power management under the covers. Custom power management settings in ESXi let you create your own power management policy, and your server’s BIOS also typically lets you customize hardware settings which can maximize power savings at a potential cost to performance.

When choosing a low power setting, we need to know whether it is effective at increasing energy efficiency, that is, the amount of work achieved for the power consumed. We also need to know how large of an impact the setting has on application throughput and latencies. A power saving setting that is too aggressive can result in low system performance. The best combination of power saving techniques will be highly individualized to your workload; here, we present one case study.

We used the VMmark virtualization benchmark to measure the effect of ESXi custom power settings and BIOS custom settings on energy efficiency and performance. VMmark 2.5 is a multi-host virtualization benchmark that uses diverse application workloads as well as common platform level workloads to model the demands of the datacenter. VMs running a complete set of the application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview.

In this study, the best custom power setting produced an increase in energy efficiency of 17% with no significant drop in performance at moderate levels of load.

Test Methodology

All tests were conducted on a two-node cluster running VMware vSphere 5.5 U1. Each custom power management setting was tested independently to gauge its effects on energy efficiency and performance while all other settings were left at their defaults. The settings tested fall into two categories: ESXi custom power settings and BIOS custom settings. We discuss how to modify these settings at the end of the article.

Systems Under Test: Two Dell PowerEdge R720 servers
Configuration Per Server  
            CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled
            Memory: 256GB ECC DDR3 @ 1866Mhz
            Host Bus Adapter: QLogic ISP2532 Dual Port 8Gb Fibre Channel to PCI Express
           Network Controller: Integrated Intel I350 Quad-Port Gigabit Adapter, one Intel I350 Dual-Port Gigabit PCIe Adapter
            Hypervisor: VMware ESXi 5.5 U1
Shared Resources  
            Virtualization Management: VMware vCenter Server 5.5
            Storage Array: EMC VNX5800
30 Enterprise Flash Drives (SSDs) and 32 HDDs, grouped as two 10-SSD RAID0 LUNs and four 8-HDD RAID0 LUNs. FAST Cache was configured from 10 SSDs.
            Power Meters: One Yokogawa WT210 per server

Each configuration was tested at five different load points: 1 tile (the lowest load level), 4, 7, 10, and 12 tiles, which was the maximum number of tiles that met Quality of Service (QoS) requirements. All datapoints are the mean of three tests in each configuration.

ESXi Custom Power Settings

ESXi custom power settings influence the power state of the processor. We tested two custom power management settings which had the greatest impact on our workload: Power.MaxFreqPct and Power.CstateResidencyCoef. The advanced ESXi setting Power.MaxFreqPct (default value 100) reduces the processor frequency by placing a cap on the highest operating frequency it can reach. In practice, the processor can operate only at certain set frequencies (P-states), so if the frequency cap requested by ESXi (e.g. 2160MHz) does not match to a set frequency state, the processor will run at the nearest lower frequency state (e.g. 2100MHz). Setting Power.MaxFreqPct = 99 put the cap at 99% of the processor’s nominal frequency, which limited Turbo Boost. Power.MaxFreqPct = 80 further limited the maximum frequency of the processor to 80% of its nominal frequency of 2.7GHz, for a maximum of 2.1GHz. Setting Power.CstateResidencyCoef = 0 (default value 5) puts the processor into its deepest available C-state, or lowest power state, when it is idle. As a prerequisite, deep C-states must be enabled in the BIOS. For a more in-depth discussion of power management techniques and other custom options, please see the vSphere documentation and the whitepaper Host Power Management in VMware vSphere 5.5.

VMmark models energy efficiency as performance score per kilowatt of power consumed. VMmark scores in the graph below have been normalized to the default “Balanced” 1-tile result, which does not use any custom power settings.

VMware ESXi Custom Power Management Settings improve efficiency

A major trend can be seen here; an increase in load is correlated with greater energy efficiency. As the CPUs become busier, throughput increases at a faster rate than the required power. This can be understood by noting that an idle server will still consume power, but with no work to show for it. A highly utilized server is typically the most energy efficient per request completed, and the results bear this out.

To more closely examine the relative impact of each custom setting compared to the default setting, we normalized all results within each load level to the default “Balanced” result for that number of tiles. The figure below shows the percent change at each load level.

VMware ESXi Custom Power Management Settings Change in Efficiency and Performance Results

All custom settings showed improvements in efficiency compared to the default “Balanced” setting. The improvements varied depending on load. Setting MaxFreqPct to 99 had the greatest benefit to energy efficiency, between 5% and 15% at varying load levels. The greatest improvement was seen at 4 tiles, which increased efficiency by 17%, while resulting in a performance decrease of only 3%. The performance cost increased with load to 9% at 12 tiles. However, limiting processor frequency even further to a maximum of 80% of its nominal frequency does not produce an additive effect. Not only did efficiency actually decrease relative to MaxFreqPct=99, but it profoundly curtailed performance from 96% of baseline at light load to 84% of baseline for a heavily loaded machine. CstateResidency=0 produced some modest increases in efficiency for a lightly loaded server, but the effect disappeared at higher load levels.

VMmark 2.5 performance scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how custom power management settings affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest 1-tile results.

VMware ESXi Custom Power Management Settings Effect on Application Latencies

Naturally, latencies increase as load increases from 1 to 12 tiles. Fortunately, the custom power management policies caused only minimal increases in application latencies, if any, except for the MaxFreqPct=80 setting which did create elevated latencies across the board.

BIOS Custom Power Settings

The Dell PowerEdge R720 BIOS provides another toolbox of power-saving knobs to tweak. Using the BIOS settings, we manually disabled Turbo Boost and reduced memory frequency from its default maximum speed of 1866MT/s (megatransfers per second) to either 1333MT/s or 800MT/s.

Custom-Power-Management-BIOS-Efficiency

The Turbo Boost Disabled configuration produced the largest increase in efficiency, while 800MT/s memory frequency actually decreased efficiency at the higher load levels.
Again, we normalized all results within each load level to its default “Balanced” result. The figure below shows the percent change at each load level.

Custom-Power-Management-BIOS-Efficiency-and-Perf
Disabling Turbo Boost was the most effective setting to increase energy efficiency, with a performance cost of 2% at low load levels to 8% at high load levels. Reducing memory frequency to 1333MT/s had a reliable but small boost to efficiency and no effect on performance, leading us to conclude that a memory speed of 1866MT/s is simply faster than needed for this workload.

Custom-Power-Management-BIOS-Application-Latencies
Disabling Turbo Boost and reducing memory frequency to 800MT/s increased DVD Store 2 latencies at 10 tiles by 10% and 12 tiles by 30%, but all latencies were still well within Quality of Service requirements.  Reducing memory frequency to 1333MT/s had no effect on application latencies.

Reducing the use of Turbo Boost, using either ESXi custom setting MaxFreqPct or BIOS custom settings, proved to be the most effective way to increase energy efficiency in our VMmark tests. The impact on performance was small, but increased with load. MaxFreqPct is the preferred setting because, like all ESXi custom power management settings, it takes effect immediately and can easily be reversed without reboots or downtime. Other custom power management settings produced modest gains in efficiency, but, if taken to the extreme, not only harm performance but fail to increase efficiency. In addition, energy efficiency is strongly related to load; the most efficient server is also one that is heavily utilized. Taking steps to increase server utilization, such as server consolidation, is an important part of a power saving strategy. Custom power management settings can produce gains in energy efficiency at a cost to performance, so consider the tradeoff when choosing custom power management settings for your own environment.


 How to Configure Custom Power Management Settings

Disclaimer: The results presented above are a case study of the impact of custom power management settings and a starting point only. Results may not apply to your environment and do not represent best practices.

Exercise caution when choosing a custom power management setting. Change settings one at a time to evaluate their impact on your environment. Monitor your server’s power consumption either through its UPS, or consult your vendor to find the rated accuracy of your server’s internal power monitoring sensor. If it is highly accurate, you can view the server’s power consumption in esxtop (press ‘p’ to view Power Usage).

To customize power management settings, enter your server’s BIOS. Power Management settings vary by vendor but most include “OS Controlled” and “Custom” policies.

In the Dell PowerEdge R720, choosing the “Performance Per Watt (OS)” System Profile allows ESXi to control power management, while leaving hardware settings at their default values.

Screenshot of R720 BIOS Selecting OS controlled power managment

Choosing the “Custom” System Profile and setting CPU Power Management to “OS DBPM” allows ESXi to control power management while enabling custom hardware settings.

Screenshot-R720-BIOS

Using ESXi Custom Power Settings

To enable the vSphere custom power management policy,

  1. Browse to the host in the vSphere Web Client navigator.
  2. Click the Manage tab and click Settings.
  3. Under Hardware, select Power Management and click the Edit button.
  4. Select the Custom power management policy and click OK.

The power management policy changes immediately and does not require a server reboot.

Screenshot-VMware-ESXi-Host-Power-Management-SettingScreenshot-VMware-ESXi-Custom-Power-Manangement-Setting

To modify ESXi custom power management settings,

  1. Browse to the host in the vSphere Web Client navigator.
  2. Click the Manage tab and click Settings.
  3. Under System, select Advanced System Settings.
  4. Power management parameters that affect the Custom policy have descriptions that begin with In Custom policy. All other power parameters affect all power management policies.
  5. Select the parameter and click the Edit button.

Note: The default values of power management parameters match the Balanced policy.

Screenshot-VMware-ESXi-Advanced-System-Settings

 

Microsoft Exchange Server Shows Great Performance on VMware Virtual SAN

Email servers are a business-critical component of IT systems implementations and Exchange Server is one of the most ubiquitous of them. As such, we wanted to see how we could leverage Virtual SAN to bring new technology in serving the storage needs of this application. We administered some tests to see how Exchange Server would perform on Virtual SAN. We ran five Virtual SAN servers, and each server hosted two virtual machines with the Exchange Server roles Mailbox and HUB. The first host had an added virtual machine for the AD Server role. A client virtual machine on a separate host ran the load generator.

Benchmarks are an important part of performance testing—we used Exchange Load Generator to simulate, for Exchange Server, users sending and receiving email. Then we measured the Sendmail latency of these requests for the average and 95th-percentile for three separate loads of 12,000 users, 16,000 users, and 20,000 users. This shows how Virtual SAN can accommodate the storage needs from additional users and be flexible for scaling out.

The results are shown in the following figure. The industry-standard measure of good latency is anything below 500ms. As shown here, the Sendmail latency is well below 500ms for both the average and 95th-percentile.

exchange-vsan-perf

 

For more information, read the paper here.

Web 2.0 Applications on VMware Virtual SAN

Here in VMware Performance Engineering, Virtual SAN is a hot topic. This storage solution leverages economical hardware compared to more expensive storage arrays and offers all the vSphere solutions like vMotion, HA, and DRS. We have been testing Virtual SAN with a number of workloads to characterize their performance. In particular we found that Web 2.0 applications, modeled with the Cloudstone benchmark, performs exceptionally with low application latency on vSphere 5.5 with Virtual SAN. We are giving a quick glimpse of the testing configuration and result here and the full detail can be found in the recently published technical white paper about Web 2.0 applications on VMware Virtual SAN.

We ran the Cloundstone benchmark using Olio server and client virtual machine pairs. Server virtual machines were on a 3-host server cluster, whereas client virtual machines were on a 3-node client cluster. An Olio server virtual machine ran Ubuntu 10.04 with a MySQL database, a NGINX Web server with PHP scripts, and a Tomcat application server. An Olio client virtual machine simulated typical Web 2.0 workloads by exercising 7 different types of user operations that involved web file exchanges and database inquiries and transactions. Virtual SAN was configured on the server cluster. Web files, database files, and OS files were all on the Virtual SAN with dedicated virtual disks to store files separately.

fig1-blog

In the paper, we report test results that show Virtual SAN achieves good application latency performance. Each server-client virtual machine pair was pre-configured for 500 Olio users. In one test, we ran 1500 Olio users and 7500 users by having 3 and 15 pairs of virtual machines respectively. We collected the average round-trip time of Olio operations. These operations were divided into frequent ones (EventDetail, HomePage, Login and TagSearch) and less frequent ones (AddEvent, AddPerson, and PersonDetail) according to how often they were exercised in the tests.

The following graph shows the average round-trip times for various operations. The threshold for these operations was defined in the passing critera, which used 250 milliseconds for the frequent operations and 500 milliseconds for the less frequent operations. In the 15VMs/7500 users case, the server cluster was at 70% CPU utilization, but the round-trip time was well below the passing threshold as shown. We also present the 95th-percentile round-trip time results and how it performed in the white paper.

fig2-blog

To learn the full story of the 15VMs/7500 Olio users test and how we further stressed storage with the workload and read the results, see the white paper.

Detailed stats for vSphere Flash Read cache

In the Performance of vFlash Read Cache in VMware vSphere 5.5 whitepaper details about performance in several different workloads is provided [http://www.vmware.com/files/pdf/techpaper/vfrc-perf-vsphere55.pdf].  It also covers details about how to tune and configure vFRC to obtain best performance.  This blog article goes through the details of how to obtain runtime statistics about the vSphere Flash Read Cache like the Cache hit rate, Latency of cached I/Os, and average number of cache blocks evicted as used in the whitepaper.

You can run esxcli to get some of these stats using the following commands:

~ # esxcli storage vflash cache list

This would list the identifiers for the caches that are currently in use.  These are displayed one per vFlash-enabled-VMDKs.   To retrieve the vFRC statistics for a particular vFlash-enabled-VMDK, the following command can be used:

~ # esxcli storage vflash cache get –c <cache-identifier>

However, a few more advanced statistics like the amount of data that is cached at any point of time may be obtained by directly accessing the VSI nodes.  The process to do this is as follows:

Cache identifier may be obtained by either the esxcli command shown above or using

~ # cacheID=`vsish -e ls /vmkModules/vflash/module/vfc/cache/`
~ # vsish -e get /vmkModules/vflash/module/vfc/cache/${cacheID}stats

This displays an output similar to the following:

vFlash per cache instance statistics {
cacheBlockSize:8192
numBlocks:1270976
numBlocksCurrentlyCached:222255
numFailedPrimaryIOs:0
numFailedCacheIOs:0
avgNumBlocksOnCache:172494
read:vFlash per I/O type Statistics {
numIOs:168016
avgNumIOPs:61
maxNumIOPs:1969
avgNumKBs:42143
maxNumKBs:227891
avgLatencyUS:16201
maxLatencyUS:41070
numPrimaryIOs:11442
numCacheIOs:156574
avgCacheLatencyUS:17130
avgPrimaryLatencyUS:239961
cacheHitPercentage:94
}
write:vFlash per I/O type Statistics {
numIOs:102264
avgNumIOPs:307
maxNumIOPs:3982
avgNumKBs:10424
maxNumKBs:12106
avgLatencyUS:3248
maxLatencyUS:31798
numPrimaryIOs:102264
numCacheIOs:0
avgCacheLatencyUS:0
avgPrimaryLatencyUS:3248
cacheHitPercentage:0
}
rwTotal:vFlash per I/O type Statistics {
numIOs:270280
avgNumIOPs:88
maxNumIOPs:2027
avgNumKBs:52568
maxNumKBs:233584
avgLatencyUS:11300
maxLatencyUS:40029
numPrimaryIOs:113706
numCacheIOs:156574
avgCacheLatencyUS:17130
avgPrimaryLatencyUS:27068
cacheHitPercentage:58
}
flush:vFlash per operation type statistics {
lastOpTimeUS:0
numBlocksLastOp:0
nextOpTimeUS:0
numBlocksNextOp:0
avgNumBlocksPerOp:0
}
evict:vFlash per operation type statistics {
lastOpTimeUS:0
numBlocksLastOp:0
nextOpTimeUS:0
numBlocksNextOp:0
avgNumBlocksPerOp:0
}
}

This output contains all of the metrics that are discussed in the vShpere Flash Read Cache whitepaper.  This information can be used for sizing the cache and cache blocks in the most effective way.

Disclaimer: This interface (vsish) is not officially supported by VMware, so please use at your own risk.

First Certified SAP BW-EML Benchmark on Virtual HANA

The first certified SAP Business Warehouse-Enhanced Mixed Workload (BW-EML) standard application benchmark based on a virtual HANA database was recently published by HP.  We worked with HP to configure and run this benchmark using a virtual HANA database running on vSphere 5.5 in a monster VM of 64 vCPUs and almost 1TB of RAM.  The test was run with a total of 2 billion records and achieved a throughput of 111,850 ad-hoc navigation steps per hour.

The same hardware configuration was used by HP to publish a native only benchmark with the same number of records. In that test, the result was 126,980 ad-hoc navigation steps per hour which is only 12% higher throughput than the virtual HANA result.

BW-EML_VirtualHANA_Graph_VROOM

Although the hardware setup was the same, this comparison between native and virtual performance has one wrinkle that gave the native system a slight advantage, estimated to be about 5%.

The reason for the estimated 5% advantage for the native system is due to the difference between cores and threads and the maximum number of vCPUs.  In the case of the native test, the BW-EML workload was able to exercise all 120 hardware threads of the physical 60 core server.  The number of threads is twice the number of physical server cores because these processors utilize Intel Hyper-Threading technology.

In vSphere 5.5 (the current version) the maximum number of vCPUs that can be used in a single VM is 64. Each vCPU is mapped to a hardware thread when scheduled to run. This limits the number of hardware threads that a single VM can use to 64, which means that for this test only slightly more than half of the 120 hardware server threads could be used for the HANA virtual machine. This means that the virtual machine was not able to directly benefit from Hyper-Threading, but was able to use all 60 cores.

The benefit of Hyper-Threading can be as much as 20% to 30% for some applications, but in the case of the BW-EML benchmark, it is estimated to be about 5%.  This estimate was found by running the native BW-EML benchmark system with and without Hyper-Threading enabled.  Because the virtual machine was not able to use the Hyper-Threads, it is estimated that the native system had a 5% advantage due to its ability to use all 120 threads of the physical server.

In theory, the advantage for the native system could be reduced by either creating a bigger virtual machine or running the native system without Hyper-Threading.  If this were done, then the difference between native and virtual should be about 5% smaller and would mean that the difference between native and virtual could shrink to single digits (approximately 7%).

Additional details about the certified SAP BW-EML benchmark configurations used in the tests: SAP HANA 1.0 on HP DL580 Gen8, 4 processors with 60 cores / 120 threads using Intel Xeon E7-4880 v2 running at 2.5 GHz and 1TB of main memory (each processor has 15 cores / 30 threads).  The application servers were SAP NetWeaver 7.30 on HP BL680 G7, 4 processors with 40 cores / 80 threads using Intel Xeon E7-4870 running at 2.4 GHz and 1TB of main memory (each processor has 10 cores / 20 threads). The OS used for all servers was SuSE Enterprise Linux Server 11 SP2.  The certification number for the native test is 2014009 and the certification number for the virtual test is 2014021.

Reducing Power Consumption in the vSphere 5.5 Datacenter

Today’s virtualized datacenters consist of several servers connected to shared storage, and this configuration has been necessary to enable the flexibility that virtualization provides and still allow for high performance. However, the power consumption of this setup is a major concern because shared storage can consume as much as 2-3x the power of a single, mid-ranged server. In this blog, we look at the performance impact of replacing shared storage with local disks and PCIe flash storage in a vSphere 5.5 datacenter to save power.

We leverage two innovative vSphere features in this performance test:

  • Unified live migration, first introduced with vSphere 5.1, removes the shared storage requirement for vMotion and allows combining traditional vMotion and Storage vMotion into one operation. This combined live migration copies both the virtual machine’s memory and storageover the network to the destination vSphere host. This feature offers administrators significantly more simplicity and flexibility in managing and moving virtual machines across their virtual infrastructures compared to the traditional vMotion and Storage vMotion migration solutions. More information about vMotion can be found in the VMware vSphere 5.1 vMotion Architecture, Performance, and Best Practices white paper.
  • vSphere 5.5 improves server power management by enabling processor C-states, in addition to the previously-used P-states, to improve power savings in the Balanced policy setting. More information about these improvements can be found in the Host Power Management in vSphere 5.5 white paper.

We measure the performance and power savings of these features when replacing shared storage with local disks and PCIe flash storage using a modified version of VMware VMmark 2.5. VMmark is a multi-host virtualization benchmark that uses varied application workloads, as well as common datacenter operations to model the demands of the datacenter. Each VMmark tile contains a set of VMs running diverse application workloads as a unit of load. For more details, see the VMmark 2.5 overview. The benchmark was modified to replace the traditional vMotion workload component with the new shared-nothing, unified live migration.

Testing Methodology

VMmark 2.5 was modified to convert the vMotion workload into a migration without shared storage. All other workloads were unchanged. This allowed a comparison of local, direct attached storage to a traditional Fibre Channel SAN. We measured the power consumption of each configuration using a pair of Yokogawa WT210 power meters, one attached to the servers and the other attached to the external storage.

Configuration

  • Systems Under Test: 2x Dell PowerEdge R710 servers
  • CPUs (per server): 2x Intel Xeon X5670 @ 2.93 GHz
  • Memory (per server): 96 GiB
  • Hypervisor: VMware vSphere 5.5
  • Local Storage (per server): 1x 785GB Fusion-io ioDrive2, 2x 300GB 10K RPM SAS drives in RAID 0
  • SAN: 8Gb Fibre Channel, 30x 200GB SATA Flash drives, 30x 600GB 15K RPM SAS drives
  • Benchmarking software: VMware VMmark 2.5

All I/O-intensive virtual disks were stored on the Fusion-io devices for local storage tests or the SATA flash drives for the SAN tests.  This included the DVD Store database files, the mail server database, and the Olio database.  All remaining virtual machine data was stored on the local SAS drives for the local storage tests and the SAN SAS drives for the SAN tests.

Results
 
VMmark performance using shared-nothing, unified live migration backed by fast local storage showed only minor differences compared to the results with shared storage.  The largest variance was seen in the infrastructure operations, which was expected as the vMotion workload was modified to include a storage migration.  The chart below shows the scores normalized to the 3-tile SAN test results.

scores

When we add the power data to these results, and compare the Performance Per Killowatt (PPKW), we see a much different picture.  The local storage-based PPKW score is much higher than shared storage due to higher power efficiency.

ppkw

We can see the reason for this difference is due to the power consumption of each configuration.  The SAN is consuming over 1000 watts, which is typical of this storage solution.  Replacing that power-hungry component with local storage greatly reduces vSphere datacenter power consumption while maintaining good performance.

power

This SAN should be able to support approximately 25 VMmark tiles (based on the storage capacity of the SSDs), roughly five times the load being supported by the two servers we had available for testing in our lab. However, it should be noted that these servers are two generations old. Current-generation two-socket servers with a comparable power usage can support 2-3x the number of tiles based on published VMmark results. This would imply that the SAN could support at most four current-generation servers. While an additional two servers will further amortize the power cost of the SAN, significant power savings would still be achieved with an all-local storage architecture.

This is not without a cost.  Removing shared storage reduces the functionality of the datacenter because there are a number of vSphere features which will no longer function, such as DRS and traditional vMotion. The reduction in the infrastructure performance due to no shared storage will limit the workloads that can be run in this manner to virtual machines with smaller disks which can be moved between hosts without shared storage fairly quickly. Virtual machines with large disks would take much longer to move and would be better suited to a shared storage environment.

We have shown that it is possible to significantly reduce datacenter power consumption without significantly reducing performance by replacing shared storage with local storage solutions.  Unified live migration enables the use of local storage without a significant infrastructure performance penalty while maintaining application performance comparable to traditional environments using shared storage for the server workloads represented in VMmark.  The resulting elimination of shared storage creates significant power savings and lower operations costs.

Virtual SAP HANA Achieves Production Level Performance

VMware CEO Pat Gelsinger announced production support for SAP HANA on VMware vSphere 5.5 at EMC World this week during his keynote. This is the end result of a very thorough joint testing project over the past year between VMware and SAP.

HANA is an in-memory platform (including database capabilities) from SAP that has enabled huge gains in performance for customers and has been a high priority for SAP over the past few years.  In order for HANA to be supported in a virtual machine on vSphere 5.5 for production workloads, we worked closely with SAP to enable, design, and measure in-depth performance tests.

In order to enable the testing and ongoing production support of SAP HANA on vSphere, two HANA appliance servers were ordered, shipped, and installed into SAP’s labs in Waldorf Germany.  These systems are dedicated to running SAP HANA on vSphere onsite at SAP.  Each system is an Intel Xeon E7-8870 (Westmere-EX) based four-socket server with 1TB of RAM.  They are used for performance testing and also for ongoing support of HANA on vSphere.  Additionally, VMware has onsite support engineering to assist with the testing and support.

SAP designed an extensive performance test suite that used a large number of test scenarios to stress all functions and capabilities of HANA running on vSphere 5.5.  They included OLAP and OLTP with a wide range of data sizes and query functions. In all, over one thousand individual test cases were used in this comprehensive test suite.  These same tests were run on identical native HANA systems and the difference between native and virtual tests was used as the key performance indicator.

In addition, we also tested vSphere features including vMotion, DRS, and VMware HA with virtual machines running HANA.  These tests were done with the HANA virtual machine under heavy stress.

The test results have been extremely positive and are one of the key factors in the announcement of production support.  The difference between virtual and native HANA across all the performance tests was on average within a few percentage points.

The vMotion, DRS, and VMware HA tests were all completed without issues.  Even with the large memory sizes of HANA virtual machines, we were still able to successfully migrate them with vMotion while under load with no issues.

One of the results of the extensive testing is a best practices guide for HANA on vSphere 5.5. This document includes a performance guide for running HANA on vSphere 5.5 based on this extensive testing.  The document also includes information about how to size a virtual HANA instance and how VMware HA can be used in conjunction with HANA’s own replication technology for high availability.

Power Management and Performance in VMware vSphere 5.1 and 5.5

Power consumption is an important part of the datacenter cost strategy. Physical servers frequently offer a power management scheme that puts processors into low power states when not fully utilized, and VMware vSphere also offers power management techniques. A recent technical white paper describes the testing and results of two performance studies: The first shows how power management in VMware vSphere 5.5 in balanced mode (the default) performs 18% better than the physical host’s balanced mode power management setting. The second study compares vSphere 5.1 performance and power savings in two server models that have different generations of processors. Results show the newer servers have 120% greater performance and 24% improved energy efficiency over the previous generation.

For more information, please read the paper: Power Management and Performance in VMware vSphere 5.1 and 5.5.