Home > Blogs > VMware VROOM! Blog > Category Archives: Web/Tech

Category Archives: Web/Tech

Docker Containers Performance in VMware vSphere

By  Qasim Ali,  Banit Agrawal, and Davide Bergamasco

 

“Containers without compromise” – This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and  in a native configuration. Briefly,

  • We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.
  • Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

  • Native: Linux OS running directly on hardware (Ubuntu, CentOS)
  • vSphere VM: Upcoming release of vSphere with the same guest OS as native
  • Native-Docker: Docker version 1.2 running on a native OS
  • VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Test Scenarios

Figure 1: Different test scenarios

Benchmarks/Workloads

For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

-   Micro-benchmarks:

  • LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.
  • STREAM: This benchmark measures memory bandwidth across various configurations.
  • FIO: This benchmark is used for I/O benchmarking for block devices and file systems.
  • Netperf: This benchmark is used to measure network performance.

Real-world workload:

  • Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results

LINPACK

LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 32 vCPU VM with 45K and 65K problem sizes

linpack

Figure 2: LINPACK performance for different test scenarios

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.

STREAM

We used a NUMA-aware  STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, HT enabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 64 vCPU VM (Hyper-Threading ON)

stream

Figure 3: STREAM performance for different test scenarios

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.

FIO

We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

Hardware: 2 socket Intel Xeon E5-2660 2.2GHz with 392GB RAM, 16 total cores, Hyper-Threading enabled
Guest: 32-vCPU  14.04.1 Ubuntu 64-bit server with 256GB RAM, with a separate ext4 disk in the guest (on VMFS5 in vSphere run)
Benchmark:  FIO, Direct I/O, 10GB file
I/O Profile:  4k I/O, Random Read/Write: depth 16, jobs 8, Mixed: depth 32, jobs 8

fio

Figure 4: FIO benchmark performance for different test scenarios

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.

Netperf

Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

Hardware (Server): 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Hardware (Client): 2 socket Intel Xeon X5570 2.93GHz with 64GB RAM, 8 cores total, Hyper-Threading disabled
Networking hardware: Broadcom Corporation NetXtreme II BCM57810
Software on server and Client: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 2 vCPU VM with 4GB RAM

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.

netperf-recieve

Figure 5: Netperf Recieve performance for different test scenarios

netperf-transmit

Figure 6: Netperf transmit performance for different test scenarios

Figure 5 and Figure 6 shows the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.

netperf-latency

Figure 7: Netperf TCP_RR performance for different test scenarios (Lower is better)

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).

netperf-latency-passthrough

Figure 8: Netperf TCP_RR performance for different test scenarios (VMs in pass-through mode)

Redis

We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

Hardware: HP ProLiant DL380e Gen8 2 socket Intel Xeon E5-2470 2.3GHz with 96GB RAM, 16 total cores, Hyper-Threading enabled
Guest OS: CentOS 7
VM: 16 vCPU, 93GB RAM
Application: Redis 2.8.13
Benchmark: redis-benchmark, 1000 clients, pipeline: 1 request, operations: SET 1 Byte
Software configuration: Redis thread pinned to CPU 0 and network interrupts pinned to CPU 1

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access.  The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys.  This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run.  The results of these experiments are summarized in the following chart.

redis

Figure 9: Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead.  However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty.  This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.

Summary

In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems.  The results will be covered in the next installment of this series. Stay tuned!

 

Monster Performance with SQL Server VMs on vSphere 5.5

VMware vSphere provides an ideal platform for customers to virtualize their business-critical applications, including databases, ERP systems, email servers, and even newly emerging technologies such as Hadoop.  I’ve been focusing on the first one (databases), specifically Microsoft SQL Server, one of the most widely deployed database platforms in the world.  Many organizations have dozens or even hundreds of instances deployed in their environments. Consolidating these deployments onto modern multi-socket, multi-core, multi-threaded server hardware is an increasingly attractive proposition for IT administrators.

Achieving optimal SQL Server performance has been a continual focus for VMware; with current vSphere 5.x releases, VMware supports much larger “monster” virtual machines that can scale up to 64 virtual CPUs and 1 TB of RAM, including exposing virtual NUMA architecture to the guest. In fact, the main goal of this blog and accompanying whitepaper is to refresh a 2009 study that demonstrated SQL performance on vSphere 4, given the marked technology advancements on both the software and hardware fronts.

These tests show that large SQL Server 2012 databases run extremely efficiently with VMware, achieving great performance in a variety of virtual machine configurations with only minor tunings to SQL Server and the vSphere ESXi host. These tunings and other best practices for fully optimizing large virtual machines for SQL Server databases are presented in the paper.

One test in the paper shows the maximum host throughput achieved with different numbers of virtual CPUs per VM. This was measured starting with 8 vCPUs per VM, then doubled to 16, then 32, and finally 64 (the maximum supported with vSphere 5.5).  DVD Store, which is a popular database tool and a key workload of the VMmark benchmark, was used to stress the VMs.  Here is a graph from the paper showing the 8 vCPU x 8 VMs case, which achieved an aggregate of 493,804 opm (operations per minute) on the host:

8 x 8 vCPU VM throughput

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the impact of various virtual NUMA (vNUMA) topologies, and experiments with the Latency Sensitivity advanced setting.

For more details and the test results, please download the whitepaper: Performance and Scalability of Microsoft SQL Server on VMware vSphere 5.5.

Custom Power Management Settings for Power Savings in vSphere 5.5

VMware vSphere serves as a common virtualization platform for a diverse ecosystem of applications. Every application has different performance demands which must be met, but the power and cooling costs of running these applications are also a concern. vSphere’s default power management policy, “Balanced”, meets both of these goals by effectively preserving system performance while still saving some power.

For those who would like to prioritize energy efficiency even further, vSphere provides additional ways to tweak its power management under the covers. Custom power management settings in ESXi let you create your own power management policy, and your server’s BIOS also typically lets you customize hardware settings which can maximize power savings at a potential cost to performance.

When choosing a low power setting, we need to know whether it is effective at increasing energy efficiency, that is, the amount of work achieved for the power consumed. We also need to know how large of an impact the setting has on application throughput and latencies. A power saving setting that is too aggressive can result in low system performance. The best combination of power saving techniques will be highly individualized to your workload; here, we present one case study.

We used the VMmark virtualization benchmark to measure the effect of ESXi custom power settings and BIOS custom settings on energy efficiency and performance. VMmark 2.5 is a multi-host virtualization benchmark that uses diverse application workloads as well as common platform level workloads to model the demands of the datacenter. VMs running a complete set of the application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview.

In this study, the best custom power setting produced an increase in energy efficiency of 17% with no significant drop in performance at moderate levels of load.

Test Methodology

All tests were conducted on a two-node cluster running VMware vSphere 5.5 U1. Each custom power management setting was tested independently to gauge its effects on energy efficiency and performance while all other settings were left at their defaults. The settings tested fall into two categories: ESXi custom power settings and BIOS custom settings. We discuss how to modify these settings at the end of the article.

Systems Under Test: Two Dell PowerEdge R720 servers
Configuration Per Server  
            CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled
            Memory: 256GB ECC DDR3 @ 1866Mhz
            Host Bus Adapter: QLogic ISP2532 Dual Port 8Gb Fibre Channel to PCI Express
           Network Controller: Integrated Intel I350 Quad-Port Gigabit Adapter, one Intel I350 Dual-Port Gigabit PCIe Adapter
            Hypervisor: VMware ESXi 5.5 U1
Shared Resources  
            Virtualization Management: VMware vCenter Server 5.5
            Storage Array: EMC VNX5800
30 Enterprise Flash Drives (SSDs) and 32 HDDs, grouped as two 10-SSD RAID0 LUNs and four 8-HDD RAID0 LUNs. FAST Cache was configured from 10 SSDs.
            Power Meters: One Yokogawa WT210 per server

Each configuration was tested at five different load points: 1 tile (the lowest load level), 4, 7, 10, and 12 tiles, which was the maximum number of tiles that met Quality of Service (QoS) requirements. All datapoints are the mean of three tests in each configuration.

ESXi Custom Power Settings

ESXi custom power settings influence the power state of the processor. We tested two custom power management settings which had the greatest impact on our workload: Power.MaxFreqPct and Power.CstateResidencyCoef. The advanced ESXi setting Power.MaxFreqPct (default value 100) reduces the processor frequency by placing a cap on the highest operating frequency it can reach. In practice, the processor can operate only at certain set frequencies (P-states), so if the frequency cap requested by ESXi (e.g. 2160MHz) does not match to a set frequency state, the processor will run at the nearest lower frequency state (e.g. 2100MHz). Setting Power.MaxFreqPct = 99 put the cap at 99% of the processor’s nominal frequency, which limited Turbo Boost. Power.MaxFreqPct = 80 further limited the maximum frequency of the processor to 80% of its nominal frequency of 2.7GHz, for a maximum of 2.1GHz. Setting Power.CstateResidencyCoef = 0 (default value 5) puts the processor into its deepest available C-state, or lowest power state, when it is idle. As a prerequisite, deep C-states must be enabled in the BIOS. For a more in-depth discussion of power management techniques and other custom options, please see the vSphere documentation and the whitepaper Host Power Management in VMware vSphere 5.5.

VMmark models energy efficiency as performance score per kilowatt of power consumed. VMmark scores in the graph below have been normalized to the default “Balanced” 1-tile result, which does not use any custom power settings.

VMware ESXi Custom Power Management Settings improve efficiency

A major trend can be seen here; an increase in load is correlated with greater energy efficiency. As the CPUs become busier, throughput increases at a faster rate than the required power. This can be understood by noting that an idle server will still consume power, but with no work to show for it. A highly utilized server is typically the most energy efficient per request completed, and the results bear this out.

To more closely examine the relative impact of each custom setting compared to the default setting, we normalized all results within each load level to the default “Balanced” result for that number of tiles. The figure below shows the percent change at each load level.

VMware ESXi Custom Power Management Settings Change in Efficiency and Performance Results

All custom settings showed improvements in efficiency compared to the default “Balanced” setting. The improvements varied depending on load. Setting MaxFreqPct to 99 had the greatest benefit to energy efficiency, between 5% and 15% at varying load levels. The greatest improvement was seen at 4 tiles, which increased efficiency by 17%, while resulting in a performance decrease of only 3%. The performance cost increased with load to 9% at 12 tiles. However, limiting processor frequency even further to a maximum of 80% of its nominal frequency does not produce an additive effect. Not only did efficiency actually decrease relative to MaxFreqPct=99, but it profoundly curtailed performance from 96% of baseline at light load to 84% of baseline for a heavily loaded machine. CstateResidency=0 produced some modest increases in efficiency for a lightly loaded server, but the effect disappeared at higher load levels.

VMmark 2.5 performance scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how custom power management settings affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest 1-tile results.

VMware ESXi Custom Power Management Settings Effect on Application Latencies

Naturally, latencies increase as load increases from 1 to 12 tiles. Fortunately, the custom power management policies caused only minimal increases in application latencies, if any, except for the MaxFreqPct=80 setting which did create elevated latencies across the board.

BIOS Custom Power Settings

The Dell PowerEdge R720 BIOS provides another toolbox of power-saving knobs to tweak. Using the BIOS settings, we manually disabled Turbo Boost and reduced memory frequency from its default maximum speed of 1866MT/s (megatransfers per second) to either 1333MT/s or 800MT/s.

Custom-Power-Management-BIOS-Efficiency

The Turbo Boost Disabled configuration produced the largest increase in efficiency, while 800MT/s memory frequency actually decreased efficiency at the higher load levels.
Again, we normalized all results within each load level to its default “Balanced” result. The figure below shows the percent change at each load level.

Custom-Power-Management-BIOS-Efficiency-and-Perf
Disabling Turbo Boost was the most effective setting to increase energy efficiency, with a performance cost of 2% at low load levels to 8% at high load levels. Reducing memory frequency to 1333MT/s had a reliable but small boost to efficiency and no effect on performance, leading us to conclude that a memory speed of 1866MT/s is simply faster than needed for this workload.

Custom-Power-Management-BIOS-Application-Latencies
Disabling Turbo Boost and reducing memory frequency to 800MT/s increased DVD Store 2 latencies at 10 tiles by 10% and 12 tiles by 30%, but all latencies were still well within Quality of Service requirements.  Reducing memory frequency to 1333MT/s had no effect on application latencies.

Reducing the use of Turbo Boost, using either ESXi custom setting MaxFreqPct or BIOS custom settings, proved to be the most effective way to increase energy efficiency in our VMmark tests. The impact on performance was small, but increased with load. MaxFreqPct is the preferred setting because, like all ESXi custom power management settings, it takes effect immediately and can easily be reversed without reboots or downtime. Other custom power management settings produced modest gains in efficiency, but, if taken to the extreme, not only harm performance but fail to increase efficiency. In addition, energy efficiency is strongly related to load; the most efficient server is also one that is heavily utilized. Taking steps to increase server utilization, such as server consolidation, is an important part of a power saving strategy. Custom power management settings can produce gains in energy efficiency at a cost to performance, so consider the tradeoff when choosing custom power management settings for your own environment.


 How to Configure Custom Power Management Settings

Disclaimer: The results presented above are a case study of the impact of custom power management settings and a starting point only. Results may not apply to your environment and do not represent best practices.

Exercise caution when choosing a custom power management setting. Change settings one at a time to evaluate their impact on your environment. Monitor your server’s power consumption either through its UPS, or consult your vendor to find the rated accuracy of your server’s internal power monitoring sensor. If it is highly accurate, you can view the server’s power consumption in esxtop (press ‘p’ to view Power Usage).

To customize power management settings, enter your server’s BIOS. Power Management settings vary by vendor but most include “OS Controlled” and “Custom” policies.

In the Dell PowerEdge R720, choosing the “Performance Per Watt (OS)” System Profile allows ESXi to control power management, while leaving hardware settings at their default values.

Screenshot of R720 BIOS Selecting OS controlled power managment

Choosing the “Custom” System Profile and setting CPU Power Management to “OS DBPM” allows ESXi to control power management while enabling custom hardware settings.

Screenshot-R720-BIOS

Using ESXi Custom Power Settings

To enable the vSphere custom power management policy,

  1. Browse to the host in the vSphere Web Client navigator.
  2. Click the Manage tab and click Settings.
  3. Under Hardware, select Power Management and click the Edit button.
  4. Select the Custom power management policy and click OK.

The power management policy changes immediately and does not require a server reboot.

Screenshot-VMware-ESXi-Host-Power-Management-SettingScreenshot-VMware-ESXi-Custom-Power-Manangement-Setting

To modify ESXi custom power management settings,

  1. Browse to the host in the vSphere Web Client navigator.
  2. Click the Manage tab and click Settings.
  3. Under System, select Advanced System Settings.
  4. Power management parameters that affect the Custom policy have descriptions that begin with In Custom policy. All other power parameters affect all power management policies.
  5. Select the parameter and click the Edit button.

Note: The default values of power management parameters match the Balanced policy.

Screenshot-VMware-ESXi-Advanced-System-Settings

 

Web 2.0 Applications on VMware Virtual SAN

Here in VMware Performance Engineering, Virtual SAN is a hot topic. This storage solution leverages economical hardware compared to more expensive storage arrays and offers all the vSphere solutions like vMotion, HA, and DRS. We have been testing Virtual SAN with a number of workloads to characterize their performance. In particular we found that Web 2.0 applications, modeled with the Cloudstone benchmark, performs exceptionally with low application latency on vSphere 5.5 with Virtual SAN. We are giving a quick glimpse of the testing configuration and result here and the full detail can be found in the recently published technical white paper about Web 2.0 applications on VMware Virtual SAN.

We ran the Cloundstone benchmark using Olio server and client virtual machine pairs. Server virtual machines were on a 3-host server cluster, whereas client virtual machines were on a 3-node client cluster. An Olio server virtual machine ran Ubuntu 10.04 with a MySQL database, a NGINX Web server with PHP scripts, and a Tomcat application server. An Olio client virtual machine simulated typical Web 2.0 workloads by exercising 7 different types of user operations that involved web file exchanges and database inquiries and transactions. Virtual SAN was configured on the server cluster. Web files, database files, and OS files were all on the Virtual SAN with dedicated virtual disks to store files separately.

fig1-blog

In the paper, we report test results that show Virtual SAN achieves good application latency performance. Each server-client virtual machine pair was pre-configured for 500 Olio users. In one test, we ran 1500 Olio users and 7500 users by having 3 and 15 pairs of virtual machines respectively. We collected the average round-trip time of Olio operations. These operations were divided into frequent ones (EventDetail, HomePage, Login and TagSearch) and less frequent ones (AddEvent, AddPerson, and PersonDetail) according to how often they were exercised in the tests.

The following graph shows the average round-trip times for various operations. The threshold for these operations was defined in the passing critera, which used 250 milliseconds for the frequent operations and 500 milliseconds for the less frequent operations. In the 15VMs/7500 users case, the server cluster was at 70% CPU utilization, but the round-trip time was well below the passing threshold as shown. We also present the 95th-percentile round-trip time results and how it performed in the white paper.

fig2-blog

To learn the full story of the 15VMs/7500 Olio users test and how we further stressed storage with the workload and read the results, see the white paper.

Detailed stats for vSphere Flash Read cache

In the Performance of vFlash Read Cache in VMware vSphere 5.5 whitepaper details about performance in several different workloads is provided [http://www.vmware.com/files/pdf/techpaper/vfrc-perf-vsphere55.pdf].  It also covers details about how to tune and configure vFRC to obtain best performance.  This blog article goes through the details of how to obtain runtime statistics about the vSphere Flash Read Cache like the Cache hit rate, Latency of cached I/Os, and average number of cache blocks evicted as used in the whitepaper.

You can run esxcli to get some of these stats using the following commands:

~ # esxcli storage vflash cache list

This would list the identifiers for the caches that are currently in use.  These are displayed one per vFlash-enabled-VMDKs.   To retrieve the vFRC statistics for a particular vFlash-enabled-VMDK, the following command can be used:

~ # esxcli storage vflash cache get –c <cache-identifier>

However, a few more advanced statistics like the amount of data that is cached at any point of time may be obtained by directly accessing the VSI nodes.  The process to do this is as follows:

Cache identifier may be obtained by either the esxcli command shown above or using

~ # cacheID=`vsish -e ls /vmkModules/vflash/module/vfc/cache/`
~ # vsish -e get /vmkModules/vflash/module/vfc/cache/${cacheID}stats

This displays an output similar to the following:

vFlash per cache instance statistics {
cacheBlockSize:8192
numBlocks:1270976
numBlocksCurrentlyCached:222255
numFailedPrimaryIOs:0
numFailedCacheIOs:0
avgNumBlocksOnCache:172494
read:vFlash per I/O type Statistics {
numIOs:168016
avgNumIOPs:61
maxNumIOPs:1969
avgNumKBs:42143
maxNumKBs:227891
avgLatencyUS:16201
maxLatencyUS:41070
numPrimaryIOs:11442
numCacheIOs:156574
avgCacheLatencyUS:17130
avgPrimaryLatencyUS:239961
cacheHitPercentage:94
}
write:vFlash per I/O type Statistics {
numIOs:102264
avgNumIOPs:307
maxNumIOPs:3982
avgNumKBs:10424
maxNumKBs:12106
avgLatencyUS:3248
maxLatencyUS:31798
numPrimaryIOs:102264
numCacheIOs:0
avgCacheLatencyUS:0
avgPrimaryLatencyUS:3248
cacheHitPercentage:0
}
rwTotal:vFlash per I/O type Statistics {
numIOs:270280
avgNumIOPs:88
maxNumIOPs:2027
avgNumKBs:52568
maxNumKBs:233584
avgLatencyUS:11300
maxLatencyUS:40029
numPrimaryIOs:113706
numCacheIOs:156574
avgCacheLatencyUS:17130
avgPrimaryLatencyUS:27068
cacheHitPercentage:58
}
flush:vFlash per operation type statistics {
lastOpTimeUS:0
numBlocksLastOp:0
nextOpTimeUS:0
numBlocksNextOp:0
avgNumBlocksPerOp:0
}
evict:vFlash per operation type statistics {
lastOpTimeUS:0
numBlocksLastOp:0
nextOpTimeUS:0
numBlocksNextOp:0
avgNumBlocksPerOp:0
}
}

This output contains all of the metrics that are discussed in the vShpere Flash Read Cache whitepaper.  This information can be used for sizing the cache and cache blocks in the most effective way.

Disclaimer: This interface (vsish) is not officially supported by VMware, so please use at your own risk.

Reducing Power Consumption in the vSphere 5.5 Datacenter

Today’s virtualized datacenters consist of several servers connected to shared storage, and this configuration has been necessary to enable the flexibility that virtualization provides and still allow for high performance. However, the power consumption of this setup is a major concern because shared storage can consume as much as 2-3x the power of a single, mid-ranged server. In this blog, we look at the performance impact of replacing shared storage with local disks and PCIe flash storage in a vSphere 5.5 datacenter to save power.

We leverage two innovative vSphere features in this performance test:

  • Unified live migration, first introduced with vSphere 5.1, removes the shared storage requirement for vMotion and allows combining traditional vMotion and Storage vMotion into one operation. This combined live migration copies both the virtual machine’s memory and storageover the network to the destination vSphere host. This feature offers administrators significantly more simplicity and flexibility in managing and moving virtual machines across their virtual infrastructures compared to the traditional vMotion and Storage vMotion migration solutions. More information about vMotion can be found in the VMware vSphere 5.1 vMotion Architecture, Performance, and Best Practices white paper.
  • vSphere 5.5 improves server power management by enabling processor C-states, in addition to the previously-used P-states, to improve power savings in the Balanced policy setting. More information about these improvements can be found in the Host Power Management in vSphere 5.5 white paper.

We measure the performance and power savings of these features when replacing shared storage with local disks and PCIe flash storage using a modified version of VMware VMmark 2.5. VMmark is a multi-host virtualization benchmark that uses varied application workloads, as well as common datacenter operations to model the demands of the datacenter. Each VMmark tile contains a set of VMs running diverse application workloads as a unit of load. For more details, see the VMmark 2.5 overview. The benchmark was modified to replace the traditional vMotion workload component with the new shared-nothing, unified live migration.

Testing Methodology

VMmark 2.5 was modified to convert the vMotion workload into a migration without shared storage. All other workloads were unchanged. This allowed a comparison of local, direct attached storage to a traditional Fibre Channel SAN. We measured the power consumption of each configuration using a pair of Yokogawa WT210 power meters, one attached to the servers and the other attached to the external storage.

Configuration

  • Systems Under Test: 2x Dell PowerEdge R710 servers
  • CPUs (per server): 2x Intel Xeon X5670 @ 2.93 GHz
  • Memory (per server): 96 GiB
  • Hypervisor: VMware vSphere 5.5
  • Local Storage (per server): 1x 785GB Fusion-io ioDrive2, 2x 300GB 10K RPM SAS drives in RAID 0
  • SAN: 8Gb Fibre Channel, 30x 200GB SATA Flash drives, 30x 600GB 15K RPM SAS drives
  • Benchmarking software: VMware VMmark 2.5

All I/O-intensive virtual disks were stored on the Fusion-io devices for local storage tests or the SATA flash drives for the SAN tests.  This included the DVD Store database files, the mail server database, and the Olio database.  All remaining virtual machine data was stored on the local SAS drives for the local storage tests and the SAN SAS drives for the SAN tests.

Results
 
VMmark performance using shared-nothing, unified live migration backed by fast local storage showed only minor differences compared to the results with shared storage.  The largest variance was seen in the infrastructure operations, which was expected as the vMotion workload was modified to include a storage migration.  The chart below shows the scores normalized to the 3-tile SAN test results.

scores

When we add the power data to these results, and compare the Performance Per Killowatt (PPKW), we see a much different picture.  The local storage-based PPKW score is much higher than shared storage due to higher power efficiency.

ppkw

We can see the reason for this difference is due to the power consumption of each configuration.  The SAN is consuming over 1000 watts, which is typical of this storage solution.  Replacing that power-hungry component with local storage greatly reduces vSphere datacenter power consumption while maintaining good performance.

power

This SAN should be able to support approximately 25 VMmark tiles (based on the storage capacity of the SSDs), roughly five times the load being supported by the two servers we had available for testing in our lab. However, it should be noted that these servers are two generations old. Current-generation two-socket servers with a comparable power usage can support 2-3x the number of tiles based on published VMmark results. This would imply that the SAN could support at most four current-generation servers. While an additional two servers will further amortize the power cost of the SAN, significant power savings would still be achieved with an all-local storage architecture.

This is not without a cost.  Removing shared storage reduces the functionality of the datacenter because there are a number of vSphere features which will no longer function, such as DRS and traditional vMotion. The reduction in the infrastructure performance due to no shared storage will limit the workloads that can be run in this manner to virtual machines with smaller disks which can be moved between hosts without shared storage fairly quickly. Virtual machines with large disks would take much longer to move and would be better suited to a shared storage environment.

We have shown that it is possible to significantly reduce datacenter power consumption without significantly reducing performance by replacing shared storage with local storage solutions.  Unified live migration enables the use of local storage without a significant infrastructure performance penalty while maintaining application performance comparable to traditional environments using shared storage for the server workloads represented in VMmark.  The resulting elimination of shared storage creates significant power savings and lower operations costs.

Power Management and Performance in VMware vSphere 5.1 and 5.5

Power consumption is an important part of the datacenter cost strategy. Physical servers frequently offer a power management scheme that puts processors into low power states when not fully utilized, and VMware vSphere also offers power management techniques. A recent technical white paper describes the testing and results of two performance studies: The first shows how power management in VMware vSphere 5.5 in balanced mode (the default) performs 18% better than the physical host’s balanced mode power management setting. The second study compares vSphere 5.1 performance and power savings in two server models that have different generations of processors. Results show the newer servers have 120% greater performance and 24% improved energy efficiency over the previous generation.

For more information, please read the paper: Power Management and Performance in VMware vSphere 5.1 and 5.5.

VDI Performance Benchmarking on VMware Virtual SAN 5.5

In the previous blog series, we presented the VDI performance benchmarking results with VMware Virtual SAN public beta and now we announced the general availability of VMware Virtual SAN 5.5 which is part of VMware vSphere 5.5 U1 GA and VMware Horizon View 5.3.1 which supports Virtual SAN 5.5. In this blog, we present the VDI performance benchmarking results with the Virtual SAN GA bits and highlight the CPU improvements and 16-node scaling results. With Virtual SAN 5.5 with default policy, we could successfully run 1615 heavy VDI users (VDImark) out-of-the-box on a 16-node Virtual SAN cluster and see about 5% more consolidation when compared to Virtual SAN public beta.

virtualsan-view-block-diagram

To simulate the VDI workload, which is typically CPU bound and sensitive to I/O, we use VMware View Planner 3.0.1. We run View Planner and consolidate as many heavy users as we can on a particular cluster configuration while meeting the quality of service (QoS) criteria and we define the score as VDImark. For QoS criteria, View Planner operations are divided into three main groups: (1) Group A for interactive operations, (2) Group B for I/O operations, and (3) Group C for background operations. The score is determined separately for Group A user operations and Group B user operations by calculating the 95th percentile latency of all the operations in a group. The default thresholds are 1.0 second for Group A and 6.0 seconds for Group B. Please refer to the user guide, and the run and reporting guides for more details. The scoring is based on several factors such as the response time of the operations, compliance of the setup and configurations, and other factors.

As discussed in the previous blog, we used the same experimental setup (shown below) where each Virtual SAN host has two disk groups and each disk group has one PCI-e solid-state drive (SSD) of 200GB and six 300GB 15k RPM SAS disks. We use default policy when provisioning the automated linked clones pool with VMware Horizon View for all our experiments.

virtualsan55-setup

CPU Improvements in Virtual SAN 5.5

There were several optimizations done in Virtual SAN 5.5 compared to the previously available public beta version and one of the prominent improvements is the reduction of CPU usage for Virtual SAN. To highlight the CPU improvements, we compare the View Planner score on Virtual SAN 5.5 (vSphere 5.5 U1) and Virtual SAN public beta (vSphere 5.5).  On a 3-node cluster, VDImark (the maximum number of desktop VMs that can run with passing QoS criteria) is obtained for both Virtual SAN 5.5 and Virtual SAN public beta and the results are shown below:

virtualsan55-3node

The results show that with Virtual SAN 5.5, we can scale up to 305 VMs on a 3-node cluster, which is about 5% more consolidation when compared with Virtual SAN public beta. This clearly highlights the new CPU improvements in Virtual SAN 5.5 as a higher number of desktop VMs can be consolidated on each host with a similar user experience.

Linear Scaling in VDI Performance

In the next set of experiments, we continually increase the number of nodes for the Virtual SAN cluster to see how well the VDI performance scales. We collect the VDImark score on 3-node, 5-node, 8-node, 16-node increments, and the result is shown in the chart below.

virtualsan55-scaling

The chart illustrates that there is a linear scaling in the VDImark as we increase the number of nodes for the Virtual SAN cluster. This indicates good performance as the nodes are scaled up. As more nodes are added to the cluster, the number of heavy users that can be added to the workload increases proportionately. In Virtual SAN public beta, a workload of 95 heavy VDI users per host was achieved and now, due to CPU improvements in Virtual SAN 5.5, we are able to achieve 101 to 102 heavy VDI users per host. On a 16-node cluster, a VDImark of 1615 was achieved which accounts for about 101 heavy VDI users per node.

To further illustrate the Group A and Group B response times, we show the average response time of individual operations for these runs for both Group A and Group B, as follows.

virtualsan55-groupA

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience. If we look all the way up to 16 nodes, we don’t see much variance in the response times, and they almost remain constant when scaling up. This clearly illustrates that, as we scale the number of VMs in larger nodes of a Virtual SAN cluster, the user experience doesn’t degrade and scales nicely.

virtualsan55-groupB

Group B is more sensitive to I/O and CPU usage than Group A, so the resulting response times are more important. The above figure shows how VDI performance scales in Virtual SAN. It is evident from the chart that there is not much difference in the response times as the number of VMs are increased from 305 VMs on a 3-node cluster to 1615 VMs on a 16-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the Virtual SAN nodes from 3 to 16.

To summarize, the test results in this blog show:

  • 5% more VMs can be consolidated on a 3-node Virtual SAN cluster
  • When adding more nodes to the Virtual SAN cluster, the number of heavy users supported increases proportionately (linear scaling)
  • The response times of common user operations (such as opening and saving files, watching a video, and browsing the Web) remain fairly constant as more nodes with more VMs are added.

To see the previous blogs on the VDI benchmarking with Virtual SAN public beta, check the links below:

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

In part 1 and part 2 of the VDI/VSAN benchmarking blog series, we presented the VDI benchmark results on VSAN for 3-node, 5-node, 7-node, and 8-node cluster configurations. In this blog, we compare the VDI benchmarking performance of VSAN with an all flash storage array. The intent of this experiment is not to compare the maximum IOPS that you can achieve on these storage solutions; instead, we show how VSAN scales as we add more heavy VDI users. We found that VSAN can support a similar number of users as that of an all flash array even though VSAN is using host resources.

The characteristic of VDI workload is that they are CPU bound, but sensitive to I/O which makes View Planner a natural fit for this comparative study. We use VMware View Planner 3.0 for both VSAN and all flash SAN and consolidate as many heavy users as much we can on a particular cluster configuration while meeting the quality of service (QoS) criteria. Then, we find the difference in the number of users we can support before we run out of CPU, because I/O is not a bottleneck here. Since VSAN runs in the kernel and uses CPU on the host for its operation, we find that the CPU usage is quite minimal, and we see no more than a 5% consolidation difference for a heavy user run on VSAN compared to the all flash array.

As discussed in the previous blog, we used the same experimental setup where each VSAN host has two disk groups and each disk group has one PCI-e solid-state drive (SSD) of 200GB and six 300GB 15k RPM SAS disks. We built a 7-node and a 8-node cluster and ran View Planner to get the VDImark™ score for both VSAN and the all flash array. VDImark signifies the number of heavy users you can successfully run and meet the QoS criteria for a system under test. The VDImark for both VSAN and all flash array is shown in the following figure.

View Planner QoS (VDImark)

 

 From the above chart, we see that VSAN can consolidate 677 heavy users (VDImark) for 7-node and 767 heavy users for 8-node cluster. When compared to the all flash array, we don’t see more than 5% difference in the user consolidation. To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above for both VSAN and the all flash array, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience.  Similar to the user consolidation, the response time of Group-A operations in VSAN is similar to what we saw with the all flash array.

Group-B Response Times

Group-B operations are sensitive to both CPU and IO and 95% should be less than six seconds to meet the QoS criteria. From the above figure, we see that the average response time for most of the operations is within the threshold and we see similar response time in VSAN when compared to the all flash array.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

 

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2

In part 1, we presented the VDI benchmark results on VSAN for 3-node and 7-node configurations. In this blog, we update the results for 5-node and 8-node VSAN configurations and show how VSAN scales for these configurations.

The View Planner benchmark was run again to find the VDImark for different numbers of nodes (5 and 8 nodes) in a VSAN cluster as described in the previous blog and the results are shown in the following figure.

View Planner QoS (VDImark)

 

In the 5-node cluster, a VDImark score of 473 was achieved and for the 8-node cluster, a VDImark score of 767 was achieved. These results are similar to the ones we saw on the 3-node and 7-node cluster earlier (about 95 VMs per host). So, there is nice scaling in terms of maximum VMs supported as the numbers of nodes were increased in the VSAN from 3 to 8.

To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience. If we look at the new results for 5-node and 8-node VSAN, we see that for most of the operations, the response time mostly remains the same across different node configurations.

Group-B Response Times

Since Group-B is more sensitive to I/O and CPU usage, the above chart for Group-B operations is more important to see how View Planner scales. The chart shows that there is not much difference in the response times as the number of VMs were increased from 286 VMs on a 3-node cluster to 767 VMs on an 8-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the VSAN nodes from 3 to 8 and user experience expectations are met.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3