Home > Blogs > VMware VROOM! Blog > Author Archives: Banit Agrawal

Docker Containers Performance in VMware vSphere

By  Qasim Ali,  Banit Agrawal, and Davide Bergamasco

 

“Containers without compromise” – This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and  in a native configuration. Briefly,

  • We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.
  • Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

  • Native: Linux OS running directly on hardware (Ubuntu, CentOS)
  • vSphere VM: Upcoming release of vSphere with the same guest OS as native
  • Native-Docker: Docker version 1.2 running on a native OS
  • VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Test Scenarios

Figure 1: Different test scenarios

Benchmarks/Workloads

For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

-   Micro-benchmarks:

  • LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.
  • STREAM: This benchmark measures memory bandwidth across various configurations.
  • FIO: This benchmark is used for I/O benchmarking for block devices and file systems.
  • Netperf: This benchmark is used to measure network performance.

Real-world workload:

  • Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results

LINPACK

LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 32 vCPU VM with 45K and 65K problem sizes

linpack

Figure 2: LINPACK performance for different test scenarios

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.

STREAM

We used a NUMA-aware  STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, HT enabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 64 vCPU VM (Hyper-Threading ON)

stream

Figure 3: STREAM performance for different test scenarios

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.

FIO

We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

Hardware: 2 socket Intel Xeon E5-2660 2.2GHz with 392GB RAM, 16 total cores, Hyper-Threading enabled
Guest: 32-vCPU  14.04.1 Ubuntu 64-bit server with 256GB RAM, with a separate ext4 disk in the guest (on VMFS5 in vSphere run)
Benchmark:  FIO, Direct I/O, 10GB file
I/O Profile:  4k I/O, Random Read/Write: depth 16, jobs 8, Mixed: depth 32, jobs 8

fio

Figure 4: FIO benchmark performance for different test scenarios

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.

Netperf

Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

Hardware (Server): 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Hardware (Client): 2 socket Intel Xeon X5570 2.93GHz with 64GB RAM, 8 cores total, Hyper-Threading disabled
Networking hardware: Broadcom Corporation NetXtreme II BCM57810
Software on server and Client: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 2 vCPU VM with 4GB RAM

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.

netperf-recieve

Figure 5: Netperf Recieve performance for different test scenarios

netperf-transmit

Figure 6: Netperf transmit performance for different test scenarios

Figure 5 and Figure 6 shows the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.

netperf-latency

Figure 7: Netperf TCP_RR performance for different test scenarios (Lower is better)

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).

netperf-latency-passthrough

Figure 8: Netperf TCP_RR performance for different test scenarios (VMs in pass-through mode)

Redis

We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

Hardware: HP ProLiant DL380e Gen8 2 socket Intel Xeon E5-2470 2.3GHz with 96GB RAM, 16 total cores, Hyper-Threading enabled
Guest OS: CentOS 7
VM: 16 vCPU, 93GB RAM
Application: Redis 2.8.13
Benchmark: redis-benchmark, 1000 clients, pipeline: 1 request, operations: SET 1 Byte
Software configuration: Redis thread pinned to CPU 0 and network interrupts pinned to CPU 1

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access.  The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys.  This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run.  The results of these experiments are summarized in the following chart.

redis

Figure 9: Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead.  However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty.  This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.

Summary

In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems.  The results will be covered in the next installment of this series. Stay tuned!

 

VDI Performance Benchmarking on VMware Virtual SAN 5.5

In the previous blog series, we presented the VDI performance benchmarking results with VMware Virtual SAN public beta and now we announced the general availability of VMware Virtual SAN 5.5 which is part of VMware vSphere 5.5 U1 GA and VMware Horizon View 5.3.1 which supports Virtual SAN 5.5. In this blog, we present the VDI performance benchmarking results with the Virtual SAN GA bits and highlight the CPU improvements and 16-node scaling results. With Virtual SAN 5.5 with default policy, we could successfully run 1615 heavy VDI users (VDImark) out-of-the-box on a 16-node Virtual SAN cluster and see about 5% more consolidation when compared to Virtual SAN public beta.

virtualsan-view-block-diagram

To simulate the VDI workload, which is typically CPU bound and sensitive to I/O, we use VMware View Planner 3.0.1. We run View Planner and consolidate as many heavy users as we can on a particular cluster configuration while meeting the quality of service (QoS) criteria and we define the score as VDImark. For QoS criteria, View Planner operations are divided into three main groups: (1) Group A for interactive operations, (2) Group B for I/O operations, and (3) Group C for background operations. The score is determined separately for Group A user operations and Group B user operations by calculating the 95th percentile latency of all the operations in a group. The default thresholds are 1.0 second for Group A and 6.0 seconds for Group B. Please refer to the user guide, and the run and reporting guides for more details. The scoring is based on several factors such as the response time of the operations, compliance of the setup and configurations, and other factors.

As discussed in the previous blog, we used the same experimental setup (shown below) where each Virtual SAN host has two disk groups and each disk group has one PCI-e solid-state drive (SSD) of 200GB and six 300GB 15k RPM SAS disks. We use default policy when provisioning the automated linked clones pool with VMware Horizon View for all our experiments.

virtualsan55-setup

CPU Improvements in Virtual SAN 5.5

There were several optimizations done in Virtual SAN 5.5 compared to the previously available public beta version and one of the prominent improvements is the reduction of CPU usage for Virtual SAN. To highlight the CPU improvements, we compare the View Planner score on Virtual SAN 5.5 (vSphere 5.5 U1) and Virtual SAN public beta (vSphere 5.5).  On a 3-node cluster, VDImark (the maximum number of desktop VMs that can run with passing QoS criteria) is obtained for both Virtual SAN 5.5 and Virtual SAN public beta and the results are shown below:

virtualsan55-3node

The results show that with Virtual SAN 5.5, we can scale up to 305 VMs on a 3-node cluster, which is about 5% more consolidation when compared with Virtual SAN public beta. This clearly highlights the new CPU improvements in Virtual SAN 5.5 as a higher number of desktop VMs can be consolidated on each host with a similar user experience.

Linear Scaling in VDI Performance

In the next set of experiments, we continually increase the number of nodes for the Virtual SAN cluster to see how well the VDI performance scales. We collect the VDImark score on 3-node, 5-node, 8-node, 16-node increments, and the result is shown in the chart below.

virtualsan55-scaling

The chart illustrates that there is a linear scaling in the VDImark as we increase the number of nodes for the Virtual SAN cluster. This indicates good performance as the nodes are scaled up. As more nodes are added to the cluster, the number of heavy users that can be added to the workload increases proportionately. In Virtual SAN public beta, a workload of 95 heavy VDI users per host was achieved and now, due to CPU improvements in Virtual SAN 5.5, we are able to achieve 101 to 102 heavy VDI users per host. On a 16-node cluster, a VDImark of 1615 was achieved which accounts for about 101 heavy VDI users per node.

To further illustrate the Group A and Group B response times, we show the average response time of individual operations for these runs for both Group A and Group B, as follows.

virtualsan55-groupA

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience. If we look all the way up to 16 nodes, we don’t see much variance in the response times, and they almost remain constant when scaling up. This clearly illustrates that, as we scale the number of VMs in larger nodes of a Virtual SAN cluster, the user experience doesn’t degrade and scales nicely.

virtualsan55-groupB

Group B is more sensitive to I/O and CPU usage than Group A, so the resulting response times are more important. The above figure shows how VDI performance scales in Virtual SAN. It is evident from the chart that there is not much difference in the response times as the number of VMs are increased from 305 VMs on a 3-node cluster to 1615 VMs on a 16-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the Virtual SAN nodes from 3 to 16.

To summarize, the test results in this blog show:

  • 5% more VMs can be consolidated on a 3-node Virtual SAN cluster
  • When adding more nodes to the Virtual SAN cluster, the number of heavy users supported increases proportionately (linear scaling)
  • The response times of common user operations (such as opening and saving files, watching a video, and browsing the Web) remain fairly constant as more nodes with more VMs are added.

To see the previous blogs on the VDI benchmarking with Virtual SAN public beta, check the links below:

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

In part 1 and part 2 of the VDI/VSAN benchmarking blog series, we presented the VDI benchmark results on VSAN for 3-node, 5-node, 7-node, and 8-node cluster configurations. In this blog, we compare the VDI benchmarking performance of VSAN with an all flash storage array. The intent of this experiment is not to compare the maximum IOPS that you can achieve on these storage solutions; instead, we show how VSAN scales as we add more heavy VDI users. We found that VSAN can support a similar number of users as that of an all flash array even though VSAN is using host resources.

The characteristic of VDI workload is that they are CPU bound, but sensitive to I/O which makes View Planner a natural fit for this comparative study. We use VMware View Planner 3.0 for both VSAN and all flash SAN and consolidate as many heavy users as much we can on a particular cluster configuration while meeting the quality of service (QoS) criteria. Then, we find the difference in the number of users we can support before we run out of CPU, because I/O is not a bottleneck here. Since VSAN runs in the kernel and uses CPU on the host for its operation, we find that the CPU usage is quite minimal, and we see no more than a 5% consolidation difference for a heavy user run on VSAN compared to the all flash array.

As discussed in the previous blog, we used the same experimental setup where each VSAN host has two disk groups and each disk group has one PCI-e solid-state drive (SSD) of 200GB and six 300GB 15k RPM SAS disks. We built a 7-node and a 8-node cluster and ran View Planner to get the VDImark™ score for both VSAN and the all flash array. VDImark signifies the number of heavy users you can successfully run and meet the QoS criteria for a system under test. The VDImark for both VSAN and all flash array is shown in the following figure.

View Planner QoS (VDImark)

 

 From the above chart, we see that VSAN can consolidate 677 heavy users (VDImark) for 7-node and 767 heavy users for 8-node cluster. When compared to the all flash array, we don’t see more than 5% difference in the user consolidation. To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above for both VSAN and the all flash array, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience.  Similar to the user consolidation, the response time of Group-A operations in VSAN is similar to what we saw with the all flash array.

Group-B Response Times

Group-B operations are sensitive to both CPU and IO and 95% should be less than six seconds to meet the QoS criteria. From the above figure, we see that the average response time for most of the operations is within the threshold and we see similar response time in VSAN when compared to the all flash array.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

 

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2

In part 1, we presented the VDI benchmark results on VSAN for 3-node and 7-node configurations. In this blog, we update the results for 5-node and 8-node VSAN configurations and show how VSAN scales for these configurations.

The View Planner benchmark was run again to find the VDImark for different numbers of nodes (5 and 8 nodes) in a VSAN cluster as described in the previous blog and the results are shown in the following figure.

View Planner QoS (VDImark)

 

In the 5-node cluster, a VDImark score of 473 was achieved and for the 8-node cluster, a VDImark score of 767 was achieved. These results are similar to the ones we saw on the 3-node and 7-node cluster earlier (about 95 VMs per host). So, there is nice scaling in terms of maximum VMs supported as the numbers of nodes were increased in the VSAN from 3 to 8.

To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience. If we look at the new results for 5-node and 8-node VSAN, we see that for most of the operations, the response time mostly remains the same across different node configurations.

Group-B Response Times

Since Group-B is more sensitive to I/O and CPU usage, the above chart for Group-B operations is more important to see how View Planner scales. The chart shows that there is not much difference in the response times as the number of VMs were increased from 286 VMs on a 3-node cluster to 767 VMs on an 8-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the VSAN nodes from 3 to 8 and user experience expectations are met.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

 

 

VDI Benchmarking Using View Planner on VMware Virtual SAN (VSAN)

VMware vSphere® 5.5 introduces the beta availability of VMware® Virtual SAN (VSAN). This feature allows a new software-defined storage tier, pools compute and direct-attached storage resources, and clusters server disks and flash to create resilient shared storage.

This blog showcases Virtual Desktop Infrastructure (VDI) performance on Virtual SAN using VMware View Planner, which is designed to simulate a large-scale deployment of virtualized desktop systems. This is achieved by generating a workload representative of many user-initiated operations that take place in a typical VDI environment. The results allow us to study the effects on an entire virtualized infrastructure including the storage subsystem. View Planner can be downloaded here.

In this blog, we evaluate the performance of VSAN using View Planner with different VSAN node configurations. In this experiment, we build a 3-node VSAN cluster and a 7-node VSAN cluster to determine the maximum number of VDI virtual machines (VMs) we can run while meeting the quality of service (QoS) criteria set for View Planner.  The maximum number of passing VMs is called the VDImark™ for a given system under test. This metric is used for VDI benchmarking and it encapsulates the number of VDI users that can be run on a given system with an application response time less than the set threshold. For response time characterization, View Planner operations are divided into three main groups: (1) Group A for interactive operations, (2) Group B for I/O operations, and (3) Group C for background operations. The score is determined separately for Group A user operations and Group B user operations by calculating the 95th percentile latency of all the operations in a group. The default thresholds are 1.0 second for Group A and 6.0 seconds for Group B. Please refer to the user guide, and the run and reporting guides for more details. Hence, the scoring is based on several factors such as the response time of the operations, compliance of the setup and configurations, and so on.

Experimental Setup

The host running the desktop VMs has 16 Intel Xeon E5-2690 cores running @ 2.9GHz. The host has 256GB physical RAM, which is more than sufficient to run 100 1GB Windows 7 VMs. For VSAN, each host has two disk groups where each disk group has one PCI-e solid-state drive (SSD) of 200GB and six 300GB 15k RPM SAS disks.

View Planner QoS (VDImark)

The View Planner benchmark was run to find the VDImark for both 3-node and 7-node VSAN clusters and the results are shown in the chart above. In the 3-node cluster, a VDImark of 286 was achieved and for 7-node cluster, a VDImark score of 677 was achieved. So, there is nice scaling in terms of maximum VMs supported as the numbers of nodes were increased in VSAN from 3 to 7.

To further illustrate the Group A and Group B response times, we show the average response time of individual operations for these runs for both Group A and Group B, as follows.

Group A Response Times

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide good end-user experience. If we look at the 3-node and 7-node run, we don’t see much variance in the response times, and they almost remain constant when scaling up. This clearly illustrates that, as we scale the number of VMs in larger nodes of a VSAN cluster, the user experience doesn’t degrade and scales nicely.

Group B Response Times

Since Group B is more sensitive to I/O and CPU usage, the above chart for Group B operations is more important to see how we scale. It is evident from the chart that there is not much difference in the response times as the number of VMs were increased from 286 VMs on a 3-node cluster to 677 VMs on a 7-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the VSAN nodes from 3 to 7.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

Simulating different VDI users with View Planner 3.0

VDI benchmarking is hard. What makes it challenging is getting a good representation or simulation of VDI users.  If we closely look at typical office users, we can get a spectrum of VDI users where at the one end of spectrum, the user may be using some simple Microsoft Office applications at a relatively moderate speed, whereas at the other end of spectrum, the user may be running some CPU-heavy multimedia applications and switching between many applications much faster. We classify the fast user as the power user or the “heavy” user, whereas we classify the user at the other end of the spectrum as the task worker or as the “light” user. In between the two categories, we define one more category which lies in between these two ends of the spectrum, which is the “medium” user.

To simulate these different categories of users and to make the job of VDI benchmarking much easier, we have made VMware View Planner 3.0, which simulates a workload representative of many user-initiated operations that take place in a typical VDI environment. The tool simulates typical Office user applications such as PowerPoint, Outlook, and Word; and Adobe Reader, Internet Explorer Web browser, multimedia applications, and so on. The tool can be downloaded from: http://www.vmware.com/products/desktop_virtualization/view-planner/overview.html.

If we look at the three categories of VDI users outlined above, one of the main differentiating factors across this gamut of VDI users is how fast they act and this is simulated using the concept of “think time” in the View Planner tool. The tool uses the thinktime parameter to randomly sleep before starting the next application operation. For the heavy user, the value of thinktime is kept very low at 2 seconds. This means that operations are happening very fast and users are switching across different applications or doing operations in an application every 2 seconds on average. The View Planner 3.0 benchmark defines a score, called “VDImark” which is based on this “heavy” user workload profile. For a medium user, the think time is set to 5 seconds, and for a light user, the think time is set to 10 seconds. The heavy VDI user also uses a bigger screen resolution compared to the medium or light user. The simulation of these category of users in the View Planner tool is summarized in the table below:

In order to show the capability of View Planner 3.0 to determine the sizing for VDI user VMs per host, we ran a flexible mode of View Planner 3.0, which allowed us to create medium and light user workloads (the heavy workload profile pre-exists), as well to understand the user density for different types of VDI users for a given system. The flexible mode will be available soon through Professional Services Organization (PSO) and to selected partners.

The experimental setup we used to compare these different user profiles is shown below:

In this test, we want to determine how many VMs can be run on the system while each VM is performing its heavy, medium, or light user profiles. In order to do this, we need to set a baseline of acceptable performance, which is defined by the quality of service (QoS) as defined in the View Planner user guide. The number of VMs that passed the QoS score is shown in the chart below.

The chart shows that we can run about 53 VMs for the heavy user (VDImark), 67 VMs for the medium user, and 91 VMs for light users. So, we could consolidate about 25% more desktops if we used this system to host users with medium workloads instead of heavy workloads. And we could consolidate 35% more desktops if we used this system to host users with light workloads instead of medium workloads. So, it is crucial to fully specify the user profile whenever we talk about the user density.

In this blog, we demonstrated how we used the View Planner 3.0 flexible mode to run different user profiles and to understand the user density for a system under test. If you have any questions and want to know more about View Planner, you can reach out to the team at viewplanner-info@vmware.com

VDI Benchmarking with View Planner 3.0

Recently we announced the general availability of VMware View Planner 3.0 as a VDI benchmark. VMware View Planner is a tool designed to simulate a large-scale deployment of virtualized desktop systems. This is achieved by generating a workload representative of many user-initiated operations that take place in a typical VDI environment. The results allow us to study the effects on an entire virtualized infrastructure. The tool can be downloaded from http://www.vmware.com/products/desktop_virtualization/view-planner/overview.html

In this blog, we present a high-level overview of the View Planner benchmark and some of its use cases. Finally, we present a simple storage scaling use case using a flash memory storage array from Violin Memory, who has partnered with us during the validation of this new benchmark.

With version 3.0, View Planner can be run as a benchmark which will help VMware partners and customers to precisely characterize and compare both the software and hardware solutions in their VDI environments. Using View Planner’s comprehensive standard methodology, VDI architects can compare and contrast different layers of the VDI stack including processor architectures; the results can be used to objectively show the performance improvements of the next generation chipset in contrast with the current generation. In addition, various storage solutions like hybrid, all-flash, and vSAN can be compared with different SAN configurations for a given hardware setup.

View Planner 3.0 provides a number of features which include

• Application-centric benchmarking of real-world workloads
• Unique and patented client-side performance measurement technology to better understand the end user experience
• High precision scoring methodology for repeatability
• Benchmark metrics to highlight infrastructure efficiencies—density, performance and economics.
• Support for latest VMware vSphere and Horizon View versions
• Better automation and stats reporting for ease of use and performance analysis
• Auto-generated PDF reports providing a summary of the run

View Planner Scoring
The View Planner score is represented as VDImark. This metric encapsulates the number of VDI users that can be run on a given system with application response time less than the set threshold. Hence, the scoring is based on several factors such as the response time of the operations, compliance of the setup and configurations, and so on.

For response time characterization, View Planner operations are divided into three main groups: (1) Group A for interactive operations, (2) Group B for I/O operations, and (3) Group C for background operations. The score is determined separately for Group A user operations and Group B user operations by calculating the 95th percentile latency of all the operations in a group. The default thresholds are 1.0 seconds for Group A and 6.0 seconds for Group B. Please refer to the user guide, and the run and reporting guides for more details.

View Planner Benchmarking Use Cases
As mentioned earlier, the View Planner 3.0 benchmark can be used to benchmark different CPU architectures, hosts, and storage architectures. Using the tool, vendors and partners can scale the VMs on a specific processor architecture and find out how many VMs per core can be supported and the same can be also done for different server host systems. In the same direction, the storage system can be benchmarked to see how may VMs can be supported without seeing significant increase in the I/O latency and hence the user experience for a given storage configuration. It can be also used to study the impact of different configurations and optimizations that can be done in different layers of both the software and the hardware stack. Next, we look at one such use case of View Planner by looking at storage scaling by running View Planner VMs on multiple hosts.

Use Case Example: Storage Scaling
To illustrate one of the use cases of View Planner 3.0, we look at storage scaling aspects. In this experiment, we scale the number of hosts (3, 5, 6) and each host runs about 90 to 100 VMs. Then we see how the Violin storage array is able to scale with increasing IOPS requirements. We didn’t go beyond 6 hosts because of hardware availability. The experimental setup for this use case is shown below.

 

The host running the desktop VMs has 16 Intel Xeon E5-2690 cores running @ 2.9 GHz. The host has 256GB physical RAM, which is more than sufficient to run 90-100 1GB Win7 VMs. The desktop is connected to a Violin storage array using the Fibre Channel host bus adapter (FC HBA)on the host.

View Planner QoS

We ran 285 VMs (3 hosts), 480 VMs (5 hosts), and 560 VMs (6 hosts), and we collected the View Planner response times and the QoS is shown in the following figure.

In all the runs, we see in the bar chart that both Group A and Group B 95% response times are less than the threshold of 1 second, and 6 seconds respectively. Also, we don’t see much variation as we increased the number of hosts and we can clearly see that Violin storage is easily coping with a greater number of desktop VMs and servicing their IOPS requirements even when the number of desktops is doubled. To further illustrate the Group A and Group B response times, we show the average response time of individual operations for these three runs for both Group A and Group B, as follows.

Group A Response Times

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide good end-user experience. If we look all three runs, we don’t see much variance in the response times and they almost remained constant when scaling up.

Group B Response Times

Since Group B is composed of I/O operations, this will provide good insight for storage-related experiments. In the bar chart shown above, we see that the latency of operations such as PPTx-Open, Word-Open, or Word-Save didn’t change much as we scaled from 285 VMs (3 hosts) to 560 VMs (6 hosts).

IOPS Requirements

The above chart shows the total IOPS seen by the Violin storage array when the benchmark was being executed. (This doesn’t include the IOPS from any management operations such as Bootstorm, virus scan, and so on.) For the 560 VM run, we see that the total IOPS from all the hosts is going up to 10k and then tapering down to about 6k in the steady state. So, in the first iteration, we see higher IOPS requirement than the steady state as expected. We see similar behavior with 285 VMs and 480 VMs run where we see peaks in the first iteration and then we see steady IOPS usage in the steady state iterations.

While we have presented one simple use case of storage scaling in this blog, View Planner 3.0 can be used for many use cases (CPU scaling, processor architecture comparison, host configurations, and so on) as mentioned earlier.
If you have any questions and want to know more about View Planner, you can reach out to the team at viewplanner-info@vmware.com

If you are attending VMworld this year, please check out our session on “View Planner 3.0 as a benchmark”. Here are the session details:
TEX5760 – View Planner 3.0 as a VDI Benchmark
Tuesday: 3:30 PM
Banit Agrawal & Rishi Bidarkar

VMware Horizon View 5.2 Performance & Best Practices and A Performance Deep Dive on Hardware Accelerated 3D Graphics

VMware Horizon View 5.2 simplifies desktop and application management while increasing security and control and delivers a personalized high fidelity experience for end-users across sessions and devices. It enables higher availability and agility of desktop services unmatched by traditional PCs while reducing the total cost of desktop ownership and end-users can enjoy new levels of productivity and the freedom to access desktops from more devices and locations while giving IT greater policy control.

Recently, we published two whitepapers to provide a performance deep-dive on Horizon View 5.2 performance and hardware accelerated 3D graphics (vSGA) feature. The links to these whitepapers are as follows:

* VMware Horizon View 5.2 Performance and Best Practices
* VMware Horizon View 5.2 and Hardware Accelerated 3D Graphics

The first whitepaper describes View 5.2 new features, including access of View desktops with Horizon, space efficient sparse (SEsparse) disks, hardware accelerated 3D graphics, and full support of Windows 8 desktops. View 5.2 performance improvements in PCoIP and View management are highlighted. In addition, this paper presents View 5.2 PCoIP performance results, Windows 8 and RDP 8 performance analysis, and a vSGA performance analysis, including how vSGA compares to the software renderer support introduced in View 5.1.

The second whitepaper goes in-depth on the support for hardware accelerated 3D graphics that debuted with VMware vSphere 5.1 and VMware Horizon View 5.2 and presents performance and consolidation results for a number of different workloads, ranging from knowledge workers using 3D desktops to performance-intensive CAD-based workloads. Because the intensity of a 3D workload will vary greatly from user to user and application to application, rather than highlighting specific case studies, we demonstrate how the solution efficiently scales for both light- and heavy-weight 3D workloads, until GPU or CPU resources are fully utilized. This paper also presents key best practices to extract peak performance from a 3D View 5.2 deployment.

Technical deep dive on VMware VIew Planner

In our prior VMworld sessions and performance white papers, we have presented user experience performance results based on VMware View® Planner, a tool that can generate workloads that are representative of many user-initiated operations in VDI environments. While we have discussed briefly about this tool in prior occasions, there have been many requests to get the architectural details and inner working of the tool. To provide more deep dive and technical details on View Planner, we have recently published an article in the recent release of VMware technical journal (VMTJ Winter 2012), which can be found here: VMware View Planner: Measuring True Virtual Desktop at Scale.

View Planner supports typical VDI user operations and also administrator’s management operations that can be configured to allow VDI evaluators to more accurately represent their particular environment. In this paper, we describe the challenges in building such a workload generator and the platform around it, as well as the View Planner architecture and use cases. We also explain how we used View Planner to perform platform characterization and consolidation studies, find potential performance optimizations and several other use cases.