Home > Blogs > VMware VROOM! Blog > Tag Archives: vmware

Tag Archives: vmware

Docker Containers Performance in VMware vSphere

By  Qasim Ali,  Banit Agrawal, and Davide Bergamasco

 

“Containers without compromise” – This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and  in a native configuration. Briefly,

  • We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.
  • Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

  • Native: Linux OS running directly on hardware (Ubuntu, CentOS)
  • vSphere VM: Upcoming release of vSphere with the same guest OS as native
  • Native-Docker: Docker version 1.2 running on a native OS
  • VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Test Scenarios

Figure 1: Different test scenarios

Benchmarks/Workloads

For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

-   Micro-benchmarks:

  • LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.
  • STREAM: This benchmark measures memory bandwidth across various configurations.
  • FIO: This benchmark is used for I/O benchmarking for block devices and file systems.
  • Netperf: This benchmark is used to measure network performance.

Real-world workload:

  • Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results

LINPACK

LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 32 vCPU VM with 45K and 65K problem sizes

linpack

Figure 2: LINPACK performance for different test scenarios

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.

STREAM

We used a NUMA-aware  STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, HT enabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 64 vCPU VM (Hyper-Threading ON)

stream

Figure 3: STREAM performance for different test scenarios

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.

FIO

We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

Hardware: 2 socket Intel Xeon E5-2660 2.2GHz with 392GB RAM, 16 total cores, Hyper-Threading enabled
Guest: 32-vCPU  14.04.1 Ubuntu 64-bit server with 256GB RAM, with a separate ext4 disk in the guest (on VMFS5 in vSphere run)
Benchmark:  FIO, Direct I/O, 10GB file
I/O Profile:  4k I/O, Random Read/Write: depth 16, jobs 8, Mixed: depth 32, jobs 8

fio

Figure 4: FIO benchmark performance for different test scenarios

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.

Netperf

Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

Hardware (Server): 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Hardware (Client): 2 socket Intel Xeon X5570 2.93GHz with 64GB RAM, 8 cores total, Hyper-Threading disabled
Networking hardware: Broadcom Corporation NetXtreme II BCM57810
Software on server and Client: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 2 vCPU VM with 4GB RAM

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.

netperf-recieve

Figure 5: Netperf Recieve performance for different test scenarios

netperf-transmit

Figure 6: Netperf transmit performance for different test scenarios

Figure 5 and Figure 6 shows the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.

netperf-latency

Figure 7: Netperf TCP_RR performance for different test scenarios (Lower is better)

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).

netperf-latency-passthrough

Figure 8: Netperf TCP_RR performance for different test scenarios (VMs in pass-through mode)

Redis

We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

Hardware: HP ProLiant DL380e Gen8 2 socket Intel Xeon E5-2470 2.3GHz with 96GB RAM, 16 total cores, Hyper-Threading enabled
Guest OS: CentOS 7
VM: 16 vCPU, 93GB RAM
Application: Redis 2.8.13
Benchmark: redis-benchmark, 1000 clients, pipeline: 1 request, operations: SET 1 Byte
Software configuration: Redis thread pinned to CPU 0 and network interrupts pinned to CPU 1

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access.  The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys.  This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run.  The results of these experiments are summarized in the following chart.

redis

Figure 9: Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead.  However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty.  This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.

Summary

In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems.  The results will be covered in the next installment of this series. Stay tuned!

 

Monster Performance with SQL Server VMs on vSphere 5.5

VMware vSphere provides an ideal platform for customers to virtualize their business-critical applications, including databases, ERP systems, email servers, and even newly emerging technologies such as Hadoop.  I’ve been focusing on the first one (databases), specifically Microsoft SQL Server, one of the most widely deployed database platforms in the world.  Many organizations have dozens or even hundreds of instances deployed in their environments. Consolidating these deployments onto modern multi-socket, multi-core, multi-threaded server hardware is an increasingly attractive proposition for IT administrators.

Achieving optimal SQL Server performance has been a continual focus for VMware; with current vSphere 5.x releases, VMware supports much larger “monster” virtual machines that can scale up to 64 virtual CPUs and 1 TB of RAM, including exposing virtual NUMA architecture to the guest. In fact, the main goal of this blog and accompanying whitepaper is to refresh a 2009 study that demonstrated SQL performance on vSphere 4, given the marked technology advancements on both the software and hardware fronts.

These tests show that large SQL Server 2012 databases run extremely efficiently with VMware, achieving great performance in a variety of virtual machine configurations with only minor tunings to SQL Server and the vSphere ESXi host. These tunings and other best practices for fully optimizing large virtual machines for SQL Server databases are presented in the paper.

One test in the paper shows the maximum host throughput achieved with different numbers of virtual CPUs per VM. This was measured starting with 8 vCPUs per VM, then doubled to 16, then 32, and finally 64 (the maximum supported with vSphere 5.5).  DVD Store, which is a popular database tool and a key workload of the VMmark benchmark, was used to stress the VMs.  Here is a graph from the paper showing the 8 vCPU x 8 VMs case, which achieved an aggregate of 493,804 opm (operations per minute) on the host:

8 x 8 vCPU VM throughput

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the impact of various virtual NUMA (vNUMA) topologies, and experiments with the Latency Sensitivity advanced setting.

For more details and the test results, please download the whitepaper: Performance and Scalability of Microsoft SQL Server on VMware vSphere 5.5.

Virtual SAP HANA Achieves Production Level Performance

VMware CEO Pat Gelsinger announced production support for SAP HANA on VMware vSphere 5.5 at EMC World this week during his keynote. This is the end result of a very thorough joint testing project over the past year between VMware and SAP.

HANA is an in-memory platform (including database capabilities) from SAP that has enabled huge gains in performance for customers and has been a high priority for SAP over the past few years.  In order for HANA to be supported in a virtual machine on vSphere 5.5 for production workloads, we worked closely with SAP to enable, design, and measure in-depth performance tests.

In order to enable the testing and ongoing production support of SAP HANA on vSphere, two HANA appliance servers were ordered, shipped, and installed into SAP’s labs in Waldorf Germany.  These systems are dedicated to running SAP HANA on vSphere onsite at SAP.  Each system is an Intel Xeon E7-8870 (Westmere-EX) based four-socket server with 1TB of RAM.  They are used for performance testing and also for ongoing support of HANA on vSphere.  Additionally, VMware has onsite support engineering to assist with the testing and support.

SAP designed an extensive performance test suite that used a large number of test scenarios to stress all functions and capabilities of HANA running on vSphere 5.5.  They included OLAP and OLTP with a wide range of data sizes and query functions. In all, over one thousand individual test cases were used in this comprehensive test suite.  These same tests were run on identical native HANA systems and the difference between native and virtual tests was used as the key performance indicator.

In addition, we also tested vSphere features including vMotion, DRS, and VMware HA with virtual machines running HANA.  These tests were done with the HANA virtual machine under heavy stress.

The test results have been extremely positive and are one of the key factors in the announcement of production support.  The difference between virtual and native HANA across all the performance tests was on average within a few percentage points.

The vMotion, DRS, and VMware HA tests were all completed without issues.  Even with the large memory sizes of HANA virtual machines, we were still able to successfully migrate them with vMotion while under load with no issues.

One of the results of the extensive testing is a best practices guide for HANA on vSphere 5.5. This document includes a performance guide for running HANA on vSphere 5.5 based on this extensive testing.  The document also includes information about how to size a virtual HANA instance and how VMware HA can be used in conjunction with HANA’s own replication technology for high availability.

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

In part 1 and part 2 of the VDI/VSAN benchmarking blog series, we presented the VDI benchmark results on VSAN for 3-node, 5-node, 7-node, and 8-node cluster configurations. In this blog, we compare the VDI benchmarking performance of VSAN with an all flash storage array. The intent of this experiment is not to compare the maximum IOPS that you can achieve on these storage solutions; instead, we show how VSAN scales as we add more heavy VDI users. We found that VSAN can support a similar number of users as that of an all flash array even though VSAN is using host resources.

The characteristic of VDI workload is that they are CPU bound, but sensitive to I/O which makes View Planner a natural fit for this comparative study. We use VMware View Planner 3.0 for both VSAN and all flash SAN and consolidate as many heavy users as much we can on a particular cluster configuration while meeting the quality of service (QoS) criteria. Then, we find the difference in the number of users we can support before we run out of CPU, because I/O is not a bottleneck here. Since VSAN runs in the kernel and uses CPU on the host for its operation, we find that the CPU usage is quite minimal, and we see no more than a 5% consolidation difference for a heavy user run on VSAN compared to the all flash array.

As discussed in the previous blog, we used the same experimental setup where each VSAN host has two disk groups and each disk group has one PCI-e solid-state drive (SSD) of 200GB and six 300GB 15k RPM SAS disks. We built a 7-node and a 8-node cluster and ran View Planner to get the VDImark™ score for both VSAN and the all flash array. VDImark signifies the number of heavy users you can successfully run and meet the QoS criteria for a system under test. The VDImark for both VSAN and all flash array is shown in the following figure.

View Planner QoS (VDImark)

 

 From the above chart, we see that VSAN can consolidate 677 heavy users (VDImark) for 7-node and 767 heavy users for 8-node cluster. When compared to the all flash array, we don’t see more than 5% difference in the user consolidation. To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above for both VSAN and the all flash array, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience.  Similar to the user consolidation, the response time of Group-A operations in VSAN is similar to what we saw with the all flash array.

Group-B Response Times

Group-B operations are sensitive to both CPU and IO and 95% should be less than six seconds to meet the QoS criteria. From the above figure, we see that the average response time for most of the operations is within the threshold and we see similar response time in VSAN when compared to the all flash array.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

 

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2

In part 1, we presented the VDI benchmark results on VSAN for 3-node and 7-node configurations. In this blog, we update the results for 5-node and 8-node VSAN configurations and show how VSAN scales for these configurations.

The View Planner benchmark was run again to find the VDImark for different numbers of nodes (5 and 8 nodes) in a VSAN cluster as described in the previous blog and the results are shown in the following figure.

View Planner QoS (VDImark)

 

In the 5-node cluster, a VDImark score of 473 was achieved and for the 8-node cluster, a VDImark score of 767 was achieved. These results are similar to the ones we saw on the 3-node and 7-node cluster earlier (about 95 VMs per host). So, there is nice scaling in terms of maximum VMs supported as the numbers of nodes were increased in the VSAN from 3 to 8.

To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience. If we look at the new results for 5-node and 8-node VSAN, we see that for most of the operations, the response time mostly remains the same across different node configurations.

Group-B Response Times

Since Group-B is more sensitive to I/O and CPU usage, the above chart for Group-B operations is more important to see how View Planner scales. The chart shows that there is not much difference in the response times as the number of VMs were increased from 286 VMs on a 3-node cluster to 767 VMs on an 8-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the VSAN nodes from 3 to 8 and user experience expectations are met.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

 

 

Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5

VMware vSphere ensures that virtualization overhead is minimized so that it is not noticeable for a wide range of applications including most business critical applications such as database systems, Web applications, and messaging systems. vSphere also supports well applications with millisecond-level latency constraints, including VoIP services. However, performance demands of latency-sensitive applications with very low latency requirements such as distributed in-memory data management, stock trading, and high-performance computing have long been thought to be incompatible with virtualization.

vSphere 5.5 includes a new feature for setting latency sensitivity in order to support virtual machines with strict latency requirements. This per-VM feature allows virtual machines to exclusively own physical cores, thus avoiding overhead related to CPU scheduling and contention. A recent performance study shows that using this feature combined with pass-through mechanisms such as SR-IOV and DirectPath I/O helps to achieve near-native performance in terms of both response time and jitter.

The paper explains major sources of latency increase due to virtualization in vSphere and presents details of how the latency-sensitivity feature improves performance along with evaluation results of the feature. It also presents some best practices that were concluded from the performance evaluation.

For more information, please read the full paper: Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5.

 

VMware Horizon View 5.2 Performance & Best Practices and A Performance Deep Dive on Hardware Accelerated 3D Graphics

VMware Horizon View 5.2 simplifies desktop and application management while increasing security and control and delivers a personalized high fidelity experience for end-users across sessions and devices. It enables higher availability and agility of desktop services unmatched by traditional PCs while reducing the total cost of desktop ownership and end-users can enjoy new levels of productivity and the freedom to access desktops from more devices and locations while giving IT greater policy control.

Recently, we published two whitepapers to provide a performance deep-dive on Horizon View 5.2 performance and hardware accelerated 3D graphics (vSGA) feature. The links to these whitepapers are as follows:

* VMware Horizon View 5.2 Performance and Best Practices
* VMware Horizon View 5.2 and Hardware Accelerated 3D Graphics

The first whitepaper describes View 5.2 new features, including access of View desktops with Horizon, space efficient sparse (SEsparse) disks, hardware accelerated 3D graphics, and full support of Windows 8 desktops. View 5.2 performance improvements in PCoIP and View management are highlighted. In addition, this paper presents View 5.2 PCoIP performance results, Windows 8 and RDP 8 performance analysis, and a vSGA performance analysis, including how vSGA compares to the software renderer support introduced in View 5.1.

The second whitepaper goes in-depth on the support for hardware accelerated 3D graphics that debuted with VMware vSphere 5.1 and VMware Horizon View 5.2 and presents performance and consolidation results for a number of different workloads, ranging from knowledge workers using 3D desktops to performance-intensive CAD-based workloads. Because the intensity of a 3D workload will vary greatly from user to user and application to application, rather than highlighting specific case studies, we demonstrate how the solution efficiently scales for both light- and heavy-weight 3D workloads, until GPU or CPU resources are fully utilized. This paper also presents key best practices to extract peak performance from a 3D View 5.2 deployment.

1millionIOPS On 1VM

Last year at VMworld 2011 we presented one million I/O operations per second (IOPS) on a single vSphere 5 host (link).  The intent was to demonstrate vSphere 5′s performance by using mutilple VMs to drive an aggregate load of one million IOPS through a single server.   There has recently been some interest in driving similar I/O load through a single VM.  We used a pair of Violin Memory 6616 flash memory arrays, which we connected to a two-socket HP DL380 server, for some quick experiments prior to VMworld.  vSphere 5.1 was able to demonstrate high performance and I/O efficiency by exceeding one million IOPS, doing so with only a modest eight-way VM.  A brief description of our configuration and results is given below.

Configuration:
Hypervisor: vSphere 5.1
Server: HP DL380 Gen8
CPU: 2 x Intel Xeon E5-2690, HyperThreading disabled
Memory: 256GB
HBAs: 5 x QLE2562
Storage: 2 x Violin Memory 6616 Flash Memory Arrays
VM: Windows Server 2008 R2, 8 vCPUs and 48GB.
Iometer Config: 4K IO size w/ 16 workers

Results:
Using the above configuration we achieved 1055896 total sustained IOPS.  Check out the following short video clip from one of our latest runs.

Look out for a more thorough write-up after VMworld.

 

Performance Implications of Storage I/O Control-Enabled NFS Datastores

Storage I/O Control (SIOC) allows administrators to control the amount of access virtual machines have to the I/O queues on a shared datastore. With this feature, administrators can ensure that a virtual machine running a business-critical application has a higher priority to access the I/O queue than that of other virtual machines sharing the same datastore. In vSphere 4.1, SIOC was supported on VMFS-based datastores that used SAN with iSCSI and Fibre Channel. In vSphere 5, SIOC support has been extended to NFS-based datastores.

Recent tests conducted at VMware Performance Engineering lab studied the following aspects of SIOC:

  • The performance impact of SIOC: A fine-grained access management of the I/O queues resulted in a 10% improvement in the response time of the workload used for the tests.
  • SIOC’s ability to isolate the performance of applications with a smaller request size: Some applications like Web and media servers use I/O patterns with a large request size (for example, 32K). But some other applications like OLTP databases request smaller I/Os ≤8K. Test findings show that SIOC helped an OLTP database workload to achieve higher performance when sharing the underlying datastore with a workload that used large-sized I/O requests.
  • The intelligent prioritization of I/O resources: SIOC monitors virtual machines’ usage of the I/O queue at the host and dynamically redistributes any unutilized queue slots to those virtual machines that need them. Tests show that this process happens consistently and reliably.

For the full paper, see Performance Implications of Storage I/O Control–Enabled NFS Datastores in VMware vSphere 5

 

VMware View & PCoIP at VMworld

In recent weeks there’s been growing excitement about the PCoIP enhancements coming to VMware View. For instance, Warren Ponder discussed here how these enhancements reduce bandwidth consumption by up to 75%. Engineers from VMware’s performance team (& Warren) will be talking more about these enhancements and how they translate into real-world performance at the rapidly approaching VMworld 2011 in Las Vegas:

EUC1987: VMware View PC-over-IP Performance and Best Practices
Tuesday, August 30th 12:00
Wednesday, August 31st 1:00

EUC3163: VMware View Performance and Best Practices
Tuesday, August 30th – 4:30
Wednesday, August 31st – 4:00

We will also be blogging additional details and performance results as VMworld progresses, followed by a performance whitepaper.

Stay tuned!