Home > Blogs > VMware VROOM! Blog

Line-Rate Performance with 80GbE and vSphere 5.5

With the increasing number of physical cores in a system, the networking bandwidth requirement per server has also increased. We often find many networking-intensive applications are now being placed on a single server, which results in a single vSphere server requiring more than one 10 Gigabit Ethernet (GbE) adapter. Additional network interface cards (NICs) are also deployed to separate management traffic and the actual virtual machine traffic. It is important for these servers to service the connected NICs well and to drive line rate on all the physical adapters simultaneously.

vSphere 5.5 supports eight 10GbE NICs on a single host, and we demonstrate that a host running with vSphere 5.5 can not only drive line rate on all the physical NICs connected to the system, but can do it with a modest increase in overall CPU cost as we add more NICs.

We configured a single host with four dual-port Intel 10GbE adapters for the experiment and connected them back-to-back with an IXIA Application Network Processor Server with eight 10GbE ports to generate traffic. We then measured the send/receive throughput and the corresponding CPU usage of the vSphere host as we increased the number of NICs under test on the system.

Environment Configuration

  • System Under Test: Dell PowerEdge R820
  • CPUs: 4 x  Intel Xeon Processors E5-4650 @ 2.70GHz
  • Memory: 128GB
  • NICs:8 x Intel 82599EB 10GbE, SFP+ Network Connection
  • Client: Ixia Xcellon-Ultra XT80-V2, 2U Application Network Processor Server

Challenges in Getting 80Gbps Throughput

To drive near 80 gigabits of data per second from a single vSphere host, we used a server that has not only the required CPU and memory resources, but also the PCI bandwidth that can perform the necessary I/O operations. We used a Dell PowerEdge Server with an Intel E5-4650 processor because it belongs to the first generation of Intel processors that supports PCI Gen 3.0. PCI Gen 3.0 doubles the PCI bandwidth capabilities compared to PCI Gen 2.0. Each dual-port Intel 10GbE adapter needs at least a PCI Gen 2.0 x8 to reach line rate. Also, the processor has Intel Data Direct I/O Technology where the packets are placed directly in the processor cache rather than going to the memory. This reduces the memory bandwidth consumption and also helps reduce latency.

Experiment Overview

Each 10GbE port of the vSphere 5.5 server was configured with a separate vSwitch, and each vSwitch had two Red Hat 6.0 Linux virtual machines running an instance of Apache web server. The web server virtual machines were configured with 1 vCPU and 2GB of memory with VMXNET3 as the virtual NIC adapter.  The 10GbE ports were then connected to the Ixia Application Server port. Since the server had two x16 slots and five x8 slots, we used the x8 slots for the four 10GbE NICs so that each physical NIC had identical resources. For each physical connection, we then configured 200 web/HTTP connections, 100 for each web server, on an Ixia server that requested or posted the file. We used a high number of connections so that we had enough networking traffic to keep the physical NIC at 100% utilization.

Figure 1. System design of NICs, switches, and VMs

The Ixia Xcellon application server used an HTTP GET request to generate a send workload for the vSphere host. Each connection requested a 1MB file from the HTTP web server.

Figure 2 shows that we could consistently get the available[1] line rate for each physical NIC as we added more NICs to the test. Each physical NIC was transmitting 120K packets per second and the average TSO packet size was close to 10K. The NIC was also receiving 400K packets per second for acknowledgements on the receive side. The total number of packets processed per second was close to 500K for each physical connection.

Figure 2. vSphere 5.5 drives throughput at available line rates. TSO on the NIC resulted in lower packets per second for send.

Similar to the send case, we configured the application server to post a 1MB file using an HTTP POST request for generating receive traffic for the vSphere host. We used the same number of connections and observed similar behavior for the throughput. Since the NIC does not have support for hardware LRO, we were getting 800K packets per second for each NIC. With eight 10GbE NICs, the packet rate reached close to 6.4 million packets per second. VMware does Software LRO for Linux and as a result we see large packets in the guest. The guest packet rate is around 240K packets per second. There was also significant traffic for TCP acknowledgements and for each physical NIC. The host was transmitting close to 120K acknowledgement packets for each physical NIC, bringing the total packets processed close to 7.5 million packets per second for eight 10Gb ports.

Figure 3. Average vSphere 5.5 host CPU utilization for send and receive

We also measured the average CPU reported for each of the tests. Figure 3 shows that the vSphere host’s CPU usage increased linearly as we added more physical NICs to the test for both send and receive. This indicates that performance improves at an expected and acceptable rate.

Test results show that vSphere 5.5 is an excellent platform on which to deploy networking-intensive workloads. vSphere 5.5 makes use of all the physical bandwidth capacity available and does this without incurring additional CPU cost.

 


[1]A 10GbE NIC can achieve only 9.4 Gbps of throughput with standard MTU. For a 1500 byte packet, we have 40 bytes for the TCP /IP header and 38 bytes for the Ethernet frame format.

Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5

VMware vSphere ensures that virtualization overhead is minimized so that it is not noticeable for a wide range of applications including most business critical applications such as database systems, Web applications, and messaging systems. vSphere also supports well applications with millisecond-level latency constraints, including VoIP services. However, performance demands of latency-sensitive applications with very low latency requirements such as distributed in-memory data management, stock trading, and high-performance computing have long been thought to be incompatible with virtualization.

vSphere 5.5 includes a new feature for setting latency sensitivity in order to support virtual machines with strict latency requirements. This per-VM feature allows virtual machines to exclusively own physical cores, thus avoiding overhead related to CPU scheduling and contention. A recent performance study shows that using this feature combined with pass-through mechanisms such as SR-IOV and DirectPath I/O helps to achieve near-native performance in terms of both response time and jitter.

The paper explains major sources of latency increase due to virtualization in vSphere and presents details of how the latency-sensitivity feature improves performance along with evaluation results of the feature. It also presents some best practices that were concluded from the performance evaluation.

For more information, please read the full paper: Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5.

 

Simulating different VDI users with View Planner 3.0

VDI benchmarking is hard. What makes it challenging is getting a good representation or simulation of VDI users.  If we closely look at typical office users, we can get a spectrum of VDI users where at the one end of spectrum, the user may be using some simple Microsoft Office applications at a relatively moderate speed, whereas at the other end of spectrum, the user may be running some CPU-heavy multimedia applications and switching between many applications much faster. We classify the fast user as the power user or the “heavy” user, whereas we classify the user at the other end of the spectrum as the task worker or as the “light” user. In between the two categories, we define one more category which lies in between these two ends of the spectrum, which is the “medium” user.

To simulate these different categories of users and to make the job of VDI benchmarking much easier, we have made VMware View Planner 3.0, which simulates a workload representative of many user-initiated operations that take place in a typical VDI environment. The tool simulates typical Office user applications such as PowerPoint, Outlook, and Word; and Adobe Reader, Internet Explorer Web browser, multimedia applications, and so on. The tool can be downloaded from: http://www.vmware.com/products/desktop_virtualization/view-planner/overview.html.

If we look at the three categories of VDI users outlined above, one of the main differentiating factors across this gamut of VDI users is how fast they act and this is simulated using the concept of “think time” in the View Planner tool. The tool uses the thinktime parameter to randomly sleep before starting the next application operation. For the heavy user, the value of thinktime is kept very low at 2 seconds. This means that operations are happening very fast and users are switching across different applications or doing operations in an application every 2 seconds on average. The View Planner 3.0 benchmark defines a score, called “VDImark” which is based on this “heavy” user workload profile. For a medium user, the think time is set to 5 seconds, and for a light user, the think time is set to 10 seconds. The heavy VDI user also uses a bigger screen resolution compared to the medium or light user. The simulation of these category of users in the View Planner tool is summarized in the table below:

In order to show the capability of View Planner 3.0 to determine the sizing for VDI user VMs per host, we ran a flexible mode of View Planner 3.0, which allowed us to create medium and light user workloads (the heavy workload profile pre-exists), as well to understand the user density for different types of VDI users for a given system. The flexible mode will be available soon through Professional Services Organization (PSO) and to selected partners.

The experimental setup we used to compare these different user profiles is shown below:

In this test, we want to determine how many VMs can be run on the system while each VM is performing its heavy, medium, or light user profiles. In order to do this, we need to set a baseline of acceptable performance, which is defined by the quality of service (QoS) as defined in the View Planner user guide. The number of VMs that passed the QoS score is shown in the chart below.

The chart shows that we can run about 53 VMs for the heavy user (VDImark), 67 VMs for the medium user, and 91 VMs for light users. So, we could consolidate about 25% more desktops if we used this system to host users with medium workloads instead of heavy workloads. And we could consolidate 35% more desktops if we used this system to host users with light workloads instead of medium workloads. So, it is crucial to fully specify the user profile whenever we talk about the user density.

In this blog, we demonstrated how we used the View Planner 3.0 flexible mode to run different user profiles and to understand the user density for a system under test. If you have any questions and want to know more about View Planner, you can reach out to the team at viewplanner-info@vmware.com

IPv6 performance improvements in vSphere 5.5

Many of our customers use IPv6 networks in their datacenters for a variety of reasons. We expect that many more will transition from IPv4 to IPv6 to reap the large address range and other benefits that IPv6 provides. Keeping this in mind, we have worked on a number of performance enhancements for the way that vSphere 5.5 manages IPv6 network traffic. Some new features that we have implemented include:

• TCP Checksum Offload: For Network Interface Cards (NICs) that support this feature, the computation of the TCP checksum of the IPv6 packet is offloaded to the NIC.

• Software Large Receive Offload (LRO): LRO is a technique of aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking stack, thus reducing the number of packets that have to be processed and saving CPU. Many NICs do not support LRO for IPv6 packets in hardware. For such NICs, we implement LRO in the vSphere network stack.

• Zero-Copy Receive: This feature prevents an unnecessary copy from the packet frame to a memory space in the vSphere network stack. Instead, the frame is processed directly.

vSphere 5.1 offers the same features, but only for IPv4. So, in vSphere 5.1, services such as vMotion, NFS, and Fault Tolerance had lower bandwidth in IPv6 networks when compared to IPv4 networks. vSphere 5.5 solves that problem—it delivers similar performance over both IPv4 and IPv6 networks. A seamless transition from IPv4 to IPv6 is now possible.

Next, we demonstrate the performance of vMotion over a 40Gb/s network connecting two vSphere hosts. We also demonstrate the performance of networking traffic between two virtual machines created on the vSphere hosts.

System Configuration
We set up a test environment with the following specifications:

• Servers: 2 Dell PowerEdge R720 servers running vSphere 5.5.
• CPU: 2-socket, 12-core Intel Xeon E5-2667 @ 2.90 GHz.
• Memory: 64GB memory; 32GB spread across two NUMA nodes.
• Networking: 1 dual-port Intel 10GbE and 1 dual-port Broadcom 10GigE adapter placed on separate PCI Gen-2 x8 lanes in both machines. We thus had 40Gb/s of network connectivity between the two vSphere hosts.
• Virtual Machine for vMotion: 1 VM running Red Hat Enterprise Linux Server 6.3 assigned 2 virtual CPUs (vCPUs) and 48GB memory. We migrate this VM between the two vSphere hosts.
• Virtual Machines for networking tests: A pair of VMs running Red Hat Enterprise Linux server 6.3, assigned 4 vCPUs and 16GB memory, on each host. We use these VMs to test the performance of networking traffic between two VMs.

We configured each vSphere host with four vSwitches, each vSwitch having one 10GbE uplink port. We created one VMkernel adapter on each vSwitch. Each VMkernel adapter was configured on the same subnet. The MTU of the NICs was set to the default of 1500 bytes. We enabled each VMkernel adapter for vMotion, which allowed vMotion traffic to use the 40Gb/s network connectivity. We created four VMXNET3 virtual adapters on the pair of virtual machines used for networking tests.

Methodology
In order to demonstrate the performance for vMotion, we simulated a heavy memory usage footprint in the virtual machine. The memory-intensive program allocated 48GB memory in the virtual machine and touched one byte in each page in an infinite loop. We migrated this virtual machine between the two vSphere hosts over the 40Gb/s network. We used net-stats to monitor network throughput and CPU utilization on the sending and receiving systems. We also noted the bandwidth achieved in each pre-copy iteration of vMotion from VMkernel logs.

In order to demonstrate the performance of virtual machine networking traffic, we use Netperf 2.60 to simulate traffic from one virtual machine to the other. We create two connections for each virtual adapter. Each connection generates traffic for the TCP_STREAM workload, with 16KB message size and 256KB socket buffer size. As in the previous experiment, we used net-stats to monitor network throughput and CPU utilization.

Results
Figures 1 and 2 show, for IPv4 and IPv6 traffic, the network throughput and CPU utilization data that we collected over the 40-second duration of the migration. After the guest memory is staged for migration, vMotion begins iterations of pre-copying the memory contents from the source vSphere host to the destination vSphere host.

In the first iteration, the destination vSphere host needs to allocate pages for the virtual machine. Network throughput is below the available bandwidth in this stage as vMotion bandwidth usage is throttled by the memory allocation on the destination host. The average network bandwidth during this phase was 1897 megabytes per second (MB/s) for IPv4 and 1866MB/s for IPv6.

After the first iteration, the source vSphere host sends the delta of changed pages. During this phase, the average network bandwidth was 4301MB/s with IPv4 and 4091MB/s with IPv6.

The peak measured bandwidth in netstats was 34.5Gb/s for IPv4 and 32.9Gb/s for IPv6. The CPU utilization of both systems followed a similar trend for both IPv4 and IPv6. Please also note that vMotion is very CPU intensive on the receiving vSphere hosts, and high CPU clock speed is necessary to achieve high bandwidths. The results are summarized in Table 1. In all, migration of the virtual machine was complete in 40 seconds regardless of IPv4 or IPv6 connectivity.

vMotion over an IPv4 network
Figure 1. vMotion over an IPv4 network
vMotion over an IPv4 network
Figure 2. VMotion over an IPv6 network

vMotion-IPv4 vs IPv6
Table 1. vMotion results—IPv4 versus IPv6

The results for virtual machine networking traffic are in Table 2. While the throughput with IPv6 is about 2.5% lower, the CPU utilization is the same on both the sending as well as the receive sides.

Virtual Machine Performance - IPv4 vs IPv6
Table 2. Virtual machine networking results—IPv4 versus IPv6

Thanks to a number of IPv6 enhancements added to vSphere 5.5, migrations with vMotion occur over IPv6 networks at speeds within 5%, compared to those over IPv4 networks. For virtual machine networking performance, the throughput of IPv6 is within 2.5% of IPv4. In addition, testing shows that we can drive bandwidth close to 40Gb/s link speeds with both protocols. Combined, this functionality allows for a seamless transition from IPv4 to IPv6 with little performance impact.

VMware vSphere 5.5 Host Power Management (HPM) saves more power and improves performance

VMware recently released a white paper on the power and performance improvements in the Host Power Management (HPM) feature in vSphere 5.5. With new improvements in HPM, one can save significant power and gain decent performance in many common scenarios. The paper shows that power savings of up to 20% can be achieved in vSphere5.5 by using industry standard SPEC benchmarks. The paper also describes some of the best practices to follow when using HPM.

One experiment indicates that you can get around a 10% increase in performance in vSphere5.5 when deep C-states (greater than C1/halt, e.g., C3 and C6) are enabled along with turbo mode.

For more interesting results and data, please read the full paper

Note: HPM works at a single host level as opposed to DPM which works on a cluster of hosts

VMware vSphere 5.5 Performs Well Running Big Data Scenario with Greenplum

VMware recently released a white paper on the performance and best practices of running a Pivotal Greenplum database cluster in virtual machines. The paper reports the results of two studies. In each study, six physical machines are used in the Greenplum cluster. Three different big data workloads are run on the physical machines, and then on virtual machines in the same configuration.

One experiment compares a physical setup to a virtual configuration for running Greenplum segment servers, one per host. The response times of all the workloads in the virtual environment are within 6% of those measured in the physical environment.

Another test shows the performance impact of deploying multiple, smaller virtual machines instead of a single, large virtual machine on each segment host. The results from this test show that vSphere 5.5 provides a reduction of 11% in workload process time when compared to the same hardware configuration in a physical environment. The main performance gain occurs when each smaller virtual machine fits into a NUMA node on the physical host. For more information, please read the full paper: Greenplum Database Performance on VMware vSphere 5.5.

VDI Benchmarking with View Planner 3.0

Recently we announced the general availability of VMware View Planner 3.0 as a VDI benchmark. VMware View Planner is a tool designed to simulate a large-scale deployment of virtualized desktop systems. This is achieved by generating a workload representative of many user-initiated operations that take place in a typical VDI environment. The results allow us to study the effects on an entire virtualized infrastructure. The tool can be downloaded from http://www.vmware.com/products/desktop_virtualization/view-planner/overview.html

In this blog, we present a high-level overview of the View Planner benchmark and some of its use cases. Finally, we present a simple storage scaling use case using a flash memory storage array from Violin Memory, who has partnered with us during the validation of this new benchmark.

With version 3.0, View Planner can be run as a benchmark which will help VMware partners and customers to precisely characterize and compare both the software and hardware solutions in their VDI environments. Using View Planner’s comprehensive standard methodology, VDI architects can compare and contrast different layers of the VDI stack including processor architectures; the results can be used to objectively show the performance improvements of the next generation chipset in contrast with the current generation. In addition, various storage solutions like hybrid, all-flash, and vSAN can be compared with different SAN configurations for a given hardware setup.

View Planner 3.0 provides a number of features which include

• Application-centric benchmarking of real-world workloads
• Unique and patented client-side performance measurement technology to better understand the end user experience
• High precision scoring methodology for repeatability
• Benchmark metrics to highlight infrastructure efficiencies—density, performance and economics.
• Support for latest VMware vSphere and Horizon View versions
• Better automation and stats reporting for ease of use and performance analysis
• Auto-generated PDF reports providing a summary of the run

View Planner Scoring
The View Planner score is represented as VDImark. This metric encapsulates the number of VDI users that can be run on a given system with application response time less than the set threshold. Hence, the scoring is based on several factors such as the response time of the operations, compliance of the setup and configurations, and so on.

For response time characterization, View Planner operations are divided into three main groups: (1) Group A for interactive operations, (2) Group B for I/O operations, and (3) Group C for background operations. The score is determined separately for Group A user operations and Group B user operations by calculating the 95th percentile latency of all the operations in a group. The default thresholds are 1.0 seconds for Group A and 6.0 seconds for Group B. Please refer to the user guide, and the run and reporting guides for more details.

View Planner Benchmarking Use Cases
As mentioned earlier, the View Planner 3.0 benchmark can be used to benchmark different CPU architectures, hosts, and storage architectures. Using the tool, vendors and partners can scale the VMs on a specific processor architecture and find out how many VMs per core can be supported and the same can be also done for different server host systems. In the same direction, the storage system can be benchmarked to see how may VMs can be supported without seeing significant increase in the I/O latency and hence the user experience for a given storage configuration. It can be also used to study the impact of different configurations and optimizations that can be done in different layers of both the software and the hardware stack. Next, we look at one such use case of View Planner by looking at storage scaling by running View Planner VMs on multiple hosts.

Use Case Example: Storage Scaling
To illustrate one of the use cases of View Planner 3.0, we look at storage scaling aspects. In this experiment, we scale the number of hosts (3, 5, 6) and each host runs about 90 to 100 VMs. Then we see how the Violin storage array is able to scale with increasing IOPS requirements. We didn’t go beyond 6 hosts because of hardware availability. The experimental setup for this use case is shown below.

 

The host running the desktop VMs has 16 Intel Xeon E5-2690 cores running @ 2.9 GHz. The host has 256GB physical RAM, which is more than sufficient to run 90-100 1GB Win7 VMs. The desktop is connected to a Violin storage array using the Fibre Channel host bus adapter (FC HBA)on the host.

View Planner QoS

We ran 285 VMs (3 hosts), 480 VMs (5 hosts), and 560 VMs (6 hosts), and we collected the View Planner response times and the QoS is shown in the following figure.

In all the runs, we see in the bar chart that both Group A and Group B 95% response times are less than the threshold of 1 second, and 6 seconds respectively. Also, we don’t see much variation as we increased the number of hosts and we can clearly see that Violin storage is easily coping with a greater number of desktop VMs and servicing their IOPS requirements even when the number of desktops is doubled. To further illustrate the Group A and Group B response times, we show the average response time of individual operations for these three runs for both Group A and Group B, as follows.

Group A Response Times

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide good end-user experience. If we look all three runs, we don’t see much variance in the response times and they almost remained constant when scaling up.

Group B Response Times

Since Group B is composed of I/O operations, this will provide good insight for storage-related experiments. In the bar chart shown above, we see that the latency of operations such as PPTx-Open, Word-Open, or Word-Save didn’t change much as we scaled from 285 VMs (3 hosts) to 560 VMs (6 hosts).

IOPS Requirements

The above chart shows the total IOPS seen by the Violin storage array when the benchmark was being executed. (This doesn’t include the IOPS from any management operations such as Bootstorm, virus scan, and so on.) For the 560 VM run, we see that the total IOPS from all the hosts is going up to 10k and then tapering down to about 6k in the steady state. So, in the first iteration, we see higher IOPS requirement than the steady state as expected. We see similar behavior with 285 VMs and 480 VMs run where we see peaks in the first iteration and then we see steady IOPS usage in the steady state iterations.

While we have presented one simple use case of storage scaling in this blog, View Planner 3.0 can be used for many use cases (CPU scaling, processor architecture comparison, host configurations, and so on) as mentioned earlier.
If you have any questions and want to know more about View Planner, you can reach out to the team at viewplanner-info@vmware.com

If you are attending VMworld this year, please check out our session on “View Planner 3.0 as a benchmark”. Here are the session details:
TEX5760 – View Planner 3.0 as a VDI Benchmark
Tuesday: 3:30 PM
Banit Agrawal & Rishi Bidarkar

Comparing Storage Density, Power, and Performance with VMmark 2.5

Datacenters continue to grow as the use of both public and private clouds becomes more prevalent.  A comprehensive review of density, power, and performance is becoming more crucial to understanding the tradeoffs when considering new storage technologies as a replacement for legacy solutions.  Expanding on previous articles around comparing storage technologies and the IOPS performance available when using flash-based storage, in this article we are comparing the density, power, and performance differences between traditional hard disk drive (HDDs) and flash-based storage.  As might be expected, we found that the flash-based storage performed very well in comparison to the traditional hard disk drives.  This article quantifies our findings.

In addition to VMmark’s previous performance measurement capability, VMmark 2.5 adds the ability to collect power measurements on servers and storage under test.  VMmark 2.5 is a multi-host virtualization consolidation benchmark that utilizes a combination of application workloads and infrastructure operations running simultaneously to model the performance of a cluster.  For more information on VMmark 2.5, see this overview.

Environment Configuration:
Hypervisor: VMware vSphere 5.1
Servers: Two x Dell PowerEdge R720
BIOS settings: High Performance Profile Enabled
CPU: Two x 2.9GHz Intel Xeon CPU-E5-2690
Memory: 192GB
HBAs: Two x 16Gb QLE2672 per system under test
Storage:
- HDD-Configuration: EMC CX3-80, 120 disks, 8 Trays, 1 SPE, 30U
- Flash-Based-Configuration: Violin Memory 6616, 64 VIMMs, 3U
Workload: VMware VMmark 2.5.1

Testing Methodology:
For this experimentation we set up a vSphere 5.1 DRS-enabled cluster consisting of two identically configured Dell PowerEdge R720 servers.  A series of VMmark 2.5 tests were then conducted on the cluster with the same VMs being moved to the storage configuration under test, progressively increasing the number of tiles until the cluster reached saturation.  Saturation was defined as the point where the cluster was unable to meet the VMmark 2.5 quality-of-service (QoS) requirements. We selected the EMC CX3-80 and the Violin Memory 6616 as representatives of the previous generation of traditional HDD-based and flash based storage, respectively. We would expect comparable arrays in these generations to have characteristics similar to what we measured in these tests.  In addition to the VMmark 2.5 results, esxtop data was collected to provide further statistics.  The HDD configuration running a single tile was used as the baseline and all VMmark 2.5 results in this article (excluding raw Watts metrics, %CPU, and Latency) were normalized to that result.

Average Watts and VMmark 2.5 Performance Per Kilowatt Comparison:
For our comparison of the two technologies, the first point of evaluation was reviewing both the average watts required by the storage arrays and the corresponding VMmark 2.5 Performance Per Kilowatt (PPKW) score.  Note that the HDD configuration reached saturation at 7 tiles. In contrast, the Flash-based configuration was able to support a total of 9 tiles, while still meeting the quality of service requirements for VMmark 2.5.

As can be seen from the above graphs, the difference between the two technologies is extremely obvious.  The average watts drawn by the Flash-based configuration was nearly 50% less than the HDD configuration across all tiles tested.  Additionally, the PPKW score of the Flash-based configuration was on average 3.4 times higher than the HDD configuration, across all runs.

Application Score Comparison:
Due to the very large difference in PPKW, we decided to dig deeper into the potential root causes, beyond just the discrepancy in power consumed.  Because the application workloads exhibit random access patterns, as opposed to the sequential nature of infrastructure operations, we focused on the differences in application scores between the two configurations, as this is where we would expect to see the majority of the gains provided by the Flash-based configuration.

The difference between the scaling of the application workloads is quite obvious.  Although running the same number of tiles, and thus attempting the same amount of work, the flash-based configuration was able to produce application workload scores that were 1.9 times higher than the HDD configuration across 7 tiles.

CPU and Latency Comparison:
After exploring the power consumption and various areas of performance difference, we decided to look into two additional key components behind the performance improvements: CPU utilization and storage latency.


In our final round of data assessment we found that the CPU utilization of the flash-based storage was on average 1.53 times higher than the HDD configuration, across all 7 tiles.  Higher CPU utilization might appear to be sub-optimal, however we determined that the systems were waiting less time for I/O to complete and were thus getting more work done.  This is especially visible when reviewing the storage latencies of the two configurations.  The flash-based configuration showed extremely flat latencies, and had on average less than one tenth of the HDD configuration’s latencies.

Finally, when comparing the physical space requirements of the two configurations, the flash-based storage was effectively 92% denser than the traditional HDD configurations (achieving 9 tiles in 3U versus 7 tiles 30U). In addition to physical density advancements, the flash-based storage allowed for a 29% increase in the number of VMs run on the same server hardware, while maintaining QoS requirements of VMmark 2.5.

The flash-based storage showed wins across the board for power and performance.  The flash-based storage consumed half the power while achieving over three times the performance.  Although the initial costs of flash-based storage can be somewhat daunting when compared to traditional HDD storage, the reduction in power, increased density, and superior performance of the flash-based storage certainly seems to provide a strong argument for integrating the technology into future datacenters. VMmark 2.5 gives us the ability to look at the larger picture, making an informed decision across a wide variety of today’s concerns.

Power Management and Performance in ESXi 5.1

Powering and cooling are a substantial portion of datacenter costs. Ideally, we could minimize these costs by optimizing the datacenter’s energy consumption without impacting performance. The Host Power Management feature, which has been enabled by default since ESXi 5.0, allows hosts to reduce power consumption while boosting energy efficiency by putting processors into a low-power state when not fully utilized.

Power management can be controlled by the either the BIOS or the operating system. In the BIOS, manufacturers provide several types of Host Power Management policies. Although they vary by vendor, most include “Performance,” which does not use any power saving techniques, “Balanced,” which claims to increase energy efficiency with minimal or no impact to performance, and “OS Controlled,” which passes power management control to the operating system. The “Balanced” policy is variably known as “Performance per Watt,” “Dynamic” and other labels; consult your vendor for details. If “OS Controlled” is enabled in the BIOS, ESXi will manage power using one of the policies “High performance,” “Balanced,” “Low power,” or “Custom.” We chose to study Balanced because it is the default setting.

But can the Balanced setting, whether controlled by the BIOS or ESXi, reduce performance relative to the Performance setting? We have received reports from customers who have had performance problems while using the BIOS-controlled Balanced setting. Without knowing the effect of Balanced on performance and energy efficiency, when performance is at a premium users might select the Performance policy to play it safe. To answer this question we tested the impact of power management policies on performance and energy efficiency using VMmark 2.5.

VMmark 2.5 is a multi-host virtualization benchmark that uses varied application workloads as well as common datacenter operations to model the demands of the datacenter. VMs running diverse application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview.

We tested three policies: the BIOS-controlled Performance setting, which uses no power management techniques, the ESXi-controlled Balanced setting (with the BIOS set to OS-Controlled mode), and the BIOS-controlled Balanced setting. The ESXi Balanced and BIOS-controlled Balanced settings cut power by reducing processor frequency and voltage among other power saving techniques.

We found that the ESXi Balanced setting did an excellent job of preserving performance, with no measurable performance impact at all levels of load. Not only was performance on par with expectations, but it did so while producing consistent improvements in energy efficiency, even while idle. By comparison, the BIOS Balanced setting aggressively saved power but created higher latencies and reduced performance. The following results detail our findings.

Testing Methodology
All tests were conducted on a four-node cluster running VMware vSphere 5.1. We compared performance and energy efficiency of VMmark between three power management policies: Performance, the ESXi-controlled Balanced setting, and the BIOS-controlled Balanced setting, also known as “Performance per Watt (Dell Active Power Controller).”

Configuration
Systems Under Test: Four Dell PowerEdge R620 servers
CPUs (per server): One Eight-Core Intel® Xeon® E5-2665 @ 2.4 GHz, Hyper-Threading enabled
Memory (per server): 96GB DDR3 ECC @ 1067 MHz
Host Bus Adapter: Two QLogic QLE2562, Dual Port 8Gb Fibre Channel to PCI Express
Network Controller: One Intel Gigabit Quad Port I350 Adapter
Hypervisor: VMware ESXi 5.1.0
Storage Array: EMC VNX5700
62 Enterprise Flash Drives (SSDs), RAID 0, grouped as 3 x 8 SSD LUNs, 7 x 5 SSD LUNs, and 1 x 3 SSD LUN
Virtualization Management: VMware vCenter Server 5.1.0
VMmark version: 2.5
Power Meters: Three Yokogawa WT210

Results
To determine the maximum VMmark load supported for each power management setting, we increased the number of VMmark tiles until the cluster reached saturation, which is defined as the largest number of tiles that still meet Quality of Service (QoS) requirements. All data points are the mean of three tests in each configuration and VMmark scores are normalized to the BIOS Balanced one-tile score.

Effects of Power Management on VMmark 2.5 score

The VMmark scores were equivalent between the Performance setting and the ESXi Balanced setting with less than a 1% difference at all load levels. However, running on the BIOS Balanced setting reduced the VMmark scores an average of 15%. On the BIOS Balanced setting, the environment was no longer able to support nine tiles and, even at low loads, on average, 31% of runs failed QoS requirements; only passing runs are pictured above.

We also compared the improvements in energy efficiency of the two Balanced settings against the Performance setting. The Performance per Kilowatt metric, which is new to VMmark 2.5, models energy efficiency as VMmark score per kilowatt of power consumed. More efficient results will have a higher Performance per Kilowatt.

Effects of Power Management on Energy Efficiency

Two trends are visible in this figure. As expected, the Performance setting showed the lowest energy efficiency. At every load level, ESXi Balanced was about 3% more energy efficient than the Performance setting, despite the fact that it delivered an equivalent score to Performance. The BIOS Balanced setting had the greatest energy efficiency, 20% average improvement over Performance.

Second, increase in load is correlated with greater energy efficiency. As the CPUs become busier, throughput increases at a faster rate than the required power. This can be understood by noting that an idle server will still consume power, but with no work to show for it. A highly utilized server is typically the most energy efficient per request completed, which is confirmed in our results. Higher energy efficiency creates cost savings in host energy consumption and in cooling costs.

The bursty nature of most environments leads them to sometimes idle, so we also measured each host’s idle power consumption. The Performance setting showed an average of 128 watts per host, while ESXi Balanced and BIOS Balanced consumed 85 watts per host. Although the Performance and ESXi Balanced settings performed very similarly under load, hosts using ESXi Balanced and BIOS Balanced power management consumed 33% less power while idle.

VMmark 2.5 scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how power management policies affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest results.

Effects of Power Management on VMmark 2.5 Latencies

Whereas the Performance and ESXi Balanced latencies tracked closely, BIOS Balanced latencies were significantly higher at all load levels. Furthermore, latencies were unpredictable even at low load levels, and for this reason, 31% of runs between one and eight tiles failed; these runs are omitted from the figure above. For example, half of the BIOS Balanced runs did not pass QoS requirements at four tiles. These higher latencies were the result of aggressive power saving by the BIOS Balanced policy.

Our tests showed that ESXi’s Balanced power management policy didn’t affect throughput or latency compared to the Performance policy, but did improve energy efficiency by 3%. While the BIOS-controlled Balanced policy improved power efficiency by an average of 20% over Performance, it was so aggressive in cutting power that it often caused VMmark to fail QoS requirements.

Overall, the BIOS controlled Balanced policy produced substantial efficiency gains but with unpredictable performance, failed runs, and reduced performance at all load levels. This policy may still be suitable for some workloads which can tolerate this unpredictability, but should be used with caution. On the other hand, the ESXi Balanced policy produced modest efficiency gains while doing an excellent job protecting performance across all load levels. These findings make us confident that the ESXi Balanced policy is a good choice for most types of virtualized applications.

VMware Horizon View 5.2 Performance & Best Practices and A Performance Deep Dive on Hardware Accelerated 3D Graphics

VMware Horizon View 5.2 simplifies desktop and application management while increasing security and control and delivers a personalized high fidelity experience for end-users across sessions and devices. It enables higher availability and agility of desktop services unmatched by traditional PCs while reducing the total cost of desktop ownership and end-users can enjoy new levels of productivity and the freedom to access desktops from more devices and locations while giving IT greater policy control.

Recently, we published two whitepapers to provide a performance deep-dive on Horizon View 5.2 performance and hardware accelerated 3D graphics (vSGA) feature. The links to these whitepapers are as follows:

* VMware Horizon View 5.2 Performance and Best Practices
* VMware Horizon View 5.2 and Hardware Accelerated 3D Graphics

The first whitepaper describes View 5.2 new features, including access of View desktops with Horizon, space efficient sparse (SEsparse) disks, hardware accelerated 3D graphics, and full support of Windows 8 desktops. View 5.2 performance improvements in PCoIP and View management are highlighted. In addition, this paper presents View 5.2 PCoIP performance results, Windows 8 and RDP 8 performance analysis, and a vSGA performance analysis, including how vSGA compares to the software renderer support introduced in View 5.1.

The second whitepaper goes in-depth on the support for hardware accelerated 3D graphics that debuted with VMware vSphere 5.1 and VMware Horizon View 5.2 and presents performance and consolidation results for a number of different workloads, ranging from knowledge workers using 3D desktops to performance-intensive CAD-based workloads. Because the intensity of a 3D workload will vary greatly from user to user and application to application, rather than highlighting specific case studies, we demonstrate how the solution efficiently scales for both light- and heavy-weight 3D workloads, until GPU or CPU resources are fully utilized. This paper also presents key best practices to extract peak performance from a 3D View 5.2 deployment.