Home > Blogs > VMware VROOM! Blog > Monthly Archives: August 2011

Monthly Archives: August 2011

Single vSphere Host, a Million I/O Operations per Second

One of the essential requirements for a platform supporting enterprise datacenters is the capability to support the extreme I/O demands of applications running in those datacenters. A previous study has shown that vSphere can easily handle demands for high I/O operations per second. Experiments discussed in a recently published paper strengthen this assertion further by demonstrating that a vSphere 5 virtual platform can easily satisfy an extremely high level of I/O demand that originates from the hosted applications.

Results obtained from performance testing done at EMC lab show that:

  • A single vSphere 5 host is capable of supporting a million+ I/O operations per second.
  • 300,000 I/O operations per second can be achieved from a single virtual machine.
  • I/O throughput (bandwidth consumption) scales almost linearly as the request size of an I/O operation increases.
  • I/O operations on vSphere 5 systems with Paravirtual SCSI (PVSCSI) controllers use less CPU cycles than those with LSI Logic SAS virtual SCSI controllers.

For more details, refer to the paper Achieving a Million I/O Operations per Second from a Single VMware vSphere 5 Host.

Performance Implications of Storage I/O Control-Enabled NFS Datastores

Storage I/O Control (SIOC) allows administrators to control the amount of access virtual machines have to the I/O queues on a shared datastore. With this feature, administrators can ensure that a virtual machine running a business-critical application has a higher priority to access the I/O queue than that of other virtual machines sharing the same datastore. In vSphere 4.1, SIOC was supported on VMFS-based datastores that used SAN with iSCSI and Fibre Channel. In vSphere 5, SIOC support has been extended to NFS-based datastores.

Recent tests conducted at VMware Performance Engineering lab studied the following aspects of SIOC:

  • The performance impact of SIOC: A fine-grained access management of the I/O queues resulted in a 10% improvement in the response time of the workload used for the tests.
  • SIOC’s ability to isolate the performance of applications with a smaller request size: Some applications like Web and media servers use I/O patterns with a large request size (for example, 32K). But some other applications like OLTP databases request smaller I/Os ≤8K. Test findings show that SIOC helped an OLTP database workload to achieve higher performance when sharing the underlying datastore with a workload that used large-sized I/O requests.
  • The intelligent prioritization of I/O resources: SIOC monitors virtual machines’ usage of the I/O queue at the host and dynamically redistributes any unutilized queue slots to those virtual machines that need them. Tests show that this process happens consistently and reliably.

For the full paper, see Performance Implications of Storage I/O Control–Enabled NFS Datastores in VMware vSphere 5

 

Multicast Performance on vSphere 5.0

Multicast is an efficient way of disseminating information and communicating over the network. A single sender can connect to multiple receivers and exchange information while conserving network bandwidth. Financial stock exchanges, multimedia content delivery networks, and commercial enterprises often use multicast as a communication mechanism. VMware virtualization takes multicast efficiency to the next level by enabling multiple receivers on a single ESXi host. Since the receivers are on the same host, the physical network does not have to transfer multiple copies of the same packet. Packet replication is carried out in the hypervisor instead.

In releases of vSphere prior to 5.0, the packet replication for multicast is done using a single context. When there is a high VM density per host, at high packet rates the replication context may become a bottleneck and cause packet loss. VMware added a new feature in ESXi 5.0 to split the cost of replication across various physical CPUs. This makes vSphere 5.0 a highly scalable and efficient platform for multicast receivers. This feature is called splitRxMode, and it can be enabled with a VMXNET3 virtual NIC. Fanning out processing to multiple contexts causes a slight increase in CPU consumption and is generally not needed for most systems. Hence, the feature is disabled by default. VMware recommends enabling splitRxMode in situations where multiple VMs share a single physical NIC and receive a lot of multicast/broadcast packets.

To enable splitRxMode for the Ethernet device:                                                

  1. Select the virtual machine you wish to change, then click Edit virtual machine settings.
  2. Under the Options tab, select General, then click Configuration Parameters.
  3. Look for ethernetX.emuRxMode. If it's not present, click Add Row and enter the new variable.
  4. Click on the value to be changed and configure it as “1”. For default behavior the value is “0”.

Environment Configuration

  • Systems Under Test: 2 Dell PowerEdge R810
  • CPUs: 2 x 8 core Intel Xeon CPU E7- 8837  @ 2.67GHz (no hyperthreading)
  • Memory: 64GB
  • NICs: Broadcom NetXtreme II 57711 10Gb Ethernet

Experiment Overview

We tested splitRxMode by scaling the number of VMs on a single ESX host from 1 to 36 VMs with each VM receiving up to 40K packets per second. We had a consolidation ratio of 2 VMs per physical core when 32 VMs were powered on. The sender was a 2-vCPU RHEL VM on a separate physical machine transmitting 800-byte multicast packets at a fixed interval. The clients (receiving VMs) were 1-vCPU RHEL VMs running on the same ESXi host. Each receiver was using 10-15% of its CPU power for processing 10K packets per second, and the usage increased linearly as we increased packet rate. No noticeable difference in CPU usage was observed when splitRxmode was enabled. We then measured the total packets received by each client and calculated the average packet loss for the setup.

Experiment Results

The default ESX configuration could run up to 20 VMs, each receiving 40K packets per second with less than 0.01% packet loss. As we powered on more VMs, the networking context became the bottleneck and we started observing packet loss in all the receivers. The loss rate increased substantially as we powered on more VMs. A similar trend was observed for lower packet rates (30K packets per second).                                                                                                                     

 

We repeated the experiment after enabling splitRxMode on all VMs. As seen in the graph below, the new feature greatly increases the scalability of the vSphere platform in handling multicast packets. We were now able to power on 40% more VMs (28VMs) than before, each receiving 40K packets per second with less than 0.01% packet loss. At lower packet rates the performance improvement is even more noticeable, as we couldn’t induce packet loss with even 36 VMs powered on.

PCoIP Improvements in VMware View 5.0

PCoIP is VMware View’s VDI display protocol and one of its key repsonsibilities is capturing the remote desktop’s AV output and conveying it to the user’s client deivce.   With VMware View 5.0 we introduce a variety of important optimizations to the PCoIP protocol that deliver a significant reduction in PCoIP’s resource utilization, benefiting users in almost all usage scenarios. Broadly speaking these optimizations fall into two broad categories, bandwidth optimizations and compute optimizations, which are now discussed in more detail.

Bandwidth Improvements

Controlling network bandwidth utilization is obviously a key consideration for VDI display protocols. This is especially true in the WAN environment, where network bandwidth can be a relatively scare and highly shared resource. View 5.0 makes significant improvements in the efficiency with which PCoIP consumes this resource, while maintaining user experience. In many typical office/knowledge worker environments, bandwidth consumption is reduced by up to 75% (a 4X improvement). In the following section, the optimizations that deliver these gains are discussed.

Lossless codec

In the VDI environment, a users’ screen is frequently composed of many forms of content, including icons, graphics, motion video, photos and text. It is the responsibility of the VDI display protocol to actively monitor the type of content the user is viewing and dynamically manage the compression algorithms utilized for each screen region to ensure the best user experience. For instance, naively applying lossy compression techniques to text-orientated content can result in blurred text edging, which can be very noticeable to users. Accordingly, PCoIP uses an efficient lossless compression algorithm that has been developed with text compression as a key consideration in order to minimize both bandwidth and CPU utilization.

With View 5.0, PCoIP debuts a major enhancement to its lossless compression codec. The improved lossless compression algorithm delivers both greater compression ratios and improved robustness. As an example, the improved algorithm delivers twice the compression of its predecessor when applied to content containing anti-aliased fonts.

If you consider the desktop belonging to the typical knowledge worker there’s frequently significant text content – text on web pages, emails, presentations and PDF documents. Accordingly, a significant proportion of the imaging data being transmitted to the client device is frequently compressed using lossless compression algorithms. As a result, View 5.0’s improved lossless compression algorithm delivers a 30% to 40% reduction in bandwidth consumption for typical knowledge worker workflows.

Client-side image Caching

Amongst its many responsibilities PCoIP is tasked with efficiently communicating desktop screen updates to the client device for local display. In many instances, only a small region of the screen may change. VDI protocols such as PCoIP perform spatial filtering and only send information related to the portion of the screen that changed (rather than naively sending the entire screen). However, in additional to spatial filtering, temporal analysis can also be performed. For instance, consider minimizing an application, dragging a window, flicking through a slide-set or even scrolling through a document. In all these examples, each successive screen update will be largely composed of previously seen (potentially shifted) pixels. As a result, if the client device maintains a cache of previously seen image blocks, PCoIP can deliver significant bandwidth savings by merely encoding these portions of the screen update as a series of cache indices rather than retransmitting the blocks.

View 5.0 introduces a client-side image cache, providing bandwidth savings of 30% in many instances (typical knowledge workers flows). This cache is not merely a simple fixed position cache, but captures both spatial and temporal redundancy in the screen updates.

Total Bandwidth Improvements

In combination the compression improvements and image caching deliver bandwidth savings of around 60% (a 2.5X improvement) out-of-the-box in both LAN and WAN use cases for typical knowledge workers.

Additional bandwidth improvements can be obtained in View 5.0 by leveraging the new image quality controls that have been introduced. By default, PCoIP will build to a lossless image – when a screen update occurs, PCoIP will almost immediately transmit an initial image for display on the client. In rapid succession PCoIP will continue to refine the client’s image until a high quality lossy image is achieved. In PCoIP vernacular, this is termed building to a “Perceptually lossless” image. If the screen remains constant, PCoIP will, in the background, continue to refine the image on the client until a lossless image is obtained (i.e. PCoIP builds to lossless (BTL)). In certain application-spaces building to a lossless image is a key feature. However, for many knowledge workers, the BTL support can be disabled without impact on image quality. And disabling BTL can deliver significant bandwidth savings — in many situations disabling BTL will provide up to around 30% reduction in bandwidth.

Combining the compression improvements, client caching and disabling BTL commonly delivers a bandwidth improvement of up to 75% (a 4X improvement), for typical office workloads!

CPU Improvements

In VDI environments, desktop consolidation is a key consideration. The more user desktops that can be handled per system (i.e. the higher the consolidation ratio), the better the cost savings that can be realized. Accordingly, the CPU overheads introduced by the VDI protocol must be carefully constrained. With View 5.0, PCoIP has been further enhanced to minimize its CPU overhead in a number of significant ways.

Idle CPU usage

From the VDI protocol’s perspective, unless the user is viewing a video, the user is idle for a large proportion of the time. For instance, if a user loads a new web page, there is a flurry of activity as the web page loads and the screen update is displayed, but many seconds or even minutes may elapse with the screen remaining static, as the user reads the content of the page. For a VDI protocol, it is not only important to encode any screen changes efficiently, but to minimize the overheads associated with all background activities that occur during these idle periods.

With View 5.0, we have significantly optimized these code paths, and PCoIP’s idle CPU consumption is now negligible. Further, the session keep-a-live (aka heartbeat) bandwidth has been reduced by a factor of 2, for many workloads.

Optimized algorithms and code

In View 5.0, many of the hottest image processing and compression functions have been reexamined, their algorithms tweaked for efficiency and their implementation further optimized – in some situations, even coded in assembly to realize the absolute lowest computational overheads.

Effectively using Hardware Instructions

Image manipulation operations are typically suitable to acceleration via the use of SIMD (Single Instruction Multiple Data) instructions, such as the SSE instructions supported on recent x86 processors. With View 5.0, PCoIP has been optimized to take even greater advantage of the SSE SIMD support available on x86 processors, not only providing an expanded coverage of the code base, but also, when available, leveraging the SSE instructions available on the very latest processors (e.g. SSE 4.2 and AES-NI).

Conclusion

In conclusion, with the introduction of View 5, we have spent significant time further optimizing PCoIP to furher reduce both its bandwidth and CPU consumption, delivering improved responsiveness, improved consolidation ratios and improved WAN scalability.

 

Performance Best Practices for VMware vSphere 5.0

A new version of Performance Best Practices for vSphere is now available.  This is a book designed to help system administrators obtain the best performance from vSphere deployments.

We've addressed many of the new features in vSphere 5.0 from a performance perspective.  These include:

  • Storage Distributed Resource Scheduler (Storage DRS), which performs automatic storage I/O load balancing
  • Virtual NUMA, allowing guests to make efficient use of hardware NUMA architecture
  • Memory compression, which can reduce the need for host-level swapping
  • Swap to host cache, which can dramatically reduce the impact of host-level swapping
  • SplitRx mode, which improves network performance for certain workloads
  • VMX swap, which reduces per-VM memory reservation
  • Multiple vMotion vmknics, allowing for more and faster vMotion operations

We've also significantly updated and expanded many of the topics we've covered in previous editions of the book.  These include:

  • Choosing hardware for a vSphere deployment
  • Power management
  • Configuring ESXi for best performance
  • Guest operating system performance
  • vCenter and vCenter database performance
  • vMotion and Storage vMotion performance
  • Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM) performance
  • High Availability (HA), Fault Tolerance (FT), and VMware vCenter Update Manager performance

The book can be found at: Performance Best Practices for VMware vSphere 5.0.

 

VMware View & PCoIP at VMworld

In recent weeks there’s been growing excitement about the PCoIP enhancements coming to VMware View. For instance, Warren Ponder discussed here how these enhancements reduce bandwidth consumption by up to 75%. Engineers from VMware’s performance team (& Warren) will be talking more about these enhancements and how they translate into real-world performance at the rapidly approaching VMworld 2011 in Las Vegas:

EUC1987: VMware View PC-over-IP Performance and Best Practices
Tuesday, August 30th 12:00
Wednesday, August 31st 1:00

EUC3163: VMware View Performance and Best Practices
Tuesday, August 30th – 4:30
Wednesday, August 31st – 4:00

We will also be blogging additional details and performance results as VMworld progresses, followed by a performance whitepaper.

Stay tuned!

First SAP SD Benchmark on vSphere 5 Shows Performance within 6% of Native

The first SAP sales and distribution (SD) benchmark on vSphere 5 has been published. It takes advantage of the new larger VM support in vSphere 5 to reach 4,600 SD Users and 25,150 SAPs with a 24-vCPU VM running on a Fujitsu Primergy server. This is the largest SAP 2-tier benchmark on vSphere as of today.

Fujitsu took the extra step of using the same server and test configuration to publish a non-virtualized result. Comparing the two tests shows that virtual is only 6% lower than native. This is a result of the hard work that was put into vSphere 5 to optimize its performance and shows that large 24 vCPU VMs can have performance very close to native.

A more detailed technical paper is being worked on by Fujitsu and VMware and will be released soon.

Some details of the SAP configuration used in the tests – SAP ECC 6.0 with EHP 4 on Fujitsu Primergy RX300 S6 with 2 x Intel Xeon X5690 processors and 96 GB RAM. The OS used was SuSE Enterprise Linux 11 SP1 and MaxDB 7.8 for the database. SAP certificates are 2011027 and 2011028.