Home > Blogs > VMware VROOM! Blog > Monthly Archives: October 2011

Monthly Archives: October 2011

Virtualized Hadoop Performance on vSphere 5

In recent years the amount of data stored worldwide has exploded. This has led to the birth of the term 'Big Data'. While the scale of data brings with it complexity associated with storing and handling it, these large datasets are known to have business information buried in them that is critical to continued growth and success.  The last few years have seen the birth of several new tools to manage and analyze such large datasets in a timely way (where traditional tools have had limitations). A natural question to ask is how these tools perform on vSphere. As the start of an ongoing effort to qauntify the performance of big data tools on vSphere, we've chosen to test one of the more popular tools – Hadoop.

Hadoop has emerged as a popular platform for the distributed processing of data. It scales to thousands of nodes while maintaining resiliency to disk, node, or even rack failure. It can use any storage, but is most often used with local disks. A whitepaper giving an overview of Hadoop and the details of tests on commondity hardware with local storage is available here.  One of the findings in the paper is that running 2 or 4 smaller VMs per physical machine usually resulted in better performance, often exceeding native performance.

As we continue our performance testing, stay tuned for results on a larger cluster with bigger data, with other Big Data tools, and on shared storage.

Running latency-sensitive applications on vSphere

For those of us interested in running latency-sensitive applications on vSphere, Bhavesh Davda, from the CTO's office, has created a comprehensive guide for tuning vSphere for such applications. 

Some of the tuning options are very familiar to those working with low-latency applications, e.g., interrupt coalescing settings, and some of them are relatively obscure vSphere specific options. Using a combination of these options, we saw noticeable improvement in performance of some latency-bound benchmarks. As a bonus, the guide provides in-depth reasoning for some options.

You can find more details here and the complete whitepaper here.

Looking Deeper at vSphere Storage Appliance Performance

One of the new products that was announced as part of the vSphere 5 launch was new vSphere Storage Appliance (VSA). It allows you to use the local storage of vSphere 5 ESX hosts to create shared storage that is accessible as NFS datastores. This shared storage automatically mirrors its data across the vSphere 5 hosts in the VSA cluster ensuring that it is always available, even in the event of the loss of a server.

VSA is integrated into vSphere 5 and is easily installed and managed. It works well and does not require SAN experience or knowledge to get up and running.

We recently published a white paper for those that are interested in looking deeper into VSA to understand the performance aspects. The paper includes information about the key factors to VSA performance, how to do a detailed analysis, and some examples based on testing.

VMware View 5 Network Optimization

PCoIP is an adaptive protocol that works to deliver the best possible user experience for any given network and CPU constraints. In the majority of environments, this is the desired approach. However, there can be times where individual users or group administrators are interested in different resource utilization policies. For instance, administrators may not want users consuming too much corporate LAN bandwidth streaming youtube videos!

The View PCoIP protocol provides a number of options that can be used to impose these constraints on audio and video streaming operations, while only having a minimal impact on quality:


Setting the maximum frame rate to 15 and the maximum initial image quality to 70 or 80, can reduce the bandwidth associated with video playback by 2 to 4X in the LAN environment. Even with the maximum initial image quality reduced to 70, image quality is good, even for high quality mp4 videos.


Setting the session audio bandwidth limit to 100 can reduce the audio bandwidth by around 5X. Even with this change, audio quality is good.

More details on how to apply these settings (and additional global resource constraint settings) can be found in the View 5 network optimization whitepaper (located here).

VMware View 5.0 Performance and Best Practices White Paper

Following up from our presentation’s at the recent VMworld discussing VMware View 5.0, we are pleased to announce the availability of our VMware View 5.0 Performance and Best Practices white paper. This paper highlights;

-          Optimizations to the View PCoIP protocol that deliver bandwidth savings of up to 75% and improved consolidation ratios of up to 30%

-          Optimizations to VMware vSphere 5.0 that benefit VDI deployments

-          Performance comparisons with View 4.5, Microsoft RDP and Citrix XenDesktop

-          Best practices & tunables on the platform, guest, PCoIP protocol & network, illustrating how users can optimally configure their View deployments

The full white paper can be found here.

GPGPU Computing in a VM

Periodically on forums I see people asking whether GPGPU computing is supported by ESX, prompting me to write a brief post on the subject:

YES! VMware ESX supports GPGPU computing. Applications can access GPGPUs via CUDA or OpenCL in exactly the same way as when running natively — no changes are required to the application.

Recent versions of ESX (4.0 onwards) support a feature termed VMDirectPath IO. VMDirectPath IO allows guest operating systems to directly access PCI devices & this feature can be used to achieve GPGPU computing in a VM.  As one would expect with direct communication, the performance overheads are minimal, providing close to native performance.

The process for setting-up access to a GPU from a VM is fairly simple:

(1)    Requires a VT-d capable system (or system with AMD IOMMU)

-          Nehalem class processor or later

-          VT-d performs the remapping of I/O DMA transfers and device-generated interrupts needed for safe operation in the virtualized environment

-          Ensures guest isolation

(2)    Requires a “direct passthrough” capable GPGPU

-          Most NVIDIA Quadro cards are passthrough capable

-          Recent AMD cards are also passthrough capable

(3)    Ensure I/O MMU is enabled in system BIOS

(4)    VMDirectPath IO can be enabled via vSphere client or ESX command line

-          Enable GPGPU as a passthrough device

-          Add GPGPU to chosen VM

(5)    Install vendors latest graphics drivers in the VM

(6)    Download GPGPU SDK

(7)    Run your CUDA/OpenCL apps!

[N.B. See VMware KB1010789 for more configuration details]

I’ve personally undertaken some testing prior to this post:

-          Successfully tested with both Windows and Linux VMs (32 and 64-bit VMs)

-          Successfully tested both CUDA and OpenCL frameworks on Nvidia Quadro GPUs

-          Successfully tested OpenCL on AMD GPUs

Additionally, I spent some time investigating the performance overheads associated with performing GPGPU computing from within a VM. Given that once the data has been loaded into the GPU’s local memory and the processing operation initiated the GPU operates largely independently of the OS, if we expect to see any performance overheads, they are likely to be most significant when moving data back and forth between the VM and the GPU and initiating commands. Accordingly, I’ve spent time looking at these scenarios and found the following:

-          Data transfer back and forth between host and GPU is close to native performance

  • Found to be true even for small  data chunks (e.g. 1KB)

-          Tested a number of short-duration microkernels from a wide range of application spaces and found no noticeable performance impact

  • DSP to financial to computational fluid dynamics  (and beyond)

-           VM performance observed at typically 98%+ of native performance

Operation close to native performance is not unexpected, as the direct communication been VM and GPU provided by VMDirectPath IO ensures the hypervisor doesn’t get involved in each and every interaction (which would add overhead), allowing even short duration operations to run close to native performance.

So, in summary, ESX supports GPGPU computing in a VM, at extremely close to native performance!