Home > Blogs > VMware VROOM! Blog > Category Archives: Uncategorized

Category Archives: Uncategorized

VMware Horizon View 5.2 Performance & Best Practices and A Performance Deep Dive on Hardware Accelerated 3D Graphics

VMware Horizon View 5.2 simplifies desktop and application management while increasing security and control and delivers a personalized high fidelity experience for end-users across sessions and devices. It enables higher availability and agility of desktop services unmatched by traditional PCs while reducing the total cost of desktop ownership and end-users can enjoy new levels of productivity and the freedom to access desktops from more devices and locations while giving IT greater policy control.

Recently, we published two whitepapers to provide a performance deep-dive on Horizon View 5.2 performance and hardware accelerated 3D graphics (vSGA) feature. The links to these whitepapers are as follows:

* VMware Horizon View 5.2 Performance and Best Practices
* VMware Horizon View 5.2 and Hardware Accelerated 3D Graphics

The first whitepaper describes View 5.2 new features, including access of View desktops with Horizon, space efficient sparse (SEsparse) disks, hardware accelerated 3D graphics, and full support of Windows 8 desktops. View 5.2 performance improvements in PCoIP and View management are highlighted. In addition, this paper presents View 5.2 PCoIP performance results, Windows 8 and RDP 8 performance analysis, and a vSGA performance analysis, including how vSGA compares to the software renderer support introduced in View 5.1.

The second whitepaper goes in-depth on the support for hardware accelerated 3D graphics that debuted with VMware vSphere 5.1 and VMware Horizon View 5.2 and presents performance and consolidation results for a number of different workloads, ranging from knowledge workers using 3D desktops to performance-intensive CAD-based workloads. Because the intensity of a 3D workload will vary greatly from user to user and application to application, rather than highlighting specific case studies, we demonstrate how the solution efficiently scales for both light- and heavy-weight 3D workloads, until GPU or CPU resources are fully utilized. This paper also presents key best practices to extract peak performance from a 3D View 5.2 deployment.

Performance Enhancements in View 5.2

View 5.2 became generally available today, and we wanted to take this opportunity to present a high-level overview of some of the performance enhancements that debut with View 5.2 and PCoIP. In this release, PCoIP’s image cache has been significantly improved to allow users on memory constrained devices to run with much smaller cache sizes; firstly, support was introduced to efficiently handle situations where image content is shifted vertically, as occurs during scroll operations. Secondly, View 5.2 debuts improved cache compression algorithms that provide significant additional compression of the View client’s image cache. Finally, the cache’s handling of progressive build operations has been made significantly more efficient. All of these enhancements combine to allow users to derive significant bandwidth reductions using considerably smaller cache sizes than was achievable with View 5.1:

The above figure illustrates that, for typical office workflows, running View 5.2 with up to a 5X smaller cache can still deliver significant bandwidth savings; a 90MB View 5.2 cache was found to deliver comparable performance to View 5.1 configured with a 250MB cache, and even a 50MB View 5.2 cache delivered the majority of the bandwidth reduction benefits observed from View 5.1 configured with a 250MB cache. This up to 5X reduction in cache size can be a compelling option for memory constrained thin clients or tablet devices. The maximum image cache size can be configured via GPOs or set on the client device.

Alternatively, users can continue to leverage the default 250MB cache size in View 5.2 and will see reduced bandwidth utilization in comparison with View 5.1:

The above figure illustrates the average bandwidth utilization observed for View 5.2 during a VMware View Planner run in two different WAN environments for out-of-the-box PCoIP configurations. The results are normalized to the View 5.1 baseline, and illustrate that in the 2 Mb/s environment, the average session bandwidth is reduced by around 6%. Moreover, in the “extreme WAN” environment, View 5.2 delivers almost 10% reduction in bandwidth utilization, compared with View 5.1. These reductions can be compelling when consolidating View sessions from a branch office onto a limited capacity link, or when users are connecting over congested WiFi connections. Furthermore, as would be expected, reducing the number of image blocks being encoded, not only reduces the bandwidth utilization, but also has the benefit of improving interactivity (faster transmission of updates and the opportunity for higher frame rates, given the reduced bandwidth utilization) and reducing CPU consumption (less encoding work being done).

Finally, other PCoIP enhancements that debut with View 5.2 include:

1. GPO settings take immediate effect: many of the performance orientated GPO settings now take effect immediately, allowing users or administrators to closely customize the behavior of their PCoIP sessions.

2. Relative mouse support: previously, support was only provided for absolute mode. However, for certain 3D applications relative mouse is required and support is introduced on View 5.2.

We will cover all of these optimizations in greater detail in an upcoming View 5.2 Performance and Best Practices Whitepaper.

Technical deep dive on VMware VIew Planner

In our prior VMworld sessions and performance white papers, we have presented user experience performance results based on VMware View® Planner, a tool that can generate workloads that are representative of many user-initiated operations in VDI environments. While we have discussed briefly about this tool in prior occasions, there have been many requests to get the architectural details and inner working of the tool. To provide more deep dive and technical details on View Planner, we have recently published an article in the recent release of VMware technical journal (VMTJ Winter 2012), which can be found here: VMware View Planner: Measuring True Virtual Desktop at Scale.

View Planner supports typical VDI user operations and also administrator’s management operations that can be configured to allow VDI evaluators to more accurately represent their particular environment. In this paper, we describe the challenges in building such a workload generator and the platform around it, as well as the View Planner architecture and use cases. We also explain how we used View Planner to perform platform characterization and consolidation studies, find potential performance optimizations and several other use cases.

vCloud Director 5.1 Performance and Best Practices

VMware vCloud Director 5.1 gives enterprise organizations the ability to build secure private clouds that dramatically increase datacenter efficiency and business agility. Coupled with VMware vSphere, vCloud Director delivers cloud computing for existing datacenters by pooling virtual infrastructure resources and delivering them to users as catalog-based services.  vCloud Director 5.1 helps helps IT professionals build agile infrastructure-as-a-service (IaaS)  cloud environments that greatly accelerate the time-to-market for applications and responsiveness of IT organizations.

This white paper addresses three areas regarding vCloud Director performance:

  • vCloud Director sizing guidelines and software requirements
  • Performance characterization and best practices for key vCloud Director operations and new features
  • Best practices in improving performance and tuning vCloud Director architecture

For more details and performance tips, please refer to VMware vCloud Director 5.1 Performance and Best Practices.

vSphere 5.1 IOPS Performance Characterization on Flash-based Storage

At VMworld 2012 we demonstrated a single eight-way VM running on vSphere 5.1 exceeding one million IOPS.  This testing illustrated the high end IOPS performance of vSphere 5.1.

In a new series of tests we have completed some additional characterization of high I/O performance using a very similar environment. The only difference between the 1 million IOPS test environment and the one used for these tests is that the number of Violin Memory Arrays was reduced from two to one (one of the arrays was a short term loan).

Configuration:
Hypervisor: vSphere 5.1
Server: HP DL380 Gen8
CPU: Two Intel Xeon E5-2690, HyperThreading disabled
Memory: 256GB
HBAs: Five QLogic QLE2562
Storage: One Violin Memory 6616 Flash Memory Array
VM: Windows Server 2008 R2, 8 vCPUs and 48GB.
Iometer Configuration: Random, 4KB I/O size with 16 workers

We continued to characterize the performance of vSphere 5.1 and the Violin array across a wider range of configurations and workload conditions.

Based on the types of questions that we often get from customers, we focused on RDM versus VMFS5 comparisons and the usage of various I/O sizes.  In the first series of experiments we compared RDM versus VMFS5 backed datastores using 100% read workload mix while ramping up the I/O size.

click to enlarge

As you can see from the above graph, VMFS5 yielded roughly equivalent performance to that of RDM backed datastores.  Comparing the average of the deltas across all data points showed performance within 1% of RDM for both IOPS and MB/s.  As expected, the number of IOPS decreased after we exceed the default array block size of 4KB, but the throughput continued to scale, approaching 4500 MB/s at both 8KB and 16KB sizes.

For our second series of experiments, we continued to compare RDM versus VMFS5 backed datastores through a progression of block sizes, but this time we altered the workload mix to include 60% reads and 40% writes.

click to enlarge

Violin Memory arrays use a 4KB sector size and perform at their optimal level when managing 4KB blocks. This is very visible in the above IOPS results at the 4KB block size. In the above graph, comparing RDM and VMFS5 IOPS, you can see that VMFS5 performs very well with a 60% read, 40% write mix.  Throughputs continued to scale in a similar fashion as the read-only experimentation and VMFS5 performance for both IOPS and MB/s were within .01% of RDM performance when comparing the average of the deltas across all data points.

The amount of I/O, with just one eight-way VM running on one Violin storage array, is both considerable and sustainable at many I/O sizes.  It’s also noteworthy to point out that running a 60% read and 40% write I/O mix still generated substantial IOPs and bandwidth. While in most cases a single VM won’t need to drive nearly this much I/O traffic, these experiments show that vSphere 5.1 is more than capable of handling it.

VMmark 2.5 Released

I am pleased to announce the release of VMmark 2.5, the latest edition of VMware’s multi-host consolidation benchmark. The most notable change in VMmark 2.5 is the addition of optional power measurements for servers and servers plus storage. This capability will assist IT architects who wish to consider trade-offs in performance and power consumption when designing datacenters or evaluating new and emerging technologies, such as flash-based storage.

VMmark 2.5 contains a number of other improvements including:

  • Support for the VMware vCenter Server Appliance.
  • Support for VMmark 2.5 message and results delivery via Growl/Prowl.
  • Support for PowerCLI 5.1.
  • Updated workload virtual machine templates made from SLES for VMware, a free use version of SLES 11 SP2.
  • Improved pre-run initialization checking.

Full release notes can be found here.

Over the past two years since its initial release, VMmark 2.x has become the most widely-published virtualization benchmark with over fifty published results. We expect VMmark 2.5 and its new capabilities to continue that momentum. Keep an eye out for new power and power-performance results from our hardware partners as well as a series of upcoming blog entries presenting interesting power-performance experiments from the VMmark team.

The power measurement capability in VMmark 2.5 utilizes the SPEC®™ PTDaemon (Power Temperature Daemon). The PTDaemon provides a straightforward and reliable building block with support for the many power analyzers that have passed the SPEC Power Analyzer Acceptance Test.

All currently published VMmark 2.0 and 2.1 results are comparable to VMmark 2.5 performance-only results. Beginning on January 8th 2013, any submission of benchmark results must use the VMmark 2.5 benchmark kit.

Turbo-charge View Video Performance

For desktop VMs using VMXnet3 NICs, you can significantly improve the peak video playback performance of your View desktop by simply setting the following registry setting to the value recommended by Microsoft:

HKLM\System\CurrentControlSet\Services\Afd\Parameters\FastSendDatagramThreshold to 1500

[As discussed in a Microsoft KB article here]

[N.B. A reboot of the desktop VM is required after changing this registry setting]

When running full-screen videos at 1080p resolution on a 2vCPU desktop, we see this deliver frame-rate improvements of up to 1.4X.

So, what does this do and why does it deliver these benefits?

The VMXNET3 adapter is a paravirtualized NIC designed for performance that, as of vSphere 5, supports interrupt coalescing. Virtual interrupt coalescing is similar to a physical NICs interrupt moderation and is useful in improving CPU efficiency for high throughput workloads. Unfortunately, out-of-the-box, Windows does not benefit from interrupt coalescing in many scenarios (those sending packets larger than 1024-bytes), because after sending a packet, Windows waits for a completion interrupt to be delivered before sending the next packet. By setting ParametersFastSendDatagramThreshold to the Microsoft recommended value of 1500 bytes you instruct Windows not to wait for the completion interrupt even when sending larger packets. Accordingly, you are allowing View and PCoIP (as well as other applications that send larger packets) to benefit from interrupt coalescing – reducing CPU load and improving network throughput for PCoIP  — which translates into significantly improved video playback performance.

Impact of Enhanced vMotion Compatibility on Application Performance

Enhanced vMotion Compatibility (EVC) is a technique that allows vMotion to proceed even when ESXi hosts with CPUs of different technologies exist in the vMotion destination cluster. EVC assigns a baseline to all ESXi hosts in the destination cluster so that all of them will be compatible for vMotion. An example is assigning a Nehalem baseline to a cluster mixed with ESXi hosts with Westmere, Nehalem processors. In this case, the features available in Westmere would be hidden, because it is a newer processor than Nehalem. But all ESXi hosts would “broadcast” that they have Nehalem features.

Tests showed how utilizing EVC with different applications affected their performance. Several workloads were chosen to represent typical applications running in enterprise datacenters. The applications represented included database, Java, encryption, and multimedia. To see the results and learn some best practices for performance with EVC, read Impact of Enhanced vMotion Compatibility on Application Performance.

Performance Best Practices for VMware vSphere 5.1

We’re pleased to announce the availability of Performance Best Practices for vSphere 5.1.  This is a book designed to help system administrators obtain the best performance from vSphere 5.1 deployments.

The book addresses many of the new features in vSphere 5.1 from a performance perspective.  These include:

  • Use of a system swap file to reduce VMkernel and related memory usage
  • Flex SE linked clones that can relinquish storage space when it’s no longer needed
  • Use of jumbo frames for hardware iSCSI
  • Single Root I/O virtualization (SR-IOV), allowing direct guest access to hardware devices
  • Enhancements to SplitRx mode, a feature allowing network packets received in a single network queue to be processed on multiple physical CPUs
  • Enhancements to the vSphere Web Client
  • VMware Cross-Host Storage vMotion, which allows virtual machines to be moved simultaneously across both hosts and datastores

We’ve also updated and expanded on many of the topics in the book.

These topic include:

  • Choosing hardware for a vSphere deployment
  • Power management
  • Configuring ESXi for best performance
  • Guest operating system performance
  • vCenter and vCenter database performance
  • vMotion and Storage vMotion performance
  • Distributed Resource Scheduler (DRS), Distributed Power Management (DPM), and Storage DRS performance
  • High Availability (HA), Fault Tolerance (FT), and VMware vCenter Update Manager performance
  • VMware vSphere Storage Appliance (VSA) and vCenter Single Sign on Server performance

The book can be found at: http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.1.pdf.

Storage I/O Performance on vSphere 5.1 over 16Gb Fibre Channel

At the vSphere 5.1 release time frame, the 16Gb Fibre Channel fabric and 16Gb FC cards have become generally available. The release of the 16Gb FC driver on the VSphere platform can now take full advantage of the new 16Gb FC HBA and thus have better storage I/O performance.

As described in the paper “Storage I/O Performance on vSphere 5.1 over 16Gb Fibre Channel”, the storage I/O throughput has doubled for the larger block I/Os compared to the 8Gb FC. In the paper it uses single storage I/O worker to show the throughput has improved with better CPU efficiency per I/O. For random I/Os in small block sizes, 16Gb FC can attain much higher I/Os per second than a 8Gb FC connection.