Home > Blogs > VMware VROOM! Blog > Tag Archives: VDI

Tag Archives: VDI

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

In part 1 and part 2 of the VDI/VSAN benchmarking blog series, we presented the VDI benchmark results on VSAN for 3-node, 5-node, 7-node, and 8-node cluster configurations. In this blog, we compare the VDI benchmarking performance of VSAN with an all flash storage array. The intent of this experiment is not to compare the maximum IOPS that you can achieve on these storage solutions; instead, we show how VSAN scales as we add more heavy VDI users. We found that VSAN can support a similar number of users as that of an all flash array even though VSAN is using host resources.

The characteristic of VDI workload is that they are CPU bound, but sensitive to I/O which makes View Planner a natural fit for this comparative study. We use VMware View Planner 3.0 for both VSAN and all flash SAN and consolidate as many heavy users as much we can on a particular cluster configuration while meeting the quality of service (QoS) criteria. Then, we find the difference in the number of users we can support before we run out of CPU, because I/O is not a bottleneck here. Since VSAN runs in the kernel and uses CPU on the host for its operation, we find that the CPU usage is quite minimal, and we see no more than a 5% consolidation difference for a heavy user run on VSAN compared to the all flash array.

As discussed in the previous blog, we used the same experimental setup where each VSAN host has two disk groups and each disk group has one PCI-e solid-state drive (SSD) of 200GB and six 300GB 15k RPM SAS disks. We built a 7-node and a 8-node cluster and ran View Planner to get the VDImark™ score for both VSAN and the all flash array. VDImark signifies the number of heavy users you can successfully run and meet the QoS criteria for a system under test. The VDImark for both VSAN and all flash array is shown in the following figure.

View Planner QoS (VDImark)

 

 From the above chart, we see that VSAN can consolidate 677 heavy users (VDImark) for 7-node and 767 heavy users for 8-node cluster. When compared to the all flash array, we don’t see more than 5% difference in the user consolidation. To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above for both VSAN and the all flash array, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience.  Similar to the user consolidation, the response time of Group-A operations in VSAN is similar to what we saw with the all flash array.

Group-B Response Times

Group-B operations are sensitive to both CPU and IO and 95% should be less than six seconds to meet the QoS criteria. From the above figure, we see that the average response time for most of the operations is within the threshold and we see similar response time in VSAN when compared to the all flash array.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

 

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2

In part 1, we presented the VDI benchmark results on VSAN for 3-node and 7-node configurations. In this blog, we update the results for 5-node and 8-node VSAN configurations and show how VSAN scales for these configurations.

The View Planner benchmark was run again to find the VDImark for different numbers of nodes (5 and 8 nodes) in a VSAN cluster as described in the previous blog and the results are shown in the following figure.

View Planner QoS (VDImark)

 

In the 5-node cluster, a VDImark score of 473 was achieved and for the 8-node cluster, a VDImark score of 767 was achieved. These results are similar to the ones we saw on the 3-node and 7-node cluster earlier (about 95 VMs per host). So, there is nice scaling in terms of maximum VMs supported as the numbers of nodes were increased in the VSAN from 3 to 8.

To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience. If we look at the new results for 5-node and 8-node VSAN, we see that for most of the operations, the response time mostly remains the same across different node configurations.

Group-B Response Times

Since Group-B is more sensitive to I/O and CPU usage, the above chart for Group-B operations is more important to see how View Planner scales. The chart shows that there is not much difference in the response times as the number of VMs were increased from 286 VMs on a 3-node cluster to 767 VMs on an 8-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the VSAN nodes from 3 to 8 and user experience expectations are met.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

 

 

Turbo-charge View Video Performance

For desktop VMs using VMXnet3 NICs, you can significantly improve the peak video playback performance of your View desktop by simply setting the following registry setting to the value recommended by Microsoft:

HKLM\System\CurrentControlSet\Services\Afd\Parameters\FastSendDatagramThreshold to 1500

[As discussed in a Microsoft KB article here]

[N.B. A reboot of the desktop VM is required after changing this registry setting]

When running full-screen videos at 1080p resolution on a 2vCPU desktop, we see this deliver frame-rate improvements of up to 1.4X.

So, what does this do and why does it deliver these benefits?

The VMXNET3 adapter is a paravirtualized NIC designed for performance that, as of vSphere 5, supports interrupt coalescing. Virtual interrupt coalescing is similar to a physical NICs interrupt moderation and is useful in improving CPU efficiency for high throughput workloads. Unfortunately, out-of-the-box, Windows does not benefit from interrupt coalescing in many scenarios (those sending packets larger than 1024-bytes), because after sending a packet, Windows waits for a completion interrupt to be delivered before sending the next packet. By setting ParametersFastSendDatagramThreshold to the Microsoft recommended value of 1500 bytes you instruct Windows not to wait for the completion interrupt even when sending larger packets. Accordingly, you are allowing View and PCoIP (as well as other applications that send larger packets) to benefit from interrupt coalescing – reducing CPU load and improving network throughput for PCoIP  — which translates into significantly improved video playback performance.

View 5.1 Performance

In addition to the numerous enhancements detailed here, View 5.1 debuts a number of significant enhancements and optimizations to the PCoIP protocol. In this blog, we detail some of the most beneficial: 

PCoIP efficiency

Continuing refinements to compression protocols and general performance optimization deliver further improvements in PCoIP efficiency and corresponding reductions in CPU consumption. While already performing better than the VDI competition (as illustrated here and here), these enhancements deliver up to an additional 1.3X reduction in PCoIP overheads.

Client optimizations

For this release, there has been significant optimization of the VMware View clients, making their protocol handling significantly more streamlined. This is especially apparent on thin-clients, where video playback performance is improved by as much as 3X over previous versions, as illustrated in the figure below. Indeed, even relatively low-performance processors can deliver excellent 720p video playback performance. These improvements are available for both x86 and ARM clients.

View5.1-perf

Network improvements

PCoIP handling of adverse network conditions has been significantly improved. This is especially beneficial for users connecting wirelessly from tablets or laptops over congested and lossy WiFi networks. These enhancements are most apparent during video playback and ensure fluid high-frame video playback — the improvement can be as high as 8X.

Interactivity improvements

Significant improvements have also been made to interactivity, making interaction with the remote desktop significantly more fluid, and continuing to further improve the experience associated with using a remote desktop. As a simple visual test of this improvement, the picture below show a user rapidly drawing a spiral in mspaint, when connecting to their remote desktop using both RDP7 and View 5.1. With RDP7, the resulting spiral is obviously formed from rough polygons, whereas, with View 5.1, the spiral is significantly smoother (while this test may seem an overly simple example, it is heavily influenced by the speed and frequency at which the client communicates with the remote desktop and clearly conveys the likely differences in scrolling and dragging performance – in a later blog we will deep-dive on interactive performance, using the user experience techniques we discuss here).

Vtor

 

VMware View 5 resource optimization

In last week’s post, we discussed 4 simple settings that we have observed deliver significant resource savings, while preserving user experience for typical desktop users. While we discussed the benefits of each setting in isolation, I just wanted to illustrate the overall gains. For runs using View Planner (which simulates a typical office user, with MS Office apps, browsers, Adobe reader, video playback, photo albums etc – more details can be found here), we observe a significant reduction in bandwidth when these 4 resource control settings are applied in unison:

View-bw

From the above plot it is apparent that the bandwidth reductions resulting from i) disabling build-to-loss, ii) setting the maximum frame rate to 15, iii) setting maximum audio bandwidth to 100, and iv ) performing simple in-guest operations (such as selecting “optimize for visual performance”  and disabling ClearType) are mainly additive, and the cumulative benefit is pretty substantial – around a 1.8X reduction from the default! [Particularly compelling, given that for typical office users there is very little difference in user experience]

VMware View PCoIP & Build-to-lossless

We have talked in previous posts about the ability in View 5 to disable build to lossless (BTL). When BTL is disabled, PCoIP rapidly builds the client image to a high quality, but lossy image — by default, if the image remains constant, PCoIP would continue to refine the image in the background until it reaches a fully lossless state. Stopping the build process when the image reaches the "perceptually lossless" stage can deliver significant bandwidth savings — for typical office workflows, we are seeing around a 30% bandwidth reduction.

Furthermore, in many situations, the difference between fully lossless and perceptually lossless images can be virtually impossible to discern. During our VMworld presentation, we used the following image to emphasize the quality of perceptually lossless:

Btl

In this qualitative comparison, we present a zoom-in of two small images. For both images, View fully lossless and View perceptually lossless (no BTL) images are shown side-by-side for comparison — hopefully conveying how difficult it is, even when zoomed, to find differences.

To further emphasize the perceptually lossless quality, it’s also interesting to examine quantitative data — for example, PSNR (peak signal to noise ratio) and RMS (root-mean-square) error data. For a fairly complex image — a fall-colors landscape with significant fine detail in the background tree colors — comparing the perceptually lossless build to a fully lossless build (RGB space), yields a PSNR value of 45.8dB, and RMS error value of 1.3! This clearly illustrates how little loss in quality is associated with perceptually lossless images. For instance, consider the RMS error of 1.3: for 32-bit colors, each rgba component has 8-bits of precision, with values ranging from 0 to 255. For this image, perceptually lossless is introducing an average error of +/-1.3 to these values — fairly negligible for most use cases!!

[While PSNR ratio obviously varies from image to image, I'm seeing ~45dB much of the time]

 

View 5 PCoIP Client-Side Image Caching

At the recent VMworld we mentioned that VMware View 5 introduces PCoIP support for client-side image caching. In our VMworld presentation, we highlighted that, on average, this caching optimization reduces bandwidth consumption by about 30%. However, there are a number of important scenarios where the ability of the PCoIP image cache to capture spatial, as well as temporal, redundancy delivers even bigger benefits.

For instance, consider scrolling through a PDF document.  As we scroll down, new content appears along the bottom edge of the window, and the oldest content disappears from the top edge. All the other content in the application window remains essentially constant, merely shifted upward. The PCoIP image cache is capable of detecting this spatial and temporal redundancy. As a result, for scrolling operations, the display information sent to the client device is primarily just a sequence of cache indices — delivering significant bandwidth savings.

This efficient scrolling has a couple of key benefits;

  • On LAN networks, where bandwidth is relatively unconstrained,  there’s sufficient bandwidth available for high quality scrolling even when client-side caching is disabled. In these situations, enabling client-side image caching delivers significant bandwidth savings – experimenting with a variety of different applications and content types (text heavy, image heavy etc), I'm seeing bandwidth reductions of over 4X (compared with caching disabled. mileage may vary, but I’m seeing this fairly consistently)!
  • On WAN networks, where bandwidth is fairly scarce, when client-side caching is disabled, scrolling performance is often degraded to stay within the available bandwidth. In these situations, in addition to bandwidth reductions (which vary based on the degree to which scrolling performance is degraded when client-side caching is disabled), client-side caching also ensures smooth, highly responsive scrolling operations even in WAN environments with very constrained bandwidth.

 

VMworld 2011 VMware View 5 PCoIP

Slides from our recent VMworld presentation on View 5 PCoIP performance (EUC1987) are now available here. In addition to discusing the latest PCoIP optimizations and best practices in detail, it also presents competitive data.

VMware View & PCoIP at VMworld

In recent weeks there’s been growing excitement about the PCoIP enhancements coming to VMware View. For instance, Warren Ponder discussed here how these enhancements reduce bandwidth consumption by up to 75%. Engineers from VMware’s performance team (& Warren) will be talking more about these enhancements and how they translate into real-world performance at the rapidly approaching VMworld 2011 in Las Vegas:

EUC1987: VMware View PC-over-IP Performance and Best Practices
Tuesday, August 30th 12:00
Wednesday, August 31st 1:00

EUC3163: VMware View Performance and Best Practices
Tuesday, August 30th – 4:30
Wednesday, August 31st – 4:00

We will also be blogging additional details and performance results as VMworld progresses, followed by a performance whitepaper.

Stay tuned!