Home > Blogs > VMware VROOM! Blog > Tag Archives: vsan

Tag Archives: vsan

VMware’s AI-based Performance Tool Can Improve Itself Automatically

PerfPsychic  our AI-based performance analyzing tool, enhances its accuracy rate from 21% to 91% with more data and training when debugging vSAN performance issues. What is better, PerfPsychic can continuously improve itself and the tuning procedure is automated. Let’s examine how we achieve this in the following sections.

How to Improve AI Model Accuracy

Three elements have huge impacts on the training results for deep learning models: amount of high-quality training data, reasonably configured hyperparameters that are used to control the training process, and sufficient but acceptable training time. In the following examples, we use the same training and testing dataset as we presented in our previous blog.

Amount of Training Data

The key of PerfPsychic is to prove the effectiveness of our deep learning pipeline, so we start by gradually adding more labeled data to the training dataset. This is to demonstrate how our models learn from more labeled data and improve their accuracy over time. Figure 1 shows the results where we start from only 20% of the training dataset and gradually label 20% more each time. It shows a clear trend that as more properly labeled data is added, our model learns and improves its accuracy, without any further human intervention. The accuracy is improved from around 50% when we have only about 1,000 data points to 91% when we have the full set of 5,275 data points. Such accuracy is as good as a programmatic analytic rule that took us three months to tune manually.

Figure 1. Accuracy improvement over larger training datasets

Training Hyperparameters

We next vary several other CNN hyperparameters to demonstrate how they were selected for our models. We change only one hyperparameter at a time and train 1,000 CNNs using the configuration. We first vary a different number of iterations in training, namely for how many times we go through the training dataset. If the number of iterations is too few, the CNNs cannot be trained adequately and, if the iteration number is too large, training will take a much longer time and it also might end up overfitting to the training data. As shown in Figure 2, between 50 to 75 iterations is the best range, where 75 iterations achieve the best accuracy of 91%.

Figure 2. Number of training iterations vs. accuracy

We next vary the step size, which is our granularity to search for the best model. In practice, with a small step size, the optimization is so slow that it cannot reach the optimal point in a limited time. With a large step size, we risk passing optimal points easily. Figure 3 shows that, between 5e-3 to 7.5e-3, the model produces good accuracy, where 5e-3 predicts 91% of the labels correctly.

Figure. 3 Step size vs. accuracy

We last evaluate the impact of issue rate of the training data in terms of accuracy. Issue rate is the percentage of training data that represents performance issues among the total. In an ideal set of training data, all the labels should be equally represented to avoid overfitting. A biased dataset generally results in overfitting models that can barely  achieve high accuracy. Figure 4 below shows that when the training data have under 20% of issue rate (that is, under 20% of the components are faulty), the model basically overfits to “noissue” data points and predicts all components are issue-free. Because our testing data have 21.9% of components without issues, it stays at 21.9% accuracy. In contrast, when we have over 80% of an issue rate, the model simply treats all components as faulty and thus achieves the 78.1% accuracy. This explains why it is important to ensure every label is equally represented, and why we mix our issue/noissue data in a ratio between 40% to 60%.

Figure 4. Impact of issue rate

Training Duration

Training time is also an important factor in a practical deep learning pipeline design. As we train thousands of CNN models, spending one second longer to train a model means a whole training phase will take 1,000 seconds longer. Figure 9 below shows the training time vs. data size and the number of iterations. As we can see, both factors form a linear trend; that is, with more data and more iterations, training will take linearly longer. Fortunately, we know from the study above that any more than 75 iterations will not help accuracy. By limiting the number of iterations, we can complete a whole phase of training in less than 9 hours. Again, once the off-line training is done, the model can perform real-time prediction in just a few milliseconds. The training time simply affects how often and how fast the models can pick up new feedback from product experts.

Figure 5. Effect of data size and iteration on training time

Automation

The model selection procedure is fully automated. Thousands of models with different hyperparameter settings are training in parallel on our GPU-enabled servers. The trained results compete with each other by analyzing our prepared testing data and reporting the final results. We then pick the model with the highest correct rate, put it into PerfPsychic and use it for online analysis. Moreover, we also keep a record of the parameters in the the winning models and use them as initial setups in future trainings. Therefore, our models can keep evolving.

PerfPsychic in Application

PerfPsychic is not only a research product, but also an internal performance analysis tool which is widely used. Now it is used to automatically analyze vSAN performance bugs on Bugzilla.

PerfPsychic automatically detects new vSAN performance bugs submitted in Bugzilla and extracts its usable data logs in the bug attachment. Then it analyzes the logs with the trained models. Finally, the analysis results are emailed to bug submitters and vSAN developer group where performance enhancement suggestions are included.

Below is part of an email received yesterday that gives performance tuning advice on a vSAN bug. Internal information are hidden.

Figure 6. Part of email generated by PerfPsychic to offer performance improvement suggestions

 

VMware Speedily Resolves Customer Issues in vSAN Performance Using AI

We in VMware’s Performance team create and maintain various tools to help troubleshoot customer issues—of these, there is a new one that allows us to more quickly determine storage problems from vast log data using artificial intelligence. What used to take us days, now takes seconds. PerfPsychic analyzes storage system performance and finds performance bottlenecks using deep learning algorithms.

Let’s examine the benefit artificial intelligence (AI) models in PerfPsychic bring when we troubleshoot vSAN performance issues. It takes our trained AI module less than 1 second to analyze a vSAN log and to pinpoint performance bottlenecks at an accuracy rate of more than 91%. In contrast, when analyzed manually, an SR ticket on vSAN takes a seasoned performance engineer about one week to deescalate, while the durations range from 3 days to 14 days. Moreover, AI also wins over traditional analyzing algorithms by enhancing the accuracy rate from around 80% to more than 90%.

Architecture

There are two operation modes in the AI module: off-line training mode and real-time prediction mode. In the training mode, sets of training data, which are labeled with their performance issues, are automatically fed to all potential convolutional neural network (CNN) [1] structures, which we train repeatedly on GPU-enabled servers. We train thousands of models at a time and pick the one that achieves the best accuracy to a real-time system. In the real-time prediction mode, unlabeled user data are sent to the model chosen from the training stage, and a prediction of the root cause (faulty component) is provided by it.

As shown in Figure 1, data in both training and prediction modes are first sent to a data preparation module (Queried Data Preparation), where data are formatted for later stages. The data path then diverges. Let’s first follow the dashed line for the data path of labeled training data. They are sent to the deep learning training module (DL Model Training) to train an ensemble of thousands of CNNs generated from our carefully designed structures. After going through all the training data for more than thousands of times and having the training accuracy rate converged to a stable value, the trained CNNs will compete with each other in the deep learning model selection module (DL Model Selection), where they are requested to predict the root causes of testing data that the models have never seen before. Their predictions are compared to the real root causes, which are labeled by human engineers, to calculate the testing accuracy rate. Finally, we provide an ensemble of models (Trained DL Model) that achieve the best testing accuracy to the real-time prediction system.

Figure 1: Deep Learning Module Workflow

You might expect this training process to be both time consuming and resource hungry and so, it should be carried out off-line on servers equipped with powerful GPUs. On the contrary, prediction mode is relatively light-weight and can adapt to real-time applications.

Following the solid line in Figure 1 for prediction mode, the unlabeled normalized user data are sent to our carefully picked models, and the root cause (Performance Exception) is predicted based on a small amount of calculations. The prediction will be returned to the upper layer such as our interactive analytic web UI, automatic analysis, or proactive analysis applications. Like the interactive analytic part, our web UI also has a means of manually validating the prediction, which will automatically trigger the next round of model training. This completes the feedback loop and ensures our models continue to learn from human feedback.

AI Wins Over Manual Debugging

Diagnosing performance problems in a software-defined datacenter (SDDC) is difficult due to both the scale of the systems and the scale of the data. The scale of the software and hardware systems results in complicated behaviors that are not only workload-dependent but also interfering with each other. Thus, pin-pointing a root cause requires thorough examinations of the entire datacenter. However, due to the scale of data collected across a datacenter, this analysis process requires many human efforts, takes an extremely long time, and is prone to errors. Take vSAN for example—dealing with performance-related escalations typically requires cross-departmental efforts examining vSAN stacks, ESXi stacks, and physical/virtual network stacks. In some cases, it took months for many engineers to pinpoint problems outside of the VMware stack, such as physical network misconfigurations. On average, it takes one week to deescalate a client’s service request ticket with the effort of many experienced engineers working together.

PerfPsychic is designed to address challenges we have faced and to further make performance diagnostics more scalable. PerfPsychic builds upon a data infrastructure that is at least 10 times faster and 100 times more scalable than the existing one. It provides an end-to-end interactive analytic UI allowing users to perform the majority of the analysis in one place. The analysis results will then immediately be fed back to the deep learning pipeline in the backend, which produces diagnostic models that detect a faulty component more accurately as more feedback gets collected. These models mostly take only a few hours to train, and can detect faulty components in a given dataset in a few milliseconds, with comparable accuracy to rules that took us months to tune manually.

AI Wins Over Traditional Algorithms

To prove the effectiveness of our AI approach, we tested it against traditional machine learning algorithms.

First, we created two datasets: training data and testing data, as summarized in Table 1.

Table 1: Training and Testing Data Property

Training data are generated from our simulated environment: a simple 4-node hybrid vSAN setup. We manually insert performance errors into our testing environment to collect the training data with accurate labels. In the example of a network issue, we simulate packet drops by having vmkernel drop a receiving packet at VMK TCP/IP for every N packets. This mimics the behavior of packet drops in the physical network. We vary N to produce enough data points for training. Although this does not 100% reproduce what happens in a customer environment, it is still a best practice since it is the only cost-effective way to get a large volume of labeled data which are clean and accurate.

The testing data, in contrast to the training data, are all from customer escalations, which have very different system configurations in many aspects (number of hosts, types and number of disks, workloads, and so on). In our testing data, we have 78.1% of the data labeled with performance issues. Note that the “performance issue” refers to a specific component in the system that is causing the performance problem in the dataset. We define “accuracy” as the percentage of predictions that the CNN model gives the correct label to all components from the testing datasets (“issue” or “no issue”).

With the same training data, we trained one CNN, and four popular machine learning models: Support Vector Machine (SVM) [2], Logistic Classification (LOG) [3], Multi-layer Perceptron Neural Network (MLP) [4] and Multinomial Naïve Bayes (MNB) [5]. Then we tested the five models against the testing dataset. To quantify model performances, we calculate their accuracy as follows.

Finally, we compared the accuracy rate achieved by each model, which are shown in Figure 2. The result reveals that AI is a clear winner, with 91% accuracy.

Figure 2: Analytic Algorithm Accuracy Comparison

Acknowledgments

We appreciate the assistance and feedback from Chien-Chia Chen, Amitabha Banerjee and Xiaobo Huang. We also feel grateful to the support from our manager Rajesh Somasundaran. Lastly, we thank Julie Brodeur for her help in reviewing and recommendations for this blog post.

References

  1. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions. CoRR,” abs/1409.4842, 2014.
  2. J. Smola, B. Schölkopf, “A Tutorial on Support Vector Regression,” Statistics and Computing Archive Volume 14 Issue 3, August 2004, p. 199-222.
  3. C. Bishop, “Pattern Recognition and Machine Learning,” Chapter 4.3.4.
  4. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning representations by back-propagating errors,” http://www.iro.umontreal.ca/~pift6266/A06/refs/backprop_old.pdf.
  5. Zhang, “The optimality of Naive Bayes,” Proc. FLAIRS, 2004, http://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf.

How to correctly test the performance of Virtual SAN 6.2 deduplication feature

In VMware Virtual SAN 6.2, we introduced several features highly requested by customers, such as deduplication and compression. An overview of this feature can be found in the blog: Virtual SAN 6.2 – Deduplication And Compression Deep Dive.

The deduplication feature adds the most benefit to an all-flash Virtual SAN environment because, while SSDs are more expensive than spinning disks, the cost is amortized because more workloads can fit on the smaller SSDs. Therefore, our performance testing is performed on an all-flash Virtual SAN cluster with deduplication enabled.

When testing the performance of the deduplication feature for Virtual SAN, we observed the following:

  • Unexpected deduplication ratio
  • High device read latency in the capacity tier, even though the SSD is perfectly fine

In this blog, we discuss the reason behind these two issues and share our testing experience.

Continue reading

VMware Virtual SAN Stretched Cluster Best Practices White Paper

VMware Virtual SAN 6.1 introduced the concept of a stretched cluster which allows the Virtual SAN customer to configure two geographically located sites, while synchronously replicating data between the two sites. A technical white paper about the Virtual SAN stretched cluster performance has now been published. This paper provides guidelines on how to get the best performance for applications deployed on a Virtual SAN stretched cluster environment.

The chart below, borrowed from the white paper, compares the performance of the Virtual SAN 6.1 stretched cluster deployment against the regular Virtual SAN cluster without any fault domains. A nine- node Virtual SAN stretched cluster is considered with two different configurations of inter-site latency: 1ms and 5ms. The DVD Store benchmark is executed on four virtual machines on each host of the nine-node Virtual SAN stretched cluster. The DVD Store performance metrics of cumulated orders per minute in the cluster, read/write IOPs, and average latency are compared with a similar workload on the regular Virtual SAN cluster. The orders per minute (OPM) is lower by 3% and 6% for the 1ms and 5ms inter-site latency stretched cluster compared to the regular Virtual SAN cluster.

vsan-stretched-fig1a
Figure 1a.  DVD Store orders per minute in the cluster and guest IOPS comparison

Guest read/write IOPS and latency were also monitored. The read/write mix ratio for the DVD Store workload is roughly at 1/3 read and 2/3 write. Write latency shows an obvious increase trend when the inter-site latency is higher, while the read latency is only marginally impacted. As a result, the average latency increases from 2.4ms to 2.7ms, and 5.1ms for 1ms and 5ms inter-site latency configuration.

vsan-stretched-fig1b
Figure 1b.  DVD Store latency comparison

These results demonstrate that the inter-site latency in a Virtual SAN stretched cluster deployment has a marginal performance impact on a commercial workload like DVD Store. More results are available in the white paper.

Virtual SAN 6.0 Performance with VMware VMmark

Virtual SAN is a storage solution that is fully integrated with VMware vSphere. Virtual SAN leverages flash technology to cache data and improve its access time to and from the disks. We used VMware’s VMmark 2.5 benchmark to evaluate the performance of running a variety of tier-1 application workloads together on Virtual SAN 6.0.

VMmark is a multi-host virtualization benchmark that uses varied application workloads and common datacenter operations to model the demands of the datacenter. Each VMmark tile contains a set of virtual machines running diverse application workloads as a unit of load. For more details, see the VMmark 2.5 overview.

 

Testing Methodology

VMmark 2.5 requires two datastores for its Storage vMotion workload, but Virtual SAN creates only a single datastore. A Red Hat Enterprise Linux 7 virtual machine was created on a separate host to act as an iSCSI target to serve as the secondary datastore. Linux-IO Target (LIO) was used for this.

 

Configuration

Systems Under Test 8x Supermicro SuperStorage SSG-2027R-AR24 servers
CPUs (per server) 2x Intel Xeon E5-2670 v2 @ 2.50 GHz
Memory (per server) 256 GiB
Hypervisor VMware vSphere 5.5 U2 and vSphere 6.0
Local Storage (per server) 3x 400GB Intel SSDSC2BA4012x 900GB 10,000 RPM WD Xe SAS drives
Benchmarking Software VMware VMmark 2.5.2

 

Workload Characteristics

Storage performance is often measured in IOPS, or I/Os per second. Virtual SAN is a storage technology, so it is worthwhile to look at how many IOPS VMmark is generating.  The most disk-intensive workloads within VMmark are DVD Store 2 (also known as DS2), an E-Commerce workload, and the Microsoft Exchange 2007 mail server workload. The graphs below show the I/O profiles for these workloads, which would be identical regardless of storage type.

 Figure1

The DS2 database virtual machine shows a fairly balanced I/O profile of approximately 55% reads and 45% writes.

Microsoft Exchange, on the other hand, has a very write-intensive load, as shown below.

Figure2

Exchange sees nearly 95% writes, so the main benefit the SSDs provide is to serve as a write buffer.

The remaining application workloads have minimal disk I/Os, but do exert CPU and networking loads on the system.

 

Results

VMmark measures both the total throughput of each workload as well as the response time.  The application workloads consist of Exchange, Olio (a Java workload that simulates Web 2.0 applications and measures their performance), and DVD Store 2. All workloads are driven at a fixed throughput level.  A set of workloads is considered a tile.  The load is increased by running multiple tiles.  With Virtual SAN 6.0, we could run up to 40 tiles with acceptable quality of service (QoS). Let’s look at how each workload performed with increasing the number of tiles.

DVD Store

There are 3 webserver frontends per DVD Store tile in VMmark.  Each webserver is loaded with a different profile.  One is a steady-state workload, which runs at a set request rate throughout the test, while the other two are bursty in nature and run a 3-minute and 4-minute load profile every 5 minutes.  DVD Store throughput, measured in orders per minute, varies depending on the load of the server. The throughput will decrease once the server becomes saturated.

Figure3

For this configuration, maximum throughput was achieved at 34 tiles, as shown by the graph above.  As the hosts become saturated, the throughput of each DVD Store tile falls, resulting in a total throughput decrease of 4% at 36 tiles. However, the benchmark still passes QoS at 40 tiles.

Olio and Exchange

Unlike DVD Store, the Olio and Exchange workloads operate at a constant throughput regardless of server load, shown in the table below:

Workload Simulated Users Load per Tile
Exchange 1000 320-330 Sendmail actions per minute
Olio 400 4500-4600 operations per minute

 

At 40 tiles the VMmark clients are sending over ~12,000 mail messages per minute and the Olio webservers served ~180,000 requests per minute.

As the load increases, the response time of Exchange and Olio increases, which makes them a good demonstration of the end-user experience at various load levels. A response time of over 500 milliseconds is considered to be an unacceptable user experience.

Figure4

As we saw with DVD Store, performance begins to dramatically change after 34 tiles as the cluster becomes saturated.  This is mostly seen in the Exchange response time.  At 40 tiles, the response time is over 300 milliseconds for the mailserver workload, which is still within the 500 millisecond threshold for a good user experience. Olio has a smaller increase in response time, since it is more processor intensive.  Exchange has a dependence on both CPU and disk performance.

Looking at Virtual SAN performance, we can get a picture of how much I/O is served by the storage at these load levels.  We can see that reads average around 2000 read I/Os per second:

Figure5

The Read Cache hit rate is 98-99% on all the hosts, so most of these reads are being serviced by the SSDs. Write performance is a bit more varied.

Figure6

We see a range of 5,000-10,000 write IOPS per node due to the write-intensive Exchange workload. Storage is nowhere close to saturation at these load levels. The magnetic disks are not seeing much more than 100 I/Os per second, while the SSDs are seeing about 3,000 – 6,000 I/Os per second. These disks should be able to handle at least 10x this load level. The real bottleneck is in CPU usage.

Looking at the CPU usage of the cluster, we can see that the usage levels out at 36 tiles at about 84% used.  There is still some headroom, which explains why the Olio response times are still very acceptable.

Figure7

As mentioned above, Exchange performance is dependent on both CPU and storage. The additional CPU requirements that Virtual SAN imposes on disk I/O causes Exchange to be more sensitive to server load.

 

Performance Improvements in Virtual SAN 6.0 (vs. Virtual SAN 5.5)

The Virtual SAN 6.0 release incorporates many improvements to CPU efficiency, as well as other improvements. This translates to increased performance for VMmark.

VMmark performance increased substantially when we ran the tests with Virtual SAN 6.0 as opposed to Virtual SAN 5.5. The Virtual SAN 5.5 tests failed to pass QoS beyond 30 tiles, meaning that at least one workload failed to meet the application latency requirement.  During the Virtual SAN 5.5 32-tile tests, one or more Exchange clients would report a Sendmail latency of over 500ms, which is determined to be a QoS failure.  Version 6.0 was able to achieve passing QoS at up to 40 tiles.

Figure8

Not only were more virtual machines able to be supported on Virtual SAN 6.0, but the throughput of the workloads increased as well.  By comparing the VMmark score (normalized to 20-tile Virtual SAN 5.5 results) we can see the performance improvement of Virtual SAN 6.0.

Figure9

Virtual SAN 6.0 achieved a performance improvement of 24% while supporting 33% more virtual machines.

 

Conclusion

Using VMmark, we are able to run a variety of workloads to simulate applications in a production environment.  We were able to demonstrate that Virtual SAN is capable of achieving good performance running heterogeneous real world applications.  The cluster of 8 hosts presented here show good performance in VMmark through 40 tiles.  This is ~12,000 mail messages per minute sent through Exchange, ~180,000 requests per minute served by the Olio webservers, and over 200,000 orders per minute processed on the DVD Store database.  Additionally, we were able to measure substantial performance improvements over Virtual SAN 5.5 using Virtual SAN 6.0.

 

VMware View Planner 3.5 and Use Cases

by   Banit Agrawal     Nachiket Karmarkar

VMware View Planner 3.5 was recently released which introduces a slew of new features, enhancements in user experience, and scalability. In this blog, we present some of these new features and use cases. More details can be found in the whitepaper here.

In addition to retaining all the features available in VMware View Planner 3.0, View Planner 3.5 addresses the following new use cases:

  • Support for VMware Horizon 6  (support of RDSH session and application publishing)
  • Support for Windows 8.1 as desktops
  • Improved user experience
  • Audio-Video sync (AVBench)
  • Drag and Scroll workload (UEBench)
  • Support for Windows 7 as clients

In View Planner 3.5, we augment the capability of View Planner to quantify user experience for user sessions and application remoting provided through remote desktop session hosts (RDSH) as a sever farm. Starting this release, we will support Windows 8.1 as one of the supported guest OSes for desktops and Windows 7 as the supported guest OS for clients.

New Interactive Workloads

We also introduced two advanced workloads: (1) Audio-Video sync (AVBench) and (2) Drag and Scroll workload (UEBench). AVBench determines audio fidelity in a distributed environment where audio and video streams are not tethered. The “Drag and Scroll” workload determines spatial and temporal variance by emulating user events like mouse click, scroll, and drag.

UEBench

Fig 1. Mouse-click and drag  (UEBench)

As seen in Figure 1, a mouse event is sent to the desktop and the red and black image is dragged across and down the screen.

UEBench-scroll

Fig. 2. Mouse-click and scroll (UEBench)

Similarly, Figure 2 depicts a mouse event sent to the scroll bar of an image that is scrolled up and down.

Better Run Status Reporting

As part of improving the user experience, the UI can track the current stage the View Planner run is in and notifies the user through a color-coded box. The text inside the box is a clickable link that provides a pop-up giving deeper insight about that particular stage.

run-progress-status

Fig. 3. View Planner run status shows the intermediate status of the run

Pre-Check Run Profile for Errors

A “check” button provides users a way to verify the correctness of their run-profile parameters.

check-runprofile

Fig. 4. ‘Check’ button in Run & Reports tab in View Planner UI

 In the past, users needed to optimize the parent VMs used for deploying clients and desktop pools. View Planner 3.5 has automated these optimizations as part of installing the View Planner agent service. The agent installer also comes with a UI that tracks the current stage the installer is in and highlights the results of various installer stages.

Sample Use Cases

Single Host VDI Scaling

Through this release, we have re-affirmed the use case of View Planner as an ideal tool for platform characterization for VDI scenarios.  On a Cisco UCS C240 server, we started with a small number of desktops running the “standard benchmark profile” and increased them until the Group A and Group B numbers exceeded the threshold. The results below demonstrate the scalability of a single UCS C240 server as a platform for VDI deployments.

host-vdi-scaling

Fig. 5. Single server characterization with hosted desktops for CISCO UCS C240

Single Host RDSH Scaling

We followed the best practices prescribed in the VMware Horizon 6 RDSH Performance & Best Practices whitepaper  and set up a number of remote desktop session (RDS) servers that would fully consolidate a single UCS C240 vSphere server. We started with a small number of user sessions per core and then increased them until the Group A and Group B numbers exceeded the threshold level. The results below demonstrate how ViewPlanner can accurately gauge the scalability of a platform (CISCO UCS in this case) when deployed in an RDS scenario

host-RDSH-scaling

Fig. 6. Single server characterization with RDS sessions for CISCO UCS C240

Storage Characterization

View Planner can also be used to characterize storage array performance. The scalability of View Planner 3.5 to drive a workload on thousands of virtual desktops and process the results thereafter makes it an ideal candidate to validate storage array performance at scale. The results below demonstrate scalability of VDI desktops obtained on Pure Storage FA-420 all-flash array. View Planner 3.5 could easily scale to 3000 desktops, as highlighted in the results below.

storage-characterization

Fig. 7. 3000 Desktops QoS results on Pure Storage FA-420 storage array

Custom Applications Scaling

In addition to characterizing platform and storage arrays, the custom app framework can achieve targeted VDI characterization that is application specific. The following results show Visio as an example of a custom app scale study on an RDS deployment with a 4-vCPU, 10GB vRAM Windows 2008 Server.

visio-custom-app

Fig. 8 Visio operation response times with View Planner 3.5 when scaling up application sessions

Other Use Cases

With a plethora of features, supported guest OSes, and configurations, it is no wonder that View Planner is capable to of characterizing multiple VMware solutions and offerings that work in tandem with VMware Horizon. View Planner 3.5 can also be used to evaluate the following features, which are described in more detail in the whitepaper:

  • VMware Virtual SAN
  • VMware Horizon Mirage
  • VMware App Volumes

For more details about new features, use cases, test environment, and results, please refer to the View Planner 3.5 white paper here.

Virtual SAN 6.0 Performance: Scalability and Best Practices

A technical white paper about Virtual SAN performance has been published. This paper provides guidelines on how to get the best performance for applications deployed on a Virtual SAN cluster.

We used Iometer to generate several workloads that simulate various I/O encountered in Virtual SAN production environments. These are shown in the following table.

Type of I/O workload Size (1KiB = 1024 bytes) Mixed Ratio Shows / Simulates
All Read 4KiB Maximum random read IOPS that a storage solution can deliver
Mixed Read/Write 4KiB 70% / 30% Typical commercial applications deployed in a VSAN cluster
Sequential Read 256KiB Video streaming from storage
Sequential Write 256KiB Copying bulk data to storage
Sequential Mixed R/W 256KiB 70% / 30% Simultaneous read/write copy from/to storage

In addition to these workloads, we studied Virtual SAN caching tier designs and the effect of Virtual SAN configuration parameters on the Virtual SAN test bed.

Virtual SAN 6.0 can be configured in two ways: Hybrid and All-Flash. Hybrid uses a combination of hard disks (HDDs) to provide storage and a flash tier (SSDs) to provide caching. The All-Flash solution uses all SSDs for storage and caching.

Tests show that the Hybrid Virtual SAN cluster performs extremely well when the working set is fully cached for random access workloads, and also for all sequential access workloads. The All-Flash Virtual SAN cluster, which performs well for random access workloads with large working sets, may be deployed in cases where the working set is too large to fit in a cache. All workloads scale linearly in both types of Virtual SAN clusters—as more hosts and more disk groups per host are added, Virtual SAN sees a corresponding increase in its ability to handle larger workloads. Virtual SAN offers an excellent way to scale up the cluster as performance requirements increase.

You can download Virtual SAN 6.0 Performance: Scalability and Best Practices from the VMware Performance & VMmark Community.

Web 2.0 Applications on VMware Virtual SAN

Here in VMware Performance Engineering, Virtual SAN is a hot topic. This storage solution leverages economical hardware compared to more expensive storage arrays and offers all the vSphere solutions like vMotion, HA, and DRS. We have been testing Virtual SAN with a number of workloads to characterize their performance. In particular we found that Web 2.0 applications, modeled with the Cloudstone benchmark, performs exceptionally with low application latency on vSphere 5.5 with Virtual SAN. We are giving a quick glimpse of the testing configuration and result here and the full detail can be found in the recently published technical white paper about Web 2.0 applications on VMware Virtual SAN.

We ran the Cloundstone benchmark using Olio server and client virtual machine pairs. Server virtual machines were on a 3-host server cluster, whereas client virtual machines were on a 3-node client cluster. An Olio server virtual machine ran Ubuntu 10.04 with a MySQL database, a NGINX Web server with PHP scripts, and a Tomcat application server. An Olio client virtual machine simulated typical Web 2.0 workloads by exercising 7 different types of user operations that involved web file exchanges and database inquiries and transactions. Virtual SAN was configured on the server cluster. Web files, database files, and OS files were all on the Virtual SAN with dedicated virtual disks to store files separately.

fig1-blog

In the paper, we report test results that show Virtual SAN achieves good application latency performance. Each server-client virtual machine pair was pre-configured for 500 Olio users. In one test, we ran 1500 Olio users and 7500 users by having 3 and 15 pairs of virtual machines respectively. We collected the average round-trip time of Olio operations. These operations were divided into frequent ones (EventDetail, HomePage, Login and TagSearch) and less frequent ones (AddEvent, AddPerson, and PersonDetail) according to how often they were exercised in the tests.

The following graph shows the average round-trip times for various operations. The threshold for these operations was defined in the passing critera, which used 250 milliseconds for the frequent operations and 500 milliseconds for the less frequent operations. In the 15VMs/7500 users case, the server cluster was at 70% CPU utilization, but the round-trip time was well below the passing threshold as shown. We also present the 95th-percentile round-trip time results and how it performed in the white paper.

fig2-blog

To learn the full story of the 15VMs/7500 Olio users test and how we further stressed storage with the workload and read the results, see the white paper.

VDI Performance Benchmarking on VMware Virtual SAN 5.5

In the previous blog series, we presented the VDI performance benchmarking results with VMware Virtual SAN public beta and now we announced the general availability of VMware Virtual SAN 5.5 which is part of VMware vSphere 5.5 U1 GA and VMware Horizon View 5.3.1 which supports Virtual SAN 5.5. In this blog, we present the VDI performance benchmarking results with the Virtual SAN GA bits and highlight the CPU improvements and 16-node scaling results. With Virtual SAN 5.5 with default policy, we could successfully run 1615 heavy VDI users (VDImark) out-of-the-box on a 16-node Virtual SAN cluster and see about 5% more consolidation when compared to Virtual SAN public beta.

virtualsan-view-block-diagram

To simulate the VDI workload, which is typically CPU bound and sensitive to I/O, we use VMware View Planner 3.0.1. We run View Planner and consolidate as many heavy users as we can on a particular cluster configuration while meeting the quality of service (QoS) criteria and we define the score as VDImark. For QoS criteria, View Planner operations are divided into three main groups: (1) Group A for interactive operations, (2) Group B for I/O operations, and (3) Group C for background operations. The score is determined separately for Group A user operations and Group B user operations by calculating the 95th percentile latency of all the operations in a group. The default thresholds are 1.0 second for Group A and 6.0 seconds for Group B. Please refer to the user guide, and the run and reporting guides for more details. The scoring is based on several factors such as the response time of the operations, compliance of the setup and configurations, and other factors.

As discussed in the previous blog, we used the same experimental setup (shown below) where each Virtual SAN host has two disk groups and each disk group has one PCI-e solid-state drive (SSD) of 200GB and six 300GB 15k RPM SAS disks. We use default policy when provisioning the automated linked clones pool with VMware Horizon View for all our experiments.

virtualsan55-setup

CPU Improvements in Virtual SAN 5.5

There were several optimizations done in Virtual SAN 5.5 compared to the previously available public beta version and one of the prominent improvements is the reduction of CPU usage for Virtual SAN. To highlight the CPU improvements, we compare the View Planner score on Virtual SAN 5.5 (vSphere 5.5 U1) and Virtual SAN public beta (vSphere 5.5).  On a 3-node cluster, VDImark (the maximum number of desktop VMs that can run with passing QoS criteria) is obtained for both Virtual SAN 5.5 and Virtual SAN public beta and the results are shown below:

virtualsan55-3node

The results show that with Virtual SAN 5.5, we can scale up to 305 VMs on a 3-node cluster, which is about 5% more consolidation when compared with Virtual SAN public beta. This clearly highlights the new CPU improvements in Virtual SAN 5.5 as a higher number of desktop VMs can be consolidated on each host with a similar user experience.

Linear Scaling in VDI Performance

In the next set of experiments, we continually increase the number of nodes for the Virtual SAN cluster to see how well the VDI performance scales. We collect the VDImark score on 3-node, 5-node, 8-node, 16-node increments, and the result is shown in the chart below.

virtualsan55-scaling

The chart illustrates that there is a linear scaling in the VDImark as we increase the number of nodes for the Virtual SAN cluster. This indicates good performance as the nodes are scaled up. As more nodes are added to the cluster, the number of heavy users that can be added to the workload increases proportionately. In Virtual SAN public beta, a workload of 95 heavy VDI users per host was achieved and now, due to CPU improvements in Virtual SAN 5.5, we are able to achieve 101 to 102 heavy VDI users per host. On a 16-node cluster, a VDImark of 1615 was achieved which accounts for about 101 heavy VDI users per node.

To further illustrate the Group A and Group B response times, we show the average response time of individual operations for these runs for both Group A and Group B, as follows.

virtualsan55-groupA

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience. If we look all the way up to 16 nodes, we don’t see much variance in the response times, and they almost remain constant when scaling up. This clearly illustrates that, as we scale the number of VMs in larger nodes of a Virtual SAN cluster, the user experience doesn’t degrade and scales nicely.

virtualsan55-groupB

Group B is more sensitive to I/O and CPU usage than Group A, so the resulting response times are more important. The above figure shows how VDI performance scales in Virtual SAN. It is evident from the chart that there is not much difference in the response times as the number of VMs are increased from 305 VMs on a 3-node cluster to 1615 VMs on a 16-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the Virtual SAN nodes from 3 to 16.

To summarize, the test results in this blog show:

  • 5% more VMs can be consolidated on a 3-node Virtual SAN cluster
  • When adding more nodes to the Virtual SAN cluster, the number of heavy users supported increases proportionately (linear scaling)
  • The response times of common user operations (such as opening and saving files, watching a video, and browsing the Web) remain fairly constant as more nodes with more VMs are added.

To see the previous blogs on the VDI benchmarking with Virtual SAN public beta, check the links below:

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

In part 1 and part 2 of the VDI/VSAN benchmarking blog series, we presented the VDI benchmark results on VSAN for 3-node, 5-node, 7-node, and 8-node cluster configurations. In this blog, we compare the VDI benchmarking performance of VSAN with an all flash storage array. The intent of this experiment is not to compare the maximum IOPS that you can achieve on these storage solutions; instead, we show how VSAN scales as we add more heavy VDI users. We found that VSAN can support a similar number of users as that of an all flash array even though VSAN is using host resources.

The characteristic of VDI workload is that they are CPU bound, but sensitive to I/O which makes View Planner a natural fit for this comparative study. We use VMware View Planner 3.0 for both VSAN and all flash SAN and consolidate as many heavy users as much we can on a particular cluster configuration while meeting the quality of service (QoS) criteria. Then, we find the difference in the number of users we can support before we run out of CPU, because I/O is not a bottleneck here. Since VSAN runs in the kernel and uses CPU on the host for its operation, we find that the CPU usage is quite minimal, and we see no more than a 5% consolidation difference for a heavy user run on VSAN compared to the all flash array.

As discussed in the previous blog, we used the same experimental setup where each VSAN host has two disk groups and each disk group has one PCI-e solid-state drive (SSD) of 200GB and six 300GB 15k RPM SAS disks. We built a 7-node and a 8-node cluster and ran View Planner to get the VDImark™ score for both VSAN and the all flash array. VDImark signifies the number of heavy users you can successfully run and meet the QoS criteria for a system under test. The VDImark for both VSAN and all flash array is shown in the following figure.

View Planner QoS (VDImark)

 

 From the above chart, we see that VSAN can consolidate 677 heavy users (VDImark) for 7-node and 767 heavy users for 8-node cluster. When compared to the all flash array, we don’t see more than 5% difference in the user consolidation. To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above for both VSAN and the all flash array, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience.  Similar to the user consolidation, the response time of Group-A operations in VSAN is similar to what we saw with the all flash array.

Group-B Response Times

Group-B operations are sensitive to both CPU and IO and 95% should be less than six seconds to meet the QoS criteria. From the above figure, we see that the average response time for most of the operations is within the threshold and we see similar response time in VSAN when compared to the all flash array.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3