Home > Blogs > VMware VROOM! Blog > Tag Archives: virtualization

Tag Archives: virtualization

Sharing GPU for Machine Learning/Deep Learning on VMware vSphere with NVIDIA GRID: Why is it needed? And How to share GPU?

By Lan Vu, Uday Kurkure, and Hari Sivaraman 

Data scientists may use GPUs on vSphere that are dedicated to use by one virtual machine only for their modeling work, if they need to. Certain heavier machine learning workloads may well require that dedicated approach. However, there are also many ML workloads and user types that do not use a dedicated GPU continuously to its maximum capacity. This presents an opportunity for shared use of a physical GPU by more than one virtual machine/user. This article explores the performance of a shared-GPU setup like this, supported by the NVIDIA GRID product on vSphere, and presents performance test results that show that sharing is a feasible approach. The other technical reasons for sharing a GPU among multiple VMs are also described here. The article also gives best practices for determining how the sharing of a GPU may be done.

VMware vSphere supports NVIDIA GRID technology for multiple types of workloads. This technology virtualizes GPUs via a mediated passthrough mechanism. Initially, NVIDIA GRID supported GPU virtualization for graphics workloads only. But, since the introduction of Pascal GPU, NVIDIA GRID has supported GPU virtualization for both graphics and CUDA/machine learning workloads. With this support, multiple VMs running GPU-accelerated workloads like machine learning/deep learning (ML/DL) based on TensorFlow, Keras, Caffe, Theano, Torch, and others can share a single GPU by using a vGPU provided by GRID. This brings benefits in multiple use cases that we discuss on this post.  

Each vGPU is allocated a dedicated amount of GPU memory and a vGPU profile specifies how much device memory each vGPU has and maximum number of vGPUs per physical GPU. For example, if you choose the P40-1q vGPU profile for Pascal P40 GPU, you can have up to 24 VMs with vGPU because P40 has total of 24 GB device memory. More information about virtualized GPUs on vSphere can be found at our previous blog here. 

Figure 1: NVIDIA GRID vGPU 

Why do we need to share GPUs?

Sharing GPUs can help increase system consolidation, resource utilization, and save deployment costs of ML/DL workloads. GPU-accelerated ML/DL workloads include training and inference tasks, and their GPU usage patterns are different. Training workloads are mostly run by data scientists and machine learning engineers during the research and development phase of an application. Because model training is just one of many tasks of ML application development, the need of GPUs by each user is usually irregular. For example, a data scientist does not spend the whole workday just training models because he/she has other things to do like checking & answering emails, attending meetings, researching and developing new ML algorithms, collecting and cleaning data, and so on. Hence, sharing GPUs among a group of multiple users helps increase the GPU utilization while not reducing much the performance benefits of GPU.  

To illustrate this scenario of using GPU for training, we conducted an experiment in which 3 VMs (or 3 users) used vGPU to share a single NVIDIA P40 GPU, and each VM ran the same ML/DL training workload at different times. ML workloads inside VM1 and VM2 were run at the times t1 and t2, so that about 25% of the GPU execution time of VM1 and VM2 were overlapped. VM3 ran its workload at t3, and it was the only GPU-based workload run at that timeframe. Figure 2 depicts this use case in which the black dash arrows indicate VMs access GPU concurrently. If you run your applications inside container, please also check out our previous blog post on running container-based applications inside a VM.

Figure 2: A use case of running multiple ML jobs on VMs with vGPUs

In our experiments, we used CentOS VMs with P40-1q vGPU profiles, 12 vCPUs, 60 GB memory, 96 GB disk, and ran TensorFlow-based training loads on those VMs, including complex language modeling using a recurrent neural network (RNN) with 1500 long short-term memory (LSTM) units per layer, on the Penn Treebank dataset (PTB) [1, 2], and handwriting recognition using a convolution neural network (CNN) with a MNIST dataset [3]. We ran the experiment on a Dell PowerEdge R740 with dual 18-core Intel Xeon Gold 6140 sockets and an NVIDIA Pascal P40 GPU.  

Figure 3 and Figure 4 show the normalized training time of VM1, VM2, and VM3 in which VM1 and VM2 have a performance impact of 16%–23%, while VM3 has no impact on the performance. In this experiment, we used the Best Effort scheduler of GRID which means VM3 fully utilized the GPU time during its application execution.

Figure 3Training time of Language Modeling 

Figure 4: Training time of Handwriting Recognition 

For inference workloads, the performance characteristics can vary based on the usage frequency of the GPU-based applications on the production environment. Less intensive GPU workloads allow more more apps running inside VMs sharing a single GPU. For example, a GPU-accelerated database app and other ML/DL apps can share the same GPUs on the same vSphere host if their performance requirements are still met.

How many vGPU per physical GPU is good?

The decision of sharing GPU among ML/DL workloads running on multiple VMs and how many VMs per physical GPU depends on the GPU usage of ML applications. When users or applications do not use the GPU very frequently, as shown in the previous example, sharing the GPU can bring huge benefits because it significantly reduces the hardware, operation, and management costs. In this case, you can assign more vGPU per physical GPU. If your workloads use GPU intensively and require continuous access to the GPU, sharing it can still bring some benefits because GPU-based application execution includes CPU time, GPU time, I/O time, and so on. Additionally, sharing a GPU helps fill the gap when applications spend time on CPU or I/O. However, in this case, you need to assign fewer vGPUs per physical GPU.

To determine how many VMs with vGPU per physical GPU are needed, you can base this on your evaluation of usage frequency or the GPU utilization history of the applications. In the case of GRID GPU on vSphere, you can monitor GPU utilization information by using the command nvidia-smi on the vSphere hypervisor. 

We evaluated the performance of ML/DL workloads, in the worst case, when all VMs use a GPU at the same time. To do this, we ran the same MNIST handwriting recognition training on multiple VMs with each vGPU concurrently sharing a single Pascal P40 GPU. Each VM had a P40-1q vGPU.  

The experiment in this scenario is depicted in Figure 5 with the number of concurrent VMs in our test ranging from 1 to 24 VMs.  

Figure 5: Running multiple ML jobs on VMs with vGPUs concurrently 

Figure 6 presents the normalized training time of this experiment. As the number of concurrent ML jobs increases, the training time of each job also increases because they share a single GPU. However, the increase of time is not as fast as the increase of VM. For example, when we have 24 VMs run concurrently, the execution time increases, at most, 17 times instead of 24 times or higherThis means that even in the worst case, where all VMs use the GPU at the same time, we still see the benefits of GPU sharing. Please note that in the typical use case of training as mentioned earliernot all users or applications use the GPU 24/7. If they do, you can just reduce the number of vGPUs per GPU until the expected performance and consolidation are reached

Figure 6: Training time with different number of VM

vGPU scheduling

When all VMs with GPU loads run concurrently, NVIDIA GRID manager schedules the jobs into the GPU based on time slicing. NVIDIA GRID supports three vGPU scheduling options: Best Effort, Equal Share, and Fixed Share. The selection of a vGPU scheduling option depends on use cases. The Best Effort scheduler allocates GPU time to VMs in a round-robin fashion. In the above experiments, we used the Best Effort scheduler. For some circumstances, a VM running a GPU-intensive application may affect the performance of a GPU-lightweight application running in other VMs. To avoid such performance impact and ensure quality of service (QoS), you can choose to switch to the Equal Share or Fixed Share scheduler. The Equal Share scheduler ensures equal share of GPU time for each powered-on VM. The Fixed Share scheduler gives a fixed share of GPU time to a VM based on the vGPU profile that is associated with each VM on the physical GPU.  

For performance comparison, we run the MNIST handwriting recognition training load using different schedulers: Best Effort and Equal Share for different number of VMs.  

Figure 7 presents the normalized training time and Figure 8 presents GPU utilization. As the number of VMs increase, Best Effort shows better performance because when a VM does not use its time slice, that time slice will be assigned to another VM that needs GPU. Meanwhile, for Equal Share, that time slice is always reserved for the VMs even if they do not utilize GPU at that moment. Therefore, Best Effort Scheduler has better GPU utilization as shown in Figure 7. 

Figure 7: Training time of Best Effort vs. Equal Share

Figure 8: GPU utilization of Best Effort vs. Equal Share 

Takeaways

  • Sharing a GPU among VMs using NVDIA GRID can help increase the consolidation of VMs with vGPU and reduce the hardware, operation, and management costs. 
  • The performance impact of sharing a GPU is small in typical use cases when the GPU used is infrequently by users. 
  • Choosing how many vGPUs per GPU is based on the ML/DL real load. For infrequent and lightweight GPU workloads, you can assign multiple vGPUs per GPU. For workloads that frequently use GPU, you should lower the number of vGPUs per GPU until the performance requirement is met.  

Acknowledgments

We would like to thank Aravind Bappanadu, Juan Garcia-Rovetta, Bruce Herndon, Don Sullivan, Charu Chaubal, Mohan Potheri, Gina Rosenthal, Justin Murray, Ziv Kalmanovich for their support of this work and thank Julie Brodeur for her help in reviewing and recommendations for this blog post.

References

[1] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals, “Recurrent Neural Network Regularization,” In arXiv:1409.2329, 2014. 

[2] Ann Taylor, Mitchell Marcus, Beatrice Santorini, “The Penn Treebank: An Overview, Treebanks: the state of the art in syntactically annotated corpora.” ed. / Anne Abeille. Kluwer, 2003.  

[3] Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition.” in Proceedings of the IEEE, 86(11):2278-2324, November 1998.  

 

 

vSphere with iSER – How to release the full potential of your iSCSI storage!

By Mark Ma

With the release of vSphere 6.7, VMware added iSER (iSCSI Extensions for RDMA) as a native supported storage protocol to ESXi. With iSER run over iSCSI, users can boost their vSphere performance just by replacing the regular NICs with RDMA-capable NICs. RDMA (Remote Direct Memory Access) allows the transfer of memory from one computer to another. This is a direct transfer and minimizes CPU/kernel involvement. By bypassing the kernel, we get extremely high I/O bandwidth and low latency. (To use RDMA, you must have an HCA/Host Channel Adapter device on both the source and destination.) In this blog, we compare standard iSCSI performance vs. iSER performance to see how iSER can release the full potential of your iSCSI storage.

Testbed Configuration

The iSCSI/iSER target system is based on open source Ubuntu 18.04 LTS server with 2 x E5-2403 v2 CPU, 96 GB RAM, 120 GB SSD for OS, 8 x 450 GB (15K RPM) SAS drive and Mellanox ConnectX-3 Pro EN 40 GbE NIC (RDMA capable). The file system is ZFS with 4 x mirror set with 8 x 450 GB SAS drive (RAID 10). ZFS has an advanced memory feature that can produce very good random IOPS and read throughput. We did not place any SSD for caching since the test is rather to compare protocol difference, not disk drives. The iSCSI/iSER target is Linux SCSI target framework (TGT).

The iSCSI/iSER initiator is ESXi 6.7.0, 9214924 with 2 x Intel Xeon E5-2403 v2 CPU @ 1.8 GHz, 96 GB RAM, USB boot drive, and  Mellanox ConnectX-3 Pro EN 40 GbE NIC (RDMA capable) has been used to benchmark the performance boost that iSER enables vs. iSCSI.

Both target and initiator connect to 40 GbE switch with QSFP cables for optimal network performance.

Both NICs have the latest firmware with 2.42.5000.

To measure performance, we used VMware I/O Analyzer, which uses the industry-standard benchmark Iometer.

iSCSI Test

We set the target with the iSCSI driver to ensure the first test measures the standard iSCSI protocol.

Figure 1

For the iSCSI initiator, we simply enable the iSCSI software adapter.

Figure 2

iSCSI test one: Max Read IOPS—this test shows the max read IOPS (4K random read I/Os per second) from the iSCSI storage.

Result: 34,255.18 IOPS

Figure 3

iSCSI test two: Max Write IOPS, this test shows the max write IOPS from the iSCSI storage.

Result: 36,428.26 IOPS

Figure 4

iSCSI test three: Max Read Throughput—this test shows the max read throughput from the iSCSI storage.

Result: 2,740.80 MBPS

Figure 5

iSCSI test four: Max Write Throughput—this test shows the potential max write throughput from the iSCSI storage. (The performance is rather low due to ZFS RAID configuration and limited disk spindle.)

Result: 112.04 MBPS

Figure 6

iSER Test

We set the target to the iSER driver to ensure the second test measures only iSER connections.

Figure 7

For the iSER initiator, we need to verify that an RDMA-capable NIC is installed. For this, we use the command:
esxcli rdma device list

Figure 8

Then we run the following command from the ESXi host to enable the iSER adapter.

esxcli rdma iser add

 

Figure 9

 

iSER test one: Max Read IOPS—this test shows the max read IOPS from the iSER storage.

Result: 71,108.85 IOPS, which is 207.59% over iSCSI.

Figure 10

iSER test two: Max Write IOPS—this test shows the max write IOPS from the iSER storage.

Result: 69,495.70 IOPS which is 190.77% over iSCSI.

Figure 11

iSER test three: Max Read Throughput—this test shows the max read throughput, measured in megabytes per second (MBPS), from the iSER storage.

Result: 4,126.53 MBPS which is 150.56% over iSCSI.

Figure 12

iSCSI test four: Max Write Throughput—this test shows the max write throughput from the iSER storage. (The performance is rather low due to ZFS raid configuration and limited disk spindle.)

Result: 106.48 MBPS which is about 5% less than iSCSI.

Figure 13

Results

Figure 14

Figure 15

Figure 16

The performance over random I/O is about a 200% increase for both read and write. That’s about a 150% increase for read throughput. Write throughput is about the same. The only difference is the storage protocol. We also performed these tests in an environment made up of older hardware, so just imagine what vSphere with iSER could do when using state-of-the-art, NVME-based storage and the latest 200 GbE network equipment.

Conclusion

The result seemed too good to be true, so I ran the benchmark several times to ensure its consistency. It’s great to see VMware’s Innovation initiative in action. Who could think that the “not so exciting” traditional iSCSI storage could be improved over 200% through the efforts of VMware and Mellanox. It’s great to see VMware continues to push the boundary of the Software-Defined Datacenter to better serve our customers in their digital transformation journey!

About the Author

Mark Ma is a senior consultant at VMware Professional Services. He is heavily involved with POC, architecture design, assessment, implementation, and user training. Mark specializes in end-to-end virtualization solutions based on Citrix, Microsoft, and VMware applications.

New white paper: Big Data performance on VMware Cloud on AWS: Spark machine learning and IoT analytics performance on-premises and in the cloud

By Dave Jaffe

A new white paper is available comparing Spark machine learning performance on an 8-server on-premises cluster vs. a similarly configured VMware Cloud on AWS cluster.

Here is what the VMware Cloud on AWS cluster looked like:

Screenshot of cluster configuration

VMware Cloud on AWS configuration for performance tests

Three standard analytic programs from the Spark machine learning library (MLlib), K-means clustering, Logistic Regression classification, and Random Forest decision trees, were driven using spark-perf. In addition, a new, VMware-developed benchmark, IoT Analytics Benchmark, which models real-time machine learning on Internet-of-Things data streams, was used in the comparison. The benchmark is available from GitHub.

As seen in the charts below, performance was very similar on-premises and on VMware Cloud on AWS.

Spark machine learning performance chart

Spark machine learning performance

IoT Analytics performance chart

IoT Analytics performance

 

All details are in the paper.

Persistent Memory Performance in vSphere 6.7

We published a paper that shows how VMware is helping advance PMEM technology by driving the virtualization enhancements in vSphere 6.7. The paper gives a detailed performance analysis of using PMEM technology on vSphere using various workloads and scenarios.

These are the key points that we cover in this white paper:

  • We explain how PMEM can be configured and used in a vSphere environment.
  • We show how applications with different characteristics can take advantage of PMEM in vSphere. Below are some of the use-cases:
    • How PMEM device limits can be achieved under vSphere with little to no overhead of virtualization. We show virtual-to-native ratio along with raw bandwidth and latency numbers from fio, an I/O microbenchmark.
    • How traditional relational databases like Oracle can benefit from using PMEM in vSphere.
    • How scaling-out VMs in vSphere can benefit from PMEM. We used Sysbench with MySQL to show such benefits.
    • How modifying applications (PMEM-aware) can get the best performance out of PMEM. We show performance data from such applications, e.g., an OLTP database like SQL Server and an in-memory database like Redis.
    • Using vMotion to migrate VMs with PMEM which is a host-local device just like NVMe SSDs. We also characterize in detail, vMotion performance of VMs with PMEM.
  • We outline some best practices on how to get the most out of PMEM in vSphere.

Read the full paper here.

Oracle Database Performance with VMware Cloud on AWS

You’ve probably already heard about VMware Cloud on Amazon Web Services (VMC on AWS). It’s the same vSphere platform that has been running business critical applications for years, but now it’s available on Amazon’s cloud infrastructure. Following up on the many tests that we have done with Oracle databases on vSphere, I was able to get some time on a VMC on AWS setup to see how Oracle databases perform in this new environment.

It is important to note that VMC on AWS is vSphere running on bare metal servers in Amazon’s infrastructure. The expectation is that performance will be very similar to “regular” onsite vSphere, with the added advantage that the hardware provisioning, software installation, and configuration is already done and the environment is ready to go when you login. The vCenter interface is the same, except that it references the Amazon instance type for the server.

Our VMC on AWS instance is made up of four ESXi hosts. Each host has two 18-core Intel Xeon E5-2686 v4 (aka Broadwell) processors and 512 GB of RAM. In total, the cluster has 144 cores and 2 TB of RAM, which gives us lots of physical resources to utilize in the cloud.

In our test, the database VMs were running Red Hat Enterprise Linux 7.2 with Oracle 12c. To drive a load against the database VMs, a single 18 vCPU driver VM was running Windows Server 2012 R2, and the DVD Store 3 test workload was also setup on the cluster. A 100 GB test DS3 database was created on each of the Oracle database VMs. During testing, the number of threads driving load against the databases were increased until maximum throughput was achieved, which was around 95% CPU utilization. The total throughput across all database servers for each test is shown below.

 

In this test, the DB VMs were configured with 16 vCPUs and 128 GB of RAM. In the 8 VMs test case, a total of 128 vCPUs were allocated across the 144 cores of the cluster. Additionally the cluster was also running the 18 vCPU driver VM,  vCenter, vSAN, and NSX. This makes the 12 VM test case interesting, where there were 192 vCPUs for the DB VMs, plus 18 vCPUs for the driver. The hyperthreads clearly help out, allowing for performance to continue to scale, even though there are more vCPUs allocated than physical cores.

The performance itself represents scaling very similar to what we have seen with Oracle and other database workloads with vSphere in recent releases. The cluster was able to achieve over 370 thousand orders per minute with good scaling from 1 VM to 12 VMs. We also recently published similar tests with SQL Server on the same VMC on AWS cluster, but with a different workload and more, smaller VMs.

UPDATE (07/30/2018): The whitepaper detailing these results is now available here.

vCenter performance improvements from vSphere 6.5 to 6.7: What does 2x mean?

In a recent blog, the VMware vSphere team shared the following performance improvements in vSphere 6.7 vs. 6.5:

Moreover, with vSphere 6.7 vCSA delivers phenomenal performance improvements (all metrics compared at cluster scale limits, versus vSphere 6.5):
2X faster performance in vCenter operations per second
3X reduction in memory usage
3X faster DRS-related operations (e.g. power-on virtual machine)

As senior engineers within the VMware Performance and vSphere teams, we are writing this blog to provide more details regarding these numbers and to explain how we measured them. We also briefly explain some of the technical details behind these improvements.

Cluster Scale

Let us first explain what all metrics compared at cluster scale limits means. What is cluster scale? Here, it is an environment that includes a vCenter server that is configured for the largest vSphere cluster that VMware currently supports, namely 64 hosts and 8,000 powered-on VMs. This setup represents a high-consolidation environment with 125 VMs per host. Note that this is different from the setup used in our previous blog about vCenter 6.5 performance improvements. The setup in that blog was our datacenter scale environment, which used the largest number of supported hosts (2000) and VMs (25,000), so the numbers from that blog should not be compared to these numbers.

2x and 3x

Let us now explain some of the performance numbers quoted. We produced the numbers by measuring workload runs in our cluster scale setup.

2x vCenter Operations Per Second, vSphere 6.7 vs. 6.5, cluster scale limits. We measure operations per second using an internal benchmark called vcbench. We describe vcbench below under “Benchmark Details.” One of the outputs of this workload is management operations (for example, clone, powerOn, vMotion) performed per second.

  • In 6.5, vCenter was capable of performing approximately 8.3 vcbench operations per second (described below under “Benchmark Details”) in the cluster-scale testbed.
  • In 6.7, vCenter is now capable of performing approximately 16.7 vcbench operations per second.

3x reduction in memory usage. In addition to our vcbench workload, we also include a simplified workload that simply executes a standard workflow: create a VM, power it on, power it off, and delete it. The rapid powerOn and powerOff of VMs in this setup puts more load on the DRS subsystem than the typical vcbench test.

  • In 6.5, the core vCenter process (vpxd) used on average about 10 GB to complete the workflow benchmark (described below under “Benchmark Details”).
  • In 6.7, the core vCenter process used approximately 3 GB to complete this run, while also achieving higher churn (that is, more workflow ‘create/powerOn/powerOff/delete’ cycles completed within the same time period).

3x faster DRS-related operations. In our vcbench workload, we measure not just the overall operations per second, but also the average latencies of individual operations like powerOn (which exercises the majority of the DRS software stack). We issue many concurrent operations, and we measure latency under such load.

  • In 6.5, the average latency of an individual powerOn during a vcbench run was 9.5 seconds.
  • In 6.7, the average latency of an individual powerOn during a vcbench run was 2.8 seconds.

The latencies above reflect the fact that a cluster has 8,000 VMs and many operations in flight at once. As a result, individual operations are slower than if they were simply run on a single-host, single-VM environment.

What does this mean to you as a customer?

As a result of these improvements, customers in high-consolidation environments will see reduced resource consumption due to DRS and reduced latency to generate VMotions for load balancing. Customers will also see faster initial placement of VMs.

Brief Deep Dive into Improvements

Before we describe the improvements, let us first briefly explain how DRS works, at a very high level.

When powering on a VM, vCenter must determine where to place the VM. This is called initial placement. Many subsystems, including DRS and policy management, must be consulted to determine valid hosts on which this VM can run. This phase is called constraint check. Once DRS determines the host on which a VM should be powered on, it registers the VM onto that host and issues the powerOn. This initial placement requires a snapshot of the inventory: by snapshot, we mean that DRS records the current configuration of hosts and VMs in the cluster.

In addition to balancing during initial placement, every 5 minutes, DRS re-examines the load of the cluster and performs a series of computations to generate VMotions that help balance the load across hosts. This phase is called periodic rebalancing. Periodic rebalancing requires an examination of the historical utilization statistics for each host and VM (for example, over the previous hour) in order to determine proper placement.

Finally, as VMs get moved around, the used capacity in resource pools changes. The vCenter server periodically exchanges messages called SpecSyncs with each host to push down the most recent resource pool configuration. The SpecSync operation requires traversing a host’s resource pool structure and changing it to make sure it matches vCenter’s configuration.

With this understanding in mind, let us now give some technical details behind the improvements above. As with our previous blog about vCenter performance improvement, we describe changes in terms of rocks (that is, somewhat large changes to entire subsystems) and pebbles (smaller individual changes throughout the code base).

Rocks

The three main rocks that we address in 6.7 are simplified initial placement, faster resource divvying, and faster SpecSyncs.

Simplified initial placement. As mentioned above, initial placement relies on a snapshot of the current state of the inventory. Creating this snapshot can be a heavyweight operation, requiring a number of data copies and locking of host and cluster data structures to ensure a consistent view of the data. In 6.7, we moved to a lightweight online approach that keeps the state up-to-date in a continuous manner, avoiding snapshots altogether. With this approach, we significantly reduce locking demands and significantly reduce the number of times we need to copy data. In some highly-contended clusters, this reduced the initial placement time from seconds down to milliseconds.

Faster (and more frequent) resource divvying. Divvying is the act of determining the resource allocations for each VM. Every five minutes, the state of the cluster is examined and both divvying and then rebalancing (using VMotion) are performed. To make the divvying phase faster, we performed a number of optimizations.

  • We changed the approach to examining historical usage statistics. Instead of storing metrics for every VM and every host over an hour, we aggregated the data, which allowed us to store a smaller number of metrics per host. This dramatically reduced memory usage and simplified the computation of the desired load for each host.
  • We restructured the code to remove compatibility checks (for example, those that help to determine which VMs can run on which hosts) during this divvying phase. In 6.5 and earlier, divvying a load also involved various host/VM compatibility calculations. Now, we store the compatible matrix and update it when compatibility changes, which is typically infrequent.
  • We have also done significant code refactoring (described below under “Pebbles”) to this code path.

By implementing these changes and making divvying faster in 6.7, we are now able to perform divvying more frequently: once per minute instead of once every five minutes. As a result, resources flow more quickly between resource pools, and we are better able to enforce fairness guarantees in DRS clusters.

Note that periodic load balancing (through VMotion) still occurs every five minutes.

Faster SpecSyncs. To perform a SpecSync, vCenter sends a resource pool configuration to a host. The host must examine this configuration and create a list of changes required to bring that host in sync with vCenter. In 6.5 and earlier, depending on the number of VMs, creating this list of changes could result in hundreds of operations on a host, and the runtime was highly variable. In 6.7, we made this computation more deterministic, reducing the number of operations and lowering the latency appropriately.

Pebbles

In addition to the changes above, we also performed a number of optimizations throughout our code base.

Code Refactoring. In 6.5 and before, admission control decisions were made by multiple independent subsystems within vCenter (for example, DRS would be responsible for some decisions, and HA would make others). In 6.7, we simplified our code such that all admission control decisions are handled by a module within DRS. Reducing multiple copies of this code simplifies debugging and reduces resource usage.

Finer-grained locks. In 6.7, we continued to make strides in reducing the scope of our locks. We introduced finer-grained locks so that DRS would not have to lock an entire VM to examine certain pieces of state. We made similar improvements to both hosts and clusters.

Removal of unnecessary classes, maps, and sets. In refactoring our code, we were able to remove a number of classes and thereby reduce the number of copies of data in our system. The maps and sets that were needed to support these classes could also be removed.

Preferring integers over strings. In many situations, we replaced strings and string comparisons with integers and integer comparisons. This dramatically reduces memory allocation overhead and speeds up comparisons, reducing CPU.

Benchmark Details

We measure Operations Per Second (OPS) using a VMware benchmark called vcbench. (For more details about vcbench, see “Benchmarking Details” in vCenter 6.5 Performance: what does 6x mean?) Briefly, vcbench is a java-based application that takes as an input a runlist, which is a list of management operations to perform. These operations include powering on VMs, cloning VMs, reconfiguring VMs, and migrating VMs, among many other operations. We chose the operations based on an analysis of typical customer management scenarios. The vcbench application uses vSphere APIs to issue these operations to vCenter. The benchmark creates multiple threads and issues operations on those threads in parallel (up to 32). We measure the operations per second by taking the total number of operations performed in a given timeframe (say, 1 hour), and dividing it by the time interval.

Our workflow benchmark is very similar to vcbench. The main difference is that more operations are issued per host at a time. In vcbench, one operation is issued per host at a time, while in workflow, up to 8 operations are issued per host at a time.

In many cases, the size of the VM has an impact on operational latency. For example, a VM with a lot of memory (say, 32 GB) and large disks (say, 100 GB) will take longer to clone because more memory will need to be snapshotted, and more disk data will need to be copied. To minimize the impact of the disk subsystem in our measurements, we use small VMs (<4GB memory, < 8GB disk).

Because we limit ourselves to 32 threads per vCenter in this single-cluster setup, throughput numbers are smaller than for our datacenter-at-scale setups (2,000 hosts; 25,000 VMs), which use up to 256 concurrent threads per vCenter.

Summary

In this blog, we have described some of the performance improvements in vCenter from 6.5 to 6.7. A variety of improvements to DRS have led to improved throughput and reduced resource usage for our vcbench workload in a cluster scale setup of 64 hosts and 8,000 powered-on VMs. Many of these changes also apply to larger datacenter-scale setups, although the scope of improvement may not be as pronounced.

Acknowledgments

The vCenter improvements described in this blog are the results of thousands of person-hours from vCenter developers, performance engineers, and others throughout VMware. We are deeply grateful to them for making this happen.

Authors

Zhelong Pan is a senior staff engineer in the Distributed Resource Management Team at VMware. He works on cluster management, including shared resource allocation, VM placement, and load balancing. He is interested in performance optimizations, including virtualization performance and management software performance. He has been at VMware since 2006.

Ravi Soundararajan is a principal engineer in the Performance Group at VMware. He works on vCenter performance and scalability, from the UI to the server to the database to the hypervisor management agents. He has been at VMware since 2003, and he has presented on the topic of vCenter Performance at VMworld from 2013-2017. His Twitter handle is @vCenterPerfGuy.

Performance Comparison of Containerized Machine Learning Applications Running Natively with Nvidia vGPUs vs. in a VM – Episode 4

This article is by Hari Sivaraman, Uday Kurkure, and Lan Vu from the Performance Engineering team at VMware.

Performance Comparison of Containerized Machine Learning Applications

Docker containers [6] are rapidly becoming a popular environment in which to run different applications, including those in machine learning [1, 2, 3]. NVIDIA supports Docker containers with their own Docker engine utility, nvidia-docker [7], which is specialized to run applications that use NVIDIA GPUs.

The nvidia-docker container for machine learning includes the application and the machine learning framework (for example, TensorFlow [5]) but, importantly, it does not include the GPU driver or the CUDA toolkit.

Docker containers are hardware agnostic so, when an application uses specialized hardware like an NVIDIA GPU that needs kernel modules and user-level libraries, the container cannot include the required drivers. They live outside the container.

One workaround here is to install the driver inside the container and map its devices upon launch. This workaround is not portable since the versions inside the container need to match those in the native operating system.

The nvidia-docker engine utility provides an alternate mechanism that mounts the user-mode components at launch, but this requires you to install the driver and CUDA in the native operating system before launch. Both approaches have drawbacks, but the latter is clearly preferable.

In this episode of our series of blogs [8, 9, 10] on machine learning in vSphere using GPUs, we present a comparison of the performance of MNIST [4] running in a container on CentOS executing natively with MNIST running in a container inside a CentOS VM on vSphere. Based on our experiments, we demonstrate that running containers in a virtualized environment, like a CentOS VM on vSphere, suffers no performance penlty, while benefiting from the tremenduous management capabilities offered by the VMware vSphere platform.

Experiment Configuration and Methodology

We used MNIST [4] to compare the performance of containers running natively with containers running inside a VM. The configuration of the VM and the vSphere server we used for the “virtualized container” is shown in Table 1. The configuration of the physical machine used to run the container natively is shown in Table 2.

vSphere  6.0.0, build 3500742
Nvidia vGPU driver 367.53
Guest OS CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

Table 1. Configuration of VM used to run the nvidia-docker container

Nvidia driver 384.98
Operating system CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

⇑ Table 2. Configuration of physical machine used to run the nvidia-docker container

The server configuration we used is shown in Table 3 below. In our experiments, we used the NVIDIA M60 GPU in vGPU mode only. We did not use the Direct I/O mode. In the scenario in which we ran the container inside the VM, we first installed the NVIDIA vGPU drivers in vSphere and inside the VM, then we installed CUDA (driver 8.0 with runtime version 7.5), followed by Docker and nvidia-docker [7]. In the case where we ran the container natively, we installed the NVIDIA driver in CentOS running natively, followed by CUDA (driver 8.0 with runtime version 7.5),  Docker and finally, nvidia-docker [7]. In both scenarios we ran MNIST and we measured the run time for training using a wall clock.

 Figure 1. Testbed configuration for comparison of the performance of containers running natively vs. running in a VM

Model Dell PowerEdge R730
Processor type Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
CPU cores 24 CPUs, each @ 2.5GHz
Processor sockets 2
Cores per socket 14
Logical processors 48
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), Storage Arrays, Local Hard Disks
GPUs 2x M60 Tesla

⇑ Table 3. Server configuration

Results

The measured wall-clock run times for MNIST are shown in Table 4 for the two scenarios we tested:

  1. Running in an nvidia-docker container in CentOS running natively.
  2. Running in an nvidia-docker container inside a CentOS VM on vSphere.

From the data, we can clearly see that there is no measurable performance penalty for running a container inside a VM as compred to running it natively.

Configuration Run time for MNIST as measured by a wall clock
Nvidia-docker container in CentOS running natively 44 minutes 53 seconds
Nvidia-docker container running in a CentOS VM on vSphere 44  minutes 57 seconds

⇑ Table 4. Comparison of the run-time for MNIST running in a container on native CentOS vs. in a container in virtualized CentOS

Takeaways

  • Based on the results shown in Table 4, it is clear that there is no measurable performance impact due to running a containerized application in a virtual environment as opposed to running it natively. So, from a performance perspective, there is no penalty for using a virtualized environment.
  • It is important to note that since containers do not include the GPU driver or the CUDA environment, both of these components need to be installed separately. It is in this aspect that a virtualized environment offers a superior user experience; an nvidia-docker container in CentOS running natively requires that any existing GPU and CUDA drivers be removed if the version of the drivers does not match that required by the container. Uninstalling and re-installing the correct drivers is often a challenging and time consuming task. However, in a virtualized environment, you can, in advance, create and store in a repository, a number of CentOS VMs with different VGPU and CUDA drivers. When you need to run an application in an nvidia-docker container, just clone the VM with the correct drivers, load the container, and run with no performance penalty. In such a scenario, running in a virtualized environment does not require you to uninstall and re-install the correct drivers, which saves both time and considerable frustration. This issue of uninstalling and re-installing drivers in a native environment becomes considerably more difficult if there are multiple container users on the system; in such a scenario, all the containers need to be migrated to use the new drivers, or the user who needs a new driver will have to wait until all the other users are done before a system administrator can upgrade the GPU drivers on the native CentOS.

Future Work

In this blog, we presented the performance results of running MNIST in a single container. We plan to run MNIST in multiple containers running concurrently in both a virtualized environment and on CentOS executing natively, and report the measured run times. This will provide a comparison of the performance as we scale up the number of containers.

References

  1. Google Cloud Platform: Cloud AI. https://cloud.google.com/products/machine-learning/
  2. Wikipedia: Deep Learning. https://en.wikipedia.org/wiki/Deep_learning
  3. NVIDIA GPUs – The Engine of Deep Learning. https://developer.nvidia.com/deep-learning
  4. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
  5. TensorFlow: An Open-Source Software Library for Machine Intelligence. https://www.tensorflow.org
  6. Wikipedia: Operating-System-Level Virtualization. https://en.wikipedia.org/wiki/Operating-system-level_virtualization
  7. NVIDIA Docker: GPU Server Application Deployment Made Easy. https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/
  8. Episode 1: Performance Results of Machine Learning with DirectPath I/O and GRID vGPU. https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html
  9. Episode 2: Machine Learning on vSphere 6 with NVIDIA GPUs. https://blogs.vmware.com/performance/2017/03/machine-learning-vsphere-6-5-nvidia-gpus-episode-2.html
  10. Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning. https://blogs.vmware.com/performance/2017/10/episode-3-performance-comparison-native-gpu-virtualized-gpu-scalability-virtualized-gpus-machine-learning.html 

Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning

In our third episode of machine learning performance with vSphere 6.x, we look at the virtual GPU vs. the physical GPU. In addition, we extend the performance results of machine learning workloads using VMware DirectPath I/O (passthrough) vs. NVIDIA GRID vGPU that have been partially addressed in previous episodes:

Machine Learning with Virtualized GPUs

Performance is one of the biggest concerns that keeps high performance computing (HPC) users from choosing virtualization as the solution for deploying HPC applications despite virtualization benefits such as reduced administration costs, resource utilization efficiency, energy saving, and security. However, with the constant evolution of virtualization technologies, the performance gaps between bare metal and virtualization have almost disappeared, and, in some use cases, virtualized applications can achieve better performance than running on bare metal because of the intelligent and highly optimized resource utilization of hypervisors. For example, a prior study [1] shows that vector machine applications running on a virtualized cluster of 10 servers have a better execution time than running on bare metal.

Virtual GPU vs. Physical GPU

To understand the performance impact of machine learning with GPUs using virtualization, we used a complex language modeling application—predicting next words given a history of previous words using a recurrent neural network (RNN) with 1500 Long Short Term Memory (LSTM) units per layer, on the Penn Treebank (PTB) dataset [2, 3], which has:

  • 929,000 training words
  • 73,000 validation words
  • 82,000 test words
  • 10,000 vocabulary words

We tested three cases:

  • A physical GPU installed on bare metal (this is the “native” configuration)
  • A DirectPath I/O GPU inside a VM on vSphere 6
  • A GRID vGPU (that is, an M60-8Q vGPU profile with 8GB memory) inside a VM on vSphere 6

The VM in the last two cases has 12 virtual CPUs (vCPUs), 60GB RAM, and 96GB SSD storage.

The benchmark was implemented using TensorFlow [4], which was also used for the implementation of the other machine learning benchmarks in our experiments. We used CUDA 7.5, cuDNN 5.1, and CentOS 7.2 for both native and guest operating systems. These test cases were run on a Dell PowerEdge R730 server with dual 12-core Intel Xeon Processors E5-2680 v3, 2.50 GHz sockets (24 physical core, 48 logical with hyperthreading enabled), 768 GB memory, and an SSD (1.5 TB). This server also had two NVIDIA Tesla M60 cards (each has two GPUs) for a total of 4 GPUs where each had 2048 CUDA cores, 8GB memory, 36 x H.264 video 1080p 30 streams, and could support 1–32 GRID vGPUs whose memory profiles ranged from 512MB to 8GB. This experimental setup was used for all tests presented in this blog (Figure 1, below).

Figure 1. Testbed configurations for native GPU vs. virtual GPU comparison

The results in Figure 2 (below) show the relative execution times of DirectPath I/O and GRID vGPU compared to native GPU. Virtualization introduces a 4% overhead—the performance of DirectPath I/O and GRID vGPU are similar. These results are consistent with prior studies of virtual GPU performance with passthrough where overheads in most cases are less than 5% [5, 6].

Figure 2. DirectPath I/O and NVIDIA GRID vs. native GPU

GPU vs. CPU in a Virtualization Environment

One important benefit of using GPU is the shortening of the long training times of machine learning tasks, which has boosted the results of AI research and developments in recent years. In many cases, it helps to reduce execution times from weeks/days to hours/minutes. We illustrate this benefit in Figure 3 (below), which shows the training time with and without vGPU for two applications:

  • RNN with PTB (described earlier)
  • CNN with MNIST—a handwriting recognizer that uses a convolution neural network (CNN) on the MNIST dataset [7].

From the results, we see that the training time for RNN on PTB with CPU was 7.9 times higher than with vGPU training time (Figure 3-a).  The training time for CNN on MNIST with CPU was 10.1 times higher than with the vGPU training time (Figure 3-b). The VM used in this test has 1 vGPU, 12 vCPUs, 60 GB memory, 96 GB of SSD storage and the test setup is similar to that of the above experiment.

Figure 3. Normalized training time of PTB, MNIST with and without vGPU

As the test results show, we can successfully run machine learning applications in a vSphere 6 virtualized environment, and its performance is similar to training times for machine learning applications running in a native configuration (not virtualized) using physical GPUs.

But what about a passthrough scenario? How does a machine learning application run in a vSphere 6 virtual machine using a passthrough to the physical GPU vs. using a virtualized GPU? We present our findings in the next section.

Comparison of DirectPath I/O and GRID vGPU

We evaluate the performance, scalability, and other benefits of DirectPath I/O and GRID vGPU. We also provide some recommendations of the best use cases for each virtual GPU solutions.

Performance

To compare the performance of DirectPath I/O and GRID vGPU, we benchmarked them with RNN on PTB, and CNN on MNIST and CIFAR-10. CIFAR-10 [8] is an object classification application that categorizes RGB images of 32×32 pixels into 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. MNIST is a handwriting recognition application. Both CIFAR-10 and MNIST use a convolutional neural network. The language model used to predict words is based on history using a recurrent neural network. The dataset used is The Penn Tree Bank (PTB).

Fig. 4. Performance comparison of DirectPath I/O and GRID vGPU.

The results in Figure 4 (above) show the comparative performance of the two virtualization solutions in which DirectPath I/O achieves slightly better performance than GRID vGPU. This improvement is due to the passthrough mechanism of DirectPath I/O adding minimal overhead to GPU-based workloads running inside a VM. In Figure 4-a, DirectPath I/O is about 5% faster than GRID vGPU for MNIST, and they have the same performance with PTB. For CIFAR-10, DirectPath I/O can process about 13% more images per second than GRID vGPU. We use images per second for CIFAR-10 because it is a frequently used metric for this dataset. The VM in this experiment has 12 vCPU, 60GB VRAM and one GPU (either DirectPath I/O or GRID vGPU).

Scalability

We look at two types of scalability: user and GPU.

User Scalability

In a cloud environment, multiple users can share physical servers, which helps to better utilize resources and save cost. Our test server with 4 GPUs can allow up to 4 users needing a GPU. Alternatively, a single user can have four VMs with a vGPU.  The number of virtual machines run per machine in a cloud environment is typically high to increase utilization and lower costs [9]. Machine learning workloads are typically much more resource intensive and using our 4 GPU test systems for up to only 4 users reflects this.

Figure 5. Scaling the number of VMs with vGPU on CIFAR-10

Figure 5 (above) presents the scalability of users on CIFAR-10 from 1 to 4 where each uses a VM with one GPU, and we normalize images per second to that of the DirectPath I/O – 1 VM case (Figure 5-a).  Similar to the previous comparison, DirectPath I/O and GRID vGPU show comparable performance as the number of VMs with GPUs scale. Specifically, the performance difference between them is 6%–10% for images per second and 0%–1.5% for CPU utilization. This difference is not significant when weighed against the benefits that vGPU brings. Because of its flexibility and elasticity, it is a good option for machine learning workloads. The results also show that the two solutions scale linearly with the number of VMs both in terms of execution time and CPU resource utilization. The VMs used in this experiment have 12 vCPUs, 16GB memory, and 1 GPU (either DirectPath I/O or GRID vGPU).

GPU Scalability

For machine learning applications that need to build very large models or in which the datasets cannot fit into a single GPU, users can use multiple GPUs to distribute the workloads among them and speed up the training task further. On vSphere, applications that require multiple GPUs can use DirectPath I/O passthrough to configure VMs with as many GPUs as required. This capability is limited for CUDA applications using GRID vGPU because only 1 vGPU per VM is allowed for CUDA computations.

We demonstrate the efficiency of using multiple GPUs on vSphere by benchmarking the CIFAR-10 workload and using the metric of images per second (images/sec) to compare the performance of CIFAR-10 on a VM with different numbers of GPUs scaling from 1 to 4 GPUs.

From the results in Figure 6 (below), we found that the images processed per second improves almost linearly with the number of GPUs on the host (Figure 6-a). At the same time, their CPU utilization also increases linearly (Figure 6-b). This result shows that machine learning workloads scale well on the vSphere platform. In the case of machine learning applications that require more GPUs than the physical server can support, we can use the distributed computing model with multiple distributed processes using GPUs running on a cluster of physical servers. With this approach, both DirectPath I/O and GRID vGPU can be used to enhance scalability with a very large number of GPUs.

Figure 6. Scaling the number of GPUs per VM on CIFAR-10

How to Choose Between DirectPath I/O and GRID vGPU

For DirectPath I/O

From the above results, we can see that DirectPath I/O and GRID vGPU have similar performance and low overhead compared to the performance of native GPU, which makes both good choices for machine learning applications in virtualized cloud environments. For applications that require short training times and use multiple GPUs to speed up machine learning tasks, DirectPath I/O is a suitable option because this solution supports multiple GPUs per VM. In addition, DirectPath I/O supports a wider range of GPU devices, and so can provide a more flexible choice of GPU for users.

For GRID vGPU

When each user needs a single GPU, GRID vGPU can be a good choice. This configuration provides a higher consolidation of virtual machines and leverages the benefits of virtualization:

  • GRID vGPU allows the flexible use of the device because vGPU supports both shared GPU (multiple users per physical machine) and dedicated GPU (one user per physical GPU). Mixing and switching among machine learning, 3D graphics, and video encoding/decoding workloads using GPUs is much easier and allows for more efficient use of the hardware resource. Using GRID solutions for machine learning and 3D graphics allows cloud-based services to multiplex the GPUs among more concurrent users than the number of physical GPUs in the system. This contrasts with DirectPath I/O, which is the dedicated GPU solution, where the number of concurrent users are limited to the number of physical GPUs.
  • GRID vGPU reduces administration cost because its deployment and maintenance does not require server reboot, so no down time is required for end users. For example, changing the vGPU profile of a virtual machine does not require a server reboot. Any changes to DirectPath I/O configuration requires a server reboot. GRID vGPU’s ease of management reduces the time and the complexity of administering and maintaining the GPUs. This benefit is particularly important in a cloud environment where the number of managed servers would be very large.

Conclusion

Our tests show that virtualized machine learning workloads on vSphere with vGPUs offer near bare-metal performance.

References

  1. Jaffe, D. Big Data Performance on vSphere 6. (August 2016). http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/bigdata-perf-vsphere6.pdf.
  2. Zaremba, W., Sutskever,I., Vinyals, O.: Recurrent Neural Network Regularization. In: arXiv:1409.2329 (2014).
  3. Taylor, A., Marcus, M., Santorini, B.: The Penn Treebank: An Overview. In: Abeille, A. (ed.). Treebanks: the state of the art in syntactically annotated corpora. Kluwer (2003).
  4. Tensorflow Homepage, https://www.tensorflow.org
  5. Vu, L., Sivaraman, H., Bidarkar, R.: GPU Virtualization for High Performance General Purpose Computing on the ESX hypervisor. In: Proc. of the 22nd High Performance Computing Symposium (2014).
  6. Walters, J.P., Younge, A.J., Kang, D.I., Yao, K.T., Kang, M., Crago, S.P., Fox, G.C.: GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications. In: Proceedings of 2014 IEEE 7th International Conference on Cloud Computing (2014).
  7. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, 86(11):2278-2324 (November 1998).
  8. Multiple Layers of Features from Tiny Images, https://www.cs.toronto.edu/~kriz/cifar.html
  9. Pandey, A., Vu, L., Puthiyaveettil, V., Sivaraman, H., Kurkure, U., Bappanadu, A.: An Automation Framework for Benchmarking and Optimizing Performance of Remote Desktops in the Cloud. In: To appear in Proceedings of the 2017 International Conference on High Performance Computing & Simulation (2017).

New White Paper: Fast Virtualized Hadoop and Spark on All-Flash Disks – Best Practices for Optimizing Virtualized Big Data Applications on VMware vSphere 6.5

A new white paper is available showing how to best deploy and configure vSphere 6.5 for Big Data applications such as Hadoop and Spark running on a cluster with fast processors, large memory, and all-flash storage (Non-Volatile Memory Express storage and solid state disks). Hardware, software, and vSphere configuration parameters are documented, as well as tuning parameters for the operating system, Hadoop, and Spark.

The best practices were tested on a 13-server cluster, with Hadoop installed on vSphere as well as on bare metal. Workloads for both Hadoop (TeraSort and TestDFSIO) and Spark Machine Learning Library routines (K-means clustering, Logistic Regression classification, and Random Forest decision trees) were run on the cluster. Configurations with 1, 2, and 4 VMs per host were tested as well as bare metal. Among the 4 virtualized configurations, 4 VMs per host ran fastest due to the best utilization of storage as well as the highest percentage of data transfer within a server. The 4 VMs per host configuration also ran faster than bare metal on all Hadoop and Spark tests but one.

Continue reading

Introducing VMmark3: A highly flexible and easily deployed benchmark for vSphere environments

VMmark 3.0, VMware’s multi-host virtualization benchmark is generally available here.  VMmark3 is a free cluster-level benchmark that measures the performance, scalability, and power of virtualization platforms.

VMmark3 leverages much of previous VMmark generations’ technologies and design.  It continues to utilize a unique tile-based heterogeneous workload application design. It also deploys the platform-level workloads found in VMmark2 such as vMotion, Storage vMotion, and Clone & Deploy.  In addition to incorporating new and updated application workloads and infrastructure operations, VMmark3 also introduces a new fully automated provisioning service that greatly reduces deployment complexity and time.

Continue reading