Home > Blogs > VMware VROOM! Blog > Author Archives: Julie Brodeur

Author Archives: Julie Brodeur

Julie Brodeur

About Julie Brodeur

Julie is a senior technical writer in the Performance Engineering group at VMware.

DRS Enhancements in vSphere 6.7

A new paper describes the DRS enhancements in vSphere 6.7, which include new initial placement, host maintenance mode enhancements, DRS support for non-volatile memory (NVM), and enhanced resource pool reservations.

Resource pool and VM entitlements—old and new models

A summary of the improvements follows:

  • DRS in vSphere 6.7 can now take advantage of the much faster placement and more accurate recommendations for all DRS configurations. vSphere 6.5 did not include support for some configurations like VMs that had fault tolerance (FT) enabled, among others.
  • Starting with vSphere 6.7, DRS uses the new initial placement algorithm to come up with the recommended list of hosts to be placed in maintenance mode. Further, when evacuating the hosts, DRS uses the new initial placement algorithm to find new destination hosts for outgoing VMs.
  • DRS in vSphere 6.7 can handle VMs running on next generation persistent memory devices, also known as Non-Volatile Memory (NVM) devices.
  • There is a new two-pass algorithm that allocates a resource pool’s resource reservation
    to its children (also known as divvying).

For more information about all of these updates, see DRS Enhancements in vSphere 6.7.

vSphere with iSER – How to release the full potential of your iSCSI storage!

By Mark Ma

With the release of vSphere 6.7, VMware added iSER (iSCSI Extensions for RDMA) as a native supported storage protocol to ESXi. With iSER run over iSCSI, users can boost their vSphere performance just by replacing the regular NICs with RDMA-capable NICs. RDMA (Remote Direct Memory Access) allows the transfer of memory from one computer to another. This is a direct transfer and minimizes CPU/kernel involvement. By bypassing the kernel, we get extremely high I/O bandwidth and low latency. (To use RDMA, you must have an HCA/Host Channel Adapter device on both the source and destination.) In this blog, we compare standard iSCSI performance vs. iSER performance to see how iSER can release the full potential of your iSCSI storage.

Testbed Configuration

The iSCSI/iSER target system is based on open source Ubuntu 18.04 LTS server with 2 x E5-2403 v2 CPU, 96 GB RAM, 120 GB SSD for OS, 8 x 450 GB (15K RPM) SAS drive and Mellanox ConnectX-3 Pro EN 40 GbE NIC (RDMA capable). The file system is ZFS with 4 x mirror set with 8 x 450 GB SAS drive (RAID 10). ZFS has an advanced memory feature that can produce very good random IOPS and read throughput. We did not place any SSD for caching since the test is rather to compare protocol difference, not disk drives. The iSCSI/iSER target is Linux SCSI target framework (TGT).

The iSCSI/iSER initiator is ESXi 6.7.0, 9214924 with 2 x Intel Xeon E5-2403 v2 CPU @ 1.8 GHz, 96 GB RAM, USB boot drive, and  Mellanox ConnectX-3 Pro EN 40 GbE NIC (RDMA capable) has been used to benchmark the performance boost that iSER enables vs. iSCSI.

Both target and initiator connect to 40 GbE switch with QSFP cables for optimal network performance.

Both NICs have the latest firmware with 2.42.5000.

To measure performance, we used VMware I/O Analyzer, which uses the industry-standard benchmark Iometer.

iSCSI Test

We set the target with the iSCSI driver to ensure the first test measures the standard iSCSI protocol.

Figure 1

For the iSCSI initiator, we simply enable the iSCSI software adapter.

Figure 2

iSCSI test one: Max Read IOPS—this test shows the max read IOPS (4K random read I/Os per second) from the iSCSI storage.

Result: 34,255.18 IOPS

Figure 3

iSCSI test two: Max Write IOPS, this test shows the max write IOPS from the iSCSI storage.

Result: 36,428.26 IOPS

Figure 4

iSCSI test three: Max Read Throughput—this test shows the max read throughput from the iSCSI storage.

Result: 2,740.80 MBPS

Figure 5

iSCSI test four: Max Write Throughput—this test shows the potential max write throughput from the iSCSI storage. (The performance is rather low due to ZFS RAID configuration and limited disk spindle.)

Result: 112.04 MBPS

Figure 6

iSER Test

We set the target to the iSER driver to ensure the second test measures only iSER connections.

Figure 7

For the iSER initiator, we need to verify that an RDMA-capable NIC is installed. For this, we use the command:
esxcli rdma device list

Figure 8

Then we run the following command from the ESXi host to enable the iSER adapter.

esxcli rdma iser add

 

Figure 9

 

iSER test one: Max Read IOPS—this test shows the max read IOPS from the iSER storage.

Result: 71,108.85 IOPS, which is 207.59% over iSCSI.

Figure 10

iSER test two: Max Write IOPS—this test shows the max write IOPS from the iSER storage.

Result: 69,495.70 IOPS which is 190.77% over iSCSI.

Figure 11

iSER test three: Max Read Throughput—this test shows the max read throughput, measured in megabytes per second (MBPS), from the iSER storage.

Result: 4,126.53 MBPS which is 150.56% over iSCSI.

Figure 12

iSCSI test four: Max Write Throughput—this test shows the max write throughput from the iSER storage. (The performance is rather low due to ZFS raid configuration and limited disk spindle.)

Result: 106.48 MBPS which is about 5% less than iSCSI.

Figure 13

Results

Figure 14

Figure 15

Figure 16

The performance over random I/O is about a 200% increase for both read and write. That’s about a 150% increase for read throughput. Write throughput is about the same. The only difference is the storage protocol. We also performed these tests in an environment made up of older hardware, so just imagine what vSphere with iSER could do when using state-of-the-art, NVME-based storage and the latest 200 GbE network equipment.

Conclusion

The result seemed too good to be true, so I ran the benchmark several times to ensure its consistency. It’s great to see VMware’s Innovation initiative in action. Who could think that the “not so exciting” traditional iSCSI storage could be improved over 200% through the efforts of VMware and Mellanox. It’s great to see VMware continues to push the boundary of the Software-Defined Datacenter to better serve our customers in their digital transformation journey!

About the Author

Mark Ma is a senior consultant at VMware Professional Services. He is heavily involved with POC, architecture design, assessment, implementation, and user training. Mark specializes in end-to-end virtualization solutions based on Citrix, Microsoft, and VMware applications.

New white paper: Big Data performance on VMware Cloud on AWS: Spark machine learning and IoT analytics performance on-premises and in the cloud

By Dave Jaffe

A new white paper is available comparing Spark machine learning performance on an 8-server on-premises cluster vs. a similarly configured VMware Cloud on AWS cluster.

Here is what the VMware Cloud on AWS cluster looked like:

Screenshot of cluster configuration

VMware Cloud on AWS configuration for performance tests

Three standard analytic programs from the Spark machine learning library (MLlib), K-means clustering, Logistic Regression classification, and Random Forest decision trees, were driven using spark-perf. In addition, a new, VMware-developed benchmark, IoT Analytics Benchmark, which models real-time machine learning on Internet-of-Things data streams, was used in the comparison. The benchmark is available from GitHub.

As seen in the charts below, performance was very similar on-premises and on VMware Cloud on AWS.

Spark machine learning performance chart

Spark machine learning performance

IoT Analytics performance chart

IoT Analytics performance

 

All details are in the paper.

Persistent Memory Performance in vSphere 6.7

We published a paper that shows how VMware is helping advance PMEM technology by driving the virtualization enhancements in vSphere 6.7. The paper gives a detailed performance analysis of using PMEM technology on vSphere using various workloads and scenarios.

These are the key points that we cover in this white paper:

  • We explain how PMEM can be configured and used in a vSphere environment.
  • We show how applications with different characteristics can take advantage of PMEM in vSphere. Below are some of the use-cases:
    • How PMEM device limits can be achieved under vSphere with little to no overhead of virtualization. We show virtual-to-native ratio along with raw bandwidth and latency numbers from fio, an I/O microbenchmark.
    • How traditional relational databases like Oracle can benefit from using PMEM in vSphere.
    • How scaling-out VMs in vSphere can benefit from PMEM. We used Sysbench with MySQL to show such benefits.
    • How modifying applications (PMEM-aware) can get the best performance out of PMEM. We show performance data from such applications, e.g., an OLTP database like SQL Server and an in-memory database like Redis.
    • Using vMotion to migrate VMs with PMEM which is a host-local device just like NVMe SSDs. We also characterize in detail, vMotion performance of VMs with PMEM.
  • We outline some best practices on how to get the most out of PMEM in vSphere.

Read the full paper here.

Performance Best Practices Guide for vSphere 6.7

We are pleased to announce the availability of Performance Best Practices for VMware vSphere 6.7. This is a comprehensive book designed to help system administrators obtain the best performance from their vSphere 6.7 deployments.

The book covers new features as well as updating and expanding on many of the topics covered in previous versions.

These include:

  • Hardware-assisted virtualization
  • Storage hardware considerations
  • Network hardware considerations
  • Memory page sharing
  • Getting the best performance with iSCSI and NFS storage
  • Getting the best performance from NVMe drives
  • vSphere virtual machine encryption recommendations
  • Running storage latency-sensitive workloads
  • Network I/O Control (NetIOC)
  • DirectPath I/O
  • Running network latency-sensitive workloads
  • Microsoft Virtualization-Based Security (VBS)
  • CPU Hot Add
  • 4KB native drives
  • Selecting virtual network adapters
  • The vSphere HTML5 Client
  • vSphere web client configuration
  • Pair-wise balancing in DRS-enabled clusters
  • VMware vSphere update manager
  • VMware vSAN performance

The book can be found here.

Also, for a summary of the new performance-related features in vSphere 6.7, refer to What’s New in Performance.

ESX IP Storage Troubleshooting Best Practice White Paper

We have published an ESX IP Storage Troubleshooting Best Practice white paper in which we recommend vSphere customers deploying ESX IP storage over 10G networks to include 10G packet capture systems as a best practice to ensure network visibility.

The white paper explores the challenges and alternatives for packet capture in a vSphere environment with IP storage (NFS, iSCSI) datastores over a 10G network, and explains why traditional techniques for capturing packet traces on 1G networks will suffer from severe limitations (capture drops and inaccurate timestamps) when used for 10G networks. Although commercial 10G packet capture systems are commonly available, they may be beyond the budget of some vSphere customers. We present the design of a self-assembled 10G packet capture solution that can be built using commercial components relatively inexpensively. The self-assembled solution is optimized for common troubleshooting scenarios where short duration packet captures can satisfy most analysis requirements.

Our experience troubleshooting a large number of IP storage issues has shown that the ability to capture and analyze packet traces in an ESX IP storage environment can significantly reduce the mean time to resolution for serious functional and performance issues. When reporting an IP storage problem to VMware or to a storage array vendor, an accompanying packet trace file is a great piece of evidence that can significantly reduce the time required by the responsible engineering teams to identify the problem.

Performance Comparison of Containerized Machine Learning Applications Running Natively with Nvidia vGPUs vs. in a VM – Episode 4

This article is by Hari Sivaraman, Uday Kurkure, and Lan Vu from the Performance Engineering team at VMware.

Performance Comparison of Containerized Machine Learning Applications

Docker containers [6] are rapidly becoming a popular environment in which to run different applications, including those in machine learning [1, 2, 3]. NVIDIA supports Docker containers with their own Docker engine utility, nvidia-docker [7], which is specialized to run applications that use NVIDIA GPUs.

The nvidia-docker container for machine learning includes the application and the machine learning framework (for example, TensorFlow [5]) but, importantly, it does not include the GPU driver or the CUDA toolkit.

Docker containers are hardware agnostic so, when an application uses specialized hardware like an NVIDIA GPU that needs kernel modules and user-level libraries, the container cannot include the required drivers. They live outside the container.

One workaround here is to install the driver inside the container and map its devices upon launch. This workaround is not portable since the versions inside the container need to match those in the native operating system.

The nvidia-docker engine utility provides an alternate mechanism that mounts the user-mode components at launch, but this requires you to install the driver and CUDA in the native operating system before launch. Both approaches have drawbacks, but the latter is clearly preferable.

In this episode of our series of blogs [8, 9, 10] on machine learning in vSphere using GPUs, we present a comparison of the performance of MNIST [4] running in a container on CentOS executing natively with MNIST running in a container inside a CentOS VM on vSphere. Based on our experiments, we demonstrate that running containers in a virtualized environment, like a CentOS VM on vSphere, suffers no performance penlty, while benefiting from the tremenduous management capabilities offered by the VMware vSphere platform.

Experiment Configuration and Methodology

We used MNIST [4] to compare the performance of containers running natively with containers running inside a VM. The configuration of the VM and the vSphere server we used for the “virtualized container” is shown in Table 1. The configuration of the physical machine used to run the container natively is shown in Table 2.

vSphere  6.0.0, build 3500742
Nvidia vGPU driver 367.53
Guest OS CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

Table 1. Configuration of VM used to run the nvidia-docker container

Nvidia driver 384.98
Operating system CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

⇑ Table 2. Configuration of physical machine used to run the nvidia-docker container

The server configuration we used is shown in Table 3 below. In our experiments, we used the NVIDIA M60 GPU in vGPU mode only. We did not use the Direct I/O mode. In the scenario in which we ran the container inside the VM, we first installed the NVIDIA vGPU drivers in vSphere and inside the VM, then we installed CUDA (driver 8.0 with runtime version 7.5), followed by Docker and nvidia-docker [7]. In the case where we ran the container natively, we installed the NVIDIA driver in CentOS running natively, followed by CUDA (driver 8.0 with runtime version 7.5),  Docker and finally, nvidia-docker [7]. In both scenarios we ran MNIST and we measured the run time for training using a wall clock.

 Figure 1. Testbed configuration for comparison of the performance of containers running natively vs. running in a VM

Model Dell PowerEdge R730
Processor type Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
CPU cores 24 CPUs, each @ 2.5GHz
Processor sockets 2
Cores per socket 14
Logical processors 48
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), Storage Arrays, Local Hard Disks
GPUs 2x M60 Tesla

⇑ Table 3. Server configuration

Results

The measured wall-clock run times for MNIST are shown in Table 4 for the two scenarios we tested:

  1. Running in an nvidia-docker container in CentOS running natively.
  2. Running in an nvidia-docker container inside a CentOS VM on vSphere.

From the data, we can clearly see that there is no measurable performance penalty for running a container inside a VM as compred to running it natively.

Configuration Run time for MNIST as measured by a wall clock
Nvidia-docker container in CentOS running natively 44 minutes 53 seconds
Nvidia-docker container running in a CentOS VM on vSphere 44  minutes 57 seconds

⇑ Table 4. Comparison of the run-time for MNIST running in a container on native CentOS vs. in a container in virtualized CentOS

Takeaways

  • Based on the results shown in Table 4, it is clear that there is no measurable performance impact due to running a containerized application in a virtual environment as opposed to running it natively. So, from a performance perspective, there is no penalty for using a virtualized environment.
  • It is important to note that since containers do not include the GPU driver or the CUDA environment, both of these components need to be installed separately. It is in this aspect that a virtualized environment offers a superior user experience; an nvidia-docker container in CentOS running natively requires that any existing GPU and CUDA drivers be removed if the version of the drivers does not match that required by the container. Uninstalling and re-installing the correct drivers is often a challenging and time consuming task. However, in a virtualized environment, you can, in advance, create and store in a repository, a number of CentOS VMs with different VGPU and CUDA drivers. When you need to run an application in an nvidia-docker container, just clone the VM with the correct drivers, load the container, and run with no performance penalty. In such a scenario, running in a virtualized environment does not require you to uninstall and re-install the correct drivers, which saves both time and considerable frustration. This issue of uninstalling and re-installing drivers in a native environment becomes considerably more difficult if there are multiple container users on the system; in such a scenario, all the containers need to be migrated to use the new drivers, or the user who needs a new driver will have to wait until all the other users are done before a system administrator can upgrade the GPU drivers on the native CentOS.

Future Work

In this blog, we presented the performance results of running MNIST in a single container. We plan to run MNIST in multiple containers running concurrently in both a virtualized environment and on CentOS executing natively, and report the measured run times. This will provide a comparison of the performance as we scale up the number of containers.

References

  1. Google Cloud Platform: Cloud AI. https://cloud.google.com/products/machine-learning/
  2. Wikipedia: Deep Learning. https://en.wikipedia.org/wiki/Deep_learning
  3. NVIDIA GPUs – The Engine of Deep Learning. https://developer.nvidia.com/deep-learning
  4. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
  5. TensorFlow: An Open-Source Software Library for Machine Intelligence. https://www.tensorflow.org
  6. Wikipedia: Operating-System-Level Virtualization. https://en.wikipedia.org/wiki/Operating-system-level_virtualization
  7. NVIDIA Docker: GPU Server Application Deployment Made Easy. https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/
  8. Episode 1: Performance Results of Machine Learning with DirectPath I/O and GRID vGPU. https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html
  9. Episode 2: Machine Learning on vSphere 6 with NVIDIA GPUs. https://blogs.vmware.com/performance/2017/03/machine-learning-vsphere-6-5-nvidia-gpus-episode-2.html
  10. Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning. https://blogs.vmware.com/performance/2017/10/episode-3-performance-comparison-native-gpu-virtualized-gpu-scalability-virtualized-gpus-machine-learning.html 

Performance of Storage I/O Control (SIOC) with SSD Datastores – vSphere 6.5

With Storage I/O Control (SIOC), vSphere 6.5 administrators can adjust the storage performance of VMs so that VMs with critical workloads will get the I/Os per second (IOPS) they need. Admins assign shares (the proportion of IOPS allocated to the VM), limits (the upper bound of VM IOPS), and reservations (the lower bound of VM IOPS) to the VMs whose IOPS need to be controlled.  After shares, limits, and reservations have been set, SIOC is automatically triggered to meet the desired policies for the VMs.

A recently published paper shows the performance of SIOC meets expectations and successfully controls the number of IOPS for VM workloads.

Continue reading

New Fling released – IOInsight

By Sankaran Sivathanu

VMware IOInsight is a tool to help people understand a VM’s storage I/O behavior. By understanding their VM’s I/O characteristics, customers can make better decisions about storage capacity planning and performance tuning. IOInsight ships as a virtual appliance that can be deployed in any vSphere environment and includes an intuitive web-based UI that allows users to choose VMDKs to monitor and view results.

Where does IOInsight help?

  • Customers may better tune and size their storage.
  • When contacting VMware Support for any vSphere storage issues, including a report from IOInsight can help VMware Support better understand the issues and can potentially lead to faster resolutions.
  • VMware Engineering can optimize products with a better understanding of various customers’ application behavior.

IOInsight captures I/O traces from ESXi and generates various aggregated metrics that represent the I/O behavior. The IOInsight report contains only these aggregated metrics and there is no sensitive information about the application itself. In addition to the built-in metrics computed by IOInsight, users can also write new analyzer plugins to IOInsight and visualize the results. A comprehensive SDK and development guide is included in the download bundle.

The fling works with vSphere 5.5 or above and can be downloaded at https://labs.vmware.com/flings/ioinsight.

vSphere 6.0 U2 Storage Performance with 32Gb Fibre Channel

We compared the I/O performance of vSphere 6.0 U2 over 16Gb and 32Gb Emulex FC HBAs connected via a Brocade G620 FC switch to an EMC VNX7500 storage array.

Iometer, a common microbenchmark, was used to generate the workload for various block sizes. For single-VM experiments, we measured sequential read and sequential write throughput. For multi-VM experiments, we measured random read IOPS and throughput.

Our experiments showed that vSphere 6 can achieve near line rate with 32Gb FC.

For details, please see the whitepaper Storage I/O Performance on VMware vSphere 6.0 U2 over 32 Gigabit Fibre Channel.