Home > Blogs > VMware VROOM! Blog > Tag Archives: Performance

Tag Archives: Performance

Persistent Memory Performance in vSphere 6.7

We published a paper that shows how VMware is helping advance PMEM technology by driving the virtualization enhancements in vSphere 6.7. The paper gives a detailed performance analysis of using PMEM technology on vSphere using various workloads and scenarios.

These are the key points that we cover in this white paper:

  • We explain how PMEM can be configured and used in a vSphere environment.
  • We show how applications with different characteristics can take advantage of PMEM in vSphere. Below are some of the use-cases:
    • How PMEM device limits can be achieved under vSphere with little to no overhead of virtualization. We show virtual-to-native ratio along with raw bandwidth and latency numbers from fio, an I/O microbenchmark.
    • How traditional relational databases like Oracle can benefit from using PMEM in vSphere.
    • How scaling-out VMs in vSphere can benefit from PMEM. We used Sysbench with MySQL to show such benefits.
    • How modifying applications (PMEM-aware) can get the best performance out of PMEM. We show performance data from such applications, e.g., an OLTP database like SQL Server and an in-memory database like Redis.
    • Using vMotion to migrate VMs with PMEM which is a host-local device just like NVMe SSDs. We also characterize in detail, vMotion performance of VMs with PMEM.
  • We outline some best practices on how to get the most out of PMEM in vSphere.

Read the full paper here.

Performance Best Practices Guide for vSphere 6.7

We are pleased to announce the availability of Performance Best Practices for VMware vSphere 6.7. This is a comprehensive book designed to help system administrators obtain the best performance from their vSphere 6.7 deployments.

The book covers new features as well as updating and expanding on many of the topics covered in previous versions.

These include:

  • Hardware-assisted virtualization
  • Storage hardware considerations
  • Network hardware considerations
  • Memory page sharing
  • Getting the best performance with iSCSI and NFS storage
  • Getting the best performance from NVMe drives
  • vSphere virtual machine encryption recommendations
  • Running storage latency-sensitive workloads
  • Network I/O Control (NetIOC)
  • DirectPath I/O
  • Running network latency-sensitive workloads
  • Microsoft Virtualization-Based Security (VBS)
  • CPU Hot Add
  • 4KB native drives
  • Selecting virtual network adapters
  • The vSphere HTML5 Client
  • vSphere web client configuration
  • Pair-wise balancing in DRS-enabled clusters
  • VMware vSphere update manager
  • VMware vSAN performance

The book can be found here.

Also, for a summary of the new performance-related features in vSphere 6.7, refer to What’s New in Performance.

Extreme Performance Series at VMworld 2018

 

I’m excited to announce that the Extreme Performance Series is back for its 6th year with 14 sessions created and being presented by VMware’s best and most distinguished performance engineers, principals, architects and gurus. You do not want to miss this years program as it’s chalk full of advanced content, practical advice and exciting technical details!

Sessions

Spread across 5 different VMworld tracks, you’ll find these sessions full of performance details that you won’t get anywhere else at VMworld. They’ll also be recorded so if the sessions you want to see aren’t being hosted in your region, you’ll still get access to it.

See the VMworld catalog for a more detailed abstract and list of speakers!

SessionID LAS BCN Title
VIN2275BU X Extreme Performance Series: vSphere DRS 6.7 Performance and Best Practices
VIN2685BU X Extreme Performance Series: Benchmarking 101
VIN2677BU Extreme Performance Series: Performance Best Practices
VIN2183BU Extreme Performance Series: vSphere PMEM = Storage at Memory Speed
VIN1782BU Extreme Performance Series: vSphere Compute & Memory Schedulers
VIN1759BU Extreme Performance Series: vCenter Performance Deep Dive
VAP2760BU X Machine Learning & Deep Learning on vSphere Using Nvidia Virtualized GPUs
HCI3000BU X Extreme Performance Series: How To Estimate vSAN Performance
VAP1900BU High Performance Big Data and Machine Learning on VMware Cloud on AWS
VAP1492BU Performance of SQL Server, Oracle, and SAP workloads in VMware Cloud on AWS
VAP1620BU Improve App Performance with Micro-Segmentation and Distributed Routing
VIN2572BU X vMotion across Hybrid Cloud : Technical Deep Dive
CTO2390BU X Virtualize and Accelerate HPC/Big Data with SR-IOV, vGPU and RDMA
NFV2917BU X Breaking the Virtual Speed Limit: Data Plane Performance Tuning

TAM Customers

For our customers that are part of the VMware TAM program, Performance will be represented with a unique performance breakout – TAM3597U: vSphere Performance: Deep Dive – on Monday Aug 27, 2018, which will be repeated Tuesday due to high demand. Be sure to add this to your schedule as soon as possible to avoid being waitlisted.

Performance Boot Camp

 

Introduced last year to a sell-out audience and excellent feedback, we will again be offering a pre-VMworld Performance boot camp, in Las Vegas, focused on vSphere platform performance on Sunday August 26, 2018.

Specific SQL, Oracle, HPC, etc boot camps will still be offered, but we have had many requests for a workload agnostic camp. This boot camp will enable you to confidently support all your virtual workloads and give you an opportunity to directly interact with VMware Performance Engineering.

vSphere Advanced Performance Design, Configuration and Troubleshooting

The VMware vSphere Advanced Performance Bootcamp provides the most advanced technical performance oriented training available about vSphere performance design, tuning and troubleshooting. VMware Certified Design Expert Mark Achtemichuk and a team of VMware Performance Engineers will cover a broad range of topics that cover all resource dimensions including the ESXi scheduler, memory management, storage and network optimization. Attendees will learn how to identify the location of performance issues, diagnose their root cause and remediate a wide variety of performance conundrums using the many techniques practiced by the most seasoned vSphere veterans and other VMware experts.

Details: $800, Sunday August 26, 2018 from 8:00am to 5:00pm

Register Now <- Registration is open, seating is limited! Don’t miss out – book your seat today!

(Be sure to add the Performance boot camp during your VMworld conference registration, under Educational Offerings, after you’ve selected your conferences pass)

Ask The Experts

This program allows you to book some one-on-one time with various highly skilled VMware professionals. Performance experts will be available to sit down and talk with, as well as challenge them with your most difficult questions. This is a great forum to get specific questions answered or to explore your particular environment.

Hands On Labs

Lastly, don’t miss the new Performance focused Hands-On-Labs.

  • SPL-1904-01-SDC vSphere 6.7 Performance Diagnostics and Benchmarking
    Explore the new features in VMware vSphere v6.7 and how they impact performance, including PMEM (persistent memory) and VBS (virtualization based security). Learn how to right-size VMs for your particular environment with benchmarking tools to characterize database and application performance or host-level performance using tools like esxtop and vCenter performance charts. With over 5 hours of content, you’ll need to take this lab several times.

 

  • SPL-1904-02-CHG VMware vSphere 6.7 – Challenge Lab
    Put your knowledge to the test! Each challenge provides a scenario in which you must fix common problems that vSphere administrators and integration engineers face every day. This is great way to experience some off the most common performance issues and how to resolve them in a fun and prescriptive way.

 

2018 will be a banner year for the Extreme Performance Series and we look forward to you joining us!

See you at VMworld 2018.

@vmMarkA #xPerfSeries #PerfBootcamp

What’s New in Performance – VMware vSphere 6.7

Underlying each release of VMware vSphere are many performance and scalability improvements. The vSphere 6.7 platform continues to provide industry-leading performance and features to ensure the successful virtualization and management of your entire software-defined datacenter.

The What’s New in Performance for vSphere 6.7 white paper highlights significant scale and performance items such as:

  • 2x increase in VMware vCenter Server Operations Per Second over the 6.5 release
  • Support for 1GB Large Memory Pages
  • vmxnet3 enhancements to RSS and VXLAN/Geneve Offloads
  • Performant support of new Persistent Memory offerings

As with each release I’m excited about these new capabilities and the performance each of them offers.

Document: What’s New in Performance for vSphere 6.7

Got a question? Ask below.

 

 

 

 

Oracle Database Performance with VMware Cloud on AWS

You’ve probably already heard about VMware Cloud on Amazon Web Services (VMC on AWS). It’s the same vSphere platform that has been running business critical applications for years, but now it’s available on Amazon’s cloud infrastructure. Following up on the many tests that we have done with Oracle databases on vSphere, I was able to get some time on a VMC on AWS setup to see how Oracle databases perform in this new environment.

It is important to note that VMC on AWS is vSphere running on bare metal servers in Amazon’s infrastructure. The expectation is that performance will be very similar to “regular” onsite vSphere, with the added advantage that the hardware provisioning, software installation, and configuration is already done and the environment is ready to go when you login. The vCenter interface is the same, except that it references the Amazon instance type for the server.

Our VMC on AWS instance is made up of four ESXi hosts. Each host has two 18-core Intel Xeon E5-2686 v4 (aka Broadwell) processors and 512 GB of RAM. In total, the cluster has 144 cores and 2 TB of RAM, which gives us lots of physical resources to utilize in the cloud.

In our test, the database VMs were running Red Hat Enterprise Linux 7.2 with Oracle 12c. To drive a load against the database VMs, a single 18 vCPU driver VM was running Windows Server 2012 R2, and the DVD Store 3 test workload was also setup on the cluster. A 100 GB test DS3 database was created on each of the Oracle database VMs. During testing, the number of threads driving load against the databases were increased until maximum throughput was achieved, which was around 95% CPU utilization. The total throughput across all database servers for each test is shown below.

 

In this test, the DB VMs were configured with 16 vCPUs and 128 GB of RAM. In the 8 VMs test case, a total of 128 vCPUs were allocated across the 144 cores of the cluster. Additionally the cluster was also running the 18 vCPU driver VM,  vCenter, vSAN, and NSX. This makes the 12 VM test case interesting, where there were 192 vCPUs for the DB VMs, plus 18 vCPUs for the driver. The hyperthreads clearly help out, allowing for performance to continue to scale, even though there are more vCPUs allocated than physical cores.

The performance itself represents scaling very similar to what we have seen with Oracle and other database workloads with vSphere in recent releases. The cluster was able to achieve over 370 thousand orders per minute with good scaling from 1 VM to 12 VMs. We also recently published similar tests with SQL Server on the same VMC on AWS cluster, but with a different workload and more, smaller VMs.

UPDATE (07/30/2018): The whitepaper detailing these results is now available here.

DRS Entitlement Viewer

Ever wondered how DRS distributes resources to VMs? How much resources your VMs are entitled to? How reservations, limits, and shares (RLS) affect your VMs’ resource availability? Our new fling, DRS Entitlement Viewer, is the answer.

DRS Entitlement Viewer is installed as a plugin to the vSphere Client. It is currently only supported for the HTML5-based vSphere Client. Once installed, it gives the hierarchical view of vCenter DRS cluster inventory with entitled CPU and memory resources for each resource pool and VM in the cluster.

Entitled resources can change with VMs’ resource demand and with the VM’s and resource pool’s RLS settings. So, users can get the current entitlements based on the VMs’ current demands and RLS settings of the VMs and resource pools.

DRS Entitlement Viewer also provides three different what-if scenarios:

  1. Changing RLS settings of a VM and/or resource pool
  2. What-if all the VMs’ resource demand is at 100%
  3. Both 1 and 2 happen together

Users can pick one of the three scenarios and can get new entitlements without actually changing RLS settings on the cluster.

Finally, DRS Entitlement Viewer also provides an option to export the new RLS values from a what-if scenario as a vSphere PowerCLI command that customers can execute against their vCenter to apply the new settings.

 

vCenter performance improvements from vSphere 6.5 to 6.7: What does 2x mean?

In a recent blog, the VMware vSphere team shared the following performance improvements in vSphere 6.7 vs. 6.5:

Moreover, with vSphere 6.7 vCSA delivers phenomenal performance improvements (all metrics compared at cluster scale limits, versus vSphere 6.5):
2X faster performance in vCenter operations per second
3X reduction in memory usage
3X faster DRS-related operations (e.g. power-on virtual machine)

As senior engineers within the VMware Performance and vSphere teams, we are writing this blog to provide more details regarding these numbers and to explain how we measured them. We also briefly explain some of the technical details behind these improvements.

Cluster Scale

Let us first explain what all metrics compared at cluster scale limits means. What is cluster scale? Here, it is an environment that includes a vCenter server that is configured for the largest vSphere cluster that VMware currently supports, namely 64 hosts and 8,000 powered-on VMs. This setup represents a high-consolidation environment with 125 VMs per host. Note that this is different from the setup used in our previous blog about vCenter 6.5 performance improvements. The setup in that blog was our datacenter scale environment, which used the largest number of supported hosts (2000) and VMs (25,000), so the numbers from that blog should not be compared to these numbers.

2x and 3x

Let us now explain some of the performance numbers quoted. We produced the numbers by measuring workload runs in our cluster scale setup.

2x vCenter Operations Per Second, vSphere 6.7 vs. 6.5, cluster scale limits. We measure operations per second using an internal benchmark called vcbench. We describe vcbench below under “Benchmark Details.” One of the outputs of this workload is management operations (for example, clone, powerOn, vMotion) performed per second.

  • In 6.5, vCenter was capable of performing approximately 8.3 vcbench operations per second (described below under “Benchmark Details”) in the cluster-scale testbed.
  • In 6.7, vCenter is now capable of performing approximately 16.7 vcbench operations per second.

3x reduction in memory usage. In addition to our vcbench workload, we also include a simplified workload that simply executes a standard workflow: create a VM, power it on, power it off, and delete it. The rapid powerOn and powerOff of VMs in this setup puts more load on the DRS subsystem than the typical vcbench test.

  • In 6.5, the core vCenter process (vpxd) used on average about 10 GB to complete the workflow benchmark (described below under “Benchmark Details”).
  • In 6.7, the core vCenter process used approximately 3 GB to complete this run, while also achieving higher churn (that is, more workflow ‘create/powerOn/powerOff/delete’ cycles completed within the same time period).

3x faster DRS-related operations. In our vcbench workload, we measure not just the overall operations per second, but also the average latencies of individual operations like powerOn (which exercises the majority of the DRS software stack). We issue many concurrent operations, and we measure latency under such load.

  • In 6.5, the average latency of an individual powerOn during a vcbench run was 9.5 seconds.
  • In 6.7, the average latency of an individual powerOn during a vcbench run was 2.8 seconds.

The latencies above reflect the fact that a cluster has 8,000 VMs and many operations in flight at once. As a result, individual operations are slower than if they were simply run on a single-host, single-VM environment.

What does this mean to you as a customer?

As a result of these improvements, customers in high-consolidation environments will see reduced resource consumption due to DRS and reduced latency to generate VMotions for load balancing. Customers will also see faster initial placement of VMs.

Brief Deep Dive into Improvements

Before we describe the improvements, let us first briefly explain how DRS works, at a very high level.

When powering on a VM, vCenter must determine where to place the VM. This is called initial placement. Many subsystems, including DRS and policy management, must be consulted to determine valid hosts on which this VM can run. This phase is called constraint check. Once DRS determines the host on which a VM should be powered on, it registers the VM onto that host and issues the powerOn. This initial placement requires a snapshot of the inventory: by snapshot, we mean that DRS records the current configuration of hosts and VMs in the cluster.

In addition to balancing during initial placement, every 5 minutes, DRS re-examines the load of the cluster and performs a series of computations to generate VMotions that help balance the load across hosts. This phase is called periodic rebalancing. Periodic rebalancing requires an examination of the historical utilization statistics for each host and VM (for example, over the previous hour) in order to determine proper placement.

Finally, as VMs get moved around, the used capacity in resource pools changes. The vCenter server periodically exchanges messages called SpecSyncs with each host to push down the most recent resource pool configuration. The SpecSync operation requires traversing a host’s resource pool structure and changing it to make sure it matches vCenter’s configuration.

With this understanding in mind, let us now give some technical details behind the improvements above. As with our previous blog about vCenter performance improvement, we describe changes in terms of rocks (that is, somewhat large changes to entire subsystems) and pebbles (smaller individual changes throughout the code base).

Rocks

The three main rocks that we address in 6.7 are simplified initial placement, faster resource divvying, and faster SpecSyncs.

Simplified initial placement. As mentioned above, initial placement relies on a snapshot of the current state of the inventory. Creating this snapshot can be a heavyweight operation, requiring a number of data copies and locking of host and cluster data structures to ensure a consistent view of the data. In 6.7, we moved to a lightweight online approach that keeps the state up-to-date in a continuous manner, avoiding snapshots altogether. With this approach, we significantly reduce locking demands and significantly reduce the number of times we need to copy data. In some highly-contended clusters, this reduced the initial placement time from seconds down to milliseconds.

Faster (and more frequent) resource divvying. Divvying is the act of determining the resource allocations for each VM. Every five minutes, the state of the cluster is examined and both divvying and then rebalancing (using VMotion) are performed. To make the divvying phase faster, we performed a number of optimizations.

  • We changed the approach to examining historical usage statistics. Instead of storing metrics for every VM and every host over an hour, we aggregated the data, which allowed us to store a smaller number of metrics per host. This dramatically reduced memory usage and simplified the computation of the desired load for each host.
  • We restructured the code to remove compatibility checks (for example, those that help to determine which VMs can run on which hosts) during this divvying phase. In 6.5 and earlier, divvying a load also involved various host/VM compatibility calculations. Now, we store the compatible matrix and update it when compatibility changes, which is typically infrequent.
  • We have also done significant code refactoring (described below under “Pebbles”) to this code path.

By implementing these changes and making divvying faster in 6.7, we are now able to perform divvying more frequently: once per minute instead of once every five minutes. As a result, resources flow more quickly between resource pools, and we are better able to enforce fairness guarantees in DRS clusters.

Note that periodic load balancing (through VMotion) still occurs every five minutes.

Faster SpecSyncs. To perform a SpecSync, vCenter sends a resource pool configuration to a host. The host must examine this configuration and create a list of changes required to bring that host in sync with vCenter. In 6.5 and earlier, depending on the number of VMs, creating this list of changes could result in hundreds of operations on a host, and the runtime was highly variable. In 6.7, we made this computation more deterministic, reducing the number of operations and lowering the latency appropriately.

Pebbles

In addition to the changes above, we also performed a number of optimizations throughout our code base.

Code Refactoring. In 6.5 and before, admission control decisions were made by multiple independent subsystems within vCenter (for example, DRS would be responsible for some decisions, and HA would make others). In 6.7, we simplified our code such that all admission control decisions are handled by a module within DRS. Reducing multiple copies of this code simplifies debugging and reduces resource usage.

Finer-grained locks. In 6.7, we continued to make strides in reducing the scope of our locks. We introduced finer-grained locks so that DRS would not have to lock an entire VM to examine certain pieces of state. We made similar improvements to both hosts and clusters.

Removal of unnecessary classes, maps, and sets. In refactoring our code, we were able to remove a number of classes and thereby reduce the number of copies of data in our system. The maps and sets that were needed to support these classes could also be removed.

Preferring integers over strings. In many situations, we replaced strings and string comparisons with integers and integer comparisons. This dramatically reduces memory allocation overhead and speeds up comparisons, reducing CPU.

Benchmark Details

We measure Operations Per Second (OPS) using a VMware benchmark called vcbench. (For more details about vcbench, see “Benchmarking Details” in vCenter 6.5 Performance: what does 6x mean?) Briefly, vcbench is a java-based application that takes as an input a runlist, which is a list of management operations to perform. These operations include powering on VMs, cloning VMs, reconfiguring VMs, and migrating VMs, among many other operations. We chose the operations based on an analysis of typical customer management scenarios. The vcbench application uses vSphere APIs to issue these operations to vCenter. The benchmark creates multiple threads and issues operations on those threads in parallel (up to 32). We measure the operations per second by taking the total number of operations performed in a given timeframe (say, 1 hour), and dividing it by the time interval.

Our workflow benchmark is very similar to vcbench. The main difference is that more operations are issued per host at a time. In vcbench, one operation is issued per host at a time, while in workflow, up to 8 operations are issued per host at a time.

In many cases, the size of the VM has an impact on operational latency. For example, a VM with a lot of memory (say, 32 GB) and large disks (say, 100 GB) will take longer to clone because more memory will need to be snapshotted, and more disk data will need to be copied. To minimize the impact of the disk subsystem in our measurements, we use small VMs (<4GB memory, < 8GB disk).

Because we limit ourselves to 32 threads per vCenter in this single-cluster setup, throughput numbers are smaller than for our datacenter-at-scale setups (2,000 hosts; 25,000 VMs), which use up to 256 concurrent threads per vCenter.

Summary

In this blog, we have described some of the performance improvements in vCenter from 6.5 to 6.7. A variety of improvements to DRS have led to improved throughput and reduced resource usage for our vcbench workload in a cluster scale setup of 64 hosts and 8,000 powered-on VMs. Many of these changes also apply to larger datacenter-scale setups, although the scope of improvement may not be as pronounced.

Acknowledgments

The vCenter improvements described in this blog are the results of thousands of person-hours from vCenter developers, performance engineers, and others throughout VMware. We are deeply grateful to them for making this happen.

Authors

Zhelong Pan is a senior staff engineer in the Distributed Resource Management Team at VMware. He works on cluster management, including shared resource allocation, VM placement, and load balancing. He is interested in performance optimizations, including virtualization performance and management software performance. He has been at VMware since 2006.

Ravi Soundararajan is a principal engineer in the Performance Group at VMware. He works on vCenter performance and scalability, from the UI to the server to the database to the hypervisor management agents. He has been at VMware since 2003, and he has presented on the topic of vCenter Performance at VMworld from 2013-2017. His Twitter handle is @vCenterPerfGuy.

SQL Server Performance of VMware Cloud on AWS

In the past, I’ve always benchmarked performance of SQL Server VMs on vSphere with “on-premises” infrastructure.  Given the skyrocketing interest in the cloud, I was very excited to get my hands on VMware Cloud on AWS – just in time for Amazon’s AWS Summit!

A key question our customers have is: how well do applications (like SQL Server) perform in our cloud?  Well, I’m happy to report that the answer is great!

VMware Cloud on AWS Environment

First, here is a screenshot of what my vSphere-powered Software-Defined Data Center (SDDC) looks like:vSphere Client - VMware Cloud on AWSThis screenshot shows several notable items:

  • The HTML5-based vSphere Client interface should be very familiar to vSphere administrators, making the move to the cloud extremely easy
  • This SDDC instance was auto-provisioned with 4 ESXi hosts and 2TB of memory, all of which were pre-configured with vSAN storage and NSX networking.
    • Each host is configured with two CPUs (Intel Xeon Processor E5-2686 v4); each socket contains 18 cores running at 2.3GHz, resulting in 144 physical cores in the cluster. For more information, see the VMware Cloud on AWS Technical Overview
  • Virtual machines are provisioned within the customer workload resource pool, and vSphere DRS automatically handles balancing the VMs across the compute cluster.

Benchmark Methodology

To measure SQL Server database performance, I used HammerDB, an open-source database load testing and benchmarking tool.  It implements a TPC-C like workload, and reports throughput in TPM (Transactions Per Minute).

To measure how well performance scaled in this cloud, I started with a single 8 vCPU, 32GB RAM VM for the SQL Server database.  To drive the workload, I created a 4 vCPU, 4GB RAM HammerDB driver VM.  I then cloned these VMs to measure 2 database VMs being driven simultaneously:HammerDB and SQL Server VMs in VMware Cloud on AWS

I then doubled the number of VMs again to 4, 8, and finally 16.  As with any benchmark, these VMs were completely driven up to saturation (100% load) – “pedal to the metal”!

Results

So, how did the results look?  Well, here is a graph of each VM count and the resulting database performance:

As you can see, database performance scaled great; when running 16 8-vCPU VMs, VMware Cloud on AWS was able to sustain 6.7 million database TPM!

I’ll be detailing these benchmarks more in an upcoming whitepaper, but wanted to share these results right away.  If you have any questions or feedback, please leave me a comment!

UPDATE (07/25/2018): The whitepaper detailing these results is now available here.

ESX IP Storage Troubleshooting Best Practice White Paper

We have published an ESX IP Storage Troubleshooting Best Practice white paper in which we recommend vSphere customers deploying ESX IP storage over 10G networks to include 10G packet capture systems as a best practice to ensure network visibility.

The white paper explores the challenges and alternatives for packet capture in a vSphere environment with IP storage (NFS, iSCSI) datastores over a 10G network, and explains why traditional techniques for capturing packet traces on 1G networks will suffer from severe limitations (capture drops and inaccurate timestamps) when used for 10G networks. Although commercial 10G packet capture systems are commonly available, they may be beyond the budget of some vSphere customers. We present the design of a self-assembled 10G packet capture solution that can be built using commercial components relatively inexpensively. The self-assembled solution is optimized for common troubleshooting scenarios where short duration packet captures can satisfy most analysis requirements.

Our experience troubleshooting a large number of IP storage issues has shown that the ability to capture and analyze packet traces in an ESX IP storage environment can significantly reduce the mean time to resolution for serious functional and performance issues. When reporting an IP storage problem to VMware or to a storage array vendor, an accompanying packet trace file is a great piece of evidence that can significantly reduce the time required by the responsible engineering teams to identify the problem.

Performance Comparison of Containerized Machine Learning Applications Running Natively with Nvidia vGPUs vs. in a VM – Episode 4

This article is by Hari Sivaraman, Uday Kurkure, and Lan Vu from the Performance Engineering team at VMware.

Performance Comparison of Containerized Machine Learning Applications

Docker containers [6] are rapidly becoming a popular environment in which to run different applications, including those in machine learning [1, 2, 3]. NVIDIA supports Docker containers with their own Docker engine utility, nvidia-docker [7], which is specialized to run applications that use NVIDIA GPUs.

The nvidia-docker container for machine learning includes the application and the machine learning framework (for example, TensorFlow [5]) but, importantly, it does not include the GPU driver or the CUDA toolkit.

Docker containers are hardware agnostic so, when an application uses specialized hardware like an NVIDIA GPU that needs kernel modules and user-level libraries, the container cannot include the required drivers. They live outside the container.

One workaround here is to install the driver inside the container and map its devices upon launch. This workaround is not portable since the versions inside the container need to match those in the native operating system.

The nvidia-docker engine utility provides an alternate mechanism that mounts the user-mode components at launch, but this requires you to install the driver and CUDA in the native operating system before launch. Both approaches have drawbacks, but the latter is clearly preferable.

In this episode of our series of blogs [8, 9, 10] on machine learning in vSphere using GPUs, we present a comparison of the performance of MNIST [4] running in a container on CentOS executing natively with MNIST running in a container inside a CentOS VM on vSphere. Based on our experiments, we demonstrate that running containers in a virtualized environment, like a CentOS VM on vSphere, suffers no performance penlty, while benefiting from the tremenduous management capabilities offered by the VMware vSphere platform.

Experiment Configuration and Methodology

We used MNIST [4] to compare the performance of containers running natively with containers running inside a VM. The configuration of the VM and the vSphere server we used for the “virtualized container” is shown in Table 1. The configuration of the physical machine used to run the container natively is shown in Table 2.

vSphere  6.0.0, build 3500742
Nvidia vGPU driver 367.53
Guest OS CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

Table 1. Configuration of VM used to run the nvidia-docker container

Nvidia driver 384.98
Operating system CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

⇑ Table 2. Configuration of physical machine used to run the nvidia-docker container

The server configuration we used is shown in Table 3 below. In our experiments, we used the NVIDIA M60 GPU in vGPU mode only. We did not use the Direct I/O mode. In the scenario in which we ran the container inside the VM, we first installed the NVIDIA vGPU drivers in vSphere and inside the VM, then we installed CUDA (driver 8.0 with runtime version 7.5), followed by Docker and nvidia-docker [7]. In the case where we ran the container natively, we installed the NVIDIA driver in CentOS running natively, followed by CUDA (driver 8.0 with runtime version 7.5),  Docker and finally, nvidia-docker [7]. In both scenarios we ran MNIST and we measured the run time for training using a wall clock.

 Figure 1. Testbed configuration for comparison of the performance of containers running natively vs. running in a VM

Model Dell PowerEdge R730
Processor type Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
CPU cores 24 CPUs, each @ 2.5GHz
Processor sockets 2
Cores per socket 14
Logical processors 48
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), Storage Arrays, Local Hard Disks
GPUs 2x M60 Tesla

⇑ Table 3. Server configuration

Results

The measured wall-clock run times for MNIST are shown in Table 4 for the two scenarios we tested:

  1. Running in an nvidia-docker container in CentOS running natively.
  2. Running in an nvidia-docker container inside a CentOS VM on vSphere.

From the data, we can clearly see that there is no measurable performance penalty for running a container inside a VM as compred to running it natively.

Configuration Run time for MNIST as measured by a wall clock
Nvidia-docker container in CentOS running natively 44 minutes 53 seconds
Nvidia-docker container running in a CentOS VM on vSphere 44  minutes 57 seconds

⇑ Table 4. Comparison of the run-time for MNIST running in a container on native CentOS vs. in a container in virtualized CentOS

Takeaways

  • Based on the results shown in Table 4, it is clear that there is no measurable performance impact due to running a containerized application in a virtual environment as opposed to running it natively. So, from a performance perspective, there is no penalty for using a virtualized environment.
  • It is important to note that since containers do not include the GPU driver or the CUDA environment, both of these components need to be installed separately. It is in this aspect that a virtualized environment offers a superior user experience; an nvidia-docker container in CentOS running natively requires that any existing GPU and CUDA drivers be removed if the version of the drivers does not match that required by the container. Uninstalling and re-installing the correct drivers is often a challenging and time consuming task. However, in a virtualized environment, you can, in advance, create and store in a repository, a number of CentOS VMs with different VGPU and CUDA drivers. When you need to run an application in an nvidia-docker container, just clone the VM with the correct drivers, load the container, and run with no performance penalty. In such a scenario, running in a virtualized environment does not require you to uninstall and re-install the correct drivers, which saves both time and considerable frustration. This issue of uninstalling and re-installing drivers in a native environment becomes considerably more difficult if there are multiple container users on the system; in such a scenario, all the containers need to be migrated to use the new drivers, or the user who needs a new driver will have to wait until all the other users are done before a system administrator can upgrade the GPU drivers on the native CentOS.

Future Work

In this blog, we presented the performance results of running MNIST in a single container. We plan to run MNIST in multiple containers running concurrently in both a virtualized environment and on CentOS executing natively, and report the measured run times. This will provide a comparison of the performance as we scale up the number of containers.

References

  1. Google Cloud Platform: Cloud AI. https://cloud.google.com/products/machine-learning/
  2. Wikipedia: Deep Learning. https://en.wikipedia.org/wiki/Deep_learning
  3. NVIDIA GPUs – The Engine of Deep Learning. https://developer.nvidia.com/deep-learning
  4. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
  5. TensorFlow: An Open-Source Software Library for Machine Intelligence. https://www.tensorflow.org
  6. Wikipedia: Operating-System-Level Virtualization. https://en.wikipedia.org/wiki/Operating-system-level_virtualization
  7. NVIDIA Docker: GPU Server Application Deployment Made Easy. https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/
  8. Episode 1: Performance Results of Machine Learning with DirectPath I/O and GRID vGPU. https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html
  9. Episode 2: Machine Learning on vSphere 6 with NVIDIA GPUs. https://blogs.vmware.com/performance/2017/03/machine-learning-vsphere-6-5-nvidia-gpus-episode-2.html
  10. Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning. https://blogs.vmware.com/performance/2017/10/episode-3-performance-comparison-native-gpu-virtualized-gpu-scalability-virtualized-gpus-machine-learning.html