Introduction – the Business Application

Enterprises use high performance computing (HPC) applications such as Monte-Carlo simulation for the analysis of data using differing scenarios. This approach is used with portfolios of financial instruments for example, to determine the effects and sensitivities of different market conditions on instrument pricing over periods of time. The Monte-Carlo analysis using multiple scenarios gives the trading desk more inputs for projecting how well a financial position will perform under varying conditions in the market. This applies to pricing trades and calculating risk factors – a core set of applications in finance. The key financial term used for these factors in quantitative analysis work is the “Greeks”, where the critical quantities being calculated are named “Delta”, “Gamma”, etc.  A separate example of Monte-Carlo simulation is seen in insurance companies where actuaries use this form of analysis to calculate risk. A financial adviser may use Monte-Carlo simulation to determine the long-term market effects on a customer’s portfolio of investments, as another example.

DiagramDescription automatically generated with medium confidence

Figure 1: An Example HPC workload with one or more GPUs executing parts of it

Applications in this HPC space are compute intensive and often require acceleration technology such as GPUs to handle the mathematical algorithms that are part of their calculations. As you can see above, the application we worked on here interleaves CPU-intensive work (to prepare and finalize the data) with its GPU-intensive processing, causing performance analysis of it to deal with a mixture of the two.  Parts of the example application in this work were custom-built to use the CUDA APIs to access the GPU. This custom code includes partial differential equations and linear algebra executing at high scale – both of which map well into GPU functionality. The main goal of the business users and architects of this application is to maximize throughput while making the best use of the compute acceleration resources available in their HPC servers.

The application works under a business Service Level Agreement (SLA) for the end users – hence the increased need for acceleration to get the work done in time. These application examples are executed as large batch jobs, such as calculating the end-of-day risk for each trading desk – that must adhere to a strict SLA.

In this article, we describe the infrastructure considerations on VMware vSphere, particularly relating to the use of multiple A100 GPUs on a server. The GPUs were used in an HPC end-user testing scenario using the application as described below.

High Performance Computing – a Brief Overview

A picture containing iconDescription automatically generated

Figure 2: The Throughput Workload Type in High Performance Computing

To understand HPC applications better, we break them down into two major categories, as detailed in the VMware HPC Reference Architecture document  For more technical detail on deploying HPC applications on vSphere, please consult that document.

Figure 2 above shows the “Throughput” form of HPC applications, where each task represented in the blue boxes is executed independently of every other task and does not need to communicate with its neighbors. This type of workload is often referred to as exhibiting “embarrassingly parallel” behavior and it is the type of HPC application we will be focused on in this article. We see that multiple tasks or processes from the left hand side can operate independently in parallel on the compute cores/streaming multiprocessors from a GPU as represented on the right hand side in green.

In contrast to the throughput workload type above, a second type of HPC application is generally categorized as a “Message Passing Interface” (MPI) application. In MPI, the individual tasks/jobs/processes that are executed in parallel need to communicate their results to each other at intervals during their execution, either within one server or across servers over the network. This is a more complex form of HPC application. One example of this is the distributed form of machine learning model training, where multiple servers cooperate and the distributed processes exchange gradient data and other information about the model that is being trained. The Horovod framework for distributed training is one example of this style – it uses many of the MPI designs.

A picture containing chartDescription automatically generated

Figure 3: The Message-passing Interface (MPI) workload type in HPC

This MPI type of workload often requires high-speed networking between the nodes. We will not go into this workload type further in this article.

A Virtualized HPC Throughput Application in Testing

We worked in partnership with a VMware customer on the testing of their HPC analytics application in VMs on vSphere 7.0 U3 over several months. The application is a throughput-type one that uses a mix of CPU-intensive and GPU-intensive parts in its processes.  It executes a Monte-Carlo simulation that runs for several hours before completion. Before this project began, the application used the older Volta V100 class of GPUs on bare metal servers.

The goal of the current work is to improve the overall performance, largely measured by workload throughput, so that more business processing work could be done in less time. An upgrade was planned for the hardware infrastructure underlying the application – particularly moving from the V100 class of GPUs to the more powerful A100 class of GPU. A decision was made to purchase new server hardware with multiple NVIDIA A100 GPUs in each server. In the test setup, there were four A100 GPUs in each server, where the servers chosen were Dell PowerEdge 750 XA. These servers have 2 TB of RAM and 2 CPU sockets with 32 cores per socket.

During the lifetime of the analytics application, many custom-built processes that use the NVIDIA Compute Unified Device Architecture (CUDA) API are started and then they are stopped after they have completed their work. CUDA is a standard set of APIs, originally designed and implemented by NVIDIA, to allow developers to use the power of GPUs. The number of concurrent CUDA processes running in parallel on each GPU can be varied by the end-user. This number is set at application startup time. Each GPU was expected to handle between ten and twenty processes concurrently and these sets of processes are started and stopped frequently during the lifetime of the application. The original V100-based implementation handled 10 concurrent processes per GPU, so it was expected that the A100 would handle more than that – potentially double that number.

The CUDA processes are scheduled onto a GPU by of a layer of software from NVIDIA called MPS or “multi-process service”. MPS schedules those processes onto the set of available GPUs in parallel, providing performance benefits through optimization of the context switching between them. Because of this process parallelism, where each CUDA process could be active in the GPU at the same time, the application can be a very high consumer of the GPU overall. At certain points in the application lifetime, the GPU is at 95% utilization.

The CUDA processes involved here did not require to communicate with each other, either within one server or across servers. For that reason, there was no need for high-speed networking in the infrastructure. Over the lifetime of the application, a significant portion of time is spent in the CPU, doing data preparation work. For certain data input types, only 5-10% of the total time was spent in the GPU, but that section of the code clearly required the benefits of acceleration.

Technical Infrastructure Software Supporting the Application

Figure 4: The NVIDIA and VMware AI Enterprise software suite

Components of the NVIDIA AI Enterprise suite of software, shown above, were used in the test setup that was deployed on vSphere 7.0.3 as part of the VMware Cloud Foundation portfolio. The NVIDIA AI Enterprise suite provides the host vGPU driver (the NVIDIA vGPU Manager) and the guest OS NVIDIA vGPU driver that together enable each VM to access one or more virtualized GPUs, as well as other tooling for the data scientist/analyst. The NVAIE AI Enterprise suite of software is supported on vSphere 7.0.3 and later versions of vSphere by NVIDIA and is certified on VMware vSphere and on VMware Cloud Foundation.

The NVIDIA host vGPU driver from the AI Enterprise suite was installed using the “esxcli software vib install -v <filename>” method. The matching guest operating system NVIDIA vGPU driver was installed in the test VMs using the standard NVIDIA driver.run file that is supplied as part of the suite. RedHat Linux was used as the guest OS in the VM. The guest OS vGPU driver was not run in a container in this testing, although that containerized form of driver installation is an allowed deployment method. An NVIDIA Licensing Server was run separately in a management VM to supply the correct NVIDIA licenses to the guest operating system vGPU driver.

Although the NVIDIA vGPU driver software allows a physical GPU to be shared by two or more VMs on a server, this was not the goal of the initial testing. Sharing a physical GPU across multiple VMs using NVIDIA Multi-Instance GPU (MIG) was pursued in a separate, later phase of testing that is outside the scope of our discussion here.

For this application testing, two VM-based scenarios were tested using the quantitative analytics application with varying numbers of parallel CUDA processes, under MPS: –

(a) A single large VM occupying most of the resources of the physical host server with all four physical GPUs each presented as a vGPU profile to that VM;

(b) A set of four smaller VMs with 1 physical GPU presented as a vGPU to each VM

The VMs used VMware vSAN as the basis for their storage and they were attached to a 10 GigE network using the standard VMXnet3 drivers. The input data for tests was loaded onto the file system for each VM based on the vSAN storage.

Initial Test Findings

The initial test was intended to understand the aplication performance with a single, large VM with 128 vCPUs, with most of the physical RAM on the Dell R750XA server assigned to it, and all 4 GPUs assigned to this one VM. A 5% portion of the physical server’s RAM was assigned for VMware vSphere’s use and therefore this was not used by the resident VM.

The four full GPUs were presented to the large VM using a vGPU profile for each one that fully occupied the framebuffer memory in time-sliced mode. The vGPU profile used is named “grid-a100d_80c”, signifying the full 80 GB of memory on the GPU and “c” at the end of the name signifying “compute-intensive”.

Since this testing was done on vSphere 7.0 U3, the maximum allowed limit was four for the set of multiple vGPU profiles per VM. That limit changes up to 8 GPUs per VM with vSphere 8. The single VM effectively occupied the full server and all 4 GPUs on that server, through having its four vGPU profiles. No other VM was run on the server during testing.

Testing was done on vSphere 7.0.3f so as to be compliant with VMware Cloud Foundation 4.4, which was the standard version in use at the company at this time. The single large VM occupied 95% of the main memory on the server.

Because the Dell R750XA hardware has two sockets with 32 cores per socket and hyper-threading was enabled, the number of vCPUs was set to 128 to match the hyperthread count. The cores-per-socket value for the VM settings was configured to 64, calculated by dividing the number of hyperthreads by the CPU socket count.

Since there are two NUMA nodes on this server, and the large 128 vCPU VM would span those nodes, we set the vSphere Advanced Configuration settings on the VM as follows: –

sched.node0.affinity = 0

sched.node1.affinity = 1

Using the following command at the vSphere ESXi console level

esxcli hardware pci list

we were able to see the hardware-level NUMA node identity for each GPU. On this particular server, two GPUs were associated with NUMA node 0 and two GPUs were associated with NUMA node 1 at the hardware level. Here is a snippet of the output of the above command (some data has been deleted for brevity)

0000:17:00.0

Address: 0000:17:00.0

Segment: 0x0000

Bus: 0x17

Slot: 0x00

Function: 0x0

VMkernel Name: vmgfx3

Vendor Name: NVIDIA Corporation

Device Name: NVIDIA A100-PCIE-80GB

NUMA Node: 0

When the guest OS NVIDIA vGPU driver was installed into the VM, we were able to see that the pair of GPUs in each NUMA node were assigned correctly to their NUMA home node as seen by the VM’s operating system. This is seen in the file located in the guest operating system at

/var/run/nvidia-topologyd/VirtualTopology.xml

No NVLink or NVswitch connections were used in this test, as there was no need for inter-GPU or inter-process communication between the CUDA processes.

The results from this single, large VM test showed that significant throughput overhead was incurred by the application, so it was decided to reduce the size of this monolithic VM and scale out the workload across a set of smaller-sized VMs. Trough profiling, it became clear that the application code makes intensive use of the CUDA API calls named cudaMalloc(), cudaMemcopy() and cudaDeviceSynchronize() respectively. The performance effects of using these calls very heavily, along with certain CUDA process-level scheduling concerns were mitigated by deploying more, smaller-sized VMs on the same server.

A Scale-Out Test with Four VMs

In the second test setup, four smaller VMs each with 16 vCPUs were deployed on the same host server. This scale-out test, including splitting the workload equally across the VMs, was possible because this application exhibits the HPC throughput design discussed earlier. What that means is that each individual within the CUDA processes that constitute the application can run on its own chunk of input data independently of every other – and the processes do not need to communicate with each other to synchronize on the data. This provides an ideal situation for breaking up a monolithic single application running in a large VM into separate, smaller VMs that contain subsets of the application processes. However, it may not suit every HPC application.

Four VMs were used on the test server host and those VMs were sized carefully to fit within the physical and NUMA boundaries of the server hardware. The amount of RAM assigned to each VM was adjusted to one quarter of the size given to the single, monolithic VM. The number of VMs could be expanded to eight even smaller VMs if needed, over time, but this initial phase of testing exercise did not pursue that option.

Each of the four VMs on the test server was assigned a virtual GPU (vGPU) of its own in the vSphere Client UI with a vGPU profile named “grid_a100d-80c”. This was done using the vSphere Client’s “Add PCIe Device” functionality. The effect of the system administrator assigning that vGPU profile to a VM is that the full memory and compute power of one A100 GPU device is entirely available to that one VM. Each VM therefore had a captive GPU of its own.

There was no sharing of the GPUs across multiple VMs here. Sharing physical GPUs across VMs on a single host server is a feature that is supported using different vGPU profiles. In the testing, there were four VMs on the same physical server that earlier held the single large VM. Each VM was equipped with a full GPU through its vGPU profile. No NVLink or NVswitch connections were used in this test, as was also the case with the single large monolithic VM testing.

Chart, line chartDescription automatically generated

Figure 5: Performance comparison of 1 large VM versus 4 smaller VMs

As the application was executed across the four VM setup, the system outperformed the bare metal deployment of the application by double-digit percentages as measured in throughput terms. These numbers differed across the different datasets provided as input, but they were very encouraging for the virtualization platform. Some reasons for that improvement in the application’s throughput are outlined below. At the time of this writing, further tests were being executed using vGPUs and Multi-instance GPU support on even more VMs, to look for further potential gains.

Application Performance Observations

When analyzing multiple test results from the various types of setup above, a number of reasons were found for the performance differences between the single VM case and the four smaller VMs running in separate tests on the same server.

  1. CUDA process scheduling and scaling across the GPUs was found to be optimal at 8 or fewer processes per GPU. At that point, a plateau was evident in the performance graphs. This is an area that is under investigation.
  2. Intensive use of the cudaMalloc (memory allocation) and cudaMemcpy (memory copy) API calls for very small quantities of memory caused the memory subsystem to be under considerable pressure. By dividing the workload into four parts, that pressure was mitigated. The application logic may be re-worked over time in this area.
  3. In general, a collection of smaller VMs give better performance and mobility than a single large one does for this type of application and for other virtualized applications. However, not all application datasets can be split out into four separate and independent units to be run in separate VMs in this way.

In the longer term, the application will be moved onto VMware vSphere 8 in order to exploit the latest vGPU driver software from NVIDIA and the ongoing improvements made to vGPU support on vSphere.

Conclusions

vSphere and VMware Cloud Foundation provide a very capable and high-performing platform for deploying compute-intensive, high-performance applications such as Monte-Carlo Simulation applications, as used in the financial industry and in many others. vSphere/VCF reduces your costs by giving you more efficient management and utilization of your infrastructure – and allows you to gain performance at the same time, as seen in this tested HPC application.