Virtualizing GPUs Eases the Path to Mainstream AI

Co-authored with Phil Hummel, Distinguished Member Technical Staff, Dell Technologies

Data is in the driver’s seat

The realization that “Data is King” is driving businesses to evolve their capability to tap into their unique sets of data. The potential for high-value business outcomes made possible by AI is motivating many companies to start assessing a more accessible and extensible operating model for data analytics infrastructure in traditional data centers and hybrid clouds. Reducing complexity plays a key role in determining how efficiently a business can leverage these data sets to drive new value. The world of isolated and one-to-one mapping of resources to applications has given way to the concept of mainstream AI.

So, what in the heck is mainstream AI you ask? It’s the concept of adopting solutions that can seamlessly integrate into your existing IT infrastructure while maintaining operational consistency. With mainstream AI solutions you can provide advanced computing resources like GPUs to your end-users with agility, consistency, and curation to accelerate results. In this blog, I will highlight some of the challenges businesses face when looking to adopt AI as a core capability, and outline some of VMware’s key partnerships that enable simplified adoption of AI/ML workloads.

Acknowledging the challenges and the reality of implementation

Data science professionals have been pushing the limits of IT infrastructure for decades. Dell Technologies and NVIDIA have responded with numerous advances in computation, data storage, and high-speed networking to meet that challenge. However, it takes more than a handful of dedicated data scientists using brute force to wrestle data and infrastructure into a mainstream AI solution that solves business problems. Realizing the best ROI from AI investments requires machine learning operations (MLOps) expertise in many areas, including:

Data management at the TB or larger scales
Agile software development with large teams

QA and testing for outcomes expressed as approximate probabilities

Rapid provisioning of environments for experimentation

Reliable operations of production systems with high-frequency versioning
And more…

Any organization with more than a few data science professionals and a couple of models will need to strike a balance between preserving a culture of self-reliant experimentation and innovation versus defining and adhering to processes to make ML development more reliable and production-ready.

Organizations that are successfully developing MLOps capabilities are forming partnerships between data science and IT professionals. There is no single path to success for IT professionals implementing platforms and supporting tools for use by data professionals working on machine learning projects. Success requires a combination of technology and training to help organizations break away from a dependency on bespoke platforms and adopt solutions that leverage industry best practices to maintain business continuity for next-gen workloads. Reducing data center complexity is one of the most valuable investments organizations can make to benefit their customers and improve business outcomes. To facilitate the promotion of holistic solutions that fall into a traditional IT operational model, VMware, Dell, and NVIDIA have united to address these challenges head-on and are delivering on that commitment.

The struggle is real and we are partnering to simplify

Hardware virtualization has arguably been the single most influential technology to impact data center operations in the last 30 years. Modern data centers from hyperscale down to the smallest server room will operate with virtual machines, virtual networks, and virtual volumes (the rise of software-defined everything). And although the use of GPUs for general-purpose machine learning workloads has been around for 10+ years, a robust form of GPU virtualization is starting to come into focus for mainstream AI. The introduction of NVIDIA vGPU allowed customers to tap into one or more GPU resources with VMware vSphere and vGPU profiles assigned to GuestOS’s. But in many ways, this one element was not enough to promote easy adoption of GPU resources and AI/ML workloads.

Fast-forward to today, customers can now mainstream AI into their virtualized data centers with greater agility and confidence thanks to the introduction of NVIDIA’s AI Enterprise and their new NVIDIA Ampere-architecture GPU devices, coupled with VMware vSphere 7 Update 2, and Dell EMC’s accelerated platforms. Additionally, vSphere is the only hypervisor certified to support a highly available (HA) live migration of NVIDIA MIG (Multi-Instance GPU) profiles from one compute node to another leveraging vMotion.

The co-innovation and collaboration between VMware, NVIDIA, and Dell not only demonstrates each company’s commitment to delivering robust solutions, but also enables a better E2E experience, and builds off many IT organizations already mature virtualized enterprise practices to maintain operational excellence. Caitlin Gordon, VP of Product Management for Dell’s Infrastructure Solutions team, recently reinforced this commitment to providing integrated solutions when she said, “Dell Technologies is focused on helping customers harness the power of AI by providing solutions that make it easier to adopt and use.” So, let’s dig a little deeper into the platforms that become the foundation for the successful adoption and implementation of mainstream AI for your enterprise.

Building a stable foundation with Dell EMC accelerator platforms

The Dell EMC PowerEdge catalog includes more than 15 intelligent rack-mountable system options that can work together or independently to enable AI. Here we will focus on the use of Dell’s GPU-optimized R750xa system for developing mainstream AI applications.

The R750xa is Dell’s mainstream GPU platform, aimed to deliver the best performance across the widest range of GPU-based workloads. It’s an air-cooled 2U mainstream-designed server, with an ambient operating temperature of up to 35C to fit in a regular data-center environment. The server can be populated with up to 4 double wide or 6 single wide GPUs, such as the NVIDIA A100 Tensor Core GPU and soon to release NVIDIA A30 GPUs, per server node when clustered. There are also options for the newly introduced NVIDIA NVLink Bridges to boost performance for deep-learning training workloads and liquid cooling for CPUs to capture up to 20% of heat dissipation.

This latest generation of a 2U GPU-optimized server features a host of improvements over previous offerings including:

Support for the newest NVIDIA PCIe GPUs with up to 7X performance improvement
Supports up to 4 double-wide NVIDIA GPUs per server node: A100, A40, A30 by NVIDIA
NVLink support using NVIDIA NVLink bridges
New 32GB/s PCIe Gen 4 connections that double the performance of previous systems
Designed with 8 memory channels per CPU and up to 32 DDR4 DIMMs at 3200 MT/s DIMM speed
Removal of thermal restrictions to be under 35C.
Configure up to 8 SSD/NVMe internal drives plus optional BOSS card addition

virtualize GPUs mainstream AI

The dual-socket/2U PowerEdge R750xa delivers outstanding performance that is ideal for the most demanding emerging applications including Artificial Intelligence, Machine Learning/Deep Learning (AI-ML/DL) training, and inferencing, High Performance Compute (HPC). The R750xa provides the ideal density for the average user rack: 6 R750xa platforms can fit in a 15kW standard 1070mm rack, to enable the user with a cluster of 24 GPUs in a regular data-center environment.

Dell EMC PowerEdge servers are made with a cyber-resilient architecture that builds in security at every phase of the product lifecycle including
- Silicon root of trust and secured component verification
- Signed firmware and drift detection
- BIOS recovery
Dell EMC OpenManage systems management portfolio helps tame the complexity of your IT environment with tools and solutions to
- Deploy
- Discover
- Monitor
- Manage
- Update
- Deploy with standard air cooling with an ambient temperature of up to 35C up to 6 servers X 4-A100 GPUs in a 15kW rack, enabling 96+ TFLOPs per rack
Optional liquid cooling for CPUs to capture up to 20% of heat dissipation

For a more detailed look at how to benefit from the latest vGPU technologies and implement mainstream AI, I recommend that you read a recently published Design Guide from Dell Technologies, VMware, and NVIDIA. In this design guide, they present an engineering-tested enterprise-class AI infrastructure solution. The solution includes servers, storage and networking from Dell Technologies, virtualization software from VMware, and acceleration, networking, and solution software from NVIDIA. This design guide describes the recommended configurations, network topologies, deployment guidelines, and observed performance. You can also read an overview of the Design Guide and solution by downloading the corresponding Solution Brief.