Accelerating Hybrid Cloud AI Workloads with VMware Cloud Foundation (Part 2)

[This second installment of a 2 part blog series that explores deployment of full stack HCI infrastructure with acceleration technologies validated through a set of reference architectures built using VMware Cloud Foundation, VMware Tanzu and Intel® Optane™ persistent memory, second generation Intel Xeon™ scalable processors along with Intel Deep Learning Boost technologies. Thanks to the Intel Data Platforms Group and VMware OCTO teams for collaborating on this program.]

In part 1 of this blog series, we explored the market dynamics and drivers for advanced analytics, machine learning and AI solutions, but for this post we’ll deconstruct the technical building blocks of the reference architecture and provide a snapshot of the performance gains achieved with the architecture. It’s well understood that machine learning requires large datasets that are many times too large to efficiently migrate to cloud, and machine‐learning models must be continuously retrained and updated in order to achieve the best outcomes. In addition, data for machine learning can be sensitive, highly regulated, or may contain intellectual property, which raises many security concerns and forces the need for intrinsic security for the ingress and egress data flows. Intel and VMware have teamed up to build machine‐learning solutions that can be deployed today, maximizing your existing VMware and Intel technology investments. VMware Cloud Foundation (VCF), running on Intel hardware, offers an agile platform that provides rapid deployment and configuration of private and hybrid cloud solutions for managing VMs and orchestrating containers.

The need for agile, high performance infrastructure that deploys a combination of VMs and containers has a wide range of use cases, including but not limited to:

Machine-learning training. Image classification is one of the most popular use cases for deep learning. Training such models can be time consuming and, without the right tools, requires specialized skills. VCF works with various machine‐learning frameworks, including DataRobot, which is a popular automated machine‐learning platform that takes advantage of optimizations for Intel® architecture.
Machine-learning inference. Once a model is trained, it can be run on new data sets to uncover hidden insights. Inference is compute-intensive, and can benefit from innovations from Intel such as Intel DL Boost with Vector Neural Network Instructions (VNNI)—available with vSphere 7 and ESXi 7, which are foundational components of VCF.
Data warehousing and analytics. Data warehouses are considered one of the core components of business intelligence, providing a central location to store data from one or more disparate sources as well as current and historical data. VCF supports data warehousing, including industry‐proven solutions based on Microsoft SQL Server 2019 or Oracle Database 19c.

One such analytic model that was used to predict which US counties are likely have confirmed cases of COVID-19, creating a model that helps federal, state and local officials to allocate budget resources and take pre-emptive measures in response to the threats. The output of this model is also be useful to healthcare providers to prepare staff and facilities in response to a predicted growth in infections. You can access this full model and analysis on the Datarobot community site here.

Figure 1: COVID-19 Predictions by County and Population (courtesy Datarobot)

In this post, we’ll explore the technical building blocks of the software and hardware components as well as the performance benchmark results that demonstrate the benefits of deploying a scalable, software-defined platform that is highly flexible but delivers measurable performance improvements for these critical analytic services.

Software Overview

The on-premises private cloud was built with VMware Cloud Foundation, which includes the following elements as part of a fully integrated Hyperconverged Infrastructure (HCI) platform: VMware vSphere, VMware Tanzu Kubernetes Grid (TKG) Service for vSphere, VMware vSAN, VMware vRealize Suite, VMware NSX-T Data Center, and VMware SDDC Manager to provide automation for the full stack HCI solution.

VMware Cloud on AWS is used as the destination for the hybrid cloud architecture. VMware Hybrid Cloud Extension (HCX) enables VM migration, workload rebalancing, and protection between on-premises and cloud. In addition to business continuity, it provides network extension for multi-tier applications without changing the VM properties.

Figure 2: Reference Architecture Building Blocks (courtesy of Intel)

Hardware Overview

The hardware stack for the solution is built on Intel® Server Board platforms which include the latest generation of Intel^® Xeon^® Gold processors and Intel^® Xeon® Platinum processors. These processors support Intel Deep Learning Boost (Intel DL Boost), which uses Vector Neural Network Instructions (VNNI) to boost AI inferencing performance. For high-performance, all-flash software-defined storage, the reference architecture includes Intel Optane SSD DC P4800X and NVMe-based Intel SSD DC P4510 combined with Intel Optane persistent memory (PMem). Intel Optane PMem introduces innovative memory technology that delivers large-capacity system memory and persistence. For an accelerated software-defined network, the platforms use 25 Gb/s Intel Ethernet Converged Network Adapters.

VMware Cloud Infrastructure Software

VMware Cloud Foundation

VMware Cloud Foundation provides a simplified path to hybrid/multi-cloud through an integrated software platform for both private and public cloud environments. It offers a complete set of software-defined services for compute, storage, network, and security, along with application-focused cloud management capabilities. The result is a simple, security-enabled, and agile cloud infrastructure on-premises and in as-a-service public cloud environments. The solution is built from a number of key components:

VMware SDDC Manager

Software-Defined Data Center (SDDC) Manager manages the bring-up of the VMware Cloud Foundation system, creates and manages Workload Domains, and performs lifecycle management to keep the software components up to date. SDDC Manager also monitors the logical and physical resources of VMware Cloud Foundation.

VMware vSphere with Tanzu Workload Management

VMware vSphere extends virtualization to storage and network services and adds automated, policy-based provisioning
and management. vSphere is the starting point for building an SDDC platform. VMware vSphere with Tanzu enables streamlined development, agile operations, and accelerated innovation for all enterprise applications. It consists of two core components: ESXi and vCenter Server. ESXi is the virtualization platform used to create and run VMs and appliances, while vCenter Server manages multiple ESXi hosts as clusters, using shared pool resources.

VMware vSphere with Tanzu workload management enables the deployment and operation of compute, networking,
and storage infrastructure for vSphere with Tanzu. It makes it possible to use vSphere as a platform for running Kubernetes workloads natively on the hypervisor layer. Kubernetes workloads may be run directly on ESXi hosts and upstream Kubernetes clusters can be created within dedicated resource pools by using the TKG Service. See Running Tanzu Kubernetes Clusters in vSphere with Tanzu Documentation for details.

Cloud Builder VM

This is the VM appliance used for automated deployment of the entire stack.

VMware NSX-T Data Center

NSX-T Data Center (formerly NSX-T) is the network virtualization platform that enables a virtual cloud network with a software-defined approach. Working like a network hypervisor, it reproduces a complete set of Layer 2 through Layer 7 networking services: routing, switching, access control, firewalls, QoS, and DHCP in software. All these components can be used in any combination to create isolated virtual networks on demand. The services can then be extended to a variety of endpoints within and across clouds. Starting with VMware Cloud Foundation 4.0, both management and VI Workload Domain types support the NSX-T Data Center platform.

VMware vRealize Suite

VMware vRealize Suite is a multi-cloud cloud management solution that provides IT organizations with a modern platform for infrastructure automation, consistent operations, and governance based on DevOps and machine learning principles.

VMware Tanzu Kubernetes Grid (TKG) Service

TKG is available in several offerings and is used to provision and manage the lifecycle of Tanzu Kubernetes clusters, which are proprietary installations of Kubernetes open-source software, built and supported by VMware. To learn more about TKG offerings, visit the VMware Tanzu webpage and Running Tanzu Kubernetes Clusters in vSphere with Tanzu documentation for details.

VMware Cloud on AWS

VMware Cloud on AWS is a hybrid cloud solution that is complementary to VCF, allowing simplified extension, migration, and modernization of applications, and protection of applications in the public cloud. The VMware Cloud on AWS infrastructure is delivered by the same vSphere-based SDDC stack that is used on- premises. The solution takes advantage of existing tools, processes, and familiar VMware technologies, along with native integration with AWS. This makes it easy to adopt, greatly reduces service disruption associated with migrating critical services to the cloud and eliminates the need for rearchitecting the environment to suit a public cloud infrastructure.

The enterprise-grade infrastructure is delivered as a service, with the SDDC provision time under two hours8 and has pre-configured vSAN storage, networking, compute, and security. VMware Cloud on AWS can also auto scale nodes as needed, depending on CPU, memory, and storage requirements.

VMware Cloud Bare-Metal Types

The latest addition to VMware Cloud on AWS bare-metal infrastructure is a new node type named “i3en.metal.” i3en.metal bare-metal instances aim to address a variety of workloads, including data- or storage-intensive workloads requiring high random I/O access. Such workloads include relational databases and data warehousing. i3en.metal instances are also ideal for workloads that require end-to-end security.

Based on the 2nd Generation Intel Xeon Scalable processors, i3en.metal instances provide 96 logical cores with hyper- threading enabled, 768 GB of memory, and 46 TB raw storage capacity per host, with an additional 6.5 TB cache capacity, delivered with low-latency NVMe-based SSDs. i3en.metal instances extend the security capabilities of VMware Cloud on AWS by providing in-transit hardware-level encryption between instances within the SDDC boundaries. This encryption seamlessly uses the AWS Key Management Service (KMS) to enable security for data both at rest and in- transit when using i3en.metal instances.

Tanzu Kubernetes Grid on VMware Cloud

One of TKG offerings—TKG Plus—is fully supported by VMware when deployed to SDDC on VMware Cloud on AWS. TKG Plus includes the core binaries to install TKG clusters on VMware Cloud on AWS and also customer reliability engineering support and services to assist customers in successfully planning, deploying, and maintaining their Kubernetes environment. With TKG Plus running on VMware Cloud on AWS, customers can deploy a production-ready infrastructure that delivers single or multiple Kubernetes workload clusters. Refer to the TKG Plus on VMware Cloud on AWS solution brief for more information.

Edge Extensions – Optional Components

VMware NSX Advanced Load Balancer

The VMware NSX Advanced Load Balancer (Avi Networks) provides multi-cloud load balancing, web application firewall, and container ingress services across on-premises data centers and any cloud. Moving from appliance-based load balancers to the software-defined NSX Advanced Load Balancer can enable organizations to modernize load- balancing services with efficient use of standard computing infrastructure and reduce over provisioning. Because the NSX Advanced Load Balancer can elastically scale load-balancing capacity up or down based on demand, applications can better utilize available compute power from Intel Xeon Scalable processors. For enterprises moving to software- defined data centers, the combination of the NSX Advanced Load Balancer deployed on servers with Intel Xeon Scalable processors represents a high-performance solution to load balance large volumes of encrypted traffic.

VMware SD-WAN by VeloCloud

VeloCloud cloud-delivered software-defined WAN (SD-WAN) enables enterprises to more securely support application growth, network agility, and simplified branch and end-point implementations while delivering high-performance, reliable access to cloud services, private data centers and software-as- a-service (SaaS)-based enterprise applications. With VeloCloud cloud-delivered SD-WAN, service providers can increase service innovation by delivering elastic transport, performance for cloud applications, and a software-defined edge that can orchestrate multiple services to meet customer needs.

VMware HCX

VMware HCX is an application mobility platform that is designed for simplifying application migration, workload rebalancing, and business continuity across data centers and clouds. It enables customers to migrate workloads between public clouds and data centers without any modification to applications or VM configurations. It provides full compatibility with the VMware software stack and helps make the migration simple, highly secure, and scalable.

Intel Hardware

Intel Optane Persistent Memory

Intel Optane PMem represents a new class of memory and storage technology. It is designed to improve the overall performance of the server by providing large amounts of persistent storage with low-latency access. Intel Optane PMem modules are DDR4-socket compatible and are offered in sizes not available with typical DDR4 DRAM products: 128, 256, and 512 GB per module.

Figure 3: Intel Optane Persistent Memory Operating Modes (Image Courtesy of Intel)

2nd Generation Intel Xeon Scalable Processors

Today’s modern enterprises process ever-increasing volumes of data and require compute power that can meet the data- centric demands of analytics, AI, and in-memory database workloads. 2nd generation Intel Xeon Scalable processors are workload-optimized for exactly these types of applications, with up to 56 cores per CPU and 12 DDR4 memory channels per socket. What’s more, these processors support Intel Optane PMem, which enables affordable system memory expansion.

Intel SSD Data Center Family: Intel Optane SSDs and Intel 3D NAND SSDs

To obtain the best performance from VMware vSAN, it is recommended that high-performance Intel Optane SSDs be used for the cache layer, while the capacity layer can use large-capacity NVMe-based 3D NAND SSDs. These characteristics make them ideal for write-heavy cache functions. Faster caching means enterprises can affordably and efficiently process bigger datasets to uncover important business insights.

Intel VMD Technology for NVMe Drivers

Intel Volume Management Device (Intel VMD) enables serviceability of NVMe-based SSDs by supporting hot swap replacement from the PCIe bus without shutting down the system. It also provides error management and LED management routers. It gives application developers the ability to address data plane processing needs, all in software and on general- purpose Intel processors.

Figure 4: Intel Volume Management Device Handles Storage Device Physical Management (Courtesy of Intel)

Data Plane Development Kit

Developed by Intel, Data Plane Development Kit (DPDK) is a set of Intel architecture-optimized libraries and drivers that accelerate packet processing and the ability to create packet forwarders without the need for costly custom switches and routers. It gives application developers the ability to address data plane processing needs, all in software and on general purpose Intel processors. The DPDK can:

Receive and send packets within a minimum number of CPU cycles.
Develop fast packet capture algorithms.
Run third-party fast path stacks.
Provide software pre-fetching, which increases performance by bringing data from memory into cache before it is needed.

DPDK enables NSX-T Edges to increase packet performance to the north-south off-ramp traffic flows, while DPDK-enabled Enhanced Datapath mode supports high-performance packet processing for east-west traffic in NSX-T. To learn more about DPDK, visit the Intel Developer Zone and VMware’s Edge Node documentation.

Hybrid Cloud Analytics Platform Performance Metrics

The following sections summarize the performance benchmarks that were performed to validate the performance and scale, while highlighting the benefits achieved through this reference architecture.

Deep-Learning Inference

VMware Cloud Foundation recently introduced Intel DL Boost with VNNI to VMs. Tests were performed to demonstrate the improvement of inference performance with an Intel architecture-optimized container stack that uses the new VNNI instruction set. Image classification is one of the most popular use cases for deep learning. The tests performed benchmarked the ResNet50 v1.5 topology with int8 and fp32 precision, using the TensorFlow distribution from the Intel architecture- optimized container stack with Intel’s Model Zoo pre-trained models. The VMs on which the benchmark ran used the entire physical node available through VMware software. The VMs used 80 vCPUs for the Base configuration and 96 vCPUs for the Plus configuration.

There were three main tests that were performed

Compare throughput from the default TensorFlow container against a container using the Intel Optimization for TensorFlow. As the figure below shows, framework optimizations from Intel Optimization for TensorFlow can provide 2.33X improvement for the Base configuration and 2.61X performance improvement for the Plus configuration.

Figure 5: Intel Architecture Optimized ResNet Benchmarks for TensorFlow (image courtesy of Intel)

Compare the results of running VMware Cloud Foundation 4.0.1 (which takes advantage of Intel DL Boost and VNNI) against the reference architecture for VMware Cloud Foundation 3.9 (which does not use Intel DL Boost or VNNI). As shown in Figure 6 below, the VCF 4.0.1 based system (with DL Boost/VNNI) provided a 1.53X improvement over the VCF 3.9 based system (without DL Boost/VNNI) for the Base configuration and a 1.64X improvement for the Plus configuration.

Figure 6: Improvement with VCF 4.0.1 using DL Boost and VNNI (image courtesy of Intel)

Compare the performance improvement of Intel DL Boost with VNNI using int8 precision against fp32 precision. As shown in Figure 11, int8 precision enabled a 4.1X improvement for the Base configuration and a 4.38X improvement for the Plus configuration. For a small decrease in precision, performance quadrupled.

Figure 7: Performance Gains Using Different Precision Metrics (image courtesy of Intel)

As the above results show, the hardware and software optimizations for inference have a huge impact on improving the performance of inference. VMware Cloud Foundation 4.0.1 is an excellent example of how software can take advantage of hardware innovations like Intel DL Boost and VNNI to deliver significantly better performance results.

Conclusions and the Path Forward

The joint Hybrid Cloud Analytics Solution reference architecture is a concrete illustration of how VMware Cloud Foundation software-defined infrastructure, coupled with integrated Tanzu Kubernetes Grid Container orchestration provide an agile platform to effectively build, run and manage complex environments. This also shows how the VMware solutions are fully complemented by Intel Optane Persistent memory, Xeon™ scalable processors and Intel Deep Learning Boost/VNNI technologies to improve performance and accuracy of data analytics, machine learning and AI architectures.

Moving to a hybrid cloud environment helps organizations migrate workloads to and from private and public clouds, utilizing their infrastructure seamlessly and efficiently. Using Intel’s reference architecture for VCF 4, enterprises can have a single, easy‐to‐manage architecture, on their own premises or in the cloud. With this end‐to‐end solution that is ready to deploy, enterprises are poised to run both their traditional data analytics workloads and the AI and machine‐learning workloads of the future.

Future iterations of this reference architecture may potentially expand upon the deployment options for Edge to extend ML/AI to remote sites as well as to leverage tight integration through leading OEMs to develop consolidated reference architectures that simplify acquiring and deploying these solutions at scale. You can also expect that as new, high performance technologies and alliances come to market (Project Monterey for example), that there will be ample opportunities to build on this platform for future growth. To learn more, visit the Intel VMware partnership page and the VMware Intel Partnership page.

References:

Intel Reference Architecture – Modernize the Data Center for Hybrid Cloud Oct. 2020

Blog – Accelerating Hybrid Cloud AI Workloads with VMware Cloud Foundation (Part 1)

VMware Cloud Foundation Product Page

VMware Tanzu Product Page

VMware Cloud on AWS Product Page

Figure 1 courtesy of Data Robot – Predicting COVID 19 at the County Level

Figures 2-7 courtesy of Intel Corporation – Modernize the Data Center for Hybrid Cloud Oct. 2020

Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation in the U.S. and/or other countries.
Other names and brands may be claimed as the property of others.

Software Overview

Hardware Overview

VMware Cloud Infrastructure Software

Edge Extensions – Optional Components

Intel Hardware

Hybrid Cloud Analytics Platform Performance Metrics

Conclusions and the Path Forward

Related Articles

Run Containers and VMs Together—Without the Complexity

Using vSphere Supervisor? Here’s a Comprehensive Guide to Upgrading to VCF 5.2.1

Making Kubernetes Simpler, More Private and Secure, with vSphere Kubernetes Service, at KubeCon Europe 2025