Machine Learning Artificial Intelligence Big Data Analytics Deep Learning HPC

Build Scalable, Resource-Efficient, High-Performing Virtualized ML and DL Platforms

In the era of autonomous driving, smart homes and cities, and automated production and services, the concept of artificial intelligence (AI) is getting more than its fair share of attention. Loosely defined as the capability of computer systems to perform tasks that normally require human intelligence, AI isn’t simply a vision of the future—it provides a competitive advantage to business owners and leaders across nearly every aspect of today’s enterprise landscape. From augmented decision-making that drives higher levels of business performance to highly accurate image processing and speech recognition that enable better products, improved customer service and more, AI creates new opportunities for businesses to compete and succeed.

Machine learning, a subset of AI defined as the ability of a system to learn from data without being specifically programmed, has been around for decades, but in recent years has achieved new life with similarly timed advances in technology. The availability of massive amounts of data, advances in high-performance computing (HPC), and the reemergence of machine learning (ML) or deep learning (DL) are the key drivers of the rapidly increasing deployment of AI in the enterprise. As a form of ML, DL leverages deep neural networks (DNNs) that allow sophisticated processing of data across many layers simultaneously, critical to enabling the required processing of huge, unstructured data sets. Of course, outcomes depend on a high-performing ML/DL Platform.

Gartner Hype Cycles offer an objective view into the maturation and current adoption of emerging technologies and innovations like AI. From the innovation trigger, where new technologies are first introduced, to the plateau of productivity wherein adoption is widespread and benefits broadly realized, decision-makers use the Hype Cycle to evaluate opportunities for business benefits and growth against possible risk.

Gartner Hype Cycle for Emerging Technologies, 2018

Examining the maturation of technologies like autonomous vehicles, DNN-based virtual assistants, and the connected home suggests that AI has begun its transition from trial-and-error testing and evaluation to early adoption. AI, with ML and DL at its core, is likely to see widespread adoption within the next 2–5 years.

Gartner projects 2018 as the beginning of AI democratization, pointing to the expansion of AI into a far wider range of companies and government organizations, where AI-based technologies offer benefit to broad sections of the population. CIOs view progress with AI initiatives to be a top-five business priority. Gartner’s latest CIO survey of 3,160 CIOs from 98 countries found that 21% of CIOs are already piloting AI initiatives or have short-term plans for them. Another 25% have medium- or long-term plans.

To fully benefit from ML/DL and maximize their competitive advantage, business owners and leaders should explore ways to build a scalable, resource-efficient, high-performing  ML/DL platform for the development and production deployment of ML and DL applications in a secure enterprise environment.

Enabling New Workloads: The Convergence of HPC and ML/DL

Real-time, high-level, enterprise-wide data analytics requires a state-of-the-art ML pipeline with the compute power to support it. It’s no surprise that the convergence of high-performance computing (HPC), Big Data, and ML is driving a paradigm shift in computing and analytics. Fundamental advancements in virtualization and cloud are critical to the future of ML workloads as they help businesses build high-performing systems capable of handling the incredible complexity of ML, including those challenges that accumulate and grow over time.

Published by Google engineers, Hidden Technical Debt in Machine Learning Systems outlines the ML-specific issues that increase the cost and complexity of long-term ML system maintenance. By virtualizing ML and DL, administrators and IT professionals can centralize and easily manage infrastructure and resources in ways that might otherwise prove impossible.

Figure 1. Only a small fraction of real-world ML systems is composed of the ML code. The required surrounding infrastructure is vast and complex.

The core of virtualization is the virtual machine (VM). A VM is a software abstraction that allows multiple software environments, comprised of operating systems and their applications, to run together on shared hardware. Deploying high-performance VMs for ML/DL workloads  as part of a high-performing ML/DL platform delivers a host of administration-level benefits that lead to increased efficiency, flexibility, and agility.

  • Heterogeneity: VMs allow different resource configurations, operating systems, and HPC as well as ML/DL software to be mixed on the same physical hardware. Self-provisioning speeds up time-to-solution for data scientists and engineers.
  • Increased control and research reproducibility: Virtualization offers administrators control and flexibility, enabling them to dynamically resize, pause, take snapshots, back up, replicate to other virtual environments, or simply wipe and redeploy VMs. Administrators can also archive VMs and rerun them as needed for auditing or research purposes.
  • Improved resource-prioritization and balancing: Virtualization allows compute resources for VMs to be prioritized individually or by pool. Also, because VMs are not tied to a specific node, it’s possible to migrate running VMs and their workloads across a cluster to optimize load-balancing.
  • Fault-isolation: VMs enable each job to be run in an isolated environment, free of the potential faults caused by jobs running in other VMs.
  • Security: Rules and policies can be defined and applied based on environment, workflow, VM, physical server, and operator, including control of actions via user permissions and workflow isolation to prohibit sharing of sensitive data with other ML environments.
  • Resilience and redundancy: VMs enable fault-resilience, dynamic recovery, and other capabilities unavailable in traditional unvirtualized environments.

Hardware Accelerators for HPC and DL

With continued efforts to improve performance to levels near that of bare-metal HPC environments, the trend toward virtualized HPC (vHPC) is growing. This is particularly true for enterprise-grade workloads like ML and DL. Similarly, the early embrace of accelerators within HPC has made it possible for virtual ML workloads to also be accelerated by these technologies. Accelerators further enhance ML adoption by improving hardware utilization efficiency and reducing business costs. By offloading intensive computations from CPUs to accelerators, businesses can significantly reduce the number of commodity CPU servers committed to a system.

As more businesses begin to deploy ML and DL workloads to improve products, services, and strategic decision-making, optimizing for accuracy becomes critically important for maintaining a competitive advantage. Achieving accurate DL models requires many large data sets and training of multiple model variants, both of which drive the need for compute accelerators to match parallel processing and performance demands.

Typical accelerators used commonly in both HPC and ML include GPUs and field-programmable gate arrays (FPGA). Advances in DL and accelerators have enabled increased adoption of ML across a wide range of industries and usages, including facial recognition, medical diagnosis, robotics, automobile safety, and text and speech recognition. Unlike HPC, which has a large base of applications that require further programming to be hardware accelerated, ML/DL workloads run easily on accelerators with mainstream ML frameworks, including TensorFlow, PyTorch and Therano. Moreover, ML/DL workloads can run with emerging accelerators such as Google TPU, Cambicon, and Graphcore, as long as mainstream ML frameworks integrations are included.

Efficiency and Performance: Fractional/Full/Multi GPUs and High-Speed Interconnects

In deploying ML and/or DL, there are three workflows to consider: development, training, and inference. Each workflow’s requirements are different, with each utilizing a fraction of a GPU, an entire GPU, or multiple GPUs per the specific workflow’s compute needs. In development, data scientists perform exploratory data analytics using laptop or desktop computers, in which case virtual desktops offer secure, server-class resources that can be granted access to fractional GPUs for optimized hardware utilization. Fractional GPU utilization is also appropriate in inference, wherein new inputs are tested against trained models, as only a small number of input examples are needed to generate predictive results.

Training, however, requires repeated forward and backward propagation of model parameters and processing of data sets. To drive greater performance and time-to-value, it’s critical to train these multiple large data sets in parallel, creating the need for a full or multi-GPU approach, both within and across hosts. That said, multiple GPUs across multiple hosts may increase latency and negatively impact application performance, though higher scale may be achieved. High-speed interconnects, like remote direct-memory access (RDMA), optimize application performance by enabling direct access from the memory of one computer to another, without involving the OS or host CPU.

Consider Horovod, a distributed training framework created to help make DL easier and faster to implement and scale. Horovod effectively leverages multiple GPUs across multiple hosts and, when utilizing RDMA interconnects, achieves higher scaling efficiency than Ethernet. Likewise, GPUDirect RDMA further enhances the performance of multiple GPU environments, like HPC and DL, by enabling host channel adapters (HCAs) to read/write GPU memory data buffers without copying data to host memory, removing any burden from the OS and CPU as mentioned above.

VMware Private Cloud Solutions for a High-Performing Virtualized ML DL Platform

Resulting from the convergence of HPC, ML/DL and Big Data, and the trend toward virtualization and cloud, ML is increasingly run in secure enterprise data centers. As ML workloads involve massive amounts of data and associated computation time, not to mention potential considerations for data compliance and/or security concerns, business leaders are likely to leverage private cloud for virtualized ML hosting.

VMware® offers a couple of different approaches to providing Infrastructure as a Service (IaaS) for virtualized environments. VMware vRealize® Automation™ (vRA) is a cloud-management platform that automates IT by allowing administrators to create, rapidly deploy, and manage infrastructure, applications, and services across multi-vendor, multi-cloud environments. Similarly vendor-agnostic, VMware Integrated OpenStack (VIO) is a VMware-supported OpenStack distribution that simplifies running an enterprise-grade OpenStack cloud on top of VMware virtualization technologies. VIO boosts performance and productivity by providing near-seamless OpenStack API access to VMware infrastructure.

As a complement to vRA and VIO, VMware NSX® enables networking and security entirely in software, abstracted from the underlying physical infrastructure. Through optimized networking and security policies distributed throughout the cluster, NSX provides switching, routing, load-balancing, and firewall capabilities to management and compute VMs.

Summary: Optimizing for Operational Efficiency and Growth

There has never been a better time for leaders and decision-makers to realize the important, enterprise-wide benefits of deploying ML and DL into their businesses. Still in the early stages of its rapid and potentially broad-based adoption, ML/DL delivers a competitive advantage in key areas like operational decision-making, product quality and yield efficiency, and customer service improvement, to name only a few. The brisk evolution of ML/DL is fueled by the convergence of advancements in HPC and virtualization, the availability of massive amounts of data, and the application of existing and emerging acceleration technologies that help meet DNN processing requirements. Business decision-makers are encouraged to explore the ways in which a high-performing virtualized ML DL platform can help optimize operational efficiency and maximize business growth.

To learn more about the technologies and concepts covered in this blog, check out the following reference materials:




One comment has been added so far

Leave a Reply

Your email address will not be published.