Designing a Shared Virtualized Infrastructure for Business Critical, Machine Learning and High-Performance Computing Workloads

The University of Groningen – Center for Information Technology

The University of Groningen (UG) is a public research university, located in the north of the Netherlands. The university has over 5000 people on staff and over 30,000 students across its under-graduate and post-graduate populations. The Center for Information Technology (CIT) within UG provides computing resources to the administrative staff, lecturers, researchers and students at the university.

CIT supports the business-critical workloads that serve the day-to-day operational needs of the University. CIT wanted to exploit the opportunity to host new types of workloads along with the business-critical ones on their existing infrastructure, using any spare compute capacity that was available there. This article describes the VMware cluster that hosts the different workloads together.

The Opportunity

CIT saw the opportunity to host workloads such as high-performance computing (HPC) and machine learning (ML) ones alongside the university’s business-critical workloads. These workloads are used by separate user communities in the academic community that are separate from the university’s core business users. CIT refers to these new workload opportunities as “resource computing” ones. Examples of the applications that fit into this “resource computing” model are:

Gromacs (Groningen Machine for Chemical Simulation – originally developed at UG)
TensorFlow (An open machine learning framework, originally developed by Google)
Caffe (A deep learning framework, created at the University of California at Berkeley)
Relion (An application for refinement of images in electron cryo-microscopy)
AmberTools (A biomolecular simulation application)

While these workloads have traditionally been hosted on dedicated clusters of bare-metal servers till now, the plan is to share the CIT’s virtualized compute resources across different workloads and different groups of users wherever possible.

The Business-Critical Workload Cluster

The first categories of workloads that are virtualized are made up of SQL Server clusters, web servers, Windows servers, Linux servers and others. In the words of the CIT group, “these workloads keep the university running”.

The CIT department make use of three physical datacenters, with two datacenters in own property, to support the business-critical workloads portfolio. These three datacenters are located within 20 miles of each other and are dual-linked together giving an effective networking bandwidth of 80 Gbit/s between each datacenter. The CIT group supports 40 Gbit networking within the three physical data centers.

Each physical datacenter has 12 Intel X86-based servers that participate in the overall cluster. While the hardware servers are physically located at three sites, the entire set is configured as one vCenter cluster. This gives us lots of advantages, we can use the VMware cluster functionality as DRS, HA and EVC

Hardware Specifications for the Cluster

The server hardware was updated in September 2018, from a blade platform to a rack-mounted server platform. The new cluster consists of 36 Fujitsu RX2540 M4 servers each with a Skylake 6150 CPU, 768 GB memory, Mellanox CX4 card (40Gbit) and one Nvidia V100 GPU, with space in the servers for a second GPU card. There are 36 servers and 36 GPUs in total, at the time of this writing.

Each high performance VM has a boot disk located on the VMware storage environment. This consists of 60 datastores provided by 16 storage devices. There are three types of storage available, replicated SSD storage, replicated bulk storage and bulk storage. The data and scratch disks of the high performance VMs are mounted over the network to a centralized high performance Lustre storage system. Lustre is a parallel file system specific for high performance compute clusters.

Each server host runs VMware vSphere v6.7 update 1. The VMware vSphere installation and configuration process, is completely automated by a Kickstart script to perform the configuration tasks as:

Add 11 VMkernel adapters – and therefore 11 IP addresses
Add all the 60 datastores
Add the 150 vlans
Installation of VIBs (VMware Installable Bundles) into the ESXi hypervisor
Port Group setting (promiscuous mode)
Log server, NTP server, SSH keys, access to ESXi shell

By scripting the installation steps, CIT guarantees that all server nodes are configured in exactly the same way. Some of the other tools used for the automation scripts and for management are Bash and PowerCLI, along with NetFlow and Grafana for monitoring the environment.

Networking

There are two main core routers based on Juniper infrastructure. All the network links are at least 40 Gbit or higher. The centers make use of more than 150 vlans within the VMware setup. Each ESXI server is connected with a dual-link of 40 Gbit to a redundant set of core switches.

The storage is connected to the ESXi servers over the storage network. The solution is based on 2 redundant Ethernet fabrics that are connected in all three datacenters. When there is an issue with one of the paths, there is always a second path. One fabric consists of 3 Dell core switches (in each datacenter) that are connected over dark-fiber to each other using 2x 40Gbit. We chose the modular core switch to be flexible in the future and support new standards such as 25/50/100Gbit

How the Workloads Share Compute Resources

The business critical workloads are hosted in virtual machines running on the 36 vSphere host servers that are spread across the three physical locations. Up to 70% of the total CPU power and memory is used for the business-critical workloads. In fact, there is some extra capacity available within that 70% that is not used today. This is shown below.

The cluster is configured to provide a reserved overcapacity of 30% of its total compute and memory power. This is primarily done to allow the business-critical workload VMs to be failed over from one location to another, should one of the locations fail. The extra 30% is not being used by the business-critical workloads by design. This over-capacity presents an opportunity for the remaining 30% of the power to be used by workloads other than the business-critical ones.

The 30% of physical host capacity that is assigned as “reserved overcapacity” is achieved by setting a vSphere reservation on a resource pool within the overall available host CPU and memory power, to that portion. The new “resource computing” workload VMs live within that resource pool allocation. By applying a vSphere limit, we ensure that those high-performance workloads never intrude on the compute power needed by the business-critical workloads. The team now thinks in terms of “resources” and not in terms of “dedicated clusters” for any application type.

Using GPUs for the Resource Computing Workloads

A key element of the university’s infrastructure for HPC and ML workloads is the use of graphical processor units (GPUs). The goal is to provide one or more GPUs to a single virtual machine over time. Today, two different approaches are used for associating GPUs with a virtual machine. These are:-

The vSphere Passthrough or DirectPath I/O method that is built into the vSphere hypervisor, and
The NVIDIA Quadro DataCenter Virtual Workstation software product (We use “NVIDIA Grid” as a short name for the product in this article).

GPUs may be configured using DirectPath I/O or NVIDIA Grid as appropriate. The DirectPath I/O approach allows for one or more GPUs to be assigned to one virtual machine, for high-end machine learning users, for example. This method is ideal for the more demanding workloads that can make full use of one of more GPUs.

The NVIDIA Grid method, on the other hand, allows the different virtual machines on a server to share access to a single physical GPU through the concept of a “virtual GPU” or vGPU. At the time of this writing, any virtual machine using NVIDIA Grid may use at most one virtual GPU (that maps to one physical GPU), or a part of a physical GPU. That limitation may well change over time. By using vGPUs in this way, we maintain the advantages of using vSphere functionality such as vMotion and Suspend and Resume, and keep the cluster maintainable. More information on deploying the NVIDIA Grid software on VMware vSphere may be found here.

Benefits of Virtualizing the Resource Computing Workloads

Each individual university researcher or group can get their own part of the cluster and can use it at any time for an hour, a week or longer.
Rather than having a dedicated cluster for each community, the overall costs can be lowered by sharing the spare capacity.
Different application toolkits and platforms can be run on different guest operating systems, all on the same hardware, within different virtual machines. These can be changed at will as the user requirements demand.
Workloads are isolated from each other from a performance and security point of view. Where data is sensitive within a particular project, it may be isolated using virtualization mechanisms from view by non-privileged users.
Faults in one VM are isolated from affecting other machines on the same cluster.

Benefits of Virtualization for the University and the CIT Department

Using the vSphere platform, the CIT staff at the University have been able to achieve the following:

Costs are lowered through sharing resources and re-using the reserved over-capacity inherent in the design. Due to the combination of different workloads on servers, compute resources can be utilized more effectively.
CIT can easily manage and maintain the physical servers, independently of their resident workloads.
Redundancy is built into the cluster, as it is provided by design for the business-critical workloads. A server failure is not a problem, as vSphere High Availability (HA) ensures that the affected VMs are brought back to life quickly on other servers.
Failure of an entire physical datacenter has been planned for and can be tolerated by the system, due to the multi-site design.
Hardware replacement is done in an incremental fashion rather than in a global replacement manner.
GPU-enabled virtual machines with varying workloads can be suspended in mid-operation if another workload needs the GPU temporarily. The original workload can then be resumed later on. This is a feature of vSphere 6.7 combined with the NVIDIA Grid software.

Conclusions

The University of Groningen CIT department has a robust, vSphere cluster setup for a variety of workloads that is based on replicated environments at three physical data centers with high-speed networking between them. A shared capacity model is used, implemented using the vSphere resource reservation and limit features, to allow designated reserved overcapacity to be used for extra workloads alongside the business-critical workloads that support the university’s business. The host servers have NVIDIA GPU cards in each server and those may be dedicated to one VM or shared by different VMs using Passthrough or NVIDIA Grid mechanisms. The CIT management staff has achieved significant business advantages and ease of use through using the virtualization technology for supporting this infrastructure.