Traditionally, HPC workloads have been deployed in bare-metal clusters, but the advances in virtualization have led the pathway for these workloads to be deployed in virtualized clusters. However, HPC cluster administrators/providers still face challenges in terms of resource elasticity and virtual machine provisioning at large-scale, due to the lack of coordination between a traditional HPC scheduler and the VM resource management layer. This lack of interaction leads to low cluster utilization and job throughput. Furthermore, the VM provisioning delays directly impact the overall performance of jobs in the cluster. Hence, there is a need for effectively provisioning virtualized HPC clusters, which can best-utilize the physical hardware with minimal provisioning overheads.
Towards this, the VMware OCTO HPC team present Multiverse, a VM provisioning framework, which can dynamically spawn VMs for incoming jobs in a virtualized HPC cluster, by integrating HPC scheduler with VMware vCenter. We have implemented this framework on the Slurm scheduler. In order to reduce the VM provisioning overheads, we use instant clone which shares both the disk and memory between the cloned and parent VMs. Measurements with real-world HPC workloads demonstrate that, instant clone is 2.5× faster than full clone in terms of VM provisioning time. Further, it improves resource utilization by up to 40%, and cluster throughput by up to 1.5×, when compared to full clone for bursty job arrival scenarios. For more details, please read our paper published in Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).
A high-level overview of Multiverse
We are working on making Multiverse open source. Right now, please feel free to contact Michael (xiaolongc [at] vmware [dot] com) to try it out.