GPUs have been proven to be a key component in the ML/AI (and data science) infrastructure, in many cases delivering 20x, sometimes 50x performance gains compared with general-purpose processors. It is common to have ML models with AI frameworks such as TensorFlow and PyTorch run natively on GPUs. However, there is an architectural glitch here. Unlike the compute infrastructure (i.e. processors) that can run multiple distinct applications and users concurrently, GPUs cannot. Remember the old days of bare-metal servers – only one user and one application at a time. Well, this is how GPU servers operate today – as bare-metal entities.
There is a significant body of work explaining the inefficiencies and low-productivity of having bare-metal infrastructure, which unsurprisingly can be mapped to bare-metal GPU servers. Even ML/AI workloads, which are compute-intensive, many times consume less than 100% of the GPU compute and memory, and it is not uncommon to see lower ratios: such as 20% or 30% of the GPU being used. The main reason for that is that ML and AI models are evolving and there is a large set of experimentation with models, batch sizes, parallelism and other factors. Added to that is the fact that GPU hardware keeps evolving, and it is common to see a mismatch between the GPU hardware resources and the workloads. Even more, as new markets are being developed for AI/ML, such as edge computing and carrier infrastructure, GPUs that were designed for hyperscale data centers will be ill-fitted to other market segments (in their bare-metal format). It is essential to be able to carve-up physical GPU hardware to more than one compute instance that can independently and in isolation process concurrent workloads.
Now with Bitfusion software, vSphere can offer the ability to partition a physical GPU to any size of partial GPU and assign it independently and in isolation to distinct users or workloads. This set of capabilities creates multiple use-cases that solve the bare-metal deficiencies.
- When a workload in certain configurations uses a smaller portion of the GPU (e.g. 30%), the user can choose to run three concurrent instances of the workload (three time of 30% of partial GPUs hosted on the same physical GPU), and thus use nearly 100% of the GPU and get a significant throughput increase (it is very common to see 2.5x in this scenario)
- When there is a shortage of GPUs (particularly the high-end GPUs in environments of multiple users and experiments), a physical GPU can be carved-up into distinct portions (for example 20%, 20% and 60%), each portion serving a different user and/or different workload. This is known as partial sharing
- Experimentation, when engineers during the Dev/Test cycle are less sensitive to performance and want to make sure the model works and is compatible with the framework, GPU, libraries, etc. In this case a GPU can be carved-up into different portions, each serving a different experiment (different parameters) concurrently
It is important to note that several architecture principles should be maintained for the GPU hardware to be partitioned:
- Any size – as mentioned before, there is a high degree of experimentation in the ML/AI field, and predictability of GPU utilization is low. Hence the ability to split GPU hardware in a granular way (e.g. 1% resolution) is important. The flexibility of accurately measuring GPU compute and memory utilization, and then setting the right size of a partial GPU (e.g. 32%) is important
- Independence – the partial GPUs (that belong to the same GPU hardware) should be able to run autonomously, and serve different AI frameworks, users and models
- Size independent – partial GPUs that belong to the same physical GPU can have user-defined and arbitrary sizes. For example, 20%, 20% and 60%
- Isolation – the partial GPUs (that belong to the same GPU hardware) run the workloads in isolation, therefore maintaining security and data governance
- Dynamic assignment – the partial GPUs can be set-up and torn-down in real time and dynamically and in isolation of the other partial GPUs running on the same physical GPU. For example, if two partial GPUs are running concurrently (45% and 55%), the 55% partial GPU can be torn down and be further split and assigned to two other partials (e.g. 35% and 20%), while the other partial GPU (45%) keeps running the workload un-interrupted
- Partial GPUs from different physical GPUs can be assigned to the same user and workload. For example, two 40% partials of two different physical GPUs can be assigned to a single user/workload. This capability helps in the case of GPU clusters where inefficiency extends across multiple GPUs. Also in cases where users need to develop multi-GPU workloads, but assigning many physical GPUs to a single user is expensive or not practical for the development and debug/test
More use cases are now being explored with VMware customers, and it is clear that the ability to abstract GPU hardware and create real-time elastic hardware assignments is a natural ML/AI infrastructure evolution. Please reach out to us in AskBitfusion@VMware.com with any questions or comments you have in this exciting new area.