In the previous blogs we have talked about the many aspects of composable ML/AI, GPU remoting, virtualization and elastic AI. In this blog we will offer a short introduction for users operating with GPU virtualization. We will use the shell commands as they offer the most direct way to run the ML/AI application. Nonetheless it is straightforward (and left to the reader) to abstract the CLI syntax and operate with scripts, automated bash or creating a front-end GUI to generate the CLI calls.
Let’s assume that resnet50 is the model of choice and the dataset is imagenet. For this example, the data scientist uses the Tensorflow framework. A typical training session with two GPUs starts with a command similar to the following:
Upon execution of this command Tensorflow tf_cnn will initiate resnet50 model training with the imagenet dataset and the specified arguments (such as batch size and floating point options). While all arguments are important, we highlight only the GPU argument –num_gpus=2 which executes the model with two GPUs (locally attached to the PCIe bus). It is clear that the two physical GPUs need to be attached to the local host running this command.
Now, switch to the virtual world. The user runs on a vanilla low-cost CPU server, on which there are no local GPUs attached to the PCIe bus. Therefore, attempting to run the command and training model above would create a failure (no physical GPUs attached to the host). However, by running the exact same command with the Bitfusion prompt flexdirect run, the Bitfusion stack will be invoked, and will attach two remote GPUs from somewhere in the network in real-time (‘somewhere’ might be misleading, as there is certain logic dictating which network GPUs the user is allowed to attach). The training will proceed normally as before (the process of attaching two remote GPUs is less than 1 second). The command will look like:
Note that the flexdirect run also has arguments. For example -n 2 invokes a process that will remote-attach two physical GPUs over the network to the user space.
There are three unique elements that make this seamless – the user experience seems as if the GPUs are attached to the user’s local server:
- No changes to the python command or arguments, it is simply a cut-and-paste
- User experience is the same (training output, model messages, exceptions if exist, etc.), as if the user is operating with a local GPU machine with no virtualization
- The Bitfusion command flexdirect run provides a few additional (optional) arguments to direct the execution of the run-time virtualization (per session)
Virtualization performs best when it is abstracted from the user, and provides the same experience as a physical host. In our case – executing AI/NL workloads on GPUs – this requirement is met to the fullest degree.
As a second example, we will show how a partial GPU is invoked. Let’s assume that the data scientist has access to a single remote physical GPU and no more (it is a busy working-day and there is peak demand for GPUs). The scientist seeks to run multiple GPU environments, to increase her productivity, with this single physical GPU. The first environment is a Tensorflow training, the second is an experimental model development with Pytorch and the third is an additional experiment with Tensorflow. All three environments need to run concurrently. In theory there is a need for three physically isolated GPUs, which unfortunately are not available. With Bitfusion, all three environments can run concurrently with a single physical GPU. All the data scientist has to do is to partition the single remote physical GPU into three logical, isolated parts: for example 33% – 33% – 33% and use each separately. With such action, the scientist has created new ‘GPUs’, each equal in size to a third of the physical GPU. The command is easy enough:
Note the new argument -p 0.33 carves out 33% of a physical GPU and attaches it to the workload (until completion). It runs same process as before, but with a difference in size: the Bitfusion stack requests 33% of a GPU from a GPU server somewhere in the network, confirms the attachment and hands this environment over to execute the Tensorflow application. The performance may not be equal to a run which uses 100% of the physical GPU. Nonetheless, it grants the user a spectrum of tradeoffs between performance and productivity/flexibility. Also, many times a single environment cannot utilize a full GPU, so it is better to allocate only part of it. We arbitrarily picked a 33% ratio for the example, nevertheless it is a decision left to the engineer (or the ML/AI infrastructure administrator) on how to carve-up the GPU, and fractions such as -p 0.27 -p 0.31 -p 0.42 will work equally well. We anticipate that the user will profile the workload and have some insight into how much of the physical GPU is being used, and then decide on the right percentage for a partial GPU.
Note also that the argument of -n is still maintained in the command invoking partial GPUs: flexdirect run -n 1 -p 0.33. The reason is that the Bitfusion stack will allow you to attach multiple partial GPUs to the run-time application. For example, flexdirect run -n 4 -p 0.25 will attach four, 25% partial GPUs to the workspace of the engineer. There are many compelling use cases for this configuration. For example, assume a cluster of GPUs serving partial GPUs to many short-lived sessions. It might be that the different physical GPUs will be consumed partially: one at 60%, the 2nd at 50% and so on. The user will not be able to get a full physical GPU but only fragments. With the multiple partial command, the user is not restricted to a single partial, but can attach multiple partials, that (potentially) yield the performance of a full, physical GPU.
Monitoring and visibility features work as well. For example, the well-known nvidia-smi utility will generate the following output.
In this case showing a P100 physical GPU, with 16GB of memory.
It is clear (and even silly) to type this command in a machine that has no GPUs (the response will be no physical GPUs of course). However, when prefacing it with flexdirect run, Bitfusion, under the hood, will execute the remote attach and then nvidia-smi will execute and show the GPU data as if it were connected to the local machine. Example:
Here the user requests a single, 33% remote-attached GPU. After the command is executed, the Bitfusion stack will attach the 33% GPU to the user host, and nvidia-smi will run. The output is shown below:
Note the uniqueness of the output. It is exactly the same as a physical GPU (but it is a virtual GPU that is available to the user!). The memory available here is 5.37GB which is essentially 33% of a P100 GPU 16GB – with a single command the user created a new logical GPU, with on-demand sizing!
There are many more exciting options, arguments, modes-of-operation and use-cases. Please contact us at AskBitfusion@vmware.com and we will review with you more extensively the virtualization run-time commands.