Deep Learning Artificial Intelligence Machine Learning

Scheduling Myths and Partial Truths

We all have been educated by the scheduler and resource management science. The concept is clear: unique and expensive resources are rare, jobs require relatively long duration (hours, days, or more) and hence will overwhelm the compute resources. Queues were invented to schedule and prioritize jobs (a proxy to batch processing). It has worked well for years in High Performance Computing and no complaints there – waiting for hours for your job to be completed (and even days) is acceptable. However, ML/AI and data science is changing the landscape. While HPC is considered a niche and specialized market with few users, ML/AI is a widely spread and fast-growing market. It is a new industry with many users expecting immediate application execution. HPC is considered to have big and bigger jobs requiring super-computers while ML/AI and data science has a broad spectrum of short-, medium- and long-lived workloads. While HPC jobs can be carefully curated and scheduled by top-notch experts, ML/AI with a large base of users cannot be ‘scheduled’; accelerated compute has to offer interactivity and ease-of-use comparable to storage and networking. And there is a final punch here, as accelerated compute hardware (GPUs, FPGAs and AI ASICs) improves, tasks execute in less time, from days to hours, from hours to sub hour and to minutes. The combination of many end users, performing relatively short-lived ML/AI tasks is changing the scheduling and resource management.

Indeed, the best architecture proxy for assigning GPUs to ML/AI workloads (or users) is storage. When a user needs to save or read files to network-attached storage, she doesn’t ‘request’ resources to be scheduled in some future time in which the the job (read/write) will then be performed. It doesn’t work like that. The user simply clicks and the task (job) is done. Behind the scenes, the client program (or kernel) goes out to the network, gets the storage assignment and performs the read/write. This is exactly what GPU scheduling is evolving to. Instead of scheduling and queues, it will be real-time assignments. Instead of complex resource management schemes, the users will get their assignments when and how they need it. Think of it that way: has a user ever thought to schedule CPU cores to a Linux task? or schedule SSD for a mass write? Or an HDD sector for a file system? Clearly not. This is exactly where GPU assignments to tasks (workloads) is heading: real-time, on demand assignments.

vSphere is now able to respond in real-time to end-users’ (and application servers’) requests for GPU assignments. The end-user simply types the Tensorflow (or any other ML/data science application) syntax with the number of requested GPUs and clicks <ENTER>. Under the hood, the Bitfusion stack will go out to the many GPU clusters in the organization and make the optimal assignment to the client, which lasts only for the duration of the run-time.

So, what happens if there are not enough GPUs? What happens if there is a mid-task failure, and how does it operate at a large scale. Let’s explore each topic:

  1. Bitfusion elastic GPU technology allows partitioning of physical GPUs. Therefore, when significant load is identified, end-users for full GPUs can be lowered to smaller chunks (e.g. 70% of a GPU, 40% of a GPU). This should sound familiar, as it is exactly what happens for networking traffic. Instead of end-users being denied service, each user gets a smaller networking bandwidth chunk (there might be some policy of defining high-priority users lower priority, but the concept remains the same)
  2. The fact that vSphere and Bitfusion are able to get an assignment of GPUs from the full extent of the network means that statistical distribution works on all GPUs in the organization, with better utilization and minimal ‘no-GPUs-available’ event
  3. With the Bitfusion virtualization technology, GPUs are released when the ML/AI framework completes the execution (or fails). Therefore, the utilization of GPUs is maximized, without artificial locking of GPUs to users or applications (which might be GPU-idle)
  4. Even when GPUs or partial-GPUs are not available momentarily, the client VM will back-off for few seconds (configured by default) and make a retry request for assignment

If vSphere GPU real-time and dynamic assignment to clients, strike you as similar to storage attached networks, you are right. Just as VMware is adding ML virtualization to compute, storage and networking, providing GPU attachment to applications offers benefits similar to storage: higher utilization and better efficiency/productivity to the organization.

Please reach out to us in AskBitfusion@VMware.com with any questions or comments you have and let’s work together to profile your ML infrastructure.

Comments

Leave a Reply

Your email address will not be published.