Machine Learning Artificial Intelligence Deep Learning Uncategorized

ML Analytics – Get Insights to Your GPU Farm Utilization

Are your GPU servers tightly setup as a well organized and monitored cluster, or like most of us, over the last 2-3 years GPU servers were deployed in a scattered, silo-ed and uncoordinated way, and no one really knows how they are being utilized.

Organizations are quickly adopting machine learning, AI and data science workloads, which increasingly rely on accelerated compute hardware (such as GPUs, FPGAs and AI ASICs).  However, the new ML infrastructure is deployed with no virtualization, no ability to share applications or with no abstraction layer. Said differently, ML infrastructure is deployed today as bare-metal, very much like CPU servers in the 70s and 80s. This causes AI servers’ (e.g. GPUs) utilization and efficiency to be very low, and leads to other limiting factors like organization silos. To make data-driven infrastructure deployment decisions, there is a need to gather sufficient utilization data across the organization which measure, chart and report utilization of GPUs over time. With this data, ML infrastructure administrators can get clear insights into the utilization economics, can spot where there is waste, and can see what is the most economical way to add capacity, share capacity or even re-package/re-locate capacity for other departments in the organization.

Now with Bitfusion, VMware has such an assessment tool. FlexDirect Analytics is a passive software sensor and monitoring tool, that allows the ML infrastructure manager to gain insights and visibility into the utilization of the GPU servers. FlexDirect Analytics provides time-series data of utilization and efficiency for each physical GPU in your network, without any impact or change to your workloads, workflow or deployment. Utilization and efficiency time-series can be exported for analysis with off-line tools. FlexDirect Analytics has a very low memory and compute footprint and it is a lightweight process. Installation and run-time are simple and done in a few minutes. The monitoring process runs in the background (on each GPU server). FlexDirect Analytics can run in public cloud (with GPU instances), private cloud and with any known hardware supplier of GPU servers.

The FlexDirect Analytics tool measures GPU server usage in two dimensions: efficiency and utilization. The two metrics are distinct and independent, and both provide the required guidance for capacity planning assisted by vSphere ML/AI Virtualization.

  1. Utilization – measures whether the GPU is running a workload or not. Said differently, every time tick (10sec) each physical GPU will be polled for status whether any workload is being performed. Essentially each time tick there can be two states: “1” – used or “0” – not used. There is a key point to be made here, the GPU server may perform AI/ML workload, feature engineering, data cleaning, etc. but the GPU may not be part of it. Utilization here implies only the time window in which CUDA APIs are performing some work
  2. Efficiency – measures how much of the GPU memory and compute are being used. For example, the GPU may run certain ML workload (e.g. Alexnet) therefore utilization state = “1”, nonetheless only a fraction of the GPU compute and memory are being used (e.g. 32%). Efficiency is measured as a percentage: 0% – 100%

Charting both metrics across the timeline, will provide two opportunities to modernize the ML infrastructure. Low utilization of the GPU cluster lends itself to improvement with GPU remote attachment, where more users have access and share the GPUs over the network. Efficiency lower than 100% (e.g. 32%) provides the opportunity to use partial GPUs (carving up physical GPU into smaller vGPU entities) – which is another metric of modernizing the GPU infrastructure. Both remote-attached and partial GPUs are the distinct capabilities of Bitfusion with vSphere, and both improve efficiency and utilization. The FlexDirect Analytic tool has also the ability to export the collected data in CSV format. These files can be uploaded to post-processing tools (such as excel) to generate automated ROI and business cases to justify a ML/AI infrastructure modernization with vSphere and Bitfusion. Please reach out to us in with any questions or comments you have and lets work together to profile your ML infrastructure.



Leave a Reply

Your email address will not be published.