Recap of the MLOps/Kubeflow User Meetup Beijing Session

The 2023 MLOps/Kubeflow Beijing User Meetup, organized by VMware on May 18th, brought together community and industry experts to explore cloud-native technologies in the machine learning operations (MLOps) field, with a particular focus on open source project Kubeflow. This event provided a platform for technical sharing and in-depth discussions with seasoned MLOps experts from the Kubeflow and Kubernetes communities presented a wide range of topics, including the latest progress in the Kubeflow ecosystem, a comprehensive introduction to Project vSphere Machine Learning Extension with its innovative GPU management capabilities, noteworthy achievements of Project Volcano, and insightful case studies featuring Project KubeDL and Inspur AIStation platform.

Industry leaders and experts from renowned companies, including NVIDIA, Alibaba, Megvii, Lenovo, AsiaInfo, xFusion, and AsiaInfo, etc. gathered at the meetup, exploring the latest trends in MLOps and Kubeflow. The meetup provided a great opportunity for people to share knowledge and collaborate, encouraging new ideas and advancements in cloud-native technologies for MLOps. The enlightening presentations and interactive discussions ignited exciting conversations, inspiring participants and bringing fresh energy and enthusiasm to the community.

Dr. Xiangjun Song, Sr. Director of VMware Cloud Infrastructure Business Group China and lead of VMware China AI Lab, delivered the opening speech, reviewing the history of Artificial Intelligence (AI) and highlighting VMware’s contributions in the field. He emphasized VMware’s active involvement in the machine learning open-source community and their achievements in advancing Kubeflow and MLOps. Dr. Song’s speech set the stage for the event, showcasing VMware’s commitment to driving cloud-native AI technologies.

Dr. Xiangjun Song kicking off the meetup highlighting VMware’s contribution to AI field

After, VMware’s Sr. Open Source Engineer and Kubeflow community contributor, Anna Jung, delivered a captivating video presentation. She highlighted the importance of the Kubeflow community and explored the unique features and benefits of the open-source platform. Anna’s presentation deepened attendees’ understanding of Kubeflow while showcasing VMware’s leadership in the cloud-native domain and commitment to open-source technologies.

Anna Jung presenting VMware’s leadership in machine learning open source communities

With a diverse range of sessions scheduled to delve into various aspects of MLOps, the first session to kick it off was an introduction to Project vSphere Machine Learning Extension by Xintong Zhu, a Software Engineer at VMware and a core developer of the Project vSphere Machine Learning Extension. As AI/ML finds widespread application in various industries, more users are developing machine learning projects in hybrid cloud environments. To address this demand, VMware is dedicated to developing MLOps tools that facilitate machine learning operations for VMware’s vSphere customers. During the session, Xintong highlighted the core features of Project vSphere Machine Learning Extension and provided detailed instructions on its installation on TKC (Tanzu Kubernetes Cluster) and the illustration of its key components.

Xintong Zhu presenting introduction to Project vSphere Machine Learning Extension

Followed by Xintong’s introduction to Project vSphere Machine Learning Extension, Jinheng Xu, a peer engineer working on the same project, led a session that specifically emphasized advanced GPU management capabilities. As the demand for GPU computing power in AI/ML workloads continues to grow, ensuring efficient GPU management for MLOps platforms in Kubernetes environments has become crucial. Jinheng’s presentation focused on the GPU advanced management capabilities of Project vSphere Machine Learning Extension, such as virtual GPU configuration, GPU information monitoring, and cross-cluster GPU scheduling, where he provided detailed explanations of the underlying principles and practical applications of these capabilities.

Jinheng Xu presenting Project vSphere Machine Learning Extension’s GPU management capabilities

The next presentations were from noteworthy open source projects: Volcano and KubeDL. Leibo Wang, the project lead of Project Volcano, presented an overview of the project. As the first cloud-native batch computing project under Cloud Native Computing Foundation (CNCF), project Volcano has gained significant traction in the community with over 500 global contributors and widespread adoption across various industries. Volcano supports popular computing frameworks like TensorFlow, PyTorch, MPI, MindSpore, PaddlePaddle, Spark, and Flink, empowering users to optimize AI-distributed training and big data analytics. The session encompassed key highlights of the project implementation principles, performance enhancements for distributed training and big data use cases, along with real-world case studies.

Project KubeDL was presented by Qiukai Chen, one of the maintainers of the KubeDL community. With the booming development of the AI application ecosystem, there has been a rapid growth in computing power requirements, leading to increasing costs for model training as hardware iterations progress. The presentation addressed the challenge of efficient AI workload scheduling in large-scale heterogeneous computing resource pools. KubeDL’s cloud-native approaches were stressed to improve model training efficiency, fault tolerance, and resource utilization. The session shared practical experiences from Alibaba Group’s large-scale clusters, providing an overview of KubeDL, practical tips for AI workload management, and real-world examples.

Qiukai Chen presenting KubeDL and practical experience and tips for ML workloads

Finally, the last session of the meetup was from Chao Wang, a Senior Product Manager from Inspur where he shared practical insights into the Inspur AIStation platform. The presentation addressed the need for improvements in cloud-native platforms due to the rapid growth of AI-generated content (AIGC) and large-scale models. He focused on the challenges infrastructure platforms face in handling large-scale model training and deployment. Sharing practical experiences, the presentation offered valuable operational and application insights into the Inspur AIStation platform. It explored effective strategies for building and optimizing infrastructure platforms in response to the fast-paced advancements in AIGC and large-scale models.

Summary

The 2023 MLOps/Kubeflow Beijing User Meetup was a successful gathering of professionals from various companies coming together to share their expertise and experiences. Furthermore, it was an event that embodied the power of collaboration with industry leaders to discuss the evolving landscape of MLOps. With the conclusion of the meetup, we are thankful to the speakers and attendees for making this event an open and collaborative environment to create opportunities for future collaboration and partnerships. With the excitement and momentum of unlocking new possibilities, we eagerly anticipate future gatherings to ignite innovation and collaboration in MLOps and Kubeflow.

Summary

Related Articles

Tackling the Open Challenges of our Time

Recap of the MLOps/Kubeflow User Meetup Beijing Session

Hardening Kubeflow Security for Enterprise Environments (1.7 Release)