Kubeflow users, developers and contributors who rely on Kubeflow and its community as a critical element to their open source MLOps workflow, gathered at the Kubeflow Summit 2022, sponsored by Arrikto and DKube, on October 18 and 19 at the AMA Conference Center in San Francisco. In a hybrid setting, attendees participated in discussions about the operational and administrative aspects of the Kubeflow community and learned of the project’s governance and its future. (Incidentally, the most exciting news about Kubeflow’s future followed the close of the Summit.)
Like to learn more about Kubeflow’s exciting news and what you missed at the Summit? Read on for the details!
Day 1: History of Kubeflow and current use cases
The first day of the conference was dedicated to Kubeflow users and addressing best practices for managing end-to-end machine learning workflows from training to deployment with a panel discussion kickoff from Kubeflow founders Jeremy Lewi and David Aronchick.
During the panel, David shared the initial vision for the Kubeflow project: to provide declarative machine learning (ML) pipelines and ML infrastructure on top of Kubernetes to make it easy to deploy and run from the start. To achieve this goal, Jeremy stated the three essential pieces of work responsible for making Kubeflow what it is today:
- Moving the control plane into Kubernetes using a controller, which eventually became the TensorFlow job controller
- Notebooks running on Kubernetes
- GPU support on Kubernetes
With the evolution of Kubernetes, David and Jeremy realized it would change how we run workloads and the trends of many legacy systems migrating to building operators to run on Kubernetes and seized the opportunity to build Kubeflow with the three core pieces.
In addition, David shared his thoughts on the future direction of Kubeflow. He would like to see Kubeflow follow the same strategy of “aggressively staying inside the box” as Kubernetes to provide declarative ML infrastructure that is easy to set up, maintain and extend. As for Jeremy, he shared his thoughts on the ML community, stating that the infrastructure landscape has expanded so much that it has become confusing and challenging for end users. He sees significant gaps in the models built and the actual business problems that models are built to solve. Many users are struggling to turn machine learning techniques into business solutions.
After the panel discussion with the co-founders, the Summit continued with presentations from Kubeflow end-user companies, including Aurora, Roblox and Samsung.
Aurora presented their use of Kubeflow Pipelines to accelerate ML model development for autonomous vehicles and lessons learned in their journey with Pipeline as their ML orchestration engine. Next, Roblox shared their reasons for choosing Kubeflow as their machine learning strategy, citing the importance of scalable compute infrastructure, core components, unified model serving and a strong community built around Kubeflow. Lastly, Samsung shared their internal platform built on Kubeflow with additional features to support container image management, access management and job monitoring. They also talked about their interest in contributing their work back to the community.
Day 2: Kubeflow roadmap and how to grow the community
The second day of the Summit was dedicated to the operations and administration of the Kubeflow Project and community. The day kicked off with my 2022 Kubeflow User Survey results presentation. I discussed user survey goals, top challenges, roadmap items and how the Kubeflow project and community has evolved compared to last year’s survey results. Check out “Key Takeaways from the 2022 Kubeflow Community User Survey” to learn more.
Attendees then divided into working group (WG) breakout rooms to dig deeper into top challenges and discuss roadmap items within respective groups. For example, in the Pipelines WG breakout room, WG leads opened the floor for the attendees to ask questions and provide feedback about Kubeflow Pipelines. Some discussion-led questions included the release and upgrade process, community governance, v2 pipelines and alternative solutions to MinIO.
Following the breakout sessions, each WG and the current 1.7 release manager shared their updates with all attendees.
The Katib and Training Operator WGs started by sharing their journey and “Low Bar & High Ceiling” design consideration to ensure components are easy to get started and allow extensibility and complex customizations. In 1.7, distributed training jobs support using python SDK within the notebook (available in the main branch) to allow consistent model training locally and remotely. Following the same pattern, Katib will also support consistent hyperparameter tuning within the notebook to allow users to tune their models locally and remotely.
The Pipelines WG focused their session on Kubeflow Pipelines 2.0. During this session, Pipeline WG leads shared the reasons why 2.0 is needed and several goals for Pipeline 2.0, which are:
- To provide first-class support for ML metadata, allowing users to define an artifact and take control of the lineage graph
- Enable a universal standard for defining a pipeline by moving to Argo-independent pipeline representation to allow platform-agnostic pipelines
- A modern interactive UI with support for artifacts as one of the nodes, groupings and parameter and metrics comparison between the runs
- A unified SDK experience to persist and reuse components and pipelines
The Notebooks WG shared their plans to migrate Notebooks CRD to workbench CRD to improve support for other non-Jupyter editors and dynamic workbenches, along with support for user groups and log access. The Manifests working group focused on enhancing testing for Kubeflow and defining the process for integrating new components into the manifest to allow collaboration with other open source projects.
The KServe WG shared their plan to graduate to 1.0. Current requirements for graduation include full support for REST/gRPC v2 inference protocol for all frameworks and transformers, cloud storage unification for cloud providers, observability improvement, batch inference support and more.
Lastly, the 1.7 release manager shared the release timelines and expected themes for the upcoming release. The release is in progress with open discussion to support multiple Kubernetes versions, improve installation experience with the kustomize version upgrade, better support for integration with other popular machine learning tools and security enhancements to address community concerns.
Afterward, the rest of the sessions were problem-specific discussions to improve Kubeflow governance and structure. A few notable topics discussed included the following:
- Need for additional working groups to focus on specific domains like security and community outreach
- Definition of core and non-core Kubeflow components
- How to engage other machine learning communities
- How to set the path for different roles and responsibilities within the community
What’s next for Kubeflow
Following the Summit, Google shared the exciting news that Kubeflow has applied to become an incubating project in the Cloud Native Computing Foundation (CNCF).
The community shared their enthusiasm about the project’s next significant milestone. With multiple discussions started at the Summit to make improvements to the project, Kubeflow’s application to CNCF gives the community more incentive to build an even healthier project.
Due to the backlog of ongoing projects, it may take another two to four months for Kubeflow to join CNCF officially. In the meantime, the community plans to continue as before and take action on many discussions from the Summit.
Kubeflow Summit 2022 Session Playlist
Most of the sessions are now available in the Kubeflow Summit 2022 playlist. Unfortunately, some sessions are unavailable due to technical difficulties, but they may be re-recorded and become available to the community in the future. Stay tuned!
Stay tuned to the Open Source Blog and follow us on Twitter for more deep dives into the world of open source contributing.