Hardening Kubeflow Security for Enterprise Environments

The security of every application is vital to protect data, ensure regulatory compliance, maintain business continuity, safeguard reputation and trust, and mitigate risks. Security is an ongoing task and a key component for enterprise adoption. Building trust takes time and effort but could be ruined in days. It is important to continually assess and enhance projects’ security measures.

This blog post aims to raise awareness of the existing Kubeflow 1.7 security flaws and the ongoing community efforts to provide a path forward for solving them. Readers will take away methodology for applying a security-first mindset in machine learning operations, understanding both the associated risks and the benefits of leveraging Kubeflow as their solution, and the ability to perform future-proofed risk management.

Kubeflow Introduction

Kubeflow is an open-source Kubernetes-native Machine Learning Operation (MLOps) platform that enables building, scaling, and managing machine learning (ML) workflows at scale.

Numerous companies like Google, IBM, VMware, BGP, AWS, etc. joined forces to develop the Kubeflow project. The end–users’ community comprises of companies from various industries – like telecommunication, finance, medical, insurance, regulated sectors, etc. With community support, the project continues to grow by taking the next big step towards joining CNCF landscape as an incubating project.

Major reasons that make Kubeflow a platform of choice for so many users include:

The project is open source and vendor neutral
No alternative open-source ML orchestrating platform available
The on-premises platform is scalable, standardized and secure

Below is the Kubeflow Architecture Diagram

More information about Kubeflow components and Architecture could be found here and here

Kubeflow 1.7 Security Improvements

Many security improvements have been made in Kubeflow 1.7 release. Beginning with setting up a Security Working Group to implementing/fixing various security-related related enhancements. In this section, we will outline some of the significant security advancements made up to Kubeflow 1.7.

1. Setting up Security Working Group

Setting up a dedicated security working group has been identified as a key step toward applying constant security efforts throughout the entire community. During Kubeflow 1.7 release, such a group was established. Its primary goals include:

Define clear policies and procedures on how to report and disclose vulnerabilities
Constantly enforce the use of security best practices
Provide places for discussions (slack & bi-weekly meetings & emails)
Identify security risks
Collaborate and tackle security enhancements
Vulnerability Management

Following security-related best practices is to perform periodic CVE scanning. Efforts have been made to identify the list of all images used by Kubeflow. The table below shares the insights from the first CVE scanning results in Kubeflow 1.7. The table contains information about the number of images used by specific WG (Working Group) and the number of associated CVEs (Common Vulnerabilities and Exposures). With these insights, the broader Kubeflow community is making efforts toward minimizing these numbers.

2. Integrating Software Bill of Materials (SBOMs)

SBOM (Software Bill of Materials) has become an industry buzzword and there is a reason for that. Securing the supply chain by integrating SBOMs within a Kubeflow is another initiative launched during the 1.7 release. SBOMs provide an exhaustive list of all software components with their licenses in a standardized machine-readable format and improve transparency. It ensures the project is license compliant and enables accurate identification and remediation of security vulnerabilities. A SBOM design document can be found here.

3. Serviceaccount Based Authentication Enabled

One of the great improvements made in Kubeflow 1.7 is adding a new method for authentication – via Kubernetes Serviceaccounts. Prior to this implementation, only session-based authentication intended for humans had been supported. So, for machines, we had to simulate web browsers! Now, oidc-authservice responsible for the authentication has a proper way to authenticate both human users & machines.

4. Authenticate Most API Calls

The community made efforts to not only authenticate external API calls – the one that first hit the Istio-ingress gateway, but also to authenticate service-to-service intercommunication.

5. Hardening Istio & Network Policies

Previously, problems related to misconfiguration of Istio sidecars services allowed the “userid” header to be easily faked (see the diagram above) and other services to be easily impersonated.

The Kubeflow Access Management was also broken (issue)– allowing a user to add himself to other namespaces and take it over. All this has been addressed by hardening Istio with the use of Istio security best practices (PR) and adding NetworkPolicies.

6. Enforce the Use of Lower Privilege RBACs

Kubeflow follows Kubernetes good practice like the use of least privilege RBAC. The community strictly monitors the assignment of minimum RBAC rights to users and serviceaccounts.

6. 99% Rootless Containers

Using containers with root privileges was the previous normal practice for Kubeflow, but not anymore. Significant progress has been made in this area – 99% rootless containers have been achieved! The missing 1 % is gone in the Istio-CNI daemonset, which replaces the functionality provided by the istio-init container, which requires significant RBAC permissions. So, Istio-CNI is the exception that could be taken by the enterprises.

Rootless containers should be enforced not just for default Kubeflow namespaces – but also for new or customer containers, which are unknown at installation time. For that we are using PodSecurityPolicies and the successors PodSecurityStandarts.

Still Present Security Exploits

Even though a lot has been done in Kubeflow 1.7 release, there are still unresolved exploits. Here is Kubeflow 1.8 roadmap of the security working group. This section will outline some of the most critical exploits the community is working on, and hopes to deliver in the next few releases.

1. Profile Controller Permissions

There are two Controllers within the Kubeflow systems namespace on the diagram. Profile Controller that creates user namespaces, adds role bindings and service accounts, and Pipeline Profile Controller that extends the namespaces with secrets and deployment for pipelines. Both controllers use the cluster admin role, which means that a single exploit in the Kubeflow namespace could give you cluster-admin rights. As both controllers have been developed independently – efficiency dictates for these two to be merged in one that leverages reduced cluster role.

2. Namespace Sharing

To increase collaboration between users/teams, Kubeflow supports a namespace sharing feature. For example, Alice could be added as a contributor to Bob’s namespace. The configuration is done in such a way, that Alice could escalate her permissions from default-viewer to default-editor. It is also required to regenerate all secrets, service accounts, and pods upon removing a collaborator. Otherwise, a removed collaborator could still use the service account token and connect Bob’s namespace.

3. Multi-User Artifact Storage Minio/S3

Another major security exploit that still exists is related to the use of Minio as artifact storage by the Kubeflow Pipeline component. The main issue is that all users share Minio admin secrets. This means that all users can read other users’ artifacts and moreover – a single user can destroy the entire storage. In addition, Minio’s license has been changed from Apache 2 to AGPL. It is unacceptable to the community that a nearly three year old and insecure image is in use by Kubeflow. An alternative object store solution is being sought and repairing the credentials issue is being planned.

4. Multi-user ML-MetaData

Orthogonal to the artifact storage is ML-Metadata/ MLMD used for lineage tracking of artifacts / pipeline running artifacts. Like the Minio, MLMD does not have multitenancy support – it is shared for all users. The community is actively working toward isolating it per user. This work promises to be contributed upstream in the following releases.

The Kubeflow community put a lot of effort towards hardening Kubeflow security! Lots have been achieved, but there is still work to be done. Ensuring the security of a project is everyone’s responsibility. Kubeflow Security Working Group is constantly monitoring and pushing such efforts forward, but we all need to participate – the broader community with its WG domain-specific expertise and the user community – by reporting vulnerabilities and sharing insights. Together we produce a robust, trustworthy MLOps platform. For more detailed information watch our talk given at KubeCon 2023 “Hardening Kubeflow security for enterprise environments.”

Stay tuned to the Open Source Blog and follow us on Twitter for more deep dives into the world of open source contributing.