The evolving landscape of data collection challenges the traditional Artificial Intelligence (AI) processing methods as worldwide data are increasingly in isolated islands. Likewise, data privacy and security regulations impose a significant burden of compliance to collect and use this data. Federated learning has emerged as a distributed learning paradigm for collaboration between enterprises to solve isolated data problems while addressing the critical issue of data privacy and data governance.
FATE (Federated AI Technology Enabler) is an open source project hosted by Linux Foundation. It provides a secure computing framework to support the federated AI ecosystem. It implements secure computation protocols based on homomorphic encryption and multi-party computation (MPC).
KubeFATE is designed to provision, orchestrate, operate and manage FATE-based federated learning systems on Kubernetes in data centers or multi-cloud environments, and exploits the advantages of the cloud computing delivery model. It manages the system in the form of a Federated Learning Cluster (FLC), which includes the FATE service, all other services it depends on as well as their configurations with cloud-native technology.
Deploying and operating a federated learning environment within traditional IT infrastructure can be challenging. The KubeFATE on VMware Cloud Foundation with Tanzu solution aims to address the complexity of AI solution integration with IT infrastructure to rapidly deploy federated learning systems and the underlying infrastructure with resiliency and security.
Why KubeFATE on VMware Cloud Foundation with Tanzu?
VMware Cloud Foundation™ with VMware Tanzu™ automates full stack deployment and operation of Kubernetes clusters through integration with VMware Tanzu Kubernetes Grid™ and VMware Tanzu Mission Control. VMware Cloud Foundation with Tanzu helps to eliminate manual steps for host configuration, creating logical relationships, managing hypervisors for faster deployment of applications at scale.
Here are the top four benefits for deploying and operating KubeFATE for federated learning on VMware Cloud Foundation with Tanzu.
Full stack integration
The VMware Cloud Foundation software stack lifecycle is automated and the complete lifecycle management greatly reduces risks and increases IT operational efficiency. Tanzu Kubernetes cluster deployment is fully integrated with the VMware vSphere SDDC stack, including storage, networking, and authentication.
Consistent operations and infrastructure for hybrid and multi-cloud
This tool provides edge, private and public cloud workload deployment options for a true hybrid cloud solution that maintains the flexibility of networking and topology. It allows enterprises to build, run, manage, connect and protect any app on any cloud or across clouds with complete consistency of experience.
Security
FATE applies various security protocols, including homomorphic encryption, secret sharing, RSA and Diffie-Hellman, to different algorithms to comprise requirements of security, audit and law. Data of all parties are stored locally, ensuring data privacy and compliance with laws and regulations and no data leakage to the outside. All parties only interact with the intermediate results of encryption after processing local data in the process of modeling and reasoning to ensure the information security of all parties from the aspects of algorithm design, encryption algorithm strength, and communication security.
Automated end-to-end lifecycle management
KubeFATE will minimize the federated learning workload impact and downtime during the necessary patching and upgrading of the full private cloud stack using automated and self-managed services within the workload domain.
How to Deploy?
VMware Cloud Foundation with Tanzu automates the full stack deployment and operation of Kubernetes clusters through integration with VMware Tanzu Kubernetes. Making it easy to stand up the underlying vSphere infrastructure, set up VMware NSX and the NSX Edge Clusters. With Workload Management enabled, we can easily enable Tanzu Kubernetes clusters.
Figure 1 shows the architecture of a two-party KubeFATE cluster on VMware Cloud Foundation with Tanzu. In each organization, we deployed a VMware Cloud Foundation instance consisting of a management domain and a workload domain. The 4-node management domain cluster hosts multiple management virtual machines and appliances. For the workload domain, we created another 4-node cluster with workload management enabled and provisioned a Tanzu Kubernetes cluster. We deployed an FLC in a namespace in a VMware Tanzu Kubernetes cluster in each party and we deployed the Exchange service in another namespace in one party. Similarly, we can deploy a multi-party KubeFATE cluster across geographies with more building blocks equivalent to Organization 2.
As Figure 2 shows, a KubeFATE cluster includes three components:
- A KubeFATE service to deploy/manage several Federated Learning Clusters.
- One or many FLCs to run federated learning workload. An FLC is a FATE cluster that includes the following components:
- A FATE Flow service to schedule and manage federated learning jobs
- A MySQL database to store metadata
- An Nginx/Pulsar server to synchronize job status and data between different FLCs
- A Jupyter Notebook for users to build and run federated learning jobs
- A FATE Board to visualize the status of federated learning workload
- A Spark cluster to run the actual federated learning workload
- An HDFS cluster to store training dataset and intermediated results
- An optional Exchange service to manage the connection information between FLCs. The Exchange service can be deployed on Kubernetes and in a demilitarized zone (DMZ).
Solution and Key Results
The validation is a showcase of VMware Cloud Foundation with Tanzu for operating and managing KubeFATE federated learning platform in a fully integrated SDDC environment.
Key results can be summarized as follows:
- Deploying Instance: Quick guide for multi-party network configuration, KubeFATE Cluster deployment, and validation of multi-party connection.
- Running a federated learning workload: Showcases federated training workflow using integrated Jupyter Notebook, FATE Board for job management, model evaluation, and prediction. We choose the dataset (https://www.kaggle.com/mlg-ulb/creditcardfraud) as an example
- Resilience Tests: Proves the solution resilience to guarantee the service continuity and stability of KubeFATE in failure scenarios such as disk and host failures.
- Best Practices: Provides best practices to deploy the infrastructure and sizing guidelines of CPU/memory/storage/network bandwidth planning to target a workload of a given scale.
- Use cases: Categorizes a broad range of use cases in both vertical federated learning and horizontal federated learning.
KubeFATE on VMware Cloud Foundation with Tanzu simplifies the deployment and management of federated learning systems and workloads. KubeFATE enables VMware’s partners and customers to provision and manage the industrial-grade FL clusters on demand and to run FL workload according to their business needs.
For more details, visit the KubeFATE on VMware Cloud Foundation with Tanzu reference architecture.