Architecting VMware vSphere Kubernetes Service on VCF: Top Webinar and Field Questions Answered

If you missed our recent webinar on VMware vSphere Kubernetes Service (VKS) on VMware Cloud Foundation, don’t worry—we’ve got you covered. Caleb Washburn (Momentum AI, CIO) and I teamed up to talk about architecture and design, while our Principal Architect for VCF Professional Services, Libby Shen, tackled lots of great questions from the audience. In this post, we’re bringing those answers directly to you, combining the best Q&A from the live session with the top questions we hear every day while working on customer engagements.

Architecture, Availability Zones, and Deployments

Q: What is the minimum number of VMware ESX hosts required to set up a VMware vSphere Supervisor, VKS cluster, and an Availability Zone (AZ)?
A: The standard minimum requirement for an ESX host cluster to support VMware vSAN and HA is three hosts. An AZ in VCF represents a logical construct of an independent physical failure domain, which typically means a minimum of three hosts per zone to maintain quorum and availability.

Q: Why not use a single stretched ESX metro cluster across two data centers for the management control plane? Is that supported?
A: Stretched clusters are fully supported for both the Management Domain and Workload Domains in VCF 9.x. However, VCF 9.1 heavily promotes the 3-Zone Deployment Model as the modern standard for native Kubernetes HA, as it provides better fault tolerance without the split-brain risks sometimes associated with 2-site metro clusters.

Q: If worker nodes sit in ESX hosts, where do the vSphere Supervisor control plane VMs and workload cluster control plane VMs sit?
A: In VKS both the vSphere Supervisor control plane and the workload cluster nodes ultimately run on ESX. The vSphere Supervisor control plane VMs are deployed onto the vSphere Supervisor-enabled vSphere cluster and managed by vCenter/VCF. When a workload cluster is created, its Kubernetes control plane nodes and worker nodes are also provisioned as VMs on the ESX hosts, typically spread by DRS/HA according to placement and availability rules. So ESX is the common substrate; the distinction is whether a VM belongs to the vSphere Supervisor platform control plane or to a tenant/workload Kubernetes cluster.

Q: Can zones be leveraged to deploy HA for node pools of a K8s cluster?
A: Yes. In VCF 9.1, node pools can be distributed across vSphere Zones. This allows you to achieve true high availability by ensuring that a single cluster’s worker nodes span multiple physical failure domains. Nodes

Q: Why would you use the Virtual Machine Service of the vSphere Supervisor to deploy a virtual machine vs. deploying a virtual machine the “classic” way via vCenter?
A: Deploying a VM via the “classic” vCenter UI is an imperative, manual action. The VM Service allows developers to provision virtual machines using Kubernetes declarative APIs (YAML files) enabling a declarative, self-service GitOps model. This treats VMs like Kubernetes objects, wrapping them into the same self-service, automated CI/CD deployment pipelines used for containers.

VMware Cloud Foundation Automation Integration

Q: Does the External IP Blocks for vSphere Supervisor in vCenter have to be different than the VCF Automation Organization IP Space Block, or can we partition the sub-network?
A: The vSphere Supervisor External IP Block and the VCF Automation Organization IP Space / External IP Block do not necessarily need to come from completely different routed networks. They can be carved from the same larger routed subnet. However, the actual ranges assigned to each consumer should be non-overlapping unless they are managed through the same VMware NSX/VCF IPAM allocation construct. In practice, we should partition the parent subnet into dedicated sub-blocks: one for vSphere Supervisor/VKS external services and separate sub-blocks for VCF Automation organizations or Virtual Private Clouds (VPC).

Networking, Container Network Interface (CNI), and Load Balancing

Q: What CNI does VKS use? Can I use Cilium, Antrea, or Calico?
A: Antrea is default CNI for VKS clusters which features deep integration with the NSX stack via the Antrea-NSX Adapter in VCF 9.1. Because VKS is CNCF-conformant, you can run other CNIs like Calico or Cilium within the guest clusters, but Antrea provides the most seamless native visibility. Since VKS 3.6 ‘bring your own CNI’ has been allowed.

Q: What is the major advantage of using VPCs in a VKS scenario vs traditional vSphere 8 NSX Classic flavor? Is VKS supported without VPCs?
A: VPCs construct is similar to public cloud construct. VPCs provide tenant isolation and self-service, automatically handling routing, Network Address Translation (NAT), and subnets for developers. However, VKS is still fully supported without VPCs using traditional NSX Classic (Tier-0/Tier-1 segments) until VCF 9.0.2 or even VDS (VLAN-backed) networking. In VCF 9.1 onwards, VDS networking and NSX-VPCs are the only supported network configurations.

Q: Is the Foundation Load Balancer fully supported for VKS vSphere Supervisor and workload clusters? How does it compare to VMware AVI Load Balancer?
A: The Foundation Load Balancer is VMware’s default Layer 4 load balancer for VDS networking and is fully supported for simpler ( Lab/ testing) environments. For heavy workloads needing Layer 7 balancing, WAF, or DNS integration, Avi Load Balancer is recommended. Avi Kubernetes Operator (AKO) acts as the ingress controller. NSX Load balancer which is default with VCF configuration supports only Layer 4 load balancing.

Q: Is it possible to forgo a load balancer at the “VMware” level and use load balancers such as kube-vip and metallb within the guests?
A: Yes, you can bypass VMware-level load balancers and deploy in-guest solutions like kube-vip or metallb. However, doing so means you lose the centralized analytics, automation, and IPAM benefits of the VCF fabric. This requires extra configuration and maintenance of these apart from standard VKS.

Q: Is Istio supported on VKS?
A: Yes. Since VCF 9.0 it has embedded lightweight Istio Service Mesh natively, supporting both Sidecar and Ambient (Sidecar-less) modes for mTLS security and traffic management in a single VKS cluster or across multiple.

Storage, Persistent Volume Claims (PVCs), and Container Storage Interface (CSI)

Q: Where does non-persistent/stateless storage come from? What are the recommended storage policies?
A: Stateless or non-persistent storage typically runs on standard vSAN or block storage policies. VCF 9.1 handles storage quotas directly at the vSphere Namespace level to control consumption.

Q: Is ReadWriteMany (RWX) supported with external NFS shares, or can you use something like Portworx if vSAN is not deployed?
A: Native RWX support out-of-the-box requires vSAN File Services. However, if you are not using vSAN, you can absolutely deploy third-party CNCF-conformant storage overlays like Portworx within the VKS clusters to provide RWX PVCs over your existing block storage.

Q: How specifically are storage / PVCs synchronized in a scenario with three Zones?
A: Standard block PVCs (ReadWriteOnce) do not synchronously replicate across three zones at the storage layer due to extreme latency penalties. High availability across three zones is primarily handled at the application/database layer (e.g., database replication). vSAN Data Persistence Platform (vDPP) can also help manage stateful services across zones.

Q: Is it possible for vSphere Container Storage Plugin to automatically reclaim space when a persistent volume is deleted in vSAN, or is it manual?
A: VMware’s Container Storage Interface (CSI) handles dynamic provisioning and de-provisioning. When a PVC/PV is deleted via Kubernetes and the reclaim policy is set to Delete, the CSI automatically reaches out to vSphere to delete the underlying vmdk/volume and reclaim the space on vSAN.

Backup, DR, and Lifecycle Management

Q: How are vSphere Supervisor clusters backed up and restored, and is Velero used for this process?
A: Yes, you can back up and restore a vSphere Supervisor cluster, but it is not done using Velero. There is a strict dividing line between backing up the infrastructure (the Supervisor) and backing up the applications (the workloads).Velero is used to backup Kubernetes workloads in VKS workload clusters. Broadcom recently contributed Velero as a Sandbox project to the CNCF community.

Q: What happens to Kubernetes guest clusters during a complete data center power failure? Do services come back automatically after power is restored?
A: In a stretched cluster scenario, vSphere HA will automatically restart the control plane and worker node VMs on the surviving site. Because the IP addresses and underlying storage stretch across the sites, Kubernetes services will automatically recover as soon as the VMs boot and the Kubelet reconnects. Once the infrastructure layer is restored, VKS and Kubernetes naturally reconcile to restart guest workloads. While stateless services typically recover without manual intervention, stateful applications may necessitate specific storage or database level recovery procedures. To survive a complete data center failure, organizations should implement a comprehensive DR strategy rather than relying solely on local Kubernetes high availability.

Need help?

Looking to add vSphere Kubernetes Services to your existing VCF environment? If you need assistance, VCF Professional Services and our partners can help. Contact your account director for more information.

Discover more from VMware Cloud Foundation (VCF) Blog

Subscribe to get the latest posts sent to your email.