Home > Blogs > VMware VROOM! Blog > Tag Archives: kubernetes

Tag Archives: kubernetes

New White Paper: High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning

By Dave Jaffe, VMware Performance Engineering

A new white paper is available showing the advantages of running virtualized Spark Deep Learning workloads on Kubernetes.

Recent versions of Spark include support for Kubernetes. For Spark on Kubernetes, the Kubernetes scheduler provides the cluster manager capability provided by Yet Another Resource Negotiator (YARN) in typical Spark on Hadoop clusters. Upon receiving a spark-submit command to start an application, Kubernetes instantiates the requested number of Spark executor pods, each with one or more Spark executors.

The benefits of running Spark on Kubernetes are many: ease of deployment, resource sharing, simplifying the coordination between developer and cluster administrator, and enhanced security. A standalone Spark cluster on vSphere virtual machines running in the same configuration as a Kubernetes-managed Spark cluster on vSphere virtual machines were compared for performance using a heavy workload, and the difference imposed by Kubernetes was found to be insignificant.

Spark applications running in Standalone mode require that every Spark worker node be installed with the correct version of Spark, Python, Java, etc. This puts a burden on the IT administrator, who may be managing many Spark applications with different requirements, and it requires coordination between the administrator and the application developer. With Kubernetes, the developer only needs to create a container with the correct software, and the IT administrator just needs to manage the cluster using the fine-grained resource management tools to enable the different Spark workloads.

To compare Spark Standalone performance to Spark on Kubernetes performance, a Deep Learning workload, the Maximum Throughput Spark BigDL ResNet50 image classifier from VMware IoT Analytics Benchmark, was run on the same 16 worker nodes, first while configured as Spark worker nodes, then while configured as Kubernetes nodes. Then the number of nodes was reduced by four (by removing the four workers on host 4), and the same comparison was made using 12 nodes, then 8, then 4.

The relative results are shown below. The Spark Standalone and Spark on Kubernetes performance in terms of images per second classified was within ~1% of each other for all configurations. Performance scaled well for the Spark tests as the number of VMs increased from 4 (1 server) to 16 (4 servers).

All details are in the paper.

How Does Project Pacific Deliver 8% Better Performance Than Bare Metal?

By Karthik Ganesan and Jared Rosoff 

At VMworld US 2019, VMware announced Project Pacific, an evolution of vSphere into a Kubernetes-native platform. Project Pacific (among other things) introduces a vSphere Supervisor Cluster, which enables you to run Kubernetes pods natively on ESXi (called vSphere Native Pods) with the same level of isolation as virtual machines. At VMworld, we claimed  that vSphere Native Pods running on Project Pacific, isolated by the hypervisor, can achieve up to 8% better performance than pods on a bare-metal Linux Kubernetes node. While it may sound a bit counter-intuitive that virtualized performance is better than bare metal, let’s take a deeper look to understand how this is possible.

Why are vSphere Native Pods faster?

This benefit primarily comes from ESXi doing a better job at scheduling the natively run pods on the right CPUs, thus providing better localization to dramatically reduce the number of remote memory accesses. The ESXi CPU scheduler knows that these pods are independent entities and takes great efforts to ensure their memory accesses are within their respective local NUMA (non-uniform memory access) domain. This results in better performance for the workloads running inside these pods and higher overall CPU efficiency. On the other hand, the process scheduler in Linux may not provide the same level of isolation across NUMA domains.

Evaluating the performance of vSphere Native Pods

Test setup

To compare the performance, we set up the testbeds shown in Figure 1 on identical hardware with 44 cores running at 2.2 GHz and 512 GB of memory:

  • A Project Pacific–based vSphere Supervisor Cluster with two ESXi Kubernetes nodes
  • A two-node, bare-metal popular Enterprise Linux cluster with out-of-the-box settings

Figure 1: Testbed configuration

Typically, a hyperthreaded processor core has multiple logical cores (hyperthreads) that share hardware resources between them. On the baseline case (testbed 2), the user specifies the CPU resource requests and the maximum number of logical cores for each pod. By contrast, on a vSphere Supervisor Cluster, the CPU resource specifications for a pod are  specified in terms of the number of physical cores. Given this difference in CPU resource specification, we disabled hyperthreading in both testbeds to simplify this comparison experiment and to be able to use an identical pod specification. In each of the clusters, we use one of the two Kubernetes nodes as the system under test, with the Kubernetes Master running on the other node.

Benchmark configuration

Figure 2: Pod configuration

  • We deploy 10 Kubernetes pods, each with a resource limit of 8 CPUs, 42 GB of RAM, and a single container in each running a standard Java transaction benchmark, as shown in Figure 2. Given the complexity and nature of the workload used for our experiments, we used large pods for this experiment to enable easier run management and score aggregation across pods.
  • The pods are affinitized using the pod specification to the target Kubernetes node in each testbed.
  • We use the benchmark score (maximum throughput) aggregated across all 10 pods to evaluate the performance of the system under test. The benchmark throughput scores used are sensitive to transaction latencies at various service-level agreement targets of 10ms, 25ms, 50ms, 75ms, and 100ms.
  • The benchmark used has little I/O or network activity, and all the experiments are restricted to a single Kubernetes node. Therefore, we do not discuss the I/O or network performance aspects in this article.

Results

Figure 3 shows the performance of the vSphere Supervisor Cluster (green bar) normalized based on the popular Enterprise Linux–based, bare-metal node case.

8% better performance on the vSphere Supervisor Cluster vs bare-metal Enterprise Linux

Figure 3: Hardware counter data on a vSphere Supervisor Cluster vs bare-metal Enterprise Linux

We observe that the vSphere Supervisor Cluster can achieve up to 8% better overall aggregate performance compared to the bare-metal case. We repeated the runs on each testbed many times to average out run-to-run variations.

Analysis and optimizations

Looking at the system statistics, we observe the workload running on the bare-metal testbed is suffering from many remote NUMA node memory accesses in comparison to the vSphere Supervisor Cluster case. While the vSphere Supervisor Cluster performance upside includes the improvement from better CPU scheduling, it also includes the performance overheads from virtualization.

To get deeper insights, we configure a set of pods to run the same workload at a constant throughput and collect hardware performance counter data for both testbeds to see the proportion of the last level cache (L3) misses that hit in the local DRAM vs remote DRAM. The local DRAM accesses incur much lower memory access latencies in comparison to remote DRAM accesses.

As shown in Figure 4, only half of the L3 cache misses hit in local DRAM, and the rest are served by remote memory in the case of bare-metal Enterprise Linux. In the case of the vSphere Supervisor Cluster, however, 99.2% of the L3 misses hit in local DRAM for an identical pod and workload configuration due to superior CPU scheduling in ESXi. The lower memory access latencies from avoiding remote memory accesses contributes to improved performance on the vSphere Supervisor Cluster.

About half of L3 cache misses hit Local DRAM on Linux vs most in Project Pacific

Figure 4: Hardware counter data on vSphere Supervisor Cluster vs bare-metal Enterprise Linux

To mitigate the effects of non-local NUMA accesses on Linux, we tried basic optimizations, like toggling NUMA balancing switches and using taskset-based pod pinning to CPUs, but none of this substantially helped performance. Numactl-based pinning of containers that forces local NUMA allocations and ensures that the processes execute in the CPUs belonging to these NUMA domains was one of the optimizations that helped. However, they are merely one-time workarounds at this point in time to help our investigation.

The chart in Figure 5 shows the improvement that can be achieved using numactl-based pinning on the bare-metal Enterprise Linux testbed.

17% performance improvements from pod pinning on bare-metal Enterprise Linux

Figure 5: Upside from better NUMA locality on the performance on bare-metal Enterprise Linux

Note that pinning practices like this are not practical when it comes to container deployment at scale and can be error-prone across heterogenous hardware. For this experiment, we chose a modest configuration for the memory usage of the workloads inside the pods, and the results are specific to these settings. If the workloads have higher or lower memory intensity, the impact of NUMA locality on their performance is expected to vary.

Conclusion and future work

In this test, we show that large pods running memory-bound workloads show better NUMA behavior and superior performance when deployed as vSphere Native Pods on a Project Pacific–based vSphere Supervisor Cluster. We did not explore the impacts on other workloads that are not memory bound, nor did we test other pod sizes. As future work, we would like to test with a spectrum of pod sizes, including smaller pods, and try other workloads that have more I/O and network activity.

Disclaimer

The results presented here are preliminary based on products under active development. The results may not apply to your environment and do not necessarily represent best practices.

 

Introducing VMmark ML

VMmark has been the go-to virtualization benchmark for over 12 years. It’s been used by partners, customers, and internally in a wide variety of technical applications. VMmark1, released in 2007, was the de-facto virtualization consolidation benchmark in a time when the overhead and feasibility of virtualization was still largely in question. In 2010, as server consolidation became less of an “if” and more of a “when,” VMmark2 introduced more of the rich vSphere feature set by incorporating infrastructure workloads (VMotion, Storage VMotion, and Clone & Deploy) alongside complex application workloads like DVD Store. Fast forward to 2017, and we released VMmark3, which builds on the previous versions by integrating an easy automation deployment service alongside complex multi-tier modern application workloads like Weathervane. To date, across all generations, we’ve had nearly 300 VMmark result publications (297 at the time of this writing) and countless internal performance studies.

Unsurprisingly, tech industry environments have continued to evolve, and so must the benchmarks we use to measure them. It’s in this vein that the VMware VMmark performance team has begun experimenting with other use cases that don’t quite fit the “traditional” VMmark benchmark. One example of a non-traditional use is Machine Learning and its execution within Kubernetes clusters. At the time of this writing, nearly 9% of the VMworld 2019 US sessions are about ML and Kubernetes. As such, we thought this might be a good time to provide an early teaser to VMmark ML and even point you at a couple of other performance-centric Machine Learning opportunities at VMworld 2019 US.

Although it’s very early in the VMmark ML development cycle, we understand that there’s a need for push-button-easy, vSphere-based Machine Learning performance analysis. As an added bonus, our prototype runs within Kubernetes, which we believe to be well-suited for this type of performance analysis.

Our internal-only VMmark ML prototype is currently streamlined to efficiently perform a limited number of operations very well as we work with partners, customers, and internal teams on how VMmark ML should be exercised. It is able to:

  1. Rapidly deploy Kubernetes within a vSphere environment.
  2. Deploy a variety of containerized ML workloads within our newly created VMmark ML Kubernetes cluster.
  3. Execute these ML workloads either in isolation or concurrently to determine the performance impact of architectural, hardware, and software design decisions.

VMmark ML development is still very fluid right now, but we decided to test some of these concepts/assumptions in a “real-world” situation. I’m fortunate to work alongside long-time DVD Store author and Big Data guru Dave Jaffe on VMmark ML.  As he and Sr. Technical Marketing Architect Justin Murray were preparing for their VMworld US talk, “High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning [BCA1563BU]“, we thought this would be a good opportunity to experiment with VMmark ML. Dave was able to use the VMmark ML prototype to deploy a 4-node Kubernetes cluster onto a single vSphere host with a 2nd-Generation Intel® Xeon® Scalable processor (“Cascade Lake”) CPU. VMmark ML then pulled a previously stored Docker container with several MLperf workloads contained within it. Finally, as a concurrent execution exercise, these workloads were run simultaneously, pushing the CPU utilization of the server above 80%. Additionally, Dave is speaking about vSphere Deep Learning performance in his talk “Optimize Virtualized Deep Learning Performance with New Intel Architectures [MLA1594BU],“ where he and Intel Principal Engineer Padma Apparao explore the benefits of Vector Neural Network Instructions (VNNI). I definitely recommend either of these talks if you want a deep dive into the details of VNNI or Spark analysis.

Another great opportunity to learn about VMware Performance team efforts within the Machine Learning space is to attend the Hands-on-Lab Expert Lead Workshop, “Launch Your Machine Learning Workloads in Minutes on VMware vSphere [ELW-2048-01-EMT_U],” or take the accompanying lab. This is being led by another VMmark ML team member Uday Kurkure along with Staff Global Solutions Consultant Kenyon Hensler. (Sign up for the Expert Lead using the VMworld 2019 mobile application or on my.vmworld.com.)

Our goal after VMworld 2019 US is to continue discussions with partners, customers, and internal teams about how a benchmark like VMmark ML would be most useful. We also hope to complete our integration of Spark within Kubernetes on vSphere and reproduce some of the performance analysis done to date. Stay tuned to the performance blog for additional posts and details as they become available.