Home > Blogs > VMware VROOM! Blog > Tag Archives: Performance

Tag Archives: Performance

New White Paper: High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning

By Dave Jaffe, VMware Performance Engineering

A new white paper is available showing the advantages of running virtualized Spark Deep Learning workloads on Kubernetes.

Recent versions of Spark include support for Kubernetes. For Spark on Kubernetes, the Kubernetes scheduler provides the cluster manager capability provided by Yet Another Resource Negotiator (YARN) in typical Spark on Hadoop clusters. Upon receiving a spark-submit command to start an application, Kubernetes instantiates the requested number of Spark executor pods, each with one or more Spark executors.

The benefits of running Spark on Kubernetes are many: ease of deployment, resource sharing, simplifying the coordination between developer and cluster administrator, and enhanced security. A standalone Spark cluster on vSphere virtual machines running in the same configuration as a Kubernetes-managed Spark cluster on vSphere virtual machines were compared for performance using a heavy workload, and the difference imposed by Kubernetes was found to be insignificant.

Spark applications running in Standalone mode require that every Spark worker node be installed with the correct version of Spark, Python, Java, etc. This puts a burden on the IT administrator, who may be managing many Spark applications with different requirements, and it requires coordination between the administrator and the application developer. With Kubernetes, the developer only needs to create a container with the correct software, and the IT administrator just needs to manage the cluster using the fine-grained resource management tools to enable the different Spark workloads.

To compare Spark Standalone performance to Spark on Kubernetes performance, a Deep Learning workload, the Maximum Throughput Spark BigDL ResNet50 image classifier from VMware IoT Analytics Benchmark, was run on the same 16 worker nodes, first while configured as Spark worker nodes, then while configured as Kubernetes nodes. Then the number of nodes was reduced by four (by removing the four workers on host 4), and the same comparison was made using 12 nodes, then 8, then 4.

The relative results are shown below. The Spark Standalone and Spark on Kubernetes performance in terms of images per second classified was within ~1% of each other for all configurations. Performance scaled well for the Spark tests as the number of VMs increased from 4 (1 server) to 16 (4 servers).

All details are in the paper.

How Does Project Pacific Deliver 8% Better Performance Than Bare Metal?

By Karthik Ganesan and Jared Rosoff 

At VMworld US 2019, VMware announced Project Pacific, an evolution of vSphere into a Kubernetes-native platform. Project Pacific (among other things) introduces a vSphere Supervisor Cluster, which enables you to run Kubernetes pods natively on ESXi (called vSphere Native Pods) with the same level of isolation as virtual machines. At VMworld, we claimed  that vSphere Native Pods running on Project Pacific, isolated by the hypervisor, can achieve up to 8% better performance than pods on a bare-metal Linux Kubernetes node. While it may sound a bit counter-intuitive that virtualized performance is better than bare metal, let’s take a deeper look to understand how this is possible.

Why are vSphere Native Pods faster?

This benefit primarily comes from ESXi doing a better job at scheduling the natively run pods on the right CPUs, thus providing better localization to dramatically reduce the number of remote memory accesses. The ESXi CPU scheduler knows that these pods are independent entities and takes great efforts to ensure their memory accesses are within their respective local NUMA (non-uniform memory access) domain. This results in better performance for the workloads running inside these pods and higher overall CPU efficiency. On the other hand, the process scheduler in Linux may not provide the same level of isolation across NUMA domains.

Evaluating the performance of vSphere Native Pods

Test setup

To compare the performance, we set up the testbeds shown in Figure 1 on identical hardware with 44 cores running at 2.2 GHz and 512 GB of memory:

  • A Project Pacific–based vSphere Supervisor Cluster with two ESXi Kubernetes nodes
  • A two-node, bare-metal popular Enterprise Linux cluster with out-of-the-box settings

Figure 1: Testbed configuration

Typically, a hyperthreaded processor core has multiple logical cores (hyperthreads) that share hardware resources between them. On the baseline case (testbed 2), the user specifies the CPU resource requests and the maximum number of logical cores for each pod. By contrast, on a vSphere Supervisor Cluster, the CPU resource specifications for a pod are  specified in terms of the number of physical cores. Given this difference in CPU resource specification, we disabled hyperthreading in both testbeds to simplify this comparison experiment and to be able to use an identical pod specification. In each of the clusters, we use one of the two Kubernetes nodes as the system under test, with the Kubernetes Master running on the other node.

Benchmark configuration

Figure 2: Pod configuration

  • We deploy 10 Kubernetes pods, each with a resource limit of 8 CPUs, 42 GB of RAM, and a single container in each running a standard Java transaction benchmark, as shown in Figure 2. Given the complexity and nature of the workload used for our experiments, we used large pods for this experiment to enable easier run management and score aggregation across pods.
  • The pods are affinitized using the pod specification to the target Kubernetes node in each testbed.
  • We use the benchmark score (maximum throughput) aggregated across all 10 pods to evaluate the performance of the system under test. The benchmark throughput scores used are sensitive to transaction latencies at various service-level agreement targets of 10ms, 25ms, 50ms, 75ms, and 100ms.
  • The benchmark used has little I/O or network activity, and all the experiments are restricted to a single Kubernetes node. Therefore, we do not discuss the I/O or network performance aspects in this article.

Results

Figure 3 shows the performance of the vSphere Supervisor Cluster (green bar) normalized based on the popular Enterprise Linux–based, bare-metal node case.

8% better performance on the vSphere Supervisor Cluster vs bare-metal Enterprise Linux

Figure 3: Hardware counter data on a vSphere Supervisor Cluster vs bare-metal Enterprise Linux

We observe that the vSphere Supervisor Cluster can achieve up to 8% better overall aggregate performance compared to the bare-metal case. We repeated the runs on each testbed many times to average out run-to-run variations.

Analysis and optimizations

Looking at the system statistics, we observe the workload running on the bare-metal testbed is suffering from many remote NUMA node memory accesses in comparison to the vSphere Supervisor Cluster case. While the vSphere Supervisor Cluster performance upside includes the improvement from better CPU scheduling, it also includes the performance overheads from virtualization.

To get deeper insights, we configure a set of pods to run the same workload at a constant throughput and collect hardware performance counter data for both testbeds to see the proportion of the last level cache (L3) misses that hit in the local DRAM vs remote DRAM. The local DRAM accesses incur much lower memory access latencies in comparison to remote DRAM accesses.

As shown in Figure 4, only half of the L3 cache misses hit in local DRAM, and the rest are served by remote memory in the case of bare-metal Enterprise Linux. In the case of the vSphere Supervisor Cluster, however, 99.2% of the L3 misses hit in local DRAM for an identical pod and workload configuration due to superior CPU scheduling in ESXi. The lower memory access latencies from avoiding remote memory accesses contributes to improved performance on the vSphere Supervisor Cluster.

About half of L3 cache misses hit Local DRAM on Linux vs most in Project Pacific

Figure 4: Hardware counter data on vSphere Supervisor Cluster vs bare-metal Enterprise Linux

To mitigate the effects of non-local NUMA accesses on Linux, we tried basic optimizations, like toggling NUMA balancing switches and using taskset-based pod pinning to CPUs, but none of this substantially helped performance. Numactl-based pinning of containers that forces local NUMA allocations and ensures that the processes execute in the CPUs belonging to these NUMA domains was one of the optimizations that helped. However, they are merely one-time workarounds at this point in time to help our investigation.

The chart in Figure 5 shows the improvement that can be achieved using numactl-based pinning on the bare-metal Enterprise Linux testbed.

17% performance improvements from pod pinning on bare-metal Enterprise Linux

Figure 5: Upside from better NUMA locality on the performance on bare-metal Enterprise Linux

Note that pinning practices like this are not practical when it comes to container deployment at scale and can be error-prone across heterogenous hardware. For this experiment, we chose a modest configuration for the memory usage of the workloads inside the pods, and the results are specific to these settings. If the workloads have higher or lower memory intensity, the impact of NUMA locality on their performance is expected to vary.

Conclusion and future work

In this test, we show that large pods running memory-bound workloads show better NUMA behavior and superior performance when deployed as vSphere Native Pods on a Project Pacific–based vSphere Supervisor Cluster. We did not explore the impacts on other workloads that are not memory bound, nor did we test other pod sizes. As future work, we would like to test with a spectrum of pod sizes, including smaller pods, and try other workloads that have more I/O and network activity.

Disclaimer

The results presented here are preliminary based on products under active development. The results may not apply to your environment and do not necessarily represent best practices.

 

Performance of VMware vCenter Server 6.7 in Remote Offices and Branch Offices

The VMware Performance team has published an updated paper detailing vCenter Server 6.7 performance in a remote offices and branch offices (ROBO) environment.

Many organizations today have a ROBO environment with local IT infrastructure. These remote locations usually have anywhere from a few servers running a few workloads to support local needs, to numerous servers spanning a large-scale datacenter. The distributed and remote nature of this infrastructure makes it hard to manage, difficult to protect, and costly to maintain. Further, the remote nature of servers makes it more challenging to perform important VM/host-related operations.

vSphere is designed to address these ROBO use cases, including IT infrastructure located in remote, distributed sites. VMware vCenter Server provides a centralized way to control and monitor the virtual infrastructure, including ESXi hosts, virtual machines, storage, and networking resources. It has been widely deployed in a ROBO environment to manage ESXi hosts that are distributed over large geographical distances over a wide range of networks with different network characteristics, including low/high bandwidth, network latency, and packet error rates. In the paper, we test:

  • LAN with high-bandwidth and low-latency links.
  • WAN with low-bandwidth and high-latency links.
  • Various networks in between; for example, DSL, T1, 4G, 5G, …

We demonstrate that vCenter Server performs well in the ROBO environment for both network bandwidth use, as well as virtual machine and ESXi host task execution times. Instead of a bandwidth restriction, we observe that network latency has a bigger impact on the overall performance. As the network latency between vCenter Server and ESXi hosts increases, the average operation latency also increases. The experimental results also show how efficiently vCenter Server executes VM operations in high-latency networks: The average VM operation execution time increases much more slowly when network latency increases by several times.

Read the paper: Performance of VMware vCenter Server 6.7 in Remote Offices and Branch Offices.

Using VMmark 3 as a Performance Analysis Tool

VMmark was originally developed to fill the need for a server consolidation benchmark for a rapidly changing datacenter that was becoming increasingly dominated by virtualization.  The design of VMmark, which is a collection of workloads, gives us the ability to quickly change workload parameters to modify the behavior of the entire benchmark. This allows us to use VMmark to exercise technologies that were not available at the time the benchmark was designed. The VMmark 3 run rules provide for academic or research results publication using a modified version of the benchmark.

VMmark 3 was designed in 2015 when the memory size of a typical high-end 2 socket server was 768 GB.  Each VMmark 3 tile was configured to use 156 GB of memory, allowing multiple tiles to be run on each server.  A new technology, Intel Optane DC Persistent Memory, now allows up to 3 TB of memory in a 2 socket server, with plans to increase that even further.  Testing the performance of this technology with an unmodified version of VMmark 3 wouldn’t be easy as we’d saturate CPU resources long before we could fully exercise this large amount of memory.  Thankfully the flexible nature of VMmark allows us to modify it to consume significantly more memory with minimal changes in CPU usage.

The two primary VMmark workloads are Weathervane and DVD Store.  Each can be modified to consume more memory.  Weathervane, as configured for VMmark 3, uses 14 VMs.  Thus while it would be possible to modify this application, doing so would be a time-consuming process.  We therefore decided to look at DVD Store, which uses only four VMs.  Most of the work is done in the DVD Store database VM which was our target for modification.

Determining the best configuration for DVD Store to utilize a larger amount of memory required multiple iterations of testing.   We modified one test parameter of the DVD Store workload, and then examined the results to determine the effect on the VMmark tile. We were looking for larger memory usage with a minimal increase in CPU usage so that we could exercise the larger memory configuration without requiring additional CPUs. The following table lists the default configuration and the variables we changed:

Parameter Default Configurations Tried
VM Memory Size 32 GB 128, 250 and 385 GB
Think Time 1 second 0.5, 0.9, 1.25, and 1.5 seconds
Number of Threads 24 36 and 48
Number of Searches 3 5, 7, and 9
Batch Search Size 3 5, 7, and 9
Database Size 100 GB 300 and 500 GB

The final configuration that we determined to have the most increased memory usage while keeping the CPU usage moderate was 250 GB DS3DB VM memory size, 1.5 seconds think time, and 300 GB database size.  All other parameters were kept at the default.

The following table lists the CPU and memory utilization of the default configuration and the “increased memory” configuration.

Configuration CPU Utilization Memory Utilization
Default 26.3 126 GB
Increased Memory 24.1 350 GB

We were able to almost triple the memory consumption of a single VMmark tile without increasing the CPU usage. Using this “increased memory” configuration for VMmark we can now see the effect of the additional memory provided by Intel Optane DC Persistent Memory in Memory Mode.

More detailed information about this configuration and the methodology used to refine it can be found in the Intel Optane DC Persistent Memory whitepaper.  Detailed instructions to configure VMmark 3 to increase the memory footprint can be obtained by emailing the VMmark team at vmmark-info@vmware.com.  We encourage you to experiment with VMmark under academic rules for your own studies and to let us know if you have any questions.

 

vSphere 6.7 Update 3 Supports AMD EPYC™ Generation 2 Processors, VMmark Showcases Its Leadership Performance

Two leadership VMmark benchmark results have been published with AMD EPYC™ Generation 2 processors running VMware vSphere 6.7 Update 3 on a two-node two-socket cluster and a four-node cluster. VMware worked closely with AMD to enable support for AMD EPYC™ Generation 2 in the VMware vSphere 6.7 U3 release.

The VMmark benchmark is a free tool used by hardware vendors and others to measure the performance, scalability, and power consumption of virtualization platforms and has become the standard by which the performance of virtualization platforms is evaluated.

The new AMD EPYC™ Generation 2 performance results can be found here and here.

View all VMmark results
Learn more about VMmark
These benchmark result claims are valid as of the date of writing.

Introducing VMmark ML

VMmark has been the go-to virtualization benchmark for over 12 years. It’s been used by partners, customers, and internally in a wide variety of technical applications. VMmark1, released in 2007, was the de-facto virtualization consolidation benchmark in a time when the overhead and feasibility of virtualization was still largely in question. In 2010, as server consolidation became less of an “if” and more of a “when,” VMmark2 introduced more of the rich vSphere feature set by incorporating infrastructure workloads (VMotion, Storage VMotion, and Clone & Deploy) alongside complex application workloads like DVD Store. Fast forward to 2017, and we released VMmark3, which builds on the previous versions by integrating an easy automation deployment service alongside complex multi-tier modern application workloads like Weathervane. To date, across all generations, we’ve had nearly 300 VMmark result publications (297 at the time of this writing) and countless internal performance studies.

Unsurprisingly, tech industry environments have continued to evolve, and so must the benchmarks we use to measure them. It’s in this vein that the VMware VMmark performance team has begun experimenting with other use cases that don’t quite fit the “traditional” VMmark benchmark. One example of a non-traditional use is Machine Learning and its execution within Kubernetes clusters. At the time of this writing, nearly 9% of the VMworld 2019 US sessions are about ML and Kubernetes. As such, we thought this might be a good time to provide an early teaser to VMmark ML and even point you at a couple of other performance-centric Machine Learning opportunities at VMworld 2019 US.

Although it’s very early in the VMmark ML development cycle, we understand that there’s a need for push-button-easy, vSphere-based Machine Learning performance analysis. As an added bonus, our prototype runs within Kubernetes, which we believe to be well-suited for this type of performance analysis.

Our internal-only VMmark ML prototype is currently streamlined to efficiently perform a limited number of operations very well as we work with partners, customers, and internal teams on how VMmark ML should be exercised. It is able to:

  1. Rapidly deploy Kubernetes within a vSphere environment.
  2. Deploy a variety of containerized ML workloads within our newly created VMmark ML Kubernetes cluster.
  3. Execute these ML workloads either in isolation or concurrently to determine the performance impact of architectural, hardware, and software design decisions.

VMmark ML development is still very fluid right now, but we decided to test some of these concepts/assumptions in a “real-world” situation. I’m fortunate to work alongside long-time DVD Store author and Big Data guru Dave Jaffe on VMmark ML.  As he and Sr. Technical Marketing Architect Justin Murray were preparing for their VMworld US talk, “High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning [BCA1563BU]“, we thought this would be a good opportunity to experiment with VMmark ML. Dave was able to use the VMmark ML prototype to deploy a 4-node Kubernetes cluster onto a single vSphere host with a 2nd-Generation Intel® Xeon® Scalable processor (“Cascade Lake”) CPU. VMmark ML then pulled a previously stored Docker container with several MLperf workloads contained within it. Finally, as a concurrent execution exercise, these workloads were run simultaneously, pushing the CPU utilization of the server above 80%. Additionally, Dave is speaking about vSphere Deep Learning performance in his talk “Optimize Virtualized Deep Learning Performance with New Intel Architectures [MLA1594BU],“ where he and Intel Principal Engineer Padma Apparao explore the benefits of Vector Neural Network Instructions (VNNI). I definitely recommend either of these talks if you want a deep dive into the details of VNNI or Spark analysis.

Another great opportunity to learn about VMware Performance team efforts within the Machine Learning space is to attend the Hands-on-Lab Expert Lead Workshop, “Launch Your Machine Learning Workloads in Minutes on VMware vSphere [ELW-2048-01-EMT_U],” or take the accompanying lab. This is being led by another VMmark ML team member Uday Kurkure along with Staff Global Solutions Consultant Kenyon Hensler. (Sign up for the Expert Lead using the VMworld 2019 mobile application or on my.vmworld.com.)

Our goal after VMworld 2019 US is to continue discussions with partners, customers, and internal teams about how a benchmark like VMmark ML would be most useful. We also hope to complete our integration of Spark within Kubernetes on vSphere and reproduce some of the performance analysis done to date. Stay tuned to the performance blog for additional posts and details as they become available.

Extreme Performance Series at VMworld 2019

 

I’m excited to announce that the “Extreme Performance Series” is back for its 7th year with 14 sessions created and being presented by VMware’s best and most distinguished performance engineers, principals, architects, and gurus. You do not want to miss this years program as it’s chock full of advanced content, practical advice, and exciting technical details!

Breakout Sessions

Spread across 4 different VMworld tracks, you’ll find these sessions full of performance details that you won’t get anywhere else at VMworld. They’ll also be recorded, so if the sessions you want to see aren’t being hosted in your region, you’ll still get access to it.

Continue reading

New Scheduler Option for vSphere 6.7 U2

Along with the recent release of VMware vSphere 6.7 U2, we published a new whitepaper that shows the performance of a new scheduler option that was included in the 6.7 U2 update.  We referred to this new scheduler option internally as the “sibling” scheduler, but the official name is the side-channel aware scheduler version 2, or SCAv2.  The whitepaper includes full details about SCAv1 and SCAv2, the L1TF security vulnerability that made them necessary, and the performance implications with several different workload types.  This blog is a brief overview of the key points, but we recommend that you check out the full document.

In August of 2018, a security vulnerability known as L1TF, affecting systems using Intel processors, was revealed, and patches and remediations were also made available. Intel provided micro-code updates for its processors, operating system patches were made available, and VMware provided an update for vSphere. The full details of the vCenter and ESXi patches are in a VMware security advisory that links to individual KB articles.

Continue reading

First VMmark 3.1 Publications, Featuring New Cascade Lake Processors

VMmark is a free tool used by hardware vendors and others to measure the performance, scalability, and power consumption of virtualization platforms.  If you’re unfamiliar with VMmark 3.x, each tile is a grouping of 19 virtual machines (VMs) simultaneously running diverse workloads commonly found in today’s data centers, including a scalable Web simulation, an E-commerce simulation (with backend database VMs), and standby/idle VMs.

As Joshua mentioned in a recent blog post, we released VMmark 3.1 in February, adding support for persistent memory, improving workload scalability, and better reflecting secure customer environments by increasing side-channel vulnerability mitigation requirements.

I’m happy to announce that today we published the first VMmark 3.1 results.  These results were obtained on systems meeting our industry-leading side-channel-aware mitigation requirements, thus continuing the benchmark’s ability to provide an indication of real-world performance.

Continue reading

vMotion across hybrid cloud: performance and best practices

VMware Cloud on AWS is a hybrid cloud service that runs the VMware software-defined data center (SDDC) stack in the Amazon Web Services (AWS) public cloud. The service automatically provisions and deploys a vSphere environment on a bare-metal AWS infrastructure, and lets you run your applications in a hybrid IT environment across your on-premises data centers and AWS global infrastructure. A key benefit of VMware Cloud on AWS is the ability to vMotion workloads back and forth from your on-premises data center to the AWS public cloud as capacity and data privacy require.

In this blog post, we share the results of our vMotion performance tests across our hybrid cloud environment that consisted of a vSphere on-premises data center located in Wenatchee, Washington and an SDDC hosted in an AWS cloud, in various scenarios including hybrid migration of a database server. We also describe the best practices to follow when migrating virtual machines by vMotion across hybrid cloud.

Continue reading