Home > Blogs > VMware VROOM! Blog > Tag Archives: benchmarking

Tag Archives: benchmarking

SQL Server Performance of VMware Cloud on AWS

In the past, I’ve always benchmarked performance of SQL Server VMs on vSphere with “on-premises” infrastructure.  Given the skyrocketing interest in the cloud, I was very excited to get my hands on VMware Cloud on AWS – just in time for Amazon’s AWS Summit!

A key question our customers have is: how well do applications (like SQL Server) perform in our cloud?  Well, I’m happy to report that the answer is great!

VMware Cloud on AWS Environment

First, here is a screenshot of what my vSphere-powered Software-Defined Data Center (SDDC) looks like:vSphere Client - VMware Cloud on AWSThis screenshot shows several notable items:

  • The HTML5-based vSphere Client interface should be very familiar to vSphere administrators, making the move to the cloud extremely easy
  • This SDDC instance was auto-provisioned with 4 ESXi hosts and 2TB of memory, all of which were pre-configured with vSAN storage and NSX networking.
    • Each host is configured with two CPUs (Intel Xeon Processor E5-2686 v4); each socket contains 18 cores running at 2.3GHz, resulting in 144 physical cores in the cluster. For more information, see the VMware Cloud on AWS Technical Overview
  • Virtual machines are provisioned within the customer workload resource pool, and vSphere DRS automatically handles balancing the VMs across the compute cluster.

Benchmark Methodology

To measure SQL Server database performance, I used HammerDB, an open-source database load testing and benchmarking tool.  It implements a TPC-C like workload, and reports throughput in TPM (Transactions Per Minute).

To measure how well performance scaled in this cloud, I started with a single 8 vCPU, 32GB RAM VM for the SQL Server database.  To drive the workload, I created a 4 vCPU, 4GB RAM HammerDB driver VM.  I then cloned these VMs to measure 2 database VMs being driven simultaneously:HammerDB and SQL Server VMs in VMware Cloud on AWS

I then doubled the number of VMs again to 4, 8, and finally 16.  As with any benchmark, these VMs were completely driven up to saturation (100% load) – “pedal to the metal”!

Results

So, how did the results look?  Well, here is a graph of each VM count and the resulting database performance:

As you can see, database performance scaled great; when running 16 8-vCPU VMs, VMware Cloud on AWS was able to sustain 6.7 million database TPM!

I’ll be detailing these benchmarks more in an upcoming whitepaper, but wanted to share these results right away.  If you have any questions or feedback, please leave me a comment!

Performance Comparison of Containerized Machine Learning Applications Running Natively with Nvidia vGPUs vs. in a VM – Episode 4

This article is by Hari Sivaraman, Uday Kurkure, and Lan Vu from the Performance Engineering team at VMware.

Performance Comparison of Containerized Machine Learning Applications

Docker containers [6] are rapidly becoming a popular environment in which to run different applications, including those in machine learning [1, 2, 3]. NVIDIA supports Docker containers with their own Docker engine utility, nvidia-docker [7], which is specialized to run applications that use NVIDIA GPUs.

The nvidia-docker container for machine learning includes the application and the machine learning framework (for example, TensorFlow [5]) but, importantly, it does not include the GPU driver or the CUDA toolkit.

Docker containers are hardware agnostic so, when an application uses specialized hardware like an NVIDIA GPU that needs kernel modules and user-level libraries, the container cannot include the required drivers. They live outside the container.

One workaround here is to install the driver inside the container and map its devices upon launch. This workaround is not portable since the versions inside the container need to match those in the native operating system.

The nvidia-docker engine utility provides an alternate mechanism that mounts the user-mode components at launch, but this requires you to install the driver and CUDA in the native operating system before launch. Both approaches have drawbacks, but the latter is clearly preferable.

In this episode of our series of blogs [8, 9, 10] on machine learning in vSphere using GPUs, we present a comparison of the performance of MNIST [4] running in a container on CentOS executing natively with MNIST running in a container inside a CentOS VM on vSphere. Based on our experiments, we demonstrate that running containers in a virtualized environment, like a CentOS VM on vSphere, suffers no performance penlty, while benefiting from the tremenduous management capabilities offered by the VMware vSphere platform.

Experiment Configuration and Methodology

We used MNIST [4] to compare the performance of containers running natively with containers running inside a VM. The configuration of the VM and the vSphere server we used for the “virtualized container” is shown in Table 1. The configuration of the physical machine used to run the container natively is shown in Table 2.

vSphere  6.0.0, build 3500742
Nvidia vGPU driver 367.53
Guest OS CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

Table 1. Configuration of VM used to run the nvidia-docker container

Nvidia driver 384.98
Operating system CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

⇑ Table 2. Configuration of physical machine used to run the nvidia-docker container

The server configuration we used is shown in Table 3 below. In our experiments, we used the NVIDIA M60 GPU in vGPU mode only. We did not use the Direct I/O mode. In the scenario in which we ran the container inside the VM, we first installed the NVIDIA vGPU drivers in vSphere and inside the VM, then we installed CUDA (driver 8.0 with runtime version 7.5), followed by Docker and nvidia-docker [7]. In the case where we ran the container natively, we installed the NVIDIA driver in CentOS running natively, followed by CUDA (driver 8.0 with runtime version 7.5),  Docker and finally, nvidia-docker [7]. In both scenarios we ran MNIST and we measured the run time for training using a wall clock.

 Figure 1. Testbed configuration for comparison of the performance of containers running natively vs. running in a VM

Model Dell PowerEdge R730
Processor type Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
CPU cores 24 CPUs, each @ 2.5GHz
Processor sockets 2
Cores per socket 14
Logical processors 48
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), Storage Arrays, Local Hard Disks
GPUs 2x M60 Tesla

⇑ Table 3. Server configuration

Results

The measured wall-clock run times for MNIST are shown in Table 4 for the two scenarios we tested:

  1. Running in an nvidia-docker container in CentOS running natively.
  2. Running in an nvidia-docker container inside a CentOS VM on vSphere.

From the data, we can clearly see that there is no measurable performance penalty for running a container inside a VM as compred to running it natively.

Configuration Run time for MNIST as measured by a wall clock
Nvidia-docker container in CentOS running natively 44 minutes 53 seconds
Nvidia-docker container running in a CentOS VM on vSphere 44  minutes 57 seconds

⇑ Table 4. Comparison of the run-time for MNIST running in a container on native CentOS vs. in a container in virtualized CentOS

Takeaways

  • Based on the results shown in Table 4, it is clear that there is no measurable performance impact due to running a containerized application in a virtual environment as opposed to running it natively. So, from a performance perspective, there is no penalty for using a virtualized environment.
  • It is important to note that since containers do not include the GPU driver or the CUDA environment, both of these components need to be installed separately. It is in this aspect that a virtualized environment offers a superior user experience; an nvidia-docker container in CentOS running natively requires that any existing GPU and CUDA drivers be removed if the version of the drivers does not match that required by the container. Uninstalling and re-installing the correct drivers is often a challenging and time consuming task. However, in a virtualized environment, you can, in advance, create and store in a repository, a number of CentOS VMs with different VGPU and CUDA drivers. When you need to run an application in an nvidia-docker container, just clone the VM with the correct drivers, load the container, and run with no performance penalty. In such a scenario, running in a virtualized environment does not require you to uninstall and re-install the correct drivers, which saves both time and considerable frustration. This issue of uninstalling and re-installing drivers in a native environment becomes considerably more difficult if there are multiple container users on the system; in such a scenario, all the containers need to be migrated to use the new drivers, or the user who needs a new driver will have to wait until all the other users are done before a system administrator can upgrade the GPU drivers on the native CentOS.

Future Work

In this blog, we presented the performance results of running MNIST in a single container. We plan to run MNIST in multiple containers running concurrently in both a virtualized environment and on CentOS executing natively, and report the measured run times. This will provide a comparison of the performance as we scale up the number of containers.

References

  1. Google Cloud Platform: Cloud AI. https://cloud.google.com/products/machine-learning/
  2. Wikipedia: Deep Learning. https://en.wikipedia.org/wiki/Deep_learning
  3. NVIDIA GPUs – The Engine of Deep Learning. https://developer.nvidia.com/deep-learning
  4. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
  5. TensorFlow: An Open-Source Software Library for Machine Intelligence. https://www.tensorflow.org
  6. Wikipedia: Operating-System-Level Virtualization. https://en.wikipedia.org/wiki/Operating-system-level_virtualization
  7. NVIDIA Docker: GPU Server Application Deployment Made Easy. https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/
  8. Episode 1: Performance Results of Machine Learning with DirectPath I/O and GRID vGPU. https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html
  9. Episode 2: Machine Learning on vSphere 6 with NVIDIA GPUs. https://blogs.vmware.com/performance/2017/03/machine-learning-vsphere-6-5-nvidia-gpus-episode-2.html
  10. Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning. https://blogs.vmware.com/performance/2017/10/episode-3-performance-comparison-native-gpu-virtualized-gpu-scalability-virtualized-gpus-machine-learning.html 

Updated – SQL Server VM Performance with vSphere 6.5, October 2017

Back in March, I published a performance study of SQL Server performance with vSphere 6.5 across multiple processor generations.  Since then, Intel has released a brand-new processor architecture: the Xeon Scalable platform, formerly known as Skylake.

Our team was fortunate enough to get early access to a server with these new processors inside – just in time for generating data that we presented to customers at VMworld 2017.

Each Xeon Platinum 8180 processor has 28 physical cores (pCores), and with four processors in the server, there was a whopping 112 pCores on one physical host!  As you can see, that extra horsepower provides nice database server performance scaling:

Generational SQL Server VM Database Performance

Generational SQL Server VM Database Performance

For more details and the test results, take a look at the updated paper:
Performance Characterization of Microsoft SQL Server on VMware vSphere 6.5

Performance of Enterprise Web Applications in Docker Containers on VMware vSphere 6.5

Docker containers are growing in popularity as a deployment platform for enterprise applications. However, the performance impact of running these applications in Docker containers on virtualized infrastructures is not well understood. A new white paper is available that uses the open source Weathervane performance benchmark to investigate the performance of an enterprise web application running in Docker containers in VMware vSphere 6.5 virtual machines (VMs).  The results show that an enterprise web application can run in Docker on a VMware vSphere environment with not only no degradation of performance, but even better performance than a Docker installation on bare-metal.

Weathervane is used to evaluate the performance of virtualized and cloud infrastructures by deploying an enterprise web application on the infrastructure and then driving a load on the application.  The tests discussed in the paper use three different deployment configurations for the Weathervane application.

  • VMs without Docker containers: The application runs directly in the guest operating systems in vSphere 6.5 VMs, with no Docker containers.
  • VMs with Docker containers: The application runs in Docker containers, which run in guest operating systems in vSphere 6.5 VMs.
  • Bare-metal with Docker containers: The application runs in Docker containers, but the containers run in an operating system that is installed on a bare-metal server.

The figure below shows the peak results achieved when running the Weathervane benchmark in the three configurations.  The results using Docker containers include the impact of tuning options that are discussed in detail in the paper.

Some important things to note in these results:

  • The performance of the application using Docker containers in vSphere 6.5 VMs is almost identical to that of the same application running in VMs without Docker.
  • The application running in Docker containers in VMs outperforms the same application running in Docker containers on bare metal by about 5%. Most of this advantage can be attributed to the sophisticated algorithms employed by the vSphere 6.5 scheduler.

The results discussed in the paper, along with the results of previous investigations of Docker performance on vSphere, show that vSphere 6.5 is an ideal platform for deploying applications in Docker containers.

Introducing TPCx-HS Version 2 – An Industry Standard Benchmark for Apache Spark and Hadoop clusters deployed on premise or in the cloud

Since its release on August 2014, the TPCx-HS Hadoop benchmark has helped drive competition in the Big Data marketplace, generating 23 publications spanning 5 Hadoop distributions, 3 hardware vendors, 2 OS distributions and 1 virtualization platform. By all measures, it has proven to be a successful industry standard benchmark for Hadoop systems. However, the Big Data landscape has rapidly changed over the last 30 months. Key technologies have matured while new ones have risen to prominence in an effort to keep pace with the exponential expansion of datasets. One such technology is Apache Spark.

spark-logo-trademarkAccording to a Big Data survey published by the Taneja Group, more than half of the respondents reported actively using Spark, with a notable increase in usage over the 12 months following the survey. Clearly, Spark is an important component of any Big Data pipeline today. Interestingly, but not surprisingly, there is also a significant trend towards deploying Spark in the cloud. What is driving this adoption of Spark? Predominantly, performance.

Today, with the widespread adoption of Spark and its integration into many commercial Big Data platform offerings, I believe there needs to be a straightforward, industry standard way in which Spark performance and price/performance could be objectively measured and verified. Just like TPCx-HS Version 1 for Hadoop, the workload needs to be well understood and the metrics easily relatable to the end user.

Continuing on the Transaction Processing Performance Council’s commitment to bringing relevant benchmarks to the industry, it is my pleasure to announce TPCx-HS Version 2 for Spark and Hadoop. In keeping with important industry trends, not only does TPCx-HS support traditional on premise deployments, but also cloud.

I envision that TPCx-HS will continue to be a useful benchmark standard for customers as they evaluate Big Data deployments in terms of performance and price/performance, and for vendors in demonstrating the competitiveness of their products.

 

Tariq Magdon-Ismail

(Chair, TPCx-HS Benchmark Committee)

 

Additional Information:  TPC Press Release

Oracle Database Performance on vSphere 6.5 Monster Virtual Machines

We have just published a new whitepaper on the performance of Oracle databases on vSphere 6.5 monster virtual machines. We took a look at the performance of the largest virtual machines possible on the previous four generations of four-socket Intel-based servers. The results show how performance of these large virtual machines continues to scale with the increases and improvements in server hardware.

Oracle Database Monster VM Performance across 4 generations of Intel based servers on vSphere 6.5

Oracle Database Monster VM Performance on vSphere 6.5 across 4 generations of Intel-based  four-socket servers

In addition to vSphere 6.5 and the four-socket Intel-based servers used in the testing, an IBM FlashSystem A9000 high performance all flash array was used. This array provided extreme low latency performance that enabled the database virtual machines to perform at the achieved high levels of performance.

Please read the full paper, Oracle Monster Virtual Machine Performance on VMware vSphere 6.5, for details on hardware, software, test setup, results, and more cool graphs.  The paper also covers performance gain from Hyper-Threading, performance effect of NUMA, and best practices for Oracle monster virtual machines. These best practices are focused on monster virtual machines, and it is recommended to also check out the full Oracle Databases on VMware Best Practices Guide.

Some similar tests with Microsoft SQL Server monster virtual machines were also recently completed on vSphere 6.5 by my colleague David Morse. Please see his blog post  and whitepaper for the full details.

This work on Oracle is in some ways a follow up to Project Capstone from 2015 and the resulting whitepaper Peeking at the Future with Giant Monster Virtual Machines . That project dealt with monster VM performance from a slightly different angle and might be interesting to those who are also interested in this paper and its results.

 

Capturing the Flag for Cloud IaaS Performance with VMware’s vSphere 6.5 and VIO 3.1 on Dell PowerEdge Servers

This week SPEC has published a new SPEC CloudTM IaaS 2016 result for a private cloud configuration built using VMware vSphere 6.5 and VMware Integrated OpenStack 3.1 (VIO 3.1) and Dell PowerEdge Servers. Working with VMware, Dell has pushed their lead in cloud performance even further. This time, the primary metric produced was a Scalability score of 78.5 @ 72 Application Instances (468 VMs). The Elasticity score was 87.4%.

VMware and Dell are active participants in SPEC and have contributed to the development of its industry standard benchmarks including SPEC Cloud IaaS 2016. Both organizations strongly support SPEC’s mission to provide a set of fair and realistic metrics on which to differentiate modern systems and technologies.

Continue reading

Weathervane, a benchmarking tool for virtualized infrastructure and the cloud, is now open source.

Weathervane is a performance benchmarking tool developed at VMware.  It lets you assess the performance of your virtualized or cloud environment by driving a load against a realistic application and capturing relevant performance metrics.  You might use it to compare the performance characteristics of two different environments, or to understand the performance impact of some change in an existing environment.

Weathervane is very flexible, allowing you to configure almost every aspect of a test, and yet is easy to use thanks to tools that help prepare your test environment and a powerful run harness that automates almost every aspect of your performance tests.  You can typically go from a fresh start to running performance tests with a large multi-tier application in a single day.

Weathervane supports a number of advanced capabilities, such as deploying multiple independent application instances, deploying application services in containers, driving variable loads, and allowing run-time configuration changes for measuring elasticity-related performance metrics.

Weathervane has been used extensively within VMware, and is now open source and available on GitHub at https://github.com/vmware/weathervane.

The rest of this blog gives an overview of the primary features of Weathervane.

Continue reading

vCenter 6.5 Performance: what does 6x mean?

At the VMworld 2016 Barcelona keynote, CTO Ray O’Farrell proudly presented the performance improvements in vCenter 6.5. He showed the following slide:

6x_slide

Slide from Ray O’Farrell’s keynote at VMworld 2016 Barcelona, showing 2x improvement in scale from 6.0 to 6.5 and 6x improvement in throughput from 5.5 to 6.5.

As a senior performance engineer who focuses on vCenter, and as one of the presenters of VMworld Session INF8108 (listed in the top-right corner of the slide above), I have received a number of questions regarding the “6x” and “2x scale” labels in the slide above. This blog is an attempt to explain these numbers by describing (at a high level) the performance improvements for vCenter in 6.5. I will focus specifically on the vCenter Appliance in this post.

Continue reading

SQL Server VM Performance with VMware vSphere 6.5

Achieving optimal SQL Server performance on vSphere has been a constant focus here at VMware; I’ve published past performance studies with vSphere 5.5 and 6.0 which showed excellent performance up to the maximum VM size supported at the time.

Since then, there have been quite a few changes!  While this study uses a similar test methodology, it features an updated hypervisor (vSphere 6.5), database engine (SQL Server 2016), OLTP benchmark (DVD Store 3), and CPUs (Intel Xeon v4 processors with 24 cores per socket, codenamed Broadwell-EX).

Continue reading