Home > Blogs > VMware VROOM! Blog > Tag Archives: vsphere

Tag Archives: vsphere

SQL Server VM Performance with VMware vSphere 6.5

Achieving optimal SQL Server performance on vSphere has been a constant focus here at VMware; I’ve published past performance studies with vSphere 5.5 and 6.0 which showed excellent performance up to the maximum VM size supported at the time.

Since then, there have been quite a few changes!  While this study uses a similar test methodology, it features an updated hypervisor (vSphere 6.5), database engine (SQL Server 2016), OLTP benchmark (DVD Store 3), and CPUs (Intel Xeon v4 processors with 24 cores per socket, codenamed Broadwell-EX).

The new tests show large SQL Server databases continue to run extremely efficiently, achieving great performance on vSphere 6.5. Following our best practices was all that was necessary to achieve this scalability – which reminds me, don’t forget to check out Niran’s new SQL Server on vSphere best practices guide, which was also just updated.

In addition to performance, power consumption was measured on each ESXi host. This allowed for a comparison of Host Power Management (HPM) policies within vSphere, performance per watt of each host, and power draw under stress versus idle:

Generational SQL Server DB Host Power and Performance/watt

Generational SQL Server DB Host Power and Performance/watt

Additionally, this new study compares a virtual file-based disk (VMDK) on VMware’s native Virtual Machine File System (VMFS 5) to a physical Raw Device Mapping (RDM). I added this test for two reasons: first, it has been several years since they have been compared; and second, customer feedback from VMworld sessions indicates this is still a debate that comes up in IT shops, particularly with regard to deploying database workloads such as SQL Server and Oracle.

For more details and the test results, download the paper: Performance Characterization of Microsoft SQL Server on VMware vSphere 6.5

Machine Learning on vSphere 6 with Nvidia GPUs – Episode 2

by Hari Sivaraman, Uday Kurkure, and Lan Vu

In a previous blog [1], we looked at how machine learning workloads (MNIST and CIFAR-10) using TensorFlow running in vSphere 6 VMs in an NVIDIA GRID configuration reduced the training time from hours to minutes when compared to the same system running no virtual GPUs.

Here, we extend our study to multiple workloads—3D CAD and machine learning—run at the same time vs. run independently on a same vSphere server.

Performance Impact of Mixed Workloads

Many customers of the NVIDIA GRID vGPU solution on vSphere run 3D CAD workloads. The traditional approach to run 3D CAD and machine learning workloads, typically, is to run the CAD workloads during the day and the machine learning workloads at night, or to have separate infrastructures for each type of workload, but this solution is inflexible and can increase deployment cost. We show, in this section, that this kind of separation is entirely unnecessary. Both workloads can be run concurrently on the same server, and the performance impact on the 3D CAD workload as well as on the machine learning workload is negligible in three out of four vGPU profiles.

vSphere 6.5 supports four vGPU profiles, and the primary difference between them is the amount of VRAM available to each VM:

  • M60-1Q: 1GB VRAM
  • M60-2Q: 2GB VRAM
  • M60-4Q: 4GB VRAM
  • M60-8Q: 8GB VRAM

In this blog, we characterize the performance impact of running 3D CAD and machine learning workloads concurrently using two benchmarks. We chose the SPECapc for 3ds Max 2015 [2] benchmark as a representative for 3D CAD workloads and so did not comply with the benchmark reporting rules, nor do we use or make comparisons to the official SPECapc metrics. We chose MNIST [3] as a representative for machine learning workloads. The performance metric in this comparison is simply the run time for the benchmark.

Our results show that the performance impact on the 3D CAD workload due to sharing the server and GPUs with the machine learning workload is below 5% (in the M60-2Q, M60-4Q, and M60-8Q profiles) when compared to running only the 3D CAD workload on the same hardware. Correspondingly, the performance impact on the machine learning workload when sharing the hardware resources with the 3D CAD workload compared to running all by itself is under 15% in the M60-2Q, M60-4Q, and M60-8Q profiles. In other words, the run time for the 3D CAD benchmark increases by less than 5% when sharing the hardware with the machine learning workload when compared to when it does not share the hardware. The increase in run time for machine learning was under 15% when sharing compared to not sharing the hardware.

Experimental Configuration and Methodology

We installed the 3D CAD benchmark in a 64-bit Windows 7 (SP1) VM with 4 vCPUs and 16GB RAM. The benchmark uses Autodesk 3ds Max 2015 software. The NVIDIA vGPU driver (#369.4) was used in the VM. We configured the vGPU profiles as M60-1Q, M60-2Q, M60-4Q, or M60-8Q for different runs. We used this VM as the golden master from which we made linked clones so that we could run the 3D CAD benchmark at scale with 1, 2, 4, …, 24 VMs running 3D CAD simultaneously. The software configurations used for the 3D CAD workload are shown in Table 2, below.

ESXi 6.5   #4240417
Guest OS CentOS Linux release 7.2.1511 (Core)
CUDA Driver & Runtime 7.5
TensorFlow 0.1

Table 1. Configuration of VM used for machine learning benchmarks

vGPU profile Total # VMs running concurrently in the test
M60-8Q 3
M60-4Q 6
M60-2Q 12
M60-1Q 24

Table 2. Software configuration used to run the 3D CAD benchmark

Our experiments includes three sets of runs. In the first set, we ran only the 3D CAD benchmark for each of the four configurations listed in Table 2, above and measured the run time, the ESXi CPU utilization, and the GPU utilization. Once this set of runs was completed, we did a second set in which we ran the 3D CAD benchmark concurrently with the machine learning benchmark. To do this, we first installed the MNIST benchmark, CUDA, cuDNN and TensorFlow in a CentOS VM with the configuration shown in Table 1, above.  Since CUDA only works with an M60-8Q profile, we used it for the VM that runs the machine learning benchmark. For the runs in this second set, we used the configurations shown in Table 4 and measured the run time for the 3D CAD benchmark, the run time for MNIST, the total ESXi CPU utilization, and the total GPU utilization. The server configuration used in our experiments is shown in Table 3, below.

Model Dell PowerEdge R730
CPU Intel Xeon Processor E5-2680 v4 @ 2.40GHz
CPU cores 28 CPUs, each @ 2.40GHz
Processor sockets 2
Cores per socket 14
Logical processors 56
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), storage arrays, local hard disks
GPUs 2x NVIDIA Tesla M60
ESXi 6.5   #4240417
NVIDIA vGPU driver in ESXi host 367.38
NVIDIA vGPU driver inside VM 369.04

Table 3. Server configuration

vGPU profile used for 3D CAD VMs only #VMs running 3D CAD # VMs running MNIST Total # VMs running concurrently in test
M60-8Q 3 1 4
M60-4Q 6 1 7
M60-2Q 12 1 13
M60-1Q 24 1 25

Table 4. Software configuration used to run the mixed workloads (please note that that the machine learning VM can be configured only using the M60-8Q profile; no other profiles are supported)

We did a third set of runs in which we ran only MNIST on the server. Since MNIST runs only in the M60-8Q profile, only one run was done in this third set.

Results

We compared the run times of the first set of runs (3D CAD benchmark only) with the ones of the second set (3D CAD and MNIST are run concurrently) as well as computed the percentage increase in the run time for 3D CAD when it shares the server with MNIST compared to when it ran without MNIST. So specifically, say, in the M60-4Q profile,  we computed the percentage increase in run time for 3D+ML compared to 3D only in the M60-4Q profile. We also measured the run time of MNIST running concurrently with 3D CAD with the run time of MNIST running by itself on the server and computed the percentage increase in run time for MNIST. We call this increase in run time the performance drop or change. The computed values are shown in Figure 1 below.

3d-ml-fig3_001

Figure 1. Percentage increase in run time for 3D graphics (3D) and machine learning (ML) workloads due to running concurrently compared to running in isolation

From the figure we can see that in the M60-8Q, M60-4Q, and M60-2Q profiles, the run times for 3D CAD when it shares the server with machine learning compared to when 3D CAD runs by itself is less than 5%. For the MNIST machine learning workload, the performance penalty due to sharing compared to no sharing is under 15% in the M60-2Q, M60-4Q, and M60-8Q profiles. Only the M60-1Q profile that can support up to 24 VMs running 3D CAD and one VM running MNIST show any significant performance penalty due to sharing. Now, if the workloads were run sequentially, the total time to complete the tasks would be the sum of the run time for 3D CAD and the machine learning workloads.

A comparison of the total run time for ML and 3D CAD workloads is shown in Figure 2. From the figure, we can see that the total time to completion of the workloads is always less when run concurrently as opposed to when run sequentially.

3d-ml-fig2_001

⇑ Figure 2. It takes a longer time to sequentially run a 3D plus ML (machine learning) mixed workload when compared to the time to run a concurrent mixed workload. The original time was in seconds, but we have normalized the concurrent time to 1 so that the change in sequential time stands out.

 

3d-ml-fig3_002

Figure 3. CPU utilization on server for mixed workload configuration (3D+ML) and for 3D graphics only (3D)

Further, running the workloads concurrently results in higher server utilization, which could result in higher revenues for a cloud service provider. The M60-1Q profile does show a higher time to complete when workloads are run concurrently when compared to being run sequentially, but it does achieve very high consolidation (measured as number of VMs per core) and high server utilization. So, if in the M60-1Q profile, the higher time to complete the workload run can be tolerated, the configuration that runs the workloads concurrently would achieve higher revenues for a cloud service provider because of higher server utilization. The CPU utilization on the server for the M60-8Q, M60-4Q, M60-2Q, and M60-1Q profiles with only 3D CAD (3D) and with 3D CAD plus machine learning (3D+ML) are shown in Figure 3, above.

Conclusions

  • Simultaneously running 3D CAD and machine learning workloads reduces the total time to complete the runs with the M60-2Q, M60-4Q, and M60-8Q profiles compared to running the workloads sequentially. This is a radical departure from traditional approaches to scheduling machine learning and 3D CAD workloads.
  • Running 3D graphics and machine learning workloads concurrently increases server utilization, which could result in higher revenues for a cloud service provider.

References

[1] Machine Learning on VMware vSphere 6 with NVIDIA GPUs
https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html

[2] The MNIST Database of Handwritten Digits
http://yann.lecun.com/exdb/mnist/ 

[3] SPECapc for 3ds Max 2015
https://www.spec.org/gwpg/apc.static/max2015info.html

 

 

Understanding vSphere DRS Performance – A White Paper

VMware vSphere Distributed Resource Scheduler (DRS) is responsible for placement of Virtual Machines and balancing of resources in a cluster. The key driver for DRS is VM/Application happiness, and it achieves this by effective VM placement and efficient load balancing. We have a new white paper, which tries to explain how DRS works in basic scenarios and how it can be tuned to behave differently for specific scenarios.

The white paper talks about the factors that influence DRS decisions and provides some useful insights into different parameters that can be tuned in specific scenarios to make DRS more effective. It also explains how to monitor DRS to better understand its behavior.

It covers DRS behavior in specific scenarios with some case studies. Some of these studies are around

  •  VM Consumed vs. Active Memory – How it impacts DRS behavior.
  •  Impact of VM overrides on cluster balance.
  •  Prerequisite moves during initial placement.
  •  Using shares to prioritize cluster resources.

The paper provides knowledge about the factors that affect DRS behavior and helps understand how DRS does what it does. This knowledge, along with monitoring and troubleshooting tips, including real case studies, will help tune DRS clusters for optimum performance.

Machine Learning on VMware vSphere 6 with NVIDIA GPUs

by Uday Kurkure, Lan Vu, and Hari Sivaraman

Machine learning is an exciting area of technology that allows computers to behave without being explicitly programmed, that is, in the way a person might learn. This tech is increasingly applied in many areas like health science, finance, and intelligent systems, among others.

In recent years, the emergence of deep learning and the enhancement of accelerators like GPUs has brought the tremendous adoption of machine learning applications in a broader and deeper aspect of our lives. Some application areas include facial recognition in images, medical diagnosis in MRIs, robotics, automobile safety, and text and speech recognition.

Machine learning workloads have also become a critical part in cloud computing. For cloud environments based on vSphere, you can even deploy a machine learning workload yourself using GPUs via the VMware DirectPath I/O or vGPU technology.

GPUs reduce the time it takes for a machine learning or deep learning algorithm to learn (known as the training time) from hours to minutes. In a series of blogs, we will present the performance results of running machine learning benchmarks on VMware vSphere using NVIDIA GPUs.

Episode 1: Performance Results of Machine Learning with DirectPath I/O and NVIDIA GPUs

In this episode, we present the performance results of running machine learning benchmarks on VMware vSphere with NVIDIA GPUs in DirectPath I/O mode and on GRID virtual GPU (vGPU) mode.

Training Time Reduction from Hours to Minutes

Training time is the performance metric used in supervised machine learning—it is the amount of time a computer takes to learn how to solve the given problem. In supervised machine learning, the computer is given data in which the answer can be found. So, supervised learning infers a model from the available, or labelled training data.

Our first machine learning benchmark is a simple demo model in the TensorFlow library. The model classifies handwritten digits from the MNIST dataset. Each digit is a handwritten number that is centered within a consistently sized grayscale bitmap. The MNIST database of handwritten digits contains 60,000 training examples and has a test set of 10,000 examples.

First, we compare training times for the model using two different virtual machine configurations:

  • NVIDIA GRID Configuration (vg1c12m60GB): 1 vGPU, 12 vCPUs, 60GB memory, 96GB of SSD storage, CentOS 7.2
  • No GPU configuration (g0c12m60GB): No GPU, 12 vCPUs, 60GB memory, 96GB of SSD storage, CentOS 7.2
MNIST vg1c12m60GB
1 vGPU 
(secs)
g0c12m60GB
No GPU (secs)
Normalized Training Time
(wrt vg1c12)
1.0 10.06
CPU Utilization 8% 43%

The above table shows that vGPU reduces the training time by 10 times. The CPU utilization also goes down 5 times. See the graphs below.

01-training-time-mnist

02-mnist-cpu-util

Scaling from One GPU to Four GPUs

This machine learning benchmark is made up of two components:

We use the metric of images per second (images/sec) to compare the different configurations as we scale from a single GPU to 4 GPUs. The metric of images/second denotes the number of images processed per second in training the model.

Our host has two NVIDIA M60 cards. Each card has 2 GPUs. We present the performance results for scaling up from 1 GPU to 4 GPUs.

You can configure the GPUs in two modes:

  • DirectPath I/O passthrough: In this mode, the host can be configured to have 1 to 4 GPUs in a DirectPath I/O passthrough mode. A virtual machine running on the host will have access to 1 to 4 GPUs in passthrough mode.
  • GRID vGPU mode: For machine learning workloads, each VM should be configured with the highest profile vGPU. Since we have M60 GPUs, we configured VMs with vGPU type M60-8q. M60-8q implies one VM/GPU.

DirectPath I/O

First we focus on DirectPath I/O passthrough mode as we scale from 1 GPU to 4 GPUs.

CIFAR-10 g1c48m60GB g2c48m60GB g4c48m60GB
   1 GPU  2 GPUs  4 GPUs
Normalized Images/sec in Thousands (w.r.t. 1 GPU) 1.0 2.04 3.74
CPU Utilization 25% 44% 71%

As the above table shows, the images processed per second improves almost linearly with the number of GPUs on the host. This means that the number of images processed becomes greater with each increase in the number of GPUs in an amount that is expected. 1 GPU sets the normalized data at 1,000 image/sec. We expect 2 GPUs to handle about double that of 1 GPU, which the graph shows. Next, we see that 4 GPUs can handle nearly 4,000 images/sec.

03-cifar10-images-per-sec

Host CPU utilization also increases linearly, as shown in the following graph.

04-cifar10-cpu-used

Single GPU DirectPath I/O vs GRID vGPU mode

Now, we present comparison of performance results for DirectPath IO and GRID vGPU mode.

Since each VM can have only one vGPU in GRID vGPU mode, we first present the results for 1 GPU configuration in DirectPath IO mode with vGPU mode.

 

MNIST g1c48m60GB vg1c48m60GB
(Lower Is Better) DirectPath I/O GRID vGPU
Normalized Training Times 1.0 1.05

 

CIFAR-10 g1c48m60GB vg1c48m60GB
(Higher  Is Better) DirectPath I/O GRID vGPU
Normalized Images/sec 1.0 0.87

 

The above tables show that one GPU configuration in DirectPath I/O and GRID mode vGPU are very close in performance. We suggest you use GRID vGPU mode because it offers the benefits of virtualization.

Multi-GPU DirectPath I/O vs Multi-VM DirectPath I/O vs Multi-VMs in GRID vGPU mode

Now we move on to multi-GPU performance results for DirectPath I/O and GRID vGPU mode. In DirectPath I/O mode, a VM can be configured with all the GPUs on the host.  In our case, we configured the VM with 4 GPUs. In GRID vGPU mode, each VM can have at most 1 GPU. Therefore, we compare the results of 4 VMs running the same job with a VM using 4 GPUs using Direct Path I/O.

CIFAR-10 g4c48m60GB g1c12m16GB (4-vms) vg1c12m16GB(4-vms)
DirectPath I/O DirectPath I/O (4 VMs) GRID vGPU ( 4 VMs)
Normalized Images/Sec
(Higher Is Better)
1.0 0.98 0.92
CPU Utilization 71% 68% 69%

05-cifar10

06-cifar10

The multi-GPU DirectPath I/O mode configuration performs better. If your workload requirement is low latency or requires a short training time, you should use multi-GPU DirectPath I/O mode. However, other virtual machines will not be able use the GPUs on the host at the same time. If you can tolerate longer latencies or training times, we recommend using a 1-GPU configuration.  GRID vGPU mode enables the benefits of virtualization: flexibility and elasticity.

Takeaways

  • GPUs bring the training times of machine learning algorithms from hours to minutes.
  • You can use NVIDIA GPUs in two modes in the VMware vSphere environment for machine learning applications:
    • DirectPath I/O passthrough mode
    • GRID vGPU mode
  • You should use GRID vGPU mode with the highest vGPU profile. The highest vGPU profile implies 1 VM/GPU, thus giving the virtual machine full access to the entire GPU.
  • For a 1-GPU configuration, the performance of the machine learning applications in GRID vGPU mode is comparable to DirectPath I/O.
  • For the shortest training time, you should use a multi-GPU configuration in DirectPath I/O mode.
  • For running multiple machine learning jobs simultaneously, you should use GRID vGPU mode. This configuration offers a higher consolidation of virtual machines and leverages the flexibility and elasticity benefits of VMware virtualization.

Go to Machine Learning on vSphere 6 with Nvidia GPUs – Episode 2.

References

Configuration Details

Host Configuration

Model Dell PowerEdge R730
Processor Type Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
CPU Cores 24 CPUs, each @ 2.499GHz
Processor Sockets 2
Cores per Socket 12
Logical Processors 48
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), Storage Arrays, Local Hard Disks
GPUs 2x M60 Tesla

Software Configuration

ESXi  6.0.0, 3500742
Guest OS CentOS Linux release 7.2.1511 (Core)
CUDA Driver 7.5
CUDA Runtime 7.5

VM Configurations

VM vCPUs Memory Storage GPUs Guest OS Mode
g0xc12m60GB 12 vCPUs 60GB 1x96GB (SSD) 0 CentOS 7.2 No GPU
g1xc12m60GB 12 vCPUs 60GB 1x96GB (SSD) 1 CentOS 7.2 DirectPath I/O
g2xc48m60GB 48 vCPUs 60GB 1x96GB

(SSD)

2 CentOS 7.2 DirectPath I/O
g4xc48m60GB 48 vCPUs 60GB 1x96GB

(SSD)

4 CentOS 7.2 DirectPath I/O
vg1xc12m60GB 12 vCPUs 60GB 1x96GB (SSD) 1 CentOS 7.2 GRID vGPU
g1c12m16GB 12 vCPUs 16GB 1x96GB

(SSD)

1 CentOS 7.2 DirectPath I/O
vg1c12m16GB 12 vCPUs 16GB 1x96GB

(SSD)

1 CentOS 7.2 GRID vGPU

 

 

DRS Doctor is here to diagnose your DRS clusters

Mystery revealed, DRS for VMware vSphere is no more a black box! DRS Doctor will tell you all you need to know about your DRS clusters.

Our latest fling, DRS Doctor, will monitor your DRS clusters for virtual machine and host resource usage data, DRS-recommended migrations, and the reason behind each migration. It also monitors all the cluster-related events, tasks, and cluster balance, and logs all this information into a plain text log file that anyone can read.

Read this blog for more information on how DRS Doctor can monitor and diagnose your clusters.

Download DRS Doctor from our flings site.

Tutorial Session on Performance Debugging on VMware vSphere

Ever wondered what it takes to debug performance issues on a VMware stack? How do you figure out if the performance issue is in your virtual machine, or the network layer, or the storage layer, or the hypervisor layer?

Here’s a handy tutorial that showcases a systematic approach for troubleshooting performance using tools like Esxtop, vSCSI stats and Net stats on a VMware stack. The tutorial also talks about some very useful optimizations and performance best practices.

Thanks to Ramprasad K. S. for putting together the slides based on his vast experience dealing with customer issues. Thanks also to Ramprasad and Sai Inabattini for presenting this at the CMG India 2nd Annual conference in Bangalore in November 2015, which was received very well.

Fault Tolerance Performance in vSphere 6

VMware has published a technical white paper about vSphere 6 Fault Tolerance architecture and performance. The paper describes which types of applications work best in virtual machines with vSphere FT enabled.

VMware vSphere Fault Tolerance (FT) provides continuous availability to virtual machines that require a high amount of uptime. If the virtual machine fails, another virtual machine is ready to take over the job.  vSphere achieves FT by maintaining primary and secondary virtual machines using a new technology named Fast Checkpointing. This technology is similar to Storage vMotion, which copies the virtual machine state (storage, memory, and networking) to the secondary ESXi host. Fast Checkpointing keeps the primary and secondary virtual machines in sync.

vSphere FT works with (and requires) vSphere HA—when an administrator enables FT, vSphere HA selects the secondary VM (admins can vMotion the VM to another server if needed). vSphere HA also creates a new secondary if the primary fails—the original secondary becomes the new primary, and vSphere HA selects an available virtual machine to use as the new secondary.

vSphere 6 FT supports applications with up to 4 vCPUs and 64GB memory on the ESXi host. The performance study shows results for various workloads run on virtual machines with 1, 2, and 4 vCPUs.

The workloads—which tax the virtual machine’s CPU, disk, and network—include:

  • Kernel compile – loads the CPU at 100%
  • Netperf-  measures network throughput and latency
  • Iometer- characterizes the storage I/O of a Microsoft Windows virtual machine
  • Swingbench- drives an OLTP load on a virtual machine running Oracle 11g
  • DVD Store –  drives an OLTP load on a virtual machine running Microsoft SQL Server 2012
  • A brokerage workload – simulates an OLTP load of a brokerage firm
  • vCenterServer workload – simulates actions performed in vCenter Server

Testing shows that vSphere FT can successfully protect a number of workloads like CPU-bound workloads, I/O-bound workloads, servers, and complex database workloads; however, admins should not use vSphere FT to protect highly latency-sensitive applications like voice-over-IP (VOIP) or high-frequency trading (HFT).

For the results of these tests, read the paper. Also useful is the VMware Fault Tolerance FAQ.

Virtualizing Performance-Critical Database Applications in VMware vSphere 6.0

by Priti Mishra

Performance studies have previously shown that there is no doubt virtualized servers can run a variety of applications near, or in some cases even above, that of software running natively (on bare metal). In a new white paper, we raise the bar higher and test “monster” vSphere virtual machines loaded with CPU and running the most taxing databases and transaction processing applications.

The benchmark workload, which we call Order-Entry, is based on an industry-standard online transaction processing (OLTP) benchmark called TPC-C. Both rigorous and demanding, the Order-Entry workload pushes virtual machine performance.

Note: The Order Entry benchmark is derived from the TPC-C workload, but is not compliant with the TPC-C specification, and its results are not comparable to TPC-C results.

The white paper quantifies the:

  • Performance differential between ESXi 6.0 and native
  • Performance differential between ESXi 6.0 and ESXi 5.1
  • Performance gains due to enhancements built into ESXi 6.0

Results from these experiments show that even the most demanding applications can be run, with excellent performance, in a virtualized environment with ESXi 6.0.  For example, our test results show that ESXi 6.0 virtual machines run out of the box at 90% of the performance of native systems. In addition, a 64-vCPU, 475GB VM processes 59.5K DBMS transactions per second while issuing 155K IOPS, capabilities well above even the high-end Oracle database installations. Even for applications that may require 64 or 128 vCPUs, the high-end performance boost of ESXi 6.0 over ESXi 5.1 makes ESXi 6.0 the best platform for virtualizing databases such as Oracle.

ESXi 6.0 Performance Relative to Native

With a 64-vCPU VM running on a 72-pCPU ESXi host, throughput was 90% of native throughput on the same hardware platform. Statistics which give an indication of the load placed on the system in the native and virtual machine configurations are summarized in Table 1.

Metric Native VM
Throughput in transactions per second 66.5K 59.5K
Average CPU utilization of 72 logical CPUs 84.7% 85.1%
Disk IOPS 173K 155K
Disk Megabytes/second 929MB/s 831MB/s
Network packets/second 71K/s receive
71K/s send
63K/s receive
64K/s send
Network Megabytes/second 15MB/s receive
36MB/s send
13MB/s receive
32MB/s send

Table 1. Comparison of Native and Virtual Machine Benchmark Load Profiles

 

The corresponding guest statistics in Table 2 provide another perspective on the resource-intensive nature of the workload. These common Linux performance metrics show that while the benchmark workload was heavy in terms of raw CPU demands, it also placed a heavy load on the operating system, interrupt handling, and the storage subsystem, areas that have traditionally been associated with high virtualization overheads.

 

Metric Amount
Interrupts per second 327K
Disk IOPS 155K
Context switches per second 287K
Load average 231

Table 2. Guest OS Statistics

ESXi 6.0 Performance Relative to ESXi 5.1

Experimental data comparing ESXi 6.0 with ESXi 5.1 (see Figures 1 and 2) show that high-end scale-up with ESXi 6.0 mirrors that of native systems.

fig1-dbapps-perf

Figure 1. Absolute throughput values

With ESXi 5.1, the Order-Entry benchmark throughput of a 64-vCPU VM on a 4-socket, 32 core/64 thread E7- 4870 (Westmere) server was 70% of the throughput of the same server in native mode when both servers were running at 77% CPU utilization (the native server reached a maximum CPU utilization of 88% and throughput of 54.8 transactions per second).

fig2-dbapps-perf

Figure 2. Relative throughput ratios

vSphere has the capability to handle loads far larger than that demanded by most Oracle database applications in production. Support for monster VMs with up to 128 vCPUs, throughput which is 90% of native and a significant performance boost over ESXi 5.1, make ESXi 6.0 an excellent platform for virtualizing very high end Oracle databases.

For details regarding experiments and the performance enhancements in vSphere, please read the paper.

VMware vCloud Air Database Performance Scalability with SQL Server

Previous posts have shown vSphere can easily handle running Microsoft SQL Server on four-socket servers with large numbers of cores—with vSphere 5.5 on Westmere-EX and more recently with vSphere 6 on Ivy Bridge-EX.  We recently ran similar tests on vCloud Air to measure how these enterprise databases with mission critical performance requirements perform in a cloud environment. The tests show that SQL Server databases scale very well on vCloud Air with a variety of virtual machine (VM) counts and virtual CPU (vCPU) sizes.

The benchmark tests were run with vCloud Air using their Virtual Private Cloud (VPC) subscription-based service.  This is a very compelling hybrid cloud service that allows for an on-premises vSphere infrastructure to be expanded into the public cloud in a secure and scalable way. The underlying host hardware consisted of two 8-core CPUs for a total of 16 physical cores, which meant that the maximum number of vCPUs was 16 (although additional processors were available via Hyper-Threading, they were not utilized).

Windows Server 2012 R2 was the guest OS, and SQL Server 2012 Standard edition was the database engine used for all the VMs.  All databases were placed on an SSD Accelerated storage tier for maximum disk I/O performance.  The test configurations are summarized below:

# VMs, # vCPUs, Memory configurations tested

DVD Store 2.1 (an open-source OLTP database stress tool) was the workload used to stress the VMs.  The first experiment was to scale up the number of 4 vCPU VMs.  The graph below shows that as the number of VMs is increased from 1 to 4, the aggregate performance (measured in orders per minute, or OPM) increases correspondingly:
vCA_SQL_4vCPU
When the size of each VM was doubled from 4 to 8 virtual CPUs, the OPM also approximately doubles for the same number of VMs as shown in the chart below.vCA_SQL_8vCPU

This final chart includes a test run with one large 16 vCPU VM.  As expected, the 16 vCPU performance was similar to the four 4vCPU VMs and eight 2vCPU VM test cases.  The slight drop can be attributed to spanning multiple physical processors and thus multiple NUMA nodes within a single VM.

vCA_SQL_16vCPU

In summary, SQL Server was found to perform and scale extremely well running on vCloud Air with 4, 8, and 16 vCPU VMs.  In the future, look for more benchmarks in the cloud as it continues to evolve!

For more information on vCloud Air, check out these third-party studies from Principled Technologies that compare it to competitive offerings, namely Microsoft Azure and Amazon Web Services (AWS):

Scaling Performance for VAIO in vSphere 6.0 U1

by Chien-Chia Chen

vSphere APIs for I/O Filtering (VAIO) is a framework that enables third-party software developers to implement data services, such as caching and replication, to vSphere. Figure 1 below shows the general architecture of VAIO. Once I/O filter libraries are installed to a virtual disk (VMDK), every I/O request generated from the guest operating system to the VMDK will first be intercepted by the VAIO framework at the file device layer. The VAIO framework then hands over the I/O request to the user space I/O filter libraries, where a series of third party data service operations can be performed against the I/O. After processing the I/O, user space I/O filter libraries return the I/O back to the VAIO framework, which continues the rest of the issuing path. Similarly, upon completion, the I/O will first be processed by the user space I/O filter libraries before continuing its original completion path.

There have been questions around the overhead of the VAIO framework due to its extra user-to-kernel communication. In this blog post, we evaluate the performance of vSphere APIs for I/O Filtering using a null I/O filter and demonstrate how VAIO scales with respect to the number of virtual machines and outstanding I/Os (OIOs). The null I/O filter accepts each I/O request and immediately returns it.

fig1-iofilt-arch

Figure 1. vSphere APIs for I/O Filtering Architecture

System Configuration

The configuration of our systems is as follows:

  • One ESXi host
    • Machine: Dell R720 server running vSphere 6.0 Update 1
    • CPU: 16-core, 2-socket (32 hyper-threads) Intel® Xeon® E5-2665 @ 2.4 GHz
    • Memory: 128GB memory
    • Physical Disk: One Intel® S3700 400GB SATA SSD on LSI MegaRAID SAS controller
    • VM: Up to 32 link-cloned I/O Analyzer 1.6.2 VMs (SUSE Linux Enterprise 11 SP2; 1 virtual CPU (VCPU) and 1GB memory each). Each virtual machine has 1 PVSCSI controller hosting two 1GB VMDKs—one has no I/O filter and another has a null filter, both think-provisioned.
  • Workload: Iometer 4K sequential read (4K-aligned) with various number of OIOs

Methodology

We conduct two sets of tests separately—one against VMDK without an I/O filter (referred to as “default”) and another against the null-filter VMDK (referred to as “iofilter”). In each set of tests, every virtual machine has one Iometer disk worker to generate 4K sequential read I/Os to the VMDK under test. We have a 2-minute warm-up time and measure I/Os per second (IOPS), normalized CPU cost, and read latency over the next 2-minute test duration. The latency is the median of the average read latencies reported by all Iometer workers.

Note that I/O sizes and access patterns do not affect the performance of VAIO since it does no additional data copying, maintains the original access patterns, and incurs no extra access to physical disks.

Results

VM Scaling

Figures 2 and 3 below show the IOPS, CPU cost per 1K IOPS, and latency with a different number of virtual machines at 128 OIOs. Except for the single virtual machine test, results show that VAIO achieves similar IOPS and has similar latency compared to the default VMDK. However, VAIO introduces 10%-20% higher CPU overhead per 1K IOPS. The single virtual machine IOPS with iofilter is 80% higher than the default VMDK. This is because, in the default case, the VCPU performs the majority of synchronous I/O work; whereas, in the iofilter case, VAIO contexts take over a big portion of the work and unblock the VCPU from generating more I/Os. With additional VCPUs and Iometer disk workers to mitigate the single core bottleneck, the default VMDK is also able to drive over 70K IOPS.

fig2a-iofilt
Figure 2. IOPS and CPU Cost vs. Number of VMs (128 Outstanding I/Os)

fig3-iofilt

Figure 3. Iometer Read Latency vs. Number of VMs (128 Outstanding I/Os)

 

OIO Scaling

Figures 4 and 5 below show the IOPS, CPU cost per 1K IOPS, and latency with a different number of OIOs at 16 virtual machines. A similar trend again holds that VAIO achieves the same IOPS and has the same latency compared to the default VMDK while it incurs 10%-20% higher CPU overhead per 1K IOPS.

fig4-iofilt

Figure 4. Percent of a Core per 1 Thousand IOPS vs. Outstanding I/Os (16 VMs)

fig5-iofilt

Figure 5. Iometer Read Latency vs. Outstanding I/Os (16 VMs)

Conclusion

Based on our evaluation, VAIO achieves comparable throughput and latency performance at a cost of 10%-20% more CPU cycles. From our experience, when using the VAIO framework, we recommend the following general best practices:

  • Reduce CPU over-commitment. The VAIO framework introduces at least one additional context per VMDK with an active filter. Over-committing CPU can result in intensive CPU contention, thus much worse virtualization efficiency.
  • Avoid blocking when developing I/O filter libraries. Keep in mind that an I/O will be blocked until the user space I/O filter finishes processing. Thus additional processing time will result in higher end-to-end latency.
  • Increase concurrency wisely when developing I/O filter libraries. The user space I/O filter can potentially serve I/Os from all VMDKs. Thus, when developing I/O filter libraries, it is important to be flexible in terms of concurrency to avoid a single core CPU bottleneck and meanwhile without introducing too many unnecessary active contexts that cause higher CPU contention.