Home > Blogs > VMware VROOM! Blog

SQL Server VM Performance with VMware vSphere 6.5

Achieving optimal SQL Server performance on vSphere has been a constant focus here at VMware; I’ve published past performance studies with vSphere 5.5 and 6.0 which showed excellent performance up to the maximum VM size supported at the time.

Since then, there have been quite a few changes!  While this study uses a similar test methodology, it features an updated hypervisor (vSphere 6.5), database engine (SQL Server 2016), OLTP benchmark (DVD Store 3), and CPUs (Intel Xeon v4 processors with 24 cores per socket, codenamed Broadwell-EX).

The new tests show large SQL Server databases continue to run extremely efficiently, achieving great performance on vSphere 6.5. Following our best practices was all that was necessary to achieve this scalability – which reminds me, don’t forget to check out Niran’s new SQL Server on vSphere best practices guide, which was also just updated.

In addition to performance, power consumption was measured on each ESXi host. This allowed for a comparison of Host Power Management (HPM) policies within vSphere, performance per watt of each host, and power draw under stress versus idle:

Generational SQL Server DB Host Power and Performance/watt

Generational SQL Server DB Host Power and Performance/watt

Additionally, this new study compares a virtual file-based disk (VMDK) on VMware’s native Virtual Machine File System (VMFS 5) to a physical Raw Device Mapping (RDM). I added this test for two reasons: first, it has been several years since they have been compared; and second, customer feedback from VMworld sessions indicates this is still a debate that comes up in IT shops, particularly with regard to deploying database workloads such as SQL Server and Oracle.

For more details and the test results, download the paper: Performance Characterization of Microsoft SQL Server on VMware vSphere 6.5

Performance of Storage I/O Control (SIOC) with SSD Datastores – vSphere 6.5

With Storage I/O Control (SIOC), vSphere 6.5 administrators can adjust the storage performance of VMs so that VMs with critical workloads will get the I/Os per second (IOPS) they need. Admins assign shares (the proportion of IOPS allocated to the VM), limits (the upper bound of VM IOPS), and reservations (the lower bound of VM IOPS) to the VMs whose IOPS need to be controlled.  After shares, limits, and reservations have been set, SIOC is automatically triggered to meet the desired policies for the VMs.

A recently published paper shows the performance of SIOC meets expectations and successfully controls the number of IOPS for VM workloads.

Continue reading

Virtual Machine vCPU and vNUMA Rightsizing – Rules of Thumb

Using virtualization, we have all enjoyed the flexibility to quickly create virtual machines with various virtual CPU (vCPU) configurations for a diverse set of workloads.  But as we virtualize larger and more demanding workloads, like databases, on top of the latest generations of processors with up to 24 cores, special care must be taken in vCPU and vNUMA configuration to ensure performance is optimized. Continue reading

Machine Learning on vSphere 6 with Nvidia GPUs – Episode 2

by Hari Sivaraman, Uday Kurkure, and Lan Vu

In a previous blog [1], we looked at how machine learning workloads (MNIST and CIFAR-10) using TensorFlow running in vSphere 6 VMs in an NVIDIA GRID configuration reduced the training time from hours to minutes when compared to the same system running no virtual GPUs.

Here, we extend our study to multiple workloads—3D CAD and machine learning—run at the same time vs. run independently on a same vSphere server.

Performance Impact of Mixed Workloads

Many customers of the NVIDIA GRID vGPU solution on vSphere run 3D CAD workloads. The traditional approach to run 3D CAD and machine learning workloads, typically, is to run the CAD workloads during the day and the machine learning workloads at night, or to have separate infrastructures for each type of workload, but this solution is inflexible and can increase deployment cost. We show, in this section, that this kind of separation is entirely unnecessary. Both workloads can be run concurrently on the same server, and the performance impact on the 3D CAD workload as well as on the machine learning workload is negligible in three out of four vGPU profiles.

vSphere 6.5 supports four vGPU profiles, and the primary difference between them is the amount of VRAM available to each VM:

  • M60-1Q: 1GB VRAM
  • M60-2Q: 2GB VRAM
  • M60-4Q: 4GB VRAM
  • M60-8Q: 8GB VRAM

In this blog, we characterize the performance impact of running 3D CAD and machine learning workloads concurrently using two benchmarks. We chose the SPECapc for 3ds Max 2015 [2] benchmark as a representative for 3D CAD workloads and so did not comply with the benchmark reporting rules, nor do we use or make comparisons to the official SPECapc metrics. We chose MNIST [3] as a representative for machine learning workloads. The performance metric in this comparison is simply the run time for the benchmark.

Our results show that the performance impact on the 3D CAD workload due to sharing the server and GPUs with the machine learning workload is below 5% (in the M60-2Q, M60-4Q, and M60-8Q profiles) when compared to running only the 3D CAD workload on the same hardware. Correspondingly, the performance impact on the machine learning workload when sharing the hardware resources with the 3D CAD workload compared to running all by itself is under 15% in the M60-2Q, M60-4Q, and M60-8Q profiles. In other words, the run time for the 3D CAD benchmark increases by less than 5% when sharing the hardware with the machine learning workload when compared to when it does not share the hardware. The increase in run time for machine learning was under 15% when sharing compared to not sharing the hardware.

Experimental Configuration and Methodology

We installed the 3D CAD benchmark in a 64-bit Windows 7 (SP1) VM with 4 vCPUs and 16GB RAM. The benchmark uses Autodesk 3ds Max 2015 software. The NVIDIA vGPU driver (#369.4) was used in the VM. We configured the vGPU profiles as M60-1Q, M60-2Q, M60-4Q, or M60-8Q for different runs. We used this VM as the golden master from which we made linked clones so that we could run the 3D CAD benchmark at scale with 1, 2, 4, …, 24 VMs running 3D CAD simultaneously. The software configurations used for the 3D CAD workload are shown in Table 2, below.

ESXi 6.5   #4240417
Guest OS CentOS Linux release 7.2.1511 (Core)
CUDA Driver & Runtime 7.5
TensorFlow 0.1

Table 1. Configuration of VM used for machine learning benchmarks

vGPU profile Total # VMs running concurrently in the test
M60-8Q 3
M60-4Q 6
M60-2Q 12
M60-1Q 24

Table 2. Software configuration used to run the 3D CAD benchmark

Our experiments includes three sets of runs. In the first set, we ran only the 3D CAD benchmark for each of the four configurations listed in Table 2, above and measured the run time, the ESXi CPU utilization, and the GPU utilization. Once this set of runs was completed, we did a second set in which we ran the 3D CAD benchmark concurrently with the machine learning benchmark. To do this, we first installed the MNIST benchmark, CUDA, cuDNN and TensorFlow in a CentOS VM with the configuration shown in Table 1, above.  Since CUDA only works with an M60-8Q profile, we used it for the VM that runs the machine learning benchmark. For the runs in this second set, we used the configurations shown in Table 4 and measured the run time for the 3D CAD benchmark, the run time for MNIST, the total ESXi CPU utilization, and the total GPU utilization. The server configuration used in our experiments is shown in Table 3, below.

Model Dell PowerEdge R730
CPU Intel Xeon Processor E5-2680 v4 @ 2.40GHz
CPU cores 28 CPUs, each @ 2.40GHz
Processor sockets 2
Cores per socket 14
Logical processors 56
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), storage arrays, local hard disks
GPUs 2x NVIDIA Tesla M60
ESXi 6.5   #4240417
NVIDIA vGPU driver in ESXi host 367.38
NVIDIA vGPU driver inside VM 369.04

Table 3. Server configuration

vGPU profile used for 3D CAD VMs only #VMs running 3D CAD # VMs running MNIST Total # VMs running concurrently in test
M60-8Q 3 1 4
M60-4Q 6 1 7
M60-2Q 12 1 13
M60-1Q 24 1 25

Table 4. Software configuration used to run the mixed workloads (please note that that the machine learning VM can be configured only using the M60-8Q profile; no other profiles are supported)

We did a third set of runs in which we ran only MNIST on the server. Since MNIST runs only in the M60-8Q profile, only one run was done in this third set.

Results

We compared the run times of the first set of runs (3D CAD benchmark only) with the ones of the second set (3D CAD and MNIST are run concurrently) as well as computed the percentage increase in the run time for 3D CAD when it shares the server with MNIST compared to when it ran without MNIST. So specifically, say, in the M60-4Q profile,  we computed the percentage increase in run time for 3D+ML compared to 3D only in the M60-4Q profile. We also measured the run time of MNIST running concurrently with 3D CAD with the run time of MNIST running by itself on the server and computed the percentage increase in run time for MNIST. We call this increase in run time the performance drop or change. The computed values are shown in Figure 1 below.

3d-ml-fig3_001

Figure 1. Percentage increase in run time for 3D graphics (3D) and machine learning (ML) workloads due to running concurrently compared to running in isolation

From the figure we can see that in the M60-8Q, M60-4Q, and M60-2Q profiles, the run times for 3D CAD when it shares the server with machine learning compared to when 3D CAD runs by itself is less than 5%. For the MNIST machine learning workload, the performance penalty due to sharing compared to no sharing is under 15% in the M60-2Q, M60-4Q, and M60-8Q profiles. Only the M60-1Q profile that can support up to 24 VMs running 3D CAD and one VM running MNIST show any significant performance penalty due to sharing. Now, if the workloads were run sequentially, the total time to complete the tasks would be the sum of the run time for 3D CAD and the machine learning workloads.

A comparison of the total run time for ML and 3D CAD workloads is shown in Figure 2. From the figure, we can see that the total time to completion of the workloads is always less when run concurrently as opposed to when run sequentially.

3d-ml-fig2_001

⇑ Figure 2. It takes a longer time to sequentially run a 3D plus ML (machine learning) mixed workload when compared to the time to run a concurrent mixed workload. The original time was in seconds, but we have normalized the concurrent time to 1 so that the change in sequential time stands out.

3d-ml-fig1_001

Figure 3. CPU utilization on server for mixed workload configuration (3D+ML) and for 3D graphics only (3D)

Further, running the workloads concurrently results in higher server utilization, which could result in higher revenues for a cloud service provider. The M60-1Q profile does show a higher time to complete when workloads are run concurrently when compared to being run sequentially, but it does achieve very high consolidation (measured as number of VMs per core) and high server utilization. So, if in the M60-1Q profile, the higher time to complete the workload run can be tolerated, the configuration that runs the workloads concurrently would achieve higher revenues for a cloud service provider because of higher server utilization. The CPU utilization on the server for the M60-8Q, M60-4Q, M60-2Q, and M60-1Q profiles with only 3D CAD (3D) and with 3D CAD plus machine learning (3D+ML) are shown in Figure 3, above.

Conclusions

  • Simultaneously running 3D CAD and machine learning workloads reduces the total time to complete the runs with the M60-2Q, M60-4Q, and M60-8Q profiles compared to running the workloads sequentially. This is a radical departure from traditional approaches to scheduling machine learning and 3D CAD workloads.
  • Running 3D graphics and machine learning workloads concurrently increases server utilization, which could result in higher revenues for a cloud service provider.

References

[1] Machine Learning on VMware vSphere 6 with NVIDIA GPUs
https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html

[2] The MNIST Database of Handwritten Digits
http://yann.lecun.com/exdb/mnist/ 

[3] SPECapc for 3ds Max 2015
https://www.spec.org/gwpg/apc.static/max2015info.html

 

 

New Fling released – IOInsight

By Sankaran Sivathanu

VMware IOInsight is a tool to help people understand a VM’s storage I/O behavior. By understanding their VM’s I/O characteristics, customers can make better decisions about storage capacity planning and performance tuning. IOInsight ships as a virtual appliance that can be deployed in any vSphere environment and includes an intuitive web-based UI that allows users to choose VMDKs to monitor and view results.

Where does IOInsight help?

  • Customers may better tune and size their storage.
  • When contacting VMware Support for any vSphere storage issues, including a report from IOInsight can help VMware Support better understand the issues and can potentially lead to faster resolutions.
  • VMware Engineering can optimize products with a better understanding of various customers’ application behavior.

IOInsight captures I/O traces from ESXi and generates various aggregated metrics that represent the I/O behavior. The IOInsight report contains only these aggregated metrics and there is no sensitive information about the application itself. In addition to the built-in metrics computed by IOInsight, users can also write new analyzer plugins to IOInsight and visualize the results. A comprehensive SDK and development guide is included in the download bundle.

The fling works with vSphere 5.5 or above and can be downloaded at https://labs.vmware.com/flings/ioinsight.

vSphere 6.5 Encrypted vMotion Architecture and Performance

With the rise in popularity of hybrid cloud computing, where VM sensitive data leaves the traditional IT environment and traverses over the public networks, IT administrators and architects need a simple and secure way to protect critical VM data that traverses across clouds and over long distances.

The Encrypted vMotion feature available in VMware vSphere® 6.5 addresses this challenge by introducing a software approach that provides end-to-end encryption for vMotion network traffic. The feature encrypts all the vMotion data inside the vmkernel by using the most widely used AES-GCM encryption standards, and thereby provides data confidentiality, integrity, and authenticity even if vMotion traffic traverses untrusted network links.

A new white paper, “VMware vSphere 6.5 Encrypted vMotion Architecture, Performance and Best Practices”, is now available. In that paper, we describe the vSphere 6.5 Encrypted vMotion architecture and provide a comprehensive look at the performance of live migrating virtual machines running typical Tier 1 applications using vSphere 6.5 Encrypted vMotion. Tests measure characteristics such as total migration time and application performance during live migration. In addition, we examine vSphere 6.5 Encrypted vMotion performance over a high-latency network, such as that in a long distance network. Finally, we describe several best practices to follow when using vSphere 6.5 Encrypted vMotion.

In this blog, we give a brief overview of vSphere 6.5 Encrypted vMotion technology, and some of the performance highlights from the paper.

Brief Overview of Encrypted vMotion Architecture and Workflow

vMotion uses TCP as the transport protocol for migrating the VM data. To secure VM migration, vSphere 6.5 encrypts all the vMotion traffic, including the TCP payload and vMotion metadata, using the most widely used AES-GCM encryption standard algorithms, provided by the FIPS-certified vmkernel vmkcrypto module.

Encrypted vMotion does not rely on the Secure Sockets Layer (SSL) and Internet Protocol Security (IPsec) technologies for securing vMotion traffic. Instead, it implements a custom encrypted protocol above the TCP layer. This is done primarily for performance, but also for the usability reasons explained in the paper.


enc-vmotion-blog-workflow-fig

As shown in Figure 1, vCenter Server prepares the migration specification that consists of a 256-bit encryption key and a 64-bit nonce, then passes the migration specification to both source and destination ESXi hosts of the intended vMotion. Both the ESXi hosts communicate over the vMotion network using the key provided by vCenter Server. The key management is simple: vCenter Server generates a new key for each vMotion, and the key is discarded at the end of vMotion. Encryption happens inside the vmkernel, hence there is no need for specialized hardware.

Brief look at Encrypted vMotion Performance

Encrypted vMotion Duration

The figure below shows the vMotion duration in several test scenarios in which we varied vCPU and memory sizes.  The figure shows identical performance in all the scenarios with and without encryption enabled on vMotion traffic.

enc-vmotion-blog-fig2

 

Encrypted vMotion CPU Overhead

The figures below show the CPU overhead of encrypting vMotion traffic on source and destination hosts, respectively. The CPU usage is plotted in terms of the CPU cores required by vMotion.

enc-vmotion-blog-fig3-src-new

enc-vmotion-blog-fig4-dst-new1

The above figures show that CPU requirements of encrypted vMotion are very moderate. For every 10Gb/s of vMotion traffic, encrypted vMotion only requires less than one core on the source host and less than half a core on the destination host for all the encryption-related overheads.

Encrypted vMotion Performance Over Long Distance

The figure below plots the performance of a SQL Server virtual machine in orders processed per second at a given time—before, during, and after encrypted vMotion on a 150ms round-trip latency network.

enc-vmotion-blog-fig5-ld

As shown in the figure, the impact on SQL Server throughput was minimal during encrypted vMotion. The only noticeable dip in performance was during the switch-over phase (in the range of 1 second) from the source to destination host. It took less than few seconds for the SQL server to resume its normal level of performance.

In summary, test results show the following:

  • vSphere 6.5 Encrypted vMotion performs nearly the same as regular, unencrypted vMotion.
  • CPU cost of encrypting vMotion traffic is very moderate, thanks to the performance optimizations added to the vSphere 6.5 vMotion code path.
  • vSphere 6.5 Encrypted vMotion can migrate workloads non-disruptively over long distances such as New York to London

For the full paper, see “VMware vSphere 6.5 Encrypted vMotion Architecture, Performance and Best Practices”.

vCenter Server 6.5 High Availability Performance and Best Practices

High availability (aka HA) services are important in any platform, and VMware vCenter Server® is no exception. As the main administrative and management tool of vSphere, it is a critical element that requires HA. vCenter Server HA (aka VCHA) delivers protection against software and hardware failures with excellent performance for common customer scenarios, as shown in this paper.

Much work has gone into the high availability feature of VMware vCenter Server® 6.5 to ensure that this service and its operations minimally affect the performance of your vCenter Server and vSphere hosts. We thoroughly tested VCHA with a benchmark that simulates common vCenter Server activities in both regular and worst case scenarios. The result is solid data and a comprehensive performance characterization in terms of:

  • Performance of VCHA failover/recovery time objective (RTO): In case of a failure, vCenter Server HA (VCHA) provides failover/RTO such that users can continue with their work in less than 2 minutes through API clients and less than 4 minutes through UI clients. While failover/RTO depends on the vCenter Server configuration and the inventory size, in our tests it is within the target limit, which is 5 minutes.
  • Performance of enabling VCHA: We observed that enabling VCHA would take around 4 – 9 minutes depending on the vCenter Server configuration and the inventory size.
  • VCHA overhead: When VCHA is enabled, there is no significant impact for vCenter Server under typical load conditions. We observed a noticeable but small impact of VCHA when the vCenter Server was under extreme load; however, it is unlikely for customers to generate that much load on the vCenter Server for extended time periods.
  • Performance impact of vCenter Server statistics level: With an increasing statistics level, vCenter Server produces less throughput, as expected. When VCHA is enabled for various statistics levels, we observe a noticeable but small impact of 3% to 9% on throughput.
  • Performance impact of a private network: VCHA is designed to support LAN networks with up to 10 ms latency between VCHA nodes. However, this comes with a performance penalty. We study the performance impact of the private network in detail and provide further guidelines about how to configure VCHA for the best performance.
  • External Platform Services Controller (PSC) vs Embedded PSC: We study VCHA performance comparing these two deployment modes and observe a minimal difference between them.

Throughout the paper, our findings show that vCenter Server HA performs well under a variety of circumstances. In addition to the performance study results, the paper describes the VCHA architecture and includes some useful performance best practices for getting the most from VCHA.

For the full paper, see VMware vCenter Server High Availability Performance and Best Practices.

vSphere 6.5 Update Manager Performance and Best Practices

vSphere Update Manager (VUM) is the patch management tool for VMware vSphere 6.5. IT administrators can use VUM to patch and upgrade ESXi hosts, VMware Tools, virtual hardware, and virtual appliances.

In the vSphere 6.5 release, VUM has been integrated into the vCenter Server appliance (VCSA) for the Linux platform. The integration eliminates remote data transfers between VUM and VCSA, and greatly simplifies the VUM deployment process. As a result, certain data-driven tasks achieve a considerable performance improvement over VUM for the Windows platform, as illustrated in the following figure:

vum-blog-fig1

To present the new performance characteristics for VUM in vSphere 6.5, a paper has been published. In particular, the paper describes the following topics:

  • VUM server deployment
  • VUM operations including scan host, scan VM, stage host, remediate host, and remediate VM
  • Remediation concurrency
  • Resource consumption
  • Running VUM operations with vCenter Server provisioning operations

The paper also offers a number of performance tips and best practices for using VUM during patch maintenance. For the full details, read vSphere Update Manager Performance and Best Practices.

Whitepaper on vSphere Virtual Machine Encryption Performance

vSphere 6.5 introduces a feature called vSphere VM encryption.  When this feature is enabled for a VM, vSphere protects the VM data by encrypting all its contents.  Encryption is done both for already existing data and for newly written data. Whenever the VM data is read, it is decrypted within ESXi before being served to the VM.  Because of this, vSphere VM encryption can have a performance impact on application I/O and the ESXi host CPU usage.

We have published a whitepaper, VMware vSphere Virtual Machine Encryption Performance, to quantify this performance impact.  We focus on synthetic I/O performance on VMs, as well as VM provisioning operations like clone, snapshot creation, and power on.  From analysis of our experiment results, we see that while VM encryption consumes more CPU resources for encryption and decryption, its impact on I/O performance is minimal when using enterprise-class SSD or VMware vSAN storage.  However, when using ultra-high performance storage like locally attached NVMe drives capable of handling up to 750,000 IOPS, the minor increase in per-I/O latency due to encryption or decryption adds up quickly to have an impact on IOPS.

For more detailed information and data, please refer to the whitepaper

vSphere 6.5 DRS Performance – A new white-paper

VMware recently announced the general availability of vSphere 6.5. Among the many new features in this release are some DRS specific ones like predictive DRS, and network-aware DRS. In vSphere 6.5, DRS also comes with a host of performance improvements like the all-new VM initial placement and the faster and more effective maintenance mode operation.

If you want to learn more about them, we published a new white-paper on the new features and performance improvements of DRS in vSphere 6.5. Here are some highlights from the paper:

 

65wp-blog-3

 

65wp-blog-2