Using virtualization, we have all enjoyed the flexibility to quickly create virtual machines with various virtual CPU (vCPU) configurations for a diverse set of workloads. But as we virtualize larger and more demanding workloads, like databases, on top of the latest generations of processors with up to 24 cores, special care must be taken in vCPU and vNUMA configuration to ensure performance is optimized. Continue reading
Category Archives: Uncategorized
With the rise in popularity of hybrid cloud computing, where VM sensitive data leaves the traditional IT environment and traverses over the public networks, IT administrators and architects need a simple and secure way to protect critical VM data that traverses across clouds and over long distances.
The Encrypted vMotion feature available in VMware vSphere® 6.5 addresses this challenge by introducing a software approach that provides end-to-end encryption for vMotion network traffic. The feature encrypts all the vMotion data inside the vmkernel by using the most widely used AES-GCM encryption standards, and thereby provides data confidentiality, integrity, and authenticity even if vMotion traffic traverses untrusted network links.
A new white paper, “VMware vSphere 6.5 Encrypted vMotion Architecture, Performance and Best Practices”, is now available. In that paper, we describe the vSphere 6.5 Encrypted vMotion architecture and provide a comprehensive look at the performance of live migrating virtual machines running typical Tier 1 applications using vSphere 6.5 Encrypted vMotion. Tests measure characteristics such as total migration time and application performance during live migration. In addition, we examine vSphere 6.5 Encrypted vMotion performance over a high-latency network, such as that in a long distance network. Finally, we describe several best practices to follow when using vSphere 6.5 Encrypted vMotion.
In this blog, we give a brief overview of vSphere 6.5 Encrypted vMotion technology, and some of the performance highlights from the paper.
Brief Overview of Encrypted vMotion Architecture and Workflow
vMotion uses TCP as the transport protocol for migrating the VM data. To secure VM migration, vSphere 6.5 encrypts all the vMotion traffic, including the TCP payload and vMotion metadata, using the most widely used AES-GCM encryption standard algorithms, provided by the FIPS-certified vmkernel vmkcrypto module.
Encrypted vMotion does not rely on the Secure Sockets Layer (SSL) and Internet Protocol Security (IPsec) technologies for securing vMotion traffic. Instead, it implements a custom encrypted protocol above the TCP layer. This is done primarily for performance, but also for the usability reasons explained in the paper.
As shown in Figure 1, vCenter Server prepares the migration specification that consists of a 256-bit encryption key and a 64-bit nonce, then passes the migration specification to both source and destination ESXi hosts of the intended vMotion. Both the ESXi hosts communicate over the vMotion network using the key provided by vCenter Server. The key management is simple: vCenter Server generates a new key for each vMotion, and the key is discarded at the end of vMotion. Encryption happens inside the vmkernel, hence there is no need for specialized hardware.
Brief look at Encrypted vMotion Performance
Encrypted vMotion Duration
The figure below shows the vMotion duration in several test scenarios in which we varied vCPU and memory sizes. The figure shows identical performance in all the scenarios with and without encryption enabled on vMotion traffic.
Encrypted vMotion CPU Overhead
The figures below show the CPU overhead of encrypting vMotion traffic on source and destination hosts, respectively. The CPU usage is plotted in terms of the CPU cores required by vMotion.
The above figures show that CPU requirements of encrypted vMotion are very moderate. For every 10Gb/s of vMotion traffic, encrypted vMotion only requires less than one core on the source host and less than half a core on the destination host for all the encryption-related overheads.
Encrypted vMotion Performance Over Long Distance
The figure below plots the performance of a SQL Server virtual machine in orders processed per second at a given time—before, during, and after encrypted vMotion on a 150ms round-trip latency network.
As shown in the figure, the impact on SQL Server throughput was minimal during encrypted vMotion. The only noticeable dip in performance was during the switch-over phase (in the range of 1 second) from the source to destination host. It took less than few seconds for the SQL server to resume its normal level of performance.
In summary, test results show the following:
- vSphere 6.5 Encrypted vMotion performs nearly the same as regular, unencrypted vMotion.
- CPU cost of encrypting vMotion traffic is very moderate, thanks to the performance optimizations added to the vSphere 6.5 vMotion code path.
- vSphere 6.5 Encrypted vMotion can migrate workloads non-disruptively over long distances such as New York to London
For the full paper, see “VMware vSphere 6.5 Encrypted vMotion Architecture, Performance and Best Practices”.
We compared the I/O performance of vSphere 6.0 U2 over 16Gb and 32Gb Emulex FC HBAs connected via a Brocade G620 FC switch to an EMC VNX7500 storage array.
Iometer, a common microbenchmark, was used to generate the workload for various block sizes. For single-VM experiments, we measured sequential read and sequential write throughput. For multi-VM experiments, we measured random read IOPS and throughput.
Our experiments showed that vSphere 6 can achieve near line rate with 32Gb FC.
For details, please see the whitepaper Storage I/O Performance on VMware vSphere 6.0 U2 over 32 Gigabit Fibre Channel.
Remember that cool project with VMware, HP Enterprise, and IBM where four super huge monster virtual machines (VMs) of 120 vCPUs each were all running at the same time on a single server with great performance?
That was Project Capstone, and it was presented at VMworld San Francisco and VMworld Barcelona last fall as a spotlight session. The follow-up whitepaper is now completed and published, which means that there are lots of great technical details available with testing results and analysis.
In addition to the four 120 vCPU VMs test, additional configurations were also run with eight 60 vCPU VMs and sixteen 30 vCPU VMs. This shows that plenty of large VMs can be run on a single host with excellent performance when using a solution that supports tons of CPU capacity and cutting edge flash storage.
The whitepaper not only contains all of the test results from the original presentation, but also includes additional details around the performance of CPU Affinity vs PreferHT and under-provisioning. There is also a best practices section that if focused on running monster VMs.
by Joanna Guan and Davide Bergamasco
The first two posts in this series assessed the performance of some Content Library operations like virtual machine deployment and library synchronization, import and export. In this post we discuss how to fine-tune Content Library settings in order to achieve optimal performance under a variety of operational conditions. Notice that in this post we only discuss the settings that have the most noticeable impact on the overall solution performance. There are several other settings which may potentially affect Content Library performance. We refer the interested readers to the official documentation for the details (the Content Library Service settings can be found here , while the Transfer Service settings can be found here.)
Global Network Bandwidth Throttling
Content Library has a global bandwidth throttling control to limit the overall bandwidth consumed by file transfers. This setting, called Maximum Bandwidth Consumption, affects all the streaming mode operations including library synchronization, VM deployment, VM capture, and item import/export. However, it does not affect direct copy operations, i.e., operations where data is directly copied across ESXi hosts.
The purpose of the Maximum Bandwidth Consumption setting is to ensure that while Content Library file transfers are in progress enough network bandwidth remains available to vCenter Server for its own operations.
The following table illustrates the properties of this setting:
|Setting Name||Maximum Bandwidth Consumption|
|vSphere Web Client Path||AdministrationàSystem ConfigurationàServicesà
Transfer ServiceàMaximum Bandwidth Consumption
Concurrent Data Transfer Control
Content Library has a setting named Maximum Number of Concurrent Transfers that limits the number of concurrent data transfers. This limit applies to all the data transfer operations including import, export, VM deployment, VM capture, and synchronization. When this limit is exceeded, all new operations are queued until the completion of one or more of the operations in progress.
For example, let’s assume the current value of Maximum Number of Concurrent Transfers is 20 and there are 8 VM deployments, 2 VM captures, and 10 item synchronization in progress. A new VM deployment request will be queued because the maximum number of concurrent operations has been reached. As soon as any of those operations completes, the new VM deployment is allowed to proceed.
This setting can be used to improve Content Library overall throughput (not the performance of each individual operation) by increasing the data transfer concurrency when the network is underutilized.
The following table illustrates the properties of this setting:
|Setting Name||Maximum Number of Concurrent Transfers|
|vSphere Web Client Path||AdministrationàSystem ConfigurationàServicesàTransfer Serviceà
Maximum Number of Concurrent Transfers
A second concurrency control setting, whose properties are shown in the table below, applies to synchronization operations only. This setting, named Library Maximum Concurrent SyncItems,controls the maximum number of items that a subscribed library is allowed to concurrently synchronize.
|Setting Name||Library Maximum Concurrent SyncItems|
|vSphere Web Client Path||AdministrationàSystem ConfigurationàServicesà
Content Library ServiceàLibrary Maximum Concurrent SyncItems
Given that the default value of Maximum Number of Concurrent Transfers is 20 and the default value of Library Maximum Concurrent SyncItems is 5, a maximum of 5 items can concurrently be transferred to a subscribed library during a synchronization operation, while a published library with 5 or more items can be synchronizing with up to 4 subscribed libraries (see Figure 1). If the number of items or subscribed libraries exceeds these limits, the extra transfers will be queued. Library Maximum Concurrent SyncItems can be used in concert with Maximum Number of Concurrent Transfers to improve the overall synchronization throughput by increasing one or both limits.
Figure 1. Library Synchronization Concurrency Control
The following Table summarizes the effect of each of the settings described above on each of the Content Library operations, depending on the data transfer mode.
|Maximum Bandwidth Consumption||Maximum Number of Concurrent Transfers||Library Maximum Concurrent Sync Items|
|Streaming Mode||VM Deployment/Capture||√||√|
|Direct Copy Mode||VM Deployment/Capture||√|
Synchronizing library content across remote sites can be problematic because of the limited bandwidth of typical WAN connections. This problem may be exacerbated when a large number of subscribed libraries concurrently synchronize with a published library because the WAN connections originating from this library can easily cause congestion due to the elevated fan-out (see Figure 2).
Figure 2. Synchronization Fan-out Problem
This problem can be mitigated by creating a mirror server to cache content at the remote sites. As shown in Figure 3, this can significantly decrease the level of WAN traffic by avoiding transferring the same files multiple times across the network.
Figure 3. Data Mirroring Used to Mitigate Fan-Out Problem
The mirror server is a proxy Web server that caches content on local storage. The typical location of mirror servers is between the vCenter Server hosting the published library and the vCenter Server(s) hosting the subscribed library(ies). To be effective, the mirror servers must be as close as possible to the subscribed libraries. When a subscribed library attempts to synchronize with the published library, it requests content from the mirror server. If such content is present on the mirror server, the request is immediately satisfied. Otherwise, the mirror server fetches said content from the published library and stores it locally before satisfying the request from the subscribed library. Any further request for that particular content will be directly satisfied by the mirror server.
A mirror server can be also used in a local environment to offload the data movement load from a vCenter Server or when the backing storage of a published library is not particularly performant. In this case the mirror server is located as close as possible to the vCenter Server hosting the published library as shown in Figure 4.
Figure 4. Data Mirroring Used to Off-Load vCenter Server
Example of Mirror Server Configuration
This section provides step-by-step instructions to assist a vSphere administrator in the creation of a mirror server using the NGINX web server (other web servers, such as Apache and Lighttpd, can be used for this purpose as well). Please refer to the NGINX documentation for additional configuration details.
- Install the NGINX web server in a Windows or Linux virtual machine to be deployed as close as possible to either the subscribed or the published library depending on the desired optimization (fan-out mitigation or vCenter offload).
- Edit the configuration files. The NGINX default configuration file,
/etc/nginx/nginx.conf, defines the basic behavior of the web server. Some of the core directives in the nginx.conf file need to be changed.
- Configure the IP address / name of the vCenter server hosting the published library
- Set the valid time for cached files. In this example we assume the contents to be valid for 6 days (this time can be changed as needed):
proxy_cache_valid any 6d;
- Configure the cache directory path and cache size. In the following example we use
/var/www/cacheas the cache directory path on the file system where cached data will be stored. 500MB is the size of the shared memory zone, while 18,000MB is the size of the file storage. Files not accessed within 6 days are evicted from the cache.
proxy_cache_path /var/www/cache levels=1:2 keys_zone=my-cache:500mmax_size=18000m inactive=6d;
- Define the cache key. Instruct NGINX to use the request URI as a key identifier for a file:
When an OVF or VMDK file is updated, the file URL gets updated as well. When a URL changes, the cache key changes too, hence the mirror server will fetch and store the updated file as a new file during a library re-synchronization.
- Configure HTTP Redirect (code 302)
handling.error_page 302 = @handler;
set $foo $upstream_http_location;
proxy_cache_valid any 6d;
- Configure the IP address / name of the vCenter server hosting the published library
- Test the configuration. Run the following command on the mirror server twice:
The first time the command is run the file example.ovf will be fetched from the published content library and copied in some folder within the cache path proxy_cache_path. The second time, there should not be any network traffic between the mirror server and the vCenter hosting the published library as the file will be served from the cache.
wget --no-certification-check –O /dev/null https://<MirrorServer-IP-or-Name>/example.ovf
- Sync library content through mirror server. Create a new subscribed library using the New Library Wizard from the vSphere Web Client. Copy the published library URL in the Subscription URL box and replace the vCenter IP or host name with the mirror server IP or host name (see Figure 5). Then complete the rest of the steps for creating a new library as usual.
Figure 5. Configuring a Subscribed Library with a Mirror Server
Note: If the network environment is trusted, a simple HTTP proxy can be used instead of HTTPS proxy in order to improve data transfer performance by avoiding unnecessary data encryption/decryption.
VMware vCenter Server 6.0 brings many performance improvements over previous vCenter versions, including:
- Extensive improvements in throughput and latency.
- vCenter Server Appliance (VCSA) parity with vCenter Server on Windows, for both inventory size and performance.
A new white paper, “VMware vCenter Server Performance and Best Practices,” illustrates these performance improvements and discusses important best practices for getting the best performance out of your vCenter Server environment.
This chart, taken from the white paper, shows the improvement in throughput over vCenter Server 5.5 at various inventory sizes:
VMware vCenter Server 6.0 brings significant improvements over previous vCenter server versions with respect to cluster size and performance. vCenter Server 6.0 supports up to 64 ESXi Hosts and 8000 VMs in a single cluster. A new white paper, ” VMware vCenter Server 6.0 Cluster Performance “, describes the improvements along several dimensions:
- VMware vCenter 6.0 can support more hosts and more VMs in a cluster.
- VMware vCenter 6.0 can support higher operational throughput in a cluster.
- VMware vCenter 6.0 can support higher operational throughput with ESXi 6.0 hosts.
- VMware vCenter 6.0 VCSA can support higher operational throughput in a cluster, compared to vCenter Server on Windows.
Here is a chart from the white paper summarizing one of the key improvements: operational throughput in a cluster:
Project Capstone was put together a few weeks before VMworld 2015 with the goal of being able to show what is possible with Monster VMs today. VMware worked with HP and IBM to put together an impressive setup using vSphere 6.0, HP Superdome X and an IBM FlashSystem array that was able to support running four 120 vCPU VMs simultaneously. Putting these massive Virtual Machines under load we found that performance was excellent with great scalability and a high amount of throughput achieved.
vSphere 6 was launched earlier this year and includes support for virtual machines with up to 128 virtual CPUs which is a big increase from the 64 vCPUs supported in vSphere 5.5. “Monster” virtual machines have a new upper limit and it allows for customers to virtualize even the largest of systems with very hungry CPU needs.
The HP Superdome X used for the testing is an impressive system. It has 16 Intel Xeon E7-2890v2 2.8 GHz processors. Each processor has 15 cores and 30 logical threads when Hyper Threading is enabled. In total this is 240 cores / 480 threads.
An IBM FlashSystem array with 20TB of superfast low latency storage was used for the project Capstone configuration. It provided extremely low latency throughout all testing and provided such great performance that storage was never a concern or issue. The FlashSystem was extremely easy to setup and use. Within 24 hours of it arriving in the lab, we were actively running four 120 vCPU VMs with sub millisecond latency.
Large Oracle 12c database virtual machines running on Redhat Enterprise Linux 6.5 were created and configured with 256GB of RAM, pvSCSI virtual disk adapters, and vmxnet3 virtual NICs. The number of VMs and the number of vCPUs for each VM was varied across the tests.
The workload used for the testing was DVD Store 3 (github.com/dvdstore/ds3). DVD Store simulates a real online store with customers logging onto the site, browsing products and product reviews, rating products, and ultimately purchasing those products. The benchmark is measured in Orders Per Minute, with each order representing a complete login, browsing, and purchasing process that includes many individual SQL operations against the database.
This large system with 240 cores / 480 threads, an extremely fast and large storage system, and vSphere 6 showed that even with many monster VMs excellent performance and scalability is possible. Each configuration was first stressed by increasing the DVD Store workload until maximum throughput was achieved for a single virtual machine. In all cases this was found to be at near CPU saturation. The number of VM was then increased so that the entire system was fully committed. A whitepaper to be published soon will have the full set of test results, but here we show the results for four 120 vCPU VMs and sixteen 30 vCPU VMs.
In both cases the performance of the system when fully loaded with either 4 or 16 virtual machines achieves about 90% of perfect linear scalability when compared to the performance of a single virtual machine.
In order to be able to drive the CPU usage to such high levels all disk IO must be very fast so that the system is not waiting for a response. The IBM FlashSystem provided .3 ms average disk latency across all tests. Total disk IO was minimized for these tests to maximize CPU usage and throughput by configuring the database cache size to be equal to the database size. Total disk IO per second (IOPS) peaked at about 50k and averaged 20k while maintaining the extremely low latency during tests.
These test results show that it is possible to use vSphere 6 to successfully virtualize even the largest systems with excellent performance.
We are pleased to announce the availability of Performance Best Practices for VMware vSphere 6.0. This is a book designed to help system administrators obtain the best performance from vSphere 6.0 deployments.
The book addresses many of the new features in vSphere 6.0 from a performance perspective. These include:
- A new version of vSphere Network I/O Control
- A new host-wide performance tuning feature
- A new version of VMware Fault Tolerance (now supporting multi-vCPU virtual machines)
- The new vSphere Content Library feature
We’ve also updated and expanded on many of the topics in the book. These include:
- VMware vStorage APIs for Array Integration (VAAI) features
- Network hardware considerations
- Changes in ESXi host power management
- Changes in ESXi transparent memory sharing
- Using Receive Side Scaling (RSS) in virtual machines
- Virtual NUMA (vNUMA) configuration
- Network performance in guest operating systems
- vSphere Web Client performance
- VMware vMotion and Storage vMotion performance
- VMware Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM) performance
The book can be found here http://www.vmware.com/files/pdf/techpaper/VMware-PerfBest-Practices-vSphere6-0.pdf.
Introduced in vSphere 5.5, a Linux-based driver was added to support 40GbE Mellanox adapters on ESXi. Now vSphere 6.0 adds a native driver and Dynamic NetQueue for Mellanox, and these features significantly improve network performance. In addition to the device driver changes, vSphere 6.0 includes improvements to the vmxnet3 virtual NIC (vNIC) that allows a single vNIC to achieve line-rate performance with 40GbE physical NICs. Another performance feature introduced in 6.0 for high bandwidth NICs is NUMA Aware I/O which improves performance by collocating highly network-intensive workloads with the device NUMA node. In this blog, we highlight these features and the corresponding benefits achieved.
We used two identical Dell PowerEdge R720 servers with Intel E5-2667 @ 2.90GHz and 64GB of memory and Mellanox Technologies MT27500 Family [ConnectX-3] / Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network NICs for our tests.
In the single VM test, we used 1 RHEL 6 VM with 4 vCPUs on each ESXi host with 4 netperf TCP streams running. We then measured the cumulative throughput for the test.
For the multi-VM test, we configured multiple RHEL VMs with 1 vCPU each and used an identical number of VMs on the receiver side. Each VM used 4 sessions of netperf for driving traffic, and we measured the cumulative throughput across the VMs.
Single vNIC Performance Improvements
In order to achieve line-rate performance for vmxnet3, changes were made to the virtual NIC adapter for vSphere 6.0 so that multiple hardware queues could push data to vNICs simultaneously. This allows vmxnet3 to use multiple hardware queues from the physical NIC more effectively. This not only increases the throughput a single vNIC can achieve, but also helps in overall CPU utilization.
As we can see from figure 1 below, 1 VM with 1 vNIC on vSphere 6.0 can achieve more than 35Gbps of throughput as compared to 20Gbps achieved in vSphere 5.5 (indicated by the blue bar chart). The CPU used to receive 1Gbps of traffic, on the other hand, is reduced by 50% (indicated by the red line chart).
Figure 1. 1VM vmxnet3 Receive throughput
By default, a single vNIC can receive packets from a single hardware queue. To achieve higher throughput, the vNIC has to request more queues. This can be done by setting ethernetX.pnicFeatures = “4” in the .vmx file. This option also requires the physical NIC to have RSS mode turn on. For Mellanox adapters, the RSS feature can be turned on by reloading the driver with num_rings_per_rss_queue=4.
CPU Cost Improvements for Mellanox 40GbE NIC
In addition to scalability improvements for the vmxnet3 adapter, vSphere 6.0 features an improved version of the Mellanox 40GbE NIC driver. The updated driver uses vSphere 6.0 APIs and, as a result, performs better than the earlier Linux-based driver. Native APIs remove the extra CPU overheads of data structure conversion that were earlier present in the Linux-based driver. The driver also has new features like Dynamic NetQueue that improves CPU utilization even further. Dynamic netqueue in vSphere 6.0 intelligently chooses the optimal number of active hardware queues in use according to the network workload and per NUMA-node CPU utilization.
Figure 2: Multi VM CPU usage for 40G traffic
As seen in figure 2 above, the new driver can improve CPU efficiency by up to 22%. For all these test cases, the Mellanox NIC was achieving line-rate throughput for both vSphere 6.0 and vSphere 5.5. Please note that for the multi-VM tests, we were using a 1-vCPU VM and vmxnet3 was using a single queue. The RSS feature on the Mellanox Adapter was also turned off.
NUMA Aware I/O
In order to achieve the best performance out of 40GbE NICs, it is advisable to place the throughput-intensive workload on the same NUMA system to which the adapter is attached. vSphere 6.0 features a new configuration option that tries to do this automatically and is available through a system-wide option. The configuration will pack all kernel networking threads on the same NUMA node to which the device is connected. The scheduler will then try to place the VMs that use these networking threads the most on the same NUMA node. By default, the configuration is turned off because it may cause uneven workload distribution between multiple NUMA nodes, especially in the cases where all NICs are connected to the same NUMA node.
Figure 3: NUMA I/O benefit.
As seen in Figure 3 above, NUMA I/O can result in about 20% reduced CPU consumption and about 20% higher throughput with a 1-vCPU VM for 40GbE NICs. There is no throughput improvement for Intel NICs because we achieve line rate irrespective of where the workloads are placed. We do however see an increase in CPU efficiency of about 7%.
To enable this option, set the value of Net. NetNetqNumaIOCpuPinThreshold in the Advanced System Settings tab for the host. The value is configurable and can vary between 0 and 200. For example, if you set the value to 100, this results in using NUMA I/O as long as the networking load is less than 100% (that is, the networking threads do not use more than 1 core). Once the load increases to 100%, vSphere 6.0 will follow default scheduling behavior and will schedule VMs and networking threads across different NUMA nodes.
vSphere 6.0 includes some great new improvements in network performance. In this blog, we show:
- Vmxnet3 can now achieve near line-rate performance with a 40GbE NIC.
- Significant performance improvements were made to the Mellanox driver, which is now up to 25% more efficient.
- vSphere also features a new option to turn on NUMA I/O that could improve application performance by up to 15%.