VMware pushes the envelope with vSphere 6.0 vMotion

vMotion in VMware vSphere 6.0 delivers breakthrough new capabilities that will offer customers a new level of flexibility and performance in moving virtual machines across their virtual infrastructures. Included with vSphere 6.0 vMotion are features – Long-distance migration, Cross-vCenter migration, Routed vMotion network – that enable seamless migrations across current management and distance boundaries. For the first time ever, VMs can be migrated across vCenter Servers separated by cross-continental distance with minimal performance impact. vMotion is fully integrated with all the latest vSphere 6 software-defined data center technologies including Virtual SAN (VSAN) and Virtual Volumes (VVOL). Additionally, the newly re-architected vMotion in vSphere 6.0 now enables extremely fast migrations at speeds exceeding 60 Gigabits per second.

In this blog, we present the latest vSphere 6.0 vMotion features as well as the performance results. We first evaluate vMotion performance across two geographically dispersed data centers connected by a 100ms round-trip time (RTT) latency network. Following that we demonstrate vMotion performance when migrating an extremely memory hungry “Monster” VM.

Long Distance vMotion

vSphere 6.0 introduces a Long-distance vMotion feature that increases the round-trip latency limit for vMotion networks from 10 milliseconds to 150 milliseconds. Long distance mobility offers a variety of compelling new use cases including whole data center upgrades, disaster avoidance, government mandated disaster preparedness testing, and large scale distributed resource management to name a few. Below, we examine vMotion performance under varying network latencies up to 100ms.

Test Configuration

We set up a vSphere 6.0 test environment with the following specifications:

Hardware

Two HP ProLiant DL580 G7 servers (32-core Intel Xeon E7-8837 @ 2.67 GHz, 256 GB memory)
Storage: Two EMC VNX 5500 arrays, FC connectivity, VMFS 5 volume on a 15-disk RAID-5 LUN
Networking: Intel 10GbE 82599 NICs
Latency Injector: Maxwell-10GbE appliance to inject latency in vMotion network

Software

VM config: 4 VCPUs, 8GB mem, 2 vmdks (30GB system disk, 20GB database disk)
Guest OS/Application: Windows Server 2012 / MS SQL Server 2012
Benchmark: DVDStore (DS2) using a database size of 12GB with 12,000,000 customers, 3 drivers without think-time

Figure 1 illustrates the logical deployment of the test-bed used for long distance vMotion testing. Long distance vMotion is supported with both no shared storage infrastructure and with shared storage solutions such as EMC VPLEX Geo which enables shared data access across long distances. Our test-bed didn’t use shared storage which resulted in migration of the entire state of VM including its memory, storage and CPU/device states. As shown in Figure 2, our test configuration deployed a Maxwell-10GbE network appliance to inject latency in vMotion network.

Measuring vMotion Performance

The following metrics were used to understand the performance implications of vMotion:

Migration Time: Total time taken for migration to complete
Switch-over Time: Time during which the VM is quiesced to enable switchover from source to the destination host
Guest Penalty: Performance impact on the applications running inside the VM during and after the migration

Test Results

We investigated the impact of long distance vMotion on Microsoft SQL Server online transaction processing (OLTP) performance using the open-source DVD Store workload. The test scenario used a Windows Server 2012 VM configured with 4 VCPUs, 8GB memory, and a SQL Server database size of 12GB. Figure 3 shows the migration time and VM switch-over time when migrating an active SQL Server VM at different network round-trip latencies. In all the test scenarios, we used a load of 3 DS2 users with no think time that generated substantial load on the VM. The migration was initiated during the steady-state period of the benchmark when the CPU utilization (esxtop %USED counter) of the VM was close to 120%, and the average read IOPS and average write IOPS were about 200 and 150, respectively.
Figure 3 shows that the impact of round-trip latency was minimal on both duration of the migration and switch-over time, thanks to the latency aware optimizations in vSphere 6.0 vMotion. The difference in the migration time among the test scenarios was in the noise range (<5%). The switch-over time increased marginally from about 0.5 seconds in 5ms test scenario to 0.78 seconds in 100ms test scenario.

Figure 4 plots the performance of a SQL Server virtual machine in orders processed per second at a given time—before, during, and after vMotion on a 100 ms round-trip latency network. In our tests, DVD store benchmark driver was configured to report the performance data at a fine granularity of 1 second (default: 10 seconds). As shown in the figure, the impact on SQL Server throughput was minimal during vMotion. The only noticeable dip in performance was during the switch-over phase (0.78 seconds) from the source to destination host. It took less than 5 seconds for the SQL server to resume to normal level of performance.

Faster migration

Why are we interested in extreme performance? Today’s datacenters feature modern servers with many processing cores (up to 80), terabytes of memory and high network bandwidth (10 and 40 GbE NICs). VMware supports larger “monster” virtual machines that can scale up to 128 virtual CPUs and 4TB of RAM. Utilizing higher network bandwidth to complete migrations of these monster VMs faster can enable you to implement high levels of mobility in private cloud deployments. The reduction in time to move a virtual machine can also reduce the overhead on the total network and CPU usage.

Test Config

Two Dell PowerEdge R920 servers (60-core Intel Xeon E7-4890 v2 @ 2.80GHz, 1TB memory)
Networking: Intel 10GbE 82599 NICs, Mellanox 40GbE MT27520 NIC
VM config: 12 VCPUs, 500GB mem
Guest OS: Red Hat Enterprise Linux Server 6.3

We configured each vSphere host with four Intel 10GbE ports and a single Mellanox 40 GbE port with total of 80Gb/s network connectivity between the two vSphere hosts. Each vSphere host was configured with five vSwitches, with four vSwitches having one unique 10GbE uplink port and fifth vSwitch with a 40GbE uplink port. The MTU of the NICs was set to the default of 1500 bytes. We created one VMkernel adapter on each of four vSwitches with 10GbE uplink port and four VMkernel adapters on the vSwitch with 40GbE uplink port. All the 8 VMkernel adapters were configured on the same subnet. We also enabled each VMkernel adapter for vMotion, which allowed vMotion traffic to use the 80Gb/s network connectivity.

Methodology

To demonstrate the extreme vMotion throughput performance, we simulated a very heavy memory usage footprint in the virtual machine. The memory-intensive program allocated 300GB memory inside the guest and touched a random byte in each memory page in an infinite loop. We migrated this virtual machine between the two vSphere hosts under different test scenarios: vMotion over 10Gb/s network, vMotion over 20Gb/s network, vMotion over 40Gb/s network and vMotion over 80Gb/s network. We used esxtop to monitor network throughput and CPU utilization on the source and destination hosts.

Test Results

Figure 5 compares the peak network bandwidth observed in vSphere 5.5 and vSphere 6.0 under different network deployment scenarios. Let us first consider the vSphere 5.5 vMotion throughput performance. Figure 5 shows vSphere 5.5 vMotion reaches line rate in both 10Gb/s network and 20Gb/s network test scenarios. When we increased the available vMotion network bandwidth to beyond 20 Gb/s, the vMotion peak usage was limited to 18Gb/s in vSphere 5.5. This is because in vSphere 5.5 vMotion, each vMotion is assigned by default two helper threads which do the bulk of vMotion processing. Since the vMotion helper threads are CPU saturated, there is no performance gain when adding additional network bandwidth. When we increased the number of vMotion helper threads from 2 to 4 in the 40Gb/s test scenario, and thereby removed the CPU bottleneck, we saw the peak network bandwidth usage of vMotion in vSphere 5.5 increase to 32Gb/s. Tuning the helper threads beyond four hurt vMotion performance in 80Gb/s test scenario, as vSphere 5.5 vMotion has some locking issues which limit the performance gains when adding more helper threads. These locks are VM specific locks that protect VM’s memory.

The newly re-architected vMotion in vSphere 6.0 not only removes these lock contention issues but also obviates the need to apply any tunings. During the initial setup phase, vMotion dynamically creates the appropriate number of TCP/IP stream channels between the source and destination hosts based on the configured network ports and their bandwidth. It then instantiates a vMotion helper thread per stream channel thereby removing the necessity for any manual tuning. Figure 5 shows vMotion reaches line rate in 10Gb/s, 20Gb/s and 40Gb/s scenarios, while utilizing little over 64 Gb/s network throughput in 80Gb/s scenario. This is over a factor of 3.5x improvement in performance when compared to vSphere 5.5.

Figure 6 shows the network throughput and CPU utilization data in vSphere 6.0 80Gb/s test scenario. During vMotion, memory of the VM is copied from the source host to the destination host in an iterative fashion. In the first iteration, the vMotion bandwidth usage is throttled by the memory allocation on the destination host. The peak vMotion network bandwidth usage is about 28Gb/s during this phase. Subsequent iterations copy only the memory pages that were modified during the previous iteration. The number of pages transferred in these iterations is determined by how actively the guest accesses and modifies the memory pages. The more modified pages there are, the longer it takes to transfer all pages to the destination server, but on the flip side, it enables vMotion’s advanced performance optimizations to kick-in to fully leverage the additional network and compute resources. That is evident in the third pre-copy iteration when the peak measured bandwidth was about 64Gb/s and the peak CPU utilization (esxtop ‘PCPU Util%’ counter) on destination host was about 40%.

Conclusions

The main results of this performance study are the following:

The dramatic 10x increase in round-trip time support offered in long-distance vMotion now makes it possible to migrate workloads non-disruptively over long distances such as New York to London
Remarkable performance enhancements in vSphere 6.0 towards improving Monster VM migration performance (up to 3.5x improvement over vSphere 5.5) in large scale private cloud deployments