vMotion across hybrid cloud: performance and best practices

VMware Cloud on AWS is a hybrid cloud service that runs the VMware software-defined data center (SDDC) stack in the Amazon Web Services (AWS) public cloud. The service automatically provisions and deploys a vSphere environment on a bare-metal AWS infrastructure, and lets you run your applications in a hybrid IT environment across your on-premises data centers and AWS global infrastructure. A key benefit of VMware Cloud on AWS is the ability to vMotion workloads back and forth from your on-premises data center to the AWS public cloud as capacity and data privacy require.

In this blog post, we share the results of our vMotion performance tests across our hybrid cloud environment that consisted of a vSphere on-premises data center located in Wenatchee, Washington and an SDDC hosted in an AWS cloud, in various scenarios including hybrid migration of a database server. We also describe the best practices to follow when migrating virtual machines by vMotion across hybrid cloud.

Test configuration

We set up the hybrid cloud environment with the following specifications:

VMware Cloud on AWS

1-host SDDC instance with Amazon EC2 i3.metal (Intel Xeon E5-2686 @ 2.3 GHz, 36 cores, 512 GB)
SDDC version: vmc1.6 (M6 – Cycle 17)
Auto-provisioned with NSX networking and VSAN storage

On-premises host

Dell PowerEdge R730 (Intel Xeon E5-2699 v4 @ 2.2GHz, 22 cores, 1 TB memory)
ESXi and vCenter version: 6.7
Storage: Dell NVMe, VMFS 5 volume
Networking: Intel 1GbE NIC (shared 2*10GbE DX links between on-prem and AWS)

Figure 1: Logical layout of the hybrid cloud setup

Figure 1 illustrates the logical layout of our hybrid cloud environment. We deployed a single-host SDDC instance on AWS cloud. The SDDC was the latest M6 version and auto-configured with vSAN storage and NSX networking. Our on-premises data center, located in Washington state, featured hosts running ESXi 6.7.

AWS Direct Connect

We used high-speed AWS Direct Connect links for connectivity between VMware on-prem data center and AWS Oregon region. AWS Direct Connect provides a leased line from the AWS environment to the on-premises data center. VMware recommends you use this type of link because it guarantees sustained bandwidth during vMotion, which isn’t possible with VPN internet connections. In our environment, there was about 40 milliseconds of round-trip latency on the network.

L2 VPN tunnel

We set up a secure L2 VPN tunnel for the compute traffic that spanned the two vCenters. This connected the VMs on cloud and on-premises to the same address space (IP subnet). So, the VMs remained on the same subnet and kept their IP addresses even as we migrated them from on-premises to cloud and vice versa.

Figure 2: Extending VXLAN across on-premises and cloud using L2 VPN

As shown in figure 2, two NSX Edge VMs provided VPN capabilities and the bridge between the overlay world (VXLAN logical networks) and the physical infrastructure (IP networks). Each NSX Edge VM was equipped with two virtual interfaces (vNICs): one vNIC was used as an uplink to the physical network, and the second vNIC was used as the VXLAN trunk interface.

Hybrid linked mode

Figure 3: A single console to manage resources across on-premises and cloud environments

We created a hybrid linked mode between the cloud vCenter and on-premises vCenter. This allowed us to use a single console to manage all our inventory across the hybrid cloud. As shown in Figure 3, the cloud inventory included a single Client-VM provisioned in the compute workload resource pool and the on-premises inventory included three VMs including NSX-Edge VM, Client-VM and a Server VM.

Measuring vMotion performance

The following metrics were used to understand the performance implications of vMotion:

Migration time: Total time taken for migration to complete
Switch-over time: Time during which the VM is quiesced to enable switchover from on-premises to cloud, and vice versa
Guest penalty: Performance impact on the applications running inside the VM during and after the migration

Benchmark methodology

We investigated the impact of hybrid vMotion on a Microsoft SQL Server database performance using the open-source DVD Store 3 (DS3) benchmark, which simulates many customers performing typical actions in an online DVD Store (logging in, browsing, buying, reviewing, and so on).

The test scenario used a Windows Server 2012 VM configured with 8 VCPUs, 8 GB memory, 40 GB disk, and a SQL Server database size of 5 GB. As shown in figures 2 and 3, we used two concurrent DS3 clients, one client running on-premises, and a second client running on the cloud. Each client used a load of five DS3 users with 0.02 seconds of think time. We started the migration during the steady-state period of the benchmark when the CPU utilization (esxtop %USED counter) of the SQL Server VM was close to 275%, and the average write IOPS was 80.

Test results

Figure 4: SQL Server throughput at given time: before, during, and after hybrid vMotions

Figure 4 plots the performance of a SQL Server VM in total orders processed per second during vMotion from on-premises to cloud, and vice versa. In our tests, both DS3 benchmark drivers were configured to report the performance data at a fine granularity of 1 second (the default is 10 seconds). As shown in figure 4, the impact on SQL Server throughput was minimal during vMotion in both directions. The total throughput remained steady in the range of 75 operations throughout the test period. The vMotion durations from on-premises to cloud, and vice versa were 415 seconds, and 382 seconds, respectively, with the network throughput ranging between 500 to 900 megabits per second (Mbps). The switch-over time was about 0.6 seconds in both vMotions. The few minor dips in throughput shown in the figure were due to the variance in available network bandwidth on the shared AWS Direct Connect link.

Figure 5: Breakdown of SQL Server throughput reported by the on-premises and cloud clients

Figure 5 illustrates the impact of network latency on the throughput. While the total SQL Server throughput remained steady during the entire test period, the throughput reported by both on-premises and cloud clients varied based on their proximity to the SQL Server VM. For instance, throughput reported by the on-premises client drops from 65 operations to 10 operations when the SQL Server VM was onboarded to the cloud and jumps back to 65 operations after the SQL Server VM is migrated back to the on-premises environment.

The throughput variation seen by the two DS3 clients is not unique to our hybrid cloud environment and can be explained by Little’s Law.

Little’s Law

In queueing theory, Little’s Law theorem states that the average number (L) of customers in a stable system is equal to the average arrival rate (λ) multiplied by the average time (W) that a customer spends in the system. Expressed algebraically, the law is: L = λ × W

Figure 6 shows how Little’s Law can be applied to our hybrid cloud environment to relate the DS3 users, SQL server throughput, SQL Server processing time, and the network latency. The formula derived in figure 6 explains the impact of the network latency on the throughput (orders per second) when the benchmark load (DS3 users) is fixed. It should be noted, however, that although the throughput reported by both the clients varied due to the network latency, the aggregate throughput remained a constant. This is because the throughput decrease seen by one client is offset by the throughput increase seen by the other client.

This illustrates how important it is for you to monitor your application dependencies when you migrate workloads to and from the cloud. For example, if your database VM depends on a Java application server VM, you should consider migrating both VMs together; otherwise, the overall application throughput will suffer due to slow responses and timeouts.

One way to monitor your application dependencies is to use VMware vRealize Network Insight, which can mitigate business risk by mapping application dependencies in both private and hybrid cloud environments.

vMotion Stun During Page Send (SDPS)

We also tested vMotion performance by doubling the intensity of the DS3 workload on both on-premises and cloud clients. Although vMotion succeeded, vmkernel logs indicated vMotion SDPS kicked-in during test scenarios that had a higher benchmark load. SDPS is an advanced feature of vMotion that ensures migration will not fail due to memory copy convergence issues. Whenever vMotion detects that the guest memory dirty rate is higher than the available network bandwidth, it injects microsecond latencies to guest execution to throttle the page dirty rate, so the network transfer can catch up with the dirty rate. So, we recommend you delay the vMotion of a heavily loaded VMs on hybrid cloud environments with shared bandwidth links, which will prevent slowdown in the guest execution.

To learn more about SDPS, see “VMware vSphere vMotion Architecture, Performance, and Best Practices.”

vMotion across multiple availability zones in the SDDC

Every AWS region has multiple availability zones (AZ). Amazon does not provide service level agreements beyond an availability zone. For reasons such as failover support, VMware Cloud on AWS customers can choose an SDDC deployment that spans multiple availability zones in a single AWS region.

There are certain vMotion performance implications with respect to the SDDC deployment configuration.

Figure 7. vMotion peak network throughput in a single availability zone vs. multiple availability zones

As shown in figure 7, vMotion peak network throughput depends on the host placement in the SDDC.

This is because vMotion uses a single TCP stream in the VMware Cloud environment. If the vMotion source and destination hosts are within the same availability zone, vMotion peak throughput can reach as high as 10 gigabits per second (Gbps), limited only by the CPU core speed. However, if the source and destination hosts are across availability zones, vMotion peak throughput is governed by the AWS rate limiter. The throughput of any single TCP or UDP stream across availability zones is limited to 5 Gbps by the AWS rate limiter.

Conclusion

In summary, our performance test results show the following:

vMotion lets you migrate workloads seamlessly across traditional, on-premises data centers and software-defined data centers on AWS Cloud.
vMotion offers the same standard performance guarantees across hybrid cloud environment, which includes less than 1 second of vMotion execution switch-over time, and minimal impact on guest performance.

References

VMware Cloud on AWS Technical Overview
vMotion Architecture, Performance, and Best Practices
Little, J. D. C.; Graves, S. C. (2008). “Little’s Law”. Building Intuition (PDF). International Series in Operations Research & Management Science. 115. p. 81.
DVD Store 3