Home > Blogs > VMware VROOM! Blog > Author Archives: Sreekanth Setty

Author Archives: Sreekanth Setty

vMotion across hybrid cloud: performance and best practices

VMware Cloud on AWS is a hybrid cloud service that runs the VMware software-defined data center (SDDC) stack in the Amazon Web Services (AWS) public cloud. The service automatically provisions and deploys a vSphere environment on a bare-metal AWS infrastructure, and lets you run your applications in a hybrid IT environment across your on-premises data centers and AWS global infrastructure. A key benefit of VMware Cloud on AWS is the ability to vMotion workloads back and forth from your on-premises data center to the AWS public cloud as capacity and data privacy require.

In this blog post, we share the results of our vMotion performance tests across our hybrid cloud environment that consisted of a vSphere on-premises data center located in Wenatchee, Washington and an SDDC hosted in an AWS cloud, in various scenarios including hybrid migration of a database server. We also describe the best practices to follow when migrating virtual machines by vMotion across hybrid cloud.

Test configuration

We set up the hybrid cloud environment with the following specifications:

VMware Cloud on AWS

  • 1-host SDDC instance with Amazon EC2 i3.metal (Intel Xeon E5-2686 @ 2.3 GHz, 36 cores, 512 GB)
  • SDDC version: vmc1.6 (M6 – Cycle 17)
  • Auto-provisioned with NSX networking and VSAN storage

On-premises host

  • Dell PowerEdge R730 (Intel Xeon E5-2699 v4 @ 2.2GHz, 22 cores, 1 TB memory)
  • ESXi and vCenter version: 6.7
  • Storage: Dell NVMe, VMFS 5 volume
  • Networking: Intel 1GbE NIC (shared 2*10GbE DX links between on-prem and AWS)

Figure 1: Logical layout of the hybrid cloud setup

Figure 1 illustrates the logical layout of our hybrid cloud environment. We deployed a single-host SDDC instance on AWS cloud. The SDDC was the latest M6 version and auto-configured with vSAN storage and NSX networking. Our on-premises data center, located in Washington state, featured hosts running ESXi 6.7.

AWS Direct Connect

We used high-speed AWS Direct Connect links for connectivity between VMware on-prem data center and AWS Oregon region. AWS Direct Connect provides a leased line from the AWS environment to the on-premises data center. VMware recommends you use this type of link because it guarantees sustained bandwidth during vMotion, which isn’t possible with VPN internet connections. In our environment, there was about 40 milliseconds of round-trip latency on the network.

L2 VPN tunnel

We set up a secure L2 VPN tunnel for the compute traffic that spanned the two vCenters. This connected the VMs on cloud and on-premises to the same address space (IP subnet). So, the VMs remained on the same subnet and kept their IP addresses even as we migrated them from on-premises to cloud and vice versa.

Figure 2: Extending VXLAN across on-premises and cloud using L2 VPN

As shown in figure 2, two NSX Edge VMs provided VPN capabilities and the bridge between the overlay world (VXLAN logical networks) and the physical infrastructure (IP networks). Each NSX Edge VM was equipped with two virtual interfaces (vNICs): one vNIC was used as an uplink to the physical network, and the second vNIC was used as the VXLAN trunk interface.

Hybrid linked mode

Figure 3: A single console to manage resources across on-premises and cloud environments

We created a hybrid linked mode between the cloud vCenter and on-premises vCenter. This allowed us to use a single console to manage all our inventory across the hybrid cloud.  As shown in Figure 3, the cloud inventory included a single Client-VM provisioned in the compute workload resource pool and the on-premises inventory included three VMs including NSX-Edge VM, Client-VM and a Server VM.

Measuring vMotion performance

The following metrics were used to understand the performance implications of vMotion:

  • Migration time: Total time taken for migration to complete
  • Switch-over time: Time during which the VM is quiesced to enable switchover from on-premises to cloud, and vice versa
  • Guest penalty: Performance impact on the applications running inside the VM during and after the migration

Benchmark methodology

We investigated the impact of hybrid vMotion on a Microsoft SQL Server database performance using the open-source DVD Store 3 (DS3) benchmark, which simulates many customers performing typical actions in an online DVD Store (logging in, browsing, buying, reviewing, and so on).

The test scenario used a Windows Server 2012 VM configured with 8 VCPUs, 8 GB memory, 40 GB disk, and a SQL Server database size of 5 GB. As shown in figures 2 and 3, we used two concurrent DS3 clients, one client running on-premises, and a second client running on the cloud. Each client used a load of five DS3 users with 0.02 seconds of think time. We started the migration during the steady-state period of the benchmark when the CPU utilization (esxtop %USED counter) of the SQL Server VM was close to 275%, and the average write IOPS was 80.

Test results

Figure 4: SQL Server throughput at given time: before, during, and after hybrid vMotions

Figure 4 plots the performance of a SQL Server VM in total orders processed per second during vMotion from on-premises to cloud, and vice versa. In our tests, both DS3 benchmark drivers were configured to report the performance data at a fine granularity of 1 second (the default is 10 seconds). As shown in figure 4, the impact on SQL Server throughput was minimal during vMotion in both directions. The total throughput remained steady in the range of 75 operations throughout the test period. The vMotion durations from on-premises to cloud, and vice versa were 415 seconds, and 382 seconds, respectively, with the network throughput ranging between 500 to 900 megabits per second (Mbps). The switch-over time was about 0.6 seconds in both vMotions. The few minor dips in throughput shown in the figure were due to the variance in available network bandwidth on the shared AWS Direct Connect link.

Figure 5: Breakdown of SQL Server throughput reported by the on-premises and cloud clients

Figure 5 illustrates the impact of network latency on the throughput. While the total SQL Server throughput remained steady during the entire test period, the throughput reported by both on-premises and cloud clients varied based on their proximity to the SQL Server VM. For instance, throughput reported by the on-premises client drops from 65 operations to 10 operations when the SQL Server VM was onboarded to the cloud and jumps back to 65 operations after the SQL Server VM is migrated back to the on-premises environment.

The throughput variation seen by the two DS3 clients is not unique to our hybrid cloud environment and can be explained by Little’s Law.

Little’s Law

In queueing theory, Little’s Law theorem states that the average number (L) of customers in a stable system is equal to the average arrival rate (λ) multiplied by the average time (W) that a customer spends in the system. Expressed algebraically, the law is: L = λ × W

Figure 6: Little’s Law applicability in hybrid cloud performance testing

Figure 6 shows how Little’s Law can be applied to our hybrid cloud environment to relate the DS3 users, SQL server throughput, SQL Server processing time, and the network latency. The formula derived in figure 6 explains the impact of the network latency on the throughput (orders per second) when the benchmark load (DS3 users) is fixed. It should be noted, however, that although the throughput reported by both the clients varied due to the network latency, the aggregate throughput remained a constant. This is because the throughput decrease seen by one client is offset by the throughput increase seen by the other client.

This illustrates how important it is for you to monitor your application dependencies when you migrate workloads to and from the cloud. For example, if your database VM depends on a Java application server VM, you should consider migrating both VMs together; otherwise, the overall application throughput will suffer due to slow responses and timeouts.

One way to monitor your application dependencies is to use VMware vRealize Network Insight, which can mitigate business risk by mapping application dependencies in both private and hybrid cloud environments.

vMotion Stun During Page Send (SDPS)

We also tested vMotion performance by doubling the intensity of the DS3 workload on both on-premises and cloud clients. Although vMotion succeeded, vmkernel logs indicated vMotion SDPS kicked-in during test scenarios that had a higher benchmark load. SDPS is an advanced feature of vMotion that ensures migration will not fail due to memory copy convergence issues. Whenever vMotion detects that the guest memory dirty rate is higher than the available network bandwidth, it injects microsecond latencies to guest execution to throttle the page dirty rate, so the network transfer can catch up with the dirty rate. So, we recommend you delay the vMotion of a heavily loaded VMs on hybrid cloud environments with shared bandwidth links, which will prevent slowdown in the guest execution.

To learn more about SDPS, see “VMware vSphere vMotion Architecture, Performance, and Best Practices.”

vMotion across multiple availability zones in the SDDC

Every AWS region has multiple availability zones (AZ). Amazon does not provide service level agreements beyond an availability zone. For reasons such as failover support, VMware Cloud on AWS customers can choose an SDDC deployment that spans multiple availability zones in a single AWS region.

There are certain vMotion performance implications with respect to the SDDC deployment configuration.

Figure 7.  vMotion peak network throughput in a single availability zone vs. multiple availability zones

As shown in figure 7, vMotion peak network throughput depends on the host placement in the SDDC.

This is because vMotion uses a single TCP stream in the VMware Cloud environment. If the vMotion source and destination hosts are within the same availability zone, vMotion peak throughput can reach as high as 10 gigabits per second (Gbps), limited only by the CPU core speed. However, if the source and destination hosts are across availability zones, vMotion peak throughput is governed by the AWS rate limiter. The throughput of any single TCP or UDP stream across availability zones is limited to 5 Gbps by the AWS rate limiter.

Conclusion

In summary, our performance test results show the following:

  • vMotion lets you migrate workloads seamlessly across traditional, on-premises data centers and software-defined data centers on AWS Cloud.
  • vMotion offers the same standard performance guarantees across hybrid cloud environment, which includes less than 1 second of vMotion execution switch-over time, and minimal impact on guest performance.

References

vSphere 6.5 Encrypted vMotion Architecture and Performance

With the rise in popularity of hybrid cloud computing, where VM sensitive data leaves the traditional IT environment and traverses over the public networks, IT administrators and architects need a simple and secure way to protect critical VM data that traverses across clouds and over long distances.

The Encrypted vMotion feature available in VMware vSphere® 6.5 addresses this challenge by introducing a software approach that provides end-to-end encryption for vMotion network traffic. The feature encrypts all the vMotion data inside the vmkernel by using the most widely used AES-GCM encryption standards, and thereby provides data confidentiality, integrity, and authenticity even if vMotion traffic traverses untrusted network links.

A new white paper, “VMware vSphere 6.5 Encrypted vMotion Architecture, Performance and Best Practices”, is now available. In that paper, we describe the vSphere 6.5 Encrypted vMotion architecture and provide a comprehensive look at the performance of live migrating virtual machines running typical Tier 1 applications using vSphere 6.5 Encrypted vMotion. Tests measure characteristics such as total migration time and application performance during live migration. In addition, we examine vSphere 6.5 Encrypted vMotion performance over a high-latency network, such as that in a long distance network. Finally, we describe several best practices to follow when using vSphere 6.5 Encrypted vMotion.

In this blog, we give a brief overview of vSphere 6.5 Encrypted vMotion technology, and some of the performance highlights from the paper.

Continue reading

VMware pushes the envelope with vSphere 6.0 vMotion

vMotion in VMware vSphere 6.0 delivers breakthrough new capabilities that will offer customers a new level of flexibility and performance in moving virtual machines across their virtual infrastructures. Included with vSphere 6.0 vMotion are features – Long-distance migration, Cross-vCenter migration, Routed vMotion network – that enable seamless migrations across current management and distance boundaries. For the first time ever, VMs can be migrated across vCenter Servers separated by cross-continental distance with minimal performance impact. vMotion is fully integrated with all the latest vSphere 6 software-defined data center technologies including Virtual SAN (VSAN) and Virtual Volumes (VVOL). Additionally, the newly re-architected vMotion in vSphere 6.0 now enables extremely fast migrations at speeds exceeding 60 Gigabits per second.

In this blog, we present the latest vSphere 6.0 vMotion features as well as the performance results. We first evaluate vMotion performance across two geographically dispersed data centers connected by a 100ms round-trip time (RTT) latency network. Following that we demonstrate vMotion performance when migrating an extremely memory hungry “Monster” VM.

Long Distance vMotion

vSphere 6.0 introduces a Long-distance vMotion feature that increases the round-trip latency limit for vMotion networks from 10 milliseconds to 150 milliseconds. Long distance mobility offers a variety of compelling new use cases including whole data center upgrades, disaster avoidance, government mandated disaster preparedness testing, and large scale distributed resource management to name a few. Below, we examine vMotion performance under varying network latencies up to 100ms.

Test Configuration

We set up a vSphere 6.0 test environment with the following specifications:

Hardware

  • Two HP ProLiant DL580 G7 servers (32-core Intel Xeon E7-8837 @ 2.67 GHz, 256 GB memory)
  • Storage: Two EMC VNX 5500 arrays, FC connectivity, VMFS 5 volume on a 15-disk RAID-5 LUN
  • Networking: Intel 10GbE 82599 NICs
  • Latency Injector: Maxwell-10GbE appliance to inject latency in vMotion network

Software

  • VM config: 4 VCPUs, 8GB mem, 2 vmdks (30GB system disk, 20GB database disk)
  • Guest OS/Application: Windows Server 2012 / MS SQL Server 2012
  • Benchmark: DVDStore (DS2) using a database size of 12GB with 12,000,000 customers, 3 drivers without think-time

vsphere60-fig1-2-new

Figure 1 illustrates the logical deployment of the test-bed used for long distance vMotion testing. Long distance vMotion is supported with both no shared storage infrastructure and with shared storage solutions such as EMC VPLEX Geo which enables shared data access across long distances. Our test-bed didn’t use shared storage which resulted in migration of the entire state of VM including its memory, storage and CPU/device states. As shown in Figure 2, our test configuration deployed a Maxwell-10GbE network appliance to inject latency in vMotion network.

Measuring vMotion Performance

The following metrics were used to understand the performance implications of vMotion:

  • Migration Time: Total time taken for migration to complete
  • Switch-over Time: Time during which the VM is quiesced to enable switchover from source to the destination host
  • Guest Penalty: Performance impact on the applications running inside the VM during and after the migration

Test Results

We investigated the impact of long distance vMotion on Microsoft SQL Server online transaction processing (OLTP) performance using the open-source DVD Store workload. The test scenario used a Windows Server 2012 VM configured with 4 VCPUs, 8GB memory, and a SQL Server database size of 12GB. Figure 3 shows the migration time and VM switch-over time when migrating an active SQL Server VM at different network round-trip latencies. In all the test scenarios, we used a load of 3 DS2 users with no think time that generated substantial load on the VM. The migration was initiated during the steady-state period of the benchmark when the CPU utilization (esxtop %USED counter) of the VM was close to 120%, and the average read IOPS and average write IOPS were about 200 and 150, respectively.
ldvmotion-fig3 Figure 3 shows that the impact of round-trip latency was minimal on both duration of the migration and switch-over time, thanks to the latency aware optimizations in vSphere 6.0 vMotion. The difference in the migration time among the test scenarios was in the noise range (<5%). The switch-over time increased marginally from about 0.5 seconds in 5ms test scenario to 0.78 seconds in 100ms test scenario.

ldvmotion-fig4 Figure 4 plots the performance of a SQL Server virtual machine in orders processed per second at a given time—before, during, and after vMotion on a 100 ms round-trip latency network. In our tests, DVD store benchmark driver was configured to report the performance data at a fine granularity of 1 second (default: 10 seconds). As shown in the figure, the impact on SQL Server throughput was minimal during vMotion. The only noticeable dip in performance was during the switch-over phase (0.78 seconds) from the source to destination host. It took less than 5 seconds for the SQL server to resume to normal level of performance.

Faster migration

Why are we interested in extreme performance? Today’s datacenters feature modern servers with many processing cores (up to 80), terabytes of memory and high network bandwidth (10 and 40 GbE NICs). VMware supports larger “monster” virtual machines that can scale up to 128 virtual CPUs and 4TB of RAM. Utilizing higher network bandwidth to complete migrations of these monster VMs faster can enable you to implement high levels of mobility in private cloud deployments. The reduction in time to move a virtual machine can also reduce the overhead on the total network and CPU usage.

Test Config

  • Two Dell PowerEdge R920 servers (60-core Intel Xeon E7-4890 v2 @ 2.80GHz, 1TB memory)
  • Networking: Intel 10GbE 82599 NICs, Mellanox 40GbE MT27520 NIC
  • VM config: 12 VCPUs, 500GB mem
  • Guest OS: Red Hat Enterprise Linux Server 6.3

We configured each vSphere host with four Intel 10GbE ports and a single Mellanox 40 GbE port with total of 80Gb/s network connectivity between the two vSphere hosts. Each vSphere host was configured with five vSwitches, with four vSwitches having one unique 10GbE uplink port and fifth vSwitch with a 40GbE uplink port. The MTU of the NICs was set to the default of 1500 bytes. We created one VMkernel adapter on each of four vSwitches with 10GbE uplink port and four VMkernel adapters on the vSwitch with 40GbE uplink port. All the 8 VMkernel adapters were configured on the same subnet. We also enabled each VMkernel adapter for vMotion, which allowed vMotion traffic to use the 80Gb/s network connectivity.

Methodology

To demonstrate the extreme vMotion throughput performance, we simulated a very heavy memory usage footprint in the virtual machine. The memory-intensive program allocated 300GB memory inside the guest and touched a random byte in each memory page in an infinite loop. We migrated this virtual machine between the two vSphere hosts under different test scenarios: vMotion over 10Gb/s network, vMotion over 20Gb/s network, vMotion over 40Gb/s network and vMotion over 80Gb/s network. We used esxtop to monitor network throughput and CPU utilization on the source and destination hosts.

Test Results

ldvmotion-fig5

Figure 5 compares the peak network bandwidth observed in vSphere 5.5 and vSphere 6.0 under different network deployment scenarios. Let us first consider the vSphere 5.5 vMotion throughput performance. Figure 5 shows vSphere 5.5 vMotion reaches line rate in both 10Gb/s network and 20Gb/s network test scenarios. When we increased the available vMotion network bandwidth to beyond 20 Gb/s, the vMotion peak usage was limited to 18Gb/s in vSphere 5.5. This is because in vSphere 5.5 vMotion, each vMotion is assigned by default two helper threads which do the bulk of vMotion processing. Since the vMotion helper threads are CPU saturated, there is no performance gain when adding additional network bandwidth. When we increased the number of vMotion helper threads from 2 to 4 in the 40Gb/s test scenario, and thereby removed the CPU bottleneck, we saw the peak network bandwidth usage of vMotion in vSphere 5.5 increase to 32Gb/s. Tuning the helper threads beyond four hurt vMotion performance in 80Gb/s test scenario, as vSphere 5.5 vMotion has some locking issues which limit the performance gains when adding more helper threads. These locks are VM specific locks that protect VM’s memory.

The newly re-architected vMotion in vSphere 6.0 not only removes these lock contention issues but also obviates the need to apply any tunings. During the initial setup phase, vMotion dynamically creates the appropriate number of TCP/IP stream channels between the source and destination hosts based on the configured network ports and their bandwidth. It then instantiates a vMotion helper thread per stream channel thereby removing the necessity for any manual tuning. Figure 5 shows vMotion reaches line rate in 10Gb/s, 20Gb/s and 40Gb/s scenarios, while utilizing little over 64 Gb/s network throughput in 80Gb/s scenario. This is over a factor of 3.5x improvement in performance when compared to vSphere 5.5.

ldvmotion-fig5
Figure 6 shows the network throughput and CPU utilization data in vSphere 6.0 80Gb/s test scenario. During vMotion, memory of the VM is copied from the source host to the destination host in an iterative fashion. In the first iteration, the vMotion bandwidth usage is throttled by the memory allocation on the destination host. The peak vMotion network bandwidth usage is about 28Gb/s during this phase. Subsequent iterations copy only the memory pages that were modified during the previous iteration. The number of pages transferred in these iterations is determined by how actively the guest accesses and modifies the memory pages. The more modified pages there are, the longer it takes to transfer all pages to the destination server, but on the flip side, it enables vMotion’s advanced performance optimizations to kick-in to fully leverage the additional network and compute resources. That is evident in the third pre-copy iteration when the peak measured bandwidth was about 64Gb/s and the peak CPU utilization (esxtop ‘PCPU Util%’ counter) on destination host was about 40%.

Conclusions

The main results of this performance study are the following:

  • The dramatic 10x increase in round-trip time support offered in long-distance vMotion now makes it possible to migrate workloads non-disruptively over long distances such as New York to London
  • Remarkable performance enhancements in vSphere 6.0 towards improving Monster VM migration performance (up to 3.5x improvement over vSphere 5.5) in large scale private cloud deployments

 

 

 

Each vSphere release introduces new vMotion functionality, increased reliability and significant performance improvements. vSphere 5.5 continues this trend by offering new enhancements to vMotion to support EMC VPLEX Metro, which enables shared data access across metro distances.

In this blog, we evaluate vMotion performance on a VMware vSphere 5.5 virtual infrastructure that was stretched across two geographically dispersed datacenters using EMC VPLEX Metro.

Test Configuration

The VPLEX Metro test bed consisted of two identical VPLEX clusters, each with the following hardware configuration:

• Dell R610 host, 8 cores, 48GB memory, Broadcom BCM5709 1GbE NIC
• A single engine (two directors) VPLEX Metro IP appliance
• FC storage switch
• VNX array, FC connectivity, VMFS 5 volume on a 15-disk RAID-5 LUN


Figure 1. Logical layout of the VPLEX Metro deployment

Figure 1 illustrates the deployment of the VPLEX Metro system used for vMotion testing. The figure shows two data centers, each with a vSphere host connected to a VPLEX Metro appliance. The VPLEX virtual volumes presented to the vSphere hosts in each data center are synchronous, distributed volumes that mirror data between the two VPLEX clusters using write-through caching. As a result, vMotion views the underlying storage as shared storage, or exactly equivalent to a SAN that both source and destination hosts have access to. Hence, vMotion in a Metro VPLEX environment is as easy as traditional vMotion that live migrates only the memory and device state of a virtual machine.

The two VPLEX Metro appliances in our test configuration used IP-based connectivity. The vMotion network between the two ESXi hosts used a physical network link distinct from the VPLEX network. The Round Trip Time (RTT) latency on both VPLEX and vMotion networks was 10 milliseconds.

Measuring vMotion Performance

The following metrics were used to understand the performance implications of vMotion:

• Migration Time: Total time taken for migration to complete
• Switch-over Time: Time during which the VM is quiesced to enable switchover from source to the destination host
• Guest Penalty: Performance impact on the applications running inside the VM during and after the migration

Test Results


Figure 2. VPLEX Metro vMotion performance in vSphere 5.1 and vSphere 5.5

Figure 2 compares VPLEX Metro vMotion performance results in vSphere 5.1 and vSphere 5.5 environments. The test scenario used an idle VM configured with 2 VCPUs and 2GB memory. The figure shows a minor difference in the total migration time between the two vSphere environments and a significant improvement in vMotion switch-over time in the vSphere 5.5 environment. The switch-over time reduced from about 1.1 seconds to about 0.6 second (a nearly 2x improvement), thanks to a number of performance enhancements that are included in the vSphere 5.5 release.

We also investigated the impact of VPLEX Metro live migration on Microsoft SQL Server online transaction processing (OLTP) performance using the open-source DVD Store workload. The test scenario used a Windows Server 2008 VM configured with 4 VCPUs, 8GB memory, and a SQL Server database size of 50GB.


Figure 3. VPLEX Metro vMotion impact on SQL Server Performance

Figure 3 plots the performance of a SQL Server virtual machine in orders processed per second at a given time—before, during, and after VPLEX Metro vMotion. As shown in the figure, the impact on SQL Server throughput was very minimal during vMotion. The SQL Server throughput on the destination host was around 310 orders per second, compared to the throughput of 350 orders per second on the source host. This throughput drop after vMotion is due to the VPLEX inter-cluster cache coherency interactions and is expected. For some time after the vMotion, the destination VPLEX cluster continued to send cache page queries to the source VPLEX cluster and this has some impact on performance. After all the metadata is fully migrated to the destination cluster, we observed the SQL Server throughput increase to 350 orders per second, the same level of throughput seen prior to vMotion.

These performance test results show the following:

  • Remarkable improvements in vSphere 5.5 towards reducing vMotion switch-over time during metro migrations (for example, a nearly 2x improvement over vSphere 5.1)
  • VMware vMotion in vSphere 5.5 paired with EMC VPLEX Metro can provide workload federation over a metro distance by enabling administrators to dynamically distribute and balance the workloads seamlessly across data centers

To find out more about the test configuration, performance results, and best practices to follow, see our recently published performance study.

VMware vSphere 5.1 vMotion Architecture, Performance, and Best Practices

vMotion and Storage vMotion are key, widely adopted technologies which enable the live migration of virtual machines on the vSphere platform. vMotion provides the ability to live migrate a virtual machine from one vSphere host to another host, with no perceivable impact to the end user. Storage vMotion technology provides the ability to live migrate the virtual disks belonging to a virtual machine across storage elements on the same host.  Together, vMotion and Storage vMotion technologies enable critical datacenter workflows, including automated load-balancing with DRS and Storage DRS, hardware maintenance, and the permanent migration of workloads.

Each vSphere release introduces new vMotion functionality, increased reliability and significant performance improvements. vSphere 5.1 continues this trend by offering new enhancements to vMotion that provide a new level of ease and flexibility for live virtual machine migrations.  vSphere 5.1 vMotion now removes the shared storage requirement for live migration and allows combining traditional vMotion and Storage vMotion into one operation. The combined migration copies both the virtual machine memory and its disk over the network to the destination vSphere host. This shared-nothing unified live migration feature offers administrators significantly more simplicity and flexibility in managing and moving virtual machines across their virtual infrastructures compared to the traditional vMotion and Storage vMotion migration solutions.

A new white paper, “VMware vSphere 5.1 vMotion Architecture, Performance and Best Practices”, is now available. In that paper, we describe the vSphere 5.1 vMotion architecture and its features. Following the overview and feature description of vMotion in vSphere 5.1, we provide a comprehensive look at the performance of live migrating virtual machines running typical Tier 1 applications using vSphere 5.1 vMotion, Storage vMotion, and vMotion. Tests measure characteristics such as total migration time and application performance during live migration. In addition, we examine vSphere 5.1 vMotion performance over a high-latency network, such as that in a metro area network. Test results show the following:

  • During storage migration, vSphere 5.1 vMotion maintains the same performance as Storage vMotion, even when using the network to migrate, due to the optimizations added to the vSphere 5.1 vMotion network data path.
  • During memory migration, vSphere 5.1 vMotion maintains nearly identical performance as the traditional vMotion, due to the optimizations added to the vSphere 5.1 vMotion memory copy path.
  • vSphere 5.1 vMotion retains the proven reliability, performance, and atomicity of the traditional vMotion and Storage vMotion technologies, even at metro area network distances.

Finally, we describe several best practices to follow when using vMotion.

For the full paper, see “VMware vSphere 5.1 vMotion Architecture, Performance and Best Practices”.

 

vMotion Architecture, Performance, and Best Practices in VMware vSphere 5

VMware vSphere vMotion enables the live migration of virtual machines from one VMware vSphere 5 host to another, with no perceivable impact to the end user. vMotion brings invaluable benefits to administrators—it enables load balancing, helps prevent server downtime, and provides flexibility for troubleshooting. vMotion in vSphere 5 incorporates a number of performance enhancements which allow vMotion to be used with minimal overhead on even the largest virtual machines running heavy-duty, enterprise-class applications.

A new white paper, vMotion Architecture, Performance, and Best Practices in VMware vSphere 5, is now available. In that paper, we describe the vMotion architecture and present the features and performance enhancements that have been introduced in vMotion in vSphere 5.  Among these improvements are multiple–network adaptor capability for vMotion, better utilization of 10GbE bandwidth, Metro vMotion, and optimizations to further reduce impact on application performance.

Following the overview and feature description of vMotion in vSphere 5, we provide a comprehensive look at the performance of migrating VMs running typical Tier 1 applications including Rock Web Server, MS Exchange Server, MS SQL Server and VMware View. Tests measure characteristics such as total migration time and application performance during vMotion. Test results show the following:

  • Remarkable improvements in vSphere 5 towards reducing the impact on guest application performance during vMotion
  • Consistent performance gains in the range of 30% in vMotion duration on vSphere 5
  • Dramatic performance improvements over vSphere 4.1 when using the newly added multi–network adaptor feature in vSphere 5 (for example, vMotion duration time is reduced by a factor of more than 3x)

Finally, we describe several best practices to follow when using vMotion.

For the full paper, see vMotion Architecture, Performance, and Best Practices in VMware vSphere 5.

VMware Load-Based Teaming (LBT) Performance

Virtualized data center environments are often characterized by a variety as well as a sheer number of traffic flows whose network demands often fluctuate widely and unpredictably. Provisioning a fixed-network capacity for these traffic flows can result either in poor performance (by under provisioning) or waste valuable capital (by over provisioning).

NIC teaming in vSphere enables you to distribute (or load balance) the network traffic from different traffic flows among multiple physical NICs by providing a mechanism to logically bind together multiple physical NICs. This results in increased throughput and fault tolerance and alleviates the challenge of network-capacity provisioning to a great extent. Creating a NIC team in vSphere is as simple as adding multiple physical NICs to a vSwitch and choosing a load balancing policy.

vSphere 4 (and prior ESX releases) provide several load balancing choices, which base routing on the originating virtual port ID, an IP hash, or source MAC hash. While these load balancing choices work fine in the majority of virtual environments, they all share a few limitations. For instance, all these policies statically map the affiliations of the virtual NICs to the physical NICs (based on virtual switch port IDs or MAC addresses) and do not base their load balancing decisions on the current networking traffic and therefore may not effectively distribute traffic among the physical uplinks. Besides, none of these policies take into consideration the disparity of the physical NIC capacity (such as a mixture of 1 GbE and 10 GbE physical NICs in a NIC team). In the next section, we will describe the latest teaming policy introduced in vSphere 4.1 that addresses these shortcomings.

Load-Based Teaming (LBT)

vSphere 4.1 introduces a load-based teaming (LBT) policy that is traffic-load-aware and ensures physical NIC capacity in a NIC team is optimized. Note that LBT is supported only with the vNetwork Distributed Switch (vDS). LBT avoids the situation of other teaming policies where some of the distributed virtual uplinks (dvUplinks) in a DV Port Group’s team are idle while others are completely saturated. LBT reshuffles the port binding dynamically, based on load and dvUplink usage, to make efficient use of the available bandwidth.

LBT is not the default teaming policy while creating a DV Port Group, so it is up to you to configure it as the active policy. As LBT moves flows among uplinks, it may occasionally cause reordering of packets at the receiver. LBT will only move a flow when the mean send or receive utilization on an uplink exceeds 75% of capacity over a 30 second period. LBT will not move flows any more often than once every 30 seconds.

Performance

In this section, we describe in detail the test-bed configuration, the workload used to generate the network traffic flows, and the test results.

Test configuration

In our test configuration, we used an HP DL370 G6 server running the GA release of vSphere 4.1, and several client machines that generated SPECweb®2005 traffic. The server was configured with dual-socket, quad-core 3.1GHz Intel Xeon W5580 processors, 96GB of RAM, and two 10 GbE Intel Oplin NICs. The server hosted four virtual machines and SPECweb2005 traffic was evenly distributed among all four VMs.  Each VM was configured with 4 vCPUs, 16GB memory, 2 vmxnet3 vNICs, and SLES 11 as the guest OS.

SPECweb2005 is an industry-standard web server benchmark defined by the Standard Performance Evaluation Corporation (SPEC). The benchmark consists of three workloads: Banking, Ecommerce, and Support, each with different workload characteristics representing common use cases for web servers. We used the Support workload in our tests which is the most I/O intensive of all the three workloads.

Baseline performance

In our baseline configuration, we configured a vDS with two dvUplinks and two DV Port Groups. We mapped the vNICs of the two VMs to the first dvUplink, and the vNICs of the other two VMs to the second dvUplink through the vDS interface. The SPECweb2005 workload was evenly distributed among all the four VMs and therefore we ensured both the dvUplinks were equally stressed. In terms of the load balancing, this baseline configuration presents the most optimal performance point. With the load of 30,000 SPECweb2005 support users, we observed a little over 13Gbps traffic, that is, about 6.5Gbps per 10 GbE uplink. The %CPU utilization and the percentage of SPECweb2005 user sessions that met the quality-of-service (QoS) requirements were found to be 80% and 99.99% respectively. We chose this load point because customers typically do not stress their systems beyond this level.

LBT performance

We then reconfigured the vDS with two dvUplinks and a single DV Port Group to which all the vNICs of the VMs were mapped. The DV Port Group was configured with the LBT teaming policy. We used the default settings of LBT, which are primarily the wakeup period (30 seconds) and link saturation threshold (75%). Our goal was to evaluate the efficacy of the LBT policy in terms of load balancing and the added CPU cost, if any, when the same benchmark load of 30,000 SPECweb2005 support sessions was applied.

Before the start of the test, we noted that the traffic from all the VMs propagated through the first dvUplink. Note that the initial affiliation of the vNICs to the dvUplinks is made based on the hash of the virtual switch port IDs. To find the current affiliations of the vNICs to the dvUplinks, run the esxtop command and find the port-to-uplink mappings in the network screen. You can also use the “net-lbt” tool to find affiliations as well as to modify LBT settings.

The figure below shows the network bandwidth usage on both of the dvUplinks during the entire benchmark period.

LBT-new

A detailed explanation of the bandwidth usage in each phase follows:

Phase 1: Because all the virtual switch port IDs of the four VMs were hashed to the same dvUplink, only one of the dvUplinks was active. During this phase of the benchmark ramp-up, the total network traffic was below 7.5Gbps. Because the usage on the active dvUplink was lower than the saturation threshold, the second dvUplink remained unused.

Phase 2: The benchmark workload continued to ramp up and when the total network traffic exceeded 7.5Gbps (above the saturation threshold of 75% of link speed), LBT kicked in and dynamically remapped the port-to-uplink mapping of one of the vNIC ports from the saturated dvUplink1 to the unused dvUplink2. This resulted in dvUplink2 becoming active.  The usage on both the dvUplinks remained below the saturation threshold.

Phase 3: As the benchmark workload further ramped up and the total network traffic exceeded 10Gbps (7.5Gbps on dvUplink1 and 2.5Gbps on dvUplink2), LBT kicked in yet again, and dynamically changed port-to-uplink mapping of one of the three active vNIC ports currently mapped to the saturated dvUplink.

Phase 4: As the benchmark reached a steady state with the total network traffic exceeding little over 13Gbps, both the dvUplinks witnessed the same usage.

We did not observe any spikes in CPU usage or any dip in SPECweb2005 QoS during all the four phases. The %CPU utilization and the percentage of SPECweb2005 user sessions that met the QoS requirements were found to be 80% and 99.99% respectively.

These results show that LBT can serve as a very effective load balancing policy to optimally use all the available dvUplink capacity while matching the performance of a manually load-balanced configuration.

Summary

Load-based teaming (LBT) is a dynamic and traffic-load-aware teaming policy that can ensure physical NIC capacity in a NIC team is optimized.  In combination with VMware Network IO Control (NetIOC), LBT offers a powerful solution that will make your vSphere deployment even more suitable for your I/O-consolidated datacenter.

Achieving High Web Throughput with VMware vSphere 4 on Intel Xeon 5500 series (Nehalem) servers

 

We just published a SPECweb2005 benchmark score of 62,296 — the highest result published to date on a virtual configuration. This result was obtained on an HP ProLiant DL380 G6 server running VMware vSphere 4 and featuring Intel Xeon 5500 series processors, and Intel 82598EB 10 Gigabit AF network interface cards. While driving the network throughput from a single host to just under 30 Gbps, this benchmark score still stands at 85% of the level achieved in native (non-virtualized) execution on equivalent hardware configurations.

Our latest benchmark results show that VMware, with our partners Intel and HP, is able to provide virtualization solutions that meet the performance and scaling needs of modern data centers. In addition, the simplification achieved through consolidation in a virtual environment, as demonstrated by the configuration used in our benchmark publication, contributes to eliminating complexity in the software environment.

Let me briefly discuss some of the distinctive characteristics of our latest benchmark results:

Use of VMDirectPath for virtualizing network I/O: VMDirectPath is a feature in vSphere 4 that builds upon Intel VT-D (Virtualization Technology for Directed I/O) capability engineered into recent Intel processors to virtualize network I/O. It allows guest operating systems to directly access an I/O device, bypassing the virtualization layer. The result we just published is notably different from our previous results in that this time we used VMDirectPath feature to take benefit of the higher performance that it makes possible.

High performance and linear scaling with the addition of virtual machines: VMDirectPath bypasses the virtualization layer to a large extent for the network interactions but, a measurable number of guest OS and hypervisor interactions still remain. The possibility still exists that the hypervisor can become a scaling limiter in a multi-VM environment. The excellent performance achieved by our benchmark configuration using four virtual machines shows that this should not be a concern.

A highly simplified setup: Results published in the SPECweb2005 website reveal the complexity of “interrupt pinning” that is common in the configurations in a native setting, generally employed in order to make full use of all the cores in today’s multi-core processors. By comparison, our benchmark configuration does not use device interrupt pinning. This is because the virtualization approach divides the load among multiple VMs, each of which is smaller and therefore easier to keep core-efficient.

Virtualization Performance: Our results show that a single vSphere host can handle 30 Gbps real world Web traffic and still reach a performance level of 85% of the native results published on equivalent physical configuration. This demonstrates capabilities several orders of magnitude greater than those needed by typical Web applications, proof-positive that the vast majority of the Web applications can be consolidated, with excellent performance, in a virtualized environment.

For more details, check out the full length article published on the VMware community website in which we elaborate upon each of the characteristics that we briefly discussed here.

VMware breaks the 50,000 SPECweb2005 barrier using VMware vSphere 4

VMware has achieved a SPECweb2005 benchmark score of 50,166 using VMware vSphere 4, a 14% improvement over the world record results previously published on VI3. Our latest results further strengthen the position of VMware vSphere as an industry leader in web serving, thanks to a number of performance enhancements and features that are included in this release. In addition to the measured performance gains, some of these enhancements will help simplify administration in customer environments.

The key highlights of the current results include:

  1. Highly scalable virtual SMP performance.
  2. Over 25% performance improvement for the most I/O intensive SPECweb2005 support component.
  3. Highly simplified setup with no device interrupt pinning.

Let me briefly touch upon each of these highlights.

Virtual SMP performance

The improved scheduler in ESX 4.0 enables usage of large symmetric multiprocessor (SMP) virtual machines for web-centric workloads. Our previous world record results published on ESX 3.5 used as many as fifteen uniprocessor (UP) virtual machines. The current results with ESX 4.0 used just four SMP virtual machines. This is made possible by several improvements
that went into the CPU scheduler in ESX 4.0.

From a scheduler perspective, SMP virtual machines present additional considerations such as co-scheduling. This is because in case of a SMP virtual machine, it is important for ESX scheduler to
present the applications and the guest OS running in the virtual machine with
the illusion that they are running on a dedicated multiprocessor machine. ESX
implements this illusion by co-scheduling the virtual processors of a SMP virtual machine. While the requirement to co-schedule all the virtual processors of a VM was
relaxed in the previous releases of ESX, the relaxed co-scheduling algorithm
has been further refined in ESX 4.0. This means the scheduler has more choices in
its ability to schedule the virtual processors of a VM. This leads to higher
system utilization and better overall performance in a consolidated
environment.

ESX 4.0 has also improved its resource locking mechanism. The
locking mechanism in ESX 3.5 was based on the cell lock construct. A cell is a
logical grouping of physical CPUs in the system within which all the vCPUs of a
VM had to be scheduled. This has been replaced with per-pCPU and per-VM locks.
This fine-grained locking reduces contention and improves scalability. All
these enhancements enable ESX 4.0 to use SMP VMs and achieve this new level of SPECweb2005 performance.

Very high performance gains for workloads with large I/O component

I/O intensive applications highlight the performance enhancements of ESX 4.0. These tests show that high-I/O workloads yield the largest gains when upgrading to this release.

In all our tests, we used SPECweb2005 workload which measures the system’s ability to
act as a web server. It is designed with three workloads to characterize different web usage patterns: Banking (emulate online banking), E-commerce (emulates an E-commerce site) and Support (emulates a vendor support site that provides downloads). The performance score of each of the workloads is measured in terms of the number of simultaneous sessions the system is able to support while meeting the QoS requirements of the workload. The aggregate metric reported by the SPECweb2005 workload normalizes the performance scores obtained on the three workloads.

The following figure compares the scores of the
three workloads obtained on ESX 4.0 to the previous results on ESX 3.5. The
figure also highlights the percentage improvements obtained on ESX 4.0 over ESX
3.5. We used an HP ProLiant DL585 G5 server with four Quad-Core AMD Opteron processors
as the system under test. The benchmark results have been reviewed and approved
by the SPEC committee.

Sw2005_KL

We used the same HP ProLiant
DL585 G5 server and the physical test infrastructure in the current as well as
the previous benchmark submission on VI3. There were some differences between
the two test configurations (for example, ESX 3.5 used UP VMs while SMP VMs were used
on ESX 4.0; ESX 4.0 tests were run on currently available processors that have
a slightly higher clock speed). To highlight the performance gains, we will look
at the percentage improvements obtained for all the three workloads rather than
the absolute numbers.

As you can see from the above figure, the biggest percentage gain was seen with the Support workload, which has the largest I/O component. In this test, a 25% gain was seen while ESX drove about 20 Gbps of web traffic. Of the three workloads, the Banking workload has the smallest I/O component, and accordingly had relatively smaller percentage gain.

Highly simplified setup

ESX 4.0 also simplifies customer environments without sacrificing performance. In our previous ESX 3.5 results, we pinned the device interrupts to make efficient use of hardware caches and improve performance. Binding device interrupts to specific processors is a technique common to SPECweb2005 benchmarking tests to maximize performance. Results published in the http://www.spec.or/osg/web2005 website reveal the complex pinning configurations used by the benchmark publishers in the native environment.

The highly improved I/O processing model in ESX 4.0 obviates the need to do any manual device interrupt pinning. On ESX, the I/O requests issued by the VM are intercepted by the virtual machine monitor (VMM) which handles them in cooperation with the VMkernel. The improved execution model in ESX 4.0 processes these I/O requests asynchronously which allows the vCPUs of the VM to execute other tasks.

Furthermore, the scheduler in ESX 4.0 schedules processing of network traffic based on processor cache architecture, which eliminates the need for manual device interrupt pinning. With the new core-offload I/O system and related scheduler improvements, the results with ESX 4.0 compare favorably to ESX 3.5.

Conclusions

These SPECweb2005 results demonstrate that customers can expect substantial performance gains on ESX 4.0 for web-centric workloads. Our past results published on ESX 3.5 showed world record performance in a scale-out (increasing the number of virtual machines) configuration and our current results on vSphere 4 demonstrate world class performance while scaling up (increasing the number of vCPUs in a virtual machine). With an improved scheduler that required no fine-tuning for these experiments, VMware vSphere 4 can offer these gains while lowering the cost of administration.

VMware Sets Performance Record with SPECweb2005 Result

Introduction

We just published the largest SPECweb2005 score to date on a 16 core server. The benchmark was run on an HP ProLiant DL585 G5 with four Quad-Core AMD 8382 OpteronTM processors.  This record score of 44,000 includes an Ecommerce component demonstrating 69,525 concurrent connections.  In the Support component, this single-host workload drove network throughput on the server to just under 16 Gb/s.

This once again proves the capabilities of the VI3 platform and its ability to service workloads with stringent Quality of Service (QoS) requirements along with a large storage and networking footprint. With continuous advancements in virtualization technology (such as hardware assist for MMU virtualization and NetQueue support for 10 Gigabit Ethernet) performance with the VI3 platform can meet the needs of the most demanding, high traffic web sites.

While record-setting performance of web servers proves the capabilities of ESX, the real story of web server virtualization is the gains due to web farm consolidation and improved flexibility. Infrastructure serving as the web front end today is designed around hundreds or even thousands of often underutilized two and four core servers.  Consolidation of these servers onto modern systems with multi-core CPUs reduces costs, simplifies management and eases power and cooling demands. Consolidating web servers makes business sense. This SPECweb2005 result from VMware has shown that the ESX Server can handle loads much more extreme than anticipated in such a consolidated environment.

The Benchmark

The SPECweb2005 benchmark consists of three workloads: Banking, Ecommerce, and Support, each with different workload characteristics representing common use cases for web servers. Each workload measures the number of simultaneous user sessions a web server can support while still meeting stringent quality-of-service and error-rate requirements. The aggregate metric reported by the SPECweb2005 benchmark is a normalized metric based on the performance scores obtained on all three workloads.

Component

Score

Explanation

Banking

80,000

Models online banking.  Represents number of customers accessing accounts at a given time that can be supported with acceptable QoS

E-commerce

69,525

Models online retail store. 69,525 is only 75 shy of the highest number reported, which required 50% more processing cores.

Support

33,000

Represents users acquiring patches and downloads from a support web site. In this test network throughput was 16Gb/s.

SPECweb2005 Score

44,000

Normalized metric from the three components.

Table 1.Results in SPECweb2005 submission.

Benchmark Configuration  

Hardware

HP ProLiant DL585 G5 with four Quad-Core AMD 8382 OpteronTM processors, 128 GB RAM.

Disk subsystem

Two EMC CLARiiON CX3-40 Fibre Channel SAN arrays, total of 110 * 133GB (15K RPM) spindles 

Network

Four Intel 10 Gigabit XF SR Server Adapters

Hypervisor

ESX Server 3.5 U3

Guest Operating system

RedHat Enterprise Linux 5 Update 1

Virtual hardware

1 vCPU, 8 GB memory, vmxnet virtual network adapter

Web Server Software

Rock Web Server v1.4.7, Rock JSP/Servlet Container v1.3.2

Client Systems

30 * Dell Poweredge 1950, Dual-socket Dual-core Intel Xeon, 8 GB

Workload

SPECweb2005

Table 2.Benchmark Configuration.

Performance Details
Here’s a quick look at what was accomplished with a single ESX host using SPECweb2005 workload.

Aggregate performance: The aggregate SPECweb2005 performance of 44,000 obtained on our 16-core virtual configuration is higher than any result ever recorded on a 16-core native system. 

Support performance: The support workload is the most I/O intensive of all the workloads. The file-set data used for the support workload was laid out on a little over 100 spindles and consisted of files varying in size ranging from 100 KB to 36 MB. In our test configuration, we used fifteen virtual machines that shared the underlying physical 10Gbps NICs. Together they supported over 33,000 concurrent support user sessions, and handled close to sixteen Gigabits per second web traffic on a single ESX host. 

Banking performance: This workload emulates online banking that transfers encrypted information using HTTPS. The file-set data used for the banking test was about 1.3 terabytes consisting of some eight million individual files of varied sizes. We laid out all this data in a single VMFS volume that spanned multiple LUNs. We used fifteen virtual machines that shared the same base image. Together, they supported 80,000 concurrent banking user sessions and handled 143,000 HTTP operations/second. 

Ecommerce performance: Of the three workloads, Ecommerce workload probably fits the profile of most customers. This is because unlike Banking, and Support workloads, this workload is a mixture of HTTP and HTTPS requests. The I/O characteristics fall in between Banking and Support workloads. In our test, ESX supported 69,525 concurrent Ecommerce user sessions on a 16-core server. Our result is the second highest E-commerce result ever published, which has only been bested by only another 75 sessions on a system with 50% more cores.

To learn more about the test configuration and tuning descriptions, please see the full disclosure report on the official SPEC website: http://www.spec.org/osg/web2005.