Home > Blogs > VMware VROOM! Blog > Tag Archives: AWS

Tag Archives: AWS

vMotion across hybrid cloud: performance and best practices

VMware Cloud on AWS is a hybrid cloud service that runs the VMware software-defined data center (SDDC) stack in the Amazon Web Services (AWS) public cloud. The service automatically provisions and deploys a vSphere environment on a bare-metal AWS infrastructure, and lets you run your applications in a hybrid IT environment across your on-premises data centers and AWS global infrastructure. A key benefit of VMware Cloud on AWS is the ability to vMotion workloads back and forth from your on-premises data center to the AWS public cloud as capacity and data privacy require.

In this blog post, we share the results of our vMotion performance tests across our hybrid cloud environment that consisted of a vSphere on-premises data center located in Wenatchee, Washington and an SDDC hosted in an AWS cloud, in various scenarios including hybrid migration of a database server. We also describe the best practices to follow when migrating virtual machines by vMotion across hybrid cloud.

Test configuration

We set up the hybrid cloud environment with the following specifications:

VMware Cloud on AWS

  • 1-host SDDC instance with Amazon EC2 i3.metal (Intel Xeon E5-2686 @ 2.3 GHz, 36 cores, 512 GB)
  • SDDC version: vmc1.6 (M6 – Cycle 17)
  • Auto-provisioned with NSX networking and VSAN storage

On-premises host

  • Dell PowerEdge R730 (Intel Xeon E5-2699 v4 @ 2.2GHz, 22 cores, 1 TB memory)
  • ESXi and vCenter version: 6.7
  • Storage: Dell NVMe, VMFS 5 volume
  • Networking: Intel 1GbE NIC (shared 2*10GbE DX links between on-prem and AWS)

Figure 1: Logical layout of the hybrid cloud setup

Figure 1 illustrates the logical layout of our hybrid cloud environment. We deployed a single-host SDDC instance on AWS cloud. The SDDC was the latest M6 version and auto-configured with vSAN storage and NSX networking. Our on-premises data center, located in Washington state, featured hosts running ESXi 6.7.

AWS Direct Connect

We used high-speed AWS Direct Connect links for connectivity between VMware on-prem data center and AWS Oregon region. AWS Direct Connect provides a leased line from the AWS environment to the on-premises data center. VMware recommends you use this type of link because it guarantees sustained bandwidth during vMotion, which isn’t possible with VPN internet connections. In our environment, there was about 40 milliseconds of round-trip latency on the network.

L2 VPN tunnel

We set up a secure L2 VPN tunnel for the compute traffic that spanned the two vCenters. This connected the VMs on cloud and on-premises to the same address space (IP subnet). So, the VMs remained on the same subnet and kept their IP addresses even as we migrated them from on-premises to cloud and vice versa.

Figure 2: Extending VXLAN across on-premises and cloud using L2 VPN

As shown in figure 2, two NSX Edge VMs provided VPN capabilities and the bridge between the overlay world (VXLAN logical networks) and the physical infrastructure (IP networks). Each NSX Edge VM was equipped with two virtual interfaces (vNICs): one vNIC was used as an uplink to the physical network, and the second vNIC was used as the VXLAN trunk interface.

Hybrid linked mode

Figure 3: A single console to manage resources across on-premises and cloud environments

We created a hybrid linked mode between the cloud vCenter and on-premises vCenter. This allowed us to use a single console to manage all our inventory across the hybrid cloud.  As shown in Figure 3, the cloud inventory included a single Client-VM provisioned in the compute workload resource pool and the on-premises inventory included three VMs including NSX-Edge VM, Client-VM and a Server VM.

Measuring vMotion performance

The following metrics were used to understand the performance implications of vMotion:

  • Migration time: Total time taken for migration to complete
  • Switch-over time: Time during which the VM is quiesced to enable switchover from on-premises to cloud, and vice versa
  • Guest penalty: Performance impact on the applications running inside the VM during and after the migration

Benchmark methodology

We investigated the impact of hybrid vMotion on a Microsoft SQL Server database performance using the open-source DVD Store 3 (DS3) benchmark, which simulates many customers performing typical actions in an online DVD Store (logging in, browsing, buying, reviewing, and so on).

The test scenario used a Windows Server 2012 VM configured with 8 VCPUs, 8 GB memory, 40 GB disk, and a SQL Server database size of 5 GB. As shown in figures 2 and 3, we used two concurrent DS3 clients, one client running on-premises, and a second client running on the cloud. Each client used a load of five DS3 users with 0.02 seconds of think time. We started the migration during the steady-state period of the benchmark when the CPU utilization (esxtop %USED counter) of the SQL Server VM was close to 275%, and the average write IOPS was 80.

Test results

Figure 4: SQL Server throughput at given time: before, during, and after hybrid vMotions

Figure 4 plots the performance of a SQL Server VM in total orders processed per second during vMotion from on-premises to cloud, and vice versa. In our tests, both DS3 benchmark drivers were configured to report the performance data at a fine granularity of 1 second (the default is 10 seconds). As shown in figure 4, the impact on SQL Server throughput was minimal during vMotion in both directions. The total throughput remained steady in the range of 75 operations throughout the test period. The vMotion durations from on-premises to cloud, and vice versa were 415 seconds, and 382 seconds, respectively, with the network throughput ranging between 500 to 900 megabits per second (Mbps). The switch-over time was about 0.6 seconds in both vMotions. The few minor dips in throughput shown in the figure were due to the variance in available network bandwidth on the shared AWS Direct Connect link.

Figure 5: Breakdown of SQL Server throughput reported by the on-premises and cloud clients

Figure 5 illustrates the impact of network latency on the throughput. While the total SQL Server throughput remained steady during the entire test period, the throughput reported by both on-premises and cloud clients varied based on their proximity to the SQL Server VM. For instance, throughput reported by the on-premises client drops from 65 operations to 10 operations when the SQL Server VM was onboarded to the cloud and jumps back to 65 operations after the SQL Server VM is migrated back to the on-premises environment.

The throughput variation seen by the two DS3 clients is not unique to our hybrid cloud environment and can be explained by Little’s Law.

Little’s Law

In queueing theory, Little’s Law theorem states that the average number (L) of customers in a stable system is equal to the average arrival rate (λ) multiplied by the average time (W) that a customer spends in the system. Expressed algebraically, the law is: L = λ × W

Figure 6: Little’s Law applicability in hybrid cloud performance testing

Figure 6 shows how Little’s Law can be applied to our hybrid cloud environment to relate the DS3 users, SQL server throughput, SQL Server processing time, and the network latency. The formula derived in figure 6 explains the impact of the network latency on the throughput (orders per second) when the benchmark load (DS3 users) is fixed. It should be noted, however, that although the throughput reported by both the clients varied due to the network latency, the aggregate throughput remained a constant. This is because the throughput decrease seen by one client is offset by the throughput increase seen by the other client.

This illustrates how important it is for you to monitor your application dependencies when you migrate workloads to and from the cloud. For example, if your database VM depends on a Java application server VM, you should consider migrating both VMs together; otherwise, the overall application throughput will suffer due to slow responses and timeouts.

One way to monitor your application dependencies is to use VMware vRealize Network Insight, which can mitigate business risk by mapping application dependencies in both private and hybrid cloud environments.

vMotion Stun During Page Send (SDPS)

We also tested vMotion performance by doubling the intensity of the DS3 workload on both on-premises and cloud clients. Although vMotion succeeded, vmkernel logs indicated vMotion SDPS kicked-in during test scenarios that had a higher benchmark load. SDPS is an advanced feature of vMotion that ensures migration will not fail due to memory copy convergence issues. Whenever vMotion detects that the guest memory dirty rate is higher than the available network bandwidth, it injects microsecond latencies to guest execution to throttle the page dirty rate, so the network transfer can catch up with the dirty rate. So, we recommend you delay the vMotion of a heavily loaded VMs on hybrid cloud environments with shared bandwidth links, which will prevent slowdown in the guest execution.

To learn more about SDPS, see “VMware vSphere vMotion Architecture, Performance, and Best Practices.”

vMotion across multiple availability zones in the SDDC

Every AWS region has multiple availability zones (AZ). Amazon does not provide service level agreements beyond an availability zone. For reasons such as failover support, VMware Cloud on AWS customers can choose an SDDC deployment that spans multiple availability zones in a single AWS region.

There are certain vMotion performance implications with respect to the SDDC deployment configuration.

Figure 7.  vMotion peak network throughput in a single availability zone vs. multiple availability zones

As shown in figure 7, vMotion peak network throughput depends on the host placement in the SDDC.

This is because vMotion uses a single TCP stream in the VMware Cloud environment. If the vMotion source and destination hosts are within the same availability zone, vMotion peak throughput can reach as high as 10 gigabits per second (Gbps), limited only by the CPU core speed. However, if the source and destination hosts are across availability zones, vMotion peak throughput is governed by the AWS rate limiter. The throughput of any single TCP or UDP stream across availability zones is limited to 5 Gbps by the AWS rate limiter.

Conclusion

In summary, our performance test results show the following:

  • vMotion lets you migrate workloads seamlessly across traditional, on-premises data centers and software-defined data centers on AWS Cloud.
  • vMotion offers the same standard performance guarantees across hybrid cloud environment, which includes less than 1 second of vMotion execution switch-over time, and minimal impact on guest performance.

References

New white paper: Big Data performance on VMware Cloud on AWS: Spark machine learning and IoT analytics performance on-premises and in the cloud

By Dave Jaffe

A new white paper is available comparing Spark machine learning performance on an 8-server on-premises cluster vs. a similarly configured VMware Cloud on AWS cluster.

Here is what the VMware Cloud on AWS cluster looked like:

Screenshot of cluster configuration

VMware Cloud on AWS configuration for performance tests

Three standard analytic programs from the Spark machine learning library (MLlib), K-means clustering, Logistic Regression classification, and Random Forest decision trees, were driven using spark-perf. In addition, a new, VMware-developed benchmark, IoT Analytics Benchmark, which models real-time machine learning on Internet-of-Things data streams, was used in the comparison. The benchmark is available from GitHub.

As seen in the charts below, performance was very similar on-premises and on VMware Cloud on AWS.

Spark machine learning performance chart

Spark machine learning performance

IoT Analytics performance chart

IoT Analytics performance

 

All details are in the paper.

Oracle Database Performance with VMware Cloud on AWS

You’ve probably already heard about VMware Cloud on Amazon Web Services (VMC on AWS). It’s the same vSphere platform that has been running business critical applications for years, but now it’s available on Amazon’s cloud infrastructure. Following up on the many tests that we have done with Oracle databases on vSphere, I was able to get some time on a VMC on AWS setup to see how Oracle databases perform in this new environment.

It is important to note that VMC on AWS is vSphere running on bare metal servers in Amazon’s infrastructure. The expectation is that performance will be very similar to “regular” onsite vSphere, with the added advantage that the hardware provisioning, software installation, and configuration is already done and the environment is ready to go when you login. The vCenter interface is the same, except that it references the Amazon instance type for the server.

Our VMC on AWS instance is made up of four ESXi hosts. Each host has two 18-core Intel Xeon E5-2686 v4 (aka Broadwell) processors and 512 GB of RAM. In total, the cluster has 144 cores and 2 TB of RAM, which gives us lots of physical resources to utilize in the cloud.

In our test, the database VMs were running Red Hat Enterprise Linux 7.2 with Oracle 12c. To drive a load against the database VMs, a single 18 vCPU driver VM was running Windows Server 2012 R2, and the DVD Store 3 test workload was also setup on the cluster. A 100 GB test DS3 database was created on each of the Oracle database VMs. During testing, the number of threads driving load against the databases were increased until maximum throughput was achieved, which was around 95% CPU utilization. The total throughput across all database servers for each test is shown below.

 

In this test, the DB VMs were configured with 16 vCPUs and 128 GB of RAM. In the 8 VMs test case, a total of 128 vCPUs were allocated across the 144 cores of the cluster. Additionally the cluster was also running the 18 vCPU driver VM,  vCenter, vSAN, and NSX. This makes the 12 VM test case interesting, where there were 192 vCPUs for the DB VMs, plus 18 vCPUs for the driver. The hyperthreads clearly help out, allowing for performance to continue to scale, even though there are more vCPUs allocated than physical cores.

The performance itself represents scaling very similar to what we have seen with Oracle and other database workloads with vSphere in recent releases. The cluster was able to achieve over 370 thousand orders per minute with good scaling from 1 VM to 12 VMs. We also recently published similar tests with SQL Server on the same VMC on AWS cluster, but with a different workload and more, smaller VMs.

UPDATE (07/30/2018): The whitepaper detailing these results is now available here.

SQL Server Performance of VMware Cloud on AWS

In the past, I’ve always benchmarked performance of SQL Server VMs on vSphere with “on-premises” infrastructure.  Given the skyrocketing interest in the cloud, I was very excited to get my hands on VMware Cloud on AWS – just in time for Amazon’s AWS Summit!

A key question our customers have is: how well do applications (like SQL Server) perform in our cloud?  Well, I’m happy to report that the answer is great!

VMware Cloud on AWS Environment

First, here is a screenshot of what my vSphere-powered Software-Defined Data Center (SDDC) looks like:vSphere Client - VMware Cloud on AWSThis screenshot shows several notable items:

  • The HTML5-based vSphere Client interface should be very familiar to vSphere administrators, making the move to the cloud extremely easy
  • This SDDC instance was auto-provisioned with 4 ESXi hosts and 2TB of memory, all of which were pre-configured with vSAN storage and NSX networking.
    • Each host is configured with two CPUs (Intel Xeon Processor E5-2686 v4); each socket contains 18 cores running at 2.3GHz, resulting in 144 physical cores in the cluster. For more information, see the VMware Cloud on AWS Technical Overview
  • Virtual machines are provisioned within the customer workload resource pool, and vSphere DRS automatically handles balancing the VMs across the compute cluster.

Benchmark Methodology

To measure SQL Server database performance, I used HammerDB, an open-source database load testing and benchmarking tool.  It implements a TPC-C like workload, and reports throughput in TPM (Transactions Per Minute).

To measure how well performance scaled in this cloud, I started with a single 8 vCPU, 32GB RAM VM for the SQL Server database.  To drive the workload, I created a 4 vCPU, 4GB RAM HammerDB driver VM.  I then cloned these VMs to measure 2 database VMs being driven simultaneously:HammerDB and SQL Server VMs in VMware Cloud on AWS

I then doubled the number of VMs again to 4, 8, and finally 16.  As with any benchmark, these VMs were completely driven up to saturation (100% load) – “pedal to the metal”!

Results

So, how did the results look?  Well, here is a graph of each VM count and the resulting database performance:

As you can see, database performance scaled great; when running 16 8-vCPU VMs, VMware Cloud on AWS was able to sustain 6.7 million database TPM!

I’ll be detailing these benchmarks more in an upcoming whitepaper, but wanted to share these results right away.  If you have any questions or feedback, please leave me a comment!

UPDATE (07/25/2018): The whitepaper detailing these results is now available here.

Measuring Cloud Scalability Using the Weathervane Benchmark

Cloud-based deployments continue to be a hot topic in many of today’s corporations.  Often the discussion revolves around workload portability, ease of migration, and service pricing differences.  In an effort to bring performance into the discussion we decided to leverage VMware’s new benchmark, Weathervane.  As a follow-on to Harold Rosenberg’s introductory Weathervane post we decided to showcase some of the flexibility and scalability of our new large-scale benchmark.  Previously, Harold presented some initial scalability data running on three local vSphere 6 hosts.  For this article, we decided to extend this further by demonstrating Weathervane’s ability to run within a non-VMware cloud environment and scaling up the number of app servers.

Weathervane is a new web-application benchmark architected to simulate modern-day web applications.  It consists of a benchmark application and a workload driver.  Combined, they simulate the behavior of everyday users attending a real-time auction.  For more details on Weathervane I encourage you to review the introductory post.

Environment Configuration:
Cloud Environment: Amazon AWS, US West.
Instance Types: M3.XLarge, M3.Large, C3.Large.
Instance Notes: Database instances utilized an additional 300GB io1 tier data disk.
Instance Operating System: Centos 6.5 x64.
Application: Weathervane Internal Build 084.

Testing Methodology:
All instances were run within the same cloud environment to reduce network-induced latencies.  We started with a base configuration consisting of eight instances.  We then  scaled out the number of workload drivers and application servers in an effort to identify how a cloud environment scaled as application workload needs increased.  We used Weathervane’s FindMax functionality which runs a series of tests to determine the maximum number of users the configuration can sustain while still meeting QoS requirements.  It should be noted that the early experimentation allowed us to identify the maximum needs for the other services beyond the workload drivers and application servers to reduce the likelihood of bottlenecks in these services.  Below is a block diagram of the configurations used for the scaled-out Weathervane deployment.

Fig1

Results:
For our analysis of Weathervane cloud scaling we ran multiple iterations for each scale load level and selected the average.  We automated the process to ensure consistency.  Our results show both the number of users sustained as well as the http requests per second as reported by the benchmark harness.

Fig2

As you can see in the above graph, for our cloud environment running Weathervane, scaling the number of applications servers yielded nearly linear scaling up to five application servers. The delta in scaling between the number of users and the http requests per second sustained was less than 1%.  Due to time constraints we were unable to test beyond five application servers but we expect that the scaling would have continued upwards well beyond the load levels presented.

Although just a small sample of what Weathervane and cloud environments can scale to, this brief article highlights both the benchmark and cloud environment scaling.  Though Weathervane hasn’t been released publicly yet, it’s easy to see how this type of controlled, scalable benchmark will assist in performance evaluations of a diverse set of environments.  Look for more Weathervane based cloud performance analysis in the future.