Home > Blogs > VMware VROOM! Blog > Monthly Archives: February 2015

Monthly Archives: February 2015

VMware pushes the envelope with vSphere 6.0 vMotion

vMotion in VMware vSphere 6.0 delivers breakthrough new capabilities that will offer customers a new level of flexibility and performance in moving virtual machines across their virtual infrastructures. Included with vSphere 6.0 vMotion are features – Long-distance migration, Cross-vCenter migration, Routed vMotion network – that enable seamless migrations across current management and distance boundaries. For the first time ever, VMs can be migrated across vCenter Servers separated by cross-continental distance with minimal performance impact. vMotion is fully integrated with all the latest vSphere 6 software-defined data center technologies including Virtual SAN (VSAN) and Virtual Volumes (VVOL). Additionally, the newly re-architected vMotion in vSphere 6.0 now enables extremely fast migrations at speeds exceeding 60 Gigabits per second.

In this blog, we present the latest vSphere 6.0 vMotion features as well as the performance results. We first evaluate vMotion performance across two geographically dispersed data centers connected by a 100ms round-trip time (RTT) latency network. Following that we demonstrate vMotion performance when migrating an extremely memory hungry “Monster” VM.

Long Distance vMotion

vSphere 6.0 introduces a Long-distance vMotion feature that increases the round-trip latency limit for vMotion networks from 10 milliseconds to 150 milliseconds. Long distance mobility offers a variety of compelling new use cases including whole data center upgrades, disaster avoidance, government mandated disaster preparedness testing, and large scale distributed resource management to name a few. Below, we examine vMotion performance under varying network latencies up to 100ms.

Test Configuration

We set up a vSphere 6.0 test environment with the following specifications:

Hardware

  • Two HP ProLiant DL580 G7 servers (32-core Intel Xeon E7-8837 @ 2.67 GHz, 256 GB memory)
  • Storage: Two EMC VNX 5500 arrays, FC connectivity, VMFS 5 volume on a 15-disk RAID-5 LUN
  • Networking: Intel 10GbE 82599 NICs
  • Latency Injector: Maxwell-10GbE appliance to inject latency in vMotion network

Software

  • VM config: 4 VCPUs, 8GB mem, 2 vmdks (30GB system disk, 20GB database disk)
  • Guest OS/Application: Windows Server 2012 / MS SQL Server 2012
  • Benchmark: DVDStore (DS2) using a database size of 12GB with 12,000,000 customers, 3 drivers without think-time

vsphere60-fig1-2-new

Figure 1 illustrates the logical deployment of the test-bed used for long distance vMotion testing. Long distance vMotion is supported with both no shared storage infrastructure and with shared storage solutions such as EMC VPLEX Geo which enables shared data access across long distances. Our test-bed didn’t use shared storage which resulted in migration of the entire state of VM including its memory, storage and CPU/device states. As shown in Figure 2, our test configuration deployed a Maxwell-10GbE network appliance to inject latency in vMotion network.

Measuring vMotion Performance

The following metrics were used to understand the performance implications of vMotion:

  • Migration Time: Total time taken for migration to complete
  • Switch-over Time: Time during which the VM is quiesced to enable switchover from source to the destination host
  • Guest Penalty: Performance impact on the applications running inside the VM during and after the migration

Test Results

We investigated the impact of long distance vMotion on Microsoft SQL Server online transaction processing (OLTP) performance using the open-source DVD Store workload. The test scenario used a Windows Server 2012 VM configured with 4 VCPUs, 8GB memory, and a SQL Server database size of 12GB. Figure 3 shows the migration time and VM switch-over time when migrating an active SQL Server VM at different network round-trip latencies. In all the test scenarios, we used a load of 3 DS2 users with no think time that generated substantial load on the VM. The migration was initiated during the steady-state period of the benchmark when the CPU utilization (esxtop %USED counter) of the VM was close to 120%, and the average read IOPS and average write IOPS were about 200 and 150, respectively.
ldvmotion-fig3 Figure 3 shows that the impact of round-trip latency was minimal on both duration of the migration and switch-over time, thanks to the latency aware optimizations in vSphere 6.0 vMotion. The difference in the migration time among the test scenarios was in the noise range (<5%). The switch-over time increased marginally from about 0.5 seconds in 5ms test scenario to 0.78 seconds in 100ms test scenario.

ldvmotion-fig4 Figure 4 plots the performance of a SQL Server virtual machine in orders processed per second at a given time—before, during, and after vMotion on a 100 ms round-trip latency network. In our tests, DVD store benchmark driver was configured to report the performance data at a fine granularity of 1 second (default: 10 seconds). As shown in the figure, the impact on SQL Server throughput was minimal during vMotion. The only noticeable dip in performance was during the switch-over phase (0.78 seconds) from the source to destination host. It took less than 5 seconds for the SQL server to resume to normal level of performance.

Faster migration

Why are we interested in extreme performance? Today’s datacenters feature modern servers with many processing cores (up to 80), terabytes of memory and high network bandwidth (10 and 40 GbE NICs). VMware supports larger “monster” virtual machines that can scale up to 128 virtual CPUs and 4TB of RAM. Utilizing higher network bandwidth to complete migrations of these monster VMs faster can enable you to implement high levels of mobility in private cloud deployments. The reduction in time to move a virtual machine can also reduce the overhead on the total network and CPU usage.

Test Config

  • Two Dell PowerEdge R920 servers (60-core Intel Xeon E7-4890 v2 @ 2.80GHz, 1TB memory)
  • Networking: Intel 10GbE 82599 NICs, Mellanox 40GbE MT27520 NIC
  • VM config: 12 VCPUs, 500GB mem
  • Guest OS: Red Hat Enterprise Linux Server 6.3

We configured each vSphere host with four Intel 10GbE ports and a single Mellanox 40 GbE port with total of 80Gb/s network connectivity between the two vSphere hosts. Each vSphere host was configured with five vSwitches, with four vSwitches having one unique 10GbE uplink port and fifth vSwitch with a 40GbE uplink port. The MTU of the NICs was set to the default of 1500 bytes. We created one VMkernel adapter on each of four vSwitches with 10GbE uplink port and four VMkernel adapters on the vSwitch with 40GbE uplink port. All the 8 VMkernel adapters were configured on the same subnet. We also enabled each VMkernel adapter for vMotion, which allowed vMotion traffic to use the 80Gb/s network connectivity.

Methodology

To demonstrate the extreme vMotion throughput performance, we simulated a very heavy memory usage footprint in the virtual machine. The memory-intensive program allocated 300GB memory inside the guest and touched a random byte in each memory page in an infinite loop. We migrated this virtual machine between the two vSphere hosts under different test scenarios: vMotion over 10Gb/s network, vMotion over 20Gb/s network, vMotion over 40Gb/s network and vMotion over 80Gb/s network. We used esxtop to monitor network throughput and CPU utilization on the source and destination hosts.

Test Results

ldvmotion-fig5

Figure 5 compares the peak network bandwidth observed in vSphere 5.5 and vSphere 6.0 under different network deployment scenarios. Let us first consider the vSphere 5.5 vMotion throughput performance. Figure 5 shows vSphere 5.5 vMotion reaches line rate in both 10Gb/s network and 20Gb/s network test scenarios. When we increased the available vMotion network bandwidth to beyond 20 Gb/s, the vMotion peak usage was limited to 18Gb/s in vSphere 5.5. This is because in vSphere 5.5 vMotion, each vMotion is assigned by default two helper threads which do the bulk of vMotion processing. Since the vMotion helper threads are CPU saturated, there is no performance gain when adding additional network bandwidth. When we increased the number of vMotion helper threads from 2 to 4 in the 40Gb/s test scenario, and thereby removed the CPU bottleneck, we saw the peak network bandwidth usage of vMotion in vSphere 5.5 increase to 32Gb/s. Tuning the helper threads beyond four hurt vMotion performance in 80Gb/s test scenario, as vSphere 5.5 vMotion has some locking issues which limit the performance gains when adding more helper threads. These locks are VM specific locks that protect VM’s memory.

The newly re-architected vMotion in vSphere 6.0 not only removes these lock contention issues but also obviates the need to apply any tunings. During the initial setup phase, vMotion dynamically creates the appropriate number of TCP/IP stream channels between the source and destination hosts based on the configured network ports and their bandwidth. It then instantiates a vMotion helper thread per stream channel thereby removing the necessity for any manual tuning. Figure 5 shows vMotion reaches line rate in 10Gb/s, 20Gb/s and 40Gb/s scenarios, while utilizing little over 64 Gb/s network throughput in 80Gb/s scenario. This is over a factor of 3.5x improvement in performance when compared to vSphere 5.5.

ldvmotion-fig5
Figure 6 shows the network throughput and CPU utilization data in vSphere 6.0 80Gb/s test scenario. During vMotion, memory of the VM is copied from the source host to the destination host in an iterative fashion. In the first iteration, the vMotion bandwidth usage is throttled by the memory allocation on the destination host. The peak vMotion network bandwidth usage is about 28Gb/s during this phase. Subsequent iterations copy only the memory pages that were modified during the previous iteration. The number of pages transferred in these iterations is determined by how actively the guest accesses and modifies the memory pages. The more modified pages there are, the longer it takes to transfer all pages to the destination server, but on the flip side, it enables vMotion’s advanced performance optimizations to kick-in to fully leverage the additional network and compute resources. That is evident in the third pre-copy iteration when the peak measured bandwidth was about 64Gb/s and the peak CPU utilization (esxtop ‘PCPU Util%’ counter) on destination host was about 40%.

Conclusions

The main results of this performance study are the following:

  • The dramatic 10x increase in round-trip time support offered in long-distance vMotion now makes it possible to migrate workloads non-disruptively over long distances such as New York to London
  • Remarkable performance enhancements in vSphere 6.0 towards improving Monster VM migration performance (up to 3.5x improvement over vSphere 5.5) in large scale private cloud deployments

 

 

 

Virtualized Hadoop Performance with vSphere 6

A recently published whitepaper shows that not only can vSphere 6 keep up with newer high-performance servers, it thrives on their capabilities.

Two years ago, Hadoop benchmarks were run with vSphere 5.1 on a cluster of 32 dual-socket, quad-core servers. Very good performance was demonstrated, with the optimal virtualized configuration shown to be actually 2% faster than native for TeraSort (see the previous whitepaper).

These benchmarks were recently run on a cluster of the same size, but with ten-core processors, more disks and memory, dual 10GbE networking, and vSphere 6. The maximum dataset size was almost quadrupled to 30TB, to ensure that it is much bigger than the total memory in the cluster (hence qualifying the test as Big Data, by one definition).

The results, summarized in the chart below, show that the optimal virtualized configuration now delivers 12% better performance than native for TeraSort. The primary reason for this excellent performance is the ability of vSphere to map physical hardware resources to virtual hardware that is optimized for scale-out applications. The observed trend, as well as theory based on processor characteristics, indicates that the importance of being able to do this mapping correctly increases as processors become more powerful. The sub-optimal performance of one of the tests is due to the combination of very small VMs and how Hadoop does replication during data creation. On the other hand, small VMs are very advantageous for read-dominated applications, which are typically more common. Taken together with other best practices discussed in the paper, this information can be used to configure Hadoop clusters for the highest levels of performance. Despite all the hardware and software changes over the past two years, the optimal configuration was still found to be four VMs per dual-socket host.

elapsed_time_ratioPlease take a look at the whitepaper for more details on how these benchmarks were run and for analyses on why certain virtual configurations perform so well.

Scaling Out Redis Performance with Docker on vSphere 6.0

by Davide Bergamasco

In an earlier VROOM! post we discussed, among other things, the performance of the Redis in-memory key-value store in a Docker/vSphere environment. In that post we focused on a single instance of a Redis server subject to a more or less artificial workload with the goal of assessing the absolute performance of said instance under various deployment scenarios.

In this post we are taking a different point of view, which is maximizing the throughput of multiple Redis instances running on a “large” server under a more realistic workload. Why are we interested in this perspective?  Conceptually, Redis is an extremely simple application, being just a thin layer of code implementing a large hash table on top of system calls.  From the implementation standpoint, a single-threaded event loop services requests from the clients in a polling fashion. The problem with this design is that it is not suitable for “scaling up”; that is, improving performance by using multiple cores. Modern servers have many processing cores (up to 80) and possibly terabytes of memory.  However, Redis can only access that memory at the speed of a single core.

This problem can be solved by “scaling out” Redis; that is, by partitioning the server memory across multiple Redis instances and running each of those on a different core.  This can be achieved by using a set of load balancers to fragment the key space and distribute the load among the various instances. The diagram shown in Figure 1 illustrates this concept.

Figure 1. Redis Scale Out Setup

Figure 1. Redis Scale Out Setup

Host H3 runs the various Redis server instances (red boxes), while Host H2 runs two sets of load balancers:

  • The green boxes are the Redis load balancers, which partition the key space using a consistent hashing algorithm.  We leveraged the Twemproxy OSS project to implement the Redis load balancers.
  • The yellow boxes are TCP load balancers, which distribute the load across the Redis load balancers in a round robin fashion. We used the HAProxy OSS project to implement the TCP load balancers.

Finally, Host H1 runs the load generators (dark blue boxes); that is, the standard benchmark redis-benchmark.

Deployment Scenarios

We assessed the performance of this design across a set of deployment scenarios analogous to what we considered in the previous post. These are listed below and illustrated in Figure 2:

  • Native: Redis instances are run as 8 separate processes on the Linux OS running directly on Host H3 hardware.
  • VM: Redis instances are run inside 8, 2-vCPU VMs running on a pre-release build of vSphere 6.0.0 running on Host H3 hardware; the guest OS is the same as the Native scenario.
  • Native-Docker: Redis instances are run inside 8 Docker containers running on the Native OS.
  • VM-Docker: Redis instances are run inside Docker containers each running inside the same VMs as the VM scenario, with one container per VM.
Figure 2. Different deployment scenarios

Figure 2. Different deployment scenarios

Hardware/Software/Workload Configuration

The following are the details about the hardware, software, and workload used in the various experiments discussed in the next section:

Hosts:

  • HP ProLiant DL380e Gen8
  • CPU: 2 x Intel® Xeon® CPU E5-2470 0 @ 2.30GHz (16 cores, 32 hyper-threads total)
  • Memory: 96GB
  • Hardware configuration: Hyper-Threading ON, turbo-boost OFF, power policy: Static High (no power management)
  • Network: 10GbE
  • Storage: 8 x 500GB 15,000 RPM 6Gb SAS disks, HP H220 host bus adapter

Linux OS:

  • CentOS 7
  • Kernel 3.18.1 (CentOS 7 comes with 3.10.0, but we wanted to use the latest kernel available at the time of this writing)
  • Docker 1.2

ESXi:

  • VMware vSphere 6.0.0 (pre-release build)

VM:

  • 8 x 2-vCPU, 11GB (VM scenario)
  • Virtual NIC: vmxnet3
  • Virtual HBA: LSI-SAS

Application:

  • Redis 2.8.13
  • AOF persistency with “everysec” flush policy (every operation that mutates a key is logged into an Append Only File in order to enable data recovery after a crash; the buffer cache is flushed every second, so with this durability policy at most one second worth of data can be lost)

Workload:

  • Keyspace: 250 million keys, value size 1 byte (this size has been chosen to prevent network or storage from becoming bottlenecks from the bandwidth perspective)
  • 8 redis-benchmark instances each simulating 100 clients with a pipeline depth of 30 requests
  • Operations mix: 75% GET, 25% SET 

Results

We ran two sets of experiments for every scenario listed in the “Deployment Scenarios” section. The first set was meant to establish a baseline by having a single redis-benchmark instance generating requests directly against a single redis-server instance. The second set aimed at assessing the overall performance of the Redis scale-out system we presented earlier. The results of these two set of experiments are shown in Figure 3, where each bar represents the throughput in operations per second averaged over five trials, and error bars indicate the range of the measured values.

Figure 3. Results of the performance experiments (Y-axis represents throughput in 1,000 operations per second)

Figure 3. Results of the performance experiments (Y-axis represents throughput in 1,000 operations per second)

Nothing really surprising can be noticed looking at the results of the baseline experiments (labeled “1 Server – 1 Client” in Figure 3).  The Native scenario is obviously the fastest in terms of operations per second, followed by the Docker, VM, and the Docker-VM scenarios. This is expected as both virtualization and containerization add some overhead on top of the bare-metal performance.

Looking at the scale-out experiments (labeled “Scale-Out” in Figure 3), we see a surprisingly different picture. The VM scenario is now the fastest, followed by Docker-VM, while the Native and Docker scenarios come in as a somewhat distant third and fourth.  This unexpected result can be explained by looking at the Host H3 CPU activity during an experiment run.  In the Native and Docker scenarios, notice that the CPU load is spread over the 16 cores.  This means that even though only 8 threads are active (the 8 redis-server instances), the Linux scheduler is continuously migrating them.  This might result in a large number of cross-NUMA node memory accesses, which are substantially more expensive than same-NUMA node accesses. Also, irqbalance is spreading the network card interrupts across all the 16 cores, additionally contributing to the above phenomenon.

In the VM and Docker-VM scenarios, this does not occur because the ESXi scheduler goes to great lengths to keep both the memory and vCPUs of a VM on one NUMA node.  Also, with the PVSCSI virtual device, the virtual interrupts are always routed to the same vCPU(s) that initiated an I/O, and this minimizes interrupt migrations.

We tried to eliminate the cross-NUMA node memory activity in the Native scenario by pinning all the redis-server processes to the cores of the same CPU; that is, to the same NUMA node. We also disabled irqbalance and manually pinned the interrupt vectors to the same set of cores. As expected, with this ad-hoc configuration, the Native scenario was the fastest, reaching 3.408 million operations per second. Without any pinning, the VM result is only 4% slower than the optimized Native performance. (Notice that introducing artificial affinity between processes/interrupt vectors and cores is not a recommended practice as it is error-prone and can, in general, lead to unexpected or suboptimal results.)

Our initial experiments were conducted with the CentOS 7 stock kernel (3.10.0), which unfortunately is not particularly recent. We thought it was prudent to verify if the Linux scheduler had been improved to avoid the inter-NUMA node thread migrations in more recent kernel versions.  Hence, we re-ran all the experiments with the latest version (at the time of this writing, 3.18.1), but we didn’t notice any significant difference with respect to version 3.10.0.

We thought it would be interesting to look at the performance numbers in terms of speedup; that is, the ratio between the throughput of the scale-out system and the throughput of the baseline 1 Server – 1 Client setup. Figure 4 below shows the speedup for the four scenarios considered in this study.

Figure 4: Speedup (Y-axis represents speedup with respect to baseline)

Figure 4: Speedup (Y-axis represents speedup with respect to baseline)

The speedup essentially tells, in relative terms, how much the performance has improved by deploying 8 Redis instances on the same host instead of on a single one. If the system scaled linearly, it would have achieved a maximum theoretical speedup of 8.  In practice, this limit could not be achieved because of extra overheads introduced by the load-balancers and possible resource contention across the Redis instances running on host H3 (this host is almost running at saturation as the overall CPU utilization is consistently between 75% and 85% during the experiment’s execution). In any case, the scale-out system delivers a performance boost of at least 4x as compared to running a single Redis instance with exactly the same memory capacity.  The VM and Docker-VM scenarios achieve a substantially larger speedup because of the cross-NUMA memory access issue afflicting the Native and Docker scenarios.

Conclusions

The main results of this study are the following:

  1. VMs and Docker containers are truly better together. The Redis scale-out system, using out-of-the-box configuration settings, clearly achieves better performance in the Docker-VM scenario than in the Native or Docker scenarios. Even though its performance is not as high as in the VM scenario, the Docker-VM setup offers the same ease of use and deployment typical of the Docker scenario, at a substantially higher performance.
  2. Using VMs and Docker, we managed to scale out a Redis deployment and extracted a great deal of extra performance (up to 5.6x more) from a large server that would have otherwise been underutilized.