Home > Blogs > VMware VROOM! Blog

Tutorial Session on Performance Debugging on VMware vSphere

Ever wondered what it takes to debug performance issues on a VMware stack? How do you figure out if the performance issue is in your virtual machine, or the network layer, or the storage layer, or the hypervisor layer?

Here’s a handy tutorial that showcases a systematic approach for troubleshooting performance using tools like Esxtop, vSCSI stats and Net stats on a VMware stack. The tutorial also talks about some very useful optimizations and performance best practices.

Thanks to Ramprasad K. S. for putting together the slides based on his vast experience dealing with customer issues. Thanks also to Ramprasad and Sai Inabattini for presenting this at the CMG India 2nd Annual conference in Bangalore in November 2015, which was received very well.

Fault Tolerance Performance in vSphere 6

VMware has published a technical white paper about vSphere 6 Fault Tolerance architecture and performance. The paper describes which types of applications work best in virtual machines with vSphere FT enabled.

VMware vSphere Fault Tolerance (FT) provides continuous availability to virtual machines that require a high amount of uptime. If the virtual machine fails, another virtual machine is ready to take over the job.  vSphere achieves FT by maintaining primary and secondary virtual machines using a new technology named Fast Checkpointing. This technology is similar to Storage vMotion, which copies the virtual machine state (storage, memory, and networking) to the secondary ESXi host. Fast Checkpointing keeps the primary and secondary virtual machines in sync.

vSphere FT works with (and requires) vSphere HA—when an administrator enables FT, vSphere HA selects the secondary VM (admins can vMotion the VM to another server if needed). vSphere HA also creates a new secondary if the primary fails—the original secondary becomes the new primary, and vSphere HA selects an available virtual machine to use as the new secondary.

vSphere 6 FT supports applications with up to 4 vCPUs and 64GB memory on the ESXi host. The performance study shows results for various workloads run on virtual machines with 1, 2, and 4 vCPUs.

The workloads—which tax the virtual machine’s CPU, disk, and network—include:

  • Kernel compile – loads the CPU at 100%
  • Netperf-  measures network throughput and latency
  • Iometer- characterizes the storage I/O of a Microsoft Windows virtual machine
  • Swingbench- drives an OLTP load on a virtual machine running Oracle 11g
  • DVD Store –  drives an OLTP load on a virtual machine running Microsoft SQL Server 2012
  • A brokerage workload – simulates an OLTP load of a brokerage firm
  • vCenterServer workload – simulates actions performed in vCenter Server

Testing shows that vSphere FT can successfully protect a number of workloads like CPU-bound workloads, I/O-bound workloads, servers, and complex database workloads; however, admins should not use vSphere FT to protect highly latency-sensitive applications like voice-over-IP (VOIP) or high-frequency trading (HFT).

For the results of these tests, read the paper. Also useful is the VMware Fault Tolerance FAQ.

Virtualizing Performance-Critical Database Applications in VMware vSphere 6.0

by Priti Mishra

Performance studies have previously shown that there is no doubt virtualized servers can run a variety of applications near, or in some cases even above, that of software running natively (on bare metal). In a new white paper, we raise the bar higher and test “monster” vSphere virtual machines loaded with CPU and running the most taxing databases and transaction processing applications.

The benchmark workload, which we call Order-Entry, is based on an industry-standard online transaction processing (OLTP) benchmark called TPC-C. Both rigorous and demanding, the Order-Entry workload pushes virtual machine performance.

Note: The Order Entry benchmark is derived from the TPC-C workload, but is not compliant with the TPC-C specification, and its results are not comparable to TPC-C results.

The white paper quantifies the:

  • Performance differential between ESXi 6.0 and native
  • Performance differential between ESXi 6.0 and ESXi 5.1
  • Performance gains due to enhancements built into ESXi 6.0

Results from these experiments show that even the most demanding applications can be run, with excellent performance, in a virtualized environment with ESXi 6.0.  For example, our test results show that ESXi 6.0 virtual machines run out of the box at 90% of the performance of native systems. In addition, a 64-vCPU, 475GB VM processes 59.5K DBMS transactions per second while issuing 155K IOPS, capabilities well above even the high-end Oracle database installations. Even for applications that may require 64 or 128 vCPUs, the high-end performance boost of ESXi 6.0 over ESXi 5.1 makes ESXi 6.0 the best platform for virtualizing databases such as Oracle.

ESXi 6.0 Performance Relative to Native

With a 64-vCPU VM running on a 72-pCPU ESXi host, throughput was 90% of native throughput on the same hardware platform. Statistics which give an indication of the load placed on the system in the native and virtual machine configurations are summarized in Table 1.

Metric Native VM
Throughput in transactions per second 66.5K 59.5K
Average CPU utilization of 72 logical CPUs 84.7% 85.1%
Disk IOPS 173K 155K
Disk Megabytes/second 929MB/s 831MB/s
Network packets/second 71K/s receive
71K/s send
63K/s receive
64K/s send
Network Megabytes/second 15MB/s receive
36MB/s send
13MB/s receive
32MB/s send

Table 1. Comparison of Native and Virtual Machine Benchmark Load Profiles

 

The corresponding guest statistics in Table 2 provide another perspective on the resource-intensive nature of the workload. These common Linux performance metrics show that while the benchmark workload was heavy in terms of raw CPU demands, it also placed a heavy load on the operating system, interrupt handling, and the storage subsystem, areas that have traditionally been associated with high virtualization overheads.

 

Metric Amount
Interrupts per second 327K
Disk IOPS 155K
Context switches per second 287K
Load average 231

Table 2. Guest OS Statistics

ESXi 6.0 Performance Relative to ESXi 5.1

Experimental data comparing ESXi 6.0 with ESXi 5.1 (see Figures 1 and 2) show that high-end scale-up with ESXi 6.0 mirrors that of native systems.

fig1-dbapps-perf

Figure 1. Absolute throughput values

With ESXi 5.1, the Order-Entry benchmark throughput of a 64-vCPU VM on a 4-socket, 32 core/64 thread E7- 4870 (Westmere) server was 70% of the throughput of the same server in native mode when both servers were running at 77% CPU utilization (the native server reached a maximum CPU utilization of 88% and throughput of 54.8 transactions per second).

fig2-dbapps-perf

Figure 2. Relative throughput ratios

vSphere has the capability to handle loads far larger than that demanded by most Oracle database applications in production. Support for monster VMs with up to 128 vCPUs, throughput which is 90% of native and a significant performance boost over ESXi 5.1, make ESXi 6.0 an excellent platform for virtualizing very high end Oracle databases.

For details regarding experiments and the performance enhancements in vSphere, please read the paper.

VMware vCloud Air Database Performance Scalability with SQL Server

Previous posts have shown vSphere can easily handle running Microsoft SQL Server on four-socket servers with large numbers of cores—with vSphere 5.5 on Westmere-EX and more recently with vSphere 6 on Ivy Bridge-EX.  We recently ran similar tests on vCloud Air to measure how these enterprise databases with mission critical performance requirements perform in a cloud environment. The tests show that SQL Server databases scale very well on vCloud Air with a variety of virtual machine (VM) counts and virtual CPU (vCPU) sizes.

The benchmark tests were run with vCloud Air using their Virtual Private Cloud (VPC) subscription-based service.  This is a very compelling hybrid cloud service that allows for an on-premises vSphere infrastructure to be expanded into the public cloud in a secure and scalable way. The underlying host hardware consisted of two 8-core CPUs for a total of 16 physical cores, which meant that the maximum number of vCPUs was 16 (although additional processors were available via Hyper-Threading, they were not utilized).

Windows Server 2012 R2 was the guest OS, and SQL Server 2012 Standard edition was the database engine used for all the VMs.  All databases were placed on an SSD Accelerated storage tier for maximum disk I/O performance.  The test configurations are summarized below:

# VMs, # vCPUs, Memory configurations tested

DVD Store 2.1 (an open-source OLTP database stress tool) was the workload used to stress the VMs.  The first experiment was to scale up the number of 4 vCPU VMs.  The graph below shows that as the number of VMs is increased from 1 to 4, the aggregate performance (measured in orders per minute, or OPM) increases correspondingly:
vCA_SQL_4vCPU
When the size of each VM was doubled from 4 to 8 virtual CPUs, the OPM also approximately doubles for the same number of VMs as shown in the chart below.vCA_SQL_8vCPU

This final chart includes a test run with one large 16 vCPU VM.  As expected, the 16 vCPU performance was similar to the four 4vCPU VMs and eight 2vCPU VM test cases.  The slight drop can be attributed to spanning multiple physical processors and thus multiple NUMA nodes within a single VM.

vCA_SQL_16vCPU

In summary, SQL Server was found to perform and scale extremely well running on vCloud Air with 4, 8, and 16 vCPU VMs.  In the future, look for more benchmarks in the cloud as it continues to evolve!

For more information on vCloud Air, check out these third-party studies from Principled Technologies that compare it to competitive offerings, namely Microsoft Azure and Amazon Web Services (AWS):

Scaling Performance for VAIO in vSphere 6.0 U1

by Chien-Chia Chen

vSphere APIs for I/O Filtering (VAIO) is a framework that enables third-party software developers to implement data services, such as caching and replication, to vSphere. Figure 1 below shows the general architecture of VAIO. Once I/O filter libraries are installed to a virtual disk (VMDK), every I/O request generated from the guest operating system to the VMDK will first be intercepted by the VAIO framework at the file device layer. The VAIO framework then hands over the I/O request to the user space I/O filter libraries, where a series of third party data service operations can be performed against the I/O. After processing the I/O, user space I/O filter libraries return the I/O back to the VAIO framework, which continues the rest of the issuing path. Similarly, upon completion, the I/O will first be processed by the user space I/O filter libraries before continuing its original completion path.

There have been questions around the overhead of the VAIO framework due to its extra user-to-kernel communication. In this blog post, we evaluate the performance of vSphere APIs for I/O Filtering using a null I/O filter and demonstrate how VAIO scales with respect to the number of virtual machines and outstanding I/Os (OIOs). The null I/O filter accepts each I/O request and immediately returns it.

fig1-iofilt-arch

Figure 1. vSphere APIs for I/O Filtering Architecture

System Configuration

The configuration of our systems is as follows:

  • One ESXi host
    • Machine: Dell R720 server running vSphere 6.0 Update 1
    • CPU: 16-core, 2-socket (32 hyper-threads) Intel® Xeon® E5-2665 @ 2.4 GHz
    • Memory: 128GB memory
    • Physical Disk: One Intel® S3700 400GB SATA SSD on LSI MegaRAID SAS controller
    • VM: Up to 32 link-cloned I/O Analyzer 1.6.2 VMs (SUSE Linux Enterprise 11 SP2; 1 virtual CPU (VCPU) and 1GB memory each). Each virtual machine has 1 PVSCSI controller hosting two 1GB VMDKs—one has no I/O filter and another has a null filter, both think-provisioned.
  • Workload: Iometer 4K sequential read (4K-aligned) with various number of OIOs

Methodology

We conduct two sets of tests separately—one against VMDK without an I/O filter (referred to as “default”) and another against the null-filter VMDK (referred to as “iofilter”). In each set of tests, every virtual machine has one Iometer disk worker to generate 4K sequential read I/Os to the VMDK under test. We have a 2-minute warm-up time and measure I/Os per second (IOPS), normalized CPU cost, and read latency over the next 2-minute test duration. The latency is the median of the average read latencies reported by all Iometer workers.

Note that I/O sizes and access patterns do not affect the performance of VAIO since it does no additional data copying, maintains the original access patterns, and incurs no extra access to physical disks.

Results

VM Scaling

Figures 2 and 3 below show the IOPS, CPU cost per 1K IOPS, and latency with a different number of virtual machines at 128 OIOs. Except for the single virtual machine test, results show that VAIO achieves similar IOPS and has similar latency compared to the default VMDK. However, VAIO introduces 10%-20% higher CPU overhead per 1K IOPS. The single virtual machine IOPS with iofilter is 80% higher than the default VMDK. This is because, in the default case, the VCPU performs the majority of synchronous I/O work; whereas, in the iofilter case, VAIO contexts take over a big portion of the work and unblock the VCPU from generating more I/Os. With additional VCPUs and Iometer disk workers to mitigate the single core bottleneck, the default VMDK is also able to drive over 70K IOPS.

fig2a-iofilt
Figure 2. IOPS and CPU Cost vs. Number of VMs (128 Outstanding I/Os)

fig3-iofilt

Figure 3. Iometer Read Latency vs. Number of VMs (128 Outstanding I/Os)

 

OIO Scaling

Figures 4 and 5 below show the IOPS, CPU cost per 1K IOPS, and latency with a different number of OIOs at 16 virtual machines. A similar trend again holds that VAIO achieves the same IOPS and has the same latency compared to the default VMDK while it incurs 10%-20% higher CPU overhead per 1K IOPS.

fig4-iofilt

Figure 4. Percent of a Core per 1 Thousand IOPS vs. Outstanding I/Os (16 VMs)

fig5-iofilt

Figure 5. Iometer Read Latency vs. Outstanding I/Os (16 VMs)

Conclusion

Based on our evaluation, VAIO achieves comparable throughput and latency performance at a cost of 10%-20% more CPU cycles. From our experience, when using the VAIO framework, we recommend the following general best practices:

  • Reduce CPU over-commitment. The VAIO framework introduces at least one additional context per VMDK with an active filter. Over-committing CPU can result in intensive CPU contention, thus much worse virtualization efficiency.
  • Avoid blocking when developing I/O filter libraries. Keep in mind that an I/O will be blocked until the user space I/O filter finishes processing. Thus additional processing time will result in higher end-to-end latency.
  • Increase concurrency wisely when developing I/O filter libraries. The user space I/O filter can potentially serve I/Os from all VMDKs. Thus, when developing I/O filter libraries, it is important to be flexible in terms of concurrency to avoid a single core CPU bottleneck and meanwhile without introducing too many unnecessary active contexts that cause higher CPU contention.

 

Dynamic Host-Wide Performance Tuning in VMware vSphere 6.0

by Chien-Chia Chen

Introduction

The networking stack of vSphere is, by default, tuned to balance the tradeoffs between CPU cost and latency to provide good performance across a wide variety of applications. However, there are some cases where using a tunable provides better performance. An example is Web-farm workloads, or any circumstance where a high consolidation ratio (lots of VMs on a single ESXi host) is preferred over extremely low end-to-end latency. VMware vSphere 6.0 introduces the Dynamic Host-Wide Performance Tuning  feature (also known as dense mode), which provides a single configuration option to dynamically optimize individual ESXi hosts for high consolidation scenarios under certain use cases. Later in this blog, we define those use cases. Right now, we take a look at how dense mode works from an internal viewpoint.

Mitigating Virtualization Inefficiency under High Consolidation Scenarios

Figure 1 shows an example of the thread contexts within a high consolidation environment. In addition to the Virtual CPUs (each labeled VCPU) of the VMs, there are per-VM vmkernel threads (device-emulation, labeled “Dev Emu”, threads in the figure) and multiple vmkernel threads for each Physical NIC (PNIC) executing physical device virtualization code and virtual switching code. One major source of virtualization inefficiency is the frequent context switches among all these threads. While context switches occur due to a variety of reasons, the predominant networking-related reason is Virtual NIC (VNIC) Interrupt Coalescing, namely, how frequently does the vmkernel interrupt the guest for new receive packets (or vice versa for transmit packets). More frequent interruptions are likely to result in lower per-packet latency while increasing virtualization overhead. At very high consolidation ratios, the overhead from increased interrupts hurts performance.

Dense mode uses two techniques to reduce the number of context switches:

  • The VNIC coalescing scheme will be changed to a less aggressive scheme called static coalescing.
    With static coalescing, a fixed number of requests are delivered in each batch of communication between the Virtual Machine Monitor (VMM) and vmkernel. This, in general, reduces the frequency of communication, thus fewer context switches, resulting in better virtualization efficiency.
  • The device emulation vmkernel thread wakeup opportunities are greatly reduced.
    The device-emulation threads now will only be executed either periodically with a longer timer or when the corresponding VCPUs are halted. This optimization largely reduces the frequency that device emulation threads being waken up, so frequency of context switch is also lowered.

fig1-high-cons

Figure 1. High Consolidation Example

Enabling Dense Mode

Dense mode is disabled by default in vSphere 6.0. To enable it, change Net.NetTuneHostMode in the ESXi host’s Advanced System Settings (shown below in Figure 2) to dense.

fig2-dense-mode-ui

Figure 2. Enabling Dynamic Host-Wide Performance Tuning
“default” is disabled; “dense” is enabled

Once dense mode is enabled, the system periodically checks the load of the ESXi host (every 60 seconds by default) based on the following three thresholds:

  • Number of VMs ≥ number of PCPUs
  • Number of VCPUs ≥ number of 2 * PCPUs
  • Total PCPU utilization ≥ 50%

When the system load exceeds the above thresholds, these optimizations will be in effect for all regular VMs that carry default settings. When the system load drops below any of the thresholds, those optimizations will be automatically removed from all affected VMs such that the ESXi host performs identical to when dense mode is disabled.

Applicable Workloads

Enabling dense mode can potentially impact performance negatively for some applications. So, before enabling, carefully profile the applications to determine whether or not the workload will benefit from this feature. Generally speaking, the feature improves the VM consolidation ratio on an ESXi host running medium network throughput applications with some latency tolerance and is CPU bounded. A good use case is Web-farm workload, which needs CPU to process Web requests while only generating a medium level of network traffic and having a few milliseconds of tolerance to end-to-end latency. In contrast, if the bottleneck is not at CPU, enabling this feature results in hurting network latency only due to less frequent context switching. For example, the following workloads are NOT good use cases of the feature:

  • X Throughput-intensive workload: Since network is the bottleneck, reducing the CPU cost would not necessarily improve network throughput.
  • X Little or no network traffic: If there is too little network traffic, all the dense mode optimizations barely have any effect.
  • X Latency-sensitive workload: When running latency-sensitive workloads, another set of optimizations is needed and is documented in the “Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5” performance white paper.

Methodology

To evaluate this feature, we implement a lightweight Web benchmark, which has two lightweight clients and a large number of lightweight Web server VMs. The clients send HTTP requests to all Web servers at a given request rate, wait for responses, and report the response time. The request is for static content and it includes multiple text and JPEG files totaling around 100KB in size. The Web server has memory caching enabled and therefore serves all the content from memory. Two different request rates are used in the evaluation:

  1. Medium request rate: 25 requests per second per server
  2. High request rate: 50 requests per second per server

In both cases, the total packet rate on the ESXi host is around 400 Kilo-Packets/Second (KPPS) to 700 KPPS in each direction, where the receiving packet rate is slightly higher than the transmitting packet rate.

System Configuration

We configured our systems as follows:

  • One ESXi host (running Web server VMs)
    • Machine: HP DL580 G7 server running vSphere 6.0
    • CPU: Four 10-core Intel® Xeon® E7-4870 @ 2.4 GHz
    • Memory: 512 GB memory
    • Physical NIC: Two dual-port Intel X520 with a total of three active 10GbE ports
    • Virtual Switching: One virtual distributed switch (vDS) with three 10GbE uplinks using default teaming policy
    • VM: Red Hat Linux Enterprise Server 6.3 assigned one VCPU, 1GB memory, and one VMXNET3 VNIC
  • Two Clients (generating Web requests)
    • Machine: HP DL585 G7 server running Red Hat Linux Enterprise Server 6.3
    • CPU: Four 8-core AMD Opteron™ 6212 @ 2.6 GHz
    • Memory: 128 GB memory
    • Physical NIC: One dual-port Intel X520 with one active 10GbE port on each client

Results

Medium Request Rate

We first present the evaluation results for medium request rate workloads. Figures 3 and 4 below show the 95th-percentile response time and total host CPU utilization as the number of VMs increase, respectively. For the 95th-percentile response time, we consider 100ms as the preferred latency tolerance.

Figure 3 shows that at 100ms, default mode consolidates only about 470 Web server VMs, whereas dense mode consolidates more than 510 VMs, which is an over 10% improvement. For CPU utilization, we consider 90% is the desired maximum utilization.

fig3-med-95

Figure 3. Medium Request Rate 95-Percentile Response Time
(Latency Tolerance 100ms)

Figure 4 shows that at 90% utilization, default mode consolidates around 465 Web server VMs, whereas dense mode consolidates about 495 Web server VMs, which is still a nearly 10% improvement in consolidation ratio. We also notice that dense mode, in fact, also reduces response time. This is because the great reduction in context switching improves virtualization efficiency, which compensates the increase in latency due to more aggressive batching.

fig4-med-90

Figure 4. Medium Request Rate Host Utilization
(Desired Maximum Utilization 90%)

High Request Rate

Figures 5 and 6 below show the 95th-percentile response time and total host CPU utilization for a high request rate as the number of VMs increase, respectively. Because the request rate is doubled, we reduce the number of Web server VMs consolidated on the ESXi host. Figure 5 first shows that at 100ms response time, dense mode only consolidates about 5% more VMs in a medium request rate case (from ~280 VMs to ~290 VMs). However, if we look at the CPU utilization as shown in Figure 6, at 90% desired maximum load, dense mode still consolidates about 10% more VMs (from ~ 240 VMs to ~260 VMs). Considering both response time and utilization metrics, because there are a fewer number of active contexts under the high request rate workload, the benefit of reducing context switches will be less significant compared to a medium request rate case.

fig5-high-95

Figure 5. High Request Rate 95-Percentile Response Time
(Latency Tolerance 100ms)

fig6-high-90

Figure 6. High Request Rate Host Utilization
(Desired Maximum Utilization at 90%)

Conclusion

We presented the Dynamic Host-Wide Performance Tuning feature, also known as dense mode. We proved a Web-farm-like workload achieves up to 10% higher consolidation ratio while still meeting 100ms latency tolerance and 90% maximum host utilization. We emphasized that the improvements do not apply to every kind of application. Because of this, you should carefully profile the workloads before enabling dense mode.

VMware Virtual SAN Stretched Cluster Best Practices White Paper

VMware Virtual SAN 6.1 introduced the concept of a stretched cluster which allows the Virtual SAN customer to configure two geographically located sites, while synchronously replicating data between the two sites. A technical white paper about the Virtual SAN stretched cluster performance has now been published. This paper provides guidelines on how to get the best performance for applications deployed on a Virtual SAN stretched cluster environment.

The chart below, borrowed from the white paper, compares the performance of the Virtual SAN 6.1 stretched cluster deployment against the regular Virtual SAN cluster without any fault domains. A nine- node Virtual SAN stretched cluster is considered with two different configurations of inter-site latency: 1ms and 5ms. The DVD Store benchmark is executed on four virtual machines on each host of the nine-node Virtual SAN stretched cluster. The DVD Store performance metrics of cumulated orders per minute in the cluster, read/write IOPs, and average latency are compared with a similar workload on the regular Virtual SAN cluster. The orders per minute (OPM) is lower by 3% and 6% for the 1ms and 5ms inter-site latency stretched cluster compared to the regular Virtual SAN cluster.

vsan-stretched-fig1a
Figure 1a.  DVD Store orders per minute in the cluster and guest IOPS comparison

Guest read/write IOPS and latency were also monitored. The read/write mix ratio for the DVD Store workload is roughly at 1/3 read and 2/3 write. Write latency shows an obvious increase trend when the inter-site latency is higher, while the read latency is only marginally impacted. As a result, the average latency increases from 2.4ms to 2.7ms, and 5.1ms for 1ms and 5ms inter-site latency configuration.

vsan-stretched-fig1b
Figure 1b.  DVD Store latency comparison

These results demonstrate that the inter-site latency in a Virtual SAN stretched cluster deployment has a marginal performance impact on a commercial workload like DVD Store. More results are available in the white paper.

Measuring Cloud Scalability Using the Zephyr Benchmark

Cloud-based deployments continue to be a hot topic in many of today’s corporations.  Often the discussion revolves around workload portability, ease of migration, and service pricing differences.  In an effort to bring performance into the discussion we decided to leverage VMware’s new benchmark, Zephyr.  As a follow-on to Harold Rosenberg’s introductory Zephyr post we decided to showcase some of the flexibility and scalability of our new large-scale benchmark.  Previously, Harold presented some initial scalability data running on three local vSphere 6 hosts.  For this article, we decided to extend this further by demonstrating Zephyr’s ability to run within a non-VMware cloud environment and scaling up the number of app servers.

Zephyr is a new web-application benchmark architected to simulate modern-day web applications.  It consists of a benchmark application and a workload driver.  Combined, they simulate the behavior of everyday users attending a real-time auction.  For more details on Zephyr I encourage you to review the introductory post.

Environment Configuration:
Cloud Environment: Amazon AWS, US West.
Instance Types: M3.XLarge, M3.Large, C3.Large.
Instance Notes: Database instances utilized an additional 300GB io1 tier data disk.
Instance Operating System: Centos 6.5 x64.
Application: Zephyr Internal Build 084.

Testing Methodology:
All instances were run within the same cloud environment to reduce network-induced latencies.  We started with a base configuration consisting of eight instances.  We then  scaled out the number of workload drivers and application servers in an effort to identify how a cloud environment scaled as application workload needs increased.  We used Zephyr’s FindMax functionality which runs a series of tests to determine the maximum number of users the configuration can sustain while still meeting QoS requirements.  It should be noted that the early experimentation allowed us to identify the maximum needs for the other services beyond the workload drivers and application servers to reduce the likelihood of bottlenecks in these services.  Below is a block diagram of the configurations used for the scaled-out Zephyr deployment.

Fig1

Results:
For our analysis of Zephyr cloud scaling we ran multiple iterations for each scale load level and selected the average.  We automated the process to ensure consistency.  Our results show both the number of users sustained as well as the http requests per second as reported by the benchmark harness.

Fig2

As you can see in the above graph, for our cloud environment running Zephyr, scaling the number of applications servers yielded nearly linear scaling up to five application servers. The delta in scaling between the number of users and the http requests per second sustained was less than 1%.  Due to time constraints we were unable to test beyond five application servers but we expect that the scaling would have continued upwards well beyond the load levels presented.

Although just a small sample of what Zephyr and cloud environments can scale to, this brief article highlights both the benchmark and cloud environment scaling.  Though Zephyr hasn’t been released publically yet, it’s easy to see how this type of controlled, scalable benchmark will assist in performance evaluations of a diverse set of environments.  Look for more Zephyr based cloud performance analysis in the future.

Content Library Performance Tuning

by Joanna Guan and Davide Bergamasco

The first two posts in this series assessed the performance of some Content Library operations like virtual machine deployment and library synchronization, import and export. In this post we discuss how to fine-tune Content Library settings in order to achieve optimal performance under a variety of operational conditions. Notice that in this post we only discuss the settings that have the most noticeable impact on the overall solution performance. There are several other settings which may potentially affect Content Library performance. We refer the interested readers to the official documentation for the details (the Content Library Service settings can be found here , while the Transfer Service settings can be found here.)

Global Network Bandwidth Throttling

Content Library has a global bandwidth throttling control to limit the overall bandwidth consumed by file transfers. This setting, called Maximum Bandwidth Consumption, affects all the streaming mode operations including library synchronization, VM deployment, VM capture, and item import/export. However, it does not affect direct copy operations, i.e., operations where data is directly copied across ESXi hosts.

The purpose of the Maximum Bandwidth Consumption setting is to ensure that while Content Library file transfers are in progress enough network bandwidth remains available to vCenter Server for its own operations.

The following table illustrates the properties of this setting:

Setting Name Maximum Bandwidth Consumption
vSphere Web Client Path AdministrationàSystem ConfigurationàServicesà
Transfer ServiceàMaximum Bandwidth Consumption
Default Value Unlimited
Unit Mbit/s

Concurrent Data Transfer Control

Content Library has a setting named Maximum Number of Concurrent Transfers that limits the number of concurrent data transfers. This limit applies to all the data transfer operations including import, export, VM deployment, VM capture, and synchronization. When this limit is exceeded, all new operations are queued until the completion of one or more of the operations in progress.

For example, let’s assume the current value of Maximum Number of Concurrent Transfers is 20 and there are 8 VM deployments, 2 VM captures, and 10 item synchronization in progress. A new VM deployment request will be queued because the maximum number of concurrent operations has been reached. As soon as any of those operations completes, the new VM deployment is allowed to proceed.

This setting can be used to improve Content Library overall throughput (not the performance of each individual operation) by increasing the data transfer concurrency when the network is underutilized.

The following table illustrates the properties of this setting:

Setting Name Maximum Number of Concurrent Transfers
vSphere Web Client Path AdministrationàSystem ConfigurationàServicesàTransfer Serviceà
Maximum Number of Concurrent Transfers
Default Value 20
Unit Number

A second concurrency control setting, whose properties are shown in the table below, applies to synchronization operations only. This setting, named Library Maximum Concurrent SyncItems,controls the maximum number of items that a subscribed library is allowed to concurrently synchronize.

Setting Name Library Maximum Concurrent SyncItems
vSphere Web Client Path AdministrationàSystem ConfigurationàServicesà
Content Library ServiceàLibrary Maximum Concurrent SyncItems
Default Value 5
Unit Number

Given that the default value of Maximum Number of Concurrent Transfers is 20 and the default value of Library Maximum Concurrent SyncItems is 5, a maximum of 5 items can concurrently be transferred to a subscribed library during a synchronization operation, while a published library with 5 or more items can be synchronizing with up to 4 subscribed libraries (see Figure 1).  If the number of items or subscribed libraries exceeds these limits, the extra transfers will be queued. Library Maximum Concurrent SyncItems can be used in concert with Maximum Number of Concurrent Transfers to improve the overall synchronization throughput by increasing one or both limits.

 

post3-fig1

Figure 1. Library Synchronization Concurrency Control

 

The following Table summarizes the effect of each of the settings described above on each of the Content Library operations, depending on the data transfer mode.

    Maximum Bandwidth Consumption Maximum Number of Concurrent Transfers Library Maximum Concurrent Sync Items
Streaming Mode VM Deployment/Capture
Import/Export
Synchronization
(Published Library)
Synchronization
(Subscribed Library)
Direct Copy Mode VM Deployment/Capture
Import/Export
Synchronization
(Published Library)
Synchronization
(Subscribed Library)

 

Data Mirroring

Synchronizing library content across remote sites can be problematic because of the limited bandwidth of typical WAN connections. This problem may be exacerbated when a large number of subscribed libraries concurrently synchronize with a published library because the WAN connections originating from this library can easily cause congestion due to the elevated fan-out (see Figure 2).

 

post3-fig2

Figure 2. Synchronization Fan-out Problem

 

This problem can be mitigated by creating a mirror server to cache content at the remote sites.  As shown in Figure 3, this can significantly decrease the level of WAN traffic by avoiding transferring the same files multiple times across the network.

 

post3-fig3b

Figure 3. Data Mirroring Used to Mitigate Fan-Out Problem

 

The mirror server is a proxy Web server that caches content on local storage. The typical location of mirror servers is between the vCenter Server hosting the published library and the vCenter Server(s) hosting the subscribed library(ies). To be effective, the mirror servers must be as close as possible to the subscribed libraries. When a subscribed library attempts to synchronize with the published library, it requests content from the mirror server.  If such content is present on the mirror server, the request is immediately satisfied. Otherwise, the mirror server fetches said content from the published library and stores it locally before satisfying the request from the subscribed library. Any further request for that particular content will be directly satisfied by the mirror server.

A mirror server can be also used in a local environment to offload the data movement load from a vCenter Server or when the backing storage of a published library is not particularly performant. In this case the mirror server is located as close as possible to the vCenter Server hosting the published library as shown in Figure 4.

 

post3-fig4

Figure 4. Data Mirroring Used to Off-Load vCenter Server

 

Example of Mirror Server Configuration

This section provides step-by-step instructions to assist a vSphere administrator in the creation of a mirror server using the NGINX web server (other web servers, such as Apache and Lighttpd, can be used for this purpose as well). Please refer to the NGINX documentation for additional configuration details.

  1. Install the NGINX web server in a Windows or Linux virtual machine to be deployed as close as possible to either the subscribed or the published library depending on the desired optimization (fan-out mitigation or vCenter offload).
  2. Edit the configuration files. The NGINX default configuration file, /etc/nginx/nginx.conf, defines the basic behavior of the web server. Some of the core directives in the nginx.conf file need to be changed.
    • Configure the IP address / name of the vCenter server hosting the published library
         proxy_pass https://<PublishervCenterServer-name-or-IP>:443;
    • Set the valid time for cached files. In this example we assume the contents to be valid for 6 days (this time can be changed as needed):
         proxy_cache_valid any 6d;
    • Configure the cache directory path and cache size. In the following example we use /var/www/cache as the cache directory path on the file system where cached data will be stored. 500MB is the size of the shared memory zone, while 18,000MB is the size of the file storage. Files not accessed within 6 days are evicted from the cache.
         proxy_cache_path /var/www/cache levels=1:2 keys_zone=my-cache:500mmax_size=18000m inactive=6d;
    • Define the cache key. Instruct NGINX to use the request URI as a key identifier for a file:
         proxy_cache_key "$scheme://$host$request_uri";
      When an OVF or VMDK file is updated, the file URL gets updated as well. When a URL changes, the cache key changes too, hence the mirror server will fetch and store the updated file as a new file during a library re-synchronization.
    • Configure HTTP Redirect (code 302)
         handling.error_page  302  = @handler;
         upstreamup_servers {
            serverPublishervCenterServer-name-or-IP:443;
         }
         location @handler{
            set $foo $upstream_http_location;
            proxy_pass $foo;
            proxy_cache my-cache;
            proxy_cache_valid any 6d;
         }
  3. Test the configuration.  Run the following command on the mirror server twice:
       wget  --no-certification-check –O /dev/null https://<MirrorServer-IP-or-Name>/example.ovf
    The first time the command is run the file example.ovf will be fetched from the published content library and copied in some folder within the cache path proxy_cache_path. The second time, there should not be any network traffic between the mirror server and the vCenter hosting the published library as the file will be served from the cache.
  4. Sync library content through mirror server. Create a new subscribed library using the New Library Wizard from the vSphere Web Client. Copy the published library URL in the Subscription URL box and replace the vCenter IP or host name with the mirror server IP or host name (see Figure 5). Then complete the rest of the steps for creating a new library as usual.

 

post3-fig5

Figure 5. Configuring a Subscribed Library with a Mirror Server

 

Note: If the network environment is trusted, a simple HTTP proxy can be used instead of HTTPS proxy in order to improve data transfer performance by avoiding unnecessary data encryption/decryption.

How to Efficiently Synchronize, Import and Export Content in VMware vSphere Content Library

By Joanna Guan and Davide Bergamasco

In a prior post we assessed the performance of VMware vSphere Content Library, focusing on the instantiation of a virtual machine from an existing library.  We considered various scenarios and provided virtual infrastructure administrators with some guidelines about the selection of the most efficient storage backing based on a cost/performance trade-off. In this post we focus on another Content Library operation, namely Synchronization, with the goal of providing similar guidelines. We also cover two Content Library maintenance operations, Import and Export.

Library Synchronization

Once a library is created and published, its content can be shared between different vCenter Servers. This is achieved through the synchronization operation, which clones a published library by downloading all the content to a subscribed library. Multiple scenarios exist based on the vCenter Server connectivity and backing storage configurations. Each scenario will be discussed in detail in the following sections after a brief description of the experimental testbed.

Experimental Testbed

For our experiments we used a total of four ESXi hosts: two for the vCenter Server appliances (one for the published library and a second for the subscribed library) and two to provide the datastore backing for the libraries. These two hosts are separately managed by each vCenter server. The following table summarizes the hardware and software specifications of the test bed.

 

ESXi Hosts running vCenter Server Appliance
    Dell PowerEdge R910 server
         CPUs Four6-core Intel® Xeon® E7530  @1.87 GHz, Hyper-Threading enabled.
         Memory 80GB
    Virtualization Platform VMware vSphere 6.0. (RTM build # 2494585)
         VM Configuration VMware vCenter Sever Appliance 6.0 (RTM build # 2559277)/16 vCPU and 32GB RAM
ESXi Hosts Providing Datatastore Backing
    Dell PowerEdge R610 server
         CPUs Two 4-core Intel® Xeon® E5530 @2.40 GHz, Hyper-Threading enabled.
         Memory 32GB
    Virtualization Platform VMware vSphere 6.0 (RTM build # 2494585)
    Storage Adapter QLogic ISP2532 DualPort 8Gb Fibre Channel HBA
    Network Adapters QLogicNetXtreme II BCM5709 1000Base-T (Data Rate: 1Gbps)Intel Corporation 82599EB 10-Gigabit SFI/SFP+ (Data Rate:10Gbps)
Storage Array
     EMC VNX5700 Storage Array exposing two 20-disk RAID-5 LUNS with a capacity of 12TB each

Single-item Library Synchronization

When multiple vCenter Servers are part of the same SSO domain they can be managed as a single entity. This feature is called Enhanced Linked Mode (see this post for a discussion on how to configure Enhanced Linked Mode). In an environment where Enhanced Linked Mode is available, the contents of a published library residing under one vCenter Server can be synched to a subscribed library residing under another vCenter server by directly copying the files from the source datastore to the destination datastore (this is possible provided that the ESXi hosts connected to those datastores have direct network connectivity).

When Enhanced Linked Mode is not available, the contents of a published library will have to be streamed through the Content Library Transfer Service components residing on each vCenter server (see the prior post in this series for a brief description of the Content Library architecture). In this case, three sub-scenarios exist based on the storage configuration: (a) both published and subscribed library reside on a datastore, (b) both published and subscribed library reside on an NFS file system mounted on the vCenter servers, and (c) the published library resides on an NFS file system while the subscribed library resides on a datastore. The four scenarios discussed above are depicted in Figure 1.

 

post2-fig1

Figure 1. Library synchronization experimental scenarios and related data flows.

 

For each of the four scenarios we synchronized the contents of a published library to a subscribed library and measured the completion time of this operation. The published library contained only one item, a 5.4GB OVF template containing a Red Hat virtual machine in compressed format (the uncompressed size is 15GB). The following table summarizes the four experiments.

 

Experiment 1 The published and subscribed libraries reside under different vCenter Servers with Enhanced Linked Mode; both libraries are backed by datastores.
Experiment 2 The published and subscribed libraries reside under different vCenter Servers without Enhanced Linked Mode; both libraries are backed by datastores.
Experiment 3 The published and subscribed libraries reside under different vCenter Servers without Enhanced Linked Mode; both libraries are backed by NFS file systems, one mounted on each vCenter server.
Experiment 4 The published and subscribed libraries reside under different vCenter Servers without Enhanced Linked Mode; the published library is backed by an NFS file system while the subscribed library is backed by a datastore.

 

For all the experiments above we used both 1GbE and 10GbE network connections to study the effect of network capacity on synchronization performance. In the scenarios of Experiment 2 through 4 the Transfer Service is used to stream data from the published library to the subscribed library. This service leverages a component in vCenter Server called rhttpproxy whose purpose is to offload the encryption/decryption of SSL traffic from the vCenter Server web server (see Figure 2). To study the performance impact of data encryption/decryption in those scenarios, we ran the experiments twice, the first time with rhttpproxy enabled (default case), and the second time with rhttpproxy disabled (thus reducing security by transferring the content “in the clear”).

 

post2-fig2

Figure 2. Reverse HTTP Proxy.

 

Results

The results of the experiments outlined above are shown in Figure 3 and summarized in the table below.

 

Experiment 1 The datastore to datastore with Enhanced Linked Mode scenario is the fastest of the four, with a sync completion time of 105 seconds (1.75 minutes).  This is because the data path is the shortest (the two ESXi hosts are directly connected) and there is no data compression/decompression overhead because content on datastores is stored in uncompressed format. When a 10GbE network is used, the library sync completion time is significantly shorter (63 seconds). This suggests that the 1GbE connection between the two hosts is a bottleneck for this scenario.
Experiment 2 The datastore to datastore without Enhanced Linked Mode scenario is the slowest, with a sync completion time of 691 seconds (more than 11 minutes). This is because the content needs to be streamed via the Transfer Service between the two sites, and it also incurs the data compression and decompression overhead across the network link between the two vCenter servers. Using a 10GbE network in this scenario has no measurable effect since most of the overhead comes from data compression/decompression. Also, disabling rhtpproxy has a marginal effect for the same reason.
Experiment 3 The NFS file system to NFS file system scenario is the second fastest scenario with a sync completion time of 274 seconds (about 4.5 minutes). Although the transfer path has the same number of hops as the previous scenario, it does not incur the data compression and decompression overhead because the content is already stored in a compressed format on the mounted NFS file systems. Using a 10GbE network in this scenario leads to a substantial improvement in the completion time (more than halved).  An even more significant improvement is achieved by disabling rhtpproxy. The combined effect of these two factors yields a 3.7x reduction in the synchronization completion time. These results imply that for this scenario both the 1GbE network and the use of HTTPS for data transfer are substantial performance bottlenecks.
Experiment 4 The NFS file system to datastore is the third fastest scenario with a sync completion time of 298 seconds (just under 5 minutes).  In this scenario the Transfer Service at the subscribed vCenter needs to decompress the files (content on mounted NFS file systems is compressed), but the published vCenter does not need to re-compress them (content on datastores is stored uncompressed). Since data decompression has a substantially smaller overhead than compression, this scenario achieves a much better performance than Experiment 2.  Using a 10GbE network and disabling rhttpproxy in this scenario has the same effects as in Experiment 3 (that is, a 3.7x reduction in completion time).

 

post2-fig3

Figure 3. Library synchronization completion times.

 

The above experiments clearly show that there are a number of factors affecting library synchronization performance:

  • Type of data path: direct connection vs. streaming;
  • Network capacity;
  • Data compression/decompression;
  • Data encryption/decryption.

The following recommendations translate these observations into a set of actionable steps to help vSphere administrators optimize Content Library synchronization performance.

  1. The best performance can be obtained if Enhanced Linked Mode is available and both the published and subscribed libraries are backed by datastores.
  2. When Enhanced Linked Mode is not available, avoid datastore-to-datastore synchronization. If no other optimization is possible, place the published library on an NFS file system (notice that for best deployment performance the subscribed library/libraries should be backed by a datastore as discussed in the prior post).
  3. Using a 10GbE network is always beneficial for synchronization performance (except in the datastore-to-datastore synchronization without Enhanced Linked Mode).
  4. If data confidentiality is not required, the overhead of the HTTPS transport can be avoided by disabling rhttpproxy as described in VMware Knowledge Base article KB2112692.

Concurrent Library Synchronization

To assess the performance of a concurrent library synchronization operation where multiple items are copied in parallel, we devised an experiment where a subscribed library is synchronized with a published library that contains an increasing number of items from 1 to 10. The source and destination vCenter servers support Enhanced Linked Mode and the two libraries are backed by datastores. Each item is an OVF template containing a Windows virtual machine with a 41GB flat VMDK file. Each vCenter server manages one cluster with two ESXi hosts, as shown in Figure 4.

 

post2-fig4

Figure 4Concurrent library synchronization.

 

Results

We studied two scenarios depending on the network speed. With a 1GbE network, we observed that each file transfer always saturates the network bandwidth between the two ESXi hosts, as we expected. Because each site has two ESXi hosts per vCenter, the library synchronization can use two pairs of ESXi hosts for transferring two files concurrently. As shown by the blue line in Figure 5, the library synchronization completion time is the virtually the same for one or two items, suggesting that two items are effectively transferred concurrently. When the number of library items is larger than two, the completion time increases linearly, indicating that the extra file transfers are queued while the network is busy with prior transfers.

With a 10GbE network we observed a different behavior. The synchronization operations were faster than in the prior experiment, but the network bandwidth was not completely saturated. This is because at a higher transfer rate the bottleneck was our storage subsystem. This bottleneck became more pronounced as more and more items were synchronized concurrently due to a more random access pattern on the disk subsystem. This resulted in a super-linear curve (red line in Figure 5), which should eventually become linear should the network bandwidth become eventually saturated.

The conclusion is that, with a 1GbE network, adding more network interface cards to the ESXi hosts to increase the number of available transfer channels (or alternatively adding more ESXi hosts to each site) will increase the total file transfer throughput and consequently decrease the synchronization completion time. Notice that this approach works only if there is constant bi-sectional bandwidth between the two sites.  Any networking bottleneck between them, like a slower WAN link, will limit, if not defeat, the transfer concurrency.

With a 10GbE network, unless very capable storage subsystems are available both at the published and subscribed library sites, the network capacity should be sufficient to accommodate a large number of concurrent transfers.

 

post2-fig5

Figure 5. Concurrent synchronization completion times.

 

Library Import and Export

The Content Library Import function allows administrators to upload content from a local system or web server to a content library. This function is used to populate a new library or add content to an existing one.  The symmetrical Export function allows administrators to download content from a library to a local system. This function can be used to update content in a library by downloading it, modifying it, and eventually importing it again to the same library.

As for the prior experiments, we studied a few scenarios using different library storage backing and network connectivity configurations to find out which one is the most performant from the completion time perspective. In our experiments we focused on the import/export of a virtual machine template in OVF format with a size of 5.4GB (the VMDK file size is 15GB in uncompressed format). As we did earlier, we assessed the performance impact of the rhttpproxy component by running experiments with and without it.

We consider the six scenarios summarized in the following table and illustrated in Figure 6.

 

Experiment 5 Exporting content from a library backed by a datastore. The OVF template is stored uncompressed, using 15GB of space.
Experiment 6 Exporting content from a library backed by an NFS filesystem mounted on the vCenter Server. The OVF template is stored compressed using 5.4GB of space.
Experiment 7/9 Importing content into a library backed by a datastore. The OVF template is stored either on a Windows system running the upload client (Experiment 7) or on a Web Server (Experiment 9). In both cases the data is stored in compressed formant using 5.4GB of space.
Experiment 8/10 Importing content into a library backed by an NFS filesystem mounted on the vCenter Server. The OVF template is stored either on a Windows system running the upload client (Experiment 9) or on a Web Server (Experiment 10). In both cases the data is stored in compressed format using 5.4GB of space.

 

post2-fig6

Figure 6. Import/Export storage configurations and data flows.

 

Results

Figure 7 shows the results of the six experiments described above in terms of Import/Export completion time (lower is better), while the following table summarizes the main observations for each experiment.

 

Experiment 5 This is the most unfavorable scenario for content export because the data goes through the ESXi host and the vCenter Server, where it gets compressed before being sent to the download client.  Using a 10GbE network or disabling rhttpproxy doesn’t help very much because, as we have already observed, data compression is the largest performance limiter.
Experiment 6 Exporting a library item from an NFS filesystem is instead the most favorable scenario. The data is already in compressed format on the NFS filesystem, so no compression is required during the download. Disabling rhttpproxy also has a large impact on the data transfer speed, yielding an improvement of about 44%. Using a 10GbE network, however, does not result in additional improvements because, after removing the encryption/decryption bottleneck, we face another limiter, a checksum operation. In fact, in order to ensure data integrity during the transfer, a checksum is computed on the data as it goes through the Transfer Service. This is another CPU-heavy operation, albeit somewhat lighter than data compression and encryption.
Experiment 7/9 Importing content into a library backed by a datastore from an upload client (Experiment 7) is clearly limited by the network capacity when 1GbEps connections are used. In fact, the completion time virtually does not change when rhtpproxy is disabled. Performance improves when 10GbE connections are employed, and further improvements are observed when rhtpproxy is disabled.  This suggest that data encryption/decryption is definitely a bottleneck with the larger network capacity.When a web server is used to host the library item to be imported (Experiment 9) we observe a completion time which is more than halved compared to Experiment 7. There are two reasons for this: (1) the Transfer Service bypasses rhtpproxy when importing content from a web server (this is the reason there are no “rhtpproxy disabled” data points for experiments 9 and 10 in Figure 7), and (2) the web server is more efficient at transferring data than the Windows client VM. Using a 10GbE connection results in a further improvement. Given that import performance improves by only 10%, this indicates the presence of another limiter.  This limiter is the decompression of the library item while it is being streamed to the destination datastore.
Experiment 8/10 When content is being imported into a library backed by an NFS filesystem we see a pattern very similar to the one we observed in Experiments 7 and 9. The only difference is that the completion times in Experiments 8 and 10 are slightly better because in this case there is no decompression being performed as the data is stored in compressed format on the NFS filesystem. The only exception is the “10GbE Network” data point in experiment 10, which is about 44% better than in Experiment 9. This is because when all the limiters have been removed, data decompression plays a more significant role in the import performance.

 

post2-fig7

Figure 7. Import/Export completion times.

 

Export Concurrency

In this last experiment, we assess the performance of Content Library in terms of network throughput when multiple users simultaneously export (download) an OVF template with a VMDK file size of 5.4GB. The content is redirected to a NULL device on the download client in order to factor out a potential bottleneck in the client storage stack. The library backing is an NFS filesystem mounted on the vCenter server.

Results

Figure 8  shows the aggregate export throughput (higher is better) as the number of concurrent export operations increases from 1 to 10 in four different scenarios depending on the network speed and use of rhtpproxy. When export traffic goes through the rhtpproxy component, the speed of the network seems to be irrelevant as we get exactly the same throughput (which saturates at around 90 MB/s) with both the 1 GbE and 10 GbE networks. This once again confirms that rhtpproxy, due to the CPU intensive SSL data encryption, creates a bottleneck on the data transfer path.

When rhttpproxy is turned off, the download throughput increases until the link capacity is completely saturated (about 120MB/s), at least with a 1GbE network. Once again, administrators can trade off security for performance by disabling rhtpproxy as explained earlier.

When a 10GbE network is used, however, throughput saturates at around 450 MB/s instead of climbing all the way up to 1200MB/s (the theoretical capacity of a 10 Gbps Ethernet link).  This is because the data transfer path, when operating at higher rates, hits another bottleneck introduced by the checksum operation performed by the Transfer Agent to ensure data integrity. Generating a data checksum is another computationally intensive operation, even though not as heavy as data encryption.

 

post2-fig8

Figure 8. Concurrent export throughput