Home > Blogs > VMware VROOM! Blog

Measuring Cloud Scalability Using the Zephyr Benchmark

Cloud-based deployments continue to be a hot topic in many of today’s corporations.  Often the discussion revolves around workload portability, ease of migration, and service pricing differences.  In an effort to bring performance into the discussion we decided to leverage VMware’s new benchmark, Zephyr.  As a follow-on to Harold Rosenberg’s introductory Zephyr post we decided to showcase some of the flexibility and scalability of our new large-scale benchmark.  Previously, Harold presented some initial scalability data running on three local vSphere 6 hosts.  For this article, we decided to extend this further by demonstrating Zephyr’s ability to run within a non-VMware cloud environment and scaling up the number of app servers.

Zephyr is a new web-application benchmark architected to simulate modern-day web applications.  It consists of a benchmark application and a workload driver.  Combined, they simulate the behavior of everyday users attending a real-time auction.  For more details on Zephyr I encourage you to review the introductory post.

Environment Configuration:
Cloud Environment: Amazon AWS, US West.
Instance Types: M3.XLarge, M3.Large, C3.Large.
Instance Notes: Database instances utilized an additional 300GB io1 tier data disk.
Instance Operating System: Centos 6.5 x64.
Application: Zephyr Internal Build 084.

Testing Methodology:
All instances were run within the same cloud environment to reduce network-induced latencies.  We started with a base configuration consisting of eight instances.  We then  scaled out the number of workload drivers and application servers in an effort to identify how a cloud environment scaled as application workload needs increased.  We used Zephyr’s FindMax functionality which runs a series of tests to determine the maximum number of users the configuration can sustain while still meeting QoS requirements.  It should be noted that the early experimentation allowed us to identify the maximum needs for the other services beyond the workload drivers and application servers to reduce the likelihood of bottlenecks in these services.  Below is a block diagram of the configurations used for the scaled-out Zephyr deployment.

Fig1

Results:
For our analysis of Zephyr cloud scaling we ran multiple iterations for each scale load level and selected the average.  We automated the process to ensure consistency.  Our results show both the number of users sustained as well as the http requests per second as reported by the benchmark harness.

Fig2

As you can see in the above graph, for our cloud environment running Zephyr, scaling the number of applications servers yielded nearly linear scaling up to five application servers. The delta in scaling between the number of users and the http requests per second sustained was less than 1%.  Due to time constraints we were unable to test beyond five application servers but we expect that the scaling would have continued upwards well beyond the load levels presented.

Although just a small sample of what Zephyr and cloud environments can scale to, this brief article highlights both the benchmark and cloud environment scaling.  Though Zephyr hasn’t been released publically yet, it’s easy to see how this type of controlled, scalable benchmark will assist in performance evaluations of a diverse set of environments.  Look for more Zephyr based cloud performance analysis in the future.

Content Library Performance Tuning

by Joanna Guan and Davide Bergamasco

The first two posts in this series assessed the performance of some Content Library operations like virtual machine deployment and library synchronization, import and export. In this post we discuss how to fine-tune Content Library settings in order to achieve optimal performance under a variety of operational conditions. Notice that in this post we only discuss the settings that have the most noticeable impact on the overall solution performance. There are several other settings which may potentially affect Content Library performance. We refer the interested readers to the official documentation for the details (the Content Library Service settings can be found here , while the Transfer Service settings can be found here.)

Global Network Bandwidth Throttling

Content Library has a global bandwidth throttling control to limit the overall bandwidth consumed by file transfers. This setting, called Maximum Bandwidth Consumption, affects all the streaming mode operations including library synchronization, VM deployment, VM capture, and item import/export. However, it does not affect direct copy operations, i.e., operations where data is directly copied across ESXi hosts.

The purpose of the Maximum Bandwidth Consumption setting is to ensure that while Content Library file transfers are in progress enough network bandwidth remains available to vCenter Server for its own operations.

The following table illustrates the properties of this setting:

Setting Name Maximum Bandwidth Consumption
vSphere Web Client Path AdministrationàSystem ConfigurationàServicesà
Transfer ServiceàMaximum Bandwidth Consumption
Default Value Unlimited
Unit Mbit/s

Concurrent Data Transfer Control

Content Library has a setting named Maximum Number of Concurrent Transfers that limits the number of concurrent data transfers. This limit applies to all the data transfer operations including import, export, VM deployment, VM capture, and synchronization. When this limit is exceeded, all new operations are queued until the completion of one or more of the operations in progress.

For example, let’s assume the current value of Maximum Number of Concurrent Transfers is 20 and there are 8 VM deployments, 2 VM captures, and 10 item synchronization in progress. A new VM deployment request will be queued because the maximum number of concurrent operations has been reached. As soon as any of those operations completes, the new VM deployment is allowed to proceed.

This setting can be used to improve Content Library overall throughput (not the performance of each individual operation) by increasing the data transfer concurrency when the network is underutilized.

The following table illustrates the properties of this setting:

Setting Name Maximum Number of Concurrent Transfers
vSphere Web Client Path AdministrationàSystem ConfigurationàServicesàTransfer Serviceà
Maximum Number of Concurrent Transfers
Default Value 20
Unit Number

A second concurrency control setting, whose properties are shown in the table below, applies to synchronization operations only. This setting, named Library Maximum Concurrent SyncItems,controls the maximum number of items that a subscribed library is allowed to concurrently synchronize.

Setting Name Library Maximum Concurrent SyncItems
vSphere Web Client Path AdministrationàSystem ConfigurationàServicesà
Content Library ServiceàLibrary Maximum Concurrent SyncItems
Default Value 5
Unit Number

Given that the default value of Maximum Number of Concurrent Transfers is 20 and the default value of Library Maximum Concurrent SyncItems is 5, a maximum of 5 items can concurrently be transferred to a subscribed library during a synchronization operation, while a published library with 5 or more items can be synchronizing with up to 4 subscribed libraries (see Figure 1).  If the number of items or subscribed libraries exceeds these limits, the extra transfers will be queued. Library Maximum Concurrent SyncItems can be used in concert with Maximum Number of Concurrent Transfers to improve the overall synchronization throughput by increasing one or both limits.

 

post3-fig1

Figure 1. Library Synchronization Concurrency Control

 

The following Table summarizes the effect of each of the settings described above on each of the Content Library operations, depending on the data transfer mode.

    Maximum Bandwidth Consumption Maximum Number of Concurrent Transfers Library Maximum Concurrent Sync Items
Streaming Mode VM Deployment/Capture
Import/Export
Synchronization
(Published Library)
Synchronization
(Subscribed Library)
Direct Copy Mode VM Deployment/Capture
Import/Export
Synchronization
(Published Library)
Synchronization
(Subscribed Library)

 

Data Mirroring

Synchronizing library content across remote sites can be problematic because of the limited bandwidth of typical WAN connections. This problem may be exacerbated when a large number of subscribed libraries concurrently synchronize with a published library because the WAN connections originating from this library can easily cause congestion due to the elevated fan-out (see Figure 2).

 

post3-fig2

Figure 2. Synchronization Fan-out Problem

 

This problem can be mitigated by creating a mirror server to cache content at the remote sites.  As shown in Figure 3, this can significantly decrease the level of WAN traffic by avoiding transferring the same files multiple times across the network.

 

post3-fig3b

Figure 3. Data Mirroring Used to Mitigate Fan-Out Problem

 

The mirror server is a proxy Web server that caches content on local storage. The typical location of mirror servers is between the vCenter Server hosting the published library and the vCenter Server(s) hosting the subscribed library(ies). To be effective, the mirror servers must be as close as possible to the subscribed libraries. When a subscribed library attempts to synchronize with the published library, it requests content from the mirror server.  If such content is present on the mirror server, the request is immediately satisfied. Otherwise, the mirror server fetches said content from the published library and stores it locally before satisfying the request from the subscribed library. Any further request for that particular content will be directly satisfied by the mirror server.

A mirror server can be also used in a local environment to offload the data movement load from a vCenter Server or when the backing storage of a published library is not particularly performant. In this case the mirror server is located as close as possible to the vCenter Server hosting the published library as shown in Figure 4.

 

post3-fig4

Figure 4. Data Mirroring Used to Off-Load vCenter Server

 

Example of Mirror Server Configuration

This section provides step-by-step instructions to assist a vSphere administrator in the creation of a mirror server using the NGINX web server (other web servers, such as Apache and Lighttpd, can be used for this purpose as well). Please refer to the NGINX documentation for additional configuration details.

  1. Install the NGINX web server in a Windows or Linux virtual machine to be deployed as close as possible to either the subscribed or the published library depending on the desired optimization (fan-out mitigation or vCenter offload).
  2. Edit the configuration files. The NGINX default configuration file, /etc/nginx/nginx.conf, defines the basic behavior of the web server. Some of the core directives in the nginx.conf file need to be changed.
    • Configure the IP address / name of the vCenter server hosting the published library
         proxy_pass https://<PublishervCenterServer-name-or-IP>:443;
    • Set the valid time for cached files. In this example we assume the contents to be valid for 6 days (this time can be changed as needed):
         proxy_cache_valid any 6d;
    • Configure the cache directory path and cache size. In the following example we use /var/www/cache as the cache directory path on the file system where cached data will be stored. 500MB is the size of the shared memory zone, while 18,000MB is the size of the file storage. Files not accessed within 6 days are evicted from the cache.
         proxy_cache_path /var/www/cache levels=1:2 keys_zone=my-cache:500mmax_size=18000m inactive=6d;
    • Define the cache key. Instruct NGINX to use the request URI as a key identifier for a file:
         proxy_cache_key "$scheme://$host$request_uri";
      When an OVF or VMDK file is updated, the file URL gets updated as well. When a URL changes, the cache key changes too, hence the mirror server will fetch and store the updated file as a new file during a library re-synchronization.
    • Configure HTTP Redirect (code 302)
         handling.error_page  302  = @handler;
         upstreamup_servers {
            serverPublishervCenterServer-name-or-IP:443;
         }
         location @handler{
            set $foo $upstream_http_location;
            proxy_pass $foo;
            proxy_cache my-cache;
            proxy_cache_valid any 6d;
         }
  3. Test the configuration.  Run the following command on the mirror server twice:
       wget  --no-certification-check –O /dev/null https://<MirrorServer-IP-or-Name>/example.ovf
    The first time the command is run the file example.ovf will be fetched from the published content library and copied in some folder within the cache path proxy_cache_path. The second time, there should not be any network traffic between the mirror server and the vCenter hosting the published library as the file will be served from the cache.
  4. Sync library content through mirror server. Create a new subscribed library using the New Library Wizard from the vSphere Web Client. Copy the published library URL in the Subscription URL box and replace the vCenter IP or host name with the mirror server IP or host name (see Figure 5). Then complete the rest of the steps for creating a new library as usual.

 

post3-fig5

Figure 5. Configuring a Subscribed Library with a Mirror Server

 

Note: If the network environment is trusted, a simple HTTP proxy can be used instead of HTTPS proxy in order to improve data transfer performance by avoiding unnecessary data encryption/decryption.

How to Efficiently Synchronize, Import and Export Content in VMware vSphere Content Library

By Joanna Guan and Davide Bergamasco

In a prior post we assessed the performance of VMware vSphere Content Library, focusing on the instantiation of a virtual machine from an existing library.  We considered various scenarios and provided virtual infrastructure administrators with some guidelines about the selection of the most efficient storage backing based on a cost/performance trade-off. In this post we focus on another Content Library operation, namely Synchronization, with the goal of providing similar guidelines. We also cover two Content Library maintenance operations, Import and Export.

Library Synchronization

Once a library is created and published, its content can be shared between different vCenter Servers. This is achieved through the synchronization operation, which clones a published library by downloading all the content to a subscribed library. Multiple scenarios exist based on the vCenter Server connectivity and backing storage configurations. Each scenario will be discussed in detail in the following sections after a brief description of the experimental testbed.

Experimental Testbed

For our experiments we used a total of four ESXi hosts: two for the vCenter Server appliances (one for the published library and a second for the subscribed library) and two to provide the datastore backing for the libraries. These two hosts are separately managed by each vCenter server. The following table summarizes the hardware and software specifications of the test bed.

 

ESXi Hosts running vCenter Server Appliance
    Dell PowerEdge R910 server
         CPUs Four6-core Intel® Xeon® E7530  @1.87 GHz, Hyper-Threading enabled.
         Memory 80GB
    Virtualization Platform VMware vSphere 6.0. (RTM build # 2494585)
         VM Configuration VMware vCenter Sever Appliance 6.0 (RTM build # 2559277)/16 vCPU and 32GB RAM
ESXi Hosts Providing Datatastore Backing
    Dell PowerEdge R610 server
         CPUs Two 4-core Intel® Xeon® E5530 @2.40 GHz, Hyper-Threading enabled.
         Memory 32GB
    Virtualization Platform VMware vSphere 6.0 (RTM build # 2494585)
    Storage Adapter QLogic ISP2532 DualPort 8Gb Fibre Channel HBA
    Network Adapters QLogicNetXtreme II BCM5709 1000Base-T (Data Rate: 1Gbps)Intel Corporation 82599EB 10-Gigabit SFI/SFP+ (Data Rate:10Gbps)
Storage Array
     EMC VNX5700 Storage Array exposing two 20-disk RAID-5 LUNS with a capacity of 12TB each

Single-item Library Synchronization

When multiple vCenter Servers are part of the same SSO domain they can be managed as a single entity. This feature is called Enhanced Linked Mode (see this post for a discussion on how to configure Enhanced Linked Mode). In an environment where Enhanced Linked Mode is available, the contents of a published library residing under one vCenter Server can be synched to a subscribed library residing under another vCenter server by directly copying the files from the source datastore to the destination datastore (this is possible provided that the ESXi hosts connected to those datastores have direct network connectivity).

When Enhanced Linked Mode is not available, the contents of a published library will have to be streamed through the Content Library Transfer Service components residing on each vCenter server (see the prior post in this series for a brief description of the Content Library architecture). In this case, three sub-scenarios exist based on the storage configuration: (a) both published and subscribed library reside on a datastore, (b) both published and subscribed library reside on an NFS file system mounted on the vCenter servers, and (c) the published library resides on an NFS file system while the subscribed library resides on a datastore. The four scenarios discussed above are depicted in Figure 1.

 

post2-fig1

Figure 1. Library synchronization experimental scenarios and related data flows.

 

For each of the four scenarios we synchronized the contents of a published library to a subscribed library and measured the completion time of this operation. The published library contained only one item, a 5.4GB OVF template containing a Red Hat virtual machine in compressed format (the uncompressed size is 15GB). The following table summarizes the four experiments.

 

Experiment 1 The published and subscribed libraries reside under different vCenter Servers with Enhanced Linked Mode; both libraries are backed by datastores.
Experiment 2 The published and subscribed libraries reside under different vCenter Servers without Enhanced Linked Mode; both libraries are backed by datastores.
Experiment 3 The published and subscribed libraries reside under different vCenter Servers without Enhanced Linked Mode; both libraries are backed by NFS file systems, one mounted on each vCenter server.
Experiment 4 The published and subscribed libraries reside under different vCenter Servers without Enhanced Linked Mode; the published library is backed by an NFS file system while the subscribed library is backed by a datastore.

 

For all the experiments above we used both 1GbE and 10GbE network connections to study the effect of network capacity on synchronization performance. In the scenarios of Experiment 2 through 4 the Transfer Service is used to stream data from the published library to the subscribed library. This service leverages a component in vCenter Server called rhttpproxy whose purpose is to offload the encryption/decryption of SSL traffic from the vCenter Server web server (see Figure 2). To study the performance impact of data encryption/decryption in those scenarios, we ran the experiments twice, the first time with rhttpproxy enabled (default case), and the second time with rhttpproxy disabled (thus reducing security by transferring the content “in the clear”).

 

post2-fig2

Figure 2. Reverse HTTP Proxy.

 

Results

The results of the experiments outlined above are shown in Figure 3 and summarized in the table below.

 

Experiment 1 The datastore to datastore with Enhanced Linked Mode scenario is the fastest of the four, with a sync completion time of 105 seconds (1.75 minutes).  This is because the data path is the shortest (the two ESXi hosts are directly connected) and there is no data compression/decompression overhead because content on datastores is stored in uncompressed format. When a 10GbE network is used, the library sync completion time is significantly shorter (63 seconds). This suggests that the 1GbE connection between the two hosts is a bottleneck for this scenario.
Experiment 2 The datastore to datastore without Enhanced Linked Mode scenario is the slowest, with a sync completion time of 691 seconds (more than 11 minutes). This is because the content needs to be streamed via the Transfer Service between the two sites, and it also incurs the data compression and decompression overhead across the network link between the two vCenter servers. Using a 10GbE network in this scenario has no measurable effect since most of the overhead comes from data compression/decompression. Also, disabling rhtpproxy has a marginal effect for the same reason.
Experiment 3 The NFS file system to NFS file system scenario is the second fastest scenario with a sync completion time of 274 seconds (about 4.5 minutes). Although the transfer path has the same number of hops as the previous scenario, it does not incur the data compression and decompression overhead because the content is already stored in a compressed format on the mounted NFS file systems. Using a 10GbE network in this scenario leads to a substantial improvement in the completion time (more than halved).  An even more significant improvement is achieved by disabling rhtpproxy. The combined effect of these two factors yields a 3.7x reduction in the synchronization completion time. These results imply that for this scenario both the 1GbE network and the use of HTTPS for data transfer are substantial performance bottlenecks.
Experiment 4 The NFS file system to datastore is the third fastest scenario with a sync completion time of 298 seconds (just under 5 minutes).  In this scenario the Transfer Service at the subscribed vCenter needs to decompress the files (content on mounted NFS file systems is compressed), but the published vCenter does not need to re-compress them (content on datastores is stored uncompressed). Since data decompression has a substantially smaller overhead than compression, this scenario achieves a much better performance than Experiment 2.  Using a 10GbE network and disabling rhttpproxy in this scenario has the same effects as in Experiment 3 (that is, a 3.7x reduction in completion time).

 

post2-fig3

Figure 3. Library synchronization completion times.

 

The above experiments clearly show that there are a number of factors affecting library synchronization performance:

  • Type of data path: direct connection vs. streaming;
  • Network capacity;
  • Data compression/decompression;
  • Data encryption/decryption.

The following recommendations translate these observations into a set of actionable steps to help vSphere administrators optimize Content Library synchronization performance.

  1. The best performance can be obtained if Enhanced Linked Mode is available and both the published and subscribed libraries are backed by datastores.
  2. When Enhanced Linked Mode is not available, avoid datastore-to-datastore synchronization. If no other optimization is possible, place the published library on an NFS file system (notice that for best deployment performance the subscribed library/libraries should be backed by a datastore as discussed in the prior post).
  3. Using a 10GbE network is always beneficial for synchronization performance (except in the datastore-to-datastore synchronization without Enhanced Linked Mode).
  4. If data confidentiality is not required, the overhead of the HTTPS transport can be avoided by disabling rhttpproxy as described in VMware Knowledge Base article KB2112692.

Concurrent Library Synchronization

To assess the performance of a concurrent library synchronization operation where multiple items are copied in parallel, we devised an experiment where a subscribed library is synchronized with a published library that contains an increasing number of items from 1 to 10. The source and destination vCenter servers support Enhanced Linked Mode and the two libraries are backed by datastores. Each item is an OVF template containing a Windows virtual machine with a 41GB flat VMDK file. Each vCenter server manages one cluster with two ESXi hosts, as shown in Figure 4.

 

post2-fig4

Figure 4Concurrent library synchronization.

 

Results

We studied two scenarios depending on the network speed. With a 1GbE network, we observed that each file transfer always saturates the network bandwidth between the two ESXi hosts, as we expected. Because each site has two ESXi hosts per vCenter, the library synchronization can use two pairs of ESXi hosts for transferring two files concurrently. As shown by the blue line in Figure 5, the library synchronization completion time is the virtually the same for one or two items, suggesting that two items are effectively transferred concurrently. When the number of library items is larger than two, the completion time increases linearly, indicating that the extra file transfers are queued while the network is busy with prior transfers.

With a 10GbE network we observed a different behavior. The synchronization operations were faster than in the prior experiment, but the network bandwidth was not completely saturated. This is because at a higher transfer rate the bottleneck was our storage subsystem. This bottleneck became more pronounced as more and more items were synchronized concurrently due to a more random access pattern on the disk subsystem. This resulted in a super-linear curve (red line in Figure 5), which should eventually become linear should the network bandwidth become eventually saturated.

The conclusion is that, with a 1GbE network, adding more network interface cards to the ESXi hosts to increase the number of available transfer channels (or alternatively adding more ESXi hosts to each site) will increase the total file transfer throughput and consequently decrease the synchronization completion time. Notice that this approach works only if there is constant bi-sectional bandwidth between the two sites.  Any networking bottleneck between them, like a slower WAN link, will limit, if not defeat, the transfer concurrency.

With a 10GbE network, unless very capable storage subsystems are available both at the published and subscribed library sites, the network capacity should be sufficient to accommodate a large number of concurrent transfers.

 

post2-fig5

Figure 5. Concurrent synchronization completion times.

 

Library Import and Export

The Content Library Import function allows administrators to upload content from a local system or web server to a content library. This function is used to populate a new library or add content to an existing one.  The symmetrical Export function allows administrators to download content from a library to a local system. This function can be used to update content in a library by downloading it, modifying it, and eventually importing it again to the same library.

As for the prior experiments, we studied a few scenarios using different library storage backing and network connectivity configurations to find out which one is the most performant from the completion time perspective. In our experiments we focused on the import/export of a virtual machine template in OVF format with a size of 5.4GB (the VMDK file size is 15GB in uncompressed format). As we did earlier, we assessed the performance impact of the rhttpproxy component by running experiments with and without it.

We consider the six scenarios summarized in the following table and illustrated in Figure 6.

 

Experiment 5 Exporting content from a library backed by a datastore. The OVF template is stored uncompressed, using 15GB of space.
Experiment 6 Exporting content from a library backed by an NFS filesystem mounted on the vCenter Server. The OVF template is stored compressed using 5.4GB of space.
Experiment 7/9 Importing content into a library backed by a datastore. The OVF template is stored either on a Windows system running the upload client (Experiment 7) or on a Web Server (Experiment 9). In both cases the data is stored in compressed formant using 5.4GB of space.
Experiment 8/10 Importing content into a library backed by an NFS filesystem mounted on the vCenter Server. The OVF template is stored either on a Windows system running the upload client (Experiment 9) or on a Web Server (Experiment 10). In both cases the data is stored in compressed format using 5.4GB of space.

 

post2-fig6

Figure 6. Import/Export storage configurations and data flows.

 

Results

Figure 7 shows the results of the six experiments described above in terms of Import/Export completion time (lower is better), while the following table summarizes the main observations for each experiment.

 

Experiment 5 This is the most unfavorable scenario for content export because the data goes through the ESXi host and the vCenter Server, where it gets compressed before being sent to the download client.  Using a 10GbE network or disabling rhttpproxy doesn’t help very much because, as we have already observed, data compression is the largest performance limiter.
Experiment 6 Exporting a library item from an NFS filesystem is instead the most favorable scenario. The data is already in compressed format on the NFS filesystem, so no compression is required during the download. Disabling rhttpproxy also has a large impact on the data transfer speed, yielding an improvement of about 44%. Using a 10GbE network, however, does not result in additional improvements because, after removing the encryption/decryption bottleneck, we face another limiter, a checksum operation. In fact, in order to ensure data integrity during the transfer, a checksum is computed on the data as it goes through the Transfer Service. This is another CPU-heavy operation, albeit somewhat lighter than data compression and encryption.
Experiment 7/9 Importing content into a library backed by a datastore from an upload client (Experiment 7) is clearly limited by the network capacity when 1GbEps connections are used. In fact, the completion time virtually does not change when rhtpproxy is disabled. Performance improves when 10GbE connections are employed, and further improvements are observed when rhtpproxy is disabled.  This suggest that data encryption/decryption is definitely a bottleneck with the larger network capacity.When a web server is used to host the library item to be imported (Experiment 9) we observe a completion time which is more than halved compared to Experiment 7. There are two reasons for this: (1) the Transfer Service bypasses rhtpproxy when importing content from a web server (this is the reason there are no “rhtpproxy disabled” data points for experiments 9 and 10 in Figure 7), and (2) the web server is more efficient at transferring data than the Windows client VM. Using a 10GbE connection results in a further improvement. Given that import performance improves by only 10%, this indicates the presence of another limiter.  This limiter is the decompression of the library item while it is being streamed to the destination datastore.
Experiment 8/10 When content is being imported into a library backed by an NFS filesystem we see a pattern very similar to the one we observed in Experiments 7 and 9. The only difference is that the completion times in Experiments 8 and 10 are slightly better because in this case there is no decompression being performed as the data is stored in compressed format on the NFS filesystem. The only exception is the “10GbE Network” data point in experiment 10, which is about 44% better than in Experiment 9. This is because when all the limiters have been removed, data decompression plays a more significant role in the import performance.

 

post2-fig7

Figure 7. Import/Export completion times.

 

Export Concurrency

In this last experiment, we assess the performance of Content Library in terms of network throughput when multiple users simultaneously export (download) an OVF template with a VMDK file size of 5.4GB. The content is redirected to a NULL device on the download client in order to factor out a potential bottleneck in the client storage stack. The library backing is an NFS filesystem mounted on the vCenter server.

Results

Figure 8  shows the aggregate export throughput (higher is better) as the number of concurrent export operations increases from 1 to 10 in four different scenarios depending on the network speed and use of rhtpproxy. When export traffic goes through the rhtpproxy component, the speed of the network seems to be irrelevant as we get exactly the same throughput (which saturates at around 90 MB/s) with both the 1 GbE and 10 GbE networks. This once again confirms that rhtpproxy, due to the CPU intensive SSL data encryption, creates a bottleneck on the data transfer path.

When rhttpproxy is turned off, the download throughput increases until the link capacity is completely saturated (about 120MB/s), at least with a 1GbE network. Once again, administrators can trade off security for performance by disabling rhtpproxy as explained earlier.

When a 10GbE network is used, however, throughput saturates at around 450 MB/s instead of climbing all the way up to 1200MB/s (the theoretical capacity of a 10 Gbps Ethernet link).  This is because the data transfer path, when operating at higher rates, hits another bottleneck introduced by the checksum operation performed by the Transfer Agent to ensure data integrity. Generating a data checksum is another computationally intensive operation, even though not as heavy as data encryption.

 

post2-fig8

Figure 8. Concurrent export throughput

VMware vCenter 6.0 Performance Improvements and Best Practices

VMware vCenter Server 6.0 brings many performance improvements over previous vCenter versions, including:

  • Extensive improvements in throughput and latency.
  • vCenter Server Appliance (VCSA) parity with vCenter Server on Windows, for both inventory size and performance.

A new white paper, “VMware vCenter Server Performance and Best Practices,” illustrates these performance improvements and discusses important best practices for getting the best performance out of your vCenter Server environment.

This chart, taken from the white paper, shows the improvement in throughput over vCenter Server 5.5 at various inventory sizes:

tput

Improvements in vCenter Server 6.0 Cluster Performance

VMware vCenter Server 6.0 brings significant improvements over previous vCenter server versions with respect to cluster size and performance. vCenter Server 6.0 supports up to 64 ESXi Hosts and 8000 VMs in a single cluster. A new white paper, ” VMware vCenter Server 6.0 Cluster Performance “, describes the improvements along several dimensions:

  • VMware vCenter 6.0 can support more hosts and more VMs in a cluster.
  • VMware vCenter 6.0 can support higher operational throughput in a cluster.​
  • VMware vCenter 6.0 can support higher operational throughput with ESXi 6.0 hosts.
  • VMware vCenter 6.0 VCSA can support higher operational throughput in a cluster, compared to vCenter Server on Windows.

Here is a chart from the white paper summarizing one of the key improvements: operational throughput in a cluster:

vCenter Server 6.0 Cluster Performance Improvement

vCenter Server 6.0 Cluster Performance Improvement

Project Capstone Shows Monster VM Performance

Project Capstone was put together a few weeks before VMworld 2015 with the goal of being able to show what is possible with Monster VMs today.  VMware worked with HP and IBM to put together an impressive setup using vSphere 6.0, HP Superdome X and an IBM FlashSystem array that was able to support running four 120 vCPU VMs simultaneously.  Putting these massive Virtual Machines under load we found that performance was excellent with great scalability and a high amount of throughput achieved.

vSphere 6 was launched earlier this year and includes support for virtual machines with up to 128 virtual CPUs which is a big increase from the 64 vCPUs supported in vSphere 5.5. “Monster” virtual machines have a new upper limit and it allows for customers to virtualize even the largest of systems with very hungry CPU needs.

The HP Superdome X used for the testing is an impressive system.  It has 16 Intel Xeon E7-2890v2 2.8 GHz processors.  Each processor has 15 cores and 30 logical threads when Hyper Threading is enabled. In total this is 240 cores / 480 threads.

An IBM FlashSystem array with 20TB of superfast low latency storage was used for the project Capstone configuration.  It provided extremely low latency throughout all testing and provided such great performance that storage was never a concern or issue.  The FlashSystem was extremely easy to setup and use.  Within 24 hours of it arriving in the lab, we were actively running four 120 vCPU VMs with sub millisecond latency.

Large Oracle 12c database virtual machines running on Redhat Enterprise Linux 6.5 were created and configured with 256GB of RAM, pvSCSI virtual disk adapters, and vmxnet3 virtual NICs.  The number of VMs and the number of vCPUs for each VM was varied across the tests.

The workload used for the testing was DVD Store 3 (github.com/dvdstore/ds3).  DVD Store simulates a real online store with customers logging onto the site, browsing products and product reviews, rating products, and ultimately purchasing those products.  The benchmark is measured in Orders Per Minute, with each order representing a complete login, browsing, and purchasing process that includes many individual SQL operations against the database.

This large system with 240 cores / 480 threads, an extremely fast and large storage system, and vSphere 6 showed that even with many monster VMs excellent performance and scalability is possible.  Each configuration was first stressed by increasing the DVD Store workload until maximum throughput was achieved for a single virtual machine.  In all cases this was found to be at near CPU saturation.  The number of VM was then increased so that the entire system was fully committed.   A whitepaper to be published soon will have the full set of test results, but here we show the results for four 120 vCPU VMs and sixteen 30 vCPU VMs.capstonePerf4x120graph

 

capstonePerf16x30graph

In both cases the performance of the system when fully loaded with either 4 or 16 virtual machines achieves about 90% of perfect linear scalability when compared to the performance of a single virtual machine.

In order to be able to drive the CPU usage to such high levels all disk IO must be very fast so that the system is not waiting for a response.  The IBM FlashSystem provided .3 ms average disk latency across all tests.  Total disk IO was minimized for these tests to maximize CPU usage and throughput by configuring the database cache size to be equal to the database size.   Total disk IO per second (IOPS) peaked at about 50k and averaged 20k while maintaining the extremely low latency during tests.

These test results show that it is possible to use vSphere 6 to successfully virtualize even the largest systems with excellent performance.

 

How to Efficiently Deploy Virtual Machines from VMware vSphere Content Library

by Joanna Guan and Davide Bergamasco

This post is the first of a series which aims to assess the performance of the VMware vSphere Content Library solution in various scenarios and provide vSphere Administrators with some ideas about how to set up a high performance Content Library environment. After providing an architectural overview of the Content Library components and inner workings, the post delves into the analysis and optimization of the most basic Content Library operation, i.e., the deployment of a virtual machine.

Introduction

The VMware vSphere Content Library empowers vSphere administrators to effectively and efficiently manage virtual machine templates, vApps, ISO images, and scripts. Specifically an administrator can leverage Content Library to:

  • Store and manage content from a central location;
  • Share content across boundaries of vCenter Servers;
  • Deploy virtual machine templates from the Content Library directly onto a host or cluster for immediate use.

Typically, a vSphere datacenter includes a multitude of vCenter servers, ESXi servers, networks, and datastores. In such an environment it can be time-consuming to clone or deploy a virtual machine through all the ESXi servers, vCenter servers, and networks from a source datastore to a destination datastore. Moreover, this problem is compounded by the fact that the size of virtual machines and other content keeps getting larger over time. The objective of Content Library is to address these issues by transferring large amounts of data in the most efficient way.

Architectural Overview

Content Library is composed of three main components which run on a vCenter server:

  • A Content Library Service, which organizes and manages content sitting on various storage locations;
  • A Transfer Service, which oversees the transfer content across said storage locations;
  • A Database which stores all the metadata associated with the content (e.g., type of content, date of creation, author/vendor, etc.)

The architecture diagram in Figure 1 shows how the three components interact with each other and with other vCenter components, along with the control path (depicted as thin black lines) and data path (depicted as thick red lines).

fig1

Figure 1. VMware vSphere Content Library architecture

The Content Library Service implements the control plane that manages storage and handles content operations such as deployment, upload, download, and synchronization. The Transfer Service implements the data plane that is responsible for actual data transfers between content stores, which may be datastores attached to ESXi hosts, NFS file systems mounted on the vCenter Server, or remote HTTP(S) servers.

Data Transfer

The data transfer performance varies depending on the storage type and available connectivity. The Transfer Service can transfer data in two ways: streaming mode and direct copy mode. The diagram in Figure 2 shows how the two modes work in a data transfer between datastores.

fig2

Figure 2. Content Library Data Transfer Flows

If the source and destination hosts have direct connectivity, the Transfer Service asks vCenter to instruct the source host to directly copy the content to the target host. When this is not possible (e.g., if the two hosts are connected to two different vCenter servers) streaming mode is used instead. In streaming mode the data flows through the Transfer Service itself. This involves one extra hop for the data, and also compression/decompression for the VMDK disk files. Also, vCenter appliances are usually connected to a management network, which could become a bottleneck due its limited bandwidth. For these reasons, direct copy mode typically has better performance than streaming mode.

Optimizing Virtual Machines Deployment

Having covered Content Library architecture and transfer modes, we can now discuss how to optimize its performance starting from the most basic operation, the deployment of a virtual machine. Deploying a virtual machine from Content Library creates a new virtual machine by cloning it from a template. We assess the performance of deployment operations by measuring their completion time. This metric is obviously the most visible and important one from an administrator’s perspective.

The experiments discussed in this blog post demonstrate how deployment performance is impacted by the Content Library backing storage configuration and provides some guidelines to help administrators choose the most appropriate configuration based on performance and cost tradeoffs.

Experimental Testbed

We used a total of three servers, one for running the vCenter Appliance and another two to create a cluster over which virtual machines were deployed from the Content Library. The following table summarizes the hardware and software specifications of the testbed.

vCenter Server Host
    Dell PowerEdge R910 server
         CPUs Four 6-core Intel® Xeon® E7530 @ 1.87 GHz, Hyper-Threading enabled.
         Memory: 80GB
    Virtualization Platform VMware vSphere 6.0. (RTM build # 2494585)
         VM Configuration 16 vCPU, 32GB of memory
         vCenter Appliance VMware vCenter Sever Appliance 6.0 (RTM build # 2562625)
ESXi Hosts
    Two Dell PowerEdge R610 servers
         CPUs Two 4-core Intel® Xeon® E5530 @ 2.40 GHz, Hyper-Threading enabled.
         Memory: 32GB
    Virtualization Platform VMware vSphere 6.0. (RTM build # 2494585)
     Storage Adapter QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express
     Network Adapter QLogic NetXtreme II BCM5709 1000Base-T (Data Rate: 1Gbps)
  Storage Array EMC VNX5700 Storage Array exposing two 20-disk RAID-5 LUNS with a capacity of 12TB each

Figure 3 illustrates the experimental testbed along with the data transfer flows for the various experiments. We ran a workload that consisted of deploying a virtual machine from a Content Library item onto a cluster. All experiments used the same 39GB OVF template. We conducted various experiments, based on the possible configurations of the source content store (the storage backing the Content Library) and the destination content store (the storage where the new virtual machine was deployed), as shown in the following table.

Experiment 1 An ESXi host is connected to a VAAI-capable storage array (VAAI stands for vStorage API for Array Integration and it is a technology which enables ESXi hosts to offload specific virtual machine and storage management operations to compliant storage hardware.)  Both the source and destination content stores are datastores residing on said array.
Experiment 2 An ESXi host is connected to the same datastores as in Experiment 1. However, these datastores are either hosted on a non-VAAI array or on two different arrays.
Experiment 3 An ESXi hosts is connected to the source datastore while a different host is connected to the destination datastore. The datastores are hosted on different arrays.
Experiment 4 The source content store is an NFS file system mounted on the vCenter server, while the destination content store is a datastore is hosted on a storage array.

 

fig3

Figure 3. Storage configurations and data transfer flows

Experimental Results

Figure 4 shows the results of the four experiments described above in terms of deployment duration (lower is better), while the following table summarizes the main observations for each experiment.

Experiment 1 The best performance was achieved in Experiment 1 (two datastores backed by a VAAI array). This was expected, as in this scenario the actual data transfer occurs internally to the storage array, without any involvement from the ESXi host. This is obviously the most efficient scenario from a deployment perspective.
Experiment 2 In Experiment 2, although the array is not VAAI-capable (or the datastores are hosted on two separate arrays), the source and the destination datastores are connected to the same ESXi host. This means the data transfer occurs through the 8 Gb/s Fibre Channel connection. This scenario is about 20% slower than Experiment 1.
Experiment 3 The scenario of Experiment 3 is significantly slower (about three times) than Experiment 1 because the datastores are attached to two different ESXi hosts. This causes the data transfer to go through the 1Gbps Ethernet connection. We also ran this experiment using a 10Gbps Ethernet network, and found that the deployment duration was similar to the one measured in Experiment 2. This suggests that the 1Gbps Ethernet connection is a significant bottleneck for this scenario.
Experiment 4 In the final scenario, Experiment 4, the template resides on an NFS file system mounted on the vCenter server. Because the template is stored in a compressed format on the NFS file system in order to save network bandwidth, its decompression on the vCenter server slows the data transfer quite noticeably. The network hops between the vCenter Server and the destination ESXi host may further slow the end-to-end data transfer. For these reasons, this scenario was about seven times slower than Experiment 1.  We also ran the same experiment using a 10Gbps network between the NFS server and the vCenter server and measured a completion time only slightly better than with the 1Gbps network (1260s vs. 1380s). Given that compression and decompression are CPU-heavy operations, using a faster network may result in only a marginal performance improvement.

 

fig4

Figure 4. Deployment completion time for the four storage configurations

Conclusions

This blog post explored how different Content Library backing storage configurations can affect the performance of a virtual machine deployment operation. The following guidelines may help an administrator in optimizing the Content Library performance for said operation based on the storage options at her/his disposal:

  1. If no other optimizations are possible, the Content Library should be at least backed by a datastore connected to one of the ESXi hosts (scenario of Experiment 3). Ideally a 10Gbps Ethernet connection should be employed.
  2. A better option is to have each ESXi host connected to both the source datastore (the one backing the Content Library) and the destination datastore(s) (the one(s) where the new virtual machine is being deployed). This is the scenario of Experiment 2.
  3. The best case is when all the ESXi hosts are connected to a VAAI-capable storage array and both the source and destination datastores reside on said array (Experiment 1).

 

Performance Best Practices for vSphere 6.0 is Available

We are pleased to announce the availability of Performance Best Practices for VMware vSphere 6.0. This is a book designed to help system administrators obtain the best performance from vSphere 6.0 deployments.

The book addresses many of the new features in vSphere 6.0 from a performance perspective. These include:

  • A new version of vSphere Network I/O Control
  • A new host-wide performance tuning feature
  • A new version of VMware Fault Tolerance (now supporting multi-vCPU virtual machines)
  • The new vSphere Content Library feature

We’ve also updated and expanded on many of the topics in the book. These include:

  • VMware vStorage APIs for Array Integration (VAAI) features
  • Network hardware considerations
  • Changes in ESXi host power management
  • Changes in ESXi transparent memory sharing
  • Using Receive Side Scaling (RSS) in virtual machines
  • Virtual NUMA (vNUMA) configuration
  • Network performance in guest operating systems
  • vSphere Web Client performance
  • VMware vMotion and Storage vMotion performance
  • VMware Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM) performance

The book can be found here http://www.vmware.com/files/pdf/techpaper/VMware-PerfBest-Practices-vSphere6-0.pdf.

 

 

 

Large Receive Offload (LRO) Support for VMXNET3 Adapters with Windows VMs on vSphere 6

Large Receive Offload (LRO) is a technique to reduce the CPU time for processing TCP packets that arrive from the network at a high rate. LRO reassembles incoming packets into larger ones (but fewer packets) to deliver them to the network stack of the system. LRO processes fewer packets, which reduces its CPU time for networking. Throughput can be improved accordingly since more CPU is available to deliver additional traffic. On Windows, LRO is also referred to as Receive Segment Coalescing (RSC).

LRO has been supported for Linux VMs with kernel 2.6.24 and later since vSphere 4. With the introduction of Windows Server 2012 and Windows 8 supporting LRO, vSphere 6 now adds support for LRO on a VMXNET3 adapter on Windows VMs. LRO is especially beneficial in the virtualized environment in which resources are shared by multiple VMs. This blog shows the performance benefits of using LRO for Windows VMs running on vSphere 6.

Test-Bed Setup

The test bed consists of a vSphere 6.0 host running VMs and a client machine that drives workload generation. Both machines have dual-socket, 6-core 2.9GHz Intel Xeon E5-2667 (Sandy Bridge) processors. The client machine is configured with native Red Hat Enterprise Linux 6 that generates TCP flows. The VMs in the vSphere host run Windows 2012 Server and are configured with 4 vCPUs and 2GB RAM. Both machines have an Intel 82599EB 10Gbps adapter installed, which are connected using a 10 Gigabit Ethernet (GbE) switch.

Performance Results

1. Native to VM Traffic

Figures 1 and 2 show a CPU efficiency and throughput comparison with and without LRO when TCP streams are generated from the client machine to a VM running on the vSphere host. Netperf is used to generate traffic. Three different message sizes are used: 256 bytes, 16KB, 64KB. The socket size is set to 8K, 64K, and 256K respectively. The message size is used to determine the number of bytes that Netperf delivers to the TCP stack in the client machine, which then determines the actual packet sizes. The NIC in the client machine splits packets with a size larger than the MTU (1500 is used for this blog) into smaller MTU-sized ones before sending them out. Once packets are received in the vSphere host, LRO aggregates packets and delivers larger packets (but smaller in number) to the receiving VM running in the host. This process is done in either hardware (that is, the physical NIC) or software (that is, the vSphere networking stack) depending on the NIC type and configuration. In this blog, LRO is performed in the vSphere networking stack before packets are delivered to the VM.

CPU efficiency is calculated by dividing the throughput by the number of CPU cores used for both the hypervisor and the VM, representing how much throughput in gigabits per second (Gbps) a single CPU core can receive. For example, a CPU efficiency of 5 means the system can handle 5Gbps with one core. Therefore, a higher CPU efficiency is desirable.

As shown in Figures 1 and 2, using LRO considerably improves both CPU efficiency and throughput with all three message sizes. Figure 1 shows that CPU efficiency improves by 86%, 25%, and 33% with 256 byte, 16KB, and 64KB messages respectively, when compared to the case without LRO. Figure 2 shows throughput improves 54% for 256 byte messages (0.6Gbps to 0.9Gbps) and 5% for 16KB messages (9.0Gbps to 9.4Gbps). There is not much difference for 64KB packets (9.5Gbps to 9.4Gbps, this is within a rage of normal variance). With 16KB and 64KB messages, throughput with LRO is already at line rate (that is, 10Gbps), which is why the improvement is not as significant as CPU efficiency.

Figure 3 compares the packet rate and the average packet size between the messages with LRO and without LRO. They are measured right before packets are delivered to the VM (but after LRO is performed for the LRO configuration). It is clearly shown that fewer but larger packets are delivered to the VM with LRO. For example, with 64KB messages, the packet rate delivered to the VM decreases from 815K packets per second (pps) to 113Kpps with LRO, while the packet size increases from 1.5KB to 10.5KB. The number of interrupts generated for the guest also becomes smaller accordingly, helping to improve overall CPU efficiency and throughput.

When the client machine generates packets, those with a size larger than the MTU are split into smaller MTU-sized packets before being sent out. With a larger message size, more MTU-sized packets are produced and the packet rate received by the vSphere host increases accordingly. This is why the packet rate becomes higher for 16KB and 64KB messages than 256 byte messages without LRO in Figure 3. LRO aggregates the received packets before they hit the VM so the packet rate remains low regardless of the message size in the figure.

Figure 1

Figure 1. CPU efficiency comparison with and without LRO with Native-VM traffic

Figure 2. Throughput comparison with and without LRO with Native-VM traffic

Figure 2. Throughput comparison with and without LRO with Native-VM traffic

Figure 3. Packet rate and size comparison with and without LRO with Native-VM traffic, when packets are delivered to the VM

Figure 3. Packet rate and size comparison with and without LRO with Native-VM traffic, when packets are delivered to the VM

2. VM to VM Local Traffic

LRO is also beneficial in VM-VM local traffic where VMs are located in the same host, communicating with each other through a virtual switch. Figures 4 and 5 depict CPU efficiency and throughput comparisons with and without LRO with two VMs on the same host sending and receiving TCP flows. The same message and socket sizes as the Native-VM tests above are used.

From Figure 4, LRO improves CPU efficiency by 15%, 92%, and 90% with 256 bytes, 16KB, and 64KB byte messages respectively, when compared to the case without LRO. Figure 5 shows throughput also improves by 20% with 256 byte messages (0.8Gbps to 1.0Gbps), by 103% with 16KB messages (9.0Gbps to 18.4Gbps), and by 142% with 64KB (11.7Gbps to 28.4Gbps).

Without LRO, big packets with a size larger than the MTU need to be split before delivered to the receiving VM, similar to Native-VM traffic. This is because the receiver cannot handle those big packets. LRO saves the time spent in both splitting packets and receiving smaller packets since packet splitting also happens on the vSphere host with VM-VM Local traffic. This explains why the improvement with 16KB and 64KB messages is higher than that of Native-VM traffic. The absolute CPU efficiency in VM-VM local traffic can become lower than that in Native-VM traffic since the CPU time of both sending and receiving VMs are included for this calculation.

As expected, the packet rate decreases while the average packet size increases with LRO as shown in Figure 6. For example, with 64KB messages, the packet rate delivered to the VM becomes reduced from 1009Kpps to 240Kpps, while the packet size gets increased from 1.6KB to 14.9KB.

The packet rate becomes higher for 256 byte messages with LRO, most likely because round-trip time (RTT) gets reduced due to the use of LRO. With the average packet size being similar to each other between LRO and without LRO, this effectively helps to improve throughput and correspondingly CPU efficiency, as seen in Figure 4 and 5.

Figure 4. CPU efficiency with and without LRO with VM-VM Local traffic

Figure 4. CPU efficiency with and without LRO with VM-VM Local traffic

Figure 5. Throughput comparison with and without LRO with VM-VM Local traffic

Figure 5. Throughput comparison with and without LRO with VM-VM Local traffic

Figure 6. Packet rate and size comparison with and without LRO with VM-VM Local traffic, when packets are delivered to the VM

Figure 6. Packet rate and size comparison with and without LRO with VM-VM Local traffic, when packets are delivered to the VM

Enable or Disable LRO on a VMXNET3 Adapter on a Windows VM

LRO is enabled by default for VMXNET3 adapters on vSphere 6.0 hosts, but you must set RSC to be enabled globally for Windows 8 VMs. For more information about configuring this, see the documentation.

Conclusion

This blog shows that enabling LRO for Windows Server 2012 and Windows 8 VMs on a vSphere host using VMXNET3 considerably enhances the CPU efficiency and correspondingly improves throughput for TCP traffic.

Running Transactional Workloads Using Docker Containers on vSphere 6.0

In a series of blogs, we showed that not only can Docker containers seamlessly run inside vSphere 6.0 VMs, but both micro-benchmarks and popular workloads in such configurations perform as well as, and in some cases better than, in native Docker configurations.

See the following blog posts for our past findings:

In this blog, we study a transactional database workload and present the results on how Docker containers perform in a VM when we scale out the number of database instances. To do this experiment, we use DVD Store 2.1, which is an OLTP benchmark that supports and stresses many different back-end databases including Microsoft SQL Server, Oracle Database, MySQL, and PostgreSQL. This benchmark is open source and the latest version 2.1 is available here. It is a 3-tier application with a Web server, an application server, and a backend database server. The benchmark simulates a DVD store, where customers log in, browse, and order DVD products. The tool is designed to utilize a number of advanced database features including transactions, stored procedures, triggers, and referential integrity. The main transactions are (1) add new customers, (2) log in customers, (3) browse DVDs, (4) enter purchase orders, and (5) re-order stock. The client driver is written in C# and is usually run from Windows; however, the client can be run on Linux using the Mono framework. The primary performance metric of the benchmark is orders per minute (OPM).

In our experiments, we used a PostgreSQL database with an Apache Web server, and the application logic was implemented in PHP. In all tests we ran 16 instances of DVD Store, where each instance comprises all 3-tiers. We found that, due to better scheduling, running Docker in a VM in a scale-out scenario can provide better throughput than running a Docker container in a native system.

Next, we present the configurations, benchmarks, detailed setup, and the performance results.

Deployment Scenarios

We compare four different scenarios, as illustrated below:

–   Native: Linux OS running directly on hardware (Ubuntu 14.04.1)

–   vSphere VM: vSphere 6.0 with VMFS5, in 8 VMs, each with the same guest OS as native

–   Native-Docker: Docker 1.5 running on a native OS (Ubuntu 14.04.1)

–   VM-Docker: Docker 1.5 running in each of 8 VMs on a vSphere  host

In each configuration, all of the power management features were disabled in the BIOS.

Hardware/Software/Workload Configuration

Figure 1 shows the hardware setup diagram for the server host below. We used Ubuntu 14.04.1 with Docker 1.5 for all our experiments.  While running Docker configuration, we use bridged networking and host volumes for storing the database.

dvdstore-benchmark-system-setup

Figure 1. Hardware/software configuration

Performance Results

We ran 16 instances of DVD Store, where each instance was running an Apache web server, PHP application logic, and a PostgreSQL database. In the Docker cases, we ran one instance of DVD Store per Docker container. In the non-Docker cases, we used the Virtual Hosts functionality of Apache to run many instances of a Web server listening on different ports. We also used the PostgreSQL command line to create different instances of the database server listening on different ports. In the VM-based experiments, we partitioned the host hardware between 8 VMs, where each VM ran 2 DVD Store instances. The 8 VMs exactly committed the CPUs, and under-committed the memory.

The four configurations for our experiments are listed below.

Configurations:

  • Native-16S: 16 instances of DVD Store running natively (16 separate instances of Apache 2 using virtual hosts and 16 separate instances of PostgreSQL database)
  • Native-Docker-16S: 16 Docker containers running on a native machine with each running one instance of DVD Store.
  • VM-8VMs-16S: Eight 4-vCPU VMs each running 2 DVD Store instances
  • VM-Docker-8VMs-16S: Eight 4-vCPU VMs each running 2 Docker containers,  where each Docker container is running one instance of DVD Store

We ran the DVD Store benchmark for the 4 configurations using 16 client drivers, where each driver process was running 4 threads. The results for these 4 configurations are shown in the figure below.

dvdstore-perf-docker-vsphere

Figure 2. DVD Store performance for different configurations

In the chart above, the y-axis shows the aggregate DVD Store performance metric orders per minute (OPM) for all 16 instances. We have normalized the order per minute results with respect to the native configuration where we saw about 126k orders per minute. From the chart, we see that, the VM configurations achieve higher throughput compared to the corresponding native configurations. As in the case of prior blogs, this is due to better NUMA-aware scheduling in vSphere.  We can also see that running Docker containers either natively or in VMs adds little overhead (2-4%).

To find out why the native configurations were not doing better, we pinned half of the Docker containers to one NUMA node and half to the other. The DVD Store aggregate OPM improved as a result and as expected, we were seeing slightly better than the VM configuration.  However, manually pinning processes to cores or sockets is usually not a recommended practice because it is error-prone and can, in general, lead to unexpected or suboptimal results.

Summary

In this blog, we showed that running a PostgreSQL transactional database in a Docker container in a vSphere VM adds very little performance cost compared to running directly in the VM. We also find that running Docker containers in a set of 8 VMs achieves slightly better throughput than running the same Docker containers natively with an out-of-the-box configuration. This is a further proof that VMs and Docker containers are truly “better together.”