by Qasim Ali
In a previous VROOM post, we showed that running Redis inside Docker containers on vSphere adds little to no overhead and observed sizeable performance improvements when scaling out the application when compared to running containers directly on the native hardware. This post analyzes scaling Web 2.0 applications using Docker containers on vSphere and compares the performance of running Docker containers on native and vSphere. This study shows that Docker containers add negligible overhead when run on vSphere, and also that the performance using virtual machines is very close to native and, in certain cases, slightly better due to better vSphere scheduling and isolation.
Web 2.0 applications are an integral part of Enterprise and small business IT offerings. We use the CloudStone benchmark, which simulates a typical Web 2.0 technology use in the workplace for our study  . It includes a Web 2.0 social-events application (Olio) and a client implemented using the Faban workload generator . It is an open source benchmark that simulates activities related to social events. The benchmark consists of three main components: a Web server, a database backend, and a client to emulate real world accesses to the Web server. The overall architecture of CloudStone is depicted in Figure 1.
Figure 1: CloudStone architecture
The benchmark reports latency for various user actions. These metrics were compared against a fixed threshold. Studies indicate that users are less likely to visit a Web site if the response time is greater than 250 milliseconds . This number can be used as an upper bound for latency for frequent operations (Home-Page, TagSearch, EventDetail, and Login). For the less frequent operations (AddEvent, AddPerson, and PersonDetail), a less restrictive threshold of 500 milliseconds can be used. Table 1 shows the exact mix/frequency of various operations.
|Operation||Number of Operations||Mix|
Table 1: CloudStone operations frequency for 1500 users
Benchmark Components and Experimental Set up
The test system was installed with a CloudStone implementation of a MySQL database, NGINX Web server with PHP scripts, and a Tomcat application server provided by the Faban harness. The default configuration was used for the workload generator. All components of the application ran on a single host, and the client ran in a separate virtual machine on a separate host. Both hosts were connected using a direct link between a pair of 10Gbps NICs. One client-server pair provided a single, independent CloudStone instance. Scaling was achieved by running additional instances of CloudStone.
We used the following three deployment scenarios for this study:
- Native-Docker: One or more CloudStone instances were run inside Docker containers (2 containers per CloudStone instance: one for the Web server and another for the database backend) running on the native OS.
- VM: CloudStone instances were run inside one or more virtual machines running on vSphere 6.0; the guest OS is the same as the native scenario.
- VM-Docker: CloudStone instances were run inside Docker containers that were running inside one or more virtual machines.
The following are the details about the hardware and software used in the various experiments discussed in the next section:
- Dell PowerEdge R820
- CPU: 4 x Intel® Xeon® CPU E5-4650 @ 2.30GHz (32 cores, 64 hyper-threads)
- Memory: 512GB
- Hardware configuration: Hyper-Threading (HT) ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
- Network: 10Gbps
- Storage: 7 x 250GB 15K RPM 4Gb SAS Disks
- Dell PowerEdge R710
- CPU: 2 x Intel® Xeon® CPU X5680 @ 3.33GHz (12 cores, 24 hyper-threads)
- Memory: 144GB
- Hardware configuration: HT ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
- Network: 10Gbps
- Client VM: 2-vCPU 4GB vRAM
- Ubuntu 14.04.1
- Kernel 3.13
- Docker 1.2
- Ubuntu 14.04.1 base image
- Host volumes for database and images
- Configured with host networking to avoid Docker NAT overhead
- Device mapper as the storage backend driver
- VMware vSphere 6.0 (pre-release build)
- Single VM: An 8-vCPU 4GB VM ( Web server and database running in a single VM)
- Two VMs: One 6-vCPU 2GB Web server VM and one 2-vCPU 2GB database VM (CloudStone instance running in two VMs)
- Scale-out: Eight 8-vCPU 4GB VMs
- The NGINX Web server was configured with 4 worker processes and 4096 connections per worker.
- PHP was configured with a maximum number of 16 child processes.
- The Web server and the database were preconfigured with 1500 users per CloudStone instance.
- A runtime of 30 minutes with a 5 minute ramp-up and ramp-down periods (less than 1% run-to-run variation) was used.
First, we ran a single instance of CloudStone in the various configurations mentioned above. This was meant to determine the raw overhead of Docker containers running on vSphere vs. the native configuration, eliminating scheduling differences. Second, we picked the configuration that performed best in a single instance and scaled it out to run multiple instances.
Figure 2 shows the mean latency of the most frequent operations and Figure 3 shows the mean latency of less frequent operations.
Figure 2: Results of single instance CloudStone experiments for frequently used operations
Figure 3: Results of single instance CloudStone experiments for less frequently used
We configured the benchmark to use a single VM and deployed the Web server and database applications in it (configuration labelled VM-1VM in Figure 2 and 3). We then ran the same workload in Docker containers in a single VM (VMDocker-1VM). The latencies are slightly higher than native, which is expected due to some virtualization overhead. However, running Docker containers on a VM showed no additional overhead. In fact, it seems to be slightly better. We believe this might be due to the device mapper using twice as much page cache as the VM (device mapper uses a loopback device to mount the file system and, hence, data ends up being cached twice in the buffer cache). We also tried AuFS as a storage backend for our container images, but that seemed to add some CPU and latency overhead, and, for this reason, we switched to device mapper. We then configured the VM to use the vSphere Latency Sensitivity feature  (VM-1VM-lat and VMDocker-1VM-lat labels). As expected, this configuration reduced the latencies even further because each vCPU got exclusive access to a core and this reduced scheduling overhead. However, this feature cannot be used when the VM (or VMs) has more vCPUs than the number of cores available on the system (that is, the physical CPUs are over-committed) because each vCPU needs exclusive access to a core.
Next, we configured the workload to use two VMs, one for the Web server and the other for the database application. This configuration ended up giving slightly higher latencies because the network packets have to traverse the virtualization layer from one VM to the other, while in the prior experiments they were confined within the same VM.
Finally, we scaled out the CloudStone workload with 12,000 users by using eight 8-vCPU VMs with 1500 users per instance. The VM configurations were the same as the VM-1VM and VMDocker-1VM cases above. The average system CPU core utilization was around 70-75%, which is the typical average CPU utilization for latency sensitive workloads because it allows for headroom to absorb traffic bursts. Figure 4 reports mean latencies of all operations (latencies were averaged across all eight instances of CloudStone for each operation), while Figure 5 reports the 90th percentile latencies (the benchmark reports these latencies in 20 millisecond granularity as evident from Figure 5.)
Figure 4: Scale-out experiments using eight instances of CloudStone (mean latency)
Figure 5: Scale-out experiments using eight instances of CloudStone (90th percentile latency)
The latencies shown in Figure 4 and 5 are well below the 250 millisecond threshold. We observed that the latencies on vSphere are very close to native or, in certain cases, slightly better than native (for example, Login, AddPerson and AddEvent operations). The latencies were better than native due to better vSphere scheduling and isolation, resulting in better cache/memory locality. We verified this by pinning container instances on specific sockets and making the native scheduler behavior similar to vSphere. After doing that, we observed that latencies in the native case got better and they were similar or slightly better than vSphere.
Note: Introducing artificial affinity between processes and cores is not a recommended practice because it is error-prone and can, in general, lead to unexpected or suboptimal results.
VMs and Docker containers are truly “better together.” The CloudStone scale-out system, using out-of-the-box VM and VM-Docker configurations, clearly achieves very close to, or slightly better than, native performance.
|||W. Sobel, S. Subramanyam, A. Sucharitakul, J. Nguyen, H. Wong, A. Klepchukov, S. Patil, O. Fox and D. Patterson, "CloudStone: Multi-Platform, Multi-Languarge Benchmark and Measurement Tools for Web 2.8," 2008.|
|||N. Grozev, "Automated CloudStone Setup in Ubuntu VMs / Advanced Automated CloudStone Setup in Ubuntu VMs [Part 2]," 2 June 2014. https://nikolaygrozev.wordpress.com/tag/cloudstone/.|
|||java.net, "Faban Harness and Benchmark Framework," 11 May 2014. http://java.net/projects/faban/.|
|||S. Lohr, "For Impatient Web Users, an Eye Blink Is Just Too Long to Wait," 29 February 2012. http://www.nytimes.com/2012/03/01/technology/impatient-web-usersflee-slow-loading-sites.html.|
|||J. Heo, "Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5," 18 September 2013. http://blogs.vmware.com/performance/2013/09/deploying-extremely-latency-sensitive-applications-in-vmware-vsphere-5-5.html.|