Home > Blogs > VMware VROOM! Blog > Author Archives: Qasim Ali

Author Archives: Qasim Ali

Qasim Ali

About Qasim Ali

Qasim Ali is a Staff Engineer at VMware R&D in the Performance team. His interests include hypervisor performance, optimizing resource management, e.g., power, memory and CPU. Mr. Ali graduated with a PhD in Computer Engineering from Purdue University, West Lafayette, Indiana in 2009.

Scaling Web 2.0 Applications using Docker containers on vSphere 6.0

by Qasim Ali

In a previous VROOM post, we showed that running Redis inside Docker containers on vSphere adds little to no overhead and observed sizeable performance improvements when scaling out the application when compared to running containers directly on the native hardware. This post analyzes scaling Web 2.0 applications using Docker containers on vSphere and compares the performance of running Docker containers on native and vSphere. This study shows that Docker containers add negligible overhead when run on vSphere, and also that the performance using virtual machines is very close to native and, in certain cases, slightly better due to better vSphere scheduling and isolation.

Web 2.0 applications are an integral part of Enterprise and small business IT offerings. We use the CloudStone benchmark, which simulates a typical Web 2.0 technology use in the workplace for our study [1] [2]. It includes a Web 2.0 social-events application (Olio) and a client implemented using the Faban workload generator [3]. It is an open source benchmark that simulates activities related to social events. The benchmark consists of three main components: a Web server, a database backend, and a client to emulate real world accesses to the Web server. The overall architecture of CloudStone is depicted in Figure 1.

Figure 1: CloudStone architecture

The benchmark reports latency for various user actions. These metrics were compared against a fixed threshold. Studies indicate that users are less likely to visit a Web site if the response time is greater than 250 milliseconds [4]. This number can be used as an upper bound for latency for frequent operations (Home-Page, TagSearch, EventDetail, and Login). For the less frequent operations (AddEvent, AddPerson, and PersonDetail), a less restrictive threshold of 500 milliseconds can be used. Table 1 shows the exact mix/frequency of various operations.

Operation Number of Operations Mix
HomePage 141908 26.14%
Login 55473 10.22%
TagSearch 181126 33.37%
EventDetail 134144 24.71%
PersonDetail 14393 2.65%
AddPerson 4662 0.86%
AddEvent 11087 2.04%

Table 1: CloudStone operations frequency for 1500 users

Benchmark Components and Experimental Set up

The test system was installed with a CloudStone implementation of a MySQL database, NGINX Web server with PHP scripts, and a Tomcat application server provided by the Faban harness. The default configuration was used for the workload generator. All components of the application ran on a single host, and the client ran in a separate virtual machine on a separate host. Both hosts were connected using a direct link between a pair of 10Gbps NICs. One client-server pair provided a single, independent CloudStone instance. Scaling was achieved by running additional instances of CloudStone.

Deployment Scenarios

We used the following three deployment scenarios for this study:

  • Native-Docker: One or more CloudStone instances were run inside Docker containers (2 containers per CloudStone instance: one for the Web server and another for the database backend) running on the native OS.
  • VM: CloudStone instances were run inside one or more virtual machines running on vSphere 6.0; the guest OS is the same as the native scenario.
  • VM-Docker: CloudStone instances were run inside Docker containers that were running inside one or more virtual machines.

Hardware/Software/Workload Configuration

The following are the details about the hardware and software used in the various experiments discussed in the next section:

Server Host:

  • Dell PowerEdge R820
  • CPU: 4 x Intel® Xeon® CPU E5-4650 @ 2.30GHz (32 cores, 64 hyper-threads)
  • Memory: 512GB
  • Hardware configuration: Hyper-Threading (HT) ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
  • Network: 10Gbps
  • Storage: 7 x 250GB 15K RPM 4Gb SAS  Disks

Client Host:

  • Dell PowerEdge R710
  • CPU: 2 x Intel® Xeon® CPU X5680 @ 3.33GHz (12 cores, 24 hyper-threads)
  • Memory: 144GB
  • Hardware configuration: HT ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
  • Network: 10Gbps
  • Client VM: 2-vCPU 4GB vRAM

Host OS:

  • Ubuntu 14.04.1
  • Kernel 3.13

Docker Configuration:

  • Docker 1.2
  • Ubuntu 14.04.1 base image
  • Host volumes for database and images
  • Configured with host networking to avoid Docker NAT overhead
  • Device mapper as the storage backend driver

ESXi:

  • VMware vSphere 6.0 (pre-release build)

VM Configurations:

  • Single VM: An 8-vCPU 4GB VM ( Web server and database running in a single VM)
  • Two VMs: One 6-vCPU 2GB Web server VM and one 2-vCPU 2GB database VM (CloudStone instance running in two VMs)
  • Scale-out: Eight 8-vCPU 4GB VMs

Workload Configurations:

  • The NGINX Web server was configured with 4 worker processes and 4096 connections per worker.
  • PHP was configured with a maximum number of 16 child processes.
  • The Web server and the database were preconfigured with 1500 users per CloudStone instance.
  • A runtime of 30 minutes with a 5 minute ramp-up and ramp-down periods (less than 1% run-to-run  variation) was used.

Results

First, we ran a single instance of CloudStone in the various configurations mentioned above. This was meant to determine the raw overhead of Docker containers running on vSphere vs. the native configuration, eliminating scheduling differences. Second, we picked the configuration that performed best in a single instance and scaled it out to run multiple instances.

Figure 2 shows the mean latency of the most frequent operations and Figure 3 shows the mean latency of less frequent operations.

Figure 2: Results of single instance CloudStone experiments for frequently used operations

 

Figure 3: Results of single instance CloudStone experiments for less frequently used
operations

We configured the benchmark to use a single VM and deployed the Web server and database applications in it (configuration labelled VM-1VM in Figure 2 and 3). We then ran the same workload in Docker containers in a single VM (VMDocker-1VM). The latencies are slightly higher than native, which is expected due to some virtualization overhead. However, running Docker containers on a VM showed no additional overhead. In fact, it seems to be slightly better. We believe this might be due to the device mapper using twice as much page cache as the VM (device mapper uses a loopback device to mount the file system and, hence, data ends up being cached twice in the buffer cache). We also tried AuFS as a storage backend for our container images, but that seemed to add some CPU and latency overhead, and, for this reason, we switched to device mapper. We then configured the VM to use the vSphere Latency Sensitivity feature [5] (VM-1VM-lat and VMDocker-1VM-lat labels). As expected, this configuration reduced the latencies even further because each vCPU got exclusive access to a core and this reduced scheduling overhead.  However, this feature cannot be used when the VM (or VMs) has more vCPUs than the number of cores available on the system (that is, the physical CPUs are over-committed) because each vCPU needs exclusive access to a core.

Next, we configured the workload to use two VMs, one for the Web server and the other for the database application. This configuration ended up giving slightly higher latencies because the network packets have to traverse the virtualization layer from one VM to the other, while in the prior experiments they were confined within the same VM.

Finally, we scaled out the CloudStone workload with 12,000 users by using eight 8-vCPU VMs with 1500 users per instance. The VM configurations were the same as the VM-1VM and VMDocker-1VM cases above. The average system CPU core utilization was around 70-75%, which is the typical average CPU utilization for latency sensitive workloads because it allows for headroom to absorb traffic bursts. Figure 4 reports mean latencies of all operations (latencies were averaged across all eight instances of CloudStone for each operation), while Figure 5 reports the 90th percentile latencies (the benchmark reports these latencies in 20 millisecond granularity as evident from Figure 5.)

Scale-out-meanlatency

Figure 4: Scale-out experiments using eight instances of CloudStone (mean latency)

Scale-out-90pctLatency

Figure 5: Scale-out experiments using eight instances of CloudStone (90th percentile latency)

The latencies shown in Figure 4 and 5 are well below the 250 millisecond threshold. We observed that the latencies on vSphere are very close to native or, in certain cases, slightly better than native (for example, Login, AddPerson and AddEvent operations). The latencies were better than native due to better vSphere scheduling and isolation, resulting in better cache/memory locality. We verified this by pinning container instances on specific sockets and making the native scheduler behavior similar to vSphere. After doing that, we observed that latencies in the native case got better and they were similar or slightly better than vSphere.

Note: Introducing artificial affinity between processes and cores is not a recommended practice because it is error-prone and can, in general, lead to unexpected or suboptimal results.

Conclusion

VMs and Docker containers are truly “better together.” The CloudStone scale-out system, using out-of-the-box VM and VM-Docker configurations, clearly achieves very close to, or slightly better than, native performance.

References

[1] W. Sobel, S. Subramanyam, A. Sucharitakul, J. Nguyen, H. Wong, A. Klepchukov, S. Patil, O. Fox and D. Patterson, “CloudStone: Multi-Platform, Multi-Languarge Benchmark and Measurement Tools for Web 2.8,” 2008.
[2] N. Grozev, “Automated CloudStone Setup in Ubuntu VMs / Advanced Automated CloudStone Setup in Ubuntu VMs [Part 2],” 2 June 2014. https://nikolaygrozev.wordpress.com/tag/cloudstone/.
[3] java.net, “Faban Harness and Benchmark Framework,” 11 May 2014. http://java.net/projects/faban/.
[4] S. Lohr, “For Impatient Web Users, an Eye Blink Is Just Too Long to Wait,” 29 February 2012. http://www.nytimes.com/2012/03/01/technology/impatient-web-usersflee-slow-loading-sites.html.
[5] J. Heo, “Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5,” 18 September 2013.   http://blogs.vmware.com/performance/2013/09/deploying-extremely-latency-sensitive-applications-in-vmware-vsphere-5-5.html.

 

 

 

VMware vSphere 5.5 Host Power Management (HPM) saves more power and improves performance

VMware recently released a white paper on the power and performance improvements in the Host Power Management (HPM) feature in vSphere 5.5. With new improvements in HPM, one can save significant power and gain decent performance in many common scenarios. The paper shows that power savings of up to 20% can be achieved in vSphere5.5 by using industry standard SPEC benchmarks. The paper also describes some of the best practices to follow when using HPM.

One experiment indicates that you can get around a 10% increase in performance in vSphere5.5 when deep C-states (greater than C1/halt, e.g., C3 and C6) are enabled along with turbo mode.

For more interesting results and data, please read the full paper

Note: HPM works at a single host level as opposed to DPM which works on a cluster of hosts

Host Power Management in vSphere 5

Host power management (HPM) on ESXi 5.0 saves energy by placing certain parts of a computer system or device into a reduced power state when the system or device is inactive or does not need to run at maximum speed. This feature can be used in conjunction with distributed power management (DPM), which redistributes VMs among physical hosts in a cluster to enable some hosts to be powered off completely.

The default power policy for HPM in vSphere 5 is “balanced,” which reduces host power consumption while having little or no impact on performance for most workloads. The balanced policy uses the processor’s P-states, which save power when the workloads running on the system do not require full CPU capacity.

A technical white paper has been published that describes:

  • What to adjust in your ESXi host’s BIOS settings to achieve the maximum benefit of HPM.
  • The different power policy options in ESXi 5.0 and how to set a custom policy.
  • Using esxtop to obtain and understand statistics related to HPM, including the ESXi host’s power usage, the processor’s P-states, and the effect of HPM on %USED and %UTIL.
  • An evaluation of the power that HPM can save while different power policies are enabled. The amount of power saved varies depending on the CPU load and the power policy.

For the full paper, see Host Power Management in VMware vSphere 5.

Note: Some performance-sensitive workloads might require the "high performance" policy.