Home > Blogs > VMware VROOM! Blog > Monthly Archives: June 2008

Monthly Archives: June 2008

Scaling real-life Web server workloads

In an earlier blog, we compared the performance aspects (such as latency, throughput and CPU resource utilization) of real-life web server workloads in a native environment and a virtualized data center environment. In this post, we focus on yet another important dimension of performance – scalability.

For our scalability evaluation tests, we used the widely deployed Apache/PHP as the Web serving platform. We used the industry-standard SPECweb2005 as the web server workload. SPECweb2005 consists of three workloads: banking, e-commerce, and support. The three workloads have vastly different characteristics, and we thus evaluated the results from all three.

First, we evaluated the scalability of the Apache/PHP Web serving platform in the native environment with no virtualization by varying the number of available CPUs at boot time. Note that in all these native configurations, there was a conventional, single operating environment that consisted of single RHEL5 kernel system image and a single Apache/PHP deployment. We applied all the well documented performance tunings to Apache/PHP configuration – for example, increasing the number of Apache worker processes, and using an Opcode cache to improve PHP performance.

The figure below shows the scaling results of SPECweb2005 workload in the native environment. The scaling curve plots the aggregate SPECweb2005 metric (a normalized metric based on the throughput scores obtained on all the three workloads -banking, e-commerce and support) as the number of processors was increased. In our test configuration, there were no bottlenecks in the hardware environment.


As shown in the figure, the scalability was severely limited as we increased the number of processors. In a single CPU configuration, we achieved the processor utilization of above 95%. But, as we increased the number of processors, we failed to achieve such high processor utilization. The performance was limited by software serialization points in the Apache/PHP/SPECweb2005 software stack. Analysis using the Intel Vtune performance analyzer confirmed increasing hot spot contention as we increased the number of CPUs. For the same size workload of 1,800 banking sessions, the CPI (Cycles Per Instruction) jumped by a factor of roughly four as we increased the number of CPUs from three to eight, indicating a software scaling issue. As observed in our test configuration, such issues often show up as unacceptable latencies even when there are plenty of compute resources available on the system. More often than not, diagnosing and fixing these issues is not practical in the time available.

Most real life web server workloads suffer from scalability issues such as those observed in our tests. In order to circumvent these issues, lots of businesses choose to deploy web server workloads on a multitude of one-CPU or dual-CPU machines. However, such approach leads to proliferation of servers in a data center environment resulting in higher costs in both power and space usage. Virtualization offers an easier alternative to avoid software scaling issues as well as provide efficiency in power and space usage. This is because, virtualization enables several complex operating environments that are not easily scalable to run concurrently on a single physical machine and exploit the vast compute resources offered by today’s power and space efficient multi-core systems. To quantify the effectiveness of this approach we measured SPECweb2005 performance by deploying multiple Apache/PHP configurations in a virtual environment. We have submitted our test results to the SPEC committee and they are under review.

In our virtualized tests, we configured the virtual machines in accordance with the general performance best practices recommended by VMware. Each VM was assigned one virtual CPU, and 4 GB of memory. We then varied the number of simultaneously running virtual machines from one to six, stopping at six, as this workload is highly network intensive and ESX offloads some of the network processing to the other available cores. Stopping short of allocating virtual machines to all cores ensured that with I/O intensive workloads such as this one, ESX Server has enough resources to take care of virtual machine scheduling, I/O processing and other housekeeping tasks. The following figure compares the SPECweb2005 scaling results between the native and the virtual environments.


As shown in the above figure, we observed good scaling in the virtual environment as we increased the number of virtual machines. The aggregate SPECweb2005 performance obtained in the tests with up to two virtual machines was slightly lower than the performance observed in corresponding native configurations. However, as we further increased the number of processors, the cumulative performance of the configuration using multiple virtual machines well exceeded the performance of a single native environment.

These results clearly demonstrate the benefit of using VMware Infrastructure to bypass software scalability limitations and improve overall efficiency when running real-life web server workloads.

To find out more about the test configuration, tuning information, and detailed results of all the individual SPECweb2005 workloads, check out our recently published performance study.

ESX scheduler support for SMP VMs: co-scheduling and more

ESX supports virtual machines configured with multiple virtual CPUs (for example, ESX 3.x supports up to
4 vCPUs). Handling mixed loads of uniprocessor and multiprocessor VMs can be challenging for a
scheduler to get right. This article answers some common questions about deploying multiprocessor VMs,
and describes the algorithms used by the ESX scheduler to provide both high performance and fairness.

When considering multiprocessor VMs, the following questions naturally arise for ESX users:

a) When do I decide to configure multiple vCPUs for a VM?
b) What are the overheads of using multiprocessor VMs? What would I lose by over provisioning vCPUs 
     for VMs?
c) Does the ESX scheduler (co-schedule) all of the vCPUs belonging to a VM together?
d) Why is co-scheduling necessary and important?
e) How does ESX scheduler deal with certain vCPUs belonging to a VM idling while others actively
     perform work? Do the idle vCPUs unnecessarily burn CPU?

Let’s answer these questions briefly:

a) It makes sense to configure multiple vCPUs for a VM when:
    1. The application you intend to run within the VM is both multi-threaded (Apache Web Server, MS
         Exchange 2007, etc) and these application threads can indeed make good use of additional
         processors provided (multiple threads can be active and running at the same time). 
    2. Multiple single threaded applications are intended to run simultaneously within the VM.   

     Running one single threaded application within a multiprocessor VM will not improve performance
     of that application, since only one vCPU will be in use at any given time.  Configuring additional 
     vCPUs in such a case is unnecessary.   

b) It’s best to configure as few virtual CPUs as needed by the application to handle its load. In other
    words, don’t overprovision on the vCPUs if not needed for additional application performance.

    Having virtual machines configured with virtual CPUs that are not used does impose resource
    requirements on the ESX Server. In some guest operating systems, the unused virtual CPUs still
    take timer interrupts which consumes a small amount of additional CPU. Please refer to KB
    articles 1077 and 1730.

c) For scheduling a VM with multiple vCPUs, ESX 2.x used a technique known as ‘Strict Co-scheduling’. 
    With strict co-scheduling, the scheduler keeps track of a "skew" value for each vCPU. A vCPU’s skew   
    increases if it is not making progress (running or idling) while at least one of its vCPU sibling is
    making progress.

   When the skew for any vCPU in a VM exceeds a threshold, the entire VM is descheduled. The VM is   
   rescheduled only when enough physical processors are available to accommodate all of the VM’s vCPUs.
   This may, especially with a system with fewer cores and running a mix of UP and SMP VMs,  lead to
   CPU ‘fragmentation’ resulting in relatively lower overall system utilization. As an example consider a
   two core system running  a single UP and a single two vCPU SMP VM. When the vCPU belonging to the
   UP VM is scheduled the other physical processor cannot be used to execute one of the two vCPUs of
   SMP VM, leading to the other physical CPU idling for that length of time.

   This co-scheduling algorithm was improved to a ‘Relaxed Co-Scheduling’ scheme in ESX 3.x. wherein
   even on availability of fewer physical processors than vCPUs in a skewed VM  only vCPUs that are 
   skewed need to be scheduled. This scheme increases the number of scheduling opportunities available
   to the scheduler and hence improving overall system throughput. Relaxed co-scheduling significantly
   reduces the possibility of co-scheduling fragmentation, improving overall processor utilization.

d) Briefly co-scheduling (to maintain the skew between processors execution times within reasonable
    limits) is necessary both so that the guest operating system and the applications with them run
    correctly and with good performance. Significant skew between the vCPUs corresponding to a VM can
    result in both severe performance and correctness issues.

    As an example guest operating systems make use of spin locks for synchronization. But if the vCPU
    currently holding a lock is descheduled, then the other VCPUs belonging to the VM will burn cycles
    busy-waiting until the lock is released. Similar performance problems can also show up in
    multi-threaded user applications, which may also perform some form of synchronization. Correctness
    issues associated with significant skew between the vCPUs of a VM can cause Windows BSODs or Linux
    kernel panics.

e) Idle vCPUs, vCPUs on which the guest is executing the idle loop, are detected by ESX and descheduled
    so that they free up a processor that can be productively utilized by some other active vCPU. 
    Descheduled idle vCPU’s are considered as making progress in the skew detection algorithm. As a
    result, for co-scheduling decisions, idle vCPUs do not accumulate skew and are treated as if they were
    running . This optimization ensures that idle guest vCPUs don’t waste physical processor resources,
    which can instead be allocated to other VMs.  For example, an ESX Server with two physical cores may
    be running one vCPU each from two different VMs, if their sibling vCPUs are idling, without incurring
    any co-scheduling overhead.  Similarly, in the fragmentation example above, if  one of the SMP VM’s
     VCPU is idling, then there will be no co-scheduling fragmentation, since its sibling vCPU can be
     scheduled concurrently with the UP VM.

To summarize ESX scheduler supports and enables SMP VMs for both high performance and fairness. ESX
users should leverage this SMP support for improving the performance of their applications by
configuring the appropriate number of vCPUs for a VM as really needed by the application load.

For a broader technical overview on ESX co-scheduling algorithms described above, please also refer to
the “Co-scheduling SMP VMs in VMware ESX Server“ blog.

Measuring Cluster Reconfiguration with VMmark

In my previous blog entry about running VMmark within a four-server cluster managed by VMware Infrastructure 3 version 3.5 (VI3.5), the results showed that a cluster of four servers exceeded the available CPU resources when running 17 VMmark tiles (102 total virtual servers). The scaling then plateaus due to CPU saturation. I suppose that if one of VMware’s customers were in this situation it could be a good thing since it means their business is successful and growing beyond their current computing infrastructure. The real issue becomes how to add capacity as quickly and painlessly as possible to meet the needs of the business. This is an arena where VI3.5 shines with its ability to simply add additional physical hosts on the fly without interruption to the virtual servers.

Experimental Setup

We can easily demonstrate the benefit of adding additional physical resources using the experimental setup described in the previous blog posting with the addition of a second HP DL380G5 configured identically to the one already in use.

We can also take advantage of the underlying VMmark scoring methodology. A VMmark run is three hours long and consists of a half-hour ramp-up period followed by a two-hour measurement interval and a half-hour ramp-down period. The two-hour measurement interval is further divided into three 40-minute periods (just like a marathon hockey game). Every benchmark workload generates its throughput metric during each 40-minute period. For each 40-minute period, the workload metrics are aggregated into an overall throughput measure. The final VMmark score is defined as the median score of the three periods.

For the purpose of these experiments, we can compare the throughput scaling while varying the configuration of the cluster during the three 40-minute scoring periods of the benchmark. Specifically, the original four-node cluster configuration is used during the first 40-minute period. The additional HP DL380G5 server is added to the cluster at the transition between the first and second periods. During the second period VMware’s Dynamic Resource Scheduler (DRS) will rebalanced the cluster. The third period should exhibit the full benefit of the dynamically added fifth server.

Experimental Results

Figure 1 compares the 17-tile throughput scaling achieved both with the default four-node cluster and when augmenting the cluster with a fifth server during the second period (this configuration is labeled “4node + 1” to distinguish it from a configuration that starts with five nodes). As expected, both configurations exhibit similar scaling during Period 0 when four servers are in use and the CPU resources are fully utilized. However, cluster performance improves during Period 1 with the addition of a fifth server. By Period 2, DRS has rebalanced the benchmark workloads and given them some breathing room to achieve a perfect 17x scaling for 17 tiles from the baseline performance of a single tile. In comparison, the CPU-saturated four-node configuration achieved roughly 16x scaling.


Figure 2 shows the results for both 18-tile and 19-tile tests. This data follows the same pattern as the initial 17-tile experiment. With the addition of the fifth server, near-linear scaling has been achieved in both cases by Period 2 in contrast to the flat profile measured in the CPU-saturated four-node cluster.


Figure 3 contains results from three different 20-tile experiments. As usual, the first experiment utilizes the default four-node cluster and exhibits throughput scaling similar to the other CPU-saturated tests. The second experiment shows the behavior when adding a fifth server during the transition between Period 0 and Period 1. It displays the rising performance characteristic of the relieved CPU bottleneck. The final experiment in this series was run using the five-node cluster from start to finish and demonstrates that dynamically relieving the CPU bottleneck produces the same throughput performance as if the bottleneck had never existed. In other words, DRS functions equally well both when balancing resources in a dynamic and heavily-utilized scenario and when beginning from a clean slate.


Swapping Resources Dynamically

The ability of VI3.5 to dynamically reallocate physical resources with zero downtime and without interruption can also be used to remove or replace physical hosts. This greatly simplifies routine maintenance, physical host upgrades, and other tasks. We can easily measure the performance implications of swapping physical hosts with the same methodology used above. In this case, we place the Sun x4150 into Maintenance Mode after the fifth physical host (an HP DL380G5) is added. This will evacuate the virtual machines from the Sun x4150, making it available for either hardware or software upgrades.

Figure 4 shows the scaling results of swapping hosts while running both 17-tile and 18-tile VMmark tests on the cluster. In both cases, the performance improves by a small amount. The HP DL380G5 in this experiment happens to contain faster CPUs than the Sun x4150 we had in our lab (Intel Xeon X5460 vs. Intel Xeon x5355), though the Sun x4150 is also available with Intel Xeon x5460 CPUs. These results clearly demonstrate that the liberating flexibility of VMware’s VMotion and DRS comes without performance penalties.


The Big Picture

I think some of the context provided in my previous blog bears repeating. Let’s take a step back and talk about what has been accomplished on this relatively modest cluster by running 17 to 20 VMmark tiles (102 to 120 server VMs). That translates into simultaneously:

  • Supporting 17,000 to 20,000 Exchange 2003 mail users
  • Sustaining more that 35,000 database transactions per minute using MySQL/SysBench
  • Driving more than 350 MB/s of disk IO
  • Serving more than 30,000 web pages each minute
  • Running 17 to 20 Java middle-tier servers

For all of these load levels, we dynamically added CPU resources and relieved CPU resource bottlenecks transparently to the virtual machines. They just ran faster. We also transparently swapped physical hosts while the CPU resources were fully saturated without affecting the performance of the workload virtual machines. VI 3.5 lets you easily add and remove physical hosts while it takes care of managing the load. You can run any mix of applications within the virtual machines and VI 3.5 will transparently balance the resources to achieve near-optimal performance. Our experiments ran these systems past the point they were completely maxed out, and I suspect that living this close to the edge is more than most customers will attempt. But I am certain that they will find it reassuring to know that VI3.5 is up to the task.

VMmark 1.1 Available

Just a quick note to let folks know that VMmark 1.1 is available. It went live last week. You can download it here. The new version has a mix of 32-bit and 64-bit virtual machines to better reflect current environments. Please check the Release Notes for full details.