Home > Blogs > VMware VROOM! Blog > Tag Archives: benchmarking

Tag Archives: benchmarking

Virtualized Storage Performance: RAID Groups versus Storage pools

RAID, a redundant array of independent disks, has traditionally been the foundation of enterprise storage. Grouping multiple disks into one logical unit can vastly increase the availability and performance of storage by protecting against disk failure, allowing greater I/O parallelism, and pooling capacity. Storage pools similarly increase the capacity and performance of storage, but are easier to configure and manage than RAID groups.

RAID groups have traditionally been regarded as offering better and more predictable performance than storage pools. Although both technologies were developed for magnetic hard disk drives (HDDs), solid-state drives (SSDs), which use flash memory, have become prevalent. Virtualized environments are also common and tend to create highly randomized I/O given the fact that multiple workloads are run simultaneously.

We set out to see how the performance of RAID group and storage pool provisioning methods compare in today’s virtualized environments.

First, let’s take a closer look at each storage provisioning type.

RAID Groups

A RAID group unifies a number of disks into one logical unit and distributes data across multiple drives. RAID groups can be configured with a particular protection level depending on the performance, capacity, and redundancy needs of the environment. LUNs are then allocated from the RAID group. RAID groups typically contain only identical drives, and the maximum number of disks in a RAID group varies by system model but is generally below fifty. Because drives typically have well defined performance characteristics, the overall RAID group performance can be calculated as the performance of all drives in the group minus the RAID overhead. To provide consistent performance, workloads with different I/O profiles (e.g., sequential vs. random I/O) or different performance needs should be physically isolated in different RAID groups so they do not share disks.

Storage Pools

Storage pools, or simply ‘pools’, are very similar to RAID groups in some ways. Implementation varies by vendor, but generally pools are made up of one or more private RAID groups, which are not visible to the user, or they are composed of user-configured RAID groups which are added manually to the pool. LUNs are then allocated from the pool. Storage pools can contain up to hundreds of drives, often all the drives in an array. As business needs grow, storage pools can be easily scaled up by adding drives or RAID groups and expanding LUN capacity. Storage pools can contain multiple types and sizes of drives and can spread workloads over more drives for a greater degree of parallelism.

Storage pools are usually required for array features like automated storage tiering, where faster SSDs can serve as a data cache among a larger group of HDDs, as well as other array-level data services like compression, deduplication, and thin provisioning. Because of their larger maximum size, storage pools, unlike RAID groups, can take advantage of vSphere 6 maximum LUN sizes of 64TB.

We used two benchmarks to compare the performance of RAID groups and storage pools: VMmark, which is a virtualization platform benchmark, and I/O Analyzer with Iometer, which is a storage microbenchmark.  VMmark is a multi-host virtualization benchmark that uses diverse application workloads as well as common platform level workloads to model the demands of the datacenter. VMs running a complete set of the application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview. Iometer places high levels of load on the disk, but does not stress any other system resources. Together, these benchmarks give us both a ‘real-world’ and a more focused perspective on storage performance.

VMmark Testing

Array Configuration

Testing was conducted on an EMC VNX5800 block storage SAN with Fibre Channel. This was one of the many storage solutions which offered both RAID group and storage pool technologies. Disks were 200GB single-level cell (SLC) SSDs. Storage configuration followed array best practices, including balancing LUNs across Storage Processors and ensuring that RAID groups and LUNs did not span the array bus. One way to optimize SSD performance is to leave up to 50% of the SSD capacity unutilized, also known as overprovisioning. To follow this best practice, 50% of the RAID group or storage pool was not allocated to any LUN. Since overprovisioning SSDs can be an expensive proposition, we also tested the same configuration with 100% of the storage pool or RAID group allocated.

RAID Group Configuration

Four RAID 5 groups were used, each composed of 15 SSDs. RAID 5 was selected for its suitability for general purpose workloads. RAID 5 provides tolerance against a single disk failure. For best performance and capacity, RAID 5 groups should be sized to multiples of five or nine drives, so this group maintains a multiple of the preferred five-drive count. One LUN was created in each of the four RAID groups. The LUN was sized to either 50% of the RAID group (Best Practices) or 100% (Fully Allocated). For testing, the capacity of each LUN was fully utilized by VMmark virtual machines and randomized data.

            RAID Group Configuration VMmark Storage Comparison        VMmark Storage Pool Configuration Storage Comparison

Storage Pool Configuration

A single RAID 5 Storage Pool containing all 60 SSDs was used. Four thick LUNs were allocated from the pool, meaning that all of the storage space was reserved on the volume. LUNs were equivalent in size and consumed a total of either 50% (Best Practices) or 100% (Fully Allocated) of the pool capacity.

Storage Layout

Most of the VMmark storage load was created by two types of virtual machines: database (DVD Store) and mail server (Microsoft Exchange). These virtual machines were isolated on two different LUNs. The remaining virtual machines were spread across the remaining two LUNs. That is, in the RAID group case, storage-heavy workloads were physically isolated in different RAID groups, but in the storage pool case, all workloads shared the same pool.

Systems Under Test: Two Dell PowerEdge R720 servers
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. VMs used virtual hardware version 11 and current VMware Tools.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express
     Network Controller: One Intel 82599EB dual-port 10 Gigabit PCIe Adapter, one Intel I350 Dual-Port Gigabit PCIe Adapter

Each configuration was tested at three different load points: 1 tile (the lowest load level), 7 tiles (an approximate mid-point), and 13 tiles, which was the maximum number of tiles that still met Quality of Service (QoS) requirements. All datapoints represent the mean of two tests of each configuration.

VMmark Results

RAID Group vs. Storage Pool Performance comparison using VMmark benchmark

Across all load levels tested, the VMmark performance score, which is a function of application throughput, was similar regardless of storage provisioning type. Neither the storage type used nor the capacity allocated affected throughput.

VMmark 2.5 performance scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how storage configuration affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest 1-tile results.

Storage configuration did not affect VMmark application latencies.

Application Latency in VMmark Storage Comparison RAID Group vs Storage Pool

Lastly, we measured read and write I/O latencies: esxtop Average Guest MilliSec/Write and Average Guest MilliSec/Read. This is the round trip I/O latency as seen by the Guest operating system.

VMmark Storage Latency Storage Comparison RAID Group vs Storage Pool

No differences emerged in I/O latencies.

I/O Analyzer with Iometer Testing

In the second set of experiments, we wanted to see if we would find similar results while testing storage using a synthetic microbenchmark. I/O Analyzer is a tool which uses Iometer to drive load on a Linux-based virtual machine then collates the performance results. The benefit of using a microbenchmark like Iometer is that it places heavy load on just the storage subsystem, ensuring that no other subsystem is the bottleneck.

Configuration

Testing used a VNX5800 array and RAID 5 level as in the prior configuration, but all storage configurations spanned 9 SSDs, also a preferred drive count. In contrast to the prior test, the storage pool or RAID group spanned an identical number of disks, so that the number of disks per LUN was the same in both configurations. Testing used nine disks per LUN to achieve greater load on each disk.

The LUN was sized to either 50% or 100% of the storage group. The LUN capacity was fully occupied with the I/O Analyzer worker VM and randomized data.  The I/O Analyzer Controller VM, which initiates the benchmark, was located on a separate array and host.

Storage Configuration Iometer with Storage Pool and RAID Group

Testing used one I/O Analyzer worker VM. One Iometer worker thread drove storage load. The size of the VM’s virtual disk determines the size of the active dataset, so a 100GB thick-provisioned virtual disk on VMFS-5 was chosen to maximize I/O to the disk and minimize caching. We tested at a medium load level using a plausible datacenter I/O profile, understanding, however, that any static I/O profile will be a broad generalization of real-life workloads.

Iometer Configuration

  • 1 vCPU, 2GB memory
  • 70% read, 30% write
  • 100% random I/O to model the “I/O blender effect” in a virtualized environment
  • 4KB block size
  • I/O aligned to sector boundaries
  • 64 outstanding I/O
  • 60 minute warm up period, 60 minute measurement period
Systems Under Test: One Dell PowerEdge R720 server
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. Worker VM used the I/O Analyzer default virtual hardware version 7.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express

Iometer results

Iometer Latency Results Storage Comparison RAID Group vs Storage PoolIometer Throughput Results Storage Comparison RAID Group vs Storage Pool

In Iometer testing, the storage pool showed slightly improved performance compared to the RAID group, and the amount of capacity allocated also did not affect performance.

In both our multi-workload and synthetic microbenchmark scenarios, we did not observe any performance penalty of choosing storage pools over RAID groups on an all-SSD array, even when disparate workloads shared the same storage pool. We also did not find any performance benefit at the application or I/O level from leaving unallocated capacity, or overprovisioning, SSD RAID groups or storage pools. Given the ease of management and feature-based benefits of storage pools, including automated storage tiering, compression, deduplication, and thin provisioning, storage pools are an excellent choice in today’s datacenters.

SQL Server VM Performance on VMware vSphere 6

Last October, I blogged about SQL Server performance with vSphere 5.5 using a four-socket Intel Xeon processor E7 based host.  Now that vSphere 6 is available, I’ve run an updated set of tests using this new release, on an even more powerful host, with Xeon E7 v2 processors.  A variety of virtual CPU (vCPU) and virtual machine (VM) quantities were tested to show that vSphere can handle hundreds of thousands of online transaction processing (OLTP) database operations per minute.

DVD Store 2.1, an open-source OLTP database stress tool, was the workload used to stress the VMs.  The first experiment in the paper was a generational performance comparison between the old and new setups; as you can see, there is a dramatic increase in throughput, even though the size of each VM has doubled from 8 vCPUs per VM to 16:

Generational performance improvement from old study to new study

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the benefit of “right-sizing” virtual machines, and measuring the impact of the advanced Latency Sensitivity setting. 

For more details and the test results, please download the whitepaper: Performance Characterization of Microsoft SQL Server on VMware vSphere 6.

Introducing the Zephyr Benchmark

The ways in which we use, design, deploy, and evaluate the performance of large-scale web applications have changed significantly in recent years.  These changes have been driven by the increase in computing capacity and flexibility provided by virtualized and cloud-based computing infrastructures. The majority of these changes are not reflected in current web-application benchmarks.

Zephyr is a new web-application benchmark we have been developing as part of our work on optimizing the performance of VMware products for the next generation of cloud-scale applications. The goal of the Zephyr project has been to develop an application-level benchmark that captures the key characteristics of the workloads, design paradigms, deployment architectures, and performance metrics of the next generation of large-scale web applications. As we approach the initial release of Zephyr, we are starting to use it to understand performance across our product range.  In this post, we will give an overview of Zephyr that will provide context for the performance results that we will be writing about over the coming months.

Zephyr Motivation

There have been many changes in usage patterns and development practices for large-scale web applications.  The design and development of Zephyr has been driven by the goal of capturing these changes in a highly scalable benchmark that includes these key aspects:

  • The effect of increased user interactivity and rich web interfaces on workload patterns
  • New design patterns for decoupled and asynchronous services
  • The use of multiple data sources for data with varying scalability and consistency requirements
  • Flexible architectures that allow for deployment on a wide range of virtual and cloud-based infrastructures

The effect of increased user interactivity and rich web interfaces is one of the most important of these aspects. In current benchmarks, a user is represented by a single thread operating independently from other users. Contrast that to the way we interact with applications as diverse as social media and stock trading. Many user interactions, such as responding to a status update or selling shares of stock, are in direct response to the actions of other users.  In addition, the current generation of script-rich web interfaces performs many operations asynchronously without any action from, or even awareness by, the user.  Examples include web pages and rich client interfaces that update active friend lists, check for messages, or maintain stock tickers.  This leads to a very different model of user behavior than the traditional single-threaded, click-and-think design used by existing benchmarks.  As a result, one of the key design goals for Zephyr was to develop both a benchmark application and a workload generator that would allow us to capture the effect of these new workload patterns.

Zephyr Overview

An application-level benchmark typically consists of two main parts: the benchmark application and the workload driver.  The application is selected and designed to represent characteristics and technology choices that are typical of a certain class of applications.  The workload driver interacts with the benchmark application to simulate the behavior of typical users of the application.   It also captures the performance metrics that are used to quantify the performance of the application/infrastructure combination. Some benchmarks, including Zephyr, also provide a run harness that assists in the set-up and automation of benchmark runs.

Zephyr’s benchmark application is LiveAuction, which is a web application for managing and hosting real-time auctions. An auction hosted by LiveAuction consists of a number of items that will be placed up for bid in a set order.  Users are given only a limited time to bid before an item is sold and the next item is placed up for bid.  When an item is up for bid, all users attending the auction are presented with a description and image of the item.  Users see and respond to bids placed by other users. LiveAuction can support thousands of simultaneous auctions with large numbers of active users, with each user possibly attending multiple, simultaneous auctions.   The figure below shows the browser application used to interact with the LiveAuction application.  This figure shows the bidding screen for a user who is attending two auctions.  The current item, bid, and bid status for each auction are updated in real-time in response to bids placed by other users.

LiveAuctionScreenFigure 1. LiveAuction bidding screen

In addition to managing live auctions, LiveAuction provides auction and item search, profile management, historical data queries, image management, auction management, and other services that would be required by a user of the application.

LiveAuction uses a scalable architecture that allows deployments to be easily sized for a large range of user loads.  A full deployment of LiveAuction includes a wide variety of support services, such as load-balancing, caching, and messaging servers, as well as relational, NoSQL, and filesystem-based data stores supporting scalability for data with a variety of consistency requirements.  The figure below shows a full deployment of LiveAuction and the Zephyr workload driver.

logicalLayoutFullFigure 2. Logical layout for full Zephyr deployment

The following is a brief description of the role played by each tier.

Infrastructure Services

TCP Load Balancers: The simulated users on the workload driver address the application through a set of IP addresses mapped to the application’s external hostname.  The TCP load balancers jointly manage these IP addresses to ensure that all IP addresses remain available in the event of a failure. The TCP load balancers distribute the load across the web servers while maintaining SSL/TLS session affinity.

Messaging Servers: The application nodes use the messaging backbone to distribute work and state-change information regarding active auctions.

Application Services

Web Servers: The web servers terminate SSL, serve static content, act as load-balancing reverse proxies for the application servers, and provide a proxy cache for application content, such as images returned by the application servers.

Application Servers: The application servers run Java servlet containers in which the application services are deployed.  The LiveAuction application services use a stateless implementation with a RESTful interface that simplifies scaling.

Data Services

Relational Database: The relational database is used for all data that is involved in transactions.  This includes user account information, as well as auction, item, and high-bid data.

NoSQL Data Server:  The NoSQL Document Store is used to store image metadata as well as activity data such as auction attendance information and bid records. It can also be used to store uploaded images. Using the NoSQL store as an image store allows the application to take advantage of its sharding capabilities to easily scale the I/O capacity for image storage.

File Server: The file server is used exclusively to store item images uploaded by users.  Note that the file server is optional, as the images can be stored and served from the NoSQL document store.

Zephyr currently includes configuration support for deploying LiveAuction using the following services:

  • Virtual IP Address Management: Keepalived
  • TCP Load Balancer: HAProxy
  • Web Server: Apache Httpd and Nginx
  • Application Server:  Apache Tomcat with EHcache for in-memory caching
  • Messaging Server: RabbitMQ
  • Relational Database: MySQL and PostgreSQL
  • NoSQL Data Store: MongoDB
  • Network Filesystem: NFS

Additional implementations will be supported in future releases.

Zephyr can be deployed with different subsets of the infrastructure and application services.  For example, the figure below shows a minimal deployment of Zephyr with a single application server and the supporting data services.  In this configuration, the application server performs the tasks handled by the web server in a larger deployment.

logicalLayoutMinimalFigure 3. Logical layout for a minimal Zephyr deployment

The Zephyr workload driver has been developed to drive HTTP-based loads for modern scalable web applications.  It can simulate workloads for applications that incorporate asynchronous behaviors using embedded JavaScript, and those requiring complex data-driven behaviors, as in web applications with significant inter-user interaction.  The Zephyr workload driver uses an asynchronous design with a small number of threads supporting a large number of simulated users. Simulated users may have multiple active asynchronous activities which share state information, and complex workload patterns can be specified with control-flow decisions made based on retrieved state and operation history. These features allow us to efficiently simulate workloads that would be presented to web applications by rich web clients using asynchronous JavaScript operations.

The Zephyr workload driver also monitors quality-of-service (QoS) metrics for both the LiveAuction application and the overall workload. The application-level QoS requirements are based on the 99th percentile response-times for the individual operations.  An operation represents a single action performed by a user or embedded script, and may consist of multiple HTTP exchanges.  The workload-level QoS requirements define the required mix of operations that must be performed by the users during the workload’s steady state.  This mix must be consistent from run to run in order for the results to be comparable.  In order for a run of the benchmark to pass, all QoS requirements must be satisfied.

Zephyr also includes a run harness that automates most of the steps involved in configuring and running the benchmark.  The harness takes as input a configuration file that describes the deployment configuration, the user load, and many service-specific tuning parameters.  The harness is then able to power on virtual machines, configure and start the various software services, deploy the software components of LiveAuction, run the workload, and collect the results, as well as the log, configuration, and statistics files from all of the virtual machines and services.  The harness also manages the tasks involved in loading and preparing the data in the data services before each run.

Scalability

Scaling to large deployments is a key goal of Zephyr.  Therefore, it will be useful to conclude with some initial scalability data to show how we are doing in achieving that goal. There are many possible ways to scale up a deployment of LiveAuction.  For the sake of providing a straightforward comparison, we will focus on scaling out the number of application server instances in an otherwise fixed deployment configuration.  The CPU utilization of the application server is typically the performance bottleneck in a well-balanced LiveAuction deployment.

The figure below shows the logical layout of the VMs and services in this deployment.  Physically, all VMs reside on the same network subnet on the vSphere hosts, which are connected by a 10Gb Ethernet switch.

Blog1LayoutFigure 4. Deployment configuration for scaling results

The VMs in the LiveAuction deployment were distributed across three VMware vSphere 6 hosts.  Table 1 gives the hardware details of the hosts.

Host Name Host Vendor/Model Processors Memory
Host1 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
256GB
Host2 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
256GB
Host3 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2680 @ 2.70GHz
8 Core, 16 Thread
192GB

Table 1. vSphere 6 hosts for LiveAuction deployment

Table 2 shows the configuration of the VMs, and their assignment to vSphere hosts.  As the goal of these tests was to examine the scalability of the LiveAuction application, and not the characteristics of vSphere 6, we chose the VM sizing and assignment in part to avoid using more virtual CPUs than physical cores. While we did some tuning of the overall configuration, we did not necessarily obtain the optimal tuning for each of the service configurations.  The configuration was chosen so that the application server was the bottleneck as far as possible within the restrictions of the available physical servers.  In future posts, we will examine the tuning of the individual services, tradeoffs in deployment configurations, and best practices for deploying LiveAuction-like applications on vSphere.

Service Host VM vCPUs (each) VM Memory
HAProxy 1 Host1 2 8GB
HAProxy 2 Host2 2 8GB
HAProxy 3 Host3 2 8GB
Nginx 1, 2, and 3 Host3 2 8GB
RabbitMQ 1 Host2 1 2GB
RabbitMQ 2 Host1 1 2GB
Tomcat 1, 3, 5, 7, and 9 Host1 2 8GB
Tomcat 2, 4, 6, 8, and 10 Host2 2 8GB
MongoDB 1 and 3 Host2 1 32GB
MongoDB 2 and 4 Host1 1 32GB
PostgreSQL Host3 6 32GB

Table 2. Virtual machine configuration

Figure 5 shows the peak load that can be supported by this deployment configuration as the number of application servers is scaled from one to ten.  The peak load supported by a configuration is the maximum load at which the configuration can satisfy all of the QoS requirements of the workload.  The dotted line shows linear scaling of the maximum load extrapolated from the single application server result.  The actual scaling is essentially linear up to six application-server VMs.  At that point, the overall utilization of the physical servers starts to affect the ability to maintain linear scaling.  With seven application servers, the web-server tier becomes a scalability bottleneck, but there are not sufficient CPU cores available to add additional web servers.

It would require additional infrastructure to determine how far the linear scaling could be extended.  However, the current results provide strong evidence that with sufficient resources, Zephyr will be able to scale to support very large loads representing large numbers of users.

scalabilityFigure 5. Maximum supported users for increasing number of application servers

Future

The discussion in this post has focused on the use of Zephyr as a traditional single-application benchmark with a focus on throughput and response-time performance metrics.  However, that only scratches the surface of our future plans for Zephyr.  We are currently working on extending Zephyr to capture more cloud-centric performance metrics.  These fall into two broad categories that we call multi-tenancy metrics and elasticity metrics.  Multi-tenancy metrics capture the performance characteristics of a cloud-deployed application in the presence of other applications co-located on the same physical resources.  The relevant performance metrics include isolation and fairness along with the traditional throughput and response-time metrics.  Elasticity metrics capture the performance characteristics of self-scaling applications in the presence of changing loads.  It is also possible to study elasticity metrics in the context of multi-tenancy environments, thus examining the impact of shared resources on the ability of an application to scale in a timely manner to satisfy user demands.  These are all exciting new areas of application performance, and we will have more to say about these subjects as we approach Zephyr 1.0.

Virtual SAN 6.0 Performance: Scalability and Best Practices

A technical white paper about Virtual SAN performance has been published. This paper provides guidelines on how to get the best performance for applications deployed on a Virtual SAN cluster.

We used Iometer to generate several workloads that simulate various I/O encountered in Virtual SAN production environments. These are shown in the following table.

Type of I/O workload Size (1KiB = 1024 bytes) Mixed Ratio Shows / Simulates
All Read 4KiB Maximum random read IOPS that a storage solution can deliver
Mixed Read/Write 4KiB 70% / 30% Typical commercial applications deployed in a VSAN cluster
Sequential Read 256KiB Video streaming from storage
Sequential Write 256KiB Copying bulk data to storage
Sequential Mixed R/W 256KiB 70% / 30% Simultaneous read/write copy from/to storage

In addition to these workloads, we studied Virtual SAN caching tier designs and the effect of Virtual SAN configuration parameters on the Virtual SAN test bed.

Virtual SAN 6.0 can be configured in two ways: Hybrid and All-Flash. Hybrid uses a combination of hard disks (HDDs) to provide storage and a flash tier (SSDs) to provide caching. The All-Flash solution uses all SSDs for storage and caching.

Tests show that the Hybrid Virtual SAN cluster performs extremely well when the working set is fully cached for random access workloads, and also for all sequential access workloads. The All-Flash Virtual SAN cluster, which performs well for random access workloads with large working sets, may be deployed in cases where the working set is too large to fit in a cache. All workloads scale linearly in both types of Virtual SAN clusters—as more hosts and more disk groups per host are added, Virtual SAN sees a corresponding increase in its ability to handle larger workloads. Virtual SAN offers an excellent way to scale up the cluster as performance requirements increase.

You can download Virtual SAN 6.0 Performance: Scalability and Best Practices from the VMware Performance & VMmark Community.

VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines

vSphere 6 introduces the ability to run virtual machines (VMs) with up to 128 virtual CPUs (vCPUs) and 4TB of RAM. This doubles the number of vCPUs supported from the previous version and increases the amount of RAM by four times. This new capability provides the potential for customers to run larger workloads than ever before in a virtual machine.

A series of tests were run with a virtual machine hosting Oracle 12c database instances. The DVD Store 2.1 open-source transactional workload was used to measure the performance of a large “Monster” VM on vSphere 6. The Oracle 12c database VM was scaled from 15 vCPUs all the way up to 120 vCPUs, and the maximum achieved throughput was measured. The full results and test details have been published in a white paper – VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines.

A four-socket Intel Xeon E7-4890 v2 processor based server with 1TB of memory was used to host the virtual machine for the tests.  Each Xeon E7-4890 v2 processor has 15 cores / 30 threads with Hyper Threading enabled for a total of 60 cores / 120 threads for the system. The diagram below shows the basic test configuration.

vSphere6OracleArchDiagram

 

In all tests Hyper-Threading was enabled on the server, but in configurations where 60 vCPUs or less are assigned to the VM, Hyper-Threads are not used by the VM. This is a result of the default scheduling policy where the preference is for vCPUs to be scheduled on one thread per core before using the second thread of any core. This first set of results, shown below, is focused on the tests that scale up to 60 vCPUs. These tests show the scaling for the virtual machine without the use of Hyper-Threads

vSphere6OraScaleUpto60

While vSphere 6 supports up to 128 vCPUs per VM, these tests were limited to 120 vCPUs due to the number of threads available on the server. The largest VM configuration used both hardware execution threads (Hyper-Threads) on all the processor cores in order to reach 120 vCPUs. In this case, there is one vCPU per execution thread.

Hyper-Threading doubles the number of execution threads, but it does not double performance. In order to measure the scale-up performance of the 120-vCPU VM, a 60-vCPU VM was configured with CPU affinity so that it was limited to only two of the server’s four sockets. In this configuration the 60-vCPU VM has one vCPU per execution thread, which is the same as the 120-vCPU VM.  Configuring a 60-vCPU VM in this way makes it easy to see the scale up performance at 120 vCPUs on this server with hyper-threads enabled.

The results of the scale-up testing using the 60-vCPU VM configured with CPU affinity to only 2 sockets and the 120-vCPU VM using all four sockets showed approximately linear scaling, as shown in the graph below.

vSphere6OraScaleUpto120

For full test details and more test results please see the white paper that has was recently published.

The new larger “Monster” VM support in vSphere 6 allows for virtual machines that can support larger workloads than ever before with excellent performance. These tests show that large virtual machines running on vSphere 6 can scale up as needed to meet extreme performance demands.

 

Virtual SAN and SAP IQ – a Perfect Match

A performance study shows that VMware vSphere 5.5 with Virtual SAN as the storage backend provides an excellent platform for virtualized deployments of SAP IQ Multiplex Servers.

We created four virtual machines with the RHEL 6.3 operating system, and these virtual machines made up the SAP IQ Multiplex Server, which used Virtual SAN as its storage backend. In order to measure performance, we looked at the distributed query processing (DQP) modes of SAP IQ. In DQP, work is performed by threads running on both leader and worker nodes, and intermediate results are transmitted between these nodes through a shared disk space, or over an inter-node network. In the paper, we refer to these modes as storage-transfer and network-transfer.

In a test consisting of concurrent streams of queries designed to emulate a multi-user scenario, we found that the read-heavy I/O profile of this workload takes full advantage of the Virtual SAN’s flash acceleration layer. Data read from magnetic disks in each disk group, is cached in the SSD in the disk group. Since 70% of SSD capacity is reserved for the read cache, a significant amount of data is quickly placed in very low latency storage. Once it is warmed up, I/O requests are served from the read cache, leading to fast query response times. Add to this SAP IQ’s ability to use network resources to handle intermediate results transfer and we get an additional bump in throughput since we no longer have the overhead of writing intermediate, shared results to disk.

Read more about Distributed Query Processing in SAP IQ on VMware vSphere and Virtual SAN.

Docker Containers Performance in VMware vSphere

By  Qasim Ali,  Banit Agrawal, and Davide Bergamasco

 

“Containers without compromise” – This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and  in a native configuration. Briefly,

  • We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.
  • Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

  • Native: Linux OS running directly on hardware (Ubuntu, CentOS)
  • vSphere VM: Upcoming release of vSphere with the same guest OS as native
  • Native-Docker: Docker version 1.2 running on a native OS
  • VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Test Scenarios

Figure 1: Different test scenarios

Benchmarks/Workloads

For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

–   Micro-benchmarks:

  • LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.
  • STREAM: This benchmark measures memory bandwidth across various configurations.
  • FIO: This benchmark is used for I/O benchmarking for block devices and file systems.
  • Netperf: This benchmark is used to measure network performance.

Real-world workload:

  • Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results

LINPACK

LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 32 vCPU VM with 45K and 65K problem sizes

linpack

Figure 2: LINPACK performance for different test scenarios

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.

STREAM

We used a NUMA-aware  STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, HT enabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 64 vCPU VM (Hyper-Threading ON)

stream

Figure 3: STREAM performance for different test scenarios

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.

FIO

We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

Hardware: 2 socket Intel Xeon E5-2660 2.2GHz with 392GB RAM, 16 total cores, Hyper-Threading enabled
Guest: 32-vCPU  14.04.1 Ubuntu 64-bit server with 256GB RAM, with a separate ext4 disk in the guest (on VMFS5 in vSphere run)
Benchmark:  FIO, Direct I/O, 10GB file
I/O Profile:  4k I/O, Random Read/Write: depth 16, jobs 8, Mixed: depth 32, jobs 8

fio

Figure 4: FIO benchmark performance for different test scenarios

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.

Netperf

Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

Hardware (Server): 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Hardware (Client): 2 socket Intel Xeon X5570 2.93GHz with 64GB RAM, 8 cores total, Hyper-Threading disabled
Networking hardware: Broadcom Corporation NetXtreme II BCM57810
Software on server and Client: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 2 vCPU VM with 4GB RAM

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.

netperf-recieve

Figure 5: Netperf Recieve performance for different test scenarios

netperf-transmit

Figure 6: Netperf transmit performance for different test scenarios

Figure 5 and Figure 6 shows the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.

netperf-latency

Figure 7: Netperf TCP_RR performance for different test scenarios (Lower is better)

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).

netperf-latency-passthrough

Figure 8: Netperf TCP_RR performance for different test scenarios (VMs in pass-through mode)

Redis

We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

Hardware: HP ProLiant DL380e Gen8 2 socket Intel Xeon E5-2470 2.3GHz with 96GB RAM, 16 total cores, Hyper-Threading enabled
Guest OS: CentOS 7
VM: 16 vCPU, 93GB RAM
Application: Redis 2.8.13
Benchmark: redis-benchmark, 1000 clients, pipeline: 1 request, operations: SET 1 Byte
Software configuration: Redis thread pinned to CPU 0 and network interrupts pinned to CPU 1

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access.  The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys.  This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run.  The results of these experiments are summarized in the following chart.

redis

Figure 9: Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead.  However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty.  This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.

Summary

In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems.  The results will be covered in the next installment of this series. Stay tuned!

 

Monster Performance with SQL Server VMs on vSphere 5.5

VMware vSphere provides an ideal platform for customers to virtualize their business-critical applications, including databases, ERP systems, email servers, and even newly emerging technologies such as Hadoop.  I’ve been focusing on the first one (databases), specifically Microsoft SQL Server, one of the most widely deployed database platforms in the world.  Many organizations have dozens or even hundreds of instances deployed in their environments. Consolidating these deployments onto modern multi-socket, multi-core, multi-threaded server hardware is an increasingly attractive proposition for IT administrators.

Achieving optimal SQL Server performance has been a continual focus for VMware; with current vSphere 5.x releases, VMware supports much larger “monster” virtual machines that can scale up to 64 virtual CPUs and 1 TB of RAM, including exposing virtual NUMA architecture to the guest. In fact, the main goal of this blog and accompanying whitepaper is to refresh a 2009 study that demonstrated SQL performance on vSphere 4, given the marked technology advancements on both the software and hardware fronts.

These tests show that large SQL Server 2012 databases run extremely efficiently with VMware, achieving great performance in a variety of virtual machine configurations with only minor tunings to SQL Server and the vSphere ESXi host. These tunings and other best practices for fully optimizing large virtual machines for SQL Server databases are presented in the paper.

One test in the paper shows the maximum host throughput achieved with different numbers of virtual CPUs per VM. This was measured starting with 8 vCPUs per VM, then doubled to 16, then 32, and finally 64 (the maximum supported with vSphere 5.5).  DVD Store, which is a popular database tool and a key workload of the VMmark benchmark, was used to stress the VMs.  Here is a graph from the paper showing the 8 vCPU x 8 VMs case, which achieved an aggregate of 493,804 opm (operations per minute) on the host:

8 x 8 vCPU VM throughput

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the impact of various virtual NUMA (vNUMA) topologies, and experiments with the Latency Sensitivity advanced setting.

For more details and the test results, please download the whitepaper: Performance and Scalability of Microsoft SQL Server on VMware vSphere 5.5.

VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2

In part 1, we presented the VDI benchmark results on VSAN for 3-node and 7-node configurations. In this blog, we update the results for 5-node and 8-node VSAN configurations and show how VSAN scales for these configurations.

The View Planner benchmark was run again to find the VDImark for different numbers of nodes (5 and 8 nodes) in a VSAN cluster as described in the previous blog and the results are shown in the following figure.

View Planner QoS (VDImark)

 

In the 5-node cluster, a VDImark score of 473 was achieved and for the 8-node cluster, a VDImark score of 767 was achieved. These results are similar to the ones we saw on the 3-node and 7-node cluster earlier (about 95 VMs per host). So, there is nice scaling in terms of maximum VMs supported as the numbers of nodes were increased in the VSAN from 3 to 8.

To further illustrate the Group-A and Group-B response times, we show the average response time of individual operations for these runs for both Group-A and Group-B, as follows.

Group-A Response Times

As seen in the figure above, the average response times of the most interactive operations are less than one second, which is needed to provide a good end-user experience. If we look at the new results for 5-node and 8-node VSAN, we see that for most of the operations, the response time mostly remains the same across different node configurations.

Group-B Response Times

Since Group-B is more sensitive to I/O and CPU usage, the above chart for Group-B operations is more important to see how View Planner scales. The chart shows that there is not much difference in the response times as the number of VMs were increased from 286 VMs on a 3-node cluster to 767 VMs on an 8-node cluster. Hence, storage-sensitive VDI operations also scale well as we scale the VSAN nodes from 3 to 8 and user experience expectations are met.

To see other parts on the VDI/VSAN benchmarking blog series, check the links below:
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 1
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 2
VDI Benchmarking Using View Planner on VMware Virtual SAN – Part 3

 

 

Line-Rate Performance with 80GbE and vSphere 5.5

With the increasing number of physical cores in a system, the networking bandwidth requirement per server has also increased. We often find many networking-intensive applications are now being placed on a single server, which results in a single vSphere server requiring more than one 10 Gigabit Ethernet (GbE) adapter. Additional network interface cards (NICs) are also deployed to separate management traffic and the actual virtual machine traffic. It is important for these servers to service the connected NICs well and to drive line rate on all the physical adapters simultaneously.

vSphere 5.5 supports eight 10GbE NICs on a single host, and we demonstrate that a host running with vSphere 5.5 can not only drive line rate on all the physical NICs connected to the system, but can do it with a modest increase in overall CPU cost as we add more NICs.

We configured a single host with four dual-port Intel 10GbE adapters for the experiment and connected them back-to-back with an IXIA Application Network Processor Server with eight 10GbE ports to generate traffic. We then measured the send/receive throughput and the corresponding CPU usage of the vSphere host as we increased the number of NICs under test on the system.

Environment Configuration

  • System Under Test: Dell PowerEdge R820
  • CPUs: 4 x  Intel Xeon Processors E5-4650 @ 2.70GHz
  • Memory: 128GB
  • NICs:8 x Intel 82599EB 10GbE, SFP+ Network Connection
  • Client: Ixia Xcellon-Ultra XT80-V2, 2U Application Network Processor Server

Challenges in Getting 80Gbps Throughput

To drive near 80 gigabits of data per second from a single vSphere host, we used a server that has not only the required CPU and memory resources, but also the PCI bandwidth that can perform the necessary I/O operations. We used a Dell PowerEdge Server with an Intel E5-4650 processor because it belongs to the first generation of Intel processors that supports PCI Gen 3.0. PCI Gen 3.0 doubles the PCI bandwidth capabilities compared to PCI Gen 2.0. Each dual-port Intel 10GbE adapter needs at least a PCI Gen 2.0 x8 to reach line rate. Also, the processor has Intel Data Direct I/O Technology where the packets are placed directly in the processor cache rather than going to the memory. This reduces the memory bandwidth consumption and also helps reduce latency.

Experiment Overview

Each 10GbE port of the vSphere 5.5 server was configured with a separate vSwitch, and each vSwitch had two Red Hat 6.0 Linux virtual machines running an instance of Apache web server. The web server virtual machines were configured with 1 vCPU and 2GB of memory with VMXNET3 as the virtual NIC adapter.  The 10GbE ports were then connected to the Ixia Application Server port. Since the server had two x16 slots and five x8 slots, we used the x8 slots for the four 10GbE NICs so that each physical NIC had identical resources. For each physical connection, we then configured 200 web/HTTP connections, 100 for each web server, on an Ixia server that requested or posted the file. We used a high number of connections so that we had enough networking traffic to keep the physical NIC at 100% utilization.

Figure 1. System design of NICs, switches, and VMs

The Ixia Xcellon application server used an HTTP GET request to generate a send workload for the vSphere host. Each connection requested a 1MB file from the HTTP web server.

Figure 2 shows that we could consistently get the available[1] line rate for each physical NIC as we added more NICs to the test. Each physical NIC was transmitting 120K packets per second and the average TSO packet size was close to 10K. The NIC was also receiving 400K packets per second for acknowledgements on the receive side. The total number of packets processed per second was close to 500K for each physical connection.

Figure 2. vSphere 5.5 drives throughput at available line rates. TSO on the NIC resulted in lower packets per second for send.

Similar to the send case, we configured the application server to post a 1MB file using an HTTP POST request for generating receive traffic for the vSphere host. We used the same number of connections and observed similar behavior for the throughput. Since the NIC does not have support for hardware LRO, we were getting 800K packets per second for each NIC. With eight 10GbE NICs, the packet rate reached close to 6.4 million packets per second. VMware does Software LRO for Linux and as a result we see large packets in the guest. The guest packet rate is around 240K packets per second. There was also significant traffic for TCP acknowledgements and for each physical NIC. The host was transmitting close to 120K acknowledgement packets for each physical NIC, bringing the total packets processed close to 7.5 million packets per second for eight 10Gb ports.

Figure 3. Average vSphere 5.5 host CPU utilization for send and receive

We also measured the average CPU reported for each of the tests. Figure 3 shows that the vSphere host’s CPU usage increased linearly as we added more physical NICs to the test for both send and receive. This indicates that performance improves at an expected and acceptable rate.

Test results show that vSphere 5.5 is an excellent platform on which to deploy networking-intensive workloads. vSphere 5.5 makes use of all the physical bandwidth capacity available and does this without incurring additional CPU cost.

 


[1]A 10GbE NIC can achieve only 9.4 Gbps of throughput with standard MTU. For a 1500 byte packet, we have 40 bytes for the TCP /IP header and 38 bytes for the Ethernet frame format.