Home > Blogs > VMware VROOM! Blog > Author Archives: Bruce Herndon

Performance Best Practices for vSphere 5.5 is Available

We are pleased to announce the availability of Performance Best Practices for vSphere 5.5. This is a book designed to help system administrators obtain the best performance from vSphere 5.5 deployments.

The book addresses many of the new features in vSphere 5.5 from a performance perspective. These include:

  • vSphere Flash Read Cache, a new feature in vSphere 5.5 allowing flash storage resources on the ESXi host to be used for read caching of virtual machine I/O requests.
  • VMware Virtual SAN (VSAN), a new feature (in beta for vSphere 5.5) allowing storage resources attached directly to ESXi hosts to be used for distributed storage and accessed by multiple ESXi hosts.
  • The VMware vFabric Postgres database (vPostgres).

We’ve also updated and expanded on many of the topics in the book. These include:

  • Running storage latency and network latency sensitive applications
  • NUMA and Virtual NUMA (vNUMA)
  • Memory overcommit techniques
  • Large memory pages
  • Receive-side scaling (RSS), both in guests and on 10 Gigabit Ethernet cards
  • VMware vMotion, Storage vMotion, and Cross-host Storage vMotion
  • VMware Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM)
  • VMware Single Sign-On Server

The book can be found here.

VMmark 2.5 Released

I am pleased to announce the release of VMmark 2.5, the latest edition of VMware’s multi-host consolidation benchmark. The most notable change in VMmark 2.5 is the addition of optional power measurements for servers and servers plus storage. This capability will assist IT architects who wish to consider trade-offs in performance and power consumption when designing datacenters or evaluating new and emerging technologies, such as flash-based storage.

VMmark 2.5 contains a number of other improvements including:

  • Support for the VMware vCenter Server Appliance.
  • Support for VMmark 2.5 message and results delivery via Growl/Prowl.
  • Support for PowerCLI 5.1.
  • Updated workload virtual machine templates made from SLES for VMware, a free use version of SLES 11 SP2.
  • Improved pre-run initialization checking.

Full release notes can be found here.

Over the past two years since its initial release, VMmark 2.x has become the most widely-published virtualization benchmark with over fifty published results. We expect VMmark 2.5 and its new capabilities to continue that momentum. Keep an eye out for new power and power-performance results from our hardware partners as well as a series of upcoming blog entries presenting interesting power-performance experiments from the VMmark team.

The power measurement capability in VMmark 2.5 utilizes the SPEC®™ PTDaemon (Power Temperature Daemon). The PTDaemon provides a straightforward and reliable building block with support for the many power analyzers that have passed the SPEC Power Analyzer Acceptance Test.

All currently published VMmark 2.0 and 2.1 results are comparable to VMmark 2.5 performance-only results. Beginning on January 8th 2013, any submission of benchmark results must use the VMmark 2.5 benchmark kit.

Performance Best Practices for VMware vSphere 5.1

We’re pleased to announce the availability of Performance Best Practices for vSphere 5.1.  This is a book designed to help system administrators obtain the best performance from vSphere 5.1 deployments.

The book addresses many of the new features in vSphere 5.1 from a performance perspective.  These include:

  • Use of a system swap file to reduce VMkernel and related memory usage
  • Flex SE linked clones that can relinquish storage space when it’s no longer needed
  • Use of jumbo frames for hardware iSCSI
  • Single Root I/O virtualization (SR-IOV), allowing direct guest access to hardware devices
  • Enhancements to SplitRx mode, a feature allowing network packets received in a single network queue to be processed on multiple physical CPUs
  • Enhancements to the vSphere Web Client
  • VMware Cross-Host Storage vMotion, which allows virtual machines to be moved simultaneously across both hosts and datastores

We’ve also updated and expanded on many of the topics in the book.

These topic include:

  • Choosing hardware for a vSphere deployment
  • Power management
  • Configuring ESXi for best performance
  • Guest operating system performance
  • vCenter and vCenter database performance
  • vMotion and Storage vMotion performance
  • Distributed Resource Scheduler (DRS), Distributed Power Management (DPM), and Storage DRS performance
  • High Availability (HA), Fault Tolerance (FT), and VMware vCenter Update Manager performance
  • VMware vSphere Storage Appliance (VSA) and vCenter Single Sign on Server performance

The book can be found at: http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.1.pdf.

Updated VMmark 2.1 Benchmarking Guide Available

Just a quick note to inform all benchmarking enthusiasts that we have released an updated VMmark 2.1 Benchmarking Guide. You can get it from the VMmark 2.1 download page. The updated guide contains a new troubleshooting section as well as more comprehensive instructions for using virtual clients and Windows Server 2008 clients.

 

VMmark 2.1 Released and Other News

VMmark 2.1 has been released and is available here. We had a list of improvements to VMmark 2.0 even as we finished up the initial release of the benchmark last fall. Most of the changes are intended to improve usability, managability, and scale-out-ability of the benchmark. VMmark 2.0 has already generated tremendous interest from our partners and customers and we expect VMmark 2.1 to add to that momentum.

Only the harness and vclient directories have been refreshed for VMware VMmark 2.1. The notable changes include the following:

  • Uniform scaling of infrastructure operations as tile and cluster sizes increase. Previously, the dynamic storage relocation infrastructure workload was held at a single thread.
  • Allowance for multiple Deploy templates as tile and cluster sizes increase.
  • Addition of conditional support for clients running Windows Server 2008 Enterprise Edition 64-bit.
  • Addition of support for virtual clients, provided all hardware and software requirements are met.
  • Improved host-side reporter functionality.
  • Improved environment time synchronization.
  • Updates to several VMmark 2.0 tools to improve ease of setup and running.
  • Miscellaneous improvements to configuration checking, error reporting, debug output, and user-specified options.

All currently published VMmark 2.0 results are comparable to VMmark 2.1. Beginning with the release of VMmark 2.1, any submission of benchmark results must use the VMmark 2.1 benchmark kit.

In other news, Fujitsu published their first VMmark 2.0 result last week.

Also, Intel has joined the VMmark Review Panel. Other members are AMD, Cisco, Dell, Fujitsu, HP, and VMware. Every result published on the VMmark results page is reviewed for correctness and compliance by the VMmark Review Panel. In most cases this means that a submitter's result will be examined by their competitors prior to publication, which enhances the credibility of the results.

That's all for now, but we should be back soon with more interesting experiments using VMmark 2.1.

Cisco Publishes First VMmark 2.0 Result

Our partners at Cisco recently published the first official VMmark 2.0 result using a matched pair of UCS B200 M2 systems. You can find all of the details at the VMmark 2.0 Results Page. Using a matched pair of systems provides a close analogue to single-system benchmarks like VMmark 1.x while providing a more realistic performance profile by including infrastructure operations such as Vmotion. Official VMmark 2.0 results are reviewed for accuracy and compliance by the VMmark Review Panel consisting of AMD, Cisco, Dell, Fujitsu, HP, and VMware.

 

VMmark 2.0 Release

VMmark 2.0, VMware’s next-generation multi-host virtualization benchmark, is now generally available here.

We were motivated to create VMmark 2.0 by the revolutionary advancements in virtualization since VMmark 1.0 was conceived. The rapid pace of innovation in both the hypervisor and the hardware has quickly transformed datacenters by enabling easier virtualization of heavy and bursty workloads coupled with dynamic VM relocation (vMotion), dynamic datastore relocation (storage vMotion), and automation of many provisioning and administrative tasks across large-scale multi-host environments. In this paradigm, a large fraction of the stresses on the CPU, network, disk, and memory subsystems is generated by the underlying infrastructure operations. Load balancing across multiple hosts can also greatly effect application performance. The benchmarking methodology of VMmark 2.0 continues to focus on user-centric application performance while accounting for the effects of infrastructure activity on overall platform performance. This approach provides a much more accurate picture of platform capabilities than less comprehensive benchmarks.

I would like to thank all of our partners who participated in the VMmark 2.0 beta program. Their thorough testing and insightful feed back helped speed the development process while delivering a more robust benchmark. I anticipate a steady flow of benchmark results from partners over the coming months and years.

I should also acknowledge the hard work of my colleagues in the VMmark team that completed VMmark 2.0 on a relatively short timeline. We have performed a wide array of experiments during the development of VMmark 2.0 and will use the data as the basis for a series of upcoming posts in this forum. Some topics likely to be covered are cluster-wide scalability, performance of heterogeneous clusters, and networking tradeoffs between 1Gbit and 10Gbit for vMotion. I hope we can inspire others to use VMmark 2.0 to explore performance characteristics in multi-host environments in novel and interesting ways all the way up to cloud-scale.

 

VMmark 2.0 Beta Overview

As I mentioned in my last blog, we have been developing VMmark 2.0, a next-generation multi-host virtualization benchmark that models not only application performance in a virtualized environment but also the effects of common virtual infrastructure operations. This is a natural progression from single-host virtualization benchmarks like VMmark 1.x and SPECvirt_sc2010. Benchmarks measuring single-host performance, while valuable, do not adequately capture the complexity inherent in modern virtualized datacenters. With that in
mind, we set out to construct a meaningfully stressful virtualization benchmark with the following properties:

  • Multi-host to model realistic datacenter deployments
  • Virtualization infrastructure workloads to more accurately capture overall platform performance
  • Heavier workloads than VMmark 1.x to reflect heavier customer usage patterns enabled by the increased capabilities of the virtualization and hardware layers
  • Multi-tier workloads driving both VM-to-VM and external network traffic
  • Workload burstiness to insure robust performance under variable high loads

The addition of virtual infrastructure operations to measure their impact on overall system performance in a typical multi-host environment is a key departure from
traditional single-server benchmarks. VMmark 2.0 includes the execution of the
following foundational and commonly-used infrastructure operations:

  • User-initiated vMotion 
  • Storage vMotion
  • VM cloning and deployment
  • DRS-initiated vMotion to accommodate host-level load variations

The VMmark 2.0 tile features a significantly heavier load profile than VMmark 1.x and consists of the following workloads:

  • DVD Store 2 – multi-tier OLTP workload consisting of a 4-vCPU database VM and three 2-vCPU webserver VMs driving a bursty load profile
  • OLIO – multi-tier social networking workload consisting of a 4-vCPU web server and a 2-vCPU database server
  • Exchange2007 – 4-vCPU mailserver workload
  • Standby server – 1 vCPU lightly-loaded server

We kicked off an initial partner-only beta program in late June and are actively polishing the benchmark for general release. We will be sharing a number of interesting experiments using VMmark 2.0 in our blog leading up to the general release of the benchmark, so stay tuned.

Surveying Virtualization Performance Trends with VMmark

The trends in published VMmark scores are an ideal illustration of the historical long-term performance gains for virtualized platforms. We began work on what
would become VMmark 1.0 almost five years ago. At the time, ESX 2.5 was the state-of-the-art hypervisor. Today’s standard features such as DRS, DPM, and Storage VMotion were in various prototype and development stages. Processors like the Intel Pentium4 5xx series (Prescott) or the single-core AMD 2yy-series Opterons were the high-end CPUs of choice. Second-generation hardware-assisted virtualization features such as AMD’s Rapid Virtualization Indexing (RVI) and Intel’s Extended Page Tables (EPT) were not yet available. Nevertheless, virtualization’s first wave was allowing customers to squeeze much more value from their existing resources via server consolidation. Exactly how much value was difficult to quantify. Our VMmark odyssey began with the overall goal of
creating a representative and reliable benchmark capable of providing meaningful comparisons between virtualization platforms.

VMmark 1.0 released nearly three years ago after two years of painstaking work and multiple beta releases of the benchmark. The reference architecture for VMmark 1.x is a two-processor Pentium4 (Prescott) server running ESX 3.0. That platform was capable of supporting one VMmark tile (six VMs) and by definition achieved a score of 1.0. (All VMmark results are normalized to this reference score.) The graph below shows a sampling of published two-socket VMmark scores for each successive processor generation. 

Blog_slide_3 ESX 3.0, a vastly more capable hypervisor than ESX 2.5, had arrived by the time of the VMmark 1.0 GA in mid-2007. Greatly improved CPU designs were also available. Two processors commonly in use by that time were the dual-core Xeon 51xx series and the quad-core Xeon 53xx series. ESX 3.5 was released with a number of performance improvements such as TCP Segmentation Offloading (TSO) support for networking in the same timeframe as the Xeon 54xx. Both ESX 4.0 and Intel 55xx (Nehalem) CPUs became available in early 2009. ESX 4.0 was a major new release with a broad array of performance enhancements and supported new hardware feature such as EPT and simultaneous multi-threading (SMT), providing a significant boost in overall performance. The recently released hexa-core Intel 56xx CPUs (Westmere) show excellent scaling compared to their quad-core 55xx brethren. (Overall, ESX delivers excellent scaling and takes advantage increased core-counts on all types of servers.) What is most striking to me in this data is the big picture: the performance of virtualized consolidation workloads as measured by VMmark 1.x has roughly doubled every year for the past five years.

In fact, the performance of virtualized platforms has increased to the point that the focus has shifted away from consolidating lightly-loaded virtual machines on a single server to virtualizing the entire range of workloads (heavy and light) across a dynamic multi-host datacenter. Not only application performance but also infrastructure responsiveness and robustness must be modeled to characterize modern virtualized environments. With this in mind, we are currently developing VMmark 2.0, a much more complex, multi-host successor to VMmark 1.x. We are rapidly approaching a limited beta release of this new benchmark, so stay tuned for more. But in this post, I’d like to look back and remember how far we’ve come with VMmark 1.x. Let’s hope the next five
years are as productive.

Measuring the Cost of SMP with Mixed Workloads

It is no secret that vSphere 4.0 delivers excellent
performance and provides the capability to virtualize the beefiest of
workloads. Several impressive performance studies using ESX 4.0 have been
already been presented. (My favorite is this database performance whitepaper.) However, I continue to hear questions about the
scheduling overhead of larger VMs within a heavily-utilized, mixed-workload
environment. We put together a study using simple variations of VMware’s
mixed-workload consolidation benchmark VMmark to help answer this
question.

For this study we chose two of the VMmark workloads,
database and web server, as the vCPU-scalability targets. These VMs represent
workloads that typically show the greatest range of load in production
environments so they are natural choices for a scalability assessment. We
varied the number of vCPUs in these two VMs between one and four and measured throughput
scaling and CPU utilization of each configuration by increasing the number of
benchmark tiles up to and beyond system saturation.

The standard VMmark workload levels were used and were held
constant for all tests. Given that the workload is constant, we are measuring
the cost of SMP VMs and their impact on the scheduler . This approach
places increasing stress the hypervisor as the vCPU allocations increase and
creates a worst-case scenario for the scheduler. The vCPU allocations for the
three configurations are shown in the table below:

 

Webserver vCPUs

Database vCPUs

Fileserver vCPUs

Mailserver vCPUs

Javaserver vCPUs

Standby vCPUs

Total vCPUs

Config1

1

1

1

2

2

1

8

Config2

2

2

1

2

2

1

10

Config3

4

4

1

2

2

1

14

 

Config2 uses the standard VMmark vCPU allocation of 10 vCPUs
per tile. Config1 contains 20% fewer vCPUs than the standard while Config3
contains 40% more than the standard.

We also used Windows Server 2008 instead of Windows Server
2003 where possible to characterize its behavior in anticipation of using
Server 2008 in a next-generation benchmark. As a result, we increased the
memory in the Javaserver VMs from 1GB to 1.4 GB to insure sufficient memory
space for the JVM. The table below provides a summary of each VM’s
configuration:

Workload

Memory

Disk

OS

Mailserver

1GB

24GB

Windows
2003 32bit

Javaserver

1.4GB

12GB
(*)

Windows
2008 64bit

Standby
Server

256MB
(*)

12GB
(*)

Windows
2008 32bit

Webserver

512MB

8GB

SLES
10 SP2 64bit

Database

2GB

10GB

SLES
10 SP2 64bit

Fileserver

256MB

8GB

SLES
10 SP2 32bit

Below is a basic summary of the hardware used:

  • Dell PowerEdge R905 with 4 x 2.6GHz Quad Core AMD Opteron
    8382
  • Firmware version 3.0.2 (latest available).
  • 128GB DDR2 Memory.
  • 2 x Intel E1000 dual-port NIC
  • 2 x Qlogic 2462 dual-port 4Gb
  • 2 x EMC CX3-80 Storage Arrays.
  • 15 x HP DL360 client systems.

Experimental Results

Figure 1 below shows both the CPU utilization and the throughput
scaling normalized to the single-tile throughput of Config1. Both throughput and
CPU utilization remain roughly equal for all three configurations at load
levels of 1, 3, and 6 tiles (6, 18, and 36 VMs, respectively). The cost of
using SMP VMs is negligible here. The throughputs remain roughly equal while
the CPU utilization curves begin to diverge as the load increases to 9, 10, and
11 tiles (54, 60, and 66 VMs, respectively). Furthermore, all three
configurations achieve roughly linear scaling up to 11 tiles (66 VMs). CPU
utilization when running 11 tiles was 85%, 90%, and 93% for Config1, Config2,
and Config3, respectively. Considering that few customers are comfortable
running at overall system utilizations above 85%, this result shows remarkable
scheduler performance and limited SMP co-scheduling overhead within a typical
operating regime.

FIG1_Alternatev-CPUscaling-4b 

Figure 2 below shows the same normalized throughput of Figure 1 as well as the total number of running vCPUs to illustrate the additional stresses put on the hypervisor by the progressively larger SMP configurations. For instance, the throughput scaling at nine tiles is equivalent despite the fact that Config1 requires only 72 vCPUs while
Config3 uses 126 vCPUs. As expected, Config3, with its heavier resource demands, is the first to transition into system saturation. This occurs at a load of 12 tiles (72 VMs). At 12 tiles, there are 168 vCPUs active – 48 more vCPUs than used by Config2 at 12 tiles. Nevertheless, Config3 scaling only lags Config2 by 9% and Config1 by 8%. Config2 reaches system saturation at 14 tiles (84 VMs), where it lags Config1 by 5%. Finally Config1 hits the saturation point at 15 tiles (90 VMs).

FIG2_Alternatev-CPUscaling-5b 

Overall, these results show that ESX 4.0 effectively and fairly manages VMs of all shapes and sizes in a mixed-workload environment. ESX 4.0 also exhibits excellent throughput parity and minimal CPU differences between the three configurations throughout the typical operating envelope. ESX continues to demonstrate first-class enterprise stability, robustness, and predictability in all cases. Considering how well ESX 4.0 handles a tough situation like this, users can have confidence when virtualizing their larger workloads within larger VMs.

(*) The spartan memory and disk allocations for the Windows Server 2008 VMs might cause readers to question if the virtual machines were adequately provisioned. Since our internal testing covers a wide array of virtualization platforms, reducing the memory of the Standby Server enables us to measure the peak performance of the server before encountering memory bottlenecks on virtualization platforms where physical memory is limited and sophisticated memory overcommit techniques are unavailable. Likewise, we want to configure our tests so that the storage capacity doesn’t induce an
artificial bottleneck. Neither the Standby Server nor the Javaserver place significant demands on their virtual disks, allowing us to optimize storage usage. We carefully compared this spartan Windows Server 2008 configuration against a richly configured Windows Server 2008 tile and found no measurable difference in stability or performance. Of course, I would not encourage this type of configuration in a live production setting. On the other hand, if a VM gets configured in this way, vSphere users can sleep well knowing that ESX won’t let them down.