Home > Blogs > VMware VROOM! Blog > Tag Archives: vCenter

Tag Archives: vCenter

vCenter performance improvements from vSphere 6.5 to 6.7: What does 2x mean?

In a recent blog, the VMware vSphere team shared the following performance improvements in vSphere 6.7 vs. 6.5:

Moreover, with vSphere 6.7 vCSA delivers phenomenal performance improvements (all metrics compared at cluster scale limits, versus vSphere 6.5):
2X faster performance in vCenter operations per second
3X reduction in memory usage
3X faster DRS-related operations (e.g. power-on virtual machine)

As senior engineers within the VMware Performance and vSphere teams, we are writing this blog to provide more details regarding these numbers and to explain how we measured them. We also briefly explain some of the technical details behind these improvements.

Cluster Scale

Let us first explain what all metrics compared at cluster scale limits means. What is cluster scale? Here, it is an environment that includes a vCenter server that is configured for the largest vSphere cluster that VMware currently supports, namely 64 hosts and 8,000 powered-on VMs. This setup represents a high-consolidation environment with 125 VMs per host. Note that this is different from the setup used in our previous blog about vCenter 6.5 performance improvements. The setup in that blog was our datacenter scale environment, which used the largest number of supported hosts (2000) and VMs (25,000), so the numbers from that blog should not be compared to these numbers.

2x and 3x

Let us now explain some of the performance numbers quoted. We produced the numbers by measuring workload runs in our cluster scale setup.

2x vCenter Operations Per Second, vSphere 6.7 vs. 6.5, cluster scale limits. We measure operations per second using an internal benchmark called vcbench. We describe vcbench below under “Benchmark Details.” One of the outputs of this workload is management operations (for example, clone, powerOn, vMotion) performed per second.

  • In 6.5, vCenter was capable of performing approximately 8.3 vcbench operations per second (described below under “Benchmark Details”) in the cluster-scale testbed.
  • In 6.7, vCenter is now capable of performing approximately 16.7 vcbench operations per second.

3x reduction in memory usage. In addition to our vcbench workload, we also include a simplified workload that simply executes a standard workflow: create a VM, power it on, power it off, and delete it. The rapid powerOn and powerOff of VMs in this setup puts more load on the DRS subsystem than the typical vcbench test.

  • In 6.5, the core vCenter process (vpxd) used on average about 10 GB to complete the workflow benchmark (described below under “Benchmark Details”).
  • In 6.7, the core vCenter process used approximately 3 GB to complete this run, while also achieving higher churn (that is, more workflow ‘create/powerOn/powerOff/delete’ cycles completed within the same time period).

3x faster DRS-related operations. In our vcbench workload, we measure not just the overall operations per second, but also the average latencies of individual operations like powerOn (which exercises the majority of the DRS software stack). We issue many concurrent operations, and we measure latency under such load.

  • In 6.5, the average latency of an individual powerOn during a vcbench run was 9.5 seconds.
  • In 6.7, the average latency of an individual powerOn during a vcbench run was 2.8 seconds.

The latencies above reflect the fact that a cluster has 8,000 VMs and many operations in flight at once. As a result, individual operations are slower than if they were simply run on a single-host, single-VM environment.

What does this mean to you as a customer?

As a result of these improvements, customers in high-consolidation environments will see reduced resource consumption due to DRS and reduced latency to generate VMotions for load balancing. Customers will also see faster initial placement of VMs.

Brief Deep Dive into Improvements

Before we describe the improvements, let us first briefly explain how DRS works, at a very high level.

When powering on a VM, vCenter must determine where to place the VM. This is called initial placement. Many subsystems, including DRS and policy management, must be consulted to determine valid hosts on which this VM can run. This phase is called constraint check. Once DRS determines the host on which a VM should be powered on, it registers the VM onto that host and issues the powerOn. This initial placement requires a snapshot of the inventory: by snapshot, we mean that DRS records the current configuration of hosts and VMs in the cluster.

In addition to balancing during initial placement, every 5 minutes, DRS re-examines the load of the cluster and performs a series of computations to generate VMotions that help balance the load across hosts. This phase is called periodic rebalancing. Periodic rebalancing requires an examination of the historical utilization statistics for each host and VM (for example, over the previous hour) in order to determine proper placement.

Finally, as VMs get moved around, the used capacity in resource pools changes. The vCenter server periodically exchanges messages called SpecSyncs with each host to push down the most recent resource pool configuration. The SpecSync operation requires traversing a host’s resource pool structure and changing it to make sure it matches vCenter’s configuration.

With this understanding in mind, let us now give some technical details behind the improvements above. As with our previous blog about vCenter performance improvement, we describe changes in terms of rocks (that is, somewhat large changes to entire subsystems) and pebbles (smaller individual changes throughout the code base).

Rocks

The three main rocks that we address in 6.7 are simplified initial placement, faster resource divvying, and faster SpecSyncs.

Simplified initial placement. As mentioned above, initial placement relies on a snapshot of the current state of the inventory. Creating this snapshot can be a heavyweight operation, requiring a number of data copies and locking of host and cluster data structures to ensure a consistent view of the data. In 6.7, we moved to a lightweight online approach that keeps the state up-to-date in a continuous manner, avoiding snapshots altogether. With this approach, we significantly reduce locking demands and significantly reduce the number of times we need to copy data. In some highly-contended clusters, this reduced the initial placement time from seconds down to milliseconds.

Faster (and more frequent) resource divvying. Divvying is the act of determining the resource allocations for each VM. Every five minutes, the state of the cluster is examined and both divvying and then rebalancing (using VMotion) are performed. To make the divvying phase faster, we performed a number of optimizations.

  • We changed the approach to examining historical usage statistics. Instead of storing metrics for every VM and every host over an hour, we aggregated the data, which allowed us to store a smaller number of metrics per host. This dramatically reduced memory usage and simplified the computation of the desired load for each host.
  • We restructured the code to remove compatibility checks (for example, those that help to determine which VMs can run on which hosts) during this divvying phase. In 6.5 and earlier, divvying a load also involved various host/VM compatibility calculations. Now, we store the compatible matrix and update it when compatibility changes, which is typically infrequent.
  • We have also done significant code refactoring (described below under “Pebbles”) to this code path.

By implementing these changes and making divvying faster in 6.7, we are now able to perform divvying more frequently: once per minute instead of once every five minutes. As a result, resources flow more quickly between resource pools, and we are better able to enforce fairness guarantees in DRS clusters.

Note that periodic load balancing (through VMotion) still occurs every five minutes.

Faster SpecSyncs. To perform a SpecSync, vCenter sends a resource pool configuration to a host. The host must examine this configuration and create a list of changes required to bring that host in sync with vCenter. In 6.5 and earlier, depending on the number of VMs, creating this list of changes could result in hundreds of operations on a host, and the runtime was highly variable. In 6.7, we made this computation more deterministic, reducing the number of operations and lowering the latency appropriately.

Pebbles

In addition to the changes above, we also performed a number of optimizations throughout our code base.

Code Refactoring. In 6.5 and before, admission control decisions were made by multiple independent subsystems within vCenter (for example, DRS would be responsible for some decisions, and HA would make others). In 6.7, we simplified our code such that all admission control decisions are handled by a module within DRS. Reducing multiple copies of this code simplifies debugging and reduces resource usage.

Finer-grained locks. In 6.7, we continued to make strides in reducing the scope of our locks. We introduced finer-grained locks so that DRS would not have to lock an entire VM to examine certain pieces of state. We made similar improvements to both hosts and clusters.

Removal of unnecessary classes, maps, and sets. In refactoring our code, we were able to remove a number of classes and thereby reduce the number of copies of data in our system. The maps and sets that were needed to support these classes could also be removed.

Preferring integers over strings. In many situations, we replaced strings and string comparisons with integers and integer comparisons. This dramatically reduces memory allocation overhead and speeds up comparisons, reducing CPU.

Benchmark Details

We measure Operations Per Second (OPS) using a VMware benchmark called vcbench. (For more details about vcbench, see “Benchmarking Details” in vCenter 6.5 Performance: what does 6x mean?) Briefly, vcbench is a java-based application that takes as an input a runlist, which is a list of management operations to perform. These operations include powering on VMs, cloning VMs, reconfiguring VMs, and migrating VMs, among many other operations. We chose the operations based on an analysis of typical customer management scenarios. The vcbench application uses vSphere APIs to issue these operations to vCenter. The benchmark creates multiple threads and issues operations on those threads in parallel (up to 32). We measure the operations per second by taking the total number of operations performed in a given timeframe (say, 1 hour), and dividing it by the time interval.

Our workflow benchmark is very similar to vcbench. The main difference is that more operations are issued per host at a time. In vcbench, one operation is issued per host at a time, while in workflow, up to 8 operations are issued per host at a time.

In many cases, the size of the VM has an impact on operational latency. For example, a VM with a lot of memory (say, 32 GB) and large disks (say, 100 GB) will take longer to clone because more memory will need to be snapshotted, and more disk data will need to be copied. To minimize the impact of the disk subsystem in our measurements, we use small VMs (<4GB memory, < 8GB disk).

Because we limit ourselves to 32 threads per vCenter in this single-cluster setup, throughput numbers are smaller than for our datacenter-at-scale setups (2,000 hosts; 25,000 VMs), which use up to 256 concurrent threads per vCenter.

Summary

In this blog, we have described some of the performance improvements in vCenter from 6.5 to 6.7. A variety of improvements to DRS have led to improved throughput and reduced resource usage for our vcbench workload in a cluster scale setup of 64 hosts and 8,000 powered-on VMs. Many of these changes also apply to larger datacenter-scale setups, although the scope of improvement may not be as pronounced.

Acknowledgments

The vCenter improvements described in this blog are the results of thousands of person-hours from vCenter developers, performance engineers, and others throughout VMware. We are deeply grateful to them for making this happen.

Authors

Zhelong Pan is a senior staff engineer in the Distributed Resource Management Team at VMware. He works on cluster management, including shared resource allocation, VM placement, and load balancing. He is interested in performance optimizations, including virtualization performance and management software performance. He has been at VMware since 2006.

Ravi Soundararajan is a principal engineer in the Performance Group at VMware. He works on vCenter performance and scalability, from the UI to the server to the database to the hypervisor management agents. He has been at VMware since 2003, and he has presented on the topic of vCenter Performance at VMworld from 2013-2017. His Twitter handle is @vCenterPerfGuy.

DRS Lens – A new UI dashboard for DRS

DRS Lens provides an alternative UI for a DRS enabled cluster. It gives a simple, yet powerful interface to monitor the cluster real time and provide useful analyses to the users. The UI is comprised of different dashboards in the form of tabs for each cluster being monitored.

Continue reading

vCenter 6.5 Performance: what does 6x mean?

At the VMworld 2016 Barcelona keynote, CTO Ray O’Farrell proudly presented the performance improvements in vCenter 6.5. He showed the following slide:

6x_slide

Slide from Ray O’Farrell’s keynote at VMworld 2016 Barcelona, showing 2x improvement in scale from 6.0 to 6.5 and 6x improvement in throughput from 5.5 to 6.5.

As a senior performance engineer who focuses on vCenter, and as one of the presenters of VMworld Session INF8108 (listed in the top-right corner of the slide above), I have received a number of questions regarding the “6x” and “2x scale” labels in the slide above. This blog is an attempt to explain these numbers by describing (at a high level) the performance improvements for vCenter in 6.5. I will focus specifically on the vCenter Appliance in this post.

Continue reading

vCenter Server 5.1 Database Performance with Large Inventories

 

Better performance, lower latency, and streamlined statistics are just some of the new features you can expect to find in the vCenter Server in version 5.1. The VMware performance team has published a paper about vCenter Server 5.1 database performance in large environments. The paper shows that statistics collection creates the biggest performance impact on the vCenter Server database. In vSphere 5.1, several aspects of statistics collection have been changed to improve the overall performance of the database. There were three sources of I/O to the statistics tables in vCenter Server—inserting statistics, rolling up statistics between different intervals, and deleting statistics when they expire. These activities have been improved by changing the way the relevant data is persisted to the tables, by partitioning the tables instead of using staging tables. In addition, by removing the staging tables, statistics collection is more robust, resolving the issues described in KB 2011523 and KB 1003878. Scalability is also improved by allowing larger inventories to be supported because they don’t take so long to read/write data from the old staging tables. The paper also includes best practices to take advantage of these changes in environments where vCenter Server has a large inventory. For more details, see vCenter Server 5.1 Database Performance in Large-Scale Environments.

Here are the URLs for the paper, “VMware vCenter Server 5.1 Database Performance Improvements and Best Practices for Large-Scale Environments”:

http://www.vmware.com/resources/techresources/10302

http://www.vmware.com/files/pdf/techpaper/VMware-vCenter-DBPerfBestPractices.pdf

 

Performance Best Practices for VMware vSphere 5.0

A new version of Performance Best Practices for vSphere is now available.  This is a book designed to help system administrators obtain the best performance from vSphere deployments.

We've addressed many of the new features in vSphere 5.0 from a performance perspective.  These include:

  • Storage Distributed Resource Scheduler (Storage DRS), which performs automatic storage I/O load balancing
  • Virtual NUMA, allowing guests to make efficient use of hardware NUMA architecture
  • Memory compression, which can reduce the need for host-level swapping
  • Swap to host cache, which can dramatically reduce the impact of host-level swapping
  • SplitRx mode, which improves network performance for certain workloads
  • VMX swap, which reduces per-VM memory reservation
  • Multiple vMotion vmknics, allowing for more and faster vMotion operations

We've also significantly updated and expanded many of the topics we've covered in previous editions of the book.  These include:

  • Choosing hardware for a vSphere deployment
  • Power management
  • Configuring ESXi for best performance
  • Guest operating system performance
  • vCenter and vCenter database performance
  • vMotion and Storage vMotion performance
  • Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM) performance
  • High Availability (HA), Fault Tolerance (FT), and VMware vCenter Update Manager performance

The book can be found at: Performance Best Practices for VMware vSphere 5.0.

 

Troubleshooting Performance Related Problems in vSphere 4.1 Environments

The hugely popular Performance Troubleshooting for VMware vSphere 4 guide is now updated for vSphere 4.1 . This document provides step-by-step approaches for troubleshooting most common performance problems in vSphere-based virtual environments. The steps discussed in the document use performance data and charts readily available in the vSphere Client and esxtop to aid the troubleshooting flows. Each performance troubleshooting flow has two parts:

  1. How to identify the problem using specific performance counters.
  2. Possible causes of the problem and solutions to solve it.

New sections that were added to the document include troubleshooting performance problems in resource pools on standalone hosts and DRS clusters, additional troubleshooting steps for environments experiencing memory pressure (hosts with compressed and swapped memory), high CPU ready time in hosts that are not CPU saturated, environments sharing resources such as storage and network, and environments using snapshots.

The Troubleshooting guide can be found here. Readers are encouraged to provide their feedback and comments in the performance community site at this link

 

VMware vCloud Director 1.0 Performance and Best Practices — Paper Published

Do you want to know how many VMware vCloud Director server instances are needed for your deployment? Do you know how to load balance the VC Listener across multiple vCloud Director instances? Are you curious about how OVF File Upload behaves on a WAN environment? What is the most efficient way to import LDAP users? This white paper, VMware vCloud Director 1.0 Performance and Best Practices, provides insight  to help you answer all the above questions.

In this paper, we discuss VMware vCloud Director 1.0 architecture, server instance sizing, LDAP sync, OVF file upload, vApp clones across vCenter Server instances, inventory sync, and adjusting thread pool and cache limits. The following performance tips are provided:

  • Ensure the inventory cache size is big enough to hold all inventory objects.
  • Ensure JVM heap size is big enough to satisfy the memory requirement for the inventory cache and memory burst  so the vCloud Director server does not run out of memory.
  • Import LDAP users by groups instead of importing individual users one by one.
  • Ensure the system is not running LDAP sync too frequently because the vCloud database is updated at regular intervals.
  • In order to help load balance disk I/O, separate the storage location for OVF uploads from the location of the vCloud Director server logs. 
  • Have a central datastore to hold the most popular vApp templates and media files and have this datastore mounted to at least one ESX host per cluster.
  • Be aware that the latency to deploy a vApp in fence mode has a static cost and does not increase proportionately with the number of VMs in the vApp.
  • Deploy multiple vApps concurrently to achieve high throughput. 
  • For load balancing purposes, it is possible to move a VC Listener to another vCloud Director instance by reconnecting the vCenter Server through the vCloud Director user interface.

Please read the white paper for more performance tips with more details. You can download the full white paper from here.

Virtualizing SQL Server-based vCenter database – Performance Study

vSphere is an industry-leading virtualization platform that enables customers to build private clouds for running enterprise applications such as SQL server databases. Customers can expect near-native performance from their virtualized SQL databases when running in a vSphere environment. VMware vCenter Server, the management component of vSphere, uses a database to store and organize information related to vSphere-based virtual environments. This database can be implemented using SQL server. Based on the previous VMware performance studies involving SQL databases, it is reasonable to expect the performance of a virtualized SQL Server-based vCenter database to be similar to that in native.

A study was conducted in the VMware performance engineering lab to validate the assumption. The results of the study show that:

  • The most resource-intensive operations of a virtualized SQL Server-based vCenter database perform at a level comparable to that in native environment.
  • A SQL Server-based vCenter database managing a vSphere virtual environment of any scale can be virtualized on vSphere.
  • SQL databases, in general, perform at a near-native level when virtualized on vSphere 4.1.

Complete details of the experiments and their results can be found in this technical document.

For comments or questions on this article, please join me at voiceforvirtual.com.