Home > Blogs > VMware VROOM! Blog > Tag Archives: vCenter

Tag Archives: vCenter

vCenter 6.5 Performance: what does 6x mean?

At the VMworld 2016 Barcelona keynote, CTO Ray O’Farrell proudly presented the performance improvements in vCenter 6.5. He showed the following slide:

6x_slide

Slide from Ray O’Farrell’s keynote at VMworld 2016 Barcelona, showing 2x improvement in scale from 6.0 to 6.5 and 6x improvement in throughput from 5.5 to 6.5.

As a senior performance engineer who focuses on vCenter, and as one of the presenters of VMworld Session INF8108 (listed in the top-right corner of the slide above), I have received a number of questions regarding the “6x” and “2x scale” labels in the slide above. This blog is an attempt to explain these numbers by describing (at a high level) the performance improvements for vCenter in 6.5. I will focus specifically on the vCenter Appliance in this post.

6x and 2x

Let’s start by explaining the “6x” and the “2x” from the keynote slide.

  1. 6x:  We measure performance in operations per second, where operations include powering on VMs, clones, VMotions, etc. More details are presented below under “Benchmarking Details.” The “6x” refers to a sixfold increase in operations per second from vSphere 5.5 to 6.5:
    1. In 5.5, vCenter was capable of approximately 10 operations per second in our testbed.
    2. In 6.0, vCenter was capable of approximately 30 operations per second in our testbed.
    3. In 6.5 vCenter can now perform more than 60 operations per second in our testbed. With faster hardware, vCenter can achieve over 70 operations per second in our testbed.
  2. 2x: The 2x improvement refers to a change in the supported limits for vCenter. The number of hosts has doubled, and the number of VMs has more than doubled:
    1. The supported limits for a single instance of vCenter 6.0 are 1000 hosts, 15000 registered VMs, and 10000 powered-on VMs.
    2. The supported limits for a single instance of vCenter 6.5 are 2000 hosts, 35000 registered VMs, and 25000 powered-on VMs.

Not only are the supported limits higher in 6.5, but the resources required to support such a limit are dramatically reduced.

What does this mean to you as a customer?

The numbers above represent what we have measured in our labs. Clearly, configurations will vary from customer to customer, and observed improvements will differ. In this section, I will give some examples of what we have observed to illustrate the sorts of gains a customer may experience.

PowerOn VM

Before powering on a VM, DRS must collect some information and determine a host for the VM. In addition, both the vCenter server and the ESX host must exchange some information to confirm that the powerOn has succeed and must record the latest configuration of the VM. By a series of optimizations in DRS related to choosing hosts, and by a large number of code optimizations to reduce CPU usage and reduce critical section time, we have seen improvements of up to 3x for individual powerOns in a DRS cluster. We give an example in the figure below, in which we show the powerOn latency (normalized to the vSphere 6.0 latency, lower is better).

powerOnSingleCluster

Example powerOn latency for 6.0 vs. 6.5, normalized to 6.0. Lower is better. 6.5 outperforms 6.0. The gains are due primarily to improvements in DRS and general code optimizations.

The benefits are likely to be most prominent in large clusters (i.e., 64 hosts and 8000 VMs in a cluster), although all cluster sizes will benefit from these optimizations.

Clone VM

Prior to cloning a VM, vCenter does a series of calls to check compatibility of the VM on the destination host, and it also validates the input parameters to the clone. The bulk of the latency for a clone is typically in the disk subsystem of the vSphere hosts. For our testing, we use small VMs (as described below) to allow us to focus on the vCenter portion of latency. In our tests, due to efficiency improvements in the compatibility checks and in the validation steps, we see up to 30% improvement in clone latency, as seen in the figure below, which depicts normalized clone latency for one of our tests.

Example clone VM latency for 6.0 vs. 6.5, normalized to 6.0. Lower is better. 6.5 outperforms 6.0. The gains are in part due to code improvements where we determine which VMs can run on which hosts.

These gains will be most pronounced when the inventory is large (several thousand VMs) or when the VMs to be cloned are small (i.e., < 16GB). For larger VMs, the latency to copy the VM over the network and the latency to write the VM to disk will dominate over the vCenter latency.

VMotion VM

For a VMotion of a large VM, the costs of pre-copying memory pages and then transferring dirty pages typically dominates. With small VMs (4GB or less), the costs imposed by vCenter are similar to those in the clone operation: checking compatibility of the VM with the new host, whether it be the datastore, the network, or the processor. In our tests, we see approximately 15% improvement in VMotion latency, as shown here:

Example VMotion latency for 6.0 vs. 6.5, normalized to 6.0. Lower is better. 6.5 is slightly better than 6.0. The gains are due in part to general code optimizations in the vCenter server.

As with clone, the bulk of these improvements is from a large number of code optimizations to improve CPU and memory efficiency in vCenter. Similar to clone, the improvements are most pronounced with large numbers of VMs or when the VMs are less than 4GB.

Reconfigure VM

Our reconfiguration operation changes the memory share settings for a VM. This requires a communication with the vSphere host followed by updates to the vCenter database to store new settings. While there have been improvements along each of these code paths, the overall latency is similar from 6.0 to 6.5, as shown in the figure below.

RegisterVMatScale

Example latency for VM reconfigure task for 6.0 vs. 6.5, normalized to 6.0. Lower is better. The performance is approximately the same from 6.0 to 6.5 (the difference is within experimental error).

Note that the slight increase in 6.5 is within the experimental error for our setup, so for this particular test, the reconfigure operation is basically the same from 6.0 to 6.5.

The previous data were for a sampling of operations, but our efficiency improvements should result in speedups for most operations, whether called through the UI or through our APIs.

Resource Usage and Restart Time of vCenter Server

In addition to the sorts of gains shown above, the improvements from 6.0 to 6.5 have also dramatically reduced the resource usage of the vCenter server. These improvements are described in more detail below, and we give one example here. For an environment in our lab consisting of a single vCenter server managing 64 Hosts and 8,000 VMs, the overall vCenter server resource usage dropped from 27GB down to 14GB. The drop is primarily due to removal of inventory service and optimizations in the core vpxd process of vCenter (especially with respect to DRS).

In our labs, the optimizations described below have also reduced the the restart time of vCenter (the time from when the machine hosting vCenter is booted until vCenter can accept API or UI requests). The impact depends on the extensions installed and the amount of data to be loaded at startup by the web client (in the case of accepting UI requests), but we have seen improvements greater than 3x in our labs, and anecdotal evidence from the field suggests larger improvements.

Brief Deep Dive into Improvements

The previous section has shown the types of improvements one might expect over different kinds of operations in vCenter. In this section, we briefly describe some of the code changes that resulted in these improvements.

 “Rocks” and “Pebbles”

The changes from 6.0 to 6.5 can be divided into large, architectural-type changes (so-called “Rocks” because of the large size of the changes) and a large number of smaller optimizations (so-called “Pebbles” because the changes themselves are smaller).

Rocks

There are three main “Rocks” that have led to performance improvements from 6.0 to 6.5:

  1. Removal of Inventory Service
  2. Significant optimizations to CPU and memory usage in DRS, specifically with respect to snapshotting inventory for compatibility checks and initial placement of VMs upon powerOn.
  3. Change from SLES11 (6.0) to PhotonOS (6.5).

Inventory Service. The Inventory Service was originally added to vCenter in the 5.0 release in order to provide a caching layer for inventory data. Clients to vCenter (like the web client) could then retrieve data from the inventory service instead of going to the vpxd process within vCenter. Second- and Third-party solutions (e.g., vROps or other solutions) could store data in this inventory service so that the web client could easily retrieve such data This inventory service was implemented in Java and was backed by an embedded database. While this approach has some benefits with respect to reducing load to vCenter, the cost of maintaining this cache was far higher than its benefits. In particular, in the largest supported setups of vCenter, the memory cost of this service was nearly 16GB, and could be even larger in customer deployments. Maintaining the embedded database also required significant disk IO (nearly doubling the overall IO in vCenter) and CPU. In 6.5, we have removed this Inventory Service and instead have employed a new design that efficiently retrieves directly from vpxd. With the significant improvements to the vpxd process, this approach is much faster than using the Inventory Service. Moreover, it saves nearly 16GB from our largest setups. Finally, removing Inventory Service also leads to faster restart times for the vCenter server, since the Inventory Service no longer has to synchronize its data with the core vpxd process of vCenter server before vCenter has finished starting up. In our test labs, the restart times (the time from reboot until vCenter can accept client requests) improved by up to 3x, from a few minutes down to around one minute.

DRS. Our performance results had suggested the DRS adds some overhead when computing initial placement and ongoing placement of VMs. When doing this computation, DRS needs to retrieve the current state of the inventory. A significant effort was undertaken in 6.5 to reduce this overhead. The sizes of the snapshots were reduced, and the overhead of taking such a snapshot was dramatically reduced. One additional source of overhead is doing the compatibility checks required to determine if a VM is able to run on a given host. This code was dramatically simplified while still preserving the appropriate load-balancing capabilities of DRS.
The combination of simplifying DRS and removing Inventory Service resulted in significant resource usage reductions. To give a concrete example, in our labs, to support the maximum supported inventory of a 6.0 setup (1000 hosts and 15000 registered VMs) required approximately 27GB, while the same size inventory required only 14GB in 6.5.

PhotonOS. The final “Rock” that I will describe is the change from SLES11 to PhotonOS. PhotonOS uses a much more recent version of the Linux Kernel (4.4 vs. 3.0 for SLES11). With much newer libraries, and with a slimmer set of default modules installed in the base image, PhotonOS has proven to be a more efficient guest OS for the vCenter Server Appliance. In addition to these changes, we have also tuned some settings that have given us performance improvements in our labs (for example, changing some of the default NUMA parameters and ensuring that we are using the pre-emptive kernel).

Pebbles

The “Pebbles” are really an accumulation of thousands of smaller changes that together improve CPU usage, memory usage, and database utilization. Three examples of such “Pebbles” are as follows:

  1. Code optimizations
  2. Locking improvements
  3. Database improvements

Code optimizations. Some of the code optimizations above include low-level optimizations like replacing strings with integers or refactoring code to significantly reduce the number of mallocs and frees. The vast majority of cycles used by the vpxd process are typically spent in malloc or in string manipulations (for example, serializing data responses from hosts). By reducing these overheads, we have significantly reduced the CPU and memory resources used to manage our largest test setups.

Locking improvements. Some of the locking improvements include reducing the length of critical sections and also restructuring code to enable us to remove some coarse-grained locks. For example, we have isolated situations in which an operation may have required consistent state for a cluster, its hosts, and all of its VMs, and reduced the scope so that only VM-level or host-level locks are required. These optimizations require careful reasoning about the code, but ultimately significantly improve concurrency.  An additional set of improvements involved simplifying the locking primitives themselves so that they are faster to acquire and release. These sorts of changes also improve concurrency. Improving concurrency not only improves performance, but it better enables us to take advantage of newer hardware with more cores: without such improvements, software would be a bottleneck, and the extra cores would otherwise be idle.

Database improvements. The vCenter server stores configuration and statistics data in the database. Any changes to the VM, host, or cluster configuration that occur as a result of an operation (for example, powering on a VM) must be persisted to the database. We have made an active effort to reduce the amount of data that must be stored in the database (for example, storing it on the host instead). By reducing this data, we reduce the network traffic between vCenter server and the hosts, because less data is transferred, and we also reduce disk traffic by the database.

A side benefit of using the vCenter server appliance is that the database (Postgres) is embedded in the appliance. As a result, the latency between the vpxd service and the database is minimized, resulting in performance improvements relative to using a remote database (as is typically used in vCenter Windows installations). This improvement can be 10% or more in environments with lots of operations being performed.

Benchmarking Details

Our benchmark results are based on our vcbench workload generator. A more complete description of vcbench is given in VMware vCenter Server Performance and Best Practices, but briefly, vcbench consists of a Java client that sends management operations to vCenter server. Operations include (but are not limited to) powering on VMs, cloning VMs, migrating VMs, VMotioning VMs, reconfiguring VMs, registering VMs, and snapshotting VMs. The Java client opens up tens to hundreds of concurrent sessions to the vCenter server and issues tasks on each of these sessions. A graphical depiction is given in the “VCBench Architecture” slide, above.

The performance of vcbench is typically given in terms of throughput, for example, operations per second. This number represents the number of management operations that vCenter can complete per second. To compute this value, we run vcbench for a specified amount of time (for example, several hours) and then measure how many operations have completed. We then divide by the runtime of the test. For example, 70 operations per second is 4200 operations per minute, or over 25000 operations in an hour. We run anywhere from 32 concurrent sessions to 512 concurrent sessions connected to vCenter.

The throughput measured by vcbench is dependent on the types of operations in the workload mix. We have tried to model our workload mix based on the frequency of operations in customer setups. In such setups, often power operations and provisioning operations (e.g., cloning VMs) are prevalent.

Finally, the throughput measured by vcbench also depends on hardware and other vCenter settings. For example, in our “benchmarking” runs, we run with level 1 statistics. We also do performance measurements with higher statistics levels, but our baseline measurements use level 1. In addition, we use SSDs to ensure the the disk is not a bottleneck, and we also make sure to have sufficient CPU and memory to ensure that they are not resource-constrained. By removing hardware as a constraint, we are able to find and fix bottlenecks in our software. Our benchmarking runs also typically do not have extensions like vROps or NSX connected to vCenter. We do additional runs with these extensions installed so that we can understand their impact and provide guidance to customers, but they are not part of the base performance reports.

Conclusion

vCenter 6.5 can support 2x the inventory size as vCenter 6.0. Moreover, vCenter 6.5 provides dramatically higher throughput than 6.0, and can manage the same environment size with less CPU and memory. In this note, I have tried to give some details regarding the source of these improvements. The gains are due to a number of significant architectural improvements (like removing the Inventory Service caching layer) as well as a great deal of low-level code optimizations (for example, reducing memory allocations and shortening critical sections). I have also provided some details about our benchmarking methodology as well as the hardware and software configuration.

Acknowledgments

The vCenter improvements described in this blog are the results of thousands of person-hours from vCenter developers, performance engineers, and others throughout VMware. I am deeply grateful to them for making this happen.

vCenter Server 5.1 Database Performance with Large Inventories

 

Better performance, lower latency, and streamlined statistics are just some of the new features you can expect to find in the vCenter Server in version 5.1. The VMware performance team has published a paper about vCenter Server 5.1 database performance in large environments. The paper shows that statistics collection creates the biggest performance impact on the vCenter Server database. In vSphere 5.1, several aspects of statistics collection have been changed to improve the overall performance of the database. There were three sources of I/O to the statistics tables in vCenter Server—inserting statistics, rolling up statistics between different intervals, and deleting statistics when they expire. These activities have been improved by changing the way the relevant data is persisted to the tables, by partitioning the tables instead of using staging tables. In addition, by removing the staging tables, statistics collection is more robust, resolving the issues described in KB 2011523 and KB 1003878. Scalability is also improved by allowing larger inventories to be supported because they don’t take so long to read/write data from the old staging tables. The paper also includes best practices to take advantage of these changes in environments where vCenter Server has a large inventory. For more details, see vCenter Server 5.1 Database Performance in Large-Scale Environments.

Here are the URLs for the paper, “VMware vCenter Server 5.1 Database Performance Improvements and Best Practices for Large-Scale Environments”:

http://www.vmware.com/resources/techresources/10302

http://www.vmware.com/files/pdf/techpaper/VMware-vCenter-DBPerfBestPractices.pdf

 

Performance Best Practices for VMware vSphere 5.0

A new version of Performance Best Practices for vSphere is now available.  This is a book designed to help system administrators obtain the best performance from vSphere deployments.

We've addressed many of the new features in vSphere 5.0 from a performance perspective.  These include:

  • Storage Distributed Resource Scheduler (Storage DRS), which performs automatic storage I/O load balancing
  • Virtual NUMA, allowing guests to make efficient use of hardware NUMA architecture
  • Memory compression, which can reduce the need for host-level swapping
  • Swap to host cache, which can dramatically reduce the impact of host-level swapping
  • SplitRx mode, which improves network performance for certain workloads
  • VMX swap, which reduces per-VM memory reservation
  • Multiple vMotion vmknics, allowing for more and faster vMotion operations

We've also significantly updated and expanded many of the topics we've covered in previous editions of the book.  These include:

  • Choosing hardware for a vSphere deployment
  • Power management
  • Configuring ESXi for best performance
  • Guest operating system performance
  • vCenter and vCenter database performance
  • vMotion and Storage vMotion performance
  • Distributed Resource Scheduler (DRS) and Distributed Power Management (DPM) performance
  • High Availability (HA), Fault Tolerance (FT), and VMware vCenter Update Manager performance

The book can be found at: Performance Best Practices for VMware vSphere 5.0.

 

Troubleshooting Performance Related Problems in vSphere 4.1 Environments

The hugely popular Performance Troubleshooting for VMware vSphere 4 guide is now updated for vSphere 4.1 . This document provides step-by-step approaches for troubleshooting most common performance problems in vSphere-based virtual environments. The steps discussed in the document use performance data and charts readily available in the vSphere Client and esxtop to aid the troubleshooting flows. Each performance troubleshooting flow has two parts:

  1. How to identify the problem using specific performance counters.
  2. Possible causes of the problem and solutions to solve it.

New sections that were added to the document include troubleshooting performance problems in resource pools on standalone hosts and DRS clusters, additional troubleshooting steps for environments experiencing memory pressure (hosts with compressed and swapped memory), high CPU ready time in hosts that are not CPU saturated, environments sharing resources such as storage and network, and environments using snapshots.

The Troubleshooting guide can be found here. Readers are encouraged to provide their feedback and comments in the performance community site at this link

 

VMware vCloud Director 1.0 Performance and Best Practices — Paper Published

Do you want to know how many VMware vCloud Director server instances are needed for your deployment? Do you know how to load balance the VC Listener across multiple vCloud Director instances? Are you curious about how OVF File Upload behaves on a WAN environment? What is the most efficient way to import LDAP users? This white paper, VMware vCloud Director 1.0 Performance and Best Practices, provides insight  to help you answer all the above questions.

In this paper, we discuss VMware vCloud Director 1.0 architecture, server instance sizing, LDAP sync, OVF file upload, vApp clones across vCenter Server instances, inventory sync, and adjusting thread pool and cache limits. The following performance tips are provided:

  • Ensure the inventory cache size is big enough to hold all inventory objects.
  • Ensure JVM heap size is big enough to satisfy the memory requirement for the inventory cache and memory burst  so the vCloud Director server does not run out of memory.
  • Import LDAP users by groups instead of importing individual users one by one.
  • Ensure the system is not running LDAP sync too frequently because the vCloud database is updated at regular intervals.
  • In order to help load balance disk I/O, separate the storage location for OVF uploads from the location of the vCloud Director server logs. 
  • Have a central datastore to hold the most popular vApp templates and media files and have this datastore mounted to at least one ESX host per cluster.
  • Be aware that the latency to deploy a vApp in fence mode has a static cost and does not increase proportionately with the number of VMs in the vApp.
  • Deploy multiple vApps concurrently to achieve high throughput. 
  • For load balancing purposes, it is possible to move a VC Listener to another vCloud Director instance by reconnecting the vCenter Server through the vCloud Director user interface.

Please read the white paper for more performance tips with more details. You can download the full white paper from here.

Virtualizing SQL Server-based vCenter database – Performance Study

vSphere is an industry-leading virtualization platform that enables customers to build private clouds for running enterprise applications such as SQL server databases. Customers can expect near-native performance from their virtualized SQL databases when running in a vSphere environment. VMware vCenter Server, the management component of vSphere, uses a database to store and organize information related to vSphere-based virtual environments. This database can be implemented using SQL server. Based on the previous VMware performance studies involving SQL databases, it is reasonable to expect the performance of a virtualized SQL Server-based vCenter database to be similar to that in native.

A study was conducted in the VMware performance engineering lab to validate the assumption. The results of the study show that:

  • The most resource-intensive operations of a virtualized SQL Server-based vCenter database perform at a level comparable to that in native environment.
  • A SQL Server-based vCenter database managing a vSphere virtual environment of any scale can be virtualized on vSphere.
  • SQL databases, in general, perform at a near-native level when virtualized on vSphere 4.1.

Complete details of the experiments and their results can be found in this technical document.

For comments or questions on this article, please join me at voiceforvirtual.com.