Designing and architecting monster vCloud Air Network service provider environments takes VMware technology to its very limits, in terms of both scalability and complexity. vCenter Server, and its supporting services, such as SSO, are at the heart of the vSphere infrastructure, even in cloud service provider environments where a Cloud Management Platform (CMP) is employed to abstract the service presentation away from vCenter Server.
Meeting service provider scalability requirements with vCenter Server requires optimization at every level of the design, in order to implement a robust technical platform that can scale to its very limits, whilst also maintain operational efficiency and support.
This article outlines design considerations around optimization of Microsoft Windows vCenter Server instances and best practice recommendations, in order to maximize operational performance of your vCenter ecosystem, which is particularly pertinent when scaling over 400 host servers. Each item listed below should be addressed in the context of the target environment, and properly evaluated before implementation, as there is no one solution to optimize all vCenter Server instances.
The following is simply a list of recommendations that should, to some extent, improve performance in large service provider environments. This blog targets the Windows variant of vCenter Server 5.x and 6.x with a Microsoft SQL database, which is still the most commonly deployed configuration.
Warning: Some of the procedures and tasks outlined in this article are potentially destructive to data, and therefore should only be undertaken by experienced personnel once all appropriate safeguards, such as backed up data and a tested recovery procedure, are in place.
Part 1 – vCenter Server Operational Optimization
vCenter Server Sizing
vCloud Air Network service providers must ensure that the vCenter virtual system(s) are sized accordingly, based on their inventory size. Where vCenter components are separated and distributed across multiple virtual machines, ensure that all systems meet the sizing recommendations set out in the installation and configuration documentation.
vSphere 5.5: https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html
vSphere 6.0: https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-6-pubs.html
vSphere 5.1: http://kb.vmware.com/kb/2021202
Distribute vCenter Services across multiple virtual machines (vSphere 5.5)
In vSphere 5.5, depending on inventory size, multiple virtual machines can be used to accommodate different vCenter roles. VMware recommends separating VMware vCenter, SSO Server, Update Manager and SQL for flexibility during maintenance and to improve scalability of the vCenter management ecosystem. The new architecture of vCenter 6 simplifies the deployment model, but also reduces design and scaling flexibility, with only two component roles to deploy.
Dedicated Management Cluster
For anything other than the smallest of environments, VMware recommends separating all vSphere management components onto a separate out-of-band management cluster. The primary benefits of management component separation, include:
- Facilitating quicker troubleshooting and problem resolution as management components are strictly contained in a relatively small and manageable cluster.
- Providing resource isolation between workloads running in the production environment and the actual systems used to manage the infrastructure.
- Separating the management components from the resources they are managing.
vCenter to Host operational latency
The number of network hops between the vCenter Server and the ESXi host affects operational latency. The ESXi host should reside as few network hops away from the vCenter Server as possible.
vCenter to SQL Server operational latency
The number of network hops between the vCenter Server and the SQL database also affects operational latency. Where possible, vCenter should reside on the same network segment as the supporting database. If appropriate, configure a DRS affinity rule to ensure that the vCenter Server and database server reside on the same ESXi host, reducing latency still further.
Java Max Heap Size
vCloud Air Network service providers must ensure that the max heap size for Java virtual machine is set correctly based on the inventory size. Confirm heap size on JVM Heap settings on vCenter, Inventory Service, SSO and Web Client are checked. Monitor Web Services to verify. vSphere 5.1 & 5.5: http://kb.vmware.com/kb/2021302
Concurrent Client Connections
Whilst no always easy, attempt to limit the number of clients connected to vCenter Server, as this affects its performance. This is particularly the case for the traditional Windows C# client.
Employ a performance monitoring tool to ensure the health of the vCenter ecosystem and to help troubleshoot problems when they arise. Where appropriate, configure a vROps Custom Dashboard for vCenter/Management components. Also ensure appropriate alerts and notifications on performance monitoring tools exist.
Virtual disk type
All vCenter Server virtual machine VMDK’s should be provisioned in an eagerZeroedThick format. This provides approximately a 10-20 percent performance improvement over the other two disk formats.
vCenter vNIC type
vCloud Air Network service providers should ensure to employ the VMXNET3 paravirtualized network adaptor to maximise network throughput, efficiency and reduce latency.
Ensure that the vCenter and VUM ODBC connections are configured with the minimum permissions required for daily operations. Additional permissions are typically required during installation and upgrade activities, but not for day to day operations. Please refer to the Service Account Permissions provided below.
vCenter Logs Clean Up
vCenter Server has no automated way of purging old vCenter Log files. These files can grow and consume a significant amount of disk space on the vCenter Server. Consider a 3/6 monthly scheduled task to delete or move log files older than the period of time defined by business requirements.
For instance, the VBscript below can be used to clean up old log files from vCenter. This script deletes files that are older than a fixed number of days, defined in line 9, from the path set in line 6. This VBscript can be configured to run as a scheduled task using the windows task scheduler.
Set Fso = CreateObject("Scripting.FileSystemObject")
Set Directory = Fso.GetFolder("C:\ProgramData\VMware\VMware VirtualCenter\Logs\")
Set Files = Directory.Files
For Each Modified in Files
If DateDiff("D", Modified.DateLastModified, Now) > 180 Then Modified.Delete
For more information, refer to KB article: KB1021804 Location of vCenter Server log files.
For additional information on modifying logging levels in vCenter please refer to KB1004795 and KB1001584.
Note: Once a log file reaches a maximum size it is rotated and numbered similar to component-nnn.log files and they may be compressed.
The statistics collection interval determines the frequency at which statistic queries occur, the length of time statistical data is stored in the database, and the type of statistical data that is collected.
As historical performance statistics can take up to 90% of the vCenter server database size, it is the primary factor in the performance and scalability of the vCenter Server database. Retaining this performance data allow administrators to view the collected historical statistics, through the performance charts in the vSphere Web Client, through the traditional Windows Client or through command-line monitoring utilities, for up to 1 year after the data was first ingested into the database.
You must ensure that statistics collection times are set as conservatively as possible so that the system does not become overloaded. For instance, you could set a new DB Data Retention Period of 60 Days and configure the DB to not retain performance data beyond 60 days. At the same, it is equally important to ensure that the retention of this historical data meets the service provider’s data compliance requirements.
As this statistics data consumes such a large proportion of the database, proper management of these vCenter Server statistics is an important consideration for overall database health. This is achieved by the processing of this data through a series of rollup jobs, which stop the database server from becoming overloaded. This is a key consideration for vCenter Server performance and is addressed in more detail in Part 2 of this article.
Task and Events Retention
Operational teams should ensure that the Task and Events retention levels are set as conservatively as possible, whilst still meeting the service provider’s data retention and compliance requirements. Every time a task or event is executed via vCenter, it is stored in the database. For example, a task is created when an user powers on or off on a virtual machine and an event is generated when something occurs, such as the vCPU usage for a VM changing to red.
vCenter Server has a Database Retention Policy setting that allows you to specify after how long vCenter Server Tasks and Events should be deleted. This correlates to a database rollup job that purges the data from the database after the selected period of time. Whilst compared to statistical data these tables consume a relevantly small amount of database space, it is good practice to consider this option for further database optimization. For Instance, by default, vCenter is configured to store tasks and events data for 180 days. However, it might be possible, based on the service provider’s compliance requirements, to configure vCenter not to retain Event and Task Data in the database beyond 60 days.
vCenter Server Backup Best Practice
In addition to scheduling regular backups of the vCenter Server database, the backups for the vCenter Server should also include the SSL certificates and license key information.
Part 2 – SQL DB Server Operational Optimization (for vCenter Server)
SQL Database Server Disk Configuration
The vCenter Server database data file (mdf) generates mostly random I/O, while database transaction logs (ldf) generate mostly sequential I/O. The traffic for these files is almost always simultaneous so it’s preferable to keep these files on two separate storage resources, that don’t share disks or I/O. Therefore, where a large service provider inventory demands it, operational teams should ensure that the vCenter Server database uses separate drives for data and logs which, in turn, are backed by different physical disks.
For large service provider inventories, place tempDB on a different drive, backed by different physical disks than the vCenter database files or transaction logs.
Reduce Allocation Contention in SQL Server tempDB database
Consider using multiple data files to increase the I/O throughput to tempDB. Configure 1:1 alignment between TempDB files and vCPUs (up to eight) by spreading tempDB across at least as many equal sized files as there are vCPUs.
For instance, where 4 vCPUs exist on the SQL server, create three additional tempDB data files, and make them all equally sized. They should also be configured to grow in equal amounts. After changing the configuration, a restart of the SQL Server instance is required. For more information please refer to: http://support.microsoft.com/kb/2154845
Database Connection Pool
vCenter server starts, by default, with a database connection pool of 50 threads. This pool is then dynamically sized according to the vCenter Server’s workload. If high load is expected due to a large inventory, then the size of the pool can be increased to 128 threads. This will increase memory consumption and load time of the vCenter Server. To change the pool size, edit the vpxd.cfg file, adding, as below, where ‘128’ is the number of connection threads to be configured.
Update statistics of the SQL tables and indexes on a regular basis, for better overall performance of the database. Create an SQL job to carry out this task, or alternatively, it should form part of a vSphere database maintenance plan. http://sqlserverplanet.com/dba/update-statistics
Index Fragmentation (Not Applicable to vCenter 5.1 or newer)
Check for fragmentation of index objects and recreate indexes if needed. This happens with vCenter due to statistic roll ups. Defragment after <30% fragmentation. See this KB1003990.
Note: With the new enhancements and design changes made in the vCenter Server 5.1 database and later, this is no longer applicable or required.
Database Recovery Model
Depending on your vCenter database backup methodology, consider setting the transaction logs to SIMPLE recovery. This model will reduce the disk space needed for the logs as well decrease I/O load.
Choosing the Recovery Model for a Database: http://msdn.microsoft.com/en-us/library/ms175987(SQL.90).aspx
How to view or Change the Recovery Model of a Database in SQL Server Management Studio: http://msdn.microsoft.com/en-us/library/ms189272(SQL.90).aspx
Virtual Disk Type
Where the vCenter Server database server is a virtual machine, ensure that all VMDK’s are provisioned in an eagerZeroedThick format. This option provides approximately 10-20 percent performance improvement over the other two disk formats.
Verify SQL Rollup Jobs
Ensure all the SQL Agent rollup jobs have been created on the SQL server during the vCenter Server Installation. For instance:
- Past Day stats rollup
- Past Week stats rollup
- Past Month stats rollup
For the full set of stored procedures and jobs please refer to the appropriate article below. Where necessary, recreate MSSQL agent rollup jobs. Note that detaching, attaching, importing, and restoring a database to a newer version of MSSQL Server does not automatically recreate these jobs. To recreate these jobs, if missing, please refer to: KB1004382.
KB 2033096 (vSphere 5.1, 5.5 & 6.0): http://kb.vmware.com/kb/2033096
KB 2006097 (vSphere 5.0): http://kb.vmware.com/kb/2006097
Also, ensure that the myDB references the vCenter Server database, and not the master or some other database. If these jobs reference any other database, you must delete and recreate the jobs.
Ensure database jobs are running correctly
Monitor scheduled database jobs to ensure they are running correctly. For more information, refer to KB article: Checking the status of vCenter Server performance rollup jobs: KB2012226
Verify MSSQL Permissions
Ensure that the local SQL and AD permissions required are in place, and align with the principle of least privilege (see below). If necessary, truncate all unrequired performance data from the database (Purging Historical Statistical Performance Data). For more information, refer to KB article: Reducing the size of the vCenter Server database when the rollup scripts take a long time to run KB1007453
Truncate all performance data from vCenter Server
As discussed in Part 1, to truncate all performance data from vCenter Server 5.1 and 5.5:
Warning: This procedure permanently removes all historical performance data. Ensure to take a backup of the database/schema before proceeding.
- Stop the VMware VirtualCenter Server service. Note: Ensure that you have a recent backup of the vCenter Server database before continuing.
- Log in to the vCenter Server database using SQL Management Studio.
- Copy and paste the contents of the SQL_truncate_5.x.sql script (available from the link below) into SQL Management Studio.
- Execute the script to delete the data.
- Restart the vCenter Server services.
For truncating data in vCenter Server and vCenter Server Appliance 5.1, 5.5, and 6.0, see Selective deletion of tasks, events, and historical performance data in vSphere 5.x and 6.x (2110031)
After purging historical data from the database, optionally shrink the database. This is an online procedure to reduce the database size and to free up space on the VMDK, however, this activity will not in itself improve performance. For more information, refer to: Shrinking the size of the VMware vCenter Server SQL database KB1036738
For further information on Shrinking a Database, refer to: http://msdn.microsoft.com/en-us/library/ms189080.aspx
Rebuilding indexes to Optimize the performance of SQL Server
Configure regular maintenance job to rebuild indexes. KB2009918
- To rebuild the vCenter Server database indexes. Note, for a vCenter Server 5.1 and 5.5 database, download and extract the .sql files from the 2009918_rebuild_51.zip file attached to this procedure.
- Backup your vCenter Server database before proceeding. For more information, see Backing up and restoring vCenter Server 4.x and 5.x (1023985).
- These steps must be performed against the vCenter database and not the Master.
- Connect to the vCenter Server database using Management Studio for SQL Server
- Execute the .sql file to create the REBUILD_INDEX stored procedure, available from the above link.
- Execute the stored procedure that was created in the previous step: execute REBUILD_INDEX
VPX_HIST_STAT Table Sizes
VMware recommend a fill factor of 70% for the 4 VPX_HIST_STAT tables. If this recommended fill factor is too high for resources on the database server, then it will need to take time splitting pages, which equates to additional I/O.
If you are experiencing high unexplained I/O in the environment, monitor the SQL Server Access Methods object: Page Splits/sec. Page splits are expensive, and cause your table to perform more poorly due to fragmentation. Therefore, the fewer page splits you have the better your system will perform.
By decreasing the fill factor in your indexes, what you are doing is increasing the amount of empty space on each data page. The more empty space there is, the fewer page splits you will experience. On the other hand, having too much unnecessary empty space can also hurt performance because it means that less data is stored per page, which means it takes more disk I/O to read tables, and less data can be stored in the buffer cache.
High Page Splits/sec will result in the database being larger than necessary and having more pages to read during normal operations.
To determining where growth is occurring in the VMware vCenter Server database refer to: http://kb.vmware.com/kb/1028356
For troubleshooting VPX_HIST_STAT table sizes in VMware vCenter Server 5, refer to: KB2038474
To reduce the size of the vCenter Server database when the rollup scripts take a long time to run, refer to: KB1007453
Monitor Database Growth
Service provider operational teams should monitor vCenter Server database growth over a period of time to ensure the database is functioning as expected. For more information, refer to KB article: Determining where growth is occurring in the vCenter Server database KB1028356
Schedule and verify regular database backups
The vCenter, SSO, VUM and SRM servers are by themselves stateless. The databases are far more critical since they store all the configuration and state information for each of the management components. These databases must be backed-up nightly and the restore process of each database needs to be tested periodically.
Operational teams should ensure that a schedule of regular backups exists of the vCenter database and based on requirements of the business, restore and mount databases from backup periodically onto a non-production system to ensure a clean recovery is possible, should database corruption or data loss occur in the production environment.
Create a Maintenance Plan for vSphere databases
Work with the DBA’s to create a daily and weekly database maintenance plan. For Instance:
- Check Database Integrity
- Rebuild Index
- Update Statistics
- Back Up Database (Full)
- Maintenance Cleanup Task
Warning: DO NOT SHRINK DB IN MAINTENANCE PLAN UNLESS THERE IS A SPECIFIC REQUIREMENT TO RECLAIM DISK SPACE: http://msdn.microsoft.com/en-us/library/ms189080.aspx
Part 3 – Service Account Permissions (Least Privilege)
vCenter Service Account
Required by the ODBC Connection for access to the database, the vCenter service account must be configured with dbo_owner privileges for normal operational use. However, the vCenter database account being used to make the ODBC connection also requires the db_owner role on the MSDB System database, during installation or upgrade of the vCenter Server. This permission facilitates the installation of SQL Agent jobs for vCenter statistic rollups.
Typically, the DBA should only grant the vCenter service account the db_owner role on the MSDB System database when installing or upgrading vCenter, then revoke that role when these activities are complete.
RSA_DBO (vSphere 5.1 Only)
Only Required for SSO 5.1, the RSA_DBA account is a local SQL account which is used for creating the schema (DDL) and requires dbo_owner permissions.
RSA_USER (vSphere 5.1 Only)
Only Required for SSO 5.1, the RSA_USER reads and writes data (only DML).
VUM Service Account
Despite being a 64bit application, VUM requires a 32bit ODBC connection from “C:\Windows\SysWOW64\odbcad32.exe”. The VUM service account must be provide the dbo_owner permission on the VUM DB. The installation of vCenter Update Manager 5.x and 6.x with a Microsoft SQL back end database also requires the ODBC connection account to temporarily have db_owner permissions on the MSDB System database. This was a new requirement in vSphere 5.0.
As with the vCenter service account, typically the DBA would only grant the VUM service account the db_owner role for the MSDB System database during an install or upgrade to the VUM component of vCenter. This permission should then be revoked when that task has been completed.