Introduction- scaling vRealize Automation for DevOps
VMware publishes our documented vRealize Automation throughput limits in the vRealize Automation Reference Architecture. Until recently these limits have exceeded the vast majority of our customers use-cases. With the increasing prevalence of high-throughput DevOps use cases, with greatly increased provisioning and Day 2 Operation rates, this has started to change. Customers are now demanding more…
In early Q2 of this year we had the chance to partner with a premier customer to accomplish this. This customer was seeking to provision in excess of 1000 Virtual Machines / Hour using vRealize Automation 7.3 and customize the Virtual Machines using both vRealize Orchestrator and Software Components. vSphere and Amazon EC2 endpoints are to be utilized for this test.
After a few weeks or work we had met their goal. Many configuration changes and hot fixes had been applied. All changes (with exceptions called out below) were introduced into 7.5 Simply by upgrading or installing 7.5 you will be able to enjoy 30% or greater system provisioning throughput.
Read on for more details…
Test
Clone and Customize 100 Deployments of 5 Virtual Machines in under 30 minutes. (500 Virtual Machines). Each machine had between 10 and 15 Software Components and there were dependancies between machines. For vSphere we used 2 sets of Linked Clones (1 base machine per machine per cluster). 2 sets of requests are performed per test. One set of 50 deployments at 00:00 and another set of 50 deployments at 10:00.
Investigation – Amazon EC2 Integration
Starting with Amazon EC2 we quickly encountered multiple limits with vRealize Automation in the Manager Service and the Amazon EC2 integration.
Issue 1: Manager Service hard coded to only process 10 Virtual Machines per Virtual Machine Observer polling interval.
Expose hard-coded setting as a configurable value. New setting in ManagerService.exe.config “VirtualMachineObserverQueryCount” configured to 100. In 7.5 and beyond this setting can be configured by adding it to the ManagerService.exe.config and specifying a value The default value when the setting is not present is 100.
Optimization 1: Modify Manager Service Virtual Machine Observer polling intervals to be 2 seconds.
Modified values of RepositoryWorkflowTimerCallbackMilliseconds, MachineRequestTimerCallbackMilliseconds and MachineWorkflowCreationTimerCallbackMilliseconds from 10000ms to 2000ms. Changing this value will reduce the amount of time machines will spend waiting to be processed at various times in the provisioning process.
Issue 2: Amazon EC2 throttled and eventually rejected API calls leading to provisioning failures
Introduce and expose configurable custom properties to reduce the amount of API calls to Amazon EC2.
Amazon.AmazonEC2LaunchInstance.PollingDelay (Default 30) – Defines the length of time the AmazonEC2LaunchInstance workflow will sleep before polling for the instance state after creating the instance.
Amazon.AmazonEC2LaunchInstance.PollingDelayRange (Default 30) – Defines the range from which a random value will be picked from and added to the PollingDelay. Intended to help spread out concurrent provisions and reduce API throttling.
Amazon.AmazonEC2LaunchInstance.PollingInterval (Default 15) – Defines the length of time between checking if the Amazon EC2 instance has been fully deployed. Note that this property defaults to 15 so it does not need to be set.
Optimization 2: Override AmazonEC2Config default value of 4 for retrying on errors.
amazon.AmazonEC2Config.MaxErrorRetry custom property to 10 on the Amazon EC2 endpoint. This will enable the Amazon EC2 client to perform 10 exponential backoff retries when throttling occurs.
With hot fixes applied and configured changes made testing resumed. As provisions were now succeeding under load we uncovered a new issue. Very rarely (once or twice a test of 500 machines) a machine would take exceptionaly longer to provision. Through extensive debugging we were able to identify that under high load the Manager Service would take an exceptional period of time to restart a machines lifecycle after the Event Broker completes processing an Event. This operation can occur dozens of time during a provision as an Event is sent to the Event Broker regardless of whether or not a Subscription for the Event is configured. We also discovered that there were optimizations which could be made to increase the parallelization of both Event Broker Events and state processing within the Virtual Machine Observer.
Issue 3: Event Broker event responses processed serially
Hotfix to change event response processing to be performed in parallel.
Issue 4: Virtual Machine Observer processes work serially
Hotfix to change Virtual Machine Observer state change operations to be performed in parallel.
Workaround 1:
Warning: Please be aware that modifying this setting will prevent any subscriptions from running that monitor for Events fired by the Manager Service. Please be aware this may break your provisioning. Highlight the text below to read.
To temporarily work around this issue we disabled the Manager Service → Event Broker integration. This is performed by selecting Infrastructure → Administration → Global Settings and modifying the Value of Disable Extensibility to True and then restarting the Manager Service.
After executing another provisioning test we observed that this time gap had been eliminated. With cause of the issue identified we rolled back the workaround and configured event whitelisting. As a bonus this optimization removes about a minute from the provisioning time as the Manager Service is only sending Events with Subscriptions that need to be processed by the Event Broker.
Optimization 3:
Warning: Please be aware that configuring this setting will prevent any subscriptions from running that monitor for Events which are not whitelisted by the Manager Service. Please be aware this may break your provisioning unless you have confirmed all states being used are in the whitelist. Highlight the text below to read.
Whitelist events which have subscriptions.
<add key=”Extensibility.Lifecycle.IncludeOnlyStates” value=”VMPSMasterWorkflow32.BuildingMachine,VMPSMasterWorkflow32.Disposing” />.
This command separated setting is added to the ManagerService.exe.config appSettings section and will take effect after a service restart. For more information on available states to be whitelisted review Workflow Subscription Life Cycle State Definitions.
With these configurable changes were were able to meet a provisioning rate of about 500 machines in roughly 30 minutes. These machines were submitted in 2 sets. 250 machines at time = 0 and another 250 machines at time = 10 minutes.
Investigation – vSphere Endpoint
Confident in our Amazon EC2 testing we moved onto vSphere as the infrastructure had been setup and configured. Blueprints were configured in vRealize Automation with Linked-Clones being used for the machines in the template.
After a few rounds of testing and iterating we arrived at the following settings for the Manager Service and vSphere Proxy Agent.
Optimization 4: Increase MaxOutstandingResourceIntensiveWorkItems from the default of 8 to 100. This setting determines how many CloneVM Work Items the Manager Service will dispatch to each Endpoint concurrently. We determined that the vSphere instance could handle at least 100 CloneVM operations.
Configure MaxOutstandingResourceIntensieWorkItems to 100 in ManagerService.exe.config
Optimization 5: Decrease workItemTimerInterval to 5 seconds and Increase workItemRetrievalCount and activeQueueSize to 100 in vSphere Proxy Agent. These settings determine how often the Proxy Agent queries for work, how many work items it obtains per query and the maximum amount of work items it can process.
Configure workitemTimerInterval to “00:00:05”, workitemRetrievalCount to “100” and activeQueueSize to “100” in VRMAgent.exe.config
Results
With all these settings made we were able to continue testing and reach 1250 Virtual Machines in 80 minutes. Please keep in mind this customer utilized Linked Clones in order to drastically reduce Clone times. If you utilize Full Clones you may be unable to reach these numbers without the Storage and Network to back it up.
With the exception of Optimization 3 all of these code and configuration changes have been defaulted in 7.5. By installing or upgrading to vRealize Automation 7.5 you can expect to experience a roughly 30% improvement in provisioning throughput.
Improvements in provisioning performance have only been made to the core services. Complex customization and Template size will still adversely affect provisioning time.