Product Announcements

Working at the edge of SRM 5 limits

If you are working at the edge of our SRM 5 scalability limits and you are finding that you get timeout errors than we can help with that.  SRM tries to start all of the VMs it is recovering as fast as it can.  Sometimes near our scalability limits that may be too aggressive.  If you are in that position you may experience timeout errors.  The actual errors may be with regards to scripts, VMware Tools, or uploads, but in all cases there will be a mention of timeouts in the error message.

There are a number of things to look at to tweak things so that you can increase your success during recoveries.  Please be aware these are advanced changes that should not be required for most people, and that you should be able to undo these changes if necessary, and test them carefully.  The numbers below are suggestions and you may increase them or decrease them as you need – and remember, as always, to test extensively!

You can throttle the startup of recovered VMs in two ways.  You can add a global configuration option to SRM that will throttle the VM operations in every cluster and OR you can make a change to each cluster.  This means you can make one change and all clusters will start fewer VMs, but if you need to, for example if you have a very big cluster with lots of resources, you can add a configuration option to that cluster alone and start more VMs in it than any other cluster.

The cluster change is done in DRS Advanced Options and the parameter is called srmMaxBootShutdownOps with a value of 32.  Remember that the cluster change has priority over the change in the SRM configuration file.

To change the value for all clusters you need to change the vmware-dr.xml file.  You need to add the section below to the vmware-dr.xml file.  After you make a change to the vmware-dr.xml file you must restart the SRM service to make that change active.  You do not need to use both PerCluster and PerHost below and generally just the cluster option will be required.

<Config>

<defaultMaxBootAndShutdownOpsPerCluster>32</defaultMaxBootAndShutdownOpsPerCluster>
<defaultMaxBootAndShutdownOpsPerHost>4</defaultMaxBootAndShutdownOpsPerHost>

</Config>

As well, sometimes you may see synchronization timeouts in vSphere Replication, and this can happen as a result of slow networks, or very large VMs, or large numbers of VMs.  You can change the amount of time available for synchronization using the info below so you can avoid this issue.

To change the synchronization timeout for vSphere Replication you can make the changes below in the vmware-dr.xml file.  Remember that you need to restart SRM to make this change active.

<Config>

     <hbrProvider>
         <syncronizationTimeout>3600</syncrhonizationTimeout>
     </hbrProvider>

As well, sometimes when you have a large number of LUNs you will have errors when preparing storage during recovery operations, and in that case you can adjust the time for SRA storage operations.  The error for this issue is very clear and will mention the commandTimeout parameter.

You can change this in the Storage section of the Advanced Settings, and it is called commandTimeout.  The default value is 5 minutes – but it is recorded in seconds so it is 300 seconds.  Sometimes storage vendors will suggest you change this value and I think the largest I have seen is 1500 seconds.  You access Advanced Settings by right clicking on the Site in the UI.

In summary, with this information, if you are operating at or near the scalability of SRM you can adjust some configuration options so that you can decrease the number of errors during recovery so you can have a more successful recovery.

Good luck!