Product Announcements

SRM 5 recovery plan speed compared to SRM 4

I’ve been doing some work in the lab doing upgrades and testing the process of moving from SRM 4 to 5.0.  There will be a blog entry soon about this to augment the upgrade guide that will be released with SRM 5, but during the course of testing I’ve come across some very interesting numbers with regards to the speed of a recovery plan in version 5 versus previous releases.

To wit, SRM 5 has some very nice changes that can help make it considerably quicker to complete recovery plans!

In accordance with the practice of testing thoroughly before and after upgrades, I ran through the recovery plans a few times in test-mode to make sure they would failover correctly and that there were no issues with the way I had things configured.  As part of that I tracked how long each recovery plan took to complete the test and saw that they were quite consistent through a few runs.  Then it was time to upgrade!

In the lab as it stands now I have a few recovery plans: A couple running on FalconStor NSS iSCSI storage, and a single large recovery plan running on NFS storage on an EMC VNX 5500.  These are both great systems, and I’m quite grateful to have the ability to test with different protocols and know that storage is never a bottleneck!

It was immediately apparent on both of these cutting-edge storage systems that SRM 5 is considerably faster to complete the test runs than SRM 4 was.  I don’t mean it was a few percent quicker to do a few steps, either: Across the board I saw fairly dramatic and exciting speed improvements.

There are a number of reasons for the speed improvement, but two of features with the most impact on this deal with IP customization and the start-sequencing for VMs at the recovery site.

IP customization in previous versions required the use of sysprep and customization specifications.  If a system needed IP changes when running at the recovery site, SRM would need to call the customization spec and use sysprep to instantiate the network changes in the virtual machine.  This would often add a few minutes to the start time for each VM that needed these changes, as it needed to boot the VM, run the sysprep to make its changes, and reboot the VM again in order to complete the change.

SRM 5 no longer uses sysprep or customization specs, but instead we inject the networking information through a VIX API call pushed through VMware Tools in the VM.  This takes considerably less time to complete than the full sysprep cycle.

So right off the bat with this change, we can reduce the overall time for a recovery plan to complete by shaving down the customization time for each VM.

The second area of improvement deals with how VMs are started by SRM at the recovery site.  Previously by default VMs would be started two-at-a-time on up to 10 hosts for a maximum of 20 simultaneous booting virtual machines.  On my lab systems I have two hosts at the recovery site, so I could do a maximum of four simultaneous VM power-ons.  Each one would do its sysprep and needed to be finished booting before SRM could start the next VM.

With SRM 5 this process is no longer followed.  Now, by default, SRM sends a call out to the vCenter server with instructions of which systems it needs booted, and allows the VC to determine how many VMs it is capable of booting at once, dependent on cluster settings, available resources, HA admission control, etc.  Meaning, in my lab with two hosts, I could start all the VMs in my recovery plan in parallel.

Moreover, the sequence is slightly different in terms of customization.  SRM 5 will do an initial “customization boot” of any virtual machines that need to be customized.  It will do a network-isolated preparatory boot to inject the IP changes, then shut down the VM so that when it is called on to start according to its place in the recovery plan, it has *already been customized* and is ready to boot “for real” without extra delay for customization.

So what were my results?

On the iSCSI FalconStor with a recovery plan with small numbers of VMs the test time shaved off about nine minutes.  On the NFS VNX with three times the number of VMs it shaved off almost 24 minutes.

Now here is the caveat section:  Your numbers may be quite different!  For example, it may not be possible or advisable for you to try to start every VM in your recovery plan all at once.  Perhaps your VC can not handle it, or the cluster does not have resources to do so.  I had also not used different priority groups or set any dependencies, so my test scenario very likely looks nothing like your environment.

The speed increase therefore will depend quite highly on how your environment is configured, the capabilities of your infrastructure, your dependencies and priorities, and what your recovery plans look like.

Regardless, any improvement of recovery time is something to be happy about, and SRM 5 is looking great in this regard!

Stay tuned for an SRM upgrade blog entry, coming soon.