The ability to non-disruptively test a recovery plan is one of the most powerful features of VMware Site Recovery Manager (SRM). Frequent testing reduces risk by reducing drift between plans and desired behavior. It also provides the confidence that if a disaster strikes you are ready, knowing how SRM and your applications will respond.
There are a lot of options when testing recovery with SRM so in this series I will cover: how the SRM recovery plan test works, what the difference is between a test and a failover, some alternatives to SRM recovery plan tests and some recommendations on SRM and testing. In this post I’ll go over how an SRM recovery plan test works and what the differences are between a test and a failover.
An SRM recovery plan test is designed to be as close as possible to running an actual recovery plan while not disrupting the protected VMs and not impacting replication. Here are the steps for both a recovery plan and a recovery plan test with the similarities and differences noted:
Recovery Plan | Recovery Plan Test | Notes |
Pre-synchronize Storage | Synchronize Storage | Optional when running a test. This allows for the testing of both a planned migration (storage synchronized) and disaster recovery (storage not synchronized) |
Shutdown VMs at Protected Site | Not done during a test so that there is no disruption to production VMs | |
Resume VMs Suspended by Previous Recovery | Not applicable for a test | |
Restore (Recovery Site) Hosts from Standby | Restore (Recovery Site) Hosts from Standby | Same in both |
Restore Protected Site Hosts From Standby | Not applicable – no changes made to protected site | |
Prepare Protected Site VMs for Migration | Not applicable – no changes made to protected site | |
Synchronize storage | Not applicable – completed earlier | |
Suspend Non-critical VMs at Recovery Site | Suspend Non-critical VMs at Recovery Site | No difference |
Change Recovery Site Storage to Writable | Create Writable Storage Snapshot | When running a test we need a snapshot so that replication isn’t interrupted |
(Not applicable for a recovery plan) | Configure test networks | Connecting VMs to test network instead of recovery network |
Power On VMs (priority groups 1-5) | Power On VMs | This is arguably the most critical step and it is the same in both. This includes IP customization and any scripts that run on the VMs as part of the recovery plan. |
<Any scripts or call-outs that are added to the recovery plan> | <Any scripts or call-outs that are added to the recovery plan> | These run the same in both a test and a regular recovery plan. Note that it is possible for a script to differentiate between when SRM is running a test vs. when it is running a failover if that is needed for the script to accomplish the desired result. |
An additional item to note regarding differences between running recovery plans and testing recovery plans is that they require different levels of permissions. Running a test requires Site Recovery Manager.Recovery Plans.Test and running a recovery plan requires Site Recovery Manager.Recovery Plans.Recovery
To ensure that a RP test is non-disruptive requires that both storage and network remain isolated. For storage this is accomplished by utilizing a storage or VM snapshot at the recovery site which allows the snapshot VM to be added to inventory and replication to continue uninterrupted.
The VM network is isolated by connecting the test VMs to one of two options:
- SRM auto-created network: SRM can automatically create an isolated network on each of the hosts at the recovery site. This doesn’t require any additional configuration, however, VMs on different subnets or that are recovered onto different hosts will not be able to communicate with each other. This can limit the usefulness of this option, especially if application testing is desired.
- Specifying a “test” network: This requires creating “test” networks/VLANs that connect to all hosts at the recovery site and are isolated from the rest of the production network. This requires a bit more work to put in place and provides a much more realistic testing environment with the possibility of including switching, routing, firewall rules, etc. This makes the testing of recovery plans much more accurate and therefore reduces risk. The use of clear naming conventions is really important here to prevent confusion. Future posts will discuss options for using this isolated test environment in more detail.
It is important to note that during a recovery plan test, a failover workflow of the same recovery plan can’t be run until the cleanup workflow has been completed.
In the next post I’ll get into alternatives to the SRM recovery plan test.