The previous post in this series discussed the recovery plan test process built into SRM. In this post I’ll talk about some alternative methods of testing SRM recovery plans as well as recommendations around SRM and testing.
SRM has a robust method of testing the operation of a recovery plan non-disruptively. However, for some customers and some use cases other options are needed. For those customers or those use cases here are a few additional options.
Run a recovery plan with the sites connected
In some situations running a test of the SRM recovery plan isn’t sufficient and there is a need for a test run of the DR process itself. SRM fully supports this through running the recovery plan either in Planned Migration or Disaster Recovery mode. Keep in mind that doing this, as the name suggests, will actually failover VMs and therefore be disruptive**.
Even though it is disruptive, it will be a more realistic test of what an actual recovery would look like. Keep in mind that this would need to be done with the sites connected. Running a planned migration with the sites disconnected will fail as connectivity is required. Running in disaster recovery mode with the sites disconnected would be disruptive to the VMs in the plan and the SRM environment and is not recommended. That specific case will be addressed below.
Run an SRM Recovery Plan Test with the sites disconnected
SRM, vSphere Replication and some SRAs (check with your array vendor) support running an SRM recovery plan test with the sites disconnected. This can allow for a more realistic test of the recovery plan while still being non-disruptive to protected workloads. Keep in mind that the disconnect doesn’t have to be taking the inter-site link down. It could just be changing network configurations such that the vCenters at each site can no longer communicate with each other. One benefit to completely severing the link between sites is that it may lead to the discovery of unexpected dependencies between sites.
An unusual variant of this is what I call a “reverse bubble”. This is useful when there is a need to use external systems (eg. Remote sites, customer sites, etc) to test application operations of the VMs recovered as part of the test, and discard any changes made as part of that testing rather than making permanent changes. This is accomplished by network isolating the production site and exposing the recovery site test environment to external networks.
Run an SRM Recovery Plan in Disaster Recovery Mode with the sites disconnected
This option comes up very frequently with customers and has a really big problem that to me excludes this as a viable choice. After you do this, you will very likely have to rebuild your SRM environment. That to me defeats a major part of running a test.
The reason that doing this damages SRM is that the SRM servers and the underlying replication solutions end up out of sync as both of them think they “own” the VMs and replicated storage. It is possible, sometimes to clear this up and it usually takes a lot of work and there is no guarantee of success.
A few other things that come up around testing
- Getting users connected into the test bubble – how will application testers get access to a network isolated environment?
- Should infrastructure systems (AD, DNS, DHCP, etc) be included in my recovery plans and therefore in my tests?
- Adding external systems into test bubble (eg AD, physical,) – If infrastructure systems aren’t protected by SRM, how do they get into the test bubble so that systems dependent on them can be tested?
These have been covered previously here and here. Even though these posts are a few years old, the information is still accurate and applicable.
Lastly here are a few general recommendations around testing recovery plans with SRM
- Run a test with the sites connected before doing any other kind of alternative testing. During the test SRM actually performs some verification steps at the protected site to make sure the planned failover would work without issues. These steps are optional and will not stop the test failover, but if you have never done them, you may be up for a surprise when you run a planned failover. This may be even more problematic if you first run an unplanned failover and then rerun the plan in the planned mode to prepare the sites for the reprotect. If you discover issues at that time, your options for fixing issues will be extremely limited.
- Don’t make any changes to SRM configurations at either location with your sites disconnected from each other. If the configurations get out of sync you may be looking at having to rebuild your configuration.
- The longer your test runs and the more VMs are involved will increase the storage used to track changes. Make sure your environment can support this (disk space, journal space, etc).
** When using stretched storage with SRM 6.1 or higher, a planned migration workflow would be non-disruptive to vMotion capable VMs.