Product Announcements

Failback? Absolutely! Absolutely!

Since VMware released vCenter Site Recovery Manager (SRM) in June 2008, the product has gained tremendous momentum in customer adoption. Customers are impressed with the SRM capabilities that turn their manual disaster recovery (DR) runbook into automated recovery plans. They execute the recovery plans in test mode as frequently as they desire in order to eliminate any glitches in the plans. When a disaster happens, they are confident that SRM can be entrusted to seamlessly perform disaster recovery for them. (See below for the DR and SRM terminology used in this blog.)

DR and SRM Terminology

Description

Failover

Event that occurs when the recovery site takes over operation in place of the protected site after the declaration of a disaster.

Failback

Reversal of failover, returning IT operations to the primary site (Site A).

Site A

The protected site before failover.

Site B

The recovery site before failover.

Protection Group

A group of virtual machines that will be failed over together to the recovery site during test or recovery.

Recovery Plan

A recovery plan contains the complete set of steps needed to recover (or test recovery of) the protected virtual machines in one or more protection groups.

Shadow Virtual Machines

An artifact in the recovery site VC inventory that represents a protected VM from the protected site VC

With SRM failover capabilities, customers can successfully recover their workloads on Site B.  If you are one of those customers, do you wonder how to failback to Site A? I bet you do!

Do SRM customers need failback capabilities?  Absolutely!

Many SRM customers have told us at VMware that failback is important to them for reasons such as:

·         They do not rely on their recovery site for an extended period of time. When the Site A is recovered, customers prefer to have the workloads running on that site instead.  Site A (the primary site) is typically allocated more computing resources than Site B (the recovery site) and it is geographically closer to the business units.  From performance considerations, it makes sense to failback to Site A in many cases.

·         They may need to failover to Site B (in recovery mode) as part of their scheduled disaster recovery testing or maintenance.  Afterwards, they need to failback to Site A.

Does SRM v1.0 make failback easier? Absolutely!

From the customer feedback we gathered, we understand how important failback it is to our customers. Before then customers still have many options to streamline and expedite their failback process using SRM. Using SRM to perform failback provides tremendous values:

·         Automated recovery plan(s)

·         Automated testing before recovery

·         Built-in audit trail

Now you probably wonder what is involved in using SRM for failback. In a nutshell, SRM-assisted failback involves two directional reversals of protection: from Site B to Site A and then back to from Site A to Site B. In order to perform the directional reversal of protection to be from Site B to Site A, the following steps are involved:

1.       Reverse the replication direction in the storage layer to be from Site B to Site A

2.       Clean up the shadow virtual machines and protection groups on Site A

3.       Clean up the Recovery Plans configured on Site B

4.       Configure the protection group(s) on Site B

5.       Configure the Recovery Plans on Site A

6.       Test recovery from Site B to Site A

7.       Perform the recovery from Site B to Site A

Afterwards, you will need to perform steps 1 – 7 for the directional reversal of protection from Site A to Site B before you can call the failback done.  You can find more details on failback in the resources listed below:

·         Chapter 6 of SRM Evaluator’s Guide: http://www.vmware.com/pdf/srm_10_eval_guide.pdf

·         The storage vendors that implement SRA (Storage Replication Adaptor) for SRM also have individually produced documents that describe how to reverse the replication direction in the storage layer and how to leverage SRM to perform failback. Mike Laverick, author of the book “Administering VMware’s Site Recovery Manager”, has started a thread on the VMware SRM community forum to track a list of documents published by Storage Vendors on SRM: http://communities.vmware.com/message/1037176#1037176. The list includes some documents that describe how to perform failback using SRM on specific storage platforms.

You may think that the SRM-assisted failback steps illustrated above are not trivial to implement. Once you compare the man-time spent on SRM-assisted failback with the man-time on manual failback, you will appreciate the benefits of SRM-assisted failback. The table below lists the man-time estimates of the failback operations with and without SRM, assuming a protected environment of 100 virtual machines:

Failback Steps

SRM-assisted Operations

Manual Operations

 Man-Time Estimates of SRM-assisted Operations

Man-Time Estimates of Manual Operations

Reverse the replication direction in the storage layer

Use Storage Vendor’s configuration tool to reverse the replication direction

Same as SRM-assisted operations

15 minutes

15 minutes

Clean up the shadow virtual machines and protection groups on Site A

Use SRM Plug-in to perform the clean-up.

Same as SRM-assisted operations

10 minutes

10 minutes

Clean up the Recovery Plans configured on Site B

Use SRM Plug-in to perform the clean-up.

Same as SRM-assisted operations

10 minutes

10 minutes

Failover Configuration(s) in SRM

Complete Array Manager configuration on Site B

N/A

10 minutes

0 minute

Configure Protection Groups

N/A

10 minutes to 2 hours depending on the level of customization

0 minute

Configure Recovery Plans

N/A

10 minutes to 2 hours depending on the level of customization

0 minute

Testing

Automated via SRA

Go through all the manual runbook operations

60 minutes

3 to 4 man days (8 hour days) depending on the complexity of manual coordination.

Stop current cycle of replication

Use storage vendor’s configuration tool to perform this step

Use storage vendor’s configuration tool to perform this step

10 minutes

10 minutes

Make the target LUN (i.e. remote volume) a primary volume

Automated via SRA

Use storage vendor’s configuration tool to perform this step

0 minute

10 minutes

Make the target LUN read-writeable

Automated via SRA

Use storage vendor’s configuration tool to perform this step

0 minute

10 minutes

Grant the ESX Server hosts in Site B access the last good snapshot that was taken

Automated via SRA

Use storage vendor’s configuration tool to perform this step

0 minute

10 minutes

Network Mapping

Automated

Edit the VMX file of each virtual machine and map it to the correct network

0 minute

2 minutes for editing each of the 100 virtual machines = 200 minutes

Resource Pool Mapping

Automated

For each VM, need to add and tell the VI Client which cluster, folder and resource pool to use

0 minute

2 minutes for adding each of the 100 virtual machines = 200 minutes

VM Folder Mapping

Automated

Done in the above step

0 minute

N/A

To perform all the steps listed in the table above in a protected environment of 100 virtual machines, the time estimates for SRM assisted operations and manual operations are 355 minutes and 1915 minutes respectively. The ratio is around 1 to 5. In other words, you can expect to spend 5 times as much time to perform failback if you opt for the manual operations instead of the SRM-assisted operations. This is a significant time saving, on top of the time saved on fixing human errors and the audit trails that you get with SRM.

If you still prefer to do it the manual way, you can leverage tools and scripts to make it more streamlined.  Refer to the documents below for more information:

·         Chapter 7 of VMBook titled “A Practical Guide to Business Continuity & Disaster Recovery with VMware Infrastructure”: http://www.vmware.com/files/pdf/practical_guide_bcdr_vmb.pdf. This chapter provides insights on failover and failback. You will learn more about failback considerations after reading it.

·         http://www.rtfm-ed.co.uk/docs/vmwdocs/Chapter%2012_Site_Recovery_without_VMware_SRM.pdf provides insights and technical details on how to use PowerShell to automate the failback process. Even the author, Mike Laverick, of this document recommends that you use SRM and perform the manual operations only as a contingency plan.