Site Recovery Manager

SRM Protection Group Design Considerations

When designing vCenter Site Recovery Manager environments the question of how to organize Protection Groups (PG) frequently comes up. In this post we’ll review what a protection group is, where it fits in the context of SRM and the factors to keep in mind when deciding how to organize them.

What is a Protection Group?
In SRM, protection groups are a way of grouping VMs that will be recovered together. A protection group contains VMs whose data has been replicated by either array-based replication (ABR) or vSphere replication (VR). A protection group cannot contain VMs replicated by more than one replication solution (eg. same VM protected by both vSphere replication and array-based replication) and, a VM can only belong to a single protection group.

For VMs protected by SRM using vSphere replication deciding what VMs are going to belong to what protection group is simple, since VMs are replicated on an individual basis, whatever makes sense from a recovery standpoint. vSphere replication protection groups are not tied to storage type or configuration other than they cannot be located on array-based replication replicated storage.

For VMs protected by SRM using array-based replication it is a little more involved. The VMs included in array-based replication protection groups are determined by the storage where the VM(s) are located.

All the VMs on a datastore have to be protected by SRM and they all have to belong to the same protection group. It is not advisable or recommended to protect a subset of VMs on a datastore. Doing this will trigger alarms within the SRM UI and can result in significant issues with those unprotected VMs. In the example below we have 5 VMs located on 2 datastores that map to 2 LUNs.

Within the storage array the 2 LUNs are normally configured in a Consistency Group to ensure write order consistency. The 2 datastores are said to be in a datastore group, which contains all the datastores associated with the VMs in the protection group.

How do Protection Groups fit into SRM?

Recovery Plans in SRM are like an automated run book, controlling all the steps in the recovery process. The recovery plan is the level at which actions like Failover, Planned Migration, Testing and Reprotect are conducted. A recovery plan contains one or more protection groups and a protection group can be included in more than one recovery plan. This provides for the flexibility to test or recover the email application by itself and also test or recover a group of applications or the entire site.


How should protection groups be organized?

When looking at organizing protection groups with array-based replication there are a few factors to consider. First, what is the smallest level at which you would like to failover or test? Traditional disaster recovery testing usually involves failing over all applications and services which can be quite difficult as it requires significant coordination between different groups. It is also introduces significant risk as well as disrupting normal operations.

When using SRM many organizations create protection groups at the application or service level (e.g. Email, SharePoint, etc). This provides the flexibility for application owners to test disaster recovery plans as needed, non-disruptively and with little or no risk.

 

Creating protection groups for each application increases flexibility, however it also increases complexity because now VM storage has to map directly to those same applications. Also some larger environments may have more applications or services than themaximum supported number of protection groups in SRM. Other alternative organizational possibilities for protection groups are to create them at the business unit (finance, HR, accounting) or application tier (web, database, application). Another option would be to keep things simple to manage and less flexible and just create protection groups leaving storage as is.

Ultimately, how you create array-based replication protection groups is going to depend on how you want to be able to test, recover and reprotect your workloads, desire for flexibility, tolerance for complexity, storage layout and number of applications/services/business units.

Follow me on twitter