A question came up recently from a customer that went like this: "I am looking for an active/active configuration for disaster avoidance and disaster recovery with a recovery time objective of 0 minutes and a recovery point objective of 0 minutes. Is this something I can achieve with VMware vSAN and/or Site Recovery Manager (SRM)?" It is not the first time I have heard this question so it naturally made sense to write a blog article. Let's dig in...
RTO and RPO Defined
Before we address possible solutions, let's define the terms recovery time objective (RTO) and a recovery point objective (RPO) as they are often misused. Note that these terms contain the word "objective", which means "something that one's efforts or actions are intended to attain or accomplish; purpose; goal; target" according to Dictionary.com. In other words, these are the goals we are trying to achieve. I'll give you an example: By stating that I have an RTO of two hours, that means the time required for recovery should take no more than two hours. The actual recovery time might be more or less time, but the target I am aiming for is two hours or less.
With the above information in mind, here are definitions from Wikipedia:
RTO: The targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.
RPO: The maximum targeted period (of time) in which data might be lost from an IT service due to a major incident.
I gave an example of RTO earlier so here is an example of RPO: An RPO of four hours means it is acceptable to lose up to four hours of the most recent data. One way this could be achieved is by configuring vSphere Replication for a VM with an RPO of four hours. vSphere Replication would replicate the changed data in the VM's files from the source location to the target location approximately every four hours to ensure the data at the target is never more than four hours old.
Now let's get back to the original question...
For disaster avoidance - when we have sufficient warning of an impending disaster - vSAN can achieve an RPO and RTO of 0 with a stretched cluster. VMs can be migrated from one site to the other using vMotion. Consider the case of rising flood waters. A flood warning is issued and it is likely flood waters will reach Data Center A in the next 4-6 hours. The decision is made to migrate workloads from Data Center A to Data Center B 40 miles away, which houses the other half of the vSAN stretched cluster. VMs are easily migrated to Data Center B with no downtime and no loss of data. An RTO of 0 (no time required for recovery) is achieved and no data is lost, which satisfies the RPO of 0. When the flood waters recede and any damage sustained at Data Center A is repaired, we simply migrate the VMs back to Data Center A, again, with no downtime and no data loss.
Disaster recovery (DR) is typically required after a disaster strikes without warning. As opposed to proactively responding to a disaster, as discussed above, we are forced to react to a disaster as there was no warning prior to its occurrence.
A vSAN stretched cluster achieves an RPO of 0. This is because vSAN writes to both sides of the stretched cluster synchronously. There is always a mirrored copy of VM data at both sites. In contrast, a true RTO of 0 is very difficult to achieve. Even application clustering solutions typically need several seconds to a minute or two to failover from one node to another. Recovery of VMs due to host and entire site failure is handled by vSphere HA. It commonly takes a few minutes to reboot the affected VMs on the other side of the stretched cluster. In other words, an RTO of a few minutes (depends on how many VMs need to restart) is possible with a vSAN stretched Cluster and vSphere HA. This click-through demo shows vSphere HA recovering 50 VMs from a failed host in a 4-node all-flash vSAN cluster in approximately four minutes. Behavior in a stretched cluster will be very similar.
SRM can provide recovery times measured in several minutes to a few hours depending on how many VMs are being recovered, how fast the storage is at the recovery site (hint: all-flash vSAN works very well here), how many priority groups and dependencies are configured in SRM, whether IP addresses need to be changed or not, etc. If the replication solution used with SRM is synchronous, the RPO would be 0. Just like a vSAN stretched cluster, SRM cannot achieve an RTO of 0 when disaster takes out an entire protected site without warning. However, SRM can achieve low RTOs - even with a large number of VMs as demonstrated in this video showing the recovery of 1000 VMs in less than 30 minutes.
As mentioned earlier, achieving a true RTO and RPO of 0 minutes in every scenario is very difficult (nearly impossible?) for any solution. However, vSAN stretched clusters and/or SRM provide viable options to help your organization get very close to this ideal state without breaking the bank.
@jhuntervmware on Twitter