Product Announcements

Understanding vSphere Replication (VR) Scheduling and RPO Violations

One of the most common misunderstandings with vSphere Replication (VR) is how it calculates replication schedules and recovery point objective (RPO) violations. Since this is a common question and one that takes a few moments to answer, I thought it made sense to post it in a blog article. I’ll start with the basics: The replication schedule for a VM is determined by the RPO policy set when configuring replication for that VM. The possible value for RPO can be any number of minutes from 15 minutes to 1440 minutes (24 hours). There is currently no way to schedule replication at specific times. VR generates its own replication schedule internally by considering all replicated VMs on each vSphere host.

A 48-hour replication schedule is computed initially using historic data change rates. The VR scheduler calculates this schedule each time certain events occur such as a change in VM power state, replication reconfiguration, etc. VR attempts to spread out replication cycles to minimize the number of concurrent replication cycles occurring on a vSphere host. Because replication takes time to complete (especially with large amounts of data and/or slower network connections), each replica is considered “aged” by the time the current replication cycle completes. To avoid RPO violation, VR attempts to complete a replication cycle in less than half of the configured RPO. Estimated transfer time is calculated averaging the previous 15 delta replications and adding 20% to that average transfer time, as a buffer.

To help illustrate this, examples are shown in the diagram below. For both examples, the RPO is configured at 60 minutes.

vrrpo

In the top example, the current replication takes 22 minutes to complete. Once the current replication is completed, the data at the target will look the same as the data at the source when replication started 22 minutes earlier. As the next replication begins, the restore point of the target continues to age. The next replication takes 23 minutes to complete. When the next replication cycle is finished, the target has newer data and now looks the same as the source did 23 minutes ago. Data replication starts again and the cycle continues.

Looking at the bottom example, we see that the current replication takes 35 minutes. After replication has finished and the replicated data has been committed to the target, the target looks the same as the source did 35 minutes ago. The next replication starts and the restore point of the target continues to age. Again, target data is not updated until the next replication is completed. In this case, 33 minutes later – eight minutes past the RPO of 60 minutes. Assuming that replication continues to take 33-35 minutes to complete, VR will get further behind its RPO policy of 60 minutes with each replication cycle. Alerts will be shown in the user interface, but replication will continue. To resolve the RPO violation, more bandwidth should be provided to VR (which should shorten replication time) or the RPO policy should be increased.

@jhuntervmware