vSphere Replication Bandwidth pt 2

Continuing from the earlier post about vSphere Replication bandwidth…

RPO.

The RPO is another key to figuring out your traffic patterns for replication. We look at how many blocks have changed during the RPO window and transfers those blocks that have changed. The data change rate within that RPO gives us the total number of blocks that need to be shipped in each window. This may vary widely throughout the day, however, which will again alter the traffic generated by replication. If we have systems that are busy during business hours but idle at night our overall figure above of 100GB daily may hold true, but the 4GB per window may drift wildly over the course of the 24 hour period.

From an averages perspective, however, the 4GB number may be good.

Here's another complication: Because we strictly look at changed blocks, not how many times those blocks have changed, we may again end up with widely differing bundles for replication from one replication window to the next. If traffic on a VM is 'bursty' meaning for example very busy in one hour and idle during the next, we may have to ship a lot of blocks in one hour and none in the next! Moreover, if we are using VSS to quiesce the VM our replication traffic can not be spread out in small trickled sets of bundles throughout the RPO window, and instead we will need to transfer all the changed blocks as one set as determined when the VM was quiescent. Without VSS, VR can ship more smaller bundles of changed blocks on an ongoing basis as blocks change, smearing out the traffic throughout the RPO. So the traffic shape will change based on your choice to use VSS, and VR will handle the replication schedule differently based on these factors, leading to varying traffic patterns.

Lastly, as you change your RPO we will obviously have to ship more or less data per transfer to meet that RPO, so that is why figuring out replication bandwidth is so dependent on both the change rate and the RPO configured.

Link Speed.

Obviously the link speed is the other defining factor in our calculation. If we now know that we have an average replication bundle of 4GB to transfer in a 1 hour period, we need to look at our link speed to determine if the RPO can be met. If we have a 10Mb pipe, we can do the math: 4GB/10MBps would take about an hour to complete on a completely dedicated pipe with little overhead under ideal conditions.

So just to meet our RPO we would be completely saturating a 10Mb WAN connection, under ideal conditions, with no overhead or limiting factors such as retransmits, shared traffic, excessive bursts of data change rates, etc. and presuming a very standard and unchanging data change rate!

Realistically we need to expect say 70% of a link will be available for actual traffic replication which means on 10Mb links we can get around 3GB/h, on a 100Mb link we can get around 30GB/h, etc.

The real question becomes one of looking at the data churn rate within an RPO and determining if the link speed and conditions will allow for that to be replicated.

Another way to look at it would be to again track what your data churn rate is within your RPO (or average it over a longer period and then divide down to your RPO) and then determine what traffic that will generate and measure that against your link speed.

For example, given a churn rate of 100 GB, you will need approximately 200 hours to replicate that on a T1, 30 hours to replicated that on 10Mbps, 3 hours on 100Mbps, and so forth.

The last major consideration here is that we are assuming a very standard RPO for all our VMs and a churn rate that smears out nicely over the 24 hours. Both of these factors will likely not hold true in your environment.

If we have many tiers of RPOs for our VMs, say groups that have RPOs of 15 minutes, 1 hour, 4 hours and 24 hours, we need to calculate this for each individual tier, and then look at how it is being met by our factors listed above, and then look at the aggregate of each tier!

So this is the complexity of VR bandwidth calculations… We need to factor in all the different RPOs in the environment, the subset of the environment that is protected, the change rate of the data within that subset, how much of that data changes within each configured RPO, and the links in use.

Ultimately it's just math, but getting to an answer of what bandwidth is required for VR unfortunately always lands us back at the statement "It depends". The biggest dependency is your data change rate, and that's probably the best place to start when looking at figuring this out.

Hope this helps understand VR and bandwidth usage!

-Ken