posted

5 Comments

One of the most common questions about SRM 5 is with regards to vSphere Replication and how much bandwidth is used to replicate VMs.

The answer is of course that "it depends" and one of the goals of this post is to help you understand what factors are important in determining what to expect in your environment from a bandwidth perspective.

There are a number of factors involved in replication with VR:

1) Size of dataset

2) Data change rate ("Churn" for lack of better terminology)

3) Recovery Point Objective ("RPO")

4) Link speed

Fundamentally, at a high level, the calculation for your required bandwidth to use VR comes down to calculating your average churn rate within an RPO divided by your link speed.  Let's take a look at why these factors are important.

Size of dataset. 

We can not assume that every VM in your environment will be protected by VR, let alone every VMDK in your VMs.  We need to look at our data stores and figure out what percentage of our total storage is for virtual machines that we are protecting with VR, and what number of VMDKs within that subset are actually configured for replication.  

Let's take an example wherein we have 2TB of VMs on our data stores, and presume that we are protecting half of them with VR.  That gives us a maximum amount of data for replication of 1TB.  It may be that we are even then only protecting a subset of the VMDKs represented by that 1TB, but for our example we can now assume that the size of replication we are dealing with is at maximum 1TB.  For our 'high water' calculations we will use this number.

Data change rate.

This is the absolute key to all calculations with VR.  We do not replicate every block of the dataset.  The change rate is very tightly coupled with the RPO when we calculate what needs to be replicated, as the figure we need to worry about is how many blocks have changed within a given RPO.  We need to know how many blocks have changed within a given RPO for a VM in order to estimate the transfer size for each window's replication.  This is not always easy, nor is it easy to calculate for each VMDK and then calculate a sum, so instead let's step up a level and look at overall averages.  We may estimate for example that we have a churn rate of 10% daily for the dataset being used with VR.  This will differ radically for each of you, but it gives nice round numbers for these sample calculations.  Given the 1TB dataset, if we know the churn rate of 10% daily, that means we are working with a daily set of blocks that need to be shipped of about 100GB.

Now this is where the complications begin.  Even here we can not assume that all 100GB needs to be shipped.  

With VR we ship blocks based on the RPO.  If we have a 1 hour RPO, that means that within each hour any block that has changed will need to be shipped in order to meet that RPO.  This does NOT mean, however, that every time a block changes we need to ship it!  If a given block changes 100 times within that hour, we don't need to ship it 100 times, we only transfer it once at its current state when the block bundle is created for transfer.  All we care about is that it *has* changed within the RPO, not how *many* times it has changed.

So this means that again our changed block rate gives us the high water mark, not a realistic estimation of what needs to be shipped, nor how often.  But again to continue with the calculation we can assume that we have at *most* 100GB to transfer, and with an example RPO of 1 hour that means we need to do a replication 24 times throughout the day.  

100GB of data divided by 24 "windows" for shipping gives us an average bundle size of a little over 4GB per window.

… This will be continued in a few hours!