Advanced vSphere Replication Options for Single VM Replication Performance

Posted by
Ken Werneburg
Tech Marketing
Twitter @vmKen

Lots of people stand up a vSphere Replication based SRM instance and check to see “how fast” it can replicate a VM. What I want to do today is talk about some of the factors that go into this, and give you some advice on using unsupported advanced features to get faster speed for replicating a single VMDK.

Why would I do this? It’s unsupported! You can saturate your links! You can create a vortex that will cause implosion of the milky way into a singularity! So the point is, you really *shouldn’t* do this. Unfortunately though, people are doing incorrect assessments based on a misunderstanding of how VR works, and I want to give you some understanding and a potential way to do some more interesting testing of your own.

The first point to note is that there are two advanced settings that control in essence how much data is buffered before sending and how much is sent in parallel. These limits are in place to keep individual large replications from saturating a pipe and causing other replications to fail, and to protect the host from being saturated by nothing other than replication traffic.

The effect of this, when replicating a single VM, is that you see a smaller amount of data getting replicated, more slowly, than you think should be. Fundamentally, vSphere Replication is optimized for parallel replication of multiple VMs instead of peak throughput for a single disk.

So let’s talk about the key advanced settings:

HBR.TransferDiskMaxBufferCount. This is the maximum number of 8K blocks held in memory during replication, for each replicated disk. The buffer is basically the total number of blocks from a transfer that we can operate on at one point. If our delta has say 100 blocks that need to be shipped, we will only load this value’s worth of blocks at any given time on which replication is operating. This defaults to 8.

HBR.TransferDiskMaxExtentCount. This is the total number of outstanding parallel blocks that have been sent to the receiving VRS, but whose writes have not yet been acknowledged by the VRS and a reply sent back to the source host. This defaults to 16.

So for example, if we have 20 blocks to send as part of a single light-weight delta for a single VMDK, we will take the first 8 blocks into the buffers and send them out. This gives us 8 buffers in use, and 8 active extents “on-the-wire”.

As a block is sent out of the buffer, it is no longer needed therein, so the block buffer gets cleared and reused for the next block that needs to be shipped. Note, however, that a block may not have completed the actual transfer and write at the target site yet, but because it has been sent the buffer can be cleared. The 8 buffers are always getting filled and the blocks shipped, and those 8 buffers will load the next block as soon as it has finished shipping the previous one.

The extents, however, are incrementing each time a buffer is emptied by a block transfer, and they are only reduced once the target site VRS has replied back that it has received and written the block. So there is a delay between the time a buffer is emptied and the time the extent for that replication of the block is cleared.

For example, we have 8 blocks of 20 in the buffer to be sent. Let’s imagine blocks 1 through 4 are sent simultaneously and the buffer for each of those blocks is emptied. We now load the next 4 blocks into the buffers for sending, and ship them out too at the same time as the original blocks 5 through 8 complete. At this point the original 8 blocks have been sent, plus another 4, so our extent count is now up to 12. Since the original blocks 5 through 8 complete their send, we load the next blocks into the buffers and ship them.

That brings us to 16 blocks in transit at once, which is the maximum parallel extent count. So the last blocks from the original LWD (16 through 20) may get loaded into the buffer, but will not get shipped until our active extents get committed.

As blocks are received and written out at the recovery site by the VRS, it sends back in essence an acknowledgement that can then clear the extents associated with that transfer. Say the first 7 blocks it received get written at the recovery site, that will then free up 7 extents on the host that is sending, and the final 4 blocks queued in the buffer can now be sent.

So there is an interplay between the total number of active blocks that are shipped at any one time, as dictated by the buffer size, and the total number of active blocks that are shipped but not yet comitted which is the extent size.

Phew! So far so good?

If we presume a single VMDK is being replicated, let’s see how much throughput we can expect. The variables are:

Number of replications = n
Number of buffers = b (default is 16)
Buffer size = s (default is 8K)
Latency = l

The calculation is basically (n * b * s / l) * 0.8 = throughput Mbps

1 disk * 16 buffers * 8k per buffer / 100ms = 10.24Mbps. Adjust your latency up and down as needed. If replicating across the street with a <1ms latency, you could get, say, 1 Gbps.

This is top-end, theoretical, no overhead and no write-latency at the target site. So let’s add in 100ms for distance latency, about 3 ms write latency at the target, and take ~20% for TCP/IP & congestion overhead, acknowledgement responses, etc…

Long distance : 1disk * 16 buffers * 8k per buffer / 103ms = ~10 Mbps * 0.8 = ~8 Mbps.

So… how do we speed it up? By adding more disks! Obviously if we start changing the number of disks we should expect, all other things being equal, a linear progression in terms of replication traffic.

3 disks * 16 buffers * 8k per buffer / 103ms = 30 Mbps * 0.8 = ~24 Mbps.

Very straightforward. If you look back to the results shared by my friends at Hosting.com a few weeks back you’ll find that they had a very similar set of results, somewhere on the order of 11Mbps per VM on a link with 55ms latency.

1 disk * 16 buffers * 8k per buffer / 55ms = 18Mbps * 0.8 = ~14 Mbps. Their results were not too far off the simplistic theoretical averages based on our algorithm.

3 disks * 16 buffers * 8k per buffer / 55 ms = 56Mbps * 0.8 = ~45 Mbps. Again fairly close to what they saw in practice.

So back to the original point. How do we try to make vSphere Replication “go faster”. Well the easiest way is to add more VMs to the replication. It doesn’t speed up the individual VM’s replication as that is capped by latency, number of buffers, and buffer size. But it will give you more usage of your bandwidth.

If you don’t want to replicate more VMs, and just want to make one VM replicate faster? Change the variables… let’s get more buffers per disk and make the extents larger!

Back to the caveat, however… This is *not supported* by VMware. The variable can be changed, but this has not undergone our usual rigorous testing scenarios, so you may run into issues when you saturate the network, or saturate the VRS that receives the blocks, or take too much host memory… Do this with great reluctance!

So to make it go faster, we simply need to adjust the advanced settings listed above on each host from which you are replicating. Remember vSphere Replication’s agent is in the kernel, so the settings are in the host advanced settings, under “HBR”, not in SRM’s advanced settings. This will also require you to reboot your hosts after making the change.

Adjust them cautiously, and by a similar ratio. If you double one, you should double the other, lest you get to a situation where for example you have lots of free extents but your buffers are saturated, or vice versa. A cautious approach might be to set your MaxBufferCount to 16 and your MaxExtentCount to 32 – this should give you a close to linear doubling of throughput.

Again the cautions here are that you could overuse memory on the host (for the buffers), you could saturate your VRS at the receiving side, you could do all sorts of unpleasant things, and this has not gone through testing so is completely unsupported. This will affect all replications on the host, so don’t do this in a production environment! But in a lab test environment, with only one or two VMs being replicated it may be worth checking out.

Hopefully this helps you understand both a bit more about how vSphere Replication works, and how it is designed for parallel operation of multiple VMs, but can be tweaked to improve single VM replication performance in lab environments.

-Ken

**** EDIT ****

I missed a decimal in one of my calculations…