vSphere Replication Appliance Failure Prevention and Recovery

vSphere Replication is an excellent host-based, per-VM replication solution that is included with vSphere Essentials Plus Kit and higher editions. That’s right – if you have vSphere Essentials Plus or higher, you have vSphere Replication. There are several use cases for vSphere Replication: Migrating VMs from old hardware to new hardware, migrating VMs between data centers, and disaster recovery – with or without vCenter Site Recovery Manager (SRM) – to name a few. When talking with customers, we tend to cover the features and benefits for starters and move on to how it works – and then what happens when issues such as hardware failure, administrative mistakes, etc. occur.

In this article, we will not discuss all of the details around how it works, but at a high level, changed data for each protected VM is replicated from vSphere hosts at the source location to one or more vSphere Replication virtual appliance(s) at the target location. The vSphere Replication appliance(s) then write this replicated data to vSphere storage at the target location. This often raises questions about what happens if these vSphere Replication appliances go offline or are lost. That is what we will cover in this article.

Before we get into some scenarios, it is important to understand there are two types of vSphere Replication appliances: The first is the vSphere Replication Management Server (VRMS). This is the first appliance deployed and it performs a variety of management and authentication functions for the vSphere Replication environment. A VRMS appliance also contains the vSphere Replication Server (VRS) component, which receives replicated data from the source vSphere host(s) and writes this replicated data to storage. The replicated data for a VM at the target location is typically referred to as a replica. The second type of vSphere Replication appliance is a VRS appliance, which contains only the VRS component.

There is exactly one VRMS appliance per vCenter Server environment. In many cases, this is the only vSphere Replication appliance you need to deploy since it handles management and also receives replicated data. In other cases, one or more VRS appliances may be deployed to accommodate different use cases, increase scale, and cover various replication topologies. Up to nine VRS appliances can be deployed in addition to the VRMS appliance for a maximum of 10 vSphere Replication appliances per vCenter Server.

Now, let’s look at some potential failure scenarios…

Scenario 1: VRS is temporarily offline. Any replication going to the offline VRS appliance will naturally be interrupted. Replication going to other vSphere Replication appliances (VRS and VRMS) will continue normally. It is possible RPO violations will occur – especially in environments where RPO is low, e.g., 30 minutes.

There is some good news here: Changes occurring to the source VM will continue to be tracked. When the offline VRS appliance comes back online, any replication that is still configured to go to that appliance will automatically resume. However, if an attempt is made to move replication from the offline appliance to another VRMS or VRS appliance, the migration will fail and the status will change to “Error” and replication will need to be reconfigured for the affected VMs when the VRS appliance is back online. Recovering affected VMs will also fail.

Scenario 2: VRS is permanently offline. This scenario is more problematic. At first, RPO violations will be observed. Just like the scenario above, if an attempt is made to move replication from the offline appliance to another VRS or VRMS appliance, the migration will fail and the status will change to “Error”. Reconfiguring replication also fails. Recovery of the affected VMs is not possible. Unfortunately, the solution is to stop replication for all affected VMs – which takes a few minutes longer than usual since the VRS appliance is offline – and then configure replication from scratch. Now before you stop replication, copy the replica files at the target location to another folder. vSphere Replication automatically cleans up replica files at the target location when replication is stopped unless you used “seeded” copies of vmdk files. Copying the replica files to another folder will preserve the replica files so they can be used as “seeds” when replication is configured again. Using “seeds” typically reduces the amount of time and bandwidth consumed as vSphere Replication compares the source and target files and transmits only the differences.

Last, you will want to unregister the offline VRS appliance from the vSphere Replication environment before deploying a new VRS appliance.

Scenario 3: VRMS is temporarily offline. Since the VRMS appliance contains management and authentication components, evidence of this appliance being offline becomes apparent very quickly. vSphere Replication shows as “Not accessible” and the lists of incoming and outgoing replications are empty.

With the VRMS appliance offline, it is not possible to recover VMs with vSphere Replication. There are some third party articles out there that suggest it is possible to manually alter the files at the target location that make up the replica and recover a VM. While this is possible, I do not recommend relying on that recovery method as it is a very manual process and, more importantly, not supported by VMware.

Now the good news in this scenario: Replication to other VRS appliances in the environment will continue, as configured. It is not possible to change the replication configuration until the VRMS appliance is back online. When the VRMS appliance comes back online, replication going to the VRMS appliance automatically resumes, management functionality is restored, and it is possible to recover VMs with vSphere Replication.

Scenario 4: VRMS is permanently offline. This scenario is very similar to the VRMS temporarily being offline scenario – nearly all functionality is lost. If you have the VRMS database hosted external of the VRMS appliance (SQL Server or Oracle Database), you can deploy a new VRMS appliance and configure it from the existing database.

However, VMs replicating to the (old) VRMS will need to be manually reconfigured to replicate to the new VRMS appliance. The majority of customers use the embedded database as it is simple and works well. If that is the case and the VRMS appliance is lost, you are out of luck – you will have to deploy a new VRMS and reconfigure replication for all protected VMs from scratch. That is unless you have been backing up your vSphere Replication appliances!

Backing up the VRMS and VRS appliances. I use VMware vSphere Data Protection (VDP) to back up the VRMS and VRS appliances. After losing the VRMS appliance, I restored it using VDP. I configured the VDP restore job to automatically power on the appliance once the restore was complete. I was able to manage the vSphere Replication environment a few minutes after the appliance was fully booted. The VMs that were being replicated to this appliance were showing “Error”. These items automatically corrected themselves in a matter of minutes. However, the VMs did have to undergo a full sync.

I also restored a VRS appliance using VDP and had similar results. Within minutes of booting up the restored VRS appliance, replication resumed and the environment returned to normal automatically.

To help ensure these types of issues are quickly discovered and remediated, I recommend a few vCenter Server alarms, which can be configured for vSphere Replication:

VR Server disconnected
RPO violated

There are many more alarms that can be configured for vSphere Replication, but these two will likely be the best indicators of an issue with a vSphere Replication appliance. Here are a few more recommendations to help ensure your vSphere Replication environment stays online:

Deploy vSphere Replication appliances to highly available storage such as VMware Virtual SAN and other traditional SAN and NAS solutions supported with vSphere.
Protect all vSphere Replication appliances with vSphere HA. This helps minimize vSphere Replication appliance downtime in the event of vSphere host hardware failure. vSphere HA guest OS protection is also recommended.
Back up all vSphere Replication appliances daily. The retention period for these daily backups can be just a few days – hopefully, any issues with a vSphere Replication appliance is discovered quickly and typically the most recent backup will be used to perform the restore. Retaining too many restore points across several days or weeks unnecessarily consumes additional backup data capacity. Backup of the VRMS appliance and vCenter Server should happen at the same time.
If using an external database for the VRMS appliance, back up that database in addition to the VRMS appliance. Backup of the appliance and the database should happen at the same time to help ensure consistency.
vSphere Replication requires vCenter Server and the vSphere Web Client to perform VM recoveries. Make sure these items are also well protected with redundant hardware, vSphere HA, and regular backups.

@jhuntervmware

Related Posts:

Related Articles

VMware vSphere Foundation: Optimizing Private Clouds and Driving IT Value

Embracing Change with VMware vSphere Foundation

Announcing New Collaborations in VMware Private AI