Recovering a Replicated vSphere Data Protection (VDP) Virtual Appliance

The question: Can a vSphere Data Protection (VDP) virtual appliance be successfully recovered from a replicated copy?

“Successfully recovered” means not only recovering the appliance, but being able to restore a VM from that recovered appliance. To test out this scenario, I used vSphere Replication to provide a replicated copy of the VDP virtual appliance. If you follow the vSphere Uptime blog, you may remember that we already discussed replicating VDP with vSphere Replication – here is the link for that post:

http://blogs.vmware.com/vsphere/2012/10/vdp-and-vr-interoperability.html

As stated in the article above, I do not recommend using VR to replicate VDP. In this case, I simply needed a way to test whether the VDP appliance could be recovered from the replicated copy at a secondary site – more importantly, to see if I could actually use the recovered VDP appliance at the secondary site to recover a VM that was backed up at the primary site.

I deployed a 2TB VDP virtual appliance at the primary site – a total of 13 .vmdk files, thin provisioned. I created a scheduled backup job for one VM. I protected the VDP appliance with VR and set the RPO to 12 hours. This helped me avoid issues with replicating VDP using VR as discussed in that previous post. Once the initial replication was complete, I immediately started my test.

To begin, I powered off the original VDP appliance and recovered the replicated copy. A VM is recovered by VR with the virtual network interface card (vNIC) disconnected as a safety precaution. I enabled the vNIC and powered on the recovered VDP appliance. I was not too surprised when I saw the following message.

Also notice in the background that the Core Services are reporting as “Unrecoverable”. This was almost surely a result of the replicated VDP appliance being a crash-consistent copy – basically as if someone “pulled the plug” on the server, moved it to another location and then turned it back on. VDP does not take kindly to that. Always shut down the guest OS of a VDP appliance gracefully (do not power it off).

There was not much I could do with the recovered appliance in that state so I deleted it from disk. However, I was not ready to give up. This time, I let the scheduled backup job and replication run for a few days. This was primarily so the VDP appliance would start running integrity checks and create checkpoints. Having a validated checkpoint would give me the opportunity to perform a Rollback in VDP. The Rollback mechanism is designed to protect against exactly what I ran into with the first attempt – a corrupted appliance due to a crash-consistent recovery.

Once again, I recovered the VDP appliance at my secondary site. The VDP appliance came online in good health – I did not have to perform a Rollback. Looking at the VDP Configure user interface (UI), I found the maintenance services were stopped, but all other services were running. I switched over to the VDP appliance console and noticed it was validating an integrity check.

Once the integrity check was complete, I switched back to the VDP Configure UI and found the Maintenance services were now running – presumably started after the integrity check was completed. Using the VDP Configure UI, I connected it to the vCenter Server and SSO server at the secondary site and rebooted the appliance.

The VDP appliance went through the automated reconfiguration process. After several minutes had passed, VDP was available for use in the vSphere Web Client. Nice! However, when I tried to connect to the appliance, I received the error stating that “the most recent request has been rejected – most likely a time issue”. How ironic that I just posted an article on this issue a few days ago:

http://blogs.vmware.com/vsphere/2012/11/vdp-time-sync-error.html

After some quick troubleshooting, I found that my vCenter Server virtual appliance, which contains the SSO server, was about 3 minutes behind the VDP appliance and the vSphere host they were running on. I restarted VMware Tools at the command line on the vCenter Server appliance: “service vmware-tools-services restart”. When VMware Tools restarted, the time was updated and all clocks were in sync.

As the earlier “Switch to a new vCenter?” warning stated, no backup jobs were present on the recovered VDP appliance, but I did find restore points for the VM backed up at original site. Naturally, I kept moving by creating a restore job from the most recent restore point for the VM. The job started, but after a few moments, I received the following error:

“VDP: Failed to restore client db-server, Execution error: E10050:Failed to create Virtual Machine.” (db-server being the name of the VM I was attempting to recover)

I speculated that maybe there was some corruption so I went ahead and performed a Rollback in the VDP Configure UI to the validated checkpoint. I tried the restore again and received the same disappointing error message. Turns out this was an issue related to performance. I had several VMs running on this same vSphere host – memory was almost completely consumed and CPU utilization was also at a considerable level. I shut down some of the non-essential VMs, which freed up some memory and CPU, and tried the restore again – it worked!

So what does all of this mean? As detailed above, you can see that I did not do a lot of testing with various replication types, different workload sizes, etc. Here are my thoughts:

– vSphere Replication is not recommended for replicating a VDP appliance. The replicated copy is crash-consistent, which presents a risk of data corruption. There is no way to schedule when vSphere Replication performs replication and it is not possible to replicate a powered off – i.e. completely quiesced – VDP appliance using vSphere Replication.

– Building on the point above, the best way to avoid data corruption is to gracefully shut down the VDP appliance and then replicate it. One approach may be to schedule a weekly shutdown of the appliance, copy or replicate it to a secondary site and then power the appliance back on after sufficient time has passed to complete the copy or replication.

– As with most applications, make sure VDP has the CPU and memory resources it needs. If your VDP appliance is particularly busy (backing up many VMs, frequent restore jobs, long-running integrity checks, etc.), it might make sense to configure CPU and memory reservations for the appliance.

– As with just about any solution, you should thoroughly test the solution prior to putting it into production. If replicating backup data from one site to another is a requirement, it might make sense to look at backup and recovery solutions that have this functionality included.

If you are using VDP and considering ways to create off-site copies of your backup data, I would be very interested in hearing how you architect the solution, test results, etc.! Thank you.

@jhuntervmware