Home > Blogs > VMware vSphere Blog


Recovering a Replicated vSphere Data Protection (VDP) Virtual Appliance

The question: Can a vSphere Data Protection (VDP) virtual appliance be successfully recovered from a replicated copy?

“Successfully recovered” means not only recovering the appliance, but being able to restore a VM from that recovered appliance. To test out this scenario, I used vSphere Replication to provide a replicated copy of the VDP virtual appliance. If you follow the vSphere Uptime blog, you may remember that we already discussed replicating VDP with vSphere Replication – here is the link for that post:

http://blogs.vmware.com/vsphere/2012/10/vdp-and-vr-interoperability.html

As stated in the article above, I do not recommend using VR to replicate VDP. In this case, I simply needed a way to test whether the VDP appliance could be recovered from the replicated copy at a secondary site – more importantly, to see if I could actually use the recovered VDP appliance at the secondary site to recover a VM that was backed up at the primary site.

I deployed a 2TB VDP virtual appliance at the primary site – a total of 13 .vmdk files, thin provisioned. I created a scheduled backup job for one VM. I protected the VDP appliance with VR and set the RPO to 12 hours. This helped me avoid issues with replicating VDP using VR as discussed in that previous post. Once the initial replication was complete, I immediately started my test.

To begin, I powered off the original VDP appliance and recovered the replicated copy. A VM is recovered by VR with the virtual network interface card (vNIC) disconnected as a safety precaution. I enabled the vNIC and powered on the recovered VDP appliance. I was not too surprised when I saw the following message.

Also notice in the background that the Core Services are reporting as “Unrecoverable”. This was almost surely a result of the replicated VDP appliance being a crash-consistent copy – basically as if someone “pulled the plug” on the server, moved it to another location and then turned it back on. VDP does not take kindly to that. Always shut down the guest OS of a VDP appliance gracefully (do not power it off).

There was not much I could do with the recovered appliance in that state so I deleted it from disk. However, I was not ready to give up. This time, I let the scheduled backup job and replication run for a few days. This was primarily so the VDP appliance would start running integrity checks and create checkpoints. Having a validated checkpoint would give me the opportunity to perform a Rollback in VDP. The Rollback mechanism is designed to protect against exactly what I ran into with the first attempt – a corrupted appliance due to a crash-consistent recovery.

Once again, I recovered the VDP appliance at my secondary site. The VDP appliance came online in good health – I did not have to perform a Rollback. Looking at the VDP Configure user interface (UI), I found the maintenance services were stopped, but all other services were running. I switched over to the VDP appliance console and noticed it was validating an integrity check.

Once the integrity check was complete, I switched back to the VDP Configure UI and found the Maintenance services were now running – presumably started after the integrity check was completed. Using the VDP Configure UI, I connected it to the vCenter Server and SSO server at the secondary site and rebooted the appliance.

The VDP appliance went through the automated reconfiguration process. After several minutes had passed, VDP was available for use in the vSphere Web Client. Nice! However, when I tried to connect to the appliance, I received the error stating that “the most recent request has been rejected – most likely a time issue”. How ironic that I just posted an article on this issue a few days ago:

http://blogs.vmware.com/vsphere/2012/11/vdp-time-sync-error.html

After some quick troubleshooting, I found that my vCenter Server virtual appliance, which contains the SSO server, was about 3 minutes behind the VDP appliance and the vSphere host they were running on. I restarted VMware Tools at the command line on the vCenter Server appliance: “service vmware-tools-services restart”. When VMware Tools restarted, the time was updated and all clocks were in sync.

As the earlier “Switch to a new vCenter?” warning stated, no backup jobs were present on the recovered VDP appliance, but I did find restore points for the VM backed up at original site. Naturally, I kept moving by creating a restore job from the most recent restore point for the VM. The job started, but after a few moments, I received the following error:

“VDP: Failed to restore client db-server, Execution error: E10050:Failed to create Virtual Machine.” (db-server being the name of the VM I was attempting to recover)

I speculated that maybe there was some corruption so I went ahead and performed a Rollback in the VDP Configure UI to the validated checkpoint. I tried the restore again and received the same disappointing error message. Turns out this was an issue related to performance. I had several VMs running on this same vSphere host – memory was almost completely consumed and CPU utilization was also at a considerable level. I shut down some of the non-essential VMs, which freed up some memory and CPU, and tried the restore again – it worked!

So what does all of this mean? As detailed above, you can see that I did not do a lot of testing with various replication types, different workload sizes, etc. Here are my thoughts:

- vSphere Replication is not recommended for replicating a VDP appliance. The replicated copy is crash-consistent, which presents a risk of data corruption. There is no way to schedule when vSphere Replication performs replication and it is not possible to replicate a powered off – i.e. completely quiesced – VDP appliance using vSphere Replication.

- Building on the point above, the best way to avoid data corruption is to gracefully shut down the VDP appliance and then replicate it. One approach may be to schedule a weekly shutdown of the appliance, copy or replicate it to a secondary site and then power the appliance back on after sufficient time has passed to complete the copy or replication.

- As with most applications, make sure VDP has the CPU and memory resources it needs. If your VDP appliance is particularly busy (backing up many VMs, frequent restore jobs, long-running integrity checks, etc.), it might make sense to configure CPU and memory reservations for the appliance.

- As with just about any solution, you should thoroughly test the solution prior to putting it into production. If replicating backup data from one site to another is a requirement, it might make sense to look at backup and recovery solutions that have this functionality included.

If you are using VDP and considering ways to create off-site copies of your backup data, I would be very interested in hearing how you architect the solution, test results, etc.! Thank you.

@jhuntervmware

11 thoughts on “Recovering a Replicated vSphere Data Protection (VDP) Virtual Appliance

  1. Pingback: VMTN Blog: Technical Marketing Update 2012 – Week 48 – #tmupdate | Virtualization

  2. jb42

    Jeff, For my DR efforts, I’m trying to bring up a ZFS snapped and replicated VDP appliance (iscsi zvol-backed). Core services came up unrecoverable on first effort, but I had snapped and replicated without shutting down the appliance. Tried again today. booted swiftly enough into console but networking was down (same thing yesterday.) Copy comes up wanting to use eth1 but original used eth0. renamed /etc/sysconfig/network/ifcfg-eth0 to ifcfg-eth1 and yesterday that got me into vdp-configure web-gui after reboot. Today appliance is taking hours to boot, though it has advanced from starting VDP appliance systems to initializing proxies. I have about 2TB on the appliance. I’m hoping it’s running integrity checks and just not showing me – but I guess DR wont be very fast in that case…

    1. jb42

      finally finished booting after about 4 hours and showed up in vCenter drop down – only I couldn’t connect. got into the vdp-configure gui, and core-services status was stopped but that seemed a touch better than unrecoverable. Only it wouldn’t start, just spun starting. And distrurbingly, there were no checkpoints, though there definately had been in my original. Network config was missing a netmask so I added it and initiated required reboot – now going on 5 hours…

      Read some VDP v Avamar blogs while waiting, and this one highlights lack of replication as part of the VDP spec: http://chucksblog.emc.com/chucks_blog/2012/08/announcing-vmware-vsphere-data-protection.html, so what are the chances this is designed not to work?

      1. Jeff HunterJeff Hunter Post author

        @jb42 – Thank you for documenting your experience around replicating VDP. I am not surprised this has been a challenge. I would not say that a VDP appliance was designed *not* to be replicated. I would definitely say that a VDP appliance was not designed to be replicated. In other words, it is not “broken on purpose”. It is no secret VDP is built on EMC Avamar code. This is the same with Avamar (VE) – you would not replicate an entire Avamar appliance, you would utilize the replication built into Avamar which VDP does not currently have.
        The point of the blog article was to show that it could potentially be accomplished. However, I do not recommend it and it is not supported.
        I am not sure why your appliance experienced the stopped services, long boot times, lack of checkpoints, etc. The fact that it contains ~2TB of data probably contributes to the challenge. The appliance I tested with was small.
        Again, I really appreciate you documenting your experience here. Hopefully, others can take this documented experience into consideration before attempting this exercise themselves.

        1. jb42

          Getting closer! Did a little digging and it looks like the dynamic MAC address is the cause of the eth0->1 interface switch: http://www.cyberciti.biz/tips/vmware-linux-lost-eth0-after-cloning-image.html.

          I must have gotten hit by this since I’m trying to “recover” within my primary site and so pulled a new MAC on boot.

          Deleted the referenced “.rules” file and 1st half of reboot was zippy, but still took a while initializing proxies…

          Core-services came up running but management services were not and wouldn’t start. Trying a rollback now…

  3. jb42

    Alright, it worked! that network interface fix was the key. note that checkpoint validation took a good while once I had everything lined up right so figure on that if you’re gonna take this DR approach.

    I’ll do more tests now with incremental zfs-snaps and then go through the full off-site recovery which will include bringing up a new vCenter. I’ll continue to log efforts here: http://communities.vmware.com/message/2226772

    Thanks Jeff. I would have given up on VDP as the engine for my free DR solution if it wasn’t for this blog post. I’m really pleased with this solution. Recycled hardware + opensource OS + free (included) software = enterprise-class DR!

  4. lafo

    Hello Jeff,

    I am interested to know what is the best way to copy and restore that VDP on other location. For example we are going for Disaster Recovery termin next month and I have no idea how to start. If there is a possibility to copy VDP and start it on the other vCenter server and somehow manage to see alle the backup still, would be great to know.

    Can you please point me in the right direction. We are not using any VR.

    Regards and Thanks

    1. Jeff HunterJeff Hunter Post author

      Hi lafo,

      I do not have a detailed guide other than this blog article:
      https://blogs.vmware.com/vsphere/2012/12/recover-replicated-vdp-appliance.html
      The important item to keep in mind is the VDP appliance must be completely quiesced – powered off, in other words – to help ensure backup data integrity. Once the appliance is powered off, copy all of the files that make up the VDP appliance to the recovery site. After the files are copied, register the copy of the appliance at the recovery site with the vCenter Server you have at the recovery site. Then, register VDP with vCenter Server using the VDP Configure user interface – see the VDP Admin Guide for details on using VDP Configure. After all of these steps have been completed, you should be able to perform restores.
      Also see this community thread:
      http://communities.vmware.com/message/2226772

      1. jb42

        Hi Jeff and Lafo. Just wanted to add that after a second phase of testing, I dont believe it’s necessary to quiesce the appliance with the data-snap approach I’ve taken, but just to schedule daily data-snaps after the VDP integrity check. This is great because we don’t have to work out an automated way to power down the VDP appliance every day!

  5. Jeff HunterJeff Hunter Post author

    Yes, theoretically, as long as the appliance does not have any jobs running (e.g. late in the maintenance window), you should be able to replicate the appliance while it is powered on. However, I am not sure I would trust this completely.

  6. Jordan Pangborn

    Greetings Everybody! Here is what I had in mind for replicating VDP and I wanted to see what your thoughts were. We were granted some rack space at our ISPs DR site. So we have a direct fiber connection to our network from the DR site(so our DR host and storage is connected directly on our network). Our VDP and its backup data are stored on a QNAP via NFS. My initial thought was to do a NAS to NAS replication of the Share file on our Main Site QNAP to the DR Site QNAP. Then see if I could add it to the inventory of the DR vCenter and spin up the VDP out there.

Comments are closed.