Product Announcements

Site Recovery Manager with vSphere Replication Multi-Point-in-Time

One of the new 5.5 features in vSphere Replication is the ability to retain historical replications as point-in-time snapshots on the recovered virtual machines.

Using this feature is quite handy in order to recover from systems that have corrupted data or viruses or even to do auditing of system changes and the like.  While VMs protected with vSphere Replication can be recovered manually, and one by one, the full automation of recovery is of course offered by Site Recovery Manager.

In this post I’ll look at how we configure these multiple points in time (MPIT) during replication, and how we interact with them after failover by SRM.

Since MPIT configuration has already been covered previously, we will start with the assumption that VMs have been configured for replication already, and that MPIT is being used for some set of historical point in time retention.  In this scenario I have 5 replicas being retained per day.

Our virtual machine (“TestSRV36”) is a pretty standard Windows VM, with 1 disk, provisioned for 20 GB on disk, using about 11 actual GB of space.  The average replication delta is about 400 KB… it’s not doing much in my lab!

Now I’m going to browse the datastore on the target location, in the recovery site – this is where the MPIT snapshots are held.  We can see a whole bunch of files are in the directory:

  • TestSRV36.vmdk – this is the root disk file that will be used for recovery.
  • hbrcfg…vmx.yyyy – these are the VMX config files associated with each point in time.  If we need to revert we want to have the VMX associated with that timestamp available as well of course, so that file is retained historically.
  • hbrcfg…nvram.yyyy – these are the BIOS contents associated with the VM at that particular point in time.
  • hbrcfg…vmxf.yyyy – supplemental config data for the VMX again timestamped to that particular PIT.
  • hbrdisk.RDID… – the actual snapshot for that point in time.  Now technically this is not a proper snapshot, yet.  When VR populates a replica on the recovery site it will create in essence a redo log.  That’s what this file is, the redo log for that replication.  Once each redo log has a complete set of data for a given replication this file gets committed just like a snapshot to the primary VMDK.  With MPIT configured, the redo logs are retained (and cleaned up) in accordance with the policy for retention.  This file is what will get turned into a ‘proper’ snapshot when recovery is initiated.

So you can see that for each configured point in time for the historical replications, we have a correlate set of files on disk for that point-in-time.  There is the ‘root disk’ failover target (the VMDK) and all the vmx, nvram, vmxc, and hbrdisk snapshots for the historical points.

Now that the VMs are replicated and we are generating historical snapshots, we simply add them to protection groups and recovery plans in SRM like any VR protected VM.

When recovering VMs with SRM however, there may be reasons for *not* wanting to retain every historical snapshot.  It’s fine to do individual VM failovers and then manually select a revert point via snapshot, but there may also be very good reasons not to have all the snapshots as part of a mass DR failover.

For example, if you are customizing IP addresses, or running scripts or the like, reverting a VM to a previous state might be a “Very Bad Thing”.  All the customization and scripting changes would be lost, as those items are only applied to the failover snapshot, not to the previous points in time.  Likewise, it may be a DR administrator who fails over, a VMware administrator who runs the VMs, and an application owner that would need to make the decision.  If a wayward VMware Administrator sees a bunch of snapshots, or if a script runs that consolidates snapshots, it may add burden or indeed completely mess things up to have all those snapshots sitting there.  Not to mention the overhead and commit time of consolidating all the snapshots on recovery.

For this reason in SRM we give the ability to override the MPIT setting and choose *not* to retain the MPIT history.  By default the setting “vrReplication.preserveMpitImagesAsSnapshots” is selected.  If you choose not to retain the history, you can simply go into the SRM interface, click on a site (or both sites), select the “Advanced Settings”, navigate to “vrReplication” and *deselect* the option.  This will then have the effect of consolidating all the snapshots as part of the failover automatically.  There is no best practice here, it is simply an option if you like the option of using MPIT on a daily basis but might choose not to use it for full DR purposes.

Now say we need to fail over for some reason to our recovery site, in SRM we run the recovery plan to migrate, and watch as everything recovers.  VR will do a final synchronization if it can after the VMs are powered off, and then as part of the recovery it will ‘promote’ the most recent VMX, turn the placeholder into a ‘real’ VM, attach the VMDK to the placeholder by a ReloadVMFromPath API command, and power it on.

If we browse the datastore now, we see some differences: The VMX file is now a normal VMX file, the vswp file is created and the VM is up and running; the redo logs are now turned into full snapshots with not just the correlate VMDKs of the snapshots but also the memory state contents of the VMSN files.

Indeed if we now manage the snapshots of the recovered VMs we can see the standard snapshot manager tree view.  The snapshots are just like any other VMware snapshot, accessed through the same and familiar snapshot interface that administrators are accustomed to.  Snapshots can be reverted to, deleted, or consolidated just like any other snapshot on any other VM – these ones were merely created by replication rather than by manual interaction.

Best practices?

  • Don’t do this for every VM, only the ones where you think you might need it.  Why?  Lots of reasons ranging from extra disk space utilization, possible problems on rollback if using customization, extra time taken to commit snapshots after recovery, and so forth.
  • If you do use snapshots and failover with SRM, commit them as soon as possible.  It’s never been a good idea to leave long chains of snapshots hanging off VMs for a long period of time, and these are no different.
  • Speaking of long chains, you can have up to 24 historical points retained with X number per day up to Y days in any combination as long as the total number is 24 or less.  Should you therefore do 24?  Not unless you need it!  Again, committing snapshots is not a trivial thing, particularly if they are large snapshots.  Use discretion in your MPIT policy.
  • There is no automated way to preselect a point in time for failover.  It will always fail over to the most recent point in time and then you will need to manually select a snapshot to revert to if chosen.  Make your life simpler and use less frequent MPIT where possible to avoid a long process of moving around through snapshots!

Lastly, keep in mind these snapshots are only created on the recovery side.  They add no overhead to the VM, there is no snapshot created on the protected VM, there is no additional interaction with the production VM, it is a completely nonintrusive process to create these.  All the work is done on the replica object, not the protected VM.

Hopefully this helps you understand the use of VR’s new MPIT feature with Site Recovery Manager!