One of the many capabilities introduced in VMware vSphere 4 is VMware Data Recovery (VDR), a virtual machine backup and recovery product. Market research and customer feedback showed that many people wanted an integrated option for protecting virtual machines in a VMware environment. Further analysis showed that this was more eminent for VMware customers that had (or plan to have) fewer than 100 virtual machines in their environment and where IT responsibilities (including VMware) were shared among 2-3 IT administrators (as opposed to having a dedicated VMware administrator on-staff).
VMware has been helping customers address their backup challenges in two ways: making significant investments in the vStorage APIs for Data Protection that third-party backup tools use to integrate their backup/recovery products with vSphere, and in providing an integrated option optimized for vSphere customers with smaller environments. VDR is built using the vStorage APIs for Data Protection and incorporates a user interface, policy engine and data duplication – see the diagram below on how it all fits together. I’ll cover these blocks in a series of blogs but I wanted to start out by discussing Data Deduplication (dedupe).
Given that we had a made a decision to only use disks as the destination for the VDR backups, we had to look for a solution that offered disk storage savings – and this is where dedupe comes in. In a nutshell, dedupe avoids the same data to be stored twice – and dedupe is HOT – just check out the mergers and acquisitions news!
What VMware decided to implement for VDR dedupe is (take a deep breath) – block based in-line destination deduplication. Deconstructing it means the following:
1. We discover data commonality at the disk block level as oppose to the file level.
2. It is done as we stream the backup data to the destination disk as opposed to a post-backup process.
3. The actual dedupe process occurs as we store the data on the destination disk as opposed to when we are scanning the source VM’s virtual disks prior to the backup.
When it comes to deduplication, there are different techniques and hash algorithms used to accomplish the result. I am not going to get into a theoretical discussion of the pros and cons of the various types of dedupe technologies available and which approach provides the best disk savings. I personally think that it totally depends on the customers’ IT environment constraints and their overall business goals plus a lot of the storage savings is going to be data driven anyway (the more data commonality there is, the better the dedupe rate). We chose this dedupe architecture because it fit best with what we were trying to achieve with VDR and what the vSphere platform provided to us. What were these reasons? Stay tuned to this space……