Large clusters of rack-mounted servers in a
Hyperconverged Infrastructure Software-Defined Storage VMware Cloud Foundation vSAN

Virtual SAN 6.2 – Deduplication and Compression Deep Dive

Virtual SAN 6.2 introduced several highly anticipated product features and in this blog, we’ll focus on some of the coolest ones: Dedupe & Compression. These features were requested by VMware customers and I am glad that we listened to the customer. When talking about Dedupe and Compression, one first needs to determine why an organization would want to use Dedupe & Compression and what these features actually do. One of the many reasons for using Dedupe and Compression is to lower TCO for customers. Customers benefit from space efficiency as the Virtual SAN cluster will not utilize as much storage as it would if it was not using Dedupe and Compression, hence saving dollars. It is also important to note that Dedupe and Compression are supported on All Flash Virtual SAN configurations only.

 

What are Dedupe and Compression?

The basics of deduplication can be seen in the figure below. What happens is that blocks of data stay in the cache tier while they are being accessed regularly, but once this trend stops, the deduplication engine checks to see if the block of data that is in the cache tier has already been stored on the capacity tier. Therefore only storing unique chunks of data.

Pic 1

So imagine if a customer has lots of VM’s sharing a datastore and these VM’s keep using the same block of data due to a certain file being written too often. Each time a duplicate copy of data is stored, space is wasted. These blocks of data should only be stored once to ensure data is stored efficiently. The Deduplication and Compression operation happens during the destage from the cache tier to the capacity tier.

In case you are wondering how all these blocks of data are tracked, hashing is used. Hashing is the process of creating a short fixed-length data string from a large block of data. The hash identifies the data chunk and is used in the deduplication process to determine if the chunk has been stored before or not.

Together with Deduplication, Compression is enabled at the cluster level. It will not be enabled using Storage Policy Based Management. The default block size for dedupe will be 4k. For each unique 4k block, compression will only be performed if the output block size will be smaller than the fixed compression block size. The goal is to get this 4k block compressed to a size of 2k as seen below. A compressed block will be allocated and tracked in translation maps.

 

2

 

Enabling Dedupe & Compression

To enable Dedupe and Compression is not rocket science by any means. Simply go to the Virtual SAN Cluster and enable it from the Edit Virtual SAN Settings screen. Once dedupe has been enabled, all hosts and disk groups in the cluster will participate in deduplication. In this discussion, dedupe domains will be the same as a disk group; therefore, all redundant copies of data in the disk group will be reduced to a single copy, however redundant copies across disk groups will not be deduped. So the space efficiency is limited to the disk group. This means that all components that are in a disk group will share one single copy if multiple components are using the same block.

Dedupe can be enabled and disabled on a live cluster, however there are some implications to doing this. Turning on dedupe means going through all disk groups in the cluster and evacuating all of the data and reformatting the disk group. After this, Virtual SAN will perform dedupe on the disk groups.

So it’s a rolling upgrade. It’s important to remember that dedup and compression are coupled, therefore, once you enable deduplication you are also enabling compression as seen below.

3

IO Intensity

Dedupe is an IO intensive operation. In a non-dedupe world, data is written from tier 1 to tier 2, however with dedupe, things remain the same for the first part. With this in mind, more operations are inherently performed during destaging. IO will go through an additional dedupe path. This will happen regardless of the data being dedupe friendly or not.

Read – When performing a read, extra reads need to be sent to the capacity SSD in order to find the logical addresses and therefore find the physical capacity (SSD) address

Write – During destage, extra writes are required to the Translation Map and to the Hash Map tables. The translation map and hash map tables are used to reduce overheads. So this needs to be accounted for that this overhead is incurred and that a 4k block size is being used.

4

Dedupe Ratio

When looking in the Summary screen for the Datastore, different capacities and dedupe ratios can be viewed. Logical capacity is a new term. It is the capacity footprint seen if Dedupe and Compression are not turned on. So in the example below, the Physical used is 10G and the dedupe ratio is 3.2. Therefore logical capacity is 32G

5

Summary

In summary Dedupe and Compression are fantastic features that are going to be very useful to customers that have all flash configurations, it will reduce their TCO, and from a technical stand point, it is very simple to implement. Customers do not really need to learn anything new so there is no ramp-up on the technology from a learning perspective.