Products vSAN

The Use of Erasure Coding in VMware vSAN

vSAN is a scale-out, software-defined storage platform that is the key component of VMware’s Hyper-Converged Infrastructure (HCI) offerings, on premises and on VMware Cloud on AWS. vSAN is an enterprise-grade storage product that offers a number of features to ensure the integrity, security and resilience of data. Since the first generally available version of the product (vSAN 5.5) back in April 2014, vSAN supports synchronous data replication (RAID-1) policies, for data resilience within a cluster. A policy of “number of concurrent failures to tolerate” (FTT) indicates the number of copies vSAN maintains for a specified object in the cluster (e.g., a VM or a virtual disk).

Since February of 2016 (vSAN 6.2), the product also supports data resilience by means of erasure coding. There are two specific configurations supported currently: RAID-5 for protection against one failure and RAID-6 for protection against up to two concurrent failures. Typically, RAID-1 is used in the case of performance-sensitive workloads, while RAID-5 or RAID-6 are used in the case where space efficiency is the top priority.

In the product and marketing material Erasure Coding and RAID-5 / RAID-6 are used pretty much interchangeably. A number of people have asked about the difference between RAID and Erasure Coding and what is actually implemented in vSAN.

 

Erasure Codes or RAID?

So, let me set the terminology straight and clarify what we do in vSAN.

Erasure Coding is a general term that refers to *any* scheme of encoding and partitioning data into fragments in a way that allows you to recover the original data even if some fragments are missing. Any such scheme is referred to as an “erasure code”.  For a great primer, see this paper by J. Plank: “Erasure Codes for Storage Systems: A Brief Primer”.

Reed-Solomon is a group of erasure codes based on the idea of augmenting N data values with K new values that are derived from the original values using polynomials. This is a fairly general idea with many possible incarnations.

 

RAID-5 is an erasure code that is typically described and understood in terms of bit parity. But even this simple code falls under the Reed-Solomon umbrella: we are augmenting N bit values with a new bit value, which is computed using a trivial polynomial under binary arithmetic (XOR).

Pic1

Figure 1: RAID-5 striping with 3 data + 1 parity fragment per stripe.

 

RAID-6 refers to various codes that are similar in function: they augment the data values with two new values and allow recovery if any one or two values are missing. The “classical” RAID-6 implementation is a Reed-Solomon code, which augments the parity in RAID-5 with a second “syndrome” which requires more complex calculations.

 

Pic 2

Figure 2: RAID-6 striping with 4 data + 1 parity (P) + 1 RS  (Q) syndrome fragment per stripe.

 

Traditionally, the latter calculations were slower which led to variations designed to avoid them, like Diagonal Parity. See the original paper by Corbett et al “Row-Diagonal Parity for Double Disk Failure Correction”. Today, however, the more complex calculations used by Reed-Solomon-based RAID-6 are no longer a problem. Modern CPU instruction sets (specifically SSSE3 and AVX2) can be leveraged in a way that makes these calculations almost as efficient as the simple XOR operations. For a reference on this, see the paper by Plank et al “Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions“.

 

In fact, we observed that performing Reed-Solomon calculations (Galois Field Arithmetic) using AVX2 is *faster* than performing simple XOR calculations without using AVX2! When we leverage AVX2 for both XOR and Reed-Solomon, the difference in cost (CPU cycles) between the two is under 10%. vSAN implements RAID-5 and Reed-Solomon-based RAID-6. It leverages SSSE3, which are present in all CPUs supported by vSphere, and AVX2 (present in Intel Haswell or newer processors).

 

As of version 6.7, vSAN supports two specific Erasure Codes:

  • RAID-5 in 3+1 configuration, which means 3 data fragments and 1 parity fragment per stripe.
  • RAID-6 in 4+2 configuration, which means 4 data fragments, 1 parity and 1 additional syndrome per stripe.

Note that a vSAN cluster size needs to be at least 4 host and 6 hosts, respectively. Of course, it may be larger (much larger) than that. Without making any commitments, I should state that if valid customer use cases emerge that justify additional RAID-5/6 configurations (or perhaps even other erasure codes), the vSAN product team will consider those requirements. The vSAN code base is generic and may support other configurations, if needed.

 

Space Efficiency vs. Performance

I would also like to highlight the key features and trade-offs of Erasure Coding and how it compares to replication, from a customer’s point of view. Obviously, the main benefit of Erasure Codes is better space efficiency than Replication for the same level of data resilience. For example, when the goal is to tolerate one failure, the space overhead of a 3+1 RAID-5 configuration is 33% as opposed to 100% overhead with 2x replication (RAID-1). The overhead difference is even bigger between 4+2 RAID-6 (50%) and 3x replication (200%), when the goal is to tolerate up to 2 concurrent failures.

However, the space efficiency benefits come at the price of the amplification of I/O operations.

 

First, in the failure-free case, read performance is not affected. However, write operations are amplified, because the parity fragments need to be updated every time data is written. In the general case, a write operation is smaller than the size of a RAID stripe. So, one way to do this is to:

  • read the part of the fragment that needs to be modified by the write operation;
  • read the relevant parts of the old parity/syndrome fragments to re-calculate their values (need both old and new values to do that);
  • combine the old values with the (new) data from the write operation to calculate new parity/syndrome value;
  • write the new data;
  • write the new parity/syndrome value.

 

With 3+1 RAID-5, for a typical logical write operation, one needs to perform 2 reads and 2 writes on storage. For 4+2 RAID-6, the numbers are 3 reads and 3 writes, respectively. When Erasure Codes are implemented over the network, as is the case with distributed storage products like vSAN, the amplification also means additional network traffic for write operations.

Moreover, in the presence of failures and while a storage is in “degraded” mode (some data and/or parity fragment missing), even Read operations may result in I/O and network amplification. The reason? If the fragment of data the application needs to read is missing, it needs to be reconstructed from the surviving fragments. In other words, Erasure Coding does not come for free. It has a substantial overhead in operations per second (IOPS) and networking. For traditional storage systems that used magnetic disks (which can deliver very few IOPS), big caches often using battery-backed NVRAM were a prerequisite to achieve reasonable performance. And they often needed to use very large numbers of spindles – not necessarily for capacity, but to meet the requirements for IOPS.

 

With Flash devices, RAID-5/6 is viable with entirely commodity components. Flash devices offer large number of (cheap) IOPS, so I/O amplification is less of a concern in that case. With vSAN and the data reduction features it offers (deduplication and compression), All-Flash clusters may even result in more cost-effective hardware configurations than Hybrid clusters (Flash and magnetic disks) depending on workload and data properties.

 

In conclusion, customers must evaluate their options based on their requirements and the use cases at hand. RAID-5/6 may be applicable for some workloads on All-Flash vSAN clusters, especially when capacity efficiency is the top priority. Replication may be the better option, especially when performance is the top priority (IOPS and latency). As always, there is no such thing as one size fits all. VvSAN allows all those properties to be specified by policies per VM or even per VMDK.

VMware offers design and sizing tools to help our customers determine what are the best hardware configurations and VM policies for their workload needs.  But that’s a topic for another blog post.