posted

0 Comments

One of the slides we showcased during the VMware Virtual SAN 6.1 Launch that got a lot of attention was the following slide:

VSAN 9s

A lot of eyebrows in the audience were going up wondering how we came to the conclusion that VSAN delivers 6-9s availability level (or less than 32 seconds of downtime a year). While, Virtual SAN uses software-based RAID, which differs in implementation from traditional storage solutions, it does have the same end result – your data objects are mirrored (RAID-1) for increased reliability and availability. Moreover, with VSAN your data is mirrored across hosts in the cluster not just across storage devices, as is the case with typical hardware RAID controllers.

The VSAN users can set their goals for data availability by means of a policy that may be specified for each VM or even for each VMDK if desired. The relevant policy is called ‘Failures to Tolerate’ (FTT) and refers to the number of concurrent host and/or disk failures a storage object can tolerate. For FTT=n, “n+1” copies of the object are created and “2n+1” hosts are required (to ensure availability even under split brain situations).

For the end user, it is important to quantify the levels of availability achieved with different values of the FTT policy. With only one copy (FTT=0), the availability of the data equals the availability of the hardware the data resides on. Typically, that is in the range of 2-9s (99%) availability, i.e., 3.65 Days downtime/year. However, for higher values of FTT, more copies of the data are created across hosts and that reduces exponentially the probability of data unavailability. With FTT=1 (2 replicas), data availability goes up to at least 4-9s (99.99% or 5 minutes downtime per year), and with FTT=2 (3 replicas) it goes up to 6-9s (99.9999% or 32 seconds downtime per year). Put simply, for FTT=n, more than n hosts and/or devices have to fail concurrently for one’s data to become unavailable. Many people challenged us to show them how the math actually works to arrive at these conclusions. So let’s get to it.

Warning! Geeky Calculations ahead, so feel free to skip to the conclusions

Concept of success with happy nerd businessman with calculator

 

First, let’s talk about the industry way of defining the probability of a failure – Two prevailing metrics exists that can be used interchangeably:

  • AFR – Annualized Failure Rate
  • MTBF – Mean Time Between Failures

Depending on the manufacturer, you’ll see either of these metrics quoted – in case you are curious on the relationship between the two it is:

AFR = 1/(MTBF/8760) * 100 (expressed in %)

Typical Enterprise HDDs and SSDs have reliability that ranges from AFR of 0.87% to 0.44%, which means 1,000,000 to 2,000,000 hours MTBF.

For the rest of the calculations I’ll use MTBF, but you can use the formula above to translate to AFR if you prefer.

To see how we get to 5 or 6 nines, let’s use a concrete example for our hardware choices. One popular 10K HDD comes from Seagate (but feel free to replace with your preferred vendor), model ST1200MM0088, with an AFR of 0.44% (see page 2 of Data Sheet) or 2M hours MTBF. A popular SSD Intel 3710 for caching comes with the same MTBF of 2M hours (see page one of product specification)

To determine data availability in the presence of the failure of such a device, we need to make an assumption on the time it takes to rebuild data after a failure (from backup or some other copy of the data). To be conservative we’ll use 24 hours. So the calculation of the availability of data residing on said HDD and SSD is:

2,000,000/ (2,000,000 + 24) = 0.99998

This means 4 nines (4x9s) for each of these devices. In reality there’s more than a single drive dependency when we consider failure scenarios. A rack, host, or controller failure need to be taken into account, and in the case of Virtual SAN, the SSD used for caching for a HDD/SSD in the capacity tier. We asked one of our OEM partners to provide us with some availability numbers for Rack, host and controller and we got the following:

  • Rack: 0.99998
  • Host: 0.9998
  • Controller: 0.9996

So the combined probability of all the above components to be available in a typical VSAN deployment with no copies made for protection (FTT=0) is:

0.99998 (Rack) * 0.9998 (Host) * 0.9996 (Controller) * 0.99998 (SSD Cache) * 0. 99998 (SSD/HDD Capacity) = 0.9993

So, if we take into account the typical deployment uses enterprise grade hardware that delivers the above individual probabilities of failures, even before we start taking into account the impact of the VSAN data protection policy, we get 3 nines (3x9s) – which is already decent. 3x9s is the equivalent of 8.76 hours downtime per year – not ideal, but not bad either.

A caveat in the above calculation would be that these probabilities may be optimistic (we depend on OEM data here), and on the fact that these probabilities are independent of each other. So, in order to be conservative let’s reduce the above number to the smallest possible number of 9s for each component:

0.9999 (Rack) * 0.999 (Host) * 0.999 (Controller) * 0.9999 (SSD Cache) * 0.9999 (SSD/HDD Capacity) = 0.997

The result of using more conservative estimates takes us down a notch to two nines (2x9s), which equals 3.65 days of downtime a year – something that most companies can’t tolerate of course.

 

Now let’s see how Virtual SAN protection policies can help the situation. With FTT=1 we have two replicas and one witness (the witness is used to form quorum with one of the data replicas in situations of split brain in the cluster). That means that for the data to become unavailable, we need two of the data components to become unavailable. So using our last equation the probability of such scenario is:

(1-0.997)^2 = 0.00000528

So the data availability per object with FTT=1 is: 1-0.00000528 = 0.999994 – which is five nines (5x9s). To be fair a VM typically contains more than one object, so if we assumed a certain VM has 10 objects we’ll get:

0.999991^10 = 0.99994

In such case it will reduce the VM availability to four nines (4x9s), or the equivalent of 52.56 minutes of downtime a year. So, depending on the number of objects per VM you’ll get between 4x9s to 5x9s of availability with FTT=1.

 

Now let’s look at the same calculations with FTT=2, which means VSAN will create 3 replicas and use 2 witnesses per object. For the data to become unavailable in the above case we need 3 of these components to be unavailable. The probability of such scenario is:

(1-0.997)^3 = 0.00000001214

So the data availability per object with FTT=2 is: 1-0.00000001214 = 0.999999988 – which is seven nines (7x9s). Using the same assumption above for a VM with 10 objects we’ll get:

0.999999988^10 = 0.999999879

Giving us the promised 6x9s availability per VM.

To conclude all this convoluted math, it’s easy to observe how data availability increases exponentially with every increase in Virtual SAN’s data protection policy (FTT). While FTT=2 delivers 6x9s of availability, something that should be enough for almost all workloads, Virtual SAN does allow you to set, on a per-VM basis (every VM can have a different protection policy), an even higher policy of FTT=3 – meaning 4 replicas with 3 witnesses. I won’t bother calculating the probability of data unavailability in such scenario, but in such case you are probably facing a better chance of winning the Powerball (just in case you’re wondering it’s 0.000000003422 or the equivalent of 8-9s).

powerball

 

I’d like to give full credit to our BU CTO, Christos Karamanolis, who walked me through the math above and educated me on the history of Virtual SAN’s data protection policies and how the engineering team thought through this important aspect of the product.