“Arithmetic is where the answer is right and everything is nice and you can look out of the window and see the blue sky – or the answer is wrong and you have to start over and try again and see how it comes out this time.” ~Carl Sandburg
When we architect SAP on VMware deployments an important topic is how we design for high availability. We have options in the VMware environment from VMware HA, VMware FT and use of inguest clustering software like Microsoft Cluster Services or LinuxHA. So can we determine a numerical availability for our design expressed as a fraction/percentage (same metric used to define uptime Service Level Agreements like 99.9% )? Yes, there are ways to estimate this value and one method is explained in the following paper http://www.availabilitydigest.com/public_articles/0712/sap_vmware.pdf . This paper develops an equation to estimate the availability of SAP running on an ESXi cluster expressed as a fraction/percentage. The concepts are taken from other papers at http://www.availabilitydigest.com (a digest of topics on high availability) and are based on mathematical algebra and probability theory that have been previously used in the IT industry for availability calculations. The availability metric (e.g. 99.9% or 0.999) is essentially a probability hence we use mathematical probability techniques to calculate the overall availability of a system.
The final general equation calculates the availability of an “n” node ESXi cluster sized with “s” number of spares i.e. an “n+s” cluster. It also factors in the software failover times of the singlepointsoffailure (SPOF) in the SAP application architecture (database and Central Services). The failover time refers to the time taken for the SPOF to failover and restart on another ESXi host or other virtual machine in the event of an ESXi host failure – this period is important as it corresponds to downtime for the SAP system. The final equation gets a bit heavy on the algebra, but that’s because it models a generic use case. Once you replace the variables with practical “realworld” values, the equation gets easier and that’s when the algebra stops and spread sheeting takes over.
Let’s look at the following example with the following assumptions:

A five node ESXi cluster running SAP virtual machines, sized with one spare ESXi host i.e. it’s an “n+1” cluster – in the event of one ESXi host failure all impacted virtual machines failover to the remaining four ESXi hosts and all virtual machines continue to run with no loss of performance (the whitepaper covers this example in more detail).

A loss of two simultaneous ESXi hosts may result in serious performance degradation which we will classify conservatively as downtime for the whole cluster (not really true, but we have to start somewhere, see the whitepaper for caveats).

The probability of a failover fault is zero i.e. if a VMware HA or inguest cluster switch over event occurs, the impacted SAP SPOF fails over to remaining ESXi hosts or another virtual machine with no chance of error.

The availability of a single ESXi host is in the ball park of 0.999 (i.e. “three nines”) – this simplifies the algebra in the general equation (see whitepaper section 4.3.1).
If we apply the above into the general equation from the whitepaper we get the following “simpler equation” specific to this use case.
We can use this equation along with practical values to replace the variables in order to observe how availability is impacted in different scenarios. The variables can be substituted with values obtained from: field experience; data/statistics gathered from actual implementations; reliability specifications from x86 server vendors; proofofconcepts / lab work evaluating failover times. The following example scenarios can then be analyzed:
 How does failover time impact the final availability?
 VMware HA adds some extra time for the OS to reboot compared to an activepassive clustering solution, how does this impact availability? VMware HA and clustering solution will have different values for mean time to failover.
At this point we can build a spreadsheet to analyze different scenarios.
It should be noted that this analysis is only considering unplanned downtime due to ESXi host/hardware failure. Other parts of the infrastructure would impact the final availability as experienced by the enduser such as network and storage (see section 3 of the whitepaper). It also does not consider downtime due to software corruptions or bugs or operational mistakes due to human error. Finally, while the formula discussed here is SAP specific the mathematical model can be applied to and adjusted for any ESXi cluster running business applications.
Thanks for sharing. Coming from a UNIX market, it’s important indeed that we share availability number. As we start taking on mission critical workload, we need to show the uptime calculation.
We need to have an example. In a 5node scenario, what’s the uptime? I hope the number is >99.999.
Another thing, I think this is not quite right –> “Most modern day servers are very resilient so let’s assume they have availability around three nines”. I thought it would be 99.99 as most Tier 1 servers are reliable enough. It is good to add to the excellent paper a supporting fact on the server availability.
Again, thank you for a good paper. Just need to add actual number so we can appreciate it.
Respectfully
e1
Hi Iwan
Thx for the feedback. I agree I need some actual numbers , i plan to research/collect some data and follow up with some examples. You may be right about Tier 1 servers being at 99.99% – which is good as it further justifies the simplification in section 4.3.1 of the whitepaper. As servers become more resilient the failover times then start to have a bigger impact on final availability – which is probably intuitively already known but an actual example with numbers will show us by how much.
Rgds
Vas
Any news with actual examples of data?
Actually, it is the availability of modern server hardware that achieves four 9s. However, hardware failures are just one mode of failure. Servers fail for other reasons, mainly software bugs and administrator errors. When these are included, several studies by various industry analysts (for instance, Gartner and the Standish Group) indicate that three 9s is a more accurate availability measure.
Pingback: Estimating Availability of SAP on ESXi Clusters – Examples  Virtualize Business Critical Applications  VMware Blogs
El microblog propiamente dicho tiene su versión en app, que permite hacerte
las cosas más fáciles.
I am interest in the expected (software) availability number for a single instance of ESXi (how many nines?). I have not been able to find this overtly stated anywhere. I ran into this article of yours above and your assertion:
“The availability of a single ESXi host is in the ball park of 0.999 ”
Where is this number derived? Or is it an empirical number based on some sort of field data? Can you point me to where this is explained/presented?
Is there a number VMware stands by for availability of a single instance of ESXi?
Just the software availability of the ESXi – based on its own software faults, etc.
Thank you
Hi Elliott
The three nines stated here for ESXi hosts is a rough guess only, based on conversations with several contacts working in data centers. Unfortunately I could not find any paper/report that lists the nines of an ESXi host / x86 server so we would have to estimate it based on empirical data recorded in a data center.
Rgds
Vas