posted

7 Comments

“Arithmetic is where the answer is right and everything is nice and you can look out of the window and see the blue sky – or the answer is wrong and you have to start over and try again and see how it comes out this time.” ~Carl Sandburg

When we architect SAP on VMware deployments an important topic is how we design for high availability. We have options in the VMware environment from VMware HA, VMware FT and use of in-guest clustering software like Microsoft Cluster Services or Linux-HA. So can we determine a numerical availability for our design expressed as a fraction/percentage  (same metric used to define uptime Service Level Agreements like 99.9% )? Yes, there are ways to estimate this value and one method is explained in the following paper http://www.availabilitydigest.com/public_articles/0712/sap_vmware.pdf . This paper develops an equation to estimate the availability of SAP running on an ESXi cluster expressed as a fraction/percentage. The concepts are taken from other papers at http://www.availabilitydigest.com (a digest of topics on high availability)  and are based on mathematical algebra and probability theory that have been previously used in the IT industry for availability calculations. The availability metric (e.g. 99.9% or 0.999) is essentially a probability hence we use mathematical probability techniques to calculate the overall availability of a system.

The final general equation calculates the availability of an “n” node ESXi cluster sized with “s” number of spares i.e. an “n+s” cluster. It also factors in the software failover times of the single-points-of-failure (SPOF) in the SAP application architecture (database and Central Services). The failover time refers to the time taken for the SPOF to failover and restart on another ESXi host or other virtual machine in the event of an ESXi host failure – this period is important as it corresponds to downtime for the SAP system. The final equation gets a bit heavy on the algebra, but that’s because it models a generic use case. Once you replace the variables with practical “real-world” values, the equation gets easier and that’s when the algebra stops and spread sheeting takes over.

Let’s look at the following example with the following assumptions:

  • A five node ESXi cluster running SAP virtual machines, sized with one spare ESXi host i.e. it’s an “n+1” cluster – in the event of one ESXi host failure all impacted virtual machines failover to the remaining four ESXi hosts and all virtual machines continue to run with no loss of performance (the whitepaper covers this example in more detail).
  • A loss of two simultaneous ESXi hosts may result in serious performance degradation which we will classify conservatively as downtime for the whole cluster (not really true, but we have to start somewhere, see the whitepaper for caveats).
  • The probability of a failover fault is zero i.e. if a VMware HA or in-guest cluster switch over event occurs, the impacted SAP SPOF fails over to remaining ESXi hosts or another virtual machine with no chance of error.
  • The availability of a single ESXi host is in the ball park of 0.999 (i.e. “three nines”) – this simplifies the algebra in the general equation (see whitepaper section 4.3.1).

If we apply the above into the general equation from the whitepaper we get the following “simpler equation” specific to this use case.

 

We can use this equation along with practical values to replace the variables in order to observe how availability is impacted in different scenarios. The variables can be substituted with values obtained from: field experience; data/statistics gathered from actual implementations; reliability specifications from x-86 server vendors; proof-of-concepts / lab work evaluating failover times. The following example scenarios can then be analyzed:

  • How does failover time impact the final availability?
  • VMware HA adds some extra time for the OS to reboot compared to an active-passive clustering solution, how does this impact availability? VMware HA and clustering solution will have different values for mean time to failover.

At this point we can build a spreadsheet to analyze different scenarios.

It should be noted that this analysis is only considering unplanned downtime due to ESXi host/hardware failure. Other parts of the infrastructure would impact the final availability as experienced by the end-user such as network and storage (see section 3 of the whitepaper). It also does not consider downtime due to software corruptions or bugs or operational mistakes due to human error. Finally, while the formula discussed here is SAP specific the mathematical model can be applied to and adjusted for any ESXi cluster running business applications.