Category Archives: SAP

Disaster Recovery for Virtualized Business Critical Applications (Part 3 of 3)

Planned Migration:

One of the relatively newer use cases for SRM is planned migration. With this use case, customers can migrate their business critical workloads to the recovery or cloud provider sites in a planned manner. This could be in planning for an upcoming threat such as a hurricane or other disaster or an actual datacenter migration to a different location or cloud provider.

Continue reading

Disaster Recovery for Virtualized Business Critical Applications (Part 2 of 3)

Protection Groups:

A protection group is a group of virtual machines that fail over together to the recovery site. Protection groups contain virtual machines whose data has been replicated by array-based replication or by VR. Typically contains virtual machines that are related in some way such as:

  • A three-tier application (application server, database server, Web server)
  • Virtual machines whose virtual machine disk files are part of the same datastore group.

Continue reading

Disaster Recovery for Virtualized Business Critical Applications (Part 1 of 3)

The purpose of the exercise was to demonstrate use cases for disaster recovery of  real business critical applications (BCA) leveraging VMware solutions such as VMWare Site Recovery Manager (SRM). Techniques to protect  against disaster for common business critical applications such as Microsoft Exchange, Microsoft SQL Server, SAP and Oracle Databases are discussed.

Continue reading

The Case for SAP Central Services and VMware Fault Tolerance

What’s the CPU Utilization Of Standalone SAP Central Services in a Virtual Machine?

Since VMware came out with VMware Fault Tolerance (FT) we have considered the deployment option of installing SAP Central Services in a 1 x vCPU virtual machine protected by VMware FT. FT creates a live shadow instance of a virtual machine that is always up-to-date with the primary virtual machine. In the event of a hardware outage, VMware FT automatically triggers failover—ensuring zero downtime and preventing data loss. Central Services is a single-point-of-failure in the SAP architecture that manages transaction locking and messaging across the SAP system and failure of this service results in downtime for the whole system. Hence Central Services is a strong candidate for FT but FT currently only supports 1 x vCPU (vSphere 5.x), so some guidance is required on how many users we can support in this configuration. VMware has given technical previews of multi-vCPU virtual machines protected by FT at VMworld 2013/2014, but now, better late than never, here are the results of a lab test demonstrating the performance of standalone Central Services in a 1 x vCPU virtual machine.

We configured the following setup.


We have two virtual machines: a SAP dialog instance (processes OLTP requests)  and  database instance in one; and  Central Services in the second. The objective was to focus on performance of the standalone Central Services virtual machine that was servicing an ABAP based workload. For the ABAP stack Central Services is officially referred to as ABAP SAP Central Services (ASCS).  The goal of the test was to scale up to 1000 OLTP users running transactions in the SAP Sales and Distribution module.  The following metrics were measured (via vCenter): the maximum CPU usage of the ASCS virtual machine during the workload run; the maximum network usage of the ASCS virtual machine during the workload run. The following graph shows the results.


Summarizing the results:

  • At 1000 users the maximum CPU utilization of the ASCS virtual machine was 29%. This is a comfortable utilization to run in production. The number of SAP lock requests (referred to by SAP as enqueues and dequeues) was over 10000 per minute during one point in the run (eyeballed via SAP transaction SM12).
  • We can see a fairly linear relationship between CPU usage and the number of users.
  • Lock requests generate network traffic between the SAP dialog instance and the ASCS service – we see that the network usage is fairly linear as users are scaled up. Note: there are no lock requests between the SAP database and Central Services.

It is recommended to validate SAP workloads in a pre-production performance test before go-live. Customer tests/workloads will show different results to those shown above as the frequency of SAP lock requests depends on the transaction think times and customer specific business processes (the transaction think times used in this test were < 20 seconds).  Customer environments should also consider batch jobs running at the same time as OLTP activity that by design are generating a lot of locks e.g. mass update of business documents – this would create additional load on Central Services.

Critical Factors to consider when virtualizing Business Critical Applications: (Part 2 of 2)


Availability of the applications is one of the important requirements for business critical applications. The vSphere platform provides capabilities to protect against hardware and software failures  to meet the stringent SLA requirements of business critical applications.

Virtualized applications can avail of features such as vSphere HA protects from HW failures by restarting virtual machines on surviving hosts if a physical host were to fail. vSphere FT  (Fault Tolerance) protects critical servers such as load balancers and other central servers with a small footprint (1vCPU) with zero downtime in the event of HW failures.

vSphere App HA for Application level protection for supported applications. Third party solutions that leverage vSphere Application Awareness API such as Symantec Application HA , NEVERFAIL, etc. layer on top of VMware HA and provide monitoring and availability  for most of the common business critical applications.



Traditional availability clusters like MSCS and Linux based clusters can be layered on top of vSphere HA, but quite often create restrictions relating to vMotion and other functionality in VMware environments. Currently the only use case that these solutions provide not provided by the Application HA solutions available on VMware is the ability to have reduced downtime during patch related activity requiring system reboot. If there is allowance for downtime for patch maintenance using traditional clusters can be avoided.


The ability to proactively manage business critical applications is a very important requirement. Virtualized business critical applications can leverage the vCenter Operations (VCOPS) suites of solutions to effectively manage their environment. VCOPS provides tactical (Right Now) and Strategic (Future Focused) capabilities to monitor performance and manage the capacity of the environment proactively.

VCOPS provides tactical monitoring and strategic planning


vCOPS1VCOPS also provides ability to perform root cause analysis on problems in the environment in a time efficient manner to reduce impact on business critical applications.

VCOPS smart alerts



vCOPS2Changes made to the environment can be correlated to the health, while also helping to diagnose the impact of the change on an application or environment.

Correlate events with changes made to environment



All major x86 applications are supported on VMware and actively being deployed by our customer.



Most application owners and DBAs virtualize Oracle or SAP for cloning. Creating application instances and refresh of data across environments such as production, development, test and QA is very time consuming and resource intensive.

With vSphere, any application can be templatized and easily cloned to be rolled out into these environments with minimal effort.  This agility provides application owners improved time to market of their solutions and increased productivity.

Easy cloning between production and other environments


AgilityDisaster Recovery:

One of the critical requirements for business critical applications is the need for disaster recovery with minimum possible downtime (Recovery Time Objective) and minimum loss of data (Recovery Point Objective).

Traditional disaster recovery for business critical applications with physical servers has many limitations such as the need for identical hardware in recovery site and downtime for production during testing. In a virtual environment, every virtual machine is encapsulated in a series of files that can be easily moved and instantiated in the recovery locations.

VMware Site Recovery Manager (SRM) provides a strong workflow engine that provides automation through centralized recovery plans, automated failover and failback and planned migration capabilities. SRM provides the ability to perform non-disruptive testing and leverages vSphere and Array based replication to replicate the data to the disaster recovery site.

Majority of the customers virtualizing business critical applications leverage SRM for their Disaster Recovery needs. SRM drastically reduces the RPO and RTO and helps them meet stringent business availability requirements.

SRM provides automated disaster recovery capabilities



The critical concerns of business critical application owners are adequately addressed by VMware’s suite of products and solutions. There is no reason not to virtualize your business critical applications. The trends shown below clearly attest to these capabilities.

Strong growth in Adoption of BCA virtualization


BCA_trendsPart 1 of series can be found at “Critical Factors to consider when virtualizing Business Critical Applications: (Part 1 of 2)


Impact of database licensing on Cluster Design for Production SAP virtualization

Type of Database licensing for SAP:

The type of licensing impacts the cluster design for SAP virtualization. SAP is supported on most common database platforms such as SQL, Oracle, DB2 and SYBASE. When customers procure SAP, they can choose to buy the database licensing through SAP or purchase it directly from the database vendor. This decision impacts the cluster design for virtualized SAP environments.

Let us look at these two scenarios and their impact on the design.

Scenario 1: Database License procured from the DB vendor for SAP:

Database vendors have differing but usually very restrictive policies regarding virtual machines running databases. The cost of licensing databases in the extreme case could force a customer to license for the entire cluster, even though  the database could be using only a small subset of the resources. Due to the uncertainty and the risk involved with DB licensing in this situation, it might be prudent to separate the entire database workload into its own cluster. By separating the entire database workload, the physical hardware used for databases can be isolated and licensed fully. Since only database workloads exist in this cluster one can achieve consolidation and efficiency for databases. The main disadvantage is the added overhead of having a separate cluster for databases. Since SAP landscapes have many modules with each module having its own individual database, creating a separate DB cluster with a good number of  hosts is  worthwhile and justified.

Dedicated Database Cluster for SAP

Dedicated Database Cluster for SAP








When there are no restrictions with licensing,  the typical cluster design methodology in vSphere environments espouses having an N+2 cluster.  An N+2 cluster would provide headroom for doing maintenance (One host at a time) and high availability (One host failure). These additional hosts can be costly for the database cluster due to the need to license all hosts.  In this situation the applications run in their own cluster, which typically is N+2.

SAP Applications in their own cluster

Dedicated APP Cluster for SAP










Most database vendors allow for a non licensed DB host, if the only purpose of these hosts is to function as a standby in the event of a failure. There are many conditions such as the number of actual days the standby takes over per year and other requirements that need to be met. vSphere clusters have a setting called dedicated failover host, which can be leveraged in database clusters to match the requirements of standby hosts.  One can potentially meet these conditions for standby node, by running all workloads in normal circumstances on licensed nodes with the dedicated failover node minimally being used only during actual failure or maintenance.

Scenario 2:  Database licensed through SAP along with SAP software:

When databases are licensed through SAP, there is no impact of database placement on licensing. This is akin to “named user” based licensing. There is a lot more flexibility to locate the database servers anywhere in the environment. Customers typically collocate the database servers along with the application servers for proximity and ease of management. The commonly used N+2 clusters can be leveraged in this scenario to allow for HA capacity even during maintenance.  All nodes can be fully utilized to run workloads.

SAP Applications and Databases in the same cluster











Cluster design in SAP environments can be impacted by the type of database licensing. Creating a dedicated database cluster for certain situation, can help meet many of the stringent licensing requirements, while still providing for consolidation and optimized utilization.



Critical Factors to consider when virtualizing Business Critical Applications: (Part 1 of 2)

Over the past few years, there has been significant acceleration in adoption of the VMware platform for virtualization of business critical applications. When vSphere 5 was introduced with its initial support for up to 32 vCPU many of the vertical scalability concerns that existed earlier were addressed. This has been increased to 64 processors with the later vSphere 5.x releases ensuring that more than 99% of all workloads will fit vertically.

Having personally worked in IT infrastructure for more than 20 years with a strong focus on implementing and managing business critical applications, I see a general reluctance from application owners to virtualize business critical applications. When virtualizing business applications there are many critical factors one should consider.  I seek to address the typical concerns of application owners about Virtualization with this multipart series on Virtualizing BCA.

CIOs and IT operations want to virtualize more because of the following advantages of virtualization:

  1. Infrastructure efficiency
  2. Simpler management
  3. Built-in availability
  4. Greater agility
  5. Simplified DR

But Application owners are usually reluctant because:

  1. Will my Application perform well? (Performance)
  2. Will it Scale? (Scalability)
  3. Can I meet my application SLAs? (Availability )
  4. Can I manage it effectively? (Manageability)
  5. Will my ISV support me? (Supportability)
  6. What’s in it for me? Will my application run better? (Agility & Time to Market)


Virtual performance is about 5-6% of physical for most business critical application workloads. The benefits of virtualization and productivity improvements overshadow the small overhead it introduces.

SAP Performance within 6% of Native:

SAP Performance on vSphere vs Physical


BCA1Exchange Server Performance:

Exchange Virtualization study discussed in Running Microsoft Apps on FlexPod for VMware  shows only a 5% performance difference between virtual and physical.

Exchange performance on vSphere



Database Performance:

SQL and Oracle performance close to Native as seen in the test results below.

SQL Throughput on vSphere vs Native

Oracle RAC performance vs Native on vSphere


It is proven that the performance of the common business critical applications are very close to their physical counterparts and performance should never be a concern for virtualizing them.  During virtualization there is usually a hardware refresh to the latest and greatest hardware with superior performance to existing systems. The small overhead of virtualization is easily offset while moving to this newer hardware as part of the virtualization migration.


With the vSphere 5.x platform, almost all workloads are amenable to virtualization.  VMware capabilities, such as vNUMA allows for NUMA aware operating systems and applications to run optimally even in virtual environments.

Scalability of vSphere


BCA5Workloads can scale up dynamically as demand increases with hot add capabilities for CPU, memory and storage resources as and when needed.  One can scale up or scale out based on the application requirements.

“Scale up” versus “Scale out”


These hot add capabilities are available only in virtualized environments and help right size environments and grow them dynamically, when needed without user downtime and loss of productivity. The following graphic shows the effect of increasing CPU on an Oracle DB server by 2 vCPU demonstrating the dynamic scale up capabilities of the vSphere platform.

Oracle Hot Add CPU Illustration



Part 2 of series can be found at  ”Critical Factors to consider when virtualizing Business Critical Applications: (Part 2 of 2)



SAP on VMware Sizing & Design Example

Recently in partner workshops I have come across some interesting discussions about the impact of hyper-threading and NUMA in sizing business critical applications on VMware. So here is an SAP example based on SAP’s sizing metric “SAPS” (a hardware-independent unit of measurement that equates to SAP OLTP throughput of Sales and Distribution users).  The examples here refer to vSphere scheduling concepts in this useful whitepaper The CPU Scheduler in VMware vSphere 5.1 .

SAP sizing requires the SAPS rating of the hardware which for estimation purposes can be obtained from certified SAP benchmarks published at . Let’s use certification 2011027  and assume that we plan to deploy on similar hardware as used in this benchmark. This is a virtual benchmark on vSphere 5 with the following result: 25120 SAPS (at ~100% CPU) for 24 vCPUs running on a server with 2 processors, 6 cores per processor and 24 logical CPUs as hyper-threading was enabled. This is a NUMA system where each processor is referred to as a NUMA node.  (Note cert 2011027 is an older benchmark, the SAPS values for vSphere on newer servers with faster processors would be different/higher, hence work with the server vendors to utilize the most recent and accurate SAPS ratings).

In this example I will design for application server virtual machines which as they scale out horizontally gives us the flexibility of choosing the number of vCPUs per virtual machine.  Now do we go with # of vCPUs = # of cores or # of vCPUs = number of logical CPUs? Let’s show an example for both. I will consider the following:

  • SAP sizing is typically conducted at 60-70% CPU and normal practice is to scale down the benchmark SAPS results, I will not bother with this and go with the 25120 SAPS at 100% CPU.
  • Size within the NUMA boundaries. In this two processor NUMA system example, there are two NUMA nodes each with one processor and memory. The access to memory in the same node is local; the access to the other node is remote. The remote access takes more cycles because it involves a multi-hop operation so keeping the memory access local improves performance.
  • For a 6 core NUMA node the virtual machine vCPU size should be a multiple divisor (or multiple) of 6 giving us 1, 2, 3 or 6 way VMs (see this VMware blog).
  • I assume workloads in all the virtual machines peak at the same time.

Let’s first show a design with # of vCPUs = # of cores i.e. no vCPU over-commit.

Example 1: # of vCPUs = # of cores, 2-way and 6-way app servers


With all the virtual machines under load simultaneously, the ESXi scheduler by default, with no specific tuning will: allocate a home NUMA node for the memory of each virtual machine; schedule vCPUs of each virtual machine on its home node thus maintaining local memory access; schedules each vCPU on a dedicated core to allow exclusive access to core resources. (Note that in physical environments such NUMA optimizations would require OS commands to localize the processing e.g. Linux command “numactl”) However the above configuration does not give us 25120 SAPS as not all the logical CPUs are being utilized as was the case in the benchmark. The hyper-threading performance boost for an SAP OLTP workload is about 24% (based on tests by VMware performance engineering – see this blog) so for # of vCPUs = # of cores we should theoretically drive about 25120/1.24 = 20258 SAPS. Also we can estimate about 20258/12 = 1688 SAPS per vCPU so the 2-way virtual machine is rated at 1688 x 2 = 3376 SAPS and the 6-way is 1688 x 6 = 10128 SAPS (@100% CPU in this example).  Are we “wasting” SAPS by not utilizing all the logical CPUs?  Technically yes but for practical purposes not a major issue because:

  • We have some CPU headroom which can be claimed back later after go-live when the virtual machines can be rebalanced based on the actual workload.  At this point vCPU over-commit may be possible as virtual machine workloads may not peak at the same time.
  • Hyper-threading benefit is dependent on the specific workload and while the 24% hyper-threading boost is based on an OLTP workload profile, the actual workload may be less impacted by hyper-threading for example:
    • CPU intensive online reporting
    • CPU intensive custom programs
    • CPU intensive batch jobs
    • SAP has created another metric in their sizing methodology referred to as SCU – Single Computing Unit of performance.  SAP has categorized different workloads/modules based on their ability to take advantage of hyper-threading. So some workloads may experience a hyper-threading benefit lower than 24%.

Now what if we need to drive the maximum possible SAPS from a single server – this is when we would need to configure # of vCPUs = # of logical CPUs. The following configuration can achieve the maximum possible performance.

Example 2: # of vCPUs = # of logical CPUs



In the above design the virtual machine level parameter “numa.vcpu.preferHT” needs to be set to true to override default ESXi scheduling behavior. Default behavior is where ESXi schedules the virtual machine across NUMA nodes when the number of vCPUs for a single virtual machine is greater than the number of cores in the NUMA node.  This results in vCPUs of a virtual machine being scheduled on a remote node relative to its memory location. This is avoided in the above example and performance is maximized because:  ESXi schedules all vCPUs of each virtual machine on the same NUMA node that contains the memory of the virtual machine thus avoiding the penalty of any remote memory access; all logical CPUs are being used thus leveraging the hyper-threading benefit (note vCPUs are sharing core resources so the SAPS per vCPU in this case is 25120/24 = 1047 at 100% CPU). This configuration is commonly used in the following situations: running a benchmark to achieve as much performance as possible (as was done for the app server virtual machines in the 3-tier vSphere SAP benchmark certification 2011044); conducting physical versus virtual performance comparisons. For practical purposes designing for # of vCPUs = # of logical CPUs may not be so critical. If we were to design for a 12-way app server (example 2 above), and actual workload was less than planned with lower CPU utilization, we would have plenty of vCPUs without the added gain from hyper-threading.  There are no hard rules so if desired, during the sizing phase, you can start with # vCPUs = # of cores or number of vCPUs = # of threads based on which approach you think best fits your needs.

Summarizing I have shown two sizing examples for SAP application server virtual machines in a hyper-threaded environment. In both cases sizing virtual machines within NUMA nodes helps with performance.  The SAPS values shown here are based on a specific older benchmark certification and would be different for modern day servers and more recent benchmarks.

Finally a thank-you to Todd Muirhead (VMware performance engineering), for his reviews and inputs.

Estimating Availability of SAP on ESXi Clusters – Examples

This is a follow up to the blog I posted in Jan 2013 which identified a generic formula to estimate the availability, expressed as a percentage/fraction, of SAP virtual machines in an ESXi cluster.  The details of the formula are in this whitepaper . This blog provides some example results based on some assumed input data. I used a spreadsheet to model the equation and generate the results – this is shown at the end. The formula is based on mathematical probability techniques. The availability of SAP on an ESXi cluster is dependent on: the probability of failure of multiple ESXi hosts based on the number of spares; the probability that the SPOFs (database & central services) are failing over due to a VMware HA event (depends on failover times and the frequency of ESXi host failures).

The example starts with a single 4-node ESXi cluster running multiple SAP database, application server and central services virtual machines (VMs) corresponding to different SAP applications (ERP, BW, CRM etc.).  A sizing engagement has determined that 4 ESXi hosts are required to drive the performance of all the SAP VMs (the SAP landscape). We assume the sizing is such that the memory of all the VMs will not fit into the physical memory of three or less hosts, and as we typically have memory reservations set (a best practice for mission critical SAP), VMs may not restart after a VMware HA event. So we conservatively treat any host failures that result in less than 4 ESXi hosts as downtime for the SAP landscape (not true at the individual VM/SAP system level as some of the VMs can be de-prioritized in the degraded state in favor of others but we are going with the landscape level approach to provide a worst case estimate). For this reason we design with redundancy by adding extra ESXi hosts in the cluster so I will compare three options with different degrees of redundancy:

Option 1:  4 node ESXi cluster with no spares i.e. “4+0”   (loss of 1 or more hosts is considered downtime. With this assumption a VMware HA event is mute so failover times are not considered. As there are no spares, the availability for this scenario = a x a x a x a, where a = availability of single ESXi host. For the remaining options I use the formula in the spreadsheet below)

Option 2:  5 node ESXi cluster with 1 spare ESXi host i.e. “4+1”   (loss of 2 or more hosts is considered downtime)

Option 3:  6 node ESXi cluster with 2 spare ESXi hosts i.e.  “4+2” (loss of 3 or more hosts is considered downtime)

Following input data is required for the formula:

  • Mean time to failover (via VMware HA) Central Services VM in case of ESXi host failure is 1-2 minutes (source: lab tests). I will use 2 minutes.
  • Mean time to failover database VM in case of ESXi host failure is 5 minutes. Source: POC from a customer who presented their case study at a SAP tradeshow. Includes time for database to start and perform a recovery, latter is dependent on the workload at the time of failure.
  • mtbf – meantime between failures of a single ESXi host (this is the failure rate due to h/w or VMware hypervisor failure)
  • mttr – meantime to repair a failed ESXi host or replace with another ESXi host in order to get the ESXi cluster back up to full strength.
  • Another term you will come across is mean time to failure (mttf).   Note that mtbf = mttf + mttr.
  • From mtbr and mttr we can calculate the availability of a single ESXi host. The definition is, availability = (mtbf-mttr)/mtbf (see the whitepaper for details. Note this type of analysis is not new, similar content can be found here and here).

The following diagram shows the relationship between mtbf, mttr and mttf.

Unfortunately there are no industry standard values for mttr and mtbf for an x-86 server running a hypervisor.  mtbf depends on the hardware and frequency of firmware and hypervisor related issues   - the latter in turn is impacted by patch management procedures.  So mtbf may vary between different environments.  So how can we estimate these metrics? As SAP is typically virtualized after other non-SAP applications, you can gather operational statistics from existing production or non-production ESXi clusters to get estimates for mtbf and mttr.  For mtbf, we would need to determine how often VMware HA events have occurred in any existing operational ESXi clusters. Few informal enquires I have made show frequency of failure around 1-2 times a year for an ESXi host and in some cases over a year without incident. For mttr are there any SLAs in place (for example server vendor services contracts) or can IT operations estimate a time they can repair or replace a faulty ESXi host in a production SAP cluster? As SAP business processes are mission critical such an SLA or understanding may be in place or required.  Hence I will show results for a range of mtbf and mttr.  The results are shown in the following table for mtbf = 90, 180 and 360 days and mttr = 2, 4, 8 and 24 hrs (I have estimated 1 year to 360 days).

You can read the above table as per the following example:

  • Experience in the datacenter has indicated about 2 failures per ESXi host a year, so we will assume about 180 days for mtbf. So we are interested in the results in the “180 days” section of the above table.
  • Datacenter Operations have procedures in place to restore a failed ESXi host within 4 hrs, so mttr = 4 hrs.
  • Hence the availability estimate is 99.9964 for a “4+1” cluster and 99.9973 for a “4+2” cluster.
  • If any of the input data differs from above or for other sized clusters, recalculate using the spreadsheet /formula (see below).

Some conclusions:

  • Adding more redundancy (two spare ESXi hosts versus one) increases availability and makes availability less sensitive to mttr which makes sense i.e. with more redundancy there is less time pressure to get a failed ESXi host back online. However there is an extra cost with this redundancy which can be mitigated by using the redundant ESXi hosts to run less important virtual machines that have a reduced SLA and can be taken offline if a single ESXi host fails. VMware resource shares can be configured to make sure these less important VMs do not interfere with the production SAP landscape.
  • Reliable servers with redundant components and good patch management policies can help to increase mtbf which increases availability.
  • Having procedures in place to lower the mttr increases availability, for example replacing a failed ESXi host with another from a non-production cluster or some standby pool may be faster than repairing the failed ESXi host.

Note the following about this analysis:

  • Only considers unplanned downtime due to ESXi host failures. Storage, application software/OS and network failures are NOT considered here. These other parts of the overall architecture have their own availability so the final availability (as experienced by the end-user) is the product of the availabilities of each sub-component as they effectively operate in series.
  • Formula assumes single instance database not active-active like RAC and no VMware FT for Central Services – these scenarios effectively reduce the failover times driving availability higher.
  • The availability estimate is for the whole SAP landscape – if you consider the individual VM /SAP system the availability can be higher, for example if we are down to three ESXi hosts, priority could be given to ERP over BW so ERP continues to perform as per normal. In this case ERP VMs would have higher availability than the value calculated for the landscape.
  • Architecture here assumes database and app tier deployed in the same cluster. Often these layers are isolated into two separate ESXi clusters (e.g. for database licensing and/or management purposes). In this case two separate availabilities need to be calculated for the two clusters and the overall availability is the product of the two.

Why Bother with All This?

While this mathematical approach (using probability) is an established method to estimate availability it does require assumptions and input data values for which we may have to estimate in case of limited data – this is where empirical data from actual implementations will help with accuracy.  So the goal of this availability analysis is:

  • A starting point to generate an estimate during a sizing engagement  or during the design phase of a deployment. If actual data is not available assume some “worst case” values for mttr and mtbf (e.g. 8 hrs and 90 days) to generate a base line estimate.
  • Enable quantitative analysis of different scenarios like:
    • How is availability impacted by extra spares in the ESXi cluster? If the business cost of downtime is known (currency per unit time), then we could determine if the cost of redundant ESXi hosts is justified.
    • If we need 10 ESXi hosts would it be better for availability if we had one 10-node cluster or two 5-node clusters?
    • How do failover times impact the final availability?

Appendix – Create Formula in Excel

The generic availability formula is in the whitepaper SAP on VMware High Availability Analysis – A Mathematical Approach . This formula can be created in Excel as shown below (in this case I am ignoring chances of any failover faults).

The heaviest part of the formula is in cell G46 (see above). This part calculates the probability that all the spares and one extra ESXi host (which is s+1 hosts, where s = number of spares) simultaneously fail resulting in downtime for the cluster. This is based on calculating the different unique combinations of s+1 nodes in the n-node cluster which is described in the whitepaper , but it should be noted that this comes from standard mathematical combination  theory, for example see this wiki page.







Formula to Determine Availability of SAP running on an ESXi Cluster

“Arithmetic is where the answer is right and everything is nice and you can look out of the window and see the blue sky – or the answer is wrong and you have to start over and try again and see how it comes out this time.” ~Carl Sandburg

When we architect SAP on VMware deployments an important topic is how we design for high availability. We have options in the VMware environment from VMware HA, VMware FT and use of in-guest clustering software like Microsoft Cluster Services or Linux-HA. So can we determine a numerical availability for our design expressed as a fraction/percentage  (same metric used to define uptime Service Level Agreements like 99.9% )? Yes, there are ways to estimate this value and one method is explained in the following paper . This paper develops an equation to estimate the availability of SAP running on an ESXi cluster expressed as a fraction/percentage. The concepts are taken from other papers at (a digest of topics on high availability)  and are based on mathematical algebra and probability theory that have been previously used in the IT industry for availability calculations. The availability metric (e.g. 99.9% or 0.999) is essentially a probability hence we use mathematical probability techniques to calculate the overall availability of a system.

The final general equation calculates the availability of an “n” node ESXi cluster sized with “s” number of spares i.e. an “n+s” cluster. It also factors in the software failover times of the single-points-of-failure (SPOF) in the SAP application architecture (database and Central Services). The failover time refers to the time taken for the SPOF to failover and restart on another ESXi host or other virtual machine in the event of an ESXi host failure – this period is important as it corresponds to downtime for the SAP system. The final equation gets a bit heavy on the algebra, but that’s because it models a generic use case. Once you replace the variables with practical “real-world” values, the equation gets easier and that’s when the algebra stops and spread sheeting takes over.

Let’s look at the following example with the following assumptions:

  • A five node ESXi cluster running SAP virtual machines, sized with one spare ESXi host i.e. it’s an “n+1” cluster – in the event of one ESXi host failure all impacted virtual machines failover to the remaining four ESXi hosts and all virtual machines continue to run with no loss of performance (the whitepaper covers this example in more detail).
  • A loss of two simultaneous ESXi hosts may result in serious performance degradation which we will classify conservatively as downtime for the whole cluster (not really true, but we have to start somewhere, see the whitepaper for caveats).
  • The probability of a failover fault is zero i.e. if a VMware HA or in-guest cluster switch over event occurs, the impacted SAP SPOF fails over to remaining ESXi hosts or another virtual machine with no chance of error.
  • The availability of a single ESXi host is in the ball park of 0.999 (i.e. “three nines”) – this simplifies the algebra in the general equation (see whitepaper section 4.3.1).

If we apply the above into the general equation from the whitepaper we get the following “simpler equation” specific to this use case.


We can use this equation along with practical values to replace the variables in order to observe how availability is impacted in different scenarios. The variables can be substituted with values obtained from: field experience; data/statistics gathered from actual implementations; reliability specifications from x-86 server vendors; proof-of-concepts / lab work evaluating failover times. The following example scenarios can then be analyzed:

  • How does failover time impact the final availability?
  • VMware HA adds some extra time for the OS to reboot compared to an active-passive clustering solution, how does this impact availability? VMware HA and clustering solution will have different values for mean time to failover.

At this point we can build a spreadsheet to analyze different scenarios.

It should be noted that this analysis is only considering unplanned downtime due to ESXi host/hardware failure. Other parts of the infrastructure would impact the final availability as experienced by the end-user such as network and storage (see section 3 of the whitepaper). It also does not consider downtime due to software corruptions or bugs or operational mistakes due to human error. Finally, while the formula discussed here is SAP specific the mathematical model can be applied to and adjusted for any ESXi cluster running business applications.