Home > Blogs > Virtualize Business Critical Applications
Featured post

Just Published – Virtualizing Active Directory Domain Services On VMware vSphere®

Announcing the latest addition to our series of prescriptive guidance for virtualizing Business Critical Applications on the VMware vSphere platform.

Microsoft Windows Active Directory Domain Services (AD DS) is one of the most pervasive Directory Services platforms in the market today. Because of the importance of AD DS to the operation and availability of other critical services, applications and processes, the stability and availability of AD DS itself is usually very important to most organizations.

Although the “Virtualization First” concept is becoming a widely-accepted operational practice in the enterprise, many IT shops are still reluctant to completely virtualize Domain Controllers. The most conservative organizations have an absolute aversion to domain controller virtualization while the more conservative organizations choose to virtualize a portion of the AD DS environment and retain a portion on physical hardware. Empirical data indicate that the cause of this opposition to domain controller virtualization is a combination of historical artifacts, misinformation, lack of experience in virtualization, or fear of the unknown.

The new Guide – Virtualizing Active Directory Domain Services On VMware vSphere® – is intended to help the reader overcome this inhibition by clearly addressing both the legacy artifacts and the new advancements in Microsoft Windows that help ameliorate the past deficiencies and make AD DS more virtualization-friendly – and safer.

With the release of Windows Server 2012, new features alleviate many of the legitimate concerns that administrators have about virtualizing AD DS. These new features, the latest versions of VMware® vSphere®, and recommended practices help achieve 100 percent virtualization of AD DS.

The Guide includes a number of best practice guidelines for administrators to help them optimally and safely deploy AD DS on VMware vSphere.

The recommendations in this guide are not specific to a particular set of hardware or to the size and scope of a specific AD DS implementation. The examples and considerations in this document provide guidance, but do not represent strict design requirements.

Similar to other “Best Practices” releases from VMware, this Guide is intended to serve as your companion and primary reference guidepost if you have any responsibility planning, designing, implementing and operating a virtualized Active Directory Domain Services instance in a VMware vSphere infrastructure.

This guide assumes a basic knowledge and understanding of vSphere and AD DS.
Architectural staff can use this document to understand the design considerations for deploying a virtualized AD DS environment on vSphere.
Engineers and administrators can use this document as a catalog of technical capabilities.
Management staff and process owners can use this document to help model business processes that take advantage of the savings and operational efficiencies achieved with virtualization.

You can download Virtualizing Active Directory Domain Services On VMware vSphere® here.

Featured post

Which vSphere Operation Impacts Windows VM-Generation ID?

In Windows Server 2012 VM-Generation ID Support in vSphere, we introduced you to VMware’s support for the new Microsoft’s Windows VM-Generation ID features, discussing how they help address some of the challenges facing Active Directory administrators looking to virtualize domain controllers.

One of the common requests from customers in response to the referenced article is a list of events and conditions under which an administrator can expect the VM-Generation ID of a virtual machine to change in a VMware vSphere infrastructure. The table below presents this list. This table will be included in an upcoming Active Directory on VMware vSphere Best Practices Guide.

Scenario VM-Generation ID Change
VMware vSphere vMotion®/VMware vSphere Storage vMotion No
Virtual machine pause/resume No
Virtual machine reboot No
HA restart No
FT failover No
vSphere host reboot No
Import virtual machine Yes
Cold clone Yes
Hot clone
Note
Hot cloning of virtual domain controllers is not supported by either Microsoft or VMware. Do not attempt hot cloning under any circumstances.
Yes
New virtual machine from VMware Virtual Disk Development Kit (VMDK) copy Yes
Cold snapshot revert (while powered off or while running and not taking a memory snapshot) Yes
Hot snapshot revert (while powered on with a memory snapshot) Yes
Restore from virtual machine level backup Yes
Virtual machine replication (using both host-based and array-level replication) Yes

If you have a specific operation or task that is not included in the table above, please be sure to ask in the comments section.

Thank you.

Critical Factors to consider when virtualizing Business Critical Applications: (Part 2 of 2)

Availability:

Availability of the applications is one of the important requirements for business critical applications. The vSphere platform provides capabilities to protect against hardware and software failures  to meet the stringent SLA requirements of business critical applications.

Virtualized applications can avail of features such as vSphere HA protects from HW failures by restarting virtual machines on surviving hosts if a physical host were to fail. vSphere FT  (Fault Tolerance) protects critical servers such as load balancers and other central servers with a small footprint (1vCPU) with zero downtime in the event of HW failures.

vSphere App HA for Application level protection for supported applications. Third party solutions that leverage vSphere Application Awareness API such as Symantec Application HA , NEVERFAIL, etc. layer on top of VMware HA and provide monitoring and availability  for most of the common business critical applications.

HA_Picture1

 

Traditional availability clusters like MSCS and Linux based clusters can be layered on top of vSphere HA, but quite often create restrictions relating to vMotion and other functionality in VMware environments. Currently the only use case that these solutions provide not provided by the Application HA solutions available on VMware is the ability to have reduced downtime during patch related activity requiring system reboot. If there is allowance for downtime for patch maintenance using traditional clusters can be avoided.

Manageability:

The ability to proactively manage business critical applications is a very important requirement. Virtualized business critical applications can leverage the vCenter Operations (VCOPS) suites of solutions to effectively manage their environment. VCOPS provides tactical (Right Now) and Strategic (Future Focused) capabilities to monitor performance and manage the capacity of the environment proactively.

vCOPS1

VCOPS provides tactical monitoring and strategic planning

VCOPS also provides ability to perform root cause analysis on problems in the environment in a time efficient manner to reduce impact on business critical applications.

 

vCOPS2

VCOPS smart alerts

Changes made to the environment can be correlated to the health, while also helping to diagnose the impact of the change on an application or environment.

vCOPS3

Correlate events with changes made to environment

Supportability:

All major x86 applications are supported on VMware and actively being deployed by our customer.

Supportability

Agility:

Most application owners and DBAs virtualize Oracle or SAP for cloning. Creating application instances and refresh of data across environments such as production, development, test and QA is very time consuming and resource intensive.

With vSphere, any application can be templatized and easily cloned to be rolled out into these environments with minimal effort.  This agility provides application owners improved time to market of their solutions and increased productivity.

Agility

Easy cloning between production and other environments

Disaster Recovery:

One of the critical requirements for business critical applications is the need for disaster recovery with minimum possible downtime (Recovery Time Objective) and minimum loss of data (Recovery Point Objective).

Traditional disaster recovery for business critical applications with physical servers has many limitations such as the need for identical hardware in recovery site and downtime for production during testing. In a virtual environment, every virtual machine is encapsulated in a series of files that can be easily moved and instantiated in the recovery locations.

VMware Site Recovery Manager (SRM) provides a strong workflow engine that provides automation through centralized recovery plans, automated failover and failback and planned migration capabilities. SRM provides the ability to perform non-disruptive testing and leverages vSphere and Array based replication to replicate the data to the disaster recovery site.

Majority of the customers virtualizing business critical applications leverage SRM for their Disaster Recovery needs. SRM drastically reduces the RPO and RTO and helps them meet stringent business availability requirements.

SRM1

SRM provides automated disaster recovery capabilities

Conclusion:

The critical concerns of business critical application owners are adequately addressed by VMware’s suite of products and solutions. There is no reason not to virtualize your business critical applications. The trends shown below clearly attest to these capabilities.

BCA_trends

Strong growth in Adoption of BCA virtualization

Part 1 of series can be found at “Critical Factors to consider when virtualizing Business Critical Applications: (Part 1 of 2)

 

Impact of database licensing on Cluster Design for Production SAP virtualization

Type of Database licensing for SAP:

The type of licensing impacts the cluster design for SAP virtualization. SAP is supported on most common database platforms such as SQL, Oracle, DB2 and SYBASE. When customers procure SAP, they can choose to buy the database licensing through SAP or purchase it directly from the database vendor. This decision impacts the cluster design for virtualized SAP environments.

Let us look at these two scenarios and their impact on the design.

Scenario 1: Database License procured from the DB vendor for SAP:

Database vendors have differing but usually very restrictive policies regarding virtual machines running databases. The cost of licensing databases in the extreme case could force a customer to license for the entire cluster, even though  the database could be using only a small subset of the resources. Due to the uncertainty and the risk involved with DB licensing in this situation, it might be prudent to separate the entire database workload into its own cluster. By separating the entire database workload, the physical hardware used for databases can be isolated and licensed fully. Since only database workloads exist in this cluster one can achieve consolidation and efficiency for databases. The main disadvantage is the added overhead of having a separate cluster for databases. Since SAP landscapes have many modules with each module having its own individual database, creating a separate DB cluster with a good number of  hosts is  worthwhile and justified.

Dedicated Database Cluster for SAP

Dedicated Database Cluster for SAP

When there are no restrictions with licensing,  the typical cluster design methodology in vSphere environments espouses having an N+2 cluster.  An N+2 cluster would provide headroom for doing maintenance (One host at a time) and high availability (One host failure). These additional hosts can be costly for the database cluster due to the need to license all hosts.  In this situation the applications run in their own cluster, which typically is N+2.

APP_Cluster

SAP Applications in their own cluster

Most database vendors allow for a non licensed DB host, if the only purpose of these hosts is to function as a standby in the event of a failure. There are many conditions such as the number of actual days the standby takes over per year and other requirements that need to be met. vSphere clusters have a setting called dedicated failover host, which can be leveraged in database clusters to match the requirements of standby hosts.  One can potentially meet these conditions for standby node, by running all workloads in normal circumstances on licensed nodes with the dedicated failover node minimally being used only during actual failure or maintenance.

Scenario 2:  Database licensed through SAP along with SAP software:

When databases are licensed through SAP, there is no impact of database placement on licensing. This is akin to “named user” based licensing. There is a lot more flexibility to locate the database servers anywhere in the environment. Customers typically collocate the database servers along with the application servers for proximity and ease of management. The commonly used N+2 clusters can be leveraged in this scenario to allow for HA capacity even during maintenance.  All nodes can be fully utilized to run workloads.

Hybrid_Cluster

SAP Applications and Databases in the same cluster

Conclusion:

Cluster design in SAP environments can be impacted by the type of database licensing. Creating a dedicated database cluster for certain situation, can help meet many of the stringent licensing requirements, while still providing for consolidation and optimized utilization.

 

Critical Factors to consider when virtualizing Business Critical Applications: (Part 1 of 2)

Over the past few years, there has been significant acceleration in adoption of the VMware platform for virtualization of business critical applications. When vSphere 5 was introduced with its initial support for up to 32 vCPU many of the vertical scalability concerns that existed earlier were addressed. This has been increased to 64 processors with the later vSphere 5.x releases ensuring that more than 99% of all workloads will fit vertically.

Having personally worked in IT infrastructure for more than 20 years with a strong focus on implementing and managing business critical applications, I see a general reluctance from application owners to virtualize business critical applications. When virtualizing business applications there are many critical factors one should consider.  I seek to address the typical concerns of application owners about Virtualization with this multipart series on Virtualizing BCA.

CIOs and IT operations want to virtualize more because of the following advantages of virtualization:

  1. Infrastructure efficiency
  2. Simpler management
  3. Built-in availability
  4. Greater agility
  5. Simplified DR

But Application owners are usually reluctant because:

  1. Will my Application perform well? (Performance)
  2. Will it Scale? (Scalability)
  3. Can I meet my application SLAs? (Availability )
  4. Can I manage it effectively? (Manageability)
  5. Will my ISV support me? (Supportability)
  6. What’s in it for me? Will my application run better? (Agility & Time to Market)

Performance:

Virtual performance is about 5-6% of physical for most business critical application workloads. The benefits of virtualization and productivity improvements overshadow the small overhead it introduces.

SAP Performance within 6% of Native:

SAP Performance on vSphere vs Physical

Exchange Server Performance:

Exchange Virtualization study discussed in Running Microsoft Apps on FlexPod for VMware  shows only a 5% performance difference between virtual and physical.

Exchange performance on vSphere

 

Database Performance:

SQL and Oracle performance close to Native as seen in the test results below.

SQL Throughput on vSphere vs Native

Oracle RAC performance vs Native on vSphere

It is proven that the performance of the common business critical applications are very close to their physical counterparts and performance should never be a concern for virtualizing them.  During virtualization there is usually a hardware refresh to the latest and greatest hardware with superior performance to existing systems. The small overhead of virtualization is easily offset while moving to this newer hardware as part of the virtualization migration.

Scalability:

With the vSphere 5.x platform, almost all workloads are amenable to virtualization.  VMware capabilities, such as vNUMA allows for NUMA aware operating systems and applications to run optimally even in virtual environments.

Scalability of vSphere

Scalability of vSphere

Workloads can scale up dynamically as demand increases with hot add capabilities for CPU, memory and storage resources as and when needed.  One can scale up or scale out based on the application requirements.

"Scale up" versus "Scale out"

“Scale up” versus “Scale out”

These hot add capabilities are available only in virtualized environments and help right size environments and grow them dynamically, when needed without user downtime and loss of productivity. The following graphic shows the effect of increasing CPU on an Oracle DB server by 2 vCPU demonstrating the dynamic scale up capabilities of the vSphere platform.

Oracle Hot Add CPU Illustration

Oracle Hot Add CPU Illustration

Part 2 of series can be found at  ”Critical Factors to consider when virtualizing Business Critical Applications: (Part 2 of 2)

 

 

SAP on VMware Sizing & Design Example

Recently in partner workshops I have come across some interesting discussions about the impact of hyper-threading and NUMA in sizing business critical applications on VMware. So here is an SAP example based on SAP’s sizing metric “SAPS” (a hardware-independent unit of measurement that equates to SAP OLTP throughput of Sales and Distribution users).  The examples here refer to vSphere scheduling concepts in this useful whitepaper The CPU Scheduler in VMware vSphere 5.1 .

SAP sizing requires the SAPS rating of the hardware which for estimation purposes can be obtained from certified SAP benchmarks published at http://www.sap.com/solutions/benchmark/sd2tier.epx . Let’s use certification 2011027  and assume that we plan to deploy on similar hardware as used in this benchmark. This is a virtual benchmark on vSphere 5 with the following result: 25120 SAPS (at ~100% CPU) for 24 vCPUs running on a server with 2 processors, 6 cores per processor and 24 logical CPUs as hyper-threading was enabled. This is a NUMA system where each processor is referred to as a NUMA node.  (Note cert 2011027 is an older benchmark, the SAPS values for vSphere on newer servers with faster processors would be different/higher, hence work with the server vendors to utilize the most recent and accurate SAPS ratings).

In this example I will design for application server virtual machines which as they scale out horizontally gives us the flexibility of choosing the number of vCPUs per virtual machine.  Now do we go with # of vCPUs = # of cores or # of vCPUs = number of logical CPUs? Let’s show an example for both. I will consider the following:

  • SAP sizing is typically conducted at 60-70% CPU and normal practice is to scale down the benchmark SAPS results, I will not bother with this and go with the 25120 SAPS at 100% CPU.
  • Size within the NUMA boundaries. In this two processor NUMA system example, there are two NUMA nodes each with one processor and memory. The access to memory in the same node is local; the access to the other node is remote. The remote access takes more cycles because it involves a multi-hop operation so keeping the memory access local improves performance.
  • For a 6 core NUMA node the virtual machine vCPU size should be a multiple divisor (or multiple) of 6 giving us 1, 2, 3 or 6 way VMs (see this VMware blog).
  • I assume workloads in all the virtual machines peak at the same time.

Let’s first show a design with # of vCPUs = # of cores i.e. no vCPU over-commit.

Example 1: # of vCPUs = # of cores, 2-way and 6-way app servers

 

With all the virtual machines under load simultaneously, the ESXi scheduler by default, with no specific tuning will: allocate a home NUMA node for the memory of each virtual machine; schedule vCPUs of each virtual machine on its home node thus maintaining local memory access; schedules each vCPU on a dedicated core to allow exclusive access to core resources. (Note that in physical environments such NUMA optimizations would require OS commands to localize the processing e.g. Linux command “numactl”) However the above configuration does not give us 25120 SAPS as not all the logical CPUs are being utilized as was the case in the benchmark. The hyper-threading performance boost for an SAP OLTP workload is about 24% (based on tests by VMware performance engineering – see this blog) so for # of vCPUs = # of cores we should theoretically drive about 25120/1.24 = 20258 SAPS. Also we can estimate about 20258/12 = 1688 SAPS per vCPU so the 2-way virtual machine is rated at 1688 x 2 = 3376 SAPS and the 6-way is 1688 x 6 = 10128 SAPS (@100% CPU in this example).  Are we “wasting” SAPS by not utilizing all the logical CPUs?  Technically yes but for practical purposes not a major issue because:

  • We have some CPU headroom which can be claimed back later after go-live when the virtual machines can be rebalanced based on the actual workload.  At this point vCPU over-commit may be possible as virtual machine workloads may not peak at the same time.
  • Hyper-threading benefit is dependent on the specific workload and while the 24% hyper-threading boost is based on an OLTP workload profile, the actual workload may be less impacted by hyper-threading for example:
    • CPU intensive online reporting
    • CPU intensive custom programs
    • CPU intensive batch jobs
    • SAP has created another metric in their sizing methodology referred to as SCU – Single Computing Unit of performance.  SAP has categorized different workloads/modules based on their ability to take advantage of hyper-threading. So some workloads may experience a hyper-threading benefit lower than 24%.

Now what if we need to drive the maximum possible SAPS from a single server – this is when we would need to configure # of vCPUs = # of logical CPUs. The following configuration can achieve the maximum possible performance.

Example 2: # of vCPUs = # of logical CPUs

 

 

In the above design the virtual machine level parameter “numa.vcpu.preferHT” needs to be set to true to override default ESXi scheduling behavior. Default behavior is where ESXi schedules the virtual machine across NUMA nodes when the number of vCPUs for a single virtual machine is greater than the number of cores in the NUMA node.  This results in vCPUs of a virtual machine being scheduled on a remote node relative to its memory location. This is avoided in the above example and performance is maximized because:  ESXi schedules all vCPUs of each virtual machine on the same NUMA node that contains the memory of the virtual machine thus avoiding the penalty of any remote memory access; all logical CPUs are being used thus leveraging the hyper-threading benefit (note vCPUs are sharing core resources so the SAPS per vCPU in this case is 25120/24 = 1047 at 100% CPU). This configuration is commonly used in the following situations: running a benchmark to achieve as much performance as possible (as was done for the app server virtual machines in the 3-tier vSphere SAP benchmark certification 2011044); conducting physical versus virtual performance comparisons. For practical purposes designing for # of vCPUs = # of logical CPUs may not be so critical. If we were to design for a 12-way app server (example 2 above), and actual workload was less than planned with lower CPU utilization, we would have plenty of vCPUs without the added gain from hyper-threading.  There are no hard rules so if desired, during the sizing phase, you can start with # vCPUs = # of cores or number of vCPUs = # of threads based on which approach you think best fits your needs.

Summarizing I have shown two sizing examples for SAP application server virtual machines in a hyper-threaded environment. In both cases sizing virtual machines within NUMA nodes helps with performance.  The SAPS values shown here are based on a specific older benchmark certification and would be different for modern day servers and more recent benchmarks.

Finally a thank-you to Todd Muirhead (VMware performance engineering), for his reviews and inputs.

Setting Multi Writer Flag for Oracle RAC on vSphere without any downtime

Oracle RAC Cluster using ASM for storage would require the shared disks to be accessed by all nodes of the RAC cluster.

The multi-writer option in vSphere allows the VMFS-backed disks to be shared by multiple VM’s simultaneously. By default, the simultaneous multi-writer “protection” is enabled for all .vmdk files ie all VM’s have exclusive access to their .vmdk files. So in order for all of the Oracle RAC VM’s to access the shared vmdk’s , the multi-writer protection needs to be disabled.

KB Article 1034165 provides more details on how to set the multi-writer option manually to allow VM’s to share vmdk’s (link below).

The above method would require the VM’s to be powered off before the multi-writer option can be set in the .vmx configuration file. This means that the Oracle RAC instance would have to be shutdown and the VM completely powered off before the option can be set leading to an Instance outage.

Oracle DBA’s would rather a method where they could add shared disks on the fly to a running Oracle RAC cluster without any downtime.

A colleague and I working for VMware PSO some time back (now I have moved under the Global Center Of Excellence group ) wrote a POWERCli script called the “addSharedDisk” which adds a new shared disk to the existing Oracle RAC nodes along with setting the multi-writer flag setting , setting the disks as independent-persistent and eager-zero thick formatting.

This script can be used to add shared disk on the fly while the Oracle RAC Clusters is up and running thus avoiding any outage.

Disclaimer:
- This is an generic script provided and can be customized as per your requirement
- This script is provided for educational purposes only , please use at your own risk
- No Support will be provided for this script
- All values provided in this script is for example only, you would need to provide real values

This script accepts at the command prompt
- The number of shared disks to be created( maximum 55)
- The number of nodes that you need this shared disk be added to

Currently
-the script places all shared VMDK files in the same datastore as the boot disk VMDK files. This can be customized to place shared vmdk’s on different datastores for IO load balancing
-the script can be further customized to place vmdk’s on different SCSI controllers.
-the script has the logic to check the network requirements of the Oracle RAC interconnect adapters(assumption of the script is 2 Private Adapters are deployed per VM) . (can be removed or commented out if not required)

The script can be found in the link below [https://communities.vmware.com/docs/DOC-24873]

Monitoring Business Critical Applications with VMware vCenter Operations Manager

The VMware BCA team recently worked with the vCOps gurus to produce the whitepaper “Monitoring Business Critical Applications with VMware vCenter Operations Manager”. 

The paper is available at https://www.vmware.com/files/pdf/solutions/Monitoring-Business-Critical-Applications-VMware-vCenter-Operations-Manager-white-paper.pdf .

The document provides an overview of monitoring the following business critical applications with vCOps: SAP; Exchange; Oracle; and SQL Server. It describes the vCOps adapters, including Hyperic, for these applications. Some key application performance metrics are covered and example dashboards are provided and explained.

Estimating Availability of SAP on ESXi Clusters – Examples

This is a follow up to the blog I posted in Jan 2013 which identified a generic formula to estimate the availability, expressed as a percentage/fraction, of SAP virtual machines in an ESXi cluster.  The details of the formula are in this whitepaper . This blog provides some example results based on some assumed input data. I used a spreadsheet to model the equation and generate the results – this is shown at the end. The formula is based on mathematical probability techniques. The availability of SAP on an ESXi cluster is dependent on: the probability of failure of multiple ESXi hosts based on the number of spares; the probability that the SPOFs (database & central services) are failing over due to a VMware HA event (depends on failover times and the frequency of ESXi host failures).

The example starts with a single 4-node ESXi cluster running multiple SAP database, application server and central services virtual machines (VMs) corresponding to different SAP applications (ERP, BW, CRM etc.).  A sizing engagement has determined that 4 ESXi hosts are required to drive the performance of all the SAP VMs (the SAP landscape). We assume the sizing is such that the memory of all the VMs will not fit into the physical memory of three or less hosts, and as we typically have memory reservations set (a best practice for mission critical SAP), VMs may not restart after a VMware HA event. So we conservatively treat any host failures that result in less than 4 ESXi hosts as downtime for the SAP landscape (not true at the individual VM/SAP system level as some of the VMs can be de-prioritized in the degraded state in favor of others but we are going with the landscape level approach to provide a worst case estimate). For this reason we design with redundancy by adding extra ESXi hosts in the cluster so I will compare three options with different degrees of redundancy:

Option 1:  4 node ESXi cluster with no spares i.e. “4+0”   (loss of 1 or more hosts is considered downtime. With this assumption a VMware HA event is mute so failover times are not considered. As there are no spares, the availability for this scenario = a x a x a x a, where a = availability of single ESXi host. For the remaining options I use the formula in the spreadsheet below)

Option 2:  5 node ESXi cluster with 1 spare ESXi host i.e. “4+1”   (loss of 2 or more hosts is considered downtime)

Option 3:  6 node ESXi cluster with 2 spare ESXi hosts i.e.  “4+2” (loss of 3 or more hosts is considered downtime)

Following input data is required for the formula:

  • Mean time to failover (via VMware HA) Central Services VM in case of ESXi host failure is 1-2 minutes (source: lab tests). I will use 2 minutes.
  • Mean time to failover database VM in case of ESXi host failure is 5 minutes. Source: POC from a customer who presented their case study at a SAP tradeshow. Includes time for database to start and perform a recovery, latter is dependent on the workload at the time of failure.
  • mtbf – meantime between failures of a single ESXi host (this is the failure rate due to h/w or VMware hypervisor failure)
  • mttr – meantime to repair a failed ESXi host or replace with another ESXi host in order to get the ESXi cluster back up to full strength.
  • Another term you will come across is mean time to failure (mttf).   Note that mtbf = mttf + mttr.
  • From mtbr and mttr we can calculate the availability of a single ESXi host. The definition is, availability = (mtbf-mttr)/mtbf (see the whitepaper for details. Note this type of analysis is not new, similar content can be found here and here).

The following diagram shows the relationship between mtbf, mttr and mttf.

Unfortunately there are no industry standard values for mttr and mtbf for an x-86 server running a hypervisor.  mtbf depends on the hardware and frequency of firmware and hypervisor related issues   - the latter in turn is impacted by patch management procedures.  So mtbf may vary between different environments.  So how can we estimate these metrics? As SAP is typically virtualized after other non-SAP applications, you can gather operational statistics from existing production or non-production ESXi clusters to get estimates for mtbf and mttr.  For mtbf, we would need to determine how often VMware HA events have occurred in any existing operational ESXi clusters. Few informal enquires I have made show frequency of failure around 1-2 times a year for an ESXi host and in some cases over a year without incident. For mttr are there any SLAs in place (for example server vendor services contracts) or can IT operations estimate a time they can repair or replace a faulty ESXi host in a production SAP cluster? As SAP business processes are mission critical such an SLA or understanding may be in place or required.  Hence I will show results for a range of mtbf and mttr.  The results are shown in the following table for mtbf = 90, 180 and 360 days and mttr = 2, 4, 8 and 24 hrs (I have estimated 1 year to 360 days).

You can read the above table as per the following example:

  • Experience in the datacenter has indicated about 2 failures per ESXi host a year, so we will assume about 180 days for mtbf. So we are interested in the results in the “180 days” section of the above table.
  • Datacenter Operations have procedures in place to restore a failed ESXi host within 4 hrs, so mttr = 4 hrs.
  • Hence the availability estimate is 99.9964 for a “4+1” cluster and 99.9973 for a “4+2” cluster.
  • If any of the input data differs from above or for other sized clusters, recalculate using the spreadsheet /formula (see below).

Some conclusions:

  • Adding more redundancy (two spare ESXi hosts versus one) increases availability and makes availability less sensitive to mttr which makes sense i.e. with more redundancy there is less time pressure to get a failed ESXi host back online. However there is an extra cost with this redundancy which can be mitigated by using the redundant ESXi hosts to run less important virtual machines that have a reduced SLA and can be taken offline if a single ESXi host fails. VMware resource shares can be configured to make sure these less important VMs do not interfere with the production SAP landscape.
  • Reliable servers with redundant components and good patch management policies can help to increase mtbf which increases availability.
  • Having procedures in place to lower the mttr increases availability, for example replacing a failed ESXi host with another from a non-production cluster or some standby pool may be faster than repairing the failed ESXi host.

Note the following about this analysis:

  • Only considers unplanned downtime due to ESXi host failures. Storage, application software/OS and network failures are NOT considered here. These other parts of the overall architecture have their own availability so the final availability (as experienced by the end-user) is the product of the availabilities of each sub-component as they effectively operate in series.
  • Formula assumes single instance database not active-active like RAC and no VMware FT for Central Services – these scenarios effectively reduce the failover times driving availability higher.
  • The availability estimate is for the whole SAP landscape – if you consider the individual VM /SAP system the availability can be higher, for example if we are down to three ESXi hosts, priority could be given to ERP over BW so ERP continues to perform as per normal. In this case ERP VMs would have higher availability than the value calculated for the landscape.
  • Architecture here assumes database and app tier deployed in the same cluster. Often these layers are isolated into two separate ESXi clusters (e.g. for database licensing and/or management purposes). In this case two separate availabilities need to be calculated for the two clusters and the overall availability is the product of the two.

Why Bother with All This?

While this mathematical approach (using probability) is an established method to estimate availability it does require assumptions and input data values for which we may have to estimate in case of limited data – this is where empirical data from actual implementations will help with accuracy.  So the goal of this availability analysis is:

  • A starting point to generate an estimate during a sizing engagement  or during the design phase of a deployment. If actual data is not available assume some “worst case” values for mttr and mtbf (e.g. 8 hrs and 90 days) to generate a base line estimate.
  • Enable quantitative analysis of different scenarios like:
    • How is availability impacted by extra spares in the ESXi cluster? If the business cost of downtime is known (currency per unit time), then we could determine if the cost of redundant ESXi hosts is justified.
    • If we need 10 ESXi hosts would it be better for availability if we had one 10-node cluster or two 5-node clusters?
    • How do failover times impact the final availability?

Appendix – Create Formula in Excel

The generic availability formula is in the whitepaper SAP on VMware High Availability Analysis – A Mathematical Approach . This formula can be created in Excel as shown below (in this case I am ignoring chances of any failover faults).

The heaviest part of the formula is in cell G46 (see above). This part calculates the probability that all the spares and one extra ESXi host (which is s+1 hosts, where s = number of spares) simultaneously fail resulting in downtime for the cluster. This is based on calculating the different unique combinations of s+1 nodes in the n-node cluster which is described in the whitepaper , but it should be noted that this comes from standard mathematical combination  theory, for example see this wiki page.

 

 

 

 

 

 

Windows Server 2012 VM-Generation ID Support in vSphere

Update 1/25/2013: The vSphere versions required for VM-Generation ID support have been updated below.

Active Directory Domain Services has been one of those applications that, to the naked eye, seemed like it was a no brainer to virtualize. Why not? In most environments it’s a fairly low utilization workload, rarely capable of efficiently using the resources found in many of the enterprise-class servers that have been available for the past few years. Many organizations have adopted this way of thinking and have successfully virtualized all of their domain controllers. What about the hold-outs? What is it about Active Directory that has left so many AD administrators and architects keeping their infrastructure, or at least a portion of it on physical servers?

Until recently, a couple of limitations, some argued, diminished the advantages of virtualization. These limitations included support for cloning domain controllers and the inability to use features such as snapshots due to the risk of roll-back.

With the release of Windows Server 2012, Microsoft has validated the role virtualization plays in the data center by adding functionality that effectively lifts these limitations. The feature known as VM-Generation ID allows hypervisor vendors to expose a virtual machine identifier that Windows Server 2012 domain controllers can use to detect the state of a virtual machine and trigger new Active Directory safeguards. These safeguards protect the Active Directory from the dreaded USN roll-back if a virtual machine is reverted to a snapshot or rolled back by other mechanisms.

Besides protecting Active Directory from unintentional roll-back, these new safeguards and VM-Generation ID allow administrators to safely clone Windows Server 2012 domain controllers. When properly prepared, a Windows Server 2012 domain controller may be used as a source for new domain controllers. Not only does this eliminate the additional tasks of preparing a base virtual machine for becoming a domain controller, it reduces the time required for replication of a new copy of the Active Directory database.

VM-Generation ID functionality requires the hypervisor vendor to create the virtual machine identifier and expose it to the guest. VMware has provided this functionality in the following releases of vSphere:

  • VMware vSphere 5.0 Update 2 (vCenter Server and ESXi must both be at 5.0 Update 2)
  • VMware vSphere 5.1 (ESXi must be at least 5.0 Update 2)

More information on VM-Generation ID, supported methods for cloning domain controllers, and domain controller safeguards can be found at the following TechNet links:

Introduction to Active Directory Domain Services Virtualization (Level 100): http://technet.microsoft.com/en-us/library/hh831734.aspx

Virtualized Domain Controller Technical Reference (Level 300): http://technet.microsoft.com/en-us/library/jj574214.aspx

-alex