Harnessing the Power of Storage Virtualization and Site Recovery Manager to Provide HA and DR Capabilities to Business Critical Databases
How do you simplify and improve availability of your Extended distance Oracle Real Application Cluster using vSphere Metro Storage Cluster (vMSC) ?
Storage Virtualization, both host based and appliance based, can pave the way for increased ease of configuration and improved availability of your cluster based applications. vMSC featues including vMotion, HA, DRS and FT as well as extended distance Oracle Real Application Clusters (RAC) are greatly simplified, and in some cases, made possible through the use of storage virtualization technologies such as EMC VPLEX, Netapp Metro Cluster, IBM SVC, HP 3PAR Peer Persistence or Oracle Automatic Storage Management (ASM) disk groups. Continue reading →
Virtual Volumes (VVOLS) a game changer for running Tier 1 Business Critical Databases:
One of the major components released with vSphere 6 this year was the support for Virtual Volumes (VVOLS). VVOLS has been gaining momentum with storage vendors, who are enabling its capabilities in their arrays.
When virtualizing business databases there are many critical concerns that need to be addressed that include:
1. Database Performance to meet strict SLAs
2. Daily Operations e.g. Backup & Recovery to complete in set window
3. Cut down time to Clone / Refresh of Databases from Production
4. Meet different IO characteristics and capabilities based on criticality
5. Never ending debate with DBAs : File Systems v/s Raw Devices , VMFS v/s RDM Continue reading →
Introduction to Queues Sizing
Proper queues sizing is a key element in ensuring current database workloads can be sustained and all SLA’s are met without any processing disruption.
Queues are often misrepresented as the very “bane of our existence” and yet queues restore some order of semblance to our chaotic life.
Imagine what would have happened if there were no queues?
Disorder, chaos, anarchy: now that’s fun! – DAVID J. SCHOW, The Crow
Much has been written about Virtualization & Storage queues, excellent blog articles by Duncan Epping, Cormac Hogan and Chad Sakac go through in depth about the various queues.
What this article tries to endeavor is to illustrate, with a simple example, the fallacy in presuming that queue tuning only needs to be done at the Application level and that it’s okay to ignore storage physical limitations. Continue reading →
VMware is glad to see that the Microsoft Exchange Server (and Performance) teams appear to have identified the prevalent cause of performance-related issues in an Exchange Server 2013 infrastructure. We have been aware for several years that Microsoft’s sizing recommendation for Exchange Server 2013 is the number one cause of every performance issue that have been reported to VMware since the release of Exchange Server 2013, and it is gratifying that Microsoft is acknowledging this as well.
In May of 2015, Microsoft released a blog post titled “Troubleshooting High CPU utilization issues in Exchange 2013″ in which Microsoft acknowledged (for the first time, to our knowledge) that CPU over-sizing is one of the chief causes of performance issues on Exchange Server 2013. We wish to highlight the fact that the Exchange 2013 Server Role Requirements Calculator is the main culprit in this state of affair. One thing we noticed with the release of Exchange Server 2013 and its accompanying “Calculator” is the increase in the compute resources it recommends when compared to similar configuration in prior versions of Exchange Server. Continue reading →
As you dive into the inner-workings of the new version of VMware vSphere (aka ESXi), one of the gems you will discover to your delight is the enhanced virtual machine portability feature that allows you to vMotion a running pair of clustered Windows workloads that have been configured with shared disks.
I pause here now to let you complete the obligatory jiggy dance. No? You have no idea what I just talked about up there, do you? Let me break it down for you:
In vSphere 6.0, you can configure two or more VMs running Windows Server Failover Clustering (or MSCS for pre-Windows 2012 OSes), using common, shared virtual disks (RDM) among them AND still be able to successfully vMotion any of the clustered nodes without inducing failure in WSFC or the clustered application. What’s the big-deal about that? Well, it is the first time VMware has ever officially supported such configuration without any third-party solution, formal exception, or a number of caveats. Simply put, this is now an official, out-of-the-box feature that does not have any exception or special requirements other than the following:
The VMs must be in “Hardware 11″ compatibility mode – which means that you are either creating and running the VMs on ESXi 6.0 hosts, or you have converted your old template to Hardware 11 and deployed it on ESXi 6.0
The disks must be connected to virtual SCSI controllers that have been configured for “Physical” SCSI Bus Sharing mode
And the disk type *MUST* be of the “Raw Device Mapping” type. VMDK disks are *NOT* supported for the configuration described in this document.
We at VMware have been fielding a lot of inquiries lately from customers who have virtualized (or are considering virtualizing) their Microsoft Lync Server infrastructure on the VMware vSphere platform. The nature of inquiries is centered on certain generalized statements contained in the “Planning a Lync Server 2013 Deployment on Virtual Servers”whitepaper published by the Microsoft Lync Server Product Group. In the referenced document, the writers made the following assertions:
You should disable hyper-threading on all hosts.
Disable non-uniform memory access (NUMA) spanning on the hypervisor, as this can reduce guest performance.
Virtualization also introduces a new layer of configuration and optimization techniques for each guest that must be determined and tested for Lync Server. Many virtualization techniques that can lead to consolidation and optimization for other applications cannot be used with Lync Server. Shared resource techniques, including processor oversubscription, memory over-commitment, and I/O virtualization, cannot be used because of their negative impact on Lync scale and call quality.
Virtual machine portability—the capability to move a virtual machine guest server from one physical host to another—breaks the inherent availability functionality in Lync Server pools. Moving a guest server while operating is not supported in Lync Server 2013. Lync Server 2013 has a rich set of application-specific failover techniques, including data replication within a pool and between pools. Virtual machine-based failover techniques break these application-specific failover capabilities.
VMware has contacted the writers of this document and requested corrections to (or clarification of) the statements because they do not, to our knowledge, convey known facts and they reflect a fundamental misunderstanding of vSphere features and capabilities. While we await further information from the writers of the referenced document, it has become necessary for us at VMware to publicly provide a direct clarification to our customers who have expressed confusion at the statements above. Continue reading →
Starting with update releases in December, 2014, VMware vSphere will default to a new configuration for the Transparent Page Sharing (TPS) feature. Unlike in prior versions of vSphere up to that point, TPS will be DISABLED by default. TPS will continued to be disabled for all future versions of vSphere.
In the interim, VMware has released a Patch for vSphere 5.5 which changes the behavior of (and provides additional configuration options for) TPS. Similar patches will also be released for prior versions at a later date.
Why are we doing this?
In a nutshell, independent research indicates that TPS can be abused to gain unauthorized access to data under certain highly controlled conditions. In line with its “secure by default” security posture, VMware has opted to change the default behavior of TPS and provide customers with a configurable option for selectively and more securely enabling TPS in their environment. Please read “Security considerations and disallowing inter-Virtual Machine Transparent Page Sharing (2080735)” for more detailed discussion of the security issues and VMware’s response. Continue reading →
Announcing the latest addition to our series of prescriptive guidance for virtualizing Business Critical Applications on the VMware vSphere platform.
Microsoft Windows Active Directory Domain Services (AD DS) is one of the most pervasive Directory Services platforms in the market today. Because of the importance of AD DS to the operation and availability of other critical services, applications and processes, the stability and availability of AD DS itself is usually very important to most organizations.
Although the “Virtualization First” concept is becoming a widely-accepted operational practice in the enterprise, many IT shops are still reluctant to completely virtualize Domain Controllers. The most conservative organizations have an absolute aversion to domain controller virtualization while the more conservative organizations choose to virtualize a portion of the AD DS environment and retain a portion on physical hardware. Empirical data indicate that the cause of this opposition to domain controller virtualization is a combination of historical artifacts, misinformation, lack of experience in virtualization, or fear of the unknown. Continue reading →
In Windows Server 2012 VM-Generation ID Support in vSphere, we introduced you to VMware’s support for the new Microsoft’s Windows VM-Generation ID features, discussing how they help address some of the challenges facing Active Directory administrators looking to virtualize domain controllers.
One of the common requests from customers in response to the referenced article is a list of events and conditions under which an administrator can expect the VM-Generation ID of a virtual machine to change in a VMware vSphere infrastructure. The table below presents this list. This table will be included in an upcoming Active Directory on VMware vSphere Best Practices Guide.
Recently in partner workshops I have come across some interesting discussions about the impact of hyper-threading and NUMA in sizing business critical applications on VMware. So here is an SAP example based on SAP’s sizing metric “SAPS” (a hardware-independent unit of measurement that equates to SAP OLTP throughput of Sales and Distribution users). The examples here refer to vSphere scheduling concepts in this useful whitepaper The CPU Scheduler in VMware vSphere 5.1 .
SAP sizing requires the SAPS rating of the hardware which for estimation purposes can be obtained from certified SAP benchmarks published at http://www.sap.com/solutions/benchmark/sd2tier.epx . Let’s use certification 2011027 and assume that we plan to deploy on similar hardware as used in this benchmark. This is a virtual benchmark on vSphere 5 with the following result: 25120 SAPS (at ~100% CPU) for 24 vCPUs running on a server with 2 processors, 6 cores per processor and 24 logical CPUs as hyper-threading was enabled. This is a NUMA system where each processor is referred to as a NUMA node. (Note cert 2011027 is an older benchmark, the SAPS values for vSphere on newer servers with faster processors would be different/higher, hence work with the server vendors to utilize the most recent and accurate SAPS ratings). Continue reading →
Oracle RAC Cluster using ASM for storage would require the shared disks to be accessed by all nodes of the RAC cluster.
The multi-writer option in vSphere allows the VMFS-backed disks to be shared by multiple VM’s simultaneously. By default, the simultaneous multi-writer “protection” is enabled for all .vmdk files ie all VM’s have exclusive access to their .vmdk files. So in order for all of the Oracle RAC VM’s to access the shared vmdk’s , the multi-writer protection needs to be disabled.
KB Article 1034165 provides more details on how to set the multi-writer option manually to allow VM’s to share vmdk’s (link below).
The above method would require the VM’s to be powered off before the multi-writer option can be set in the .vmx configuration file. This means that the Oracle RAC instance would have to be shutdown and the VM completely powered off before the option can be set leading to an Instance outage. Continue reading →
This is a follow up to the blog I posted in Jan 2013 which identified a generic formula to estimate the availability, expressed as a percentage/fraction, of SAP virtual machines in an ESXi cluster. The details of the formula are in this whitepaper . This blog provides some example results based on some assumed input data. I used a spreadsheet to model the equation and generate the results – this is shown at the end. The formula is based on mathematical probability techniques. The availability of SAP on an ESXi cluster is dependent on: the probability of failure of multiple ESXi hosts based on the number of spares; the probability that the SPOFs (database & central services) are failing over due to a VMware HA event (depends on failover times and the frequency of ESXi host failures).
The example starts with a single 4-node ESXi cluster running multiple SAP database, application server and central services virtual machines (VMs) corresponding to different SAP applications (ERP, BW, CRM etc.). A sizing engagement has determined that 4 ESXi hosts are required to drive the performance of all the SAP VMs (the SAP landscape). We assume the sizing is such that the memory of all the VMs will not fit into the physical memory of three or less hosts, and as we typically have memory reservations set (a best practice for mission critical SAP), VMs may not restart after a VMware HA event. So we conservatively treat any host failures that result in less than 4 ESXi hosts as downtime for the SAP landscape (not true at the individual VM/SAP system level as some of the VMs can be de-prioritized in the degraded state in favor of others but we are going with the landscape level approach to provide a worst case estimate). For this reason we design with redundancy by adding extra ESXi hosts in the cluster so I will compare three options with different degrees of redundancy: Continue reading →
Update 1/25/2013: The vSphere versions required for VM-Generation ID support have been updated below.
Active Directory Domain Services has been one of those applications that, to the naked eye, seemed like it was a no brainer to virtualize. Why not? In most environments it’s a fairly low utilization workload, rarely capable of efficiently using the resources found in many of the enterprise-class servers that have been available for the past few years. Many organizations have adopted this way of thinking and have successfully virtualized all of their domain controllers. What about the hold-outs? What is it about Active Directory that has left so many AD administrators and architects keeping their infrastructure, or at least a portion of it on physical servers? Continue reading →
“Arithmetic is where the answer is right and everything is nice and you can look out of the window and see the blue sky – or the answer is wrong and you have to start over and try again and see how it comes out this time.” ~Carl Sandburg
When we architect SAP on VMware deployments an important topic is how we design for high availability. We have options in the VMware environment from VMware HA, VMware FT and use of in-guest clustering software like Microsoft Cluster Services or Linux-HA. So can we determine a numerical availability for our design expressed as a fraction/percentage (same metric used to define uptime Service Level Agreements like 99.9% )? Yes, there are ways to estimate this value and one method is explained in the following paper http://www.availabilitydigest.com/public_articles/0712/sap_vmware.pdf . This paper develops an equation to estimate the availability of SAP running on an ESXi cluster expressed as a fraction/percentage. The concepts are taken from other papers at http://www.availabilitydigest.com (a digest of topics on high availability) and are based on mathematical algebra and probability theory that have been previously used in the IT industry for availability calculations. The availability metric (e.g. 99.9% or 0.999) is essentially a probability hence we use mathematical probability techniques to calculate the overall availability of a system. Continue reading →
The original vCenter Server 5.5 Availability Guide was published in December 2014.
With the End of Availability of vCenter Server Heartbeat guidance was provided on how to monitor and protect vCenter. Due to the need for additional protection, we have internally validated using Windows Server Failover Clustering for protection of vCenter services. Improved SLAs can be attained with this clustering solution. The update provides step-by-step guidance to deploy this solution to protect vCenter 5.5
One of the relatively newer use cases for SRM is planned migration. With this use case, customers can migrate their business critical workloads to the recovery or cloud provider sites in a planned manner. This could be in planning for an upcoming threat such as a hurricane or other disaster or an actual datacenter migration to a different location or cloud provider.
A protection group is a group of virtual machines that fail over together to the recovery site. Protection groups contain virtual machines whose data has been replicated by array-based replication or by VR. Typically contains virtual machines that are related in some way such as:
A three-tier application (application server, database server, Web server)
Virtual machines whose virtual machine disk files are part of the same datastore group.