posted

4 Comments

Oracle, VMware and Extended Distance Oracle Real Application Clusters on vSphere Metro Storage Cluster

Nowadays a high availability cluster (for any kind of application) in a single data center is almost not enough to ensure the kind of availability a Mission Critical Application needs which support revenue generation.

Typical issues plaguing a data center may include local power outage, airplane crash, server room flooding etc.

ERAC1

Do we have a solution?

ERAC2

THE answer: a “Stretched” Cluster, or one that is distributed between multiple sites , as long as the distance between them is within 100kms / satisfy 5ms Round Trip Latency time requirement.

This architecture best fits especially in regards to distance, latency, and degree of protection it provides.

Site separation is great protection for some local disasters as mentioned above but not all. Disasters such as earthquakes, hurricanes, and regional floods may affect a much greater area.

For comprehensive protection against regional disasters including protection against corruptions, Oracle Data Guard and VMware Site Recovery Manager (SRM) can be combined with Extended distance Oracle RAC on vSphere Metro Storage Cluster, giving us both a Disaster Avoidance and a Disaster Recovery Solution.

Traditional Cluster and an Extended Distance/Stretched Cluster

Let’s highlight the similarities and differences between a Traditional Cluster at a Site and a Cluster with its nodes distributed between the 2 sites, also known as an Extended distance Cluster or a Stretched Cluster.

Similarities

  1. Layer 2 Adjacency
  2. Shared storage

Differences (only a few minor differences)

  1. Network latency between nodes increases and needs to be kept at 5 ms or less
  2. Storage that is synchronously, bi-directionally replicated in 5 ms or less is required at each site
  3. A witness or quorum site is required for any clustered app or storage to avoid a “split brain”

Network Latency

A Traditional Cluster and a Stretched Cluster are no different except in a Stretched cluster there is a long distance (almost always less than 100 KM) between some of the nodes.

The rule of thumb to make this work correctly for almost all clusters of any kind is there must be 5 ms or less round trip latency (RTT) between the cluster nodes (whether they are in the same or different data center). This is true of Oracle RAC and vSphere HA clusters in particular (vSphere Enterprise Plus licenses support up to 10 ms).

There is no requirement for any kind of extra licensing or configuration for this.

Replicated Storage

Storage needs to be synchronously, bi-directionally replicated between sites. In almost every case, this also has to happen in 5 ms or less.  There are two ways to do this.

  1. Host based replication. The prime example of this is Oracle Automatic Storage Management (ASM).
    1. Pros: storage agnostic, easy to configure, rock solid
    2. Cons: extra CPU usage, if you don’t already have Oracle this becomes a little more complicated and expensive
  2. Appliance based replication.   Some examples are VMware stretched vSAN , EMC VPLEX, IBM SVC, HP Peer Persistence and NetApp Metro Cluster
    1. Pros: no CPU penalty, tightly integrated with storage, rock solid
    2. Cons: Usually not storage agnostic, almost always proprietary, very expensive

Witness Site

All Clustered technologies require a tie breaker (a Witness or Quorum disks) to prevent a Cluster Fencing / Split Brain situation where every node, due to network / disk heartbeat failure with other nodes in the cluster, assumes that it is the sole surviving member of the Cluster thereby proclaiming itself to be the Master, called a Split Brain Syndrome.

That causes the data, app, etc. at each site to get out of sync and in a position where they cannot be re synchronized.

The occurrence of the witness site prevents creating multiple masters and avoids such situation.

Stretched Oracle RAC Clusters on vSphere In the Real World

Many business critical apps require five 9s of availability, or 99.999% availability (less than 5 minutes of downtime per year).   This is where the marriage of vSphere HA and Oracle RAC really shines. This combination has been used to great effect by several very large organizations globally.

vSphere HA clusters which are distributed across sites, commonly known as vSphere Metro Storage Clusters (vMSC), provide high availability across sites for protected VMs just like their single site counterparts do in a single data center. This protection extends from file and print services, AD Domain Controllers, email servers, load balancers and application servers all the way to Business Critical databases.

Most apps have many application servers for business logic but require one central database. Optimum availability for this type of app is provided by

  1. A Stretched Database Cluster such as Oracle RAC across all sites providing no data loss and single digit seconds time to recover (i.e. for an application server to fail over to a surviving node in the database cluster)
  2. In addition to the highly available database architecture, all its components are protected by the vMSC, thus providing “hyper availability” for the database
  3. All other app components being protected by the HA functionality of the vMSC

Are You A Candidate For A Stretched Cluster?

Some of the considerations when mulling to deploy an Extended RAC Cluster are

  1. Latency requirements of the Workload
  2. Site Distance (0, 25, 50, 100, > 100KM?)
  3. Network Connection / Bandwidth between Sites? Dark Fiber over Dense Wavelength Division Multiplexing (DWDM)
  4. New equipment to be purchased/leased? Extra Cost!!

How will I know if my application will scale, meet the current SLAs and at the same time enjoy the benefits of a Stretch Cluster or do I have to pony up and spend “One Hundred Billion Dollars” in order to understand that?

ERAC3

Luckily, almost all of us can now afford to replicate a stretched cluster in our labs.

Using open source software and trial licenses you can build a stretched cluster where you can validate your specific workload over varying “distances” that you configure on your emulated WAN and vet it out for yourself.

Here is a diagram of how that might look.

ERAC4

Here is a more detailed architecture diagram using host based storage replication provided by Oracle ASM. All the individual components are contained in a vSphere VM.

ERAC5

Details on the solution will be discussed in detail at VMworld 2015 Barcelona in session VAPP4634:

Harnessing the Power of Storage Virtualization and Site Recovery Manager to Provide HA and DR Capabilities to Business Critical Databases (VAPP4634)

Thursday, Oct 15, 12:00 PM – 1:00 PM – Hall 8.0, Room 38

Sudhir Balasubramanian – Senior Solution Architect – Data Platforms, VMware

Marlin McNeil – VMWare Partner, Yucca Group

VMworld 2015 sessions in both San Francisco and Barcelona

VAPP4634 – Harnessing the Power of Storage Virtualization and Site Recovery Manager to Provide HA and DR Capabilities to Business Critical Databases