TLDR; Version:

Reliable, optimal, effective Enterprise-scale Business Continuity and Disaster Recovery (BCDR) Plans are difficult and costly to design and implement, not just financially, but technically and operationally. By using VMware’s Site Recovery Manager in any of the VMware’s Hybrid Cloud infrastructure, businesses can not only design and implement a repeatable, automated and simplified BCDR Plans for their Mission-Critical Applications, they can do so in a cost-beneficial way that eliminates or substantially reduces the errors, complications and stress associated with restoring critical services in a disaster event. In this Guide, we show you how.

It is inarguable that very few things rate higher in the order of importance and desirability in the Enterprise than Business Continuity and Disaster Recover (BCDR). Paradoxically, BCDR is arguably one of the most important considerations least planned and prepared for by enterprise Infrastructure Architects and Administrators. In this document, we will attempt to address this paradox by providing a detailed step-by-step guidance on how Businesses can leverage VMware Site Recovery Manager to provide a simplified, optimal, comprehensive and repeatable Solution for protecting business-critical infrastructure components and recovering them to restore business continuity in the event of a disaster event.

In this Guide, we demonstrated a real-life business operations scenario, using the following 3-tiered Enterprise Application configuration and infrastructure:

  • A Microsoft Windows Active Directory infrastructure with multiple Domain Controllers
  • A Clustered 2-Node Microsoft SQL Server instance
  • A Microsoft Windows Client from which we will verify business continuity after restoring services

In discussing Business Continuity and Disaster Recovery (BCDR) as a subject, it is imperative that we not only identify what makes an Application “business-critical”, but also differentiate between BCDR and the closely related concept of “High Availability” (HA).

An Application is deemed Business-critical if its interruption or non-availability negatively impacts a Business entity in such a way (and to such a degree) that the outage directly and indirectly affects it is unable to successfully continue its operations until and unless the Application is restored and reintroduced into service. A prolonged outage of such Application more often than not negatively impacts the Business’ profitability, subject the Business to contractual liability, and (sometimes) threaten lives and properties.

An Application’s non-availability could be triggered by any number of events or causes within and outside the immediate control of the Business and its Operators/Administrators and the impact could be manifested in several ways, both predictable and unpredictable

Businesses try to account for the inevitability of Application’s non-availability by proactively building as much resilience and redundancy as possible into the Application and its components. This is where the distinction between BCDR and HA becomes very important.

Application High Availability is more focused on an Application’s ability to continue to operate and provide services even when the Application’s component(s) or the Application itself has failed. This Application’s ability to survive (and recover from) failures is largely dependent on the resilience built into the Application (either natively or through the use of 3rd-party solutions or add-ons). In the scenario documented in this Guide, the Application-level resilience is provided through the combination of Windows Server Failover Clustering (WSFC) and the Microsoft SQL Server Always On features. These features enable the Services provided by Microsoft SQL Server to continue to be available (after a brief interruption) even after the original Server providing that Service has become unavailable for any reason. When the original Server fails, WSFC brings up the its resources on a surviving Node, usually without any administrative intervention. This, in a nutshell, is Microsoft SQL Server’s “High Availability”.

A Disaster event, in contrast with what we just described above, is a catastrophic failure impacting more than just a component or a Server. Without regard to its severity or duration, a Disaster event can be described as a superset of multiple HA events which cannot be easily overcome by the resilience of a specific application, component or service. Because it is hardly ever transient in nature, the effects of a Disaster event is more impactful, disruptive and destructive on an Enterprise. Also because, in a Disaster event, multiple layers of the infrastructure is negatively impacted, recovering from such event is considerably more difficult, expensive and slower than recovering from an HA event. Consequently, preparing and planning for recovering from a Disaster event are materially and financially more expensive.

For example, imagine a power loss to a Datacenter. In this event, the outage will affect multiple systems simultaneously. A Business would usually prepare for this event by ensuring power redundancy at the Datacenter. In this scenario, the redundancy will minimize the possibility of a Disaster, and such preparation will also help the Business recover faster, should any of the devices in the Datacenter suffer a failure during the brief interruption. A Disaster is avoided or minimized.

How about a more pervasive scenario like, say, flooding in the Datacenter? The possibility of a complete site-wide Disaster is the primary rationale for Businesses spreading their IT Infrastructure over multiple Datacenters – servers and components in the other Datacenter continue to provide services (even if at much reduced performance level) while the Business scrambles to restore services at the other failed site.

As you have probably surmised from the two disaster scenarios described in the preceding paragraphs, while a secondary Datacenter provides better and faster recovery option, the cost and efforts required is significantly higher. “Cost” in this case is not necessarily limited to the financial cost of operating multiple Datacenters (rent, power, cooling, staffing, etc), it also includes the technical and technological considerations for the Applications and components in the Datacenter. Take our referenced clustered 2-Node Microsoft SQL Server instances as an example. One would assume that placing one Node in each Datacenter would help achieve a measure of safety in the event of a site outage. Unfortunately, this is not so.

A 2-Node Microsoft SQL Server cluster spread across two Datacenter provides only an illusion of Disaster preparedness and recovery capabilities. WSFC requires that, in order for a Cluster to be operationally functional, a majority of the Nodes in the Cluster must be available. In a 2-Node configuration where one Node is in a Datacenter that is no longer available, what happens?

Yeah, you guessed it. OK, so how about we make it a 3-Node Cluster? Good idea. But, for this to work for our Disaster Recovery scenario, we will either need to have a 3rd site, or pray that the site with 2 of the Nodes is never going to experience a Disaster event? Since we are getting technical (and not spiritual) here, we can agree that a 3rd site is the better option. With the 3rd site comes additional cost imposition. Well, what about a 4-Node configuration then? Oh, wait… 2 Nodes gone, 2 Nodes left, so… “No Cluster for you!”. Let’s even forget the fact that, every additional Node incurs additional licensing cost. This is the point at which your Trusted Advisor will telling you (in very strong words) that you ought to be scaling out a given SQL Server cluster for capacity, load distribution and performance, not necessarily for Disaster avoidance or recovery.

You know what? We could just have a “Cold” Datacenter, right? Right… Not!

A “Cold” Datacenter BCDR Plan is one in which a Business has a Datacenter to which it replicates its Business-Critical Application Servers and other Services, but these Servers/Services are not powered on – until and unless there is a Disaster. While this is a very good cost-saving BCDR option an Enterprise can live with (depending on its Service Level Agreement and other contractual/legal obligations), it has proven to be a very inefficient “Solution”, especially in situations where the Plan is actually called upon to save the day. Let’s take our referenced Workloads as an example, again. In a “Cold Site” configuration, we will have copies of our Domain Controllers, SQL Server and Client machine somewhere, waiting to be powered on in the event of a failure at our primary (Hot) site. In a Disaster, we would then (theoretically) power them up in the right order and restore services at the “Cold-but-now-Hot” site, pop a Champagne and celebrate. In. Theory.

In practice, we will be counting on many things to go right in a predetermined sequence in other for this to work. Given that we designed this BCDR Plan months or years ago, the heat of the scrambles to recover from the Disaster event and restore services is not the right time for us to discover that we needed to have powered on the DCs before we brought up the SQL Servers. And, oh, the front-end App Server shouldn’t have come up yet, either. And, when was the last time we replicated the Servers and which of the Domain Controllers is the Operations Master?

The Cluster’s Listener address has to change? What does that even mean, and what the double-double is this “VM Generation ID” thing you are talking about? No, my hair is not on fire! You are on fire!

“Cold” Site as a BCDR option saves Enterprises money. On the other hand, it makes their BCDR “Plans” less of a “Plan” and more of a “Prayer”. And, this, is why some Enterprise fail hard, when they have a Disaster event.

The VMware Site Recovery Manager (SRM) has been VMware’s proven Solution for Enterprises’ BCDR pains for over 10 years now. SRM does only one thing, and does it exceptionally well – it helps Enterprises protect their virtualized Business-Critical VMs and the Applications they host, configure repeatable workflow for their recovery and execute the associated recovery plans with just a few clicks. With SRM, you not only get to better automate your BCDR Plan, you also get to test, validate and tweak them, on-demand, without interrupting your Production environment.

Because, with SRM, the protected copies of the Workloads are always in “Cold” state, SRM also helps Enterprises achieve the cost-minimization requirements of their BCDR Plans – without jeopardizing their recovery success rate. Combined with the benefits and flexibility of any of the VMware Hybrid Cloud offerings (from AWS, Microsoft, Google, IBM, Alibaba, etc), Customer enjoys the benefits of a “Secondary Site” without the associated cost implications of operating or managing a physical Datacenter solely for the purpose of implementing an optimal Business Continuity and Disaster Recovery Plan.

The narrative continues in “Protecting and Recovering Mission-Critical Applications in a VMware Hybrid Cloud with Site Recovery Manager“, with step-by-step walk-through and sample scripts added in as a bonus. Download it, read it, use it, and don’t forget to tell us how you think that we can improve it.

Thank you.