Even the Best-Laid Disaster Recovery Plans Can Produce Lessons Learned

by: Senior Director–IT Application Operations Varinder Kumar and VMware IT Disaster Recovery Senior Manager Lalit Parashari

VMware IT Prepares for the Unthinkable

VMware understands that emergencies, such as weather events, fires, data center failures, cyberattacks and other crises are inevitable. An effective disaster recovery (DR) plan reduces the risk and impact on any organization and is critical to have in place.

With the support of our leadership and usage-of-right technology based on specific requirements, DR planning and testing are now part of our DNA, with plans in place for every IT application.

Most organizations have a documented DR plan. However, there can be trepidation about executing it. The VMware IT philosophy is that a DR plan is not formal unless tested regularly. Therefore, one of our DR key performance indicators (KPIs) is to conduct an actual failover to our DR site every quarter. To keep confidence levels high, we have a strict policy never to skip a quarterly test.

DR Plan Governance and Process

DR requires a straightforward plan and strict governance. The main steps are:

Identify DR stakeholders and executive sponsors for IT and business units
Categorize each application based on its criticality
Define roles and responsibilities within IT teams and business stakeholders
Define availability standards and architectures for applications and services
Define a test strategy
Establish recovery targets: identify RTO and RPO SLAs
Measure and track progress toward goals
Audit compliance

DR Test Methodologies that Scale

At VMware, we use active-active, active-passive, and active-standby architecture to test our DR plan. All our strategies are based on the criticality of the applications.

Bubble Network for DR Testing (Nondisruptive DR Test)

One of the biggest challenges facing business continuity teams is regularly testing the disaster recovery plan with minimal-to-no downtime to the business. VMware IT developed nonintrusive comprehensive disaster recovery testing using a “bubble network” for various critical applications. In this scenario, IT built an isolated test network for multitier applications using VMware NSX^® and VMware Site Recovery Manager^™

During the testing, there is no impact to the production environment because the bubble network is isolated and, with the help of VMware Horizon^®, it is accessible by business and IT users for end-to-end testing.

DR tests on enterprise applications are performed quarterly in this environment without disrupting their corresponding production environments. In addition, service virtualization enables DR teams to validate that end-end business processes within the test environment are functioning similarly in the production environment. See Figure 1.

Figure 1. VMware NSX design for a nondisruptive DR test.

Active-Active Scenario

Some mission-critical applications, such as identity and authentication for multiple software-as-a-service (SaaS) offerings by VMware, are deployed in multiple data centers. Additionally, most of the microservices deployed on various cloud-native platforms (VMware Cloud Foundation^™, VMware Tanzu^® Application Service^™ and VMware Tanzu^® Kubernetes Grid^™) are also running active-active across two different data centers. Application traffic is pinned to a single data center, but if there is a glitch in the primary site, traffic is automatically routed to the second data center. See Figure 2. This helped achieve 99.99 SLA for mission-critical applications.

Figure 2. Multiple data center setup for applications.

Active-Passive Scenario

Some applications have a traditional DR setup in which the applications and the database are installed on primary and DR data centers, and the replication is enabled through the application or database. This reduced downtime during operating system and application patching.

Disaster Strikes

There are always lessons to be learned due to an IT outage. In one instance, our IT operations team received outage alerts in our primary data center on a Friday evening. Our servers were offline.

Immediately our core infrastructure teams were engaged and started troubleshooting the problem. The storage team diagnosed there were issues with the underlining storage arrays. The team members pinpointed a single storage array failure.

Despite our best efforts, we couldn’t recover the storage array, resulting in a complete data loss. There were numerous critical applications hosted on the storage array, including:

Active Directory domain controllers for authentication
DNS for name resolution
VMware vCenter Server® for managing the production VMware vSphere® clusters
Total number of applications impacted: ~200

Out of 200 applications, approximately 20 of them suffered application-level DR and were able to failover immediately. As stated above, only one array was impacted, resulting in a partial DR. Since VMware IT performs application-level DR testing regularly, the team made the DR decision extremely fast.

Our DR strategy is designed for full data center loss. However, in this case, we had a partial data center loss, which helped identify a few gaps:

While we do quarterly bubble DR for all SRM-based applications, we never tested partial scenarios. We also found specific VLAN extension/mapping gaps between two data centers—this impacted overall DR recovery time.
Application/DB tiers were running on a different storage array for a single application. Due to this, we had latency issues when the partial application was running on a primary DC and the rest of the application was running on a secondary DC.
A few core services, like AD domain controllers and DNS, were adversely impacted because these services were not distributed widely across multiple storage arrays.
Even though we had local standby configured, we were impacted due to the same storage being used for both active and standby nodes.

As evidenced by our recent IT outage, lessons learned from regular testing DR plans are a powerful tool in finding innovative solutions to minimize downtime and data loss, mitigate against single points of failure and promote customer retention and profitability. In addition, a strong recovery plan removes last-second decision-making during a disaster. VMware IT recognizes that proper planning prevents poor performances by technology and its users.

VMware on VMware blogs are written by IT subject matter experts sharing stories about our digital transformation using VMware products and services in a global production environment. Contact your sales rep or vmwonvmw@vmware.com to schedule a briefing on this topic. Visit the VMware on VMware microsite and follow us on Twitter.

Related Articles

How VMware Duplicated an On-Premises Experience for the Multi-Cloud

Cultivating a Sustainable Culture with VMware Tanzu CloudHealth

Bright Ideas. How VMware IT Moved to a Multi-Cloud Ecosystem and Achieved 99.99 Percent Availability