By Varinder Kumar, IT Director and Lalit Parashari, IT Disaster Recovery Manager
VMware understands that emergencies such as weather events, fire, data center failures, cyber attacks, and other crises are inevitable. An effective disaster recovery (DR) plan will reduce the risk and impact to any organization and is important to have in place.
With the support of our leadership and usage-of-right technology based on specific requirements, VMware IT has evolved an effective DR strategy in the last several years. DR planning and testing is now part of our DNA, with a plan in place for every IT application.
Most organizations have a documented DR plan. However; there can be trepidation about executing on it. In VMware IT our philosophy is that a DR plan is not formal unless it is tested regularly. In fact, one of our DR key performance indicators (KPIs) is to perform an actual failover to our DR site every quarter. To keep confidence levels high, we have a strict policy to never skip a quarterly test.
DR Test Objectives
For DR teams, the operational objectives are to validate DR plans, meet recovery time objective (RTO) and recovery point objective (RPO) service level agreements (SLAs) defined by the business, reduce the RTOs by automating redundant tasks, and enable administrators to perform the DR tests. For the business, the objectives are to keep RTOs/RPOs low and to ensure consistent, end-to-end business process testing and audit compliance for enterprise resiliency.
DR Plan Governance and Process
A disaster recovery requires a straightforward plan and strict governance. The main steps are:
- Identify DR stakeholders and executive sponsors for both IT and business units
- Categorize each application based on its criticality
- Clearly define roles and responsibilities within IT teams and business stakeholders
- Define availability standards and architectures for applications and services
- Define a test strategy
- Establish recovery targets: identify RTO and RPO SLAs
- Measure and track progress toward goals
- Audit compliance
VMware IT—Production Computing Scale
VMware IT is running across three data centers (including VMware Cloud on AWS), 26+ vCenters, 1200+ Esxi hosts, 7,650+ virtual machines and storage (2.2+ PB vSAN, 1.25+ PB Fiber Channel and 3+ PB NFS). We also have 500+ applications hosted in our data center. We categorized the IT-hosted (on premises) applications based on business priority and criticality. These categories are:
- Core services (70+): Services that are the backbone of infrastructure such as Active Directory (AD), Lightweight Directory Access Protocol (LDAP), Domain Name System (DNS), firewall and network,
- Essential applications (120+): Customer and revenue-impacting applications, such as Enterprise Resource Planning (ERP) and Finance applications
- Standard applications (140+): Applications that are less critical and help optimize business tasks
- Discretionary applications (100+): Monitoring and various tool applications
DR Test Methodologies That Scale
At VMware, we use active-active, active-passive, and active-standby architecture to test our DR plan. All of our strategies are based on the criticality of the applications.
Bubble Network for DR Testing (nondisruptive DR test)
One of the biggest challenges facing business continuity teams is to regularly test the disaster recovery plan with minimal-to-no downtime to the business. VMware IT developed nonintrusive comprehensive disaster recovery testing using a “bubble network” for various critical applications. In this scenario, we built an isolated test network for multitier applications using VMware NSX and Site Recovery Manager (SRM). During the testing, there is no impact to the production environment because the bubble network is isolated and, with the help of VMware Horizon, it is accessible by business and IT users for end-to-end testing.
DR tests on enterprise applications were performed in this environment quarterly without any disruption to their corresponding production environments. Service virtualization enables DR teams to validate that end-end business processes within the test environment are functioning the same in the production environment.
Disaster Recovery as a Service (DRaaS)—DR Site on VMware Cloud on AWS
A few of our internet applications (including VMworld.com) run in a hybrid cloud. With this methodology, all application VMs are protected by Site Recovery Manager and replicated through vSphere to VMware Cloud on AWS (VMC). During DR testing, the VMs are recovered in the DR site.
Some mission-critical applications, such as identity and authentication for multiple software-as-a-service (SaaS) offerings by VMware, are deployed in multiple data centers and the services are always up and running in both data centers.
Additionally, most of the microservices, which are deployed on various cloud-native platforms (VMware Pivotal Cloud Foundation, Tanzu Application Service and VMware Tanzu Kubernetes Grid) are also running active-active across two different data centers. Application traffic is pinned to a single data center but if there is a glitch in the primary site, traffic is automatically routed to the second data center. This helped us achieve 99.99 SLA for mission critical applications.
Some of the applications have a traditional DR setup, in which the applications and the database are installed on primary and DR data centers, and the replication is enabled through the application or database. This also helped us reduce downtime during operating system and application patching.
Benefits of DR Testing That Scales
One of the most significant benefits for having regular DR testing that can scale is continuous improvement and validation. Frequent testing ensures that our support teams are well versed in DR procedures and are prepared to quickly address a real disaster. As the IT environment expands, DR capacity is scaled along with the production environment and ensure that environments remain consistent. With each regular DR testing, and as business teams are part of these events, the business becomes more and more confident in the IT DR plan.