Security

Up and Running. How VMware IT Automated the Disaster Recovery Failover/Failback Process 

by VMware Sr. Director of IT Application Operations Varinder Kumar and VMware Senior Manager of IT Applications Operations Lalit Parashari

 

Disaster recovery (DR) is critical to ensure business continuity in modern enterprise operations. Traditional DR plans often rely on manual intervention, a process that is generally time-consuming, error-prone, and inefficient.  

When a business operation is disrupted due to a disaster, every minute of downtime can lead to significant financial losses and damage to your organization’s reputation. Manual DR processes involve multiple steps that require careful coordination and execution by the IT team—actions that may not always be feasible under the pressure of a crisis. 

At VMware, we developed a single-click automated DR solution that incorporates an end-to-end failover/failback process for mission-critical applications.  

Why automation? 

It might seem logical in this day and age to automate as much as possible throughout the enterprise. But like any new technology, DR automation needs to justify its implementation beyond the “it’s cool” aesthetic. The following are some of the distinct advantages VMware IT found that more than justified the investment. 

Speed 

DR automation helps in the fast recovery of systems and data in the event of a disaster. By automating the failover and failback steps, the automation portal can rapidly execute all predefined steps. This is crucial to minimizing downtime and achieving the recovery time objective (RTO) service-level agreement (SLA). Manual recovery processes, on the other hand, can be time-consuming and error-prone. 

Consistency 

Since recovery procedures are executed the same way time after time, automation enhances reliability and reduces the likelihood of mistakes. Even with repetitive tasks human error is still a factor. An exampleis when the stage actor Richard Harris performed the show Camelot night after night for many years. Yet during one performance, he stopped out of the blue and told the audience he forgot his line. 

Human independence 

As mentioned, the human factor can be the weakest link in the chain. Our team removed the dependence on live personnel by executing the failover through the DR automation portal. It allows the quick response and action to any disaster without relying on any human intervention. This is important during weekends/off-hours or when skilled technical resources may not be immediately accessible.  

Lower cost 

Even though there is an initial investment to implementing DR automation, the long-term savings outweigh it. It reduces human efforts and the downtime of critical systems, both of which directly translate to cost savings as systems are brought online faster than ever, regardless of the disaster at hand. 

How we made it work 

It’s important to conduct a requirement assessment with all the key stakeholders and to get their full buy-in. We knew we had to carve out a well-defined project plan, as well as evaluate potential tools and platforms for development. Additionally, we ran a pilot project before starting a full-scale development effort, beginning with a small application with few dependencies.  

Specifics include:

Finding the right tool 

Our team encountered several challenges when evaluating the right tools necessary to automate DR. These included the complexity of the portal (and how to simplify it), integration, ability to scale, a superior user experience (UX), security and compliance, data management, and ease of monitoring and maintenance. 

After we completed a thorough vetting process, the following technologies/tools were chosen. Some were already in use within the VMware ecosystem, making integration aseamless undertaking. 

  • Angular and VMware Clarity design for building the user interface (UI)  
  • Node JS for developing the APIs to handle the requests from the UI 
  • Mongo DB on VMware Tanzu® Kubernetes Grid™ used as a persistent store to hold DR schedules/access/authorization/audits 

Figure 1 outlines this architecture. 

The Automated DR Architecture

Figure 1: VMware automated DR architecture

Cross-team stakeholder buy-in 

We collaborated with all affected stakeholders, such infrastructure teams (Linux, storage, network services, etc.) and application owners in order to develop a DR automation project plan that would work for everyone. to put up a plan for the automation project. While it was not an easy process, every stakeholder ultimately came on board once they understood the benefits of DR automation for their job function.

In-demand features 

During any disaster recovery (DR) test, the DR automation portal gives stakeholders a thorough insight and a holistic perspective. Through its framework, it allows the execution of DR procedures with a single click. It also includes management and preservation tools for the DR test schedule. A notification system is built into the site to periodically update stakeholders on the DR plan’s schedule. Users can quickly access DR playbooks for comprehensive instructions. The portal also uses role-based access control to guarantee security and keeps track of login audits for accountability/monitoring needs. Additionally, we deployed the same application in multiple data centers to increase the availability. 

By investing in the right tools, thorough planning, and continuously testing and updating your DR automation processes, you can safeguard your business against unexpected disasters and avoid severe financial loss.  

There’s a lot more to this topic than is presented here. That’s why we encourage you to contact your account team to schedule a briefing with us. No sales pitch, no marketing. Just straightforward peer conversations revolving around your company’s unique requirements. 

VMware on VMware blogs are written by IT subject matter experts sharing stories about our digital transformation using VMware products and services in a global production environment.To learn more about how VMware IT uses VMware products and technology to solve critical challenges, visit ourmicrosite, read ourblogsandIT Performance Annual Reportand follow us on SoundCloud,Twitter andYouTube. All VMware trademarks and registered marks (including logos and icons) referenced in the document remain the property of VMware.