How VMware Manages Disaster Recovery - VMware on VMware Blogs

by: VMware Sr. Director of Application Operations Varinder Kumar and VMware Sr. Manager for Disaster Recovery Lalit Parashari

How an organization manages to restore access and functionality to its IT infrastructure after a human error or natural disaster is called disaster recovery or simply DR. It’s a topic that is often sidelined or not a part of the design conversation whenever a solution is architected. It generally comes into play only when Day 2 operations are addressed, but at VMware, we do it differently.

VMware Sr. Director of Application Operations Varinder Kumar said that 10 years back, VMware was no different. But now DR is in our DNA. Right from the drawing board, we raise the topic of DR for any new tech additions or updates. Whenever any new application or technology is added to our IT portfolio, as part of production readiness assessment, also known as our PRA program, our DR governance team ensures that all DR processes (architecture, testing, monitoring, etc.) are in place before the application goes live. Our change management team won’t approve the production go-live ticket unless it is approved by DR governance team. Any exception needs to be approved by our senior leadership.

DR Testing Process

VMware Sr. Manager for Disaster Recovery Lalit Parashari explained about quarterly DR testing at VMware. He said that a DR plan without testing is of no use. At VMware, we don’t have our DR strategy only on paper; we prove that our strategy is always working and will work when it’s actually required. Frequent DR tests also ensure that all involved teams are trained, aware and have the necessary competency to flawlessly execute our IT DR plan.

Many enterprises struggle to meet disaster recovery requirements due to the cost, effort and downtime required for implementation and testing. Scheduling the downtime to run disaster recovery tests on business-critical applications is the primary challenge. To meet this challenge, VMware IT developed a comprehensive approach to allow nondisruptive DR testing of entire production applications, including end-to-end application validation in an isolated environment—without downtime.

Most critical applications, such as the ones supporting our SaaS business, have active-active DR setup and we even running two DR sites for the applications that will provide a few seconds of recovery time objective (RTO) for those critical applications.

For some business applications, where we can afford downtime, we perform the real DR test and keep running the application in the DR site for more than 24 hours, where real business transactions happen on the DR site.

How VMware IT handled a DR scenario

As an operation team, we always hope that we never need to execute DR, but hope is not the best strategy when you are supporting mission-critical applications to support multibillion dollar businesses. In May 2022, we had a scenario in which we had to execute real DR. We had an issue with one of the main storage arrays and the vendor was unable to recover the data. Since we test our DR on a quarterly basis, we did not hesitate to activate our DR for our applications. Within few hours, we were able to run our applications on our secondary data center (DC).

Learning from real life

The leadership at VMware made the decision, without any hesitation, for DR activation. We were able to recover our services quickly, and few of our business teams faced any disruption for their services. Having said that, we also had a learning opportunity from this DR scenario:

While we do quarterly bubble DR for all site recovery manager-based applications, we never tested partial scenarios. During this exercise, we found specific VLAN extension/mapping gaps between two DCs—this impacted overall DR recovery time.
Application/DB tiers were running on a different storage array for a single application. Due to this, we had latency issues when the partial application was running on a primary DC and the rest of the application was running on a secondary DC.
A few core services were impacted because these services were not distributed widely across multiple storage arrays.
Even though we had local standby configured, we were impacted due to the same storage being used for both active and standby nodes.

This DR scenario was a blessing in disguise because we found the gaps, which we wouldn’t normally see during ongoing DR tests.

Future plans from a DR perspective

By running frequent DR tests, we learn many lessons and get new ideas for reducing RTO. We are working toward automating all human tasks, which require failover of the application services so that we can avoid human error and also help to further reduce RTO.

IT clearly explains the importance of DR and how every organization should have a viable DR plan for every critical application. But it also comes with a cost, so how can that be optimized to decide on a DR strategy?

There is no easy answer, but if you are using VMware site recovery manager (SRM), you can use DR compute for any nonproduction workload. At VMware, we have a lot of nonproduction workloads running in secondary DC and we are actively using DR compute to run this workload. During DR testing or an actual DR situation, we shut down our nonproduction workloads. Another option is to use cloud-based DR, such as DR as a service (known as DRaaS) to run DR using VMware Cloud on AWS. This helped us to reduce costs. For other applications, where we are running application-based DR, we run a few app/dB nodes in DR for such applications and expand the capacity only during the DR.

Lastly, we can’t leave the automation behind. With the implementation of various automated startup/shutdown scripts, we reduced the number of resources required during the disaster recovery operation. At the end, always look at the ROI and this will help you to decide what type of DR needs to be implemented for each application.

The topic continues to evolve, so contact your account team to schedule a briefing with a VMware IT expert to hear the latest. For more about how VMware IT addresses queries related to modern apps, check out more blogs on the topic. For other questions, contact vmwonvmw@vmware.com.

We look forward to hearing from you.

VMware on VMware blogs are written by IT subject matter experts sharing stories about our digital transformation using VMware products and services in a global production environment. To learn more about how VMware IT uses VMware products and technology to solve critical challenges, visit our microsite, read our blogs and IT Performance Annual Report and follow us on SoundCloud, Twitter and YouTube. All VMware trademarks and registered marks (including logos and icons) referenced in the document remain the property of VMware.

Related Articles

How VMware Duplicated an On-Premises Experience for the Multi-Cloud

Cultivating a Sustainable Culture with VMware Tanzu CloudHealth

Bright Ideas. How VMware IT Moved to a Multi-Cloud Ecosystem and Achieved 99.99 Percent Availability