by Velchamy Sankarlingam, Vice President, Cloud Services Development & Operations, VMware
In this two-part blog, VMware IT shares its perspective on its goal of one hour of down time for disaster recovery (DR) failovers within the enterprise. The first blog explores the IT management perspective on why limiting the impact of DR is critical to any enterprise. The second blog explores IT’s current DR environment and the valuable lessons learned.
In most enterprises, disaster recovery (DR) is an afterthought, partly because of other more demanding immediate priorities. There is always a DR plan as a document, but the confidence level is low on the ability for the team to be able to execute on it. Testing a DR plan is done partially or in bits and pieces. This was an acceptable status when businesses were not global, not operating 24×7, and did not have the exposure to the vulnerabilities we face now.
VMware had a similar approach to our legacy systems. There was an opportunity to change it when we were deploying SAP HANA as our ERP platform. Our engineers needed to think of DR as part of the design and not an afterthought. One of the mandates given to the team was to perform an actual failover to our DR site every quarter and run from the DR site for the weekend.
Since this idea was new, we experienced resistance from everywhere about the technical feasibility and the risk to the business. Our philosophy was that a DR plan untested was not really a DR plan. The more we tested the DR plan, the more DR muscle we would build.
We designed our DR architecture with Dell/EMC Recover Point and VMware Site Recovery Manager. (Read more about the technical architecture in this blog.) Our initial target was a three-hour maintenance to failover, including all the business validation, and three hours to failback. We had to get business agreement for this maintenance window for the first four quarters after rollout. The target was to cut down the failover and failback times to less than an hour.
Different teams have requested we skip a failover test for a quarter because of other priorities. We have made it a strict policy to not skip any tests. If we skip a quarter, the confidence will be lower in the next quarter and may lead to skipping another quarter and finally going back to status quo. We have successfully performed the failover and failback every quarter since we started. Our failover and failback times are trending toward an hour now.
We had a major infrastructure failure a few months back. The only management decision the team needed to make was if we were going to failover to DR. Once the decision was made, the DR failover executed like clockwork. We faced some issues during the failover, but the team had the bandwidth to deal with those issues since they were so comfortable with the entire failover to DR process.
Skepticism on DR within the business is completely gone. The DR team is proud of their achievement and now every new project that goes live has a DR plan. DR has become part of our DNA.
Read about IT’s quarterly DR testing and the lessons learned in this blog.
VMware on VMware blogs are written by IT subject matter experts sharing stories about IT’s digital transformation journey using VMware products and services in a global production environment. Visit our portal to learn more.