posted

4 Comments

by Laxminarayan Jayabharthi, Senior Director, IT Application Development, VMware

Avoiding downtime for business users is our highest priority. So when a catastrophic software failure occurred recently in our dev/test environments, it not only affected a large group of users, it also put a highly visible product feature release in jeopardy. This could have been a major disaster since the failure affected a critical component of our development platform.

The issue occurred when all VMs running critical applications went down. Since our environment hosts 600 critical APIs for our core business, the business impact was significant. The failure presented two immediate challenges:

  • Restore service to our users
  • Identify and fix the underlying issue that caused the failure

To restore service, we redirected the traffic to a test instance. The test instance was part of a separate software upgrade process, with test instances running in parallel to the production instance. The rerouting enabled users to continue working without service interruption.

At the same time, we began troubleshooting the issue. To recreate the failed environment, we used our integrated developer experience (nicknamed the dev portal), which provides features such as a self-service portal and automated instance provisioning. With a few clicks, we were able to dynamically provision an identical instance from the dev portal’s centrally stored blueprints that contain the latest code, configurations, and data. Because of the blueprints, we were able to create an instance for a highly complex system like this one very quickly (within 22 hours) and with a high degree of stability and quality.

DevOps Portal for Real-time Incident ResolutionUsing this instance, we identified, fixed, and tested the issue, then moved the traffic back to the original old instance and destroyed the dynamic instance. We resumed the testing process and successfully upgraded to the latest software version without any further service interruptions. The entire incident was virtually invisible to our users.

This was the first time in recent memory we had four nodes fail at once. The ability to rapidly create a new twin instance for troubleshooting was critical to our fast incident resolution. In our previous pre-private cloud environment, spinning up a new instance would have taken weeks or months, time we did not have. Although this type of failure is rare, we are now much more confident in our ability to handle it while minimizing the user impact.

VMware on VMware blogs are written by IT subject matter experts sharing stories about IT’s transformation journey using VMware products and services in a global production environment. Visit our portal to learn more.