Two executives talking
Networking

How VMware IT Automated Network Failover Testing to Deliver 99.99 Percent Availability

by: VMware Director Shared Services Edward Lyons and VMware Senior Manager Shared Service Delivery Parmesh Karthik

In 2015, VMware IT network services began deploying a new global network standard to more than 70 offices and data centers around the world focusing on network resiliency. Building redundant WAN, LAN, WLAN, and security footprints allowed network availability to stay unaffected during an outage, when the automatic failover moves from active infrastructure to a standby infrastructure.

The challenge

The challenge was developing a method to test if the redundant infrastructure was working as expected in an optimum fashion when the primary service breaks down.

Using the business continuity planning (BCP) principles, VMware IT implemented a similar idea for the network services, and called it network verification failover testing (NVFT).

Introduction to NVFT

Network failover is the ability to automatically and seamlessly switch to a backup or secondary network service to enable business continuity (BC).

To achieve redundancy during an abnormal failure of the active network infrastructure, a standby network infrastructure must always be ready to automatically take over the service.

Scope of failover

Network failover testing is performed in all layers of the network service domain, such as:

  • Private WAN—multiprotocol label switching (MPLS)
  • Public WAN—Internet
  • SD WAN
  • Firewall security
  • LAN core and LAN DMZ
  • Wireless network
  • Network device power supply units

How does the failover work

Active-active and active-passive are the most common configurations for high availability (HA). Although both improve reliability, each failover technique achieves failover in a different way.

The failover testing determines the ability of the standby/passive network service to handle the service during critical failures; this is achieved by shutting down the primary network Infrastructure to validate the performance of the standby service with no impact on the network service.

Failover testing requirements

The original NVFT program proved successful for quality control and implementation sign-off, however, it was resource intensive and time consuming. Original test windows to complete NVFT ranged from six to eight hours, and required resources from:

  • PMO—release manager
  • IT Analytics—managing monitoring and alerting tools
  • NetOps—WAN engineer
  • NetOps—security engineer
  • NetOps—LAN engineer
  • Net Services—systems administrator
  • CET—CET resource for colleague experience testing

Total resource hours: 56 (based on an eight-hour test window)

A target of each site being tested annually proved challenging due to the resources’ requirement, longer downtime window, and being restricted to performing NVFT during weekends.

Regardless of the time and resource challenges, the NVFT itself proved to be a key tool in identifying configuration errors and also for standardizing network configuration as a result of planned changes.

Stages of NVFT

  • Pre NVFT
  • Core NVFT
  • Post NVFT

Pre-NVFT

The pre NVFT stage is also called an NVFT readiness check, where most of the network physical connectivity are examined to avoid any failures during the core activity.

The pre-activity also audits the complete network services like the hardware model, the running firmware, and the network configuration by which the network standardization would be accomplished.

Core NVFT

After the successful completion of pre NVFT activity, the actual failover is performed on all network layers, where the primary network services would be failed over to check the availability and performance of the secondary network services and vice versa. During this activity, the observations are captured and noted for discussion.

Post-NVFT

This stage mainly focuses on documenting the observations during the core activity and updating them in the “risk register” for future references.

The risk register is a repository for all risks identified and includes additional information about each risk like the nature of the risk, owner, reference and mitigation measures.

Figure 1: Stages of network verification failover testing.

Evolution of NVFT

The NVFT did not advance in just one day; rather, it took a reasonable time to become a more productive, dependable testing system.

After constant upgrades, with new ideas and methods, failover testing has gone through several phases.

NVFT Phase I

This is the Initial period of the failover testing, where only the core network services were tested. This was a completely manual process that was resource intense, needing up to eight engineers on the call; the downtime window was nearly eight hours based on the office size (small, medium or big). With these challenges, we completed NVFT in an average of 27 locations per year.

NVFT Phase II

This phase registered a large advancement of the failover testing by incorporating new ideas. As the network topology grew, all the network layers were part of the failover mechanism. The entire activity became fully automated; with this, the resources required for the activity was lowered to just two engineers, with total downtime drastically decreased to 120 mins. This led to a massive increase of about 130 NVFT events per year.

Results

  • Resources decreased by 71 percent
  • Downtime window decrease by 75 percent
  • Fourfold increase in number of NVFT
  • Running NVFT during the work week became possible
  • Additional tests were included so that the whole network infrastructure failover testing was possible.

As a result of automation, VMware service delivery was able to hit the target of running an NVFT at each site annually and we now have a new target of running NVFT at each site every quarter. We also managed to completely remove the resource requirement during the NVFT window, much to the delight of our operations teams.

NVFT enabled VMware IT to ensure that new network standards delivered improved availability, performance and security, and reduce the number of Severity 1 incidents. And, most importantly, it enables us to deliver a delightful experience to our colleagues, by ensuring high network uptime with availability of 99.99 percent throughout the year.

VMware on VMware blogs are written by IT subject matter experts sharing stories about our digital transformation using VMware products and services in a global production environment. Contact your sales rep or [email protected] to schedule a briefing on this topic. Visit the VMware on VMware microsite and follow us on Twitter.