This post was written by Roie Ben Haim and Max Ardica, with a special thanks to Jerome Catrouillet, Michael Haines, Tiran Efrat and Ofir Nissim for their valuable input.
The modern data center design is changing, following a shift in the habits of consumers using mobile devices, the number of new applications that appear every day and the rate of end-user browsing which has grown exponentially. Planning a new data center requires meeting certain fundamental design guidelines. The principal goals in data center design are: Scalability, Redundancy and High-bandwidth.
In this blog we will describe the Equal Cost Multi-Path functionality (ECMP) introduced in VMware NSX release 6.1 and discuss how it addresses the requirements of scalability, redundancy and high bandwidth. ECMP has the potential to offer substantial increases in bandwidth by load-balancing traffic over multiple paths as well as providing fault tolerance for failed paths. This is a feature which is available on physical networks but we are now introducing this capability for virtual networking as well. ECMP uses a dynamic routing protocol to learn the next-hop towards a final destination and to converge in case of failures. For a great demo of how this works, you can start by watching this video, which walks you through these capabilities in VMware NSX.
Scalability and Redundancy and ECMP
To keep pace with the growing demand for bandwidth, the data center must meet scale out requirements, which provide the capability for a business or technology to accept increased volume without redesign of the overall infrastructure. The ultimate goal is avoiding the “rip and replace” of the existing physical infrastructure in order to keep up with the growing demands of the applications. Data centers running business critical applications need to achieve near 100 percent uptime. In order to achieve this goal, we need the ability to quickly recover from failures affecting the main core components. Recovery from catastrophic events needs to be transparent to end user experiences.
ECMP with VMware NSX 6.1 allows you to use upto a maximum of 8 ECMP Paths simultaneously. In a specific VMware NSX deployment, those scalability and resilience improvements are applied to the “on-ramp/off-ramp” routing function offered by the Edge Services Gateway (ESG) functional component, which allows communication between the logical networks and the external physical infrastructure.
External user’s traffic arriving from the physical core routers can use up to 8 different paths (E1-E8) to reach the virtual servers (Web, App, DB).
In the same way, traffic returning from the virtual server’s hit the Distributed Logical Router (DLR), which can choose up to 8 different paths to get to the core network.
How is the path determined:
NSX for vSphere Edge Services Gateway device:
When a traffic flow needs to be routed, the round robin algorithm is used to pick up one of the links as the path for all traffic of this flow. The algorithm ensures to keep in order all the packets related to this flow by sending them through the same path. Once the next-hop is selected for a particular Source IP and Destination IP pair, the route cache stores this. Once a path has been chosen, all packets related to this flow will follow the same path.
There is a default IPv4 route cache timeout, which is 300 seconds. If an entry is inactive for this period of time, it is then eligible to be removed from route cache. Note that these settings can be tuned for your environment.
Distributed Logical Router (DLR):
The DLR will choose a path based on a Hashing algorithm of Source IP and Destination IP.
What happens in case of a failure on one of Edge Devices?
In order to work with ECMP the requirement is to use a dynamic routing protocol: OSPF or BGP. If we take OSPF for example, the main factor influencing the traffic outage experience is the tuning of the OSPF timers.
OSPF will send hello messages between neighbors, the OSPF “Hello” protocol is used and determines the Interval as to how often an OSPF Hello is sent.
Another OSPF timer called “Dead” Interval is used, which is how long to wait before we consider an OSPF neighbor as “down”. The OSPF Dead Interval is the main factor that influences the convergence time. Dead Interval is usually 4 times the Hello Interval but the OSPF (and BGP) timers can be set as low as 1 second (for Hello interval) and 3 seconds (for Dead interval) to speed up the traffic recovery.
In the example above, the E1 NSX Edge has a failure; the physical routers and DLR detect E1 as Dead at the expiration of the Dead timer and remove their OSPF neighborship with him. As a consequence, the DLR and the physical router remove the routing table entries that originally pointed to the specific next-hop IP address of the failed ESG.
As a result, all corresponding flows on the affected path are re-hashed through the remaining active units. It’s important to emphasize that network traffic that was forwarded across the non-affected paths remains unaffected.
Troubleshooting and visibility
With ECMP it’s important to have introspection and visibility tools in order to troubleshoot optional point of failure. Let’s look at the following topology.
A user outside our Data Center would like to access the Web Server service inside the Data Center. The user IP address is 192.168.100.86 and the web server IP address is 172.16.10.10.
This User traffic will hit the Physical Router (R1), which has established OSPF adjacencies with E1 and E2 (the Edge devices). As a result R1 will learn how to get to the Web server from both E1 and E2 and will get two different active paths towards 172.16.10.10. R1 will pick one of the paths to forward the traffic to reach the Web server and will advertise the user network subnet 192.168.100.0/24 to both E1 and E2 with OSPF.
E1 and E2 are NSX for vSphere Edge devices that also establish OSPF adjacencies with the DLR. E1 and E2 will learn how to get to the Web server via OSPF control plane communication with the DLR.
From the DLR perspective, it acts as a default gateway for the Web server. This DLR will form an OSPF adjacency with E1 and E2 and have 2 different OSPF routes to reach the user network.
From the DLR we can verify OSPF adjacency with E1, E2.
We can use the command: “show ip ospf neighbor”
From this output we can see that the DLR has two Edge neighbors: 188.8.131.52 and 192.168.100.10.The next step will be to verify that ECMP is actually working.
We can use the command: “show ip route”
The output from this command shows that the DLR learned the user network 192.168.100.0/24 via two different paths, one via E1 = 192.168.10.1 and the other via E2 = 192.168.10.10.
Now we want to display all the packets which were captured by an NSX for vSphere Edge interface.
In the example below and in order to display the traffic passing through interface vNic_1, and which is not OSPF protocol control packets, we need to type this command: “debug packet display interface vNic_1 not_ip_proto_ospf”
We can see an example with a ping running from host 192.168.100.86 to host 172.16.10.10
If we would like to display the captured traffic to a specific ip address 172.16.10.10, the command capture would look like: “debug packet display interface vNic_1 dst_172.16.10.10”
Useful CLI for Debugging ECMP
- To check which ECMP path is chosen for a flow
- debug packet display interface IFNAME
- To check the ECMP configuration
- show configuration routing-global
- To check the routing table
- show ip route
- To check the forwarding table
- show ip forwarding
Useful CLI for Dynamic Routing
- show ip ospf neighbor
- show ip ospf database
- show ip ospf interface
- show ip bgp neighbors
- show ip bgp
ECMP Deployment Consideration
ECMP currently implies stateless behavior. This means that there is no support for stateful services such as the Firewall, Load Balancing or NAT on the NSX Edge Services Gateway. The Edge Firewall gets automatically disabled on ESG when ECMP is enabled. In the current NSX 6.1 release, the Edge Firewall and ECMP cannot be turned on at the same time on NSX edge device. Note however, that the Distributed Firewall (DFW) is unaffected by this.
For more in-depth information, you can also read our VMware® NSX for vSphere (NSX-V) Network Virtualization Design Guide
About The Authors
Roie Ben Haim works as a professional services consultant at VMware, focusing on design and implementation of VMware’s software-defined data center products. Roie has more than 12 years in data center architecture, with a focus on network and security solutions for global enterprises. An enthusiastic M.Sc. graduate, Roie holds a wide range of industry leading certifications including Cisco CCIE x2 # 22755 (Data Center, CCIE Security), Juniper Networks JNCIE – Service Provider #849, and VMware vExpert 2014, VCP-NV, VCP-DCV. Follow his personal blog at http://roie9876.wordpress.com/
Max Ardica is a senior technical product manager in VMware’s networking and security business unit (NSBU). Certified as VCDX #171, his primary task is helping to drive the evolution of the VMware NSX platform, building the VMware NSX architecture and providing validated design guidance for the software-defined data center, specifically focusing on network virtualization. Prior to joining VMware, Max worked for almost 15 years at Cisco, covering different roles, from software development to product management. Max owns also a CCIE certification (#13808).