Network virtualization has come a long way. NSX has played a key role in redefining and modernizing networking in a datacenter. Providing an optimal routing path for the traffic has been one of the topmost priorities of Network Architects. Thanks to NSX distributed routing, the routing between different subnets on a ESXi hypervisor can be done in kernel and traffic never has to leave the hypervisor. With NSX-T, we take a step further and extend this network functionality to a multi-hypervisor and multi-cloud environment. NSX-T is a platform that provides Network and Security virtualization for a plethora of compute nodes such as ESXi, KVM, Bare Metal, Public Clouds and Containers.
This blog series will introduce NSX-T Routing & focus primarily on Distributed Routing. I will explain Distributed Routing in detail with a packet walk between the VMs sitting in same/different hypervisors, connectivity to physical infrastructure and multi-tenant routing. In the next parts of this blog, I will discuss connectivity to the physical infrastructure, routing features & multi-tenant routing.
Let’s start with a quick reference to NSX-T architecture.
NSX-T has a built-in separation for Management plane (NSX-T Manager), Control Plane (NSX-T Controllers) and Data Plane (Hypervisors, Containers etc.). I highly recommend going through NSX-T Whitepaper for detailed information on architecture to understand the components and functionality of each of the planes.
Couple of interesting points that I want to highlight about the architecture:
NSX-T Manager is decoupled from vCenter and is designed to run across all these heterogeneous platforms.
NSX-T Controllers serve as central control point for all the logical switches/routers within a network and maintains information about hosts, logical switches/routers.
NSX-T Manager and NSX-T Controllers can be deployed in a VM form factor on either ESXi or KVM.
In order to provide networking to different type of compute nodes, NSX-T relies on a virtual switch called “hostswitch”. The NSX management plane fully manages the lifecycle of this “hostswitch”. This hostswitch is a variant of the VMware virtual switch on ESXi-based endpoints and as Open Virtual Switch (OVS) on KVM-based endpoints.
Data Plane stretches across a variety of compute nodes: ESXi, KVM, Containers, and NSX-T edge nodes (on/off ramp to physical infrastructure).
Each of the compute nodes is a transport node & will have a TEP (Tunnel End Point). TEPs are the overlay tunnel endpoints, used to encapsulate and decapusulate packets between the hosts. Depending upon the teaming policy, this host could have one or more TEPs.
NSX-T uses GENEVE as underlying overlay protocol for these TEPs to carry Layer 2 information across Layer 3. GENEVE provides us the complete flexibility of inserting Metadata as TLV (Type, Length, Value) fields which could be used for new features. One of the examples of this Metadata is VNI (Virtual Network Identifier). We recommend a MTU of 1600 to account for encapsulation header. GENEVE is a standard under development at the IETF. More details on GENEVE can be found on the following IETF Draft. https://datatracker.ietf.org/doc/draft-ietf-nvo3-geneve/
Before we dive deep into routing, let me define a few key terms.
Logical Switch is a broadcast domain which can span across multiple compute hypervisors. VMs in the same subnet would connect to the same logical switch.
Logical Router provides North-South, East-West routing between different subnets & has two components: Distributed component that runs as a kernel module in hypervisor and Centralized component to take care of centralized functions like NAT, DHCP, LB and provide connectivity to physical infrastructure.
Types of interfaces on a Logical Router
Downlink- Interface connecting to a Logical switch.
Uplink– Interface connecting to the physical infrastructure/physical router.
RouterLink– Interface connecting two Logical routers.
Edge nodes are appliances with a pool of capacity to run the centralized services and would be an on/off ramp to the physical infrastructure. You can think of Edge node as an empty container which would host one or multiple Logical routers to provide centralized services and connectivity to physical routers. Edge node will be a transport node just like compute node and will also have a TEP IP to terminate overlay tunnels.
They are available in two form factor: Bare Metal or VM(both leveraging Linux Foundation’s DPDK Technology).
Moving on, let’s also get familiarized with the topology that I will use throughout this blog series.
I have two hypervisors in above topology, ESXi and KVM. Both of these hypervisors have been prepared for NSX & have been assigned a TEP (Tunnel End Point) IP, ESXi Host: 192.168.140.151, KVM host: 192.168.150.152. These hosts have L3 connectivity between them via transport network. I have created 3 Logical switches via NSX Manager & have connected a VM to each one of the switches. I have also created a Logical Router named Tenant 1 Router, which is connected to all the logical switches and is acting as a default gateway for each subnet.
Before we look at the routing table, packet walk etc., let’s look at how configuration looks like in NSX Manager. Here is switching configuration, showing 3 Logical switches.
Following is configuration of ports on Tenant 1 Logical Router.
Once configured via NSX Manager, the logical switches and routers are pushed to both the hosts, ESXi and KVM. Let’s validate that on both hosts. Following is the output from ESXi showing the Logical switches and router.
Following is the output from KVM host showing the Logical switches and router.
NSX Controller MAC learning and advertisement
Before we look at the packet walk, it is important to understand how remote MAC addresses are learnt by the compute hosts. This is done via NSX Controllers. As soon as a VM comes up and connects to Logical switch, local TEP registers its MAC with the NSX Controller. Following output from NSX Controller shows that the MAC addresses of VMs on Web VM1, App VM1 and DB VM1 have been reported by their respective TEPs. NSX Controller publishes this MAC/TEP association to the compute hosts depending upon type of host.
Now, we will look at the communication between VMs on the same hypervisor.
Distributed Routing for VMs hosted on the same Hypervisor
We have WEB VM1 and App VM1 hosted on the same ESXi hypervisor. Since we are discussing the communication between VMs on same host, I am just showing the relevant topology below.
Following is how traffic would go from Web VM1 to App VM1.
Web VM1 (172.16.10.11) sends traffic to the gateway 172.16.10.1, as the destination (172.16.20.11) is in different subnet. This traffic traverses Web-LS and goes to Downlink interface of Local distributed router running as a kernel module on ESXi Host.
Routing lookup happens on the ESXi distributed router. Router has 3 logical interfaces (LIF) & 172.16.20.0/24 subnet is a Connected route. Packet is put on the LIF connecting to App LS.
Destination MAC i.e. MAC address of App VM1 is needed to forward the frame. In this case, App VM1 is also hosted on the same ESXi & we do have a valid local ARP entry for App VM1 in the ARP table.
L2 rewrite is done, packet is put on App-LS and sent to App VM1.
Please note that the packet didn’t have to leave the hypervisor to get routed. This routing happened in kernel. Now that we understand the communication between two VMs (in different subnet) on same hypervisor, let’s take a look at the packet walk from Web VM1 (172.16.10.11) on ESXi to DB-VM1 (172.16.30.11) hosted on KVM, by sending a ICMP request from Web VM1 to DB VM1.
Distributed Routing for VMs hosted on the different Hypervisors (ESXi & KVM)
Web VM1 (172.16.10.11) sends this traffic to its default gateway 172.16.10.1, as the destination (172.16.30.11) is in different subnet. This traffic traverses Web-LS and goes to Downlink interface of Local distributed router on ESXi Host.
Routing lookup happens on the ESXi distributed router (DR). Router has 3 logical interfaces (LIF) & 172.16.30.0/24 is a directly connected route. Packet is put on the LIF connecting to DB LS. Following output show the DR on ESXi host and it’s routing table.
Destination MAC, i.e. MAC address of DB VM1 is needed to forward the frame. An ARP entry exists for DB VM1. MAC address of DB VM1 is learnt via remote TEP 192.168.150.152. Again, this MAC/TEP association table was published by NSX Controller to the hosts.
ESXi TEP encapsulates the packet and sends it to the remote TEP with a Outer Src IP=192.168.140.151, Dst IP=192.168.150.152.
Packet is received at remote KVM TEP 192.168.150.152, where VNI (21386) is matched. MAC lookup is done and packet is delivered to DB VM1 after removing the encapsulation header.
A quick traceflow validates the above packet walk.
Let’s take a look at how the ICMP reply comes back from DB VM1 to Web VM1.
DB VM1 sends ICMP echo reply for the ICMP request received from 172.16.10.11. Since the destination IP i.e. 172.16.10.11 is in a different subnet, packet is sent to the default gateway i.e. 172.16.30.1. This traffic traverses DB-LS and goes to Downlink interface (LIF) of Local DR on KVM Host.
Routing lookup happens on the KVM distributed router (DR) this time. Packet is put on the LIF connecting to Web LS.
ARP entry exists for Web VM1 and MAC address of Web VM1 is learnt via remote TEP 192.168.140.151 i.e. ESXi Host.
KVM TEP encapsulates the packet and sends it to the remote TEP with a Src IP=192.168.150.152, Dst IP=192.168.140.151.
Packet is received at remote ESXi TEP 192.168.140.151, where VNI (21384) is matched. MAC lookup is done and packet is delivered to Web VM1 after removing the encapsulation header.
Please note that the ICMP request packet received from Web VM1(hosted on ESXi) was routed by local distributed router on ESXi and ICMP reply from DB VM1 (hosted on KVM) was routed by the local distributed router on KVM host. Routing is done closest to the source. This concludes the routing components part of this blog. In the next blog of this series, I will discuss multi-tenant routing and connectivity to physical infrastructure.