NVMe Over Fabrics Architectural Overview
NVMeTM Over Fabrics replaces the PCIe transport with a fabric technology such as RDMA or Fibre Channel (FC) fabric as shown in Figure 3. Transports for RDMA fabric include Ethernet (ROCE), InfiniBand and iWARP. Native TCP (non-RDMA) transport is also possible (TCP is still Work-In-Progress as of July 2018).
Figure 3: RDMA and FC Fabric NVMe Architecture
The NVM subsystem (see note below) shown in Figure 3 is a collection of one or more physical fabric interfaces (ports) with each individual controller usually attached to a single port. Multiple controllers may share a port. Although the ports of an NVM subsystem are allowed to support different NVMe transports, in practice, a single port is likely to support only a single transport-type.
Note: An NVM subsystem includes one or more controllers, one or more namespaces, one or more PCI Express ports, a non-volatile memory storage medium, and an interface between the controller(s) and non-volatile memory storage medium. Figure 4 shows an example of an array consisting of an NVM subsystem attached via FC fabric to 3 hosts.
Figure 4: Example array consisting of NVM subsystem attached via Fabric to 3 Hosts
In general, an NVM subsystem presents a collection of one or more NVMe controllers (maximum about 64K) which are used to access namespaces that are associated with one or more hosts through one or more (maximum of 64K) NVM subsystem ports. In practice, the number of subsystem controllers or the number of subsystem ports tend to be very small.
NVMe Over Fabrics (NVMe-oF) builds on the Base NVMe Architecture, i.e., Command Set and queuing interface. In addition to Admin and I/O Commands it also supports Fabric Commands. It differs from the Base Specification in some distinct ways (e.g., disallows Interrupts)
Note: See the NVMe Over Fabric 1.0 Specification for a complete list of differences between NVMe over Fabrics and the NVMe Base Specification
A controller is associated with exactly one host at a time, whereas a port may be shared – NVMe allows hosts to connect to multiple controllers in the NVM subsystem through the same port or different ports.
NVMe-oF supports Discovery Services. Using the discovery mechanism, a host may obtain a list of NVM subsystems with namespaces that are accessible to the host including the ability to discover multiple paths to a NVM subsystem. The NVMe Identify Admin Command is used to determine the namespaces for a controller.
Multi-Path I/O, Namespace Sharing, Multi-host Connectivity and Reservations
As noted earlier, the NVMe specifications support both multi-path I/O and namespace sharing. While these are distinct concepts, it is convenient to describe them together as they are somewhat interrelated when it comes to multi-host namespace access and especially when NVMe Reservations are used. The following provides a brief description of these concepts along with the system requirements imposed on the NVM subsystem and host connectivity.
Namespace sharing refers to the ability of two or more hosts to access a common namespace using different NVMe controllers. Namespace sharing requires that the NVM subsystem contain two or more controllers.
Figure 5 shows an example where two NVMe controllers are attached via two NVM subsystem ports; in this example, the Namespace B (NS B) is shared by both controllers. The NVMe Compare & Write fused operation could be used for coordinating access to the shared namespace. Controllers associated with a shared namespace may operate on the namespace concurrently. A globally unique identifier or the namespace ID (NSID) associated with the namespace itself may be used to determine when there are multiple paths to the same shared namespace (see Box (Part 1) What is a NVMe Namespace?). An NVM subsystem is not required to have the same namespaces attached to all controllers. In Figure 5 only Namespace B is shared and attached to the controllers.
Note – Current NVMe specifications do not specify namespace sharing across NVM subsystems. This is being addressed in the draft NVMe 1.4 Specifications (see box ‘NVMe Base Specification Roadmap’ and Dispersed Namespaces).
Figure 5: Example NVM subsystem with dedicated Port Access to Shared Namespace
NVMe multi-path I/O refers to two or more completely independent paths between a single host and a namespace. Each path uses its own controller although multiple controllers may share a subsystem port. Multi-path I/O like namespace sharing requires that the NVM subsystem contain two or more controllers. In the example shown in Figure 6, Host A has 2 paths via Controller 1 and Controller 2. The NVMe Standards Technical Committee is currently working on a draft specification on multi-path I/O (see box ‘NVMe Base Specification Roadmap’ and ANA).
Multi-Host Connectivity and Reservations
NVMe Reservations is functionally like SCSI-3 Persistent Reservations and may be used to provide capabilities utilized by two or more hosts to coordinate access to a shared namespace. An NVMe Reservation on a namespace restricts hosts access to that namespace. For example, VMware ESXi when supported by driver could use NVMe Reservations to support Microsoft Windows Server Failover Clustering with VMs.
An NVMe Reservation requires an association between a host and a namespace. Each controller in a multi-path I/O and namespace sharing environment is associated with exactly one host as shown in the example in Figure 6. A host may be associated with multiple controllers by registering the same Host ID with each controller it is associated with.
Note: For uniquely identifying the Host ID, the controller may support one of the two formats: 1) 64-bit Host Identifier 2) Extended 128-bit Host Identifier; NVMe Over Fabrics requires the 128-bit extended format
In the example shown in Figure 6, Host A is associated with 2 controllers, while Host B is associated with a single controller. The Host Identifier (e.g., HostID A) allows the NVM subsystem to identify controllers associated with the same host (e.g., Host A) and preserve reservation properties across these controllers.
Figure 6: Multi-host access to shared Namespace
Note: I’ll be covering NVMe Readiness in Part 3.
One comment has been added so far