vSAN

Base NVM Express – Part One

Written by Murali Rajagopal, PhD- VMware Storage Architect, Office of the CTO

There is much in the way of NVMe ExpressTM (NVMeTM) literature publicly available especially surrounding SSDs – mainly originating from device manufacturers. This blog provides a grand overview of the NVM ExpressTM technology, the eco-system, NVM ExpressTM specifications, devices and interfaces. It is brief by design. Web links provide pointers for the curious reader to dig into more details.

This is a 4-Part blog. In Part 1 I begin with an overview of the Base  NVMe ExpressTM technology; In Part-2 I describe NVM ExpressTM Over Fabrics (NVMe-oF); in Part-3 I provide an overview of NVMe readiness including NVMe specifications and in Part 4 I follow up with an overview of the NVM ExpressTM Management Interface (NVMe-MITM). The architectural overviews are based on NVM ExpressTM specifications drawn by the NVM ExpressTM Organization (Standards). I also provide a summary of the current state of the Base, Fabric and Management Interface Specifications and sprinkle the blog with the current level of support from hardware and software vendors.

Base NVM ExpressTM Architectural Overview

NVM ExpressTM (NVMeTM) is an interface specification optimized for solid state storage for both client and enterprise storage systems utilizing the PCI Express® (PCIe®) Interface. An NVMe host uses PCIe to access one or more NVMe Solid State Drives (SSDs).  The NVMe SSD consists of a PCIe Host Interface, SSD Controller (e.g. Processor with Firmware) and Non-volatile Memory (e.g., NAND).

NVMe’s host interface does not use a host adapter that is found in other storage technologies such as SAS or SATA – this was architected to reduce costs.

Note – While the omission of an NVMe HBA was architected to reduce costs, it appears that in hind-sight inclusion of an HBA could have been beneficial. Although there is no SCSI-like Hardware HBA for NVMe drive, functionality wise HBA functions like queuing IOs and interrupts still exist inside the driver and some SCSI-like HBA functions now resides in the NVMe Controller.  However, lack of a HBA means that storage bus events such as drive insertion and removal now has to be handled by the OS, PCIe bus driver or BIOS. Not having a dedicated control point may lead to handling of such events with varying degrees of success and possibly lead to unreliable support for LED management. One solution to this problem is the Intel Volume Management Device (VMD) Technology – it places a control point in the PCIe root complex of servers powered by Intel Xeon Scalable processors and acts like a HBA for SATA and SAS and provides error management, surprise-hot-plug and LED management.

An NVMe driver in the host utilizes NVMe specified Memory-mapped I/O (MMIO) Controller Registers and System DRAM for I/O Submission (SQ) and Completion Queues (CQ). It supports parallel operations – theoretically about 64K I/O SQ and CQ where each queue can support about 64K outstanding commands; but practically speaking most devices only support a fraction of these and SSDs vendors have indicated current support for number of queues anywhere from 16 to 256 – with queue depths ranging from 32 to 2048.

NVMe uses a small (relative to SCSI) number of optimized commands and command completions. The data structure format for the command uses a fixed size of 64 bytes and for the completion it uses a fixed size of 16 bytes. There are two types of commands in NVMe: Admin Commands and I/O Commands. Admin Commands are sent to an Admin Queue (single SQ/CQ pair) and I/O Commands are sent to I/O Queues (each of which has an SQ/CQ pair, or is part of a structure where one CQ handles completions for I/O Commands submitted via multiple SQs).

Figure 1 illustrates the host side NVMe architecture. It consists of a host side interface and a target side controller. The controller has arbitration for queue servicing –  Round Robin (RR), Weighted Round Robin (WRR), Strict Priority Levels, etc. Most SSD vendors implement RR today while a small number have indicated WRR support in future.

Base NVM Express

Figure 1 Basic NVMe Architecture

Although the figure shows the I/O SQ and CQ having an affinity to cores, this is not required and implementations may differ in whether there is fixed affinity for the I/O SQ and CQ with the cores. Also, SQs may share a CQ (Core C queues in the figure). Note that each NVMe controller has its own Admin Queue (SQ/CQ pair) and one or more I/O Queues (SQ/CQ pair and/or structures in which a single CQ is paired with multiple SQs) and illustrated in the example in Figure 2. In this example, a host is interfacing across PCIe with 3 NVMe controllers where each controller has one NVMe SSD (not shown) attached to it.

Figure 2: Example Host interface with 3 NVMe Controllers

NVMe Namespace is a key concept described in the Base NVMe Specifications (see below What is a NVMe Namespace?). Namespaces may be shared across controllers and hosts. NVMe multi-path I/O is another key concept also described in the Base NVMe Specification although both namespace sharing and multi-path are primarily applicable to NVMe over Fabrics (NVMe-oF); in future, PCIe Switch may also make namespace sharing and multi-path relevant in non-fabric configurations. Both namespace sharing and multi-path I/O require that the NVM subsystem support at least 2 controllers. We’ll revisit NVM subsystem, multi-path I/O and namespace sharing again after an overview of NVMe-oF in Part 2.

NVMe physical Form factors include: M.2, U.2 2.5-inch drive and PCIe Add-In-Card (AIC)

What is a NVMe Namespace?

An NVMe namespace is a storage volume organized into logical blocks that range from 0 to one less than the size of the namespace (LBA 0 through n-1) and is backed by some capacity of non-volatile memory. Thin provisioning and deallocation of capacity are supported, so the capacity of non-volatile memory for a namespace may be less than the size of the namespace.

A namespace ID (NSID) is an identifier used by a controller to provide access to a namespace (handle to a namespace). An NVMe controller may support multiple namespaces that are referenced using NSID. EUI64 (8 bytes), NGUID (16 bytes) and UUID (128-bit) are globally unique namespace identifiers defined in the Base Specification.

Namespaces may be created and deleted using the Namespace Management and Namespace Attachment Commands.

Note: I’ll be covering NVMeTM Over Fabrics in Part 2.

If you’re attending VMworld 2018, please check out the breakout sessions on NVMe and NVMe Over Fabrics demos with VMware partners on the floor.