Performance and Use Cases of VMware DirectPath I/O for Networking


VMware DirectPath I/O is a technology, available from vSphere 4.0 and higher that leverages hardware support (Intel VT-d and AMD-Vi) to allow guests to directly access hardware devices. In the case of networking, a VM with DirectPath I/O can directly access the physical NIC instead of using an emulated (vlance, e1000) or a para-virtualized (vmxnet, vmxnet3) device. While both para-virtualized devices and DirectPath I/O can sustain high throughput (beyond 10Gbps), DirectPath I/O can additionally save CPU cycles in workloads with very high packet count per second (say > 50k/sec). However, DirectPath I/O does not support many features such as physical NIC sharing, memory overcommit, vMotion and Network I/O Control. Hence, VMware recommends using DirectPath I/O only for workloads with very high packet rates, where CPU savings from DirectPath I/O may be needed to achieve desired performance.

DirectPath I/O for Networking

VMware vSphere 4.x provides three ways for guests to perform network I/O: device emulation, para-virtualization and DirectPath I/O. A virtual machine using DirectPath I/O directly interacts with the network device using its device drivers. The vSphere host (running ESX or ESXi) is only involved in virtualizing interrupts of the network device. In contrast, a virtual machine (VM) using an emulated or para-virtualized device (referred to as virtual NIC or virtualized mode henceforth) interacts with a virtual NIC that is completely controlled by the vSphere host. The vSphere host handles the physical NIC interrupts, processes packets, determines the recipient of the packet and copies them into the destination VM, if needed. The vSphere host also mediates packet transmissions over the physical NIC.

In terms of network throughput, a para-virtualized NIC such as vmxnet3 matches the performance of DirectPath I/O in most cases. This includes being able to transmit or receive 9+ Gbps of TCP traffic with a single virtual NIC connected to a 1-vCPU VM. However, DirectPath I/O has some advantages over virtual NICs such as lower CPU costs (as it bypasses execution of the vSphere network virtualization layer) and the ability to use hardware features that are not yet supported by vSphere, but might be supported by guest drivers (e.g., TCP Offload Engine or SSL offload). In the virtualized mode of operation, the vSphere host completely controls the virtual NIC and hence it can provide a host of useful features such as physical NIC sharing, vMotion and Network I/O Control. By bypassing this virtualization layer, DirectPath I/O trades off virtualization features for potentially lower networking-related CPU costs. Additionally, DirectPath I/O needs memory reservation to ensure that the VM’s memory has not been swapped out when the physical NIC tries to access the VM’s memory.

VMware’s Performance Review of DirectPath I/O vs. Emulation

VMware used the netperf [1] microbenchmark to plot the gains of DirectPath I/O as a function of packet rate. For the evaluation, VMware used the following setup:

  • SLES11-SP1 VM on vSphere 4.1. vSphere was running on a dual socket Intel E5520 processor (@2.27 GHz) with a Broadcom 57711 10GbE NIC as the physical NIC.
  • A native Linux machine was used as the traffic source or sink.
  • UDP_STREAM benchmark of netperf, along with the burst and interval functionality to send or receive packets at a controlled rate.

PktRate vs CPU Savings with DirectPath I/O

The above figure plots CPU savings due to DirectPath I/O as a percent of one core against packet rate (Packets per Second – PPS). Immediately, you can see the benefits of DirectPath I/O at high packet rates (100,000 PPS). However, it is equally clear that at lower packet rates, the benefits of DirectPath I/O are not as significant. At 10,000 PPS, DirectPath I/O can only save about 6% of one core. This is an important observation as many enterprise workloads do not have very high networking traffic (see Tables 1 and 2).

Table 1. Performance of enterprise class workloads with DirectPath I/O

To further illustrate the specific use cases and benefits for DirectPath I/O, VMware also compared its performance against that of a virtual NIC with three complex workloads: a web server workload and two database workloads. The web server workload and configuration was similar to SPECweb®2005 (described in reference [2]). We ran a fixed number of users requesting data from a web server and measured the CPU utilization between DirectPath I/O and a para-virtualized virtual NIC. Due to the high packet rate of this workload, DirectPath I/O is able to support 15% more users per %CPU Used. Note that in a typical web server workload, the packets that a web server receives are smaller than 1500 bytes (average of 86 bytes in our experiments). Hence, we cannot directly use the receive numbers in Figure 1 to calculate CPU savings.

Next, we looked at a database workload that has far lower packet rates. We used the Order Entry benchmark [3], and measured the ratio of number of operations per second. As expected, due to the low packet rate, the performance of virtual NIC and DirectPath I/O was similar.

We also looked at the performance of an OLTP-like workload with SAP and DB2 [4] on a 4-socket Intel X-7550 machine with one 8-vcpu VM. Virtual NIC out-performs DirectPath I/O by about 3% in the default configuration. This performance gap was an artifact of memory pinning, reservation and NUMA behavior of the platform in the DirectPath I/O configuration. By setting memory reservations for the virtual NIC configuration, we were able to match the performance of both configurations. Table 2 lists packet rates for some more enterprise-class workloads. Based on the packet rate numbers and the CPU cost saving estimates from Figure 1, we do not expect these workloads to benefit from the use of DirectPath I/O.

Table 2. Packet Rates for some  enterprise class workloads

Compatibility Matrix

DirectPath I/O requires the VM to be directly allowed to access a device and the device to be allowed to modify the VM’s memory (e.g., to copy a received packet to the VM’s memory). Additionally, the VM and the device can now share essential state information that is invisible to ESX. Hence the use of DirectPath I/O is incompatible with many of core virtualization features. Table 3 presents a compatibility matrix for DirectPath I/O.

Table 3. Feature Compatibility Matrix for DirectPath I/O


As stated in the beginning of this post, DirectPath I/O is intended for specific use cases. It is another technology VMware users can deploy to boost performance of applications with very high packet rate requirements.

Further Reading

  • VMware DirectPath I/O. http://communities.vmware.com/docs/DOC-11089
  • Configuration Examples and Troubleshooting for DirectPath I/O. http://www.vmware.com/files/pdf/techpaper/vsp_4_vmdirectpath_host.pdf


  1. netperf. http://www.netperf.org/netperf/
  2. Achieving High Web Throughput Scaling with VMware vSphere 4 on Intel Xeon 5500 series (Nehalem) servers. http://communities.vmware.com/docs/DOC-12103
  3. Virtualizing Performance-Critical Database Applications in VMware vSphere. http://www.vmware.com/pdf/Perf_ESX40_Oracle-eval.pdf
  4. SAP Performance on vSphere with IBM DB2 and SUSE Linux Enterprise. http://www.vmware.com/files/pdf/techpaper/vsp_41_perf_SAP_SUSE_DB2.pdf

SPECweb®2005 is a registered trademark of the Standard Performance Evaluation Corporation (SPEC).


11 comments have been added so far

  1. It would be helpful to see discussion and analysis of performance in terms of latency (vs. throughput or cpu utilization): Does directpath offer measurably better latency numbers in, e.g., a netperf TCP_RR test, or a simple ping time test?

  2. We have seen better latency characteristics with DirectPath I/O in some experiments (e.g., up to 20% benefit with ping). Since this is a blog post, we wanted to use benchmarks whose virtualized performance numbers were publicly available and explained in detail. Adding latency would have necessitated lot of ground work that is more suited to a whitepaper than a blog post.

  3. A TCP_RR test isn’t any more difficult than an intervals/paced UDP_STREAM test. In many ways, since there is no worry about packet losses affecting the test run time, it is actually easier.
    With that paced, bursted UDP_STREAM test, any loss of traffic is not “made up” by netperf – it isn’t trying to detect lost traffic. So, if there are packet losses, the test will slow down – or in the extreme case (when there are as many losses as the size of the burst) stop entirely. At that point, traffic stops flowing but netperf/netserver is still marking time and measuring CPU util.
    The only “complication” with a TCP_RR test is interrupt coalescing on the NIC (virtual or otherwise). To get a “good” latency measurment – in terms of what is possible at least – the coalescing has to be disabled – although there are some NIC/driver combinations that have (IMO) a really good coalescing mechanism and so don’t need it disabled for good latency.

  4. Rick, as you say, a basic TCP_RR test or a ping test is easy to run even with the complications of interrupt coalescing. This post is narrowly focused on the benefits of DirectPath I/O for some representative enterprise workloads (for which we already have whitepapers explaining performance in virtualized environments). For these workloads, packet rate rather than latency was the biggest differentiating factor.

Leave a Reply

Your email address will not be published. Required fields are marked *