When you think of summertime, images of the beach, sun, and vacation come into mind. Not for the members of the VMware Research Group, they think “INTERNS.” It is the time of year when top-notch computer science Ph.D. students join our team’s quest to cause disruption in the state of the field. As we are welcoming a new crop of interns, we thought we would take a moment to highlight some of the groundbreaking results from interns past. As you will see, the bar has been set high, very high.
In summer 2016, intern Heidi Howard of the University of Cambridge came to VMware to work on optimizing systems built on Paxos to make it more efficient and scalable. As she studied the Paxos algorithm, she became perplexed. She did not know why all quorums in Paxos had to intersect. It was evident to her that intersection was required across phases, but she just did not understand why they were necessary for the same phase. Heidi decided to take her observation to her mentor, Dahlia Malkhi who took the time to digest and understand the implications. In a word, Dahlia deemed the observation “brilliant.” Dahlia’s role as mentor took on a whole new direction at that point.
To grasp the significance of this observation you need to know that Paxos is an algorithm for the reliability of distributed systems first introduced in 1989. It has been studied, tweaked, and implemented for years. Paxos uses two phases, leader election and replication. The heart of Paxos requires a quorum of participants to agree and accept a proposed value from a leader. At this point, the leader deems the value committed. Heidi’s proposal to generalize Paxos such that majority quorums are only necessary across Paxos phases was named Flexible Paxos.
The goal for the remainder of the summer was to take the AHA moment and prove it. Working with her mentor and VMware Research intern, Sasha Spiegelman of Technion University, the trio implemented a prototype of the Flexible Paxos algorithm. The results showed that Flexible Paxos is reliable and efficient as both latency and throughput improved. It was significant enough to be archived in arXiv. So for your summertime reading list, dig into Flexible Paxos: Quorum intersection revisited.
Berkeley Churchill of Stanford University, VMware Summer Intern 2016, worked with Eric Schkufza to come up with a breakthrough toolchain insensitive technique to identify third party libraries in unmodified binaries. As often happens with discoveries, it is a project/experiment that does not succeed that leads to a giant step forward. Berkeley’s original project was in the x86 binary size reduction domain, and he was not satisfied with the results. Eric, serving as his mentor, was able to use his internal VMware product team relationships to see if there was anything useful or applicable to Berkley’s work. In the ensuing discussions with the product teams, a reoccurring pain point emerged, third party library identification. Re-energized, Berkeley repurposed his original project, and ProcHasher was born.
Techniques to be able to audit your real world software to determine which third party libraries are contained in the binary are theoretically simple. You build a database of all the possible libraries and do a lookup based on a hash code. The not so simple part is determining which hash function to use. Current techniques rely on using the machine code text in the binaries. The problem with this approach is its sensitivity to the toolchain. Every compiler with its multitude of options requires a new database entry for the same library. As you can imagine the size of the database dramatically increases if you try to build a database of all possible compiler/options combinations.
Berkeley realized when looking at the side effects a binary has on the heap, a hash function based on this property was both promising and toolchain insensitive. Now, this is the point in the tale where principled math takes over. Similar to techniques used for plagiarism detection you can treat successful hash lookups as evidence, not proof, that the component exists in the binary. With mathematical techniques used in the document classification domain, you can statistical classify the probability a given library is in a binary.
ProcHasher’s prototype implementation produced significantly more accurate results than existing techniques. The size of the database was also manageable. ProcHasher was 4.3X less sensitive to toolchain variation and achieved up to 31X improvement in recall, the percentage of components that were identified correctly. It is also proved to be robust against the presence of multiple components in a single binary.
We are looking forward to the disruptions this year’s interns will bring to the table. Check back in the Fall to see if they raised the bar even higher.
As adoption of non-uniform memory access (NUMA) architectures has grown, efficient concurrent data structures have become essential to take advantage of all the cores available in NUMA machines. VMware has been actively involved in researching techniques to automatically produce concurrent thread-safe data structures from their sequential, non-thread safe implementations. The results, presented at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), won one of the two Best Paper Awards.
The recipe for the success of this endeavor came from integrating techniques from distributed systems and shared-memory algorithms to form a new approach, Node Replication (NR). The secret ingredient is replicating the sequential data structures across the NUMA nodes. The key is to minimize the amount of data shared between nodes, which is achieved using an in-memory log that provides synchronization between replicas. Mix in the Flat Combining technique, which coordinates operations within a node efficiently and decreases contention on the log through batching. The goal of NR is to work well under contention, an area where other methods have fallen short. Building concurrent data structures for NUMA is laborious and error-prone, even for an expert programmer. NR produces concurrent data structures that are linearizable and free from concurrency bugs from black-box sequential data structures with no special skills needed. This is similar to the way a box of pasta is easier to make than homemade pasta.
A “taste” test was performed on Redis’ data structures to compare NR to five other approaches. The results showed that under high contention, NR outperformed all others except tailored-made NUMA-aware data structures. Just as good recipes put on extra pounds, NR needs additional memory for the log and replicas, so this approach is best suited for small- to medium-sized structures.
For the background of concurrent data structure approaches, NR algorithm details, and more importantly, the evaluation results, see “Black-box Concurrent Data Structures for NUMA Architectures“, Irina Calciu (VMware Research); Siddhartha Sen (Microsoft Research); Mahesh Balakrishnan (Yale University); Marcos K. Aguilera (VMware Research). To learn more about what projects VMware Researchers are working on click here.
Remember when your parents would say, “Because I said so!” and you never knew why? Well, David Tennehouse, VMware’s Chief Research Officer, recently shared his views on artificial intelligence with AI Business, an online community for Artificial Intelligence in businesses. David reflected on how VMware uses AI today; the AI needs of customers going forward; and more importantly his view on Explainable AI and the opportunities that exist for research. “Early in my career I learned that insights and intuition will only take you so far; in order to have broader impact I needed to work backwards from my intuitive answers and come up with the “chain of reasoning” as to why they were correct. Absent that explanation I couldn’t convince my colleagues in the research community or the executives of large companies that my intuition was correct. AI faces a similar problem.” See the full column at VMware’s Chief Research Officer Believes Explainable AI Will Be the Future.
Programming networks with P4
P4 is a domain-specific programming language designed for expressing how packets are processed by the data plane of a programmable network element such as hardware or software switches, network interface cards, routers or network function appliances. This is the first blog post in a series that will provide an introduction to the P4 programming language, using the P416 version of the language, which is currently a draft specification.
One of the most active areas of computer networking today is Software Defined Networking (SDN). SDN has separated the two core functions of a network-processing element (router): the control plane and the data plane. Traditionally both these functions were implemented on the same device; SDN decoupled them and allowed a variety of control plane implementations for each data plane. The poster child for SDN is the Open Flow protocol, which specifies the API between the control plane and the data plane. Despite the additional flexibility brought by separating these functions, SDN still assumes that the behavior of the network data plane is fixed. This is a significant impediment to innovation; for example, deploying a new network protocol (such as VXLAN) can take up to 4 years between the initial proposal and its commercial availability in high-speed devices.
What’s in it for me?
As a reaction to this state of affairs, there is a new impetus to make computer networks even more programmable by making the behavior of the data plane expressible in software. Industry and academia are converging on a new domain-specific programming language called P4 (Programming Protocol-Independent Packet Processors). The P4 specification is open and public. The P4 language is developed by the p4.org consortium, which includes more than 60 companies in the area of networking, cloud systems, and network chip design and academic institutions.) While initially P4 was designed for programming network switches, its scope has been broadened to cover a large variety of packet processing systems.
Compared to the state-of-the-art method of programming network processing systems (e.g., writing microcode on top of custom hardware ASICS), P4 provides significant advantages:
- Flexibility: P4 makes many packet-forwarding policies expressible as programs; contrast this to traditional switches, which expose fixed-function forwarding engines to their users.
- Expressiveness: P4 programs may express sophisticated hardware-independent packet processing algorithms using solely general-purpose operations and table lookups.
- Resource mapping and management: P4 programs express resource usage in symbolic terms (e.g., IPv4 source address); compilers map such user-defined fields to available hardware resources and manage resource allocation and scheduling.
- Software engineering: P4 programs provide important benefits such as type checking, information hiding, and software reuse.
- Component libraries: P4 supports the creation of libraries that wrap hardware-specific functions or standard protocol implementations into portable high-level P4 constructs.
- Decoupling hardware and software evolution: P4 is to a large degree architecture independent, allowing separate hardware and software upgrade cycles.
- Debugging: Manufacturers can provide their customers’ software models of target architectures to aid in the development and debugging of P4 programs.
P4 brings programmability to the network data plane, similar to the way the CUDA language brought programmability to the graphics cards. We expect that this additional flexibility will unleash a wave of new and unforeseen applications. Here are some of the emerging ones:
- Specification of network protocols: P4 is a formal programming language, which has a precise semantics. P4 can be used to specify some network protocols more precisely than using English prose.
- Implementation of new network protocols: P4 makes the deployment of new protocols much simpler – just the matter of a software upgrade. For example, a VXLAN implementation in P4 requires 175 lines of code.
- Network monitoring: The killer application for P4 is the ability to write software programs that extract dynamically information from network devices. For example, the INT proposal (In-Band Network telemetry) allows for network switches and endpoints to insert additional custom headers into the packets that cross the network; these additional headers carry network measurement information (timestamps, queue occupancy, congestion, etc.)
- Simpler network devices: when using a P4-programmable device, users can not only implement new protocols, they can also remove unused protocols, using the available hardware resources for other purposes.
- Intellectual property protection: with P4 network protocols can be written and deployed by network operators, and not by network device manufacturers.
The original paper that proposed the P4 programming language was written in July 2014. The first version of the language, including a specification, a reference implementation of a compiler, and various tools, including a simulator, were released in the fall of 2015. The currently accepted version of the language is 1.0.3, also named P414. Most of the initial release was spearheaded by Barefoot Networks, a start-up that is working on building a high-speed programmable network ASIC.
Following the initial release, the p4.org consortium grew rapidly. The p4.org consortium organized a series of P4 workshops at Stanford and several P4 developer days; hundreds of participants from industry and academia have attended these events. The next workshop is scheduled for May 17, 2017.
Meanwhile, a formal design committee was assembled that worked towards designing the next version of the language, based on initial feedback from users. The committee has released a draft specification for the next version of the language, dubbed P416, in December 2016. This new specification is accompanied by a reference implementation for a new reference compiler. While P416 is syntactically incompatible with P414, the new language should be a superset of the old one in expressivity, and the reference compiler can convert P414 programs into P416 programs. The design committee will attempt to keep future versions of the language backwards-compatible.
P416: the newest family member
P416 is a relatively simple programming language. It is a statically-typed, strongly-typed and memory safe programming language. A one-line description of P4 could be: “it is similar to a simplified C or Java, without support for: pointers, dynamic memory allocation, floating-point numbers, and recursive functions, and with very limited support for looping constructs; in addition, it offers built-in hash-tables, which are written by the control plane and read by the data plane.”
P416 is designed to be implementable on a large variety of target platforms such as programmable network cards, FPGA switches, software switches and programmable ASICs. The language is restricted to constructs that are efficiently implementable on all these platforms. P4 also accommodates hybrid packet processing devices, where some functions are programmable, while other functions are hardwired (e.g., using special hardware acceleration units). Figure 1 shows the workflow for programming a network device using P4; a P4 compiler synthesizes not only the data plane functionality, but also the API between the control plane and the data plane.
The core abstractions provided by the P4 language are:
- Header definitions describe the format (the set of fields and their sizes) of each header within a packet.
- Parsers describe the permitted header sequences within received packets, how to identify those header sequences and the headers to extract from packets.
- Tables associate user-defined keys with actions. P4 tables generalize traditional switch tables; they can be used to implement routing tables, flow lookup tables, access-control lists, and other user-defined table types, including complex multivariable decisions.
- Actions are code fragments that describe how packet header fields and metadata are manipulated. Actions can also include data, which can be supplied by the control plane at run time.
- Match-action units perform the following sequence of operations:
- Construct lookup keys from packet fields or computed metadata,
- Perform table lookup using the constructed key, and selecting an action (including the associated data)
- Finally, execute the selected action
- Control flow expresses an imperative program describing the data-dependent packet processing within a target pipeline, including the data-dependent sequence of match-action unit invocations. Deparsing (packet reassembly) can be performed using a control flow.
- Extern objects are library constructs that can be manipulated by a P4 program through well-defined APIs, but whose internal behavior is hardwired (e.g., checksum units) and hence not programmable using P4.
- User-defined metadata: user-defined data structures associated with each packet.
- Intrinsic metadata: metadata provided or consumed by the architecture associated with each packet (e.g., the input port where a packet has been received).
- Architecture definition: a set of declarations that describes the programmable parts of a network processing device.
Will it solve world hunger?
Unfortunately, the answer to this question is “no”. In addition to P4’s inability to solve all of the worlds’ problems, as a general-purpose programming language, it is very limited. P4 is not a Turing-complete language; it is narrowly defined for performing data-path packet processing. Surprisingly, there are even many packet-processing tasks that cannot be expressed in P4.
- P416 supports extern functions or methods; these are computational functions that are implemented outside of P4 and can be invoked from P4 programs. There is currently an effort to standardize a set of such methods; however, each P4 target platform can provide additional extern methods, e.g., to model hardware accelerators. Invoking extern methods is one way that P4 programs can perform otherwise impossible tasks.
- There is no iteration construct in P4. Loops can only be created by the parser state machine. There is no support for recursive functions. In consequence, the work performed by a P4 program depends linearly only on the header sizes.
- There is no dynamic memory allocation in P4. Resource consumption can be statically estimated (at compile-time).
- There are no pointers or references.
- There is no support for multicast or broadcast. These must be achieved by means external to P4. The typical way a P4 program performs multicast is by setting a special intrinsic metadata field to a “broadcast group”. This triggers a mechanism that is outside of P4, which performs the required packet replication.
- P4 has no built-in support for queueing, scheduling or multiplexing.
- P4 is unsuitable for deep-packet inspection. In general, due to the absence of loops, P4 programs cannot do any interesting processing of the packet payload.
- P4 offers no support for processing packet trailers.
- All the state in a P4 program is created when a packet is received and destroyed when the processing is complete. To maintain state across different packets (e.g., per-flow counters) P4 programs must use extern methods. We expect the standard library to contain support for such persistent arrays (counters, registers, meters). Even given support for registers, one cannot iterate over all counters to compute statistics.
- There is no standard way to communicate between the data plane and the control plane; this is usually offered using extern methods (e.g., to implement “learning”).
- There is no support for performing packet fragmentation or reassembly; thus protocols such as TCP cannot be described in P4.
- There is no support for generating new packets (e.g., an ICMP reply), only for processing existing ones.
Despite these limitations, P4 is a remarkably useful language. We also expect that future evolution will expand its capabilities. To get started go here to download P4. To learn more about other research efforts at VMware look here.
Coming next, Show me the code please!
Abstract: Recent research has shown that applications often incorrectly implement crash consistency. We present ccfs, a file system that improves the correctness of application-level crash consistency protocols while maintaining high performance. A key idea in ccfs is the abstraction of a stream. Within a stream, updates are committed in program order, thus helping correctness; across streams, there are no ordering restrictions, thus enabling scheduling flexibility and high performance. We empirically demonstrate that applications running atop ccfs achieve high levels of crash consistency. Further, we show that ccfs performance under standard file-system benchmarks is excellent, in the worst case on par with the highest performing modes of Linux ext4, and in some cases notably better. Overall, we demonstrate that both application correctness and high performance can be realized in a modern file system.
Authors: Thanumalayan Sankaranarayana Pillai, Ramnatthan Alagappan, Lanyue Lu, Vijay Chidambaram, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
“Agile Paging: Exceeding the Best of Nested and Shadow Paging” by Jayneel Gandhi, has been accepted as an IEEE Micro Top Pick of the year 2016!
Abstract: Virtualization provides benefits for many workloads, but the overheads of virtualizing memory are not universally low. The cost comes from managing two levels of address translation—one in the guest virtual machine (VM) and the other in the host virtual machine monitor (VMM)— with either nested or shadow paging. Nested paging directly performs a two-level page walk that makes TLB misses slower than unvirtualized native, but enables fast page tables changes. Alternatively, shadow paging restores native TLB miss speeds, but requires costly VMM intervention on page table updates.
This paper proposes agile paging that combines both techniques and exceeds the best of both. A virtualized page walk starts with shadow paging and optionally switches in the same page walk to nested paging where frequent page table updates would cause costly VMM interventions. Agile paging enables most TLB misses to be handled as fast as native while most page table changes avoid VMM intervention. It requires modest changes to hardware (e.g., demark when to switch) and VMM policies (e.g., predict good switching opportunities).
We emulate the proposed hardware and prototype the software in Linux with KVM on x86-64. Agile paging performs more than 12% better than the best of the two techniques and comes within 4% of native execution for all workloads.
IEEE Micro Background: “IEEE Micro collects some of this year’s most significant research papers in computer architecture conferences based on novelty and potential for long-term impact. It publishes these papers yearly as “Micro’s Top Picks from the Computer Architecture Conferences” in its May / June issue. The Top Picks committee through this issue recognize those significant and insightful papers that have the potential to influence the work of computer architects for years to come.”