Home > Blogs > VMware Research

Flexible Byzantine Fault Tolerance

Current state-of-the-art Byzantine fault-tolerant protocols state a limit on the number of Byzantine faults that can be tolerated. Of course, there are lower bounds stating that only a certain number of Byzantine faults can be tolerated in a given setting. On the other hand, it’s unclear why Byzantine faults are the primary faults one should consider! For instance, for high-stake transactions, often it is essential to guard its safety against adversaries who intend to earn by double-spending. Moreover, depending on their beliefs and external assumptions, two clients in the system may have entirely different interpretations of the state of the system. There may be conservative clients who wish to tolerate more faults and may wait for the state to be in a certain conservative state to commit. Or there may be clients who believe in stronger message timing assumptions. And just in case these clients are incorrect, all of them they may want to be able to update their assumptions and recover!

In a recent work called Flexible Byzantine Fault Tolerance, we address many of these intriguing questions. First, Flexible BFT introduces a new fault model called alive-but-corrupt faults — adversaries who intend to break safety of the protocol but not liveness. Flexible BFT shows that in the presence of such alive-but-corrupt faults (in addition to Byzantine faults), we can support a higher number of total faults than known in the distributed computing literature. Second, Flexible BFT enables the co-existence of clients with different beliefs and assumptions. This allows supporting clients having their own assumptions or using the additional external information they have access to. If the assumption made by a client is incorrect, some of her commits may be unsafe. But subsequently, the protocol allows updating assumptions and a recovery mechanism.

At a technical level, Flexible BFT makes two key contributions. First, it introduces a synchronous BFT protocol in which only the commit step requires to know the synchrony network delay bound and thus replicas execute the protocol without any synchrony assumption. Second, it introduces a notion called Flexible Byzantine Quorums by deconstructing the roles of different quorums in existing consensus protocols. A preliminary draft of this work is available at ArXiv and a short presentation is available here.

HotStuff: BFT Consensus in the Lens of Blockchain

Blockchain is a Byzantine Fault Tolerant (BFT) replicated state machine, in which each state-update is by itself a Turing machine with bounded resources. The core algorithm for achieving BFT in a Blockchain appears completely different from classical BFT algorithms:

  • Classical solutions like DLS, PBFT solve BFT among a small-to-medium group of known participants. Such algorithms consist of multiple rounds of message exchanges carrying votes and safety-proofs. They are evidently quite intimidating to the non-expert.
  • In contrast, Bitcoin solves BFT among a very large group of unknown users. In each time-period, one user broadcasts a single message carrying a Proof-of-Work (PoW). No other messages or information is exchanged.

What a difference between the two worlds!

Recent advances in blockchain technology blur these boundaries. Namely, hybrid solutions such as Byzcoin, Bitcoin-NG, Hybrid Consensus, Casper and Solida, anchor off-chain BFT decisions inside a PoW chain or the other way around.

Moreover, innovative solutions in the age of blockchains, such as Honeybadger, ALGORAND, Tendermint, and SBFT, revisit the BFT setting with greater scalability and simplicity.

Confused?  The VMware’s HotStuff framework can help!

HotStuff describes Blockchain in the lens of BFT and BFT in the lens of Blockchain, and provide common algorithmic foundations for both. It operates over graphs of blocks (aka blockchains), such that a block becomes committed when votes embedded in other blocks refers to it. A feature of HotStuff is its economy of mechanism, involving only a single message type for proposing, a single message type for voting, and simple Commit and Preferred Branch Rules. The proposer protocol in HotStuff is the same regardless of whether the proposer is being replaced or not, thus relieving of the main source complexity in BFT protocols The Saddest Moment.

At the same time, the reasoning for its safety is subtle and deceptively simple—not unlike that for other designs that have subsequently been identified as problematic (see e.g., Revisit Fast BFT).  This led us to explore model checking to confirm the safety of HotStuff.

Interested in reading more? check the HotStuff whitepaper (“HotStuff: BFT Consensus in the Lens of Blockchain’’, https://arxiv.org/abs/1803.05069).

For more information on VMware Blockchain, go here.


Meet Cascade – A Just-in-time compiler for reprogrammable hardware

fast cars

Okay let’s be honest, nobody strives for slow in the computing world. We want fast computers, fast chips, and fast time to market, all while eating fast food. Field-programmable gate arrays (FPGAs) have been actively used since the mid-80s for the purpose of hardware acceleration. Speed! Unfortunately, one aspect of FPGAs has not been accelerated: compilation. Compiling FPGA programs is painfully slow. FPGAs are programmed using hardware description languages (HDLs). Although similar to software programming languages they differ in both the scope of the language and the output of the compiler.  FPGA compilers need to produce a complex set of layout instructions and connections for a reconfigurable circuit board. Not a simple task. In fact, it’s NP complete!

VMware Researchers Eric Schkufza and Michael Wei wondered if applying the principals behind just-in-time compilation could help reduce the latency of FPGA development. It turns out it can! A proof of concept has been completed, and exciting possibilities have grown out of the initial work. Cascade is now a VMware open source tool which is available to everyone.

How does Cascade work? Starting with a Verilog source file, Cascade generates many smaller Verilog programs that in aggregate have the same semantics as the original program. Cascade then determines which of these smaller programs (specifically those that correspond to I/O peripherals such as LEDs) must be placed on the hardware immediately and loads them from a standard library on to the FPGA. In the meanwhile, the remainder of the programs are interpreted in software. Over time, these programs are compiled to hardware, and when the process is complete they are swapped onto the FPGA as well. A cascading effect. From the developer’s point of view, the code runs immediately and simply gets faster over time. No more waiting endlessly for your code to compile; a significant improvement over the state of the art!

Even if you are not interested in saving development cycle time, there are other advantages to using Cascade. Cascade currently supports software-style printf-debugging and will soon offer users the ability to take advantage of runtime-specific data-dependent optimizations. In the future, Cascade may also find use in the hypervisor. Supporting VMotion for VMs that use FPGAs is a known hard problem. But by using Cascade as the mechanism for providing a VM access to an FPGA, it may be possible to move a program from hardware back into software, move the software pieces from one host to another, and then back down into hardware in the new location.

To try Cascade out, visit us on GitHub, https://github.com/vmware/cascade.

Scaling Computational Biology at VMware

Building Blocks

Computational biology datasets tend to be big.  Really big.  The raw genomic data for a single cell can run into the hundreds of gigabytes. And if you want to search through thousands of datasets looking for a gene or other snippet of DNA that you are interested in, then you can be staring down petabytes of data.  The sheer size of computational biology datasets means that many problems can only be solved on beefy servers with lots of RAM and fast disks.

VMware Senior Researcher Rob Johnson has recently published a string of papers showing how new data structures can be used to cut this data down to size.  By shrinking the data, Johnson’s data structures enable many problems that have typically required a server to now be solved on a laptop or even a high-end phone.  “One of our motivating visions from the beginning was to eliminate the computational, storage, and memory obstacles to creating a portable device that doctors and researchers could take into the field to quickly analyze and diagnose unknown diseases and outbreaks,” says Johnson.  “And our most recent paper shows how a national or international organization, such as NIH or WHO, could use our data structures to create and maintain an efficiently searchable archive of genomic datasets for researchers around the world.”  A searchable archive of genomic data could enable researchers to quickly find instances of a gene they have identified, or to test whether a gene is correlated with some observable characteristic, such as susceptibility to cancer.

But how did VMware get involved in computational biology?

The connection is in the underlying data structures.

Johnson’s computational biology work builds on the counting quotient filter, a data structure he originally developed as a replacement for the widely used, decades-old Bloom filter.  A Bloom filter is a very space-efficient way of representing a set of items.  Applications can insert new items into a Bloom filter, and they can query whether an item is in the filter.  Bloom filters save space by allowing a small false-positive rate, i.e. they will occasionally indicate that an item is in the set even though it is not.  Many applications can tolerate a small false positive rate, and so Bloom filters are now frequently used in networking, databases, file systems, storage systems, and computational biology, among many other domains.

The counting quotient filter improves on the Bloom filter in several ways.  For example it is smaller and faster than a Bloom filter, and it supports several operations that Bloom filters do not, such as counting the number of times that each item appears in the set.  The counting quotient filter also scales out of RAM more efficiently than Bloom filters, since many counting quotient filter operations can be implemented using sequential IO, which is particularly efficient for data stored on disks and SSDs.

These improvements are important because applications often have to work around the performance and functionality limitations of Bloom filters, resulting in greater complexity and reduced performance. Counting quotient filters, on the other hand, enable streamlined, high-performance solutions to many problems that up to now have been solved by using Bloom filters in conjunction with other, less space-efficient data structures.

Johnson’s three recent papers demonstrate how counting quotient filters can pay off in the domain of computational biology.

In the first paper, Johnson describes Squeakr, a tool for performing k-mer counting, which is often the first step in any kind of analysis of raw sequencing data.  State-of-the-art k-mer counting tools use Bloom filters to remove errors from the raw data, but resort to a conventional, large hash table to perform the actual counting. Squeakr exploits the counting ability of the counting quotient filter to perform both tasks faster and in less space.  Squeakr uses as little as one third the RAM as comparable tools and, for some applications, one sixth the time.

The second paper describes deBGR, a tool for compactly representing the de Bruijn Graph of a collection of raw sequencing data.  De Bruijn Graphs are widely used to perform more complex analyses of sequencing data; for example, to identify complex mutations and errors in raw data, to disentangle data from multiple sources, and two compare data from different organisms.  deBGR shows how to use the counting quotient filter to represent not only the basic de Bruijn Graph, but also how often each snippet of DNA in the graph occurs in the underlying raw data set.  This information can be essential to many of the advanced analyses mentioned above.

Finally, Johnson’s third paper shows how to use counting quotient filters to build an index for quickly searching through thousands of raw datasets, constituting terabytes of data.  The ability to perform such searches is so essential that BLAST, an earlier tool for searching through genomic data, is one of the most cited papers in the entire scientific literature.  But BLAST works only on assembled data, not raw data.  And the process of converting raw data into assembled data is expensive, so the vast majority of raw data never gets assembled, making it inaccessible to BLAST-based searches.  Johnson’s system, Mantis, overcomes this limitation.  Mantis outperforms Bloom-filter-based solutions to this problem on several dimensions. For example, previous solutions had large error rates; Mantis has none.  Mantis also builds the index in about one eighth the time, performs queries about 100 times faster, and uses 20% less space than the best Bloom-filter-based systems.

“I believe these projects show the universality of good algorithms and data structures work,” says Johnson.  “Although we started out working on the quotient filter for databases, dedupe systems, and the like, we ended up making a big impact on computational biology, effectively building an entirely new tool-chain around a data structure originally created to solve systems problems.”

The papers describing this research have appeared in top venues for big data and computational biology research, including SIGMOD, RECOMB, ISMB, and Bioinformatics.  This NSF-funded work is in collaboration with  Fatemah Almodaresi, Michael Bender, Mike Ferdman, Prashant Pandey, and Rob Patro of Stony Brook University

Further Reading




VMware’s Mihai Budiu presents “Hillview: A Spreadsheet for Big Data” at Stanford 11/29/17

Dr. Mihai Budiu, a Senior Researcher in VMware’s Research Group, will present on “Hillview:  A Spreadsheet for Big Data”, as part of the Stanford DAWN Project’s seminar series.  The talk will occur on Wednesday, November 29th, 2017 from 3 to 4 p.m., on campus in the Gates Computer Science Building in room 415.

SOSP ’17 Reflections

Vijay Chidambaram, UT Austin Assistant Professor and VMware Affiliated Researcher, wrote a blog post reflecting on the recent Symposium on Operating Systems Principles (SOSP 2017), where two papers from VMware were presented. SOSP is the premier venue for academic research on operating systems and distributed systems. Click here for more:

Big data for the 99% of #enterprises – that’s the motto powering VRG’s Project Hillview

The Hillview project is developing a set of tools for the exploration of very large datasets (“big data”). Our goal is to build a spreadsheet for big data that is easy to use and helps users browse and visualize data quickly. The motto of our project is “Big data for the 99% of enterprises.” The goal is to handle datasets with a billion rows interactively. Take a quick look here to see how we are accomplishing this. A demo of Hillview will be shown at VMworld US next Monday 8/28(session: FUT2634PU)

VMware Research headed to ACM SIGCOMM

The VMware Research team is headed to ACM SIGCOMM.  See where we’ll be and come say hello! https://octo.vmware.com/want-meet-sigcomm-2017/

What Happens When Interns Ask Questions?

When you think of summertime, images of the beach, sun, and vacation come into mind. Not for the members of the VMware Research Group, they think “INTERNS.” It is the time of year when top-notch computer science Ph.D. students join our team’s quest to cause disruption in the state of the field.  As we are welcoming a new crop of interns, we thought we would take a moment to highlight some of the groundbreaking results from interns past. As you will see, the bar has been set high, very high.

Flexible Paxos

In summer 2016, intern Heidi Howard of the University of Cambridge came to VMware to work on optimizing systems built on Paxos to make it more efficient and scalable. As she studied the Paxos algorithm, she became perplexed. She did not know why all quorums in Paxos had to intersect.  It was evident to her that intersection was required across phases, but she just did not understand why they were necessary for the same phase. Heidi decided to take her observation to her mentor, Dahlia Malkhi who took the time to digest and understand the implications. In a word, Dahlia deemed the observation “brilliant.” Dahlia’s role as mentor took on a whole new direction at that point.

To grasp the significance of this observation you need to know that Paxos is an algorithm for the reliability of distributed systems first introduced in 1989. It has been studied, tweaked, and implemented for years. Paxos uses two phases, leader election and replication. The heart of Paxos requires a quorum of participants to agree and accept a proposed value from a leader. At this point, the leader deems the value committed.  Heidi’s proposal to generalize Paxos such that majority quorums are only necessary across Paxos phases was named Flexible Paxos.

The goal for the remainder of the summer was to take the AHA moment and prove it. Working with her mentor and VMware Research intern, Sasha Spiegelman of Technion University, the trio implemented a prototype of the Flexible Paxos algorithm. The results showed that Flexible Paxos is reliable and efficient as both latency and throughput improved. It was significant enough to be archived in arXiv. So for your summertime reading list, dig into Flexible Paxos: Quorum intersection revisited.


Berkeley Churchill of Stanford University, VMware Summer Intern 2016, worked with Eric Schkufza to come up with a breakthrough toolchain insensitive technique to identify third party libraries in unmodified binaries. As often happens with discoveries, it is a project/experiment that does not succeed that leads to a giant step forward. Berkeley’s original project was in the x86 binary size reduction domain, and he was not satisfied with the results. Eric, serving as his mentor, was able to use his internal VMware product team relationships to see if there was anything useful or applicable to Berkley’s work. In the ensuing discussions with the product teams, a reoccurring pain point emerged, third party library identification. Re-energized, Berkeley repurposed his original project, and ProcHasher was born.

Techniques to be able to audit your real world software to determine which third party libraries are contained in the binary are theoretically simple. You build a database of all the possible libraries and do a lookup based on a hash code. The not so simple part is determining which hash function to use. Current techniques rely on using the machine code text in the binaries.  The problem with this approach is its sensitivity to the toolchain.  Every compiler with its multitude of options requires a new database entry for the same library. As you can imagine the size of the database dramatically increases if you try to build a database of all possible compiler/options combinations.

Berkeley realized when looking at the side effects a binary has on the heap, a hash function based on this property was both promising and toolchain insensitive. Now, this is the point in the tale where principled math takes over. Similar to techniques used for plagiarism detection you can treat successful hash lookups as evidence, not proof, that the component exists in the binary. With mathematical techniques used in the document classification domain, you can statistical classify the probability a given library is in a binary.

ProcHasher’s prototype implementation produced significantly more accurate results than existing techniques. The size of the database was also manageable. ProcHasher was 4.3X less sensitive to toolchain variation and achieved up to 31X improvement in recall, the percentage of components that were identified correctly. It is also proved to be robust against the presence of multiple components in a single binary.

Summer 2017

We are looking forward to the disruptions this year’s interns will bring to the table. Check back in the Fall to see if they raised the bar even higher.

A Recipe for Concurrent Data Structures

As adoption of non-uniform memory access (NUMA) architectures has grown, efficient concurrent data structures have become essential to take advantage of all the cores available in NUMA machines. VMware has been actively involved in researching techniques to automatically produce concurrent thread-safe data structures from their sequential, non-thread safe implementations. The results, presented at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), won one of the two Best Paper Awards.
The recipe for the success of this endeavor came from integrating techniques from distributed systems and shared-memory algorithms to form a new approach, Node Replication (NR). The secret ingredient is replicating the sequential data structures across the NUMA nodes. The key is to minimize the amount of data shared between nodes, which is achieved using an in-memory log that provides synchronization between replicas. Mix in the Flat Combining technique, which coordinates operations within a node efficiently and decreases contention on the log through batching. The goal of NR is to work well under contention, an area where other methods have fallen short. Building concurrent data structures for NUMA is laborious and error-prone, even for an expert programmer. NR produces concurrent data structures that are linearizable and free from concurrency bugs from black-box sequential data structures with no special skills needed. This is similar to the way a box of pasta is easier to make than homemade pasta.
A “taste” test was performed on Redis’ data structures to compare NR to five other approaches. The results showed that under high contention, NR outperformed all others except tailored-made NUMA-aware data structures. Just as good recipes put on extra pounds, NR needs additional memory for the log and replicas, so this approach is best suited for small- to medium-sized structures.
For the background of concurrent data structure approaches, NR algorithm details, and more importantly, the evaluation results, see “Black-box Concurrent Data Structures for NUMA Architectures“, Irina Calciu (VMware Research); Siddhartha Sen (Microsoft Research); Mahesh Balakrishnan (Yale University); Marcos K. Aguilera (VMware Research). To learn more about what projects VMware Researchers are working on click here.