Home > Blogs > VMware Research

What’s new? Differential Datalog.

The era of continuous exponential increase of efficiency of computers (known more commonly as Moore’s law) is finally coming to an end. What will be the next driver of growth in computing? Experts forecast a new era, in which performance increases will come from more efficient designs, customized to the problem at hand. This is clearly visible in the plethora of domain-specific hardware devices that are proliferating (as an example, consider the chips optimized for Artificial Intelligence applications). Another important ingredient for efficiency will be domain-specific languages: programming languages that are customized for a specific problem, rather than general-purpose. The subject of this short write-up is an example of the latter, a new programming language named Differential Datalog.


Differential Datalog is based on an old programming language, Datalog, but enhanced with many modern features that are very important for programmer productivity (garbage-collection, strong typing, concurrent execution, interoperation with other programming languages). Datalog itself is a very simple, mathematically pure, and well-understood language; it was used mostly by theoreticians who reason about database systems.

Datalog programs compute on database tables (database people call these tables relations; other programming languages call them collections). Here is a table:

In a Differential Datalog this could be declared as

input relation Relationship(
person: string, relatedTo: string, relationship: string)

This is just a typical database table with three columns. We could also write some code that fills this table with data, but in general the set of tables is fixed, while the contents of the tables changes during the program execution:

Relationship(“Mike”, “John”, “Brother”).
Relationship(“Joan”, “John”, “Spouse”).
Relationship(“Frank”, “John”, “Child”).

These are known as “facts” in Datalog parlance.

All DDlog computations take some tables as inputs and produce other tables. For example, we could produce a table containing just the children:

output relation Children(
person: string, child: string)
Children(parent, child) :- Relationship(parent, child, “Child”).

The turnstyle Datalog symbol :- is read as “if”; this program says: parent and child are in the relation Children if there is a row in the table Relationship having parent in the column person, child in the column relatedTo, and the string “Child” in the column relationship.

When you run this program it will produce a Children table that has the following contents:

Children(“Frank”, “Jon”).

We can also define more complicated relations; for example a Descendant relation:

output relation Descendant(ancestor: string, descendant: string)

Descendant(ancestor, descendant) :- Child(ancestor, descendant).
Descendant(ancestor, descendant) :-
Child(ancestor, child), Descendant(child, descendant).

So you are someone’s descendant either if you are their child, or if you are the descendant of one of their children.

In a Datalog program you declare some output tables and write formulas describing how the contents of the output tables is computed based on the inputs, possibly via some intermediate tables. Then, given a collection of facts that tell us what is in the input tables, the program computes the contents of the output tables.

Differential Datalog

So, what is the Differential part in Differential Datalog? A Differential Datalog program is written in exactly the same way as a Datalog program, but it executes in a very different way: instead of simply computing the output tables from a set of input tables, it actually receives as input changes (or differences) in the input tables and it tells us what the changes are in the output tables. The following diagram illustrates this process:

Now we can explain why Differential Datalog, or DDlog for short, was built on top of Datalog: for tables the notion of “change” is clearly defined: a change consists in inserting and deleting rows in the input tables. If we were to use a more powerful programming language as a base (e.g. Python) we would have much more trouble defining what a “change” even is.

A DDlog program will start with a set of empty input and output tables; every time some change is made to input tables, it will notify the interested consumers of all changes made to the output tables. This works both for data being added or removed from an input table.

What is it good for?

Why would you use DDlog instead of a more traditional programming language with a lot of nice bells and whistles? It turns out that computing changes efficiently is in general quite difficult. As a challenge for the reader, we invite you to write a program that computes the changes in the Descendant relation in the programming language of your choice. We bet you that either your implementation will be slower, or significantly more complex than the 2-line program written in DDlog.

DDlog is a great tool when you care about processing a constant stream of changes efficiently, and when you can express your problem as a computation on tables. It turns out that there are many application domains where inputs keep changing; we have had a lot of success with DDlog in implementing control software for various distributed systems; in particular, we have rewritten some very complex network management applications, such as OVN – which is more than 14 thousand lines of DDlog.

DDlog is a compiled language; the DDlog compiler generates a Rust program, which can be compiled and executed natively on any platform that supports Rust. Since DDlog programs always interact with their environment (to receive and send changes), we provide support for interoperating with C, C++, Rust, Java and Go programs.


DDlog is being developed as an open-source project with a liberal MIT license. It is available for download at https://github.com/vmware/differential-datalog. It is still under very active development. The language features a powerful type system, a rich – and ever growing – collection of libraries, support for aggregate operations (like finding the largest value in a table), many tools for debugging, a compiler that translates SQL into DDlog, and many other features that make a language a tool for serious software engineering. You should consider using DDlog for your next project if you care about tracking a changing dataset. To learn more about the researchers working on this project please go to research.vmware.com.

Call for Nominations

VMware offers an annual award to an exceptional early-career faculty member conducting computer science systems research. Eligible faculty are those within five years of their first tenure-track appointment. The award provides USD $125,000 in support of her/his research. Prior recipients include:

  • 2016: Professor Matei Zaharia of Stanford University
  • 2017:  Professor Tim Kraska of Brown University (now MIT)
  • 2018:  Professor Tiark Rompf of Purdue University
  • 2019:  Professor Mohammad Alizadeh of MIT

We are pleased to open the nomination window and request nominations by October 23, 2020. Nominations are limited to one per department chair (or equivalent). For full details of the award and nomination process please visit http://www.vmware.com/company/sysaward.html. Nominations will also be automatically considered for an independently-evaluated and separate category of early-career faculty awards.

VMware traces its roots to academic research, and we feel a special responsibility to support the growth and vitality of systems research in computer science. We hope you and your colleagues will consider nominating someone whose early-career work you believe most role-models originality, impact, and future promise.

Importantly, VMware is committed to diversity in all aspects of our organization. We embrace diversity through our employees, the suppliers and partners we work with, our customers, and the communities in which we work and live.

For questions, or to submit a nomination, please email sysaward@vmware.com.


A VMware Research team has created a tool that enables programmers to specify cluster management logic in a high-level declarative language, and synthesize the code to compute policy-compliant configurations automatically and efficiently. Their new tool, called DCM (Declarative Cluster Management), allows data center developers to use SQL to easily add, remove and modify constraints and policies, the essence of a cluster manager.

“The idea with DCM is to write policies in a declarative style where we use SQL to write what you would like the system to do,” said the DCM team leader, Researcher Lalith Suresh. “The details of how it gets done, the algorithms that you would otherwise have to write by hand, are automated. Behind the scenes there’s a compiler that will take care of a lot of the heavy lifting.”

Modern cluster management systems like Kubernetes, DRS, OpenStack and OpenShift are responsible for configuring a complex distributed system and allocating resources efficiently. Whether juggling containers, virtual machines, micro-services, virtual network appliances, or serverless functions, these systems must enforce numerous cluster management policies.

Currently, developers implement such systems by designing custom application-specific heuristics —- an approach that is proving unsustainable, as ad-hoc heuristics both perform poorly and introduce overwhelming complexity, making it challenging to add important new features. These heuristics have to continuously be adapted to work for arbitrary combinations of policies, making such systems hard to evolve over time.

With DCM, the developer maintains the application state in a relational database, and specifies constraints as database queries in SQL. The DCM compiler then generates code that can then be used to efficiently find configurations that satisfy all these constraints.

“DCM makes it very easy to get started with, for example, writing your own cluster schedulers,” said Suresh. “A lot of the complexity with building such systems is hidden by using DCM. Where these things typically take several years to stabilize, we’re hoping to cut that time significantly”.

“There is no other tool that has this capability today. The level at which we’ve lowered the barrier to building schedulers declaratively, I don’t think there’s any tool out there that gets anywhere close,” he said.

The DCM compiler uses structural information extracted from the SQL specifications. The tool generates code that efficiently translates the state from the database into an optimization model of the problem. At runtime, when a system configuration decision is to be made, the generated code extracts the current state of the system from the database, solves it using an off-the-shelf solver, and generates a new configuration that satisfies all the specified constraints.


In a paper published in 2019, the VMware Research team built a Kubernetes scheduler to show how DCM automates cluster management. The scheduler operates as a drop-in replacement for the default Kubernetes scheduler, supporting all its capabilities and adding new ones.

“We found that it was significantly easier to build our scheduler using DCM than how the Kubernetes scheduler is built today, which has more than 10,000 lines of code,” said Suresh. “With DCM, you basically have about a thousand lines of Java code, and a couple hundred lines of SQL, which we think is a significant benefit.

“Today, there’s really no way to easily build such policy-based cluster managers without having to code all of it from scratch. With DCM, your policies are easily specified using SQL” he said. “The tool handles how to take all the policies and find optimal decisions for you. So there’s a lot less work involved in using it.”

Lalith Suresh drove the DCM project after many years of thinking about how to simplify developing cluster managers. The VMware Research team working with him includes Senior Researcher Nina Narodytska, Senior Researcher Leonid Ryzhyk, Post-doctoral Researcher Sangeetha Abdu Jyothi, Research Intern João Loff and Research Intern Faria Kalim.


VMware Research released DCM as open source code in summer 2019. It is available for use on the VMware GitHub repository at


For more details on DCM, download the paper Synthesizing Cluster Management Code for Distributed Systems, at




Mitosis: An Efficient Way to Boost Application Performance on Large Machines

Applications running on large multi-socket machines having 1 to 100 TBs of memory suffer from non-uniform bandwidth and latency issues while accessing physical memory. To mitigate these effects, past research has focused only on data placement policies on larger non-uniform memory access (NUMA) machines.

A research team led by VMware researcher Jayneel Gandhi has discovered another performance issue on NUMA machines which was previously ignored – sub-optimal page-table placement. To resolve the issue, the team proposed a new design, called Mitosis, for boosting application performance on large machines by migrating and replicating page-tables instead of application data across sockets.

Their study, “Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines,” is the first to make the case for explicit page-table allocation policy. The study shows that page-table placement is becoming crucial to application performance on large machines, which are becoming increasingly more prevalent.

The team implemented Mitosis in Linux and evaluated its benefits on real hardware, showing that it improves performance for big-data multi-socket applications by up to 1.3 times. Moreover, Mitosis improves application performance by up to 3.2 times in cases where the operating system scheduler migrates a process across sockets.

“A three-times increase in performance is a huge improvement for customers who use Big Data applications like databases on large machines,” said Jayneel Gandhi. “When you’re ordering an airline ticket, for example, your ticket processing becomes a lot faster. And from the airline’s perspective, they can sell a lot more tickets in the same amount of time.”

The research team states that as you scale to large databases on much larger machines, the number of sockets and physical memory size will also scale, making page-table placement a more significant problem. In large multi-socket machines, data placement policies typically partially replicate a database on different sockets so it can be accessed faster. But this has a high memory overhead, which grows with database size as the machines they run on become larger.

The Mitosis study shows that page-table replication incurs negligible memory overhead, can be implemented efficiently and delivers substantial performance improvements. These gains come at a cost of only 0.6% memory overhead, compared to the exorbitant memory cost of data replication.


Mitosis has two components, a mechanism to enable efficient page-table replication and migration; and policies to effectively manage and control page-table replication and migration.

The illustration shows how Mitosis can replicate the page tables on each socket where a multi-socket application is running. Currently, an address translation can result in up to four remote accesses to page-tables. However, with Mitosis-based replication, an address translation results in up to four local accesses to the page-table, precluding the need for any remote memory accesses during page-table walks.

Additionally, single-socket workloads suffer performance losses when processes are migrated across sockets while page-tables are not. The illustration shows that, when a process is migrated from socket 0 to socket 1, the NUMA memory manager transparently migrates data pages, but page-table pages remain on socket 1. In contrast, Mitosis migrates the page-tables along with the data. This eliminates remote memory accesses for page-table walks, improving performance significantly.

Mitosis builds on widely-used operating system mechanisms like page faults and system calls and applies to most commodity operating systems. An important feature of Mitosis is that it requires no changes to applications or hardware, and is easy to use on a per-application basis.

The Mitosis research team includes VMware Research interns Reto Achermann and Ashish Panwar, and academic collaborators: Abhishek Bhattacharjee (Yale University), Timothy Roscoe (ETH Zurich), Arkaprava Basu (IISc Bangalore), and Gopinath Kanchi (IISc Bangalore).


The research team’s next step will be to approach virtualization in the VMware hypervisor. Understanding page-table placement in virtualized systems is a major undertaking and will require additional research and a separate study.

The research team has released Mitosis for native systems. It is open-sourced and available for use here.


Concurrency Course at UT-Austin Is a Hands-On Lab to Learn Parallelism Using Cascade, a New Program for Coding FPGAs

cascade logo

The Concurrency class introduced at the University of Texas in fall 2018 achieved a variety of outcomes. It provided computer science students with hands-on experience in parallel hardware and programming models, to demystify concurrency and convince them that concurrency is a powerful tool at their disposal. In addition, it introduced students to Cascade – a better, faster, more user-friendly compiler for programming FPGAs.

The course also helped Cascade developers improve the program, as students debugged the code while moving through their classwork. Cascade is an open source programming tool for FPGAs from VMware Research.

People need to graduate with computer science degrees that reflect a high level of comfort, facility and engagement with concurrency and parallelism,” said Christopher J. Rossbach, the Concurrency professor at UT and a member of the Cascade team at VMware Research.

“What we’re trying to accomplish with the course is give students a sense of empowerment, and a willingness to get their hands dirty and try to do things that seem hard.”

“If you look at how we’ve taught parallel programming and concurrency in the past,” said Eric Schkufza of VMware Research, “what we’ve actually done is given them just enough information to be fearful about parallel programming.

“With Cascade, we’re giving students a tool on Day One that makes it easy for them to program concurrent systems,” he said. “And because we open-source Cascade, it’s easy for them to get in the classroom and for students to tell their friends where to download and try it.”

“Cascade was a key enabler for our Concurrency course because it provided a much friendlier interface than students would have exposure to otherwise,” said Rossbach. “Programming an FPGA is hard. When students write a program for an FPGA that takes forever to compile, they get frustrated.

“Programming an FPGA with Cascade might still be hard, but some of the worst challenges are taken off the table by the ability to run transparently in software on hardware or on some combination of the two.”

Reprogrammable hardware (FPGAs) can exceed the performance of general-purpose CPUs by several orders of magnitude. However, programming an FPGA can be an extremely slow process. Trivial programs can take several minutes to compile using a traditional compiler, and complex designs can take hours or longer.

Cascade is the world’s first just-in-time compiler for Verilog. It reduces the time to produce working hardware designs, and therefore encourages more frequent compilation. With Cascade, code is executed immediately in a software simulator and compilation is performed in the background. When finished, the code is moved into hardware, and from the user’s perspective the process is much faster.

“Insofar as VMware’s core business has transitioned from virtualizing the server to virtualizing the data center, and data centers today have more FPGAs than ever, VMware needs a solution for virtualizing an FPGA. We’re discovering that Cascade could be the mechanism to do that,” said Rossbach.

“Many technology trends suggest that we’re going to see FPGAs in a lot of shared environments,” said Michael Wei of VMware Research. “If you can’t virtualize an FPGA or you can’t use it or even access it in a virtual environment, that’s fundamentally limiting VMware. Making it more accessible is congruent with VMware’s goals.

“In addition to writing code for multicore or multi-processor platforms, more and more programmers are having to write codes that run on specialized hardware like GPUs and FPGAs, which are super parallel and super concurrent,” said Wei.

“We’re improving Cascade,” said Rossbach. “We’re learning what’s important: to make the hardware more programmable and make the experience more like software, that the availability of a tool like Cascade changes the way people think about parallelism and concurrency.

“Teaching people to program FPGAs by giving them something that takes two hours to compile, and then is wrong because of bugs, is the wrong direction,” he said. “Providing a learning environment in which it’s super easy to experiment and understand the semantics is the right thing to do.

“In our Concurrency course there were students finding bugs in Cascade. When a couple of them got into the code base and tried to fix things themselves, I wanted to do backflips of joy,” said Rossbach.

Download the Cascade technical paper at https://github.com/vmware/cascade/blob/master/doc/asplos19.pdf

Cascade is available to download and test at https://github.com/vmware/cascade

Flexible Byzantine Fault Tolerance

Current state-of-the-art Byzantine fault-tolerant protocols state a limit on the number of Byzantine faults that can be tolerated. Of course, there are lower bounds stating that only a certain number of Byzantine faults can be tolerated in a given setting. On the other hand, it’s unclear why Byzantine faults are the primary faults one should consider! For instance, for high-stake transactions, often it is essential to guard its safety against adversaries who intend to earn by double-spending. Moreover, depending on their beliefs and external assumptions, two clients in the system may have entirely different interpretations of the state of the system. There may be conservative clients who wish to tolerate more faults and may wait for the state to be in a certain conservative state to commit. Or there may be clients who believe in stronger message timing assumptions. And just in case these clients are incorrect, all of them they may want to be able to update their assumptions and recover!

In a recent work called Flexible Byzantine Fault Tolerance, we address many of these intriguing questions. First, Flexible BFT introduces a new fault model called alive-but-corrupt faults — adversaries who intend to break safety of the protocol but not liveness. Flexible BFT shows that in the presence of such alive-but-corrupt faults (in addition to Byzantine faults), we can support a higher number of total faults than known in the distributed computing literature. Second, Flexible BFT enables the co-existence of clients with different beliefs and assumptions. This allows supporting clients having their own assumptions or using the additional external information they have access to. If the assumption made by a client is incorrect, some of her commits may be unsafe. But subsequently, the protocol allows updating assumptions and a recovery mechanism.

At a technical level, Flexible BFT makes two key contributions. First, it introduces a synchronous BFT protocol in which only the commit step requires to know the synchrony network delay bound and thus replicas execute the protocol without any synchrony assumption. Second, it introduces a notion called Flexible Byzantine Quorums by deconstructing the roles of different quorums in existing consensus protocols. A preliminary draft of this work is available at ArXiv and a short presentation is available here.

HotStuff: BFT Consensus in the Lens of Blockchain

Blockchain is a Byzantine Fault Tolerant (BFT) replicated state machine, in which each state-update is by itself a Turing machine with bounded resources. The core algorithm for achieving BFT in a Blockchain appears completely different from classical BFT algorithms:

  • Classical solutions like DLS, PBFT solve BFT among a small-to-medium group of known participants. Such algorithms consist of multiple rounds of message exchanges carrying votes and safety-proofs. They are evidently quite intimidating to the non-expert.
  • In contrast, Bitcoin solves BFT among a very large group of unknown users. In each time-period, one user broadcasts a single message carrying a Proof-of-Work (PoW). No other messages or information is exchanged.

What a difference between the two worlds!

Recent advances in blockchain technology blur these boundaries. Namely, hybrid solutions such as Byzcoin, Bitcoin-NG, Hybrid Consensus, Casper and Solida, anchor off-chain BFT decisions inside a PoW chain or the other way around.

Moreover, innovative solutions in the age of blockchains, such as Honeybadger, ALGORAND, Tendermint, and SBFT, revisit the BFT setting with greater scalability and simplicity.

Confused?  The VMware’s HotStuff framework can help!

HotStuff describes Blockchain in the lens of BFT and BFT in the lens of Blockchain, and provide common algorithmic foundations for both. It operates over graphs of blocks (aka blockchains), such that a block becomes committed when votes embedded in other blocks refers to it. A feature of HotStuff is its economy of mechanism, involving only a single message type for proposing, a single message type for voting, and simple Commit and Preferred Branch Rules. The proposer protocol in HotStuff is the same regardless of whether the proposer is being replaced or not, thus relieving of the main source complexity in BFT protocols The Saddest Moment.

At the same time, the reasoning for its safety is subtle and deceptively simple—not unlike that for other designs that have subsequently been identified as problematic (see e.g., Revisit Fast BFT).  This led us to explore model checking to confirm the safety of HotStuff.

Interested in reading more? check the HotStuff whitepaper (“HotStuff: BFT Consensus in the Lens of Blockchain’’, https://arxiv.org/abs/1803.05069).

For more information on VMware Blockchain, go here.


Meet Cascade – A Just-in-time compiler for reprogrammable hardware

fast cars

Okay let’s be honest, nobody strives for slow in the computing world. We want fast computers, fast chips, and fast time to market, all while eating fast food. Field-programmable gate arrays (FPGAs) have been actively used since the mid-80s for the purpose of hardware acceleration. Speed! Unfortunately, one aspect of FPGAs has not been accelerated: compilation. Compiling FPGA programs is painfully slow. FPGAs are programmed using hardware description languages (HDLs). Although similar to software programming languages they differ in both the scope of the language and the output of the compiler.  FPGA compilers need to produce a complex set of layout instructions and connections for a reconfigurable circuit board. Not a simple task. In fact, it’s NP complete!

VMware Researchers Eric Schkufza and Michael Wei wondered if applying the principals behind just-in-time compilation could help reduce the latency of FPGA development. It turns out it can! A proof of concept has been completed, and exciting possibilities have grown out of the initial work. Cascade is now a VMware open source tool which is available to everyone.

How does Cascade work? Starting with a Verilog source file, Cascade generates many smaller Verilog programs that in aggregate have the same semantics as the original program. Cascade then determines which of these smaller programs (specifically those that correspond to I/O peripherals such as LEDs) must be placed on the hardware immediately and loads them from a standard library on to the FPGA. In the meanwhile, the remainder of the programs are interpreted in software. Over time, these programs are compiled to hardware, and when the process is complete they are swapped onto the FPGA as well. A cascading effect. From the developer’s point of view, the code runs immediately and simply gets faster over time. No more waiting endlessly for your code to compile; a significant improvement over the state of the art!

Even if you are not interested in saving development cycle time, there are other advantages to using Cascade. Cascade currently supports software-style printf-debugging and will soon offer users the ability to take advantage of runtime-specific data-dependent optimizations. In the future, Cascade may also find use in the hypervisor. Supporting VMotion for VMs that use FPGAs is a known hard problem. But by using Cascade as the mechanism for providing a VM access to an FPGA, it may be possible to move a program from hardware back into software, move the software pieces from one host to another, and then back down into hardware in the new location.

To try Cascade out, visit us on GitHub, https://github.com/vmware/cascade.

Scaling Computational Biology at VMware

Building Blocks

Computational biology datasets tend to be big.  Really big.  The raw genomic data for a single cell can run into the hundreds of gigabytes. And if you want to search through thousands of datasets looking for a gene or other snippet of DNA that you are interested in, then you can be staring down petabytes of data.  The sheer size of computational biology datasets means that many problems can only be solved on beefy servers with lots of RAM and fast disks.

VMware Senior Researcher Rob Johnson has recently published a string of papers showing how new data structures can be used to cut this data down to size.  By shrinking the data, Johnson’s data structures enable many problems that have typically required a server to now be solved on a laptop or even a high-end phone.  “One of our motivating visions from the beginning was to eliminate the computational, storage, and memory obstacles to creating a portable device that doctors and researchers could take into the field to quickly analyze and diagnose unknown diseases and outbreaks,” says Johnson.  “And our most recent paper shows how a national or international organization, such as NIH or WHO, could use our data structures to create and maintain an efficiently searchable archive of genomic datasets for researchers around the world.”  A searchable archive of genomic data could enable researchers to quickly find instances of a gene they have identified, or to test whether a gene is correlated with some observable characteristic, such as susceptibility to cancer.

But how did VMware get involved in computational biology?

The connection is in the underlying data structures.

Johnson’s computational biology work builds on the counting quotient filter, a data structure he originally developed as a replacement for the widely used, decades-old Bloom filter.  A Bloom filter is a very space-efficient way of representing a set of items.  Applications can insert new items into a Bloom filter, and they can query whether an item is in the filter.  Bloom filters save space by allowing a small false-positive rate, i.e. they will occasionally indicate that an item is in the set even though it is not.  Many applications can tolerate a small false positive rate, and so Bloom filters are now frequently used in networking, databases, file systems, storage systems, and computational biology, among many other domains.

The counting quotient filter improves on the Bloom filter in several ways.  For example it is smaller and faster than a Bloom filter, and it supports several operations that Bloom filters do not, such as counting the number of times that each item appears in the set.  The counting quotient filter also scales out of RAM more efficiently than Bloom filters, since many counting quotient filter operations can be implemented using sequential IO, which is particularly efficient for data stored on disks and SSDs.

These improvements are important because applications often have to work around the performance and functionality limitations of Bloom filters, resulting in greater complexity and reduced performance. Counting quotient filters, on the other hand, enable streamlined, high-performance solutions to many problems that up to now have been solved by using Bloom filters in conjunction with other, less space-efficient data structures.

Johnson’s three recent papers demonstrate how counting quotient filters can pay off in the domain of computational biology.

In the first paper, Johnson describes Squeakr, a tool for performing k-mer counting, which is often the first step in any kind of analysis of raw sequencing data.  State-of-the-art k-mer counting tools use Bloom filters to remove errors from the raw data, but resort to a conventional, large hash table to perform the actual counting. Squeakr exploits the counting ability of the counting quotient filter to perform both tasks faster and in less space.  Squeakr uses as little as one third the RAM as comparable tools and, for some applications, one sixth the time.

The second paper describes deBGR, a tool for compactly representing the de Bruijn Graph of a collection of raw sequencing data.  De Bruijn Graphs are widely used to perform more complex analyses of sequencing data; for example, to identify complex mutations and errors in raw data, to disentangle data from multiple sources, and two compare data from different organisms.  deBGR shows how to use the counting quotient filter to represent not only the basic de Bruijn Graph, but also how often each snippet of DNA in the graph occurs in the underlying raw data set.  This information can be essential to many of the advanced analyses mentioned above.

Finally, Johnson’s third paper shows how to use counting quotient filters to build an index for quickly searching through thousands of raw datasets, constituting terabytes of data.  The ability to perform such searches is so essential that BLAST, an earlier tool for searching through genomic data, is one of the most cited papers in the entire scientific literature.  But BLAST works only on assembled data, not raw data.  And the process of converting raw data into assembled data is expensive, so the vast majority of raw data never gets assembled, making it inaccessible to BLAST-based searches.  Johnson’s system, Mantis, overcomes this limitation.  Mantis outperforms Bloom-filter-based solutions to this problem on several dimensions. For example, previous solutions had large error rates; Mantis has none.  Mantis also builds the index in about one eighth the time, performs queries about 100 times faster, and uses 20% less space than the best Bloom-filter-based systems.

“I believe these projects show the universality of good algorithms and data structures work,” says Johnson.  “Although we started out working on the quotient filter for databases, dedupe systems, and the like, we ended up making a big impact on computational biology, effectively building an entirely new tool-chain around a data structure originally created to solve systems problems.”

The papers describing this research have appeared in top venues for big data and computational biology research, including SIGMOD, RECOMB, ISMB, and Bioinformatics.  This NSF-funded work is in collaboration with  Fatemah Almodaresi, Michael Bender, Mike Ferdman, Prashant Pandey, and Rob Patro of Stony Brook University

Further Reading




VMware’s Mihai Budiu presents “Hillview: A Spreadsheet for Big Data” at Stanford 11/29/17

Dr. Mihai Budiu, a Senior Researcher in VMware’s Research Group, will present on “Hillview:  A Spreadsheet for Big Data”, as part of the Stanford DAWN Project’s seminar series.  The talk will occur on Wednesday, November 29th, 2017 from 3 to 4 p.m., on campus in the Gates Computer Science Building in room 415.