Home > Blogs > VMware Research

Mitosis: An Efficient Way to Boost Application Performance on Large Machines

Applications running on large multi-socket machines having 1 to 100 TBs of memory suffer from non-uniform bandwidth and latency issues while accessing physical memory. To mitigate these effects, past research has focused only on data placement policies on larger non-uniform memory access (NUMA) machines.

A research team led by VMware researcher Jayneel Gandhi has discovered another performance issue on NUMA machines which was previously ignored – sub-optimal page-table placement. To resolve the issue, the team proposed a new design, called Mitosis, for boosting application performance on large machines by migrating and replicating page-tables instead of application data across sockets.

Their study, “Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines,” is the first to make the case for explicit page-table allocation policy. The study shows that page-table placement is becoming crucial to application performance on large machines, which are becoming increasingly more prevalent.

The team implemented Mitosis in Linux and evaluated its benefits on real hardware, showing that it improves performance for big-data multi-socket applications by up to 1.3 times. Moreover, Mitosis improves application performance by up to 3.2 times in cases where the operating system scheduler migrates a process across sockets.

“A three-times increase in performance is a huge improvement for customers who use Big Data applications like databases on large machines,” said Jayneel Gandhi. “When you’re ordering an airline ticket, for example, your ticket processing becomes a lot faster. And from the airline’s perspective, they can sell a lot more tickets in the same amount of time.”

The research team states that as you scale to large databases on much larger machines, the number of sockets and physical memory size will also scale, making page-table placement a more significant problem. In large multi-socket machines, data placement policies typically partially replicate a database on different sockets so it can be accessed faster. But this has a high memory overhead, which grows with database size as the machines they run on become larger.

The Mitosis study shows that page-table replication incurs negligible memory overhead, can be implemented efficiently and delivers substantial performance improvements. These gains come at a cost of only 0.6% memory overhead, compared to the exorbitant memory cost of data replication.

HOW MITOSIS WORKS

Mitosis has two components, a mechanism to enable efficient page-table replication and migration; and policies to effectively manage and control page-table replication and migration.

The illustration shows how Mitosis can replicate the page tables on each socket where a multi-socket application is running. Currently, an address translation can result in up to four remote accesses to page-tables. However, with Mitosis-based replication, an address translation results in up to four local accesses to the page-table, precluding the need for any remote memory accesses during page-table walks.

Additionally, single-socket workloads suffer performance losses when processes are migrated across sockets while page-tables are not. The illustration shows that, when a process is migrated from socket 0 to socket 1, the NUMA memory manager transparently migrates data pages, but page-table pages remain on socket 1. In contrast, Mitosis migrates the page-tables along with the data. This eliminates remote memory accesses for page-table walks, improving performance significantly.

Mitosis builds on widely-used operating system mechanisms like page faults and system calls and applies to most commodity operating systems. An important feature of Mitosis is that it requires no changes to applications or hardware, and is easy to use on a per-application basis.

The Mitosis research team includes VMware Research interns Reto Achermann and Ashish Panwar, and academic collaborators: Abhishek Bhattacharjee (Yale University), Timothy Roscoe (ETH Zurich), Arkaprava Basu (IISc Bangalore), and Gopinath Kanchi (IISc Bangalore).

NEXT STEPS

The research team’s next step will be to approach virtualization in the VMware hypervisor. Understanding page-table placement in virtualized systems is a major undertaking and will require additional research and a separate study.

The research team has released Mitosis for native systems. It is open-sourced and available for use here.

 

Concurrency Course at UT-Austin Is a Hands-On Lab to Learn Parallelism Using Cascade, a New Program for Coding FPGAs

cascade logo

The Concurrency class introduced at the University of Texas in fall 2018 achieved a variety of outcomes. It provided computer science students with hands-on experience in parallel hardware and programming models, to demystify concurrency and convince them that concurrency is a powerful tool at their disposal. In addition, it introduced students to Cascade – a better, faster, more user-friendly compiler for programming FPGAs.

The course also helped Cascade developers improve the program, as students debugged the code while moving through their classwork. Cascade is an open source programming tool for FPGAs from VMware Research.

People need to graduate with computer science degrees that reflect a high level of comfort, facility and engagement with concurrency and parallelism,” said Christopher J. Rossbach, the Concurrency professor at UT and a member of the Cascade team at VMware Research.

“What we’re trying to accomplish with the course is give students a sense of empowerment, and a willingness to get their hands dirty and try to do things that seem hard.”

“If you look at how we’ve taught parallel programming and concurrency in the past,” said Eric Schkufza of VMware Research, “what we’ve actually done is given them just enough information to be fearful about parallel programming.

“With Cascade, we’re giving students a tool on Day One that makes it easy for them to program concurrent systems,” he said. “And because we open-source Cascade, it’s easy for them to get in the classroom and for students to tell their friends where to download and try it.”

“Cascade was a key enabler for our Concurrency course because it provided a much friendlier interface than students would have exposure to otherwise,” said Rossbach. “Programming an FPGA is hard. When students write a program for an FPGA that takes forever to compile, they get frustrated.

“Programming an FPGA with Cascade might still be hard, but some of the worst challenges are taken off the table by the ability to run transparently in software on hardware or on some combination of the two.”

Reprogrammable hardware (FPGAs) can exceed the performance of general-purpose CPUs by several orders of magnitude. However, programming an FPGA can be an extremely slow process. Trivial programs can take several minutes to compile using a traditional compiler, and complex designs can take hours or longer.

Cascade is the world’s first just-in-time compiler for Verilog. It reduces the time to produce working hardware designs, and therefore encourages more frequent compilation. With Cascade, code is executed immediately in a software simulator and compilation is performed in the background. When finished, the code is moved into hardware, and from the user’s perspective the process is much faster.

“Insofar as VMware’s core business has transitioned from virtualizing the server to virtualizing the data center, and data centers today have more FPGAs than ever, VMware needs a solution for virtualizing an FPGA. We’re discovering that Cascade could be the mechanism to do that,” said Rossbach.

“Many technology trends suggest that we’re going to see FPGAs in a lot of shared environments,” said Michael Wei of VMware Research. “If you can’t virtualize an FPGA or you can’t use it or even access it in a virtual environment, that’s fundamentally limiting VMware. Making it more accessible is congruent with VMware’s goals.

“In addition to writing code for multicore or multi-processor platforms, more and more programmers are having to write codes that run on specialized hardware like GPUs and FPGAs, which are super parallel and super concurrent,” said Wei.

“We’re improving Cascade,” said Rossbach. “We’re learning what’s important: to make the hardware more programmable and make the experience more like software, that the availability of a tool like Cascade changes the way people think about parallelism and concurrency.

“Teaching people to program FPGAs by giving them something that takes two hours to compile, and then is wrong because of bugs, is the wrong direction,” he said. “Providing a learning environment in which it’s super easy to experiment and understand the semantics is the right thing to do.

“In our Concurrency course there were students finding bugs in Cascade. When a couple of them got into the code base and tried to fix things themselves, I wanted to do backflips of joy,” said Rossbach.

Download the Cascade technical paper at https://github.com/vmware/cascade/blob/master/doc/asplos19.pdf

Cascade is available to download and test at https://github.com/vmware/cascade

Flexible Byzantine Fault Tolerance

Current state-of-the-art Byzantine fault-tolerant protocols state a limit on the number of Byzantine faults that can be tolerated. Of course, there are lower bounds stating that only a certain number of Byzantine faults can be tolerated in a given setting. On the other hand, it’s unclear why Byzantine faults are the primary faults one should consider! For instance, for high-stake transactions, often it is essential to guard its safety against adversaries who intend to earn by double-spending. Moreover, depending on their beliefs and external assumptions, two clients in the system may have entirely different interpretations of the state of the system. There may be conservative clients who wish to tolerate more faults and may wait for the state to be in a certain conservative state to commit. Or there may be clients who believe in stronger message timing assumptions. And just in case these clients are incorrect, all of them they may want to be able to update their assumptions and recover!

In a recent work called Flexible Byzantine Fault Tolerance, we address many of these intriguing questions. First, Flexible BFT introduces a new fault model called alive-but-corrupt faults — adversaries who intend to break safety of the protocol but not liveness. Flexible BFT shows that in the presence of such alive-but-corrupt faults (in addition to Byzantine faults), we can support a higher number of total faults than known in the distributed computing literature. Second, Flexible BFT enables the co-existence of clients with different beliefs and assumptions. This allows supporting clients having their own assumptions or using the additional external information they have access to. If the assumption made by a client is incorrect, some of her commits may be unsafe. But subsequently, the protocol allows updating assumptions and a recovery mechanism.

At a technical level, Flexible BFT makes two key contributions. First, it introduces a synchronous BFT protocol in which only the commit step requires to know the synchrony network delay bound and thus replicas execute the protocol without any synchrony assumption. Second, it introduces a notion called Flexible Byzantine Quorums by deconstructing the roles of different quorums in existing consensus protocols. A preliminary draft of this work is available at ArXiv and a short presentation is available here.

HotStuff: BFT Consensus in the Lens of Blockchain

Blockchain is a Byzantine Fault Tolerant (BFT) replicated state machine, in which each state-update is by itself a Turing machine with bounded resources. The core algorithm for achieving BFT in a Blockchain appears completely different from classical BFT algorithms:

  • Classical solutions like DLS, PBFT solve BFT among a small-to-medium group of known participants. Such algorithms consist of multiple rounds of message exchanges carrying votes and safety-proofs. They are evidently quite intimidating to the non-expert.
  • In contrast, Bitcoin solves BFT among a very large group of unknown users. In each time-period, one user broadcasts a single message carrying a Proof-of-Work (PoW). No other messages or information is exchanged.

What a difference between the two worlds!

Recent advances in blockchain technology blur these boundaries. Namely, hybrid solutions such as Byzcoin, Bitcoin-NG, Hybrid Consensus, Casper and Solida, anchor off-chain BFT decisions inside a PoW chain or the other way around.

Moreover, innovative solutions in the age of blockchains, such as Honeybadger, ALGORAND, Tendermint, and SBFT, revisit the BFT setting with greater scalability and simplicity.

Confused?  The VMware’s HotStuff framework can help!

HotStuff describes Blockchain in the lens of BFT and BFT in the lens of Blockchain, and provide common algorithmic foundations for both. It operates over graphs of blocks (aka blockchains), such that a block becomes committed when votes embedded in other blocks refers to it. A feature of HotStuff is its economy of mechanism, involving only a single message type for proposing, a single message type for voting, and simple Commit and Preferred Branch Rules. The proposer protocol in HotStuff is the same regardless of whether the proposer is being replaced or not, thus relieving of the main source complexity in BFT protocols The Saddest Moment.

At the same time, the reasoning for its safety is subtle and deceptively simple—not unlike that for other designs that have subsequently been identified as problematic (see e.g., Revisit Fast BFT).  This led us to explore model checking to confirm the safety of HotStuff.

Interested in reading more? check the HotStuff whitepaper (“HotStuff: BFT Consensus in the Lens of Blockchain’’, https://arxiv.org/abs/1803.05069).

For more information on VMware Blockchain, go here.

 

Meet Cascade – A Just-in-time compiler for reprogrammable hardware

fast cars

Okay let’s be honest, nobody strives for slow in the computing world. We want fast computers, fast chips, and fast time to market, all while eating fast food. Field-programmable gate arrays (FPGAs) have been actively used since the mid-80s for the purpose of hardware acceleration. Speed! Unfortunately, one aspect of FPGAs has not been accelerated: compilation. Compiling FPGA programs is painfully slow. FPGAs are programmed using hardware description languages (HDLs). Although similar to software programming languages they differ in both the scope of the language and the output of the compiler.  FPGA compilers need to produce a complex set of layout instructions and connections for a reconfigurable circuit board. Not a simple task. In fact, it’s NP complete!

VMware Researchers Eric Schkufza and Michael Wei wondered if applying the principals behind just-in-time compilation could help reduce the latency of FPGA development. It turns out it can! A proof of concept has been completed, and exciting possibilities have grown out of the initial work. Cascade is now a VMware open source tool which is available to everyone.

How does Cascade work? Starting with a Verilog source file, Cascade generates many smaller Verilog programs that in aggregate have the same semantics as the original program. Cascade then determines which of these smaller programs (specifically those that correspond to I/O peripherals such as LEDs) must be placed on the hardware immediately and loads them from a standard library on to the FPGA. In the meanwhile, the remainder of the programs are interpreted in software. Over time, these programs are compiled to hardware, and when the process is complete they are swapped onto the FPGA as well. A cascading effect. From the developer’s point of view, the code runs immediately and simply gets faster over time. No more waiting endlessly for your code to compile; a significant improvement over the state of the art!

Even if you are not interested in saving development cycle time, there are other advantages to using Cascade. Cascade currently supports software-style printf-debugging and will soon offer users the ability to take advantage of runtime-specific data-dependent optimizations. In the future, Cascade may also find use in the hypervisor. Supporting VMotion for VMs that use FPGAs is a known hard problem. But by using Cascade as the mechanism for providing a VM access to an FPGA, it may be possible to move a program from hardware back into software, move the software pieces from one host to another, and then back down into hardware in the new location.

To try Cascade out, visit us on GitHub, https://github.com/vmware/cascade.

Scaling Computational Biology at VMware

Building Blocks

Computational biology datasets tend to be big.  Really big.  The raw genomic data for a single cell can run into the hundreds of gigabytes. And if you want to search through thousands of datasets looking for a gene or other snippet of DNA that you are interested in, then you can be staring down petabytes of data.  The sheer size of computational biology datasets means that many problems can only be solved on beefy servers with lots of RAM and fast disks.

VMware Senior Researcher Rob Johnson has recently published a string of papers showing how new data structures can be used to cut this data down to size.  By shrinking the data, Johnson’s data structures enable many problems that have typically required a server to now be solved on a laptop or even a high-end phone.  “One of our motivating visions from the beginning was to eliminate the computational, storage, and memory obstacles to creating a portable device that doctors and researchers could take into the field to quickly analyze and diagnose unknown diseases and outbreaks,” says Johnson.  “And our most recent paper shows how a national or international organization, such as NIH or WHO, could use our data structures to create and maintain an efficiently searchable archive of genomic datasets for researchers around the world.”  A searchable archive of genomic data could enable researchers to quickly find instances of a gene they have identified, or to test whether a gene is correlated with some observable characteristic, such as susceptibility to cancer.

But how did VMware get involved in computational biology?

The connection is in the underlying data structures.

Johnson’s computational biology work builds on the counting quotient filter, a data structure he originally developed as a replacement for the widely used, decades-old Bloom filter.  A Bloom filter is a very space-efficient way of representing a set of items.  Applications can insert new items into a Bloom filter, and they can query whether an item is in the filter.  Bloom filters save space by allowing a small false-positive rate, i.e. they will occasionally indicate that an item is in the set even though it is not.  Many applications can tolerate a small false positive rate, and so Bloom filters are now frequently used in networking, databases, file systems, storage systems, and computational biology, among many other domains.

The counting quotient filter improves on the Bloom filter in several ways.  For example it is smaller and faster than a Bloom filter, and it supports several operations that Bloom filters do not, such as counting the number of times that each item appears in the set.  The counting quotient filter also scales out of RAM more efficiently than Bloom filters, since many counting quotient filter operations can be implemented using sequential IO, which is particularly efficient for data stored on disks and SSDs.

These improvements are important because applications often have to work around the performance and functionality limitations of Bloom filters, resulting in greater complexity and reduced performance. Counting quotient filters, on the other hand, enable streamlined, high-performance solutions to many problems that up to now have been solved by using Bloom filters in conjunction with other, less space-efficient data structures.

Johnson’s three recent papers demonstrate how counting quotient filters can pay off in the domain of computational biology.

In the first paper, Johnson describes Squeakr, a tool for performing k-mer counting, which is often the first step in any kind of analysis of raw sequencing data.  State-of-the-art k-mer counting tools use Bloom filters to remove errors from the raw data, but resort to a conventional, large hash table to perform the actual counting. Squeakr exploits the counting ability of the counting quotient filter to perform both tasks faster and in less space.  Squeakr uses as little as one third the RAM as comparable tools and, for some applications, one sixth the time.

The second paper describes deBGR, a tool for compactly representing the de Bruijn Graph of a collection of raw sequencing data.  De Bruijn Graphs are widely used to perform more complex analyses of sequencing data; for example, to identify complex mutations and errors in raw data, to disentangle data from multiple sources, and two compare data from different organisms.  deBGR shows how to use the counting quotient filter to represent not only the basic de Bruijn Graph, but also how often each snippet of DNA in the graph occurs in the underlying raw data set.  This information can be essential to many of the advanced analyses mentioned above.

Finally, Johnson’s third paper shows how to use counting quotient filters to build an index for quickly searching through thousands of raw datasets, constituting terabytes of data.  The ability to perform such searches is so essential that BLAST, an earlier tool for searching through genomic data, is one of the most cited papers in the entire scientific literature.  But BLAST works only on assembled data, not raw data.  And the process of converting raw data into assembled data is expensive, so the vast majority of raw data never gets assembled, making it inaccessible to BLAST-based searches.  Johnson’s system, Mantis, overcomes this limitation.  Mantis outperforms Bloom-filter-based solutions to this problem on several dimensions. For example, previous solutions had large error rates; Mantis has none.  Mantis also builds the index in about one eighth the time, performs queries about 100 times faster, and uses 20% less space than the best Bloom-filter-based systems.

“I believe these projects show the universality of good algorithms and data structures work,” says Johnson.  “Although we started out working on the quotient filter for databases, dedupe systems, and the like, we ended up making a big impact on computational biology, effectively building an entirely new tool-chain around a data structure originally created to solve systems problems.”

The papers describing this research have appeared in top venues for big data and computational biology research, including SIGMOD, RECOMB, ISMB, and Bioinformatics.  This NSF-funded work is in collaboration with  Fatemah Almodaresi, Michael Bender, Mike Ferdman, Prashant Pandey, and Rob Patro of Stony Brook University

Further Reading

References

Software

 

VMware’s Mihai Budiu presents “Hillview: A Spreadsheet for Big Data” at Stanford 11/29/17

Dr. Mihai Budiu, a Senior Researcher in VMware’s Research Group, will present on “Hillview:  A Spreadsheet for Big Data”, as part of the Stanford DAWN Project’s seminar series.  The talk will occur on Wednesday, November 29th, 2017 from 3 to 4 p.m., on campus in the Gates Computer Science Building in room 415.

SOSP ’17 Reflections

Vijay Chidambaram, UT Austin Assistant Professor and VMware Affiliated Researcher, wrote a blog post reflecting on the recent Symposium on Operating Systems Principles (SOSP 2017), where two papers from VMware were presented. SOSP is the premier venue for academic research on operating systems and distributed systems. Click here for more:

Big data for the 99% of #enterprises – that’s the motto powering VRG’s Project Hillview

The Hillview project is developing a set of tools for the exploration of very large datasets (“big data”). Our goal is to build a spreadsheet for big data that is easy to use and helps users browse and visualize data quickly. The motto of our project is “Big data for the 99% of enterprises.” The goal is to handle datasets with a billion rows interactively. Take a quick look here to see how we are accomplishing this. A demo of Hillview will be shown at VMworld US next Monday 8/28(session: FUT2634PU)

VMware Research headed to ACM SIGCOMM

The VMware Research team is headed to ACM SIGCOMM.  See where we’ll be and come say hello! https://octo.vmware.com/want-meet-sigcomm-2017/