Graph analytics is an important part of enterprise computing. With roots in academia going back many decades, the last 10-15 years have seen a huge surge of interest in this topic to address a wide range of modern use cases, from cybersecurity to social networks to supply distribution chains.
Enterprises have made significant investments in SQL-based infrastructure, software, and training of their employees. Deploying and maintaining a specialized graph processing engine in addition to a relational database requires careful consideration, given the additional expense and effort involved. The ability to execute graph analytics in an MPP relational database allows enterprises to preserve their investments in SQL.
Much of the data that would be used to populate a graph database typically already resides in relational form, and it is expensive to copy data for processing in a separate graph system, then move the result sets back into the relational database for subsequent analyses. So a natural question to ask is if relational database systems are a fit for these types of workloads. In particular, in the case of very large graphs, can massively parallel processing (MPP) databases like Greenplum effectively solve practical graph processing problems at scale?
An additional consideration is the multi-modal nature of many data science problems today. For example, in the case of lateral movement detection in cybersecurity, data scientists generally employ multiple approaches when combating this important threat. In addition to graph models, they may use regression, clustering, user behavioural models, and recommendation systems combined with standard relational operations. If this computation can all occur within one engine, it is much more efficient than solving different parts of the problem in different systems and then trying to combine results. To do this, it’s helpful to have a collection of analytical functions that can be executed within the database cluster to reduce or eliminate data movement between data management and analytical environments. One such library is Apache MADlib.
Apache MADlib
MADlib is a Top Level Project in the Apache Software Foundation. It is an open source library for scalable in-database analytics, providing data-parallel implementations of mathematical, statistical, graph and machine learning methods for structured and unstructured data. It uses shared-nothing, distributed, scale-out architectures to offer data scientists an effective tool set for challenging problems involving very large data sets. MADlib is SQL-based and supports Pivotal Greenplum Database and PostgreSQL.
The graph library currently supports the following algorithms:
- All Pairs Shortest Path (APSP)
- Breadth-First Search
- Hyperlink-Induced Topic Search (HITS)
- PageRank
- Single Source Shortest Path (SSSP)
- Weakly Connected Components
- Measures
- Average Path Length
- Closeness Centrality
- Graph Diameter
- In-Out Degree
The library is being expanded by the community with new algorithms added on a regular basis.
Graph Representation
MADlib supports directed graphs (digraphs) containing vertices, edges and edge weights:
In the database, graphs are represented by a vertex table and an edge table:
For example, in an electrical circuit, vertices could represent diodes, transistors, capacitors, switches, etc., with edges being the wires connecting them.
Familiar SQL Syntax
MADlib functions are SQL based. Here is an example for PageRank, which is a popular link analysis algorithm that measures the importance of a vertex by counting the number and quality of the links to that vertex:
SELECT madlib.pagerank(
vertex_table, -- list of vertices in graph
vertex_id, -- col in vertex table containing vertex IDs
edge_table, -- list of edges in graph
edge_args, -- source, dest, edge weights cols in edge table
out_table -- output table with PageRank distribution
);
For a simple example, here is the output table showing vertices ranked in order of importance:
SELECT * FROM out_table ORDER BY pagerank DESC;
id | pagerank
----+-------------------
0 | 0.28753749341184
3 | 0.21016988901855
2 | 0.14662683454062
4 | 0.10289614384217
1 | 0.10289614384217
6 | 0.09728637768887
5 | 0.05258711765692
(7 rows)
Vertex-Centric Graph Processing
Graph computations can be challenging to parallelize and scale, due to the inherent interdependencies in graph data. Many real-world graph sizes exceed the memory capacity of a single machine, so iterative operations that attempt to reason across a large graph as a whole can be very time and resource intensive on the cluster.
The vertex-centric programming model is designed to address these shortcomings by taking a more local vertex perspective of computation. Vertex functions are decentralized design patterns that operate on data from adjacent vertices and incoming edges, and communicate out along outgoing edges. Vertex programs can be made highly scalable and inherently parallel, with reasonable amounts of serialization and communication across the network.
Many common graph algorithms, when viewed from a vertex-centric perspective, can be translated to standard SQL using scans, joins and aggregates over large tables. MPP databases like Greenplum are well suited for these types of queries because they take advantage of the advances in query optimization and execution that have been made over many years of intensive research and development.
Vertex-centric approaches may not offer all of the expressiveness of semantic query languages associated with graph databases, however, they are a good choice for many common use cases.
Example: Single Source Shortest Path (SSSP)
Consider the single source shortest path (SSSP) algorithm, which finds a path from a starting vertex to every other vertex, such that the sum of the weights of its constituent edges is minimized.
We base our implementation on Bellman-Ford algorithm [1, 2] because it supports digraphs and negative edge weights. Here is the idea: start with a naive approximation for the cost of reaching every vertex, and update the cost at each iteration. If the algorithm does not converge in |V|−1 iterations, where |V| is the number of vertices in the graph, this indicates the existence of a negative cycle in the graph, which is detected and reported.
The high level implementation is as follows:
This algorithm runs in O(|E| X |V|)
with storage O(|V|)
. Note that there are some differences with the standard Bellman-Ford algorithm:
- Line 3: We only check the vertices that have been updated in the last iteration.
- Line 6: At each iteration, we update a given vertex only one time. This means the toupdate set cannot contain multiple records for the same vertex.
For more details how the pseudo code above gets translated into SQL, please refer to the MADlib design document.
Performance
Below are results from scale tests of SSSP, PageRank, and weakly connected components (WCC) in MADlib for graph sizes from 1K vertices to 100M vertices. We use random graphs where the out-degrees of vertices follow the normal distribution, with a mean of 50 edges per vertex, meaning that the largest graph with 100M vertices has 5B edges.
The key point is the linear scale in run time as the size of the graph increases.
These tests were run on a Greenplum cluster with 1 master and 4 segment hosts, each having 12 cores, 24GB of memory, 2TB disk, running on RHEL 5.5. Different cluster sizes and configurations will of course produce different run times, but the scaling behavior will be the same.
Growth Ahead
A growing number modern business problems in analytics are about connections and relationships, not just discrete data, and this lends itself to graph approaches. Given that the majority of business data today resides in a relational form, scale out graph analytics on MPP databases such as Greenplum will continue to grow in importance.
References:
- [1] R. Bellman. “On a routing problem.” In: Quarterly of applied mathematics (1958), pp. 87–90.
- [2] L. R. Ford Jr. Network flow theory. Tech. rep. DTIC Document, 1956.