Databases and GPUs

In previous posts [1,2] we described how Greenplum database can be used to do training and inference of deep neural networks with GPUs using the Apache MADlib open source library.  In this post we’ll expand on the mechanics of how to connect GPUs to Greenplum, a massively parallel processing database.

There are different ways to leverage GPUs in a database.  One approach is to try to make certain database operations faster by parallelizing some portion of the workload (e.g., aggregations, sorting, grouping) and to reduce dependency on indexing and partitioning.  This approach results in what are commonly called GPU databases, which are typically “ground up” development projects designed around GPUs.  Some examples are: BlazingDB, Kinetica, Omnisci (formerly MapD), and SQream. Brytlyt and HeteroDB (formerly PG-Strom) retrofit PostgreSQL by making it GPU aware. AWS Neptune incorporates Blazegraph for GPU-accelerated graph processing.

GPU databases have been on the market for several years. They are technically impressive but remain niche and have seen relatively modest adoption.  One challenge is convincing large enterprise customers to accept the risk of moving critical analytic workloads to a completely new database, especially if that database only accelerates certain types of operations.

A different approach is to employ GPUs to power standard deep learning libraries on an existing database, without making changes to the query processing function of the database server itself. This is the approach we took with Greenplum: combine all of the capabilities of a mature, fully-featured MPP database with GPU acceleration to train deep learning models faster, using all of the data residing in the database.

GPUs are a shared resource by the segments (workers) on each host (Figure 1).  

Figure 1:  Greenplum Architecture for Deep Learning

The design is intended to eliminate transport delays across the interconnect between segments and GPUs.   

Designed for Business 

Deep learning is productive in domains such as language processing, image recognition, fraud detection, and recommendation systems, and all of these tasks can take advantage of the parallelism of GPUs. Moreover, it’s advantageous to bring the computation to where the data resides, rather than moving large datasets between systems depending on the analytical workload. This is particularly important in the case of neural networks, which require large training sets compared to other machine learning methods (Figure 2).

Figure 2:  Importance of Scale for Neural Networks [3]

There are rapid innovations taking place in open-source deep learning libraries like TensorFlow and Keras.  By supporting these standard libraries, Greenplum users can stay up-to-date as these libraries evolve.

From the point of view of IT management, by running deep learning algorithms in an existing database, enterprises do not need to deal with yet another vendor and incur additional expense and level of effort, not to mention the risk of developing new data silos.  Further, modern data science pipelines involve loading data from multiple sources, joining, cleansing, transforming, and feature engineering. These are the types of operations that MPP databases excel at.

Configuration 

Greenplum database can run on-premise as well as on all of the major cloud providers. The database cluster can be configured with any number of GPUs depending on the price/performance tradeoff desired. 

Once the database cluster is set up, you specify the number of GPUs to use when calling the training function.  For example, here is the SQL to train a Keras model on the well known CIFAR-10 dataset of images [4]:

The GPUs per host parameter specifies the number of GPUs on each segment host to use for training.  In this example it is set to 4, which means that the segments (workers) on the host will share these 4 GPUs.  A general rule of thumb is to have the same number of segments and GPUs on a host, though other combinations are possible.

Specifying 0 for this parameter means training with CPUs not GPUs, which could be useful for initial runs and debugging of shallow neural networks on smaller datasets, say on PostgreSQL, before moving to more expensive GPUs for training a deep neural network on the whole dataset on Greenplum.

Future Work

As part of the Apache MADlib project, the community plans to add new deep learning capability with each release.  For example, currently, we assume symmetric cluster configurations with the same number of GPUs attached to each segment host.  However, you may wish to have GPUs attached to only certain hosts for cost control. These types of asymmetric configurations will be supported in a future release of MADlib.

NVIDIA GPUs dominate the market today, but as new AI acceleration chipsets and systems develop, we anticipate supporting them as well.

References:

[1] GPU-Accelerated Deep Learning on Greenplum Database, https://content.pivotal.io/engineers/gpu-accelerated-deep-learning-on-greenplum-database

[2] Transfer Learning for Deep Neural Networks on Greenplum Database, 

https://content.pivotal.io/practitioners/transfer-learning-for-deep-neural-networks-on-greenplum-database

[3] Trends and Development in Deep Learning Research, Jeff Dean, Jan 2017, https://www.slideshare.net/AIFrontiers/jeff-dean-trends-and-developments-in-deep-learning-research

[4] CIFAR-10 dataset, https://www.cs.toronto.edu/~kriz/cifar.html

Learning More:

Ready to take the next step? Great! We recommend you: