data Enterprise Strategy machine learning Tanzu Greenplum Thought Leadership

Massively Parallel Automated Model Building for Deep Learning

This post was co-written by Advitya Gemawat (UC San Diego) and Frank McQuillan (VMware).

The post references joint work between VMware and Dr. Arun Kumar of the Department of Computer Science and Engineering (CSE) and the Halicioglu Data Science Institute (HDSI), and Advitya Gemawat of HDSI at the University of California, San Diego.

Artificial neural networks are seeing increased adoption in the enterprise as they can be used to create accurate models in diverse domains, such as natural language processing and image recognition. However, training deep neural networks is expensive, since many trials are typically needed to select the best model architecture and associated hyperparameters. (Hyperparameters are variables that are set manually, as opposed to parameters like weights, which are computed automatically during training.) Automated machine learning, or AutoML, automates the time-consuming, iterative task of model development in order to make data science workflows more efficient and, in the process, minimize the amount of resources consumed. 

In this post, we describe how different AutoML methods can be efficiently run on VMware Tanzu Greenplum, a massively parallel processing (MPP) data platform that can potentially have hundreds of segments (workers) training many model configurations simultaneously [1, 2, 3]. Tanzu Greenplum, based on open source Postgres, features a shared-nothing architecture, which is well suited for analytic methods that exploit parallelism over massive data sets. Let’s first examine how Tanzu Greenplum’s parallelism is used to accelerate the process of model development. 

Grid and random search: Keep it simple 

For hyperparameter tuning, data scientists often start with grid or random search since they are simple, well understood methods. 

Let’s say we want to investigate two parameters for optimizing a function, and have a budget of nine trials for our test. If one parameter is not important and the other one is, then grid search only tests the important parameter for three distinct values, whereas random search explores nine distinct values. This is relevant because the more we explore the important parameter, the higher the likelihood that we will find an optimal function value (the peak).  

 

Grid and random search of nine trials for optimizing a function [4] 

When it comes to higher dimensional hyperparameter optimization for deep nets, the same idea applies, and it can be shown empirically and theoretically that random trials are more efficient than trials based on a grid [4]. Even so, grid search is still commonly used today [5]. 

To see how random search works in practice, consider a trial of 80 different model configurations comprising three model architectures, three optimizers, and several different hyperparameters. The search space for the hyperparameters is intentionally chosen to be very broad in order to illustrate the wide range of accuracies resulting from training. We used the well-known CIFAR-10 dataset for computer vision and trained each configuration for 10 iterations, which took a total of 91 minutes on the test cluster.* 

Random search of 80 model configurations on CIFAR-10 dataset 

Few models achieve more than 80 percent validation accuracy, and indeed, the histogram shows there to be a lot of poor performers. That’s because there is no intervention once a configuration is started; it is trained to completion even if it shows no promise. This means that with random search (and also with grid search), one needs to start with a pretty good search space to get accurate results, or else a lot of compute cycles are wasted. Hand-tuning and multiple runs are also necessary.

Hyperband: If it’s working, keep doing it 

Hyperband improves on grid and random search by continuing with configurations that are doing well, and stopping those that are not in order to make more efficient use of resources [6]. It’s an example of successive halving, whereby the final surviving configurations represent the most accurate models.  

Successive halving of model configurations to keep the best performers  

Hyperband allocates exponentially more resources to promising configurations. It also has an exploration aspect from random selection of initial configurations. The training schedule is based on two input parameters: 

  1. R  – The maximum resources that can be allocated to a single configuration 

  2. 𝜂  – Controls the proportion of configurations discarded in each round of successive halving 

An example schedule for R=81 and 𝜂=3 consisting of five brackets that are run top to bottom, left to right, is shown below. In the first bracket on the left, for s=4, 81 configurations are randomly selected from the search space and each is trained for one iteration. The best 27 of these are trained for an additional three iterations each. Then the best nine of those are trained for nine more iterations each, and so on down the bracket. Moving over to the second bracket from the left, for s=3, 27 configurations are randomly selected from the search space, then the process repeats. At the end of the entire schedule, the best model is selected from all runs across all brackets. 

Hyperband schedule for R=81 and 𝜂=3 [6] 

As an example, when we run the schedule above (excluding the bottom row in each bracket, which is dropped because 81 iterations is quite expensive) on the CIFAR-10 dataset on the test cluster and compare the validation accuracy from Hyperband below with the previous random search example, there are fewer low accuracy models since they were discarded early on. But Hyperband can explore more configurations for a given resource constraint, and it turns out the best resulting model in this example is more accurate than random search. 

Hyperband run on CIFAR-10 for R=81 and 𝜂=3 (skip last row in each bracket) 

In order to run Hyperband efficiently on an MPP cluster, Apache MADlib (an open source library of analytical methods for Postgres-based databases) employs a novel implementation by running multiple brackets at the same time. Doing so avoids the situation where machines become idle towards the bottom of a given bracket. 

Hyperopt: Search for the best 

Hyperopt is a meta-modeling approach that uses Bayesian optimization to explore a search space and narrow down to the best set of estimated parameters [7]. First, define the boundaries of the search space, meaning the model architectures and hyperparameter ranges of interest. Next, specify the number of trials to evaluate, and the number of iterations that define one trial. After completing all trials, pick the configuration that produces the most accurate model.  

 Hyperopt implementation on MPP 

We ran 500 trials on the test cluster with the CIFAR-10 dataset, which completed in approximately 8.5 hours. We once again used a broad search space to start and plotted the best validation accuracy over the 500 trials. As the figure below makes clear, improvements are larger in early trials than they are in later ones, with final accuracy ending up less than the Hyperband run above and similar to that of random search. The next step for the practitioner might be to narrow the search space to the most promising configurations, then re-run Hyperopt for additional trials to achieve higher accuracy.  

 Hyperopt run on CIFAR-10 with 500 trials 

To make Hyperopt scale on MPP, multiple trials are run in parallel as opposed one trial at a time.  This makes for less frequent updates back to Hyperopt, yet still preserves information about each trial. We also assume that model architecture can be represented as a parameter in search space. Preliminary testing indicates that our assumption is correct; the alternative would be to run Hyperopt on only one model architecture at a time. 

AutoML on Tanzu Greenplum for enterprise deep learning 

AutoML methods for training deep nets have been implemented on Tanzu Greenplum for enterprise workloads. These methods take advantage of the horizontal scale of MPP so that highly accurate models can be arrived at efficiently. 

Although AutoML methods can make a data scientist’s work more efficient, it’s still common practice to run more than one round of a given AutoML method by narrowing the search space to the most promising configurations, then doing one or more additional runs. However, the level of effort required when using AutoML will be lower than with manual approaches, like grid and random search.  

Learn more 

Ready to take the next step? Learn more about Apache MADlib and Greenplum: 

 —————————————- 

*Test infrastructure: 

  • Google Cloud Platform 

  • Five hosts, each with 32 vCPUs, 150 GB memory, and four NVIDIA Tesla P100 GPUs (20 total) 

  • Greenplum 6 with five segments per host (20 total) 

  • Apache MADlib 1.18.0 

  • TensorFlow 1.14 

References 

[1] Model Selection for Deep Neural Networks on Greenplum Database 

[2] “Cerebro: A Data System for Optimized Deep Learning Model Selection,” Proceedings of the VLDB Endowment, Vol. 13, No. 11 

[3] Efficient Model Selection for Deep Neural Networks on Massively Parallel Processing Databases, FOSDEM'20 conference video 

[4] “Random Search for Hyper-Parameter Optimization,” Journal of Machine Learning Research 13 (2012) 281-305 

[5] Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020 

[6] “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization,” Journal of Machine Learning Research 18 (2018) 1-52 

[7] “Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures,” Proceedings of the 30th International Conference on Machine Learning, PMLR 28(1):115-123, 2013.