A sphere-shaped clock with its digits turned upside down. Indoors window light.NOTE: Film image, cannot be altered (much)
data_science database mpp

How to Use Temporal Features in an MPP Database

Joint work by Greg Tam and Chris Rawles of Pivotal Data Science.

Feature engineering is a fundamental task for building predictive models. As data scientists, we often build features using data that varies with time. A common approach to create features from temporal data is to condense down the information to an account or individual level that can be used to train a machine learning model.

In this blog, we will outline how to:

  • Create temporal features in a Massively Parallel Processing (MPP) database given a reference point and prediction and feature windows

  • Set up the data for cross-validation and model building

As an example, suppose a telecommunications company wants to predict which customers will upgrade their phone plans in the next year. A given customer’s calling behavior over time is likely to provide predictive power for determining propensity to upgrade. Such insights will help the company allocate resources differently, provide customer-centric targeting, and predict future revenue.

To model customer call behavior, we will need to create profiles of what customers looked like at specific points in time. For customers who have upgraded, we would like to capture their behavior before upgrading. For customers who have yet to upgrade, we can look at their behavior at any point in time. After deriving features for their behavior, we can then apply machine learning techniques using these features to gain insight into what drives customers to upgrade.

In the following example, we use customer cell phone plan changes and call information to show how to create temporal features for use in a model with an MPP database, such as Pivotal Greenplum (GPDB) or Apache HAWQ (incubating). We hypothesize that call behavior over time could be a powerful predictor for plan upgrades.

We integrate the plan changes and call information tables together and transform them in a specific manner to create features we can use for modeling. We will call these features temporal features; they can be used to summarize behavior right before a customer upgraded or downgraded.

Key Terms

  • Prediction window: The time period in which we are predicting an upgrade. In this example, a reasonable prediction window could be a year. This means our model is predicting the probability of a customer upgrading within the next year.

  • Reference point: The time point where we are making our prediction.

  • Feature window: The time window before the reference point that we use to derive features.

Figure 1: Feature window, reference point, and prediction window

This framework is very flexible; we can set different feature and prediction window sizes and have different reference points. For example, we may want to look at customer behavior in the past month to predict whether the customer will upgrade in the next 2 years or look at the past year to predict for the next month

 

Creating Temporal Features

We will utilize the plan change table, plan_data , to derive our labels. In this case, our label would indicate whether a given customer’s plan change is an upgrade.

Plan Change Information (plan_data)

cust_id

plan_id

plan_type

plan_start_dt

1

A01

A

1/1/16

1

B02

B

6/17/16

2

B01

B

3/14/16

2

A02

A

8/12/16

Table 1: Plan change information table

To build our predictive model, we must first assign labels indicating whether each plan change is an upgrade or not. We call this label is_upgrade; it will be the dependent variable in our supervised learning problem.

CREATE TABLE plan_data_label
    AS WITH plan_data_upgrade AS
            (SELECT *,
                    LAG(plan_type)
                        OVER(PARTITION BY cust_id
                                 ORDER BY plan_start_dt
                            ) AS last_plan
               FROM plan_data
            )
     SELECT cust_id,
            plan_id,
            plan_type,
            plan_start_dt,
            COALESCE(plan_type > last_plan, FALSE) AS is_upgrade
       FROM plan_data_upgrade
DISTRIBUTED BY (plan_id);

We utilize the LAG window function to return the previous plan_type for every customer. We then compare whether the current plan_type is an upgrade from the last_plan. If the plan change is the customer’s first plan, then LAG would return NULL. To account for this, we use the COALESCE function to set it to FALSE.

 

Building Temporal Features

The next step is to derive temporal features from the phone call information table by capturing a given customer’s call behavior over time.

Phone Call Information (call_data)

cust_id

call_date

call_length

1

4/23/16

23

1

5/23/16

80

1

6/5/16

70

2

3/15/15

5

Table 2: Phone call information table

We must create these features on the plan level, which we do by first inner joining call_data to plan_data_label on cust_id.

Plan Table Joined with Call Table (plan_join_call_data)

cust_id

plan_id

plan_type

plan_start_dt

is_upgrade

call_date

call_length

1

A01

A

1/1/16

FALSE

4/23/16

23

1

A01

A

1/1/16

FALSE

5/23/16

80

1

A01

A

1/1/16

FALSE

6/5/16

70

1

B02

B

6/17/16

TRUE

4/23/16

23

1

B02

B

6/17/16

TRUE

5/23/16

80

1

B02

B

6/17/16

TRUE

6/5/16

70

2

B01

B

3/14/16

FALSE

3/15/15

5

2

A02

A

8/12/16

FALSE

3/15/15

5

Table 3: Joining the 2 tables so we can create features

Using this table, we can create features which encode information such as:

  1. The number of minutes a given customer used in the last 30 days
  2. The number of calls a given customer made in the last 30 days

Creation of these features relies on having a reference point and feature and prediction windows. Template code to do this is shown below:

CREATE TABLE [insert table name]
  AS SELECT cust_id,
            is_upgrade,
            plan_id,
            plan_start_dt,
            COUNT(CASE WHEN [if in feature window]
                            THEN 1
                       ELSE NULL
                   END
                 ) AS num_calls,
            SUM(CASE WHEN [if in feature window]
                          THEN 1
                     ELSE NULL
                 END
               ) AS total_call_len
       FROM plan_data_label
            INNER JOIN call_data
                 USING (cust_id)
      WHERE is_upgrade = [insert boolean for pos/neg example]
        AND [conditions for train/test set]
      GROUP BY cust_id, is_upgrade, plan_id, plan_start_dt
DISTRIBUTED BY (plan_id);

The CASE WHEN statement restricts our COUNT and SUM to only phone calls that happened in the defined feature window. The WHERE clause determines whether our features are generated for the training or test set and either positive or negative examples. In the next section we discuss how to set up framework for these 4 different combinations.

Setting up the Data for Cross-Validation

The next step in our modeling pipeline is to split up our data set for cross-validation. We will need to utilize the template code shown above 4 times to create 4 different tables—positive training examples, negative training examples, positive test examples, and negative test examples—using different feature windows and conditions for each. Once this is done, we union the respective training and test sets together.

We set up our training and test sets to best mimic a real-life scenario by using earlier data for training and more recent data for testing. We call the date that splits these 2 data sets the train/test split date. By doing this, we use the information before this date to train our model and then test it on data received between the train/test split date and current date. This test set acts as future data. After modeling, we can use it to verify the performance of the model. We can split up the data before the Train/Test Split Date into training and validation sets in the same manner

Figure 2: Cross-validation

 

Training Set

Test Set

Positive Examples

  • Plan upgrade date before train/test split date
  • Reference point at the plan upgrade date
  • Plan upgrade date after train/test split date

  • Reference point at the train/test split date

Negative Examples

  • No examples shared between negative test set

  • End of prediction window comes before the train/test split date

  • No examples shared between negative training set

  • Prediction window falls within the test set

Table 4: Reference points for different training and test sets

Table 4 above describes how we can set our reference points for our training and testing examples. Positive example reference points must be placed such that there is an upgrade within the prediction window. Negative example reference points must be placed so there are no upgrades within the prediction window.

Figure 3: Reference points diagram

Once we have created our training and test sets in this manner, we can leverage the parallel power of Apache MADlib (incubating) or procedural languages such as PL/R and PL/Python to build a data-driven predictive model.

This blog has laid out the foundation to create temporal features in an MPP database. The features we have generated are only a small taste of what can be created using this framework. In the real world, domain experts can help data scientists build intelligent temporal features by giving them insights into potential key drivers.


Read more about how we do data science at Pivotal.