analytics apache_madlib data_science predictive_analytics products sql technical time_series_analysis

Time Series Analysis #1: Introduction to Window Functions

featured-TimeSeriesTime series data is an ordered sequence of observations of a particular variable, usually at evenly spaced time intervals. It is found in many real world applications, including click stream processing, financial analysis, and sensor data. Modeling time series data within a database presents a challenge, in that the fundamental ordered nature of the data will cause many of the interesting calculations to be outside of the traditional relational calculus. Fortunately, there are tools in the analyst’s toolbox that can aid in solving many common time series related problems.

The first of those tools, and the subject of this article, is the Window Function. First introduced in the SQL 2003 Standards specification, Window Functions enable queries to perform certain types of ordered capabilities. This is primarily accomplished through the “OVER” clause, which enables the specification of a “window” over which the calculation should be performed.

The basic syntactic elements of a window expression is:

SELECT function() OVER ( [ PARTITION BY … ]
                          [ ORDER BY … ]
[ ROWS|RANGE BETWEEN … ] )
FROM …

Or

SELECT function() OVER ( w )
FROM …
WINDOW w as ([ PARTITION BY … ]
             [ ORDER BY … ]
     [ ROWS|RANGE BETWEEN … ]) ;

Where:

  • PARTITION BY separates the data into distinct separable groups, similar to GROUP BY aggregation.
  • ORDER BY describes how the data should be ordered within a partition
  • RANGE|ROWS BETWEEN describes which rows within the partition apply for the current calculation, default is all the rows prior to the current one, e.g. “ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW”
  • WINDOW enables creating an alias for a particular window specification so that it can be simply referenced in multiple places within the query.

This enables many sorts of calculations that otherwise would be difficult or impossible to express within SQL including cross row calculations.

• Lag/Lead between rows:

    SELECT
    ts, x,
    lag(x, 1) OVER (ORDER BY ts),
    lead(x, 1) OVER (ORDER BY ts)
    FROM testdata;

                 ts             | x | lag | lead
    ----------------------------+---+-----+------
     2014-01-08 12:55:19.537497 | 1 |     |    2
     2014-01-08 12:55:22.36167  | 2 |   1 |    3
     2014-01-08 12:55:25.010043 | 3 |   2 |    4
     2014-01-08 12:55:27.321565 | 4 |   3 |

• Running accumulations:

    SELECT
    ts, x,
    count() OVER (ORDER BY ts),
    sum(x)  OVER (ORDER BY ts)
    FROM testdata;

                 ts             | x | count | sum
    ----------------------------+---+-------+-----
     2014-01-08 12:55:19.537497 | 1 |     1 |   1
     2014-01-08 12:55:22.36167  | 2 |     2 |   3
     2014-01-08 12:55:25.010043 | 3 |     3 |   6
     2014-01-08 12:55:27.321565 | 4 |     4 |  10

• Rolling averages:


    SELECT
    ts, x,
    avg(x) OVER (ORDER BY ts
                   rows between 1 preceding
                            and current row) as avg2,
    avg(x) OVER (ORDER BY ts
                   rows between 2 preceding
                            and current row) as avg3
    FROM testdata;

                 ts             | x | avg2 | avg3
    ----------------------------+---+------+------
     2014-01-08 12:55:19.537497 | 1 |    1 |    1
     2014-01-08 12:55:22.36167  | 2 |  1.5 |  1.5
     2014-01-08 12:55:25.010043 | 3 |  2.5 |    2
     2014-01-08 12:55:27.321565 | 4 |  3.5 |    3

 

• Windows over multiple partitions:


     SELECT
     tag, ts, x,
     lag(x) OVER (PARTITION BY tag ORDER BY ts)
     FROM testdata2
     ORDER BY tag, ts;

     tag  |             ts             | x | lag
    ------+----------------------------+---+-----
     tag1 | 2014-01-08 12:55:19.537497 | 1 |
     tag1 | 2014-01-08 12:55:22.36167  | 2 |   1
     tag1 | 2014-01-08 12:55:25.010043 | 3 |   2
     tag1 | 2014-01-08 12:55:27.321565 | 4 |   3
     tag2 | 2014-01-08 12:55:19.537497 | 1 |
     tag2 | 2014-01-08 12:55:22.36167  | 2 |   1
     tag2 | 2014-01-08 12:55:25.010043 | 3 |   2
     tag2 | 2014-01-08 12:55:27.321565 | 4 |   3

 

Two concrete real world use cases include:

  • Top-N queries
  • Sessionization of a Clickstream

Download the Pivotal Data Science Lab datasheet.

Top-N Queries

A common problem across many domains is getting a list of the best and worst of x. There are a couple approaches depending on how you want to treat ties, but both are basically the same—use a window function to partition and order based on your criteria and filter the results to the desired value of N.

Suppose we want to know the top three lowest priced homes of various types, such as condos, single family homes, multifamily homes, etc.

Option #1, ignore ties, and always return 3 results per group

    SELECT *
    FROM (
    SELECT id, type, price, row_number() OVER (w)
    FROM houses
    WINDOW w as (PARTITION BY type ORDER BY price)
    )  houses_with_row_number
    WHERE row_number <= 3
    ORDER BY type, price;

     id |     type      | price  | row_number
    ----+---------------+--------+------------
      2 | condo         |  85000 |          1
     11 | condo         |  87000 |          2
      5 | condo         | 133000 |          3
      1 | multi family  |  50000 |          1
      4 | multi family  |  90000 |          2
     13 | multi family  | 140000 |          3
     15 | single family |  65000 |          1
      3 | single family |  65000 |          2
      6 | single family |  90500 |          3

Option #2 , allow ties, rank is #(rows < x.price) + 1, which may return > N rows per partition.

            SELECT *
            FROM (
              SELECT id, type, price, rank() OVER (w)
              FROM houses
              WINDOW w as (PARTITION BY type ORDER BY price)
            )  houses_with_row_number
            WHERE rank <= 3
            ORDER BY type, price;

         id |     type      | price  | rank
        ----+---------------+--------+------
          2 | condo         |  85000 |    1
         11 | condo         |  87000 |    2
          5 | condo         | 133000 |    3
          1 | multi family  |  50000 |    1
          4 | multi family  |  90000 |    2
         13 | multi family  | 140000 |    3
          3 | single family |  65000 |    1
         15 | single family |  65000 |    1
          6 | single family |  90500 |    3

Option #3, allow ties, rank is #(prices < x.price) + 1, which may return > N rows per partition


    SELECT *
    FROM (
    SELECT id, type, price, dense_rank() OVER (w)
    FROM houses
    WINDOW w as (PARTITION BY type ORDER BY price)
    )  houses_with_row_number
    WHERE dense_rank <= 3
    ORDER BY type, price;

     id |     type      | price  | dense_rank
    ----+---------------+--------+------------
      2 | condo         |  85000 |          1
     11 | condo         |  87000 |          2
      5 | condo         | 133000 |          3
      1 | multi family  |  50000 |          1
      4 | multi family  |  90000 |          2
     13 | multi family  | 140000 |          3
      3 | single family |  65000 |          1
     15 | single family |  65000 |          1
      6 | single family |  90500 |          2
     12 | single family | 118600 |          3

See how the results for single_family is (1,2,3) using row_number(), (1,1,3) using rank(), and (1,1,2,3) using dense_rank(). Deciding which approach is appropriate for you depends on how you want to handle ties within the query results.

Sessionization of a Clickstream

Sessionization can be explained as the process of dividing a clickstream into sessions based on key-based partitioning, and within that subdividing based on click frequency. For example, one frequently used mechanism is to consider two clicks from a particular IP address that take place more than 30 minutes apart as being different sessions.

What makes this interesting is that the dependency on a session is defined based on the difference between the current timestamp and the previous timestamp, which we can calculate easily using the lag() window function:


    SELECT
    ipaddress, ts,
    ts - lag(ts,1) OVER (w) > '30 minutes' as newsession
    FROM clickstream
    WINDOW w as (PARTITION BY ipaddress ORDER BY ts)
    ORDER BY ipaddress, ts;

       ipaddress    |             ts             | newsession
    ----------------+----------------------------+------------
     74.125.239.102 | 2014-01-14 13:42:06.628822 |
     74.125.239.102 | 2014-01-14 13:42:14.65362  | f
     74.125.239.102 | 2014-01-14 13:47:52.497826 | f
     74.125.239.102 | 2014-01-14 13:53:04.643553 | f
     74.125.239.102 | 2014-01-14 16:43:20.533292 | t
     74.125.239.102 | 2014-01-14 16:43:25.174022 | f
     74.125.239.102 | 2014-01-16 16:46:44.109591 | t
     206.190.36.45  | 2014-01-14 13:41:50.923029 |
     206.190.36.45  | 2014-01-14 13:42:08.677066 | f
     206.190.36.45  | 2014-01-14 13:42:12.885571 | f

This gets us partway, but only gives us an indication of when a new session begins. To complete the problem, we need a distinct session number to be generated. This can be done with one extra pass over the result, to add up the new session events as an accumulated sum:


        WITH session_lag as (
          SELECT
            ipaddress, ts,
            ts - lag(ts,1) OVER (w) > '30 minutes' as newsession
          FROM clickstream
          WINDOW w as (PARTITION BY ipaddress ORDER BY ts)
        )
        SELECT
          ipaddress, ts,
          coalesce(sum(case when newsession then 1 end) OVER (w2), 0) as session_id
        FROM session_lag
        WINDOW w2 as (PARTITION BY ipaddress ORDER BY ts)
        ORDER BY ipaddress, ts;

           ipaddress    |             ts             | session_id
        ----------------+----------------------------+------------
         74.125.239.102 | 2014-01-14 13:42:06.628822 |          0
         74.125.239.102 | 2014-01-14 13:42:14.65362  |          0
         74.125.239.102 | 2014-01-14 13:47:52.497826 |          0
         74.125.239.102 | 2014-01-14 13:53:04.643553 |          0
         74.125.239.102 | 2014-01-14 16:43:20.533292 |          1
         74.125.239.102 | 2014-01-14 16:43:25.174022 |          1
         74.125.239.102 | 2014-01-16 16:46:44.109591 |          2
         206.190.36.45  | 2014-01-14 13:41:50.923029 |          0
         206.190.36.45  | 2014-01-14 13:42:08.677066 |          0
         206.190.36.45  | 2014-01-14 13:42:12.885571 |          0

Understanding window functions can be of great help to many sorts of Time Series capabilities. This post has outlined some of the base window function capabilities and provided a couple real world cases where they can be applied. Future blog posts in this series will continue to explore how additional times series capabilities can be realized within the Pivotal data platforms.