Survival Analysis

How to build duration tables from event logs with SQL

Survival modeling tools often require target variables to be specified in duration table format. A duration table lists each unit of a study (e.g. a website user), whether or not they experienced some outcome of interest (e.g. a purchase), and how long it took for that outcome to happen, if known.

Although duration tables are relatively straightforward conceptually, they are a bit tricky to construct from raw data. In my last article, I showed how to build a duration table in Python from an event log, using the example of web browser activity. Check out the first half of that article for a more detailed explanation of duration tables and event logs.

The thing is, event logs are usually stored in a database or data warehouse that we query with SQL, not Python. So in this article, I show how to build a duration table from an event log in SQL.1

The following code—including commands to set up the event log—can be found in The Crosstab Kite's gists repo, in the file events_to_durations.sql. In many of the following snippets, I implicitly include a LIMIT clause to display the output.

The event log

Event logs are a type of fact table. Each row represents an event, defined by the unit (e.g. user), timestamp, and type of event. Our demo data is from e-commerce site Retail Rocket, downloaded from Kaggle. For this demo, suppose we've already copied the data into a table called event_log.

1
SELECT * FROM event_log LIMIT 5;

 visitorid | event_type | itemid | transactionid |          event_at          
-----------+------------+--------+---------------+----------------------------
    257597 | view       | 355908 |               | 2015-06-01 22:02:12.117-07
    992329 | view       | 248676 |               | 2015-06-01 22:50:14.164-07
    111016 | view       | 318965 |               | 2015-06-01 22:13:19.827-07
    483717 | view       | 253185 |               | 2015-06-01 22:12:35.914-07
    951259 | view       | 367447 |               | 2015-06-01 22:02:17.106-07

The fields we're interested in are:

  • visitorid: ID of a Retail Rocket user, presumably. Event logs are long-form tables, with records from all of the units.
  • event_type
  • event_at

Most rows in the log are view events. There are also addtocart and transaction (i.e. purchase) events, but they are much rarer. Most visitors in this dataset do not have any transactions at all.

1
2
3
4
5
6
SELECT
    event_type,
    count(1) as num_observations
FROM event_log
GROUP BY 1
ORDER BY 2 DESC;

 event_type  | num_observations 
-------------+------------------
 view        |          2664312
 addtocart   |            69332
 transaction |            22457

The sequence of events for a visitor who does complete a transaction looks like this:

1
2
3
4
5
SELECT
    *
FROM event_log 
WHERE visitorid='1050575' 
ORDER BY event_at ASC;

 visitorid | event_type  | itemid | transactionid |          event_at          
-----------+-------------+--------+---------------+----------------------------
   1050575 | view        | 116493 |               | 2015-07-31 10:27:21.908-07
   1050575 | view        | 31640  |               | 2015-07-31 14:08:45.248-07
   1050575 | view        | 273877 |               | 2015-07-31 14:09:21.716-07
   1050575 | view        | 31640  |               | 2015-07-31 14:10:18.947-07
   1050575 | addtocart   | 31640  |               | 2015-07-31 14:11:12.772-07
   1050575 | transaction | 31640  | 8354          | 2015-07-31 14:12:56.57-07

The full query and result

To give you a sense of what the final duration table looks like, here's the full query and the first five rows of the output. In the next section, we'll break it down and walk through each part in sequence.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- Find the entry time for each unit.
WITH entry_times AS (
    SELECT
        visitorid,
        min(event_at) AS event_at
    FROM event_log
    GROUP BY 1
),

-- Get the earliest endpoint event for units that have an endpoint.
endpoint_events AS (
    SELECT *
    FROM event_log
    WHERE event_type IN ('transaction')
),

first_endpoint_events AS (
    SELECT 
        *
    FROM (
        SELECT
            *,
            ROW_NUMBER() OVER(PARTITION BY visitorid ORDER BY event_at ASC) AS row_num
        FROM endpoint_events
    ) AS _
    WHERE row_num = 1
),

-- Define the censoring time to be the latest timestamp in the whole event log.
censoring AS (
    SELECT max(event_at) AS event_at FROM event_log
)

-- Put all the pieces together as a *duration table*.
SELECT 
    entry_times.visitorid,
    entry_times.event_at as entry_at,
    endpt.event_type AS endpoint_type,
    endpt.event_at AS endpoint_at,
    COALESCE(endpt.event_at, censoring.event_at) as final_obs_at,
    COALESCE(endpt.event_at, censoring.event_at) - entry_times.event_at as duration
FROM censoring, entry_times
LEFT JOIN first_endpoint_events AS endpt
    USING(visitorid)

 visitorid |          entry_at          | endpoint_type | endpoint_at |        final_obs_at        |       duration       
-----------+----------------------------+---------------+-------------+----------------------------+----------------------
         0 | 2015-09-11 13:49:49.439-07 |               |             | 2015-09-17 19:59:47.788-07 | 6 days 06:09:58.349
         1 | 2015-08-13 10:46:06.444-07 |               |             | 2015-09-17 19:59:47.788-07 | 35 days 09:13:41.344
         2 | 2015-08-07 10:51:44.567-07 |               |             | 2015-09-17 19:59:47.788-07 | 41 days 09:08:03.221
         3 | 2015-08-01 00:10:35.296-07 |               |             | 2015-09-17 19:59:47.788-07 | 47 days 19:49:12.492
         4 | 2015-09-15 14:24:27.167-07 |               |             | 2015-09-17 19:59:47.788-07 | 2 days 05:35:20.621

Breaking it down

Let's look at each of the Common Table Expressions (CTEs) as though they were standalone queries, then we'll parse the final statement that pulls everything together.

Find the entry time for each unit

The first thing we need to do is to find when each visitor first entered the system. The simplest way is to use the earliest logged timestamp. As with the Python version of this article, the GROUP BY functionality is our workhorse because we need to operate on each unit separately.

1
2
3
4
5
SELECT
    visitorid,
    min(event_at) AS event_at
FROM event_log
GROUP BY 1

 visitorid |          event_at          
-----------+----------------------------
         0 | 2015-09-11 13:49:49.439-07
         1 | 2015-08-13 10:46:06.444-07
         2 | 2015-08-07 10:51:44.567-07
         3 | 2015-08-01 00:10:35.296-07
         4 | 2015-09-15 14:24:27.167-07

Find each unit's endpoint time, if it exists

Finding the endpoint time for each visitor is trickier. First things first, we define our endpoint of interest to be the transaction event. Some visitors don't have an observed transaction, though, while others have multiple transactions.

We first filter the raw data down to just the endpoint events, i.e. transactions, then use a window function to take the earliest endpoint event for each visitor. Keep in mind that only visitors who have an observed endpoint are represented in this result; we'll need that fact in two more paragraphs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
WITH endpoint_events AS (
    SELECT *
    FROM event_log
    WHERE event_type IN ('transaction')
)

SELECT 
    *
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER(PARTITION BY visitorid ORDER BY event_at ASC) AS row_num
    FROM endpoint_events
) AS _
WHERE row_num = 1

 visitorid | event_type  | itemid | transactionid |          event_at          | row_num 
-----------+-------------+--------+---------------+----------------------------+---------
       172 | transaction | 465522 | 9725          | 2015-08-14 18:29:01.23-07  |       1
       186 | transaction | 49029  | 8726          | 2015-08-12 09:34:57.04-07  |       1
       264 | transaction | 459835 | 8445          | 2015-09-07 10:34:45.614-07 |       1
       419 | transaction | 19278  | 16455         | 2015-07-28 22:03:12.695-07 |       1
       539 | transaction | 94371  | 14778         | 2015-06-15 22:39:38.673-07 |       1

Define the censoring time

Our target variable is the time it took for each visitor to make a transaction or reach the censoring time, which is the end of our observation period. In practice, it's usually best to choose either the current time or the latest timestamp in the event log.

1
SELECT max(event_at) AS event_at FROM event_log

          event_at          
----------------------------
 2015-09-17 19:59:47.788-07

Putting it all together

Last but not least, we pull all the pieces together (this snippet doesn't run on its own, it requires the CTEs defined in the full query listed above).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
SELECT 
    entry_times.visitorid,
    entry_times.event_at as entry_at,
    endpt.event_type AS endpoint_type,
    endpt.event_at AS endpoint_at,
    COALESCE(endpt.event_at, censoring.event_at) as final_obs_at,
    COALESCE(endpt.event_at, censoring.event_at) - entry_times.event_at as duration
FROM censoring, entry_times
LEFT JOIN first_endpoint_events AS endpt
    USING(visitorid)

There's a lot going on here:

  • We start with the entry_times CTE because we know it includes all of the visitors, one row for each. This is how we want the duration table to be structured.

  • The censoring CTE does not need to be joined because it's a single value. It is automatically broadcast to every row.

  • We use a LEFT JOIN to bring in fields from the first_endpoint_events CTE because first_endpoint_events only has rows for visitors who had an observed transaction event.

  • The final observation timestamp (final_obs_at) is the time of the first transaction if a visitor has one, or it defaults to the censoring time.

  • Lastly, the duration is the time interval between each visitor's entry time and the final observation time. For PostgreSQL, this is easily computed with the minus operator.

Wrapping up

Now we have a complete duration table, with one row for each unit. Each row has an entry time, a final observation time, and the duration between those two timestamps. We know whether or not a unit has an observed endpoint based on the endpoint_type and endpoint_time columns; missing values indicate the unit has not experienced an endpoint.

With this table, we should be able to use most survival analysis software with only minimal additional tweaks. Switch back to the Python version of this article to see how to use this table with the Scikit-survival and Lifelines packages.

Notes & References

  1. Specifically, I'm using PostgreSQL 12.7 on Ubuntu 20.04. I follow GitLab's SQL style guide as much as possible.