Survival Analysis

# How to convert event logs to duration tables for survival analysis

This article was originally published on June 4, 2021. For clarity and generality, it has been updated to change the terms ‘conversion analysis’, ‘conversion table’, and ‘conversion events’ to ‘survival analysis’, ‘duration table’, and ‘endpoints’, respectively. The code snippets in this article use Python; see also the SQL translation.

In many applications, it is most natural to think about how much time it takes for some outcome to occur. In medicine, for example, we might ask whether a given treatment increases patients' life span. For predictive maintenance, we want to know which pieces of hardware have the shortest time until failure. For customer care, how long until support tickets are resolved?

Survival analysis techniques directly address these kinds of questions, but my sense is that survival modeling is relatively rare in industry data science. One reason might be that setting up the data for survival modeling can be tricky but most survival analysis literature skips the data prep part. I'll try to fill the gap, by showing how to create a duration table from an event log, using the example of web browser events.

All of the code below can be found in the Crosstab Kite's gists repo, at events_to_durations.py.

## What is an event log?

Event logs are a type of transactional fact table. They have a long-form schema where each entry represents an event, defined by the unit (e.g. user), timestamp, and type of event.

Mixpanel's documentation shows an example log of user actions in a web browser. The key columns are USERID (the unit of interest), TIMESTAMP, and EVENT.

For event logs, the event is a categorical variable. Contrast this with sensor output, which is typically numeric.

## What's a duration table?

A duration table is a wide-form table that shows for each unit whether the event of interest occurred, and if so, how long it took to happen. It is a common input data format for survival analysis software.

A side note: the term duration table is based on the Python Lifelines package's term duration matrix.1 Most of the literature I have read assumes we have this kind of data in hand and doesn't bother to give it an explicit name.2

With the Mixpanel example, suppose we define conversion to be the Checkout event. Then the first part of the duration table for that event log would be:

UserID Entry time Conversion time Conversion event
qrt-345 2016-12-15T08:02 NA NA
pnm-321 2016-12-15T08:07 NA NA
Janet 2016-12-15T08:10 2016-12-15T08:15 Checkout
Vijay 2016-12-15T08:10 NA NA

The first two columns are straightforward. Each row corresponds to a user, indicated by the user ID column. Entry time is the first recorded timestamp for each user.

The missing values in the Conversion time and Conversion event columns are where things get interesting. Here's the core insight of survival analysis: those users may still convert in the future; we only know that they had not yet converted when the event log was closed. This is called censoring and it's something that survival models handle gracefully, but we still need to represent it appropriately in our duration table.

Suppose the Mixpanel log was closed at 2016-12-15T08:20. The final duration table would be (scroll right on narrow screens):

UserID Entry time Conversion time Conversion event Final observation time Duration
qrt-345 2016-12-15T08:02 NA NA 2016-12-15T08:20 18 seconds
pnm-321 2016-12-15T08:07 NA NA 2016-12-15T08:20 13 seconds
Janet 2016-12-15T08:10 2016-12-15T08:15 Checkout 2016-12-15T08:15 5 seconds
Vijay 2016-12-15T08:10 NA NA 2016-12-15T08:20 10 seconds

The final observation time is the conversion time if it's known or the timestamp of the last moment of observation (i.e. when the log was closed). The duration is the time delta between the entry time and the final observation time. It will be our target variable in downstream modeling and is the field that is right-censored.3

## The Metamorphosis

A reminder that all of the following code can be found in consolidated, document, and slightly reorganized form at events_to_durations.py.

### Input data

We'll stick with the web browser example and use data from e-commerce site Retail Rocket, available on Kaggle. The dataset has 2.8 million rows and occupies about 250MB in memory. For this snippet, assume the data has been downloaded to a local data/retailrocket folder.

 1 2 3 4 5 6 7 8 9 import pandas as pd event_log = pd.read_csv( "data/retailrocket/events.csv", dtype={"transactionid": "Int64"} ) event_log.index.name = "row" event_log["timestamp"] = pd.to_datetime(event_log["timestamp"], unit="ms") print(event_log.head()) 

                  timestamp  visitorid event  itemid  transactionid
row
0   2015-06-02 05:02:12.117     257597  view  355908           <NA>
1   2015-06-02 05:50:14.164     992329  view  248676           <NA>
2   2015-06-02 05:13:19.827     111016  view  318965           <NA>
3   2015-06-02 05:12:35.914     483717  view  253185           <NA>
4   2015-06-02 05:02:17.106     951259  view  367447           <NA>


### Schema

The first thing we need to do is identify the schema of our event log. For unit of interest, let's go with visitor (visitorid), so we can answer the questions What fraction of users make a purchase and how long does it take them to do so? Item would also be an interesting unit of observation, and in real web browser data, we might also be interested in sessions.

There are only three kinds of events: view, addtocart, and transaction:

 1 event_log["event"].value_counts() 

view           2664312
transaction      22457
Name: event, dtype: int64


The sequence of events for a visitor who completes a transaction looks like this:

 1 event_log.query("visitorid==1050575").sort_values("timestamp") 

                      timestamp  visitorid        event  itemid  transactionid
2751488 2015-07-31 17:27:21.908    1050575         view  116493           <NA>
2740586 2015-07-31 21:08:45.248    1050575         view   31640           <NA>
2750173 2015-07-31 21:09:21.716    1050575         view  273877           <NA>
2743671 2015-07-31 21:10:18.947    1050575         view   31640           <NA>
2743666 2015-07-31 21:11:12.772    1050575    addtocart   31640           <NA>
2755294 2015-07-31 21:12:56.570    1050575  transaction   31640           8354


The visitor views a bunch of items, eventually adds a few to their cart, then purchases the items in the cart in a transaction. Let's define the transaction event to be our endpoint of interest. Endpoints are the events corresponding to outcomes we're interested in; in clinical research, this might be death, disease recurrence, or recovery. In predictive maintenance, it could be hardware failure. In customer lifetime value, it would be reaching a revenue threshold. In principle, we could have multiple endpoints, so we specify this in a list.

Our schema is:

 1 2 3 4 unit = "visitorid" timestamp = "timestamp" event = "event" endpoints = ["transaction"] 

### Find the entry time for each unit

The DataFrame groupby method is our workhorse because we need to treat each visitor separately. The simplest way to define entry time for each visitor is to take the earliest logged timestamp:

 1 2 3 4 grp = event_log.groupby(unit) durations = pd.DataFrame(grp[timestamp].min()) durations.rename(columns={timestamp: "entry_time"}, inplace=True) print(durations.head()) 

                       entry_time
visitorid
0         2015-09-11 20:49:49.439
1         2015-08-13 17:46:06.444
2         2015-08-07 17:51:44.567
3         2015-08-01 07:10:35.296
4         2015-09-15 21:24:27.167


### Find each unit's endpoint time, if it exists

Finding the endpoint time for each visitor is trickier because some visitors have not yet completed a transaction but might in the future. Furthermore, some users have completed multiple transactions, in which case we want the timestamp of the earliest endpoint event.

We first filter the raw data down to just the endpoint events:

 1 endpoint_events = event_log.loc[event_log[event].isin(endpoints)] 

then find the index (in the original DataFrame) of the earliest endpoint event for each visitor and extract just those rows. Remember that only the visitors who have an observed endpoint are represented in the endpoint_events DataFrame.

 1 2 3 4 grp = endpoint_events.groupby(unit) idx_endpoint_events = grp[timestamp].idxmin() # these are indices in the original DataFrame endpoint_events = event_log.iloc[idx_endpoint_events].set_index(unit) print(endpoint_events.head()) 

                        timestamp        event  itemid  transactionid
visitorid
172       2015-08-15 01:29:01.230  transaction  465522           9725
186       2015-08-12 16:34:57.040  transaction   49029           8726
264       2015-09-07 17:34:45.614  transaction  459835           8445
419       2015-07-29 05:03:12.695  transaction   19278          16455
539       2015-06-16 05:39:38.673  transaction   94371          14778


Because Pandas DataFrames have well-crafted indexing, we can merge the endpoints and endpoint times back into our primary output table with simple column assignments. If you're translating this to another language, be very careful about indexing here—remember that these columns are not defined for every user.

 1 2 3 durations["endpoint"] = endpoint_events[event] durations["endpoint_time"] = endpoint_events[timestamp] print(durations.head()) 

                       entry_time endpoint endpoint_time
visitorid
0         2015-09-11 20:49:49.439      NaN           NaT
1         2015-08-13 17:46:06.444      NaN           NaT
2         2015-08-07 17:51:44.567      NaN           NaT
3         2015-08-01 07:10:35.296      NaN           NaT
4         2015-09-15 21:24:27.167      NaN           NaT


### Compute the target variable

Our target variable is the time it took for each unit to reach an endpoint or the censoring time, which is the end of our observation period. In the Mixpanel example above we chose an arbitrary timestamp, but in practice it's usually easier to choose the current time (e.g. import datetime as dt; dt.datetime.now()) or the latest timestamp in the log.

We don't want to lose track of which users have an observed endpoint, so we'll create another column that defaults to the censoring time unless a visitor has an endpoint event.

 1 2 3 4 censoring_time = event_log["timestamp"].max() durations["final_obs_time"] = durations["endpoint_time"].copy() durations["final_obs_time"].fillna(censoring_time, inplace=True) 

The last step is to compute the elapsed time between each visitor's entry into the system and their final observed time. Once again Pandas makes this easy, returning a TimeDelta-type column by subtracting two timestamps.

 1 2 durations["duration"] = durations["final_obs_time"] - durations["entry_time"] print(durations.head()) 

                        entry_time endpoint endpoint_time          final_obs_time                duration
visitorid
0         2015-09-11 20:49:49.439      NaN           NaT 2015-09-18 02:59:47.788  6 days 06:09:58.349000
1         2015-08-13 17:46:06.444      NaN           NaT 2015-09-18 02:59:47.788 35 days 09:13:41.344000
2         2015-08-07 17:51:44.567      NaN           NaT 2015-09-18 02:59:47.788 41 days 09:08:03.221000
3         2015-08-01 07:10:35.296      NaN           NaT 2015-09-18 02:59:47.788 47 days 19:49:12.492000
4         2015-09-15 21:24:27.167      NaN           NaT 2015-09-18 02:59:47.788  2 days 05:35:20.621000


Each row of the durations DataFrame represents a visitor. Each unit has an entry time, a final observation time, and the duration between those two timestamps. We know whether or not a unit has reached an endpoint based on the endpoint and endpoint_time columns.

### Sanity checks

We can sanity check the output using a few simple properties that should always be true, regardless of the input data. To start, we know that there should be one row in the output duration table for every unique visitorid in the event log.

 1 assert len(durations) == event_log[unit].nunique() 

We can bound the duration variable by virtue of the way we constructed it: using the earliest observation in the data for each visitor's entry time and the latest timestamp overall as the censoring time.

 1 2 3 assert durations["duration"].max() <= ( event_log[timestamp].max() - event_log[timestamp].min() ) 

And finally, we know that the number of visitors who have converted so far must be no bigger than the total number of endpoint events in the input log.

 1 2 3 4 assert ( durations["endpoint_time"].notnull().sum() <= event_log[event].isin(endpoints).sum() ) 

## Next steps

Now we should be able to use any survival analysis software, with minimal additional tweaks.

### Example 1

Let's fit a cumulative hazard function to our data with the Lifelines package, probably the most popular Python survival analysis package. To keep things simple, we first convert our durations column from TimeDelta format to floating-point days. The notnull method of a Pandas series gives us a boolean flag that indicates if each unit has an observed endpoint or has a censored duration.

 1 2 3 4 5 6 7 8 durations["endpoint_observed"] = durations["endpoint"].notnull() durations["duration_days"] = ( durations["duration"].dt.total_seconds() / (60 * 60 * 24) # denom is number of seconds in a day ) import lifelines model = lifelines.NelsonAalenFitter() model.fit(durations["duration_days"], durations["endpoint_observed"]) 

### Example 2

Or, we could use the Scikit-survival package to fit a nonparametric survival curve.

 1 2 3 4 5 6 from sksurv.util import Surv from sksurv.nonparametric import SurvivalFunctionEstimator target = Surv().from_dataframe("endpoint_observed", "duration_days", durations) model = SurvivalFunctionEstimator() model.fit(target) 

Let's plot the result, to confirm that something meaningful has happened. For our demo use case, it's more natural to think of the conversion rate, which is just 1 minus the survival function.4

  1 2 3 4 5 6 7 8 9 10 11 import numpy as np import plotly.express as px time_grid = np.linspace(0, 120, 1000) proba_survival = model.predict_proba(time_grid) conversion_rate = 1 - proba_survival fig = px.line(x=time_grid, y=conversion_rate * 100, template="plotly_white") fig.update_traces(line=dict(width=7)) fig.update_layout(xaxis_title="Days elapsed", yaxis_title="Conversion rate (%)") fig.show() 

In upcoming articles, I'll show how to convert other types of raw data into duration tables and how to interpret survival analysis results. Stay tuned! In the meantime, please leave a comment or question below. You can also contact me privately, with the Crosstab Kite contact form.

## Notes & References

1. The Lifelines API documentation says that a duration matrix need not include all units, presumably omitting those with no observed endpoint. My concept of a duration table is different in that it should contain one row for each unit.

2. An exhaustive literature search is beyond the scope of this article, but here are some examples of works that avoid naming the input to survival models.

• The Wikipedia entry for survival analysis shows an example of a duration table but only refers to it generically as a “survival data set”.

• The R package survival uses the class Surv to define a “survival object” target variable, with both the duration and censoring indicator. The survfit.formula function has a data parameter but defines it only by saying it's a data frame that includes the variables named in the Surv object.

• Frank Harrell's Regression Modeling Strategies does not appear to use an explicit term for the duration table. The survival analysis examples use datasets already in duration table format, named for their subject matter (e.g. prostate, p. 521)

• Rich, et al. A practical guide to understanding Kaplan-Meier curves explicitly describes survival inputs as a table but never gives that table a name.

The first step in preparation for [Kaplan-Meier] analysis involves the construction of a table...containing the three key elements required for input...The table is then sorted...Once this initial table is constructed...

• Even the Lifelines package, from which I get the term duration matrix, is not consistent. It also uses the terms survival events and survival data for this concept.

Naming is hard and the field of survival analysis seems especially bad at it. I have more to say about this in a future post, but in the meantime, if you know of an existing, widespread, standard name for what I call duration table, please let me know.