How to convert event logs to duration tables for survival analysis

Survival models describe how much time it takes for some event to occur. This is a natural way to think about many applications but setting up the data can be tricky. In this article, we use Python to turn an event log into a duration table, which is the input format for many survival analysis tools.

code
python
survival analysis
Author

Brian Kent

Published

June 7, 2021

Tip

See also our article on creating duration tables from event logs in SQL.

In many applications, it is most natural to think about how much time it takes for some outcome to occur. In medicine, for example, we might ask whether a given treatment increases patients’ life span. For predictive maintenance, we want to know which pieces of hardware have the shortest time until failure. For customer care, how long until support tickets are resolved?

Survival analysis techniques directly address these kinds of questions, but my sense is that survival modeling is relatively rare in industry data science. One reason might be that setting up the data for survival modeling can be tricky but most survival analysis literature skips the data prep part. I’ll try to fill the gap, by showing how to create a duration table from an event log, using the example of web browser events.

What is an event log?

Event logs are a type of transactional fact table. They have a long-form schema where each entry represents an event, defined by the unit (e.g. user), timestamp, and type of event.

Mixpanel’s documentation shows an example log of user actions in a web browser. The key columns are USERID (the unit of interest), TIMESTAMP, and EVENT.

For event logs, the event is a categorical variable. Contrast this with sensor output, which is typically numeric.

What’s a duration table?

A duration table is a wide-form table that shows for each unit whether the event of interest occurred, and if so, how long it took to happen. It is a common input data format for survival analysis software.

A side note: the term duration table is based on the Python Lifelines package’s term duration matrix.1 Most of the literature I have read assumes we have this kind of data in hand and doesn’t bother to give it an explicit name.2

With the Mixpanel example, suppose we define conversion to be the Checkout event. Then the first part of the duration table for that event log would be:

UserID Entry time Conversion time Conversion event
qrt-345 2016-12-15T08:02 NA NA
pnm-321 2016-12-15T08:07 NA NA
Janet 2016-12-15T08:10 2016-12-15T08:15 Checkout
Vijay 2016-12-15T08:10 NA NA

The first two columns are straightforward. Each row corresponds to a user, indicated by the user ID column. Entry time is the first recorded timestamp for each user.

The missing values in the Conversion time and Conversion event columns are where things get interesting. Here’s the core insight of survival analysis: those users may still convert in the future; we only know that they had not yet converted when the event log was closed. This is called censoring and it’s something that survival models handle gracefully, but we still need to represent it appropriately in our duration table.

Suppose the Mixpanel log was closed at 2016-12-15T08:20. The final duration table would be:

UserID Entry time Conversion time Conversion event Final observation time Duration
qrt-345 2016-12-15T08:02 NA NA 2016-12-15T08:20 18 seconds
pnm-321 2016-12-15T08:07 NA NA 2016-12-15T08:20 13 seconds
Janet 2016-12-15T08:10 2016-12-15T08:15 Checkout 2016-12-15T08:15 5 seconds
Vijay 2016-12-15T08:10 NA NA 2016-12-15T08:20 10 seconds

The final observation time is the conversion time if it’s known or the timestamp of the last moment of observation (i.e. when the log was closed). The duration is the time delta between the entry time and the final observation time. It will be our target variable in downstream modeling and is the field that is right-censored.3

The Metamorphosis

Input data

We’ll stick with the web browser example and use data from e-commerce site Retail Rocket, available on Kaggle. The dataset has 2.8 million rows and occupies about 250MB in memory.

Code
import pandas as pd

event_log = pd.read_csv("retailrocket_events.csv", dtype={"transactionid": "Int64"})
event_log.index.name = "row"
event_log["timestamp"] = pd.to_datetime(event_log["timestamp"], unit="ms")

event_log.head()
timestamp visitorid event itemid transactionid
row
0 2015-06-02 05:02:12.117 257597 view 355908 <NA>
1 2015-06-02 05:50:14.164 992329 view 248676 <NA>
2 2015-06-02 05:13:19.827 111016 view 318965 <NA>
3 2015-06-02 05:12:35.914 483717 view 253185 <NA>
4 2015-06-02 05:02:17.106 951259 view 367447 <NA>

Schema

The first thing we need to do is identify the schema of our event log. For unit of interest, let’s go with visitor (visitorid), so we can answer the questions What fraction of users make a purchase and how long does it take them to do so? Item would also be an interesting unit of observation, and in real web browser data, we might also be interested in sessions.

There are only three kinds of events: view, addtocart, and transaction:

Code
event_log["event"].value_counts()
view           2664312
addtocart        69332
transaction      22457
Name: event, dtype: int64

The sequence of events for a visitor who completes a transaction looks like this:

Code
event_log.query("visitorid==1050575").sort_values("timestamp")
timestamp visitorid event itemid transactionid
row
2751488 2015-07-31 17:27:21.908 1050575 view 116493 <NA>
2740586 2015-07-31 21:08:45.248 1050575 view 31640 <NA>
2750173 2015-07-31 21:09:21.716 1050575 view 273877 <NA>
2743671 2015-07-31 21:10:18.947 1050575 view 31640 <NA>
2743666 2015-07-31 21:11:12.772 1050575 addtocart 31640 <NA>
2755294 2015-07-31 21:12:56.570 1050575 transaction 31640 8354

The visitor views a bunch of items, eventually adds a few to their cart, then purchases the items in the cart in a transaction. Let’s define the transaction event to be our endpoint of interest. Endpoints are the events corresponding to outcomes we’re interested in; in clinical research, this might be death, disease recurrence, or recovery. In predictive maintenance, it could be hardware failure. In customer lifetime value, it would be reaching a revenue threshold. In principle, we could have multiple endpoints, so we specify this in a list.

Our schema is:

Code
unit = "visitorid"
timestamp = "timestamp"
event = "event"
endpoints = ["transaction"]

Find the entry time for each unit

The DataFrame groupby method is our workhorse because we need to treat each visitor separately. The simplest way to define entry time for each visitor is to take the earliest logged timestamp:

Code
grp = event_log.groupby(unit)
durations = pd.DataFrame(grp[timestamp].min())
durations.rename(columns={timestamp: "entry_time"}, inplace=True)
durations.head()
entry_time
visitorid
0 2015-09-11 20:49:49.439
1 2015-08-13 17:46:06.444
2 2015-08-07 17:51:44.567
3 2015-08-01 07:10:35.296
4 2015-09-15 21:24:27.167

Find each unit’s endpoint time, if it exists

Finding the endpoint time for each visitor is trickier because some visitors have not yet completed a transaction but might in the future. Furthermore, some users have completed multiple transactions, in which case we want the timestamp of the earliest endpoint event.

We first filter the raw data down to just the endpoint events:

Code
endpoint_events = event_log.loc[event_log[event].isin(endpoints)]

then find the index (in the original DataFrame) of the earliest endpoint event for each visitor and extract just those rows. Remember that only the visitors who have an observed endpoint are represented in the endpoint_events DataFrame.

Code
grp = endpoint_events.groupby(unit)
idx_endpoint_events = grp[timestamp].idxmin()  # these are indices in the original DataFrame
endpoint_events = event_log.iloc[idx_endpoint_events].set_index(unit)
print(endpoint_events.head())
                        timestamp        event  itemid  transactionid
visitorid                                                            
172       2015-08-15 01:29:01.230  transaction  465522           9725
186       2015-08-12 16:34:57.040  transaction   49029           8726
264       2015-09-07 17:34:45.614  transaction  459835           8445
419       2015-07-29 05:03:12.695  transaction   19278          16455
539       2015-06-16 05:39:38.673  transaction   94371          14778

Because Pandas DataFrames have well-crafted indexing, we can merge the endpoints and endpoint times back into our primary output table with simple column assignments. If you’re translating this to another language, be very careful about indexing here—remember that these columns are not defined for every user.

Code
durations["endpoint"] = endpoint_events[event]
durations["endpoint_time"] = endpoint_events[timestamp]
durations.head()
entry_time endpoint endpoint_time
visitorid
0 2015-09-11 20:49:49.439 NaN NaT
1 2015-08-13 17:46:06.444 NaN NaT
2 2015-08-07 17:51:44.567 NaN NaT
3 2015-08-01 07:10:35.296 NaN NaT
4 2015-09-15 21:24:27.167 NaN NaT

Compute the target variable

Our target variable is the time it took for each unit to reach an endpoint or the censoring time, which is the end of our observation period. In the Mixpanel example above we chose an arbitrary timestamp, but in practice it’s usually easier to choose the current time (e.g. import datetime as dt; dt.datetime.now()) or the latest timestamp in the log.

We don’t want to lose track of which users have an observed endpoint, so we’ll create another column that defaults to the censoring time unless a visitor has an endpoint event.

Code
censoring_time = event_log["timestamp"].max()

durations["final_obs_time"] = durations["endpoint_time"].copy()
durations["final_obs_time"].fillna(censoring_time, inplace=True)

The last step is to compute the elapsed time between each visitor’s entry into the system and their final observed time. Once again Pandas makes this easy, returning a TimeDelta-type column by subtracting two timestamps.

Code
durations["duration"] = durations["final_obs_time"] - durations["entry_time"]
durations.head()
entry_time endpoint endpoint_time final_obs_time duration
visitorid
0 2015-09-11 20:49:49.439 NaN NaT 2015-09-18 02:59:47.788 6 days 06:09:58.349000
1 2015-08-13 17:46:06.444 NaN NaT 2015-09-18 02:59:47.788 35 days 09:13:41.344000
2 2015-08-07 17:51:44.567 NaN NaT 2015-09-18 02:59:47.788 41 days 09:08:03.221000
3 2015-08-01 07:10:35.296 NaN NaT 2015-09-18 02:59:47.788 47 days 19:49:12.492000
4 2015-09-15 21:24:27.167 NaN NaT 2015-09-18 02:59:47.788 2 days 05:35:20.621000

Each row of the durations DataFrame represents a visitor. Each unit has an entry time, a final observation time, and the duration between those two timestamps. We know whether or not a unit has reached an endpoint based on the endpoint and endpoint_time columns.

Sanity checks

We can sanity check the output using a few simple properties that should always be true, regardless of the input data. To start, we know that there should be one row in the output duration table for every unique visitorid in the event log.

Code
assert len(durations) == event_log[unit].nunique()

We can bound the duration variable by virtue of the way we constructed it: using the earliest observation in the data for each visitor’s entry time and the latest timestamp overall as the censoring time.

Code
assert durations["duration"].max() <= (
    event_log[timestamp].max() - event_log[timestamp].min()
)

And finally, we know that the number of visitors who have converted so far must be no bigger than the total number of endpoint events in the input log.

Code
assert (
    durations["endpoint_time"].notnull().sum() 
    <= event_log[event].isin(endpoints).sum()
)

Next steps

Now we should be able to use any survival analysis software, with minimal additional tweaks.

Example 1

Let’s fit a cumulative hazard function to our data with the Lifelines package, probably the most popular Python survival analysis package. To keep things simple, we first convert our durations column from TimeDelta format to floating-point days. The notnull method of a Pandas series gives us a boolean flag that indicates if each unit has an observed endpoint or has a censored duration.

Code
durations["endpoint_observed"] = durations["endpoint"].notnull()
durations["duration_days"] = (
    durations["duration"].dt.total_seconds() / (60 * 60 * 24)  # denom is number of seconds in a day
)

import lifelines
model = lifelines.NelsonAalenFitter()
model.fit(durations["duration_days"], durations["endpoint_observed"])
model.plot()
<AxesSubplot: xlabel='timeline'>

Example 2

Or, we could use the Scikit-survival package to fit a nonparametric survival curve.

Code
from sksurv.util import Surv
from sksurv.nonparametric import SurvivalFunctionEstimator

target = Surv().from_dataframe("endpoint_observed", "duration_days", durations)
model = SurvivalFunctionEstimator()
model.fit(target)
model.predict_proba([0, 20, 40, 60, 80, 100, 120])
array([0.99991617, 0.99211362, 0.99185032, 0.99167823, 0.99151507,
       0.99138487, 0.9912852 ])

Let’s plot the result, to confirm that something meaningful has happened. For our demo use case, it’s more natural to think of the conversion rate, which is just 1 minus the survival function.4

Code
import numpy as np
import plotly.express as px

time_grid = np.linspace(0, 120, 1000)
proba_survival = model.predict_proba(time_grid)
conversion_rate = 1 - proba_survival

fig = px.line(x=time_grid, y=conversion_rate * 100, template="plotly_white")
fig.update_traces(line=dict(width=7))
fig.update_layout(xaxis_title="Days elapsed", yaxis_title="Conversion rate (%)")
fig.show()

In upcoming articles, I’ll show how to convert other types of raw data into duration tables and how to interpret survival analysis results. Stay tuned! In the meantime, please leave a comment or question below. You can also contact me privately, with the Crosstab Kite contact form.

References

  • Listing image by Tsvetoslav Hristov on Unsplash.

Footnotes

  1. The Lifelines API documentation says that a duration matrix need not include all units, presumably omitting those with no observed endpoint. My concept of a duration table is different in that it should contain one row for each unit.↩︎

  2. An exhaustive literature search is beyond the scope of this article, but here are some examples of works that avoid naming the input to survival models.

    • The Wikipedia entry for survival analysis shows an example of a duration table but only refers to it generically as a “survival data set”.

    • The R package survival uses the class Surv to define a “survival object” target variable, with both the duration and censoring indicator. The survfit.formula function has a data parameter but defines it only by saying it’s a data frame that includes the variables named in the Surv object.

    • Frank Harrell’s Regression Modeling Strategies does not appear to use an explicit term for the duration table. The survival analysis examples use datasets already in duration table format, named for their subject matter (e.g. prostate, p. 521)

    • Rich, et al. A practical guide to understanding Kaplan-Meier curves explicitly describes survival inputs as a table but never gives that table a name.

      The first step in preparation for [Kaplan-Meier] analysis involves the construction of a table…containing the three key elements required for input…The table is then sorted…Once this initial table is constructed…

    • Even the Lifelines package, from which I get the term duration matrix, is not consistent. It also uses the terms survival events and survival data for this concept.

    Naming is hard and the field of survival analysis seems especially bad at it. I have more to say about this in a future post, but in the meantime, if you know of an existing, widespread, standard name for what I call duration table, please let me know.↩︎

  3. Conversion data can also be left- or interval-censored. This article is only about right-censored data.↩︎

  4. Technically, this is the cumulative incidence function.↩︎