How to convert event logs to duration tables for survival analysis
This article was originally published on June 4, 2021. For clarity and generality, it has been updated to change the terms ‘conversion analysis’, ‘conversion table’, and ‘conversion events’ to ‘survival analysis’, ‘duration table’, and ‘endpoints’, respectively. The code snippets in this article use Python; see also the SQL translation.
In many applications, it is most natural to think about how much time it takes for some outcome to occur. In medicine, for example, we might ask whether a given treatment increases patients' life span. For predictive maintenance, we want to know which pieces of hardware have the shortest time until failure. For customer care, how long until support tickets are resolved?
Survival analysis techniques directly address these kinds of questions, but my sense is that survival modeling is relatively rare in industry data science. One reason might be that setting up the data for survival modeling can be tricky but most survival analysis literature skips the data prep part. I'll try to fill the gap, by showing how to create a duration table from an event log, using the example of web browser events.
All of the code below can be found in the Crosstab Kite's gists repo, at events_to_durations.py.
What is an event log?
Mixpanel's documentation shows an example log of user actions in a web
browser. The key columns are
USERID (the unit of interest),
For event logs, the event is a categorical variable. Contrast this with sensor output, which is typically numeric.
What's a duration table?
A duration table is a wide-form table that shows for each unit whether the event of interest occurred, and if so, how long it took to happen. It is a common input data format for survival analysis software.
A side note: the term duration table is based on the Python Lifelines package's term duration matrix.1 Most of the literature I have read assumes we have this kind of data in hand and doesn't bother to give it an explicit name.2
With the Mixpanel example, suppose we define conversion to be the Checkout event. Then the first part of the duration table for that event log would be:
|UserID||Entry time||Conversion time||Conversion event|
The first two columns are straightforward. Each row corresponds to a user, indicated by the user ID column. Entry time is the first recorded timestamp for each user.
The missing values in the Conversion time and Conversion event columns are where things get interesting. Here's the core insight of survival analysis: those users may still convert in the future; we only know that they had not yet converted when the event log was closed. This is called censoring and it's something that survival models handle gracefully, but we still need to represent it appropriately in our duration table.
Suppose the Mixpanel log was closed at
final duration table would be (scroll right on narrow screens):
|UserID||Entry time||Conversion time||Conversion event||Final observation time||Duration|
The final observation time is the conversion time if it's known or the timestamp of the last moment of observation (i.e. when the log was closed). The duration is the time delta between the entry time and the final observation time. It will be our target variable in downstream modeling and is the field that is right-censored.3
A reminder that all of the following code can be found in consolidated, document, and slightly reorganized form at events_to_durations.py.
We'll stick with the web browser example and use data from e-commerce site Retail
Rocket, available on Kaggle. The dataset has 2.8 million rows and
occupies about 250MB in memory. For this snippet, assume the data has been downloaded to
1 2 3 4 5 6 7 8 9
timestamp visitorid event itemid transactionid row 0 2015-06-02 05:02:12.117 257597 view 355908 <NA> 1 2015-06-02 05:50:14.164 992329 view 248676 <NA> 2 2015-06-02 05:13:19.827 111016 view 318965 <NA> 3 2015-06-02 05:12:35.914 483717 view 253185 <NA> 4 2015-06-02 05:02:17.106 951259 view 367447 <NA>
The first thing we need to do is identify the schema of our event log. For unit of
interest, let's go with visitor (
visitorid), so we can
answer the questions What fraction of users make a purchase and how long does it take
them to do so? Item would also be an interesting unit of observation, and in real web
browser data, we might also be interested in sessions.
There are only three kinds of events:
view 2664312 addtocart 69332 transaction 22457 Name: event, dtype: int64
The sequence of events for a visitor who completes a transaction looks like this:
timestamp visitorid event itemid transactionid 2751488 2015-07-31 17:27:21.908 1050575 view 116493 <NA> 2740586 2015-07-31 21:08:45.248 1050575 view 31640 <NA> 2750173 2015-07-31 21:09:21.716 1050575 view 273877 <NA> 2743671 2015-07-31 21:10:18.947 1050575 view 31640 <NA> 2743666 2015-07-31 21:11:12.772 1050575 addtocart 31640 <NA> 2755294 2015-07-31 21:12:56.570 1050575 transaction 31640 8354
The visitor views a bunch of items, eventually adds a few to their cart, then purchases
the items in the cart in a
transaction. Let's define the
transaction event to be our endpoint of interest. Endpoints are the events
corresponding to outcomes we're interested in; in clinical research, this might be
death, disease recurrence, or recovery. In predictive maintenance, it could be hardware
failure. In customer lifetime value, it would be reaching a revenue threshold. In
principle, we could have multiple endpoints, so we specify this in a list.
Our schema is:
1 2 3 4
Find the entry time for each unit
groupby method is our workhorse because we
need to treat each visitor separately. The simplest way to define entry time for each
visitor is to take the earliest logged timestamp:
1 2 3 4
entry_time visitorid 0 2015-09-11 20:49:49.439 1 2015-08-13 17:46:06.444 2 2015-08-07 17:51:44.567 3 2015-08-01 07:10:35.296 4 2015-09-15 21:24:27.167
Find each unit's endpoint time, if it exists
Finding the endpoint time for each visitor is trickier because some visitors have not yet completed a transaction but might in the future. Furthermore, some users have completed multiple transactions, in which case we want the timestamp of the earliest endpoint event.
We first filter the raw data down to just the endpoint events:
then find the index (in the original DataFrame) of the earliest endpoint event for each
visitor and extract just those rows. Remember that only the visitors who have an
observed endpoint are represented in the
1 2 3 4
timestamp event itemid transactionid visitorid 172 2015-08-15 01:29:01.230 transaction 465522 9725 186 2015-08-12 16:34:57.040 transaction 49029 8726 264 2015-09-07 17:34:45.614 transaction 459835 8445 419 2015-07-29 05:03:12.695 transaction 19278 16455 539 2015-06-16 05:39:38.673 transaction 94371 14778
Because Pandas DataFrames have well-crafted indexing, we can merge the endpoints and endpoint times back into our primary output table with simple column assignments. If you're translating this to another language, be very careful about indexing here—remember that these columns are not defined for every user.
1 2 3
entry_time endpoint endpoint_time visitorid 0 2015-09-11 20:49:49.439 NaN NaT 1 2015-08-13 17:46:06.444 NaN NaT 2 2015-08-07 17:51:44.567 NaN NaT 3 2015-08-01 07:10:35.296 NaN NaT 4 2015-09-15 21:24:27.167 NaN NaT
Compute the target variable
Our target variable is the time it took for each unit to reach an endpoint or the
censoring time, which is the end of our observation period. In the Mixpanel example
above we chose an arbitrary timestamp, but in practice it's usually easier to choose the
current time (e.g.
import datetime as dt; dt.datetime.now())
or the latest timestamp in the log.
We don't want to lose track of which users have an observed endpoint, so we'll create another column that defaults to the censoring time unless a visitor has an endpoint event.
1 2 3 4
The last step is to compute the elapsed time between each visitor's entry into the
system and their final observed time. Once again Pandas makes this easy, returning a
TimeDelta-type column by subtracting two timestamps.
entry_time endpoint endpoint_time final_obs_time duration visitorid 0 2015-09-11 20:49:49.439 NaN NaT 2015-09-18 02:59:47.788 6 days 06:09:58.349000 1 2015-08-13 17:46:06.444 NaN NaT 2015-09-18 02:59:47.788 35 days 09:13:41.344000 2 2015-08-07 17:51:44.567 NaN NaT 2015-09-18 02:59:47.788 41 days 09:08:03.221000 3 2015-08-01 07:10:35.296 NaN NaT 2015-09-18 02:59:47.788 47 days 19:49:12.492000 4 2015-09-15 21:24:27.167 NaN NaT 2015-09-18 02:59:47.788 2 days 05:35:20.621000
Each row of the
durations DataFrame represents a visitor.
Each unit has an entry time, a final observation time, and the duration between those
two timestamps. We know whether or not a unit has reached an endpoint based on the
We can sanity check the output using a few simple properties that should always be true,
regardless of the input data. To start, we know that there should be one row in the
output duration table for every unique
visitorid in the
We can bound the
duration variable by virtue of the way we
constructed it: using the earliest observation in the data for each visitor's entry time
and the latest timestamp overall as the censoring time.
1 2 3
And finally, we know that the number of visitors who have converted so far must be no bigger than the total number of endpoint events in the input log.
1 2 3 4
Now we should be able to use any survival analysis software, with minimal additional tweaks.
Let's fit a cumulative hazard function to our data with the Lifelines
package, probably the most popular Python survival analysis package. To keep things
simple, we first convert our
durations column from TimeDelta
format to floating-point days. The
notnull method of a
Pandas series gives us a boolean flag that indicates if each unit has an observed
endpoint or has a censored duration.
1 2 3 4 5 6 7 8
Or, we could use the Scikit-survival package to fit a nonparametric survival curve.
1 2 3 4 5 6
Let's plot the result, to confirm that something meaningful has happened. For our demo use case, it's more natural to think of the conversion rate, which is just 1 minus the survival function.4
1 2 3 4 5 6 7 8 9 10 11
In upcoming articles, I'll show how to convert other types of raw data into duration tables and how to interpret survival analysis results. Stay tuned! In the meantime, please leave a comment or question below. You can also contact me privately, with the Crosstab Kite contact form.
Notes & References
The Lifelines API documentation says that a duration matrix need not include all units, presumably omitting those with no observed endpoint. My concept of a duration table is different in that it should contain one row for each unit. ↩
An exhaustive literature search is beyond the scope of this article, but here are some examples of works that avoid naming the input to survival models. ↩
The Wikipedia entry for survival analysis shows an example of a duration table but only refers to it generically as a “survival data set”.
The R package
survivaluses the class
Survto define a “survival object” target variable, with both the duration and censoring indicator. The
survfit.formulafunction has a
dataparameter but defines it only by saying it's a data frame that includes the variables named in the
Frank Harrell's Regression Modeling Strategies does not appear to use an explicit term for the duration table. The survival analysis examples use datasets already in duration table format, named for their subject matter (e.g.
prostate, p. 521)
Rich, et al. A practical guide to understanding Kaplan-Meier curves explicitly describes survival inputs as a table but never gives that table a name.
The first step in preparation for [Kaplan-Meier] analysis involves the construction of a table...containing the three key elements required for input...The table is then sorted...Once this initial table is constructed...
Even the Lifelines package, from which I get the term duration matrix, is not consistent. It also uses the terms survival events and survival data for this concept.
Naming is hard and the field of survival analysis seems especially bad at it. I have more to say about this in a future post, but in the meantime, if you know of an existing, widespread, standard name for what I call duration table, please let me know.
Conversion data can also be left- or interval-censored. This article is only about right-censored data. ↩
Technically, this is the cumulative incidence function. ↩