The Margin

Data before models, but problem formulation first

Several prominent AI thought leaders recently tweeted on the theme that data is more important in applied machine learning than model architecture and optimization. François Chollet wrote:

Andrew Ng chimed in that he agrees with Chollet and that more work needs to be done to disseminate best-practices about creating and organizing data. Shreya Shankar tweeted a slide about the limitations of deep learning models in practice:

Shreya Shankar slide: 'But when to use deep learning in practice?'
Source: Shreya Shankar


Getting the data right is certainly critical, and we'll talk a lot more about data over the coming months. But Chollet and Ng seem to be speaking from an "ML-first" perspective, where it's axiomatic that a predictive model—and nothing else—is how applied ML projects work. In this mindset, data may be important, but it's only the means to the end that is the predictive model.

I think problem formulation is even more important than either data or models.

Shankar's first point hints at this: the task comes first, and dictates which types of models are appropriate. Christoph Molnar hit the nail more squarely on the head:

What is problem formulation?

Problem formulation is the process of devising a data science solution to a business problem. In this post, I assume the business problem is defined and given; a different but related problem is to create business value from existing analytics or modeling work.

Molnar lists some elements of problem formulation in a second tweet:

  • "choice of prediction target"
  • "which data to use"
  • "what to do with the prediction"

My only quibble would be to move the last item up; the first part of problem formulation is to plan how your system will be used and how it will solve the business problem.

Take the problem of churn prevention, for example. From the ML-first perspective, it looks straightforward; in a given month, each user either churns or they don't. Bam, let's train a binary boosted trees classifier to predict churn for the coming month. Done.

But then what?

How will the output be interpreted?

The first issue we face is calibration. Many binary classifiers are trained and tuned in ways that depend only on the rank of output prediction scores, not their actual values.1 Suppose we have 3 users in our validation set:

User Outcome Model A Prediction Model B Prediction
1 retained 0.05 0.8
2 retained 0.06 0.81
3 churned 0.07 0.82


For many binary classifiers, the predictions from models A and B would be equally good, because for each we can pick a threshold that perfectly separates the retained and churned users.2

In applications like churn prevention, the values do matter though! The customer success team will read the scores directly and be far more worried about a user with an 80% chance of churning than one with a 7% chance of churning, even if we know those scores shouldn't be compared to each other. To make a churn prediction model useful, we should calibrate the scores so they have real-world meaning.

In general, we need to understand how our audience will interpret our output, whether it's model predictions, BI dashboards, experiment results, or any other artifact.

Does the method truly solve the problem?

OK, we solved the calibration issue and we now have a flawless churn prediction model. What should the customer success team do with it?

  • Should they reach out to the users who are at 10% churn probability, or 50%, or 90%?

  • What if each user responds differently to the customer success team's intervention? Does it matter whether a user's churn probability is 20% or 80% if our intervention has no effect on that user?

  • What if the customer success team contacts every customer anyway? In this case, our predictive model is useless; we should have learned instead which intervention is most effective.

There are no right answers to these questions, because the business problem is to prevent churn, not predict it. A predictive model by itself can't solve the problem.

Choose the right target variable

Our article on the tradeoffs of conversion rate modeling shows an example of the importance of choosing a prediction target. The ultimate task is to understand how well users convert from one stage of a sequence (e.g. a sales funnel) to the next. We can formulate this as a data science task in two ways:

  1. Treat conversion as a binary target, by choosing a fixed-length window of time to observe conversions. If the subject converts within the window, it's a success, otherwise failure.

  2. Model how long it takes users to convert. This is a right-censored, numeric target.

The choice of target variable dictates the data we need to collect and what types of models are appropriate.

Problem formulation is a foundational data science skill

Excellent problem formulation is one of the clearest hallmarks of a strong data scientist, and one of the key things I look for when interviewing. What makes somebody good at it?

  • Curiosity, to learn how the business works.

  • Honesty, to ourselves and to our audience. We should be up-front about the limitations of our methods and the correct interpretations of our results, especially when we know our approach doesn't fully solve a given problem.

  • Breadth of knowledge, through lots of hands-on experience with real, applied problems and reading about other data scientists' experiences. To evaluate alternative formulations we need to be aware of the options.

  • Foresight, to imagine the roadmap and architecture of each potential solution, and to identify the pros and cons before committing to one option.

Problem formulation is at the very heart of the chasm between data science training and applied practice. This gap is what motived us to start The Crosstab Kite, so this theme is going to come up frequently. Join our community and send us examples from your own practice—where have you seen problem formulation fail or succeed in notable ways?

Notes

  1. Logistic regression is a notable exception that is well-calibrated by construction.

  2. The AUC is identical for the two model outputs as well, because the true positive and false positive rates are the same for all relevant thresholds.