In defense of statistical modeling

Data science has been hot for many years now, attracting attention and talent. There has been a persistent thread, though, that says data science’s core skill of statistical modeling is overhyped and that managers and aspiring data scientists should focus on engineering instead. Vicki Boykis’ 2019 blog post was the first article I remember along these lines. She wrote:

…data science is asymptotically moving closer to engineering, and the skills that data scientists need moving forward are less visualization and statistics-based, and more in line with traditional computer science curricula…

With that premise, her reasonable advice was:

Don’t do a degree in data science, don’t do a bootcamp…It’s much easier to come into a data science and tech career through the “back door”, i.e. starting out as a junior developer, or in DevOps, project management, and, perhaps most relevant, as a data analyst, information manager, or similar…

Her list of skills an aspiring data scientist should learn consisted entirely of data engineering, MLOps, and tools, and she intentionally omitted modeling, saying:

While tuning models, visualization, and analysis make up some component of your time as a data scientist, data science is and has always been primarily about getting clean data in a single place to be used for interpolation.

More recently, Gartner’s 2020 AI hype cycle report acknowledges the role of data scientists but says:

Gartner foresees developers being the major force in AI.

Chris I. said it more bluntly, with an article titled “Don’t Become a Data Scientist”.

Everyone and their grandmother wants to be a data scientist…I often get messages from new grads and career changers asking me for advice on getting into data science. I tell them to become a software engineer instead.

Mihail Eric echoed the thought with an article titled “We Don’t Need Data Scientists, We Need Data Engineers”.

Today, the bottleneck in helping companies get machine learning and modelling insights to production center on data problems…This may sound boring and unsexy, but old-school software engineering with a bend toward data may be what we really need right now…There are going to be fewer positions available for what is looking to be an abundance of newcomers to the market trained to do data science.

I agree with these articles that data engineering and MLOps are important for applied industry data science work, but I also believe data science’s core skill—statistical modeling—is becoming more, not less, important. Since we don’t have many chances for in-person debates in this Covid era, here’s how I imagine a debate would go with these skeptics.

The term data science is diluted

Skeptic: What does the term “data science” even mean? It’s such a broad, vague title, plus everybody calls themselves a data scientist these days, so it’s totally diluted.

Me: Data science is a big tent. When people talk about what the term should mean, it usually revolves around the core skill of statistical modeling. Boykis, for example, cites “machine learning, deep learning, and Bayesian simulations” as the things junior data scientists expect to work on vs. the “cleaning, shaping data, and moving it from place to place” work they end up doing. Eric describes a data scientist as someone “responsible for building models to probe what can be learned from some data source, though often at a prototype rather than production level.”

Statistical modeling is what’s taught in most statistics, machine learning, and data science courses. It includes, among other things:

Traditional predictive models, i.e. regression and classification. All the biggest hits—linear models, boosted trees, neural nets, etc—fall into this category
Time series forecasting
Experiment design and analysis
Causal inference

Model training will be obsolete; software engineers can do the job

Skeptic: So data science is just training models? Isn’t that becoming obsolete anyway with the rise of AutoML and massive pre-trained models like GPT-3? As model building becomes commoditized, software engineers will do the work, not methodologists.

Me: Statistical modeling involves a lot more than pushing the button on a generic scikit-learn or PyTorch script. AutoML tools can help with some parts like hyperparameter search and feature selection, but there’s so much more to it.

As I wrote a few weeks ago, the first thing a data scientist needs to do is to understand business problems and formulate them as modeling tasks. You want to reduce churn, for example, but should you treat it as a binary classification or a time-to-event problem? Will a predictive model suffice, or do you need to draw causal conclusions? How will you run experiments to verify the model works?

The next step in modeling is to understand and clean the data thoroughly. This work often creates a ton of value on its own because data scientists are often uniquely qualified to translate between business logic and data engineering and to spot problems.

The model fitting process is evolving, especially as model training platforms like MLFlow, Comet, and Weights & Biases (among others) mature. Many components still cannot be automated or abstracted away, though. Data scientists must decide how to evaluate model performance, for example. For a predictive model, should we use a random or temporal train-test split? What evaluation metric best matches the business use case?

The last piece of the modeling process is communication. Data engineering and MLOps need to know how to implement the model in production (if that’s not also the data scientist’s job). Business units need at least a basic intuition for how the model works and explanations for unexpected predictions.

As far as the massive pre-trained models like GPT-3 go, sure, most data scientists at most companies should not waste time trying to build them from scratch. But these models cover a tiny fraction of real-world use cases; the vast majority of applications have no pre-trained model to build from.

Data scientists spend most of their time doing other things

Skeptic: Fair enough. But I hear data scientists say over and over that modeling work takes up only a small fraction of their time. Even you* said that data work should come before modeling. So if I were a hiring manager, shouldn’t I focus first on data engineering and MLOps engineers? If I were choosing my career, wouldn’t data engineering be the safer choice?*

Me: Let’s first get on the same page. Problem formulation, data exploration, and data cleaning are part of statistical modeling. Understanding how data engineering and model deployment pipelines work is part of statistical modeling (although designing and implementing these systems is not). Even data scientists who only want to do statistical modeling should embrace these tasks.

I agree that from an organizational perspective data engineering is a higher priority than statistical modeling. Even experiment analyses—which don’t need to be deployed—depend completely on good instrumentation and data pipelines.

Data scientists at smaller, scrappier companies spend more of their time on data engineering and MLOps. People who prefer to focus on statistical modeling should look to larger, better-funded companies with more specialized teams. I would caution against premature career specialization, though, because knowing some engineering allows data scientists to act as a highly valuable bridge between the technical and business sides of an organization. It also leaves open the option to move to a more engineering-heavy role down the road.

Most data science projects fail

Skeptic: Score a point for the skeptics. I’ve also read that most data science projects fail, so I don’t see why a company—especially a small, scrappy one—should waste the resources on data scientists.

Me: I’ve seen those sources that say 85% or 87% of projects fail, but they seem to just make up the numbers out of thin air. Where’s the data? I’m skeptical of your skepticism!

More seriously, what does it mean for a data science project to fail? Kohavi, Tang, and Xu point out that most experiments do fail in the sense that a proposed change turns out not to be better than the existing system. This is not failure in the business sense, though, because these experiments still lead to good decisions and fast innovation cadence.

More generally, the most valuable thing statistical modelers bring to the table is their culture. Data scientists insist on justifying ideas with evidence instead of intuition, especially by quantifying model performance. Before we run an experiment, we need to know what metrics we use to evaluate a new idea. Before we deploy a complex predictive model, we need to know what the baseline is. It’s probably the current deterministic, hard-coded system, which you don’t even think of as a model, let alone measure! So even if some projects fail, strong data scientists raise the bar for the whole organization.

Modeling experts also increase the pace of innovation by spotting potential modeling problems ahead of time. In recommender systems, for example, it’s important to think upfront about how to avoid closed feedback loops, how to address the cold start problem, and how to ensure algorithmic fairness.

Not everything can be planned in a data science project, however. Unlike other engineering disciplines, we cannot a priori promise concrete results or even solid road maps to our partners, because we don’t know what we’ll find in the data. Having a distinct data science role helps to communicate this limitation.

There are too many data scientists

Skeptic: Maybe really good senior data scientists add value but there’s now a glut of junior data scientists. These poor folks end up at companies that aren’t ready for statistical models, wasting their training and talent.

Me: As we said above, data cleaning is part of statistical modeling, and data science training programs should emphasize this more. People who want to specialize in modeling should look for jobs at larger companies, although even this isn’t a perfect panacea; understanding data and engineering pipelines is always important.

It may be true that data science labor supply exceeds demand, at least in terms of jobs with an explicit “data scientist” title. This perspective misses the forest for the trees. Aspiring data scientists may have to broaden their job search to other titles or target roles on specific business units, but statistical modeling can and should be applied to virtually every industry job. No matter the title, people with good modeling skills will be more effective and rise to the top.

It’s important to actually learn statistical modeling, though, whether through a degree, bootcamp, or self-study. One cannot focus entirely on data engineering and MLOps to land a job, then hope to switch to the data science team later without any modeling experience.

Conclusion

The field of data science has certainly received a lot of hype over the past 10 years, and a certain amount of pushback is inevitable, even productive. But let’s not forget the value that its core skill of statistical modeling brings to the table.