Google Form Parser, a review and how-to

Google Form Parser is a new challenger in the information extraction arena, offering general-purpose off-the-shelf form extraction. We compared Form Parser to Amazon Textract and Microsoft Form Recognizer in terms of accuracy, speed, and ease of use. Take our code snippets and try it yourself!

How to build duration tables from event logs with SQL

Duration tables are a common input format for survival analysis but they are not trivial to construct. In our last article, we used Python to convert a web browser activity log into a duration table. Event logs are usually stored in databases, however, so in this article we do the same conversion with SQL.

How to convert event logs to duration tables for survival analysis

Survival models describe how much time it takes for some event to occur. This is a natural way to think about many applications but setting up the data can be tricky. In this article, we use Python to turn an event log into a duration table, which is the input format for many survival analysis tools.

A checklist for professionalizing machine learning models

Data scientists are drawn to the latest and greatest machine learning tasks and models, even though tabular binary classification remains the industry workhorse. We should take more pride in professionalizing the models that we know to work, rather than reflexively chasing every new thing.

Review: Statistical Rethinking, by Richard McElreath

This is an absolute gem of a book. McElreath has found an elusive combination: Statistical Rethinking is not only one of the best intro textbooks for both causal and Bayesian modeling, it's also highly readable, even entertaining.

Streamlit review and demo: best of the Python data app tools

Streamlit has quickly become the hot thing in data app frameworks. We put it to the test to see how well it stands up to the hype. Come for the review, stay for the code demo, including detailed examples of Altair plots.

How to analyze a staged rollout experiment

Recently we argued that confidence intervals are a poor choice for analyzing staged rollout experiments. In this article, we show a better way: a Bayesian approach that gives decision-makers the answers they really need. Check out our interactive Streamlit app first, then the article for the details.

Research digest: what does cross-validation really estimate?

A new paper by Bates, Hastie, and Tibshirani reminds us that estimating a model's predictive performance is tricky. For linear models at least, cross-validation does not estimate the generalization error of a specific model, as you would assume. How much does this matter for data science in practice?

What we're reading, April edition

The data science content firehose can be overwhelming; these are the pieces we think might be worth your time to check out. This month we're focusing on causal inference.

No, your confidence interval is not a worst-case analysis

Confidence intervals are one of the most misunderstood concepts in statistics. Common sense says the lower bound of a confidence interval is a good estimate of the worst-case outcome, but the definition of the confidence interval doesn't allow us to make this claim. Or does it? Let's take a look.

Announcing ABGlossary, an experiment vocab translator

Defining terms is a key part of experimentation culture. The consequence, however, is that every community has its own experiment jargon, which makes it hard to spot patterns, let alone communicate across groups. We've created a lightweight tool called ABGlossary to help translate experiment vocab.

Choose your experiment platform wisely

To build a culture of fast, reliable, evidence-based innovation, you need an experiment platform. These tools support each stage of the experiment process and, done well, become the beating heart of your infrastructure.

In defense of statistical modeling

Data science remains hot but there is a persistent stream of articles that says the field is overhyped and that hiring managers and aspiring data scientists should focus more on engineering. Let's remember why data science's core skill of statistical modeling is so valuable.

Our experimentation roadmap

Experimentation is the gold standard for predicting the impact of a new idea. Simple designs like A/B testing sound easy but are fiendishly hard to get right. Data scientists are uniquely positioned to solve the challenge and help their companies develop the experimentation muscle.

What we're reading: February 22

What we're reading for the week of February 22nd. The data science content firehose can be overwhelming. These are the pieces we think might be a good use of your time to read and study.

Data before models, but problem formulation first

Recent tweets highlighted the importance of data annotation and curation in applied machine learning vs. model perfection. Data is indeed critical, but formulating a business problem as a data science task is even more foundational.

Conversion rate modeling: worth the effort?

Conversion rates are essential for understanding and optimizing a business. In this article, we compare conversion rate modeling to a common analytics approach and show how to decide between the two methods.

Modeling the customer journey: our roadmap

Modeling the customer journey can be one of the best ways for industry data scientists to deliver value. We break the customer journey metaphor down into smaller pieces and lay out our roadmap for covering them.