How to analyze a staged rollout experiment
This article is the second part of a twopart series. The first part—why confidence intervals are a bad idea for worstcase analysis—is here. The Streamlit app that accompanies this article is also explained in the post Streamlit review and demo: best of the Python data app tools.
We recently argued that confidence intervals are a bad way to analyze staged rollout experiments. In this article, we describe a better way. Better yet, we've built a live, interactive Streamlit app to show the better way. Check it out, then follow along as we talk through the methodology, the app, and the decisionmaking process.
The scenario
Continuing from our previous article, here's the scene: a staged rollout of a new website feature, with session click rate as our primary metric. What decisionmakers often want to know in this scenario is that the new site is no worse than the existing site. We formalize this by asking what's the probability that the true click rate is less than some businesscritical threshold? If we use \(\theta \) to represent the true click rate, \( \kappa \) for the businesscritical threshold that defines the “worstcase” outcome, and \( \alpha \) be the maximum worstcase probability we can tolerate, then our decision rule is \[ \mathbb{P}(\theta \leq \kappa) < \alpha \]
If this statement is true, then we're GO for the next phase of the rollout. Otherwise, it's NO GO; we roll the new feature back try something else. For example, our company execs might say that as long as the click rate on our site doesn't fall below 0.08, we're good to keep going with the rollout. Specifically, as long as the probability of that low click rate is less than 10%, we'll go forward.
Confidence intervals don't cut it
Data scientists sometimes resort to confidence intervals in this situation. It's an attractive option, at first blush. For starters, anybody who's taken an intro stats class knows how to do it. More importantly, a confidence interval gives us both the worstcase threshold \( \kappa \)—the lower bound of the interval—and a probability \( \alpha \)—the significance level. A naïve data scientist might find, for example, an 80% confidence interval of \([0.079, 0.088] \) and conclude there's only a 10% chance the true click rate is less than the lower bound of 0.079.
As we showed previously, this is an incorrect interpretation of the confidence interval; the logic for constructing the interval assumes the true click rate is fixed at some unknown value, so it's not possible to then make probabilistic statements about the distribution of that same value. The closest we can get to a worstcase analysis with the confidence interval is to assume we didn't draw an unusually weird sample (a correct assumption 80% of the time), assume the true click rate is within our interval, and conclude that a click rate of 0.079 is the smallest value that's compatible with our data. Because this is smaller than 0.08, we should make a NO GO decision.
No matter how hard we try, we can't transmogrify this result into the form we set out above as our decision criteria. What's more, the confidence interval approach doesn't give our decisionmakers the freedom to choose both the business threshold and the confidence level. In this example, we're forced to use the lower bound of the interval at 0.078 as our threshold, even though we know the more meaningful threshold would be 0.08.
Bayesian data analysis to the rescue!
So how can we answer our decisionmakers' question? With a Bayesian analysis. This method has several key advantages over the confidence interval approach:

It lets us directly estimate the quantity that decisionmakers intuitively want

Our stakeholders can set both the businesscritical threshold that defines the worstcase scenario and the maximum allowable chance of that scenario happening

We can incorporate prior information, which is especially handy for staged rollouts, where we don't want to ignore the results from previous phases of the experiment.

It allows us to check how robust the results are to our prior beliefs about the click rate and to the decisionmaking criteria. This is super important when different stakeholders have different priors and decision criteria because it identifies those differences as the crux of the debate, rather than the red herring of statistical significance.

It provides a complete picture of our updated posterior belief about the click rate that we can use to compute all sorts of interesting things. In addition to our GO/NO GO decision criteria, we might also be interested in the expected value of the click rate if the worstcase situation does happen, for example.
This article is not a general introduction to Bayesian analysis; there are better resources for that. A few we recommend in particular:
 McElreath, Statistical Rethinking
 Gelman, et al., Bayesian Data Analysis.
 Jake VanderPlas, Frequentism and Bayesianism: A Pythondriven Primer
We encourage you to check these references out but we hope you can use this article and our interactive app without any further background.
The data
Simulating data
As with any statistical analysis, we start with the data. For this article and our demo app, we simulate data for a hypothetical staged rollout. We imagine 500 web sessions per day on average with daily variation according to a Poisson distribution.^{1} Each day's sessions yield some number of clicks, according to a binomial distribution. The true click rate is the parameter of that binomial; we set it at 0.085. The Python 3.8 code for this and the first three days of data:^{2}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

day  sessions  clicks  misses  click_rate 

0  484  37  447  0.076446 
1  523  32  491  0.061185 
2  506  46  460  0.090909 
In the app, the data is represented in the top middle plot. The number of daily sessions is shown in orange and the click rates in blue.
Modeling the data
Now we force ourselves to forget that we simulated the data...
Each day we observe \( n \) sessions, and we model the number of clicks \( y \) as a binomial random variable. \[ y \sim \mathrm{Binomial}(n, \theta) \] We don't choose the binomial distribution because we know it's the true generating distribution; it's an eminently reasonable likelihood function for most binary outcome data, even when we don't know the underlying generating process. This distribution's only parameter is the true click rate \( \theta \), which would be unknown and unknowable in a real problem. That is the thing we need to learn about from the data.
What do we already know about the click rate?
In most experiments we don't start with a blank slate; data scientists and business units both have intuition about how well new features and products are likely to work. This is especially true for the later phases of a staged rollout, where we have data from the previous phases.
We're going to describe these prior beliefs about the click rate with a beta distribution. There are two reasons for this. First, its two parameters \( a \) and \( b \) are highly interpretable. Think of \( a \) as the number of clicks we've already seen and \( b \) as the prior number of sessions without a click (let's call them misses). For a staged rollout, these could be real data from previous phases, but they could also be hypothetical. The stronger your stakeholder's belief, the larger the prior number of sessions should be, and the higher they think the click rate is, the larger the number of clicks should be.
The initial setting in our interactive app, for example, is 100 prior sessions, 10 of which resulted in clicks.
1 2 3 4 5 6 7 

The second reason why we use the beta distribution is that it hugely simplifies how we derive our updated posterior belief about the click rate.
Putting it together  the posterior
The fundamental goal of our analysis is to describe our belief about the parameter given both our prior assumptions and the data we've observed. This quantity is the posterior distribution; in math notation, it's
\[ \mathbb{P}( \theta  \mathcal{D} ) \] where \( \mathcal{D} \) represents the data. We obtain this by multiplying the prior distribution and the likelihood of the data given a parameter value. The posterior is proportional to this result, and that's good enough because we know it's a probability distribution and must sum or integrate to 1. \[ \mathbb{P}( \theta  \mathcal{D} ) \propto \mathbb{P}(\theta) \cdot \mathcal{L}(\mathcal{D}  \theta) \]
We've chosen a beta distribution as our prior with \( a \) clicks and \( b \) misses. The data likelihood is a binomial distribution with \( y \) clicks and \( n  y \) misses. Here's where the second big advantage of the beta prior comes in; the result of multiplying the beta prior and the binomial likelihood is another beta distribution! The posterior beta distribution has hyperparameters that can be interpreted as \( a + y \) clicks and \(b + (n y) \) misses.
This property is called conjugacy. A good source for the mathematical details is section 2.4 of Gelman et al.'s Bayesian Data Analysis, on Informative prior distributions. Crucially, using conjugate prior means we don't have to do complicated math or sampling to derive our posterior.
Another cool thing about Bayesian updating is that we can update our posterior after observing each day's data and the result is the same as if we had updated with all observed data at once. In our analysis and app, we do daily updates to see how the posterior evolves over time, and because this is a more realistic way to process the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 

Each row of the results
DataFrame represents a day's update,
with index 1 representing our prior. The posterior update step is lines 1619. In lines
2226 we compute functions of the distribution to describe its shape because we can't
easily plot the entire distribution over time. The mean
is
the expected value of the posterior, while p10
and p90
together form the 80% credible interval. There is
nothing special about the credible interval; it is not equivalent to a confidence
interval. It's just a way of characterizing the shape of the posterior in a condensed
way.
The posterior time series plot shows a few important things:

The credible interval shrinks as we get more data, i.e. our certainty about the true click rate increases.

Intervals are useful. Due to pure randomness, the posterior mean drops pretty far below the true click rate (even though our prior is higher than the true click rate). The wide credible interval shows how much uncertainty we still had at that point in the experiment.

By day 7 of the experiment (counting from 0 as good Pythonistas), the posterior distribution seems to have stabilized. If we use the app to imagine a stronger and higher prior—say 1,000 sessions with a click rate of 0.2—the posterior time series shows that 14 days of data isn't enough for the posterior to stabilize, and we need to collect more data.

It's hard to get a sense just from the posterior mean and credible interval how certain we are that the true click rate is above our worstcase threshold, which is why it's important to look at the full posterior distribution in detail.
The full posterior looks like this:
In the app, this is the bottom middle plot. The bottom left plot is also the posterior on the same scale as the prior, to show how the posterior distribution narrows as we observe data and become more certain about the result.
Making a decision
The posterior distribution for click rate is not by itself the final answer. We need to guide our stakeholders toward the best action, or at least the probable consequences of different actions.
In our hypothetical scenario, we said our decisionmakers want to go forward with the
new feature rollout if there's less than a 10% chance that the true click rate is less
than 0.08, or \( \mathbb{P}(\theta \leq 0.08) < 0.1 \). We use the cdf
method of the posterior distribution (a scipy.stats
continuous random variable object) to compute the
total probability of the worstcase scenario, i.e. the scenario where the click rate is
less than 0.08.
1 2 3 4 5 6 7 8 9 10 11 12 13 

Result  Value 

Prior mean click rate  0.1 
Prior worstcase probability  26.8% 
Posterior mean click rate  0.0837 
Posterior worstcase probability  13.12% 
FINAL DECISION  NO GO 
In this example, there's still a 13% chance that the click rate is less than 8%, given our prior assumptions and the observed data, so we recommend to our execs that the rollout should stop. It's interesting to see that the posterior mean of 0.0837 is pretty close to the true click rate of 0.085. The problem is that we haven't observed enough data to be sure that it's truly above 0.08.
Stresstesting and building intuition with the interactive app
The interactive aspect of our Streamlit app is not just for fun. It helps us—and our stakeholders—to gain intuition about the method and the results, and to check the robustness of the outcome.
The first two widgets on the app control panel specify our prior belief about the click rate. The higher we think the click rate is beforehand, the higher our posterior belief will be too. If we came into our experiment thinking the click rate was 0.15 instead of 0.1, then the posterior probability of the worstcase scenario would fall to just 9% and our decision would switch from NOGO to GO.
Similarly, the more prior sessions we have (real or imagined), the more the prior influences the result. If we had just 310 prior sessions at a click rate of 0.1 instead of 100, our decision would also flip from NOGO to GO.
These are surprisingly small changes that cause the final decision to reverse completely. This lack of robustness shows that specifying the prior can be critical, particularly when multiple stakeholders are involved, each of whom has their own intuition and assumptions. While this may seem like a bug in the methodology, it is in fact a major feature. The ability to plug in the perspectives of different stakeholders sometimes reveals the crux of a decisionmaking debate.
The issue of different stakeholder assumptions is even more clear when it comes to the decision criteria, which are the bottom two widgets on the app control panel. The worstcase click rate threshold is \( \kappa \), and it appears as a red line on the plots. The max acceptable worstcase probability is \( \alpha \); it only affects the final decision of GO or NO GO.
Final thoughts
It's understandable that data scientists would use confidence intervals for worstcase analyses. It's a technique taught in every intro stats class, the procedure seems to provide all the information we need, and the result is often very similar to the credible interval from the Bayesian approach.
The Bayesian approach that we've shown here is much better, for many, many reasons. Not only does it give the interpretation that our stakeholders want, it allows them to define both the worstcase scenario and the maximum allowable chances of that scenario occurring. And while the confidence interval method may yield the right answer for the wrong reason, once we use an informative prior—a very natural thing to do in a staged rollout where we have data from past phases of the experiment—all bets are off. Why not do the right thing from the start?
Furthermore, the Bayesian method lets us compare different prior distributions, to check the robustness of the results or to reflect the different assumptions of different stakeholders. Lastly, this approach gives us a fuller picture of the result in the form of a posterior distribution. With the posterior, we can compute all sorts of useful things, like the expected revenue of the new feature given revenueperclick data, or the expected value of the click rate if the worstcase scenario does come to pass.
One thing a Bayesian analysis does not do is eliminate the need for thoughtful data science and decisionmaking. Even in our relatively simple click rate example, there is plenty of room for argument. While the posterior distribution seems to have stabilized, our posterior results are quite different from what we expected a priori.
Maybe the final decision shouldn't be a hard NOGO; maybe it should be PAUSE, while we investigate what caused such a big surprise. Maybe it's an engineering bug and we can resume the feature rollout after it's resolved. In the end, critical thinking is, well, critical for every statistical method.
Notes & references

A more realistic simulation would probably include dayofweek effects too. ↩

All code snippets in this article are taken from the Streamlit app script, at https://github.com/CrosstabKite/worstcaseanalysis, although I've tweaked some of the code slightly to better stand on its own in this context. ↩