Experimentation

Announcing ABGlossary, an experiment vocab translator

A Netflix engineer, an academic statistician, and a Google Optimize user walk into a bar...

The bar is in rough shape after many empty nights during the Covid pandemic, and the plucky data scientists decide to buy it and spruce it up. An argument quickly ensues over details like the new name and paint color. The industry practitioners want to rename the bar Malty-Armed Bandit but the academic prefers Stoutistical Significance and wants to change the paint from green to blue.

Like good data scientists, they begin to plan an experiment to help them decide how to proceed.

Optimize user: It's all pretty straightforward, we have two sections each with two variants, so four combinations in total.

Academic: Sections? combinations? I thought we were talking about an experiment, it sounds like you're designing seating arrangements!

Netflix engineer: Hmmm, I worry we don't have the time to collect enough data for each of the cells. Plus, what's the default cell going to be for a brand new bar?

Academic: Cells!? I thought seating arrangements were bad, now we're doing biology?? What happened to plain old factors and levels? I'm going back to my beer...

Experiment jargon is cultural and variable

The story is silly, but the jargon is real. Netflix—uniquely, as far as we can tell—uses the term cell to refer to "experimental groups". Ron Kohavi's well-known literature goes with variant, while the popular experiment platform Google Optimize refers to it as a combination, using variant to mean what statisticians usually call a level.

Experiment language varies from community to community because experimentation is about culture as much as it is science. This can be a real source of friction when trying to translate ideas from one group to another, as in a job interview, guest lecture, or choosing which experiment platform to buy.

ABGlossary, to the rescue!

As we've been reading and writing about experimentation over the past few weeks, we've found ourselves frequently asking the same questions:

  • Company X uses the term T in its experimentation blog post/product documentation - what do they mean?

  • Give me a list of all of Company X's experimentation vernacular, in preparation for an upcoming interview or tech talk (or Crosstab Kite article).

  • I want to use the term T to mean something specific, is that standard? Tell me what various organizations mean when they use the term T.

To help answer these questions, we've created a tool called ABGlossary. It has two main things:

  1. A compilation of experiment terms used by various organizations in written sources, like book chapters, blog posts, and product documentation

  2. A command-line utility for querying the compiled data.

Our goals for this project are very modest. In particular, there are several things ABGlossary does not do. It does not attempt to create standard definitions; it's purely descriptive. As a result, ABGlossary cannot currently answer the question "give me all the terms that relate to experiment concept Y". We'll leave that for the to-do list, if there's sufficient interest.

ABGlossary's scope is limited to experimentation terms that lack precise, universal mathematical definitions. P-values and confidence intervals mean the same thing for every organization (even if they're sometimes used incorrectly), so there's no need to list them in the glossary.

How do I use it?

The easy option: browse the data

ABGlossary lives in a GitHub repo. The easiest way to use it is to simply browse the terminology.yaml file in the repo. This file contains the compiled terms, listed by source. For example, the entry for a 2017 Lyft blog post is

- title: Ramblings on Experimentation Pitfalls, Part 1
  link: https://eng.lyft.com/ramblings-on-experimentation-pitfalls-dd554ff87c0e
  orgs:
    - Lyft
  authors:
    - "timothybrownsf"
  terms:
    rollout: >- 
      A protocol for a staged, test-based launch of a new idea that uses a "holdout"
      group in addition to control and treatment groups. In the first stage, for
      example, the allocation of subjects might be (10% control, 10% treatment, 80%
      holdout). The next stage might be (50% control, 50% treatment, 0% holdout). The
      goal is to avoid the analysis mistake of weighting each observational subject
      equally across all stages, rather than (correctly) doing the analysis for each
      stage separately.
    variant:
    control:
    treatment:
    split:
    holdout: >-
      a group of subjects (users) who are in the target population for an A/B
      test, but not part of either the control or treatment group.

Ctrl+f to search for organizations, source titles, authors, or terms of interest. Notice that not all terms have a definition in the Lyft blog post entry; we think it's useful to know an organization uses a term, even if a particular source assumes the definition is known.

The more useful option: command-line utility

A faster and more powerful way to query the data is to use the command-line tool. To use it, clone the GitHub repo locally and follow the short instructions in readme.md to install the required Python packages.

The CLI is the python script abglossary.py. It has two sub-commands: list, which lists all the terms, organizations, or sources that are in the data file, and query, which filters the data to a specified organization and/or term and sorts according to any of the output fields.

Open a terminal with a bash-like shell. To list all organizations with at least one entry in the data file:

$ python abglossary.py list orgs

[
    'Amazon',
    'DoorDash',
    'Facebook',
    'Google',
    'LinkedIn',
    'Lyft',
    'Microsoft',
    'Netflix',
    'Optimizely',
    'Tubi',
    'Uber',
    'VWO'
]

Try listing sources and terms as well. To query all entries for a particular organization, let's say Lyft:

$ python abglossary.py query -o Lyft
ABGlossary results for Lyft

The results are printed in a nice table with the Python package Rich.1 The source title is hyperlinked, so you can open the original document in a browser where possible. To get the query results in the original non-tabular format, use the --verbose flag.

Suppose we want to find how various organizations use a particular term, let's say variant. To get that info:

$ python abglossary.py query -t variant
ABGlossary results for 'variant'

Finally, we can filter by both organization and term to answer the first question in the intro:

$ python abglossary.py query -o Google -t variant

ABGlossary results for Google and 'variant'

Feedback and submissions

Please feel free to add entries to the raw data, by submitting a pull request in the GitHub repo. Bug reports to that repo are welcome as well.

ABGlossary has a modest scope and ambition, but we've found it useful in our reading and writing and we hope it might help others in a small way as well. If you do find it useful, please let us know—this will help us prioritize a second version with our own taxonomy of key concepts.

Notes

  1. Learning how to use Rich was admittedly one of the reasons to create the ABGlossary CLI.