Online controlled experiments (OCEs) have become popular among digital technology organizations in measuring the impact of their products and services, and guiding business decisions Kohavi et al. (2020); Liu et al. (2020); Thomke (2020). Large tech companies including Google Hohnhold et al. (2015), Linkedin Xu et al. (2015), and Microsoft Kohavi et al. (2013) reported running thousands of experiments on any given day, and there are multiple companies established solely to manage OCEs for other businesses Browne and Swarbrick Jones (2017); Johari et al. (2017)
. It is also considered a key step in the machine learning development lifecycleBernardi et al. (2019); Zhang et al. (2020).
OCEs are essentially randomized controlled trials run on the Web. The simplest example, commonly known as an A/B test, splits a group of entities (e.g. users to a website) randomly into two groups, where one group is exposed to some treatment (e.g. showing a "free delivery" banner on the website) while the other act as the control (e.g. seeing the original website without any mention of free delivery). We calculate the decision metric(s) (e.g. proportion of users who bought something) based on responses from both groups, and compare the metrics using a statistical test to draw causal statements about the treatment.
The ability to run experiments on the Web allows one to interact with a large number of subjects within a short time frame and collect a large number of responses. This, together with the scale of experimentation carried out by tech organizations, should lead to a wealth of datasets describing the result of an experiment. However, there are not many publicly available OCE datasets, and we believe they were never systematically reviewed nor categorized. This is in contrast to the machine learning field, which also enjoyed its application boom in the past decade yet already has established archives and detailed categorizations for its datasets Dua and Graff (2019); Vanschoren et al. (2013).
We argue the lack of relevant datasets arising from real experiments hinders the further development of OCE methods (e.g. new statistical tests, bias correction, and variance reduction methods). Many statistical tests proposed relied on simulated data that impose restrictive distributional assumptions and thus may not be representative of the real-world scenario. Moreover, it may be difficult to understand how methods differ from each other and assess their relative strengths and weaknesses without a common dataset to compare them on.
To address this problem, we present the first ever survey and taxonomy for OCE datasets. Our survey identified 13 datasets, including standalone experiment archives, accompanying datasets from scholarly works, and demo datasets from online courses on the design and analysis of experiments. We also categorize these datasets based on dimensions such as the number of experiments each dataset contains, how granular each data point is time-wise and subject-wise, and whether it includes results from real experiment(s).
The taxonomy enables us to engage in a discussion on the data requirements for an experiment by systematically mapping out which data dimension is required for which statistical test and/or learning the hyperparameter(s) associated with the test. We also recognize that in practice data are often used for purposes beyond what it is originally collected forKerr (1998). Hence, we posit the mapping is equally useful in allowing one to understand the options they have when choosing statistical tests given the format of data they possess. Together with the survey, the taxonomy helps us to identify what types of datasets are required for commonly used statistical tests, yet are missing from the public domain.
One of the gaps the survey and taxonomy identify is datasets that can support the design and running of experiments with adaptive stopping
(a.k.a. continuous monitoring / optional stopping). We motivate their use below. Traditionally, experimenters analyze experiments using Null Hypothesis Statistical Tests (NHST, e.g. a Student’s-test). These tests require one to calculate and commit to a required sample size based on some expected treatment effect size, all prior to starting the experiment. Making extra decisions during the experiment, be it stopping the experiment early due to seeing favorable results, or extending the experiment as it teeters “on the edge of statistical significance” Munroe (2015), is discouraged as they risk one having more false discoveries than intended Greenland et al. (2016); Munroe (2011).
Clearly, the restrictions above are incompatible with modern decision-making processes. Businesses operating online are incentivized to deploy any beneficial and roll back any damaging changes as quickly as possible. Using the “free delivery” banner example above, the business may have calculated that they require four weeks to observe enough users based on an expected 1% change in the decision metric. If the experiment shows, two weeks in, that the banner is leading to a 2% improvement, it will be unwise not to deploy the banner to all users simply due to the need to run the experiment for another two weeks. Likewise, if the banner is shown leading to a 2% loss, it makes every sense to immediately terminate the experiment and roll back the banner to stem further losses.
As a result, more experimenters are moving away from NHST and adopting adaptive stopping techniques. Experiments with adaptive stopping allow one to decide when to stop an experiment (i.e. stopping it earlier or prolonging it) based on the sample responses observed so far without compromising the statistical validity of false positive/discovery rate control. To encourage further development in this area, both in methods and data, we release the ASOS Digital Experiments Dataset, which contains daily checkpoints of decision metrics from multiple, real OCEs run on the global online fashion retail platform.
The dataset design is guided by the requirements identified by the mapping between the taxonomy and statistical tests, and to the best of our knowledge, is the first public dataset that can support the end-to-end design and running of online experiments with adaptive stopping. We demonstrate it can indeed do so by (1) running a sequential test and a Bayesian hypothesis test on all the experiments in the dataset, and (2) estimating the value of hyperparameters associated with the tests. While the notion of ground-truth does not exist in real OCEs, we show the dataset can also act as a quasi-benchmark for statistical tests by comparing results from the tests above with that of a-test.
To summarize, our contributions are:
(Section 4) We map the relationship between the taxonomy to statistical tests commonly used in experiments by identifying the minimally sufficient set of statistics and dimensions required in each test. The mapping, which also applies to offline and non-randomized controlled experiments, enables experimenters to quickly identify the data collection requirements for their experiment design (and conversely the test options available given the data availability); and
2 A Taxonomy for Online Controlled Experiment Datasets
We begin by presenting a taxonomy on OCE datasets, which is necessary to characterize and understand the results of a survey. To the best of our knowledge, there are no surveys nor taxonomies specifically on this topic prior to this work. While there is a large volume of work concerning the categorization of datasets in machine learning Dua and Graff (2019); Vanschoren et al. (2013), of research work in the online randomized controlled experiment methods Auer and Felderer (2018); Auer et al. (2021); Fabijan et al. (2018); Ros and Runeson (2018), and of general experiment design Hopewell et al. (2010); Liu and McCoy (2020), our search on Google Scholar and Semantic Scholar using combinations of the keywords “online controlled experiment”/“A/B test”, “dataset”, and “taxonomy”/“categorization” yields no relevant results.
The taxonomy focuses on the following four main dimensions:
Experiment count A dataset can contain the data collected from a single experiment or multiple experiments. Results from a single experiment are useful for demonstrating how a test works, though any learning should ideally involve multiple experiments. Two closely related but relatively minor dimensions are the variant count (number of control/treatment groups in the experiment) and the metric count (number of performance metrics the experiment is tracking). Having an experiment with multiple variants and metrics enables the demonstration of methods such as false discovery rate control procedures Benjamini and Hochberg (1995) and learning the correlation structure within an experiment.
Response granularity Depending on the experiment analysis requirements and constraints imposed by the online experimentation platform, the dataset may contain data aggregated to various levels. Consider the “free delivery” banner example in Section 1, where the website users are randomly allocated to the treatment (showing the banner) and control (not showing the banner) groups to understand whether the banner changes the proportion of users who bought something. In this case, each individual user is considered a randomization unit Kohavi et al. (2020).
A dataset may contain, for each experiment, only summary statistics on the group level, e.g. the proportion of users who have bought something in the control and treatment groups respectively. It can also record one response per randomization unit, with each row containing the user ID and whether the user bought something. The more detailed activity logs at a sub-randomization unit level can also have each row containing information about a particular page view from a particular user.
Time granularity An experiment can last anytime between a week to many months Kohavi et al. (2020), which provides many possibilities in recording the result. A dataset can opt to record the overall result only, showing the end state of an experiment. It may or may not come with a timestamp for each randomization unit or experiment if there are multiple instances of them. It can also record intermediate checkpoints for the decision metrics, ideally at regular intervals such as daily or hourly. These checkpoints can either be a snapshot of the interval (recording activities between time and , time and , etc.) or cumulative from the start of the experiment (recording activities between time and , time and , etc.).
Syntheticity A dataset can record data generated from a real process. It can also be synthetic—generated via simulations with distributional assumptions applied. A dataset can also be semi-synthetic if it is generated from a real-life process and subsequently augmented with synthetic data.
Note we can also describe datasets arising from any experiments (including offline and non-randomized controlled experiments) using these four dimensions. We will discuss in Section 4 how these dimensions map to common statistical tests used in online experimentation.
In addition, we also record the application domain, target demographics, and the temporal coverage of the experiment(s) featured in a dataset. In an age when data are often reused, it is crucial for one to understand the underlying context, and that learnings from a dataset created under a certain context may not translate to another context. We also see the surfacing of such context as a way to promote considerations in fairness and transparency for experimenters as more experiment datasets become available Jiang et al. (2019). For example, having target demographics information on a meta-level helps experimenters to identify who were involved, or perhaps more importantly, who were not involved in experiments and could be adversely impacted by a treatment that is effectively untested.
Finally, two datasets can also differ in their medium of documentation and the presence/absence of a data management / long-term preservation plan. The latter includes the hosting location, the presence/absence of a DOI, and the type of license. We record these attributes for the datasets surveyed below for completeness.
3 Public Online Controlled Experiment Datasets
Here we discuss our approach to produce the first ever survey on OCE datasets and present its results. The survey is compiled via two search directions, which we describe below. For both directions, we conduct a first round search in May 2021, with follow up rounds in August and October 2021 to ensure we have the most updated results.
We first search on the vanilla Google search engine using the keywords “Online controlled experiment "dataset"”, “A/B test "dataset"”, and “Multivariate test "dataset"”. For each keyword, we inspect the first 10 pages of the search result (top 100 results) for scholarly articles, web pages, blog posts, and documents that may host and/or describe a publicly available OCE dataset. The search term “dataset” is in double quotes to limit the search results to those with explicit mention of dataset(s). We also search on specialist data search engines/hosts, namely on Google Dataset Search (GDS) and Kaggle, using the keywords “Online controlled experiment(s)” and “A/B test(s)”. We inspect the metadata and description for all the results returned (except for GDS, where we inspect the first 100 results for “A/B test(s)”) for relevant datasets as defined below.222Searching for the keyword “Online controlled experiment” on GDS and Kaggle returned 42 and 7 results respectively, and that for “A/B test” returned “100+” and 286 results respectively. Curiously, replacing “experiment” and “test” in the keywords with their plural form changes the number of results, with the former returning 6 and 10 results on GDS and Kaggle respectively, and the latter returning “100+” and 303 results respectively.
A dataset must record the result arising from a randomized controlled experiment run online to be included in the survey. The criterion excludes experimental data collected from offline experiments, e.g. those in agriculture Wright (2020), medicine Doyle et al. (2018), and economics Dupas et al. (2016). It also excludes datasets used to perform quasi-experiments and observational studies, e.g. the LaLonde dataset used in econometrics Dehejia and Wahba (1999) and datasets constructed for uplift modeling tasks Diemert et al. (2018); Hillstrom (2008).333Uplift modeling (UM) tasks for online applications often start with an OCE Radcliffe (2007) and thus we can consider UM datasets as OCE datasets with extra randomization unit level features. The nature of the tasks is different though: OCEs concern validating the average treatment effect across the population using a statistical test, whereas UM concerns modeling the conditional average treatment effect for each individual user, making it more a general causal inference task that is outside the scope of this survey.
The result is presented in Table 1. We place the 13 OCE datasets identified in this exercise along the four taxonomy dimensions defined in Section 2 and record the additional features. These datasets include two standalone archives for online media and education experiments respectively Matias et al. (2021); Selent et al. (2016), plus two accompanying datasets for peer-reviewed research articles Tenório et al. (2017); Young (2014). There are also tens of Kaggle datasets, blog posts, and code repositories that describe and/or duplicate one of the five example datasets used in five different massive open online courses on online controlled experiment design and analysis Bååth and Romero ; Campos et al. ; Grimes et al. ; Grossman ; Udacity . Finally, we identify three standalone datasets hosted on Kaggle with relatively light documentation Ay (2020); Emmanuel (2020); Klimonova (2020).
From the table, we observe a number of gaps in OCE dataset availability, the most obvious one being the lack of datasets that record responses at a sub-randomization unit level. In the sections below, we will identify more of these gaps and discuss their implications for OCE analysis.
4 Matching Dataset Taxonomy with Statistical Tests
Specifying the data requirements (or structure) and performing statistical tests are perhaps two of the most common tasks carried out by data scientists. However, the link between the two processes is seldom mapped out explicitly. It is all too common to consider from scratch the question “I need to run this statistical test, how should I format my dataset?” (or more controversially, “I have this dataset, what statistical tests can I run?” Kerr (1998)) for every new project/application, despite the list of possible dataset dimensions and statistical tests remaining largely the same.
We aim to speed up the process above by describing what summary statistics are required to perform common statistical tests in OCE, and link the statistics back to the taxonomy dimensions defined in Section 2. The exercise is similar to identifying the sufficient statistic(s) for a statistical model Fisher (1922)
, though the identification is done for the encapsulating statistical inference procedure, with a practical focus on data dimension requirements. We do so by stating the formula used to calculate the corresponding effect sizes and test statistics and observe the summary statistics required in common. The general approach enables one to also apply the resultant mapping to any experiments that involve
a two-sample statistical test, including offline experiments and experiments without a randomized control. For brevity, we will refrain from discussing the full model assumptions as well as their applicability. Instead, we point readers to the relevant work in the literature.
4.1 Effect size and Welch’s -test
We consider a two-sample setting and let and be i.i.d. samples from the distributions and
respectively. We assume the first two moments exist for the distributionsand , with their mean and variance denoted and respectively. We also denote the sample mean and variance of the two samples and respectively.
Often we are interested in the difference between the mean of the two distributions , commonly known as the effect size (of the difference in mean) or the average treatment effect. A standardized effect size enables us to compare the difference across many experiments and is thus useful in meta-analyses. One commonly used effect size is Cohen’s
, defined as the difference in sample means divided by the pooled sample standard deviationCohen (1988):
We are also interested in whether the samples carry sufficient evidence to indicate is different from a prescribed value . This can be done via a hypothesis test with and .444There are other ways to specify the hypotheses such as that in superiority and non-inferiority tests Committee for Proprietary Medicinal Products (2001), though they are unlikely to change the data requirement as long as it remains anchored on . One of the most common statistical test used in online controlled experiments is the Welch’s -test Welch (1947), in which we calculate the test statistic as follow:
We observe that in order to calculate the two stated quantities above, we require six quantities: two means (, ), two (sample) variances (, ), and two counts (, ). We call these quantities Dimension Zero (D0) quantities as they are the bare minimum required to run a statistical test—these quantities will be expanded along the taxonomy dimensions defined in Section 2.
Cluster randomization / dependent data The sample variance estimates (, ) may be biased in the case where cluster randomization is involved. Using again the “free delivery” banner example, instead of randomly assigning each individual user to the control and treatment groups, the business may randomly assign postcodes to the two groups, with all users from the same postcode getting the same version of the website. In this case, user responses may become correlated, which violates the independence assumptions in statistical tests. Common workarounds including the use of bootstrap Bakshy and Eckles (2013) and the Delta method Deng et al. (2018) generally require access to sub-randomization unit responses.
4.2 Experiments with adaptive stopping
As discussed in Section 1
, experiments with adaptive stopping are getting increasingly popular among the OCE community. Here we motivate the data requirement for statistical tests in this domain by looking at the quantities required to calculate the test statistics for a Mixture Sequential Probability Ratio Test (mSPRT)Johari et al. (2017)
and a Bayesian hypothesis test using Bayes factorDeng et al. (2016), two popular approaches in online experimentation. There are many other tests that support adaptive stopping Miller (2010); Wald (1945), though the data requirements, in terms of the dimensions defined in Section 2, should be largely identical.
We first observe running a mSPRT with a normal mixing distribution involves calculating the following test statistic upon observing the first and (see Eq. (11) in (Johari et al., 2017)):
where and represent the sample mean of the s and s up to sample respectively, and is a hyperparameter to be specified or learned from data.
where and are the effect size (standardized by the pooled variance) and the effective sample size
of the test respectively. In OCE it is common to appeal to the central limit theorem and assume a normal likelihood for the effect size, i.e.. We then compare the hypotheses and by calculating the Bayes factor (Kass and Raftery, 1995):
is the PDF of a normal distribution with meanand variance , and is a hyperparameter that we specify or learn from data.
During an experiment with adaptive stopping, we calculate the test statistics stated above many times for different and . This means a dataset can only support the running of such experiments if it contains intermediate checkpoints for the counts (, ) and the means (, ), ideally cumulative from the start of the experiment. Often one also requires the variances at the same time points (see below). The only exception to the dimensional requirement above is the case where the dataset contains responses at a randomization unit or finer level of granularity, and despite recording the overall results only, has a timestamp per randomization unit. Under this special case, we will still be able to construct the cumulative means (, ) for all relevant values of and by ordering the randomization units by their associated timestamps.
Learning the effect size distribution (hyper)parameters The two tests introduced above feature some hyperparameters ( and ) that have to be specified or learned from data. These parameters characterize the prior belief of the effect size distribution, which will be the most effective if it “matches the distribution of true effects across the experiments a user runs” Johari et al. (2017). Common parameter estimation procedures Adams (2018); Azevedo et al. (2019); Guo et al. (2020) require results from multiple related experiments.
Estimating the response variance In the equations above, the response variance of the two samples and are assumed to be known. In practice we often use the plug-in empirical estimates and —the sample variances for the first and first respectively, and thus the data dimensional requirement is identical to that of the counts and means as discussed above. In the case where the plug-in estimate may be biased due to dependent data, we will also require a sub-randomization unit response granularity (see Section 4.1).
4.3 Non-parametric tests
We also briefly discuss the data requirements for non-parametric tests, where we do not impose any distributional assumptions on the responses but compare the hypotheses and , where we recall and are the distributions of the two samples.
One of the most commonly used (frequentist) non-parametric tests in OCE, the Mann-Whitney -test Mann and Whitney (1947), calculates the following test statistic:
While a rank-based method is available for large and , both methods require the knowledge of all the and . Such requirement is the same for other non-parametric tests, e.g. the Wilcoxon signed-rank Wilcoxon (1945), Kruskal-Wallis Kruskal and Wallis (1952), and Kolmogorov–Smirnov tests Dodge (2008). This suggests a dataset can only support a non-parametric test if it at least provides responses at a randomization unit level.
We conclude by showing how we can combine the individual data requirements above to obtain the requirement to design and/or run experiments for more complicated statistical tests. This is possible due to the orthogonal design of the taxonomy dimensions. Consider an experiment with adaptive stopping using Bayesian non-parametric tests (e.g. with a Pólya Tree prior Chen and Hanson (2014); Holmes et al. (2015)). It involves a non-parametric test and hence requires responses at a randomization unit level. It computes multiple Bayes factors for adaptive stopping and hence requires intermediate checkpoints for the responses (or a timestamp for each randomization unit). Finally, to learn the hyperparameters of the Pólya Tree prior it requires multiple related experiments. The substantial data requirement along 3+ dimensions perhaps explains the lack of relevant OCE datasets and the tendency for experimenters to use simpler statistical tests for the day-to-day design and/or running of OCEs.
5 A Novel Dataset for Experiments with Adaptive Stopping
We finally introduce the ASOS Digital Experiments Dataset, which we believe is the first public dataset that supports the end-to-end design and running of OCEs with adaptive stopping. We motivate why this is the case, provide a light description of the dataset (and a link to the more detailed accompanying datasheet),55footnotemark: 5 and showcase the capabilities of the dataset via a series of experiments. We also discuss the ethical implications of releasing this dataset.
Recall from Section 4.2 that in order to support the end-to-end design and running of experiments with adaptive stopping, we require a dataset that (1) includes multiple related experiments; (2) is real, so that any parameters learned are reflective of the real-world scenario; and either (3a) contains intermediate checkpoints for the summary statistics during each experiment (i.e. time-granular), or (3b) contains responses at a randomization unit granularity with a timestamp for each randomization unit (i.e. response-granular with timestamps).
None of the datasets surveyed in Section 3 meet all three criteria. While the Upworthy Matias et al. (2021), ASSISTments Selent et al. (2016), and MeuTutor Tenório et al. (2017) datasets meet the first two criteria, they all fail to meet the third.555All three report the overall results only and hence are not time-granular. The Upworthy dataset reports group-level statistics and hence is not response-granular. The ASSISTments and MeuTutor datasets are response-granular but they lack the timestamp to order the samples. The Udacity Free Trial Screener Experiment dataset meets the last two criteria by having results from a real experiment with daily snapshots of the decision metrics (and hence time-granular), which supports the running of an experiment with adaptive stopping. However, the dataset only contains a single experiment, which is not helpful for learning the effect size distribution (the design).
The ASOS Digital Experiments Dataset contains results from OCEs run by a business unit within ASOS.com, a global online fashion retail platform. In terms of the taxonomy defined in Section 2, the dataset contains multiple (78), real experiments, with two to five variants in each experiment and four decision metrics based on binary, count, and real-valued responses. The results are aggregated on a group level, with daily or 12-hourly checkpoints of the metric values cumulative from the start of the experiment. The dataset design meets all the three criteria stated above and hence differentiates itself from other public datasets.
We provide readers with an accompanying datasheet (based on Gebru et al. (2020)) that provides further information about the dataset. We also host the dataset on Open Science Framework to ensure it is easily discoverable and can be preserved long-term.11footnotemark: 1 It is worth noting that the dataset is released with the intent to support development in the statistical methods required to run OCEs. The experiment results shown in the dataset are not representative of ASOS.com’s overall business operations, product development, or experimentation program operations, and no conclusion of such should be drawn from this dataset.
5.1 Potential use cases
Meta-analyses The multi-experiment nature of the dataset enables one to perform meta-analyses. A simple example is to characterize the distribution of -values (under a Welch’s -test) across all experiments (see Figure 1). We observe there are roughly a quarter of experiments in this dataset attaining , and attribute this to the fact that what we experiment in OCEs are often guided by what domain experts think may have an impact. Having that said, we invite external validation on whether there is evidence for data dredging using e.g. Miller and Hosanagar (2020).
Design and running of experiments with adaptive stopping We then demonstrate the dataset can indeed support OCEs with adaptive stopping by performing a mixed Sequential Probability Ratio test (mSPRT) and a Bayesian hypothesis test via Bayes factor for each experiment and metric. This requires learning the hyperparameters and . We learn, for each metric, a naïve estimate for by collating the (see (4)) at the end of each experiment and taking their sample variance. This yields the estimates 1.30e-05, 1.07e-05, 6.49e-06, and 5.93e-06 for the four metrics respectively. For , we learn near-identical naïve estimates by collating the value of Cohen’s (see (1)) instead. However, as captures the spread of unstandardized effect sizes, we specify in each test , where is the sample variance of all responses up to the observation in that particular experiment. The Bayesian tests also require a prior belief in the null hypothesis being true ()—we set it to based on what we observed in the -tests above.
We then calculate the -value in mSPRT and the posterior belief in the null hypothesis () in the Bayesian test for each experiment and metric at each daily/12-hourly checkpoint, following the procedures stated in Johari et al. (2017) and Deng et al. (2016); Kass and Raftery (1995) respectively. We plot the results for five experiments selected at random in Figure 2, which shows the -value for a mSPRT is monotonically non-increasing, while the posterior belief for a Bayesian test can fluctuate depending on the effect size observed so far.
A quasi-benchmark for adaptive stopping methods Real online controlled experiments, unlike machine learning tasks, generally do not have a notion of ground truth. The use of quasi-ground truth enables the comparison between two hyperparameter settings of the same adaptive stopping method or two adaptive stopping methods. Using as quasi-ground truth the significant / not significant verdict from a Welch’s
-test at the end of the experiment as the “ground truth”, we could then compare this “ground truth” to the significant / not significant verdict of a mSPRT at different stages of individual experiments. This yields many “confusion matrices” over different stages of individual experiments where a “Type I error” corresponds to cases where a Welch’s t-test gives a not significant result and a mSRPT reports a significant result, a confusion matrix for the end of each experiment can be seen in Table2. As the dataset was collected without early stopping it allows us to perform sensitivity analysis and optimization on the hyperparameters of mSPRT under what can be construed as a “precision-recall” tradeoff of statistically significant treatments.
Other use cases The time series nature of this dataset enables one to detect bias (of the estimator) across time, e.g. those that are caused by concept drift or feedback loops. In the context of OCEs, Chen et al. (2019) described a number of methods to detect invalid experiments over time that may be run on this dataset. Moreover, being both a multi-experiment and time series dataset also enables one to learn the correlation structure across experiments, decision metrics, and time Pouget-Abadie et al. (2019); Wang et al. (2020).
5.2 Ethical considerations
We finally discuss the ethical implications of releasing the dataset, touching on data protection and anonymization, potential misuses, and the ethical considerations for running OCEs in general.
|mSPRT||Significant||Not significant||Significant||Not significant|
Data protection and anonymization The dataset records aggregated activities of hundreds of thousands or millions of website users for business measurement purposes and hence it is impossible to identify a particular user. Moreover, to minimize the risk of disclosing business sensitive information, all experiment context is either removed or anonymized such that one should not be able to tell who is in an experiment, when is it run, what treatment does it involve, and what decision metrics are used. We refer readers to the accompanying datasheet55footnotemark: 5 for further details in this area.
Potential misuses An OCE dataset, no matter how anonymized it is, reflects the behavior of its participants under a certain application domain and time. We urge potential users of this dataset to exercise caution when attempting to generalize the learnings. It is important to emphasize that the learnings are different from the statistical methods and processes that are demonstrated on this dataset. We believe the latter are generalizable, i.e. they can be applied on other datasets with similar data dimensions regardless of the datasets’ application domain, target demographics, and temporal coverage, and appeal for potential users of the dataset to focus on such.
One example of generalizing the learnings is the use of this dataset as a full performance benchmark. As discussed above, this dataset does not have a notion of ground truth and any quasi-ground truths constructed are themselves a source of bias to estimators. Thus, experiment design comparisons need to be considered at a theoretical level Liu and McCoy (2020). Another example will be directly applying the value of hyperparameter(s) obtained while training a model on this dataset to another dataset. While this may work for similar application domains, the less similar they are the less likely the hyperparameters learned will transfer. This may introduce risk in incurring bias both on the estimator and in fairness.
Running OCEs in general The dataset is released with the aim to support experiments with adaptive stopping, which will enable a faster experimentation cycle. As we run more OCEs, which are ultimately human subjects research, the ethical concerns will naturally mount. We reiterate the importance of the following three principles when we design and run experiments National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, United States (1978); Office for Human Research Protections, U.S. Department of Health and Human Services et al. (2018): respect for persons, beneficence (properly assess and balance the risks and benefits), and justice (ensure participants are not exploited), and refer readers to Chapter 9 of Kohavi et al. (2020) and its references for further discussions in this area.
Online controlled experiments (OCE) are a powerful tool for online organizations to assess their digital products’ and services’ impact. To safeguard future methodological development in the area, it is vital to have access to and have a systematic understanding of relevant datasets arising from real experiments. We described the result of the first ever survey on publicly available OCE datasets, and provided a dimensional taxonomy that links the data collection and statistical test requirements. We also released the first ever dataset that can support OCEs with adaptive stopping, which design is grounded on a theoretical discussion between the taxonomy and statistical tests. Via extensive experiments, we also showed that the dataset is capable of addressing the identified gap in the literature.
Our work on surveying, categorizing, and enriching the publicly available OCE datasets is just the beginning and we invite the community to join in the effort. As discussed above we have yet to see a dataset that can support methods dealing with correlated data due to cluster randomization, or the end-to-end design and running of experiments with adaptive stopping using Bayesian non-parametric tests. We also see ample opportunity to generalize the survey to cover datasets arising from uplift modeling tasks, quasi-experiments, and observational studies. Finally, we can further expand the taxonomy, which already supports datasets from all experiments, with extra dimensions (e.g. number of features to support stratification, control variate, and uplift modeling methods) as the area matures.
CHBL is part-funded by the EPSRC CDT in Modern Statistics and Statistical Machine Learning at Imperial College London and University of Oxford (StatML.IO) and ASOS.com. The authors thank their colleagues, participants in CODE@MIT 2021, and the anonymous reviewers for suggesting many improvements to the original manuscript.
- Empirical Bayesian estimation of treatment effects. SSRN Electronic Journal. External Links: Cited by: §4.2.
- Analyze AB test results. Note: Code repository External Links: Cited by: Table 1.
- Current state of research on continuous experimentation: a systematic mapping study. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Vol. , pp. 335–344. External Links: Cited by: §2.
- Controlled experimentation in continuous experimentation: knowledge and challenges. Information and Software Technology 134, pp. 106551. External Links: Cited by: §2.
- Cited by: §3, Table 1.
- Empirical Bayes estimation of treatment effects with many A/B tests: an overview. AEA Papers and Proceedings 109, pp. 43–47. External Links: Cited by: §4.2.
-  (Website) External Links: Cited by: §3, Table 1.
- Uncertainty in online experiments with dependent data: an evaluation of bootstrap methods. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, New York, NY, USA, pp. 1303–1311. External Links: Cited by: §4.1.
- Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 57 (1), pp. 289–300. External Links: Cited by: §2.
- 150 successful machine learning models: 6 lessons learned at Booking.com. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, New York, NY, USA, pp. 1743–1751. External Links: Cited by: §1.
- What works in e-commerce - a meta-analysis of 6700 online experiments. Note: http://www.qubit.com/wp-content/uploads/2017/12/qubit-research-meta-analysis.pdfWhite paper. External Links: Cited by: §1.
-  (Website) External Links: Cited by: §3, Table 1.
- Cited by: Table 1.
- Cited by: Table 1.
- How A/B tests could go wrong: automatic diagnosis of invalid online experiments. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, New York, NY, USA, pp. 501–509. External Links: Cited by: §5.1.
- A/B tests project — Udacity DAND P2. Note: Blog post External Links: Cited by: Table 1.
- Bayesian nonparametric k-sample tests for censored and uncensored data. Computational Statistics & Data Analysis 71, pp. 335–346. External Links: Cited by: §4.3.
- Statistical power analysis for the behavioral sciences. Taylor & Francis. External Links: Cited by: §4.1.
- Points to consider on switching between superiority and non-inferiority. British Journal of Clinical Pharmacology 52 (3), pp. 223–228. External Links: Cited by: footnote 4.
- Cited by: Table 1.
- Causal effects in nonexperimental studies: reevaluating the evaluation of training programs. Journal of the American Statistical Association 94 (448), pp. 1053–1062. External Links: Cited by: §3.
- Applying the delta method in metric analytics: a practical guide with novel ideas. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, New York, NY, USA, pp. 233–242. External Links: Cited by: §4.1.
Continuous monitoring of A/B tests without pain: optional stopping in Bayesian testing.
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Vol. , pp. 243–252. External Links: Cited by: §4.2, §4.2, §5.1.
- Objective Bayesian two sample hypothesis testing for online controlled experiments. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, New York, NY, USA, pp. 923–928. External Links: Cited by: §4.2.
- A large scale benchmark for uplift modeling. Note: 2018 AdKDD & TargetAd Workshop (in conjunction with KDD ’18) External Links: Cited by: §3.
- Kolmogorov–Smirnov Test. In The Concise Encyclopedia of Statistics, pp. 283–287. External Links: Cited by: §4.3.
- Cited by: §3.
- UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §1, §2.
- Cited by: §3.
- Cited by: Table 1.
- Cited by: §3, Table 1.
- Online controlled experimentation at scale: an empirical survey on the current state of A/B testing. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Vol. , pp. 68–72. External Links: Cited by: §2.
- On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222 (594-604), pp. 309–368. External Links: Cited by: §4.
- Datasheets for datasets. Note: arXiv:1803.09010 [cs.DB] External Links: Cited by: §5.
values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology 31 (4), pp. 337–350. External Links: Cited by: §1.
-  (Website) External Links: Cited by: §3, Table 1.
-  (Website) External Links: Cited by: §3, Table 1.
- Cited by: Table 1.
- Empirical Bayes for large-scale randomized experiments: a spectral approach. Note: arXiv:2002.02564 [stat.ME] External Links: Cited by: §4.2.
- The MineThatData e-mail analytics and data mining challenge. External Links: Cited by: §3.
- Focusing on the long-term: it’s good for users and business. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 1849–1858. External Links: Cited by: §1.
- Two-sample Bayesian nonparametric hypothesis testing. Bayesian Analysis 10 (2), pp. 297–320. External Links: Cited by: §4.3.
- The quality of reports of randomised trials in 2000 and 2006: comparative study of articles indexed in PubMed. BMJ 340, pp. c723. External Links: Cited by: §2.
- Who’s the guinea pig? investigating online A/B/n tests in-the-wild. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, New York, NY, USA, pp. 201–210. External Links: Cited by: §2.
- Peeking at A/B tests: why it matters, and what to do about it. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, New York, NY, USA, pp. 1517–1525. External Links: Cited by: §1, §4.2, §4.2, §4.2, §5.1.
- Bayes factors. Journal of the American Statistical Association 90 (430), pp. 773–795. External Links: Cited by: §4.2, §5.1.
- HARKing: hypothesizing after the results are known. Personality and Social Psychology Review 2 (3), pp. 196–217. External Links: Cited by: §1, §4.
- Cited by: §3, Table 1.
- Online Controlled Experiments at Large Scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, New York, NY, USA, pp. 1168–1176. External Links: Cited by: §1.
- Trustworthy online controlled experiments: a practical guide to A/B testing. 1st Edition edition, Cambridge University Press. External Links: Cited by: §1, §2, §2, §5.2.
- Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association 47 (260), pp. 583–621. External Links: Cited by: §4.3.
- What is the value of experimentation and measurement?: quantifying the value and risk of reducing uncertainty to make better decisions. Data Science and Engineering 5, pp. 152–167. External Links: Cited by: §1.
- An evaluation framework for personalization strategy experiment designs. Note: arXiv:2007.11638 [stat.ME]. Presented in AdKDD 2020 Workshop (In conjunction with KDD’20) External Links: Cited by: §2, §5.2.
- Analyze A/B test results. Note: Code notebook External Links: Cited by: Table 1.
On a test of whether one of two random variables is stochastically larger than the other. The Annals of Mathematical Statistics 18 (1), pp. 50–60. External Links: Cited by: §4.3.
- The Upworthy Research Archive, a time series of 32,487 experiments in U.S. media. Scientific Data 8 (1), pp. 195. External Links: Cited by: §3, Table 1, §5.
- Udacity A/B testing—final course project, Version 11. Note: Code notebook External Links: Cited by: Table 1.
- A meta-analytic investigation of p-hacking in e-commerce experimentation. Note: Working paper External Links: Cited by: §5.1.
- How not to run an A/B test. Note: Blog post External Links: Cited by: §4.2.
- Xkcd: significant. External Links: Cited by: §1.
- Xkcd: p-values. External Links: Cited by: §1.
- Cited by: Table 1.
- The belmont report: ethical principles and guidelines for the protection of human subjects of research. Vol. 2, The Commission. Cited by: §5.2.
- Federal policy for the protection of human subjects (’Common Rule’). External Links: Cited by: §5.2.
- Udacity data analysis nanodegree | Analyze AB test results. Note: Code repository External Links: Cited by: Table 1.
- Variance reduction in bipartite experiments through correlation clustering. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Cited by: §5.1.
- Using control groups to target on predicted lift: building and assessing uplift model. Direct Marketing Analytics Journal, pp. 14–21. Cited by: footnote 3.
- Analyze A/B test results, Version 1. Note: Code notebook External Links: Cited by: Table 1.
- Continuous experimentation and A/B testing: a mapping study. In Proceedings of the 4th International Workshop on Rapid Continuous Software Engineering, RCoSE ’18, New York, NY, USA, pp. 35–41. External Links: Cited by: §2.
- AB testing with Python - Walkthrough Udacity’s course final project, Version 29. Note: Code notebook External Links: Cited by: Table 1.
- ASSISTments dataset from multiple randomized controlled experiments. In Proceedings of the Third (2016) ACM Conference on Learning @ Scale, L@S ’16, New York, NY, USA, pp. 181–184. External Links: Cited by: §3, Table 1, §5.
- Cited by: Table 1.
- Cited by: Table 1.
- Dataset of two experiments of the application of gamified peer assessment model into online learning environment MeuTutor. Data in Brief 12, pp. 433–437. External Links: Cited by: §3, Table 1, §5.
- A gamified peer assessment model for on-line learning environments in a competitive context. Computers in Human Behavior 64, pp. 247–263. External Links: Cited by: Table 1.
- Experimentation works: the surprising power of business experiments. Harvard Business Review Press. External Links: Cited by: §1.
-  (Website) External Links: Cited by: §3, Table 1.
- OpenML: networked science in machine learning. SIGKDD Explorations 15 (2), pp. 49–60. External Links: Cited by: §1, §2.
- Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16 (2), pp. 117–186. External Links: Cited by: §4.2.
- A/B testing for an pre-enrollment intervention in Udacity. Note: Code repository External Links: Cited by: Table 1.
- Causal meta-mediation analysis: inferring dose-response function from summary statistics of many randomized experiments. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2625–2635. Cited by: §5.1.
- The generalization of ‘Student’s’ problem when several different population variances are involved. Biometrika 34 (1-2), pp. 28–35. External Links: Cited by: §4.1.
- Individual comparisons by ranking methods. Biometrics Bulletin 1 (6), pp. 80. External Links: Cited by: §4.3.
- Agridat: agricultural datasets. Note: R package version 1.17 External Links: Cited by: §3.
- From infrastructure to culture: A/B testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 2227–2236. External Links: Cited by: §1.
- Cited by: Table 1.
- Improving library user experience with A/B testing: principles and process. Weave: Journal of Library User Experience 1 (1). External Links: Cited by: Table 1.
- Cited by: §3, Table 1.
- A/B testing for Udacity free trial screener. Note: Code notebook External Links: Cited by: Table 1.
- Machine learning testing: survey, landscapes and horizons. IEEE Transactions on Software Engineering. External Links: Cited by: §1.