Controlled experiments (A/B tests or randomized field experiments) are the de facto standard to make data-driven decisions; to drive innovation by the possibility of evaluating new ideas cheaply and quickly. Online experiments are widely used to evaluate the impact of changes made in marketing campaigns, software products, or websites. These days, basically every large technology company continuously conducts experiments: Facebook for user interface changes in their News Feed ; Google states that “experimentation is practically a mantra; we evaluate almost every change that potentially affects what our users experience.” ; and Microsoft for optimizing Bing ads . In all cases, the key reason to conduct controlled experiments is the possibility to establish a causal relationship (with high probability) between a change on a website or a new feature in a product and changes in visitor or user behavior .
“It’s all about data these days. Leaders don’t want to make decisions unless they have evidence.” states the Harward Business Review in an 2017 article “A Refresher on A/B Testing”  with the subtitle “Spoiler: Many people are doing it wrong.” The article stresses the importance of experiments for businesses, but also the difficulties in communicating the Frequentist inference-based results like significance, confidence interval, and -value to stakeholders like marketing and product managers. In recent years, Bayesian inference gained popularity as it has several advantages over Frequentist inference . One key argument in our business setting is that the result expressed in probability to be the best variant and expected uplift in conversion rates are easier to communicate to stakeholders and, consequently, it is easier for them to make business decisions.
In our company, we are continuously running A/B tests in marketing, website, and product development. In this paper we show the different experimentation scenarios and how we model them in terms of Bayesian inference; the paper is organized as follows. Section II discusses the general structure of experiments and the main business questions to be answered with them. Section III reviews the basics of Bayesian inference and introduces relevant concepts and distributions needed in this paper. Section IV introduces the three business scenarios we tackle and the corresponding models. Section V describes the methodology to draw business decisions based on the introduced models. Section VI presents three real-world cases where we used this methodology to come to a business decision. We provide empirical verification of the correctness by comparing the results to a state-of-the art commercial experimentation tool. Section VII
concludes this paper with a summary and the discussion on future work. The open source implementation of the methodology is freely available to the community fromhttps://github.com/Avira/prayas.
The main objective of a business is to maximize revenue. As described in the previous section, conducting controlled experiments is an effective way to do so. Let us take an example of an online e-commerce business with an objective of increasing its transactions, Fig. 1 illustrates the scenario. It currently has a green “purchase” button on its webpage and wants to experiment by changing the colour to blue and see if this change has an effect on the number of transactions. We would call the page with the green button variant A (the control) and the one with blue variant B (the treatment). In this paper, we focus on conversion-based experiments. Therefore, every experiment has trials and successes. Each visitor to the webpage counts to a trial and each purchase counts to a success (we also refer to a success as a conversion). The measure of interests are the uplift in conversion rate, and revenue-based metrics like average order value and customer lifetime value.
In the execution of the experiment, each visitor is randomly assigned to variant A or variant B. Given a valid design and a correct execution, the only difference between the two variants is the color of the button; all other factors such as seasonality or moves by the competition occur in both control and treatment. This means, that the difference in the measures of interests are either due to the change of the button color or by random chance . To determine the “relevance” of the difference, statistical methods are used—either a statistical test in terms of the Frequentist inference or, as we motivate here, a model driven approach in terms of Bayesian inference.
In detail, we present a Bayesian approach to answer the quintessential questions that typically arise in the mind of a business owner:
What is the probability that variant B is better than variant A?
How much uplift in conversions will variant B bring?
What is the risk of switching to variant B from variant A?
Note that in the course of this paper we define the approach with experiments executed on websites, with visitors as the population, the purchase of a product as the success event, and additional revenue-based metrics as measure of interest. However the approach is generally applicable, e.g., also in the domain of product testing.
Iii Bayesian statistics
In this section we provide an introduction into relevant concepts of Bayesian statistics. We mainly follow“Bayesian Data Analysis” by Gelman et. al  and for specific details we refer to this seminal book.
Bayesian statistics is based on the principle of probability statements and how to update probabilities after obtaining new data. The Bayes’ theorem is the fundamental model describing such updating:
It describes the dependency of the posterior distribution of a parameter after seeing data with the prior distribution of the parameter , the likelihood of the data given the parameter , and the marginal likelihood of the data . In many cases, also in this paper, we can ignore the marginal likelihood and work with . With this, the primary task in developing Bayesian models for specific applications is to define proper models for the prior and the likelihood.
A very important concept in Bayesian statistics is conjugacy, meaning that the posterior distribution
is in the same probability distribution family as the prior distribution. The prior is then called a conjugate prior. This allows us to write the posterior in a closed-form expression which is mathematically as well as computationally convenient.
Iii-a Relevant distributions
To describe the models in this paper we need six distributions that we briefly introduce.
The binomial distribution is a discrete distribution defined on the interval . It describes the number of successes in trials with success probability in each trial. Fig. 2 (left) shows the probability mass function for different parameter values.
The Beta distribution is a continous distribution defined on the interval . The parameters and define the shape of the distribution, e.g.,
gives the standard uniform distribution. Fig.2 (right) shows the probability density function for different parameter values. The Beta distribution is the conjugate prior for the binomial distribution.
The multinomial distribution is the generalization of the binomial distribution. Compared to the binomial distribution, it describes the number of successes in trials for options. The distribution gives the probability of any particular combination of numbers of successes for the options with the fixed success probability and .
The Dirichlet distribution is the generalization of the Beta distribution for options. The parameter is defined as with . The Dirichlet distribution is the conjugate prior for the multinomial distribution.
The exponential distribution is a continuous distribution defined on the interval with the parameter called the scale (another common parametrization is with the rate parameter
). We use the exponential distribution to model revenue, i.e., the higher the price of a product the lower the probability to be purchased.
The Gamma distribution is a generalization of the exponential distribution. The parameters define the shape and the scale of the distribution; with the parameter it is an exponential distribution with
. The Gamma distribution is the conjugate prior for the exponential distribution.
Iv Experiment scenarios and models
In our company, majority of experiments fall into one of the three scenarios: 1) compare variants where the visitor has one option to choose, 2) compare variants where the visitor has multiple options to choose from, and 3) compare variants where the visitor has multiple options to choose from but we only observe the overall success. In the following we describe a model for each scenario.
Iv-a One option model
The first model we describe is applicable in the basic scenario of experiments described in Section II and illustrated in Fig. 1; Fig. 3 shows a real-world variant of one of our experiments where the visitor has only the option of purchasing one product. One option scenarios are formalized as follows.
The experiment consists of variants with where is the total number of variants. Each variant displays only one option to the visitor. For a given variant , is the total number of visitors that saw variant (trials), and the number of conversions (successes). Success is defined by the occurrence of the event of interest, for example, the visitor clicks a button click or purchases a product. A success also has a value such as revenue or customer lifetime value associated and can be different for every variant. In a frequentist approach, the conversion rate
is the point estimate, the total revenue earned is , and the revenue earned per visitor per variant is .
Executing the experiment results in collected data for each variant, with , and indicates a non-conversion and a conversion. Compared to the frequentist approach explained above, we model the data generating process and the distribution of the conversion rate . Since the conversions are Boolean-valued, the sequence of conversions follows the binomial distribution and the generative process of the data is:
The prior distribution about the conversion rate is modeled as
with the hyperparametersand defined in a way to represent our prior belief. Fig. 2 (right) shows different prior beliefs about the conversion rate: with we have an uniformed prior and all conversion rates are equally likely; with and we define a prior where the conversion rate is around
with small variance, meaning that we have a strong prior knowledge; and withand the conversion rate is around with wider variance, indicating that we have only weaker knowledge about the conversion rate.
To compute the posterior, we leverage the fact that the conjugate prior for a distribution belonging to binomial family is in the beta family, and therefore the posterior distribution is the same as the prior distribution with updated parameter values :
The posterior gives the probability distribution of all possible values of given the evidence . We can draw random samples from the posterior to obtain a set of possible values the conversion rate can take. The possible values for the revenue (or any other value ) per visitor is defined as . There are cases where businesses incur loss for each non-conversion. One example is the pay-per-click model followed in online advertising industry where businesses must pay the ad platform for each click generated regardless of conversion. The above model can be adjusted to penalize the loss of each non-conversion as .
Simulation of the Bayesian updating
To illustrate the Bayesian updating from prior to posterior when gathering new data, we simulate a one option experiment and take a closer lock at Variant . For the data we draw samples from a binomial distribution with the conversion rate . Starting from the priors shown in Fig. 2 (right), we look at the posteriors after seeing the first 10, 50, and all 100 samples. Fig. 4 shows the simulation. In the first row, we start with an uninformative prior, and after seeing all data the posterior is near to the true conversion rate. In the second row, we start with a strong prior knowledge of the conversion rate being around ; here 100 samples are not enough to estimate the true conversion rate. In the third row, we start with weaker prior knowledge of the conversion rate being around , and in this case, the posterior is near the true conversion rate after seeing all samples.
Iv-B Multi-options model
The second model we describe is the generalization of Scenario 1. Scenario 1 assumes that there can be only one possible event of interest in each variant. However, in real world, it oft is too strict a restriction. For example, many online websites showcase multiple product purchase buttons in a single page. Fig. 5 shows a variant of one of our experiments where visitors can choose from three different product options. Multi-options scenarios are formalized as follows.
Each variant displays different options, from which the visitor can chose either one of the options or none. The number of conversions is given by successes for each option per variant. The revenue for each success value is since each of the multiple options can have a different value (e.g., due to different prices per product). The conversion rate is defined as with the probabilities of choosing individual options and the probability of choosing none of the options. Since each visitor can only choose from one of these options, they are mutually exclusive and therefore for every variant it holds that .
Executing the experiment results in collected data for each variant, with and indicating which option the visitor chose. Since we consider multiple events within a variant, the sequence of conversions follows a multinomial distribution and the generative process of the data is :
The prior distribution of the conversion rate is modeled as
where are the hyperparameters.
Once again, leveraging the conjugacy relationship between multinomial and Dirichlet distributions, we compute the posterior distribution with updated parameter values as :
From this posterior distribution of conversion rates, we can draw random samples where each element is the overall conversion rate over all options available in the Variant . Similarly to the one option model, we can penalize the non-conversions, and we can compute a revenue-based metric as the sum of element-wise multiplication of the values and .
Iv-C Aggregated model
This scenario is a special case of the scenarios 1 and 2 where only the aggregated revenue and the sum of conversion per variant are observed. This can occur because of the data generation process or simply because there are too many options within a variant making it cumbersome to model it under Scenario 2. Some of our experiments have up to 81 different options the visitor can chose from, due to the different products listed together with the options for license runtimes and the number of devices on which the product can be installed. This aggregated scenario is formalized as follows.
For a given variant , there are unknown options. is the number of visitors and is the number of conversions defined as in the multi-option model. The individual revenue per option is unknown. Executing the experiment results in collected data , and the aggregated revenue over all successes and implicitly over all unknown options, i.e., with the revenue of each success .
The posterior for the conversion rate can be obtained by following the one option model and the therein defined posterior in Equation 4. To get an estimate of the revenue per visitor we model the average revenue per visitor given the observed aggregated revenue . For that we assume that follows an exponential distribution
where is the scale parameter of the distribution . The exponential distribution means that lower revenue has more probability of occurrence than higher revenue. This assumption fits our observation of the visitors’ money spent curve on our website. The prior distribution of is modeled as
Taking advantage of the conjugacy relationship between exponential and gamma distributions, we compute the posterior distribution as
The expected value per visitor per variant is . Since we have the posterior distributions for both, and , we draw random samples from for the conversion rate , random samples from for the revenue.
V Decision making
Every experiment has one measure of interest defined beforehand, such as conversion rate or revenue per visitor, on which the variants are evaluated. In Section IV, we described a way of modelling the measures of interest and obtaining the samples from the posterior distribution. Here, we describe the method of comparing different posterior samples in order to answer the motivating questions from Section II.
V-a Probability to be the best
After running an experiment with multiple variants, we want to know which variant is the most favorable and should be implemented. Given a set of posterior samples of the measure of interest from different variants, the probability to be the best is defined as the probability that a variant has higher measure in comparison with all other variants. The probability that is better than is the mean of:
The extension to more than two variants is to simply compare against all others. To find the winner variant we compute all combinations and select the one with the highest probability.
V-B Expected uplift
The second motivating question we want to answer, once we have the best variant, is to determine the increase in measure we expect after its implementation. Given set of posterior samples , the expected uplift of choosing over is defined as the mean (or the credible interval) of the percentage increase:
As we are interested in the expected uplift compared to the control variant, we typically only compute this for all treatment variants against the control variant. However, the equation can be extended to compare every variant against each other.
V-C Expected loss
Suppose Variant 1 has the highest probability to be the best but smaller than 1. Then, there is still a chance that the other variant is the true best performing one. In such a case we want to know the risk of implementing Variant 1. Given set of posterior samples , the expected loss when choosing over is the mean (or the credible interval) of:
Similar to the expected uplift, expected loss is typically calculated between the the treatment variant and the control.
Vi Business cases
In this section we illustrate the application of Bayesian inference and decision making while experimenting with different discounts on our product prices.
Vi-a Single product discount test
|Variant||Probability to||Improvement||Loss from|
|be best||beat BL||Mean||CI95||BL|
|Discount 10||0.26||0.68||0.06||[-0.16, 0.33]||0.02|
|Discount 40||0.58||0.83||0.12||[-0.11, 0.40]||0.01|
|Discount 50||0.06||0.42||-0.02||[-0.23, 0.24]||0.06|
We ran an experiment on our main product page, which displayed 20% discount (see Fig. 3), with the objective of measuring the effect on conversion rate by changing the discount. We chose the three treatment variants to display 10%, 40% and 50% discount. Technically we ran the experiment with Google Optimize, and we use this experiment to validate our approach against a industry standard tool. Google Optimize also uses Bayesian inference for analysis of results, however their concrete models are, to the best of our knowledge, unknown .
Since there was only one option on the page we used our one option model, described in Section IV-A, for Bayesian inference and decision making. After running the experiment for 53 days, we gathered the following data: for the variants “Discount 20%”, “Discount 10%”, “Discount 40%”, and “Discount 50%”, the total conversions were 139, 147, 149 and 134 respectively, and the total visitors were 15144, 15176, 14553 and 14948 respectively.
Fig. 6 shows the posteriors of conversion rates for all the four variants obtained from the model using an uninformed prior. In order to make decisions based on the posteriors, we use the equations defined in section Section V to find the probability to be best, probability to beat the baseline (Discount 20%), expected uplift from the baseline, and the expected loss for each variant. The result in Table I show that he variant “Discount 40%” had the highest probability to be best (58%), the highest average improvement (12%) and the lowest expected loss (1%). Hence, we decided to display 40% discount instead of 20% on the product page.
As a comparison we show the results from Google Optimize in Fig. 7. All the metrics from Google Optimize and our model are identical up to some insignificant rounding differences.
Vi-B Multi-product discount test
|Variant||Probability to||Improvement||Loss from|
|be best||beat BL||Mean||CI95||BL|
The second discount experiment was run on our Microsoft Bing Ads landing page, which displayed three products, each with three different license options, with 20% discount each (see Fig. 5). Since visitors land on this webpage after clicking our ad, we incur a cost. The treatment displayed a progressive discount of 0%, 15% and 30%, in order to nudge the visitors towards our premium product . The objective was to measure effect on gain per visitor given as revenue per visitor minus the cost per visitor.
Since there were total of nine options on each variant, we used our multi-option scenario described in Section IV-B. After running the experiment for 58 days, we collected the following data: for the variants “Original” and “Progressive”, the total conversions were [50, 5, 5, 28, 7, 5, 20, 1, 6] and [28, 3, 6, 30, 6, 5, 27, 6, 3], the total visitors were 8067 and 8082, and the revenue for each option per variant was [27.95, 47.95, 63.95, 35.95, 63.95, 79.95, 79.95, 151.95, 223.95] and [34.95, 59.95, 79.95, 37.95, 67.95, 84.95, 69.95, 132.95, 195.95] respectively. The Bing ad cost per click was slightly higher (by 15 cents) for the original variant.
Fig. 8 (top row) shows the posteriors of conversion rates, revenue per visitor and gain per visitor for both the variants, using an uniformed prior and Table II shows the result of the experiment. The progressive variant has lower conversion rates than the original variant (-10% on average) but the revenue per visitor is almost the same for both variants since the probability to be best is almost 50% for both variants; both variants are equally likely to be the best. The progressive variant has less conversions for the lower priced option compared to the original, but makes up for the lost revenue with higher conversions on the premium product. For the metric gain per visitor, progressive discount had the highest probability to be best with a low expected loss of 2%. Hence the progressive variant was used in production after the experiment.
Vi-C Multi-product discount test with aggregated data
|Variant||Probability to||Improvement||Loss from|
|be best||beat BL||Mean||CI95||BL|
Here we illustrate that the previous example can be modeled under the aggregated model scenario described in Section IV-C to obtain similar results. We suppose that we only obtained the totals conversions, total visitors and the overall revenue for the two variants from the above experiment: 127 and 114 (sum of conversions from all options), 8067 and 8082, and 6905.65 and 6883.30 (sum of element-wise multiplication of individual conversions and revenue) respectively.
Fig. 8 (bottom row) shows our posteriors of conversion rates, revenue per visitor and gain per visitor for both the variants, using an uniformed prior and Table III shows the experiment results. Since we assumed that revenue follows exponential distribution, we see that posteriors for revenue and gain have slightly shifted to the left in the aggregated model when comparing Fig. 8 (bottom row) and Fig. 8 (top row). The results from Table II and Table III for conversion and revenue are almost identical. The aggregated model shows more uncertainty in gain, however the credible interval of improvement for the aggregate model is inclusive of the credible interval of improvement for the multi-options model.
Decision making is central in running a business—with data-driven decisions being the ones having the highest impact on output and productivity . To support decision making, we (and many other companies) are continuously running experiments. For easier interpretability we use a Bayesian approach for the analysis of the experiments. In this paper, we introduced the three most common scenarios in our company: The one option scenario for testing websites where the visitor has only the option to purchase one product. The multi-option scenario for testing websites where the visitor has multiple products to choose from. And the aggregated scenario, which is a multi-option scenario but we only observe aggregated data. For all three scenarios, we presented the Bayesian formulation of the models and how to draw decisions based on sampling the estimated posteriors.
To showcase the presented approach in production, we showed a real-world experiment for each of the scenarios. In the first experiment, we wanted to find out the effect of different discounts. The analysis made us switch from our baseline of 20% discount to 40% discount. We validated the results of our models against the industry standard tool Google Optimize that also uses black-box Bayesian inference. In the second experiment, we wanted to investigate the nudging effect of progressive discounts. The result showed us that nudging did not have an effect (probability to be the best based on revenue is the same for both options). However, as the costs for showing the ads for the progressive discounts where cheaper (probability to be the best based on gain), we switched to the progressive discounts. The third experiment is based on the second one, and showed that even with observing only aggregated data, we come to the same conclusion.
We thank our collegues Alin Secareanu and Matthew Bonick for providing helpful comments during the writing process.
-  (2014) Designing and deploying online field experiments. In Proceedings of the 23rd ACM conference on the World Wide Web, External Links: Cited by: §I.
-  (2011) Strength in numbers: how does data-driven decisionmaking affect firm performance?. Social Science Research Network (SSRN). External Links: Cited by: §VII.
-  (2015) Probabilistic programming & bayesian methods for hackers. Addison-Wesley Professional. External Links: Cited by: §IV-A, §IV-B.
-  (2017) A refresher on A/B testing. Harvard Business Review. Cited by: §I.
-  (2013) Bayesian data analysis. 3 edition, Chapman and Hall/CRC. External Links: Cited by: §III, §IV-A, §IV-B.
-  General methodology. Note: https://support.google.com/optimize/answer/7405543; accessed on 2020-01-31 Cited by: Fig. 7, §VI-A.
-  (2019) Top challenges from the first practical online controlled experiments summit. SIGKDD Explorations 21 (1). External Links: Cited by: §I, §II.
-  (2013) Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, External Links: Cited by: §I.
-  (2017) A primer on bayesian analysis for experimental psychopathologists. Journal of Experimental Psychopathology 8 (2). External Links: Cited by: §I.
-  (2015) Bayesian A/B testing at VWO. Visual Website Optimizer. Note: Whitepaper, accessed on 2020-01-31 Cited by: §IV-C.
-  (2010) Overlapping experiment infrastructure: more, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 17–26. External Links: Cited by: §I.
-  (2008) Nudge: improving decisions about health, wealth, and happiness. Yale University Press. External Links: Cited by: §VI-B.