A/B testing is the technique of collecting data from two parallel experiments and comparing them by probabilistic inference.
A particularly important case is assessing which of two success-counting experiments achieves a higher success rate. This naturally applies to evaluating conversion rates on two different versions of a webpage. A typical question being asked is
if there is a difference (called also non-zero effect) in conversion between groups: are unknown conversion rates in two experiments, and the task is to compare hypotheses
and , given observed data. In the frequentionist approach, one falsifies by the two-sample t-test
two-sample t-test[Welch, 1938]. In the bayesian approach one evaluates the strength of both and and decides based the ratio called bayes factor
which converted by the Bayes theorem toquantifies posterior odds and allows a research to choose a model more plausible given data (usually one gives and same chance of getting considered and sets ). The decision rule and confidence depends on the magnitude of [Kass and Raftery, 1995, Jeffreys, 1998]. In the bayesian approach a hypothesis assigns an arbitrary distribution to parameters which is more general.
Testing counts proportions
Suppose that empirical data has runs and successes in the -th experiment, . Under the binomial counting model, the data likelihood under a hypothesis equals
where the prior distribution reflects what is assumed prior to seeing data (and what will be tested); one can for example choose for and for uniformly over all valid values of , but in practice more informative priors are used because some configurations of values are unrealistic (e.g. extremely low or high conversion). The corresponding factor can be computed for example by the R package BayesFactor [Morey and Rouder, 2018].
Problem: Bayesian A/B testing power
Estimates, neither frequentist nor bayesian, will not be conclusive without sufficiently many samples. Frequentists widely use rules of thumbs that are derived based on t-tests. Under the bayesian methodology this is little more complicated because hypotheses can be arbitrary priors over parameters. Under the binomial A/B model, we will answer the following questions
when, given data, a bayesian hypothesis on zero effect may be rejected ( for some )?
what is the relation to the classical t-test?
This will allow us to understand data limitations
when doing bayesian inference, and relate them to widely-spread frequentionist rule of thumbs.
1.2 Related Works and Contribution
Our problem, as stated, is a question about maximizing minimal bayes factor. It is known that for certain problems bayes factors can be related to frequentionist’s p-values [Edwards et al., 1963, Kass and Raftery, 1995, Goodman, 1999] and thus bridges the Bayesian and frequentionist world (this should be contrasted with a wide-spread belief that both methods are very incompatible [Kruschke and Liddell, 2018]
). The novel contributions of this paper are (a) bounding the Bayes factor for binomial distributions (b) discussion of sample bounds for binomial A/B testing in relation to the frequentionist approach.
Main result: Bayes factor and Welch’s statistic
The following theorem shows that no “zero-effect” hypothesis can be falsified, unless the number of samples is big in relation to a certain dataseet statistic. This statistic turns out to be the Jensen-Shannon divergence, well-known in information theory. It is in turn bounded by the Welch’s t-statistic.
Theorem 1 (Bayes Factors for Binomial Testing).
Consider two independent experiments, each with independent trials with unknown success probabilities
independent trials with unknown success probabilitiesand respectively. Let observed data has successes and failures for group . Then
where the maximum is over null hypothesis (priors)
where the maximum is over null hypothesis (priors)over such that , the minimum is over all valid alternative hypothesis (priors) over , and denotes the Jensen-Shannon divergence.
Moreover, the Jensen-Shannnon divergence is bounded by the Welch’s -statistic (on )
so that we can bound
Remark 1 (Most favorable hypotheses).
Maximally favorable alternative ( which maximizes ) is and
Maximally favorable null of the form is
If null is of the form then the bound becomes .
Application: sample bounds
The main result implies the following sample rule
Corollary 1 (Bayesian Sample Bound).
To confirm the non-zero effect () the number of samples for the bayesian method should be
Under the frequenionist method the rule of thumb is , which gives (see Section 2)
Note that both formulas needs assumptions on locations of the parameters. In particular, testing smaller effects or effects with higher variance require more samples.
Bounds Equation 6 and Equation 7 are close to each other by a constant factor (a different small factor is necessary to make the bound small in both the bayesian credibility and p-value sense). The difference (under the normalized constant) is illustrated on Figure 1, for the case when one wants to test a relative uplift of .
Since high values of means small p-values, we conclude that the frequentionist p-values bounds the bayes factor and indeed, are evidence against a null-hypothesis in the well-defined bayesian sense. However, because of the scaling , this is true for p-values much lower than the standard threshold of 0.05. In some sense, the bayesian approach is more conservative and less reluctant to reject than frequentionist tests; this conclusion is shared with other works [Goodman, 1999].
The binary cross-entropy of and is defined by
which becomes the standard (Shannon) binary entropy when , denoted as
. The Kullback-Leibler divergence is defined as
and the Jensen-Shannon divergence [Lin, 1991] is defined as
(always positive because the entropy is concave).
The following lemma shows that the cross-entropy function is convex in the second argument. This should be contrasted with the fact that the entropy function (of one argument) is concave.
Lemma 1 (Convexity of cross-entropy).
For any the mapping is convex in .
Since for is convex we obtain
for any and any , . Replacing by and by in the above inequality gives us also
Adding side by side yields
which finishes the proof. This argument works for multivariate case, when
are probability vectors. ∎
Lemma 2 (Quadratic bounds on KL/cross-entropy).
For any it holds that
We will prove a general version. Let and be probability vectors of the same length. By the elementary inequality
multiplying both sides by and adding inequalities side by side we obtain
which means . Our lemma follows by specializing to the vectors and . ∎
To decide whether means in two groups are equal, under the assumption of unequal variances, one performs the Welch’s t-test with the statistic
where are sample variances and are sample means for group . The null hypothesis is rejected unless the statistic is sufficiently high (in absolute terms). In our case the formula simplifies to
If and success out of trials have been observed respectively in the first and the second group then
We change the notation slightly, unknown success rates will be and , and corresponding successes .
Maximizng over alll posible priors over pairs we get
where is a normalizing constant, which equals
achieved for being a unit mass at .
Let states that the baseline is and the efect is . Then we obain
with the same normalizing constant .
If none of two hypothesis is a priori prefered, that is when , then the Bayes factor equals the likelihood ratio (by Bayes theorem)
In turn the likelihood ratio (in favor of ) equals
(the normalizing constant cancells). Using the relation between the KL divergence and cross-entropy we obtain
We will use the following observation
The expression is minimized under , and achieves value .
Now the existence of the minimum at follows by convexity of , proved in Lemma 1. We note that for any (by definition), and thus for we obtain and . This combined with the definition of the Jensen-Shannon divergence finishes the proof. ∎
Connecting t-statistic and bayes factor exponent
Recall that by Claim 1 under t-test we have
It remains to connect and . By Lemma 2 we have the following refinement of Pinsker’s inequality
We have .
Claim 3 implies
we recognize the Weltch’s statistic and write
The author thanks to Evan Miller for inspiring discussions.
- [Edwards et al., 1963] Edwards, W., Lindman, H., and Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70(3):193–242.
- [Goodman, 1999] Goodman, S. N. (1999). Toward evidence-based medical statistics. 2: The bayes factor. Annals of internal medicine, 130 12:1005–13.
- [Jeffreys, 1998] Jeffreys, H. (1998). The Theory of Probability. Oxford Classic Texts in the Physical Sciences. OUP Oxford.
- [Kass and Raftery, 1995] Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795.
[Kruschke and Liddell, 2018]
Kruschke, J. K. and Liddell, T. M. (2018).
The bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective.Psychonomic Bulletin & Review, 25(1):178–206.
- [Lin, 1991] Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151.
- [Morey and Rouder, 2018] Morey, R. D. and Rouder, J. N. (2018). BAYESFACTOR: computation of bayes factors for common designs. r package version 0.9.12-4.2. http://CRAN.R-project.org/package=BayesFactor.
- [Welch, 1938] Welch, B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29(3/4):350–362.