1 Introduction
1.1 Motivation
A/B testing
A/B testing is the technique of collecting data from two parallel experiments and comparing them by probabilistic inference. A particularly important case is assessing which of two successcounting experiments achieves a higher success rate. This naturally applies to evaluating conversion rates on two different versions of a webpage. A typical question being asked is if there is a difference (called also nonzero effect) in conversion between groups: are unknown conversion rates in two experiments, and the task is to compare hypotheses and , given observed data. In the frequentionist approach, one falsifies by the
twosample ttest
[Welch, 1938]. In the bayesian approach one evaluates the strength of both and and decides based the ratio called bayes factor(1) 
which converted by the Bayes theorem to
quantifies posterior odds and allows a research to choose a model more plausible given data (usually one gives and same chance of getting considered and sets ). The decision rule and confidence depends on the magnitude of [Kass and Raftery, 1995, Jeffreys, 1998]. In the bayesian approach a hypothesis assigns an arbitrary distribution to parameters which is more general.Testing counts proportions
Suppose that empirical data has runs and successes in the th experiment, . Under the binomial counting model, the data likelihood under a hypothesis equals
(2) 
where the prior distribution reflects what is assumed prior to seeing data (and what will be tested); one can for example choose for and for uniformly over all valid values of , but in practice more informative priors are used because some configurations of values are unrealistic (e.g. extremely low or high conversion). The corresponding factor can be computed for example by the R package BayesFactor [Morey and Rouder, 2018].
Problem: Bayesian A/B testing power
Estimates, neither frequentist nor bayesian, will not be conclusive without sufficiently many samples. Frequentists widely use rules of thumbs that are derived based on ttests. Under the bayesian methodology this is little more complicated because hypotheses can be arbitrary priors over parameters. Under the binomial A/B model, we will answer the following questions

when, given data, a bayesian hypothesis on zero effect may be rejected ( for some )?

what is the relation to the classical ttest?
This will allow us to understand data limitations
when doing bayesian inference, and relate them to widelyspread frequentionist rule of thumbs.
1.2 Related Works and Contribution
Our problem, as stated, is a question about maximizing minimal bayes factor. It is known that for certain problems bayes factors can be related to frequentionist’s pvalues [Edwards et al., 1963, Kass and Raftery, 1995, Goodman, 1999] and thus bridges the Bayesian and frequentionist world (this should be contrasted with a widespread belief that both methods are very incompatible [Kruschke and Liddell, 2018]
). The novel contributions of this paper are (a) bounding the Bayes factor for binomial distributions (b) discussion of sample bounds for binomial A/B testing in relation to the frequentionist approach.
Main result: Bayes factor and Welch’s statistic
The following theorem shows that no “zeroeffect” hypothesis can be falsified, unless the number of samples is big in relation to a certain dataseet statistic. This statistic turns out to be the JensenShannon divergence, wellknown in information theory. It is in turn bounded by the Welch’s tstatistic.
Theorem 1 (Bayes Factors for Binomial Testing).
Consider two independent experiments, each with
independent trials with unknown success probabilities
and respectively. Let observed data has successes and failures for group . Then(3) 
where the maximum is over null hypothesis (priors)
over such that , the minimum is over all valid alternative hypothesis (priors) over , and denotes the JensenShannon divergence.Moreover, the JensenShannnon divergence is bounded by the Welch’s statistic (on )
(4) 
so that we can bound
(5) 
Remark 1 (Most favorable hypotheses).
Note that

Maximally favorable alternative ( which maximizes ) is and

Maximally favorable null of the form is
If null is of the form then the bound becomes .
Application: sample bounds
The main result implies the following sample rule
Corollary 1 (Bayesian Sample Bound).
To confirm the nonzero effect () the number of samples for the bayesian method should be
(6) 
Under the frequenionist method the rule of thumb is , which gives (see Section 2)
(7) 
Note that both formulas needs assumptions on locations of the parameters. In particular, testing smaller effects or effects with higher variance require more samples.
Bounds Equation 6 and Equation 7 are close to each other by a constant factor (a different small factor is necessary to make the bound small in both the bayesian credibility and pvalue sense). The difference (under the normalized constant) is illustrated on Figure 1, for the case when one wants to test a relative uplift of .
Since high values of means small pvalues, we conclude that the frequentionist pvalues bounds the bayes factor and indeed, are evidence against a nullhypothesis in the welldefined bayesian sense. However, because of the scaling , this is true for pvalues much lower than the standard threshold of 0.05. In some sense, the bayesian approach is more conservative and less reluctant to reject than frequentionist tests; this conclusion is shared with other works [Goodman, 1999].
2 Preliminaries
Entropy, Divergence
The binary crossentropy of and is defined by
(8) 
which becomes the standard (Shannon) binary entropy when , denoted as
. The KullbackLeibler divergence is defined as
(9) 
and the JensenShannon divergence [Lin, 1991] is defined as
(10) 
(always positive because the entropy is concave).
The following lemma shows that the crossentropy function is convex in the second argument. This should be contrasted with the fact that the entropy function (of one argument) is concave.
Lemma 1 (Convexity of crossentropy).
For any the mapping is convex in .
Proof.
Since for is convex we obtain
for any and any , . Replacing by and by in the above inequality gives us also
Adding side by side yields
which finishes the proof. This argument works for multivariate case, when
are probability vectors. ∎
Lemma 2 (Quadratic bounds on KL/crossentropy).
For any it holds that
(11) 
Proof.
We will prove a general version. Let and be probability vectors of the same length. By the elementary inequality
(12) 
we obtain
(13)  
(14) 
multiplying both sides by and adding inequalities side by side we obtain
(15)  
(16) 
which means . Our lemma follows by specializing to the vectors and . ∎
2Sample test
To decide whether means in two groups are equal, under the assumption of unequal variances, one performs the Welch’s ttest with the statistic
(17) 
where are sample variances and are sample means for group . The null hypothesis is rejected unless the statistic is sufficiently high (in absolute terms). In our case the formula simplifies to
Claim 1.
If and success out of trials have been observed respectively in the first and the second group then
(18) 
3 Proof
We change the notation slightly, unknown success rates will be and , and corresponding successes .
Alternatives
Maximizng over alll posible priors over pairs we get
(19) 
where is a normalizing constant, which equals
(20) 
achieved for being a unit mass at .
Null
Let states that the baseline is and the efect is . Then we obain
(21) 
with the same normalizing constant .
Bayes factor
If none of two hypothesis is a priori prefered, that is when , then the Bayes factor equals the likelihood ratio (by Bayes theorem)
(22) 
In turn the likelihood ratio (in favor of ) equals
(23) 
(the normalizing constant cancells). Using the relation between the KL divergence and crossentropy we obtain
(24) 
We will use the following observation
Claim 2.
The expression is minimized under , and achieves value .
Proof.
We have
Now the existence of the minimum at follows by convexity of , proved in Lemma 1. We note that for any (by definition), and thus for we obtain and . This combined with the definition of the JensenShannon divergence finishes the proof. ∎
Connecting tstatistic and bayes factor exponent
Recall that by Claim 1 under ttest we have
(26) 
It remains to connect and . By Lemma 2 we have the following refinement of Pinsker’s inequality
Claim 3.
We have .
Using , the inequality from Claim 3, and the Welch’s formula in Equation 26 we obtain
Claim 4.
We have
(27) 
Proof.
Combining Equation 25 and Equation 27 implies the second part of the theorem.
Acknowledgments
The author thanks to Evan Miller for inspiring discussions.
References
 [Edwards et al., 1963] Edwards, W., Lindman, H., and Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70(3):193–242.
 [Goodman, 1999] Goodman, S. N. (1999). Toward evidencebased medical statistics. 2: The bayes factor. Annals of internal medicine, 130 12:1005–13.
 [Jeffreys, 1998] Jeffreys, H. (1998). The Theory of Probability. Oxford Classic Texts in the Physical Sciences. OUP Oxford.
 [Kass and Raftery, 1995] Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795.

[Kruschke and Liddell, 2018]
Kruschke, J. K. and Liddell, T. M. (2018).
The bayesian new statistics: Hypothesis testing, estimation, metaanalysis, and power analysis from a bayesian perspective.
Psychonomic Bulletin & Review, 25(1):178–206.  [Lin, 1991] Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151.
 [Morey and Rouder, 2018] Morey, R. D. and Rouder, J. N. (2018). BAYESFACTOR: computation of bayes factors for common designs. r package version 0.9.124.2. http://CRAN.Rproject.org/package=BayesFactor.
 [Welch, 1938] Welch, B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29(3/4):350–362.
Comments
There are no comments yet.