 # Bounds on Bayes Factors for Binomial A/B Testing

Bayes factors, in many cases, have been proven to bridge the classic -value based significance testing and bayesian analysis of posterior odds. This paper discusses this phenomena within the binomial A/B testing setup (applicable for example to conversion testing). It is shown that the bayes factor is controlled by the Jensen-Shannon divergence of success ratios in two tested groups, which can be further bounded by the Welch statistic. As a result, bayesian sample bounds almost match frequentionist's sample bounds. The link between Jensen-Shannon divergence and Welch's test as well as the derivation are an elegant application of tools from information geometry.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Motivation

#### A/B testing

A/B testing is the technique of collecting data from two parallel experiments and comparing them by probabilistic inference. A particularly important case is assessing which of two success-counting experiments achieves a higher success rate. This naturally applies to evaluating conversion rates on two different versions of a webpage. A typical question being asked is if there is a difference (called also non-zero effect) in conversion between groups: are unknown conversion rates in two experiments, and the task is to compare hypotheses and , given observed data. In the frequentionist approach, one falsifies by the

two-sample t-test

[Welch, 1938]. In the bayesian approach one evaluates the strength of both and and decides based the ratio called bayes factor

 K=Pr[D|H0]Pr[D|Ha] (1)

which converted by the Bayes theorem to

quantifies posterior odds and allows a research to choose a model more plausible given data (usually one gives and same chance of getting considered and sets ). The decision rule and confidence depends on the magnitude of  [Kass and Raftery, 1995, Jeffreys, 1998]. In the bayesian approach a hypothesis assigns an arbitrary distribution to parameters which is more general.

#### Testing counts proportions

Suppose that empirical data has runs and successes in the -th experiment, . Under the binomial counting model, the data likelihood under a hypothesis equals

 Pr[D|H]=∫2∏i=1p¯piri(1−pi)(1−¯p)irdPH(p1,p2) (2)

where the prior distribution reflects what is assumed prior to seeing data (and what will be tested); one can for example choose for and for uniformly over all valid values of , but in practice more informative priors are used because some configurations of values are unrealistic (e.g. extremely low or high conversion). The corresponding factor can be computed for example by the R package BayesFactor [Morey and Rouder, 2018].

#### Problem: Bayesian A/B testing power

Estimates, neither frequentist nor bayesian, will not be conclusive without sufficiently many samples. Frequentists widely use rules of thumbs that are derived based on t-tests. Under the bayesian methodology this is little more complicated because hypotheses can be arbitrary priors over parameters. Under the binomial A/B model, we will answer the following questions

• when, given data, a bayesian hypothesis on zero effect may be rejected ( for some )?

• what is the relation to the classical t-test?

This will allow us to understand data limitations

when doing bayesian inference, and relate them to widely-spread frequentionist rule of thumbs.

### 1.2 Related Works and Contribution

Our problem, as stated, is a question about maximizing minimal bayes factor. It is known that for certain problems bayes factors can be related to frequentionist’s p-values [Edwards et al., 1963, Kass and Raftery, 1995, Goodman, 1999] and thus bridges the Bayesian and frequentionist world (this should be contrasted with a wide-spread belief that both methods are very incompatible [Kruschke and Liddell, 2018]

). The novel contributions of this paper are (a) bounding the Bayes factor for binomial distributions (b) discussion of sample bounds for binomial A/B testing in relation to the frequentionist approach.

#### Main result: Bayes factor and Welch’s statistic

The following theorem shows that no “zero-effect” hypothesis can be falsified, unless the number of samples is big in relation to a certain dataseet statistic. This statistic turns out to be the Jensen-Shannon divergence, well-known in information theory. It is in turn bounded by the Welch’s t-statistic.

###### Theorem 1 (Bayes Factors for Binomial Testing).

Consider two independent experiments, each with

independent trials with unknown success probabilities

and respectively. Let observed data has successes and failures for group . Then

 maxH0:{p1=p2}minHaPr[H0|D]Pr[Ha|D]=e−2r⋅JS(¯p1,¯p2) (3)

where the maximum is over null hypothesis (priors)

over such that , the minimum is over all valid alternative hypothesis (priors) over , and denotes the Jensen-Shannon divergence.

Moreover, the Jensen-Shannnon divergence is bounded by the Welch’s -statistic (on )

 JS(¯p1,¯p2)⩾tWelch(t,¯p1,¯p2)24r (4)

so that we can bound

 maxH0:{p1=p2}minHaPr[H0|D]Pr[Ha|D]⩽e−tWelch(t,¯p1,¯p)2/2 (5)
###### Remark 1 (Most favorable hypotheses).

Note that

• Maximally favorable alternative ( which maximizes ) is and

• Maximally favorable null of the form is

If null is of the form then the bound becomes .

#### Application: sample bounds

The main result implies the following sample rule

###### Corollary 1 (Bayesian Sample Bound).

To confirm the non-zero effect () the number of samples for the bayesian method should be

 r≫12JS(p1,p2) (6)

Under the frequenionist method the rule of thumb is , which gives (see Section 2)

 r≫2(p1(1−p1)+p2(1−p2)(p1−p2)2 (7)

Note that both formulas needs assumptions on locations of the parameters. In particular, testing smaller effects or effects with higher variance require more samples.

Bounds Equation 6 and Equation 7 are close to each other by a constant factor (a different small factor is necessary to make the bound small in both the bayesian credibility and p-value sense). The difference (under the normalized constant) is illustrated on Figure 1, for the case when one wants to test a relative uplift of . Figure 1: Comparison of the bayesian (6) (bayesian) and the frequentionist (7) sample lower bounds, where p2=p and p1=p1⋅(1+δ) for δ=0.1 (10% uplift). Both formulas are multiplied by a factor of 2 to accommodate meaningful confidence.

Since high values of means small p-values, we conclude that the frequentionist p-values bounds the bayes factor and indeed, are evidence against a null-hypothesis in the well-defined bayesian sense. However, because of the scaling , this is true for p-values much lower than the standard threshold of 0.05. In some sense, the bayesian approach is more conservative and less reluctant to reject than frequentionist tests; this conclusion is shared with other works [Goodman, 1999].

## 2 Preliminaries

#### Entropy, Divergence

The binary cross-entropy of and is defined by

 H(p,q)=−plog(1−p)−(1−p)log(1−q) (8)

which becomes the standard (Shannon) binary entropy when , denoted as

. The Kullback-Leibler divergence is defined as

 KL(p,q)=H(p,q)−H(p) (9)

and the Jensen-Shannon divergence [Lin, 1991] is defined as

 JS(p,q)=H(p,q)−12H(p)−12H(q) (10)

(always positive because the entropy is concave).

The following lemma shows that the cross-entropy function is convex in the second argument. This should be contrasted with the fact that the entropy function (of one argument) is concave.

###### Lemma 1 (Convexity of cross-entropy).

For any the mapping is convex in .

###### Proof.

Since for is convex we obtain

 −γ1plogx1−γ2plogx2⩾−plog(γ1x1+γ2x2)

for any and any , . Replacing by and by in the above inequality gives us also

 −γ1(1−p)log(1−x1)−γ2(1−p)log(1−x2)⩾−(1−p)log(γ1(1−x1)+γ2(1−x2))=−(1−p)log(1−γ1x1)−γ2x2)

 γ1H(p,x1)+γ2H(p,x2)⩾γ1H(p,x1)+γ2H(p,x2)

which finishes the proof. This argument works for multivariate case, when

are probability vectors. ∎

###### Lemma 2 (Quadratic bounds on KL/cross-entropy).

For any it holds that

 KL(p,x)⩾(1p+11−p)⋅(x−p)2 (11)
###### Proof.

We will prove a general version. Let and be probability vectors of the same length. By the elementary inequality

 log(1+u)⩾u−12u2 (12)

we obtain

 −log(xi/pi) =−log(1−(pi−xi)/pi)⩾ (13) −pi−xipi+12(pi−xipi)2 (14)

multiplying both sides by and adding inequalities side by side we obtain

 −∑pilog(xi/pi) ⩾−∑i(xi−pi)+∑i(pi−xi)22pi (15) =∑i(pi−xi)22pi (16)

which means . Our lemma follows by specializing to the vectors and . ∎

#### 2-Sample test

To decide whether means in two groups are equal, under the assumption of unequal variances, one performs the Welch’s t-test with the statistic

 tWelch=μ1−μ2√s21r1+s22r2 (17)

where are sample variances and are sample means for group . The null hypothesis is rejected unless the statistic is sufficiently high (in absolute terms). In our case the formula simplifies to

###### Claim 1.

If and success out of trials have been observed respectively in the first and the second group then

 tWelch(t,θ1,θ2)=r−12⋅θ1−θ2√θ1(1−θ1)+θ2(1−θ2) (18)

## 3 Proof

We change the notation slightly, unknown success rates will be and , and corresponding successes .

#### Alternatives

Maximizng over alll posible priors over pairs we get

 maxPPr[D|Ha]=c⋅maxP∫[0,1]2e−rH(θ1,p)−rH(θ1,q)Pa(p,q)d(p,q) (19)

where is a normalizing constant, which equals

 maxPaPr[D|Ha]=c⋅e−rH(θ1)−rH(θ2) (20)

achieved for being a unit mass at .

#### Null

Let states that the baseline is and the efect is . Then we obain

 Pr[D|H0]=c⋅e−rH(θ1,p)−rH(θ2,p) (21)

with the same normalizing constant .

#### Bayes factor

If none of two hypothesis is a priori prefered, that is when , then the Bayes factor equals the likelihood ratio (by Bayes theorem)

 Pr[H0|D]Pr[Ha|D]=Pr[D|H0]Pr[D|Ha]. (22)

In turn the likelihood ratio (in favor of ) equals

 minHaPr[D|H0]Pr[D|Ha]=e−r⋅(H(θ1,p)+H(θ1,p)−H(θ1)−H(θ2)) (23)

(the normalizing constant cancells). Using the relation between the KL divergence and cross-entropy we obtain

 minHaPr[D|H0]Pr[D|Ha]=e−rKL(θ1,p)−rKL(θ2,p) (24)

We will use the following observation

###### Claim 2.

The expression is minimized under , and achieves value .

###### Proof.

We have

 KL(θ1,p)+KL(θ2,p)=H(θ1,p)+H(θ2,p)−H(θ1)−H(θ2)

Now the existence of the minimum at follows by convexity of , proved in Lemma 1. We note that for any (by definition), and thus for we obtain and . This combined with the definition of the Jensen-Shannon divergence finishes the proof. ∎

We can now bound Equation 24 as

 minHaPr[D|H0]Pr[D|Ha]⩽e−2r⋅JS(θ∗) (25)

This proves the first part of Theorem 1

#### Connecting t-statistic and bayes factor exponent

Recall that by Claim 1 under t-test we have

 T≈r12⋅|θ1−θ2|⋅(θ1(1−θ1)+θ2(1−θ2)) (26)

It remains to connect and . By Lemma 2 we have the following refinement of Pinsker’s inequality

###### Claim 3.

We have .

Using , the inequality from Claim 3, and the Welch’s formula in Equation 26 we obtain

###### Claim 4.

We have

 JS(θ1,θ2)⩾tWelch(t,θ1,θ2)24r (27)
###### Proof.

Claim 3 implies

 2JS(θ1,θ2)⩾(θ1−θ2)2⋅(12θ1(1−θ1)+12θ1(1−θ2)) (28)

we recognize the Weltch’s statistic and write

 2JS(θ1,θ2)⩾tWelch(t,θ1,θ2)22r (29)

Combining Equation 25 and Equation 27 implies the second part of the theorem.

## Acknowledgments

The author thanks to Evan Miller for inspiring discussions.

## References

• [Edwards et al., 1963] Edwards, W., Lindman, H., and Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70(3):193–242.
• [Goodman, 1999] Goodman, S. N. (1999). Toward evidence-based medical statistics. 2: The bayes factor. Annals of internal medicine, 130 12:1005–13.
• [Jeffreys, 1998] Jeffreys, H. (1998). The Theory of Probability. Oxford Classic Texts in the Physical Sciences. OUP Oxford.
• [Kass and Raftery, 1995] Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795.
• [Kruschke and Liddell, 2018] Kruschke, J. K. and Liddell, T. M. (2018).

The bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective.

Psychonomic Bulletin & Review, 25(1):178–206.
• [Lin, 1991] Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151.
• [Morey and Rouder, 2018] Morey, R. D. and Rouder, J. N. (2018). BAYESFACTOR: computation of bayes factors for common designs. r package version 0.9.12-4.2.
• [Welch, 1938] Welch, B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29(3/4):350–362.