Bounds on Bayes Factors for Binomial A/B Testing

02/28/2019 ∙ by Maciej Skorski, et al. ∙ 0

Bayes factors, in many cases, have been proven to bridge the classic -value based significance testing and bayesian analysis of posterior odds. This paper discusses this phenomena within the binomial A/B testing setup (applicable for example to conversion testing). It is shown that the bayes factor is controlled by the Jensen-Shannon divergence of success ratios in two tested groups, which can be further bounded by the Welch statistic. As a result, bayesian sample bounds almost match frequentionist's sample bounds. The link between Jensen-Shannon divergence and Welch's test as well as the derivation are an elegant application of tools from information geometry.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation

A/B testing

A/B testing is the technique of collecting data from two parallel experiments and comparing them by probabilistic inference. A particularly important case is assessing which of two success-counting experiments achieves a higher success rate. This naturally applies to evaluating conversion rates on two different versions of a webpage. A typical question being asked is if there is a difference (called also non-zero effect) in conversion between groups: are unknown conversion rates in two experiments, and the task is to compare hypotheses and , given observed data. In the frequentionist approach, one falsifies by the

two-sample t-test

[Welch, 1938]. In the bayesian approach one evaluates the strength of both and and decides based the ratio called bayes factor

(1)

which converted by the Bayes theorem to

quantifies posterior odds and allows a research to choose a model more plausible given data (usually one gives and same chance of getting considered and sets ). The decision rule and confidence depends on the magnitude of  [Kass and Raftery, 1995, Jeffreys, 1998]. In the bayesian approach a hypothesis assigns an arbitrary distribution to parameters which is more general.

Testing counts proportions

Suppose that empirical data has runs and successes in the -th experiment, . Under the binomial counting model, the data likelihood under a hypothesis equals

(2)

where the prior distribution reflects what is assumed prior to seeing data (and what will be tested); one can for example choose for and for uniformly over all valid values of , but in practice more informative priors are used because some configurations of values are unrealistic (e.g. extremely low or high conversion). The corresponding factor can be computed for example by the R package BayesFactor [Morey and Rouder, 2018].

Problem: Bayesian A/B testing power

Estimates, neither frequentist nor bayesian, will not be conclusive without sufficiently many samples. Frequentists widely use rules of thumbs that are derived based on t-tests. Under the bayesian methodology this is little more complicated because hypotheses can be arbitrary priors over parameters. Under the binomial A/B model, we will answer the following questions

  • when, given data, a bayesian hypothesis on zero effect may be rejected ( for some )?

  • what is the relation to the classical t-test?

This will allow us to understand data limitations

when doing bayesian inference, and relate them to widely-spread frequentionist rule of thumbs.

1.2 Related Works and Contribution

Our problem, as stated, is a question about maximizing minimal bayes factor. It is known that for certain problems bayes factors can be related to frequentionist’s p-values [Edwards et al., 1963, Kass and Raftery, 1995, Goodman, 1999] and thus bridges the Bayesian and frequentionist world (this should be contrasted with a wide-spread belief that both methods are very incompatible [Kruschke and Liddell, 2018]

). The novel contributions of this paper are (a) bounding the Bayes factor for binomial distributions (b) discussion of sample bounds for binomial A/B testing in relation to the frequentionist approach.

Main result: Bayes factor and Welch’s statistic

The following theorem shows that no “zero-effect” hypothesis can be falsified, unless the number of samples is big in relation to a certain dataseet statistic. This statistic turns out to be the Jensen-Shannon divergence, well-known in information theory. It is in turn bounded by the Welch’s t-statistic.

Theorem 1 (Bayes Factors for Binomial Testing).

Consider two independent experiments, each with

independent trials with unknown success probabilities

and respectively. Let observed data has successes and failures for group . Then

(3)

where the maximum is over null hypothesis (priors)

over such that , the minimum is over all valid alternative hypothesis (priors) over , and denotes the Jensen-Shannon divergence.

Moreover, the Jensen-Shannnon divergence is bounded by the Welch’s -statistic (on )

(4)

so that we can bound

(5)
Remark 1 (Most favorable hypotheses).

Note that

  • Maximally favorable alternative ( which maximizes ) is and

  • Maximally favorable null of the form is

If null is of the form then the bound becomes .

Application: sample bounds

The main result implies the following sample rule

Corollary 1 (Bayesian Sample Bound).

To confirm the non-zero effect () the number of samples for the bayesian method should be

(6)

Under the frequenionist method the rule of thumb is , which gives (see Section 2)

(7)

Note that both formulas needs assumptions on locations of the parameters. In particular, testing smaller effects or effects with higher variance require more samples.

Bounds Equation 6 and Equation 7 are close to each other by a constant factor (a different small factor is necessary to make the bound small in both the bayesian credibility and p-value sense). The difference (under the normalized constant) is illustrated on Figure 1, for the case when one wants to test a relative uplift of .

Figure 1: Comparison of the bayesian (6) (bayesian) and the frequentionist (7) sample lower bounds, where and for (10% uplift). Both formulas are multiplied by a factor of 2 to accommodate meaningful confidence.

Since high values of means small p-values, we conclude that the frequentionist p-values bounds the bayes factor and indeed, are evidence against a null-hypothesis in the well-defined bayesian sense. However, because of the scaling , this is true for p-values much lower than the standard threshold of 0.05. In some sense, the bayesian approach is more conservative and less reluctant to reject than frequentionist tests; this conclusion is shared with other works [Goodman, 1999].

2 Preliminaries

Entropy, Divergence

The binary cross-entropy of and is defined by

(8)

which becomes the standard (Shannon) binary entropy when , denoted as

. The Kullback-Leibler divergence is defined as

(9)

and the Jensen-Shannon divergence [Lin, 1991] is defined as

(10)

(always positive because the entropy is concave).

The following lemma shows that the cross-entropy function is convex in the second argument. This should be contrasted with the fact that the entropy function (of one argument) is concave.

Lemma 1 (Convexity of cross-entropy).

For any the mapping is convex in .

Proof.

Since for is convex we obtain

for any and any , . Replacing by and by in the above inequality gives us also

Adding side by side yields

which finishes the proof. This argument works for multivariate case, when

are probability vectors. ∎

Lemma 2 (Quadratic bounds on KL/cross-entropy).

For any it holds that

(11)
Proof.

We will prove a general version. Let and be probability vectors of the same length. By the elementary inequality

(12)

we obtain

(13)
(14)

multiplying both sides by and adding inequalities side by side we obtain

(15)
(16)

which means . Our lemma follows by specializing to the vectors and . ∎

2-Sample test

To decide whether means in two groups are equal, under the assumption of unequal variances, one performs the Welch’s t-test with the statistic

(17)

where are sample variances and are sample means for group . The null hypothesis is rejected unless the statistic is sufficiently high (in absolute terms). In our case the formula simplifies to

Claim 1.

If and success out of trials have been observed respectively in the first and the second group then

(18)

3 Proof

We change the notation slightly, unknown success rates will be and , and corresponding successes .

Alternatives

Maximizng over alll posible priors over pairs we get

(19)

where is a normalizing constant, which equals

(20)

achieved for being a unit mass at .

Null

Let states that the baseline is and the efect is . Then we obain

(21)

with the same normalizing constant .

Bayes factor

If none of two hypothesis is a priori prefered, that is when , then the Bayes factor equals the likelihood ratio (by Bayes theorem)

(22)

In turn the likelihood ratio (in favor of ) equals

(23)

(the normalizing constant cancells). Using the relation between the KL divergence and cross-entropy we obtain

(24)

We will use the following observation

Claim 2.

The expression is minimized under , and achieves value .

Proof.

We have

Now the existence of the minimum at follows by convexity of , proved in Lemma 1. We note that for any (by definition), and thus for we obtain and . This combined with the definition of the Jensen-Shannon divergence finishes the proof. ∎

We can now bound Equation 24 as

(25)

This proves the first part of Theorem 1

Connecting t-statistic and bayes factor exponent

Recall that by Claim 1 under t-test we have

(26)

It remains to connect and . By Lemma 2 we have the following refinement of Pinsker’s inequality

Claim 3.

We have .

Using , the inequality from Claim 3, and the Welch’s formula in Equation 26 we obtain

Claim 4.

We have

(27)
Proof.

Claim 3 implies

(28)

we recognize the Weltch’s statistic and write

(29)

Combining Equation 25 and Equation 27 implies the second part of the theorem.

Acknowledgments

The author thanks to Evan Miller for inspiring discussions.

References

  • [Edwards et al., 1963] Edwards, W., Lindman, H., and Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70(3):193–242.
  • [Goodman, 1999] Goodman, S. N. (1999). Toward evidence-based medical statistics. 2: The bayes factor. Annals of internal medicine, 130 12:1005–13.
  • [Jeffreys, 1998] Jeffreys, H. (1998). The Theory of Probability. Oxford Classic Texts in the Physical Sciences. OUP Oxford.
  • [Kass and Raftery, 1995] Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795.
  • [Kruschke and Liddell, 2018] Kruschke, J. K. and Liddell, T. M. (2018).

    The bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a bayesian perspective.

    Psychonomic Bulletin & Review, 25(1):178–206.
  • [Lin, 1991] Lin, J. (1991). Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151.
  • [Morey and Rouder, 2018] Morey, R. D. and Rouder, J. N. (2018). BAYESFACTOR: computation of bayes factors for common designs. r package version 0.9.12-4.2. http://CRAN.R-project.org/package=BayesFactor.
  • [Welch, 1938] Welch, B. L. (1938). The significance of the difference between two means when the population variances are unequal. Biometrika, 29(3/4):350–362.