 # Measuring and Controlling Bias for Some Bayesian Inferences and the Relation to Frequentist Criteria

A common concern with Bayesian methodology in scientific contexts is that inferences can be heavily influenced by subjective biases. As presented here, there are two types of bias for some quantity of interest: bias against and bias in favor. Based upon the principle of evidence, it is shown how to measure and control these biases for both hypothesis assessment and estimation problems. Optimality results are established for the principle of evidence as the basis of the approach to these problems. A close relationship is established between measuring bias in Bayesian inferences and frequentist properties that hold for any proper prior. This leads to a possible resolution to an apparent conflict between these approaches to statistical reasoning. Frequentism is seen as establishing a figure of merit for a statistical study, while Bayesianism plays the key role in determining inferences based upon statistical evidence.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A serious concern with Bayesian methodology is that the choice of the prior could result in conclusions that to some degree are predetermined before seeing the data. In certain circumstances this is correct. This can be seen by considering the problem associated with what is known as the Jeffreys-Lindley paradox where posterior probabilities of hypotheses, as well as associated Bayes factors, will produce increasing support for the hypothesis as the prior becomes more diffuse. So, while one may feel that a very diffuse prior is putting in very little information, it is in fact biasing the results in favor of the hypothesis. It has been argued, see Baskurt and Evans (2013) and Evans (2015), that the measurement and control of bias is a key element of a Bayesian analysis as without it, and the assurance that bias is minimal, the validity of any inference is suspect.

While attempts have been made to avoid the Jeffreys-Lindley paradox through the choice of the prior, modifying the prior to avoid bias is contrary to the ideals of a Bayesian analysis which requires the elicitation of a prior based upon knowledge of the phenomenon under study. Why should one change such a prior because of bias? Indeed, as will be discussed, there is bias in favor and bias against and typically choosing a prior to minimize one type of bias simply increases the other. The real method for controlling bias of both types is through the amount of data collected. So controlling bias is an aspect of design. Bias can be measured post-hoc and it then provides a way to assess the weight that should be given the results of an analysis. For example, if a study concludes that there is evidence in favor of a hypothesis, but it can be shown that there was a high prior probability that such evidence would be obtained, then the results of such an analysis can’t be considered to be reliable.

Previous discussion concerning bias was focused on hypothesis assessment and in many ways this is a natural starting point. This paper is concerned with adding some aspects to those developments and to extending the approach to estimation and prediction problems. Furthermore, it is shown here that measuring and controlling bias establishes close links between a frequentist approach to statistics and Bayesian inference. In essence frequentism is concerned with design while inferences are Bayesian. Bayesian inference is based upon the evidence in the observed data and is unconcerned, at least for inference, about data sets that could have been obtained. Frequentism is concerned with the behavior of inferences as applied to unobserved data sets and this is entirely appropriate before the data is observed. So consideration of bias leads to a degree of unification between different ways of thinking about statistical reasoning.

The measurement of bias, and thus its control, is dependent upon measuring evidence. The principle of evidence is adopted here: evidence in favor of a specific value of an unknown occurs when the posterior probability of the value is greater than its prior probability, evidence against occurs when the posterior probability of the value is less than its prior probability and there is no evidence either way when these are equal. The major part of what is discussed here depends only on this simple principle but sometimes a numerical measure of evidence is needed and for this we use the relative belief ratio defined as the ratio of the posterior to prior probability. The relative belief ratio is related to the Bayes factor but has some nicer properties such as providing a measure of the evidence for each value of a parameter without the need to modify the prior.

There is not much discussion in the Bayesian literature of the notion of bias in the sense that is meant here. There is considerable discussion, however, concerning the Jeffreys-Lindley paradox and our position is that bias plays a key role in the issues that arise. Relevant recent papers on this include Shafer (1982), Spanos (2013), Sprenger (2013), Robert (2014), Cousins (2017) and Villa and Walker (2017) and these contain extensive background references. Gu et al. (2019) is concerned with the validation of quantum theory using Bayesian methodology applied to well-known data sets and the principle of evidence and an assessment of the bias in the prior plays a key role in the argument.

In Section 2 the concepts are defined, their properties are considered and illustrated via a simple example where the Jeffreys-Lindley paradox is relevant. Also, it is seen that a well-known p-value does not satisfy the principle of evidence but can still be used to characterize evidence for or against but requires significance levels that go to 0 with increasing sample size or increasing diffuseness of the prior. In Section 3 the relationship with frequentism is discussed and a number of optimality results are established for the approach taken here to measuring and controlling bias, namely, via the principle of evidence. In Section 4, a variety of examples are considered and analyzed from the point-of-view of bias. All proofs of theorems are in the Appendix.

## 2 Evidence and Bias

For the discussion here there is a model given by densities for data

and a proper prior probability distribution given by density

It is supposed that interest is in inferences about where is onto and for economy the same notation is used for the function and its range. For the most part it is safe to assume all the probability distributions are discrete with results for the continuous case obtained by taking limits.

A measure of the evidence that is the true value is given by the relative belief ratio

 RBΨ(ψ|x)=limδ→0ΠΨ(Nδ(ψ)|x)ΠΨ(Nδ(ψ))=πΨ(ψ|x)πΨ(ψ) (1)

where are the prior and posterior probability measures of with densities and respectively, and is a sequence of sets converging nicely to The last equality in (1) requires some conditions but the prior density positive and continuous at is enough. So implies evidence for the true value being etc. Any valid measure of evidence should satisfy the principle of evidence, namely, the existence of a cut-off value that determines evidence for and against as prescribed by the principle. Naturally, this cut-off is 1 for the relative belief ratio. The Bayes factor is also a valid measure of evidence and with the same cut-off. When then the Bayes factor of equals and so can be defined in terms of the relative belief ratio, but not conversely. Also, iff and so the Bayes factor is not really a comparison of the evidence for being true with the evidence for its negation. In the continuous case, if we define the Bayes factor for as a limit as in (1), then this limit equals Further discussion on the choice of a measure of evidence can be found in Evans (2015) as there are other candidates beyond these two. It is important to note, however, that the discussion of bias depends only on the principle of evidence and is the same no matter what valid measure of evidence is used.

The following example is carried along as it illustrates a number of things.

Example 1. Location normal.

Suppose is i.i.d.  with a prior. Then so

 RB(μ|x)=(1+nτ20σ20)1/2exp⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩−12(1+σ20nτ20)(√n(¯x−μ)σ0+σ0(μ0−μ)√nτ20)2+(μ−μ0)22τ20⎫⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪⎭.

### 2.1 Bias in Hypothesis Assessment Problems

The value tells us if we have evidence for or against .

Example 1. Location normal (continued).

Observe that as then for every and in particular for a hypothesized value So it would appear that overwhelming evidence is obtained for the hypothesis when the prior is very diffuse and this holds irrespective of what the data says. Also, when the standardized value is fixed, then as This phenomenon also occurs if a Bayes factor (which equals in this case) or a posterior probability based upon a discrete prior mass at is used to assess Accordingly all these measures lead to a sharp disagreement with the frequentist p-value when it is small This is the Jeffreys-Lindley paradox and it arises quite generally.

The Jeffreys-Lindley paradox shows that the strength of evidence cannot be measured strictly by the size of the measure of evidence. A logical way to assess this is to compare the evidence for with the evidence for the other possible values for The strength of the evidence can then be measured by

 ΠΨ(RBΨ(ψ|x)≤RBΨ(ψ∗|x)|x), (2)

the posterior probability that the true value has evidence no greater than the evidence for So if and (2) is small, then there is strong evidence against while, if and (2) is large, then there is strong evidence in favor of The inequalities hold and so when is small there is strong evidence against  and when and is big, then there is strong evidence in favor of Note, however, that does not guarantee and if this means that there is weak evidence against There is no reason why multiple measures of the strength of the evidence can’t be used (see the discussion in Section 2.2). There are some issues with (2) in the continuous case that require a modification and we refer to Evans (2015) for this as the strength does not play a key role in the discussion here. The important point is to somehow calibrate the measure of evidence using probability to measure how strong belief in the evidence is.

Example 1. Location normal (continued).

A simple calculation shows that, with fixed, then (2) converges to as So, if the p-value is small, this indicates that a large value of is only weak evidence in favor of It is to be noted that the p-value is not a valid measure of evidence as described here because there is no cut-off that corresponds to evidence for and evidence against. So its appearance as a measure of the strength of the evidence is not in any sense circular.

Simple algebra shows, however, that a difference of two p-values, is a valid measure of evidence via the cut-off 0. From this it is seen that the values of the first p-value that lead to evidence against generally become smaller as For example, with and then the p-value equals Setting and the second p-value equals and so there is evidence against, with the the second term equals and with it equals so there is evidence in favor in both cases. When increases these values become smaller as with the first p-value equal to is always evidence in favor. Similar results are obtained with a uniform prior on reflecting perhaps a desire to treat many values equivalently, as or For example, with and then the second p-value equals and there is evidence in favor These conclusions are similar to those found in Berger and Selke (1987) and Berger and Delampady (1987).

It is very simple to elicit based on prescribing an interval that contains the true with some high probability such as , taking to be the mid-point and so is determined. There is no reason to take to be arbitrarily large. But still one wonders if the choice made is inducing some kind of bias into the problem as taking too large clearly does.

Certainly default choices of priors should be avoided when possible, but even when eliciting, how can we know if the chosen prior is inducing bias? To assess this a numerical measure is required. The principle of evidence suggests that bias against is measured by

 M(RBΨ(ψ∗|X)≤1|ψ∗) (3)

where is the prior predictive distribution of the data given that the hypothesis is true. So (3) is the prior probability that evidence in favor of will not be obtained when is the true value. If (3) is large, then there is an a priori bias against

For the bias in favor of it is necessary to assess if evidence against will not be obtained with high prior probability even when is false. One possibility is to measure bias in favor by

 ∫Ψ∖{ψ∗}M(RBΨ(ψ∗|X)≥1|ψ)ΠΨ(dψ) =M(RBΨ(ψ∗|X)≥1)−M(RBΨ(ψ∗|X)≥1|ψ∗)ΠΨ({ψ∗}) (4)

which is the prior probability of not obtaining evidence against when it is false. When then (4) equals where is the prior predictive for the data. For continuous parameters it can be argued that it does not make sense to consider values of so close to that they are practically speaking indistinguishable. Suppose then there is a measure of distance on and a value such that, if then and are indistinguishable in the application. The bias in favor of can then be measured by replacing in (4) by which has upper bound bound

 supψ:dΨ(ψ∗,ψ)≥δM(RBΨ(ψ∗|X)≥1|ψ). (5)

Typically decreases as moves away from so (5) can be computed by finding the supremum over the set and, when is real-valued and is Euclidian distance, this equals

It is to be noted that the measures of bias given by (3), (4) and (5) do not depend on using the relative belief ratio to measure evidence. Any valid measure of evidence will determine the same values when the relevant cut-off is substituted for 1. It is only (2) that depends on the specific choice of the relative belief ratio as the measure of evidence.

Under general circumstances, see Evans (2015), both biases will converge to 0 as the amount of data increases and so both biases can be controlled by design. Clearly there is no point in reporting the results of an analysis when there is a lot of bias unless the evidence actually contradicts the bias.

Example 1. Location normal (continued).

Under then So, putting

 a(μ∗,μ0,τ20,σ20,n) =σ0(μ∗−μ0)/√nτ20, b(μ∗,μ0,τ20,σ20,n) ={(1+σ20nτ20)[log(1+nτ20σ20)+(μ∗−μ0)2τ20]}1/2,

then (3) is given by

 M(RB(μ∗|X)≤1|μ∗)=1− Φ(a(μ∗,μ0,τ20,σ20,n)+b(μ∗,μ0,τ20,σ20,n))+ Φ(a(μ∗,μ0,τ20,σ20,n)−b(μ∗,μ0,τ20,σ20,n)). (6)

This goes to 0 as or as So bias against can be controlled by sample size or by the diffuseness of the prior although, as subsequently shown, a diffuse prior induces bias in favor. It is also the case that (6) converges to 0 when or when is fixed and So it would appear that using a prior with location quite different than the hypothesized value or a prior that was much more concentrated than the sampling distribution, can be used to lower bias against. These are situations, however, where one can expect to have prior-data conflict after observing the data.

The entries in Table 1 record the bias against for a specific case and illustrate that increasing does indeed reduce bias. The entries also show that bias against can be greater when the prior is centered on the hypothesis. Figure 1 contains a plot of the bias against as a function of when using a prior. Note that the maximum bias against occurs at the mean of the prior (and equals ) and this typically occurs when namely, when the data is more concentrated than the prior. Figure 1 also contains a plot of the bias against when using a prior more concentrated than the data distribution. That the bias against is maximized, as a function of the hypothesized mean when

equals the value associated with the strongest belief under the prior seems odd. This phenomenon arises quite often, and the mathematical explanation for this is that the greater the amount of prior probability assigned to a value, the harder it is for the posterior probability to increase and so it is quite logical when considering evidence. It will be seen that this phenomenon is very convenient for the control of bias in estimation problems and could be used as an argument for using a prior centered on the hypothesis, although this is not necessary as beliefs may be different. Figure 1: Plot of bias against H0={μ} with a N(0,1) prior (- - -) and a N(0,0.01) prior (—) with n=5,σ0=1.

Now consider (5), namely, bias in favor of Putting

 c(μ∗,μ,μ0,τ20,σ20,n)=√n(μ∗−μ)/σ0+a(μ∗,μ0,τ20,σ20,n),

then (5) equals where

 M(RB(μ∗|X)≥1|μ)= Φ(c(μ∗,μ,μ0,τ20,σ20,n)+b(μ∗,μ0,τ20,σ20,n))− Φ(c(μ∗,μ,μ0,τ20,σ20,n)−b(μ∗,μ0,τ20,σ20,n)) (7)

which converges to 0 as and also as But (7) converges to 1 as so if the prior is too diffuse there will be bias in favor of So resolving the Jeffreys-Lindley paradox requires choosing the sample size , after choosing the prior, so that (7) is suitably small. Note that choosing larger reduces bias against but increases bias in favor and so generally bias cannot be avoided by choice of prior. Figure 2 is a plot of for a particular case and this strictly decreases as moves away from .

In Table 2 we have recorded some specific values of the bias in favor using (4) and using (5) where is Euclidean distance. It is seen that bias in favor can be quite serious for small samples. When using (5) this can be mitigated by making larger. For example, with the bias in favor equals Note, however, that is not chosen to make the bias in favor small, rather it is determined in an application as the difference from the null that is just practically important. The virtues of determining a suitable value of are also readily apparent as (5) is much smaller than (4) for larger

A comparison of Tables 1 and 2 shows that a study whose purpose is to demonstrate evidence in favor of  is much more demanding than one whose purpose is to determine whether or not there is evidence against

### 2.2 Bias in Estimation Problems

The relative belief estimate of is the value that maximizes the measure of evidence, namely, It is easy to show that with the inequality strict except in trivial contexts. The accuracy of this estimate can be measured by the ”size” of the plausible region the set of values of that have evidence in their favor and note To say that is an accurate estimate, requires that be ”small”, perhaps as measured by where is some measure of volume, and also have high posterior content which measures the belief that the true value is in Note that does not depend on the specific measure of evidence chosen, in this case the relative belief ratio. Any valid estimator must satisfy the principle of evidence and so be in It is argued that in an estimation problem, bias is measured by various coverage probabilities for the plausible region.

Note too that if there is evidence in favor of then and so represents the natural estimate of provided there was a clear reason for assessing the evidence for this value. The strength of the evidence in favor of can then also be measured by the size of Similarly, if evidence against is obtained then the implausible region and then there is strong evidence against provided has small volume and large posterior probability. A virtue of this approach to measuring the strength of the evidence is that it does not depend upon using the relative belief ratio to measure evidence.

#### 2.2.1 Bias Against

The prior probability that the plausible region does not cover the true value measures bias against when estimating For if this probability is large, then the estimate and the plausible region are a priori likely to be misleading as to the true value. The prior probability that doesn’t contain when is

 EΠΨ(M(ψ∉PlΨ(X)|ψ))=EΠΨ(M(RBΨ(ψ|X)≤1|ψ)) (8)

which is also the average bias against over all hypothesis testing problems Note that
which is the prior coverage probability of . Also,

 supψM(ψ∉PlΨ(X)|ψ)=supψM(RBΨ(ψ|X)≤1|ψ), (9)

is an upper bound on (8). Therefore, controlling (9) controls the bias against in estimation and all hypothesis assessment problems involving . Also so using (9) implies lower bounds for the coverage probability and for the expected posterior content of the plausible region. In general, both (8) and (9) converge to 0 with increasing amounts of data. So it is possible to control for bias against in estimation problems by design.

Example 1. Location normal (continued).

The value of is given in (6) and examples are plotted in Figure 1. When then so

 EΠ(M(RB(μ|X)≤1|μ)) =1−E⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣Φ(σ0√nτ0Z+{(1+σ20nτ20)[log(1+nτ20σ20)+Z2]}1/2)+Φ(σ0√nτ0Z−{(1+σ20nτ20)[log(1+nτ20σ20)+Z2]}1/2)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

which is notably independent of the prior mean . The dominated convergence theorem implies as or as So provided is large enough, there is no estimation bias against. Table 3 illustrates some values of this bias measure. Subtracting the probabilities in Table 3 from 1 gives the prior probability that the plausible region covers the true value and the expected posterior content of the plausible region. So when the prior probability of the plausible region containing the true value is so is a

Bayesian confidence interval for

To use (9) it is necessary to maximize as a function of and it is seen that, at least when the prior is not overly concentrated, that this maximum occurs at Figure 1 shows that when using the prior the maximum occurs at when and from the second column of Table 1, the maximum equals . The average bias against is given by as recorded in Table 3. Note that the maximum also occurs at for the other values of recorded in Table 1.

#### 2.2.2 Bias in Favor

Bias in favor occurs when the prior probability that does not cover a false value is large, namely, when

 ∫Ψ∫Ψ∖{ψ∗}M(ψ∗∉ImΨ(X)|ψ)ΠΨ(dψ)ΠΨ(dψ∗) =∫Ψ∫Ψ∖{ψ∗}M(RBΨ(ψ∗|X)≥1|ψ)ΠΨ(dψ)ΠΨ(dψ∗) (10)

is large as this would seem to imply that the plausible region will cover a randomly selected false value from the prior with high prior probability. Note that (10) is the prior mean of (4) and in the continuous case equals . As previously discussed, however, it often doesn’t make sense to distinguish values of that are close to The bias in favor for estimation can then be measured by

 EΠΨ(supψ:dΨ(ψ,ψ∗)≥δM(ψ∗∉ImΨ(X)|ψ)) =EΠΨ(supψ:dΨ(ψ,ψ∗)≥δM(RBΨ(ψ∗|X)≥1|ψ)). (11)

An upper bound on (11) is commonly equal to 1 as illustrated in Figure 3 and so is not useful.

It is the size and posterior content of that provides a measure of the accuracy of the estimate As discussed in Section 2.2.1 the a priori expected posterior content of can be controlled by bias against. The a priori expected volume of satisfies

 EM(Vol(PlΨ(X)))=∫Ψ∫ΨM(ψ∗∈PlΨ(X)|ψ)ΠΨ(dψ)Vol(dψ∗). (12)

Notice that when for every this can be interpreted as a kind of average of the prior probabilities of the plausible region covering a false value.

Example 1. Location normal (continued).

It follows from (7) that

 supM(RB(μ∗|X)≥1|μ∗±δ) = sup⎧⎨⎩Φ(c(μ∗,μ∗±δ,μ0,τ20,σ20,n)+b(μ∗,μ0,τ20,σ20,n))−Φ(c(μ∗,μ∗±δ,μ0,τ20,σ20,n)−b(μ∗,μ0,τ20,σ20,n))⎫⎬⎭

Note that as , then when see Figure 3, and converges to 0 if so it would appear that the better circumstance for guarding against bias in favor is when the prior is putting in more information than the data. As previously noted, however, this is a situation where we might expect prior data-conflict to arise and, except in exceptional circumstances should be avoided. Figure 3: Bias in favor of μ maximized over μ±δ based on a N(0,1) prior and σ0=1,n=20,δ=0.5.

Table 4 contains values of (7) for this situation with different values of .

Some elementary calculations give with

 w(¯x,n,σ20,μ0,τ20)

where ) under It is notable that the prior distribution of the width is independent of the prior mean. Table 5 contains some expected half-widths together with the coverage probabilities of

## 3 Frequentist and Optimal Properties

Consider now the bias against namely, If we repeatedly generate then this probability is the long-run proportion of times that This frequentist interpretation depends on the conditional prior and when so there are no nuisance parameters, this is a ”pure” frequentist probability. Even in the latter case there is some dependence on the prior, however, as so satisfies iff where So in general the region depends on but the probability depends only on the conditional prior predictive given namely, and not on the marginal prior on We refer to probabilities that depend only on as frequentist, for example, coverage probabilities are called confidences, and those that depend on the full prior as Bayesian confidences. The frequentist label is similar to use of the confidence terminology when dealing with random effects models as nuisance parameters have been integrated out.

Suppose now that some other general rule, not necessarily the principle of evidence, is used to determine whether there is evidence for or against  and this leads to the set as those data sets that do not give evidence in favor of The rules of potential interest will satisfy since this implies better performance a priori in terms of identifying when data has evidence in favor of via the set than the principal of evidence For example, for some satisfies this but note that a value satisfying violates the principle of evidence if it is claimed there is evidence in favor of . Putting leads to the following result.

Theorem 1. (i) The prior probability is maximized among all satisfying by (ii) If then maximizes the prior probability of not obtaining evidence in favor of when it is false and otherwise maximizes this probability among all rules satisfying

When rules may exist having greater prior probability of not getting evidence in favor of when it is false but the price paid for this is the violation of the principle of evidence. Also, when comparing rules based on their ability to distinguish falsity it only seems fair that the rules perform the same under the truth. So Theorem 1 is a general optimality result for the principle of evidence applied to hypothesis assessment when considering bias against.

Now consider which is the set of values for which there is evidence in their favor after observing according to some alternative evidence rule. Since then
and so the Bayesian coverage of is at least as large as that of and so represents a viable alternative to using

The following establishes an optimality result for .

Theorem 2. (i) The prior probability that the region doesn’t cover a value generated from the prior, namely, is maximized among all regions satisfying for every by (ii) If for all then maximizes the prior probability of not covering a false value and otherwise maximizes this probability among all satisfying for all

Again when the existence of a region with better properties with respect to not covering false values than can’t be ruled out but, when considering such a property, it seems only fair to compare regions with the same coverage probability and in that case is optimal. So Theorem 2 is also a general optimality result for the principle of evidence applied to estimation when considering bias against. Also, if there is a value then serves as a lower bound on the coverage probabilities, and thus is a -confidence region for and this is a pure frequentist -confidence region when Since then Example 1 shows that it is reasonable to expect that such a exists.

The principle of evidence leads to the following satisfying properties which connect the concept of bias as discussed here with the frequentist concept..

Theorem 3. (i) Using the principle of evidence, the prior probability of getting evidence in favor of when it is true is greater than or equal to the prior probability of getting evidence in favor of given that is false. (ii) The prior probability of  covering the true value is always greater than or equal to the prior probability of covering a false value.

The properties stated in Theorem 3 are similar to a property called unbiasedness for frequentist procedures. For example, a test is unbiased if the probability of rejecting a null is always larger when it is false than when it is true and a confidence region is unbiased if the probability of covering the true value is always greater than the probability of covering a false value. While the inferences discussed here are ”unbiased” in this generalized sense, they could still be biased against or in favor in the practical sense of this paper, as it is the amount of data that controls this.

Now consider bias in favor and suppose there is an alternative characterization of evidence that leads to the region consisting of all data sets that do not lead to evidence against Putting we restrict attention to regions satisfying