Geometric Conditions for the Discrepant Posterior Phenomenon and Connections to Simpson's Paradox

01/23/2020 ∙ by Yang Chen, et al. ∙ University of Michigan 0

The discrepant posterior phenomenon (DPP) is a counterintuitive phenomenon that occurs in the Bayesian analysis of multivariate parameters. It refers to when an estimate of a marginal parameter obtained from the posterior is more extreme than both of those obtained using either the prior or the likelihood alone. Inferential claims that exhibit DPP defy intuition, and the phenomenon can be surprisingly ubiquitous in well-behaved Bayesian models. Using point estimation as an example, we derive conditions under which the DPP occurs in Bayesian models with exponential quadratic likelihoods, including Gaussian models and those with local asymptotic normality property, with conjugate multivariate Gaussian priors. We also examine the DPP for the Binomial model, in which the posterior mean is not a linear combination of that of the prior and the likelihood. We provide an intuitive geometric interpretation of the phenomenon and show that there exists a non-trivial space of marginal directions such that the DPP occurs. We further relate the phenomenon to the Simpson's paradox and discover their deep-rooted connection that is associated with marginalization. We also draw connections with Bayesian computational algorithms when difficult geometry exists. Theoretical results are complemented by numerical illustrations. Scenarios covered in this study have implications for parameterization, sensitivity analysis, and prior choice for Bayesian modeling.



There are no comments yet.


page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In Bayesian analysis, the posterior distribution provides a probabilistic summary that incorporates both the prior knowledge and what can be learned from data. Statistical inferential statements on model parameters are derived solely from the posterior distribution. In many applications for which the model parameter is multi-dimensional, we are only interested in inference about a certain marginal parameter, say , where is the full model parameter of dimension , and is the nuisance parameter. For such inference problems, Bayesian theory suggests to solve them “as one coherent whole, including assigning priors and conducting analyses with nuisance parameters” (Wasserman, 2007); and inference for the target parameter , is obtained from the marginal posterior of . Efron (1986) and Wasserman (2007)

contrasted Bayesian and frequentist approaches using an illustrative example of estimating population quantiles. While a frequentist approach directly uses the sample quantiles to estimate the population quantiles, a Bayesian approach requires a prior to be assigned to a full parameter

and an estimate of the population quantiles can then be obtained from the marginal posterior of

. This Bayesian inference approach is coherent and well supported by probability theory, and it has been extensively used in practice; see discussions on multi-parameter models in 

Gelman et al. (2013) and references therein.

The posterior serves as a combination of information coming from the prior and likelihood (Gelman et al., 2013)

, thus we generally expect it to be a compromise between the two. Estimates based on the posterior are expected to be more moderate than either of the corresponding estimates from the prior or the likelihood. For example, in a Gaussian conjugate model with unknown mean and known variance, the posterior mean is a weighted average of the prior mean and the maximum likelihood estimate (MLE). Therefore, the posterior estimate lies between the estimates based on the prior and the likelihood. What is lesser known is that, when we have multiple model parameters (parameter of interest plus nuisance parameters) and the prior is informative, the practice of marginalizing a full Bayesian posterior to the parameter of interest can lead to counter-intuitive posterior inference. The

discrepant posterior phenomenon (DPP) occurs when a point estimate derived from the (marginal) posterior takes a value that is more extreme than those based on either the prior or the data, see Xie et al. (2013) for an example. The DPP is counterintuitive for it defeats the general expectation as a prior-data compromise.

The DPP was first reported in a study of a Binomial clinical trial conducted by Johnson & Johnson (J&J) Inc. in Xie et al. (2013). Both expert opinions and data from the clinical trial agreed that the improvement , from the control success rate () to the treatment success rate (), is around . However, the marginal posteriors of from several candidate full Bayesian models on suggested that the improvement is over ; cf., Table 3 of Xie et al. (2013). Figure 1 is a reproduction of a simpler example reported in Figure 2 of Xie and Singh (2013), where independent Beta priors are used for and . As can be seen from the figure, the marginal posterior of peaks outside of the marginal prior distribution of and the profile likelihood function of . Here, is the joint likelihood of , and the marginal prior distribution of and the profile likelihood function are more or less in agreement.

Figure 1: (a) Contour plots of the joint prior , likelihood function , and posterior function ; (b) Projections (marginals) of ,, and onto the direction of . Apparently, the marginal posterior of in (b) locates at the right-hand-side of both the marginal/projection of the prior and the likelihood . This figure is a reproduction of Figure 2 (a) and (b) of Xie and Singh (2013) with data and an independent beta prior with for and for .

The DPP is not a mathematical oversight. By strictly following Bayes formula and probability theory, conclusions on the marginal posterior of from a full Bayes analysis are necessarily correct, with or without DPP, provided that both the prior and likelihood specifications are correct and the prior is proper. Practically however, DPP can lead to undesirable complications. For instance, in the example from Xie et al. (2013), should we trust the conclusion that the improvement is over based on the marginal posterior of ? Many may choose to question the prior or the data model specifications. However, as we will see in late sections, the phenomenon is surprisingly ubiquitous in well-behaved Bayesian models. An added complication to this J&J clinical trial is that the prior information of is not completely given – we only have the marginal prior information on from a trial approved by FDA and information from medical experts. It remains an open question how to use a Bayesian method to analyze this clinical trial while avoiding the DPP. Followup investigations, e.g. Xie and Singh (2013, Section 6.2), Robert (2013) and Xie (2013), suggest that the DPP is commonplace in multivariate Bayesian analysis. As long as point estimates (such as the mean or mode) computed from the prior, likelihood and posterior are not on the same line, there exists at least one linear margin on which the marginal posterior appears to be more extreme than both the prior and the data likelihood. This observation generated further discussions on whether it is necessary to require some alignment of the prior given the likelihood, which in turn raised questions and disagreement about whether data-dependent priors should be used.

To be concrete and specific mathematically but without loss of generality, we illustrate and study DPP using point estimation in this paper. Consider the parameter of interest , for a given , . DPP then refers to that the posterior mean does not lie between the prior mean and a point estimate derived from the likelihood. In the cases that we consider, the point estimator is taken to be the maximum likelihood estimate (MLE). In particular, technical examinations of the DPP is performed for two classes of models, in order to facilitate concrete probabilistic statements. In Section 3, observation are assumed to exhibit an exponential quadratic likelihood, that is, , where is a quadratic function of , and Section 5

does so for the Binomial likelihood. The family of the exponential quadratic likelihood includes Gaussian models and also those with local asymptotic normality (LAN) property as special cases. For the ease of mathematical derivation and without loss of generality, the prior distribution is assumed to be fully specified as a multivariate Gaussian distribution, which is a conjugate for the Gaussian and LAN likelihood.

The development in this paper provides a fuller picture and further understanding of the DPP and can provide intuitions on how to mitigate and interpret the DPP in the presented cases. The results can serve as precautions for practitioners of Bayesian inference when highly informative priors are desired, and more importantly provide practical guidance towards prior specification (including dispersed priors, hierarchical priors, and re-parameterization for prior-likelihood curvature alignment) for Bayesian inference, to mitigate and possibly avoid the DPP.

The remainder of the paper is organized as follows. First, a precise definition of DPP is given in Section 2, with brief discussions on the prevalence of DPP: there always exist certain marginal directions along which DPP occurs. Second, we derive specific conditions under which DPP occurs (and does not occur) in a model from the family exponential quadratic likelihood, and provide numerical examples to illustrate prevalence of the DPP and the theoretical conditions in Section 3. Third, we give geometric and intuitive interpretations of the conditions for which the DPP occurs or not, and establish a connection between DPP and Simpson’s paradox in Section 4. Fourth, we revisit the Binomial example given in Xie et al. (2013) in Section 5. In the highly nonlinear model, the DPP phenomenon is much more complicated than the Gaussian case. Instead of giving analytical solutions for DPP, we examine from both theoretical and numerical perspective the DPP for the Binomial model. Finally, we conclude the paper in Section 6 with discussions on the DPP and its implications for parameterization, sensitivity analysis, and choice of priors in Bayesian analysis.

2 Definition and Existence of DPP

The DPP occurs whenever an estimate from the marginal posterior does not lie between the corresponding estimates from the prior and the likelihood. In this paper, to understand the essence underlying DPP and simplify our presentations in simple and precise mathematical forms, we restrict our attention to a point-wise definition of DPP, that is, the definition of DPP based on point estimators for the parameter of interest .

Definition 2.1 (Point-wise DPP).

Denote by the prior mean of , and let be an estimate of derived from the likelihood function. We say that the discrepant posterior phenomenon (DPP) occurs, if and only if


where is the posterior mean of .

Under the same notational convention, write where is the posterior mean of . Similarly, denote the MLE of by and the prior mean of by . Under both classes of data generating models consider in this paper, the MLE is a consistent and asymptotically efficient estimator of . It follows that , and the MLE of is given by , which we take to be the likelihood-based point estimator. From Equation (1), DPP occurs if and only if . Therefore, if , then for any , DPP occurs. If for , i.e. ; then DPP does not occur when and DPP always occurs when and .

One may wish to also define the discrepant posterior phenomenon for more general types of estimators, such as point estimators other than expectations and the MLE, as well as interval estimators. To maintain clarity of the current paper, we defer discussions about alternative definitions of DPP to future work, noting here that such definitions are conceivable.

Remark 2.1.

Directly from the definition of DPP given above, if and are not collinear (in the sense that there does not exist non-zero real numbers and such that ), then there exists a non-trivial space of possible values for which the DPP occurs. Note that if , and are collinear, they only span a proper subspace of . For most well-behaving Bayesian models with continuous prior and data spaces, the probability of this happening is over repeated sampling of the data. Thus, in this sense, the DPP occurs with probability .

Although a consequence of probabilistic calculations, at the crux of the DPP lies a puzzle of geometry. In Section 3, we articulate the conditions for the DPP in Gaussian conjugate models, and in Section 4 elaborate on the geometry behind the problem, including drawing connection to the infamous Simpson’s paradox to demonstrate its structure as well as its prevalence in statistical applications.

3 Conditions for DPP in Exponential-Quadratic Likelihoods

3.1 Theoretical Results

In this section, we investigate conditions under which DPP occurs for models with multivariate Gaussian priors and exponential-quadratic likelihoods. The latter can be regarded as the asymptotic likelihood in large samples; see Lemma 3.1. Theorem 3.1 and Propositions 3.1 and 3.2 focus on cases in which the parameter of interest is a linear marginal. Propositions 3.3 and 3.4 consider the linear contrast between means in the two-dimensional setup.

We adopt the following notation for the remainder of this section. Let the prior for be and the likelihood be proportional to , where denotes a multivariate Gaussian density with mean and variance . In this case, it is easy to derive that the posterior distribution of is Gaussian, with mean and variance-covariance matrix denoted by and respectively. Suppose the parameter of interest is a linear margin , for a given . In what follows, Lemma 3.1

gives two examples of exponential-quadratic likelihoods: one is an exact exponential-quadractic likelihood from independently and identically distributed (i.i.d.) Gaussian observations with unknown mean and known variance, which can be easily adapted to simple linear regression models with unknown regression coefficient and known variance; and the other is an asymptotically Gaussian likelihood based on the theory of local asymptotic normality (LAN). We summarize these known results in Lemma 

3.1. This gives a concrete example of exponential-quadractic likelihoods, establishes the notation, and showcases the extent of generality of our analysis on DPP.

Lemma 3.1.

(a) [Gaussian Population] Let

random sample vector

, . Assume that is known and is the unknown parameter. Then the likelihood is proportional to where denotes Gaussian density and , ,

(b) [Local Asymptotic Normality (LAN)] Let , , where is the unknown parameter and is the density function with regularity conditions given in Le Cam and Yang (2012, Chapter 6). Let be the true value of . Then in an open neighborhood of of radius , with probability converging to as , the likelihood , as a function of , is proportional to for some that only depend on the data and .

The proof of Lemma 3.1 (a) is trivial. Lemma 3.1

(b) directly follows from the locally asymptotically quadratic property that is satisfied by a large family of probability distributions 

(Hájek, 1972). We use the definition given in Le Cam and Yang (2012, Chapter 6) to give a proof of Lemma 3.1 in Appendix H. Geyer et al. (2013) also considers quadratic log-likelihoods.

Theorem 3.1 below provides a necessary and sufficient condition for not observing a DPP in exponential-quadratic likelihoods with a multivariate Gaussian prior.

Theorem 3.1 (necessary and sufficient condition for DPP).

The DPP does not occur if and only if and are both positive () or both negative (), where .

See Appendix B for proof of the theorem. Note that and defines two hyper-planes that divide the space of . When the samples give a

that lies on the same side of the two hyperplanes, then

thus DPP does not occur; otherwise, the DPP occurs. This scenario is demonstrated via repeated simulations in Section 3.3.

The probability of the DPP for the Gaussian population as given in Lemma 3.1 (a) is given in Theorem 3.2 below. We will show this probability numerically for special cases in Section 3.3.

Theorem 3.2 (probability of DPP, Gaussian case).

Using the same notations as in Lemma 3.1 (a), the DPP occurs with probability , where the probability is taken with respect to the true data generating model.

The probability of DPP can be computed using Monte Carlo simulations. For example, in the following data generating model for Gaussian conjugate models as defined in Lemma 3.1 (a):


we have . We can simulate from this Gaussian repeatedly and count the frequency (probability) of the inequality that defines the DPP hold.

We have the following corollary.

Corollary 3.1 (possibility of DPP for all contrasts, Gaussian case).

For the Gaussian model in Theorem 3.2 and under the data generating model specified in Equation (2), for any , i.e. taking any margin, the probability of DPP occuring is positive except when for some positive constant .

Now we briefly give a proof of the corollary. Since both and follows univariate Gaussian distributions under the data generating model, the probability of DPP is equal to zero if and only if they are perfectly positively correlated, i.e. there exists a positive constant such that holds with probability . This implies that .

So far we have considered the case with a specific target parameter of interest with a fixed direction on . Now suppose we do not have a fixed direction and we may be interest in several on perhaps multiple or even all directions of . The next theorem states that in the multivariate Gaussian conjugate model, as long as for a constant , then with probability 1 we can always have a non-trivial space of such that we will see DPP in these directions.

Theorem 3.3 (certainty of DPP for some contrast, Gaussian case).

For the multivariate Gaussian conjugate model given in Lemma 3.1 (a), there always exists a non-trivial space for possible such that the DPP occurs with probability , unless there exists a non-zero constant such that .

The proof of the theorem is straightforward, and we briefly explain it here. From Remark 2.1 in Section 2, if , , does not lie on the same line, there exists a non-trivial space for possible such that the DPP occurs. In the Gaussian setting, , where . Note that the vectors , , lying on the same line implies that, there exist some constants and , such that . This is equivalent to the following linear equation:


This is a linear equation for sample mean . Thus, for a given and as long as , the probability that equation (3) holds is zero. Consequently, with probability , , , does not lie on the same line; yielding the existence of a non-trivial space for possible such that the DPP occurs. Thus the DPP is prevalent unless .

3.2 Several Special Cases of Theorems in Section 3.1

We consider in this subsection several examples that are special cases of Theorem 3.1. Example 3.1 concerns when both the prior and likelihood covariance matrices are diagonal.

Example 3.1 (diagonal covariances).

Assume that and . Define for . In this case, the DPP does not occur if and only if

where and . When for all , thus DPP does not occur as long as and .

When and , special cases to the above for which DPP does not occur include (1) when for all where , that is, the prior and likelihood have the same pattern of heterogeneity; or (2) when and for all , that is, the prior and likelihood both have homogeneous, independent dimensions. In other words, when the parameters are orthogonal in both the prior and likelihood, the DPP does not occur when the prior and the likelihood contours are nicely “aligned” in the sense of elongated directions/dimensions. Example 3.2 next concerns the situation when both the prior and the likelihood employ homogeneous correlation (or equicorrelation) structure across all dimensions and equal marginal variances.

Example 3.2 (equicorrelation with homogeneous variances).

Assume that and , where is a diagonal matrix with diagonal elements equal to , is a column vector of s, and ; i.e. we have

Then, DPP does not occur if and only if , where

and , , , and

When and , special cases to the above for which DPP never occurs include (1) when (thus ), that is, the prior and likelihood have the same correlation pattern; or (2) when (thus ), that is, the prior and likelihood have similar correlation pattern and the parameter of interest is a ‘contrast’. The special case for which DPP would always occur is when (thus ) and , . This corresponds to when the quantity of interest lies on the direction () that is orthogonal to the direction of prior-likelihood mean contrast (), which is the farthest away from being a weighted average of the prior mean and the mean given by the data likelihood.

Remark 3.1.

The situation above when the DPP always occurs is not as significant of a concern as opposed to the seemingly weaker statements of DPP occuring with positive probability. This is because the linear equation that defines this situation, , actually happens with probability under the Gaussian conjugate model. Therefore, the more interesting discussions in the paper are related to the cases when DPP occurs with positive probability where the geometry of the prior and likelihood contours ( in the Gaussian model) plays an important role.

We now examine linear contrasts of the two-dimensional posterior mean, that is, , special cases of the previous examples to gain more intuition. Example 3.3 shows that the DPP does not occur as long as the two component dimensions of the parameter have the same variance within the prior and the likelihood specifications, regardless of correlation structure. On the contrary, Example 3.4 shows that when the variances of the two dimensions differ, it creates the possibility for DPP even if the two dimensions are independent within both the prior and likelihood. In what follows, , and .

Example 3.3 (two-dimensional contrast, homogeneous variance).

If , , where , then for , the DPP does not occur.

A special case of Example 3.3 is when . The posterior distribution for is , where . Note that the posterior mean must lie between and , regardless which one is larger. Another special case is when only , i.e. for correlated parameters’ likelihood, we set an independent prior; or similarly when only , i.e. for uncorrelated parameters’ likelihood, we set a correlated prior. Again, we can write the posterior mean for the contrast as a convex combination of the prior contrast and the MLE . Thus the DPP does not occur. The detailed result and proof are given in Appendix D. Example 3.3 shows that in practice, if we can make the the marginal variances of the parameters in both the prior and likelihood close to being homogeneous, DPP could be mitigated or even avoided. In fact, the homogeneity of marginal variances is a nice property to have not only for avoiding the DPP, but also for the efficiency of computation algorithms, which we discuss in more details in Section 6.

In contrast to Example 3.3, Example 3.4 gives the condition under which DPP occurs when the two component dimensions of the parameter are uncorrelated, but the marginal variances are not the same.

Example 3.4 (two-dimensional contrast, heterogeneous variance).

Let , , , and denote


Then the posterior mean for is .

  1. When , i.e. the relative curvature between the prior and likelihood is the same for the two dimensions, always holds and the equality holds if and only if . In the special case when , we have , which is perfect alignment of prior/MLE/posterior. Thus the DPP does not occur in this case.

  2. When , without loss of generality, we assume that , then DPP occurs if and only if and


Example 3.4 sends a somewhat surprising message, as compared to the commonly perceived understandings of “difficult geometry” of the likelihood and prior misalignment. As it turns out, DPP is not a consequence of parameter dependence in either the prior or the likelihood specifications. Just by assuming nonhomogeneous variances in the component dimensions of the parameter is enough to create the unsettling phenomenon. The geometry behind Proposition 4 is the subject of detailed analysis in Section 4.

3.3 Numerical Results

Numerical results based on repeated simulations under various multivariate Gaussian models corresponding to the heterogeneous variance case with and without correlation structures (the latter corresponds to Example 3.4) are collected in Figure 2. The parameter of interest is , the difference of the two Gaussian marginal means. In columns 1 and 3, and in columns 2 and 4, . In columns 1 and 2, ; and in columns 3 and 4, . Monte Carlo estimates of the probabilities of DPP under each model are given. We can see that the DPP gradually vanishes as we increase the sample size, although at a slower rate for some models than others. For the examples shown here, model with uncorrelated parameter components (in both the likelihood and the prior) seem less prone to DPP than models with highly correlated dimensions. However, DPP is not eliminated in these cases, and the extent of reduction is a function of the parameter values used for the simulations shown here. In Proposition 3.4, heterogeneous variances with independent dimensions is shown to be related to the DPP. This example shows that heterogeneous variances plus correlation among parameters make the situation even worse. Models with priors not geometrically aligned with the likelihood, e.g. with heterogeneous marginal variances and/or correlation structure, are more likely to suffer from DPP than otherwise.

Figure 2: DPP in bivariate Gaussian models for linear contrast. Each subplot contains independently simulated datasets from Lemma 3.1(a), with , , and samples per dataset for each of the three rows. Each dataset is represented by a point whose x and y coordinates are the sample averages of the first and second dimensions respectively. The red dots are data occurrences for which DPP occurs, and the blue dots are those for which DPP does not occur. Separating the two are hyperplanes and defined in Theorem 3.1.

4 The geometry of DPP and relation to Simpson’s paradox: An illustration in () Case

In this section, we take a closer look at the geometry behind DPP, and illustrate its connection with the famous Simpson’s paradox, one that occurs due to inconsistently aggregating sources of conditional information. For simplicity, the analysis below focuses on the scenario described in Example 3.4 with . We assume a bivariate Gaussian conjugate model for a pair of drug and placebo treatment efficacy, where efficacy is measured on the real line. Both the prior and the sampling distribution have independent and heterogeneous covariance structures. The inferential target is again the posterior linear contrast between the efficacy of the drug and placebo treatments.

For the purpose of illustration, assume the prior mean and the MLE , with respective diagonal covariance matrices and . The model is depicted in Figure 3. Note that the MLE is greater than the prior mean element-wise, that is, is to the northeast of .

Denote the posterior mean. Since both the prior and likelihood covariances are diagonal, the light blue rectangle with and as vertices is the region in which could take value. Three lines of slope 1 pass through , and , and intersect the -axis at , and respectively. The -coordinates of the three intersections are respectively the prior, likelihood, and posterior linear contrasts, that is, , , and where .

By Definition 2.1, DPP occurs if falls outside the closed interval between and . Equivalently stated, the occurrence of DPP can be determined by examining the location of relative to the dark blue parallelogram sandwiched between the two lines that pass through and , as well as and . DPP occurs if falls within the light blue rectangle but outside the parallelogram, and it does not occur if falls within parallelogram.

Having fixed and , the location of is a function of the prior and likelihood covariances and . The specific values depicted in Figure 3 are and . The posterior mean is then , and posterior covariance . The three covariances are illustrated by their respective concentration ellipses around , and . Notice that falls within the light blue region outside the dark blue band. As a consequence, the posterior contrast () is larger than both the prior () and likelihood contrasts (), suggesting that the efficacy of the drug is assessed to be more than the placebo a posteriori, at a scale larger than that of either the prior or the data alone.

To understand the geometry of Example 3.4, define such that

The two angles and are annotated in Figure 3. Equation (5) can be re-expressed as


That is, given that , DPP occurs if and only if . This happens precisely when sits to the left of the line that passes through and . For the specific values of and in this example, the weights satisfy with values equal to and respectively. Should it be the case that , the same argument applies once the roles of and are flipped. The necessary and sufficient condition for DPP to occur is then , or equivalently, for to sit to the right of the line that passes through and .

Figure 3: The geometry of DPP: posterior drug efficacy (C) exceeds the range spanned by that assessed from the prior (A) and the data (B). That is because the posterior mean () is only an element-wise convex combination of the prior mean () and the MLE (), but itself is not collinear with them. In gray are concentration ellipses of the covariance matrices , and .

For conjugate normal models, the posterior mean is an element-wise convex combination of the prior mean and the MLE . That is, each dimension of is a convex combination of the corresponding dimensions of and , with weights determined by the prior and the sampling distribution covariances. If the weights applied to and are not balanced across dimensions, the resulting posterior mean may not be an overall convex combination of and , which is to say that it may not be collinear with and . Indeed, when the weights are heavily imbalanced, can be far from collinear with and , much so that it creates ample triangularization among the three quantities for a collection of marginal directions to render the projection of outside the range of those of and . Referring again to Figure 3, any value of outside the dark blue parallelogram is considered far from collinear with and , giving rise to the DPP.

To put this formally, let denote the linear margin of interest, and consider the two angles it forms with and respectively, namely and such that


By Definition 2.1, the DPP occurs if and only if both cosine quantities in (7) are positive or negative. If is not exactly collinear with and , the marginal direction orthogonal to is always vulnerable to the DPP. Indeed, any departure in from the convex combination of and can be picked up by the marginal direction orthogonal to the difference of the latter two, however slight the departure may be. In addition, the neighborhood of marginal directions whose polar angles are between and are also vulnerable to the DPP. As long as (6) holds, this neighborhood is nonempty.

The geometry described here is not limited to two-dimensional situations. The same intuition applies when the Bayesian model of concern invokes a parameter space of higher dimensions. In fact, the higher the dimension of the parameter space, the more “prevalent” the DPP in the sense that the nonempty neighborhood of marginal directions that can result in a DPP is also of higher dimension, and can be harder to avoid.

The DPP is keenly related to the Simpson’s paradox, another puzzling phenomenon said to occur when the marginal expectation of a random variable seemingly takes value outside the range of the conditional expectations of the same variable from which it is aggregated. Simpson’s paradox is a consequence of incoherent marginalization: sources of conditional information were aggregated against different, as opposed to the same, marginal distributions of the conditioning variable. When the difference is substantial, the marginal expectation may appear out of range, which is otherwise mathematically impossible had the marginalization been done coherently.

When the prior and posterior means in conjugate normal models are regarded as point estimators of a quantity of interest, as it has been the case with our investigation, DPP is precisely a manifestation of the Simpson’s paradox in the following sense. Let and be the estimators of the drug and placebo efficacy respectively. is the indicator variable of whether the observation is made through a pilot study (), which corresponds to the prior, or a full clinical trial () which corresponds to the likelihood. Write

with the understanding that the expectations are each taken with respect to a distinct and independent sample drawn from a (possibly finite) population.

Let be the sample marginal distribution of , that is the fraction of subjects assigned to the pilot study versus the clinical trial. Write and . If is independent of and , the marginal expected linear contrast can be written as

As varies in , it is guaranteed that


That is, the marginal expected linear contrast is bounded within range of the conditional expected linear contrasts from the pilot study and the clinical trial. In other words, Simpson’s paradox does not occur regardless of . However, if is not independent of and , that is if the assignment probabilities to the pilot study versus the clinical trial depend on the outcomes, the guarantee in (8) does not hold. In particular, if possesses two distinct marginal distributions, one pertinent to either the drug () or the placebo (), and are respectively


then the “marginal expected linear contrast” of the drug’s efficacy is written as


The phrase “marginal expected linear contrast” here is in quotes, as the marginalization of and endorsed different marginal distributions of , hence the result is incoherent for comparison purposes. In this case, Simpson’s paradox is said to occur whenever

which coincides with the definition of DPP in Equation 1. In the special case of Example 3.4 illustrated here, when and as defined in (9) take values according to the respective posterior variance component coefficients of (4), then (10) is precisely the posterior expected linear contrast with respect to the independent heterogeneous variance model. DDP occurs precisely when Simpson’s paradox occurs.

5 DPP in Binomial Model Revisited

When the “numbers of trials” go to infinity, the two-by-two table Binomial model used in Xie et al. (2013) is covered by the LAN property in Section 3. Thus, we will not repeat the discussion for the case that the “numbers of trials” go to infinity. However, in practice, as shown in Xie et al. (2013), we care about the finite sample property, especially when the prior is moderately or highly informative. The case of finite numbers of trials does not fall into the realm of of exponential-quadratic likelihood. We study DPP in this case in this section. We are curious to see whether the results that we obtain from the exponential-quadratic likelihood, though not directly applicable here, could be adopted for guiding the choice of priors and re-parameterization.

Let , , both and are finite. The parameter of interest is , for which we have “some prior information”. Furthermore, we also have “some prior information” for . This is an example given in Xie et al. (2013). The likelihood is

The MLE of and are and . Let be the prior mean and be the posterior mode of , then DPP occurs if and only if


The (independent) conjugate prior for this model is given by

, .

5.1 Theoretical Results

When both and are finite, the likelihood of the Binomial model is very different from exponential-quadratic type likelihoods thus we no longer have nice analytical solutions for the conditions of DPP as in Section 3. To correspond as much as possible to the results obtained in Section 3, we still choose a bivariate Gaussian prior (and live with the fact that this prior is not rigorously appropriate for parameters with bounded support) for the parameters in the Binomial model and still try to express the posterior as a weighted average of prior mean and MLE. By doing so, we can examine the exact distinction (actually in terms of an extra residue term in the weighted average) of the Binomial model from the exponential-quadratic models.

Let the prior for be bivariate Gaussian with means and variance-covariance matrix . Assume that are known constants and assign uniform prior for on . Proposition 5.1 rewrites the posterior mean as a weighted average of the prior mean and MLE, plus and extra term, without which the DPP would not occur.

Proposition 5.1.

For any , the posterior mode satisfies


where , , and

Let without loss of generality, then DPP occurs if and only if