Exact and efficient inference for Partial Bayes problems

02/12/2018 ∙ by Yixuan Qiu, et al. ∙ Purdue University 0

Bayesian methods are useful for statistical inference. However, real-world problems can be challenging using Bayesian methods when the data analyst has only limited prior knowledge. In this paper we consider a class of problems, called Partial Bayes problems, in which the prior information is only partially available. Taking the recently proposed Inferential Model approach, we develop a general inference framework for Partial Bayes problems, and derive both exact and efficient solutions. In addition to the theoretical investigation, numerical results and real applications are used to demonstrate the superior performance of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many real-world statistical problems, the information that is available to the data analysts can be organized in a hierarchical structure. That is, there exists some past experience about the parameter(s) of interest, and data relevant to the parameter(s) are also collected. For this type of problems, the standard approach to statistical inference is the Bayesian framework. However, in many applications, the data analysts have only limited prior knowledge. For instance, the prior information may be insufficient to form a known distribution, so that data analysts need to assume some unknown distributional components in the Bayesian setting. This class of problems has brought many challenges to statisticians; see for example Lambert and Duncan (1986); Meaux et al. (2002); Moreno et al. (2003). To systematically study such problems that involve partial prior information, in this article we refer to them as Partial Bayes problems, in order to highlight their nature that there exists only partial information in the Bayesian prior distribution.

Partial Bayes problems have drawn a lot of attention in statistics literature. One popular type of Partial Bayes problems refers to the case where there exists an unknown prior distribution, either parametric or non-parametric, in a Bayesian hierarchical model. A very popular approach to this type of models is known as the Empirical Bayes, which has been first proposed by Robbins (1956) for handling the case with non-parametric prior distributions, and later by Efron and Morris (1971, 1972a, 1972b, 1973, 1975) for parametric prior distributions. Another kind of Partial Bayes problems was studied by Xie et al. (2013)

, in which the joint prior distribution of a parameter vector is missing, but some marginal distributions are known. For clarity, we will refer to this type as the marginal prior problem. In

Xie et al. (2013), the solution to the marginal prior problem is based on the Confidence Distribution approach (Xie et al., 2011), which provides a unified framework for meta-analysis.

The Empirical Bayes and Confidence Distribution approaches both have successful real-world applications. However, one fundamental problem in scientific research, the exact inference about the parameter of interest, remains to be an open question for Partial Bayes problems. As pointed out by many authors (Morris, 1983; Laird and Louis, 1987; Carlin and Gelfand, 1990)

, Empirical Bayes in general underestimates the associated uncertainty of the interval estimators, so these authors have proposed various methods to correct the bias of the coverage rate. However, even if they have shown better performance, the target coverage rates are still approximately achieved for such methods. The same issue happens in the Confidence Distribution framework. Confidence Distribution provides a novel way to combine different inference results, but these individual inferences may or may not be exact. All of these indicate that the exact inference for Partial Bayes problems is highly non-trivial.

Recently, the Inferential Model (Martin and Liu, 2013, 2015a, 2015c) is proposed as a new framework for statistical inference, which not only provides Bayesian-like probabilistic measures of uncertainty about the parameter, but also has an automatic long-run frequency calibration property. In this paper, we use this framework to derive interval estimators for the parameters of interest in Partial Bayes problems, and demonstrate their important statistical properties including the exactness and efficiency. When compared with other approaches, we refer to the proposed estimators as Partial Bayes solutions for brevity.

The remaining part of this article is organized as follows. In Section 2 we study a hierarchical normal-means model as a motivating example of Partial Bayes problems. In Section 3 we provide a brief review of the Inferential Model framework as the theoretical foundation of our analysis. Section 4 is the main part of this article, where we introduce a general framework for studying Partial Bayes problems, and deliver our major theoretical results. We revisit some popular Partial Bayes models in Section 5, are conduct simulation studies in Section 6 to numerically compare the proposed solutions with other methods. In Section 7 we consider an application to a basketball game dataset, and finally in Section 8 we conclude with a few remarks. Proofs of theoretical results are given in the appendix.

2 A Motivating Example

In this section, we use a motivating example to demonstrate what a typical Partial Bayes problem is, and how its solution differs from the existing method. Consider the well-known normal hierarchical model for the observed data . The model introduces unobservable means , one for each observation, and assumes that conditional on ’s, ’s are mutually independent with for

, where the common variance

is known. In addition, all the ’s are i.i.d. with for , where the variance is known but the mean is an unknown hyper-parameter.

The problem of interest here is to make inference about the individual means , and for simplicity we focus on without loss of generality. The aim of inference is to construct an interval estimator for that satisfies the following conditions: using the terminology in Morris (1983), a sample-based interval is an interval estimator for with confidence level, if it satisfies for all

, where the probability that indicates the coverage rate is computed over the joint distribution of

.

The standard Empirical Bayes approach to this problem can be found in Efron (2010). It computes the MLE of , , from the observed data. Plugging back into the prior in place of , Empirical Bayes proceeds with the standard Bayesian procedure to provide an approximate posterior distribution of , , where , and the notation “” indicates that the distribution is approximate. Accordingly, the Empirical Bayes interval estimator for is obtained as

where is the quantile of the standard normal distribution.

The Partial Bayes solution, derived in Section 5.1.1, has a slightly different formula:

(1)

Compared with Empirical Bayes, the proposed interval has the same center but is slightly wider for small . For a numerical illustration, we fix to be 0.05, and take . Figure 1 shows the theoretical coverage rates of both the Empirical Bayes solution and the Partial Bayes solution as a function of . It can be seen that the coverage probability of the Empirical Bayes interval is less than the nominal value , and is close to the target only when is sufficiently large. On the contrary, the Partial Bayes solution correctly matches the nominal coverage rate for all .

Figure 1: The coverage probabilities of Empirical Bayes (blue dashed curve) and Partial Bayes (red solid line) as a function of . The line for Partial Bayes is exactly positioned at the 0.95 level, indicating that it achieves the nominal coverage rate exactly for all .

3 A Brief Review of Inferential Models

Since our inference for Partial Bayes problems is based on the recently developed Inferential Models, in this section we provide a brief introduction to this new framework, with more details given in Martin and Liu (2013)

. Inferential Model is a new framework designed for exact and efficient statistical inference. The exactness of Inferential Models guarantees that under a particular definition, the inference made by Inferential Models has a controlled probability of error, for example, in hypothesis testing problems the Type I error should be no greater than a pre-specified level. In addition, Inferential Models provide a systematic way to combine information in the data for efficient statistical inference.

Formally, Inferential Models draw statistical conclusions on an assertion , a subset of the parameter space, about the parameter of interest . For example, the subset stands for the assertion , and corresponds to . In the Inferential Model framework, two quantities are used to represent the knowledge about contained in the data: the belief function, which describes how much evidence in the data supports the claim that “ is true”, and the plausibility function, which quantifies how much evidence does not support the claim that “ is false”.

Like Fisher’s fiducial inference, Inferential Models make use of auxiliary or unobserved random variables to represent the sampling model. In order to have meaningful probabilistic inferential results, unlike Fisher’s fiducial inference, Inferential Models predict unobserved realizations of the auxiliary variables using random sets, and propagate such uncertainty to the space of

. Technically, Inferential Model is formulated as a three-step procedure to produce the inferential results:

Association step

This step specifies an association function to connect the parameter , the observed data , and the unobserved auxiliary random variable with following a known distribution . This relationship implies that the randomness in the data is represented by an auxiliary variable .

Prediction step

Let be the true but unobserved value of that “generates” the data. This step constructs a valid predictive random set, , to predict . is valid if the quantity , interpreted as the probability that successfully covers , satisfies the condition , where .

Combination step

This step transforms the uncertainty from the space to the space by defining , a mapping from back to after incorporating the uncertainty represented by . Then for an assertion , its belief function is defined as , and similarly, its plausibility function is defined as .

The plausibility function is very useful to derive frequentist-like confidence regions for the parameter of interest (Martin, 2015). If we let be a singleton assertion and denote , then a frequentist-like confidence region, which is termed as plausibility region in Inferential Model (or plausibility interval as a special case), is given by . In Inferential Model, the exactness of the inference is formally termed as validity. For example, the validity property of Inferential Model guarantees that the above region has at least long-run coverage probability.

It is worth mentioning that Inferential Models also have a number of extensions for efficient inference. When the model has multiple parameters but only some of them are of interest, the Marginal Inferential Models (MIM, Martin and Liu, 2015c) appropriately integrate out the nuisance parameters. For models where the dimension of auxiliary variables is higher than that of the parameters, the Conditional Inferential Models (CIM, Martin and Liu, 2015a) could be used to combine information in the data such that efficient inference can be achieved. Both MIM and CIM are used extensively in our development of exact and efficient inference for Partial Bayes problems.

4 Inference for Partial Bayes Problems

In this section we build a general model framework for studying Partial Bayes problems. The derivation of our interval estimator is described in detail using the Inferential Model framework, and some of its key statistical properties are also studied.

4.1 Model Specification

Our attempt here is to provide a simple model framework that is general enough to describe a broad range of Partial Bayes problems introduced in Section 1.

Let be the observed data, whose distribution relies on an unknown parameter vector . The information on that comes from the collected data is expressed by the conditional distribution of given the parameter: . In many cases, we have prior knowledge about that can be characterized as a prior distribution . When is fully specified, standard Bayesian method can be used to derive the posterior distribution of . In other cases, there is only partial prior information available. Formally, assume that the parameter can be partitioned into two blocks, , so that the desirable fully-specified prior of can be accordingly decomposed as , where is the conditional density function of given , and is the marginal distribution of . We call the prior information partial if only the conditional distribution is available, but is missing. In general, inference is made on or a component of , i.e., can be further partitioned into , with denoting the parameter of interest and denoting the additional nuisance parameters. In this article we focus on the case that is a scalar, which is of interest for many practical problems. For better presentation, we summarize these concepts and the proposed model structure in the following table:

Sampling model
Parameter partition ,
Partial prior
Component without prior
Parameter of interest

Despite its simplicity, the above model includes the well-known hierarchical models as an important class of practically useful models. Moreover, the formulation goes beyond the hierarchical models, and also includes the marginal prior problem. As described in Section 1, our target of inference is to construct a sample-based interval that satisfies some validity conditions. Specifically, the following two types of validity properties are considered:

Definition 1.

is said to be an unconditionally valid interval estimator for with confidence level, if for all , where the probability is computed over the joint distribution of .

Definition 2.

is said to be a conditionally valid interval estimator for given with confidence level, if for all and , where is a statistic of the data, and the probability is computed over the joint distribution of given .

Definition 1 is a rephrasing of the validity condition in Morris (1983), and Definition 2 comes from Carlin and Gelfand (1990). It should be noted that the second condition is stronger than the first, since it can be reduced to Definition 1 by averaging over . In this article, we aim to produce the second type of interval estimators, but the first validity property is studied when different interval estimators for are compared with each other.

4.2 Inferential Models for Partial Bayes Problems

In this section we describe a procedure to analyze Partial Bayes problems in the Inferential Model framework, and develop intermediate results that are used to derive the proposed interval estimator in Section 4.3. The procedure consists of the three steps introduced in Section 3, and outputs a plausibility function for , the parameter of interest.

4.2.1 The Association Step

The association step has three sub-steps, and we highlight their tasks at the beginning of each sub-step.

Constructing data and prior associations

The first association equation comes from the data sampling model , for which we write , where is the “data association” function, and is an unobservable auxiliary variable that has a known distribution. Since can be partitioned into with , the equation that represents this partial information can be written as , where is the “prior association” function, and is another auxiliary variable independent of . Substituting the prior association into the data association, we get . To avoid the over-complicated notations, we simply write this relation as , where .

As described in Section 4.1, we are only interested in an element of the vector, so we assume that can be equivalently decomposed as and , where and are the decomposed associations and . Therefore, the model for Partial Bayes problems can be summarized by the following system of three equations:

(2)

Note that can be regarded as a nuisance parameter, and (2) is “regular” in the sense of Definition 2 of Martin and Liu (2015c). Then according to the general theory of MIM in that paper (Theorems 2 and 3), the third equation in (2) can be ignored without loss of efficiency.

Decomposing data association

Next, since the sample usually contains multiple observations, the dimension of can often be very high. In order to reduce the number of auxiliary variables, assume that the relationship admits a decomposition

(3)

for one-to-one mappings and . Martin and Liu (2015a) shows that this decomposition broadly exists for a large number of models, and in case that (3) is not available, we simply write and . The equation (3) implies that when the collected data have a realization , the auxiliary variable is fully observed with the value . By conditioning on , we obtain the following two conditional associations

(4)
(5)

where the notation means that the random variable has a distribution given . In the rest of Section 4.2, when we discuss the distribution of a random variable that depends on or , the condition is implicitly added.

Obtaining the final association

Finally, to make inference about , the unknown quantity needs to be marginalized out of the equations. We seek a real-valued continuous function such that when its first argument is fixed to some value , the mapping is one-to-one. At the current stage we simply take as an arbitrary function, and we defer the discussion of its optimal choice in Section 4.4. As a result, associations (4) and (5) are equivalent to

(6)
(7)

Conditional on , is a random variable whose c.d.f. is indexed by the unknown parameter . If the function is chosen such that has only little effect on , the first equation (6) provides little or even no information about , and hence it can be ignored according to the theory of MIM. The final association equation (7) thus completes the association step.

4.2.2 The Prediction Step

The aim of the this step is to introduce a predictive random set conditional on that can predict with high probability. The following two situations are considered.

The first situation is that is in fact free of . This can be easily achieved if has the same dimension as , and if the mapping can be inverted as . To verify this, plug into (4), and we obtain , which reduces to a univariate Inferential Model problem that has a well-defined solution.

The second situation is more general and thus more challenging, in which case relies on the unknown parameter . Typically this occurs when the dimension of is higher than that of . To deal with this issue, we generalize the Definition 5 of Martin and Liu (2015c) to define the concept of stochastic bounds for tails.

Definition 3.

Let and be two random variables with c.d.f. and respectively, and denote by the median of . is said to be stochastically bounded by in tails if for , and for .

The difference between this definition and the one in the literature is that here the medians of and are not required to be zero.

Assume that we have found a random variable such that given , is stochastically bounded by in tails for any . Note that the first situation discussed earlier can be viewed as a special case, since any random variable is stochastically bounded by itself in tails. To shorten the argument, we only consider this more general case for later discussion. There are various ways to construct such a random variable , see the examples in Martin and Liu (2015c). Here we provide a simple approach, by defining the c.d.f. to be

provided that the resulting function is a c.d.f..

Given , a standard conditional predictive random set can be chosen for the prediction of . For the purpose of constructing two-sided interval estimators, we first define the generalized c.d.f. of a random variable as , and then construct as follows:

(8)

This completes the prediction step, and other choices of the predictive random set for different purposes are discussed in Martin and Liu (2013).

4.2.3 The Combination Step

In what follows, to avoid notational confusions we use to represent the parameter of interest as a random variable, and denote by the possible values of . In the final combination step, denote by the set of values that satisfy the association equation (7) with and , i.e., , and define . Then the conditional plausibility function for is obtained as

(9)

which completes the combination step.

4.3 Interval Estimator and Validity of Inference

In Section 4.2.3 a conditional plausibility function for the parameter has been derived under the Inferential Model framework, and in this section it is used to construct the proposed interval estimator. Similar to the construction of plausibility region introduced in Section 3, we define the following set-valued function of :

(10)

From (9) it can be seen that depends on the data on two aspects: the random set depends on , and the association function depends on . As a result, we define our Partial Bayes interval estimator for to be , obtained by plugging the random sample into .

In the typical case that is a fixed value, the Inferential Model theory guarantees that is a valid

frequentist confidence interval for

. However in our case, the joint distribution of the parameter and data is considered, as in Definitions 1 and 2. Therefore, the validity of does not automatically follow from the Inferential Model theory, and hence needs to be studied separately. The result is summarized as Theorem 1.

Theorem 1.

With defined in (3), is a conditionally valid interval estimator for given with confidence level.

Recall that if the decomposition (3) is unavailable, we will take and . In such cases, Theorem 1 reduces to the unconditional result corresponding to Definition 1.

4.4 Optimality and Efficiency

Theorem 1 states that the proposed interval estimator defined in (10) satisfies the validity condition. Another important property, the efficiency of the estimator, is discussed in this section. We claim two facts about the proposed interval estimator:

  1. If is known, then with a slight modification to the predictive random set , the optimal interval estimator can be constructed.

  2. If is unknown, then under some mild conditions, can approximate well. The discussion also guides the choice of the function in (7).

First consider the ideal scenario that , the marginal distribution of , is known, in which case a full prior distribution for is available. On one hand, it is well known that given a fully-specified prior distribution, the optimal inference for the parameter is via its posterior distribution given the data. On the other hand, given this new information, the approach introduced in Section 4.2 can still be used to derive an interval estimator, with some slight modifications shown below. Later this result is compared with the Bayesian solution.

Let be the association equation for the marginal distribution of . Combining it with (6) and (7), we obtain the following three associations:

(11)

where and . Again, the second equation implies that given the data , is fully observed with value , so the auxiliary variable can be predicted using its conditional distribution given and , which we denote by . Similar to the prediction step in Section 4.2.2, we construct a predictive random set for by replacing with in formula (8), and proceed with the same combination step to obtain

As a result, the interval estimator for is obtained as , where . Comparing the function that defines and the function in (9), it can be seen that they only differ in the distributions assigned to the predictive random sets. The following theorem shows that with this slight change,

matches the Bayesian posterior credible interval.

Theorem 2.

Assuming that is known and has a continuous distribution function given , then is optimal in the sense that it matches the Bayesian posterior credible interval, i.e., .

Theorem 2 implies that, by choosing a proper predictive random set for the auxiliary variable, the inference result can attain the optimality. This fact implies that even when is missing, as long as there exists a predictive random set close to , the resulting interval estimator would be as efficient as the optimal one, at least approximately.

Recall that the optimal predictive random set is induced by the distribution , and when is missing, only is available. Therefore, the next question is to find out the conditions under which is close to . Since they are both conditional on , to simplify the analysis we remove this condition from both distributions, and then study the closeness between and , where is the c.d.f. of defined in (7), and stands for the distribution of defined in (11) given .

In most real applications, the association relation for changes with the data size . To emphasize the dependence on , in what follows we write , , and in place of , , and , respectively. The following definition from Xiong and Li (2008) is needed to study the large sample property of a conditional distribution.

Definition 4.

Given two sequences of random variables and , the conditional distribution function of given , a random c.d.f. denoted by , is said to converge weakly to a non-random c.d.f. in probability, denoted by , if for every continuous point of , , where .

This definition is a generalization to the usual concept of weak convergence. Then we have the following result:

Theorem 3.

Let , , and denote the densities of , , and , respectively. Also define . If (a) for fixed , , (b) , and (c) pointwisely, then and , where in is seen as a fixed value.

Remark 1.

Conditions (a) and (b) are intentionally expressed in a simple form. In fact they can be replaced by and where and are one-to-one functions, and the limiting distribution is changed to accordingly.

Remark 2.

The three conditions are easy to check. Condition (a) states that should be a consistent estimator for if is seen as fixed. Condition (b) guides the choice of the function, e.g. taking . For condition (c), it is shown in the proof that , and a sufficient condition for (c) is that the density of also converges to that of

, which is satisfied by most parametric models.

To summarize, Theorem 3 indicates that and converge to the same limiting distribution, in which sense the random sets and have approximately identical distributions when is sufficiently large. As a result, the proposed interval estimator defined in (10) can be seen as an approximation to the optimal solution . Combining Theorem 1 and Theorem 3, it can be concluded that the proposed interval estimator possesses the favorable properties of both validity and efficiency.

5 Popular Models Viewed as Partial Bayes Problems

In this section we apply the methodology in Section 4 to a collection of popular models viewed as Partial Bayes problems, and show how their Partial Bayes solutions are developed.

5.1 The Normal Hierarchical Model

The normal hierarchical model is extremely popular in the Empirical Bayes literature, partly due to its simplicity and flexibility; see for example Efron and Morris (1975); Morris (1983); Casella (1985); Efron (2010). The model setting has been given in Section 2, and without loss of generality we set , since ’s can always be scaled by a constant to achieve an arbitrary variance. We will consider both the cases where is known and unknown, and our parameter of interest is . To summarize, we write

Sampling model
Partial prior ,
Component without prior
Parameter of interest

As a first step, this model can be expressed by the following association equations: and for , where , , and and are independent. An equivalent expression for these associations is , in which the data are directly linked to the unknown . Since the focus is on , equations related to can be ignored. In the following two subsections we discuss the cases with both known and unknown .

5.1.1 The case with a known

This case corresponds to the motivating example presented in Section 2, and we are going to derive formula (1) with . Since is known, let , and then the system of associations can be rewritten as and for , where and . Therefore, by denoting and , where and is a vector of all ones, the decomposition in equation (3) is achieved. The associated auxiliary variable for is , where .

Next, we keep the following two associations and , where and conditional on . The last step is to take , and the final association equation is . It can be verified that the conditional distribution of given is

(12)

and the predictive random set (8) can be constructed accordingly. As a result, the conditional plausibility function for is obtained as

(13)

where is the standard normal c.d.f., and hence the interval estimator for is

(14)

5.1.2 The case with an unknown

Similar to the previous case, the starting point is to decompose the data associations into and , which can be done in two stages as described below. In the first stage, we keep the association for and decompose instead. Consider the ancillary statistics for , where and are the sample mean and sample variance of . It is clear that has a one-to-one mapping to . Since marginally , it is well known that is a complete sufficient statistic for , and thus is independent of according to Basu’s theorem. Therefore, conditioning on does not change the distribution of , and we obtain the following four associations: (a) , (b) , (c) , and (d) , where , , , and the auxiliary variables , and are mutually independent. Equations (c) and (d) are derived from the well-known facts that and .

Then in the second stage, we condition on the following equation, as the auxiliary variable is known to follow a student -distribution with degrees of freedom:

(15)

As a result, we keep the associations , , and , with , and conditional on . Obviously in this case , which combined with completes the decomposition.

Next, by observing that is free of , we can take to be a function of and , so that the corresponding auxiliary variable is indexed by only one unknown parameter . Specifically, let

(16)

and then define , where and are chosen such that