1 Introduction, Definition, and a Motivating Example
For a given inferential problem, if all unknowns are represented by and knowns are represented by
then the objective of Bayesian inference is to solve for the posterior distribution
. This is proportional to the product of the prior function (the marginal probability model for the unknowns) and the likelihood function (a probability model for the observed data given the unknowns).222The analysis is simplified temporarily by bundling all unknowns into , which can include both missing data as well as inferential parameters of interest. In general this may not be an appropriate assumption. Since these functions are the key assumptions embedded into a Bayesian model, it is natural to quantify their strength. In other words, in a Bayesian analysis just how much does , the data collected in an experiment, influence the inference of in comparison to a prior function? The overarching objective of this paper is to define and critically examine two data dependent metrics that attempt to answer this question.
The Bayesian viewpoint has been criticized due to the challenge of appropriately selecting a prior distribution, and a variety of approaches have been taken to construct a default prior or bypass the prior completely, such as the classic Jeffreys Prior (Jeffreys 1946), Berger and Bernardo’s reference priors (Bernardo 1979, Berger and Bernardo 2009), Martin and Liu’s concept of prior free inference (Martin et al. 2012), and other approaches reviewed by Kass and Wasserman (1996). Furthermore Yuan and Clarke (1999) introduce information theoretic arguments to determine a default likelihood. In contrast, I do not attempt to introduce a default prior, likelihood, nor a fundamentally new inferential procedure. Instead I suggest using standard parametric Bayesian inference with an arbitrarily specified prior and likelihood, and utilizing a pair of data dependent information theoretic metrics to quantify the information of both the prior and likelihood functions chosen. While this approach does not provide an explicit set of rules to define a likelihood or prior, it allows one to use a quantity to determine if prior or data assumptions are too strong or weak, similarly to how an analyst may use a p-value as a measure or tolerance of extremity under an assumed probabilistic model.
The ideas presented to quantify the information of the prior and likelihood are certainly not the only routes for addressing the issues raised; see for instance Reimherr et al. 2014, Evans and Jang 2011, and Clarke 1996. The metrics examined in this paper differ from said approaches in two notable ways. First it is not necessary to assume a true sampling distribution for their construction, and second they are not defined with respect to a default prior, such as the Jeffreys prior. Finally, we make explicit connections between the Lindley 1956 definition of the information provided by an experiment and the likelihood information in the next section.
1.2 Definition of Prior and Likelihood Information
Before setting forth the definitions of prior and likelihood information, it is important to clarify the notation that will be used throughout this paper. I adopt the notational conventions from BDA3 (Gelman et al. 2013) unless stated otherwise. Additionally,
refers to the KL-divergence from the first random variable (or probability density) to the second random variable (or probability density); that is, the expectation in the definition of the KL-divergence is with respect to the first random variable’s distribution. Letbe a particular parameter(s) of interest.
Definition 1.1: Normalized likelihood
The normalized likelihood is defined as . By definition this is dependent on the particular parameterization, , that is chosen for the analysis, which is emphasized by the subscript .
Definition 1.2: Prior information:
Definition 1.3: Likelihood information:
For these quantities to exist, it is necessary that the posterior and normalized likelihood, and posterior and prior, are absolutely continuous with respect to each other (Kullback 1959). Qualitatively the information of the likelihood function is judged by the distance from the posterior to the prior relative to the posterior, and the information of the prior is the distance from the posterior to the normalized likelihood relative to the posterior. The influence of the prior distribution is thus quantified by the deviation of the posterior distribution from the likelihood function333It is possible the the likelihood is not integrable in which case the prior information is not defined, limiting its applicability. However in many cases likelihood integrability may be achieved by assuming a compact parameter space a priori, which is often a scientifically plausible assumption.
, which is often used if a prior distribution is absent; i.e., in a likelihood based method of inference. The arguments in the KL-divergence can be transposed for a qualitatively similar metric yet I have chosen these definitions so that the likelihood and prior information can be compared meaningfully against each other, since the posterior is the common measure in either case. Additionally, while other (pseudo) metrics on a space of probability distributions can be employed in theory (such as Hellinger or Wasserstein), the KL-divergence is chosen primarily because: i) it has closed form analytical solutions for common statistical distributions such as those from the exponential family and ii) it is easily derivable that the likelihood information metric is invariant to reparameterization as will be discussed in the second section. A Monte Carlo algorithm for computing these metrics is given in Appendix A. Furthermore, it is not necessary to assume that the prior and likelihood are integrable functions for the likelihood information to exist; this technicality is addressed in the Appendix B, however in such a scenario the guarantee of non-negativity that comes with the KL-divergence is lost.
It is crucial to point out that the likelihood information metric introduced is closely related to the information provided by an experiment defined by Lindley (Definition 1) in 1956, but the two are not equivalent. The latter is . In contrast, the likelihood information, defined as the KL-divergence from the posterior to prior, is = . As Lindley states in his 1956 paper, this information measure is not invariant to change of parameterization 444This is in contrast to the average amount of information in Definition 2 of Lindley (1956) which is invariant to reparameterization since it is equivalent to the mutual information between and , similarly to the average likelihood information. In contrast the likelihood information is invariant to 1-1 reparameterization as is shown in section 2.
1.3 A Motivating Example Regarding Prior and Likelihood Information
As a motivating example, I consider a scenario where a “noninformative” Jeffreys prior and “noninformative” reference prior are more informative than a flat prior.
Consider data that is collected according to a bivariate binomial model, whose probability mass function is given by:
The observed data are and , and the inferential parameters of interest are and . The reference prior according to Yang and Berger (1998) is:
The Jeffreys prior is:
(Note the distinction between subscripted to refer to a function and the numerical constant; also the notation is avoided to reduce confusion with the parameter .) See figures 1 and 2 for an illustration of these priors.
Assume that the following data is collected in an experiment: m = 30, r = 29, and s = 2. Most of the likelihood’s mass occupies one of the four corners of the unit square where both priors approach as is illustrated in Figure 3. Hence the reference prior will have more information than a flat prior, and intuitively the likelihood will have more information when the prior is flat conditioning on this particular data set. Indeed the (numerically) computed likelihood information with a flat prior is 3.52805, 2.97012 for the Berger and Bernardo reference prior, and 2.55152 for the Jeffreys prior.555As defined, the prior information is 0 for a flat prior and strictly positive for the other two priors, but of course to cite this without computing the likelihood information would seem to be tautological. How is this apparent inconsistency resolved? While there exists utility and established theoretical footing for default priors, this example illustrates it is still important to quantify the prior and likelihood information in one’s particular experiment or data set nonetheless.
The structure of the paper is as follows: in section 2, I state and prove some theoretical properties of the prior and likelihood information metrics, in section 3, I illustrate that the information metrics can be computed analytically in some common conjugate models, and in section 4 I apply the metrics to a few prediction problems illustrating that setting the prior information to be a small value may be a resaonable guiding principle for building a Bayesian model that predicts future data well. I conclude by discussing limitations of these metrics in addition to potential next steps in this line of research.
2 Some Theoretical Properties of the Prior and Likelihood Information
In this section, I set forth and prove some key theoretical properties of the prior and likelihood information which may help justify their utility; namely I demonstrate an invariance property of the likelihood information, interpret the likelihood information as observed mutual information between and , and prove that prior information goes to 0 as more data is collected in commonly used Bayesian models. Additionally a key motivation of the use of the KL-divergence in defining the prior and likelihood information is that it is non-negative and 0 if and only if the probability measures compared are identical almost surely (Cover and Thomas 1991). 666Appendix B considers the case when the prior or likelihood are not necessarily integrable, in which case it is still possible that the likelihood information metric is defined but the guarantee of non-negativity is lost.
2.1 Property 1: Invariance of the Likelihood Information to 1-1 Reparameterization
A key property of the likelihood information is that it is invariant to 1-1 reparameterization.
Let be a 1-1 transformation of . Then:
However, the same property does not hold for prior information since a Jacobian is not necessary for the normalized likelihood when starting with a different parameterization; in other words is proportional to .
2.2 Property 2: The Likelihood Information is “Observed Mutual Information” between and
Typically, the reference prior due to Bernardo and Berger maximizes the average likelihood information which is equivalent to maximizing the mutual information between and . This is because:
Where is mutual information. From this perspective the likelihood information can be considered the observed mutual information between and . Therefore, the relationship between the likelihood information and the mutual information between data and the parameter of interest in a parametric Bayesian model is analogous to the relationship between the observed and expected Fisher information, where the critical distinction between these quantities is that the expected Fisher information is an average over the data. Furthermore, this property connects the information measure in Definition 1 of Lindley (1956) with the likelihood information metric, since both measures yield the mutual information betwen and when averaged over . In contrast to the information measure in Definition 1 of Lindley (1956) the likelihood information is invariant to 1-1 reparameterization without taking an average over .
2.3 Property 3: In Common Bayesian Models the Prior Information Goes to 0 in Probability as More Data is Collected.
What follows is a proof that the prior information goes to zero in probability when the parameter space is finite, the model is correctly specified, and data are generated i.i.d from this model.
Assume the parameter space is a finite set which contains the true parameter , , , and , and data are generated i.i.d according to a true model where the model is correctly specified. Then the prior information approaches 0 in probability.
By posterior consistency as in Appendix B of BDA3 (Gelman et al. 2013), the probability masses on governed by the posterior and normalized likelihood (which is a posterior under a constant improper prior) both converge in probability to mass of 1 on and 0 elsewhere. The KL divergence between the posterior and normalized likelihood is a continuous map which is a function of these masses and so the continuous mapping theorem applies. In particular the sequence of KL divergences between the posterior and likelihood converges in probability to the KL divergence between two point masses with 0 mass everywhere except , which is 0.
Additionally, a conjecture in the continuous case is as follows: due to posterior consistency, both the log-likelihood and log-posterior can be reasonably approximated with a Taylor expansion about the truth, so both and
can be approximated as a Normal distribution withas the mean and the inverse of the Fisher information at
as the variance (at least when the observed data are generated i.i.d conditioned on the underlying parameters and sufficient regularity conditions are met as in the Bernstein von Mises theorem (van der Vaart 2000)). Therefore the KL divergence betweenand approaches 0; techniques used by Clarke 1999 may be used to rigorously prove this claim.
3 Prior and Likelihood Information in Common Conjugate Models
To illustrate that the prior and likelihood information can be used in practice, I analytically compute closed form expressions for the prior and likelihood information in the Normal-Normal model with known variance, Poisson-Gamma and Multinomial-Dirichlet models. Note that a random variable denoted by is distributed according to the normalized likelihood,
. These results may be useful in deriving the prior and likelihood information in models which use conjugate priors, such as ‘Naive Bayes Classification’, which can be thought of as a pair of Multinomial-Dirichlet models for each of two classes. Additionally, the third part of the Appendix presents a result which demonstrates why the normalized likelihood, and hence prior information, may be defined in models where the data generating process is assumed to come from the exponential family.
3.1 Normal-Normal Model With Known Variance
Assume i.i.d samples and . I calculate the KL - divergence between the posterior and prior making use of the fact that the KL - divergence from to is given by
as in Penny 2001. In the Normal-Normal model the posterior is given by:
Hence substituting the latter equation into the former and simplifying I derive:
Which gives us the KL divergence between the posterior and prior, or likelihood information. To calculate the prior information I first note that:
Hence I derive the prior information to be:
In the Normal-Normal model I can use the law of iterated expectation to show that:
Thus after algebraic manipulation I derive that: If I set and (for computational convenience) I derive the average prior information to be:
Since the limit of this is 0 as approaches , by Markov’s inequality the prior information approaches 0 in probability. It is interesting that in the case when the data are generated according to the marginal distribution of , the posterior does not always contract to a fixed point despite that the prior information approaches 0 in probability, since the variance of is non-zero as approaches .
In univariate exponential family models with a conjugate prior, prior sample size and data sample size are an alternate pair of metrics that can be used to quantify the information of the prior and data. In the Normal model, the prior precision serves as an indicator of prior sample size and by definition is the data sample size. Hence using these metrics, prior information can be taken as which decays to 0 in , which is consistent with the definition of prior information since which is approximately by Taylor’s expansion. However it is important to realize that this is with respect to an average over the data, which in general may not be appropriate as alluded to in the motivating example. Moreover, since the prior sample size and data sample size metrics do not actually involve the observed data, they may not be adequate measures of prior and likelihood information.
3.2 Poisson-Gamma Model
Assume i.i.d samples and . Note that the KL divergence between to is given by:
(Penny 2001). Using this expression I may derive the likelihood information to be:
To derive the prior information I note that
Hence I may calculate prior information, :
3.3 Multinomial-Dirichlet Model
Assume a vectoris drawn from a Multinomial distribution with probabilities and those probabilities are drawn from a Dirichlet distribution with parameters , canonically referred to as the Dirichlet-Multinomial conjugate model. To derive the prior and likelihood information I note that the KL-divergence between and is given by:
Where and (Penny 2001). Also note that in the Dirichlet-Multinomial model, and . Hence I may derive the likelihood information to be:
Noting that I can compute the prior information :
4 Experiments Investigating the Relationship Between Predictive Accuracy and Prior Information
In this section, I present the results of two prediction experiments with Bayesian models as an attempt to understand the relationship between prior information and predictive accuracy 777The prior_inf_experiment.R script included in the supplementary materials allows one to re-run this experiment.
; intuitively, as suggested by Gelman et al. (2008), “weak information” (or regularization) of the prior ought to improve the out of sample classification error of a model. Furthermore, the regularized regression estimates that have been widely used in recent decades and are useful for prediction such as LASSO (Tibshirani 1996) can be seen as a departure from typical linear regression estimates which simply maximize a likelihood. Hence prior information, which quantifies a discrepancy of the posterior from the likelihood, ought to be reflected in prediction accuracy.
First, I use the UCI machine learning diabetes classification dataset and consider a logistic regression model trained with independent Normal priors on the coefficients with varying standard deviations, hence varying the prior and likelihood information content. The dataset consists of two labels (diabetes or no diabetes), 8 continuous predictors, and 758 data points, of which 500 are randomly chosen for training and the remaining are chosen for the test set. Letand correspond to the test and training labels for diabetes outcome, respectively. Let represent the background covariates for the individual and be the vector of model coefficients. Then, the model used is:
The variance parameter is fixed, and can be toggled to control the prior and likelihood information.
are independent conditional on
For all .
Then by the definition of conditional probability and marginalization, the posterior predictive distribution for(conditional on ) is given by . Hence for a fixed value of , I generate 100 samples from , the posterior distribution of model coefficients, using the elliptical slice sampling algorithm (Murray et al. 2010) due to the multivariate normal prior on , and use these samples to draw from the posterior predictive distribution of the diabetes label for the remaining 258 individuals in the study using the logistic link function and Bernoulli random draws. Retaining the samples for thus generates samples from the posterior prediction of given
, as discussed by Gelman et al. 2013. The posterior predictive mode is used for the predictive classifications for all of the remaining units, which is a prediction that is justifiable from a decision theoretic viewpoint for a 0-1 loss function. Average 0-1 loss is used as the measure of predictive error and the results of the experiment are shown in the tables below.
To estimate the prior and likelihood information I use the Monte Carlo algorithm from the first appendix. Note that the normalized likelihood is approximated as the posterior under the same prior with , much larger than the variances used in the experiment. Consistent with the intuition of regularization I note that the largest classification accuracy is achieved for a small value of the prior information (0.87 nats), with classification accuracy diminishing with smaller and larger values; more interpretation is needed to assess if there is any particular significance of this value.
Second, I use the prostate cancer regression dataset from the lasso2 R package with a continuous outcome of interest (log-cancer volume) and 8 predictors. There are 97 data points in this data set of which 75 are randomly chosen for training and the remaining are chosen for the test set. The model used is:
, where , are the background covariates for individual in the study, and is the log-cancer volume for individual .
are independent conditional on .
For all .
I generate 100 samples of the posterior distribution of model coefficients and subsequently draw from the posterior predictive distribution of log-cancer volume for the test individuals in the study. The posterior predictive mean is taken as the final prediction, again justifiable from a decision theoretic standpoint using a squared error loss function, and I use mean square error as the measure of predictive accuracy. As in the previous experiment, to estimate the prior and likelihood information, I use the Monte Carlo algorithm from the first appendix. Note that the normalized likelihood is approximated as the posterior under the same prior with , much larger than the variances used in the experiment. I see a similar phenomenon in that predictive error is minimized for a small value of the prior information and grows as the prior information gets smaller and larger. However, I also note that for one of the design points, the prior information is estimated to be negative indicating Monte Carlo and/or numerical error.
The Tables below delineate the results of these experiments, followed by plots which show test classification error as a function of estimated prior information. In either case test classification error is smallest for a small value of prior information. While follow up studies and more repetitions are needed, this suggests that tuning hyperparameters to allow for small prior information may be a reasonable guiding principle to build a Bayesian model for the purposes of prediction. This may be useful particularly if one would like to use all available data for the purposes of fitting a Bayesian model as opposed to splitting it up for training, validation, and testing.
|Estimated Prior Information||Estimated Likelihood Information||Classification Error|
|Estimated Prior Information||Estimated Likelihood Information||MSE|
5 Conclusions and Future Directions
The two metrics constructed appear to be reasonable measures of prior and likelihood information as evidenced by their theoretical properties, analytical tractability in common conjugate models, computability, and use in applied contexts. Additionally one may want to develop and apply these metrics to hierarchical models. In the current formulation, the metrics do not disentangle various hierarchical levels, but instead collapse all prior parameters into a generic . Such an approach does not quantify the extent to how much each individual hierarchical level impacts final inferences, and so it would be fruitful to consider a general method of disentangling the contribution of each hierarchical level. Secondly, in the context of causal inference from a Bayesian perspective, the final inferential statements of the causal effect are defined in terms of the posterior predictive distribution of the potential outcomes, which in the current formulation are bundled with all of the unknowns, including inferential and nuisance parameters. Thus another potential future direction is to determine how these metrics can be extended to apply to sensitivity analysis in (Bayesian) causal inference.
I am greatly appreciative of Professor Joe Blitzstein’s encouragement of the development of the ideas within this paper; his feedback, suggestions, and conversations have been instrumental in their development. Additionally, I must thank Professor Xiao-Li Meng for his stimulating insights and connections between the work of Berger and Bernardo and information theoretic concepts. Finally, I am extremely grateful for the time and effort spent by an anonymous individual who provided many useful comments and corrections for this paper.
7.1 A Monte Carlo Algorithm for Estimating the Prior and Likelihood Information
Here I develop a Monte Carlo algorithm to estimate the prior and likelihood information and quantify the variance of the resultant estimator with the delta method. Without essential loss of generality, I develop the algorithm to estimate the prior information. First note the following identities:
Where and are the normalizing constants for the posterior and likelihood respectively. Assume i.i.d samples from the posterior. Then by the identity
a natural Monte Carlo estimator for the prior information is:
where of posterior samples have been chosen to estimate . By the WLLN and the continuous mapping theorem this estimator converges in probability to the prior information, assuming that both and approach (so, for instance, could be some fraction of ). Since each component of the sum is consistent, the sum must also be consistent by basic results on convergence of sums in probability to the sum of their individual limits.
Furthermore, the asymptotic variance of the estimator can be approximated using the delta method with the Log(.) transformation, yielding
. In some cases it may be easy to generate normalized-likelihood samples, in which case an asymptotically unbiased estimator foris , where are draws from the normalized likelihood, whose asymptotic variance can be approximated with the delta method as .
More efficient Monte Carlo estimators may be derived by using better methods for estimating the ratio of normalizing constants; for instance see Meng and Wong (1996). An application of this algorithm is illustrated in Figure 7 comparing Monte Carlo estimates to the ground truth in a Multinomial-Dirichlet model, for which I have previously derived analytical results for the prior information. While the mean of the Monte Carlo estimates seems to be close to the ground truth, for small values of the hyperparameter there appears to be quite a bit of bias, as well as variation in the standard deviation of the Monte Carlo estimate by hyperparameter choice.
7.2 Conditions for When Likelihood Information is Well Defined.
Here are a set of sufficient (but not necessary) conditions for when the likelihood information is well defined, with the consequence that the likelihood information is bounded but possibly negative. Precisely:
Without essential loss of generality, assume the prior and posterior are continuous, the posterior is integrable, the prior and (unnormalized) likelihood are bounded and the Shannon differential entropy of the posterior exists. Then the likelihood information is bounded (and possibly negative).
Proof of Lemma:
Let be an upper bound for and be an upper bound for . An upper bound for the likelihood information is derived as follows:
Which is bounded above by .This exists because the normalizing constant, exists since the posterior is assumed to be integrable. To derive a lower bound, since , .
7.3 An Observation Connecting the Exponential Family (EF) of Distributions to the Natural Exponential Family (NEF) of Distributions
be a distribution in the exponential family, where . The likelihood function is this density as a function of the underlying parameters, that is
Assuming , one can form a probability distribution . Then is proportional to which is in the NEF with parameter .
Hence when the data generating process of the model is of the form , there may be a closed form expression for and consequently the prior information. Moreover, on a loosely related note, this could be useful for sampling from posterior distributions where the underlying form of the data generating process is , since a reasonable trial distribution is for importance or accept-reject sampling; posterior asymptotics imply that the posterior and get closer to each other as more data is collected. There may be a regime where the number of data samples is not sufficient to approximate the posterior with directly, but is a good trial distribution for importance or accept-reject sampling.
Berger, J.O., Bernardo, J.M, and Sun, D. (2009) The formal definition of reference priors. The Annals of Statistics 37, 2, 905-938.
Bernardo, J.M. (1979) Reference Posterior Distributions for Bayesian Inference. JRSS B.
Cover, T. Thomas, J. (1991) Elements of information theory. Wiley-Interscience, New York, NY, USA, 1991
Clarke, B. (1996) Implications of Reference Priors for Prior Information and for Sample Size Journal of the American Statistical Association Vol. 91, No. 433.
Clarke, B. (1999) Asymptotic normality of the posterior in relative entropy. Information Theory, IEEE Transactions on (Volume:45 , Issue: 1 ).
Evans, M. and Jang, G.(2011) Weak Informativity and the Information in One Prior Relative to Another. Statistical Science. Volume 26, Number 3, 423-439.
Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y-S. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics 2, 4, 1360-1383.
Gelman et al. (2013) Bayesian Data Analysis.
Jeffreys, H. (1946) An Invariant Form for the Prior Probability in Estimation Problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, Vol. 186, No. 1007 (Sep. 24, 1946), pp. 453-461
Kass, R. and Wasserman, L. (1996) Journal of the American Statistical Association, 91, 435.
Kullback, S. (1959) Information Theory and Statistics.
Lasso2 R package https://cran.r-project.org/web/packages/lasso2/index.html
Lindley, D.V. (1956) On a Measure of the Information Provided by an Experiment. The Annals of Mathematical Statistics.
Martin, R. and Liu, C. (2012) Inferential models: A framework for prior-free posterior probabilistic inference. arXiv:1206.4091.
Meng, X.L. and Wong, W.H. (1996) Simulating Ratios of Normalizing Constants via a Simple Identity: A Theoretical Exploration, Statistica Sinica 6, 831-860.
Murray, I., Adams, P., MacKay, D. (2010) Elliptical Slice Sampling. JMLR
Penny, W.D.. Kullback-Liebler Divergences of Normal, Gamma, Dirichlet and Wishart Densities. (2001) Technical report, Wellcome Department of Cognitive Neurology.
Reimherr, M., Meng, X-L., Nicolae, D.L. (2014) Being an informed Bayesian: Assessing prior informativeness and prior-likelihood conflict. arXiv:1406.5958.
UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/
van der Vaart, A. W. (2000) Asymptotic Statistics
Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society, Ser. B, 58, 267–288.
Yang, R. and Berger, J. (1996) A Catalog of Noninformative Priors.
Yuan, A. and Clarke, B. (1999) A minimally informative likelihood for decision analysis: Illustration and robustness. Volume 27, Issue 3, pages 649–665, September 1999