Reference Bayesian analysis for hierarchical models

04/25/2019 ∙ by Thaís C. O. Fonseca, et al. ∙ UFRJ 0

This paper proposes an alternative approach for constructing invariant Jeffreys prior distributions tailored for hierarchical or multilevel models. In particular, our proposal is based on a flexible decomposition of the Fisher information for hierarchical models which overcomes the marginalization step of the likelihood of model parameters. The Fisher information matrix for the hierarchical model is derived from the Hessian of the Kullback-Liebler (KL) divergence for the model in a neighborhood of the parameter value of interest. Properties of the KL divergence are used to prove the proposed decomposition. Our proposal takes advantage of the hierarchy and leads to an alternative way of computing Jeffreys priors for the hyperparameters and an upper bound for the prior information. While the Jeffreys prior gives the minimum information about parameters, the proposed bound gives an upper limit for the information put in any prior distribution. A prior with information above that limit may be considered too informative. From a practical point of view, the proposed prior may be evaluated computationally as part of a MCMC algorithm. This property might be essential for modeling setups with many levels in which analytic marginalization is not feasible. We illustrate the usefulness of our proposal with examples in mixture models, in model selection priors such as lasso and in the Student-t model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been an increasing interest over the last decade in the construction of flexible and computational efficient models such as neural networks, clustering, multilevel, spatial random effects and mixture models. These models are setup in an hierarchical fashion which naturally describe several real data characteristics. The hierarchy allows for flexibility and complexity is take into account by adding extra levels in the model. In this context, subjective prior elicitation for the hyperparameters is not always trivial as parameters are very often defined in low levels of the hierarchy and lack practical interpretation. Frequently, the parameters are calibrated or estimated using empirical Bayes approaches. Recent work has focus attention in the specification of hyperpriors for parameters in low levels of the hierarchy such as the penalizing complexity priors of

Simpson et al. (2017) which is weakly informative and penalizes parameter values far from the base model specification. In the context of full Bayesian analysis, objective prior specifications may be considered instead of calibration, empirical Bayes or weakly informative priors.

Consider

the vector of observed responses and a probabilistic parametric model

. Direct inference about using the integrated likelihood is not always trivial and levels of hierarchy are often introduced in the modelling to allow for feasible inferences regarding due to conditional independence given the latent variables. As follows, we assume that the mechanism which has generated the data is a data augmentation process: given , a value of is selected from and, given , a value of is selected from . The model for depends on the application and is often chosen to allow easier inferences about or due to easier interpretability. In this setup the model has two elements: the extended likelihood and the model for the latent variable . The marginal likelihood for given the data is obtained by integration

(1)

with the cumulative distribution associated with . Notice, however, that the integration step is not usually desirable as in general flexible hierarchical modelling interest lies in making inference for the hidden effects as well as the parameters . The introduction of latent variables in the inferential problem brings great benefits as very often the complete conditional distributions of have nice explicit forms. In an iterative algorithm as the one proposed by Tanner and Wong (1987), a sample is obtained from and a sample from is obtained conditional on the sampled value of . Both complete conditional distributions are often easy to sample from.

Another important aspect of the hierarchical approach is that the model is usually a flexible version of a base model as discussed in Simpson et al. (2017)

. This naturally leads to ill behaved likelihoods as the lower levels in the hierarchy may converge to a constant and the model converges to the base model. This may happen with quite high probability if the sample size is not too large. This bad behaviour of the likelihood function is not corrected by reparametrization and the use of informative priors will mean that the inference made a priori and a posteriori will be approximately the same. Examples with this characteristic are the Student-t model

(Fernandez and Steel, 1999; Fonseca et al., 2008)

, the skewed normal distribution

(Liseo and Loperfido, 2006), mixture models (Bernardo and Giron, 1988; Grazian and Robert, 2018), the hyperbolic model (Fonseca et al., 2012). In these cases, the use of Jeffreys priors correct the ill behaved likelihood and provide better frequentist properties than the ones achieved by using informative priors.

The pioneer attempts to provide default inferences were based on setting uniform prior for unknown parameters (Bayes, 1763). However, these proposals are not invariant to transformations. Jeffreys (1946) introduced invariant priors based on the Riemannian geometry of the statistical model. In particular, the divergence considered for prior derivation were the squared Hellinger and the Kullback-Leibler distances. Both divergences behave locally as the Fisher information. More recently, George and McCulloch (1993) considered a large class of invariant priors based on discrepancy measures. The motivation of these authors was to think in terms of the parametric family instead of the parameter domain . Thus, the prior weight assigned to a neighbor value of , say , depends directly on the discrepancy between the parametric family members and .

A way to measure information in an inferential problem is through discrepancy functions such as Fisher information. In the context of hierarchical models, it is natural that the Fisher information about obtained from the complete data problem must be larger than the one obtained from the incomplete data problem . Indeed, if and were both known it would be easier to make inference about .

In this work, we present a Fisher information decomposition for a general hierarchical problem which allows for computing the information about in the incomplete model and in the extended model and overcoming the marginalization step. From the proposed decomposition we obtain an alternative way to compute the Jeffreys prior for and an upper bound for this prior. The Jeffreys rule prior gives the minimum prior information for inference about . On the other hand, any prior which gives more information than the upper bound may be too informative.

Section 2 proposes the Fisher information decomposition, the upper bound for the Jeffreys prior of and an alternative way to compute the usual Jeffreys rule prior for an hierarchical model. Some hierarchical models, often explored in the current literature are presented in the following sections. In section 3 the discrete mixture model (Bernardo and Giron, 1988)

is presented and the Jeffreys prior is obtained without the usual marginalization step. An alternative proof for integrability of the resulting Jeffreys prior is also presented. Section 4 discusses the Student-t model with unknown degrees of freedom. The model is written as a Gaussian mixture model and the Jeffreys rule prior is obtained directly for the hierarchical model. Section 5 presents Jeffreys prior for Lasso parameter in a regression model also overcoming the marginalization step.

2 Fisher matrix decomposition and proposed reference prior

We recall that the Fisher information matrix for a probabilistic parametric model is defined by the expected matrix of the observed information,

(2)

It can also be seen as the variance of the score function, i.e.,

.

In many situations, analytical or even numerical expressions for the Fisher information matrix is prohibitive. Moreover, the use of Monte Carlo methods for approximate (2) can yield high variance. So, in order to reduce variance, we can use the Rao-Blackwell theorem, by carrying out analytical computation as much as possible. Based on this, we propose a Rao-Blackwell-type theorem to be used as an alternative method to compute the Fisher information matrix on hierarchical structures of the model (1). For the next result, we consider

  1. the Fisher information matrix of the extended likelihood ;

  2. the Fisher information matrix of the latent variable ;

  3. the Fisher information matrix of the complete model ;

  4. the Fisher information matrix of the complete conditional .

We obtain the following

Theorem 2.1.

The Fisher information matrices of the hierarchical model (1) can be decomposed as

(3)

For the sake of simplicity, in the expression above, denotes and denotes . If is not easily computed then we may take advantage of the hierarchy, which leads to .

To prove Theorem 2.1, we observe that the Fisher information matrix can be derived from the Kulback-Leibler (KL) divergence. Namely, for any parametric model it holds

(4)

In other words, the Fisher information matrix can be obtained as the Hessian of the KL divergence

(5)

The proposition below is the main ingredient to derive (3).

Proposition 2.1.

Consider the probability distributions

and , where are random vectors. The Kullback-Leibler divergence satisfy:

(6)

Proposition 2.1 follows by a direct computation. In fact, the Kullback-Leibler divergence of and is given by

(7)

The first term of the third equality holds since . The second equality in (2.1) holds similarly. ∎

Now, we will prove Theorem 2.1. For the hierarchical model (1), we consider the likelihood of given the joint random vector and the complete conditional posterior distribution . Applying Proposition 2.1, we obtain

Taking the Hessian of the above equalities at , we obtain (3). ∎

We can use Theorem 2.1 to compute the Fisher information matrices of high-level hierarchical models:

(8)

We write and , for . Applying recursively Theorem 2.1, it holds

As a particular case, consider the hierarchy model:

(9)

Since does not depend on , it holds that . Applying Theorem 2.1 we obtain . Thus, it holds

Corollary 2.1.

For the hierarchical model is and , the posterior Fisher information matrices satisfy

(10)

In particular, if we assume the Jeffreys’ prior of the model is proper then the Jeffreys’ prior of the marginal likelihood distribution is also proper.

It is worth to emphasize that both parâmeters and can be multivariate. In this case, we can also use of Theorem 2.1 to give a lower bound of the Jeffreys rule prior in terms of the hierarchy. Namely, we prove the following:

Theorem 2.2.

Under the hypotheses of Theorem 2.1, the usual Jeffreys rule priors satisfy the inequalities as follows. If the parâmeter is multivariate (dimension ), then

(11)
Proof.

Assume the dimension of the parameter space is . The Minkowski determinant theorem (see for instance Theorem 4.1.8 pg 115 of Marcus and Minc (1964)) states, for any symmetric and positive semi-definite matrices and ,

(12)

Fisher information matrices are symmetric and positive semi-definite. By (3) and (12),

(13)

And, since ,

Theorem 2.2 is proved. ∎

The examples below will show, in many other situations, how to conveniently decompose in order to compute the Fisher information matrix of hierarchical models.

3 Mixture model with known components

Mixture models provide a useful representation of several applications such as gene expression, classification, etc. A simple mixture model with known components (Bernardo and Giron, 1988) may be written as

(14)

with a simplex defining the weights of the mixture and , completely specified density functions with known parameters. We set assuming values in with categorical distribution , with . For each , we fix a probability distribution . The two-level hierarchical model and yields the mixture of probabilities . The Fisher information matrix of the mixture model is

(15)

with . It is not a simple task to show the properness of Jeffreys’ prior derived from (15). In fact, this result was obtained by Bernardo and Giron (1988) and Grazian and Robert (2018). Corollary 2.1 gives an alternative proof of this result as we discuss as follows.

The Fisher information matrix based on the model denoted by , with , is given by

where is the Kronecker delta. It is well known that the Jeffreys’ prior of the categorical model is proper with normalizing constant , where denotes the round sphere in of radius . By Corollary 2.1, the Jeffreys’ prior of the incomplete model based on is also proper.

In the simpler model resulting in . Firstly, we take advantage of the hierarchy to obtain conjugated posterior for . The complete conditional posterior distribution of is

The Fisher information of about is given by

Expectation of with respect to the model is obtained computationaly in the estimation algorithm.

(16)

The resulting prior must be integrable as also is. This result is immediate from Corollary 2.1.

4 Two level Hierarchical Gaussian model

The usual random effect model depends strongly on the estimation of the unknown variances in each level of the hierarchy. The signal to noise ratio has been studied in several areas of applications. As follows we consider the two level hierarchical random effect model in its simplest version as a prototypical example.

with , and . Assume that is known. In this setting , and integrating out yields the marginal model . This model may be seen as the simplest version of several hierarchical models which depends on random effect estimation. It is worth noting that in this example , so the second term in the left hand side of the information identity is null.

The prior and posterior densities of

are both Gaussian distributions with the same general form

with and in the prior distribution and and in the posterior distribution, where, without loss of generality, we suppose that and . The log conditional density is

Twice differentiation and expectation in the distribution of results in

In the prior distribution and it follows that In the posterior distribution , and Taking expectation in the marginal data distribution leads to

(17)

Finally, applying the proposed Fisher decomposition

(18)

In this example, the marginal model is easily obtained and usual differentiation of the log likelihood leads to the same result obtained in (18). However, for more general random effect models marginalization might be not feasible and this procedure would still apply to obtain the required Fisher information.

5 Student-t model with unknown degrees of freedom

Let be the standard Student-t model with fixed mean and precision and unknown degrees of freedom . This model robustfy the Gaussian model by allowing for fatter tails. Several authors have dealt with reference prior specification for the tail parameter (Fonseca et al., 2008; Villa and Walker, 2014; Simpson et al., 2017). This model may be rewritten in an hierarchical setting as

This model has two levels of hierarchy with appearing in the second level only, as in (9). In this case, by Corollary 2.1, the Fisher decomposition simplifies to

(19)

The first term is obtained from the latent variable prior distribution

which is a gamma distribution and twice differentiation and expectation with respect to

leads to

(20)

The second part in (19) is computed based on the complete condicional posterior distribution of the latent variable which is . Twice differentiation and taking expectation in the distribution of results in

Now we take expectations with respect to the model . For this model the result is analitic and given by

(21)

Equations (20) and (21) together result in the usual Jeffreys rule prior for the degrees of freedom of the Student-t model. The Jeffreys prior is bounded above by the function which is not integrable. However, as . Thus, the Jeffreys prior for the degree of freedom of the Student-t model is proper.

6 Hierarchical priors for variable selection in regression analysis

This example illustrates the use of hierarchical models for variable selection in regression analysis and a convenient use of our proposed Fisher decomposition. Consider the usual linear regression model for predictor variables

and responses

(22)

where is the vector of responses, is the matrix of covariates, and is the vector of regression coefficients. Consider the problem of variable selection in that context, that is, if is large it is desirable to find some ’s equal or close to zero. Tibshirani (1996) proposed a method called lasso, least absolute shrinkage selection operator, which is able to produce coefficients exactly equal to zero in regression models. The lasso estimate is defined by

(23)

The parameter , the Lasso parameter, controls the amount of shrinkage that is applied to the estimates. Let be the full least square estimates and let . Values of will cause shrinkage of the solutions towards zero. This method aims to improve prediction accuracy and to be more interpretable, as we may focus only in the strongest effects. However, estimation of is not an easy task, Tibshirani (1996)

comments that since the lasso estimate is a non-linear and non-differentiable function of the response, it is difficult to obtain an accurate estimate of its standard error.

The lasso constraint is equivalent to the addition of a penalty term to the residual sum of squares, that is, we should minimize

(24)

In the Bayesian context, this is equivalent to the use of independent Laplace priors for the regression coefficients (Park and Casella, 2008), that is,

(25)

We may derive the Lasso estimate as the posterior mode under the prior (25) for the ‘s. The choice of the Bayesian Lasso parameter is also not trivial and several methods has been proposed to estimate this parameter such as cross validation and empirical Bayes, however, these methods are often unstable. It is pointed out by Park and Casella (2008) that the standard error obtained for the lasso parameter are not fully satisfactory.

Mallick and Yi (2014) rewrites the Laplace prior for in a convenient hierarchical way which allows for analytical full conditional distributions of the latent variables and model parameters. The Lasso prior is obtained as an uniform scale mixture by considering the conditional setting

(26)
(27)

As follows we obtain the Jeffreys prior for based on the Fisher decomposition (9) which simplifies to

(28)

with . The Fisher information for the bottom level in the hierarchy is easily obtained as . In order to obtain we take advantage of the known complete conditional distribution of which is presented in Mallick and Yi (2014) as

(29)
(30)

with

the ordinary least squared estimator of

in the usual regression model (22). The distribution of does not depend on . Thus, depend on the distribution of only and is trivialy obtained as . The final Jeffreys prior has closed form given by

(31)

Notice that this example illustrates the usefulness of our proposed decomposition in other robust regression models such as the ridge regression, elastic net and horse shoe model.

7 Hyperbolic model

This example illustrates the use of our proposed decomposition when all levels of hierarchy depend on the parameter of interest. In particular, we present results for the Hyperbolic model which is an extension of the Gaussian model allowing for asymmetric behavior. The Jeffreys rule prior for this model has been proposed by Fonseca et al. (2012), however, the prior is only obtained computationally and no propertness of the resulting proposal is proved.

A random variable is said to have a Hyperbolic distribution with parameters

, , e when its density is given by

(32)

with and the modified Bessel function of the tird kind with index 1. This model can be written alternatively as a mixture of a Gaussian and a GIG111If then its density is given by

with . Jørgensen (1982) presents details about this class. distribution, that is,

(33)

with the Gaussian distribution with mean and variance and the Generalized Inverse Gaussian distribution (GIG) with parameters 1, and .

In particular, consider the standard Hyperbolic model (32) for , and which corresponds to an asymmetric density function for . Let . This model may be rewritten as

Now we illustrate how to compute the Jeffreys prior using the proposed decomposition when both levels of hierarchy depend on . By Theorem 2.1, the Fisher decomposition is given by

(34)

In this case the latent variable has a GIG prior distribution and differentiating and taking expectations with respect to leads to

(35)

with , and . Notice that in this example the first level of hierarchy also depends on thus, to obtain the upper bound for the Jeffreys prior, we need to compute . In the first level has a Gaussian distribution, thus differentiating and taking expectations with respect to leads to

(36)

The resulting upper bound for the Jeffreys rule prior is given by the function

This prior is not integrable and the propertness of the prior distribution must be investigated using the resulting Jeffreys prior. The second term in the right hand side of (34) is computed based on the complete condicional posterior distribution of the latent variable which in this model is also a GIG distribution given by

Twice differentiation and taking expectation in the distribution of results in

Now we take expectations with respect to the model to obtain the Jeffreys rule prior for the asymmetry parameter in the Hyperbolic model .

(37)

The resulting prior has a term that must be computed numerically. Notice that instead of computing directly by numerical integration a part of it is obtained analytically which leads to reduction of variance.

8 Conclusions and further developments

This paper presented a convenient Fisher decomposition which facilitates Jeffreys prior computation even when model marginalization is not available analytically. We have illustrated our proposal with well known examples from the literature which either present ill behaved likelihoods or have hyperparameters which are difficult to set informative priors. Many other hierarchical models could be exploit and properties from the resulting prior could be investigated using the proposed Fisher decomposition. In particular, the variable selection models with other mixing distributions other than the Lasso prior could be studied and properties of propertiness of the resulting Jeffreys priors could be proved from our theoretical results.

References

  • Bayes (1763) Bayes, T. R. (1763). “"An Essay Towards Solving a Problem in the Doctrine of Chance.” Philosophical Transactions of the Royal Society, 53, 370–418. Reprinted in Biometrika, 45, 243-315, 1958.
  • Bernardo and Giron (1988) Bernardo, J. M. and Giron, F. J. (1988). “A Bayesian analysis of simple mixture problems.” Bayesian Statistics, 3, 67–78.
  • Fernandez and Steel (1999) Fernandez, C. and Steel, M. F. J. (1999). “Multivariate Student-t regression models: Pitfalls and inference.” Biometrika, 86, 153–167.
  • Fonseca et al. (2008) Fonseca, T. C. O., Ferreira, M. A. R., and Migon, H. S. (2008). “Objective Bayesian analysis for the Student-t regression model.” Biometrika, 95, 325–333.
  • Fonseca et al. (2012) Fonseca, T. C. O., Migon, H. S., and Ferreira, M. A. R. (2012). “Bayesian analysis based on the Jeffreys prior for the hyperbolic distribution.” Brazilian Journal of Probability and Statistics, 26, 327–343.
  • George and McCulloch (1993) George, E. I. and McCulloch, R. (1993). “On obtaining invariant prior distributions.” Journal of statistical planning and inference, 37, 169–179.
  • Grazian and Robert (2018) Grazian, C. and Robert, C. P. (2018). “Jeffreys priors for mixture estimation: Properties and alternatives.” Computational Statistics Data Analysis, 121, 149–163.
  • Jeffreys (1946) Jeffreys, H. (1946).

    “An invariant form for the prior probability in estimation problems.”

    Proceedings of the Royal Statistical Society London.
  • Liseo and Loperfido (2006) Liseo, B. and Loperfido, N. (2006). “Default Bayesian analysis of the skew-normal distribution.” Journal of Statistical Planning and Inference, 136, 373–389.
  • Mallick and Yi (2014) Mallick, H. and Yi, N. (2014). “A New Bayesian Lasso.” Stat Interface, 7, 571–582.
  • Marcus and Minc (1964) Marcus, M. and Minc, H. (1964). A Survey of Matrix Theory and Matrix Inequalities. Allyn and Bacon Inc., Boston.
  • Park and Casella (2008) Park, T. and Casella, G. (2008). “The Bayesian Lasso.” Journal of the American Statistical Association, 103, 482, 681–686.
  • Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., and Sorbye, S. H. (2017). “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science, 32, 1–28.
  • Tanner and Wong (1987) Tanner, M. A. and Wong, W. H. (1987). “The Calculation of Posterior Distributions by Data Augmentation.” Journal of the American Statistical Association, 82, 528–540.
  • Tibshirani (1996) Tibshirani, R. (1996). “Regression shirinkage and selection via the Lasso.” Journal of the Royal Statistical Society, Series B, 58, 267–288.
  • Villa and Walker (2014) Villa, C. and Walker, S. G. (2014). “Objective Prior for the Number of Degrees of Freedom of a t Distribution.” Bayesian Analysis, 9, 197–220.