There has been an increasing interest over the last decade in the construction of flexible and computational efficient models such as neural networks, clustering, multilevel, spatial random effects and mixture models. These models are setup in an hierarchical fashion which naturally describe several real data characteristics. The hierarchy allows for flexibility and complexity is take into account by adding extra levels in the model. In this context, subjective prior elicitation for the hyperparameters is not always trivial as parameters are very often defined in low levels of the hierarchy and lack practical interpretation. Frequently, the parameters are calibrated or estimated using empirical Bayes approaches. Recent work has focus attention in the specification of hyperpriors for parameters in low levels of the hierarchy such as the penalizing complexity priors ofSimpson et al. (2017) which is weakly informative and penalizes parameter values far from the base model specification. In the context of full Bayesian analysis, objective prior specifications may be considered instead of calibration, empirical Bayes or weakly informative priors.
Consider. Direct inference about using the integrated likelihood is not always trivial and levels of hierarchy are often introduced in the modelling to allow for feasible inferences regarding due to conditional independence given the latent variables. As follows, we assume that the mechanism which has generated the data is a data augmentation process: given , a value of is selected from and, given , a value of is selected from . The model for depends on the application and is often chosen to allow easier inferences about or due to easier interpretability. In this setup the model has two elements: the extended likelihood and the model for the latent variable . The marginal likelihood for given the data is obtained by integration
with the cumulative distribution associated with . Notice, however, that the integration step is not usually desirable as in general flexible hierarchical modelling interest lies in making inference for the hidden effects as well as the parameters . The introduction of latent variables in the inferential problem brings great benefits as very often the complete conditional distributions of have nice explicit forms. In an iterative algorithm as the one proposed by Tanner and Wong (1987), a sample is obtained from and a sample from is obtained conditional on the sampled value of . Both complete conditional distributions are often easy to sample from.
Another important aspect of the hierarchical approach is that the model is usually a flexible version of a base model as discussed in Simpson et al. (2017)
. This naturally leads to ill behaved likelihoods as the lower levels in the hierarchy may converge to a constant and the model converges to the base model. This may happen with quite high probability if the sample size is not too large. This bad behaviour of the likelihood function is not corrected by reparametrization and the use of informative priors will mean that the inference made a priori and a posteriori will be approximately the same. Examples with this characteristic are the Student-t model(Fernandez and Steel, 1999; Fonseca et al., 2008)2006), mixture models (Bernardo and Giron, 1988; Grazian and Robert, 2018), the hyperbolic model (Fonseca et al., 2012). In these cases, the use of Jeffreys priors correct the ill behaved likelihood and provide better frequentist properties than the ones achieved by using informative priors.
The pioneer attempts to provide default inferences were based on setting uniform prior for unknown parameters (Bayes, 1763). However, these proposals are not invariant to transformations. Jeffreys (1946) introduced invariant priors based on the Riemannian geometry of the statistical model. In particular, the divergence considered for prior derivation were the squared Hellinger and the Kullback-Leibler distances. Both divergences behave locally as the Fisher information. More recently, George and McCulloch (1993) considered a large class of invariant priors based on discrepancy measures. The motivation of these authors was to think in terms of the parametric family instead of the parameter domain . Thus, the prior weight assigned to a neighbor value of , say , depends directly on the discrepancy between the parametric family members and .
A way to measure information in an inferential problem is through discrepancy functions such as Fisher information. In the context of hierarchical models, it is natural that the Fisher information about obtained from the complete data problem must be larger than the one obtained from the incomplete data problem . Indeed, if and were both known it would be easier to make inference about .
In this work, we present a Fisher information decomposition for a general hierarchical problem which allows for computing the information about in the incomplete model and in the extended model and overcoming the marginalization step. From the proposed decomposition we obtain an alternative way to compute the Jeffreys prior for and an upper bound for this prior. The Jeffreys rule prior gives the minimum prior information for inference about . On the other hand, any prior which gives more information than the upper bound may be too informative.
Section 2 proposes the Fisher information decomposition, the upper bound for the Jeffreys prior of and an alternative way to compute the usual Jeffreys rule prior for an hierarchical model. Some hierarchical models, often explored in the current literature are presented in the following sections. In section 3 the discrete mixture model (Bernardo and Giron, 1988)
is presented and the Jeffreys prior is obtained without the usual marginalization step. An alternative proof for integrability of the resulting Jeffreys prior is also presented. Section 4 discusses the Student-t model with unknown degrees of freedom. The model is written as a Gaussian mixture model and the Jeffreys rule prior is obtained directly for the hierarchical model. Section 5 presents Jeffreys prior for Lasso parameter in a regression model also overcoming the marginalization step.
2 Fisher matrix decomposition and proposed reference prior
We recall that the Fisher information matrix for a probabilistic parametric model is defined by the expected matrix of the observed information,
It can also be seen as the variance of the score function, i.e.,.
In many situations, analytical or even numerical expressions for the Fisher information matrix is prohibitive. Moreover, the use of Monte Carlo methods for approximate (2) can yield high variance. So, in order to reduce variance, we can use the Rao-Blackwell theorem, by carrying out analytical computation as much as possible. Based on this, we propose a Rao-Blackwell-type theorem to be used as an alternative method to compute the Fisher information matrix on hierarchical structures of the model (1). For the next result, we consider
the Fisher information matrix of the extended likelihood ;
the Fisher information matrix of the latent variable ;
the Fisher information matrix of the complete model ;
the Fisher information matrix of the complete conditional .
We obtain the following
The Fisher information matrices of the hierarchical model (1) can be decomposed as
For the sake of simplicity, in the expression above, denotes and denotes . If is not easily computed then we may take advantage of the hierarchy, which leads to .
To prove Theorem 2.1, we observe that the Fisher information matrix can be derived from the Kulback-Leibler (KL) divergence. Namely, for any parametric model it holds
In other words, the Fisher information matrix can be obtained as the Hessian of the KL divergence
The proposition below is the main ingredient to derive (3).
Consider the probability distributions
Consider the probability distributionsand , where are random vectors. The Kullback-Leibler divergence satisfy:
Proposition 2.1 follows by a direct computation. In fact, the Kullback-Leibler divergence of and is given by
The first term of the third equality holds since . The second equality in (2.1) holds similarly. ∎
Now, we will prove Theorem 2.1. For the hierarchical model (1), we consider the likelihood of given the joint random vector and the complete conditional posterior distribution . Applying Proposition 2.1, we obtain
Taking the Hessian of the above equalities at , we obtain (3). ∎
We can use Theorem 2.1 to compute the Fisher information matrices of high-level hierarchical models:
We write and , for . Applying recursively Theorem 2.1, it holds
As a particular case, consider the hierarchy model:
Since does not depend on , it holds that . Applying Theorem 2.1 we obtain . Thus, it holds
For the hierarchical model is and , the posterior Fisher information matrices satisfy
In particular, if we assume the Jeffreys’ prior of the model is proper then the Jeffreys’ prior of the marginal likelihood distribution is also proper.
It is worth to emphasize that both parâmeters and can be multivariate. In this case, we can also use of Theorem 2.1 to give a lower bound of the Jeffreys rule prior in terms of the hierarchy. Namely, we prove the following:
Under the hypotheses of Theorem 2.1, the usual Jeffreys rule priors satisfy the inequalities as follows. If the parâmeter is multivariate (dimension ), then
Assume the dimension of the parameter space is . The Minkowski determinant theorem (see for instance Theorem 4.1.8 pg 115 of Marcus and Minc (1964)) states, for any symmetric and positive semi-definite matrices and ,
And, since ,
Theorem 2.2 is proved. ∎
The examples below will show, in many other situations, how to conveniently decompose in order to compute the Fisher information matrix of hierarchical models.
3 Mixture model with known components
Mixture models provide a useful representation of several applications such as gene expression, classification, etc. A simple mixture model with known components (Bernardo and Giron, 1988) may be written as
with a simplex defining the weights of the mixture and , completely specified density functions with known parameters. We set assuming values in with categorical distribution , with . For each , we fix a probability distribution . The two-level hierarchical model and yields the mixture of probabilities . The Fisher information matrix of the mixture model is
with . It is not a simple task to show the properness of Jeffreys’ prior derived from (15). In fact, this result was obtained by Bernardo and Giron (1988) and Grazian and Robert (2018). Corollary 2.1 gives an alternative proof of this result as we discuss as follows.
The Fisher information matrix based on the model denoted by , with , is given by
where is the Kronecker delta. It is well known that the Jeffreys’ prior of the categorical model is proper with normalizing constant , where denotes the round sphere in of radius . By Corollary 2.1, the Jeffreys’ prior of the incomplete model based on is also proper.
In the simpler model resulting in . Firstly, we take advantage of the hierarchy to obtain conjugated posterior for . The complete conditional posterior distribution of is
The Fisher information of about is given by
Expectation of with respect to the model is obtained computationaly in the estimation algorithm.
The resulting prior must be integrable as also is. This result is immediate from Corollary 2.1.
4 Two level Hierarchical Gaussian model
The usual random effect model depends strongly on the estimation of the unknown variances in each level of the hierarchy. The signal to noise ratio has been studied in several areas of applications. As follows we consider the two level hierarchical random effect model in its simplest version as a prototypical example.
with , and . Assume that is known. In this setting , and integrating out yields the marginal model . This model may be seen as the simplest version of several hierarchical models which depends on random effect estimation. It is worth noting that in this example , so the second term in the left hand side of the information identity is null.
The prior and posterior densities of
are both Gaussian distributions with the same general form
with and in the prior distribution and and in the posterior distribution, where, without loss of generality, we suppose that and . The log conditional density is
Twice differentiation and expectation in the distribution of results in
In the prior distribution and it follows that In the posterior distribution , and Taking expectation in the marginal data distribution leads to
Finally, applying the proposed Fisher decomposition
In this example, the marginal model is easily obtained and usual differentiation of the log likelihood leads to the same result obtained in (18). However, for more general random effect models marginalization might be not feasible and this procedure would still apply to obtain the required Fisher information.
5 Student-t model with unknown degrees of freedom
Let be the standard Student-t model with fixed mean and precision and unknown degrees of freedom . This model robustfy the Gaussian model by allowing for fatter tails. Several authors have dealt with reference prior specification for the tail parameter (Fonseca et al., 2008; Villa and Walker, 2014; Simpson et al., 2017). This model may be rewritten in an hierarchical setting as
The first term is obtained from the latent variable prior distribution
which is a gamma distribution and twice differentiation and expectation with respect toleads to
The second part in (19) is computed based on the complete condicional posterior distribution of the latent variable which is . Twice differentiation and taking expectation in the distribution of results in
Now we take expectations with respect to the model . For this model the result is analitic and given by
Equations (20) and (21) together result in the usual Jeffreys rule prior for the degrees of freedom of the Student-t model. The Jeffreys prior is bounded above by the function which is not integrable. However, as . Thus, the Jeffreys prior for the degree of freedom of the Student-t model is proper.
6 Hierarchical priors for variable selection in regression analysis
This example illustrates the use of hierarchical models for variable selection in regression analysis and a convenient use of our proposed Fisher decomposition. Consider the usual linear regression model for predictor variablesand responses
where is the vector of responses, is the matrix of covariates, and is the vector of regression coefficients. Consider the problem of variable selection in that context, that is, if is large it is desirable to find some ’s equal or close to zero. Tibshirani (1996) proposed a method called lasso, least absolute shrinkage selection operator, which is able to produce coefficients exactly equal to zero in regression models. The lasso estimate is defined by
The parameter , the Lasso parameter, controls the amount of shrinkage that is applied to the estimates. Let be the full least square estimates and let . Values of will cause shrinkage of the solutions towards zero. This method aims to improve prediction accuracy and to be more interpretable, as we may focus only in the strongest effects. However, estimation of is not an easy task, Tibshirani (1996)
comments that since the lasso estimate is a non-linear and non-differentiable function of the response, it is difficult to obtain an accurate estimate of its standard error.
The lasso constraint is equivalent to the addition of a penalty term to the residual sum of squares, that is, we should minimize
In the Bayesian context, this is equivalent to the use of independent Laplace priors for the regression coefficients (Park and Casella, 2008), that is,
We may derive the Lasso estimate as the posterior mode under the prior (25) for the ‘s. The choice of the Bayesian Lasso parameter is also not trivial and several methods has been proposed to estimate this parameter such as cross validation and empirical Bayes, however, these methods are often unstable. It is pointed out by Park and Casella (2008) that the standard error obtained for the lasso parameter are not fully satisfactory.
Mallick and Yi (2014) rewrites the Laplace prior for in a convenient hierarchical way which allows for analytical full conditional distributions of the latent variables and model parameters. The Lasso prior is obtained as an uniform scale mixture by considering the conditional setting
As follows we obtain the Jeffreys prior for based on the Fisher decomposition (9) which simplifies to
with . The Fisher information for the bottom level in the hierarchy is easily obtained as . In order to obtain we take advantage of the known complete conditional distribution of which is presented in Mallick and Yi (2014) as
the ordinary least squared estimator ofin the usual regression model (22). The distribution of does not depend on . Thus, depend on the distribution of only and is trivialy obtained as . The final Jeffreys prior has closed form given by
Notice that this example illustrates the usefulness of our proposed decomposition in other robust regression models such as the ridge regression, elastic net and horse shoe model.
7 Hyperbolic model
This example illustrates the use of our proposed decomposition when all levels of hierarchy depend on the parameter of interest. In particular, we present results for the Hyperbolic model which is an extension of the Gaussian model allowing for asymmetric behavior. The Jeffreys rule prior for this model has been proposed by Fonseca et al. (2012), however, the prior is only obtained computationally and no propertness of the resulting proposal is proved.
A random variable is said to have a Hyperbolic distribution with parameters, , e when its density is given by
with and the modified Bessel function of the tird kind with index 1. This model can be written alternatively as a mixture of a Gaussian and a GIG111If then its density is given by
with the Gaussian distribution with mean and variance and the Generalized Inverse Gaussian distribution (GIG) with parameters 1, and .
In particular, consider the standard Hyperbolic model (32) for , and which corresponds to an asymmetric density function for . Let . This model may be rewritten as
Now we illustrate how to compute the Jeffreys prior using the proposed decomposition when both levels of hierarchy depend on . By Theorem 2.1, the Fisher decomposition is given by
In this case the latent variable has a GIG prior distribution and differentiating and taking expectations with respect to leads to
with , and . Notice that in this example the first level of hierarchy also depends on thus, to obtain the upper bound for the Jeffreys prior, we need to compute . In the first level has a Gaussian distribution, thus differentiating and taking expectations with respect to leads to
The resulting upper bound for the Jeffreys rule prior is given by the function
This prior is not integrable and the propertness of the prior distribution must be investigated using the resulting Jeffreys prior. The second term in the right hand side of (34) is computed based on the complete condicional posterior distribution of the latent variable which in this model is also a GIG distribution given by
Twice differentiation and taking expectation in the distribution of results in
Now we take expectations with respect to the model to obtain the Jeffreys rule prior for the asymmetry parameter in the Hyperbolic model .
The resulting prior has a term that must be computed numerically. Notice that instead of computing directly by numerical integration a part of it is obtained analytically which leads to reduction of variance.
8 Conclusions and further developments
This paper presented a convenient Fisher decomposition which facilitates Jeffreys prior computation even when model marginalization is not available analytically. We have illustrated our proposal with well known examples from the literature which either present ill behaved likelihoods or have hyperparameters which are difficult to set informative priors. Many other hierarchical models could be exploit and properties from the resulting prior could be investigated using the proposed Fisher decomposition. In particular, the variable selection models with other mixing distributions other than the Lasso prior could be studied and properties of propertiness of the resulting Jeffreys priors could be proved from our theoretical results.
- Bayes (1763) Bayes, T. R. (1763). “"An Essay Towards Solving a Problem in the Doctrine of Chance.” Philosophical Transactions of the Royal Society, 53, 370–418. Reprinted in Biometrika, 45, 243-315, 1958.
- Bernardo and Giron (1988) Bernardo, J. M. and Giron, F. J. (1988). “A Bayesian analysis of simple mixture problems.” Bayesian Statistics, 3, 67–78.
- Fernandez and Steel (1999) Fernandez, C. and Steel, M. F. J. (1999). “Multivariate Student-t regression models: Pitfalls and inference.” Biometrika, 86, 153–167.
- Fonseca et al. (2008) Fonseca, T. C. O., Ferreira, M. A. R., and Migon, H. S. (2008). “Objective Bayesian analysis for the Student-t regression model.” Biometrika, 95, 325–333.
- Fonseca et al. (2012) Fonseca, T. C. O., Migon, H. S., and Ferreira, M. A. R. (2012). “Bayesian analysis based on the Jeffreys prior for the hyperbolic distribution.” Brazilian Journal of Probability and Statistics, 26, 327–343.
- George and McCulloch (1993) George, E. I. and McCulloch, R. (1993). “On obtaining invariant prior distributions.” Journal of statistical planning and inference, 37, 169–179.
- Grazian and Robert (2018) Grazian, C. and Robert, C. P. (2018). “Jeffreys priors for mixture estimation: Properties and alternatives.” Computational Statistics Data Analysis, 121, 149–163.
Jeffreys, H. (1946).
“An invariant form for the prior probability in estimation problems.”Proceedings of the Royal Statistical Society London.
- Liseo and Loperfido (2006) Liseo, B. and Loperfido, N. (2006). “Default Bayesian analysis of the skew-normal distribution.” Journal of Statistical Planning and Inference, 136, 373–389.
- Mallick and Yi (2014) Mallick, H. and Yi, N. (2014). “A New Bayesian Lasso.” Stat Interface, 7, 571–582.
- Marcus and Minc (1964) Marcus, M. and Minc, H. (1964). A Survey of Matrix Theory and Matrix Inequalities. Allyn and Bacon Inc., Boston.
- Park and Casella (2008) Park, T. and Casella, G. (2008). “The Bayesian Lasso.” Journal of the American Statistical Association, 103, 482, 681–686.
- Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., and Sorbye, S. H. (2017). “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science, 32, 1–28.
- Tanner and Wong (1987) Tanner, M. A. and Wong, W. H. (1987). “The Calculation of Posterior Distributions by Data Augmentation.” Journal of the American Statistical Association, 82, 528–540.
- Tibshirani (1996) Tibshirani, R. (1996). “Regression shirinkage and selection via the Lasso.” Journal of the Royal Statistical Society, Series B, 58, 267–288.
- Villa and Walker (2014) Villa, C. and Walker, S. G. (2014). “Objective Prior for the Number of Degrees of Freedom of a t Distribution.” Bayesian Analysis, 9, 197–220.