Log In Sign Up

A Gentle Introduction to Bayesian Hierarchical Linear Regression Models

Considering the flexibility and applicability of Bayesian modeling, in this work we revise the main characteristics of two hierarchical models in a regression setting. We study the full probabilistic structure of the models along with the full conditional distribution for each model parameter. Under our hierarchical extensions, we allow the mean of the second stage of the model to have a linear dependency on a set of covariates. The Gibbs sampling algorithms used to obtain samples when fitting the models are fully described and derived. In addition, we consider a case study in which the plant size is characterized as a function of nitrogen soil concentration and a grouping factor (farm).


The Tilted Beta Binomial Linear Regression Model: a Bayesian Approach

This paper proposes new linear regression models to deal with overdisper...

Variable fusion for Bayesian linear regression via spike-and-slab priors

In linear regression models, a fusion of the coefficients is used to ide...

On Gibbs Sampling for Structured Bayesian Models Discussion of paper by Zanella and Roberts

This article is a discussion of Zanella and Roberts' paper: Multilevel l...

Bayesian linear regression models with flexible error distributions

This work introduces a novel methodology based on finite mixtures of Stu...

Variational Hierarchical Mixtures for Learning Probabilistic Inverse Dynamics

Well-calibrated probabilistic regression models are a crucial learning c...

Bayesian Hierarchical Factor Regression Models to Infer Cause of Death From Verbal Autopsy Data

In low-resource settings where vital registration of death is not routin...

Hierarchical Distribution Matching: a Versatile Tool for Probabilistic Shaping

The hierarchical distribution matching (Hi-DM) approach for probabilisti...

1 Introduction

A key characteristic of many problems is that the observed data can be used to estimate aspects of the population even though they are never observed. Often, it is quite natural to model such a problem hierarchically, with observable outcomes modeled conditionally on certain parameters, which themselves are assigned a probabilistic specification in terms of further random quantities.

Hierarchical models can have enough parameters to fit the data well, while using a population distribution to structure some dependence into the parameters, thereby avoiding problems of over-fitting. In addition, by establishing hierarchies we are not forced to choose between complete pooling and not pooling at all as the classic analysis of variance does


In this work we analyze observational continuous data arranged in groups. Specifically, we discuss hierarchical models for the comparison of group-specific parameters across groups in a regression setting. Our hierarchical approach is conceptually a straightforward generalization of a standard Normal model. Emulating hoff2009first, we use an ordinary regression model to describe within-group heterogeneity of observations, and also, describe between-group heterogeneity using a sampling model for the group-specific regression parameters. Then, we take a step further and develop another hierarchical model adding a specific layer for carrying out clustering tasks.

Even though our modeling approach is quite conventional under the Bayesian paradigm, our main contribution mainly relies in how models are structured and developed. Thus, we provide for each model all the details regarding model and (hierarchical) prior specification, careful hyperparameter selection under little external information, and simulation-based algorithms for computation. Finally, we also consider several metrics to quantify the performance aspects of our models by means of both goodness-of-fit and in-sample model checking metrics. In every instance along the way, we go over several buildying blocks of the literature about hierarchical modeling techniques and clustering tasks.

This paper is structured as follows: Section 2 revises all the details related to the Normal linear regression model. Sections 3 and 4 provide a full development of the hierarchical extensions. Section 5 discusses in depth other modeling approaches. Section 7

presents several specifics about model checking through tests statistics, and also, model section through information criteria. Section

8 makes a complete analyzes of a case study. Finally, Section 9 discusses our findings and future developments.

2 Normal Linear Regression Model

Here, we show some relevant aspects about linear regression modeling in a Bayesian setting, which is a powerful data analysis tool quite useful for carrying out many inferential tasks such as data characterization and prediction. Roughly speaking, our goal is to find a model for predicting the dependent variable (response) given one or more independent (predictor) variables.

2.1 Model Specification

First, we consider a simple scenario in which we want to characterize the sampling distribution of a random variable

through a set of explanatory variables . Thus, we look upon a Normal linear regression model of the form

where , , and

are the response variable, the covariates, and the random error, respectively, corresponding to the

-th observation from the -th group, and are the regression parameters of the model. Note that the previous model can be re-expressed as


is the response vector given by

, with , and is the design matrix arranged in a similar fashion.

In order to perform full Bayesian analysis using the likelihood given above, we consider a semi-conjugate prior distribution for

and of the form

as in most Normal sampling problems.

2.2 Prior Elicitation

In the absence of convincing external information to the data, it is customary using a defuse prior distribution in order to be as minimally informative as possible. In the same spirit of the unit information prior as in kass1995reference, we let and , and set , , with , where

stands for the ordinary least squares (OLS) estimate of

. This choice of makes the ratio very close to 1, and therefore, is practically centered around ; similarly, the prior distribution of is weakly centered around since . Such a prior distribution cannot be strictly considered a real prior distribution, as it requires knowledge of to be constructed. However, it only uses a small amount of the information given in , and can be loosely thought of as the prior distribution of a researcher with unbiased but weak prior information.

2.3 Posterior Inference

The posterior distribution can be explored using Markov chain Monte Carlo (MCMC) methods

(gamerman2006markov) such as the Gibbs sampling. Implementing such an algorithm under the previous model is quite simple since the full conditional distributions are


See for example christensen2011bayesian for details about this result.

Under the previous setting, we need to carefully choose values for the set of model hyperparameters, namely, , , , and . Often, the analysis must be done in the absence of prior information, so we should use a prior distribution as minimally informative as possible. The so-called -priors (see for example albert2009bayesian for a brief discussion) offer this possibility and the desirable feature of invariance to changes in the scale of the regressors. A popular alternative in this direction consists in letting and for some positive value that reflects the amount of information in the data relative to the prior distribution (choosing a large value of naturally induces a diffuse prior). It can be shown that under the -prior specification,

is an Inverse Gamma distribution, which means that we can use direct sampling as follows (see

hoff2009first for details):

  1. Sample .

  2. Sample .

3 Hierarchical Normal Linear Regression Model

We present the treatment of a hierarchical model, in which the observed data is assumed to be normally distributed with both group-specific fixed effects (and therefore subject-specif means) and group-specific variances. The model specification provided below is quite convenient because in addition to a global assessment of the mean relationship between the covariates and the response variable (as allowed by a standard linear regression model), it gives the means to carry out specific inferences within each group as well as comparisons among groups.

3.1 Model Specification

Figure 1: DAG representation of the hierarchical Normal linear regression model.

We consider independent groups, each one of them with independent normally distributed data points (i.e., a balanced experiment), , each of which with subject-specific mean , with , and group-specific variance ; i.e.,

In addition, we propose a hierarchical prior distribution with the following stages:


where , , , , and are the unknown model parameters, and , , , , , , and are hyperparameters carefully selected according to external information. Figure 1) provides a directed acyclic graph (DAG) representation of the model. As a final remark, note that fitting this hierarchical model is not equivalent to fitting regular regression models to each group independently since the information shared across groups (shrinkage effect) would be lost. See hoff2009first for more details in this regard.

3.2 Prior Elicitation

Following the same unit-information-prior-inspired approach considered to select the hyperparameters of the Normal linear regression model, we let again and , and set , , with (see Section 2 for details). In addition, aiming to establishing a diffuse and reasonable centered prior for , we let and because such a specification produces a mean vague concentration of around since a priori. Finally, we let and because this choice leads to a diffuse prior for such that with , which clearly emulates the prior elicitation in a regular regression setting for which .

3.3 Posterior Inference

Joint posterior inference for the model parameters can be achieved by constricting a Gibbs sampling algorithm (gamerman2006markov), which requires iteratively sampling each parameter from its full conditional distribution.

Let be the full set of parameters in the model. The posterior distribution of is

which leads to

Although this is an abuse of standard mathematical notation, the full conditional distribution (fcd) of parameter given the rest of the parameters, the design matrix , and the data is denoted by . We derived these distributions looking at the dependencies in the full posterior distribution. Thus, we have that:

  • The fcd of , for , is

    where .

  • The fcd for is

  • The fcd of is

  • The fcd of , for , is

  • The fcd of is

Let denote the state of parameter in the -th iteration of the Gibbs sampling algorithm, for . Then, such an algorithm in this case is as follows:

  1. Choose an initial configuration for each parameter in the model, say , , , , and .

  2. Update , , , , and until convergence:

    1. Sample from the fcd , for .

    2. Sample from the fcd

    3. Sample from the fcd

    4. Sample from the fcd , for .

    5. Sample from the fcd

  3. Cycle until achieve convergence.

4 Clustering Hierarchical Normal Linear Regression Model

Here we extend the hierarchical approach provided in the previous section by adding more structure to the model in order to perform clustering task at a group level (clusters whose elements are groups). Specifically, we consider a mixture model instead of a regular Normal distribution in the likelihood, and then, introduce a new set of parameters known as cluster assignments to “break” the mixture and then be able to identify those groups belonging to the same cluster.

4.1 Model Specification

A natural way to extend the hierarchical model consists in relaxing the normality assumption about the response variable by replacing it with a finite mixture of Normal components in such a way that


is a positive fixed integer that represents the number of clusters in which groups can be classified,

and are the cluster-specific regression parameters and cluster-specific variances of the mixture components, and

are mixture probabilities such that

and . Note that under this formulation we recover the Normal linear regression model by setting .

According to the previous mixture, the probability that the group is part of the cluster is , i.e., , for and , where

is a categorical variable (known as either cluster assignment or cluster indicator) that takes integer values in

with probabilities , respectively. Thus, we can use the cluster assignments to “break” the mixture and write the model as

In addition, a parsimonious way to formulate a hierarchical prior distribution can be achieved by letting


where , , , , , , and are the unknown model parameters, and , , , , , , , and are hyperparameters carefully selected according to external information. Finally, note that the DAG representation of the model is very similar to that displayed in Figure 1, but including a few extra nodes corresponding to the clustering process.

4.2 Prior Elicitation

Under the same approach as before, we let once again and , and set , , with , , , , and . Thus, the only hyperparameter that remains unspecified is . To do so in a sensible way, we let , which has a direct connection with a Chinese restaurant process prior (ishwaran2000markov) and places a diffuse prior distribution for the number of occupied clusters in the data.

4.3 Posterior Inference

Once again we recur to MCMC methods as in Section 3 to explore the posterior distribution of the model parameters . The posterior distribution of is such that

which leads to

where is the Iverson bracket. Such a posterior distribution is quite reminiscent of the one that we derived for the hierarchical model in Section 3, but this time it includes a portion related with the clustering process, and also, group-specific parameters cycle over terms instead of .

Once again, we derive the fcd’s from the posterior distribution, obtaining that:

  • The fcs of , for , is a Categorical distribution such that

  • The fcs of is

    where is the number of elements in cluster , for .

  • The fcd of , for , is

    where and . Note that if cluster is empty, then the fcd of is just .

  • The fcd for is

    where is the number of non-empty clusters.

  • The fcd of is

  • The fcd of , for , is

    Again, note that if cluster is empty, then the fcd of is just .

  • The fcd of is

Thus, the Gibbs sampling algorithm in this case is as follows:

  1. For a given value of , choose an initial configuration for each parameter in the model, say , , , , , , and .

  2. Update , , , , , , and until convergence:

    1. Sample from the fcd , for .

    2. Sample from the fcd .

    3. Sample from the fcd , for .

    4. Sample from the fcd

    5. Sample from the fcd

    6. Sample from the fcd , for .

    7. Sample from the fcd

  3. Cycle until achieve convergence.

5 Further Extensions

In order to gain stochastic flexibility, yet another way of extending the model provided in Section 3 consists in completely relaxing the normality assumption about the response variable by assigning a prior distribution to it. Specifically, we consider a Dirichlet process (DP) mixture model of the form

where is a positive scalar parameter and

is a base distribution function. In this case, the DP generates cumulative distribution functions on

(see for example muller2015bayesian for a formal treatment of the DP). The model given above can be also written as

which makes evident why the can be interpreted as subject-specific random effects. Such an extension is beyond the scope of this work and will be discussed in detail elsewhere.

Other straightforward parametric extensions are considering group-specific effects in a way that , for , , as well as extra model hierarchies such as letting to be a integer random value ranging from 1 to a fixed large upper bound in a way that , where is a hyperparameter. For clustering tasks, more sophisticated extensions require the specification of nonparametric priors of the form , for , where are weights constructed from a sequence , with for and

. The joint distribution of the set of weights

is called a stick-breaking prior with parameters and . This formulation is connected to the stick-breaking construction of the Poisson-Dirichlet process (pitman1997two). The stick-breaking representation associated with the Dirichlet process is a special case with .

(a) Linear Regression Model
(b) Hierarchical Linear Regression Model
(c) Clustering Hierarchical Linear Regression Model
Figure 2: Log-likelihood traceplots when fitting the models for the plant size data analyzed in Section 8.

6 Computation

We implement all the models above following the algorithms provided in the corresponding sections entitled as Posterior Inference. Every time our results are based on samples of the posterior distribution obtained after thinning the original chains every 10 observations and a burn-in period of 10,000 iterations. In addition, before using the MCMC samples with inferential purposes, we determine first if there is any evidence of lack of convergence of any chain to its stationary distribution. Following standard practices, we produce log-likelihood traceplots of each model. Fitting the models to the data provided in Section 8, such plots strongly suggest that there are no stationary concerns since the log-likelihoods move in a consistent direction (see Figure 2). Further, autocorrelation plots of model parameters (not shown here) indicate that there are no signs of strong dependence in the chains. Thus, we are very confident of our MCMC samples to perform inductive tasks.

7 Model Checking and Goodness-of-Fit

We check the in-sample performance of the model by generating new data from the posterior predictive distribution, and then, calculating a battery of test statistics (such as the mean), aiming to compere the corresponding empirical distributions with the actual values observed in the sample

(gelman2013bayesian). In this spirit, for each quantity of interest, we also compute the posterior predictive -vale (ppp), which can be calculated as

where is a predictive dataset and a test statistic, in order to measure how good the model is in fitting the actual sample.

Finally, in order to asses the goodness-of-fit of each model as a measure of their predictive performance, in what follows we consider two metrics that account for both model fit and model complexity. The goal here is not necessarily picking the model with lowest estimated prediction error but to determine if improvements in fitting the model are large enough to justify the additional difficulty. That why such measures also serve as model-selection tools. The model-based literature has largely focused on the Bayesian Information Criteria (BIC) as a mechanism for model selection. However, the BIC is typically inappropriate for hierarchical models since the hierarchical structure implies that the effective number of parameters will typically be lower than the actual number of parameters in the likelihood (gelman2014understanding). Two popular alternatives to the BIC that address such an issue are the Deviance Information Criterion (spiegelhalter2002bayesian; spiegelhalter-2014, DIC),

with , and the Watanabe-Akaike Information Criterion (watanabe2010asymptotic; watanabe2013widely, WAIC),

with , where is the posterior mean of model parameters, and and are penalty terms accounting for model complexity. Note that in the previous expressions all expectations, which are computed with respect to the posterior distribution, can be approximated by averaging over Markov chain Monte Carlo (MCMC) samples (see Section 6 for details). Next section we have adopt a standard approach and use the DIC.

8 Illustration

Nitrogen is an essential macronutrient deeded by all plants to thrive. It is an important component of many structural, genetic, and metabolic compounds in plant cells. Increasing the levels of nitrogen during the vegetative stage can strengthen and support the plant roots, enabling them to take in more water nutrients. This allows a plant to grow more rapidly and produce large amounts of succulent, green foliage, which in turn can generate bigger yields, tastier vegetables, and a crop that is more resistant to pets, diseases, and other adverse conditions. Using too much nitrogen, however, can be just as harmful to plants as to little. A researcher took measurements of nitrogen soil concentration () and plant sizes () within each of farms. Thus, we have that and are the plant size and the nitrogen soil concentration values, respectively, associated the -th plant from the -th farm, and .

Figure 3: Descriptive plots. The second panel exhibit the ordinary least squares regression line for this data. Colors in the third panel correspond to different farms.

8.1 Exploratory Data Analysis

A histogram of the plant size is shown in the first panel of Figure 3. The plant size ranges from 76.56 to 117.50 which seems quite large in comparison with the plant size range within each farm (see the bottom panel). The second and third panels show the relationship between the plant size and the nitrogen soil concentration. In particular, the second panel exhibits the ordinary least squares (OLS) regression line for these data, which is given by

with a standard error of

with 118 degrees of freedom, and an adjusted R-squared of

. The third panel presents the same plot but taking into account the farm where the measurements belong to. These panels along with the OLS fit indicate two main features about this experiment: (1) there is an important relationship between the nitrogen soil concentration and the plant size (in our OLS fit the slope turns out to be significant, -value ); and (2) there is a clear farm effect on plant size since colors in the scatter plot reveal clustering patters, and also, the corresponding boxplots strongly suggest differences in terms of mean plant growth among farms.

8.2 Normal Linear Regression Model

We fit the linear regression model given in Section 2 to these data without taking into account the farm information. Posterior summaries of the model parameters are provided in Table 1. Even though the posterior mean of the model parameters practically coincide with their corresponding OLS estimates, these results are again quite limited since they do not allow us to isolate any kind of effect over the plant size arising from the grouping factor. Such an impossibility strongly motivates the hierarchical approaches that we present in this paper.

Parameter Mean SD Q2.5% Q97.5%
92.59 3.60 85.53 99.61
0.36 0.18 0.01 0.70
72.89 9.67 56.35 94.07
Table 1: Posterior summaries of the model parameters in the linear regression model.

8.3 Hierarchical Normal Linear Regression Model

Now we go further and fit the hierarchical linear regression model given in Section 3 to the plant size data considering both group-specific fixed effects and group-specific variances. Such a group-specif approach is very convenient because it allow us to carry out separate-group inferences as opposed to its non-hierarchical counterpart.

Figure 4:

95% quantile-based credible intervals and posterior means (squares) for the regressor fixed-effects regressor parameters in the hierarchical Normal linear regression model. Colored thicker lines correspond to credible intervals that do not contain zero. Top panel:

(intercepts). Bottom panel: (slopes).
Figure 5: 95% quantile-based credible intervals and posterior means (squares) for the variance parameters in the hierarchical Normal linear regression model.

We present our main results in Figures 4 and 5 where we display 95% quantile-based credible intervals for and , respectively. At this point we are capable of making evident some important findings. The uncertainty about the group-specific parameters is not even since the amplitude of the credible intervals clearly varies across farms. This effect is particularly evident for the variance components. In addition, point estimates (posterior means) of the group-specific parameters are also quite variable, which strongly suggests that for these data considering this approach is beneficial because it allows us to characterize for each farm its own unique features. However, some farms show some signs of similar features, which was also evident before in Figure 3. We explore clustering patterns in the next subsection.

As expected, we see that all the intercepts are statistically significant, but also highly variable (ranging from 61.48 to 106.81). On the other hand, the story behind the slopes (ranging from 0.08 to 1.04) is quite different. We see that just 9 out of 24 (37.5%) of such parameters turn out to be significant, namely, for farms 1, 7, 9, 15, 18, 20, 21, 22, and 23 (Figure 6 shows the corresponding estimated regression lines for these farms). The previous fact strongly suggests that the relevance of the relationship between the nitrogen soil concentration and the plant size is not consistent across farms. Therefore, in this case, considering farms as a grouping factor makes a substantial impact on the analysis.

Figure 6: Estimated lines of the hierarchical Normal linear regression model for those farms whose nitrogen soil concentration is statistically significant. Colors are used to represent different farms.

8.4 Clustering Hierarchical Normal Linear Regression Model

It is quite reasonable attempting to identify clusters composed of farms given the abundant evidence of similarities among groups and cluster formation detected in both the exploratory data analysis and the hierarchical modeling stage of the analysis. Here we fit here the clustering hierarchical linear regression given in Section 4 model in order to provide a formal partition of farms. To this end, we fit the model using as a default number of communities. Such an extreme case represents the prior believe that there are no clustering patterns at all. A moderate large value of is convenient in situations like this because it allows the data to tell by itself how many non-empty clusters should be considered. We will see in what follows that out from the clusters, many turn out to be empty and only a few remain.

Let be the number of non-empty clusters in the partition induced by . The left panel of Figure 7 shows the posterior distribution of . We see that the most highly probable values are and . In fact, , which means that around three quarters of the posterior partitions are composed of either 7 or 8 clusters of farms. The estimated number of non-empty clusters in the data is obviously (maximum a posteriori with a reference value very close to 0.4).

(a) Posterior distr. of .
(b) Incidence matrix .
Figure 7: Posterior inference on the cluster asingments .

On the other hand, the right panel in Figure 7 shows the incidence matrix obtained from the posterior distribution of the cluster assignments . The incidence matrix is a pairwise-probability matrix whose elements are given by , for (note that for all ). Thus,

simply represents the posterior probability that farms

and belong to the same community. Such probabilities are indeed identifiable, however labels themselves are not since the likelihood is invariant to relabelling of the mixture components (this is known as the label switching problem; see stephens2000dealing and references therein). On top the incidence matrix, we also present a point estimate of the partition induced by such a matrix (represented by black lines), which can be obtained by employing the clustering methodology proposed in lau2007bayesian with a relative error cost of 0.5. As expected, we see that eight clusters are formed from the data. The corresponding cluster sizes are 6, 6, 5, 2, 2, 1, 1, and 1. Specifically, the clusters are composed of the following farms: , , and , , , , , and . These clusters along with their estimated regression lines are also represented in Figure 8. We see that the estimated clusters make sense spatially according to the data points. The slope parameters turn out to be significant for every cluster.

(a) Cluster 1.
(b) Cluster 2.