1 Introduction
A key characteristic of many problems is that the observed data can be used to estimate aspects of the population even though they are never observed. Often, it is quite natural to model such a problem hierarchically, with observable outcomes modeled conditionally on certain parameters, which themselves are assigned a probabilistic specification in terms of further random quantities.
Hierarchical models can have enough parameters to fit the data well, while using a population distribution to structure some dependence into the parameters, thereby avoiding problems of overfitting. In addition, by establishing hierarchies we are not forced to choose between complete pooling and not pooling at all as the classic analysis of variance does
(gelman2013bayesian).In this work we analyze observational continuous data arranged in groups. Specifically, we discuss hierarchical models for the comparison of groupspecific parameters across groups in a regression setting. Our hierarchical approach is conceptually a straightforward generalization of a standard Normal model. Emulating hoff2009first, we use an ordinary regression model to describe withingroup heterogeneity of observations, and also, describe betweengroup heterogeneity using a sampling model for the groupspecific regression parameters. Then, we take a step further and develop another hierarchical model adding a specific layer for carrying out clustering tasks.
Even though our modeling approach is quite conventional under the Bayesian paradigm, our main contribution mainly relies in how models are structured and developed. Thus, we provide for each model all the details regarding model and (hierarchical) prior specification, careful hyperparameter selection under little external information, and simulationbased algorithms for computation. Finally, we also consider several metrics to quantify the performance aspects of our models by means of both goodnessoffit and insample model checking metrics. In every instance along the way, we go over several buildying blocks of the literature about hierarchical modeling techniques and clustering tasks.
This paper is structured as follows: Section 2 revises all the details related to the Normal linear regression model. Sections 3 and 4 provide a full development of the hierarchical extensions. Section 5 discusses in depth other modeling approaches. Section 7
presents several specifics about model checking through tests statistics, and also, model section through information criteria. Section
8 makes a complete analyzes of a case study. Finally, Section 9 discusses our findings and future developments.2 Normal Linear Regression Model
Here, we show some relevant aspects about linear regression modeling in a Bayesian setting, which is a powerful data analysis tool quite useful for carrying out many inferential tasks such as data characterization and prediction. Roughly speaking, our goal is to find a model for predicting the dependent variable (response) given one or more independent (predictor) variables.
2.1 Model Specification
First, we consider a simple scenario in which we want to characterize the sampling distribution of a random variable
through a set of explanatory variables . Thus, we look upon a Normal linear regression model of the formwhere , , and
are the response variable, the covariates, and the random error, respectively, corresponding to the
th observation from the th group, and are the regression parameters of the model. Note that the previous model can be reexpressed aswhere
is the response vector given by
, with , and is the design matrix arranged in a similar fashion.In order to perform full Bayesian analysis using the likelihood given above, we consider a semiconjugate prior distribution for
and of the formas in most Normal sampling problems.
2.2 Prior Elicitation
In the absence of convincing external information to the data, it is customary using a defuse prior distribution in order to be as minimally informative as possible. In the same spirit of the unit information prior as in kass1995reference, we let and , and set , , with , where
stands for the ordinary least squares (OLS) estimate of
. This choice of makes the ratio very close to 1, and therefore, is practically centered around ; similarly, the prior distribution of is weakly centered around since . Such a prior distribution cannot be strictly considered a real prior distribution, as it requires knowledge of to be constructed. However, it only uses a small amount of the information given in , and can be loosely thought of as the prior distribution of a researcher with unbiased but weak prior information.2.3 Posterior Inference
The posterior distribution can be explored using Markov chain Monte Carlo (MCMC) methods
(gamerman2006markov) such as the Gibbs sampling. Implementing such an algorithm under the previous model is quite simple since the full conditional distributions areand
See for example christensen2011bayesian for details about this result.
Under the previous setting, we need to carefully choose values for the set of model hyperparameters, namely, , , , and . Often, the analysis must be done in the absence of prior information, so we should use a prior distribution as minimally informative as possible. The socalled priors (see for example albert2009bayesian for a brief discussion) offer this possibility and the desirable feature of invariance to changes in the scale of the regressors. A popular alternative in this direction consists in letting and for some positive value that reflects the amount of information in the data relative to the prior distribution (choosing a large value of naturally induces a diffuse prior). It can be shown that under the prior specification,
is an Inverse Gamma distribution, which means that we can use direct sampling as follows (see
hoff2009first for details):
Sample .

Sample .
3 Hierarchical Normal Linear Regression Model
We present the treatment of a hierarchical model, in which the observed data is assumed to be normally distributed with both groupspecific fixed effects (and therefore subjectspecif means) and groupspecific variances. The model specification provided below is quite convenient because in addition to a global assessment of the mean relationship between the covariates and the response variable (as allowed by a standard linear regression model), it gives the means to carry out specific inferences within each group as well as comparisons among groups.
3.1 Model Specification
We consider independent groups, each one of them with independent normally distributed data points (i.e., a balanced experiment), , each of which with subjectspecific mean , with , and groupspecific variance ; i.e.,
In addition, we propose a hierarchical prior distribution with the following stages:
with
where , , , , and are the unknown model parameters, and , , , , , , and are hyperparameters carefully selected according to external information. Figure 1) provides a directed acyclic graph (DAG) representation of the model. As a final remark, note that fitting this hierarchical model is not equivalent to fitting regular regression models to each group independently since the information shared across groups (shrinkage effect) would be lost. See hoff2009first for more details in this regard.
3.2 Prior Elicitation
Following the same unitinformationpriorinspired approach considered to select the hyperparameters of the Normal linear regression model, we let again and , and set , , with (see Section 2 for details). In addition, aiming to establishing a diffuse and reasonable centered prior for , we let and because such a specification produces a mean vague concentration of around since a priori. Finally, we let and because this choice leads to a diffuse prior for such that with , which clearly emulates the prior elicitation in a regular regression setting for which .
3.3 Posterior Inference
Joint posterior inference for the model parameters can be achieved by constricting a Gibbs sampling algorithm (gamerman2006markov), which requires iteratively sampling each parameter from its full conditional distribution.
Let be the full set of parameters in the model. The posterior distribution of is
which leads to
Although this is an abuse of standard mathematical notation, the full conditional distribution (fcd) of parameter given the rest of the parameters, the design matrix , and the data is denoted by . We derived these distributions looking at the dependencies in the full posterior distribution. Thus, we have that:

The fcd of , for , is
where .

The fcd for is

The fcd of is

The fcd of , for , is

The fcd of is
Let denote the state of parameter in the th iteration of the Gibbs sampling algorithm, for . Then, such an algorithm in this case is as follows:

Choose an initial configuration for each parameter in the model, say , , , , and .

Update , , , , and until convergence:

Sample from the fcd , for .

Sample from the fcd

Sample from the fcd

Sample from the fcd , for .

Sample from the fcd


Cycle until achieve convergence.
4 Clustering Hierarchical Normal Linear Regression Model
Here we extend the hierarchical approach provided in the previous section by adding more structure to the model in order to perform clustering task at a group level (clusters whose elements are groups). Specifically, we consider a mixture model instead of a regular Normal distribution in the likelihood, and then, introduce a new set of parameters known as cluster assignments to “break” the mixture and then be able to identify those groups belonging to the same cluster.
4.1 Model Specification
A natural way to extend the hierarchical model consists in relaxing the normality assumption about the response variable by replacing it with a finite mixture of Normal components in such a way that
where
is a positive fixed integer that represents the number of clusters in which groups can be classified,
and are the clusterspecific regression parameters and clusterspecific variances of the mixture components, andare mixture probabilities such that
and . Note that under this formulation we recover the Normal linear regression model by setting .According to the previous mixture, the probability that the group is part of the cluster is , i.e., , for and , where
is a categorical variable (known as either cluster assignment or cluster indicator) that takes integer values in
with probabilities , respectively. Thus, we can use the cluster assignments to “break” the mixture and write the model asIn addition, a parsimonious way to formulate a hierarchical prior distribution can be achieved by letting
with
where , , , , , , and are the unknown model parameters, and , , , , , , , and are hyperparameters carefully selected according to external information. Finally, note that the DAG representation of the model is very similar to that displayed in Figure 1, but including a few extra nodes corresponding to the clustering process.
4.2 Prior Elicitation
Under the same approach as before, we let once again and , and set , , with , , , , and . Thus, the only hyperparameter that remains unspecified is . To do so in a sensible way, we let , which has a direct connection with a Chinese restaurant process prior (ishwaran2000markov) and places a diffuse prior distribution for the number of occupied clusters in the data.
4.3 Posterior Inference
Once again we recur to MCMC methods as in Section 3 to explore the posterior distribution of the model parameters . The posterior distribution of is such that
which leads to
where is the Iverson bracket. Such a posterior distribution is quite reminiscent of the one that we derived for the hierarchical model in Section 3, but this time it includes a portion related with the clustering process, and also, groupspecific parameters cycle over terms instead of .
Once again, we derive the fcd’s from the posterior distribution, obtaining that:

The fcs of , for , is a Categorical distribution such that

The fcs of is
where is the number of elements in cluster , for .

The fcd of , for , is
where and . Note that if cluster is empty, then the fcd of is just .

The fcd for is
where is the number of nonempty clusters.

The fcd of is

The fcd of , for , is
Again, note that if cluster is empty, then the fcd of is just .

The fcd of is
Thus, the Gibbs sampling algorithm in this case is as follows:

For a given value of , choose an initial configuration for each parameter in the model, say , , , , , , and .

Update , , , , , , and until convergence:

Sample from the fcd , for .

Sample from the fcd .

Sample from the fcd , for .

Sample from the fcd

Sample from the fcd

Sample from the fcd , for .

Sample from the fcd


Cycle until achieve convergence.
5 Further Extensions
In order to gain stochastic flexibility, yet another way of extending the model provided in Section 3 consists in completely relaxing the normality assumption about the response variable by assigning a prior distribution to it. Specifically, we consider a Dirichlet process (DP) mixture model of the form
where is a positive scalar parameter and
is a base distribution function. In this case, the DP generates cumulative distribution functions on
(see for example muller2015bayesian for a formal treatment of the DP). The model given above can be also written aswhich makes evident why the can be interpreted as subjectspecific random effects. Such an extension is beyond the scope of this work and will be discussed in detail elsewhere.
Other straightforward parametric extensions are considering groupspecific effects in a way that , for , , as well as extra model hierarchies such as letting to be a integer random value ranging from 1 to a fixed large upper bound in a way that , where is a hyperparameter. For clustering tasks, more sophisticated extensions require the specification of nonparametric priors of the form , for , where are weights constructed from a sequence , with for and
. The joint distribution of the set of weights
is called a stickbreaking prior with parameters and . This formulation is connected to the stickbreaking construction of the PoissonDirichlet process (pitman1997two). The stickbreaking representation associated with the Dirichlet process is a special case with .6 Computation
We implement all the models above following the algorithms provided in the corresponding sections entitled as Posterior Inference. Every time our results are based on samples of the posterior distribution obtained after thinning the original chains every 10 observations and a burnin period of 10,000 iterations. In addition, before using the MCMC samples with inferential purposes, we determine first if there is any evidence of lack of convergence of any chain to its stationary distribution. Following standard practices, we produce loglikelihood traceplots of each model. Fitting the models to the data provided in Section 8, such plots strongly suggest that there are no stationary concerns since the loglikelihoods move in a consistent direction (see Figure 2). Further, autocorrelation plots of model parameters (not shown here) indicate that there are no signs of strong dependence in the chains. Thus, we are very confident of our MCMC samples to perform inductive tasks.
7 Model Checking and GoodnessofFit
We check the insample performance of the model by generating new data from the posterior predictive distribution, and then, calculating a battery of test statistics (such as the mean), aiming to compere the corresponding empirical distributions with the actual values observed in the sample
(gelman2013bayesian). In this spirit, for each quantity of interest, we also compute the posterior predictive vale (ppp), which can be calculated aswhere is a predictive dataset and a test statistic, in order to measure how good the model is in fitting the actual sample.
Finally, in order to asses the goodnessoffit of each model as a measure of their predictive performance, in what follows we consider two metrics that account for both model fit and model complexity. The goal here is not necessarily picking the model with lowest estimated prediction error but to determine if improvements in fitting the model are large enough to justify the additional difficulty. That why such measures also serve as modelselection tools. The modelbased literature has largely focused on the Bayesian Information Criteria (BIC) as a mechanism for model selection. However, the BIC is typically inappropriate for hierarchical models since the hierarchical structure implies that the effective number of parameters will typically be lower than the actual number of parameters in the likelihood (gelman2014understanding). Two popular alternatives to the BIC that address such an issue are the Deviance Information Criterion (spiegelhalter2002bayesian; spiegelhalter2014, DIC),
with , and the WatanabeAkaike Information Criterion (watanabe2010asymptotic; watanabe2013widely, WAIC),
with , where is the posterior mean of model parameters, and and are penalty terms accounting for model complexity. Note that in the previous expressions all expectations, which are computed with respect to the posterior distribution, can be approximated by averaging over Markov chain Monte Carlo (MCMC) samples (see Section 6 for details). Next section we have adopt a standard approach and use the DIC.
8 Illustration
Nitrogen is an essential macronutrient deeded by all plants to thrive. It is an important component of many structural, genetic, and metabolic compounds in plant cells. Increasing the levels of nitrogen during the vegetative stage can strengthen and support the plant roots, enabling them to take in more water nutrients. This allows a plant to grow more rapidly and produce large amounts of succulent, green foliage, which in turn can generate bigger yields, tastier vegetables, and a crop that is more resistant to pets, diseases, and other adverse conditions. Using too much nitrogen, however, can be just as harmful to plants as to little. A researcher took measurements of nitrogen soil concentration () and plant sizes () within each of farms. Thus, we have that and are the plant size and the nitrogen soil concentration values, respectively, associated the th plant from the th farm, and .
8.1 Exploratory Data Analysis
A histogram of the plant size is shown in the first panel of Figure 3. The plant size ranges from 76.56 to 117.50 which seems quite large in comparison with the plant size range within each farm (see the bottom panel). The second and third panels show the relationship between the plant size and the nitrogen soil concentration. In particular, the second panel exhibits the ordinary least squares (OLS) regression line for these data, which is given by
with a standard error of
with 118 degrees of freedom, and an adjusted Rsquared of
. The third panel presents the same plot but taking into account the farm where the measurements belong to. These panels along with the OLS fit indicate two main features about this experiment: (1) there is an important relationship between the nitrogen soil concentration and the plant size (in our OLS fit the slope turns out to be significant, value ); and (2) there is a clear farm effect on plant size since colors in the scatter plot reveal clustering patters, and also, the corresponding boxplots strongly suggest differences in terms of mean plant growth among farms.8.2 Normal Linear Regression Model
We fit the linear regression model given in Section 2 to these data without taking into account the farm information. Posterior summaries of the model parameters are provided in Table 1. Even though the posterior mean of the model parameters practically coincide with their corresponding OLS estimates, these results are again quite limited since they do not allow us to isolate any kind of effect over the plant size arising from the grouping factor. Such an impossibility strongly motivates the hierarchical approaches that we present in this paper.
Parameter  Mean  SD  Q2.5%  Q97.5% 

92.59  3.60  85.53  99.61  
0.36  0.18  0.01  0.70  
72.89  9.67  56.35  94.07 
8.3 Hierarchical Normal Linear Regression Model
Now we go further and fit the hierarchical linear regression model given in Section 3 to the plant size data considering both groupspecific fixed effects and groupspecific variances. Such a groupspecif approach is very convenient because it allow us to carry out separategroup inferences as opposed to its nonhierarchical counterpart.
95% quantilebased credible intervals and posterior means (squares) for the regressor fixedeffects regressor parameters in the hierarchical Normal linear regression model. Colored thicker lines correspond to credible intervals that do not contain zero. Top panel:
(intercepts). Bottom panel: (slopes).We present our main results in Figures 4 and 5 where we display 95% quantilebased credible intervals for and , respectively. At this point we are capable of making evident some important findings. The uncertainty about the groupspecific parameters is not even since the amplitude of the credible intervals clearly varies across farms. This effect is particularly evident for the variance components. In addition, point estimates (posterior means) of the groupspecific parameters are also quite variable, which strongly suggests that for these data considering this approach is beneficial because it allows us to characterize for each farm its own unique features. However, some farms show some signs of similar features, which was also evident before in Figure 3. We explore clustering patterns in the next subsection.
As expected, we see that all the intercepts are statistically significant, but also highly variable (ranging from 61.48 to 106.81). On the other hand, the story behind the slopes (ranging from 0.08 to 1.04) is quite different. We see that just 9 out of 24 (37.5%) of such parameters turn out to be significant, namely, for farms 1, 7, 9, 15, 18, 20, 21, 22, and 23 (Figure 6 shows the corresponding estimated regression lines for these farms). The previous fact strongly suggests that the relevance of the relationship between the nitrogen soil concentration and the plant size is not consistent across farms. Therefore, in this case, considering farms as a grouping factor makes a substantial impact on the analysis.
8.4 Clustering Hierarchical Normal Linear Regression Model
It is quite reasonable attempting to identify clusters composed of farms given the abundant evidence of similarities among groups and cluster formation detected in both the exploratory data analysis and the hierarchical modeling stage of the analysis. Here we fit here the clustering hierarchical linear regression given in Section 4 model in order to provide a formal partition of farms. To this end, we fit the model using as a default number of communities. Such an extreme case represents the prior believe that there are no clustering patterns at all. A moderate large value of is convenient in situations like this because it allows the data to tell by itself how many nonempty clusters should be considered. We will see in what follows that out from the clusters, many turn out to be empty and only a few remain.
Let be the number of nonempty clusters in the partition induced by . The left panel of Figure 7 shows the posterior distribution of . We see that the most highly probable values are and . In fact, , which means that around three quarters of the posterior partitions are composed of either 7 or 8 clusters of farms. The estimated number of nonempty clusters in the data is obviously (maximum a posteriori with a reference value very close to 0.4).

On the other hand, the right panel in Figure 7 shows the incidence matrix obtained from the posterior distribution of the cluster assignments . The incidence matrix is a pairwiseprobability matrix whose elements are given by , for (note that for all ). Thus,
simply represents the posterior probability that farms
and belong to the same community. Such probabilities are indeed identifiable, however labels themselves are not since the likelihood is invariant to relabelling of the mixture components (this is known as the label switching problem; see stephens2000dealing and references therein). On top the incidence matrix, we also present a point estimate of the partition induced by such a matrix (represented by black lines), which can be obtained by employing the clustering methodology proposed in lau2007bayesian with a relative error cost of 0.5. As expected, we see that eight clusters are formed from the data. The corresponding cluster sizes are 6, 6, 5, 2, 2, 1, 1, and 1. Specifically, the clusters are composed of the following farms: , , and , , , , , and . These clusters along with their estimated regression lines are also represented in Figure 8. We see that the estimated clusters make sense spatially according to the data points. The slope parameters turn out to be significant for every cluster.