Bayesian linear regression models with flexible error distributions

This work introduces a novel methodology based on finite mixtures of Student-t distributions to model the errors' distribution in linear regression models. The novelty lies on a particular hierarchical structure for the mixture distribution in which the first level models the number of modes, responsible to accommodate multimodality and skewness features, and the second level models tail behavior. Moreover, the latter is specified in a way that no degrees of freedom parameters are estimated and, therefore, the known statistical difficulties when dealing with those parameters is mitigated, and yet model flexibility is not compromised. Inference is performed via Markov chain Monte Carlo and simulation studies are conducted to evaluate the performance of the proposed methodology. The analysis of two real data sets are also presented.



There are no comments yet.


page 1

page 2

page 3

page 4


Moment-based Estimation of Mixtures of Regression Models

Finite mixtures of regression models provide a flexible modeling framewo...

Flexible and objective time series analysis: a loss-based approach with two-piece location-scale distributions

Two-piece location-scale models are used for modeling data presenting de...

Semiparametric inference for the scale-mixture of normal partial linear regression model with censored data

In the framework of censored data modeling, the classical linear regress...

A Bayesian Ensemble Regression Framework on the Angry Birds Game

An ensemble inference mechanism is proposed on the Angry Birds domain. I...

A Gentle Introduction to Bayesian Hierarchical Linear Regression Models

Considering the flexibility and applicability of Bayesian modeling, in t...

Hunting for Discriminatory Proxies in Linear Regression Models

A machine learning model may exhibit discrimination when used to make de...

Inference for Network Regression Models with Community Structure

Network regression models, where the outcome comprises the valued edge i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Density estimation and modeling of heterogeneous populations through finite mixture models have been highly explored in the literature (Frühwirth-Schnatter, 2006; McLachlan and Peel, 2000)

. In linear regression models, the assumption of identically distributed errors is not appropriate for applications where unknown heterogeneous latent groups are presented. A simple and natural extension to capture mean differences between the groups would be to add covariates in the linear predictor capable of characterizing them. However, that may not be enough to explain the source of heterogeneity often presented in the data. Furthermore, differences may also be presented in skewness, variance and tail behavior. Naturally, the usual normality assumption is not appropriate in those cases.

Over the years, many extensions of the classical normal linear regression model, such the Student-t regression (Lange et al., 1989), have been proposed. In practice, the true distribution of the errors is unknown and it may be the case that single parametric family is unable to satisfactorily model their behavior. One possible solution is to consider a finite mixture of distributions, which are a natural way to detect and model some unobserved heterogeneity. Conceptually, this may be seen as a semiparametric approach for linear regression modelling. Initial works in this direction were proposed in Bartolucci and Scaccia (2005) and Soffritti and Galimberti (2011)

by assuming a mixture of Gaussian distributions to model the errors. A more flexible approach was presented in

Galimberti and Soffritti (2014) with a finite mixture of (multivariate) Student-t distributions.

From a modelling perspective, mixtures of more flexible distributions like the Student-t, skew-normal, skew-t, may be preferred over mixtures of Gaussians when the data exhibit mutimodality with significant departure from symmetry and Gaussian tails. Basically, the number of Gaussian components in the mixture to achieve a good fit may defy model parsimony. On the other hand, more flexible distributions usually have specific parameters with statistical properties that hinder the inference procedure (see Fernandez and Steel, 1999; Fonseca et al., 2008; Villa et al., 2014). A common solution for that is to impose restrictions on the parametric space, which goes in the opposite direction of model flexibility. For example, in Galimberti and Soffritti (2014), inference is performed via maximum likelihood using the EM algorithm with restrictions on the degrees of freedom parameters.

The aim of this paper is to propose a flexible model for the errors in a linear regression context but that, at the same time, is parsimonious and does not suffer from the known inference problems related to some of the parameters in the distributions described above. Flexibility and parsimony are achieved by considering a finite mixtures of Student-t distributions that, unlike previous works, considers a separate structure to model multimodality/skewness and tail behavior. Inference problems are avoided by a model specification that does not require the estimation of the degrees of freedom parameter and, at the same time, does not lose model flexibility. That is achieved due to the fact that arbitrary tail behaviors of the Student-t can be well mimicked by mixtures of well-chosen Student-t distributions with fixed degrees of freedom. Inference for the proposed model is performed under the Bayesian paradigm through an efficient MCMC algorithm.

This paper is organised as follows: the proposed model is presented in Section 2 and the MCMC algorithm for inference is presented in Section 3 along with the discussion of some computational aspects of the algorithm. Simulated examples are presented in Section 4 and the results of the analysis of two real data sets are shown in Section 5. Finally, Section 6 presents some concluding remarks and proposals of future works.

2 Linear regression with flexible errors’s distribution

2.1 Motivation

Finite mixture models have great flexibility to capture specific properties of real data such as multimodality, skewness and heavy tails and has recently been used in a linear regression context to model the errors’ distribution. Bartolucci and Scaccia (2005) and Soffritti and Galimberti (2011) were the first to consider a finite mixture to model the errors in linear regression models. More specifically, they proposed a finite mixture of -dimensional Gaussian distributions. Galimberti and Soffritti (2014) extended that model by assuming that the errors follow a -dimensional Student-t distribution, i.e.

where denotes the density of the

-dimensional Student-t distribution with mean vector

, positive definite dispersion matrix and degrees of freedom .

Recently, Benites et al. (2016) considered regression models in which the errors follow a finite mixture of scale mixtures of skew-normal (SMSN) distributions, i.e.

where corresponds to some distribution that belongs to the SMSN class, with location parameter , dispersion , skewness and degree of freedom , for .

Mixtures of Gaussian distributions may not be the most adequate choice when the error terms present heavy tails or skewness. The simultaneous occurrence of multimodality, skewness and heavy tails may require a large number of components in the mixture which, in turn, may compromise model parsimony. The mixture models proposed by Galimberti and Soffritti (2014) and Benites et al. (2016)

are more flexible and remedies the problem of dealing with outliers, heavy tails and skewness in the errors. However, both models have a constraint with respect to the estimation of the degrees of freedom parameters, once the estimation of these parameters are known to be difficult and computationally expensive. Thus, for computational convenience, they assume that

. In practice assuming the same degree of freedom for all components of the mixture can be quite restrictive, since a single may not be sufficient to model the tail structure in the different components of the model. Another important point to notice is that the number of components needed to accommodate multimodality and skewness may be different from the number of components to model tail behavior. This is our motivation to propose a flexible and parsimonious mixture model that does not suffer from estimation problems and have an easier interpretation for the model components.

2.2 The proposed model

Define the dimensional response vector , the design matrix and the dimensional coefficient vector . Consider a -dimensional weight vector and -dimensional weight vectors , such that and , ; and , with fixed and known for all .

Consider the linear regression model


where . We propose the following finite mixture model for the error terms distribution:


where denotes the Student-t distribution with mean , dispersion and degrees of freedom . Model identifiability is achieved by setting mean zero to the erros, i.e. .

The model in (2) is quite general and includes as a submodel the one propose in Galimberti and Soffritti (2014), in a univariate context. That occurs for and

being the identity matrix. Moreover, if

, the model in (2) results in the linear regression model with Student-t errors propose by Lange et al. (1989).

Given the expressions (1) and (2

), the probability density function of the response vector

is given by


where . For computational reasons, we reparametrise the model by making , with the ’s the unrestricted means of the erros’ mixture. The original parametrisation can be easily recovered by making and .

We also consider a set of auxiliary variables to easy the MCMC computations, leading to the following hierarchical representation of the proposed model:


where , and denote the Gaussian, gamma and multinomial distributions, respectively; ; and are the latent mixture component indicator variables. Variable defines the mixture component of the th individual in terms of the location and scale parameters and defines the component in terms of the degrees of freedom parameter.

Interpretation of the model in (3) may be facilitated by considering the following hierarchical representation of the mixture distribution in which the first level models multimodality/skewness and the second level models tail behavior.


The mixture of gamma distributions for

in (9) gives a nonparametric flavor to the model since particular choices of this distribution lead to specific marginal distributions for (see Andrews and Mallows, 1974). The 2-level representation allows visualising the two possible sources of unobserved heterogeneity in . The two sources of heterogeneity come, respectively, from (8) and (9).

The main advantages of the proposed model are:

  1. The mixture in models the multimodal and/or skewness behavior and can, therefore, have the number of mixture components fixed at reasonable values, based on an empirical analysis of the data;

  2. The separate modelling for tail bahavior may avoid over-parametrisation of the model (too many degrees of freedom parameters) due to the need for several modes to capture multimodality/skewness;

  3. Mixing different degrees of freedom, and estimating the respective weights, brings flexibility to the model in the sense that it is capable of capturing a variety of tail structures;

  4. The tail behavior is estimated without the need to estimate degrees freedom parameters. Besides avoiding the known inference difficulties associated to those parameters, our model’s parsimony is less penalised with a increase in .

A particular case of the model defined in (3) occurs when (model without covariates) and no restriction is imposed to the ’s. In this case, has the following probability density function:


This iid mixture model can be used for cluster analysis or density estimation. Moreover, for

and being the identity matrix, the model results in an ordinary mixture of Student-t distributions (Peel and McLachlan, 2000).

2.3 The choice of

In this section we present a strategy on how to choose the values of the degrees of freedom parameters , which are fixed in the proposed model, based on the Kullback-Liebler divergence (KLD, Kullback and Leibler, 1951). The KLD between two continuous probability measures is defined as

We consider the KLD between the standard Gaussian and Student-t distributions for different degrees of freedom. Figure 1 shows the KLD for different values of . The similarity between those distributions when the degrees of freedom increase gives the intuition to why this is an statistically complicated parameter.

Figure 1: Kullback-Liebler divergence between the standard Gaussian Student-t distributions with degrees of freedom.

An exploratory study, not reported here, strongly suggests that a mixture of two standard Student-t distributions with d.f. and can approximate quite well a standard Student-t distribution with d.f. if and are not too far from each other in the KLD scale. Based on that we suggest the following strategy to choose the values of :

  1. Choose the minimum and maximum values of the parameters - . The maximum should typically be between 8 and 15 and the minimum should be chosen based on how heavy the tails are believed to be. For example, one should choose this to be greater than 2 in most of the cases, so that the model has finite variance.

  2. Choose the value of . Based on the exploratory study just mentioned, we suggest or .

  3. Compute the values of the remaining ’s such that they are equally spaced in the KLD scale, i.e. , . For example, for , and , we get .

3 Bayesian Inference

The Bayesian model is fully specified by (4)-(7) and the prior distribution for the parameter vector . Prior specification assumes independence among all the components of , except for . The following prior distributions are adopted: ; ; , ; and , where and denote the normal-inverse-gamma and Dirichlet distributions, respectively.

3.1 Mcmc

The blocking scheme for our Gibbs sampler is chosen in a way to get the larger possible blocks for which we can directly simulate from the respective full conditional distributions. The following blocks are chosen:


where and are matrices of dimensions and , respectively.

All the full conditional distributions are derived from the joint distribution density of all random components in the model, which is given by

Details about how to sample from each of those distributions are presented in in Appendix A of the Supplementary Material.

3.2 Prediction

Prediction for unobserved configurations of the covariates is a typical aim in a regression analysis. Suppose we want to make prediction for the covariate values in the matrix

. In an MCMC context, the output of the algorithm can be directly used to obtain a sample from the posterior predictive distribution of

. That is achieved by adding the following steps to each iteration of the Gibbs sampler after the burn-in:

  • Let be the state of the chain at the th iteration. For each :

  1. Sample from (5);

  2. Sample .

4 Simulated Studies

Two simulated studies are conducted aiming at evaluating the performance of the proposed approach. For both studies we assume or and set and , which leads to and .

In the first study the data is simulated from the proposed model and, in the second one, from an ordinary mixture of Skew-t distributions. We consider sample sizes , and with objective to evaluate if exist impact of sample size when we study the tail behavior of the distribution that generated the data. Three models are fit to each simulated sample: 1. the proposed model with (Mt-p1); 2. the proposed model with (Mt-p2); 3. an ordinary mixture of Student-t distributions with 2 components in the first study and 4 in the second one (Mt). Inference for the third model is performed via MCMC by appropriately adapting our algorithm in terms of , and , and by adding a step to sample the degrees of freedom parameters. This sampling step is proposed in Gonçalves et al. (2015). Furthermore, the penalised complexity prior (PC prior) from Simpson et al. (2017) is adopted for (see Appendix B of the Supplementary Material).

Comparison among the three models is performed in terms of posterior statistics of the regression coefficients and of the errors’ variance, which is given by:

where and and are the mean and variance of the Student-t distribution in th component of the mixture. We consider the following three statistics: bias, and , where is the true value of .

We also define the following percentual variation measure to compare the fitted error distribution in each model:

For a grid in the sample space of the error, where is to the true density of the data and is the posterior mean of the error’s density under the fitted model. The comparison is performed globally and in the tails of the distribution. In the latter, we use consider a grid of points below th and above th percentiles.

Initial values for , and are obtained through the R package mixsmsn (Prates et al., 2013). Matrix is initialised assuming the same probability for all elements and for

we use the ordinary least square estimates. We also set the hyperparameter values

, , and

. A uniform distribution on the respective simplex is assumed for

and .

All the MCMC chains run for iterations with a burn-in of . A lag was defined to maximise the effective sample size of the log-posterior density. The algorithm was implemented using the R software (R Core Team, 2017).

4.1 Errors generated from the proposed mixture model

Data is generated from the proposed mixture model with , where , , , , and . We consider , generated from and , and . Parameters and are chosen to satisfy the identifiability constraint . The following models are fitted:

  1. Mt-p1: , and ;

  2. Mt-p2: , and ;

  3. Mt: 2 mixture components.

Table 1 shows that the estimates for the regression coefficients are quite similar to the true values for both models for sample size . Therefore, it is reasonable to focus on the comparison between both models on the variance of the errors. Similar results are obtained for sample sizes and .

Mt-p1 1.019 [0.903, 1.144] -1.993 [-2.060, -1.924] - 0.612 (0.642, 0.244, 0.114)
0.877 [ 0.635, 1.114] - 0.388 (0.497, 0.329, 0.174)
Mt-p2 1.018 [0.889, 1.133] -1.995 [-2.060, -1.926] - 0.615 (0.465, 0.275, 0.161, 0.099)
0.875 [0.647, 1.134] - 0.385 (0.339, 0.284, 0.239, 0.138)
Mt 1.025 [0.896, 1.140] -1.995 [-2.058, -1.931] 2.913 [2.441, 3.366] 0.611 -
0.866 [0.604, 1.080] 2.910 [2.359, 3.406] 0.389 -
Table 1: Posterior results for sample of size 2500 and 95% high posterior density (HPD) when data are generated from the proposed model.

Table 2 shows the results concerning the comparison criteria previously described. The proposed model performs better than ordinary mixture of Student-t model for all the sample size. The posterior results for variance and MSE for Mt model are greatly impacted by the sample size and, even for the largest sample size, it still presents a considerable difference in the variance and MSE in comparison to Mt-p models. The same thing happens regarding the distances and . This result is explained by the fact that the posterior variance of the degrees of freedom parameters in the Mt model inflates the posterior variability of all the quantities that depend on those parameters. Finally, note that differences between the results for Mt-p1 and Mt-p2 are small enough to suggest that there is no great impact on the choice of K.

n Model bias var MSE
Mt-p1 -0.486 0.074 0.313 0.240 0.285
500 Mt-p2 -0.533 0.065 0.349 0.269 0.325
Mt -0.129 5.266 5.283 0.325 0.403
Mt-p1 -0.115 0.063 0.076 0.082 0.095
1000 Mt-p2 -0.148 0.056 0.078 0.104 0.130
Mt -0.295 0.338 0.425 0.268 0.388
Mt-p1 0.122 0.031 0.045 0.116 0.155
2500 Mt-p2 0.044 0.032 0.034 0.054 0.068
Mt 0.404 0.282 0.445 0.213 0.287
Table 2: Bias, variance and MSE of the posterior results for the error variance in each mixture model when data are generated from the proposed model

4.2 Errors generated from a mixture of Skew-t distributions

Data is generated from a regression model in which the erros follow a mixture of Skew-t distributions with 2 components and , , , (skewness parameter) and . We consider , generated from and , and . The following models are fitted:

  1. Mt-p1: , and ;

  2. Mt-p2: , and ;

  3. Mt: 4 mixture components.

The same conclusion obtained from Tables 1 and 2 in the first study are valid for Table 3 and 4, respectively.

0.626 [0.563, 0.686] -2.032 [-2.071, -1.987] - 0.148 (0.506, 0.355, 0.139)
Mt-p1 1.009 [0.926, 1.096] - 0.291 (0.255, 0.305, 0.440)
- 0.304 (0.237, 0.257, 0.506)
- 0.257 (0.414, 0.454, 0.132)
0.621 [0.559, 0.679] -2.034 [-2.079, -1.992] - 0.091 (0.309, 0.284, 0.258, 0.149)
Mt-p2 1.002 [0.923, 1.087] - 0.211 (0.240, 0.230, 0.253, 0.277)
- 0.332 (0.168, 0.223, 0.204, 0.405)
- 0.366 (0.388, 0.295, 0.209, 0.108)
0.633 [0.572, 0.700] -2.031 [-2.063, -1.991] 7.443 [2.324, 13.485] 0.174 -
Mt 1.009 [0.934, 1.092] 7.546 [2.545, 13.711] 0.369 -
7.580 [2.442, 13.575] 0.211 -
7.534 [2.168, 13.482] 0.246 -
Table 3: Posterior results for sample of size 2500 and 95% high posterior density (HPD) when data are generated from the mixture of Skew-t.

We highlight the fact that the proposed methodology is efficient to simultaneously accommodate multimodality, skewness and heavy tails, with the advantage of not having to estimate the degrees of freedom parameters.

n Model bias var MSE
Mt-p1 -0.255 0.134 0.198 1.286 1.894
500 Mt-p2 -0.289 0.125 0.209 1.353 2.361
Mt 1.096 11.285 12.486 1.571 2.634
Mt-p1 0.103 0.197 0.208 0.873 0.935
1000 Mt-p2 -0.061 0.165 0.169 0.671 0.650
Mt 0.065 5.158 5.162 0.972 1.107
Mt-p1 -0.036 0.053 0.054 0.465 0.438
2500 Mt-p2 -0.052 0.051 0.053 0.272 0.262
Mt -0.144 0.086 0.107 0.764 0.918
Table 4: Bias, variance and mean square error (MSE) of the posterior results for the error variance in each mixture model when data are generated from a mixture of Skew-t.

Figure 2 shows the computational cost of the MCMC for each model as a function of the sample size, for both simulated studies. Note how the cost was uniformly smaller for the proposed model.

Figure 2: Computational cost (in seconds) of the MCMC for each model in Study 1 (a) and Study 2 (b).

5 Application

We analyse two real data sets. The first one has been considered in several previous works and consists on the velocity of 82 galaxies (in thousands of kilometers per second) located in the Corona Borealis constellation. The second one refers to the national health and nutrition examination (NHANES) survey conducted every year by US National Center for Health Statistics.

For the first application we assume the proposed model without covariates and, in both analysis, we also fit an ordinary mixture of Students-t distributions. We consider to fit the proposed model with .

Model comparison is performed via DIC (Spiegelhalter et al., 2002). An approximation of the DIC can be obtained using the MCMC sample from the posterior distribution. We have that , where and .

5.1 Galaxies velocity

The data set is available in the R package MASS (Ripley et al., 2013) and displayed in Figure 3. This data set was previously modelled in the literature by a mixture of 6 Gaussian components (Carlin and Chib, 1995; Richardson and Green, 1997; Stephens, 1997). The data set has mean and variance equal to and exhibit some clear outliers.

Figure 3: Histogram of the galaxy data.

Table 5 shows the DIC criterion for both models assuming and , which refers to the total number of mixture components in the Mt models. Based on the DIC the best fit is with in both models, with the MT-p model having the smaller DIC.

Table 5: DIC selection criterion for different values of in the fitting of the Mt-p and Mt models to galaxy data.

Table 6 shows some posterior statistics when . Results are similar for both models regarding , and . The Mt model presented estimated values of between 3.88 and 4.64. For the Mt-p model, the values contributes with approximately 25% of the weight in all four components. This result suggests that it is important to consider values that characterise heavy tails. Additionally, values and 14.4 contribute together with approximately 50% of the weight in all components.

9.715 [ 9.086, 10.373] 0.740 [0.186, 1.648] - 0.086 (0.233, 0.243, 0.247, 0.277)
Mt-p 19.901 [19.329, 20.415] 0.817 [0.220, 1.691] - 0.434 (0.264, 0.263, 0.247, 0.226) 21.604
22.978 [22.118, 23.682] 1.898 [0.562, 3.788] - 0.443 (0.266, 0.249, 0.255, 0.230)
32.772 [30.820, 34.904] 2.733 [0.285, 10.144] - 0.037 (0.246, 0.248, 0.252, 0.254)
9.722 [9.162, 10.462] 0.734 [0.146, 1.709] 3.882 [2.014, 7.008] 0.086 -
Mt 19.901 [19.506, 20.483] 0.798 [0.197, 1.764] 3.896 [2.014, 7.475] 0.433 - 24.917
22.892 [22.023, 23.837] 1.947 [ 0.470, 4.041] 3.930 [2.011, 7.565] 0.445 -
32.759 [29.487, 35.194] 2.713 [0.291, 9.458] 4.644 [2.012, 10.537] 0.036 -
Table 6:

Posterior results for the galaxy analysis. The posterior mean and the 95% HPD credibility interval are presented.

Figure 4 confirms the information in Table 6 that the results are quite similar for both models. Greater differences were observed for the estimates of the variance of and in the computational cost - seconds for Mt-p and for Mt).

Figure 4: Histogram of empirical distribution of the galaxy data with fitting curves.

5.2 US National Health and Nutrition Examination Survey

The data set is available in the R package NHANES (Pruim, 2015) and refers to the survey carried out between 2011/2012. The data contain information on 76 variables describing demographic, physical, health, and lifestyle characteristics of 5000 participants. Lin et al. (2007) and Cabral et al. (2008) analysed the data from this study for 1999/2000 and 2001/2002 and restricted the sample to male participants only. Lin et al. (2007) assumed a mixture of skew-t distributions to estimate the density of the participants body mass index, whereas Cabral et al. (2008) used a mixture of skew-t-normal for density estimation.

We consider the information regarding the weight in kilograms (response variable), age in years, sex and diabetes information (0-No; 1-Yes) of the participants. Participants who did not have information on at least one of the considered variables were previously removed from the database. The sample used for analysis contains 4905 participants, however 5% of the observations are randomly selected for prediction. Thus, the final sample contains 4660 participants. Figure

5(a) shows the empirical distribution of the weight of participants and exhibits a multimodal behavior. Residuals from a normal linear regression fit are positively skewed (Figure 5(b)).

Figure 5: (a)Histogram of empirical distribution of the weight in kilograms e (b) Histogram of ordinary residuals.

To model the source of unobserved heterogeneity present in Figure 5(b) we applied the proposed approach to model the errors distribution and us comparer the posterior results with the mixture of t components (Mt). We assumed the model with , 3 and 4 components and based on the DIC the best fit to the data is considering a mixture with components (Table 7).

Model J=2 J=3 J=4
Table 7: DIC selection criterion for different values of in the fitting of the Mt-p and Mt models to NHANES data.

Table 8 shows the posterior statistics for . Results are substantially different for both models. The posterior results suggest that the average weight of participants in the fourth component is around to 95 kg () for the Mt-p model and 84 kg for the Mt model. The posterior mean for in the Mt model is almost times larger than in the Mt-p model. Note that coefficient is less significant in model Mt-p. Results are also quite different for the tail behavior estimation.

-30.586 [-31.949, -29.303] 27.479 [15.473, 41.217] 33.154 [28.241, 38.611] 0.642 [0.609, 0.675] - 0.141 (0.146, 0.196, 0.231, 0.427)
Mt-p -4.064 [-6.263, -1.986] 173.528 [121.409, 225.631] 6.098 [4.561, 7.742] - 0.580 (0.027, 0.036, 0.055, 0.882)
22.012 [17.254, 26.459] 190.682 [77.107, 284.778] 4.319 [-0.305, 8.090] - 0.261 (0.204, 0.233, 0.313, 0.250)
62.097 [43.174, 70.384] 118.679 [1.420, 393.123] - 0.018 (0.245, 0.255, 0.264, 0.236)
-30.341 [-32.073, -28.768] 36.726 [20.970, 50.130] 31.785 [26.787, 37.580] 0.629 [0.592, 0.664] 7.485 [2.948, 12.535] 0.162 -
Mt -5.202 [-7.428, -3.046] 124.174 [84.606, 169.551] 7.146 [5.260, 8.713] 7.456 [3.056, 12.463] 0.473 -
17.501 [12.917, 22.885] 198.705 [95.912, 305.835] 4.683 [0.260, 8.462] 7.409 [2.979, 12.492] 0.327 -
51.834 [39.405, 68.744] 436.010 [1.801, 742.636] 7.341 [2.290, 12.761] 0.038 -
Table 8: Posterior results for the NHANES analysis. The posterior mean and 95% HPD credibility interval are presented.

We also compare the models in terms of prediction capability. We consider the root of the prediction mean square error (RMSE), absolute mean error (MAE) and the relative error (RE). Results are presented in Table 9 and indicate a slightly better performance for the Mt-p model. We also consider the HPD predictive intervals which are, on average, smaller for the Mt-p model (see Figure 6 and range in Table 9). The computational cost for the Mt-p model is smaller than the cost for the Mt model.

Model RMSE MAE RE range (median)
Mt-p 22.991 17.571 0.356 123.0 (122.7)
Mt 23.007 17.574 0.359 123.4 (123.1)
Table 9: Prediction analysis for the Mt-p and Mt models to NHANES data.

Figure 6: HPD intervals of 99% for the first 50 observations predicted by Mt-p and Mt models.

6 Conclusions

This paper proposes a Bayesian model based on finite mixtures of Student-t distributions to model the errors in linear regression models. The proposed methodology considers separate structures to model multimodality/skewness and tail behavior. The two-level mixture facilitates interpretation and data modeling since it considers that the required number of components to accommodate multimodality and skewness may differ from the number of components to model the tail structure. In addition, the tail modeling does not involve degree of freedom parameters estimation, which improves the precision of estimates and the computational cost. The methodology also includes the case with no covariates for density estimation and clustering. Morevoer, the proposed MCMC algorithm may be adapted to perform inference in ordinary mixture of Student-t models including the estimation of the degrees of freedom parameters.

The performance of the proposed methodology was evaluated through simulation studies and applications to real data sets. Results illustrated the flexibility of the model to simultaneously capture the different structures presented in the errors of the regression model. It is important to emphasise that the complexity resulting from the estimation of the weights associated to the parameters is much lower in comparison to the estimation of the degrees of freedom parameter, whose estimation process is computationally expensive and problematic.

Future work may consider estimation of the number of components and the extension for multivariate and censored data with heavy tails.


Appendix A MCMC details

The sampling step for is based on the following factorisation:

which suggests the following algorithm:

  1. Sample the ’s independently from

    where , with .

  2. Sample the ’s independently from

    where , with and .

  3. Sample the ’s independently from

The full conditional distributions of and are given by

The full conditional distributions of and are given by:

where with


where ; is the dimensional vector with entries ; is the Hadamard product and is the identity matrix with dimension .

Appendix B PC priors for

The PC prior from Simpson et al. (2017) was constructed to prefer a simpler model and penalise the more complex one . To do so, it defines a measure of complexity

, based on the Kullback-Leibler divergence. The exponential prior is set for the measure

and thus the prior distribution of is given by

A nice feature of this prior is that the selection of an appropriate is done by allowing the researcher to control the prior tail behavior of the model. For more details see Simpson et al. (2017).


  • Andrews and Mallows (1974)

    Andrews, D. F., Mallows, C. L., 1974. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B - Methodological

    36 (1), 99–102.
  • Bartolucci and Scaccia (2005) Bartolucci, F., Scaccia, L., 2005. The use of mixtures for dealing with non-normal regression errors. Computational Statistics & Data Analysis 48 (4), 821–834.
  • Benites et al. (2016) Benites, L., Maehara, R., Lachos, V. H., 2016. Linear regression models with mixture of skew heavy-tailed errors. Preprint 06/2016, Universidade Estadual de Campinas.
  • Cabral et al. (2008) Cabral, C. B., Bolfarine, H., Pereira, J. R. G., 2008. Bayesian density estimation using skew student-t-normal mixtures. Computational Statistics & Data Analysis 52 (12), 5075–5090.
  • Carlin and Chib (1995) Carlin, B. P., Chib, S., 1995. Bayesian model choice via markov chain monte carlo methods. Journal of the Royal Statistical Society, series B (Methodological) 57, 473–484.
  • Fernandez and Steel (1999) Fernandez, C., Steel, M. F. J., 1999. Multivariate student-t regression models: Pitfalls and inference. Biometrika 86 (1), 153–167.
  • Fonseca et al. (2008) Fonseca, T. C., Ferreira, M. A. R., Migon, H. S., 2008. Objective bayesian analysis for the student-t regression model. Biometrika 95 (2), 325–333.
  • Frühwirth-Schnatter (2006) Frühwirth-Schnatter, S., 2006. Finite mixture and Markov switching models: Modeling and applications to random processes. Springer Science & Business Media.
  • Galimberti and Soffritti (2014) Galimberti, G., Soffritti, G., 2014. A multivariate linear regression analysis using finite mixtures of t distributions. Computational Statistics & Data Analysis 71, 138–150.
  • Gonçalves et al. (2015) Gonçalves, F. B., Prates, M. O., Lachos, V. H., 2015. Robust bayesian model selection for heavy-tailed linear regression using finite mixtures. arXiv preprint arXiv:1509.00331.
  • Kullback and Leibler (1951) Kullback, S., Leibler, R. A., 1951. On information and sufficiency. The annals of mathematical statistics 22 (1), 79–86.
  • Lange et al. (1989) Lange, K. L., Little, R. J., Taylor, J. M., 1989. Robust statistical modeling using the t distribution. Journal of the American Statistical Association 84 (408), 881–896.
  • Lin et al. (2007) Lin, T., Lee, J., Hsieh, W., 2007. Robust mixture modeling using the skew-t distribution. Statistics and Computing 17 (2), 81–92.
  • McLachlan and Peel (2000) McLachlan, G. J., Peel, D., 2000. Finite mixture models. John Wiley & Sons.
  • Peel and McLachlan (2000) Peel, D., McLachlan, G. J., 2000. Robust mixture modelling using the t distribution. Statistics and computing 10 (4), 339–348.
  • Prates et al. (2013) Prates, M. O., Lachos, V. H., Cabral, C., 2013. mixsmsn: Fitting finite mixture of scale mixture of skew-normal distributions. Journal of Statistical Software 54 (12), 1–20.
  • Pruim (2015) Pruim, R., 2015. Nhanes: Data from the us national health and nutrition examination study.
  • R Core Team (2017) R Core Team, 2017. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Richardson and Green (1997) Richardson, S., Green, P., 1997. On bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society. Series B - Methodological 59 (4), 731–792.
  • Ripley et al. (2013) Ripley, B., Venables, B., Bates, D. M., Hornik, K., Gebhardt, A., Firth, D., Ripley, M. B., 2013. Package mass.
  • Simpson et al. (2017) Simpson, D., Rue, H., Riebler, A., Martins, T. G., Sørbye, S. H., et al., 2017. Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science 32 (1), 1–28.
  • Soffritti and Galimberti (2011) Soffritti, G., Galimberti, G., 2011. Multivariate linear regression with non-normal errors: A solution based on mixture models. Statistics and Computing 21 (4), 523–536.
  • Spiegelhalter et al. (2002) Spiegelhalter, D. J., Best, N. G., Carlin, B. P., Van Der Linde, A., 2002. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 (4), 583–639.
  • Stephens (1997) Stephens, M., 1997. Bayesian methods for mixtures of normal distributions. Ph.D. thesis, University of Oxford.
  • Villa et al. (2014) Villa, C., Walker, S. G., et al., 2014. Objective prior for the number of degrees of freedom of at distribution. Bayesian Analysis 9 (1), 197–220.