Log-Regularly Varying Scale Mixture of Normals for Robust Regression

05/06/2020 ∙ by Yasuyuki Hamura, et al. ∙ 0

Linear regression with the classical normality assumption for the error distribution may lead to an undesirable posterior inference of regression coefficients due to the potential outliers. This paper considers the finite mixture of two components with thin and heavy tails as the error distribution, which has been routinely employed in applied statistics. For the heavily-tailed component, we introduce the novel class of distributions; their densities are log-regularly varying and have heavier tails than those of Cauchy distribution, yet they are expressed as a scale mixture of normal distributions and enable the efficient posterior inference by Gibbs sampler. We prove the robustness to outliers of the posterior distributions under the proposed models with a minimal set of assumptions, which justifies the use of shrinkage priors with unbounded densities for the high-dimensional coefficient vector in the presence of outliers. The extensive comparison with the existing methods via simulation study shows the improved performance of our model in point and interval estimation, as well as its computational efficiency. Further, we confirm the posterior robustness of our method in the empirical study with the shrinkage priors for regression coefficients.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

Code Repositories

EHE

Log-Regularly Varying Scale Mixture of Normals for Robust Regression


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The robustness to outliers in linear regression models has been well-studied for its importance, and the research on theory and methodology for robust statistics has been accumulated in the past years. Yet, the modeling of error distributions in practice to accommodate outliers has not advanced significantly from Student’s -distribution. In modern applied statistics, where data are enriched by massive observations, the more extreme outliers are expected to arrive, and the more likely, and significantly, the inference of regression coefficients and scale parameter is affected by such outliers. Our research aims to contribute to the development of novel error distributions for outlier-robustness which we believe are still in demand.

In the full posterior inference, the concept of robustness is not limited to the point estimation, but targets the whole posterior distributions of parameters of interest. Also known as outlier-proneness or outlier-rejection, the posterior robustness defines the property of posterior distributions that the difference of posteriors with and without outliers diminishes as the values of outliers become extreme (O’Hagan, 1979). The series of research on posterior robustness has revealed both the (sufficient) conditions for error distributions to achieve the robustness, and the specific model that meets such conditions; see the detailed review by O’Hagan and Pericchi (2012). The recent studies introduced the concept of regularly varying density functions (Andrade and O’Hagan, 2006, 2011), which was later extended to log-regularly varying functions (Desgagné, 2015; Desgagné and Gagnon, 2019), and provided the robustness conditions for the partial and whole posteriors of interest to be unaffected by outliers. As an error distribution whose density function is log-regularly varying, Gagnon et al. (2019)

proposed log-Pareto truncated normal (LPTN) distribution, which replaced the thin-tails of normal distribution by those of heavily-tailed log-Pareto distribution. Despite its desirable property of robustness, the class of LPTN distributions has hyperparameters that are difficult to tune and/or estimate, such as the truncation point of Gaussian tail, that could result in the efficiency loss in practice. Another issue in such distribution is the difficulty in posterior computation; unlike

-distribution, direct sampling from the conditional posteriors is infeasible, and one has to rely on Metroplis-Hastings algorithm, which may result in the increased computational cost.

We, in contrast, explore a different class of error distributions that have received less attention in the literature. Following Box and Tiao (1968), we model the error distribution by the finite mixture of two components; one has thinner tails such as normal distributions, and the other is extremely heavily-tailed to accommodate potential outliers. While remaining in the general class of scale mixture of normals (West, 1984), this simple, intuitive approach to the modeling of outliers contrasts the literature listed above, where the error is modeled by a single, continuous distribution. The structure of finite mixture helps controlling the effect of outliers on the posteriors of parameters of interest, while allowing the conditional conjugacy for posterior computation. For these theoretical and practical utilities, the finite mixture models have been routinely practiced in applied statistics (see, for example, Carter and Kohn 1994, West 1997, Frühwirth-Schnatter 2006 and Tak et al. 2019). In this research, we specifically focus on this class of error distributions in proving the posterior robustness.

For the heavily-tailed distribution that comprises the finite mixture, Student’s -distribution is still regarded thin-tailed for its outlier sensitivity. We propose the use of distributions that has been utilized in the robust inference for high-dimensional count data (Hamura et al., 2019)

for their extremely-heavy tails. This is another scale mixture of normals by the gamma distribution with the hierarchical structure on shape parameters, which allows the posterior inference by a simple but efficient Gibbs sampler. The tails of such distributions are heavier than those of Cauchy distribution; this tail property is consistent with those of other heavily-tailed distributions considered for posterior robustness, including LPTN distributions.

The finite mixture of the thinly-tailed and heavily-tailed distributions used as the error distribution in linear models, which we name the extremely heavily-tailed error (EHE) distribution, is proved to achieve the whole posterior robustness. The wider class of error distributions including the EHE distributions is considered, but the error distribution that attains the posterior robustness is shown to be the proposed EHE distribution only. The posterior robustness realized by the EHE distributions is extensively compared with the other alternatives in simulation study, showing its competence in point and interval estimations.

Another notable feature of the EHE distributions is that the posterior robustness is guaranteed for the variety of priors on regression coefficients and scale parameter. The assumptions for the posterior robustness do not exclude the unbounded prior densities for regression coefficients. Such prior distributions include the shrinkage priors for high-dimensional regression, e.g., horseshoe priors (Carvalho et al., 2009, 2010). We illustrate the utility of the robustness with the shrinkage prior for regression coefficients in the empirical studies for Boston housing dataset that is suspected to be contaminated with possible outliers. Likewise, in another example of the famous diabetes data, we confirm that the loss of efficiency by the introduction of heavily-tailed distribution is minimal even in the absence of outliers.

The rest of the paper is organized as follows. In Section 2, we introduce the new error distribution and describe its use in linear regression models. We also provide theoretical robustness properties regarding the posterior distribution. The algorithm for posterior computation is provided in Section 3 with the discussion on its computational efficiency. In Section 4, we carry out simulation studies to compare the proposed method with existing ones. In Section 5, we illustrate the proposed method using two famous datasets. Finally, we conclude with further discussions in Section 6.

2 A new error distribution for robust Bayesian regression

2.1 Extremely Heavy-tailed error distribution

Let

be a response variable and

be an associated -dimensional vector of covariates, for . We consider a linear regression model, , where is a -dimensional vector of regression coefficients and is an unknown scale parameter. The error terms,

, are directly linked to the posterior robustness; it is well-known that modeling those errors simply by Gaussian distributions makes the posterior inference very sensitive to outliers.

For the posterior robust, we introduce a local random variable

and assume that the error distribution is conditionally Gaussian, as . Under this setting, when an outlier arrives, then the higher value of local variable explains such outlier and keeps the posterior distribution of unchanged. A typical choice of the distribution of is the inverse-gamma distribution, which leads to the marginal distribution of being the -distribution. However, as shown in Gagnon et al. (2019) and our main theorem, this choice does not hold desirable robustness properties of the posterior distribution even when the distribution of is Cauchy distribution.

The model for the local scale variable studied in this research is given by the mixture of two components as follows;

where

with mixing probability

. These variables independently follow different distributions defined below:

(1)

with fixed value and unknown parameter . The second, newly-introduced -distribution is defined by the proper density,

Preparing two distributions in modeling of the variance structure in the form (

1) is based on the same modeling philosophy of Box and Tiao (1968); the first component generates non-outlying errors and the second component is supposed to absorb outlying errors. For non-outlying part, we set to be a large value such that the variance of is very small; the point mass on unity is included in our model as the limit of . In what follows, we adopt as a default choice. In contrast, as the model for outlying errors, the second component is extremely heavily-tailed since as , which is known as log-regularly varying density (Desgagné, 2015). This property is inherited to the error distribution and plays an important role in the robustness properties of the posterior distribution.

Under the formulation (1), the marginal distribution of is obtained as

(2)

where is the normal density with mean zero and variance . Both components are the scale mixtures of normals, and the first component is the normal-gamma distribution in general (Griffin and Brown, 2010), but in our application, it is essentially the standard normal distribution for is set to a large value. The second component does not admit any closed-form expression. To handle with this component in posterior computation, as we see later in Section 3.1, we utilize the augmentation of -distribution by a couple of gamma-distributed state variables. By this augmentation, the posterior inference for this model is straightforward.

A notable property of the new error distribution is its extremely heavy tails shown in the following proposition, with the proof left in the Appendix.

Proposition 1.

The density (2) satisfies

for large if .

The above proposition indicates that the density of the EHE distribution is a family of log-regularly varying functions. In addition, the tails of the EHE density are heavier than those of Cauchy distribution; . This property follows that the EHE distribution directly inherits the heavy tails of the mixing -distribution in the second component of the density (2). In what follows, we call the new error distribution (2) extremely heavily-tailed error (EHE) distribution.

The density in (2) is shown in Figure 1 for and , with the standard normal density. It is observed that the shape of the EHE distribution is very similar to one of the standard normal distribution around the origin, whereas the tail gets heavier as increases. Figure 2

shows the cumulative distribution functions (CDFs) of

-distributions and the EHE distributions to emphasize their tail property. The tails of the proposed EHE distributions are heavier than those of Cauchy distribution, as seen in the right panel. This fact is also confirmed via the comparison of CDFs of - and inverse-gamma distributions in the left panel. Owing to these properties of the EHE density, we can achieve robustness properties for the entire posterior distribution as shown in Theorem 1.

Figure 1: Densities of the proposed error distribution with , and and the standard normal error distribution. The intractable integral of the second component is computed by the Monte Carlo integration.
Figure 2: Left: Cumulative distribution functions of scale distributions, for , and the inverse gamma distribution with shape and scale . Right: The empirical cumulative distributions of the EHE distributions with and computed by the Monte Carlo integration, compared with the distribution function of Cauchy distribution.

2.2 Robustness properties

We here consider theoretical robustness properties of the posterior distribution based on the proposed EHE distribution. To this end, we consider a wider class of error distributions which includes the proposed distribution as a special case, defined by replacing in (2) with

(3)

where is a normalizing constant, and is an additional shape parameter. Note that the distribution in (3) reduces to the proposed distribution in (2) under . This parameter is also related to the decay of the density tail of (3), that is, . Hence, the tail gets heavier as decreases, and the EHE distribution with , in fact, has the heaviest tail in this class of distributions. Among this general class in (3), we show later in Theorem 1 that only the proposed error distribution with attains the robustness property. This theorem also clarifies the difference from

-distributions with degree of freedom

, the density tails of which is and lighter than those of the proposed distribution even when (Cauchy tail).

For simplicity, we fix in what follows, but the same property holds if the support of is compact. Let be the set of the observed data. To discuss the posterior robustness, we target the unnormalized posterior distribution of under the general error distribution with (3),

(4)

where and is a joint prior distribution of . Next, to analyze the effect of outliers explicitly, we assume that each outlier goes to infinity at its own specific rate. More precisely, the observed value of responses is parametrized by as for some ’s, and as . Let be the set of non-outlying observations; is independent of for . The posterior robustness is defined as the diminishing difference of posteriors conditional on and as . The formal statement of posterior robustness for our model is given below. For the detailed proof, see the Appendix.

Theorem 1.

For the unnormalized posterior density given in (4), it holds that

(5)

for any if and only if , where is a compact set.

We note again that the general error distribution with is exactly the proposed EHE distribution, so that the above theorem indicates that the desirable robustness property is achieved only under the proposed EHE distribution among the general class of error distributions with the mixing distribution in (3).

As clarified in the proof of Theorem 1, the ratio of the two unnormalized posteriors converges to the function of and if . The same asymptotic ratio is obtained for -distribution with degree-of-freedom . In other words, the posterior robustness cannot be attained by the finite mixture with -distribution.

The main theorem shows the uniform convergence of the posterior distribution with outliers to one without outliers on a compact set. Although this result is proved with almost no assumption other than the model structure, we can also prove other variations of posterior robustness seen in other literature with appropriate conditions. Examples include the convergence with normalized constant and convergence in distribution by introducing additional assumptions on the models and priors. The explicit benefit of the version of posterior robustness in our theorem is the minimal set of assumption required for the priors on and , and the posterior robustness is valid for any proper priors, even if the density is unbounded. In fact, unbounded density functions are common in some advanced but widely adopted shrinkage priors, such as the horseshoe priors (Carvalho et al., 2010). Thus, the theoretical framework of this research guarantees the posterior robustness for the boarder and important class of statistical problems, including the high-dimensional regression by shrinkage as an important example.

3 Posterior Computation

3.1 Gibbs sampler by augmentation

An important property of the proposed EHE distribution (2) is its computational tractability, that is, we can easily construct a simple Gibbs sampling for posterior inference. Note that the error distribution contains two unknown parameters, and

, and we adopt conditionally conjugate priors given by

and . The conditionally conjugate priors can also be found for main parameters, and , and we use and . The multivariate normal prior for can be replaced with the scale mixture of normals, such as shrinkage priors, which is discussed later in Section 3.3.

To derive the tractable conditional posteriors, we need to keep the likelihood conditionally Gaussian with scale . For this purpose, we need to rely on a set of latent variables, and , to obtain a hierarchical expression of . Now, the scale parameter is written as , where and are mutually independent and distributed as , and as in (1). The conditional conjugacy for follows immediately from the Gaussian likelihoods, and the conditional posteriors are normal and inverse gamma, given .

The full conditional distributions of the other parameters and latent variables in the EHE distribution are not any well-known distribution, but we can utilize the following integral expression of density as

namely, the random variable following the density admits the mixture representation: , and , which enables us to easily generate samples from the full conditional distribution of and .

The latent state is useful in deriving the conditional posterior of , and one can derive the Gibbs sampler with latent

as the part of the Markov chain, although

is totally redundant in posterior sampling of the other parameters. We, instead, marginalize out when sampling , , ’s and ’s from their conditional posteriors. This modification of the original Gibbs sampler simplifies the sampling procedure, and even facilitates the mixing, while targeting the same stationary distribution (Partially collapsed Gibbs sampler, Van Dyk and Park 2008). The algorithm for posterior sampling is summarized as follows.

Summary of the posterior sampling

  • Sample from the full conditional distribution , where

    with .

  • Sample from , where

  • Sample

    from Bernoulli distribution; the probabilities of

    and are proportional to and , respectively.

  • The full conditional distributions of and are given by and , respectively, where and , and .

  • For each , independently, sample from if or from if , where denotes the generalized Gaussian inverse distribution with the density of the form, .

  • For each , independently, sample first in a compositional way; sample from and as . Then, sample from if or from if .

3.2 Efficiency in computation

A possible reason that the finite mixture has attracted less attention in the past research on posterior robustness is, as mentioned in Desgagné and Gagnon (2019), the increased number of latent state variables introduced by augmentation, and the concern for the potential inefficiency in posterior computation. It is the same concern seen in Bayesian variable selection (George and McCulloch, 1993); the finite mixture model for the prior on regression coefficients results in the necessity of stochastic search in the high-dimensional model space, hence causes the slow convergence of Markov chains and the costly computation. It is clear in the above algorithm, however, that the use of finite mixture as error distributions is completely different from the variable selection in terms of the model structure and free from such computational problem. Unlike the variable selection, the membership of each to either of the two components in our model is independent of one another, which facilitates the stochastic search in possible combination of the model space. This fact also shows that the sampling of can be done completely in parallel across ’s, hence our algorithm is scaled and computational feasible for the dataset with extremely large .

We, again, emphasize that the use of the finite mixture is designed for controlling the effect of outliers on the other parameters of interest, and we focus on the inference for regression coefficients and scale parameter, not on the outlier detection. Although this view has already been clarified, and supported, by the posterior robustness in Theorem 

1, we further discuss the utility of finite mixture approach by the extensive comparison with other models by the simulation study in Section 4.

3.3 Robust Bayesian variable selection with shrinkage priors

When the dimension of is moderate or large, it is desirable to select a suitable subset of to achieve efficient estimation. This procedure of variable selection would also be seriously affected by the possible outliers, by which we may fail to select suitable subsets of covariates. For a robust Bayesian variable selection procedure, we introduce shrinkage priors for regression coefficients. Here we rewrite the regression model to explicitly express an intercept term as , and consider a normal prior with fixed hyperparameter . For the regression coefficients , we consider a class of independent priors expressed as a scale mixture of normals given by

(6)

where is a mixing distribution, and is an unknown global parameter that controls the strength of the shrinkage effects. Examples of the mixing distribution

includes the exponential distribution leading to the Laplace prior of

(Bayesian Lasso, Park and Casella 2008), and the half-Cauchy distribution for which results in the horseshoe prior (Carvalho et al., 2009, 2010). The robustness property of the resulting posterior distributions is guaranteed for those shrinkage priors; Theorem 1 does not require any conditions other than the prior propriety.

In terms of posterior computation, the key property is that the conditional distribution of given under (6) is a normal distribution, so the sampler given in Section 3.1 is still valid with minor modification. Specifically, the sampling steps from the full conditional distributions of , , and are modified or added as follows:

  • Sample from , where

  • Sample from , where

  • Sample from , where

  • Sample for each and from their full conditionals. Their densities are proportional to and , respectively, where is a prior density for .

The full conditional distributions of and are familiar forms owing to the normal mixture representation of the EHE distribution as well as the shrinkage priors. The sampling of and depends on the choice of shrinkage priors, but the existing algorithms in the literature can be directly imported to our method.

In Section 5, we adopt the horseshoe prior for regression coefficients with the EHE distribution for the error terms. We here provide the details of sampling algorithm under the horseshoe model. The horseshoe prior assumes that independently for and , where

is the standard half-Cauchy distribution with probability density function given by

for . Note that they admit hierarchical expressions given by and for , and and for . Then, the full conditional distributions of and as well as the latent parameters and are given by

4 Simulation studies

We here carry out simulation studies to investigate the performance of the proposed method together with existing methods. We generated observations from the linear regression model with covariates, given by

where , , and the other coefficients are set to . Here the vector of covariates were generated from a multivariate normal distribution with zero mean vector and variance-covariance matrix having -element equal to for . Regarding the error term, we adopted the following contamination structure:

where is the contamination ratio and is the location of outliers. We considered all the combinations of and , which leads to 9 scenarios in total since with arbitrary leads to the same structures of , namely no contamination.

For the simulation dataset, we applied the robust regression methods with the EHE distribution, the LPTN distribution (Gagnon et al., 2019), and -distribution with degrees of freedom. When using the EHE distribution, we adopted a simple method with setting (denoted by EH), and the adaptive version with estimated (aEH) by assigning prior distribution. In the LPTN distribution, we need to specify the tuning parameter , and adopted two cases, and , denoted by LP1 and LP2, respectively. Regarding the -distribution, we considered corresponding to Cauchy distribution (denoted by C), (T3) and an adaptive version by assigning a discrete prior for (denoted by aT). We also employed the standard normal distribution (denoted by N). We implemented all the methods in Bayesian ways by assigning prior distributions: and . Under the EHE distribution, -distributions and normal distribution, we generated the posterior samples of by Gibbs sampler. On the other hand, we generated posterior samples under the LPTN distribution by the random-walk Metropolis-Hastings algorithm as adopted in Gagnon et al. (2019), in which the step sizes were set to . For each model, we generated 3000 posterior samples after discarding the first 1000 posterior samples.

Based on the posterior samples, we computed posterior means as well as credible intervals of for . The performance of the point and interval estimation was assessed using square root of mean squared errors (RMSE), coverage probabilities (CP) and average length (AL) based on 500 replications of the simulation, and these values were averaged over . In addition, we evaluated the efficiency of the sampling schemes by computing the average of inefficient factors (IF) of the posterior samples.

In Table 1, we reported the values of these performance measures in 9 scenarios. When (no outlier), as predicted, the normal distribution provides the most efficient result in all measures while the other methods are slightly inefficient. However, the proposed method (EH and aEH in the table) performs almost in the same way as the normal distribution. This is an empirical evidence that the efficiency loss of the EHE distribution is very limited owing to the normal component in the mixture. In the other robust methods, MSEs are slightly higher than the that of the normal distribution and CPs are smaller than the nominal level.

In the other scenarios, where outliers are incorporated in the data generating process, the performance of the normal distribution breaks down, and the robustness property is highlighted in the performance measures of the other models. In particular, the EHE distribution with fixed (EH) performs quite stably in both point and interval estimation. The adaptive version (aEH) also works reasonably well, but the performances is slightly worse at the cost of estimation of , thereby the estimation of may not be beneficial. The LPTN model with (LP1) shows reasonable performance, but its CPs tend to be smaller than the nominal level. The other LPTN model with (LP2) greatly worsens the accuracy of point estimation, implying the sensitivity of the choice of hyperparameter to the posteriors. The other models (C, T and aT) also suffer from the larger MSE values, which might relate to the lack of posterior robustness under the -distribution family. The results of interval estimation severely depend on the degree-of-freedom parameter, as the Cauchy and -distributions produce too narrow/wide credible intervals.

In terms of computational efficiency, it is remarkable that the IF values of the EHE methods are small and comparable with those of the -distribution methods, which shows the efficiency of the proposed Gibbs sampling algorithm. On the other hand, the IFs of the LPTN models are very large due to the use of Metropolis-Hastings algorithm. To obtain the reliable posterior analysis under the LPTN models, one needs to increase the number of iterations drastically, or to spend more effort tuning the step-size parameter. The performance of LPTNs is improved under the simpler settings of less predictors, , but the overall result of comparison of 8 models remains almost the same. See the Appendix for this additional experiment.

EH aEH LP1 LP2 C T3 aT N
(0, –) 6.25 6.26 6.61 7.92 7.76 6.70 6.48 6.25
(5, 5) 6.99 7.60 7.07 8.22 8.04 7.17 7.42 10.68
(10, 5) 9.09 8.63 8.82 9.46 8.32 8.27 9.63 15.73
(5, 10) 6.53 6.77 6.76 8.03 7.85 6.85 7.14 18.56
RMSE (10, 10) 7.03 7.54 7.08 8.27 7.98 7.30 9.73 29.20
(5, 15) 6.58 6.74 6.79 8.15 7.88 6.84 7.00 26.76
(10, 15) 6.99 7.26 7.02 8.32 7.90 7.09 10.07 43.70
(5, 20) 6.50 6.63 6.70 8.02 7.78 6.75 6.90 35.56
(10, 20) 6.94 7.12 6.96 8.29 7.79 6.94 10.19 58.22
(0, –) 95.0 95.0 89.6 72.6 88.3 93.3 94.4 95.1
(5, 5) 94.9 92.7 92.1 78.2 89.5 94.5 95.7 91.5
(10, 5) 93.3 91.9 91.6 80.1 90.5 93.8 94.4 90.1
(5, 10) 95.0 94.3 92.1 77.4 90.0 95.6 97.8 90.6
CP (10, 10) 94.8 93.5 93.4 78.7 92.0 97.1 98.2 90.6
(5, 15) 95.1 94.6 92.2 76.2 90.0 95.6 98.4 90.6
(10, 15) 94.7 93.8 93.2 78.6 92.3 97.7 99.2 90.3
(5, 20) 95.0 94.7 92.0 76.2 90.5 95.9 98.7 90.3
(10, 20) 94.6 94.1 93.3 78.0 92.5 98.0 99.6 90.3
(0, –) 24.6 24.6 23.0 18.5 24.6 24.6 25.0 24.6
(5, 5) 27.6 27.5 26.1 21.7 26.2 27.7 30.4 36.3
(10, 5) 31.7 30.6 31.1 24.9 28.1 31.9 37.2 44.2
(5, 10) 25.8 26.0 25.1 20.6 26.1 27.8 33.9 58.6
AL (10, 10) 27.3 27.8 27.4 22.1 28.0 32.6 49.1 77.3
(5, 15) 25.8 25.9 25.1 20.3 26.1 27.9 35.9 83.1
(10, 15) 27.1 27.3 26.9 22.1 27.9 32.8 60.1 113.3
(5, 20) 25.6 25.7 24.8 20.2 26.0 27.7 37.2 109.2
(10, 20) 27.0 27.1 26.7 21.7 27.9 32.9 69.4 149.4
(0, –) 1.01 1.44 45.25 54.19 4.65 2.11 1.86 0.98
(5, 5) 2.23 5.03 42.73 52.94 4.30 1.96 1.80 0.99
(10, 5) 3.73 5.36 40.53 51.92 3.98 1.86 1.82 0.98
(5, 10) 1.99 3.46 43.56 53.41 4.26 1.90 1.79 0.98
IF (10, 10) 3.10 5.35 41.73 52.69 3.86 1.70 1.93 0.98
(5, 15) 1.98 3.13 43.58 53.52 4.23 1.88 1.76 0.98
(10, 15) 3.13 4.62 42.30 52.80 3.84 1.66 2.07 0.98
(5, 20) 1.97 2.93 43.84 53.50 4.21 1.88 1.75 0.98
(10, 20) 3.11 4.23 42.45 52.84 3.80 1.65 2.18 0.98
Table 1: Average values of RMSEs, CPs, ALs and IFs of the proposed extremely-heavy tailed distribution with fixed (EH) and estimated (aEH), log-Pareto normal distribution with (LP1) and (LP2), Cauchy distribution (C), -distribution with 3 degrees of freedom (T3) and estimated degrees of freedom (aT), based on 500 replications in 9 combinations of . All values except for IFs are multiplied by 100.

5 Real data examples

The posterior robustness of the proposed EHE distribution is demonstrated via the analysis of two real datasets: Boston housing data and diabetes data. The goal of statistical analysis here is the variable selection with and predictors in the presence of outliers. Our robustness scheme is a prominent part of such analysis by allowing the use of unbounded prior densities for strong shrinkage effect– specifically the horseshoe priors we discussed in Section 3.3– while protecting the posteriors from the potential outliers. The former dataset is suspected to be contaminated by some outliers, where the difference of the proposed EHE distribution and the traditional -distribution is emphasized. In contrast, the latter dataset is free from extreme outliers, by which we discuss the possible efficiency loss caused by the use of EHE distributions.

In our examples, we consider robust Bayesian inference using the proposed method with taking account of variable selection, since the number of covariates is not small in two cases. Specifically, we employed the horseshoe prior as described in Section

3.3. For comparison, we also applied Bayesian regression with the normal and -error distribution, where the degrees of freedom is also estimated, while using the horseshoe prior for regression coefficients. In these three model, we assign the same prior distribution as in Section 4. Note that the horseshoe prior can be easily incorporated into the regression models with both normal and -distribution, and efficient Gibbs sampling methods can be used. On the other hand, it is not straightforward to incorporate such priors into the robust method with the LPTN distribution, thereby we omitted it from the comparison. In all the methods, we generated 5000 posterior samples after discarding the first 2000 posterior samples as burn-in.

5.1 Boston housing data

We first consider the famous Boston housing dataset (Harrison and Rubinfeld, 1978). The response variable is the corrected median value of owner-occupied homes (in 1,000 USD). The covariates in the original datasets consist of 14 continuous-valued variables about the information of houses, such as longitude and latitude, and 1 binary covariate. After standardizing the continuous covariates, we also create squared values of those, which results in covariates in our models. The sample size is .

To see the presence of outliers, we first applied a simple linear regression model to the dataset with Gaussian error distribution and compute standardized residuals, which are shown in the left panel of Figure 3. Large residuals in the figure imply the possible outliers in the dataset, which thereby affects the inference of regression coefficients and makes the analysis by the standard Gaussian regression model implausible.

In the proposed error distribution, the effect of possible outliers is reflected on the posterior of , i.e., mixture proportion of the extremely heavy-tailed distribution. The trace plot of posterior samples of under the EHE model is presented in the right panel of Figure 3. Since all the sampled values are bounded away from , it suggests that a certain proportion of the heavy-tailed distribution to take account of the outliers shown in the left panel. Other than the default prior , we also applied slightly more informative priors, and , based on the prior belief that should be small, but the results were almost the same for all the parameters.

The posterior means and credible intervals of the regression coefficients based on the three methods are shown in Figure 4. It shows that the results of the normal error model are quite different from those of - and -distributions. The difference of estimates becomes visually clear especially for the significant covariates– if we define the significance in the sense that the credible intervals do not contain zero– as the result of proneness/sensitivity to the representative outliers observed in Figure 3. Comparing the models with the - and -distribution, they select the same set of covariates by significance, but the lengths of posterior credible intervals in the EHE model are shorter than those in the -distribution model. In fact, the average interval lengths in the EHE and the -distribution models are and , showing the efficiency of the EHE model. This finding is consistent with the simulation results in Section 4.

Figure 3: Standardized residuals (left) and trace plot of (mixing proportion) in the proposed EHE distribution (right), obtained form the Boston housing data.
Figure 4: Posterior means and credible intervals of the regression coefficients in the normal regression with normal distribution error (N), the proposed EHE distribution, and the -distribution (T) with estimated degrees of freedom, applied to the Boston housing data.

5.2 Diabetes data

We next consider another famous dataset known as Diabetes data (Efron et al., 2004). The data contains information of individuals and 10 covariates regarding individual information (age and sex) and several medical measures. We consider the same formulation of linear model as in Efron et al. (2004); the set of predictors consists of the original 10 variables, 10 main effects, 45 interactions, and 9 squared values, which results in predictors in the model.

Similarly to the analysis of Boston housing data, we check the standardized residuals computed under the standard linear regression model, which was presented in the left panel of Figure 5. Few outliers are confirmed in the dataset as most of residuals are included in the interval, which strongly supports the standard normal assumption in this example. In the main analysis by a regression models with horseshoe prior and three error distributions of normal, - and EHE distributions, we generated 5000 posterior samples after discarding the first 2000 posterior samples as burn-in.

The right panel of Figure 5 shows the trace plot of posterior samples of . All the sampled values are very close to zero, as expected from the residual plot in the left panel of Figure 5. For the small weight is inferred from the data, the heavy-tailed component of the finite mixture is regarded “redundant” for this dataset. The same sensitivity analysis on the choice of priors for is done as in the previous section, but we find no significant change to the results.

To see the possible inefficiency of using the EHE models for the dataset without outliers, the posterior means and credible intervals of the regression coefficients are reported in Figure 6. The results of the three models are comparable; the predictors selected by significance are almost the same under the three models. The only notable difference is that the credible intervals produced by the -distribution model is slightly larger than those of the other two methods. This indicates the loss of efficiency in using the -distribution method under no outliers, as also confirmed in the simulation results in Section 4. In contrast, the difference in the credible intervals of the Gaussian and EHE models is hardly visible in the figure. We conclude from this finding that the choice of the EHE model is a safe option; even if no outlier exists, the efficiency loss in estimation is minimal.

Figure 5: Standardized residuals (left) and trace plot of (mixing proportion) in the proposed EHE distribution (right), obtained form the Diabetes data.
Figure 6: Posterior means and credible intervals of the regression coefficients in the normal regression with normal distribution error (N), the proposed EHE distribution, and the -distribution (T) with estimated degrees of freedom, applied to the Diabetes data.

6 Discussions

While we focused on the inference for the regression coefficients and scale parameter in this research, it is also of great interest to employ the predictive analysis based on the proposed model. Because

-distribution, as well as many log-regularly varying distributions, is too heavily-tailed to have finite moments, the posterior predictive mean under the EHE models do not exist. In predictive analysis, one needs to consider the posterior predictive medians or other alternatives for the point prediction. In uncertainty quantification, the second component of the EHE distribution could have a significant impact on the posterior predictive credible intervals for its heavy tails. In practice, it is important to monitor the posterior of mixing weight

to interpret the predictive analysis.

The use of the proposed method is not limited to the linear regression models, but can be immediately applied to other Gaussian models such as graphical models or state space models. Even under these highly-structured models, we are able to develop an efficient posterior computation algorithm by utilizing the hierarchical representation of the proposed error distribution. The similar theoretical robustness properties may also be confirmed for those models.

Acknowledgement

This work is supported by the Japan Society for the Promotion of Science (grant number: 18K12757, 17K17659 and 18H00835).

Appendix

A1 Lemmas

We provide two lemmas used in the proofs of Proposition 1 and Theorem 1.

Lemma A1.

Let and be continuous, positive, and integrable functions defined on . Suppose that . Then

Proof.

We can assume that ; if , then we can exchange the definitions of and , and this reduces to the case of . Let be either or . We can also assume without loss of generality that and are integrable. To see this, observe that, for any , there exist satisfying

and, for these and , there also exists such that . Hence, for all , the covariance inequality implies

where is the indicator function ( if and otherwise) and the density of random variable is proportional to . Finally, we have

which shows the difference of and is ignorable in . This result verifies that, if is not integrable, then we can replace by .

Again, assume and both and are integrable. Let . Then we have

as since is assumed to be integrable on . Therefore,

(A1)

as . Furthermore, uniformly in ,

(A2)

as by assumption. Combining (A1) and (A2) gives the desired result. ∎

Lemma A2.

Let . Then we have

Proof.

The inequality in part (a) is trivial when ; the left-hand-side is bounded by . For the case of , first observe that

for all . Then it is immediate from this expression that

for . For part (b), we use the same expression to obtain

by the dominated convergence theorem. ∎

A2 Proof of Proposition 1

Here we prove Proposition 1. We show that

for some constant . Since

by Lemma A1, we can assume . Then we have for sufficiently large

where the last equality follows by making the change of variables . Now, by part (a) of Lemma A2, the integrand is bounded by