Robust Bayesian Regression with Synthetic Posterior

10/02/2019 ∙ by Shintaro Hashimoto, et al. ∙ 0

Regression models are fundamental tools in statistics, but they typically suffer from outliers. While several robust methods have been proposed based on frequentist approaches, Bayesian methods would be more preferable in terms of easiness of uncertainty quantification of estimation results. In this article, we propose a robust Bayesian method for regression models by introducing a synthetic posterior distribution based on robust divergence. We also consider introducing shrinkage priors for regression coefficients for variable selection. We provide an efficient posterior computation algorithm using Bayesian bootstrap within Gibbs sampling. We carry out simulation studies to evaluate performance of the proposed method, and apply the proposed method to real datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

page 18

Code Repositories

RBR

Robust Bayesian regression with synthetic posterior


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The maximum likelihood or least squared methods are basic strategies of the inference for parametric statistical models. However, theses methods are strongly affected by outliers, and several alternative robust methods have been developed, e.g. M-estimation, least absolute deviation (LAD) regression and so on. In addition to such problem, we often face with data which have large number of covariates. In this case, the shrinkage estimation of coefficients in regression models such as Lasso (Tibshirani, 1996) is known to be useful, but these methods also suffer from the problem of outliers. To overcome these problems, we may use LAD regression or -distribution as an error distribution together with shrinkage methods, while it is known that such methods dose not work well when the contamination ratio is not small or outliers concentrate in one side of tail region. On the other hand, robust estimation methods based on divergence have been attracted attention because of the automatic rejection of outliers. The pioneering work by Basu et al. (1998) proposed the density power divergence, and they studied robustness in the sense of the influence function. As a related work, Jones et al. (2001) and Fujisawa and Eguchi (2008) considered a slight modified version of the density power divergence, which is now called -divergence, and -divergence is shown to be robust against heavy contamination. Furthermore, Kawashima and Fujisawa (2017) employed -divergence to propose robust and sparse linear regression methods.

In Bayesian framework, we can also construct sparse linear regression models by using shrinkage prior distributions which induce sparsity in regression coefficients. They are roughly divided into two types of priors. The first one is discrete mixture priors known as spike-and-slab priors (e.g. Ishwaran and Rao, 2005; Rockova and George, 2018). Although these priors lead some of coefficients to exactly zero, the complexity of posterior computation is explosively growing with the number of covariates. The second one is use of continuous shrinkage priors such as Laplace (Park and Casella, 2008) and horseshoe priors (Carvalho et al., 2010)

. The Bayesian linear regression based on the Laplace prior is known as Bayesian lasso. While estimators based on such priors can not be exactly zero, there exists efficient Markov chain Monte Carlo (MCMC) algorithms for posterior computation. In this article. we adopt the later type of priors for reasons of practical use. The Bayesian approach for sparse regression models is very attractive because it can take uncertainly of estimation into account, i.e., we can easily compute credible intervals whereas frequentist approach is not necessary easy. Also, we can include uncertainty of a tuning parameter in estimation results by averaging over posterior distribution. However, these Bayesian linear regression models also suffer from outliers for the same reasons as frequentist approaches. For a remedy, we may use the Bayesian lasso with heavy-tailed error distributions such as

-distribution for robustness, but this strategy is not necessarily a good solution since it is not easy to extend this approach to other regression models such as generalized linear models. Moreover, the use of heavy-tailed distribution cannot necessarily work well under variety of contamination structures.

In this paper, we propose the use of synthetic posterior based on robust divergence combined with shrinkage priors for regression coefficients. Recently, robust Bayesian methods via synthetic posterior have been proposed (e.g. Bissiri et al., 2016; Bhattacharya et al., 2019; Miller and Dunson, 2019; Nakagawa and Hashimoto, 2019)

, but such methodologies are demonstrated in low-dimensional parametric models to show their good robustness properties through numerical studies. On the other hand, this paper is concerned with regression models with relatively large number of covariates. As discussed later, the main problem using the proposed synthetic posterior is its complicated functional form of model parameters, so that the main bottleneck is difficulty of efficient posterior computation. To solve the problem, we provide an efficient algorithm which generates random samples from the complicated full conditional distribution via optimization of the randomized objective function within Gibbs sampler. This idea is related to Bayesian bootstrap

(Rubin, 1981), weighted likelihood bootstrap (Newton and Raftery, 1994) and nonparametric Bayesian updating (Lyddon et al., 2018). In the optimization step, we use the modified version of Majorization-Minimization algorithm given in Kawashima and Fujisawa (2017). The advantage of the proposed algorithm is that there is no rejection steps, and performance in terms of mixing and autocorrelation is quite reasonable.

The paper is organized as follows: In Section 2, we introduce the setting of our model and formulate the synthetic posterior distribution. Next, we consider the Bayesian linear regression model under shrinkage priors which are constructed by the scale mixtures of normal. Then a new posterior computation algorithm for synthetic posterior with shrinkage priors is proposed. In Section 3, we provide results of numerical studies and real data analysis. Under several types of contaminations, it is shown that the proposed method outperforms the original Bayesian lasso and Bayesian lasso with the

-prior distribution in the sense of the mean squared error, average length of credible intervals and coverage probability. Also, we show mixing properties of the proposed algorithm compared with other methods.

2 Robust Bayesian regression via synthetic posterior

2.1 Settings and synthetic posterior

Suppose we have independent observation for , where is a continuous response and is a

-dimensional vector of covariates. We consider fitting a regression model

with . Let be prior distribution for the model parameters and . Then the standard posterior distribution of the model parameters are given by

(1)

where is the density function of , and denotes the set of sampled data. When there exist outliers which have large residuals , the posterior distribution (1) is known to be sensitive to such outliers, and it could produce biased or inefficient posterior inference.

To overcome the problem, we propose replacing the log-likelihood function in (1) with robust alternatives. Specifically, we employ -divergence (Jones et al., 2001; Fujisawa and Eguchi, 2008) of the form:

where

Note that is a tuning parameter that controls the robustness, and reduces to the log-likelihood function as . Straightforward calculation shows that

(2)

where is an irrelevant constant which does not depend on and . Based on (2), we define the following synthetic posterior:

(3)

Since reduces to the standard posterior (1) under , the synthetic posterior (3) can be regarded as a natural extension of the standard posterior (1).

2.2 Posterior computation

In what follows, we consider the standard prior for , that is, normal prior for and inverse gamma prior for , independently, given by

where and

are hyperparameters. Since the synthetic posterior distribution (

3) is not a familiar form, the posterior computation to generate random samples of is not straightforward. We may use crude approaches such as MH algorithm, but the acceptance probabilities might be very small even when the dimension of covariates is moderate. To avoid such undesirable situation, we adopt approximated sampling strategy using Bayesian bootstrap (Rubin, 1981) or nonparametric Bayesian updating (Lyddon et al., 2018), which generate posterior samples as the minimizer of the following weighted objective function:

(4)

where , so that . The minimization of (4) can be efficiently carried by Majorization-Minimization (MM) algorithm (Hunter and Lange, 2004) obtained by slight modification of one given in Kawashima and Fujisawa (2017). From Jensen’s inequality, with current values of the model parameters, the upper bound of the objective function (4) can be obtained as follows:

(5)

where is an irrelevant constant and is a new weight in the th iteration, defined as

(6)

noting that . Since the upper bound given in (5) is a familiar function of and , the minimizer can be easily obtained and the updating process is given by

(7)

Therefore, the MM-algorithm to get the minimizer of (4) repeats calculation of the weight (10) and updating parameter values via (7) until convergence. Since all the steps are obtained in analytical forms, the MM-algorithm is very efficient and easy to implement.

For generating random samples from synthetic posterior distribution (3), we generate samples of ’s and solve minimization problems of (4) for times via the MM-algorithm. The resulting samples can be approximately regarded as posterior samples of (3), and the approximation error would be negligible when is large (e.g. Newton and Raftery, 1994; Lyddon et al., 2018; Newton et al., 2018).

2.3 Incorporating shrinkage priors

When the dimension of is moderate or large, it is desirable to select a subset of that are associated with , which corresponds to shrinkage estimation of . Here we rewrite the regression model to explicitly express an intercept term as . We consider normal prior for , that is, with known . For coefficients , we introduce shrinkage priors expressed as scale mixture of normals given by

(8)

where is a mixing distribution which may depends on some tuning parameter . Many existing shrinkage priors are included in this class by appropriately choosing the mixing distribution . Among many others, we consider two priors for

: exponential distribution which results in Laplace prior of

known as Bayesian Lasso (Park and Casella, 2008)

, and half-Cauchy distribution for

which results in horseshoe prior for (Carvalho et al., 2010). The detailed settings of these two priors are given as follows:

where and are fixed hyperparameters,

denotes the gamma distribution with shape parameter

and rate parameter , and is the inverse gamma distribution with shape parameter and scale parameter . Note that in the horseshoe prior is additional latent variable to make posterior computation easier, and the density of is proportional to , so that follows half-Cauchy distribution.

Using the mixture representation (8), the full conditional distribution of given the other parameters including ’s is given by

(9)

where . Similarly to the previous section, we can generate approximate posterior samples of by minimizing the weighted objective function obtained by replacing with in the above expression, where is defined in the same way in the previous section. Then, the objective function can be minimized by a similar MM-algorithm given in the previous section. Under given regression coefficients , the full conditional distribution of latent variables and hyperparameter are the same as the case with the standard linear regression, so that we can use the existing Gibbs sampling methods. We summarize our MCMC algorithm under the two shrinkage priors in what follows.

MCMC algorithm under shrinkage prior

  • (Sampling from and )    Generate , and set initial values and . Repeat the following procedures until convergence:

    • Compute the following weight:

      (10)
    • Update the parameter values:

    We adopt the final values as sampled values from the full conditional distribution.

  • (Sampling from and )

    • (Laplace prior)   The full conditional distribution of is inverse-Gaussian with parameters and in the parametrization of the inverse-Gaussian density given by

      The full conditional distribution of is .

    • (Horseshoe prior)   The full conditional distribution of , and are , and , respectively.

Note that the sampling scheme in the main parameter does not use the previous sampled values of and there is no rejection steps, thereby autocorrelation of generated from the above MCMC algorithm os expected to be very small, which will be demonstrated in our numerical studies in Section 3.

3 Numerical studies

3.1 Simulation study

We here evaluate the performance of the proposed methods through simulation studies. We first compare the point and interval estimation performance of the proposed methods with those of some existing methods. To this end, we consider the following regression model with and :

where , and and the other ’s were set to . The covariates

were generated from a multivariate normal distribution

with with . For the error term , we adopted contaminated structure given by , where is a contamination distribution and is contamination probability which might be heterogeneous over samples. We considered two settings , that is, (I) and (II)

, noting that the contamination distribution has very large variance in scenario (I) while the contamination distribution tends to produce large values in scenario (II). Regarding the contamination probability, we considered the following scenarios:

Note that the contamination probability in the first setting is constant over the samples, which is refereed to homogeneous contamination. On the other hand, in the second setting, the probability depends on covariates and is different over the samples, so that it is refereed to heterogenous contamination.

For the simulated dataset, we applied the proposed robust methods with Laplace and horseshoe priors, denoted by RBL and RHS, respectively, as well as the standard (non-robust) Bayesian lasso (BL). Moreover, we also applied the regression model with error term following -distribution with degrees of freedom, dented by tBL, as a possible competitor of robust Bayesian methods. The tuning parameter in the proposed method is set to . In applying all the methods, we generated 2000 posterior samples after discarding the first 500 samples as burn-in. For point estimates of ’s, we computed posterior median of each element of ’s, and their performance is evaluated via mean squared error (MSE) defined as . We also computed credible intervals of ’s, and calculated average lengths (AL) and coverage probability (CP) defined as and , respectively. These values were averaged over 300 replications of simulating datasets.

In Figures 1, we presented the results of logarithm of MSE (log-MSE) with error bars corresponding three times estimated Monte Carlo errors. When there is no outliers, the standard BL method performs quite well while the proposed two robust methods (RBL, RHS) perform almost in the same way. On the other hand, the performance of BL gets worse as the ratio of being outliers increases, compared with the other robust methods. Comparing the proposed methods with tBL, it is observed that they perform similarly when the contaminated error distribution is symmetric and has large variance as in Scenario (I), possibly because -distribution can be effectively adapted to such structure. However, when the contamination distribution has different mean as in Scenario (II), the performance of tBL gets worse than the proposed robust methods as the contamination increases. Comparing the proposed two robust methods, RHS using horseshoe prior is slightly better than RBL using Laplace prior in all the scenarios, which could be reasonable since horseshoe prior is known to have better performance than Laplace prior as shrinkage priors for regression coefficients. Regarding interval estimation, the results for AL and CP are given in Figure 2 and Table 1, respectively. From Table 1, CPs seem reasonable in all the methods while tBL slightly show over-coverage properties. On the other hand, Figure 2 reveals a similar trend to one observed in log-MSE, so that the credible intervals of BL are shown to be very inefficient when there exist outliers. Also it should be pointed out that credible intervals of tBL is inefficient compared with the proposed methods even under Scenario (I). Comparing the proposed two robust methods, RHS provides slightly more efficient interval estimation than RBL, which is consistent with the results of log-MSE in Figure 1.

We next checked mixing properties of the proposed MCMC algorithm (denoted by RBL-proposal) compared with standard methods. We consider the same setting as above simulations. In particular, we show only results in the case of (II)-Homo. Note that we can obtain similar results in other cases, i.e., (I)-Homo, (I)-Hetero and (II)-Hetero. We set . In addition to the case of , we also consider the case of , where we set the same true non-zero regression coefficients as the case of . For comparison, we employed (non-robust) Bayesian lasso (BL) and robust Bayesian lasso with Langevin algorithm (e.g. Welling and Teh, 2011) with the step size (denoted by RBL-Langevin). Note that the Langevin algorithm is applied to the full conditional distribution of , so that this algorithm has only difference from the proposed algorithm in sampling from the full conditional distribution of . Figures 3 and 4 show mixing and autocorrelation results for one-shot posterior simulation of , and . As is well-known, since the ordinal BL have the efficient Gibbs sampling algorithm, we can find that mixing and autocorrelation of them are not bad. However, sample paths of BL are affected by outliers. The second row shows results of Langevin algorithm for our proposed model. From Figures 3 and 4, mixings are not good and autocorrelations are very high. The bottom of Figure 3 and 4 show results based on the proposed algorithm. In this case, sample paths are not affected by outliers and it seems that there is no autocorrelation among posterior samples because posterior samples of regression coefficients are obtained by solutions of the optimization problem for each iterations. Hence, the proposed method is robust against outliers and enables us to generate posterior samples without being worried about the problem of autocorrelation. Furthermore, our algorithm does not depend on any tuning parameters (for example, we have to set some tuning parameters in Langevin or Metropolis algorithms). Figures 5 and 6 show the case of . Similar to the case of , our algorithm performs well, while the use of Langevin algorithm may cause more higher autocorrelation. We tried other possible choices of tuning parameters in Langevin algorithm, but the results did not change much.

Figure 1: Average values of log-MSE with Monte Carlo error bands based on 300 replications in four simulation scenarios.
Figure 2: Average values of average length of credible intervals with Monte Carlo error bands based on 300 replications in four simulation scenarios.
(I)-Homo (II)-Homo
BL tBL RBL RHS BL tBL RBL RHS
0 95.9 95.0 93.7 93.3 95.9 95.0 93.7 93.3
0.05 96.7 96.5 94.1 93.9 96.2 96.9 94.1 93.7
0.10 96.6 97.7 94.5 94.5 96.3 98.6 94.1 94.0
0.15 96.2 98.3 95.6 95.5 96.3 99.1 94.8 94.5
0.20 96.7 99.0 96.0 95.9 96.5 98.7 95.3 94.8
(I)-Hetero (II)-Hetero
BL tBL RBL RHS BL tBL RBL RHS
0 95.9 95.0 93.7 93.3 95.9 95.0 93.7 93.3
1 96.4 96.5 94.3 94.2 96.1 97.1 93.8 93.8
2 96.1 97.7 94.5 94.3 94.9 98.6 94.1 93.5
3 96.4 98.2 95.6 95.3 94.1 98.8 94.5 94.3
4 96.4 98.4 96.2 96.1 93.4 96.7 95.5 94.8
Table 1: Coverage probability of credible intervals averaged over 300 replications.
Figure 3: Trace plot of posterior samples for three methods (BL, RBL-Langevin and RBL-proposal) in (II)-Homo case with and . Red line corresponds to the true value.
Figure 4: Autocorrelation of posterior samples for three methods (BL, RBL-Langevin and RBL-proposal) in (II)-Homo case with and .
Figure 5: Trace plot of posterior samples for three methods (BL, RBL-Langevin and RBL-proposal) in (II)-Homo case with and . Red line corresponds to the true value.
Figure 6: Autocorrelation of posterior samples for three methods (BL, RBL-Langevin and RBL-proposal) in (II)-Homo case with and .

3.2 Real data examples

We compare results of the proposed methods with that of the non-robust Bayesian Lasso through applications to two real datasets. We use two well-known datasets, Boston Housing data (Harrison and Rubinfeld, 1978) and Diabetes data from Efron et al. (2004), which have and

, respectively. In our applications, we standardized all the covariates, so that they have 0 mean and 1 standard deviation. We used the log-transformed response values in the Boston Housing data and original response values in the Diabetes data. For the datasets, we applied the proposed robust Bayesian methods with Laplace prior (RBL) and horseshoe prior (RHS). We set

in both methods. For comparison, we also applied the standard non-robust Bayesian Lasso (BL). Based on 4000 posterior samples after discarding 1000 posterior samples, we computed posterior medians as well as credible intervals of regression coefficients, which are shown in Figure 7. It is observed that the two robust methods, RBL and RHS produce similar results while the standard BL provides quite different results than the others in some coefficients. In particular, in the Diabetes dataset, the two robust methods detected 5th and 7th variables as significant ones based on their credible intervals while the credible intervals of the BL method contains , which shows that the robust methods may be able to detect significant variables that the non-robust method cannot. A similar phenomena is observed in the 3rd covariate in the Boston Housing data. Comparing two figures, the degree of difference of the results between the two robust methods and the non-robust method in the Diabetes data is smaller than that in the Boston Housing data, for example, the posterior medians among the three methods are almost identical in the Diabetes data. This might show that the ratio of outliers in the Diabetes dataset is smaller than that of the Boston Housing data.

Figure 7: credible intervals with posterior medians () of regression coefficients based on the standard Bayesian Lasso (BL), the proposed robust Bayesian methods with Laplace prior (RBL) and horseshoe prior (RHS), applied to Boston Housing data and Diabetes data.

4 Conclusions and discussion

We proposed a new robust Bayesian regression method by using synthetic posterior based on -divergence. Using a technique of nonparametric Bayesian updating that optimizes a weighted objective function within Gibbs sampling, we developed an efficient posterior computation algorithm for obtained posterior samples of regression coefficients under shrinkage priors. The numerical performance of the proposed method compared with existing methods are investigated through simulation and real data examples.

Although our presentation is focused on a linear regression in this paper, the proposed method can be conceptionally extended to other models such as generalized linear models (GLM). However, under general GLM, corresponding objective functions from -divergence is not tractable since it includes integrals or infinite sums (Kawashima and Fujisawa, 2019)

, thereby the posterior computation would not be feasible even if we use a similar techniques used in this paper. Only logistic regression for binary outcomes could be a tractable regression model in which

-divergence is obtained in an analytical form, and outliers in logistic regression is typically related to a mislabeling problem (e.g. Hung et al., 2018). The detailed investigation including developing an efficient posterior computation algorithm could be valuable.

Finally, while controls the degree of robustness, it has been fixed throughout this paper. In our numerical studies, we fixed and revealed that the simulation results are quite supportive since the purposed method performs significantly better than the other methods as well as the performance even under the case without any outliers is comparable with the non-robust method. However, it would be interesting to consider a data-dependent choice of .

Acknowledgement

This work is partially supported by Japan Society for Promotion of Science (KAKENHI) grant numbers 18K12757 and 17K14233.

References

  • Basu et al. (1998) Basu, A., I. R. Harris, N. L. Hjort, and M. C. Jones (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85, 549–559.
  • Bhattacharya et al. (2019) Bhattacharya, A., D. Pati, and Y. Yang (2019). Bayesian fractional posteriors. The Annals of Statistics 47, 39–66.
  • Bissiri et al. (2016) Bissiri, P. G., C. Holmes, and S. G. Walker (2016). A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B 78, 1103–1130.
  • Carvalho et al. (2010) Carvalho, C. M., N. G. Polson, and J. G. Scott (2010). The horseshoe estimator for sparse signals. Biometrika 97, 465–480.
  • Efron et al. (2004) Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. The Annals of Statistics 32, 407–499.
  • Fujisawa and Eguchi (2008) Fujisawa, H. and S. Eguchi (2008). Robust parameter estimation with a small bias against heavy contamination.

    Journal of Multivariate Analysis

     99, 2053–2081.
  • Harrison and Rubinfeld (1978) Harrison, D. and D. L. Rubinfeld (1978). Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management 5, 81–102.
  • Hung et al. (2018) Hung, H., Z. Jou, and S. Huang (2018). Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics 74, 145–154.
  • Hunter and Lange (2004) Hunter, D. R. and K. Lange (2004). A tutorial on mm algorithms. The American Statistician 58, 30–37.
  • Ishwaran and Rao (2005) Ishwaran, H. and J. S. Rao (2005). Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics 33, 730–773.
  • Jones et al. (2001) Jones, M. C., N. L. Hjort, I. R. Harris, and A. Basu (2001). A comparison of related density-based minimum divergence estimators. Biometrika 88, 865–873.
  • Kawashima and Fujisawa (2017) Kawashima, T. and H. Fujisawa (2017). Robust and sparse regression via gamma-divergence. Entropy 19, 608.
  • Kawashima and Fujisawa (2019) Kawashima, T. and H. Fujisawa (2019). Robust and sparse regression in glm by stochastic optimization.

    Japanese Journal of Statistics and Data Science

     to appear.
  • Lyddon et al. (2018) Lyddon, S., S. G. Walker, and C. Holmes (2018). Nonparametric learning from bayesian models with randomized objective functions. 32nd Conference on Neural Information Processing Systems (NIPS 2018).
  • Miller and Dunson (2019) Miller, J. W. and D. B. Dunson (2019).

    Robust bayesian inference via coarsening.

    Journal of the American Statistical Association 114, 1113–1125.
  • Nakagawa and Hashimoto (2019) Nakagawa, T. and S. Hashimoto (2019). Robust bayesian inference via -divergence. Communications in Statistics: Theory and Methods, to appear.
  • Newton et al. (2018) Newton, M. A., N. G. Polson, and J. Xu (2018). Weighted bayesian bootstrap for scalable bayes. arXiv.
  • Newton and Raftery (1994) Newton, M. A. and A. E. Raftery (1994). Approximate bayesian inference with weighted likelihood bootstrap. Journal of the Royal Statistical Society: Series B 56, 3–48.
  • Park and Casella (2008) Park, T. and G. Casella (2008). The bayesian lasso. Journal of the American Statistical Association 103, 681–686.
  • Rockova and George (2018) Rockova, V. and E. I. George (2018). The spike-and-slab lasso. Journal of the American Statistical Association 113, 431–444.
  • Rubin (1981) Rubin, D. B. (1981). The bayesian bootstrap. The Annals of Statistics 9, 130–134.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B 58, 267–288.
  • Welling and Teh (2011) Welling, M. and Y. W. Teh (2011). Bayesian learning via stochastic gradient langevin dynamics.

    International Conference on Machine Learning (ICML 2011)

    .