Extended Stochastic Gradient MCMC for Large-Scale Bayesian Variable Selection

02/07/2020 ∙ by Qifan Song, et al. ∙ Purdue University 0

Stochastic gradient Markov chain Monte Carlo (MCMC) algorithms have received much attention in Bayesian computing for big data problems, but they are only applicable to a small class of problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. This paper proposes an extended stochastic gradient MCMC lgoriathm which, by introducing appropriate latent variables, can be applied to more general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. Numerical studies show that the proposed algorithm is highly scalable and much more efficient than traditional MCMC algorithms. The proposed algorithms have much alleviated the pain of Bayesian methods in big data computing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

After six decades of continual development, MCMC has proven to be a powerful and typically unique computational tool for analyzing data of complex structures. However, for large datasets, its computational cost can be prohibitive as it requires all of the data to be processed at each iteration. To tackle this difficulty, a variety of scalable algorithms have been proposed in the recent literature. According to the strategies they employed, these algorithms can be grouped into a few categories, including stochastic gradient MCMC algorithms (Welling & Teh, 2011; Ding et al., 2014; Ahn et al., 2012; Chen et al., 2014; Betancourt, 2015; Ma et al., 2015; Nemeth & Fearnhead, 2019), split-and-merge algorithms (Scott et al., 2016; Srivastava et al., 2018; Xue & Liang, 2019), mini-batch Metropolis-Hastings algorithms (Chen et al., 2016; Korattikara et al., 2014; Bardenet et al., 2014; Maclaurin & Adams, 2014; Bardenet et al., 2017), nonreversible Markov process-based algorithms (Bierkens et al., 2019; Bouchard Coté et al., 2018), and some discrete sampling algorithms based on the multi-armed bandit (Chen & Ghahramani, 2016).

Although scalable algorithms have been developed for both continuous and discrete sampling problems, they are hard to be applied to dimension jumping problems. Dimension jumping is characterized by variable selection where the number of parameters changes from iteration to iteration in MCMC simulations. Under their current settings, the stochastic gradient MCMC and nonreversible Markov process-based algorithms are only applicable to problems for which the parameter space has a fixed dimension and the log-posterior density is differentiable with respect to the parameters. For the split-and-merge algorithms, it is unclear how to aggregate samples of different dimensions drawn from the posterior distributions based on different subset data. The multi-armed bandit algorithms are only applicable to problems with a small discrete domain and can be extremely inefficient for high-dimensional variable selection problems. The mini-batch Metropolis-Hastings algorithms are in principle applicable to dimension jumping problems. However, they are generally difficult to use. For example, the algorithms by Chen et al. (2016), Korattikara et al. (2014), and Bardenet et al. (2014) perform approximate acceptance tests using subset data. The amount of data consumed for each test varies significantly from one iteration to another, which compromise their scalability. The algorithms by Maclaurin & Adams (2014) and Bardenet et al. (2017) perform exact tests but require a lower bound on the parameter distribution across its domain. Unfortunately, the lower bound is usually difficult to obtain.

This paper proposes an extended stochastic gradient Langevin dynamics algorithm which, by introducing appropriate latent variables, extends the stochastic gradient Langevin dynamics algorithm to more general large-scale Bayesian computing problems such as variable selection and missing data. The extended stochastic gradient Langevin dynamics algorithm is highly scalable and much more efficient than traditional MCMC algorithms. Compared to the mini-batch Metropolis-Hastings algorithms, the proposed algorithm is much easier to use, which involves only a fixed amount of data at each iteration and does not require any lower bound on the parameter distribution.

2 A Brief Review of Stochastic Gradient Langevin Dynamics

Let , denote a set of independent and identically distributed samples drawn from the distribution , where is the sample size and is the parameter. Let denote the likelihood function, let denote the prior distribution of , and let denote the log-posterior density function. If has a fixed dimension and is differentiable with respect to , then the stochastic gradient Langevin dynamics algorithm (Welling & Teh, 2011) can be applied to simulate from the posterior, which iterates by

(1)

where is the dimension of , is an

-identity matrix,

is the step size (also known as the learning rate), is the temperature, and

denotes an estimate of

based on a mini-batch of samples. The learning rate can be decreasing or kept as a constant. For the former, the convergence of the algorithm was studied in Teh et al. (2016). For the latter, the convergence of the algorithm was studied in Sato & Nakagawa (2014) and Dalalyan & Karagulyan (2017). Refer to Nemeth & Fearnhead (2019) for more discussions on the theory, implementation and variants of this algorithm.

3 An Extended Stochastic Gradient Langevin Dynamics Algorithm

To extend the applications of the stochastic gradient Langevin dynamics algorithm to varying-dimensional problems such as variable selection and missing data, we first establish an identity for evaluating in presence of latent variables. As illustrated below, the latent variables can be the model indicator in the variable selection problems or missing values in the missing data problems.

Lemma 1

For any latent variable ,

(2)

where and denote the conditional distribution of and , respectively.

Lemma 1 provides us a Monte Carlo estimator for by averaging over the samples drawn from the conditional distribution . The identity (2) is similar to Fisher’s identity. The latter has been used in evaluating the gradient of the log-likelihood function in presence of latent variables, see e.g. Cappé et al. (2005). When is large, the computation can be accelerated by subsampling. Let denote a subsample, where denotes the subsample size. Without loss of generality, we assume that is a multiple of , i.e., is an integer. Let denote a duplicated dataset with the subsample, whose total sample size is also . Following from (2), we have

(3)

Since is true and is unbiased for ,

forms an unbiased estimator of

. Sampling from can be much faster than sampling from as for the former the likelihood only needs to be evaluated on a mini-batch of samples.

3.1 Bayesian Variable Selection

As an illustrative example, we consider the problem of variable selection for linear regression

(4)

where

is a zero-mean Gaussian random error with variance

,

is the vector of regression coefficients, and

is the vector of explanatory variables. Let be a binary vector indicating the variables included in model , and let be the vector of regression coefficients associated with the model

. From the perspective of Bayesian statistics, we are interested in estimating the posterior probability

for each model and the posterior mean for some integrable function , where comprises models. Both quantities can be estimated using the reversible jump Metropolis-Hastings algorithm (Green, 1995) by sampling from the posterior distribution . However, when is large, the algorithm can be extremely slow due to repeated scans of the full dataset in simulations.

As aforementioned, the existing stochastic gradient MCMC algorithms cannot be directly applied to simulate of due to the dimension jumping issue involved in model transition. To address this issue, we introduce an auxiliary variable , which links and through

(5)

where denotes elementwise multiplication. Let and be subvectors of corresponding to the nonzero and zero elements of , respectively. Note that is sparse with all elements in being zero, while can be dense. Based on the relation (5), we suggest to simulate from using the stochastic gradient Langevin dynamic algorithm, for which the gradient can be evaluated using Lemma 1 by treating as the latent variable. Let denote the prior of . To simplify the computation of , we further assume the a priori independence that . Then it is easy to derive

which can be used in evaluating by Lemma 1. If a mini-batch of data is used, the gradient can be evaluated based on (3). This leads to an extended stochastic gradient langevin dynamics algorithm.

[Extended Stochastic Gradient Langevin Dynamics for Bayesian Variable Selection]

  • (Subsampling) Draw a subsample of size (with or without replacement) from the full dataset at random, and denote the subsample by , where indexes the iteration.

  • (Simulating models) Simulate models from the conditional posterior by running a short Markov chain, where and is the sample of at iteration .

  • (Updating ) Update by setting , where is the learning rate, , is the temperature, and is the dimension of .

Theorem 1 justifies the validity of this algorithm with the proof given in the Appendix.

Theorem 1

Assume that the conditions (A.1)-(A.3) (given in Appendix) hold, , , are increasing with such that , , and a constant learning rate is used. Then, as ,

  • as , where denotes the distribution of , , and denotes the second order Wasserstein distance between two distributions.

  • If is -Lipschitz for some constant , then as , where denotes convergence in probability and .

  • If (A.4) further holds, as .

Part (i) establishes the weak convergence of ; that is, if the total sample size and the iteration number are sufficiently large, and the subsample size and the number of models simulated at each iteration are reasonably large, then will converge to the true posterior in 2-Wasserstein distance. Refer to Gibbs & Su (2002) for discussions on the relation between Wasserstein distance and other probability metrics. Parts (ii) & (iii) address our general interests on how to estimate the posterior mean and posterior probability, respectively, based the samples simulated by Algorithm 3.1. For parts (i), (ii) and (iii), the explicit convergence rates are given in equations (8), (10) and (15), respectively.

For the choice of , can be approximately treated as the maximum size of the models under consideration, which is of the same order as the true model. Therefore, can be pretty small under the model sparsity assumption. Theorem 1 is established with a constant learning rate. In practice, one may use a decaying learning rate, see e.g. Teh et al. (2016), where it is suggested to set for some . For the decaying learning rate, Teh et al. (2016) recommended some weighted averaging estimators for . Theorem 2 shows that the unweighted averaging estimators used above still work if the learning rate slowly decays at a rate of for . However, if , the weighted averaging estimators are still needed. The proof of Theorem 2 is given in the supplementary material.

Theorem 2

Assume the conditions of Theorem 1 hold. If a decaying learning rate is used for some , then parts (i), (ii) and (iii) of Theorem 1 are still valid.

3.2 Missing Data

Missing data are ubiquitous over all fields from science to technology. However, under the big data scenario, how to conduct Bayesian analysis in presence of missing data is still unclear. The existing data-augmentation algorithm (Tanner & Wong, 1987) is full data based and thus can be extremely slow. In this context, we let denote the incomplete data and let denote the model parameters. If we treat the missing values as latent variables, then Lemma 1 can be used for evaluating the gradient . However, Algorithm 3.1

cannot be directly applied to missing data problems, since the imputation of the missing data might depend on the subsample only. To address this issue, we propose Algorithm S1 (given in the Supplementary material), where the missing values

are imputed from at each iteration. Theorem 1 and Theorem 2 are still applicable to this algorithm.

4 An Illustrative Example

This section illustrates the performance of Algorithm 3.1 using a simulated example. More numerical examples are presented in the supplementary material. Ten synthetic datasets were generated from the model (4) with , , , , , and , where

is assumed to be known, and the explanatory variables are normally distributed with a mutual correlation coefficient of 0.5. A hierarchical prior was assumed for the model and parameters with the detail given in the supplementary material. For each dataset, Algorithm

3.1 was run for 5000 iterations with , , and the learning rate , where the first 2000 iterations were discarded for the burn-in process and the samples generated from the remaining iterations were used for inference. At each iteration, the reversible jump Metropolis-Hastings algorithm (Green, 1995) was used for simulating the models , with the detail given in the supplementary material.

Table 1 summarizes the performance of the algorithm, where the false selection rate (FSR), negative selection rate (NSR), mean squared errors for false predictors () and mean squared errors for true predictors () are defined in the supplementary material. The variables were selected according to the median posterior probability rule (Barbieri & Berger, 2004), which selects only the variables with the marginal inclusion probability greater than 0.5. The Bayesian estimates of parameters were obtained by averaging over a set of thinned (by a factor of 10) posterior samples. For comparison, some existing algorithms were applied to this example with the results given in Table 1 and the implementation details given in the supplementary material. The comparison show that the proposed algorithm has much alleviated the pain of Bayesian methods in big data analysis.

width=1 Algorithm FSR NSR CPU(m) eSGLD 0(0) 0(0) () () 3.3 RJMH 0.50(0.10) 0.16(0.042) () () 180.1 SaM 0.05(0.05) 0.013(0.013) () () 150.4 B-Lasso 0(0) 0(0) () () 32.8

Table 1: Bayesian variable selection with the extended stochastic gradient Langevin dynamics (eSGLD), reversible jump Metropolis-Hastings (RJMH), split-and-merge (SaM) and Bayesian Lasso (B-Lasso) algorithms, where FSR, NSR, and

are reported in averages over 10 datasets with standard deviations given in the parentheses, and the CPU time (in minutes) was recorded for one dataset on a Linux machine with Intel

® Core™i7-3770 CPU@3.40GHz.

5 Discussion

This paper has extended the stochastic gradient Langevin dynamics algorithm to general large-scale Bayesian computing problems, such as those involving dimension jumping and missing data. To the best of our knowledge, this paper provides the first Bayesian method and theory for high-dimensional discrete parameter estimation with mini-batch samples, while the existing methods work for continuous parameters or very low dimensional discrete problems only. Other than generalized linear models, the proposed algorithm can have many applications in data science. For example, it can be used for sparse deep learning and accelerating computation for statistical models/problems where latent variables are involved, such as hidden Markov models, random coefficient models, and model-based clustering problems.

Algorithm 3.1 can be further extended by updating using a variant of stochastic gradient Langevin dynamics, such as stochastic gradient Hamiltonian Monte Carlo (Chen et al., 2014), stochastic gradient thermostats (Ding et al., 2014), stochastic gradient Fisher scoring (Ahn et al., 2012), or preconditioned stochastic gradient Langevin dynamics (Li et al., 2016). We expect that the advantages of these variants (over stochastic gradient Langevin dynamics) can be carried over to the extension.

Acknowledgements

This work was partially supported by the grants DMS-1811812, DMS-1818674, and R01-GM126089. The authors thank the editor, associate editor and referees for their insightful comments/suggestions.

Appendix

.1 Proof of Lemma 1

Proof

Let denote the prior density of , and let denote the density of . Then

where the second and third equalities follow from the relation (for an appropriate function ), and the fourth and fifth equalities are by direct calculations of the conditional distributions.

.2 Proof of Theorem 1

Let denote the posterior density function of , and let denote the density of generated by Algorithm 3.1 at iteration . We are interested in studying the discrepancy between and in the 2nd order Wasserstein distance. The following conditions are assumed.

  • The posterior is strongly log-concave and gradient-Lipschitz:

    (6)
    (7)

    where , and for some positive constants and .

  • The posterior

    has bounded second moment:

    .

  • , where denotes expectation with respect to the distribution of , and denotes the set of all possible models.

  • Let and let be the descending order statistics of . Assume that there exists a constant such that .

Proof

Part (i). In Algorithm 3.1, the gradient is estimated by running a short Markov chain with a mini-batch of data. Since the initial distribution of the Markov chain might not coincide with its equilibrium distribution, the resulting gradient estimate can be biased. Let . Following from (A.3), we have

Following from Lemma S2 in the supplementary material, if , and holds, then

(8)

for some , since and hold by conditions (A.1) and (A.2).

Part (ii). Since is -Lipschitz, we have for some constant . Further, is strongly log-concave, so , i.e., is -integrable. On the other hand,

(9)

where and

are two random variables whose marginal distributions follow

and respectively,

denotes expectation with respect to the joint distribution of

and , and . This implies that is also -integrable and as .

Further, by the property of Markov chain, WLLN applies and thus . Combining it with the above result leads to

(10)

Part (iii). To establish the convergence of , we define , , and for any . For each , is approximately Gaussian with and . Therefore, for any positive , with probability , is bounded by according to the tail probability of the Gaussian. It implies, with high probability, that if is the most likely model, i.e., , then

if (i.e., ) and , where the second equality follows from the mean-value theorem by viewing ’s as the arguments of , and denotes a value between and . Similarly, if is not the most likely model, then we denote as the most likely model and, by the mean-value theorem,

In conclusion, with probability , for all , any iteration and any . Then, one could choose some , such that is bounded by

(11)

for any iteration . Conditioned on , ’s are independent and each is bounded by 1, so WLLN applies. Therefore, for any , by WLLN,

(12)

provided . Since forms a time-homogeneous Markov chain, whose convergence is measured by (8), and the function is bounded and continuous in ,

(13)

holds for any . Combining (13) with (12) leads to

(14)

Conditioned on and , by the standard theory of MCMC, forms a consistent estimator of with an asymptotic bias of . Since is increasing with and , the estimator is asymptotically unbiased. Combining this result with (14) leads to

(15)

which converges to 0 as and .

References

  • Ahn et al. (2012) Ahn, S., Balan, A. K. & Welling, M. (2012). Bayesian posterior sampling via stochastic gradient Fisher scoring. In ICML.
  • Barbieri & Berger (2004) Barbieri, M. & Berger, J. (2004). Optimal predictive model selection. Annals of Statistics 32, 870–897.
  • Bardenet et al. (2014) Bardenet, R., Doucet, A. & Holmes, C. (2014). Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In ICML.
  • Bardenet et al. (2017) Bardenet, R., Doucet, A. & Holmes, C. C. (2017). On Markov chain Monte Carlo methods for tall data.

    Journal of Machine Learning Research

    18, 47:1–47:43.
  • Betancourt (2015) Betancourt, M. (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In ICML.
  • Bierkens et al. (2019) Bierkens, J., Fearnhead, P. & Roberts, G. (2019). The zig-zag process and super-efficient Monte Carlo for Bayesian analysis of big data. Annals of Statistics 47, 1288–1320.
  • Bouchard Coté et al. (2018) Bouchard Coté, A., Vollmer, S. & Doucet, A. (2018). The bouncy particle sampler: A nonreversible rejection-free Markov chain Monte Carlo method. Journal of the American Statistical Association 113, 855–867.
  • Cappé et al. (2005) Cappé, O., Moulines, E. & Ryden, T. (2005). Inference in Hidden Markov Models. New York: Springer.
  • Chen et al. (2016) Chen, H., Seita, D., Pan, X. & Canny, J. (2016). An efficient minibatch acceptance test for Metropolis-Hastings. arXiv:1610.06848 .
  • Chen et al. (2014) Chen, T., Fox, E. B. & Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. In ICML.
  • Chen & Ghahramani (2016) Chen, Y. & Ghahramani, Z. (2016). Scalable discrete sampling as a multi-armed bandit problem. In ICML.
  • Dalalyan & Karagulyan (2017) Dalalyan, A. S. & Karagulyan, A. G. (2017). User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. CoRR abs/1710.00095.
  • Ding et al. (2014) Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D. & Neven, H. (2014). Bayesian sampling using stochastic gradient thermostats. In NIPS.
  • Gibbs & Su (2002) Gibbs, A. & Su, F. (2002). On choosing and bounding probability metrics. International Statistical Review 70, 419–435.
  • Green (1995) Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732.
  • Korattikara et al. (2014) Korattikara, A., Chen, Y. & Welling, M. (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In ICML.
  • Li et al. (2016) Li, C., Chen, C., Carlson, D. E. & Carin, L. (2016).

    Preconditioned stochastic gradient Langevin dynamics for deep neural networks.

    In AAAI.
  • Ma et al. (2015) Ma, Y.-A., Chen, T. & Fox, E. B. (2015). A complete recipe for stochastic gradient MCMC. In NIPS.
  • Maclaurin & Adams (2014) Maclaurin, D. & Adams, R. P. (2014). Firefly Monte Carlo: Exact MCMC with subsets of data. In IJCAI.
  • Nemeth & Fearnhead (2019) Nemeth, C. & Fearnhead, P. (2019). Stochastic gradient Markov chain Monte Carlo. arXiv:1907.06986 .
  • Sato & Nakagawa (2014) Sato, I. & Nakagawa, H. (2014). Approximation analysis of stochastic gradient Langevin dynamics by using Fokker-Planck equation and Ito process. In ICML.
  • Scott et al. (2016) Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I. & McCulloch, R. E. (2016). Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management 11, 78–88.
  • Srivastava et al. (2018) Srivastava, S., Li, C. & Dunson, D. B. (2018). Scalable Bayes via Barycenter in Wasserstein space. Journal of Machine Learning Research 19, 1–35.
  • Tanner & Wong (1987) Tanner, M. & Wong, W. (1987). The calculation of posterior distributions by data augmentation (with Discussions). Journal of the American Statistical Association 82, 528–540.
  • Teh et al. (2016) Teh, W., Thiery, A. & Vollmer, S. (2016). Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research 17, 1–33.
  • Welling & Teh (2011) Welling, M. & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In ICML.
  • Xue & Liang (2019) Xue, J. & Liang, F. (2019). Double-parallel Monte Carlo for Bayesian analysis of big data. Statistics and Computing 29, 23–32.