Big data problems present a computational challenge for iterative updates of global and local parameters. On-line algorithms that fit model parameters on a sub-sampled set of data provide a stochastic gradient approximation of the posterior (Hoffman et al., 2013). We focus on Gibbs sampling algorithms that provide exact posterior computation in large scale inference problems.
On-line variational inference algorithms are fast but result in approximate posterior. MCMC algorithms on the other hand draw samples from the true posterior but take a long time to converge. Recent work on speeding up MCMC and improving variational approximation includes split-merge methods (Chang & Fisher III, 2014) and structured variational inference (Hoffman & Blei, 2014).
Adaptive MCMC algorithms (Andrieu & Thoms, 2008), (Atchade et al., 2009), (Roberts & Rosenthal, 2009) have been developed to improve mixing and convergence rate by automotically adjusting MCMC parameters. The goal of this paper is to adapt the batch-size in order to improve asymptotic convergence of MCMC. Our algorithm resembles stochastic approximation found to be both theoretically valid and to work well in practice (Mahendran et al., 2012).
In this work, we show the importance of selecting the right batch size for big data inference problems. The main contribution of this paper is an adaptive batch-size Gibbs sampling algorithm that yields optimum update strategy demonstrated on Bayesian Lasso, Dirichlet Process Mixture Model (DPMM) and Latent Dirichlet Allocation (LDA) graphical models.
2 Local and Global Updates
Consider a graphical model in Figure 1 where the observed data depends on both the global parameters and the local parameters , where . The posterior distribution contains a product over a very large :
We can re-write the posterior in (1) as
where we choose exponential family representation for convenience. In the traditional Gibbs sampling algorithm, the full conditional updates alternate between:
The global update of is inefficient since it requires an update of every and the summation over a very large . Since the global parameters appear in each local Gibbs sampling update of
, we can achieve smaller variance by updatingmore often. Thus, we can find an optimum trade-off between the frequency of latent variable updates and Gibbs sampling performance.
2.1 Optimizing Mini-Batch Size
In evaluating MCMC performance we examine mixing time and the convergence rate to stationary distribution (Murphy, 2012). Consider the following update frequencies after burnin for the graphical model in Figure 1:
Frequent updates of in (5) increase the number of samples and help reduce the variance in (8). On the other hand, a larger mini-batch size in (6) results in lower autocorrelation in (10) and therefore greater information content per -sample. Finally, a fractional mini-batch size in (7) updates multiple times before updating the local parameters .
Let be an average of MCMC samples of a function of the global parameter
. Suppose we are interested in estimating. A natural estimator is the sample mean . Under squared error loss, our objecitve is to minimize the variance:
Alternatively, we can represent the variance of :
where is the lag- autocorrelation:
Let be the integrated autocorrelation time of the MCMC, then . Therefore, is a measure of efficiency of the estimator of . In fact, is the effective sample size. The autocorrelation function is commonly used to asses convergence and is therefore a good candidate for optimizing the batch size.
Consider a Gibbs sampling algorithm with mini-batch size for a graphical model in Figure 1. Let and be the time of each local and global update, respectively. Given a fixed time budget , the number of -samples we can get is
Therefore, we can re-write the variance in (9) as
Because and are fixed, we can formulate our mini-batch objective function as:
Note as the mini-batch size decreases, the time cost will decrease linearly, while and therefore will increase. This leads to Algorithm 1 for selecting an optimum mini-batch size.
The adaptive batch size algorithm consists of two phases: adaptation and sampling. During adaptation phase, the optimum mini-batch size is selected after burnin in time and space, where is the max autocorrelation lag. During the sampling phase, the optimimum mini-batch size is used to achieve fast convergence.
3 Experimental Results
To evaluate performance of the adaptive batch size Gibbs sampler we consider Bayesian lasso and Dirichlet Process Gaussian Mixture Model (DPGMM) graphical models.
3.1 Probit Regression with Non-conjugate Prior
Consider the model
is the response vector,is a known design matrix, and is the unknown vector of regression coefficients. Alternatively, we can re-write the expression as:
where distribution has density . The Laplace prior in (18
) can be represented as a scale mixture of normal distributions(Park & Casella, 2008):
where . This frameworks is known as Bayesian lasso. Figure 2 shows the graphical model for Bayesian lasso. The generative model is defined as follows:
To reduce the dimensionality of
for big data applications, a one-sided Laplacian or an exponential prior is introduced on the covariance of the hyperplane. However, this prior is non-conjugate and requires an efficient sampling algorithm. Conditioned on prior hyper-parameters, the label updates are identical to Algorithm (4). The hyper-parameter are updated as derived in (Park & Casella, 2008):
where . The full conditional updates above when combined with incremental updates lead to the following incremental Gibbs sampling Algorithm 4.
The mini-batch Gibbs sampler updates are analogous to stochastic gradient and are expected to give a computational advantage for big data applications with a limited time budget .
3.1.1 Experimental Results
The experimental results were obtained by averaging samples of three parallel MCMC chains with Estimated Potential Scale Reduction (EPSR) used as a convergence criterion:
where , with and representing the in-between chain and within-chain variance, respectively (Gelman et al., 2013).
Figure 3 shows the MSE for a dataset consisting of points in averaged over multiple trials. We can see that the lowest MSE for a fixed time budget is achieved by mini-batch sizes and outperforming the full-conditional Bayesian lasso Gibbs sampler. Similarly, the objective plot shows that mini-batch size will be selected during the sampling phase. Thus, the adaptive batch-size Gibbs sampler first cycles through mini-batch sizes during the adaptation phase and fixes the mini-batch size to during the sampling phase.
3.2 Dirichlet Process Mixture Model
Dirichlet Process Mixture Model (DPMM) uses non-parametric priors for modeling mixtures with infinite number of clusters:
. A Dirichlet process is a distribution over probability measures
and it defines a conjugate prior for arbitrary measurable spaces:
Applied to mixture modeling, we can describe the graphical model for DPMM in Figure 7 as follows:
The simplest way to fit a DPMM is to modify the collapsed Gibbs sampler for a finite mixture model. By exchangeability, we can assume that is the most recent assignment and therefore,
The rest of the algorithm can be summarized as follows:
3.2.1 Experimental Results
Figure 5 shows the posterior DPMM clustering of data points in with concentration parameter and the true number of clusters set to . We can see that the mini-batch Gibbs samplers produced the true number of clusters by iteration in comparison with the collapsed Gibbs sampler.
Figure 6 shows the MSE and purity scores computed at iterations comparing the Gibbs sampling posterior clustering with the ground truth. The MSE scores were computed by matching inferred clusters using the Hungarian algorithm with Euclidean distance as the cost matrix. The purity score is defined as , where and , with where is defined as the number of objects in cluster that belongs to class .
3.3 Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a generative ad-mixture topic model in which every word is assigned to its own topic drawn from a document specific distribution (Blei et al., 2003).
The generative LDA model can be described as follows:
It is straightforward to derive a full conditional Gibbs sampling algorithm for LDA. However, one can get better performance by analytically integrating out and . This leads to the following collapsed Gibbs sampler expression:
where is the number of words that belong to topic in the entire corpus excluding the -th word and is the number of words assigned to topic in document excluding the -th word.
Rather than waiting for the collapsed Gibbs sampler to iterate over all documents before updating the global topic parameter , we can divide the corpus into a set of mini-batches . This leads to the following mini-batch Gibbs sampling algorithm:
3.3.1 Experimental Results
Figure 8 shows the experimental results on the Brown corpus with topics, dictionary and documents. The perplexity of the test set was to evaluate performance:
Perplexity is computed by evaluating the expression below over independent chains:
where is the number of words assigned to topic in document and is the number of words assigned to topic across the entire corpus. Alternatively, probability can be computed using chib-IS estimator and other methods described in (Wallach et al., 2009).
The objective function shows a preference for mini-batch size which also leads to smaller perplexity for a fixed time period.
4.1 Mini-Batch Gibbs Sampler
The mini-batch Gibbs sampler uses a sequential sampling scheme alternating between sampling of the global parameters and local parameters . The optimum sampling frequency is selected by optimizing the mini-batch objective function . MSE gain is achieved by updating the global parameters more frequently in comparison to the collapsed Gibbs sampler. This is evident for graphical models with a hierarchical structure such as Bayesian Lasso, DPMM and LDA.
The mini-batch Gibbs sampler is well suited for sampling algorithms used in large scale inference settings such as Stochastic Gradient Descent (SGD)(Li et al., 2014). If the mini-batchsize is set to , this is the case of standard SGD, if we have the collapsed Gibbs sampler. Thus, we can achieve better MSE by choosing according to the mini-batch objective function.
The intuition for this is that one can get a fairly good initial estimate of the local hidden variables knowing the global parameters by evaluating just a few data points. Thus having a noisy estimate of global parameters enables rapid movement through the parameter space in a Gibbs sampling framework. In addition to the improvements in speed, mini-batch Gibbs sampler is less likely to get stuck in local minima due to a certain amount of noise added to parameter estimates.
The mini-batch Gibbs sampler algorithm consists of two phases: adaptation and sampling. During the adaptation phase, the optimum mini-batch size is chosen by a random shuffle through a fixed range of mini-batch sizes . After which the optimum mini-batch is used for sampling during the sampling phase. While the algorithm introduces a time overhead of samples, the chain continues to mix during the adaptation phase. To reduce the overhead, a logarithmic search of mini-batch size was used to minimize .
In addition, the mini-batch objective was selected to use commonly used MCMC diagnostic functions that present little computational overhead such as the integrated autocorrelation and empirical sampling time and . Figure 9 illustrates the two phases of the mini-batch Gibbs sampling algorithm for Bayesian Lasso. Also shown in Figure 10 are the EPSR convergence metrics comparing a full conditional Gibbs sampler (left) and the collapsed Gibbs sampler with optimum mini-batch size (right) for different number of trials and dimensions of the global parameter .
Notice that the adaptation phase acts as initialization for the sampling phase. In general, the mini-batch Gibbs sampling algorithm can be used to initialize batch optimization methods that converge faster in the vicinity of potentially global optimum.
We developed a mini-batch Gibbs sampling algorithm that outperformed state of the art collapsed Gibbs sampler for hierarchical graphical models in large scale inference setting. The complexity of the algorithm is negligible compared to reduction in asymptotic variance as a result of optimum mini-batch selection. We demonstrated the performance of the algorithm on Bayesian Lasso, DPMM, and LDA graphical models.
- Andrieu & Thoms (2008) Andrieu, C. and Thoms, J. A tutorial on adaptive mcmc. Statistics and Computing, 18(4):343–373, December 2008. ISSN 0960-3174. doi: 10.1007/s11222-008-9110-y. URL http://dx.doi.org/10.1007/s11222-008-9110-y.
Atchade et al. (2009)
Atchade, Y., Fort, G., Moulines, E., and Priouret, P.
Adaptive Markov Chain Monte Carlo: Theory and Methods.In Technical Report, University of Michigan, 2009.
- Blei et al. (2003) Blei, D., Ng, A., and Jordan, M. Latent Dirichlet Allocation. In JMLR, 2003.
- Chang & Fisher III (2014) Chang, J and Fisher III, J. W. MCMC Sampling in HDPs using Sub-Clusters. In Proceedings of the Neural Information Processing Systems (NIPS), Dec 2014.
- Gelman et al. (2013) Gelman, Andrew, John, Carlin, Hal, Stern, David, Dunson, Aki, Vehtari, and Donald, Rubin. Bayesian Data Analysis. CRC Press, 2013.
- Hoffman & Blei (2014) Hoffman, M. and Blei, D. Structures Stochastic Variational Inference. In arXiv, 2014.
Hoffman et al. (2013)
Hoffman, M, Blei, D, Wang, C, and Paisley, J.
Stochastic variational inference.
The Journal of Machine Learning Research (JMLR), 2013.
- Li et al. (2014) Li, Mu, Zhang, Tong, Chen, Yuqiang, and Smola, Alexander J. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’14, 2014.
Mahendran et al. (2012)
Mahendran, Nimalan, Wang, Ziyu, Hamze, Firas, and de Freitas, Nando.
Adaptive MCMC with bayesian optimization.
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012, La Palma, Canary Islands, April 21-23, 2012, pp. 751–760, 2012. URL http://jmlr.csail.mit.edu/proceedings/papers/v22/mahendran12.html.
- Murphy (2012) Murphy, Kevin P. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020, 9780262018029.
- Park & Casella (2008) Park, T. and Casella, G. The Bayesian Lasso. In Journal of the American Statistical Association, 2008.
- Roberts & Rosenthal (2009) Roberts, G. O. and Rosenthal, J. S. Examples of adaptive MCMC. In Journal of Computational and Graphical Statistics, 2009.
- Wallach et al. (2009) Wallach, H., Murray, I., R., Salakhutdinov, and D., Mimno. Evaluation Methods for Topic Models. In International Conference on Machine Learning, 2009.