1 Introduction
Gradientbased Markov Chain Monte Carlo (MCMC) methods for computing Bayesian posteriors, which exploit geometric information, have become increasingly popular in recent years. Exact gradient evaluations, however, can be prohibitive even for moderately large datasets. Following the success of using stochastic gradients in largescale optimization problems, and inspired by the seminal work of Welling and Teh
[1], many gradientbased MCMC algorithms have now been adapted to capitalize on the idea of using fast but noisy gradient evaluations computed on minibatches of data [2]. Examples include stochastic gradient Langevin dynamics (SGLD) [1] and stochastic gradient Hamiltonian Monte Carlo [3], which have established themselves as popular choices for scalable MCMC sampling.Complementary to data subsampling, which underlies the use of stochastic gradients, another strategy for scaling up MCMC is distributed computation, where the data are partitioned and distributed across a number of workers [13, 15, 18, 20]. This strategy is particularly relevant in cases, where the data set is too large to fit into the memory of a single machine. The need for distributed methods also arises in settings, where the data set is originally collected in a distributed manner, and constraints due to e.g., communication or privacy prevent it from being centralized into a single location. In this work, we focus on stochastic gradient MCMC in distributed settings and use distributed SGLD (DSGLD) introduced in [4] as our starting point.
While stochastic gradients provide a reduction in computational cost, and thus better scaling properties, they suffer intrinsically from high variance, leading to poor mixing rates and slow convergence [7]. In distributed settings, minibatches are chosen within individual data subsets held by workers, which intuitively has an amplifying effect on the variance of stochastic gradients, especially if the distributed data sets are heterogeneous. However, currently, no formal analysis of the stochastic gradient variance and its influence on convergence in distributed settings appears to be available and we provide it in this paper. In serial settings, an extensive effort has been conducted to provide a technical analysis on the convergence of SGLD [6, 11, 5, 17, 19], which due to its relatively straightforward formulation is a good starting point for analyses and novel developments in stochastic gradient MCMC.
We further introduce a method which enables decreasing variance and providing tighter convergence bounds. As the central mechanism of our method, we define the notion of a conducive gradient, a zeromean stochastic gradient constructed using a surrogate likelihood. The role of the conducive gradient is to provide a global estimate of the target posterior which when added to an unbiased SGLD estimator, decreases variance and provides tighter convergence bounds.
Previous works in [7, 5] have proposed strategies to alleviate the effect of high variance in SGLD in serial settings. The authors of [7] proposed two algorithms, SAGALD and SVRGLD, both of which are based on using previously evaluated gradients to approximate gradients for data points not visited in a given iteration. The first algorithm, SAGALD, requires a record of individual gradients for each data point to be maintained. In the second one, SVRGLD, the gradient on the entire data set needs to be periodically evaluated. The recently proposed SGLDCV [5], uses posterior mode estimates to build control variates, which are added to the gradient estimates to speed up convergence. For distributed settings, the algorithms proposed in [7] may not be adequate, on the one hand due to the high memory cost that SAGALD requires for storing gradients, and on the other hand due to the high computational cost that the update of the fullgradients requires in SVRGLD. Similarly to SGLDCV, our algorithm can be seen as a control variate method [14]. However, applying SGLDCV to distributed scenarios may result in a significant communication overhead, as finding a mode of the posterior would require employing distributed optimization algorithms.
After introducing some notation and providing a brief review of serial and distributed SGLD in Section 2, we derive a convergence bound for DSGLD and contrast it with the analogous bound for serial SGLD in Section 3. We then introduce the concept of conducive gradients in Section 4, and use it to construct a novel gradient estimator, which is shown to improve on the convergence bound for DSGLD. In Section 5, we show experimental results. Finally, we summarize our conclusions in Section 6.
2 Background and notation
Let be a data set of size and let be the density of a posterior distribution from which we wish to draw samples. Langevin dynamics [12] is a family of MCMC methods which utilizes the gradient of the logposterior,
to generate proposals in a MetropolisHastings sampling scheme. For large data sets, computing the gradient of the loglikelihood with respect to the entire data set becomes expensive. To mitigate this problem, stochastic gradient Langevin dynamics (SGLD) [1] uses stochastic gradients to approximate the fulldata gradient.
Denoting by the gradient of the loglikelihood with respect to a minibatch of size , SGLD draws samples from the target distribution using a stochastic gradient update of the form
(1) 
with step size determined by , and additional noise injected to each step. The step size is annealed according to a schedule satisfying and . Note that as , the MetropolisHastings acceptance rate goes asymptotically to one and thus, the acceptreject step is typically ignored in SGLD. While a proper annealing schedule yields an asymptotically exact algorithm, constant step sizes are often used in practice.
2.1 Distributed SGLD
In this paper, our focus is on SGLD in distributed settings, where the data are partitioned into nonoverlapping shards , such that . For simplicity, we will assume that each shard corresponds to a single worker. An adaptation of Equation (1) to distributed settings has been presented in [4]. The main idea is that in each iteration, a minibatch is sampled within a shard, say
, and the shard itself is sampled by a scheduler with probability
, with and for all . This results in the update(2) 
where denotes the size of shard , chosen at time . Intuitively, if a minibatch of data points is chosen uniformly at random from , then scales
to be an unbiased estimator for
, while further scales this gradient to be an unbiased estimator for .In order to reduce the overhead in workers communicating between each iteration, [4] further proposed a modified version, where multiple update steps within a shard are taken before moving to another worker, at the cost of some loss in asymptotic accuracy. It is worth noting that, while the data are distributed across multiple workers, the above sampling procedure may still be understood as being entirely serial. In practice, however, distributed settings are naturally amenable to running multiple chains in parallel.
3 Convergence of DSGLD
We begin by analyzing the convergence of DSGLD under the same framework used for the analysis of SGLD in [6] and subsequently adpoted in [7], which directly ties convergence bounds to the variance of the gradient estimators. Besides certain regularity conditions adopted in these works, which we outline in the supplementary material, we make the following assumption:
Assumption 1
The gradient of the loglikelihood of individual elements within each shard is bounded, i.e., , for all and , and each .
We then proceed to derive the following bound on the convergence in mean squared error (MSE) of the Monte Carlo expectation of a test function with respect to its expected value .
Theorem 3.1
Let for all . Under standard regularity conditions and Assumption 1, the MSE of DSGLD for a smooth test function at time is bounded, for some constant independent of and , in the following manner:
The bound in Theorem 3.1 (proved in the supplement) depends explicitly on the ratio between squared shard sizes and their selection probabilities. This follows the intuition that both shard sizes and their availability play a role in the convergence of DSGLD. Note also that the above bound for DSGLD generalizes previous results for SGLD [7]. More specifically, if we combine all shards into a single data set and let , we recover the bound for SGLD:
4 Variance reduction using conducive gradients
In DSGLD, stochastic gradient updates are computed on minibatches sampled within local shards, which adds bias to the updates and increases variance globally. This is especially true if the shards are heterogeneous. To counteract this, we would like the local updates to make use of information across shards, without significantly increasing either computational cost or memory requirements.
Our strategy for achieving this goal is to augment the local updates with an auxiliary gradient computed on a tractable surrogate for the fulldata likelihood . We assume here that the surrogate, denoted as , factorizes over shards as , where each is itself a surrogate for , i.e. the likelihood w.r.t. the entire shard. Given a surrogate , we define the conducive gradient w.r.t. shard as
The following result states that when is added to the stochastic gradient in a DSGLD setting, the resulting estimator remains a valid estimator for the gradient of the fulldata loglikelihood.
Lemma 1
Assume are Lipschitz continuous. Given a dataset partitioned into shards , with respective sample sizes and shard selection probabilities , the following gradient estimator,
is an unbiased estimator of with finite variance.
The conducivegradient DSGLD update can now be written as
(3) 
Together with the same setting and assumption 1 previously adopted for the analysis of DSGLD, Lemma 2, which follows directly from the Lipschitz continuity of our surrogates, is sufficient for deriving a tighter bound for our method, in Theorem 4.1.
Lemma 2
If are everywhere differentiable and Lipschitz continuous, then the average value of , taken over , is bounded by some , for each .
Theorem 4.1
Put simply, Theorem 4.1 (proved in the supplement) states that through conducive gradients, tractable approximations to the local likelihood factors can be employed to improve distributed SGLD. Algorithm 1 describes our method, conducive gradient DSGLD (CGDSGLD), for a single chain.
Remark 1 (Controlling exploration)
Note that conducive gradients can be alternatively written as
(4) 
making it explicit that these terms encourage the exploration of regions in which we believe, based on the approximations and , the posterior density to be high but the density within shard to be low. We can explicitly control the extent of this exploration by multiplying the conducive gradient by a constant to obtain the modified gradient estimator:
(5) 
4.1 Choice of
The key idea in choosing is to obtain an approximation of with a parametric form, which still enables the computation of to be relatively inexpensive, i.e., such that it can be computed in a single gradient evaluation instead of iterating over all data, of size , of said shard . Exponential family distributions are particularly convenient for this purpose, as they are closed under product operations, enabling to be computed in a single gradient evaluation, which keeps the additional cost of our method negligible even when .
In this work, we use a simulationbased approach to compute by first drawing from locally employing SGLD, and using the resulting samples to compute the parameters of an exponential family approximation. To avoid communication overhead, can be computed independently in parallel for each of the datashards and then communicated to the coordinating server once, before the CGDSGLD steps take place.
5 Experiments
In this Section we demonstrate the performance of CGDSGLD under different conditions. In Subsection 5.1, we provide a visual illustration of a pathology that afflicts DSGLD. The pathology happens when the number of withinshard updates is increased.In Subsection 5.2, we consider inference on a Bayesian metric learning model, a nonconjugate model which can be used to learn metric matrices from similarity information in the data. Finally, in Subsection 5.3
we show how our method can be employed to learn Bayesian neural networks in a distributed fashion and analyse how its behaviour changes as the amount of heterogeneity between shards is increased, when compared to its original counterpart.
While these models are increasingly complex, we highlight that, using multivariate Gaussians as the approximations to each of the withinshard likelihood functions, we were still able to obtain good results for CGDSGLD. For the first set of experiments, we derive analytic forms for the approximations , which is possible due to the simplicity of the target model. For the remaining ones, we employ SGLD independently for
and use the samples obtained to compute the mean vector and covariance matrix that parameterize
.All experiments were implemented using PyTorch
^{1}^{1}1https://pytorch.org and code will be publicly released upon the publication of the manuscript.5.1 Heterogeneous shards
An ideal sampling scheme would, in theory, update the chain state at a worker, and immediately pass it over to another worker in the next iteration. However, such a short communication cycle would result in a large overhead. To ameliorate the problem, [4] proposed making a number of chain updates before moving to another worker. However, the authors reported that as the number of iterations within each worker and shard increases, the algorithm tends to lose sample efficiency and effectively sample from a mixture of local posteriors, , instead of the true posterior. With heterogeneous data shards, the effect is particularly noticeable, as illustrated below.
5.1.1 Model:
In this experiment, we illustrate how information can be shared between shards through the approximations introduced in CGDSGLD, to avoid the aforementioned pathology and achieve better performance. To this end, we consider inference for the mean vector
of normally distributed data under the simple model
5.1.2 Setting:
We generate disjoint data subsets of size , each respectively from with uniformly sampled from the square. We then perform inference on the overall mean using the model . We sample the same number of posterior samples using both DSGLD and CGDSGLD with fixed stepsize , minibatch size and . The first samples were discarded as burnin and the remaining ones were thinned by 100. We set for each .
5.1.3 Results:
Figure 1 shows the posterior samples as a function of the number of local updates the method takes before jumping to the next shard. For comparison, samples from the analytical posterior are shown in Figure 1(a). As can be seen from the results, the proposed method (CGDSGLD) converges adequately to the true posterior while DSGLD diverges towards a mixture of local approximations. Figure 1(b) compares the convergence in terms of the number of posterior samples. While DSGLD with 100 local updates converges as fast as CGDSGLD, it plateaus at a higher MSE. Note that in contrast with DSGLD, CGDSGLD is in the current experiment insensitive to the number of local updates.
5.2 Metric learning
Given sets of similar and dissimilar pairs of vectors from , metric learning concerns the task of learning a distance metric matrix such that the Mahalanobis distance
is low if and high if .
5.2.1 Model:
We consider inference on the Bayesian metric learning model proposed by Yang et al. [21], in which it is assumed that can be expressed as where are the top eigenvectors of . The likelihood function for each pair from or is given by
where equals one if and equals zero otherwise, if . While having is enough to guarantee that is a distance metric, this is further relaxed and a diagonal Gaussian prior is put on .
5.2.2 Setting:
We have devised a data set for metric learning based on the Spoken Letter Recognition^{2}^{2}2https://archive.ics.uci.edu/ml/datasets/isolet (isolet) data, which encompasses 7797 examples split among 26 classes. We have created and pairs of similar and dissimilar vectors, respectively, using the labels of the isolet examples, i.e., samples are considered similar if they belong to the same class and dissimilar otherwise. We split these pairs into data shards of identical size, in a manner that there is no overlap in the classes used, to create the sets of pairs and in each shard . Additionally, we created another thousand pairs of equally split similar and dissimilar examples which we held out for test.
We have set and run both DSGLD and CGDSGLD with constant step size and minibatch size . The conducive terms are computed independently by drawing three thousand samples from
using SGLD and imposing a multivariate normal approximation.
5.2.3 Results:
Figure 3 shows results in terms of average loglikelihood as a function of samples. the curves show the average of ten repetitions of the experiment with different random seeds. The results show that CGDSGLD converges faster than DSGLD while also achieving better performance both in observed samples and on held out data.
5.3 Bayesian neural networks
In this experiment, we assess the performance of our method for posterior inference on a deep MultiLayer Perceptron (MLP) for: 1) evenly distributed data; and 2) different levels of heterogeneity between shards.
5.3.1 Model:
We consider a MLP with three hidden layers. The first two hidden layers consist of nodes and the last one of
. All hidden nodes are equipped with the rectified linear unit (ReLU) activation function and a Softmax function is applied to the output of the network. Since we employ this network for a classification task, we use the crossentropy loss function.
5.3.2 Homogeneous setting:
We perform posterior inference for the SUSY ^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/SUSY binary classification data set, which comprises five million samples with 18 covariates each. We split the data set into shards with samples each, and set aside an additional shard for evaluation purposes.
We ran CGDSGLD and DSGLD, with even shard selection probabilities, for rounds of communication, between which shardlocal updates take place. For both methods, we adopt fixed stepsize and use minibatches of size . The first samples were discarded for each method and the remaining ones were thinned by two. In a similar fashion to the previous experiments in Subsection 5.2, we computed the conducive terms by drawing independently from densities proportional to the local likelihoods and imposing diagonalmultivariate normal approximations based on these samples.
5.3.3 Results:
Figure 4 reports the results in terms of average log likelihood, evaluated in both observed and held out data, as a function of the number of samples drawn. Values reported correspond to the average over ten rounds of experiments, each having a different set of data points. Note that in this case CGDSGLD converged to clearly better average loglikelihood values.
5.3.4 Heterogeneous setting:
To simulate different degrees of heterogeneity, we create a series of labelimbalanced data shards based on the SUSY data. We control the amount of heterogeneity of the shards by drawing the proportions of positive samples in shard
from a symmetric Beta distribution with parameters
and with values in . When , the data shards are approximately balanced. When , half of the shards tend to have mostly positive and the other half tends to have mostly negative labels, enforcing diversity between shards. All the aforementioned shards comprise samples and an additional balanced shard is held out for evaluation. All remaining implementation details for CGDSGLD and DSGLD are kept the same as for in the homogeneous setting.5.3.5 Results:
Figure 5 shows results in terms of average loglikelihood, evaluated on held out data, for both values in . Note that for , the case in which most of the shards are balanced, CGDSGLD and DSGLD exhibit similar behaviour. However, for more heterogeneous shards, when , CGDSGLD clearly outperforms DSGLD.
6 Conclusion
We proposed CGDSGLD, a novel method which incorporates coarse information regarding the likelihood contribution of each data shard to improve the convergence of distributed SGLD. Experiments show that our method outperforms DSGLD, with emphasis in cases where there is considerable variety accross data shards, causing their likelihood contributions of to be significantly distinct.
Our method can be seen as a variance reduction strategy for DSGLD. Thus, we provided a theoretical analysis regarding its impact on convergence. To the best of our knowledge, we also present the first the discussion about the convergence of SGLD in distributed settings. In contrast to the variancereduction strategies applied to SGLD, given suitable surrogates , such as exponential family distributions, CGDSGLD can be simultaneously made efficient both in terms of memory and computation.
We leave open the possibility of employing more expressive or computationally cheaper surrogates , such as nonparametric methods or variational approximations, respectively.
Appendix 0.A Background on convergence analysis for SGLD
Let be the functional that solves the Poisson equation . Assume is bounded up to its third order derivative by a function , such that with with denoting the th order derivative. Assume as well that the expectation of w.r.t. is bounded () and that is smooth such that , , , for some . Under the regularity conditions above, [6] showed the following result.
Theorem 0.A.1 (See [6])
Let be an unbiased estimate of , the unnormalized negative log posterior, and for all . Let . Under the assumptions above, for a smooth test function , the MSE OF SGLD at time is bounded for some independent of in the following manner:
(6a) 
To the regularity conditions mentioned above, we add for further analysis that for some .
Appendix 0.B Theorem 1: Convergence of DSGLD
Here, we follow the footprints of [6] later adopted by [7]. Thus, we focus on bounding . We defined to avoid text congestion in the following. For some , we have:
(8a)  
(8b)  
(8c)  
(8d)  
(8e)  
(8f)  
(8g)  
(8h) 
In the above, denotes that the expectation is taken w.r.t. a minibatch of size with elements drawn with replacement and equal probability from shard . To advance from Equation (8c) to (8d), we use law of iterated expectations and the fact that for zeromean independent . To advance from Equation (8d) to (8e), we use . Substituting Equation (8h) in in Equation (7a) yields the desired result.
Appendix 0.C Lemma 1: Unbiasedness and finite variance
Recall that the for the DSGLD update [4] we have that:
Furthermore, for conducive gradients we have that:
Thus, the CGDSGLD gradient estimator, the sum of the DSGLD estimator and the conducive gradient, is unbiased.
The sufficient condition for the DSGLD estimator to have finite variance is that the unnormalized log posterior is Lipschitz continuous. Similarly, since are also Lipschitz continuous, their first derivatives are bounded, so the conducive gradient is a convex combination of bounded functions and has finite variance. Thus, their sum, the CGDSGLD estimator has finite variance.
Appendix 0.D Theorem 2: Convergence of CGDSGLD
Since conducive gradients have zeromean, the bound derived in Proposition 1 also holds. We are left with the task of proving the alternative bound. We again defined another function and to avoid text congestion in the following.
(9a)  
(9b)  
(9c)  
(9d)  
(9e)  
(9f)  
(9g) 
We proceed from Equation (9b) to (9c) using the law of iterated expectations and the fact that for zeromean independent . To advance from Equation (9c) to (9d), we use . The last line is obtained using Lemma 2. The desired result is obtained by taking the minimum of the bound derived here in the one in Equation 7a.
Appendix 0.E Additional experiments
0.e.1 Linear regression
In this set of experiments, we are comparing vanilla DSGLD with our method CGDSGLD on real datasets. We apply our CGDSGLD algorithm to Bayesian linear regression, studying the behavior of the test MSE taken over the posterior samples obtained.
0.e.1.1 Model
The inputs of our model are , where and . The likelihood of the th output , given the input vector , is , and we place the prior .
0.e.1.2 Setting
We ran experiments on four different datasets^{4}^{4}4Datasets can be downloaded from https://archive.ics.uci.edu/ml/index.html from the UCI repository, Concrete ( samples, features), Noise ( samples, features), Conductivity ( samples, features) and Localization ( samples, features). We normalized and partitioned our datasets into (80%) training and (20%) test sets. We report the cumulative average of both, the test MSE and variance. In all our experiments, both DSGLD and CGDSGLD have the same hyperparameters. We sample disjoint data subsets for rounds each having iteration per round, with fixed stepsize and minibatch size . All shards are chosen with same probability . We also burnin the first ten thousand samples and thin the remaining by a hundred. We set , with and , for each .
0.e.1.3 Results
Figure 6 shows the cumulative MSE and its variance. We can see that CGDSGLD converges faster and has less variance in the experiments.
0.e.2 Logistic regression
0.e.2.1 Model
0.e.2.2 Setting
We carry posterior inference for the Magic ( samples, features) and Credit ( samples, features) datasets^{5}^{5}5Datasets can be downloaded from https://archive.ics.uci.edu/ml/index.html.
We compare the proposed CGDSGLD to the earlier DSGLD, both taking a hundred shardlocal chain updates, with batchsize and stepsize . The data is normalized, so the features have variance one and mean zero. We set the likelihood surrogates as Gaussian approximations computed from samples obtained by running SGLD independently using each of the local likelihoods as a target. Using normal approximations allows us to compute the gradlog surrogate, in a single gradient evaluation, which cost is negligible, approximately O(1). The data is partitioned into shards and we set . We discard the first three thousand samples from each chain and thin the remaining ones by . Experiments were repeated times, in each of which a thousand data points were used for the test.
0.e.2.3 Results
Figure 7 shows that our method performs better than DSGLD in terms of cumulative average accuracy, allowing faster convergence.
References

[1]
Max Welling and Yee W Teh.
Bayesian learning via stochastic gradient Langevin dynamics.
In
Proceedings of the 28th International Conference on Machine Learning (ICML11)
, pages 681–688, 2011.  [2] YiAn Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
 [3] Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1683–1691. PMLR, 2014.
 [4] Sungjin Ahn, Babak Shahbaba, and Max Welling. Distributed stochastic gradient MCMC. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1044–1052. PMLR, 2014.
 [5] Jack Baker, Paul Fearnhead, Emily B. Fox, and Christopher Nemeth. Control variates for stochastic gradient MCMC. Statistics and Computing, 29(3):599–615, 2019.
 [6] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient MCMC algorithms with highorder integrators. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 2, NIPS’15, pages 2278–2286, Cambridge, MA, USA, 2015. MIT Press.
 [7] Kumar Avinava Dubey, Sashank J. Reddi, Sinead A. Williamson, Barnabás Póczos, Alexander J. Smola, and Eric P. Xing. Variance reduction in stochastic gradient langevin dynamics. In Proceedings of the International Conference on Neural Information Processing Systems, pages 1154–1162, 2016.
 [8] Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NIPS Workshop on Private MultiParty Machine Learning, 2016.

[9]
Chunyuan Li, Changyou Chen, Yunchen Pu, Ricardo Henao, and Lawrence Carin.
Communicationefficient stochastic gradient MCMC for neural
networks.
In
Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI)
. AAAI Press, 2019.  [10] Zhize Li, Tianyi Zhang, Shuyu Cheng, Jun Zhu, and Jian Li. Stochastic gradient Hamiltonian Monte Carlo with variance reduction for Bayesian inference. arXiv preprint arXiv:1803.11159, 2019.
 [11] Tigran Nagapetyan, Andrew B. Duncan, Leonard Hasenclever, Sebastian J. Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient Langevin dynamics. arXiv eprints, page arXiv:1706.02692, 2017.
 [12] Radford Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, and XiaoLi Meng, editors, Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC, New York, 2011.
 [13] Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarrassingly parallel MCMC. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, pages 623–632, Arlington, Virginia, United States, 2014. AUAI Press.
 [14] B. D. Ripley. Stochastic Simulation. Wiley, Hoboken, 2009.
 [15] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George, and Robert E McCulloch. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88, 2016.
 [16] Umut Şimşekli, Hazal Koptagel, Hakan Güldaş, A Taylan Cemgil, Figen Öztoprak, and Ş İlker Birbil. Parallel stochastic gradient Markov Chain Monte Carlo for matrix factorisation models. arXiv preprint arXiv:1506.01418, 2015.
 [17] Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer. Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17(1):193–225, 2016.
 [18] Alexander Terenin, Daniel Simpson, and David Draper. Asynchronous Gibbs sampling. arXiv preprint arXiv:1509.08999, 2015.
 [19] Sebastian J Vollmer, Konstantinos C Zygalakis, and Yee Whye Teh. Exploration of the (non) asymptotic bias and variance of stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17(1):5504–5548, 2016.
 [20] Xiangyu Wang, Fangjian Guo, Katherine A. Heller, and David B. Dunson. Parallelizing MCMC with random partition trees. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 451–459, Cambridge, MA, USA, 2015. MIT Press.
 [21] Liu Yang, Rong Jin, and Rahul Sukthankar. Bayesian active distance metric learning. In Proceedings of the TwentyThird Conference on Uncertainty in Artificial Intelligence, UAI’07, page 442–449, Arlington, Virginia, USA, 2007. AUAI Press.
 [22] Yuan Yang, Jianfei Chen, and Jun Zhu. Distributing the stochastic gradient sampler for largescale LDA. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1975–1984, New York, NY, USA, 2016. ACM.
Comments
There are no comments yet.