Variance reduction for distributed stochastic gradient MCMC

04/23/2020
by   Khaoula El Mekkaoui, et al.
aalto
11

Stochastic gradient MCMC methods, such as stochastic gradient Langevin dynamics (SGLD), have emerged as one of the dominant approaches for posterior sampling in large-scale settings. While gradient evaluations based on only a small fraction of the data significantly reduce the computational cost, they may suffer from high variance, leading to slow convergence. In distributed settings, where the data lie scattered across a number of workers, the problem of high variance is particularly imminent and is even worse if the data subsets held by the workers are very heterogeneous. The impact of variance reduction has been studied in serial settings but not in distributed scenarios so far. In this work, we derive variance bounds for distributed SGLD and introduce the concept of conducive gradients, zero-mean stochastic gradients that serve as a mechanism for sharing probabilistic information between workers. We introduce a novel stochastic gradient estimator which incorporates the inducive gradients, and show both theoretically and empirically that it reduces variance, and hence improves convergence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/21/2016

Stochastic Gradient MCMC with Stale Gradients

Stochastic gradient MCMC (SG-MCMC) has played an important role in large...
06/16/2017

Control Variates for Stochastic Gradient MCMC

It is well known that Markov chain Monte Carlo (MCMC) methods scale poor...
02/20/2020

Improving Sampling Accuracy of Stochastic Gradient MCMC Methods via Non-uniform Subsampling of Gradients

Common Stochastic Gradient MCMC methods approximate gradients by stochas...
08/23/2020

Multi-kernel Passive Stochastic Gradient Algorithms

This paper develops a novel passive stochastic gradient algorithm. In pa...
01/27/2020

Variance Reduction with Sparse Gradients

Variance reduction methods such as SVRG and SpiderBoost use a mixture of...
10/16/2015

SGD with Variance Reduction beyond Empirical Risk Minimization

We introduce a doubly stochastic proximal gradient algorithm for optimiz...
06/24/2020

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Sampling methods (e.g., node-wise, layer-wise, or subgraph) has become a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gradient-based Markov Chain Monte Carlo (MCMC) methods for computing Bayesian posteriors, which exploit geometric information, have become increasingly popular in recent years. Exact gradient evaluations, however, can be prohibitive even for moderately large datasets. Following the success of using stochastic gradients in large-scale optimization problems, and inspired by the seminal work of Welling and Teh

[1], many gradient-based MCMC algorithms have now been adapted to capitalize on the idea of using fast but noisy gradient evaluations computed on mini-batches of data [2]. Examples include stochastic gradient Langevin dynamics (SGLD) [1] and stochastic gradient Hamiltonian Monte Carlo [3], which have established themselves as popular choices for scalable MCMC sampling.

Complementary to data subsampling, which underlies the use of stochastic gradients, another strategy for scaling up MCMC is distributed computation, where the data are partitioned and distributed across a number of workers [13, 15, 18, 20]. This strategy is particularly relevant in cases, where the data set is too large to fit into the memory of a single machine. The need for distributed methods also arises in settings, where the data set is originally collected in a distributed manner, and constraints due to e.g., communication or privacy prevent it from being centralized into a single location. In this work, we focus on stochastic gradient MCMC in distributed settings and use distributed SGLD (DSGLD) introduced in [4] as our starting point.

While stochastic gradients provide a reduction in computational cost, and thus better scaling properties, they suffer intrinsically from high variance, leading to poor mixing rates and slow convergence [7]. In distributed settings, mini-batches are chosen within individual data subsets held by workers, which intuitively has an amplifying effect on the variance of stochastic gradients, especially if the distributed data sets are heterogeneous. However, currently, no formal analysis of the stochastic gradient variance and its influence on convergence in distributed settings appears to be available and we provide it in this paper. In serial settings, an extensive effort has been conducted to provide a technical analysis on the convergence of SGLD [6, 11, 5, 17, 19], which due to its relatively straightforward formulation is a good starting point for analyses and novel developments in stochastic gradient MCMC.

We further introduce a method which enables decreasing variance and providing tighter convergence bounds. As the central mechanism of our method, we define the notion of a conducive gradient, a zero-mean stochastic gradient constructed using a surrogate likelihood. The role of the conducive gradient is to provide a global estimate of the target posterior which when added to an unbiased SGLD estimator, decreases variance and provides tighter convergence bounds.

Previous works in [7, 5] have proposed strategies to alleviate the effect of high variance in SGLD in serial settings. The authors of [7] proposed two algorithms, SAGA-LD and SVRG-LD, both of which are based on using previously evaluated gradients to approximate gradients for data points not visited in a given iteration. The first algorithm, SAGA-LD, requires a record of individual gradients for each data point to be maintained. In the second one, SVRG-LD, the gradient on the entire data set needs to be periodically evaluated. The recently proposed SGLD-CV [5], uses posterior mode estimates to build control variates, which are added to the gradient estimates to speed up convergence. For distributed settings, the algorithms proposed in [7] may not be adequate, on the one hand due to the high memory cost that SAGA-LD requires for storing gradients, and on the other hand due to the high computational cost that the update of the full-gradients requires in SVRG-LD. Similarly to SGLD-CV, our algorithm can be seen as a control variate method [14]. However, applying SGLD-CV to distributed scenarios may result in a significant communication overhead, as finding a mode of the posterior would require employing distributed optimization algorithms.

After introducing some notation and providing a brief review of serial and distributed SGLD in Section 2, we derive a convergence bound for DSGLD and contrast it with the analogous bound for serial SGLD in Section 3. We then introduce the concept of conducive gradients in Section 4, and use it to construct a novel gradient estimator, which is shown to improve on the convergence bound for DSGLD. In Section 5, we show experimental results. Finally, we summarize our conclusions in Section 6.

2 Background and notation

Let be a data set of size and let be the density of a posterior distribution from which we wish to draw samples. Langevin dynamics [12] is a family of MCMC methods which utilizes the gradient of the log-posterior,

to generate proposals in a Metropolis-Hastings sampling scheme. For large data sets, computing the gradient of the log-likelihood with respect to the entire data set becomes expensive. To mitigate this problem, stochastic gradient Langevin dynamics (SGLD) [1] uses stochastic gradients to approximate the full-data gradient.

Denoting by the gradient of the log-likelihood with respect to a minibatch of size , SGLD draws samples from the target distribution using a stochastic gradient update of the form

(1)

with step size determined by , and additional noise injected to each step. The step size is annealed according to a schedule satisfying and . Note that as , the Metropolis-Hastings acceptance rate goes asymptotically to one and thus, the accept-reject step is typically ignored in SGLD. While a proper annealing schedule yields an asymptotically exact algorithm, constant step sizes are often used in practice.

2.1 Distributed SGLD

In this paper, our focus is on SGLD in distributed settings, where the data are partitioned into non-overlapping shards , such that . For simplicity, we will assume that each shard corresponds to a single worker. An adaptation of Equation (1) to distributed settings has been presented in [4]. The main idea is that in each iteration, a mini-batch is sampled within a shard, say

, and the shard itself is sampled by a scheduler with probability

, with and for all . This results in the update

(2)

where denotes the size of shard , chosen at time . Intuitively, if a mini-batch of data points is chosen uniformly at random from , then scales

to be an unbiased estimator for

, while further scales this gradient to be an unbiased estimator for .

In order to reduce the overhead in workers communicating between each iteration, [4] further proposed a modified version, where multiple update steps within a shard are taken before moving to another worker, at the cost of some loss in asymptotic accuracy. It is worth noting that, while the data are distributed across multiple workers, the above sampling procedure may still be understood as being entirely serial. In practice, however, distributed settings are naturally amenable to running multiple chains in parallel.

3 Convergence of DSGLD

We begin by analyzing the convergence of DSGLD under the same framework used for the analysis of SGLD in [6] and subsequently adpoted in [7], which directly ties convergence bounds to the variance of the gradient estimators. Besides certain regularity conditions adopted in these works, which we outline in the supplementary material, we make the following assumption:

Assumption 1

The gradient of the log-likelihood of individual elements within each shard is bounded, i.e., , for all and , and each .

We then proceed to derive the following bound on the convergence in mean squared error (MSE) of the Monte Carlo expectation of a test function with respect to its expected value .

Theorem 3.1

Let for all . Under standard regularity conditions and Assumption 1, the MSE of DSGLD for a smooth test function at time is bounded, for some constant independent of and , in the following manner:

The bound in Theorem 3.1 (proved in the supplement) depends explicitly on the ratio between squared shard sizes and their selection probabilities. This follows the intuition that both shard sizes and their availability play a role in the convergence of DSGLD. Note also that the above bound for DSGLD generalizes previous results for SGLD [7]. More specifically, if we combine all shards into a single data set and let , we recover the bound for SGLD:

4 Variance reduction using conducive gradients

In DSGLD, stochastic gradient updates are computed on mini-batches sampled within local shards, which adds bias to the updates and increases variance globally. This is especially true if the shards are heterogeneous. To counteract this, we would like the local updates to make use of information across shards, without significantly increasing either computational cost or memory requirements.

Our strategy for achieving this goal is to augment the local updates with an auxiliary gradient computed on a tractable surrogate for the full-data likelihood . We assume here that the surrogate, denoted as , factorizes over shards as , where each is itself a surrogate for , i.e. the likelihood w.r.t. the entire shard. Given a surrogate , we define the conducive gradient w.r.t. shard as

The following result states that when is added to the stochastic gradient in a DSGLD setting, the resulting estimator remains a valid estimator for the gradient of the full-data log-likelihood.

Lemma 1

Assume are Lipschitz continuous. Given a dataset partitioned into shards , with respective sample sizes and shard selection probabilities , the following gradient estimator,

is an unbiased estimator of with finite variance.

The conducive-gradient DSGLD update can now be written as

(3)

Together with the same setting and assumption 1 previously adopted for the analysis of DSGLD, Lemma 2, which follows directly from the Lipschitz continuity of our surrogates, is sufficient for deriving a tighter bound for our method, in Theorem 4.1.

Lemma 2

If are everywhere differentiable and Lipschitz continuous, then the average value of , taken over , is bounded by some , for each .

Theorem 4.1

Let for all . Assume are Lipschitz continuous. Under standard regularity conditions and Assumption 1, the MSE of CG-DSGLD(defined in Algorithm 1) for a smooth test function at time is bounded, for some constant independent of and in the following manner:

Put simply, Theorem 4.1 (proved in the supplement) states that through conducive gradients, tractable approximations to the local likelihood factors can be employed to improve distributed SGLD. Algorithm 1 describes our method, conducive gradient DSGLD (CG-DSGLD), for a single chain.

Remark 1 (Controlling exploration)

Note that conducive gradients can be alternatively written as

(4)

making it explicit that these terms encourage the exploration of regions in which we believe, based on the approximations and , the posterior density to be high but the density within shard to be low. We can explicitly control the extent of this exploration by multiplying the conducive gradient by a constant to obtain the modified gradient estimator:

(5)

4.1 Choice of

The key idea in choosing is to obtain an approximation of with a parametric form, which still enables the computation of to be relatively inexpensive, i.e., such that it can be computed in a single gradient evaluation instead of iterating over all data, of size , of said shard . Exponential family distributions are particularly convenient for this purpose, as they are closed under product operations, enabling to be computed in a single gradient evaluation, which keeps the additional cost of our method negligible even when .

In this work, we use a simulation-based approach to compute by first drawing from locally employing SGLD, and using the resulting samples to compute the parameters of an exponential family approximation. To avoid communication overhead, can be computed independently in parallel for each of the datashards and then communicated to the coordinating server once, before the CG-DSGLD steps take place.

1:Input: Total number of iterations , step sizes , number of shards , shard selection probabilities , number of local iterations , initial state .
2:for   do
3:     if  then
4:         Sample
5: After, iterations change shard      end if
6:     Sample a mini-batch of size from
7:     
8: DSGLD estimator
9:     
10: Conducive gradient
11:     
12: According to Eq. (3) end for
13:Output: Set of samples .
Algorithm 1 Single-chain distributed SGLD with conducive gradients (CG-DSGLD)

5 Experiments

In this Section we demonstrate the performance of CG-DSGLD under different conditions. In Subsection 5.1, we provide a visual illustration of a pathology that afflicts DSGLD. The pathology happens when the number of within-shard updates is increased.In Subsection 5.2, we consider inference on a Bayesian metric learning model, a non-conjugate model which can be used to learn metric matrices from similarity information in the data. Finally, in Subsection 5.3

we show how our method can be employed to learn Bayesian neural networks in a distributed fashion and analyse how its behaviour changes as the amount of heterogeneity between shards is increased, when compared to its original counterpart.

While these models are increasingly complex, we highlight that, using multivariate Gaussians as the approximations to each of the within-shard likelihood functions, we were still able to obtain good results for CG-DSGLD. For the first set of experiments, we derive analytic forms for the approximations , which is possible due to the simplicity of the target model. For the remaining ones, we employ SGLD independently for

and use the samples obtained to compute the mean vector and covariance matrix that parameterize

.

All experiments were implemented using PyTorch

111https://pytorch.org and code will be publicly released upon the publication of the manuscript.

5.1 Heterogeneous shards

An ideal sampling scheme would, in theory, update the chain state at a worker, and immediately pass it over to another worker in the next iteration. However, such a short communication cycle would result in a large overhead. To ameliorate the problem, [4] proposed making a number of chain updates before moving to another worker. However, the authors reported that as the number of iterations within each worker and shard increases, the algorithm tends to lose sample efficiency and effectively sample from a mixture of local posteriors, , instead of the true posterior. With heterogeneous data shards, the effect is particularly noticeable, as illustrated below.

5.1.1 Model:

In this experiment, we illustrate how information can be shared between shards through the approximations introduced in CG-DSGLD, to avoid the aforementioned pathology and achieve better performance. To this end, we consider inference for the mean vector

of normally distributed data under the simple model

5.1.2 Setting:

We generate disjoint data subsets of size , each respectively from with uniformly sampled from the square. We then perform inference on the overall mean using the model . We sample the same number of posterior samples using both DSGLD and CG-DSGLD with fixed step-size , mini-batch size and . The first samples were discarded as burn-in and the remaining ones were thinned by 100. We set for each .

5.1.3 Results:

Figure 1 shows the posterior samples as a function of the number of local updates the method takes before jumping to the next shard. For comparison, samples from the analytical posterior are shown in Figure 1(a). As can be seen from the results, the proposed method (CG-DSGLD) converges adequately to the true posterior while DSGLD diverges towards a mixture of local approximations. Figure 1(b) compares the convergence in terms of the number of posterior samples. While DSGLD with 100 local updates converges as fast as CG-DSGLD, it plateaus at a higher MSE. Note that in contrast with DSGLD, CG-DSGLD is in the current experiment insensitive to the number of local updates.

Figure 1: Posterior samples of the global mean (black) in DSGLD and CG-DSGLD, as a function of the number of shard-local updates(shown in the parentheses in each title). The colored dots are the data samples, different shards having different color.
(a)
(b)
Figure 2: (a) Samples from the analytical posterior, for comparison with Figure 1. (b) Quantitative comparison of MSE in estimating the global mean, as a function of the number of posterior samples, showing CG-DSGLD clearly outperforms DSGLD. Numbers in parentheses indicate number of local shard updates; for CG-DSGLD the curves are almost identical, and only one is shown.

5.2 Metric learning

Given sets of similar and dissimilar pairs of vectors from , metric learning concerns the task of learning a distance metric matrix such that the Mahalanobis distance

is low if and high if .

5.2.1 Model:

We consider inference on the Bayesian metric learning model proposed by Yang et al. [21], in which it is assumed that can be expressed as where are the top eigenvectors of . The likelihood function for each pair from or is given by

where equals one if and equals zero otherwise, if . While having is enough to guarantee that is a distance metric, this is further relaxed and a diagonal Gaussian prior is put on .

5.2.2 Setting:

We have devised a data set for metric learning based on the Spoken Letter Recognition222https://archive.ics.uci.edu/ml/datasets/isolet (isolet) data, which encompasses 7797 examples split among 26 classes. We have created and pairs of similar and dissimilar vectors, respectively, using the labels of the isolet examples, i.e., samples are considered similar if they belong to the same class and dissimilar otherwise. We split these pairs into data shards of identical size, in a manner that there is no overlap in the classes used, to create the sets of pairs and in each shard . Additionally, we created another thousand pairs of equally split similar and dissimilar examples which we held out for test.

We have set and run both DSGLD and CG-DSGLD with constant step size and mini-batch size . The conducive terms are computed independently by drawing three thousand samples from

using SGLD and imposing a multivariate normal approximation.

5.2.3 Results:

Figure 3 shows results in terms of average log-likelihood as a function of samples. the curves show the average of ten repetitions of the experiment with different random seeds. The results show that CG-DSGLD converges faster than DSGLD while also achieving better performance both in observed samples and on held out data.

(a)
(b)
Figure 3: Average log-likelihood values as a function of number of samples for both DSGLD and CG-DSGLD, measured on (a) observed samples and (b) on a held out set of samples.

5.3 Bayesian neural networks

In this experiment, we assess the performance of our method for posterior inference on a deep Multi-Layer Perceptron (MLP) for: 1) evenly distributed data; and 2) different levels of heterogeneity between shards.

5.3.1 Model:

We consider a MLP with three hidden layers. The first two hidden layers consist of nodes and the last one of

. All hidden nodes are equipped with the rectified linear unit (ReLU) activation function and a Softmax function is applied to the output of the network. Since we employ this network for a classification task, we use the cross-entropy loss function.

5.3.2 Homogeneous setting:

We perform posterior inference for the SUSY 333https://archive.ics.uci.edu/ml/datasets/SUSY binary classification data set, which comprises five million samples with 18 covariates each. We split the data set into shards with samples each, and set aside an additional shard for evaluation purposes.

We ran CG-DSGLD and DSGLD, with even shard selection probabilities, for rounds of communication, between which shard-local updates take place. For both methods, we adopt fixed step-size and use mini-batches of size . The first samples were discarded for each method and the remaining ones were thinned by two. In a similar fashion to the previous experiments in Subsection 5.2, we computed the conducive terms by drawing independently from densities proportional to the local likelihoods and imposing diagonal-multivariate normal approximations based on these samples.

5.3.3 Results:

Figure 4 reports the results in terms of average log likelihood, evaluated in both observed and held out data, as a function of the number of samples drawn. Values reported correspond to the average over ten rounds of experiments, each having a different set of data points. Note that in this case CG-DSGLD converged to clearly better average log-likelihood values.

(a)
(b)
Figure 4: Average log-likelihood values as a function of number of samples for both DSGLD and CG-DSGLD, measured on (a) observed samples and (b) on a held out set of samples.

5.3.4 Heterogeneous setting:

To simulate different degrees of heterogeneity, we create a series of label-imbalanced data shards based on the SUSY data. We control the amount of heterogeneity of the shards by drawing the proportions of positive samples in shard

from a symmetric Beta distribution with parameters

and with values in . When , the data shards are approximately balanced. When , half of the shards tend to have mostly positive and the other half tends to have mostly negative labels, enforcing diversity between shards. All the aforementioned shards comprise samples and an additional balanced shard is held out for evaluation. All remaining implementation details for CG-DSGLD and DSGLD are kept the same as for in the homogeneous setting.

5.3.5 Results:

Figure 5 shows results in terms of average log-likelihood, evaluated on held out data, for both values in . Note that for , the case in which most of the shards are balanced, CG-DSGLD and DSGLD exhibit similar behaviour. However, for more heterogeneous shards, when , CG-DSGLD clearly outperforms DSGLD.

(a)
(b)
Figure 5: Average log-likelihood values as a function of number of samples for both DSGLD and CG-DSGLD, measured for different levels of heterogeneity for (a) DSGLD and (b) CG-DSGLD measured on held out data.

6 Conclusion

We proposed CG-DSGLD, a novel method which incorporates coarse information regarding the likelihood contribution of each data shard to improve the convergence of distributed SGLD. Experiments show that our method outperforms DSGLD, with emphasis in cases where there is considerable variety accross data shards, causing their likelihood contributions of to be significantly distinct.

Our method can be seen as a variance reduction strategy for DSGLD. Thus, we provided a theoretical analysis regarding its impact on convergence. To the best of our knowledge, we also present the first the discussion about the convergence of SGLD in distributed settings. In contrast to the variance-reduction strategies applied to SGLD, given suitable surrogates , such as exponential family distributions, CG-DSGLD can be simultaneously made efficient both in terms of memory and computation.

We leave open the possibility of employing more expressive or computationally cheaper surrogates , such as non-parametric methods or variational approximations, respectively.

Appendix 0.A Background on convergence analysis for SGLD

Let be the functional that solves the Poisson equation . Assume is bounded up to its third order derivative by a function , such that with with denoting the th order derivative. Assume as well that the expectation of w.r.t. is bounded () and that is smooth such that , , , for some . Under the regularity conditions above, [6] showed the following result.

Theorem 0.A.1 (See [6])

Let be an unbiased estimate of , the unnormalized negative log posterior, and for all . Let . Under the assumptions above, for a smooth test function , the MSE OF SGLD at time is bounded for some independent of in the following manner:

(6a)

Equation (6a) can also be written as:

(7a)

a form which will be useful for us later.

To the regularity conditions mentioned above, we add for further analysis that for some .

Appendix 0.B Theorem 1: Convergence of DSGLD

Here, we follow the footprints of [6] later adopted by [7]. Thus, we focus on bounding . We defined to avoid text congestion in the following. For some , we have:

(8a)
(8b)
(8c)
(8d)
(8e)
(8f)
(8g)
(8h)

In the above, denotes that the expectation is taken w.r.t. a mini-batch of size with elements drawn with replacement and equal probability from shard . To advance from Equation (8c) to (8d), we use law of iterated expectations and the fact that for zero-mean independent . To advance from Equation (8d) to (8e), we use . Substituting Equation (8h) in in Equation (7a) yields the desired result.

Appendix 0.C Lemma 1: Unbiasedness and finite variance

Recall that the for the DSGLD update [4] we have that:

Furthermore, for conducive gradients we have that:

Thus, the CG-DSGLD gradient estimator, the sum of the DSGLD estimator and the conducive gradient, is unbiased.

The sufficient condition for the DSGLD estimator to have finite variance is that the unnormalized log posterior is Lipschitz continuous. Similarly, since are also Lipschitz continuous, their first derivatives are bounded, so the conducive gradient is a convex combination of bounded functions and has finite variance. Thus, their sum, the CG-DSGLD estimator has finite variance.

Appendix 0.D Theorem 2: Convergence of CG-DSGLD

Since conducive gradients have zero-mean, the bound derived in Proposition 1 also holds. We are left with the task of proving the alternative bound. We again defined another function and to avoid text congestion in the following.

(9a)
(9b)
(9c)
(9d)
(9e)
(9f)
(9g)

We proceed from Equation (9b) to (9c) using the law of iterated expectations and the fact that for zero-mean independent . To advance from Equation (9c) to (9d), we use . The last line is obtained using Lemma 2. The desired result is obtained by taking the minimum of the bound derived here in the one in Equation 7a.

Appendix 0.E Additional experiments

0.e.1 Linear regression

In this set of experiments, we are comparing vanilla DSGLD with our method CG-DSGLD on real datasets. We apply our CG-DSGLD algorithm to Bayesian linear regression, studying the behavior of the test MSE taken over the posterior samples obtained.

0.e.1.1 Model

The inputs of our model are , where and . The likelihood of the th output , given the input vector , is , and we place the prior .

0.e.1.2 Setting

We ran experiments on four different datasets444Datasets can be downloaded from https://archive.ics.uci.edu/ml/index.html from the UCI repository, Concrete ( samples, features), Noise ( samples, features), Conductivity ( samples, features) and Localization ( samples, features). We normalized and partitioned our datasets into (80%) training and (20%) test sets. We report the cumulative average of both, the test MSE and variance. In all our experiments, both DSGLD and CG-DSGLD have the same hyper-parameters. We sample disjoint data subsets for rounds each having iteration per round, with fixed step-size and mini-batch size . All shards are chosen with same probability . We also burn-in the first ten thousand samples and thin the remaining by a hundred. We set , with and , for each .

0.e.1.3 Results

Figure 6 shows the cumulative MSE and its variance. We can see that CG-DSGLD converges faster and has less variance in the experiments.

Figure 6: Average MSE and variance along time computed for DSGLD and CG-DSGLD samples for four different datasets.

0.e.2 Logistic regression

0.e.2.1 Model

We consider inference on the logistic regression model with likelihood

with prior .

0.e.2.2 Setting

We carry posterior inference for the Magic ( samples, features) and Credit ( samples, features) datasets555Datasets can be downloaded from https://archive.ics.uci.edu/ml/index.html.

We compare the proposed CG-DSGLD to the earlier DSGLD, both taking a hundred shard-local chain updates, with batch-size and step-size . The data is normalized, so the features have variance one and mean zero. We set the likelihood surrogates as Gaussian approximations computed from samples obtained by running SGLD independently using each of the local likelihoods as a target. Using normal approximations allows us to compute the grad-log surrogate, in a single gradient evaluation, which cost is negligible, approximately O(1). The data is partitioned into shards and we set . We discard the first three thousand samples from each chain and thin the remaining ones by . Experiments were repeated times, in each of which a thousand data points were used for the test.

0.e.2.3 Results

Figure 7 shows that our method performs better than DSGLD in terms of cumulative average accuracy, allowing faster convergence.

Figure 7: Average accuracy along time computed using DSGLD and CG-DSGLD samples for the Magic and Credit datasets.

References

  • [1] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In

    Proceedings of the 28th International Conference on Machine Learning (ICML-11)

    , pages 681–688, 2011.
  • [2] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
  • [3] Tianqi Chen, Emily B. Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1683–1691. PMLR, 2014.
  • [4] Sungjin Ahn, Babak Shahbaba, and Max Welling. Distributed stochastic gradient MCMC. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1044–1052. PMLR, 2014.
  • [5] Jack Baker, Paul Fearnhead, Emily B. Fox, and Christopher Nemeth. Control variates for stochastic gradient MCMC. Statistics and Computing, 29(3):599–615, 2019.
  • [6] Changyou Chen, Nan Ding, and Lawrence Carin. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2278–2286, Cambridge, MA, USA, 2015. MIT Press.
  • [7] Kumar Avinava Dubey, Sashank J. Reddi, Sinead A. Williamson, Barnabás Póczos, Alexander J. Smola, and Eric P. Xing. Variance reduction in stochastic gradient langevin dynamics. In Proceedings of the International Conference on Neural Information Processing Systems, pages 1154–1162, 2016.
  • [8] Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NIPS Workshop on Private Multi-Party Machine Learning, 2016.
  • [9] Chunyuan Li, Changyou Chen, Yunchen Pu, Ricardo Henao, and Lawrence Carin. Communication-efficient stochastic gradient MCMC for neural networks. In

    Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI)

    . AAAI Press, 2019.
  • [10] Zhize Li, Tianyi Zhang, Shuyu Cheng, Jun Zhu, and Jian Li. Stochastic gradient Hamiltonian Monte Carlo with variance reduction for Bayesian inference. arXiv preprint arXiv:1803.11159, 2019.
  • [11] Tigran Nagapetyan, Andrew B. Duncan, Leonard Hasenclever, Sebastian J. Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient Langevin dynamics. arXiv e-prints, page arXiv:1706.02692, 2017.
  • [12] Radford Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC, New York, 2011.
  • [13] Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarrassingly parallel MCMC. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, pages 623–632, Arlington, Virginia, United States, 2014. AUAI Press.
  • [14] B. D. Ripley. Stochastic Simulation. Wiley, Hoboken, 2009.
  • [15] Steven L Scott, Alexander W Blocker, Fernando V Bonassi, Hugh A Chipman, Edward I George, and Robert E McCulloch. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88, 2016.
  • [16] Umut Şimşekli, Hazal Koptagel, Hakan Güldaş, A Taylan Cemgil, Figen Öztoprak, and Ş İlker Birbil. Parallel stochastic gradient Markov Chain Monte Carlo for matrix factorisation models. arXiv preprint arXiv:1506.01418, 2015.
  • [17] Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer. Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17(1):193–225, 2016.
  • [18] Alexander Terenin, Daniel Simpson, and David Draper. Asynchronous Gibbs sampling. arXiv preprint arXiv:1509.08999, 2015.
  • [19] Sebastian J Vollmer, Konstantinos C Zygalakis, and Yee Whye Teh. Exploration of the (non-) asymptotic bias and variance of stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17(1):5504–5548, 2016.
  • [20] Xiangyu Wang, Fangjian Guo, Katherine A. Heller, and David B. Dunson. Parallelizing MCMC with random partition trees. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 451–459, Cambridge, MA, USA, 2015. MIT Press.
  • [21] Liu Yang, Rong Jin, and Rahul Sukthankar. Bayesian active distance metric learning. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, UAI’07, page 442–449, Arlington, Virginia, USA, 2007. AUAI Press.
  • [22] Yuan Yang, Jianfei Chen, and Jun Zhu. Distributing the stochastic gradient sampler for large-scale LDA. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1975–1984, New York, NY, USA, 2016. ACM.