Quasi-Newton (QN) methods are powerful optimization techniques that are able to attain fast convergence rates by incorporating local geometric information through an approximation of the inverse of the Hessian matrix. The L-BFGS algorithm (Nocedal & Wright, 2006) is a well-known limited-memory QN method that aims at solving the following optimization problem:
is a twice continuously differentiable function that can be convex or non-convex, and is often referred to as the empirical risk. In a typical machine learning context, a datasetwith independent and identically distributed (i.i.d.) data points is considered, which renders the function as a sum of different functions .
In large scale applications, the number of data points often becomes prohibitively large and therefore using a ‘batch’ L-BFGS algorithm becomes computationally infeasible. As a remedy, stochastic L-BGFS methods have been proposed (Byrd et al., 2016; Schraudolph et al., 2007; Moritz et al., 2016; Zhou et al., 2017; Yousefian et al., 2017; Zhao et al., 2017), which aim to reduce the computational requirements of L-BFGS by replacing (i.e. the full gradients that are required by L-BFGS) with some stochastic gradients that are computed on small subsets of the dataset. However, using stochastic gradients within L-BFGS turns out to be a challenging task since it brings additional technical difficulties, which we will detail in Section 2.
In a very recent study, Berahas et al. (2016)
proposed a parallel stochastic L-BFGS algorithm, called multi-batch L-BFGS (mb-L-BFGS), which is suitable for synchronous distributed architectures. This work illustrated that carrying out L-BFGS in a distributed setting introduces further theoretical and practical challenges; however, if these challenges are addressed, stochastic L-BFGS can be powerful in a distributed setting as well, and outperform conventional algorithms such as distributed stochastic gradient descent (SGD), as shown by their experimental results.
Despite the fact that synchronous parallel algorithms have clear advantages over serial optimization algorithms, the computational efficiency of synchronous algorithms is often limited by the overhead induced by the synchronization and coordination among the worker processes. Inspired by asynchronous parallel stochastic optimization techniques (Agarwal & Duchi, 2011; Lian et al., 2015; Zhang et al., 2015; Zhao & Li, 2015; Zheng et al., 2017), in this study, we propose an asynchronous parallel stochastic L-BFGS algorithm for large-scale non-convex optimization problems. The proposed approach aims at speeding up the synchronous algorithm presented in (Berahas et al., 2016) by allowing all the workers work independently from each other and circumvent the inefficiencies caused by synchronization and coordination.
Extending stochastic L-BFGS to asynchronous settings is a highly non-trivial task and brings several challenges. In our strategy, we first reformulate the optimization problem (1) as a sampling problem where the goal becomes drawing random samples from a distribution whose density is concentrated around . We then build our algorithm upon the recent stochastic gradient Markov Chain Monte Carlo (SG-MCMC) techniques (Chen et al., 2015, 2016b) that have close connections with stochastic optimization techniques (Dalalyan, 2017; Raginsky et al., 2017; Zhang et al., 2017), and have proven successful in large-scale Bayesian machine learning. We provide formal theoretical analysis and prove non-asymptotic guarantees for the proposed algorithm. Our theoretical results show that the proposed algorithm achieves an ergodic global convergence with rate , where denotes the total number of iterations. Our results further imply that the algorithm can achieve a linear speedup under ideal conditions.
For evaluating the proposed method, we conduct several experiments on synthetic and real datasets. The experimental results support our theory: our experiments on a large-scale matrix factorization problem show that the proposed algorithm provides a significant speedup over the synchronous parallel L-BFGS algorithm.
2 Technical Background
Preliminaries: As opposed to the classical optimization perspective, we look at the optimization problem (1) from a maximum a-posteriori
(MAP) estimation point of view, where we consider
as a random variable inand as the optimum of a Bayesian posterior whose density is given as , where is a set of i.i.d. observed data points. Within this context, is often called the potential energy and defined as , where is the likelihood function and is the prior density. In a classical optimization context, would correspond to the data-loss and would correspond to a regularization term. Throughout this study, we will assume that the problem (1) has a unique solution in .
We define a stochastic gradient
, that is an unbiased estimator of, as follows: , where denotes a random data subsample that is drawn with replacement, is the cardinality of . In the sequel, we will occasionally use the notation and to denote the stochastic gradient computed at iteration of a given algorithm, or on a specific data subsample , respectively.
The L-BFGS algorithm: The L-BFGS algorithm iteratively applies the following equation in order to find the MAP estimate given in (1):
where denotes the iterations. Here, is an approximation to the inverse Hessian at and is computed by using the past values of the ‘iterate differences’ , and ‘gradient differences’ . The collection of the iterate and gradient differences is called the L-BFGS memory
. The matrix-vector productis often implemented by using the two-loop recursion (Nocedal & Wright, 2006), which has linear time and space complexities .
In order to achieve computational scalability, stochastic L-BFGS algorithms replace with . This turns out to be problematic, since the gradient differences would be inconsistent, meaning that the stochastic gradients in different iterations will be computed on different data subsamples, i.e. and . On the other hand, in the presence of the stochastic gradients, L-BFGS is no longer guaranteed to produce positive definite approximations even in convex problems, therefore more considerations should be taken in order to make sure that is positive definite.
Stochastic Gradient Markov Chain Monte Carlo: Along with the recent advances in MCMC techniques, diffusion-based algorithms have become increasingly popular due to their applicability in large-scale machine learning applications. These techniques, so called the Stochastic Gradient MCMC (SG-MCMC) algorithms, aim at generating samples from the posterior distribution as opposed to finding the MAP estimate, and have strong connections with stochastic optimization techniques (Dalalyan, 2017). In this line of work, Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) is one of the pioneering algorithms and generates an approximate sample from by iteratively applying the following update equation:
where is the step-size and is a collection of standard Gaussian random variables in . Here, is called the inverse temperature: it is fixed to in vanilla SGLD and when the algorithm is called ‘tempered’. In an algorithmic sense, SGLD is identical to SGD, except that it injects a Gaussian noise at each iteration and it coincides with SGD when goes to infinity.
SGLD has been extended in several directions (Ma et al., 2015; Chen et al., 2015; Şimşekli et al., 2016b; Şimşekli, 2017). In (Şimşekli et al., 2016a), we proposed an L-BFGS-based SGLD algorithm with computational complexity, which aimed to improve the convergence speed of the vanilla SGLD. We showed that a straightforward way of combining L-BFGS in SGLD would incur an undesired bias; however, the remedy to prevent this bias resulted in numerical instability, which would limit the applicability of the algorithm. In other recent studies, SGLD has also been extended to synchronous (Ahn et al., 2014) and asynchronous (Chen et al., 2016b; Springenberg et al., 2016) distributed MCMC settings.
SGLD can be seen as a discrete-time simulation of a continuous-time Markov process that is the solution of the following stochastic differential equation (SDE):
where denotes the standard Brownian motion in . Under mild regularity conditions on , the solution process attains a unique stationary distribution with a density that is proportional to (Roberts & Stramer, 2002). An important property of this distribution is that, as goes to infinity, this density concentrates around the global minimum of (Hwang, 1980; Gelfand & Mitter, 1991). Therefore, for large enough , a random sample that is drawn for the stationary distribution of would be close to . Due to this property, SG-MCMC methods have recently started drawing attention from the non-convex optimization community. Chen et al. (2016a) developed an annealed SG-MCMC algorithm for non-convex optimization and it was recently extended by Ye et al. (2017). Raginsky et al. (2017) and Xu et al. (2017) provided finite-time guarantees for SGLD to find an ‘approximate’ global minimizer that is close to , which imply that the additive Gaussian noise in SGLD can help the algorithm escape from poor local minima. In a complementary study, Zhang et al. (2017) showed that SGLD enters a neighborhood of a local minimum of in polynomial time, which shows that even if SGLD fails to find the global optimum, it will still find a point that is close to one of the local optima. Even though these results showed that SG-MCMC is promising for optimization, it is still not clear how an asynchronous stochastic L-BFGS method could be developed within an SG-MCMC framework.
3 Asynchronous Stochastic L-BFGS
In this section, we propose a novel asynchronous L-BFGS-based (tempered) SG-MCMC algorithm that aims to provide an approximate optimum that is close to by generating samples from a distribution that has a density that is proportional to . We call the proposed algorithm asynchronous parallel stochastic L-BFGS (as-L-BFGS). Our method is suitable for both distributed and shared-memory settings. We will describe the algorithm only for the distributed setting; the shared-memory version is almost identical to the distributed version as long as the updates are ensured to be atomic.
We consider a classical asynchronous optimization architecture, which is composed of a master node, several worker nodes, and a data server. The main task of the master node is to maintain the newest iterate of the algorithm. At each iteration, the master node receives an additive update vector from a worker node, it adds this vector to the current iterate in order to obtain the next iterate, and then it sends the new iterate to the worker node which has sent the update vector. On the other hand, the worker nodes work in a completely asynchronous manner. A worker node receives the iterate from the master node, computes an update vector, and sends the update vector to the master node. However, since the iterate would be possibly modified by another worker node which runs asynchronously in the mean time, the update vector that is sent to the server will thus be computed on an old iterate, which causes both practical and theoretical challenges. Such updates are aptly called ‘delayed’ or ‘stale’. The full data is kept in the data server and we assume that all the workers have access to the data server.
The proposed algorithm iteratively applies the following update equations in the master node:
where is the iteration index, is called the momentum variable, and and are the update vectors that are computed by the worker nodes. A worker node runs the following equations in order to compute the update vectors:
where is the step-size, is the friction parameter that determines the weight of the momentum, is the inverse temperature, denotes standard Gaussian random variables, and denotes the L-BFGS matrix at iteration . Here, denotes the ‘staleness’ of a particular update and measures the delay between the current update and the up-to-date iterate that is stored in the master node. We assume that the delays are bounded, i.e. . Note that the matrix-vector products have time-space complexity.
Due to the asynchrony, the stochastic gradients and the L-BFGS matrices will be computed on the delayed variables and . As opposed to the asynchronous stochastic gradient algorithms, where the main difficulty stems from the delayed gradients, our algorithm faces further challenges since it is not straightforward to obtain the gradient and iterate differences that are required for the L-BFGS computations in an asynchronously parallel setting.
We propose the following approach for the computation of the L-BFGS matrices. As opposed to the mb-L-BFGS algorithm, which uses a central L-BFGS memory (i.e. the collection of the gradient and iterate differences) that is stored in the master node, we let each worker have their own local L-BFGS memories since the master node would not be able to keep track of the gradient and iterate differences, which are received in an asynchronous manner. In our strategy, each worker updates its own L-BFGS memory right after sending the update vector to the master node. The overall algorithm is illustrated in Algorithms LABEL:algo:master and LABEL:algo:worker ( denotes the number of workers).
In order to be able to have consistent gradient differences, each worker applies a multi-batch subsampling strategy that is similar to mb-L-BFGS. We divide the data subsample into two subsets, i.e. with , , and . Here the main idea is to choose and use as an overlapping subset for the gradient differences. In this manner, in addition to the gradients that are computed on and , we also perform an extra gradient computation on the previous overlapping subset, at the end of each iteration. As will be small, this extra cost will not be significant. Finally, in order to ensure the L-BFGS matrices are positive definite, we use a ‘cautious’ update mechanism that is useful for non-convex settings (Li & Fukushima, 2001; Zhang & Sutton, 2011; Berahas et al., 2016) as shown in Algorithm LABEL:algo:worker.
Note that, in addition to asynchrony, the proposed algorithm also extends the current stochastic L-BFGS methods by introducing momentum. This brings two critical practical features: (i) without the existence of the momentum variables, the injected Gaussian noise must depend on the L-BFGS matrices, as shown in (Şimşekli et al., 2016a), which results in an algorithm with time complexity whereas our algorithm has time complexity, (ii) the use of the momentum significantly repairs the numerical instabilities caused by the asynchronous updates, since inherently encapsulates a direction for , which provides additional information to the algorithm besides the gradients and L-BFGS computations. Furthermore, in a very recent study (Loizou & Richtárik, 2017) the use of momentum variables has been shown to be useful in other second-order optimization methods. On the other hand, despite their advantages, the momentum variable also drifts apart the proposed algorithm from the original L-BFGS formulation. However, even such approximate approaches have proven useful in various scenarios (Zhang & Sutton, 2011; Fu et al., 2016). Also note that, when , , and for all , the algorithm coincides with SGD with momentum. A more detailed illustration is given in the supplementary document.
4 Theoretical Analysis
In this section, we will provide non-asymptotic guarantees for the proposed algorithm. Our analysis strategy is different from the conventional analysis approaches for stochastic optimization and makes use of tools from analysis of SDEs. In particular, we will first develop a continuous-time Markov process whose marginal stationary measure admits a density that is proportional to . Then we will show that (5)-(7) form an approximate Euler-Maruyama integrator that approximately simulates this continuous process in discrete-time. Finally, we will analyze this approximate numerical scheme and provide a non-asymptotic error bound. All the proofs are given in the supplementary document.
We start by considering the following stochastic dynamical system:
where is also called the momentum variable, denotes the L-BFGS matrix at time and is a vector that is defined as follows:
where denotes the component of a vector and similarly denotes a single element of a matrix .
In order to analyze the invariant measure of the SDE defined in (8), we need certain conditions to hold. First, we have two regularity assumptions on and :
The gradient of the potential is Lipschitz continuous, i.e. .
The L-BFGS matrices have bounded second-order derivatives and they are Lipschitz continuous, i.e. .
The assumptions H 1 and H 2 are standard conditions in analysis of SDEs (Duan, 2015) and similar assumptions have also been considered in stochastic gradient (Moulines & Bach, 2011) and stochastic L-BFGS algorithms (Zhou et al., 2017). Besides, H 2 provides a direct control on the partial derivatives of , which will be useful for analyzing the overall numerical scheme. We now present our first result that establishes the invariant measure of the SDE (8).
This result shows that, if the SDE (8) could be exactly simulated, the marginal distribution of the samples would converge to a measure which has a density that is proportional to . Therefore, for large enough and , would be close to the global optimum .
We note that when , the SDE (8) shares similarities with the SDEs presented in (Fu et al., 2016; Ma et al., 2015). While the main difference being the usage of the tempering scheme, (Fu et al., 2016) further differs from our approach as it directly discard the term since is in a Metropolis-Hastings framework, which is not adequate for large-scale applications. On the other hand, the stochastic gradient Riemannian Hamiltonian Monte Carlo algorithm given in (Ma et al., 2015), chooses as the Fisher information matrix; a quantity that requires space-time complexity and is not analytically available in general.
We will now show that the proposed algorithm (5)-(7) form an approximate method for simulating (8) in discrete-time. For illustration, we first consider the Euler-Maruyama integrator for (8), given as follows:
Here, the term introduces an additional computational burden and its importance is very insignificant (i.e. its magnitude is of order due to H 2). Therefore, we discard , define , , , and use these quantities in (S3) and (S1). We then obtain the following re-parametrized Euler integrator:
The detailed derivation is given in the supplementary document. Finally, we replace with the stochastic gradients, replace the variables and with stale variables and in the update vectors, and obtain the ultimate update equations, given in (5). Note that, due to the negligence of , the proposed approach would require a large and would not be suitable for classical posterior sampling settings, where .
In this section, we will analyze the ergodic error , where we define and . This error resembles the bias of a statistical estimator; however, as opposed to the bias, it directly measures the expected discrepancy to the global optimum. Similar ergodic error notions have been considered in the analysis of non-convex optimization methods (Lian et al., 2015; Chen et al., 2016a; Berahas et al., 2016).
In our proof strategy, we decompose the error into two terms: , where , and . We then upper-bound these terms separately.
The term turns out to be the bias of a statistical estimator, which we can analyze by using ideas from recent SG-MCMC studies. However, existing tools cannot be directly used because of the additional difficulties introduced by the L-BFGS matrices. In order to bound , we first require the following smoothness and boundedness condition.
Let be a functional that is the unique solution of a Poisson equation that is defined as follows:
where , is the generator of (8) at and is formally defined in the supplementary document. The functional and its up to third-order derivatives are bounded by a function , such that for and . Furthermore, and is smooth such that for all , , and .
Assumption H 3 is also standard in SDE analysis and SG-MCMC (Mattingly et al., 2010; Teh et al., 2016; Chen et al., 2015; Durmus et al., 2016) and gives us control over the weak error of the numerical integrator. We further require the following regularity conditions in order to have control over the error induced by the delayed stochastic gradients.
The variance of the stochastic gradients is bounded, i.e.
The variance of the stochastic gradients is bounded, i.e.for some .
For a smooth and bounded function , the remainder in the following Taylor expansion is bounded:
The following lemma presents an upper-bound for .
and the following moment condition on.
The second-order moments of are bounded and satisfies the following inequality: for some .
This assumption is mild since concentrates around as tends to infinity. The order is arbitrary, hence the assumption can be further relaxed. The following lemma establishes an upper-bound for .
More explicit constants and a discussion on the relation of the theorem to other recent theoretical results are provided in the supplementary document.
Theorem 1 provides a non-asymptotic guarantee for convergence to a point that is close to the global optimizer even when is non-convex, thanks to the additive Gaussian noise. The bound suggests an optimal rate of convergence of , which is in line with the current rates of the non-convex asynchronous algorithms (Lian et al., 2015). Furthermore, if we assume that the total number of iterations is a linear function of the number of workers, e.g. , where is the number of iterations executed by a single worker, Theorem 1 implies that, in the ideal case, the proposed algorithm can achieve a linear speedup with increasing , provided that .
Despite their nice theoretical properties, it is well-known that tempered sampling approaches also often get stuck near a local minimum. In our case, this behavior would be mainly due to the hidden constant in (14), which can be exponential in dimension , as illustrated in (Raginsky et al., 2017) for SGLD. On the other hand, Theorem 1 does not guarantee that the proposed algorithm will converge to a neighborhood of a local minimum; however, we believe that we can also prove local convergence guarantees by using the techniques provided in (Zhang et al., 2017; Tzen et al., 2018), which we leave as a future work.
The performance of asynchronous stochastic gradient methods has been evaluated in several studies, where the advantages and limitations have been illustrated in various scenarios, to name a few (Dean et al., 2012; Zhang et al., 2015; Zheng et al., 2017). In this study, we will explore the advantages of using L-BFGS in an asynchronous environment. In order to illustrate the advantages of asynchrony, we will compare as-L-BFGS with mb-L-BFGS (Berahas et al., 2016); and in order to illustrate the advantages that are brought by using higher-order geometric information, we will compare as-L-BFGS to asynchronous SGD (a-SGD) (Lian et al., 2015). We will also explore the speedup behavior of as-L-BFGS for increasing .
We conduct experiments on both synthetic and real datasets. For real data experiments, we have implemented all the three algorithms in C++ by using a low-level message passing protocol for parallelism, namely the OpenMPI library. This code can be used both in a distributed environment or a single computer with multiprocessors. For the experiments on synthetic data, we have implemented the algorithms in MATLAB, by developing a realistic discrete-event simulator. This simulated environment is particularly useful for understanding the behaviors of the algorithms in detail since we can explicitly control the computation time that is spent at the master or worker nodes, and the communication time between the nodes. This simulation strategy also enables us to explicitly control the variation among the computational powers of the worker nodes; a feature that is much harder to control in real distributed environments.
Linear Gaussian model: We conduct our first set of experiments on synthetic data where we consider a rather simple convex quadratic problem whose optimum is analytically available. The problem is formulated as finding the MAP estimate of the following linear Gaussian probabilistic model:
We assume that and are known and we aim at computing . For these experiments, we develop a parametric discrete event simulator that aims to simulate the algorithms in a controllable yet realistic way. The simulator simulates a distributed optimization algorithm once it is provided four parameters: (i) : the average computational time spent by the master node at each iteration, (ii) : the average computational time spent by a single worker at each iteration, (iii) : the standard deviation of the computational time spent by a single worker per iteration, and (iv) : the time spent for communications per iteration. All these parameters are in a generic base time unit
. Once these parameters are provided for one of the three algorithms, the simulator simulates the (a)synchronous distributed algorithm by drawing random computation times from a log-normal distribution whose mean and variance is specified byand . Figure 1 illustrates a typical outcome of the real and the simulated implementations of as-L-BFGS, where we observe that the simulator is able to provide realistic simulations that can even very well reflect the fluctuations of the algorithm.
In our first experiment, we set , , , we randomly generate and fix the vectors in such a way that there will be a strong correlation in the posterior distribution, and we finally generate a true and the observations by using the generative model.
For each algorithm, we fix , , and to realistic values and investigate the effect of the variation among the workers by comparing the running time of the algorithms for achieving -accuracy (i.e., ) for different values of when . We repeat each experiment times. In all our experiments, we have tried several values for the hyper-parameters of each algorithm and we report the best results. All the hyper-parameters are provided in the supplementary document.
Figure 2 visualizes the results for the first experiment. We can observe that, for smaller values as-L-BFGS and mb-L-BFGS perform similarly, where a-SGD requires more computational time to achieve -accuracy. However, as we increase the value of , mb-L-BFGS requires more computational time in order to be able to collect sufficient amount of stochastic gradients. The results show that both asynchronous algorithms turn out to be more robust to the variability of the computational power of the workers, where as-L-BFGS shows a better performance.
In our second experiment, we investigate the speedup behavior of as-L-BFGS within the simulated setting. In this setting, we consider a highly varying set of workers and set and vary the number of workers . As illustrated in Figure 3, as increases, increases as well and the algorithm hence requires more iterations in order to achieve -accuracy, since a smaller step-size needs to be used. However, this increment in the number of iterations is compensated by the increased number of workers, as we observe that the required computational time gracefully decreases with increasing . We observe a similar behavior for different values of , where the speedup is more prominent for smaller .
Large-scale matrix factorization: In our next set of experiments, we consider a large-scale matrix factorization problem (Gemulla et al., 2011; Şimşekli et al., 2015, 2017), where the goal is to obtain the MAP solution of the following probabilistic model: . Here, is the data matrix, and and are the factor matrices to be estimated.
In this context, we evaluate the algorithms on three large-scale movie ratings datasets, namely MovieLens Million (ML-M), Million (ML-M), and Million (ML-M) (grouplens.org). The ML-M dataset contains million non-zero entries, where (movies) and (users). The ML-M dataset contains million non-zero entries, resulting in a data matrix. Finally, the ML-M dataset contains million ratings, resulting in a data matrix. We have conducted these experiments on a cluster of more than interconnected computers, each of which is equipped with variable quality CPUs and memories. In these experiments, we have found that the numerical stability is improved when is replaced with for small . This small modification does not violate our theoretical results. The hyper-parameters are provided in the supplementary document.
Figure 4 shows the performance of the three algorithms on the MovieLens datasets in terms of the root-mean-squared-error (RMSE), which is a standard metric for recommendation systems, and the norm of the gradients through iterations. In these experiments, we set for all the three datasets and we set the number of workers to . The results show that, in all datasets, as-L-BFGS provides a significant speedup over mb-L-BFGS thanks to asynchrony. We can observe that even when the speed of convergence of mb-L-BFGS is comparable to a-SGD and as-L-BFGS (cf. the plots showing the norm of the gradients), the final RMSE yielded by mb-L-BFGS is poorer than the two other methods, which is an indicator that the asynchronous algorithms are able to find a better local minimum. On the other hand, the asynchrony causes more fluctuations in as-L-BFGS when compared to a-SGD.
As opposed to the synthetic data experiments, in all the three MovieLens datasets, we observe that as-L-BFGS provides a slight improvement in the convergence speed when compared to a-SGD. This indicates that a-SGD is able to achieve a comparable convergence speed by taking more steps while as-L-BFGS is computing the matrix-vector products. However, this gap can be made larger by considering a more efficient, yet more sophisticated implementation for L-BFGS computations (Chen et al., 2014).
In our last experiment, we investigate the speedup properties of as-L-BFGS in the real distributed setting. In this experiment, we only consider the ML-M dataset and run the as-L-BFGS algorithm for different number of workers. Figure 5 illustrates the results of this experiment. As we increase from to , we obtain a decent speedup that is close to a linear speedup. However, when we set the algorithm becomes unstable, since the term in (15) dominates. Therefore, for we need to decrease the step-size , which requires the algorithm to be run for a longer amount of time in order to achieve the same error as we achieved when was smaller. On the other hand, the algorithm achieves a linear speedup in terms of iterations; however, the corresponding result is provided in the supplementary document due to the space constraints.
In this study, we proposed an asynchronous parallel L-BFGS algorithm for non-convex optimization. We developed the algorithm within the SG-MCMC framework, where we reformulated the problem as sampling from a concentrated probability distribution. We proved non-asymptotic guarantees and showed that as-L-BFGS achieves an ergodic global convergence with rateand it can achieve a linear speedup. Our experiments supported our theory and showed that the proposed algorithm provides a significant speedup over the synchronous parallel L-BFGS algorithm.
The authors would like to thank to Murat A. Erdoğdu for fruitful discussions. This work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX project (ANR-16-CE23-0014), by the Scientific and Technological Research Council of Turkey (TÜBİTAK) grant number 116E580, and by the industrial chair Machine Learning for Big Data from Télécom ParisTech.
- Agarwal & Duchi (2011) Agarwal, A. and Duchi, J. C. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pp. 873–881, 2011.
- Ahn et al. (2014) Ahn, S., Shahbaba, B., and Welling, M. Distributed stochastic gradient MCMC. In International conference on machine learning, pp. 1044–1052, 2014.
- Berahas et al. (2016) Berahas, A. S., Nocedal, J., and Takác, M. A multi-batch L-BFGS method for machine learning. In Advances in Neural Information Processing Systems, pp. 1055–1063, 2016.
- Byrd et al. (2016) Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
- Chen et al. (2015) Chen, C., Ding, N., and Carin, L. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pp. 2269–2277, 2015.
- Chen et al. (2016a) Chen, C., Carlson, D., Gan, Z., Li, C., and Carin, L. Bridging the gap between stochastic gradient MCMC and stochastic optimization. In AISTATS, 2016a.
- Chen et al. (2016b) Chen, C., Ding, N., Li, C., Zhang, Y., and Carin, L. Stochastic gradient MCMC with stale gradients. In Advances In Neural Information Processing Systems, pp. 2937–2945, 2016b.
- Chen et al. (2014) Chen, W., Wang, Z., and Zhou, J. Large-scale L-BFGS using MapReduce. In Advances in Neural Information Processing Systems, pp. 1332–1340, 2014.
- Şimşekli et al. (2016a) Şimşekli, U., Badeau, R., Cemgil, A. T., and Richard, G. Stochastic quasi-Newton Langevin Monte Carlo. In ICML, 2016a.
- Şimşekli et al. (2016b) Şimşekli, U., Badeau, R., Richard, G., and Cemgil, A. T. Stochastic thermodynamic integration: efficient Bayesian model selection via stochastic gradient MCMC. In ICASSP, 2016b.
- Şimşekli et al. (2017) Şimşekli, U., Durmus, A., Badeau, R., Richard, G., Moulines, E., and Cemgil, A. T. Parallelized stochastic gradient Markov Chain Monte Carlo algorithms for non-negative matrix factorization. In ICASSP, 2017.
- Dalalyan (2017) Dalalyan, A. S. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. Proceedings of the 2017 Conference on Learning Theory, 2017.
- Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.
- Duan (2015) Duan, J. An Introduction to Stochastic Dynamics. Cambridge University Press, New York, 2015.
- Durmus et al. (2016) Durmus, A., Şimşekli, U., Moulines, E., Badeau, R., and Richard, G. Stochastic gradient Richardson-Romberg Markov Chain Monte Carlo. In NIPS, 2016.
- Fu et al. (2016) Fu, T., Luo, L., and Zhang, Z. Quasi-Newton Hamiltonian Monte Carlo. In UAI, 2016.
- Gelfand & Mitter (1991) Gelfand, S. B. and Mitter, S. K. Recursive stochastic algorithms for global optimization in R^d. SIAM Journal on Control and Optimization, 29(5):999–1018, 1991.
- Gemulla et al. (2011) Gemulla, R., Nijkamp, E., J., H. P., and Sismanis, Y. Large-scale matrix factorization with distributed stochastic gradient descent. In ACM SIGKDD, 2011.
- Hwang (1980) Hwang, C. Laplace’s method revisited: weak convergence of probability measures. The Annals of Probability, pp. 1177–1182, 1980.
- Li & Fukushima (2001) Li, D.-H. and Fukushima, M. On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 11(4):1054–1064, 2001.
- Lian et al. (2015) Lian, X., Huang, Y., Li, Y., and Liu, J. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 2737–2745, 2015.
- Loizou & Richtárik (2017) Loizou, N. and Richtárik, P. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.
- Ma et al. (2015) Ma, Y. A., Chen, T., and Fox, E. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pp. 2899–2907, 2015.
- Mattingly et al. (2010) Mattingly, J. C., Stuart, A. M., and Tretyakov, M. V. Convergence of numerical time-averaging and stationary measures via Poisson equations. SIAM Journal on Numerical Analysis, 48(2):552–577, 2010.
- Moritz et al. (2016) Moritz, P., Nishihara, R., and Jordan, M. A linearly-convergent stochastic L-BFGS algorithm. In Artificial Intelligence and Statistics, pp. 249–258, 2016.
Moulines & Bach (2011)
Moulines, E. and Bach, F. R.
Non-asymptotic analysis of stochastic approximation algorithms for machine learning.In Advances in Neural Information Processing Systems, pp. 451–459, 2011.
- Nocedal & Wright (2006) Nocedal, J. and Wright, S. J. Numerical optimization. Springer, Berlin, 2006.
- Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pp. 1674–1703, 2017.
- Roberts & Stramer (2002) Roberts, G. O. and Stramer, O. Langevin Diffusions and Metropolis-Hastings Algorithms. Methodology and Computing in Applied Probability, 4(4):337–357, December 2002. ISSN 13875841.
- Schraudolph et al. (2007) Schraudolph, N. N., Yu, J., and Günter, S. A stochastic quasi-Newton method for online convex optimization. In Artificial Intelligence and Statistics, pp. 436–443, 2007.
- Şimşekli (2017) Şimşekli, U. Fractional Langevin Monte carlo: Exploring Levy driven stochastic differential equations for Markov chain Monte Carlo. In ICML, pp. 3200–3209, 2017.
- Şimşekli et al. (2015) Şimşekli, U., Koptagel, H., Güldaş, H., Cemgil, A. T., Öztoprak, F., and Birbil, Ş. İ. Parallel stochastic gradient Markov Chain Monte Carlo for matrix factorisation models. arXiv preprint arXiv:1506.01418, 2015.
- Springenberg et al. (2016) Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F. Asynchronous stochastic gradient MCMC with elastic coupling. arXiv preprint arXiv:1612.00767, 2016.
- Teh et al. (2016) Teh, Y. W., Thiery, A. H., and Vollmer, S. J. Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17:1–33, 2016.
- Tzen et al. (2018) Tzen, B., Liang, T., and Raginsky, M. Local optimality and generalization guarantees for the langevin algorithm via empirical metastability. In Proceedings of the 2018 Conference on Learning Theory, 2018.
- Welling & Teh (2011) Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pp. 681–688, 2011.
- Xu et al. (2017) Xu, P., Chen, J., and Gu, Q. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. arXiv preprint arXiv:1707.06618, 2017.
Ye et al. (2017)
Ye, N., Zhu, Z., and Mantiuk, R.
Langevin dynamics with continuous tempering for training deep neural networks.In Advances in Neural Information Processing Systems, pp. 618–626. 2017.
- Yousefian et al. (2017) Yousefian, F., Nedić, A., and Shanbhag, U. On stochastic and deterministic quasi-Newton methods for non-strongly convex optimization: convergence and rate analysis. arXiv preprint arXiv:1710.05509, 2017.
- Zhang et al. (2015) Zhang, S., Choromanska, A. E., and LeCun, Y. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693, 2015.
- Zhang & Sutton (2011) Zhang, Y. and Sutton, C. A. Quasi-Newton methods for Markov Chain Monte Carlo. In Advances in Neural Information Processing Systems, pp. 2393–2401, 2011.
- Zhang et al. (2017) Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient langevin dynamics. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pp. 1980–2022, 2017.
- Zhao et al. (2017) Zhao, R., Haskell, W. B., and Tan, V. Y. F. Stochastic L-BFGS: Improved convergence rates and practical acceleration strategies. IEEE Transactions on Signal Processing, 2017.
- Zhao & Li (2015) Zhao, S.-Y. and Li, W.-J. Fast asynchronous parallel stochastic gradient decent. arXiv preprint arXiv:1508.05711, 2015.
- Zheng et al. (2017) Zheng, S., Meng, Q., Wang, T., Chen, W., Yu, N., Ma, Z., and Liu, T. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 4120–4129, 2017.
- Zhou et al. (2017) Zhou, C., Gao, W., and Goldfarb, D. Stochastic adaptive quasi-Newton methods for minimizing expected values. In International Conference on Machine Learning, pp. 4150–4159, 2017.
1 The Approximate Euler-Maruyama Scheme
1.1 Connection with gradient descent with momentum
The standard Euler-Maruyama scheme for the SDE (8) can be developed as follows:
where is the step-size and is a collection of standard Gaussian random variables.
We can obtain simplified update rules if we define and use it in (S3). The modified update rules are given as follows:
where . If we use the modified Euler scheme as described in neal2010 and replace with in (S1), we obtain the following update equation:
Note that, when we have the following update rules:
which coincides with Gradient descent with momentum when for all .
1.2 Numerical integration with stale stochastic gradients
We now focus on (S1) and (S2). We first drop the term , replace the gradients with the stochastic gradients, and then modify the update rules by using stale parameters and . The resulting scheme is given as follows:
By using a similar argument to the one used in Section 1.1, we define , , , and obtain the following update equations:
2 Proof of Proposition 1
We start by rewriting the SDE given in (8) as follows:
Here, we observe that is positive semi-definite, is anti-symmetric. Furthermore, for all we observe that
The assumptions H 2 and 1 directly imply that the function is locally Lipschitz continuous in for all . Then, the desired result is obtained by applying Theorem 1 of ma2015complete and Proposition 4.2.2 of kunze. ∎
3 Proof of Lemma 1
In the rest of this document, if there is no specification, the notation will denote the expectation taken over all the random sources contained in .
Before providing the proof of Lemma 1, let us consider the following Itô diffusion:
where , , , and is Brownian motion in . The generator for (S18) is formally defined as follows:
where is any twice differentiable function whose support is compact. Here, denotes the inner product between vectors and , by definition is equal to for some matrices and . In our study, the generator for the diffusion (S16) is then defined as follows: (define and use (S19))
We also define the following operator for the approximate Euler-Maruyama scheme with delayed updates:
By using the definitions and , we obtain the following identity:
where . This operator essentially captures all the errors induced by the approximate integrator.
3.2 Proof of Lemma 1
where is a standard Gaussian random variable in , h is the step-size, , , and are defined in (S16). We then modify this scheme such that we replace the gradient with the stale stochastic gradients and we discard the term . The resulting numerical integrator is given as follows:
In the sequel, we follow a similar strategy to chen2016stochastic. However, we have additional difficulties caused by the usage of L-BFGS matrices, which are reflected in the operator . Since we are using the Euler-Maruyama integrator, we have the following inequality chen2015convergence:
By rearranging the terms and dividing all the terms by , we obtain:
By using the Poisson equation given in (12) for each and rearranging the terms, we obtain:
as desired. ∎
Theorem 1 significantly differentiates from other recent results. First of all, none of the references we are aware of provides an analysis for an asynchronous stochastic L-BFGS algorithm. Aside from this fact, when compared to chen2016bridging, our bound handles the case of delayed updates and provides an explicit dependence on . When compared to chen2016stochastic, our analysis considers the tempered case and handles the additional difficulties brought by the L-BFGS matrices and their derivatives. On the other hand, our analysis is also significantly different than the ones presented in raginsky17a and xu2017global, as it considers the asynchrony and L-BFGS matrices, and provides a bound for the ergodic error.
4 Proof of Lemma 2
We use the same proof technique given in raginsky17a[Proposition 11]. We assume that admits a density with respect to the Lebesgue measure, denoted as , where is the normalization constant: . We start by using the definition of , as follows:
where is the differential entropy, defined as follows:
We now aim at upper-bounding and lower-bounding . By Assumption H 6, the distribution
has a finite second order moment, therefore its differential entropy is upper-bounded by the differential entropy of a Gaussian distribution that has the same second order moment. Then, we obtain
where denotes the covariance matrix of the Gaussian distribution. In (S32
) we used the relation between the arithmetic and geometric means, and in (S33) we used Assumption H 6.
We now lower-bound . By definition, we have