Asynchronous Stochastic Proximal Methods for Nonconvex Nonsmooth Optimization

02/24/2018 ∙ by Rui Zhu, et al. ∙ Wuhan University 0

We study stochastic algorithms for solving non-convex optimization problems with a convex yet possibly non-smooth regularizer, which find wide applications in many practical machine learning applications. However, compared to asynchronous parallel stochastic gradient descent (AsynSGD), an algorithm targeting smooth optimization, the understanding of the behavior of stochastic algorithms for the non-smooth regularized optimization problems is limited, especially when the objective function is non-convex. To fill this gap, in this paper, we propose and analyze asynchronous parallel stochastic proximal gradient (AsynSPG) methods, including a full update version and a block-wise version, for non-convex problems. We establish an ergodic convergence rate of O(1/√(K)) for the proposed AsynSPG, K being the number of updates made on the model, matching the convergence rate currently known for AsynSGD (for smooth problems). To our knowledge, this is the first work that provides convergence rates of asynchronous parallel SPG algorithms for non-convex problems. Furthermore, our results are also the first to prove convergence of any stochastic proximal methods without assuming an increasing batch size or the use of additional variance reduction techniques. We implement the proposed algorithms on Parameter Server and demonstrate its convergence behavior and near-linear speedup, as the number of workers increases, for sparse learning problems on a real-world dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With rapidly growing data volumes and variety, the need to scale up machine learning has sparked broad interests in developing efficient parallel optimization algorithms. A typical parallel optimization algorithm usually decomposes the original problem into multiple subproblems, each handled by a worker node. Each worker iteratively downloads the global model parameters and computes its local gradients to be sent to the master node or servers for model updates. Recently, asynchronous parallel optimization algorithms (Niu et al., 2011; Li et al., 2014b; Lian et al., 2015), exemplified by the Parameter Server architecture (Li et al., 2014a), have been widely deployed in industry to solve practical large-scale machine learning problems. Asynchronous algorithms can largely reduce overhead and speedup training, since each worker may individually perform model updates in the system without synchronization. Another trend to deal with large volumes of data is the use of stochastic algorithms. As the number of training samples increases, the cost of updating the model taking into account all error gradients becomes prohibitive. To tackle this issue, stochastic algorithms make it possible to update using only a small subset of all training samples at a time.

Stochastic gradient descent (SGD) is one of the first algorithms widely implemented in an asynchronous parallel fashion; its convergence rates and speedup properties have been analyzed for both convex (Agarwal and Duchi, 2011; Mania et al., 2017) and nonconvex (Lian et al., 2015) optimization problems. Nevertheless, SGD is mainly applicable to the case of smooth optimization, and yet is not suitable for problems with a nonsmooth term in the objective function, e.g., an norm regularizer. In fact, such nonsmooth regularizers are commonplace in many practical machine learning problems or constrained optimization problems. In these cases, SGD becomes ineffective, as it is hard to obtain gradients for a nonsmooth objective function.

We consider the following nonconvex regularized optimization problem:

(1)

where takes a finite-sum form of , and each is a smooth (but not necessarily convex) function. The second term is a convex (but not necessarily smooth

) function. This type of problems is prevalent in machine learning, as exemplified by deep learning with regularization

(Dean et al., 2012; Chen et al., 2015; Zhang et al., 2015), LASSO (Tibshirani et al., 2005)

, sparse logistic regression

(Liu et al., 2009), robust matrix completion (Xu et al., 2010; Sun and Luo, 2015)

, and sparse support vector machine (SVM)

(Friedman et al., 2001). In these problems,

is a loss function of model parameters

, possibly in a nonconvex form (e.g., in neural networks), while

is a convex regularization term, which is, however, possibly nonsmooth, e.g., the norm regularizer.

Many classical deterministic (non-stochastic) algorithms are available to solve problem (1), including the proximal gradient (ProxGD) method (Parikh et al., 2014) and its accelerated variants (Li and Lin, 2015) as well as the alternating direction method of multipliers (ADMM) (Hong et al., 2016). These methods leverage the so-called proximal operators (Parikh et al., 2014) to handle the nonsmoothness in the problem. Although implementing these deterministic algorithms in a synchronous parallel fashion is straightforward, extending them to asynchronous parallel algorithms is much more complicated than it appears. In fact, existing theory on the convergence of asynchronous proximal gradient (PG) methods for nonconvex problem (1) is quite limited. An asynchronous parallel proximal gradient method has been presented in (Li et al., 2014b) and has been shown to converge to stationary points for nonconvex problems. However, (Li et al., 2014b) has essentially proposed a non-stochastic algorithm and has not provided its convergence rate.

In this paper, we propose and analyze an asynchronous parallel proximal stochastic gradient descent (ProxSGD) method for solving the nonconvex and nonsmooth problem (1), with provable convergence and speedup guarantees. The analysis of ProxSGD has attracted much attention in the community recently. Under the assumption of an increasing minibatch size used in the stochastic algorithm, the non-asymptotic convergence of ProxSGD to stationary points has been shown in (Ghadimi et al., 2016) for problem (1) with a convergence rate of , being the times the model is updated. Moreover, additional variance reduction techniques have been introduced (Reddi et al., 2016) to guarantee the convergence of ProxSGD, which is different from the stochastic method we discuss here. The stochastic algorithm considered in this paper assumes that each worker selects a minibatch of randomly chosen training samples to calculate the gradients at a time, which is a scheme widely used in practice. To the best of our knowledge, the convergence behavior of ProxSGD—under a constant minibatch size without variance reduction—is still unknown (even for the synchronous or sequential version).

Our main contributions are summarized as follows:

  • We propose asynchronous parallel ProxSGD (a.k.a. Asyn-ProxSGD) and prove that it can converge to stationary points of nonconvex and nonsmooth problem (1) with an ergodic convergence rate of , where is the number of times that the model is updated. This rate matches the convergence rate known for asynchronous SGD. The latter, however, is suitable only for smooth problems. To our knowledge, this is the first work that offers convergence rate guarantees for any stochastic proximal methods in an asynchronous parallel setting.

  • Our result also suggests that the sequential (or synchronous parallel) ProxSGD can converge to stationary points of problem (1), with a convergence rate of . To the best of our knowledge, this is also the first work that provides convergence rates of any stochastic algorithm for nonsmooth problem (1) under a constant batch size, while prior literature on such stochastic proximal methods assumes an increasing batch size or relies on variance reduction techniques.

  • We provide a linear speedup guarantee as the number of workers increases, provided that the number of workers is bounded by . This result has laid down a theoretical ground for the scalability and performance of our Asyn-ProxSGD algorithm in practice.

2 Preliminaries

In this paper, we use as the one defined in (1), and

as a function whose stochastic nature comes from the random variable

representing a random index selected from the training set . We use to denote the norm of the vector , and to denote the inner product of two vectors and . We use to denote the “true” gradient and use to denote the stochastic gradient for a function . For a random variable or vector , let be the conditional expectation of w.r.t. a sigma algebra . We denote as the subdifferential of . A point is a critical point of , iff .

2.1 Stochastic Optimization Problems

In this paper, we consider the following stochastic optimization problem instead of the original deterministic version (1):

(2)

where the stochastic nature comes from the random variable , which in our problem settings, represents a random index selected from the training set . Therefore, (2) attempts to minimize the expected loss of a random training sample plus a regularizer . In this work, we assume the function is proper, closed and convex, yet not necessarily smooth.

2.2 Proximal Gradient Descent

The proximal operator is fundamental to many algorithms to solve problem (1) as well as its stochastic variant (2).

Definition 1 (Proximal operator).

The proximal operator of a point under a proper and closed function with parameter is defined as:

(3)

In its vanilla version, proximal gradient descent performs the following iterative updates:

for , where is the step size at iteration .

To solve stochastic optimization problem (2), we need a variant called proximal stochastic gradient descent (ProxSGD), with its update rule at each (synchronized) iteration given by

(4)

where is the mini-batch size. In ProxSGD, the aggregate gradient over all the samples is replaced by the gradients from a random subset of training samples, denoted by at iteration . Since is a random variable indicating a random index in , is a random loss function for the random sample , such that .

2.3 Parallel Stochastic Optimization

Recent years have witnessed rapid development of parallel and distributed computation frameworks for large-scale machine learning problems. One popular architecture is called parameter server (Dean et al., 2012; Li et al., 2014a), which consists of some worker nodes and server nodes. In this architecture, one or multiple master machines play the role of parameter servers, which maintain the model . Since these machines serve the same purpose, we can simply treat them as one server node for brevity. All other machines are worker nodes that communicate with the server for training machine learning models. In particular, each worker has two types of requests: pull the current model from the server, and push the computed gradients to the server.

Before proposing an asynchronous Proximal SGD algorithm in the next section, let us first introduce its synchronous version. Let us use an example to illustrate the idea. Suppose we execute ProxSGD with a mini-batch of 128 random samples on 8 workers. We can let each worker randomly take 16 samples, and compute a summed gradient on these 16 samples, and push it to the server. In the synchronous case, the server will finally receive 8 summed gradients (containing information of all 128 samples) in each iteration. The server then updates the model by performing the proximal gradient descent step. In general, if we have workers, each worker will be assigned random samples in an iteration.

Note that in this scenario, all workers contribute to the computation of the sum of gradients on random samples in parallel, which corresponds to data parallelism in the literature (e.g., (Agarwal and Duchi, 2011; Ho et al., 2013)). Another type of parallelism is called model parallelism, in which each worker uses all random samples in the batch to compute a partial gradient on a specific block of (e.g., (Niu et al., 2011; Pan et al., 2016)). Typically, data parallelism is more suitable when , i.e., large dataset with moderate model size, and model parallelism is more suitable when . We focus on data parallelism.

Server executes:

1:Initialize .
2:Initialize . Gradient accumulator
3:Initialize . Request counter
4:loop
5:     if Pull Request from worker is received: then
6:         Send to worker .
7:     end if
8:     if Push Request (gradient ) from worker is received: then
9:         .
10:         .
11:         if  then
12:              .
13:              .
14:              .
15:         end if
16:     end if
17:end loop

Worker asynchronously performs:

1:Pull to initialize.
2:for  do
3:     Randomly choose training samples indexed by .
4:     Calculate .
5:     Push to the server.
6:     Pull the current model from the server: .
7:end for
Algorithm 1 Asyn-ProxSGD: Asynchronous Proximal Stochastic Gradient Descent

3 Asynchronous Proximal Gradient Descent

We now present our asynchronous proximal gradient descent (Asyn-ProxSGD) algorithm, which is the main contribution in this paper. In the asynchronous algorithm, different workers may be in different local iterations due to random delays in computation and communication.

For ease of presentation, let us first assume each worker uses only one random sample at a time to compute its stochastic gradient, which naturally generalizes to using a mini-batch of random samples to compute a stochastic gradient. In this case, each worker will independently and asynchronously repeat the following steps:

  • Pull the latest model from the server;

  • Calculate a gradient based on a random sample locally;

  • Push the gradient to the server.

Here we use to emphasize that the gradient computed on workers may be delayed. For example, all workers but worker have completed their tasks of iteration , while worker still works on iteration . In this case, the gradient is not computed based on the current model but from a delayed one .

In our algorithm, the server will perform an averaging over the received sample gradients as long as gradients are received and perform an proximal gradient descent update on the model , no matter where these gradients come from; as long as gradients are received, the averaging is performed. This means that it is possible that the server may have received multiple gradients from one worker while not receiving any from another worker.

In general, when each mini-batch has samples, and each worker processes random samples to calculate a stochastic gradient to be pushed to the server, the proposed Asyn-ProxSGD algorithm is described in Algorithm 1 leveraging a parameter server architecture. The server maintains a counter . Once reaches , the server has received gradients that contain information about random samples (no matter where they come from) and will perform a proximal model update.

4 Convergence Analysis

To facilitate the analysis of Algorithm 1, we rewrite it in an equivalent global view (from the server’s perspective), as described in Algorithm 2. In this algorithm, we use an iteration counter to keep track of how many times the model has been updated on the server; increments every time a push request (model update request) is completed. Note that such a counter is not required by workers to compute gradients and is different from the counter in Algorithm 1 is maintained by each worker to count how many sample gradients have been computed locally.

In particular, for every stochastic sample gradients received, the server simply aggregates them by averaging:

(5)

where indicates that the stochastic gradient received at iteration could have been computed based on an older model due to communication delay and asynchrony among workers. Then, the server updates to using proximal gradient descent.

1:Initialize .
2:for  do
3:     Randomly select training samples indexed by .
4:     Calculate the averaged gradient according to (5).
5:     .
6:end for
Algorithm 2 Asyn-ProxSGD (from a Global Perspective)

4.1 Assumptions and Metrics

We make the following assumptions for convergence analysis. We assume that is a smooth function with the following properties:

Assumption 1 (Lipschitz Gradient).

For function there are Lipschitz constants such that

(6)

As discussed above, assume that is a proper, closed and convex function, which is yet not necessarily smooth. If the algorithm has been executed for iterations, we let denote the set that consists of all the samples used up to iteration . Since for all , the collection of all such forms a filtration

. Under such settings, we can restrict our attention to those stochastic gradients with an unbiased estimate and bounded variance, which are common in the analysis of

stochastic gradient descent or stochastic proximal gradient algorithms, e.g., (Lian et al., 2015; Ghadimi et al., 2016).

Assumption 2 (Unbiased gradient).

For any , we have .

Assumption 3 (Bounded variance).

The variance of the stochastic gradient is bounded by .

We make the following assumptions on the delay and independence:

Assumption 4 (Bounded delay).

All delay variables are bounded by : .

Assumption 5 (Independence).

All random variables for all and in Algorithm 2 are mutually independent.

The assumption of bounded delay is to guarantee that gradients from workers should not be too old. Note that the maximum delay is roughly proportional to the number of workers in practice. This is also known as stale synchronous parallel (Ho et al., 2013) in the literature. Another assumption on independence can be met by selecting samples with replacement, which can be implemented using some distributed file systems like HDFS (Borthakur et al., 2008). These two assumptions are common in convergence analysis for asynchronous parallel algorithms, e.g., (Lian et al., 2015; Davis et al., 2016).

4.2 Theoretical Results

We present our main convergence theorem as follows:

Theorem 1.

If Assumptions 4 and 5 hold and the step length sequence in Algorithm 2 satisfies

(7)

for all , we have the following ergodic convergence rate for Algorithm  2:

(8)

where the expectation is taken in terms of all random variables in Algorithm 2.

Taking a closer look at Theorem 1, we can properly choose the learning rate as a constant value and derive the following convergence rate:

Corollary 1.

Let the step length be a constant, i.e.,

(9)

If the delay bound satisfies

(10)

then the output of Algorithm 1 satisfies the following ergodic convergence rate:

(11)
Remark 1 (Consistency with ProxSGD).

When , our proposed Asyn-ProxSGD reduces to the vanilla ProxSGD (e.g., (Ghadimi et al., 2016)). Thus, the iteration complexity is according to (11), attaining the same result as that in (Ghadimi et al., 2016) yet without assuming increased mini-batch sizes.

Remark 2 (Linear speedup w.r.t. the staleness).

From (11) we can see that linear speedup is achievable, as long as the delay is bounded by (if other parameters are constants). The reason is that by (10) and (11), as long as is no more than , the iteration complexity (from a global perspective) to achieve -optimality is , which is independent from .

Remark 3 (Linear speedup w.r.t. number of workers).

As the iteration complexity is to achieve -optimality, it is also independent from the number of workers if assuming other parameters are constants. It is worth noting that the delay bound is roughly proportional to the number of workers. As the iteration complexity is independent from , we can conclude that the total iterations will be shortened to of a single worker’s iterations if workers work in parallel, achieving nearly linear speedup.

Remark 4 (Comparison with Asyn-SGD).

Compared with asynchronous SGD (Lian et al., 2015), in which or the number of workers should be bounded by to achieve linear speedup, here Asyn-ProxSGD is more sensitive to delays and more suitable for a smaller cluster.

5 Experiments

(a) a9a
(b) mnist
(c) a9a
(d) mnist
Figure 1: Performance of ProxGD and Async-ProxSGD on a9a (left) and mnist (right) datasets. Here the x-axis represents how many sample gradients is computed (divided by ), and the y-axis is the function suboptimality where is obtained by running gradient descent for many iterations with multiple restarts. Note all values on the y-axis are normalized by .

We now present experimental results to confirm the capability and efficiency of our proposed algorithm to solve challenging non-convex non-smooth machine learning problems. We implemented our algorithm on TensorFlow

(Abadi et al., 2016), a flexible and efficient deep learning library. We execute our algorithm on Ray (Moritz et al., 2017), a general-purpose framework that enables parallel and distributed execution of Python as well as TensorFlow functions. A key feature of Ray is that it provides a unified task-parallel abstraction, which can serve as workers, and actor abstraction, which stores some states and acts like parameter servers.

We use a cluster of 9 instances on Google Cloud. Each instance has one CPU core with 3.75 GB RAM, running 64-bit Ubuntu 16.04 LTS. Each server or worker uses only one core, with 9 CPU cores and 60 GB RAM used in total. Only one instance is the server node, while the other nodes are workers.

Setup: In our experiments, we consider the problem of non-negative principle component analysis (NN-PCA) (Reddi et al., 2016). Given a set of samples , NN-PCA solves the following optimization problem

(12)

This NN-PCA problem is NP-hard in general. To apply our algorithm, we can rewrite it with for all samples . Since the feasible set is convex, we can replace the optimization constraint by a regularizer in the form of an indicator function , such that if and otherwise.

datasets dimension sample size
a9a 123 32,561
mnist 780 60,000
Table 1: Description of the two classification datasets used.

The hyper-parameters are set as follows. The step size is set using the popular -inverse step size choice , which is the same as the one used in (Reddi et al., 2016). Here determine how learning rates change, and controls for how many steps the learning rate would change.

We conduct experiments on two datasets 111Available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, with their information summarized in Table 1. All samples have been normalized, i.e., for all . In our experiments, we use a batch size of in order to evaluate the performance and speedup behavior of the algorithm under constant batches.

We consider the function suboptimality value as our performance metric. In particular, we run proximal gradient descent (ProxGD) for a large number of iterations with multiple random initializations, and obtain a solution . For all experiments, we evaluate function suboptimality, which is the gap , against the number of sample gradients processed by the server (divided by the total number of samples ), and then against time.

(a) a9a
(b) mnist
(c) a9a
(d) mnist
Figure 2: Performance of ProxGD and Async-ProxSGD on a9a (left) and mnist (right) datasets. Here the x-axis represents the actual running time, and the y-axis is the function suboptimality. Note all values on the y-axis are normalized by .

Results: Empirically, Assumption 4 (bounded delays) is observed to hold for this cluster. For our proposed Asyn-ProxSGD algorithm, we are particularly interested in the speedup in terms of iterations and running time. In particular, if we need iterations (with sample gradients processed by the server) to achieve a certain suboptimality level using one worker, and iterations (with sample gradients processed by the server) to achieve the same suboptimality with workers, the iteration speedup is defined as (Lian et al., 2015). Note that all iterations are counted on the server side, i.e., how many sample gradients are processed by the server. On the other hand, the running time speedup is defined as the ratio between the running time of using one worker and that of using workers to achieve the same suboptimality.

The iteration and running time speedups on both datasets are shown in Fig. 1 and Fig. 2, respectively. Such speedups achieved at the suboptimality level of are presented in Table 2 and 3. We observe that nearly linear speedup can be achieved, although there is a loss of efficiency due to communication as the number workers increases.

Workers 1 2 4 8
Iteration Speedup 1.000 1.982 3.584 5.973
Time Speedup 1.000 2.219 3.857 5.876
Table 2: Iteration speedup and time speedup of Asyn-ProxSGD at the suboptimality level . (a9a)
Workers 1 2 4 8
Iteration Speedup 1.000 2.031 3.783 7.352
Time Speedup 1.000 2.285 4.103 5.714
Table 3: Iteration speedup and time speedup of Asyn-ProxSGD at the suboptimality level . (mnist)

6 Related Work

Stochastic optimization problems have been studied since the seminal work in 1951 (Robbins and Monro, 1951), in which a classical stochastic approximation algorithm is proposed for solving a class of strongly convex problems. Since then, a series of studies on stochastic programming have focused on convex problems using SGD (Bottou, 1991; Nemirovskii and Yudin, 1983; Moulines and Bach, 2011). The convergence rates of SGD for convex and strongly convex problems are known to be and , respectively. For nonconvex optimization problems using SGD, Ghadimi and Lan (Ghadimi and Lan, 2013) proved an ergodic convergence rate of , which is consistent with the convergence rate of SGD for convex problems.

When in (1) is not necessarily smooth, there are other methods to handle the nonsmoothness. One approach is closely related to mirror descent stochastic approximation, e.g., (Nemirovski et al., 2009; Lan, 2012). Another approach is based on proximal operators (Parikh et al., 2014), and is often referred to as the proximal stochastic gradient descent (ProxSGD) method. Duchi et al. (Duchi and Singer, 2009) prove that under a diminishing learning rate for -strongly convex objective functions, ProxSGD can achieve a convergence rate of . For a nonconvex problem like (1), rather limited studies on ProxSGD exist so far. The closest approach to the one we consider here is (Ghadimi et al., 2016), in which the convergence analysis is based on the assumption of an increasing minibatch size. Furthermore, Reddi et al. (Reddi et al., 2016) prove convergence for nonconvex problems under a constant minibatch size, yet relying on additional mechanisms for variance reduction. We fill the gap in the literature by providing convergence rates for ProxSGD under constant batch sizes without variance reduction.

To deal with big data, asynchronous parallel optimization algorithms have been heavily studied. Recent work on asynchronous parallelism is mainly limited to the following categories: stochastic gradient descent for smooth optimization, e.g., (Niu et al., 2011; Agarwal and Duchi, 2011; Lian et al., 2015; Pan et al., 2016; Mania et al., 2017) and deterministic ADMM, e.g. (Zhang and Kwok, 2014; Hong, 2017). A non-stochastic asynchronous ProxSGD algorithm is presented by (Li et al., 2014b), which however did not provide convergence rates for nonconvex problems.

7 Concluding Remarks

In this paper, we study asynchronous parallel implementations of stochastic proximal gradient methods for solving nonconvex optimization problems, with convex yet possibly nonsmooth regularization. However, compared to asynchronous parallel stochastic gradient descent (Asyn-SGD), which is targeting smooth optimization, the understanding of the convergence and speedup behavior of stochastic algorithms for the nonsmooth regularized optimization problems is quite limited, especially when the objective function is nonconvex. To fill this gap, we propose an asynchronous proximal stochastic gradient descent (Asyn-ProxSGD) algorithm with convergence rates provided for nonconvex problems. Our theoretical analysis suggests that the same order of convergence rate can be achieved for asynchronous ProxSGD for nonsmooth problems as for the asynchronous SGD, under constant minibatch sizes, without making additional assumptions on variance reduction. And a linear speedup is proven to be achievable for both asynchronous ProxSGD when the number of workers is bounded by .

References

  • Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283. USENIX Association, 2016.
  • Agarwal and Duchi [2011] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
  • Borthakur et al. [2008] D. Borthakur et al. Hdfs architecture guide. Hadoop Apache Project, 53, 2008.
  • Bottou [1991] L. Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8), 1991.
  • Chen et al. [2015] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
  • Davis et al. [2016] D. Davis, B. Edmunds, and M. Udell. The sound of apalm clapping: Faster nonsmooth nonconvex optimization with stochastic asynchronous palm. In Advances in Neural Information Processing Systems, pages 226–234, 2016.
  • Dean et al. [2012] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
  • Duchi and Singer [2009] J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(Dec):2899–2934, 2009.
  • Friedman et al. [2001] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin, 2001.
  • Ghadimi and Lan [2013] S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • Ghadimi et al. [2016] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
  • Ho et al. [2013] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231, 2013.
  • Hong [2017] M. Hong. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An admm approach. IEEE Transactions on Control of Network Systems, PP(99):1–1, 2017. ISSN 2325-5870. doi: 10.1109/TCNS.2017.2657460.
  • Hong et al. [2016] M. Hong, Z.-Q. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.
  • Lan [2012] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365–397, 2012.
  • Li and Lin [2015] H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In Advances in neural information processing systems, pages 379–387, 2015.
  • Li et al. [2014a] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583–598, 2014a.
  • Li et al. [2014b] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27, 2014b.
  • Lian et al. [2015] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
  • Liu et al. [2009] J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547–556. ACM, 2009.
  • Mania et al. [2017] H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization, 27(4):2202–2229, jan 2017. doi: 10.1137/16m1057000. URL https://doi.org/10.1137%2F16m1057000.
  • Moritz et al. [2017] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, W. Paul, M. I. Jordan, and I. Stoica. Ray: A distributed framework for emerging ai applications. arXiv preprint arXiv:1712.05889, 2017.
  • Moulines and Bach [2011] E. Moulines and F. R. Bach.

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

    In Advances in Neural Information Processing Systems, pages 451–459, 2011.
  • Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
  • Nemirovskii and Yudin [1983] A. Nemirovskii and D. B. Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.
  • Niu et al. [2011] F. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693–701, 2011.
  • Pan et al. [2016] X. Pan, M. Lam, S. Tu, D. Papailiopoulos, C. Zhang, M. I. Jordan, K. Ramchandran, and C. Ré. Cyclades: Conflict-free asynchronous machine learning. In Advances in Neural Information Processing Systems, pages 2568–2576, 2016.
  • Parikh et al. [2014] N. Parikh, S. Boyd, et al. Proximal algorithms. Foundations and Trends® in Optimization, 1(3):127–239, 2014.
  • Reddi et al. [2016] S. J. Reddi, S. Sra, B. Póczos, and A. J. Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems, pages 1145–1153, 2016.
  • Robbins and Monro [1951] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • Sun and Luo [2015] R. Sun and Z.-Q. Luo. Guaranteed matrix completion via nonconvex factorization. In Foundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 270–289. IEEE, 2015.
  • Tibshirani et al. [2005] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005.
  • Xu et al. [2010] H. Xu, C. Caramanis, and S. Sanghavi.

    Robust pca via outlier pursuit.

    In Advances in Neural Information Processing Systems, pages 2496–2504, 2010.
  • Zhang and Kwok [2014] R. Zhang and J. Kwok. Asynchronous distributed admm for consensus optimization. In International Conference on Machine Learning, pages 1701–1709, 2014.
  • Zhang et al. [2015] S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685–693, 2015.

Appendix A Auxiliary Lemmas

Lemma 1 ([Ghadimi et al., 2016]).

For all , we have:

(13)

Due to slightly different notations and definitions in [Ghadimi et al., 2016], we provide a proof here for completeness. We refer readers to [Ghadimi et al., 2016] for more details.

Proof.

By the definition of proximal function, there exists a such that:

which proves the lemma.    

Lemma 2 ([Ghadimi et al., 2016]).

For all , if is a convex function, we have

(14)
Proof.

Let denote and denote . By definition of the proximal operator, for all , we have

where and . Let substitute in the first inequality and in the second one, we have

Then, we have

(15)
(16)
(17)

and

(18)
(19)
(20)

By adding (17) and (20), we obtain

which proves the lemma.    

Lemma 3 ([Ghadimi et al., 2016]).

For any and , we have

(21)
Proof.

It can be obtained by directly applying Lemma 2 and the definition of gradient mapping.    

Lemma 4 ([Reddi et al., 2016]).

Suppose we define for some . Then for , the following inequality holds:

(22)

for all .

We recall and define some notations for convergence analysis in the subsequent. We denote as the average of delayed stochastic gradients and as the average of delayed true gradients, respectively:

Moreover, we denote as the difference between these two differences.

Appendix B Convergence analysis for Asyn-ProxSGD

b.1 Milestone lemmas

We put some key results of convergence analysis as milestone lemmas listed below, and the detailed proof is listed in B.4.

Lemma 5 (Decent Lemma).
(23)
Lemma 6.

Suppose we have a sequence by Algorithm 2, then we have:

(24)

for all .

Lemma 7.

Suppose we have a sequence by Algorithm 2, , then we have:

(25)

b.2 Proof of Theorem 1

Proof.

From the fact , we have

which implies that

We start the proof from Lemma 5. According to our condition of , we have and therefore

Apply Lemma 7 we have

By taking telescope sum, we have

where , and we have

When for all as the condition of Theorem 1, we have

which proves the theorem.    

b.3 Proof of Corollary 1

Proof.

From the condition of Corollary, we have

It is clear that the above inequality also satisfies the condition in Theorem 1. By doing so, we can have Furthermore, we have

Since , we have and thus

Following Theorem 1 and the above inequality, we have