1 Introduction
Many problems in machine learning and network optimization can be formulated as
(1.1) 
where , , is differentiable, is Lipschitz continuous with , for , and is proximable. A stateoftheart method for this problem is the proximal gradient method [10], which requires to compute the full gradient of in each iteration. However, when the number of the component functions is very large, i.e. , it is costly to obtain the full gradient ; on the other hand, in some network cases, calculating the full gradient is not allowed, either. Thus, the incremental gradient algorithms are developed to avoid computing the full gradient.
The main idea of the incremental gradient descent lies on computing the gradients of partial components of to refresh the full gradient. Precisely, in each iteration, it selects an index set from ; and then computes to update the full gradient. It requires much less computation than the gradient descent without losing too much accuracy of the true gradient. It is natural to consider two index selection strategies: deterministic and stochastic. In fact, all the incremental gradient algorithms for solving problem (1.1) can be labeled as one of these two routines.
1.1 The general PIAG algorithm
Let denote the th iterate. We first define a algebra . Consider a general proximal incremental aggregated gradient algorithm which performs as
(1.2) 
where is the delay associated with and is the noise in the th iteration. The first equation in (1.2) indicates that is an approximation of the full gradient with delays and noises in the perspective of expectation. For simplicity, we call this algorithm as the general PIAG algorithm.
1.2 Literature review
As mentioned before, by the strategies of index selection, the literature can also be divided into two classes.
On the deterministic road: Bertsekas proposed the Incremental Gradient (IG) method for problem (1.1) when [4]. To obtain convergence, IG method requires diminishing stepsizes, even for smooth and strongly convex functions [5]. A special condition was proposed in [26] to relax this condition. The second order IG method was also developed in [13]. An improved version of the IG is the Incremental Aggregated Gradient (IAG) method [6, 29]. When
is a quadratic function, the convergence based on the perturbation analysis of the eigenvalues of a periodic dynamic linear system was given by
[6]. The global convergence is proved in [29]; and if the local Lipschitzian error condition and local strong convexity are satisfied, the local linear convergence can also be proved. [1] established lower complexity bounds for IAG for problem (1.1) with . The linear convergence rates are proved under strong convexity assumption [12, 30].On the stochastic road: The pioneer of this class is the Stochastic Gradient Descent (SGD) method [20], which suggests picking from
in each iteration with uniform probability, and using
to replace . However, SGD requires diminishing stepsizes, which makes its performance poor in both theory and practice. Due to the large deviation of from, “variance reduction” schemes were proposed later, such as SVRG method
[14], SAG method [25], and the SAGA method [11] are developed. With selected constant stepsizes, linear convergence has been proved in the strongly convex case, and ergodic sublinear convergence has been proved in the nonstrongly convex case.1.3 Relations with existing algorithms
In this part, we will present several popular existing algorithms which can be covered by the general PIAG.
E.1. (Inexact) Proximal Gradient Desdent Algorithm: When and expectation vanishes, the general PIAG is equivalent to
E.2. (Inexact) Proximal Incremental Aggregated Gradient Algorithm: In the th iteration, pick essentially cyclicly and then update as where
(1.3) 
In each iteration, one just needs to compute and the term is shared by the memory.
E.3. Deterministic SAG (SAGA): Let be defined as (1.3), pick essentially cyclicly, then update as
E4. Deterministic SVRG: Pick cyclicly, i.e., , let , take being any one from . And then update as
E5. Decentralized Parallel Stochastic Gradient Descent (DPSGD): The algorithm is proposed in [17] to solve where , is mixing matrix, and is the neighbour set of . In each iteration, the DPSGD computes a stochastic gradient of at node , and then computes the neighborhood weighted average to update the local variable.
E6. Forwardbackward splitting by Parameter Server Computing: The forwardbackward splitting for problem (1.1) can be implemented on a parameter server computing model, in which, the worker nodes communicate synchronously with the parameter server but not directly with each other. Node just computes and sends the result to the parameter server, where comes from the shared memory with bounded delay . In the parameter server, the gradients are collected and the iterate is updated (implement the proximal map computation). The algorithm is a special case of the general PIAG. More details about parameter server computing can be found in [24].
E7. LAG: Lazily aggregated gradient: This algorithm is also designed with parameter server computing. Different from E.7, the main motivation of this algorithm is to reduce the communications. In this setting, the parameter server broadcasts the current iteration to the workers which is lowcostly; while the data transition cost from the worker to parameter server is high. In this case, authors in [9] propose LAG whose core idea is disabling the data feedback from worker to parameter server if the gradients change slightly. It is not difficult to verify that the general PIAG contains LAG.
1.4 Contribution
In the perspective of algorithms, this paper proposes the general PIAG, which not only covers various classical PIAGlike algorithms including the inexact schemes but also derives novel algorithms. In the perspective of theory, we build better and novel results compared with previous literature. Specifically, the contribution of this paper can be summarized as follows:
I. General scheme: We propose a general PIAG algorithm, which covers various classical algorithms in network optimization, distributed optimization, machine learning areas. We unify all these algorithms into one mathematical scheme.
II. Novel algorithm: We apply the line search strategy to PIAG and prove its convergence. The numerical results demonstrate its efficiency.
III. Novel proof techniques: Compared with previous convergence analysis of PIAG, we use a new proof technique: Lyapunov function analysis. Due to this, we can build much stronger theoretical results with more general scheme. The Lyapunov function analysis is through the paper for both convex and nonconvex cases.
IV. Better theoretical results: The previous convergence results of PIAG is restricted to strongly convex case and the stepsize depends on the strong convexity constant. We get rid of this constant and still guarantees the linear convergence with much larger stepsizes, even under a weaker assumption. For the cyclic PIAG, the stepsize can be half of the gradient descent algorithm.
V. Novel theoretical results: Many new results are proved in this paper. We list them as follows:

V.1. The convergence of nonconvex PIAG is studied. And in the expectationfree case, the sequence convergence is proved under the semialgebraic property.

V.2. The convergence of the inexact PIAG is proved for both convex and nonconvex cases. In the convex case, the convergence rates are exploited if the noises are promised to follow certain assumptions. In the nonconvex case, the sequence convergence is also prove under semialgebraic property and assumption on the noises.

V.3. We proved the sublinear convergence of PIAG under general convex assumptions. To the best of our knowledge, it is the first time to prove the nonergodic convergence rate of PIAG. And, we also proved the nonergodic convergence rate for inexact PIAG.

V.4. The convergence of line search of PIAG is proved for both convex and nonconvex cases. The convergence rates are also presented in the convex case.
Novel results  
Algorithms  Theory  




Proof technique  Lyapunov function analysis  
Better results 

2 Preliminaries
Through the paper, we use the notation and Assume that is differentiable and is Lipschitz continuous. Then, is Lipschitz continuous with . The maximal delay is . The convergence analysis in the paper depends on the square summable assumption on , i.e., That is why the general PIAG just contains deterministic SAGA and SVRG, in which case . However, the SAGA and SVRG may not have the summable assumption held. In the deterministic case, according to the general PIAG we defined in (1.2). Then we only need . Further, if the noise vanishes, the assumption certainly holds. Besides the deterministic case we discussed above, the stochastic coordinate descent algorithm (with asynchronous parallel) can also satisfy this assumption. Taking the stochastic coordinate descent algorithm for example, in this algorithm, In the stochastic coordinate descent algorithm, it is easy to prove if the stepsize is well chosen. For the asynchronous parallel algorithm, by assuming the independence between with , we can prove the same result given in [Lemma 1, [27]]. We introduce the definitions of subdifferentials. The details can be found in [19, 22, 23].
Definition 1.
Let be a proper and lower semicontinuous function. The subdifferential, of at , written as , is defined as
3 Convergence analysis
The analysis in this section is heavily based on the following Lyapunov function:
(3.1) 
where will be determined later, based on the step size and (the bound for ). We discuss the convergence when (the regularized function in (1.1)) is convex or nonconvex separately. The main difference of the two cases is the upper bound of the stepsize. Due to the convexity of , the upper bound of the stepsize in the first case is twice as the second one.
3.1 is convex
When is convex, we consider three different types of convergence: the first one is in the perspective of expectation, the second one is about almost surely convergence, while the last one considers the semialgebraic property[18, 15, 7].
Convergence in the perspective of expectation:
Lemma 1.
Let be a function (may be nonconvex) with Lipschitz gradient and is convex, and finite . Let be generated by the general PIAG, and , and . Choose the step size for arbitrary fixed . Then we can choose to obtain
(3.2) 
With the Lipschitz continuity of , we are prepared to present the convergence result.
Theorem 1.
Assume the conditions of Lemma 1 hold and , and is generated by general PIAG. Then, we have
Remark 1.
For the cyclic PIAG, . If we apply the gradient descent for (1.1), the stepsize should be for some . In this case, the stepsize of cyclic PIAG is the half of the gradient descent algorithm for this problem.
Convergence in the perspective of almost surely: The almost surely convergence is proved in this part. We consider a Lyapunov function which is modification of (3.1) as
(3.3) 
where we assume and
(3.4) 
A lemma on nonnegative almost supermartingales [21], whose details are included in the appendix, is needed to prove the almost sure convergence.
Theorem 2.
Assume the conditions of Lemma 1 hold and , and is generated by general PIAG. Then we have
Convergence under semialgebraic property: If the function satisfies the semialgebraic property^{1}^{1}1Semialgebraic property used in the nonconvex optimization, more details can be found in [8, 2]., we can obtain more results for the inexact proximal incremental aggregated gradient algorithm. In this case, the expectation of (Proof of Lemma 1) and (Proof of Theorem 1) can both be removed. Similar to [Theorem 1, [28]], we can derive the following result.
Theorem 3.
Assume the conditions of Lemma 1 hold, and satisfies the semialgebraic property, and is generated by (in)exact PIAG, and (), then, converges to a critical point of .
3.2 is nonconvex
In this subsection, we consider the case when is nonconvex. Under this weaker assumption, the stepsizes are reduced for the convergence. Like previous subsection, we also consider three kinds of convergence. We list them as sequence.
Proposition 1.
Assume the conditions of Theorem 1 hold except that is nonconvex and for arbitrary fixed . Then, we have
Proposition 2.
Assume the conditions of Proposition 1 hold, then, we have
Proposition 3.
Assume the conditions of Theorem 3 hold except that is nonconvex and for arbitrary fixed , then, converges to a critical point of .
4 Convergence rates in convex case
In this part, we prove the sublinear convergence rates of the general proximal incremental aggregated gradient algorithm under general convex case, i.e., both and are convex. The analysis in the part uses a slightly modified Lyapunov function
(4.1) 
where is given in (3.4) and and is a nonnegative sequence. Here, we assume .
4.1 Technical lemma
This part presents a technique lemma. The sublinear and linear convergence results are both derived from this lemma.
Lemma 2.
Assume the gradient of is Lipschitz with and is convex. Choose the step size for arbitrary fixed . For any positive sequence satisfying for some . Let denote the projection of to , assumed to exist, and let
Then, there exist such that:
(4.2) 
4.2 Sublinear convergence rate under general convexity
In this subsection, we present the sublinear convergence of the general proximal incremental aggregated gradient algorithm.
Theorem 4.
Assume the gradient of is Lipschitz continuous with and is convex, and is bounded. Choose the step size for arbitrary fixed . Let be generated by the general proximal incremental aggregated gradient algorithm. And the , where . Then, we have
(4.3) 
In many cases, may be unbounded. However, we can slightly modified the algorithm. For example, in the LASSO problem
(4.4) 
we can easily see that That means the solution set of (4.4) is bounded by . Then, we can turn to solve And we can set rather than . Luckily, the proximal map of is proximable. With [Theorem 2, [31]], we have for . In the deterministic case, the sublinear convergence still holds even the is unbounded. The boundedness of the is used to derive the boundedness of sequence . In fact, this boundedness can be obtained by the coercivity of function in the deterministic case.
Proposition 4.
Assume the condition of Theorem 4 hold. Let be generated by the (in)exact PIAG, then
To the best of our knowledge, this is the first time to prove the sublinear convergence rate for the proximal incremental aggregated gradient algorithm.
4.3 Linear convergence with larger stepsize
Assume that the function satisfies the following condition
(4.5) 
where is the projection of to the set , and . This property is weaker than the strongly convexity. If is further differentiable, condition (4.5) is equivalent to the restricted strongly convexity [16].
Theorem 5.
Assume the gradient of is Lipschitz with and is convex, and the function satisfies condition (4.5). Choose the step size for arbitrary fixed . And the , where . Then, we have
(4.6) 
for some .
5 Line search of the proximal incremental gradient algorithm
In this part, we consider a line search version of the deterministic proximal incremental gradient algorithm. First, we set if is nonconvex, and if is convex. The scheme of the algorithm can be presented as follows: Step 1 Compute the point Step 2 Find as the smallest integer number which obeys that and where and the parameters. Set if and if else. The point is generated by
Without the noise, the Lyapunov function can get one parameter free in the analysis (we can get rid of ). Thus, the Lyapunov function used in this part can be described as
(5.1) 
Lemma 3.
Let be a function (may be nonconvex) with Lipschitz gradient and is nonconvex, and finite . Let be generated by the proximal incremental aggregated gradient algorithm with line search, and . Choose the parameter and . It then holds that
In previous result, if is convex, the lower bound of can be shortened by half. This is because (7.51) in the Appendix can be improved as This result is proved by (Proof of Lemma 1). Thus, we can obtain the following result.
Lemma 4.
Assume conditions of Lemma 3 hold except that both and are convex and . It then holds that
In fact, we can also derive the convergence rate for the line search version in the convex case. The proof is very similar to the one in Section 4. Thus, we just present the sketch. Like the previous analysis, a modified Lyapunov function is needed where With this Lyapunov function and suitable , we prove the following two inequalities
(5.2) 
and
(5.3) 
Theorem 6.
Let be a convex function with Lipschitz gradient and is convex, and finite . Let be generated by the proximal incremental aggregated gradient algorithm with line search, and . Choose the parameter and . Then, there exist such that:
(5.4) 
where Further more, if is coercive, If satisfies condition (4.5), for some .
6 Numerical results
Now we use some numerical experiments to show how the line search strategy can accelerate the PIAG algorithms. Here we considered the following two updating rules,

scheme I:

scheme II:
where ,,
. We tested binary classifiers on MNIST, ijcnn1. To include all convex and nonconvex cases, we choose logistic regression (convex) and squared logistic loss (nonconvex) for
, regularization (convex) and MCP (nonconvex) for . The results when using scheme I and II with and without line search are shown in Figure 6. In our experiments, we choose when is convex and when is nonconvex, , .Our numerical results shows that the line search strategy can speed up the PIAG algorithm a lot.7 Conclusion
In this paper, we consider a general proximal incremental aggregated gradient algorithm and prove several novel results. Much better results are proved under more general conditions. The core of the analysis is using the Lyapunov function analysis. We also consider the line search of proximal incremental aggregated gradient algorithm and the convergence rate is proved.
References
 [1] Alekh Agarwal and Leon Bottou. A lower bound for the optimization of finite sums. arXiv preprint arXiv:1410.0723, 2014.
 [2] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdykaŁojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
 [3] Amir Beck. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1):185–209, 2015.
 [4] Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 [5] Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(138):3, 2011.
 [6] Doron Blatt, Alfred O Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29–51, 2007.
 [7] Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
 [8] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(12):459–494, 2014.
 [9] Tianyi Chen, Georgios B Giannakis, Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradient for communicationefficient distributed learning. NIPS 2018, 2018.
 [10] Patrick L Combettes and Valérie R Wajs. Signal recovery by proximal forwardbackward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005.
 [11] Aaron Defazio, Francis Bach, and Simon LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
 [12] Mert Gurbuzbalaban, Asuman Ozdaglar, and PA Parrilo. On the convergence rate of incremental aggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035–1048, 2017.
 [13] Mert Gürbüzbalaban, Asuman Ozdaglar, and Pablo Parrilo. A globally convergent incremental newton method. Mathematical Programming, 151(1):283–313, 2015.
 [14] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
 [15] Krzysztof Kurdyka. On gradients of functions definable in ominimal structures. In Annales de l’institut Fourier, volume 48, pages 769–784. Chartres: L’Institut, 1950, 1998.
 [16] MingJun Lai and Wotao Yin. Augmented ell_1 and nuclearnorm models with a globally linearly convergent algorithm. SIAM Journal on Imaging Sciences, 6(2):1059–1091, 2013.
 [17] Xiangru Lian, Ce Zhang, Huan Zhang, ChoJio Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1705.09056, 2017.
 [18] Stanislas Łojasiewicz. Sur la géométrie semiet sousanalytique. Ann. Inst. Fourier, 43(5):1575–1595, 1993.
 [19] Boris S Mordukhovich. Variational analysis and generalized differentiation I: Basic theory, volume 330. Springer Science & Business Media, 2006.
 [20] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
 [21] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers, pages 111–135. Springer, 1985.
 [22] R Tyrrell Rockafellar and Roger JB Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
 [23] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.
 [24] Ernest K Ryu and Wotao Yin. Proximalproximalgradient method. arXiv preprint arXiv:1708.06908, 2017.
 [25] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12):83–112, 2017.
 [26] Mikhail V Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
 [27] Tao Sun, Robert Hannah, and Wotao Yin. Asynchronous coordinate descent under more realistic assumptions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6183–6191. 2017.
 [28] Tao Sun, Hao Jiang, Lizhi Cheng, and Wei Zhu. A convergence frame for inexact nonconvex and nonsmooth algorithms and its applications to several iterations. arXiv preprint arXiv:1709.04072, 2017.
 [29] Paul Tseng and Sangwoon Yun. Incrementally updated gradient methods for constrained and regularized optimization. Journal of Optimization Theory and Applications, 160(3):832–853, 2014.
 [30] Nuri Denizcan Vanli, Mert Gurbuzbalaban, and Asu Ozdaglar. Global convergence rate of proximal incremental aggregated gradient methods. arXiv preprint arXiv:1608.01713, 2016.
 [31] YaoLiang Yu. On decomposing the proximal map. In Advances in Neural Information Processing Systems, pages 91–99, 2013.
Comments
There are no comments yet.