 # General Proximal Incremental Aggregated Gradient Algorithms: Better and Novel Results under General Scheme

The incremental aggregated gradient algorithm is popular in network optimization and machine learning research. However, the current convergence results require the objective function to be strongly convex. And the existing convergence rates are also limited to linear convergence. Due to the mathematical techniques, the stepsize in the algorithm is restricted by the strongly convex constant, which may make the stepsize be very small (the strongly convex constant may be small). In this paper, we propose a general proximal incremental aggregated gradient algorithm, which contains various existing algorithms including the basic incremental aggregated gradient method. Better and new convergence results are proved even with the general scheme. The novel results presented in this paper, which have not appeared in previous literature, include: a general scheme, nonconvex analysis, the sublinear convergence rates of the function values, much larger stepsizes that guarantee the convergence, the convergence when noise exists, the line search strategy of the proximal incremental aggregated gradient algorithm and its convergence.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many problems in machine learning and network optimization can be formulated as

 minx{F(x)=f(x)+g(x)}, (1.1)

where , , is differentiable, is Lipschitz continuous with , for , and is proximable. A state-of-the-art method for this problem is the proximal gradient method , which requires to compute the full gradient of in each iteration. However, when the number of the component functions is very large, i.e. , it is costly to obtain the full gradient ; on the other hand, in some network cases, calculating the full gradient is not allowed, either. Thus, the incremental gradient algorithms are developed to avoid computing the full gradient.

The main idea of the incremental gradient descent lies on computing the gradients of partial components of to refresh the full gradient. Precisely, in each iteration, it selects an index set from ; and then computes to update the full gradient. It requires much less computation than the gradient descent without losing too much accuracy of the true gradient. It is natural to consider two index selection strategies: deterministic and stochastic. In fact, all the incremental gradient algorithms for solving problem (1.1) can be labeled as one of these two routines.

### 1.1 The general PIAG algorithm

Let denote the -th iterate. We first define a -algebra . Consider a general proximal incremental aggregated gradient algorithm which performs as

 {E(vk∣χk)=∑mi=1∇fi(xk−τi,k)+ek,xk+1=proxγkg[xk−γkvk], (1.2)

where is the delay associated with and is the noise in the -th iteration. The first equation in (1.2) indicates that is an approximation of the full gradient with delays and noises in the perspective of expectation. For simplicity, we call this algorithm as the general PIAG algorithm.

### 1.2 Literature review

As mentioned before, by the strategies of index selection, the literature can also be divided into two classes.

On the deterministic road: Bertsekas proposed the Incremental Gradient (IG) method for problem (1.1) when . To obtain convergence, IG method requires diminishing stepsizes, even for smooth and strongly convex functions . A special condition was proposed in  to relax this condition. The second order IG method was also developed in . An improved version of the IG is the Incremental Aggregated Gradient (IAG) method [6, 29]. When

is a quadratic function, the convergence based on the perturbation analysis of the eigenvalues of a periodic dynamic linear system was given by

. The global convergence is proved in ; and if the local Lipschitzian error condition and local strong convexity are satisfied, the local linear convergence can also be proved.  established lower complexity bounds for IAG for problem (1.1) with . The linear convergence rates are proved under strong convexity assumption [12, 30].

On the stochastic road: The pioneer of this class is the Stochastic Gradient Descent (SGD) method , which suggests picking from

in each iteration with uniform probability, and using

to replace . However, SGD requires diminishing stepsizes, which makes its performance poor in both theory and practice. Due to the large deviation of from

, “variance reduction” schemes were proposed later, such as SVRG method

, SAG method , and the SAGA method  are developed. With selected constant stepsizes, linear convergence has been proved in the strongly convex case, and ergodic sublinear convergence has been proved in the non-strongly convex case.

### 1.3 Relations with existing algorithms

In this part, we will present several popular existing algorithms which can be covered by the general PIAG.

E.1. (Inexact) Proximal Gradient Desdent Algorithm: When and expectation vanishes, the general PIAG is equivalent to

E.2. (Inexact) Proximal Incremental Aggregated Gradient Algorithm: In the -th iteration, pick essentially cyclicly and then update as where

 {τi,k+1=τi,k+1 if i≠ik,τi,k+1=1 if i=ik. (1.3)

In each iteration, one just needs to compute and the term is shared by the memory.

E.3. Deterministic SAG (SAGA): Let be defined as (1.3), pick essentially cyclicly, then update as

E4. Deterministic SVRG: Pick cyclicly, i.e., , let , take being any one from . And then update as

E5. Decentralized Parallel Stochastic Gradient Descent (DPSGD): The algorithm is proposed in  to solve where , is mixing matrix, and is the neighbour set of . In each iteration, the DPSGD computes a stochastic gradient of at node , and then computes the neighborhood weighted average to update the local variable.

E6. Forward-backward splitting by Parameter Server Computing: The forward-backward splitting for problem (1.1) can be implemented on a parameter server computing model, in which, the worker nodes communicate synchronously with the parameter server but not directly with each other. Node just computes and sends the result to the parameter server, where comes from the shared memory with bounded delay . In the parameter server, the gradients are collected and the iterate is updated (implement the proximal map computation). The algorithm is a special case of the general PIAG. More details about parameter server computing can be found in .

E7. LAG: Lazily aggregated gradient: This algorithm is also designed with parameter server computing. Different from E.7, the main motivation of this algorithm is to reduce the communications. In this setting, the parameter server broadcasts the current iteration to the workers which is low-costly; while the data transition cost from the worker to parameter server is high. In this case, authors in  propose LAG whose core idea is disabling the data feedback from worker to parameter server if the gradients change slightly. It is not difficult to verify that the general PIAG contains LAG.

### 1.4 Contribution

In the perspective of algorithms, this paper proposes the general PIAG, which not only covers various classical PIAG-like algorithms including the inexact schemes but also derives novel algorithms. In the perspective of theory, we build better and novel results compared with previous literature. Specifically, the contribution of this paper can be summarized as follows:

I. General scheme: We propose a general PIAG algorithm, which covers various classical algorithms in network optimization, distributed optimization, machine learning areas. We unify all these algorithms into one mathematical scheme.

II. Novel algorithm: We apply the line search strategy to PIAG and prove its convergence. The numerical results demonstrate its efficiency.

III. Novel proof techniques: Compared with previous convergence analysis of PIAG, we use a new proof technique: Lyapunov function analysis. Due to this, we can build much stronger theoretical results with more general scheme. The Lyapunov function analysis is through the paper for both convex and nonconvex cases.

IV. Better theoretical results: The previous convergence results of PIAG is restricted to strongly convex case and the stepsize depends on the strong convexity constant. We get rid of this constant and still guarantees the linear convergence with much larger stepsizes, even under a weaker assumption. For the cyclic PIAG, the stepsize can be half of the gradient descent algorithm.

V. Novel theoretical results: Many new results are proved in this paper. We list them as follows:

• V.1. The convergence of nonconvex PIAG is studied. And in the expectation-free case, the sequence convergence is proved under the semi-algebraic property.

• V.2. The convergence of the inexact PIAG is proved for both convex and nonconvex cases. In the convex case, the convergence rates are exploited if the noises are promised to follow certain assumptions. In the nonconvex case, the sequence convergence is also prove under semi-algebraic property and assumption on the noises.

• V.3. We proved the sublinear convergence of PIAG under general convex assumptions. To the best of our knowledge, it is the first time to prove the non-ergodic convergence rate of PIAG. And, we also proved the non-ergodic convergence rate for inexact PIAG.

• V.4. The convergence of line search of PIAG is proved for both convex and nonconvex cases. The convergence rates are also presented in the convex case.

## 2 Preliminaries

Through the paper, we use the notation and Assume that is differentiable and is -Lipschitz continuous. Then, is Lipschitz continuous with . The maximal delay is . The convergence analysis in the paper depends on the square summable assumption on , i.e., That is why the general PIAG just contains deterministic SAGA and SVRG, in which case . However, the SAGA and SVRG may not have the summable assumption held. In the deterministic case, according to the general PIAG we defined in (1.2). Then we only need . Further, if the noise vanishes, the assumption certainly holds. Besides the deterministic case we discussed above, the stochastic coordinate descent algorithm (with asynchronous parallel) can also satisfy this assumption. Taking the stochastic coordinate descent algorithm for example, in this algorithm, In the stochastic coordinate descent algorithm, it is easy to prove if the stepsize is well chosen. For the asynchronous parallel algorithm, by assuming the independence between with , we can prove the same result given in [Lemma 1, ]. We introduce the definitions of subdifferentials. The details can be found in [19, 22, 23].

###### Definition 1.

Let be a proper and lower semicontinuous function. The subdifferential, of at , written as , is defined as

 ∂J(x):={u∈RN:∃ xk→x, uk→u, such~{} thatlimy≠xkinfy→xkJ(y)−J(xk)−⟨uk,y−xk⟩∥y−xk∥2≥0 }.

## 3 Convergence analysis

The analysis in this section is heavily based on the following Lyapunov function:

 ξk(ε,δ) :=F(xk)+L2εk−1∑d=k−τ(d−(k−τ)+1)∥Δd∥2+12δ+∞∑i=kσ2i−minF, (3.1)

where will be determined later, based on the step size and (the bound for ). We discuss the convergence when (the regularized function in (1.1)) is convex or nonconvex separately. The main difference of the two cases is the upper bound of the stepsize. Due to the convexity of , the upper bound of the stepsize in the first case is twice as the second one.

### 3.1 g is convex

When is convex, we consider three different types of convergence: the first one is in the perspective of expectation, the second one is about almost surely convergence, while the last one considers the semi-algebraic property[18, 15, 7].

Convergence in the perspective of expectation:

###### Lemma 1.

Let be a function (may be nonconvex) with -Lipschitz gradient and is convex, and finite . Let be generated by the general PIAG, and , and . Choose the step size for arbitrary fixed . Then we can choose to obtain

 Eξk(ε,δ)−Eξk+1(ε,δ)≥14(1γ−L2−τL)⋅E∥Δk∥2,    limkE∥Δk∥=0. (3.2)

With the Lipschitz continuity of , we are prepared to present the convergence result.

###### Theorem 1.

Assume the conditions of Lemma 1 hold and , and is generated by general PIAG. Then, we have

###### Remark 1.

For the cyclic PIAG, . If we apply the gradient descent for (1.1), the stepsize should be for some . In this case, the stepsize of cyclic PIAG is the half of the gradient descent algorithm for this problem.

Convergence in the perspective of almost surely: The almost surely convergence is proved in this part. We consider a Lyapunov function which is modification of (3.1) as

 ^ξk(ε,δ) :=F(xk)+κ⋅k−1∑d=k−τ(d−(k−τ)+1)∥Δd∥2+12δ+∞∑i=kσ2i−minF, (3.3)

where we assume and

 κ:=L2ε+14τ(1γ−L2−τL). (3.4)

A lemma on nonnegative almost supermartingales , whose details are included in the appendix, is needed to prove the almost sure convergence.

###### Theorem 2.

Assume the conditions of Lemma 1 hold and , and is generated by general PIAG. Then we have

Convergence under semi-algebraic property: If the function satisfies the semi-algebraic property111Semi-algebraic property used in the nonconvex optimization, more details can be found in [8, 2]., we can obtain more results for the inexact proximal incremental aggregated gradient algorithm. In this case, the expectation of (Proof of Lemma 1) and (Proof of Theorem 1) can both be removed. Similar to [Theorem 1, ], we can derive the following result.

###### Theorem 3.

Assume the conditions of Lemma 1 hold, and satisfies the semi-algebraic property, and is generated by (in)exact PIAG, and (), then, converges to a critical point of .

### 3.2 g is nonconvex

In this subsection, we consider the case when is nonconvex. Under this weaker assumption, the stepsizes are reduced for the convergence. Like previous subsection, we also consider three kinds of convergence. We list them as sequence.

###### Proposition 1.

Assume the conditions of Theorem 1 hold except that is nonconvex and for arbitrary fixed . Then, we have

###### Proposition 2.

Assume the conditions of Proposition 1 hold, then, we have

###### Proposition 3.

Assume the conditions of Theorem 3 hold except that is nonconvex and for arbitrary fixed , then, converges to a critical point of .

## 4 Convergence rates in convex case

In this part, we prove the sublinear convergence rates of the general proximal incremental aggregated gradient algorithm under general convex case, i.e., both and are convex. The analysis in the part uses a slightly modified Lyapunov function

 Fk(ε,δ) :=F(xk)+κ⋅k−1∑d=k−τ(d−(k−τ)+1)∥Δd∥2+λk−minF, (4.1)

where is given in (3.4) and and is a nonnegative sequence. Here, we assume .

### 4.1 Technical lemma

This part presents a technique lemma. The sublinear and linear convergence results are both derived from this lemma.

###### Lemma 2.

Assume the gradient of is Lipschitz with and is convex. Choose the step size for arbitrary fixed . For any positive sequence satisfying for some . Let denote the projection of to , assumed to exist, and let

 ⎧⎨⎩α:=max{1γ+L+κτ,2D}/[min{L8τ(1γ−12−τ),1}]β:=(τ+1)(1γ+L)+1.

Then, there exist such that:

 (EFk+1(ε,δ))2≤α(EFk(ε,δ)−EFk+1(ε,δ))×(κτk−1∑d=k−τE∥Δd∥2+βE∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥2+λk). (4.2)

### 4.2 Sublinear convergence rate under general convexity

In this subsection, we present the sublinear convergence of the general proximal incremental aggregated gradient algorithm.

###### Theorem 4.

Assume the gradient of is Lipschitz continuous with and is convex, and is bounded. Choose the step size for arbitrary fixed . Let be generated by the general proximal incremental aggregated gradient algorithm. And the , where . Then, we have

 EF(xk)−minF∼O(1k). (4.3)

In many cases, may be unbounded. However, we can slightly modified the algorithm. For example, in the LASSO problem

 minx{∥b−Ax∥22+∥x∥1}, (4.4)

we can easily see that That means the solution set of (4.4) is bounded by . Then, we can turn to solve And we can set rather than . Luckily, the proximal map of is proximable. With [Theorem 2, ], we have for . In the deterministic case, the sublinear convergence still holds even the is unbounded. The boundedness of the is used to derive the boundedness of sequence . In fact, this boundedness can be obtained by the coercivity of function in the deterministic case.

###### Proposition 4.

Assume the condition of Theorem 4 hold. Let be generated by the (in)exact PIAG, then

To the best of our knowledge, this is the first time to prove the sublinear convergence rate for the proximal incremental aggregated gradient algorithm.

### 4.3 Linear convergence with larger stepsize

Assume that the function satisfies the following condition

 F(x)−minF≥ν∥x−¯¯¯x∥2, (4.5)

where is the projection of to the set , and . This property is weaker than the strongly convexity. If is further differentiable, condition (4.5) is equivalent to the restricted strongly convexity .

###### Theorem 5.

Assume the gradient of is Lipschitz with and is convex, and the function satisfies condition (4.5). Choose the step size for arbitrary fixed . And the , where . Then, we have

 EF(xk)−minF∼O(ωk), (4.6)

for some .

Compared with the existing linear convergence results in [30, 12], our theoretical findings enjoys two advantages: 1. we generalize the strong convexity to a much weaker condition (4.5); 2. the stepsize gets rid of the parameter which promises larger descent of the algorithm when is small.

## 5 Line search of the proximal incremental gradient algorithm

In this part, we consider a line search version of the deterministic proximal incremental gradient algorithm. First, we set if is nonconvex, and if is convex. The scheme of the algorithm can be presented as follows: Step 1 Compute the point Step 2 Find as the smallest integer number which obeys that and where and the parameters. Set if and if else. The point is generated by

Without the noise, the Lyapunov function can get one parameter free in the analysis (we can get rid of ). Thus, the Lyapunov function used in this part can be described as

 ξk(ε):=F(xk)+L2εk−1∑d=k−τ(d−(k−τ)+1)∥Δd∥2−minF. (5.1)
###### Lemma 3.

Let be a function (may be nonconvex) with -Lipschitz gradient and is nonconvex, and finite . Let be generated by the proximal incremental aggregated gradient algorithm with line search, and . Choose the parameter and . It then holds that

In previous result, if is convex, the lower bound of can be shortened by half. This is because (7.51) in the Appendix can be improved as This result is proved by (Proof of Lemma 1). Thus, we can obtain the following result.

###### Lemma 4.

Assume conditions of Lemma 3 hold except that both and are convex and . It then holds that

In fact, we can also derive the convergence rate for the line search version in the convex case. The proof is very similar to the one in Section 4. Thus, we just present the sketch. Like the previous analysis, a modified Lyapunov function is needed where With this Lyapunov function and suitable , we prove the following two inequalities

 Fk(ε)−Fk+1(ε)≥min{1−c8cτ(L+2τL),1}⋅(k∑d=k−τ∥Δd∥2), (5.2)

and

 (Fk+1(ε))2≤(1γ+L+~κτ)×(k∑d=k−τ∥Δd∥2) ×([(τ+1)(1γ+L)+1]∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥2+~κτk−1∑d=k−τ∥Δd∥2). (5.3)

With (5.2) and (5), we then derive the following lemma.

###### Theorem 6.

Let be a convex function with -Lipschitz gradient and is convex, and finite . Let be generated by the proximal incremental aggregated gradient algorithm with line search, and . Choose the parameter and . Then, there exist such that:

 (Fk+1(ε))2≤~α(Fk(ε)−Fk+1(ε))×(~κτk−1∑d=k−τ∥Δd∥2+β∥xk+1−¯¯¯¯¯¯¯¯¯¯xk+1∥2), (5.4)

where Further more, if is coercive, If satisfies condition (4.5), for some .

## 6 Numerical results

Now we use some numerical experiments to show how the line search strategy can accelerate the PIAG algorithms. Here we considered the following two updating rules,

1. scheme I:

2. scheme II:

where ,,

. We tested binary classifiers on MNIST, ijcnn1. To include all convex and nonconvex cases, we choose logistic regression (convex) and squared logistic loss (nonconvex) for

, regularization (convex) and MCP (nonconvex) for . The results when using scheme I and II with and without line search are shown in Figure 6. In our experiments, we choose when is convex and when is nonconvex, , .Our numerical results shows that the line search strategy can speed up the PIAG algorithm a lot.

## 7 Conclusion

In this paper, we consider a general proximal incremental aggregated gradient algorithm and prove several novel results. Much better results are proved under more general conditions. The core of the analysis is using the Lyapunov function analysis. We also consider the line search of proximal incremental aggregated gradient algorithm and the convergence rate is proved.

## References

•  Alekh Agarwal and Leon Bottou. A lower bound for the optimization of finite sums. arXiv preprint arXiv:1410.0723, 2014.
•  Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
•  Amir Beck. On the convergence of alternating minimization for convex programming with applications to iteratively reweighted least squares and decomposition schemes. SIAM Journal on Optimization, 25(1):185–209, 2015.
•  Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
•  Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.
•  Doron Blatt, Alfred O Hero, and Hillel Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 18(1):29–51, 2007.
•  Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
•  Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494, 2014.
•  Tianyi Chen, Georgios B Giannakis, Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradient for communication-efficient distributed learning. NIPS 2018, 2018.
•  Patrick L Combettes and Valérie R Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005.
•  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
•  Mert Gurbuzbalaban, Asuman Ozdaglar, and PA Parrilo. On the convergence rate of incremental aggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035–1048, 2017.
•  Mert Gürbüzbalaban, Asuman Ozdaglar, and Pablo Parrilo. A globally convergent incremental newton method. Mathematical Programming, 151(1):283–313, 2015.
•  Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
•  Krzysztof Kurdyka. On gradients of functions definable in o-minimal structures. In Annales de l’institut Fourier, volume 48, pages 769–784. Chartres: L’Institut, 1950-, 1998.
•  Ming-Jun Lai and Wotao Yin. Augmented ell_1 and nuclear-norm models with a globally linearly convergent algorithm. SIAM Journal on Imaging Sciences, 6(2):1059–1091, 2013.
•  Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jio Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1705.09056, 2017.
•  Stanislas Łojasiewicz. Sur la géométrie semi-et sous-analytique. Ann. Inst. Fourier, 43(5):1575–1595, 1993.
•  Boris S Mordukhovich. Variational analysis and generalized differentiation I: Basic theory, volume 330. Springer Science & Business Media, 2006.
•  Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
•  Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers, pages 111–135. Springer, 1985.
•  R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
•  Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.
•  Ernest K Ryu and Wotao Yin. Proximal-proximal-gradient method. arXiv preprint arXiv:1708.06908, 2017.
•  Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
•  Mikhail V Solodov. Incremental gradient algorithms with stepsizes bounded away from zero. Computational Optimization and Applications, 11(1):23–35, 1998.
•  Tao Sun, Robert Hannah, and Wotao Yin. Asynchronous coordinate descent under more realistic assumptions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6183–6191. 2017.
•  Tao Sun, Hao Jiang, Lizhi Cheng, and Wei Zhu. A convergence frame for inexact nonconvex and nonsmooth algorithms and its applications to several iterations. arXiv preprint arXiv:1709.04072, 2017.
•  Paul Tseng and Sangwoon Yun. Incrementally updated gradient methods for constrained and regularized optimization. Journal of Optimization Theory and Applications, 160(3):832–853, 2014.
•  Nuri Denizcan Vanli, Mert Gurbuzbalaban, and Asu Ozdaglar. Global convergence rate of proximal incremental aggregated gradient methods. arXiv preprint arXiv:1608.01713, 2016.
•  Yao-Liang Yu. On decomposing the proximal map. In Advances in Neural Information Processing Systems, pages 91–99, 2013.