# Non-ergodic Convergence Analysis of Heavy-Ball Algorithms

In this paper, we revisit the convergence of the Heavy-ball method, and present improved convergence complexity results in the convex setting. We provide the first non-ergodic O(1/k) rate result of the Heavy-ball algorithm with constant step size for coercive objective functions. For objective functions satisfying a relaxed strongly convex condition, the linear convergence is established under weaker assumptions on the step size and inertial parameter than made in the existing literature. We extend our results to multi-block version of the algorithm with both the cyclic and stochastic update rules. In addition, our results can also be extended to decentralized optimization, where the ergodic analysis is not applicable.

• 41 publications
• 12 publications
• 65 publications
• 5 publications
• 4 publications
• 53 publications
10/30/2017

### Linearly convergent stochastic heavy ball method for minimizing generalization error

In this work we establish the first linear convergence result for the st...
11/22/2021

09/14/2016

### Stochastic Heavy Ball

This paper deals with a natural stochastic optimization procedure derive...
04/07/2015

### From Averaging to Acceleration, There is Only a Step-size

11/24/2020

We prove that the iterates produced by, either the scalar step size vari...
08/28/2019

We prove that the norm version of the adaptive stochastic gradient metho...
12/01/2021

### The (1+1)-ES Reliably Overcomes Saddle Points

It is known that step size adaptive evolution strategies (ES) do not con...

## Introduction

In this paper, we study the Heavy-ball algorithm first proposed by Polyak (1964), for solving the following unconstrained minimization problem

 minx∈Rmf(x), (1)

where is convex and differentiable, and is Lipschitz continuous with constant . The Heavy-ball method iterates

 xk+1=xk−γk∇f(xk)+βk(xk−xk−1), (2)

where is the step size and is the inertial parameter. Different from the gradient descent algorithm, the sequence generated by Heavy-ball method is not Fejér monotone due to the inertial term . This poses a challenge in proving the convergence rate of in the convex case. In the existing literature, the sublinear convergence rate of the Heavy-ball has been proved only in the sense of ergodicity.

When the objective function is twice continuously differentiable and strongly convex (i.e., almost quadratic), the Heavy-ball method provably converges linearly. Under a weaker assumption that the objective function is nonconvex but Lipschitz differentiable, Zavriev and Kostyuk (1993) proved that the sequence generated by the Heavy-ball method will converge to a critical point, yet without specifying the convergence rate. The smoothness of objective function is crucial for convergence of the Heavy-ball. Indeed, it can be divergent for a strongly convex but nonsmooth function as suggested by Lessard, Recht, and Packard (2016). Different from the classical gradient descent methods, the Heavy-ball algorithm fails to generate a Fejér monotone sequence. In the convex and smooth case, the only result about convergence rate, to our knowledge, is the ergodic rate in terms of the objective value (Ghadimi, Feyzmahdavian, and Johansson, 2015), i.e., . The linear convergence of Heavy-ball algorithm was proved under the strongly convexity assumption by Ghadimi, Feyzmahdavian, and Johansson (2015). But the authors imposed a restrictive assumption on the inertial parameter . Specifically, when the strongly convex constant is tiny, the convergence result holds only for a small range of values. By incorporating the idea of proximal mapping, the inertial proximal gradient algorithm (iPiano) was proposed in (Ochs et al., 2014), whose convergence in nonconvex case was thoroughly discussed. Locally linear convergence of iPiano and Heavy-ball method was later proved in (Ochs, 2016). In the strongly convex case, the linear convergence was proved for iPiano with fixed (Ochs, Brox, and Pock, 2015). In the paper (Pock and Sabach, 2016), inertial Proximal Alternating Linearized Minimization (iPALM) was introduced as a variant of iPiano for solving the two-block regularized problem. Xu and Yin (2013)

analyzed the Heavy-ball algorithm in tensor minimization problems. Stochastic versions of heavy-ball have also been introduced

(Loizou and Richtárik, 2017b, a). A multi-step heavy-ball algorithm was analyzed in (Liang, Fadili, and Peyré, 2016). The inertial methods are also developed and studied in the operator research by Combettes and Glaudin (2017). None of the aforementioned Heavy-ball based algorithms, however, provides a non-ergodic convergence rate.

### Contributions

In this paper, we establish the first non-ergodic convergence result in general convex case. More precisely, we prove that for convex and coercive 111We say is coercive, if as .. Compared with existing result in (Ghadimi, Feyzmahdavian, and Johansson, 2015), ours allows a larger step size . We also prove a linear convergence result under a restricted strongly convex condition, weaker than the strong convexity assumption. In short, we make weaker assumptions on the step size, on the inertial parameter, as well as on the convexity of the objective function. The convergence of multi-block extensions of Heavy-ball method is studied. The sublinear and linear convergence rates are proved for the cyclic and stochastic update rules, respectively. In addition, we extend our analysis to the decentralized Heavy-ball method, where the ergodic analysis is not applicable. Our theoretical results are based on a novel Lyapunov function, which is motivated by a modified dynamical system.

### A dynamical system interpretation

It has been long known that the Heavy-ball method is equivalent to the discretization of the following second-order ODE (Alvarez, 2000):

 ¨x(t)+α˙x(t)+∇f(x(t))=0,  t≥0, (3)

for some . In the case , the Heavy-ball method boils down to the standard gradient descent, which is known to be the discretization of the following first-order ODE

 α˙x(t)+∇f(x(t))=0,  t≥0. (4)

The dynamical system (3), however, misses essential information about relation between and . Specifically, if we replace by with being the discretization step size, then it holds that

 ∥∥∥xk+1−2xk+xk−1h2∥∥∥ ≤1h⋅(∥∥∥xk+1−xkh∥∥∥+∥∥∥xk−xk−1h∥∥∥).

Since both and can be viewed as the discretization of , we propose to modify (3) by adding the following constraint

 ∥¨x(t)∥≤θ∥˙x(t)∥, (5)

where . In next section, we will devise a useful Lyapunov function by exploiting the additive constraint (5) and establish the asymptotic non-ergodic sublinear convergence rate in the continuous setting. Finally, we will “translate” this analysis into that in discretized setting.

## Analysis of the dynamical system

We analyze the modified dynamical system (3) + (5). The existence of the solution is beyond the scope of this paper and will not be discussed here. Let us assume that is coercive, , and . We consider the Lyapunov function

 ξ(t):=f(x(t))+12∥˙x(t)∥2−minf≥0, (6)

and refer the readers to the relevant equations (3)-(6). A direct calculation gives

 ˙ξ(t) =⟨∇f(x(t)),˙x(t)⟩+⟨¨x(t),˙x(t)⟩ =−α∥˙x(t)∥2, (7)

which means is non-increasing. As a result,

 supt{f(x(t))−minf}≤suptξ(t)≤ξ(0).

By the coercivity of , is bounded. Then by the continuity of , is bounded; using (3), is also bounded. By the triangle inequality, we have

 ∥¨x(t)+α˙x(t)∥ ≥α∥˙x(t)∥−∥¨x(t)∥ ≥(α−θ)∥˙x(t)∥. (8)

Since , we obtain the boundedness of ; by (5), is also bounded. Let , we have

 0≤f(x(t))−f(x∗)a)≤⟨∇f(x(t)),x(t)−x∗⟩ b)≤∥∇f(x(t))∥⋅∥x(t)−x∗∥ c)=∥¨x(t)+α˙x(t)∥⋅∥x(t)−x∗∥ d)≤(∥¨x(t)∥+α∥˙x(t)∥)⋅∥x(t)−x∗∥ e)≤(α+θ)∥˙x(t)∥⋅∥x(t)−x∗∥, (9)

where is due to the convexity of ; is due to the Young’s inequality; is due to (3); is due to the triangle inequality; is because of (5). Denote

 r:=supt≥0{(α+θ)⋅∥x(t)−x∗∥+∥˙x(t)∥2}.

Since and are both bounded, we have . Using (Analysis of the dynamical system), we have

 ξ(t)2=(f(x(t))−f(x∗)+12∥˙x(t)∥2)2 ≤((α+θ)⋅∥x(t)−x∗∥⋅∥˙x(t)∥+12∥˙x(t)∥2)2 ≤(((α+θ)⋅∥x(t)−x∗∥+12∥˙x(t)∥)⋅∥˙x(t)∥)2 ≤ r2∥˙x(t)∥2. (10)

Combining (Analysis of the dynamical system) and (Analysis of the dynamical system), we have , or equivalently,

 −αr2dt≤dξξ2. (11)

Taking the integral of both sides from to and noting that (we have assumed that ), we get

 −αr2t≤1ξ(0)−1ξ(t),

and thus . Since , we have

 f(x(t))−minf≤1αr2t+1ξ(0)

Thus we have derived the asymptotic sublinear convergence rate for .

## Convergence analysis of Heavy-ball

In this section, we prove convergence rates of Heavy-ball method. The core of the proof is to construct a proper Lyapunov function. The expression of in (6) suggests the Lyapunov function be of the form for some . In fact, we have the following sufficient descent lemma. All the technical proofs for the rest of the paper, will be provided in the supplementary materials.

###### Lemma 1

Suppose is convex with -Lipschitz gradient and . Let be generated by the Heavy-ball method with non-increasing . By choosing the step size

 γk=2(1−βk)cL

with fixed , we have

 [f(xk)+βk2γk∥xk−xk−1∥2] −[f(xk+1)+βk+12γk+1∥xk+1−xk∥2] ≥(1−c)L2c∥xk+1−xk∥2. (12)

According to Lemma 1, a potentially useful Lyapunov function is , as it has the descent property shown in (1). However, it does not fulfill the relation in (Analysis of the dynamical system)  222That is, we are not able build a useful error relation for .. Therefore, we rewrite (1), so that the new right-hand-side contains something like . It turns out that a better Lyapunov function reads

 ξk:=f(xk)+δk∥xk−xk−1∥2−minf, (13)

where

 δk:=βk2γk+12(1−βkγk−L2). (14)

We can see is in line with the discretization of (6).

Given the Lyapunov function in (13), we present a key technical lemma.

###### Lemma 2

Suppose the assumptions of Lemma 1 hold. Let denote the projection of onto , assumed to exist, and define

 εk:=4cδ2k(1−c)L+4c(1−c)Lγ2k. (15)

Then it holds that

 (ξk)2 ≤εk×(ξk−ξk+1) ×(2∥xk−¯¯¯¯¯xk∥2+∥xk−xk−1∥2). (16)

We see that (2) is the discretization of (11) if .

### Sublinear convergence

We present the non-ergodic convergence rate of the function value. This rate holds when . We define

 R:=supk≥0supx∗∈argminf{∥xk−x∗∥2}. (17)

In our following settings, we can see it actually holds that .

###### Theorem 1

Under the assumptions of Lemma 1 and assumptions that and is coercive, We have

 f(xk)−minf≤4R⋅supk{εk}k. (18)

To our best knowledge, this is the first non-ergodic result established for Heavy-ball algorithm in convex case. The definition of implies , so it holds that

 f(xk)−minf=O(R⋅Lk),

which is on the same order of complexity as that in gradient descent.

The coercivity assumption on is crucial for Theorem 1. When the function fails to be coercive, we need to assume summable instead.

###### Corollary 1

Suppose the assumptions of Lemma 1 hold, and .333A classical example is , where . Let be generated by the Heavy-ball algorithm and be coercive. Then,

 f(xk)−minf≤4R⋅supk{εk}k.

### Linear convergence with restricted strong convexity

We say the function satisfies a restricted strongly convex condition (Lai and Yin, 2013), if

 f(x)−minf≥ν∥x−¯¯¯x∥2, (19)

where is the projection of onto the set , and . Restricted strong convexity is weaker than the strong convexity. For example, let us consider the function with . When fails to be full row-rank, is not strongly convex but restricted strongly convex.

###### Theorem 2

Suppose the assumptions of Theorem 1 hold, and that satisfies condition (19). Then we have

 f(xk)−minf≤ωk,

for and .

Our result improves the linear convergence established by Ghadimi, Feyzmahdavian, and Johansson (2015) in two aspects: Firstly, The strongly convex assumption is weakened to (19). Secondly, The step size and inertial parameter are chosen independent of the strongly convex constants.

## Cyclic coordinate descent Heavy-ball algorithm

In this section, we consider the multi-block version of Heavy-ball algorithm and prove its convergence rates under convexity assumption. The minimization problem reads

 minx1,x2,…,xmf(x1,x2,…,xm). (20)

The function is assumed to satisfy

 ∥∇if(x)−∇if(y)∥≤Li∥x−y∥. (21)

With (21), we can easily obtain

 f(x1,x2,…,x1i,…,xm)≤f(x1,x2,…,x2i,…,xm) +⟨∇if(x1,x2,…,x2i,…,xm),x1i−x2i⟩ +Li2∥x1i−x2i∥2. (22)

The proof is similar to [Lemma 1.2.3,Nesterov (2013)], and we shall skip it here. We denote

 ∇kif:=∇if(xk+11,…,xk+1i−1,xki…,xkm), xk:=(xk1,xk2,…,xkm), L:=m∑i=1Li,

with the convention . The cyclic coordinate descent inertial algorithm iterates: for from to ,

 xk+1i=xki−γk,i∇kif+βk,i(xki−xk−1i), (23)

where . Our analysis relies on the following assumption:

A1: for any , the parameters is non-increasing.

###### Lemma 3

Let be a convex function satisfying (21), and finite . Let be generated by scheme (23) and Assumption A1 hold. Choosing the step size

 γk,i=2(1−βk,i)cLi, i∈[1,2,…,m]

for arbitrary fixed , we have

 [f(xk)+m∑i=1βk,i2γk,i∥xki−xk−1i∥2] −[f(xk+1)+m∑i=1βk+1,i2γk+1,i∥xk+1i−xki∥2] ≥(1−c)L––2c∥xk+1−xk∥2. (24)

where .

We consider the following similar Lyapunov function in the analysis of cyclic coordinate descent Heavy-ball algorithm

 ^ξk:=f(xk)+m∑i=1δk,i∥xki−xk−1i∥2−minf, (25)

where

 δk,i:=βk,i2γk,i+12(1−βk,iγk,i−Li2). (26)

Then we have the following lemma.

###### Lemma 4

Suppose the conditions Lemma 3 hold. Let denote the projection of onto , assumed to exist, and define

 ^εk :=max⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩4c⋅∑mi=1(δ2k+1,i+1γ2k,i)(1−c)L––,4c⋅m⋅L(1−c)L––⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭. (27)

It holds that

 (^ξk)2 ≤^εk(^ξk−^ξk+1) ×(2∥xk−¯¯¯¯¯xk∥2+∥xk−xk−1∥2). (28)

### Sublinear convergence of cyclic coordinate descent Heavy-ball algorithm

We show the convergence rate of cyclic coordinate descent Heavy-ball algorithm for coercive .

###### Theorem 3

Suppose the conditions of Lemma 3 hold, is coercive, and

 0

Then we have

 f(xk)−minf=O(4R⋅supk{^εk}k), (29)

where is given by (17).

We readily check that , where is number of the block. Therefore, the cyclic inertial algorithm converges with the rate . Compared with the results in (Sun and Hong, 2015), this rate is on the same order as that of cyclic block coordinate descent in general convex setting.

### Linear convergence of cyclic coordinate descent Heavy-ball algorithm

Under the same assumption of restricted strong convexity, we derive the linear convergence rate for cyclic coordinate descent Heavy-ball algorithm.

###### Theorem 4

Suppose the conditions of Lemma 3 hold, satisfies (19), and

 0

Then we have

 f(xk)−minf≤(^ω)k (30)

for some , and .

This result can be extended to the essentially cyclic Heavy-ball algorithm. The essentially cyclic index selection strategy (Sun, Hannah, and Yin, 2017), which generalizes the cyclic update rule, is defined as follows: there is an , , such that each block is updated at least once in a window of .

## Stochastic coordinate descent Heavy-ball algorithm

For the stochastic index selection strategy, in the -th iteration, we pick uniformly from and iterate

 {xk+1ik=xkik−γk∇ikf(xk)+βk(xkik−xk−1ik),xk+1i=xki, if i≠ik. (31)

In this section, we make the following assumption

A2: the parameters is non-increasing.

Assumption A2 is quite different from previous requirement that which are constrained on . This difference comes from the uniformly stochastic selection of the index.

###### Lemma 5

Let be a convex function whose gradient is Lipschitz continuous with , and finite . Let be generated by scheme (31) and Assumption A2 be satisfied. Choose the step size

 γk=2(1−βk/√m)cL

for arbitrary fixed . Then, we can obtain

 [Ef(xk)+βk2√mγkE∥xk−xk−1∥2] −[Ef(xk+1)+βk+12√mγk+1E∥xk+1−xk∥2] ≥(1−c)L2cE∥xk+1−xk∥2. (32)

Similarly, we consider the following function

 ¯ξk:=f(xk)+¯δk∥xk−xk−1∥2−minf, (33)

where

 ¯δk:=βk2√mγk+12(1−βk/√mγk−L2). (34)

Different from the previous analyses, the Lyapunov function considered here is instead of . Naturally, the sufficient descent property is established in the sense of expectation.

###### Lemma 6

Suppose the conditions of Lemma 5 hold. Let denote the projection of onto , assumed to exist, and define

 ¯εk:=4cδ2k(1−c)L+8cm(1−c)Lγ2k. (35)

Then it holds

 (E¯ξk)2 ≤¯εk⋅(E¯ξk−E¯ξk+1) ×(E∥xk−¯¯¯¯¯xk∥2+E∥xk−xk−1∥2). (36)

### Sublinear convergence of stochastic coordinate descent Heavy-ball algorithm

Due to that the sufficient descent condition involves expectations, even using the coercivity of , we cannot obtain the boundedness of the generated points. Therefore, we first present a result by assuming the smoothness of only.

###### Theorem 5

Suppose that the assumptions of Lemma 5 hold. Then we have

 min0≤i≤kE∥∇f(xk)∥=o(1√k). (37)

We remark that Theorem (5) also holds for nonconvex functions. To obtain the sublinear convergence rate on the function values, we need a boundedness assumption. Precisely, the assumption is

A3: the sequence satisfies

 ¯R:=supk{E∥xk−¯¯¯¯¯xk∥2}<+∞.

Under assumption A3, we are able to show the non-ergodic convergence sublinear convergence rates of the expected objective values.

###### Theorem 6

Suppose that the assumptions of Lemma 5 and A3 hold. Then we have

 Ef(xk)−minf=O(4¯R⋅supk{¯εk}k). (38)

### Linear convergence of stochastic coordinate descent Heavy-ball algorithm

The linear convergence rate of stochastic coordinate descent Heavy-ball algorithm is similar to previous ones. By assuming the restricted strongly convex condition, the linear convergence rate of the expected objective values can be proved.

###### Theorem 7

Suppose that the assumptions in Lemma 5 hold, and the function satisfies the restricted strongly convex condition (19). Let be generated by the scheme (31). Then we have

 Ef(xk)−minf≤(¯w)k, (39)

where , and .

While we only consider the uniform probability selection strategy here, the same convergence results can be easily extended to the non-uniform probability selection strategy.

## Applications to decentralized optimization

We apply the analysis to the following decentralized optimization problem

 minx∈Rn{m∑i=1fi(x)},

where is differentiable and is -Lipschitz. Denote by the local copy of at node and . In the community of decentralized algorithms, rather than directly solving the problem, following penalty formulation instead has been proposed

 minX∈Rm×n{F(X)=f(X)+X⊤(I−W)X2α}, (40)

where is the mixing matrix, and , and is the unit matrix. It is easy to see that is Lipschitz with the constant , here

is minimum eigenvalue of

. Researchers consider the decentralized gradient descent (DGD) (Nedic and Ozdaglar, 2009), which is essentially the gradient descent applied to (40) with stepsize being equal to . This algorithm can be implemented over a connected network, in which the agents communicate with their neighbors and make full use of the computing resources of all nodes. Alternatively, we can use the Heavy-ball method by choosing the stepsize , that is,

 Xk+1 =Xk−α∇F(Xk)+β(Xk−Xk−1) =WXk−α∇f(Xk)+β(Xk−Xk−1).

For node , the local scheme is then

 xk+1(i) =∑j∈N(i)wi,jxk(j)−α∇fi(xk(i)) +β(xk(i)−xk−1(i)),

where is the copy of the variable in node in the th iteration and denotes the neighbors of node . In the global scheme, it is basically Heavy-ball algorithm. Thus, we can apply our theoretical findings to this algorithm. To guarantee the convergence, we just need

 0<α<2(1−β)LF=2(1−β)maxi{Li}+1−λmin(W)α.

After simplification, we then get

 α⋅maxi{Li}<1+λmin(W)−2β.

In a word, we need the requirements

 0≤β<1+λmin(W)2, 0<α<1−2β+λmin(W)maxi{Li}.

The convergence result for decentralized Heavy-ball method directly follows from our previous theoretical findings and can be summarized as below.

###### Corollary 2

Assume that is convex and differentiable, and is Lipschitz with . Let , and the sequence be generated by the decentralized Heavy-ball method. For any fixed stepsize , we have

 F(Xk)−minF=O(1k). (41)

This justifies the superiority of our non-ergodic analysis. As aforementioned in the introduction, all the existing convergence results are about the sequence . However, for decentralized Heavy-ball algorithm, it is meaningless to discuss the ergodic rates, because the nodes only communicate with their neighbors. However, our results, in this case, still hold.

## Experimental results

We report the numerical simulations of Heavy-ball method applied to the linear regression problem

 minx∈Rn{12m∑i=1(yi−A⊤ix)2}, (42)

and the logistic regression problem

 minx∈Rn{m∑i=1log(1+%exp(−yiA⊤ix))+λ2∥x∥2}, (43)

where , . All experiments were performed using MATLAB on an desktop with an Intel 3.4 GHz CPU. We tested the three Heavy-ball algorithms with different inertial parameters. We fixed the stepsize as in all numerical tests. For the stepsize, we need , i.e., . Therefore, inertial parameters are set to . For linear regression problem, , where denotes the largest eigenvalue of a matrix; whereas for logistic regression, we have . With schemes of the algorithms, for cyclic coordinate gradient descent, the function values are recorded after the whole epoch is updated; while for stochastic coordinate gradient descent, functions values are updated after per iteration. The special case corresponds to the gradient descent, or cyclic coordinate gradient descent, or stochastic coordinate gradient descent. And we set and . The data and were generated by the Gaussian random and Bernoulli random distributions, respectively. The maximum number of iterations was set to . For logistic regression, we set . We tested the three Heavy-ball algorithms for both two regression tasks with Gaussian and Bernoulli data.

As illustrated by Figure 1, larger leads to faster convergence for both Heavy-ball algorithm and cyclic coordinate descent algorithm when . However, for the stochastic block coordinate descent scheme, the inertial method helps insignifically. This is because for the stochastic case, in the th iteration, the inertial terms contribute only when . This case, however, happens with probability and is the number of the blocks; as is large, happens at low probability for just one iteration, let alone the whole iterations. Therefore, the inertial method is actually inactive at most iterations for stochastic block coordinate descent scheme.

To improve the practical performance of stochastic block coordinate descent Heavy-ball algorithm, another inertial scheme proposed in Xu and Yin (2013) can be recruited, in which, a new storage is used. In each iteration, the algorithm employs to replace in scheme (31) and then updates with keeping other coordinates of . In this scheme, the inertial term can be active for all iterations. However, the convergence of such algorithm is beyond the proof techniques proposed in this paper, and of course, deserves further study.

## Conclusion

In this paper, we studied the non-ergodic computational complexity of the Heavy-ball methods in the convex setting. Under different assumptions, we proved the non-ergodic sublinear and linear convergence rates for the algorithm, respectively. In both cases, we made much more relaxed assumptions than appeared in the existing literatures. Our proof was motivated by the analysis on a novel dynamical system. We extended our results to the multi-block coordinate descent Heavy-ball algorithm for both cyclic and stochastic update rules. The application to decentralized optimization demonstrated the advantage of our analysis techniques.

Acknowledgments: The authors are indebted to anonymous referees for their useful suggestions. We are grateful for the support from the the Major State Research Development Program (2016YFB0201305), and National Key Research and Development Program of China (2017YFB0202003), and National Natural Science Foundation of Hunan (2018JJ3616).

## References

• Alvarez (2000) Alvarez, F. 2000. On the minimizing property of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization 38(4):1102–1119.
• Combettes and Glaudin (2017) Combettes, P. L., and Glaudin, L. E. 2017. Quasinonexpansive iterations on the affine hull of orbits: From mann’s mean value algorithm to inertial methods. Siam Journal on Optimization 27(4).
• Ghadimi, Feyzmahdavian, and Johansson (2015) Ghadimi, E.; Feyzmahdavian, H. R.; and Johansson, M. 2015. Global convergence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European, 310–315. IEEE.
• Lai and Yin (2013) Lai, M.-J., and Yin, W. 2013. Augmented and nuclear-norm models with a globally linearly convergent algorithm. SIAM Journal on Imaging Sciences 6(2):1059–1091.
• Lessard, Recht, and Packard (2016) Lessard, L.; Recht, B.; and Packard, A. 2016. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization 26(1):57–95.
• Liang, Fadili, and Peyré (2016) Liang, J.; Fadili, J.; and Peyré, G. 2016. A multi-step inertial forward-backward splitting method for non-convex optimization. In Advances in Neural Information Processing Systems, 4035–4043.
• Loizou and Richtárik (2017a) Loizou, N., and Richtárik, P. 2017a. Linearly convergent stochastic heavy ball method for minimizing generalization error. arXiv preprint arXiv:1710.10737.
• Loizou and Richtárik (2017b) Loizou, N., and Richtárik, P. 2017b. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677.
• Nedic and Ozdaglar (2009) Nedic, A., and Ozdaglar, A. 2009. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control 54(1):48–61.
• Nesterov (2013) Nesterov, Y. 2013. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media.
• Ochs et al. (2014) Ochs, P.; Chen, Y.; Brox, T.; and Pock, T. 2014. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences 7(2):1388–1419.
• Ochs, Brox, and Pock (2015) Ochs, P.; Brox, T.; and Pock, T. 2015. ipiasco: Inertial proximal algorithm for strongly convex optimization. Journal of Mathematical Imaging and Vision 53(2):171–181.
• Ochs (2016) Ochs, P. 2016. Local convergence of the heavy-ball method and ipiano for non-convex optimization. arXiv preprint arXiv:1606.09070.
• Pock and Sabach (2016) Pock, T., and Sabach, S. 2016. Inertial proximal alternating linearized minimization (ipalm) for nonconvex and nonsmooth problems. SIAM Journal on Imaging Sciences 9(4):1756–1787.
• Polyak (1964) Polyak, B. T. 1964. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5):1–17.
• Sun and Hong (2015) Sun, R., and Hong, M. 2015. Improved iteration complexity bounds of cyclic block coordinate descent for convex problems. In Advances in Neural Information Processing Systems, 1306–1314.
• Sun, Hannah, and Yin (2017) Sun, T.; Hannah, R.; and Yin, W. 2017. Asynchronous coordinate descent under more realistic assumptions. NIPS.
• Xu and Yin (2013) Xu, Y., and Yin, W. 2013. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences 6(3):1758–1789.
• Zavriev and Kostyuk (1993) Zavriev, S., and Kostyuk, F. 1993. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling 4(4):336–341.

## Proof of Lemma 1

By the scheme for updating ,

 xk−xk+1γk+βkγk(xk−xk−1)=∇f(xk). (44)

By Lipschitz continuity of ,

 f(xk+1)−f(xk) ≤⟨∇f(xk),xk+1−xk⟩+L2∥xk+1−xk∥2. (45)

Combining (44) and (45), we have

 f(xk+1)−f(xk) (???)+(???)≤βkγk⟨xk−xk−1,xk+1−xk⟩+(L2−1γk)∥xk+1−xk∥2 (46)

where uses the Cauchy-Schwarz inequality . A simple calculation gives

 [f(xk)+βk