Momentum and Stochastic Momentum for Stochastic Gradient, Newton, Proximal Point and Subspace Descent Methods

12/27/2017 ∙ by Nicolas Loizou, et al. ∙ 0

In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time momentum variants of several of these methods are studied. We choose to perform our analysis in a setting in which all of the above methods are equivalent. We prove global nonassymptotic linear convergence rates for all methods and various measures of success, including primal function values, primal iterates (in L2 sense), and dual function values. We also show that the primal iterates converge at an accelerated linear rate in the L1 sense. This is the first time a linear rate is shown for the stochastic heavy ball method (i.e., stochastic gradient descent method with momentum). Under somewhat weaker conditions, we establish a sublinear convergence rate for Cesaro averages of primal iterates. Moreover, we propose a novel concept, which we call stochastic momentum, aimed at decreasing the cost of performing the momentum step. We prove linear convergence of several stochastic methods with stochastic momentum, and show that in some sparse data regimes and for sufficiently small momentum parameters, these methods enjoy better overall complexity than methods with deterministic momentum. Finally, we perform extensive numerical testing on artificial and real datasets, including data coming from average consensus problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Two of the most popular algorithmic ideas for solving optimization problems involving big volumes of data are stochastic approximation and momentum. By stochastic approximation we refer to the practice pioneered by Robins and Monro [65] of replacement of costly-to-compute quantities (e.g., gradient of the objective function) by cheaply-to-compute stochastic

approximations thereof (e.g., unbiased estimate of the gradient). By momentum we refer to the

heavy ball technique originally developed by Polyak [54] to accelerate the convergence rate of gradient-type methods.

While much is known about the effects of stochastic approximation and momentum in isolation, surprisingly little is known about the combined effect of these two popular algorithmic techniques. For instance, to the best of our knowledge, there is no context in which a method combining stochastic approximation with momentum is known to have a linear convergence rate. One of the contributions of this work is to show that there are important problem classes for which a linear rate can indeed be established for a range of stepsize and momentum parameters.

1.1 Setting

In this paper we study three closely related problems:

  1. stochastic optimization,

  2. best approximation, and

  3. (bounded) concave quadratic maximization.

These problems and the relationships between them are described in detail in Section 3. Here we only briefly outline some of the key relationships. By stochastic optimization we refer to the problem of the form

(1)

where the expectation is over random matrices drawn from an arbitrary distribution , and is a stochastic convex quadratic function of a least-squares type, depending on , and in addition on a matrix

, vector

, and an positive definite matrix (see Section 3 for full details). The problem is constructed in such a way that the set of minimizers of is identical to the set of solutions of a given (consistent) linear system

(2)

where and . In this sense, (1) can be seen as the reformulation of the linear system (2) into a stochastic optimization problem. Such reformulations provide an explicit connection between the fields of linear algebra and stochastic optimization, which may inspire future research by enabling the transfer of knowledge, techniques, and algorithms from one field to another. For instance, the randomized Kaczmarz method of Strohmer and Vershynin [69] for solving (2) is equivalent to the stochastic gradient descent method applied to (1), with corresponding to a discrete distribution over unit coordinate vectors in . However, the flexibility of being able to choose arbitrarily allows for numerous generalizations of the randomized Kaczmarz method [64]. Likewise, provably faster variants of the randomized Kaczmarz method (for instance, by utilizing importance sampling) can be designed using the connection.

1.2 Three stochastic methods

Problem (1) has several peculiar characteristics which are of key importance to this paper. For instance, the Hessian of is a (random) projection matrix, which can be used to show that . Moreover, it follows that the Hessian of

has all eigenvalues bounded by 1, and so on. These characteristics can be used to show that several otherwise

distinct stochastic algorithms for solving the stochastic optimization problem (1) are identical [64]. In particular, the following optimization methods for solving (1) are identical222In addition, these three methods are identical to a stochastic fixed point method (with relaxation) for solving the fixed point problem , where is the set of solutions of , which is a sketched version of the linear system (2), and can be seen as a stochastic approximation of the set :

  • Method 1: stochastic gradient descent (SGD),

  • Method 2: stochastic Newton method (SN), and

  • Method 3: stochastic proximal point method (SPP);

all with a fixed stepsize . The methods will be described in detail in Section 3; see also Table 2 for a quick summary.

The equivalence of these methods is useful for the purposes of this paper as it allows us to study their variants with momentum by studying a single algorithm only. We are not aware of any successful attempts to analyze momentum variants of SN and SPP and as we said before, there are no linearly convergent variants of SGD with momentum in any setting.

1.3 Best approximation, duality and stochastic dual subspace ascent

It was shown in [24] in the case and in [64] in the general case that SGD, SN and SPP converge to a very particular minimizer of : the projection of the starting point onto the solution set of the linear system (2). This naturally leads to the best approximation problem, which is the problem of projecting333In the rest of the paper we consider projection with respect to an arbitrary Euclidean norm. a given vector onto the solution space of the linear system (2):

(3)

The dual of the best approximation problem is an unconstrained concave quadratic maximization problem [24]. Consistency of implies that the dual is bounded. It follows from the results of [24] that for , the random iterates of SGD, SN and SPP arise as affine images of the random iterates produced by an algorithm for solving the dual of the best approximation problem (3), known as

  • Method 4: stochastic dual subspace ascent (SDSA).

In this paper we show that this equivalence extends beyond the case, specifically for , and further study SDSA with momentum. We then show that SGD, SN and SPP with momentum arise as affine images of SDSA with momentum. SDSA proceeds by taking steps in a random subspace spanned by the columns of randomly drawn in each iteration from . In this subspace, the method moves to the point which maximizes the dual objective, . Since is an arbitrary distribution of random matrices, SDSA moves in arbitrary random subspaces, and as such, can be seen as a vast generalization of randomized coordinate descent methods and their minibatch variants [16, 58].

1.4 Structure of the paper

The remainder of this work is organized as follows. In Section 2 we summarize our contributions in the context of existing literature. In Section 3 we provide a detailed account of the stochastic optimization problem, the best approximation and its dual. Here we also describe the SGD, SN and SPP methods. In Section 4 we describe and analyze primal methods with momentum (mSGD, mSN and mSPP), and in Section 5 we describe and analyze the dual method with momentum (mSDSA). In Section 6 we describe and analyze primal methods with stochastic momentum (smSGD, smSN and smSPP). Numerical experiments are presented in Section 8. Proofs of all key results can be found in the appendix.

1.5 Notation

The following notational conventions are used in this paper. Boldface upper-case letters denote matrices;

is the identity matrix. By

we denote the solution set of the linear system . By , where

is a random matrix, we denote the solution set of the

sketched linear system . Throughout the paper, is an positive definite matrix giving rise to an inner product and norm on . Unless stated otherwise, throughout the paper, is the projection of onto in the -norm: . We write .

2 Momentum Methods and Our Contributions

In this section we give a brief review of the relevant literature, and provide a summary of our contributions.

2.1 Heavy ball method

The baseline first-order method for minimizing a differentiable function is the gradient descent (GD) method,

where is a stepsize. For convex functions with -Lipschitz gradient (function class ), GD converges at at the rate of . When, in addition, is -strongly convex (function class ), the rate is linear: [48]. To improve the convergence behavior of the method, Polyak proposed to modify GD by the introduction of a (heavy ball) momentum term444Arguably a much more popular, certainly theoretically much better understood alternative to Polyak’s momentum is the momentum introduced by Nesterov [46, 48], leading to the famous accelerated gradient descent (AGD) method. This method converges nonassymptotically and globally; with optimal sublinear rate [45] when applied to minimizing a smooth convex objective function (class ), and with the optimal linear rate when minimizing smooth strongly convex functions (class ). Both Nesterov’s and Polyak’s update rules are known in the literature as “momentum” methods. In this paper, however, we focus exclusively on Polyak’s heavy ball momentum., . This leads to the gradient descent method with momentum (mGD), popularly known as the heavy ball method:

More specifically, Polyak proved that with the correct choice of the stepsize parameters and momentum parameter , a local accelerated linear convergence rate of can be achieved in the case of twice continuously differentiable, -strongly convex objective functions with -Lipschitz gradient (function class ). See the first line of Table 1.

Recently, Ghadimi et al. [20] performed a global convergence analysis for the heavy ball method. In particular, the authors showed that for a certain combination of the stepsize and momentum parameter, the method converges sublinearly to the optimum when the objective function is convex and has Lipschitz gradient (), and linearly when the function is also strongly convex (). A particular, selection of the parameters and that gives the desired accelerated linear rate was not provided.

To the best of our knowledge, despite considerable amount of work on the on heavy ball method, there is still no global convergence analysis which would guarantee an accelerated linear rate for . However, in the special case of a strongly convex quadratic, an elegant proof was recently proposed in [35]. Using the notion of integral quadratic constraints from robust control theory, the authors proved that by choosing and , the heavy ball method enjoys a global asymptotic accelerated convergence rate of . The aforementioned results are summarized in the first part of Table 1.

Extensions of the heavy ball method have been recently proposed in the proximal setting [51], non-convex setting [52, 78] and for distributed optimization [21].

2.2 Stochastic heavy ball method

In contrast to the recent advances in our theoretical understanding of the (classical) heavy ball method, there has been less progress in understanding the convergence behavior of stochastic variants of the heavy ball method. The key method in this category is stochastic gradient descent with momentum (mSGD; aka: stochastic heavy ball method):

where is an unbiased estimator of the true gradient

. While mSGD is used extensively in practice, especially in deep learning

[70, 71, 33, 74], its convergence behavior is not very well understood.

In fact, we are aware of only two papers, both recent, which set out to study the complexity of mSGD: the work of Yang et al. [77], and the work of Gadat et al. [18]

. In the former paper, a unified convergence analysis for stochastic gradient methods with momentum (heavy ball and Nesterov’s momentum) was proposed; and an analysis for both convex and non convex functions was performed. For a general Lipschitz continuous convex objective function with bounded variance, a rate of

was proved. For this, the authors employed a decreasing stepsize strategy: , where is a positive constant. In [18], the authors first describe several almost sure convergence results in the case of general non-convex coercive functions, and then provided a complexity analysis for the case of quadratic strongly convex function. However, the established rate is slow. More precisely, for strongly convex quadratic and coercive functions, mSGD with diminishing stepsizes was shown to convergence as when the momentum parameter is , and with the rate when . The convergence rates established in both of these papers are sublinear. In particular, no insight is provided into whether the inclusion of the momentum term provides what is was aimed to provide: acceleration.

The above results are summarized in the second part of Table 1. From this perspective, our contribution lies in providing an in-depth analysis of mSGD (and, additionally, of SGD with stochastic momentum). Our contributions are discussed next.

Method Paper Rate Assumptions on Convergence
Heavy Ball
(mGD)
Polyak, 1964 [54] accelerated linear local
Ghadimi et al, 2014 [20] sublinear global
Ghadimi et al, 2014 [20] linear global
Lessard et al, 2016 [35] accelerated linear + quadratic global, asymptotic
Stochastic
Heavy Ball
(mSGD)
Yang et al. 2016 [77] sublinear + bounded variance global, non-asymptotic
Gadat et al, 2016 [18] sublinear + other assumptions global, non-asymptotic
THIS PAPER see Table 3 + quadratic global, non-asymptotic
Table 1: Known complexity results for gradient descent with momentum (mGD, aka: heavy ball method), and stochastic gradient descent with momentum (mSGD, aka: stochastic heavy ball method). We give the first linear and accelerated rates for mSGD. For full details on iteration complexity results we obtain, refer to Table 3.

2.3 Connection to incremental gradient methods

Assuming is discrete distribution (i.e., we sample from matrices, , where

is chosen with probability

), we can write the stochastic optimization problem (1) in the finite-sum form

(4)

Choosing , mSGD with fixed stepsize applied to (4) can be written in the form

(5)

where with probability . Problem (4) can be also solved using incremental average/aggregate gradient methods, such as the IAG method of Blatt et al. [3]. These methods have a similar form to (5), with the main difference being in the way the past gradients are aggregated. While (5) uses a geometric weighting of the gradients, the incremental average gradient methods use a uniform/arithmetic weighting. The stochastic average gradient (SAG) method of Schmidt et al. [66] can be also written in a similar form. Note that mSGD uses a geometric weighting of previous gradients, while the the incremental and stochastic average gradient methods use an arithmetic weighting. Incremental and incremental average gradient methods are widely studied algorithms for minimizing objective functions which can expressed as a sum of finite convex functions. For a review of key works on incremental methods and a detailed presentation of the connections with stochastic gradient descent, we refer the interested reader to the excellent survey of Bertsekas  [2]; see also the work of Tseng  [72].

In [27], an incremental average gradient method with momentum was proposed for minimizing strongly convex functions. It was proved that the method converges to the optimum with linear rate. The rate is always worse than that of the no-momentum variant. However, it was shown experimentally that in practice the method is faster, especially in problems with high condition number. In our setting, the objective function has a very specifc structure (1). It is not a finite sum problem as the distribution could be continous; and we also do not assume strong convexity. Thus, the convergence analysis of [27] can not be directly applied to our problem.

2.4 Summary of contributions

We now summarize the contributions of this paper.

New momentum methods. We study several classes of stochastic optimization algorithms (SGD, SN, SPP and SDSA) with momentum, which we call mSGD, mSN, mSPP and mSDSA, respectively (see the first and second columns of Table 2). We do this in a simplified setting with quadratic objectives where all of these algorithms are equivalent. These methods can be seen as solving three related optimization problems: the stochastic optimization problem (1), the best approximation problem (3) and its dual. To the best of our knowledge, momentum variants of SN, SPP and SDSA were not analyzed before.

no momentum
()
momentum
()
stochastic momentum
()
SGD [23, ], [64, ]
mSGD [Sec 4]
smSGD [Sec 6]
SN [64]
mSN [Sec 4]
smSN [Sec 6]
SPP [64]
mSPP [Sec 4]
smSPP [Sec 6]
SDSA [24, ]
mSDSA [Sec 5]
Table 2: All methods analyzed in this paper. The methods highlighted in bold (with momentum and stochastic momentum) are new. SGD = Stochastic Gradient Descent, SN = Stochastic Newton, SPP = Stochastic Proximal Point, SDSA = Stochastic Dual Subspace Ascent. At iteration , matrix is drawn in an i.i.d. fashion from distribution , and a stochastic step is performed.

Linear rate. We prove several (global and non-asymptotic) linear convergence results for our primal momentum methods mSGD/mSN/mSPP. First, we establish a linear rate for the decay of to zero (i.e., convergence), for a range of stepsizes and momentum parameters . We show that the same rate holds for the decay of the expected function values of (1) to zero. Further, the same rate holds for mSDSA, in particular, this is for the convergence of the dual objective to the optimum. For a summary of these results, and pointers to the relevat theorems, refer to lines 1, 2 and 6 of Table 3. Unfortunately, the theoretical rate for all our momentum methods is optimized for , and gets worse as the momentum parameter increases. However, no prior linear rate for any of these methods with momentum are known. We give the first linear convergence rate for SGD with momentum (i.e., for the stochastic heavy ball method).

Algorithm
momentum
Quantity
converging to 0
Rate
(all: global, non-asymptotic)
Theorem
mSGD/mSN/mSPP linear 1
mSGD/mSN/mSPP linear 1
mSGD/mSN/mSPP sublinear: 3
mSGD/mSN/mSPP 1 accelerated linear 4
mSGD/mSN/mSPP
accelerated linear
(better than for )
4
mSDSA linear 6
smSGD/smSN/smSPP linear 7
smSGD/smSN/smSPP linear 7
Table 3: Summary of the iteration complexity results obtained in this paper. Parameters of the methods: (stepsize) and (momentum term). In all cases, is the solution of the best approximation problem. Theorem 3 refers to Cesaro averages: . Theorem 6 refers to suboptimality in dual function values ( is the dual function).

Accelerated linear rate. We then study the decay of the larger quantity to zero (i.e., L1 convergence). In this case, we establish an accelerated linear rate, which depends on the square root of the condition number (of the Hessian of ). This is a quadratic speedup when compared to the no-momentum methods as these depend on the condition number. See lines 4 and 5 of Table 3. To the best of our knowledge, this is the first time an accelerated rate is obtained for the stochastic heavy ball method (mSGD). Note that there are no global non-asymptotic accelerated linear rates proved even in the non-stochastic setting (i.e., for the heavy ball method). Moreover, we are not aware of any accelerated linear convergence results for the stochastic proximal point method.

Sublinear rate for Cesaro averages. We show that the Cesaro averages, , of all primal momentum methods enjoy a sublinear rate (see line 3 of Table 3). This holds under weaker assumptions than those which lead to the linear convergence rate.

Primal-dual correspondence. We show that SGD, SN and SPP with momentum arise as affine images of SDSA with momentum (see Theorem 5). This extends the result of [24] where this was shown for the no-momentum methods () and in the special case of the unit stepsize ().

Stochastic momentum. We propose a new momentum strategy, which we call stochastic momentum. Stochastic momentum is a stochastic (coordinate-wise) approximation of the deterministic momentum, and hence is much less costly, which in some situations leads to computational savings in each iteration. On the other hand, the additional noise introduced this way increases the number of iterations needed for convergence. We analyze the SGD, SN and SPP methods with stochastic momentum, and prove linear convergence rates. We prove that in some settings the overall complexity of SGD with stochastic momentum is better than the overall complexity of SGD with momentum. For instance, this is the case if we consider the randomized Kaczmarz (RK) method as a special case of SGD, and if is sparse.

Space for generalizations. We hope that the present work can serve as a starting point for the development of SN, SPP and SDSA methods with momentum for more general classes (beyond special quadratics) of convex and perhaps also nonconvex optimization problems. In such more general settings, however, the symmetry which implies equivalence of these algorithms will break, and hence a different analysis will be needed for each method.

2.5 No need for variance reduction

SGD is arguably one of the most popular algorithms in machine learning. Unfortunately, SGD suffers from slow convergence, which is due to the fact that the variance of the stochastic gradient as an estimator of the gradient does not naturally diminish. For this reason, SGD is typically used with a decreasing stepsize rule, which ensures that the variance converges to zero. However, this has an adverse effect on the convergence rate. For instance, SGD has a sublinear rate even if the function to be minimized is strongly convex. To overcome this problem, a new class of so-called variance-reduced methods was developed over the last 2-5 years, including SAG [66], SDCA [68, 62], SVRG/S2GD [29, 32], minibatch SVRG/S2GD [31], and SAGA [12, 11].

Since we assume that the linear system (2) is feasible, it follows that the stochastic gradient vanishes at the optimal point (i.e., for any ). This suggests that additional variance reduction techniques are not necessary since the variance of the stochastic gradient drops to zero as we approach the optimal point . In particular, in our context, SGD with fixed stepsize enjoys linear rate without any variance reduction strategy [42, 23, 64]. Hence, in this paper we can bypass the development of variance reduction techniques, which allows us to focus on the momentum term.

3 Technical Preliminaries

A general framework for studying consistent linear systems via carefully designed stochastic reformulations was recently proposed by Richtárik and Takáč [64]. In particular, given the consistent linear system (2), they provide four reformulations in the form of a stochastic optimization problem, stochastic linear system, stochastic fixed point problem and a stochastic intersection problem. These reformulations are equivalent in the sense that their solutions sets are identical. That is, the set of minimizers of the stochastic optimization problem is equal to the set of solutions of the stochastic linear system and so on. Under a certain assumption, for which the term exactness was coined in [64], the solution sets of these reformulations are equal to the solution set of the linear system.

3.1 Stochastic optimization

Stochasticity enters the reformulations via a user defined distribution of matrices (all with rows). In addition, the reformulations utilize a positive definite matrix as a parameter, used to define an inner product in via and the induced norm . In particular, the stochastic optimization reformulation (1), i.e., is defined by setting

(6)

where is a random symmetric positive semidefinite matrix defined as By we denote the Moore-Penrose pseudoinverse.

Hessian and its eigenvalues.

Note that the Hessian555While the Hessian is not self-adjoint with respect to the standard inner product, it is self-adjoint with respect to the inner product which we use as the canonical inner product in . of is given by where

(7)

Note that and

(8)

have the same spectrum. Matrix is symmetric and positive semidefinite (with respect to the standard inner product). Let

be the eigenvalue decomposition of , where

is an orthonormal matrix of eigenvectors, and

are the corresponding eigenvalues. Let be the smallest nonzero eigenvalue, and be the largest eigenvalue. It was shown in [64] that for all .

Exactness.

Note that is a convex quadratic, and that whenever . However, can be zero also for points outside of . Clearly, is nonnegative, and for . However, without further assumptions, the set of minimizers of can be larger than . The exactness assumption mentioned above ensures that this does not happen. For necessary and sufficient conditions for exactness, we refer the reader to [64]. Here it suffices to remark that a sufficient condition for exactness is to require to be positive definite. This is easy to see by observing that

3.2 Three algorithms for solving the stochastic optimization problem

The authors of [64] consider solving the stochastic optimization problem (1) via stochastic gradient descent (SGD)666The gradient is computed with respect to the inner product .

(9)

where is a fixed stepsize and is sampled afresh in each iteration from . Note that the gradient of with respect to the inner product is equal to

(10)

where , and is any vector in .

They observe that, surprisingly, SGD is in this setting equivalent to several other methods; in particular, to the stochastic Newton method777In this method we take the -pseudoinverse of the Hessian of instead of the classical inverse, as the inverse does not exist. When , the pseudoinverse specializes to the standard Moore-Penrose pseudoinverse.,

(11)

and to the stochastic proximal point method888In this case, the equivalence only works for .

(12)

3.3 Stochastic fixed point problem

The stochastic fixed point problem considered in [64] as one of the four stochastic reformulations has the form

(13)

where the expectation is taken with respect to , and where is the projection of , in the norm, onto the sketched system . An explicit formula for the projection onto is given by

(14)

a formula for is obtained by replacing with everywhere.

The stochastic fixed point method (with relaxation parameter ) for solving (13) is defined by

(15)

3.4 Best approximation problem, its dual and SDSA

It was shown in [64] that the above methods converge linearly to ; the projection of the initial iterate onto the solution set of the linear system. Hence, besides solving problem (1), they solve the best approximation problem

(16)

The Fenchel dual of (16) is the (bounded) unconstrained concave quadratic maximization problem

(17)

Boundedness follows from consistency. It turns out that by varying and (but keeping consistency of the linear system), the dual problem in fact captures all bounded unconstrained concave quadratic maximization problems.

In the special case of unit stepsize, method (15) was first proposed by Gower and Richtárik [23] under the name “sketch-and-project method”, motivated by the iteration structure which proceeds in two steps: i) replace the set by its sketched variant , and then project the last iterate onto . Analysis in [23] was done under the assumption that be of full column rank. This assumption was lifted in [24], and a duality theory for the method developed. In particular, for , the iterates arise as images of the iterates produced by a specific dual method for solving (17) under the mapping given by

(18)

The dual method—stochastic dual subspace ascent (SDSA)—has the form

(19)

where is in each iteration sampled from , and is chosen greedily, maximizing the dual objective : . Such a might not be unique, however. SDSA is defined by picking the solution with the smallest (standard Euclidean) norm. This leads to the formula:

SDSA proceeds by moving in random subspaces spanned by the random columns of . In the special case when and , Gower and Richtárik [24] established the following relationship between the iterates produced by the primal methods (9), (11), (12), (15) (which are equivalent), and the dual method (19):

(20)

3.5 Other related work

Variants of the sketch-and-project methods have been recently proposed for solving several other problems. Xiang and Zhang [76] show that the sketch-and-project framework is capable of expressing, as special cases, randomized variants of 16 classical algorithms for solving linear systems. Gower and Richtárik [26, 25] use similar ideas to develop of linearly convergent randomized iterative methods for computing/estimating the inverse and the pseudoinverse of a large matrix, respectively. A limited memory variant of the stochastic block BFGS method for solving the empirical risk minimization problem arising in machine learning was proposed by Gower et al. [22]. Tu et al. [73] utilize the sketch-and-project framework to show that breaking block locality can accelerate block Gauss-Seidel methods. In addition, they develop an accelerated variant of the method for a specific distribution . Loizou and Richtárik [38] use the sketch-and-project method to solve the average consensus problem; and Hanzely et al. [28] design new variants of sketch and project methods for the average consensus problem with privacy considerations (see Section 8.3 for more details regarding the average consensus problem).

4 Primal Methods with Momentum

Applied to problem (1), i.e., the gradient descent method with momentum (also known as the heavy ball method) of Polyak [54, 55] takes the form

(21)

where is a stepsize and is a momentum parameter. Instead of marrying the momentum term with gradient descent, we can marry it with SGD. This leads to SGD with momentum (mSGD), also known as the stochastic heavy ball method:

(22)

Since SGD is equivalent to SN and SPP, this way we obtain momentum variants of the stochastic Newton (mSN) and stochastic proximal point (mSPP) methods. The method is formally described below:

mSGD / mSN / mSPP Parameters: Distribution from which method samples matrices; positive definite matrix ; stepsize/relaxation parameter the heavy ball/momentum parameter . Initialize: Choose initial points For do Draw a fresh Set

Output: last iterate

To the best of our knowledge, momentum variants of SN and SPP were not considered in the literature before. Moreover, as far as we know, there are no momentum variants of even deterministic variants of (11), (12) and (15), such as incremental or batch Newton method, incremental or batch proximal point method and incremental or batch projection method; not even for a problem formulated differently.

In the rest of this section we state our convergence results for mSGD/mSN/mSPP.

4.1 convergence and function values: linear rate

In this section we study L2 convergence of mSGD/mSN/mSPP; that is, we study the convergence of the quantity to zero. We show that for a range of stepsize parameters and momentum terms the method enjoys global linear convergence rate. To the best of our knowledge, these results are the first of their kind for the stochastic heavy ball method. As a corollary of L2 convergence, we obtain convergence of the expected function values.

Theorem 1.

Choose . Assume exactness. Let be the sequence of random iterates produced by mSGD/mSN/mSPP. Assume and and that the expressions

satisfy . Let . Then

(23)

and

where and . Moreover, .

Proof.

See Appendix A. ∎

In the above theorem we obtain a global linear rate. To the best of our knowledge, this is the first time that linear rate is established for a stochastic variant of the heavy ball method (mSGD) in any setting. All existing results are sublinear. These seem to be the first momentum variants of SN and SPP methods.

If we choose , then the condition is satisfied for all

(24)

If , mSGD reduces to SGD analyzed in [64]. In this special case, , which is the rate established in [64]. Hence, our result is more general.

Let be the rate as a function of . Note that since , we have

(25)

Clearly, the lower bound on is an increasing function of . Also, for any the rate is always inferior to that of SGD (). It is an open problem whether one can prove a strictly better rate for mSGD than for SGD.

Our next theorem states that for all iterations of mSGD. This invariance is important, as it allows the algorithm to converge to .

Theorem 2.

Let be the starting points of the mSGD method and let be the random iterates generated by mSGD. Then for all .

Proof.

Note that in view of (6), . Since

and since , it can shown by induction that for all . However, is the orthogonal complement to in the -inner product. Since is parallel to , vectors must have the same -projection onto for all : . ∎

4.2 Cesaro average: sublinear rate without exactness assumption

In this section we present the convergence analysis of the function values computed on the Cesaro average. Again our results are global in nature. To the best of our knowledge are the first results that show convergence of the stochastic heavy ball method. Existing results apply in more general settings at the expense of slower rates. In particular, [77] and [18] get and convergence, respectively. When , [18] gets rate.

Theorem 3.

Choose and let be the random iterates produced by mSGD/mSN/mSPP, where the momentum parameter and relaxation parameter (stepsize) satisfy . Let be any vector satisfying . If we let , then

Proof.

See Appendix B. ∎

In the special case of , the above theorem gives the rate

This is the convergence rate for Cesaro averges of the “basic method” (i.e., SGD) established in [64].

Our proof strategy is similar to [20] in which the first global convergence analysis of the (deterministic) heavy ball method was presented. There it was shown that when the objective function has a Lipschitz continuous gradient, the Cesaro averages of the iterates converge to the optimum at a rate of . To the best of our knowledge, there are no results in the literature that prove the same rate of convergence in the stochastic case for any class of objective functions.

In [77] the authors analyzed mSGD for general Lipshitz continuous convex objective functions (with bounded variance) and proved the sublinear rate . In [18], a complexity analysis is provided for the case of quadratic strongly convex smooth coercive functions. A sublinear convergence rate of , where , was proved. In contrast to our results, where we assume fixed stepsize , both papers analyze mSGD with diminishing stepsizes.

4.3 convergence: accelerated linear rate

In this section we show that by a proper combination of the relaxation (stepsize) parameter and the momentum parameter , mSGD/mSN/mSPP enjoy an accelerated linear convergence rate in mean.

Theorem 4.

Assume exactness. Let be the sequence of random iterates produced by mSGD / mSN / mSPP, started with satisfying the relation , with relaxation parameter (stepsize) and momentum parameter . Let . Then there exists constant such that for all we have

  • If we choose and then and the iteration complexity becomes .

  • If we choose and then and the iteration complexity becomes .

Proof.

See Appendix C. ∎

Note that the convergence factor is precisely equal to the value of the momentum parameter . Let be any random vector in with finite mean , and is any reference vector (for instance, any solution of ). Then we have the identity (see, for instance [23])

(26)

This means that the quantity appearing in our L2 convergence result (Theorem 1) is larger than appearing in the L1 convergence result (Theorem 4), and hence harder to push to zero. As a corollary, L2 convergence implies L1 convergence. However, note that in Theorem 4 we have established an accelerated rate. A similar theorem, also obtaining an accelerated rate in the L1 sense, was established in [64] for an accelerated variant of SGD in the sense of Nesterov.

5 Dual Methods with Momentum

In the previous sections we focused on methods for solving the stochastic optimization problem (1) and the best approximation problem (3). In this section we focus on the dual of the best approximation problem, and propose a momentum variant of SDSA, which we call mSDSA.

Stochastic Dual Subspace Ascent with Momentum (mSDSA) Parameters: Distribution from which method samples matrices; positive definite matrix ; stepsize/relaxation parameter the heavy ball/momentum parameter . SDSA is obtained as a special case of mSDSA for . Initialize: Choose initial points For do Draw a fresh Set Set Output: last iterate

5.1 Correspondence between primal and dual methods

In our first result we show that the random iterates of the mSGD/mSN/mSPP methods arise as an affine image of mSDSA under the mapping defined in (18).

Theorem 5 (Correspondence Between Primal and Dual Methods).

Let and let be the iterates of mSGD/mSN/mSPP. Let , and let be the iterates of mSDSA. Assume that the methods use the same stepsize , momentum parameter , and the same sequence of random matrices . Then

for all . That is, the primal iterates arise as affine images of the dual iterates.

Proof.

First note that

We now use this to show that

So, the sequence of vectors mSDSA satisfies the same recursion of degree as the sequence defined by mSGD. It remains to check that the first two elements of both recursions coincide. Indeed, since and , we have , and . ∎

5.2 Convergence

We are now ready to state a linear convergence convergence result describing the behavior of mSDSA in terms of the dual function values .

Theorem 6 (Convergence of dual objective).

Choose . Assume exactness. Let be the sequence of random iterates produced by mSDSA. Assume and and that the expressions