# Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization

Stochastic gradient methods (SGMs) have been extensively used for solving stochastic problems or large-scale machine learning problems. Recent works employ various techniques to improve the convergence rate of SGMs for both convex and nonconvex cases. Most of them require a large number of samples in some or all iterations of the improved SGMs. In this paper, we propose a new SGM, named PStorm, for solving nonconvex nonsmooth stochastic problems. With a momentum-based variance reduction technique, PStorm can achieve a near-optimal complexity result Õ(ε^-3) to produce a stochastic ε-stationary solution, if a mean-squared smoothness condition holds. Different from existing near-optimal methods, PStorm requires only one or O(1) samples in every update. With this property, PStorm can be applied to online learning problems that favor real-time decisions based on one or O(1) new observations. In addition, for large-scale machine learning problems, PStorm can generalize better by small-batch training than other near-optimal methods that require large-batch training and the vanilla SGM, as we demonstrate on training a sparse fully-connected neural network.

## Authors

• 25 publications
• ### Momentum Schemes with Stochastic Variance Reduction for Nonconvex Composite Optimization

Two new stochastic variance-reduced algorithms named SARAH and SPIDER ha...
02/07/2019 ∙ by Yi Zhou, et al. ∙ 0

• ### Momentum with Variance Reduction for Nonconvex Composition Optimization

Composition optimization is widely-applied in nonconvex machine learning...
05/15/2020 ∙ by Ziyi Chen, et al. ∙ 0

• ### An Optimal Hybrid Variance-Reduced Algorithm for Stochastic Composite Nonconvex Optimization

In this note we propose a new variant of the hybrid variance-reduced pro...
08/20/2020 ∙ by Deyi Liu, et al. ∙ 0

• ### Nearly Minimal Over-Parametrization of Shallow Neural Networks

A recent line of work has shown that an overparametrized neural network ...
10/09/2019 ∙ by Armin Eftekhari, et al. ∙ 0

• ### Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction

We propose a stochastic variance reduced optimization algorithm for solv...
05/09/2016 ∙ by Xingguo Li, et al. ∙ 0

• ### Fast Variance Reduction Method with Stochastic Batch Size

In this paper we study a family of variance reduction methods with rando...
08/07/2018 ∙ by Xuanqing Liu, et al. ∙ 0

• ### Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

Stochastic variance-reduced gradient (SVRG) algorithms have been shown t...
09/18/2020 ∙ by Pan Zhou, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The stochastic approximation method first appears in robbins1951stochastic for solving a root-finding problem. Nowadays, its first-order version, or the stochastic gradient method (SGM), has been extensively used to solve machine learning problems that involve huge amounts of given data and also to stochastic problems that involve uncertain streaming data. Complexity results of SGMs have been well established for convex problems. A lot of recent researches about SGMs focus on nonconvex cases.

In this paper, we consider the regularized nonconvex stochastic programming

 Φ∗=minimizex∈Rn Φ(x):={F(x)≡Eξ[f(x;ξ)]}+r(x), (1.1)

where is a smooth nonconvex function almost surely for , and is a closed convex function on . Examples of (1.1) include the sparse online matrix factorization mairal2010online , the online nonnegative matrix factorization zhao2016online , and the streaming PCA (by a unit-ball constraint) mitliagkas2013memory . In addition, as

follows a uniform distribution on a finite set

, (1.1

) recovers the so-called finite-sum structured problem. It includes most regularized machine learning problems such as the sparse bilinear logistic regression

shi2014sparse

and the sparse convolutional neural network

liu2015sparse .

### 1.1 Background

When , the recent work arjevani2019lower gives an lower complexity bound of SGMs to produce a stochastic -stationary solution of (1.1) (see Definition 2 below), by assuming the so-called mean-squared smoothness condition (see Assumption 2). Several variance-reduced SGMs tran2019hybrid ; wang2018spiderboost ; fang2018spider ; cutkosky2019momentum have achieved an complexity result111Throughout the paper, we use to suppress an additional polynomial term of . Among them, fang2018spider ; cutkosky2019momentum only consider smooth cases, i.e., in (1.1), and tran2019hybrid ; wang2018spiderboost study nonsmooth problems in the form of (1.1). To reach an complexity result, the Hybrid-SGD method in tran2019hybrid needs samples at the initial step and then two samples at each update, while wang2018spiderboost ; fang2018spider require samples after every fixed number of updates. The STORM method in cutkosky2019momentum requires one single sample of at each update, but it only applies to smooth problems. Practically on training a (deep) machine learning model, small-batch training is often used to have better generalization masters2018revisiting ; keskar2016large

. In addition, for certain applications such as reinforcement learning

sutton2018reinforcement , one single sample can usually be obtained, depending on the stochastic environment and the current decision. Furthermore, regularization terms can improve generalization of a machine learning model, even for training a neural network wei2019regularization . We aim at designing a new SGM for solving the nonconvex nonsmooth problem (1.1) and achieving an optimal complexity result by using (that can be one) samples at each update.

### 1.2 Mirror-prox algorithm

Our algorithm is a mirror-prox SGM, and we adopt the momentum technique to reduce variance of the stochastic gradient in order to achieve a near-optimal complexity result.

Let be a continuously differentiable and 1-strongly convex function on , i.e.,

 w(y)≥w(x)+⟨∇w(x),y−x⟩+12∥y−x∥2,∀x,y∈Rn.

The Bregman divergence induced by is defined as

 V(x,z)=w(x)−w(z)−⟨∇w(z),x−z⟩. (1.2)

At each iteration of our algorithm, we obtain one or a few samples of , compute stochastic gradients at the previous and current iterates using the same samples, and then perform a mirror-prox momentum stochastic gradient update. The pseudocode is shown in Algorithm 1. We name it as PStorm as it can be viewed as a proximal version of the Storm method in cutkosky2019momentum .

### 1.3 Related works

A lot of efforts have been made on analyzing the convergence and complexity of SGMs for solving nonconvex stochastic problems, e.g., ghadimi2016accelerated ; ghadimi2013stochastic ; xu2015block-sg ; davis2019stochastic ; davis2020stochastic ; wang2018spiderboost ; cutkosky2019momentum ; fang2018spider ; allen2018natasha ; tran2019hybrid . We list comparison results on the complexity in Table 1.

The work ghadimi2013stochastic appears to be the first one that conducts complexity analysis of SGM for nonconvex stochastic problems. It introduces a randomized SGM. For a smooth nonconvex problem, the randomized SGM can produce a stochastic -stationary solution within SG iterations. The same-order complexity result is then extended in ghadimi2016accelerated to nonsmooth nonconvex stochastic problems in the form of (1.1). To achieve an complexity result, the accelerated prox-SGM in ghadimi2016accelerated needs to take samples at the -th update for each . Assuming a weak-convexity condition and using the tool of Moreau envelope, davis2019stochastic establishes an complexity result of stochastic subgradient method for solving more general nonsmooth nonconvex problems to produce a near- stochastic stationary solution (see davis2019stochastic for the precise definition).

In general, the complexity result cannot be improved for smooth nonconvex stochastic problems, as arjevani2019lower shows that for the problem where is smooth, any SGM that can access unbiased SG with bounded variance needs SGs to produce a solution such that . However, with one additional mean-squared smoothness condition on each unbiased SG, the complexity result can be improved to , which has been reached by a few variance-reduced SGMs tran2019hybrid ; wang2018spiderboost ; fang2018spider ; cutkosky2019momentum . These methods are closely related to ours. Below we briefly review them.

Spider. To find a stochastic -stationary solution of (1.1) with , fang2018spider proposes the Spider method with the update: for each . Here, is set to

 vk=⎧⎪⎨⎪⎩1|Bk|∑ξ∈Bk(∇f(xk;ξ)−∇f(xk−1;ξ))+vk−1, if mod(k,q)≠0,1|Ck|∑ξ∈Ck∇f(xk;ξ), % otherwise. (1.5)

where , , and or . Under the mean-squared smoothness condition (see Assumption 2), the Spider method can produce a stochastic -stationary solution within updates, by choosing appropriate learning rate (roughly in the order of ).

Storm. cutkosky2019momentum focuses on a smooth nonconvex stochastic problem, i.e., (1.1) with . It proposes the Storm method, which can be viewed as a special case of Algorithm 1 with applied to the smooth problem. However, its analysis and also algorithm design rely on the knowledge of a uniform bound on . In addition, because the learning rate of Storm is set dependent on the sampled stochastic gradient, its analysis needs almost-sure uniform smoothness of . This assumption is significantly stronger than the mean-squared smoothness condition, and also the uniform smoothness constant can be much larger than an averaged one.

Spiderboost. wang2018spiderboost extends Spider into solving a nonsmooth nonconvex stochastic problem in the form of (1.1) by proposing a so-called Spiderboost method. Spiderboost iteratively performs the update

 xk+1=argminx⟨vk,x⟩+1ηV(x,xk)+r(x), (1.6)

where denotes the Bregman divergence induced by a strongly-convex function, and is set by (1.5) with and . Under the mean-squared smoothness condition, Spiderboost reaches a complexity result of by choosing , where is the smoothness constant.

Hybrid-SGD. tran2019hybrid considers a nonsmooth nonconvex stochastic problem in the form of (1.1). It proposes a proximal stochastic method, called Hybrid-SGD, as a hybrid of SARHA nguyen2017sarah and an unbiased SGD. The Hybrid-SGD performs the update for each , where

Here, the sequence is set by with for a given and

 vk=βk−1vk−1+βk−1(∇f(xk;ξk)−∇f(xk−1;ξk))+(1−βk−1)∇f(xk;ζk), (1.7)

where and are two independent samples of . A mini-batch version of Hybrid-SGD is also given in tran2019hybrid . By choosing appropriate parameters , Hybrid-SGD can reach an complexity result. Although the update of requires only two or samples, its initial setting needs samples. As explained in (tran2019hybrid, , Remark 4.1), if the initial minibatch size is , then the complexity result of Hybrid-SGD will be worsened to .

More. There are many other works analyzing complexity results of SGMs on solving nonconvex finite-sum structured problems, e.g., allen2016variance ; reddi2016stochastic ; lei2017non ; huo2016asynchronous . These results often emphasize the dependence on the number of component functions and also the target error tolerance . In addition, several works have analyzed adaptive SGMs for nonconvex finite-sum or stochastic problems, e.g., chen2018convergence ; zhou2018convergence ; xu2020-APAM . An exhaustive review of all these works is impossible and also beyond the scope of this paper. We refer interested readers to those papers and the references therein.

### 1.4 Contributions

Our main contributions are about the algorithm design and analysis. We design a momentum-based variance-reduced mirror-prox stochastic gradient method for solving nonconvex nonsmooth stochastic problems. The proposed method generalizes Storm in cutkosky2019momentum from smooth cases to nonsmooth cases, and in addition, it achieves the same near-optimal complexity result under a mean-squared smooth condition, which is weaker than the almost-sure uniform smoothness condition assumed in cutkosky2019momentum . While Spiderboost wang2018spiderboost and Hybrid-SGD tran2019hybrid can also achieve an complexity result for stochastic nonconvex nonsmooth problems, they need or data samples in some or all iterations. Our new method is the first one that requires only one or

samples per iteration, and thus it can be applied to online learning problems that need real-time decision based on possibly one or several new data samples. Furthermore, the proposed method only needs an estimate of the smoothness parameter and is easy to tune to have good performance. Empirically, we observe that it converges faster than a vanilla SGD and performs more stable than Spiderboost and Hybrid-SGD on training sparse neural networks.

### 1.5 Notation, definitions, and outline

We use bold lowercase letters

for vectors.

denotes the expectation about a mini-batch set conditionally on the all previous history, and denotes the full expectation. counts the number of elements in the set . We use for the Euclidean norm. A differentiable function is called -smooth, if for all and .

###### Definition 1 (proximal gradient mapping)

Given , , and , we define , where

By the proximal gradient mapping, if a point is an optimal solution of (1.1), then it must satisfy for any . Based on this observation, we define a near-stationary solution as follows. This definition is standard and has been adopted in other papers, e.g., wang2018spiderboost .

###### Definition 2 (stochastic ε-stationary solution)

Given , a random vector is called a stochastic -stationary solution of (1.1) if for some , it holds .

From (ghadimi2016mini, , Lemma 1), it holds

 ⟨d,P(x,d,η)⟩≥∥P(x,d,η)∥2+1η(r(x+)−r(x)). (1.8)

 ∥P(x,d1,η)−P(x,d2,η)∥≤∥d1−d2∥, ∀d1,d2,∀x∈dom(r),∀η>0. (1.9)

For each , we denote

 gk=P(xk,dk,ηk),¯gk=P(xk,∇F(xk),ηk). (1.10)

Notice that measures the violation of stationarity of . The gradient error is represented by

 ek=dk−∇F(xk). (1.11)

Outline. The rest of the paper is organized as follows. In section 2, we establish the complexity result of Algorithm 1. Numerical experiments are conducted in section 3, and we conclude the paper in section 4.

## 2 Convergence analysis

In this section, we analyze the complexity result of Algorithm 1. Our analysis is inspired from that in cutkosky2019momentum and wang2018spiderboost . Throughout our analysis, we make the following assumptions.

###### Assumption 1 (finite optimal objective)

The optimal objective value of (1.1) is finite.

###### Assumption 2 (mean-squared smoothness)

The function is -smooth, and satisfies the mean-squared smoothness condition:

 Eξ[∥∇f(x;ξ)−∇f(y;ξ)∥2]≤L2∥x−y∥2,∀x,y∈dom(r).
###### Assumption 3 (unbiasedness and variance boundedness)

There is such that for each ,

 EBk[vk]=∇F(xk),EBk[uk]=∇F(xk−1), (2.1) E[∥vk−∇F(xk)∥2]≤σ2. (2.2)

We first show a few lemmas. The lemma below estimates one-iteration progress. Its proof follows from wang2018spiderboost .

###### Lemma 1 (one-iteration progress)

Let be generated from Algorithm 1. Then

 Φ(xk+1)−Φ(xk)≤ηk2(2−ηkL)∥ek∥2−ηk4(1−ηkL)∥¯gk∥2,∀k≥1.
###### Proof

By the -smoothness of and the definition of in (1.10), we have

 F(xk+1)−F(xk)≤⟨∇F(xk),xk+1−xk⟩+L2∥xk+1−xk∥2=−ηk⟨∇F(xk),gk⟩+η2kL2∥gk∥2. (2.3)

Using the definition of in (1.11) and the inequality in (1.8), we have

 −⟨∇F(xk),gk⟩=⟨ek,gk⟩−⟨dk,gk⟩≤⟨ek,gk⟩−∥gk∥2+1ηk(r(xk)−r(xk+1)).

Plugging the above inequality into (2.3) and rearranging terms give

 Φ(xk+1)−Φ(xk)≤ηk⟨ek,gk⟩−ηk∥gk∥2+η2kL2∥gk∥2.

By the Cauchy-Schwartz inequality, it holds , which together with the above inequality implies

 Φ(xk+1)−Φ(xk)≤ηk2∥ek∥2−ηk2(1−ηkL)∥gk∥2. (2.4)

From (1.9) and the definitions of and in (1.10), it follows

 −∥gk∥2≤−12∥¯gk∥2+∥gk−¯gk∥2≤−12∥¯gk∥2+∥dk−∇F(xk)∥2=−12∥¯gk∥2+∥ek∥2. (2.5)

Now plug the above inequality into (2.4) to give the desired result.

The next lemma gives a recursive bound on the gradient error vector sequence . Its proof follows that of (cutkosky2019momentum, , Lemma 2).

###### Lemma 2 (recursive bound on gradient error)

For each , it holds

 E[∥ek+1∥2]≤2β2kσ2m+4(1−βk)2η2kL2E[∥¯gk∥2]+(1−βk)2(1+4η2kL2)E[∥ek∥2].
###### Proof

First, notice that

 EBk+1[⟨vk+1−∇F(xk+1),ek⟩]=0,EBk+1[⟨uk+1−∇F(xk),ek⟩]=0. (2.6)

Hence, by writing , we have

 EBk+1[∥ek+1∥2]=EBk+1[∥vk+1−∇F(xk+1)+(1−βk)(∇F(xk)−uk+1)∥2]+(1−βk)2∥ek∥2. (2.7)

By the Young’s inequality, it holds

 ∥vk+1−∇F(xk+1)+(1−βk)(∇F(xk)−uk+1)∥2 (2.8) = ∥∥βk(vk+1−∇F(xk+1))+(1−βk)(vk+1−∇F(xk+1)+∇F(xk)−uk+1)∥∥2 (2.9) ≤ 2β2k∥vk+1−∇F(xk+1)∥2+2(1−βk)2∥vk+1−∇F(xk+1)+∇F(xk)−uk+1∥2. (2.10)

From Assumption 3, we have

 EBk+1[∥vk+1−∇F(xk+1)+∇F(xk)−uk+1∥2]≤EBk+1[∥vk+1−uk+1∥2].

Hence, taking conditional expectation on both sides of (2.8) and substituting it into (2.7) yield

 EBk+1[∥ek+1∥2]≤2β2kEBk+1∥vk+1−∇F(xk+1)∥2+2(1−βk)2EBk+1[∥vk+1−uk+1∥2]+(1−βk)2∥ek∥2.

Now taking a full expectation over the above inequality and using Assumptions 2 and 3, we have

 E[∥ek+1∥2]≤2β2kσ2m+2(1−βk)2L2E[∥xk+1−xk∥2]+(1−βk)2E[∥ek∥2]. (2.11)

By similar arguments as those in (2.5), it holds

 ∥gk∥2≤2∥¯gk∥2+2∥gk−¯gk∥2≤2∥¯gk∥2+2∥ek∥2.

Now notice and plug the above inequality into (2.11) to obtain the desired result.

Using Lemmas 1 and 2, we first show a convergence rate result by choosing the parameters that satisfy a general condition. Then we specify the choice of the parameters.

###### Theorem 2.1

Under Assumptions 1 through 3, let be the iterate sequence from Algorithm 1, with the parameters and satisfying the condition:

 14(1−ηkL)−ηk5ηk+1(1−βk)2>0, and ηk2(2−ηkL)−120ηkL2+(1−βk)2(1+4η2kL2)20ηk+1L2≤0,∀k≥0. (2.12)

Let be defined in (1.10). Then

 K−1∑k=0(ηk4(1−ηkL)−η2k5ηk+1(1−βk)2)E[∥¯gk∥2]≤Φ(x0)−Φ∗+E[∥e0∥2]20η0L2+K−1∑k=0β2kσ210mηk+1L2. (2.13)
###### Proof

From Lemmas 1 and 2, it follows that

 E[Φ(xk+1)+∥ek+1∥220ηk+1L2−Φ(xk)−∥ek∥220ηkL2]≤E[ηk2(2−ηkL)∥ek∥2−ηk4(1−ηkL)∥¯gk∥2−∥ek∥220ηkL2] +120ηk+1L2E[2β2kσ2m+4(1−βk)2η2kL2∥¯gk∥2+(1−βk)2(1+4η2kL2)∥ek∥2]. (2.14)

We have from the condition of that the coefficient of the term on the right hand side of (2) is nonpositive, and thus we obtain from (2) that

 E[Φ(xk+1)+∥ek+1∥220ηk+1L2−Φ(xk)−∥ek∥220ηkL2]≤β2kσ210mηk+1L2−(ηk4(1−ηkL)−η2k5ηk+1(1−βk)2)E[∥¯gk∥2].

Summing up the above inequality from through gives

 E[Φ(xK)+∥eK∥220ηKL2−Φ(x0)−∥e0∥220η0L2] ≤ K−1∑k=0β2kσ210mηk+1L2−K−1∑k=0(ηk4(1−ηkL)−η2k5ηk+1(1−βk)2)E[∥¯gk∥2],

which implies the inequality in (2.13).

Now we specify the choice of parameters and establish a complexity result of Algorithm 1.

###### Theorem 2.2 (convergence rate)

Under Assumptions 1 through 3, let be the iterate sequence from Algorithm 1, with the parameters and set to

 ηk=η(k+4)13,βk=1+20η2kL2−ηk+1ηk1+4η2kL2,∀k≥0, (2.15)

where is a positive number. Then

 E[∥¯gτ∥2]≤2(Φ(x0)−Φ∗+3√4E[∥e0∥2]20ηL2+σ210mL2(1152η3L4(54)13(log(K+3)−log3)+133√9η))3(732−15(54)13)η((K+4)23−423). (2.16)
###### Proof

Since , it holds . Also, notice or equivalently for all . Hence, it is straightforward to have and thus for each . Now notice , so the first inequality in (2.12) holds. In addition, to ensure the second inequality in (2.12), it suffices to have . Because , this inequality is implied by , which is further implied by the choice of in (2.15). Therefore, both conditions in (2.12), and thus we have (2.13).

Next we bound the coefficients in (2.13). First, from and for all , we have

 K−1∑k=0(ηk4(1−ηkL)−η2k5ηk+1(1−βk)2)≥cK−1∑k=0ηk≥cη∫K0(x+4)−13dx=3cη2((K+4)23−423), (2.17)

where . Second,

 K−1∑k=0β2kηk+1≤1ηK−1∑k=0(k+5)13(1+24η2kL2−ηk+1ηk)2=1ηK−1∑k=0(k+5)13⎛⎜⎝1+24η2kL2−(k+4)13(k+5)13⎞⎟⎠2. (2.18)

Note that

 K−1∑k=0(k+5)13η4k=η4K−1∑k=0(k+5)13(k+4)−43≤η4(54)