# Momentum Schemes with Stochastic Variance Reduction for Nonconvex Composite Optimization

Two new stochastic variance-reduced algorithms named SARAH and SPIDER have been recently proposed, and SPIDER has been shown to achieve a near-optimal gradient oracle complexity for nonconvex optimization. However, the theoretical advantage of SPIDER does not lead to substantial improvement of practical performance over SVRG. To address this issue, momentum technique can be a good candidate to improve the performance of SPIDER. However, existing momentum schemes used in variance-reduced algorithms are designed specifically for convex optimization, and are not applicable to nonconvex scenarios. In this paper, we develop novel momentum schemes with flexible coefficient settings to accelerate SPIDER for nonconvex and nonsmooth composite optimization, and show that the resulting algorithms achieve the near-optimal gradient oracle complexity for achieving a generalized first-order stationary condition. Furthermore, we generalize our algorithm to online nonconvex and nonsmooth optimization, and establish an oracle complexity result that matches the state-of-the-art. Our extensive experiments demonstrate the superior performance of our proposed algorithm over other stochastic variance-reduced algorithms.

• 114 publications
• 125 publications
• 24 publications
• 66 publications
• 64 publications
05/15/2020

### Momentum with Variance Reduction for Nonconvex Composition Optimization

Composition optimization is widely-applied in nonconvex machine learning...
05/31/2020

### Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization

Stochastic gradient methods (SGMs) have been extensively used for solvin...
08/20/2020

### An Optimal Hybrid Variance-Reduced Algorithm for Stochastic Composite Nonconvex Optimization

In this note we propose a new variant of the hybrid variance-reduced pro...
05/25/2018

### A New Analysis of Variance Reduced Stochastic Proximal Methods for Composite Optimization with Serial and Asynchronous Realizations

We provide a comprehensive analysis of stochastic variance reduced gradi...
06/15/2021

05/09/2016

### Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction

We propose a stochastic variance reduced optimization algorithm for solv...
06/08/2021

### Provably Faster Algorithms for Bilevel Optimization

Bilevel optimization has been widely applied in many important machine l...

## 1 Introduction

In the era of machine learning, optimization problems associated with practical applications have a rapidly increasing data volume. In many scenarios, such optimization problems take the following composite form:

 minx∈RdF(x) :=f(x)+g(x), (P) where f(x) :=1nn∑i=1ℓi(x),

where is the optimization variable, integer denotes the total sample size, is a differentiable function that corresponds to the loss on the -th data sample and denotes a possibly nonsmooth regularizer function. In particular, solving the above problem (P) can be demanding due to the tremendous data size

and complex machine learning models (e.g., neural networks) that result in highly nonconvex and nonsmooth loss landscape

(Goodfellow et al., 2016). Therefore, stochastic gradient-like algorithms are commonly used in practice to leverage their sample efficiency and implementation simplicity while maintaining provable convergence guarantee in nonconvex optimization.

A variety of stochastic algorithms have been proposed in the literature for solving the problem (P) without the existence of the regularizer

(i.e., smooth nonconvex optimization). The simplest algorithm is the stochastic gradient descent (SGD) algorithm

(Robbins & Monro, 1951; Bottou, 2010)

that approximates the full gradient by one mini-batch of stochastic samples. Although SGD has a low per-iteration complexity, its convergence rate can be significantly deteriorated by the intrinsic variance of its stochastic estimator. Such an issue has been successfully resolved by using more advanced stochastic variance-reduced gradient estimators that induce a smaller variance, leading to the design of a variety of stochastic variance-reduced algorithms such as SAG

(Schmidt et al., 2017), SAGA (Defazio et al., 2014), SVRG (Johnson & Zhang, 2013), etc. To further handle the nonsmooth regularizer , proximal versions of these advanced algorithms have been developed (Xiao & Zhang, 2014; Ghadimi et al., 2016; Reddi et al., 2016). However, these algorithms do not yield an optimal stochastic gradient oracle complexity for generic nonconvex optimization.

Recently, (Nguyen et al., 2017a, b) and (Fang et al., 2018) proposed a new type of stochastic variance-reduced algorithms called SARAH and SPIDER, respectively. In specific, under an accuracy-dependent stepsize, it has been shown in (Fang et al., 2018) that a natural gradient descent step taken in SPIDER yields the optimal111In the parameter regime , where corresponds to the desired accuracy. stochastic gradient oracle complexity for solving the problem (P) without the regularizer . In a subsequent work (Wang et al., 2018a), the authors further proposed an improved algorithm scheme called Proximal SpiderBoost that allows to use a much larger constant-level stepsize and achieves the same order-level stochastic gradient oracle complexity for solving the problem (P) under a convex regularizer .

Although the aforementioned SPIDER-based algorithms achieve the optimal stochastic gradient oracle complexity in nonconvex optimization, their practical performance has been found in recent works (Nguyen et al., 2017b; Fang et al., 2018) to be hardly advantageous to that of the traditional SVRG. Therefore, it is of vital importance to exploit the structure of the SPIDER estimator in other algorithmic dimensions to further improve the practical performance of SPIDER-based algorithms. Momentum is such a promising and important perspective. In fact, there are still two major challenges ahead to design momentum schemes for variance-reduced algorithms in nonconvex optimization. First, while momentum scheme has been well studied for (stochastic) gradient algorithms (Ghadimi & Lan, 2016) in nonconvex optimization, the convergence guarantee of stochastic variance-reduced-like algorithms is only explored for SVRG in certain convex scenarios (Nitanda, 2016; Allen-Zhu, 2017, 2018; Shang et al., 2018). Therefore, it is not clear whether a certain momentum scheme can be applied to stochastic variance-reduced algorithms based on SPIDER and yield the optimal oracle gradient complexity for nonconvex optimization. Furthermore, the existing momentum scheme for stochastic algorithms to solve the nonconvex problem (P) has convergence guarantee only for convex regularizers that have a bounded domain (Ghadimi & Lan, 2016), which are not applicable to a variety of application scenarios where regularizers with unbounded domain (e.g., ) are commonly used.

In this paper, we explore momentum schemes for SPIDER-based variance reduction algorithms that can solve the nonconvex and nonsmooth problem (P) under a much broader choice of regularizers with convergence guarantee. We summarize our contributions as follows.

### Summary of Contributions

We consider solving the problem (P) with nonconvex loss functions and an arbitrary convex regularizer (possibly nonsmooth). We propose Proximal SPIDER-M, which is a proximal stochastic algorithm that exploits both the SPIDER variance-reduction scheme and a momentum scheme for solving the problem (P). We show that the output point generated by the Proximal SPIDER-M satisfies a generalized

-first-order stationary condition within number of iterations, and the corresponding stochastic gradient oracle complexity is in the order of , matching the complexity lower bound for nonconvex optimization. To the best of our knowledge, this is the first known theoretical guarantee for stochastic variance-reduced type of algorithms with momentum in nonconvex optimization. We also note that the design of our momentum scheme is applicable to arbitrary convex regularizers, which significantly relaxes the constraint of the existing momentum scheme that requires the regularizer to have a bounded domain in order to have a convergence guarantee for nonconvex optimization (Ghadimi & Lan, 2016).

We further propose two variants of the momentum scheme, i.e., epochwise diminishing momentum and epochwise restart momentum, for Proximal SPIDER-M. We establish the same order-level oracle complexity result in nonconvex optimization as mentioned above. To the best of our knowledge, this is the first formal theoretical guarantee for epochwise diminishing and restart momentum schemes in nonconvex optimization. Moreover, we generalize the Proximal SPIDER-M to solve the problem (P) in an online setting, and show that the algorithm satisfies the generalized -first-order stationary condition within number of iterations, and the associated stochastic gradient oracle complexity is in the order , matching the state-of-the-art result. Our numerical experiments demonstrate that the momentum scheme does substantially improve the practical performance of SPIDER and outperform other momentum-based variance-reduced algorithms.

### Related Work

Stochastic algorithms for nonconvex optimization: For nonconvex optimization, SGD has been shown to achieve an -first-order stationary condition with an overall stochastic gradient oracle complexity of (Ghadimi et al., 2016). Convergence guarantee for various stochastic variance-reduced algorithms have been established in nonconvex optimization. In specific, SAGA and SVRG have been shown to yield an overall stochastic gradient oracle complexity of (Reddi et al., 2016; Allen-Zhu & Hazan, 2016) to achieve an -first-order stationary condition. More recently, (Nguyen et al., 2017a, b) proposed a novel stochastic variance reduction algorithm named SARAH and showed that the corresponding stochastic gradient oracle complexity is to attain an -first-order stationary point. The SPIDER algorithm (Fang et al., 2018) is a variant of SARAH that uses the same gradient estimator as SARAH but adopts a natural gradient descent update. (Fang et al., 2018) showed that SPIDER achieves an overall stochastic gradient oracle complexity, which is optimal within the regime . (Wang et al., 2018a) further proposed an improved SPIDER scheme that allows to use a constant-level stepsize and can solve composite nonconvex optimization problems. In (Zhou et al., 2018), the authors proposed a nested stochastic variance reduction scheme for nonconvex optimization and achieve the same order-level oracle complexity result as that of SPIDER. More recently, (Zhou et al., 2019; Zhang et al., 2018) further applied the SARAH and SPIDER estimators to nonconvex optimization problems over manifolds.

Momentum schemes for nonconvex optimization: Momentum scheme is originally designed for accelerating gradient algorithms to achieve an optimal convergence rate in convex optimization (Nesterov, 2014; Beck & Teboulle, 2009; Tseng, 2010; Ghadimi & Lan, 2016). For nonconvex optimization, (Ghadimi & Lan, 2016) established convergence of stochastic gradient algorithms with momentum to an -first-order stationary point with an overall stochastic gradient oracle complexity of . The convergence guarantee of SVRG with momentum has been explored under a certain local gradient dominance geometry in nonconvex optimization (Li et al., 2017). However, the momentum scheme there requires to compare the objective function value (and hence calculate the total loss) at each iteration and hence is not sample efficient. Similar momentum scheme has also been explored in second-order algorithms for nonconvex optimization (Wang et al., 2018b).

## 2 Preliminaries

In this section, we introduce some definitions and assumptions that are used throughout the paper. Recall that we are interested in solving the following optimization problem with composite objective function

 minx∈RdF(x) :=f(x)+g(x), (P) where f(x) :=1nn∑i=1ℓi(x),

where the function denotes the total loss on the training data and the function corresponds to the regularizer that penalizes the violation of a desired structure (e.g., sparsity, low-rankness, etc). We adopt the following standard assumptions on the problem (P).

###### Assumption 1.

The objective function in the problem (P) satisfies:

1. [leftmargin=*,topsep=0pt,noitemsep]

2. Function is bounded below, i.e.,

 F∗:=infx∈RdF(x)>−∞; (1)
3. The loss functions are -smooth, i.e., for all , there exists an such that

 ∥∇ℓi(x)−∇ℓi(y)∥≤L∥x−y∥,∀x,y∈Rd; (2)
4. The regularizer function is proper222An extended-valued function is called proper if its domain is non-empty. and convex.

Intuitively, item 1 of creftypecap 1 guarantees the feasibility of the optimization problem (P) and item 2 imposes smoothness on the individual loss functions. For the set of convex regularizers, many of them (e.g., , elastic net, etc) are not differentiable and therefore one cannot use gradient to evaluate the first-order stationary condition for such a nonsmooth composite objective function. This motivates us to introduce a generalized notion of gradient as we elaborate below.

We first introduce the following proximal mapping that is useful to handle the nonsmoothness of a function.

###### Definition 1 (Proximal mapping).

For any proper and convex function , its proximal mapping evaluated at with parameter is the unique point defined as

 proxηg(x):=argminu∈Rd{g(u)+12η∥u−x∥2}.

The proximal mapping is uniquely defined for convex functions. Particularly, in the special case where is the indicator function of a convex set, its proximal mapping reduces to the projection operator onto the convex set. More importantly, the proximal mapping can be used to characterize the first-order stationary condition of nonsmooth composite functions in the following way.

###### Fact 1.

(Bauschke & Combettes, 2011) Let be a proper and convex function. Define the following notion of generalized gradient

 Gη(x,∇f(x)):=1η(x−proxηg(x−η∇f(x))). (3)

Then, is a critical point of the function (i.e., ) if and only if .

Intuitively, can be understood as a generalized notion of gradient for composite objective function. In the special case where , the generalized gradient reduces to the usual notion of the gradient .

Based on the above definition, throughout the paper, we say that a point satisfies an -first-order stationary condition of the problem (P) if .

## 3 Proximal SPIDER-M for Nonconvex Composite Optimization

In this section, we propose a proximal SPIDER algorithm that incorporates a momentum scheme (referred to as Proximal SPIDER-M) for solving the composite problem (P), and study its theoretical guarantee as well as the oracle complexity.

### 3.1 Algorithm Design

We present the detailed update rule of the Proximal SPIDER-M in Algorithm 1, where “Unif” denotes the uniform sampling scheme with replacement.

To elaborate on the algorithm design, note that Proximal SPIDER-M generates a tuple of variable sequences according to the momentum scheme. In specific, the variables , are updated via proximal gradient-like steps using the gradient estimate proposed for SARAH in (Nguyen et al., 2017a, b) and different stepsizes , respectively. Then, their convex combination with momentum coefficient yields the variable . We choose a standard diminishing momentum coefficient that serves for proving convergence guarantee in nonconvex optimization. We also note that the two updates for and do not introduce extra computation overhead as compared to a single update, since they both depend on the same stochastic gradient .

We want to highlight the difference between our momentum scheme design for Proximal SPIDER-M and the existing momentum scheme design for proximal SGD in (Ghadimi & Lan, 2016) and proximal SVRG in (Allen-Zhu, 2017). In these works, they use the following proximal gradient steps for updating the variables and :

 xk+1 =proxλkg(xk−λkvk), (4) yk+1 =proxβkg(zk−βkvk). (5)

Note that eq. 4 and eq. 5 use different proximal gradient updates that are based on and , respectively. As a comparison, our momentum scheme in Algorithm 1 applies the same proximal gradient term to update both variables and , and therefore requires less computation for evaluating the proximal mapping. Moreover, our update for the variable is not a single proximal gradient update (as opposed to eq. 5), and it couples with the variables and .

The momentum scheme introduced in (Ghadimi & Lan, 2016) based on eq. 4 and eq. 5 was shown to have convergence guarantee in nonconvex composite optimization only for convex regularizers that have a bounded domain. Therefore, it cannot yield a provable convergence guarantee for regularizers with unbounded domain, which are commonly used in practical applications, e.g., , , elastic net, etc. On the other hand, the momentum scheme introduced in (Allen-Zhu, 2017) was not proven to have a convergence guarantee in nonconvex optimization. In the next subsection, we prove that our momentum scheme in Algorithm 1 has a provable convergence guarantee for nonconvex composite optimization with arbitrary convex regularizers, therefore eliminating the restriction on the regularizers in (Ghadimi & Lan, 2016).

### 3.2 Convergence and Complexity Analysis

In this subsection, we study the convergence guarantee as well as the stochastic gradient oracle complexity of Proximal SPIDER-M for solving the problem (P). We obtain the following main theorem.

###### Theorem 1.

Let creftypecap 1 hold. Apply the Proximal SPIDER-M (see Algorithm 1) to solve the problem (P) with parameters , , and . Then, the output produced by the algorithm satisfies for any provided that the total number of iterations satisfies

 K≥Θ(L(F(x0)−F∗)ϵ2). (6)

Moreover, the total number of stochastic gradient oracle calls is at most and the total number of proximal mapping oracle calls is at most .

Theorem 1 establishes the convergence rate of Proximal SPIDER-M to satisfy the generalized first-order stationary condition and the corresponding oracle complexity. Specifically, the iteration complexity to achieve the generalized -first-order stationary condition is in the order of , which matches the state-of-art result of stochastic nonsmooth nonconvex optimization (Wang et al., 2018a). Furthermore, the corresponding stochastic gradient oracle complexity matches the lower bound for nonconvex optimization (Fang et al., 2018). Therefore, Proximal SPIDER-M enjoys the same optimal convergence guarantee as that for the Proximal SpiderBoost (Wang et al., 2018a) in nonconvex optimization, and it further benefits from the momentum scheme that can lead to significant acceleration in practical applications (as we demonstrate via experiments in Section 6). We also note that the design of Proximal SPIDER-M allows to use constant stepsizes as opposed to the accuracy-dependent stepsize adopted by the original SPIDER (Fang et al., 2018). This also facilitates the convergence of the algorithm in practice.

###### Outline of Proof for Theorem 1.

As the technical proof is involved, we briefly outline the key intermediate steps below to convey some intuition on the analysis. The detailed proof is provided in the supplementary materials.

Based on the definition of generalized gradient (see creftypecap 1), we can rewrite the updates for in Algorithm 1 as follows:

 zk =(1−αk+1)yk+αk+1xk, xk+1 =xk−λkGλk(xk,vk), yk+1 =zk−βkGλk(xk,vk).

It can be seen that the term serves as a generalized gradient in the updates. Then, under the above momentum scheme, we can characterize the per-iteration progress of Proximal SPIDER-M by bounding the progressive function value gap as

 F(xk+1)−F(xk) ≤−Θ(λk∥Gλk(xk,vk)∥2) +Θ(Γkk−1∑t=0λt−βtαt+1Γt+1∥Gλt(xt,vt)∥2)

where and we have hidden the constant factors for simplicity of presentation. The next key step is to bound the estimation error term in terms of the generalized gradient term as

 E[∥∇f(zk) −vk∥2]≤k−1∑i=(τ(k)−1)qL2|ξi|[2β2iE∥Gλi(xi,vi)∥2 +2α2i+2Γi+1i∑t=0(λt−βt)2αt+1Γt+1E∥Gλt(xt,vt)∥2],

where denotes the index of the period that iteration belongs to. Then, combining the above two inequalities, telescoping and simplifying with much effort yield that

 E[F(xK)] ≤F(x0)−K−1∑k=0βk16E∥Gλk(xk,vk)∥2.

Based on the above result, we further exploit the randomized output strategy and finally obtain that

 E∥Gλζ(xζ,vζ)∥≤Θ(√L(F(x0)−F∗)K),

where is selected from uniformly at random. Then, the desired convergence rate and oracle complexity results follow. ∎

From a technical perspective, we highlight the following three major new developments in the proof of Theorem 1 that is different from the proof for the basic stochastic gradient algorithm with momentum (Ghadimi & Lan, 2016) for nonconvex optimization: 1) our proof exploits the martingale structure of the SPIDER estimate which allows to bound the mean-square error term in a tight way under the momentum scheme. In traditional analysis of stochastic algorithms with momentum (Ghadimi & Lan, 2016), such an error term corresponds to the variance of the stochastic estimator and is assumed to be bounded by a universal constant. 2): Our proof requires a very careful manipulation of the bounding strategy to handle the accumulation of the mean-square error over the entire optimization path. 3): Our design of the momentum scheme allows to prove the convergence under arbitrary convex regularizers, whereas the proof of (Ghadimi & Lan, 2016) requires the regularizer to have a bounded domain.

## 4 Other Momentum Scheduling Schemes for Proximal SPIDER-M

It turns out that the design of Proximal SPIDER-M in Algorithm 1 allows to use more flexible momentum schemes in nonconvex optimization. In this section, we explore two variant momentum schemes for Proximal SPIDER-M that can be useful in practice and study the corresponding convergence guarantees.

### 4.1 Epochwise-diminishing Momentum

The Proximal SPIDER-M in Algorithm 1 uses a momentum coefficient

that diminishes to zero iterationwisely. As the epoch length

usually consists of many inner iterations (e.g., multiple passes over the data), the momentum coefficient can be very small after several epochs and hence leads to limited acceleration. Therefore, one strategy to alleviate such a problem is to set the momentum coefficient to diminish epochwisely, i.e., set

(Epochwise-diminish momentum):

 αk=2\ceil[]k/q+1, k=1,...,K−1,

where corresponds to the number of inner iterations within each epoch and ‘’ denotes the ceiling function. Under such a momentum setting, the momentum coefficient remains to be constant within each epoch and diminishes slowly along progressive epochs. We note that a similar momentum coefficient setting is adopted in (Allen-Zhu, 2017; Shang et al., 2018) for accelerating SVRG. However, the focus there is to solve convex optimization problems and no convergence guarantee was established for nonconvex optimization.

### 4.2 Epochwise-restart Momentum

Another widely used momentum setting is to restart the momentum scheme after a fixed number of iterations. Specifically in the context of Proximal SPIDER-M, we synchronize the variables and to be the obtained in the previous iteration after every epoch, i.e., we add the following algorithmic code to the Proximal SPIDER-M in Algorithm 1

 If mod(k,q−1)=0 then set yk+1=xk+1=xk.

This can be understood as a reinitialization of the variables epochwisely. On the other hand, we restart the momentum coefficient after every epoch as

 αk=¯¯¯¯αmod(k,q), where ¯¯¯¯αt=2t+1, t=1,...,q−1.

Under such a momentum scheme, the momentum coefficient reboots at the beginning of every epoch, injecting a periodic momentum into the algorithm dynamic consistently. Finally, the algorithm outputs the point where is selected from uniformly at random.

The momentum scheme with restart has been applied to the gradient descent algorithm in (O’donoghue & Candès, 2015). There, it has been justified that a proper restart scheme can significantly accelerate the practical convergence of the algorithm. However, it is unclear whether a restart momentum scheme can have a convergence guarantee in nonconvex and nonsmooth optimization, especially under the more sample-efficient SPIDER scheme. We establish such a theoretical result in the next subsection.

To further illustrate the differences among these three momentum schemes, we plot and compare the scheduling of the momentum coefficient of these momentum schemes in Figure 1. The area below each curve roughly corresponds to the total momentum that is injected into the algorithm dynamic by the corresponding momentum scheme. One can see that the original momentum scheme that diminishes iterationwisely has the smallest total momentum, whereas the epochwise-diminishing momentum scheme has the largest total momentum (within a considerable number of epochs). We further demonstrate that the practical performance of these momentum schemes is highly correlated with the accumulative momentum via numerical experiments in Section 6.

### 4.3 Convergence and Complexity Analysis

In this subsection, we present the convergence result and the corresponding oracle complexity of Proximal SPIDER-M under the variants of momentum schemes introduced in the previous subsections. We obtain the following main result.

###### Theorem 2.

Let creftypecap 1 hold. Apply the Proximal SPIDER-M with either epochwise-diminishing momentum or epochwise-restart momentum to solve the problem (P). Set parameters , and . Then, the output of the algorithm satisfies for any under the same complexity requirements as those in Theorem 1.

From Theorem 2, it can be seen that the Proximal SPIDER-M maintains the optimal stochastic gradient oracle complexity in nonconvex optimization under the more flexible epochwise-diminishing and the epochwise restart momentum schemes. Therefore, this demonstrates that the algorithmic structure of SPIDER provides much flexibility in designing compatible momentum schemes in the nonconvex regime.

## 5 Proximal SPIDER-M for Online Nonconvex Composite Optimization

The objective function in the optimization problem (P) contains a finite number of data samples that are typically drawn from an underlying data distribution. Therefore, it can be viewed as a finite-sample approximation of the population risk , where the data sample is generated from an underlying distribution . In this section, we study the following online composite optimization problem that involves the population risk:

 minx∈RdF(x) :=f(x)+g(x), (R) where f(x) :=Eu∼U[ℓu(x)],

where the function corresponds to the regularizer. As the problem (R) depends on the population risk that contains infinite samples, we propose a variant of Proximal SPIDER-M that can solve it in an online setting. We summarize the detailed steps of the algorithm in Algorithm 2, where we refer to it as Online Proximal SPIDER-M.

Note that unlike the Proximal SPIDER-M for the finite-sum case, the Online Proximal SPIDER-M keeps drawing new data samples from the underlying distribution (uniformly at random) to construct the gradient estimate . To study its convergence guarantee, we make the following standard assumption on the variance of the random sampling.

###### Assumption 2.

There exists a constant such that for all and all random samples , it holds that .

Based on creftypecap 2, we obtain the following convergence guarantee for Online Proximal SPIDER-M.

###### Theorem 3.

Let Assumptions 1 and 2 hold. Apply Online Proximal SPIDER-M (see Algorithm 2) to solve the problem (R). Choose any desired accuracy and set parameters , and . Then, the output of the algorithm satisfies provided that the total number of iterations satisfies

 K≥Θ(L(F(x0)−F∗)ϵ2). (7)

Moreover, the total number of stochastic gradient calls is at most and the total number of proximal mapping calls is at most .

The orders of the results in Theorem 3 match those of state-of-arts (Fang et al., 2018; Wang et al., 2018a). Our result demonstrates that the momentum scheme can be applied to facilitate the convergence of Proximal SPIDER for solving online nonsmooth and nonconvex problems with a provable convergence guarantee. Moreover, we obtain a similar complexity result for Online Proximal SPIDER-M under the other two momentum schemes proposed in Section 4 in the following theorem.

###### Theorem 4.

Let Assumptions 1 and 2 hold. Apply Online Prox-SPIDER-M with either epochwise-diminishing momentum or epochwise restart momentum to solve the problem (R). Choose any desired accuracy and set parameters , and . Then, the output of the algorithm satisfies under the same complexity requirements as those in Theorem 3.

## 6 Experiments

In this section, we compare the practical performance of the following stochastic variance-reduced algorithms: SVRG in (Johnson & Zhang, 2013), SpiderBoost in (Wang et al., 2018a), Katyusha in (Allen-Zhu, 2017), ASVRG in (Shang et al., 2018), RSAG in (Ghadimi & Lan, 2016), Proximal SPIDER-M (Algorithm 1 in this paper), Proximal SPIDER-MED (epochwise-diminishing momentum) and Proximal SPIDER-MER (epochwise restart momentum). We note that all algorithms use certain momentum schemes except for SVRG and SpiderBoost. For all algorithms considered, we set their learning rates to be

. For each experiment, we initialize all the algorithms at the same point that is generated randomly from the normal distribution. Also, we choose a fixed mini-batch size

and set the epoch length to be such that all algorithms pass over the entire dataset twice in each epoch.

### 6.1 Un-regularized Nonconvex Optimization

We first apply these algorithms to solve an un-regularized nonconvex optimization problem. The first problem is the following nonconvex logistic regression problem

 minw∈Rdf(w):=1nn∑i=1ℓ(w⊺xi,yi)+αd∑i=1w2i1+w2i,

where denotes the features and corresponds to the labels, and we set the loss to be the cross-entropy loss and . For this problem, we use two different datasets from the LIBSVM (Chang & Lin, 2011): the a9a dataset () and the w8a dataset (). We report the learning curves on the function value gap of these algorithms in Figure 2.

In this experiment, one can see from Figure 2 that our SPIDER-MED with epochwise diminishing momentum achieves the best performance and significantly outperforms other algorithms. Also, we note that the performances of both Katyusha and ASVRG do not achieve much acceleration in such a nonconvex case, as these algorithms are originally developed to accelerate solving convex problems. This demonstrates that our design of SPIDER-M has a stable performance in nonconvex optimization as well as provable theoretical guarantee. We note that the curve of SpiderBoost overlaps with that of SVRG (both algorithms have similar performance). On the other hand, among all SPIDER-M-type of algorithms, the one that uses the epochwise diminishing momentum (SPIDER-MED) has the best performance, whereas the one that uses the iterationwise diminish momentum (SPIDER-M) is the slowest. This corroborates the comparison of the total momentum that we illustrate in Figure 1.

Next, we compare these algorithms in solving the following nonconvex robust linear regression problem

 minw∈Rdf(w):=1nn∑i=1ℓ(yi−w⊺xi),

where we use the nonconvex loss function . We report the learning curves on the function value gap of these algorithms in Figure 3. One can see that our SPIDER-MED with epochwise diminishing momentum has a comparable performance to that of Katyusha, and they both outperform other algorithms.

### 6.2 Nonsmooth and Nonconvex Optimization

We further add an nonsmooth regularizer with weight coefficient to the objective functions of the above two optimization problems, and apply the corresponding proximal versions of these algorithms to solve the nonconvex composite optimization problems. All the results are presented in Figures 4 and 5. One can see that our Proximal SPIDER-MED still significantly outperforms all the other algorithms in these nonconvex and nonsmooth scenarios. This demonstrates that our novel design of the coupled update for in the momentum scheme is efficient in the nonsmooth and nonconvex setting. Also, it turns out that Katyusha and ASVRG are suffering from a slow convergence (their convergences occur at around 40 epochs). Together with the first two experiments, this implies that their performance is not stable and may not be generally suitable for solving nonsmooth and nonconvex problems.

## 7 Conclusion

In this paper, we design an efficient proximal stochastic variance-reduced algorithm with momentum to solve nonconvex composite optimization problems with provable convergence guarantee. Under a basic momentum scheme, we show that our Proximal SPIDER-M achieves the best possible stochastic gradient oracle complexity for nonconvex optimization. Our algorithm design further allows to apply other momentum schemes and to solve online composite optimization problems with an optimal oracle complexity. We anticipate our algorithm design to inspire the development of more advanced momentum acceleration schemes for stochastic nonconvex optimization. On the other hand, it is also interesting to explore whether our algorithm can achieve the best possible convergence rate in convex optimization.

## References

• Allen-Zhu (2017) Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research, 18(1):8194–8244, January 2017.
• Allen-Zhu (2018) Allen-Zhu, Z. Katyusha x: Simple momentum method for stochastic sum-of-nonconvex optimization. In Proc. International Conference on Machine Learning (ICML), volume 80, pp. 179–185, 10–15 Jul 2018.
• Allen-Zhu & Hazan (2016) Allen-Zhu, Z. and Hazan, E. Variance reduction for faster non-convex optimization. In Proc. International Conference on Machine Learning (ICML), pp. 699–707, 2016.
• Bauschke & Combettes (2011) Bauschke, H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2011.
• Beck & Teboulle (2009) Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, March 2009.
• Bottou (2010) Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proc. International Conference on Computational Statistics (COMPSTAT), pp. 177–186, 2010.
• Chang & Lin (2011) Chang, C. and Lin, C.

LIBSVM: A library for support vector machines.

ACM Transactions on Intelligent Systems and Technology, 2(3):1–27, 2011.
• Defazio et al. (2014) Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1646–1654. 2014.
• Fang et al. (2018) Fang, C., Li, C., Lin, Z., and Zhang, T. SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 687–697. 2018.
• Ghadimi & Lan (2016) Ghadimi, S. and Lan, G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, March 2016.
• Ghadimi et al. (2016) Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305, Jan 2016.
• Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. MIT Press, 2016.
• Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. International Conference on Neural Information Processing Systems (NIPS), pp. 315–323, 2013.
• Li et al. (2017) Li, Q., Zhou, Y., Liang, Y., and Varshney, P. K. Convergence analysis of proximal gradient with momentum for nonconvex optimization. In Proc. International Conference on Machine Learning (ICML), volume 70, pp. 2111–2119, 2017.
• Nesterov (2014) Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2014.
• Nguyen et al. (2017a) Nguyen, L., Liu, J., Scheinberg, K., and Takáč, K. Stochastic recursive gradient algorithm for nonconvex optimization. ArXiv:1705.07261, May 2017a.
• Nguyen et al. (2017b) Nguyen, L., Liu, J., Scheinberg, K., and Takáč, M. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proc. International Conference on Machine Learning (ICML), 2017b.
• Nitanda (2016) Nitanda, A. Accelerated stochastic gradient descent for minimizing finite sums. In

Proc. International Conference on Artificial Intelligence and Statistics (AISTATS)

, volume 51, pp. 195–203, May 2016.
• O’donoghue & Candès (2015) O’donoghue, B. and Candès, E. Adaptive restart for accelerated gradient schemes. Foundations of Computational Mathematics, 15(3):715–732, June 2015.
• Reddi et al. (2016) Reddi, S. J., Sra, S., Poczos, B., and Smola, A. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1145–1153. 2016.
• Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 09 1951.
• Schmidt et al. (2017) Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, Mar 2017.
• Shang et al. (2018) Shang, F., Jiao, L., Zhou, K., Cheng, J., Ren, Y., and Jin, Y. Asvrg: Accelerated proximal svrg. In Proc. Asian Conference on Machine Learning, volume 95, pp. 815–830, 2018.
• Tseng (2010) Tseng, P. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Mathematical Programming, 125(2):263–295, Oct 2010.
• Wang et al. (2018a) Wang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization. ArXiv:1810.10690, October 2018a.
• Wang et al. (2018b) Wang, Z., Zhou, Y., Liang, Y., and Lan, G. Cubic regularization with momentum for nonconvex optimization. ArXiv:1810.03763, October 2018b.
• Xiao & Zhang (2014) Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
• Zhang et al. (2018) Zhang, J., Zhang, H., and Sra, S. R-SPIDER: A Fast Riemannian Stochastic Optimization Algorithm with Curvature Independent Rate. ArXiv:811.04194, 2018.
• Zhou et al. (2018) Zhou, D., Xu, P., and Gu, Q. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Proc. Advances in Neural Information Processing Systems (NeurIps), pp. 3925–3936. 2018.
• Zhou et al. (2019) Zhou, P., Yuan, X., and Feng, J. Faster First-Order Methods for Stochastic Non-Convex Optimization on Riemannian Manifolds. 2019.

## Appendix A Auxiliary Lemmas for Analysis of Algorithm 1

In this section, we collect some auxiliary results that facilitate the analysis of Algorithm 1. Throughout, for any , denote the unique integer such that . We also define and for . Since we set , it is easy to check that .

We first present an auxiliary lemma from (Fang et al., 2018).

###### Lemma 1.

(Fang et al., 2018) Under creftypecap 1, the estimation of gradient constructed by SPIDER satisfies that for all ,

 E∥vk−∇f(zk)∥2 ≤L2|ξk|E∥zk−zk−1∥2+E∥vk−1−∇f(zk−1)∥2.

Telescoping Lemma 1 and noting that for all such that , we obtain the following bound.

###### Lemma 2.

Under creftypecap 1, the estimation of gradient constructed by SPIDER satisfies that for all ,

 E∥vk−∇f(zk)∥2≤k−1∑i=(τ(k)−1)qL2|ξi|E∥zi+1−zi∥2. (8)

Next, recall the following definition of the gradient mapping for some and :

 Gη(x,u):=1η(x−proxηg(x−ηu)).

Based on this definition, we can rewrite the updates of Algorithm 1 as follows:

 zk =(1−αk+1)yk+αk+1xk, xk+1 =xk−λkGλk(xk,vk), yk+1 =zk−βkGλk(xk,vk).

Next, we prove the following auxiliary lemma.

###### Lemma 3.

Let the sequences be generated by Algorithm 1. Then, the following inequalities hold

 yk−xk =Γkk∑t=1λt−1−βt−1ΓtGλt−1(xt−1,vt−1), (9) ∥yk−xk∥2 ≤Γkk∑t=1λt−1−βt−1αtΓt∥Gλt−1(xt−1,vt−1)∥2, (10) ∥zk+1−zk∥2 ≤2β2k∥Gλk(xk,vk)∥2+2α2k+2Γk+1k+1∑t=1(λt−1−βt−1)2αtΓt∥Gλt−1(xt−1,vt−1)∥2. (11)
###### Proof.

We prove the first equality. By the update rule of the momentum scheme, we obtain that

 yk−xk =zk−1−βk−1Gλk−1(xk−1,vk−1)−(xk−1−λk−1Gλk−1(xk−1,vk−1)) =(1−αk)(yk−1−xk−1)+(λk−1−βk−1)Gλk−1(xk−1,vk−1). (12)

Dividing both sides by and noting that , we further obtain that

 yk−xkΓk =yk−1−xk−1Γk−1+λk−1−βk−1ΓkGλk−1(xk−1,vk−1). (13)

Telescoping the above equality over yields the first desired equality.

Next, we prove the second inequality. Based on the first equality, we obtain that

 ∥yk−xk∥2 =∥Γkk∑t=1λt−1−βt−1ΓtGλt−1(xt−1,vt−1)∥2 =∥Γkk∑t=1αtΓtλt−1−βt−1αtGλt−1(xt−1,vt−1)∥2 (i)≤Γkk∑t=1αtΓt(λt−1−βt−1)2α2t∥Gλt−1(xt−1,vt−1)∥2 =Γkk∑t=1(λt−1−βt−1)2Γtαt∥Gλt−1(xt−1,vt−1)∥2, (14)

where (i) uses the facts that is a decreasing sequence, and Jensen’s inequality.

Finally, we prove the third inequality. By the update rule of the momentum scheme, we obtain that . Then, we further obtain that

 ∥zk+1−zk∥ ≤∥yk+1−zk∥+αk+2∥xk+1−yk+1∥ ≤βk∥Gλk(xk,vk)∥+αk+2√∥xk+1−yk+1∥2 ≤βk∥Gλk(xk,vk)∥+αk+2 ⎷Γk+1k+1∑t=1(λt−1−βt−1)2Γtαt∥Gλt−1(xt−1,vt−1)∥2.

The desired result follows by taking the square on both sides of the above inequality and using the fact that . ∎

We also need the following lemma, which was established as Lemma 1 and Proposition 1 in (Ghadimi et al., 2016).

###### Lemma 4 (Lemma 1 and Proposition 1, (Ghadimi et al., 2016)).

Let be a proper and closed convex function. Then, for all and , the following statements hold:

 ⟨u,Gη(x,u)⟩≥∥Gη(x,u)∥2+1η(g(proxηg(x−ηu))−g(x)), ∥Gη(x,u)−Gη(x,v)∥≤∥u−v∥.

## Appendix B Proof of Theorem 1

Consider any iteration of the algorithm. By smoothness of , we obtain that

 f(xk) =f(xk−1)+⟨∇f(xk−1),−λk−1Gλk−1(xk−1,vk−1)⟩+Lλ2k−12∥Gλk−1(xk−1,vk−1)∥2 =f(xk−1)−λk−1⟨∇f(xk−1)−vk−1,Gλk−1(xk−1,vk−1)⟩−λk−1⟨vk−1,Gλk−1(xk−1,vk−1)⟩ +Lλ2k−12∥Gλk−1(xk−1,vk−1)∥2 (i)≤f(xk−1)−λk−1⟨∇f(xk−1)−vk−1,Gλk−1(xk−1,vk−1)⟩−λk−1∥Gλk−1(xk−1,vk−1)∥2 −(g(proxλk−1g(xk−1−λk−1vk−1))−g(xk−1))+Lλ2k−12∥Gλk−1(xk−1,vk−1)∥2 =f(xk−1)−λk−1⟨∇f(xk−1)−vk−1,Gλk−1(xk−1,vk−1)⟩−λk−1∥Gλk−1(xk−1,vk−1)∥2 −(g(xk)−g(xk−1))+Lλ2k−12∥Gλk−1(xk−1,vk−1)∥2,

where (i) follows from Lemma 4. Rearranging the above inequality and using Cauchy-Swartz inequality yields that