The stochastic approximation method first appears in robbins1951stochastic for solving a root-finding problem. Nowadays, its first-order version, or the stochastic gradient method (SGM), has been extensively used to solve machine learning problems that involve huge amounts of given data and also to stochastic problems that involve uncertain streaming data. Complexity results of SGMs have been well established for convex problems. A lot of recent researches about SGMs focus on nonconvex cases.
In this paper, we consider the regularized nonconvex stochastic programming
where is a smooth nonconvex function almost surely for , and is a closed convex function on . Examples of (1.1) include the sparse online matrix factorization mairal2010online , the online nonnegative matrix factorization zhao2016online , and the streaming PCA (by a unit-ball constraint) mitliagkas2013memory . In addition, as
follows a uniform distribution on a finite set, (1.1
) recovers the so-called finite-sum structured problem. It includes most regularized machine learning problems such as the sparse bilinear logistic regressionshi2014sparse
and the sparse convolutional neural networkliu2015sparse .
When , the recent work arjevani2019lower gives an lower complexity bound of SGMs to produce a stochastic -stationary solution of (1.1) (see Definition 2 below), by assuming the so-called mean-squared smoothness condition (see Assumption 2). Several variance-reduced SGMs tran2019hybrid ; wang2018spiderboost ; fang2018spider ; cutkosky2019momentum have achieved an complexity result111Throughout the paper, we use to suppress an additional polynomial term of . Among them, fang2018spider ; cutkosky2019momentum only consider smooth cases, i.e., in (1.1), and tran2019hybrid ; wang2018spiderboost study nonsmooth problems in the form of (1.1). To reach an complexity result, the Hybrid-SGD method in tran2019hybrid needs samples at the initial step and then two samples at each update, while wang2018spiderboost ; fang2018spider require samples after every fixed number of updates. The STORM method in cutkosky2019momentum requires one single sample of at each update, but it only applies to smooth problems. Practically on training a (deep) machine learning model, small-batch training is often used to have better generalization masters2018revisiting ; keskar2016large
. In addition, for certain applications such as reinforcement learningsutton2018reinforcement , one single sample can usually be obtained, depending on the stochastic environment and the current decision. Furthermore, regularization terms can improve generalization of a machine learning model, even for training a neural network wei2019regularization . We aim at designing a new SGM for solving the nonconvex nonsmooth problem (1.1) and achieving an optimal complexity result by using (that can be one) samples at each update.
1.2 Mirror-prox algorithm
Our algorithm is a mirror-prox SGM, and we adopt the momentum technique to reduce variance of the stochastic gradient in order to achieve a near-optimal complexity result.
Let be a continuously differentiable and 1-strongly convex function on , i.e.,
The Bregman divergence induced by is defined as
At each iteration of our algorithm, we obtain one or a few samples of , compute stochastic gradients at the previous and current iterates using the same samples, and then perform a mirror-prox momentum stochastic gradient update. The pseudocode is shown in Algorithm 1. We name it as PStorm as it can be viewed as a proximal version of the Storm method in cutkosky2019momentum .
1.3 Related works
A lot of efforts have been made on analyzing the convergence and complexity of SGMs for solving nonconvex stochastic problems, e.g., ghadimi2016accelerated ; ghadimi2013stochastic ; xu2015block-sg ; davis2019stochastic ; davis2020stochastic ; wang2018spiderboost ; cutkosky2019momentum ; fang2018spider ; allen2018natasha ; tran2019hybrid . We list comparison results on the complexity in Table 1.
|at -th iter.|
|accelerated prox-SGM ghadimi2016accelerated||is smooth|
|stochastic subgradient davis2019stochastic||is weakly-convex|
|bounded stochastic subgrad.|
|Spider fang2018spider||mean-squared smoothness|
|see Assumption 2||or|
|Storm cutkosky2019momentum||is smooth a.s.||1|
|bounded stochastic grad.|
|Spiderboost wang2018spiderboost||mean-squared smoothness|
|Hybrid-SGD tran2019hybrid||mean-squared smoothness||if|
|is convex||but at least 2 if|
|This paper||is convex||and can be 1|
The work ghadimi2013stochastic appears to be the first one that conducts complexity analysis of SGM for nonconvex stochastic problems. It introduces a randomized SGM. For a smooth nonconvex problem, the randomized SGM can produce a stochastic -stationary solution within SG iterations. The same-order complexity result is then extended in ghadimi2016accelerated to nonsmooth nonconvex stochastic problems in the form of (1.1). To achieve an complexity result, the accelerated prox-SGM in ghadimi2016accelerated needs to take samples at the -th update for each . Assuming a weak-convexity condition and using the tool of Moreau envelope, davis2019stochastic establishes an complexity result of stochastic subgradient method for solving more general nonsmooth nonconvex problems to produce a near- stochastic stationary solution (see davis2019stochastic for the precise definition).
In general, the complexity result cannot be improved for smooth nonconvex stochastic problems, as arjevani2019lower shows that for the problem where is smooth, any SGM that can access unbiased SG with bounded variance needs SGs to produce a solution such that . However, with one additional mean-squared smoothness condition on each unbiased SG, the complexity result can be improved to , which has been reached by a few variance-reduced SGMs tran2019hybrid ; wang2018spiderboost ; fang2018spider ; cutkosky2019momentum . These methods are closely related to ours. Below we briefly review them.
where , , and or . Under the mean-squared smoothness condition (see Assumption 2), the Spider method can produce a stochastic -stationary solution within updates, by choosing appropriate learning rate (roughly in the order of ).
Storm. cutkosky2019momentum focuses on a smooth nonconvex stochastic problem, i.e., (1.1) with . It proposes the Storm method, which can be viewed as a special case of Algorithm 1 with applied to the smooth problem. However, its analysis and also algorithm design rely on the knowledge of a uniform bound on . In addition, because the learning rate of Storm is set dependent on the sampled stochastic gradient, its analysis needs almost-sure uniform smoothness of . This assumption is significantly stronger than the mean-squared smoothness condition, and also the uniform smoothness constant can be much larger than an averaged one.
Spiderboost. wang2018spiderboost extends Spider into solving a nonsmooth nonconvex stochastic problem in the form of (1.1) by proposing a so-called Spiderboost method. Spiderboost iteratively performs the update
where denotes the Bregman divergence induced by a strongly-convex function, and is set by (1.5) with and . Under the mean-squared smoothness condition, Spiderboost reaches a complexity result of by choosing , where is the smoothness constant.
Hybrid-SGD. tran2019hybrid considers a nonsmooth nonconvex stochastic problem in the form of (1.1). It proposes a proximal stochastic method, called Hybrid-SGD, as a hybrid of SARHA nguyen2017sarah and an unbiased SGD. The Hybrid-SGD performs the update for each , where
Here, the sequence is set by with for a given and
where and are two independent samples of . A mini-batch version of Hybrid-SGD is also given in tran2019hybrid . By choosing appropriate parameters , Hybrid-SGD can reach an complexity result. Although the update of requires only two or samples, its initial setting needs samples. As explained in (tran2019hybrid, , Remark 4.1), if the initial minibatch size is , then the complexity result of Hybrid-SGD will be worsened to .
More. There are many other works analyzing complexity results of SGMs on solving nonconvex finite-sum structured problems, e.g., allen2016variance ; reddi2016stochastic ; lei2017non ; huo2016asynchronous . These results often emphasize the dependence on the number of component functions and also the target error tolerance . In addition, several works have analyzed adaptive SGMs for nonconvex finite-sum or stochastic problems, e.g., chen2018convergence ; zhou2018convergence ; xu2020-APAM . An exhaustive review of all these works is impossible and also beyond the scope of this paper. We refer interested readers to those papers and the references therein.
Our main contributions are about the algorithm design and analysis. We design a momentum-based variance-reduced mirror-prox stochastic gradient method for solving nonconvex nonsmooth stochastic problems. The proposed method generalizes Storm in cutkosky2019momentum from smooth cases to nonsmooth cases, and in addition, it achieves the same near-optimal complexity result under a mean-squared smooth condition, which is weaker than the almost-sure uniform smoothness condition assumed in cutkosky2019momentum . While Spiderboost wang2018spiderboost and Hybrid-SGD tran2019hybrid can also achieve an complexity result for stochastic nonconvex nonsmooth problems, they need or data samples in some or all iterations. Our new method is the first one that requires only one or
samples per iteration, and thus it can be applied to online learning problems that need real-time decision based on possibly one or several new data samples. Furthermore, the proposed method only needs an estimate of the smoothness parameter and is easy to tune to have good performance. Empirically, we observe that it converges faster than a vanilla SGD and performs more stable than Spiderboost and Hybrid-SGD on training sparse neural networks.
1.5 Notation, definitions, and outline
We use bold lowercase letters
for vectors.denotes the expectation about a mini-batch set conditionally on the all previous history, and denotes the full expectation. counts the number of elements in the set . We use for the Euclidean norm. A differentiable function is called -smooth, if for all and .
Definition 1 (proximal gradient mapping)
Given , , and , we define , where
By the proximal gradient mapping, if a point is an optimal solution of (1.1), then it must satisfy for any . Based on this observation, we define a near-stationary solution as follows. This definition is standard and has been adopted in other papers, e.g., wang2018spiderboost .
Definition 2 (stochastic -stationary solution)
Given , a random vector is called a stochastic -stationary solution of (1.1) if for some , it holds .
From (ghadimi2016mini, , Lemma 1), it holds
In addition, the proximal gradient mapping is nonexpansive from (ghadimi2016mini, , Proposition 1), i.e.,
For each , we denote
Notice that measures the violation of stationarity of . The gradient error is represented by
2 Convergence analysis
In this section, we analyze the complexity result of Algorithm 1. Our analysis is inspired from that in cutkosky2019momentum and wang2018spiderboost . Throughout our analysis, we make the following assumptions.
Assumption 1 (finite optimal objective)
The optimal objective value of (1.1) is finite.
Assumption 2 (mean-squared smoothness)
The function is -smooth, and satisfies the mean-squared smoothness condition:
Assumption 3 (unbiasedness and variance boundedness)
There is such that for each ,
We first show a few lemmas. The lemma below estimates one-iteration progress. Its proof follows from wang2018spiderboost .
Lemma 1 (one-iteration progress)
Let be generated from Algorithm 1. Then
By the -smoothness of and the definition of in (1.10), we have
Plugging the above inequality into (2.3) and rearranging terms give
By the Cauchy-Schwartz inequality, it holds , which together with the above inequality implies
Now plug the above inequality into (2.4) to give the desired result.
The next lemma gives a recursive bound on the gradient error vector sequence . Its proof follows that of (cutkosky2019momentum, , Lemma 2).
Lemma 2 (recursive bound on gradient error)
For each , it holds
First, notice that
Hence, by writing , we have
By the Young’s inequality, it holds
From Assumption 3, we have
By similar arguments as those in (2.5), it holds
Now notice and plug the above inequality into (2.11) to obtain the desired result.
Now we specify the choice of parameters and establish a complexity result of Algorithm 1.
Theorem 2.2 (convergence rate)
Since , it holds . Also, notice or equivalently for all . Hence, it is straightforward to have and thus for each . Now notice , so the first inequality in (2.12) holds. In addition, to ensure the second inequality in (2.12), it suffices to have . Because , this inequality is implied by , which is further implied by the choice of in (2.15). Therefore, both conditions in (2.12), and thus we have (2.13).
Next we bound the coefficients in (2.13). First, from and for all , we have
where . Second,