1 Introduction
Nonconvex optimization has recently received increasing attention due to its popularity in emerging machine learning tasks, particularly for learning deep neural networks. One of the keys to the success of deep learning for big data problems is the employment of simple stochastic algorithms such as
Sgd or AdaGrad [25, 9]. Analysis of these stochastic algorithms for nonconvex optimization is an important and interesting research topic, which already attracts much attention from the community of theoreticians [14, 15, 16, 42, 7, 37, 27]. However, one issue that has been largely ignored in existing theoretical results is that the employed algorithms in practice usually differ from their plain versions that are well understood in theory. Below, we will mention several important heuristics used in practice that have not been well understood for nonconvex optimization, which motivates this work.
First, a heuristic for setting the step size in training deep neural networks is to change it in a stagewise manner from a large value to a small value (i.e., a constant step size is used in a stage for a number of iterations and is decreased for the next stage) [25], which lacks theoretical analysis to date. In existing literature [14, 7], Sgd with an iteratively decreasing step size or a small constant step size has been well analyzed for nonconvex optimization problems with guaranteed convergence to a stationary point. For example, the existing theory usually suggests an iteratively decreasing step size proportional to at the th iteration or a small constant step size, e.g., proportional to with for finding an stationary solution whose gradient’s magnitude (in expectation) is small than .
Second, the averaging heuristic is usually used in practice, i.e., an averaged solution is returned for prediction [3], which could yield improved stability and generalization [18]. However, existing theory for many stochastic nonconvex optimization algorithms only provides guarantee on a uniformly sampled solution or a nonuniformly sampled solution with decreasing probabilities for latest solutions [14, 42, 7]. In particular, if an iteratively decreasing step size proportional to at the th iteration is employed, the convergence guarantee was provided for a random solution that is nonuniformly selected from all iterates with a sampling probability proportional to for the th iterate. This means that the latest solution always has the smallest probability to be selected as the final solution, which contradicts to the common wisdom. If a small constant step size is used, then usually a uniformly sampled solution is returned with convergence guarantee. However, both options are seldomly used in practice.
A third common heuristic in practice is to use adaptive coordinatewise step size of AdaGrad [9]. Although adaptive step size has been well analyzed for convex problems (i.e., when it can yield faster convergence than Sgd) [12, 5], it still remains an mystery for nonconvex optimization with missing insights from theory. Several recent studies have attempted to analyze AdaGrad for nonconvex problems [37, 27, 4, 45]. Nonetheless, none of them are able to exhibit the adaptive convergence of AdaGrad to data as in the convex case and its advantage over Sgd for nonconvex problems.
To overcome the shortcomings of existing theories for stochastic nonconvex optimization, this paper analyzes new algorithms that employ some or all of these commonly used heuristics in a systematic framework, aiming to fill the gap between theory and practice. The main results and contributions are summarized below:

We propose a universal stagewise optimization framework for solving a family of nonconvex problems, i.e., weakly convex problems, which is broader than smooth nonconvex problems and includes some nonsmooth nonconvex problems. At each stage, any suitable stochastic convex optimization algorithms (e.g., Sgd, AdaGrad) with a constant step size parameter can be employed for optimizing a regularized convex problem with a number of iterations, which usually return an averaged solution. The step size parameter is decreased in a stagewise manner following a polynomial decaying scheme.

We analyze several variants of the proposed framework by employing different basic algorithms, including Sgd, stochastic heavyball (Shb) method, stochastic Nesterov’s accelerated gradient (Snag) method, stochastic alternating direction methods of multipliers (ADMM), and AdaGrad. We prove the convergence of their stagewise versions for an averaged solution that is randomly selected from all stagewise averaged solutions.

To justify a heuristic approach that returns the last averaged solution in stagewise learning, we present and analyze a nonuniform sampling strategy over stagewise averaged solutions with sampling probabilities increasing as the stage number.

Regarding the convergence results, for stagewise Sgd, Shb, Snag, we establish the same order of iteration complexity for finding a nearly stationary point as the existing theories of their nonstagewise variants. For stagewise AdaGrad, we establish an adaptive convergence for finding a nearly stationary point, which is provably better than (stagewise) Sgd, Shb, and Snag when the cumulative growth of stochastic gradient is slow.

Besides theoretical contributions, we also empirically verify the effectiveness of the proposed stagewise algorithms. In particular, our empirical studies show that (i) the stagewise AdaGrad dramatically improves the generalization performance of existing variants of AdaGrad, (ii) stagewise Sgd, Shb, Snag also outperform their plain variants with an iteratively decreasing step size; (iii) the proposed stagewise algorithms achieve similar if not better generalization performance than their heuristic variants implemented in existing libraries on standard benchmark datasets.
2 Related Work
We review some theoretical results for stochastic nonconvex optimization in this section.
Sgd for unconstrained smooth nonconvex problems was first analyzed by Ghadimi and Lan [14], who established an iteration complexity for finding an stationary point in expectation satisfying , where denotes the objective function. As mentioned earlier, the returned solution is either a uniformly sampled solution or a nonuniformly sampled one with sampling probabilities proportional to decreasing step size. Similar results were established for the stochastic momentum variants of Sgd (i.e., Shb, Snag) by [42, 15]. Recently, SGD was also analyzed for (constrained) weakly convex problems, whose objective function is nonconvex and not necessarily smooth, by Davis and Drusvyatskiy [7]. Since the objective function could be nonsmooth, the convergence guarantee is provided on the magnitude of the Moreau envelope’s subgradient with the same order of iteration complexity as in the smooth case. However, none of these studies provide results for algorithms that return an averaged solution.
Although adaptive variants of Sgd, e.g., AdaGrad [12], Adam [23, 32], were widely used for training deep neural networks, there are few studies on theoretical analysis of these algorithms for nonconvex problems. Several recent studies attempted to analyze AdaGrad for nonconvex problems [37, 27, 4, 45]. Ward et al. [37] only analyzed a variant of AdaGrad that uses a global adaptive step size instead of coordinatewise adaptive step size as in the original AdaGrad used in practice. Li and Orabona [27] gave two results about the convergence of variants of AdaGrad. One is given in terms of asymptotic convergence for coordinatewise adaptive step size, and another one is given in terms of nonasymptotic convergence for global adaptive step size. When we prepare this manuscript, we note that two recent studies [4, 45] appeared online, which also analyzed the convergence of AdaGrad with coordinatewise adaptive step size and its momentum variants. Although all of these studies established an iteration complexity of for different variants of AdaGrad for finding an stationary solution of a stochastic nonconvex optimization problem, none of them can exhibit the potential adaptive advantage of AdaGrad over Sgd as in the convex case. To the best of our knowledge, our result is the first one that explicitly shows that coordinatewise adaptive step size could yield faster convergence than using nonadaptive step size for nonconvex problems similar to that in the convex case. Besides that, these studies also suffer from the following shortcomings: (i) they all assume smoothness of the problem, while we consider nonsmooth and nonconvex problems; (ii) their convergence is provided on a solution with minimum magnitude of gradient that is expensive to compute, though their results also imply a convergence on a random solution selected from all iterates with decreasing sampling probabilities. In contrast, these shortcomings do not exist in this paper.
Statewise step size has been employed in stochastic algorithms and analyzed for convex optimization problems [19, 39]. Hazan and Kale [19]
proposed an epochGD method for stochastic strongly convex problems, in which a stagewise step size is used that decreases geometrically and the number of iteration for each stage increases geometrically. Xu et al.
[39] proposed an accelerated stochastic subgradient method for optimizing convex objectives satisfying a local error bound condition, which also employs a stagewise scheme with a constant number of iterations perstage and geometrically decreasing stagewise step size. The difference from the present work is that they focus on convex problems.The proposed stagewise algorithm is similar to several existing algorithms in design [39, 8], which are originated from the proximal point algorithm [35]. I.e., at each stage a proximal strongly convex subproblem is formed and then a stochastic algorithm is employed for optimizing the proximal subproblem inexactly with a number of iterations. Xu et al. [39] used this idea for solving problems that satisfy a local error bound condition, aiming to achieve faster convergence than vanilla Sgd. Davis and Grimmer [8] followed this idea to solve weakly convex problems. At each stage, Sgd with decreasing step sizes for a strongly convex problem is employed for solving the proximal subproblem in these two papers. Our stagewise algorithm is developed following the similar idea. The key differences from [39, 8] are that (i) we focus on weakly convex problems instead of convex problems considered in [39]; (ii) we use nonuniform sampling probabilities that increase as the stage number to select an averaged solution as the final solution, unlike the uniform sampling used in [8]; (iii) we present a unified algorithmic framework and convergence analysis, which enable one to employ any suitable stochastic convex optimization algorithms at each stage. It gives us several interesting variants including stagewise stochastic momentum methods, stagewise AdaGrad, and stagewise stochastic ADMM. For stagewise AdaGrad that employs AdaGrad as the basic algorithm for solving the proximal subproblem, we derive an adaptive convergence that is faster than Sgd when the cumulative growth of stochastic gradients is slow.
Finally, we refer readers to several recent papers for other algorithms for weakly convex problems [6, 11]. For example, Drusvyatskiy and Paquette [11] studied a subclass of weakly convex problems whose objective consists of a composition of a convex function and a smooth map, and proposed a proxlinear method that could enjoy a lower iteration complexity than by smoothing the objective of each subproblem. Davis and Drusvyatskiy [6]
studied a more general algorithm that successively minimizes a proximal regularized stochastic model of the objective function. When the objective function is smooth and has a finite form, variancereduction based methods are also studied
[31, 33, 2, 1, 26], which have provable faster convergence than Sgd in terms of . However, in all of these studies the convergence is provided on an impractical solution, which is either a solution that gives the minimum value of the (proximal) subgradient’s norm [11] or on a uniformly sampled solution from all iterations [31, 33, 2, 1].3 Preliminaries
The problem of interest in this paper is:
(1) 
where is a closed convex set,
is a random variable,
and are nonconvex functions, with the basic assumptions on the problem given in Assumption 1.To state the convergence property of an algorithm for solving the above problem. We need to introduce some definitions. These definitions can be also found in related literature, e.g., [8, 7]. In the sequel, we let denote an Euclidean norm, denote a set, and denote the indicator function of the set .
Definition 1.
(Fréchet subgradient) For a nonsmooth and nonconvex function ,
denotes the Fréchet subgradient of .
Definition 2.
(Firstorder stationarity) For problem (1), a point is a firstorder stationary point if
where denotes the indicator function of . Moreover, a point is said to be stationary if
(2) 
where dist denotes the Euclidean distance from a point to a set.
Definition 3.
(Moreau Envelope and Proximal Mapping) For any function and , the following function is called a Moreau envelope of
Further, the optimal solution to the above problem denoted by
is called a proximal mapping of .
Definition 4.
(Weakly convex) A function is weakly convex, if is convex.
It is known that if is weakly convex and , then its Moreau envelope is smooth with the gradient given by (see e.g. [7])
A small norm of has an interpretation that is close to a point that is nearly stationary. In particular for any , let , then we have
(3) 
This means that a point satisfying is close to a point in distance of that is stationary.
It is notable that for a nonsmooth nonconvex function , there could exist a sequence of solutions such that converges while may not converge [11]. To handle such a challenging issue for nonsmooth nonconvex problems, we will follow existing works [6, 11, 8] to prove the near stationarity in terms of . In the case when is smooth, is closely related to the magnitude of the projected gradient defined below, which has been used as a criterion for constrained nonconvex optimization [33],
(4) 
It was shown that when is smooth with Lipschitz continuous gradient [10]:
(5) 
Thus, the near stationarity in terms of implies the near stationarity in terms of for a smooth function .
Now, we are ready to state the basic assumptions of the considered problem (1).
Assumption 1.

There is a measurable mapping such that for any .

For any , .

Objective function is weakly convex.

there exists such that for any .
Remark: Assumption 1A, 1B assume a stochastic subgradient is available for the objective function and its Euclidean norm square is bounded in expectation, which are standard assumptions for nonsmooth optimization. Assumption C assumes weak convexity of the objective function, which is weaker than assuming smoothness. Assumption D assumes that the objective value with respect to the optimal value is bounded. Below, we present some examples of objective functions in machine learning that are weakly convex.
Ex. 1: Smooth NonConvex Functions.
If is a smooth function (i.e., its gradient is Lipschitz continuous), then it is weakly convex.
Ex. 2: Additive Composition.
Consider
(6) 
where is a weakly convex function, and is a closed convex function. In this case, is
weakly convex. This class includes many interesting regularized problems in machine learning with smooth losses and convex regularizers. For smooth nonconvex loss functions, one can consider truncated square loss for robust learning, i.e.,
, where denotes a random data and denotes its corresponding output, and is a smooth nonconvex truncation function (e.g., ). Such truncated nonconvex losses have been considered in [28]. When and , it was proved that is a smooth function with Lipschitz continuous gradient [41]. For , one can consider any existing convex regularizers, e.g., norm, grouplasso regularizer [43], graphlasso regularizer [22].Ex. 3: Convex and Smooth Composition
Consider
where is closed convex and Lipschitz continuous, and is nonlinear smooth mapping with Lipschitz continuous gradient. This class of functions has been considered in [11] and it was proved that is weakly convex. An interesting example is phase retrieval , where . More examples of this class can be found in [6].
Ex. 4: Smooth and Convex Composition
Consider
where is a smooth function satisfying , and is convex and Lipschitz continuous. This class of functions has been considered in [41] for robust learning and it was proved that is weakly convex. An interesting example is truncated Lipschitz continuous loss , where is a smooth truncation function with (e.g., ) and is a convex and Lipschitzcontinuous function (e.g., with bounded ).
Ex. 5: Weakly Convex SparsityPromoting Regularizers
Consider
where is a convex or a weaklyconvex function, and is a weaklyconvex sparsitypromoting regularizer. Examples of weaklyconvex sparsitypromoting regularizers include:
4 Stagewise Optimization: Algorithms and Analysis
In this section, we will present the proposed algorithms and the analysis of their convergence. We will first present a Meta algorithmic framework highlighting the key features of the proposed algorithms and then present several variants of the Meta algorithm by employing different basic algorithms.
The Meta algorithmic framework is described in Algorithm 1. There are several key features that differentiate Algorithm 1 from existing stochastic algorithms that come with theoretical guarantee. First, the algorithm is run with multiple stages. At each stage, a stochastic algorithm (SA) is called to optimize a proximal problem inexactly that consists of the original objective function and a quadratic term, which is guaranteed to be convex due to the weak convexity of and . The convexity of allows one to employ any suitable existing stochastic algorithms (cf. Theorem 1) that have convergence guarantee for convex problems. It is notable that SA usually returns an averaged solution at each stage. Second, a decreasing sequence of step size parameters is used. At each stage, the SA uses a constant step size parameter and runs the updates for a number of iterations. We do not initialize as it might be adaptive to the data as in stagewise AdaGrad. Third, the final solution is selected from the stagewise averaged solutions with nonuniform sampling probabilities proportional to a sequence of nondecreasing positive weights . In the sequel, we are particularly interested in with . The setup of and will depend on the specific choice of SA, which will be exhibited later for different variants.
To illustrate that Algorithm 1 is a universal framework such that any suitable SA algorithm can be employed, we present the following result by assuming that SA has an appropriate convergence for a convex problem.
Theorem 1.
Remark: It is notable that the convergence guarantee is provided on a stagewise average solution . To justify a heuristic approach that returns the final average solution for prediction, we analyze a new sampling strategy that samples a solution among all stagewise average solutions with sampling probabilities increasing as the stage number increases. This sampling strategy is better than uniform sampling strategy or a strategy with decreasing sampling probabilities in the existing literature. The convergence upper bound in (7) of SA covers the results of a broad family of stochastic convex optimization algorithms. When (as in Sgd), the upper bound can be improved by a constant factor. Moreover, we do not optimize the value of . Indeed, any will work, which only has an effect on constant factor in the convergence upper bound.
Next, we present several variants of the Meta algorithm by employing Sgd, AdaGrad, and stochastic momentum methods as the basic SA algorithm, to which we refer as stagewise Sgd, stagewise AdaGrad
, and stagewise stochastic momentum methods, respectively. It is worth mentioning that one can follow similar analysis to analyze other stagewise algorithms by using their basic convergence for stochastic convex optimization, including RMSProp
[29], AMSGrad [32], which is omitted in this paper.Proof.
Below, we use to denote expectation over randomness in the th stage given all history before th stage. Define
(9) 
Then . Then we have . Next, we apply Lemma 1 to each call of Sgd in stagewise Sgd,
Then
On the other hand, we have that
where the inequality follows from the Young’s inequality with . Thus we have that
(10) 
Next, we bound given that is fixed. According to the definition of , we have
Taking expectation over randomness in the th stage on both sides, we have
Thus,
Assuming that , we have
Plugging this upper bound into (4), we have
(11) 
By setting and assuming , we have
Multiplying both sides by , we have that
By summing over , we have
Taking the expectation w.r.t. , we have that
For the first term on the R.H.S, we have that
Then,
The standard calculus tells that
Combining these facts and the assumption , we have that
In order to have , we can set . The total number of iterations is
∎
Next, we present several variants of the Meta algorithm by employing Sgd, stochastic momentum methods, and AdaGrad as the basic SA algorithm, to which we refer as stagewise Sgd, stagewise stochastic momentum methods, and stagewise AdaGrad, respectively.
4.1 Stagewise Sgd
In this subsection, we analyze the convergence of stagewise Sgd, in which Sgd shown in Algorithm 2 is employed in the Meta framework. Besides Assumption 1, we impose the following assumption in this subsection.
Assumption 2.
the domain is bounded, i.e., there exists such that for any .
It is worth mentioning that bounded domain assumption is imposed for simplicity, which is usually assumed in convex optimization. For machine learning problems, one usually imposes some bounded norm constraint to achieve a regularization. Recently, several studies have found that imposing a norm constraint is more effective than an additive norm regularization term in the objective for deep learning [17, 30]. Nevertheless, the bounded domain assumption is not essential for the proposed algorithm. We present a more involved analysis in the next subsection for unbounded domain . The following is a basic convergence result of Sgd, whose proof can be found in the literature and is omitted.
Lemma 1.
For Algorithm 2, assume that is convex and , then for any we have
To state the convergence, we introduce a notation
(12) 
which is the gradient of the Moreau envelope of the objective function . The following theorem exhibits the convergence of stagewise Sgd
Theorem 2.
4.2 Stagewise stochastic momentum (SM) methods
In this subsection, we present stagewise stochastic momentum methods and their analysis. In the literature, there are two popular variants of stochastic momentum methods, namely, stochastic heavyball method (Shb) and stochastic Nesterov’s accelerated gradient method (Snag). Both methods have been used for training deep neural networks [25, 36], and have been analyzed by [42] for nonconvex optimization. To contrast with the results in [42], we will consider the same unified stochastic momentum methods that subsume Shb, Snag and Sgd as special cases when . The updates are presented in Algorithm 3.
To present the analysis of stagewise SM methods, we first provide a convergence result for minimizing at each stage.
Lemma 2.
For Algorithm 3, assume is a strongly convex function, where such that , and , then we have that
Comments
There are no comments yet.