Non-convex optimization has recently received increasing attention due to its popularity in emerging machine learning tasks, particularly for learning deep neural networks. One of the keys to the success of deep learning for big data problems is the employment of simple stochastic algorithms such asSgd or AdaGrad [25, 9]. Analysis of these stochastic algorithms for non-convex optimization is an important and interesting research topic, which already attracts much attention from the community of theoreticians [14, 15, 16, 42, 7, 37, 27]
. However, one issue that has been largely ignored in existing theoretical results is that the employed algorithms in practice usually differ from their plain versions that are well understood in theory. Below, we will mention several important heuristics used in practice that have not been well understood for non-convex optimization, which motivates this work.
First, a heuristic for setting the step size in training deep neural networks is to change it in a stagewise manner from a large value to a small value (i.e., a constant step size is used in a stage for a number of iterations and is decreased for the next stage) , which lacks theoretical analysis to date. In existing literature [14, 7], Sgd with an iteratively decreasing step size or a small constant step size has been well analyzed for non-convex optimization problems with guaranteed convergence to a stationary point. For example, the existing theory usually suggests an iteratively decreasing step size proportional to at the -th iteration or a small constant step size, e.g., proportional to with for finding an -stationary solution whose gradient’s magnitude (in expectation) is small than .
Second, the averaging heuristic is usually used in practice, i.e., an averaged solution is returned for prediction , which could yield improved stability and generalization . However, existing theory for many stochastic non-convex optimization algorithms only provides guarantee on a uniformly sampled solution or a non-uniformly sampled solution with decreasing probabilities for latest solutions [14, 42, 7]. In particular, if an iteratively decreasing step size proportional to at the -th iteration is employed, the convergence guarantee was provided for a random solution that is non-uniformly selected from all iterates with a sampling probability proportional to for the -th iterate. This means that the latest solution always has the smallest probability to be selected as the final solution, which contradicts to the common wisdom. If a small constant step size is used, then usually a uniformly sampled solution is returned with convergence guarantee. However, both options are seldomly used in practice.
A third common heuristic in practice is to use adaptive coordinate-wise step size of AdaGrad . Although adaptive step size has been well analyzed for convex problems (i.e., when it can yield faster convergence than Sgd) [12, 5], it still remains an mystery for non-convex optimization with missing insights from theory. Several recent studies have attempted to analyze AdaGrad for non-convex problems [37, 27, 4, 45]. Nonetheless, none of them are able to exhibit the adaptive convergence of AdaGrad to data as in the convex case and its advantage over Sgd for non-convex problems.
To overcome the shortcomings of existing theories for stochastic non-convex optimization, this paper analyzes new algorithms that employ some or all of these commonly used heuristics in a systematic framework, aiming to fill the gap between theory and practice. The main results and contributions are summarized below:
We propose a universal stagewise optimization framework for solving a family of non-convex problems, i.e., weakly convex problems, which is broader than smooth non-convex problems and includes some non-smooth non-convex problems. At each stage, any suitable stochastic convex optimization algorithms (e.g., Sgd, AdaGrad) with a constant step size parameter can be employed for optimizing a regularized convex problem with a number of iterations, which usually return an averaged solution. The step size parameter is decreased in a stagewise manner following a polynomial decaying scheme.
We analyze several variants of the proposed framework by employing different basic algorithms, including Sgd, stochastic heavy-ball (Shb) method, stochastic Nesterov’s accelerated gradient (Snag) method, stochastic alternating direction methods of multipliers (ADMM), and AdaGrad. We prove the convergence of their stagewise versions for an averaged solution that is randomly selected from all stagewise averaged solutions.
To justify a heuristic approach that returns the last averaged solution in stagewise learning, we present and analyze a non-uniform sampling strategy over stagewise averaged solutions with sampling probabilities increasing as the stage number.
Regarding the convergence results, for stagewise Sgd, Shb, Snag, we establish the same order of iteration complexity for finding a nearly stationary point as the existing theories of their non-stagewise variants. For stagewise AdaGrad, we establish an adaptive convergence for finding a nearly stationary point, which is provably better than (stagewise) Sgd, Shb, and Snag when the cumulative growth of stochastic gradient is slow.
Besides theoretical contributions, we also empirically verify the effectiveness of the proposed stagewise algorithms. In particular, our empirical studies show that (i) the stagewise AdaGrad dramatically improves the generalization performance of existing variants of AdaGrad, (ii) stagewise Sgd, Shb, Snag also outperform their plain variants with an iteratively decreasing step size; (iii) the proposed stagewise algorithms achieve similar if not better generalization performance than their heuristic variants implemented in existing libraries on standard benchmark datasets.
2 Related Work
We review some theoretical results for stochastic non-convex optimization in this section.
Sgd for unconstrained smooth non-convex problems was first analyzed by Ghadimi and Lan , who established an iteration complexity for finding an -stationary point in expectation satisfying , where denotes the objective function. As mentioned earlier, the returned solution is either a uniformly sampled solution or a non-uniformly sampled one with sampling probabilities proportional to decreasing step size. Similar results were established for the stochastic momentum variants of Sgd (i.e., Shb, Snag) by [42, 15]. Recently, SGD was also analyzed for (constrained) weakly convex problems, whose objective function is non-convex and not necessarily smooth, by Davis and Drusvyatskiy . Since the objective function could be non-smooth, the convergence guarantee is provided on the magnitude of the Moreau envelope’s subgradient with the same order of iteration complexity as in the smooth case. However, none of these studies provide results for algorithms that return an averaged solution.
Although adaptive variants of Sgd, e.g., AdaGrad , Adam [23, 32], were widely used for training deep neural networks, there are few studies on theoretical analysis of these algorithms for non-convex problems. Several recent studies attempted to analyze AdaGrad for non-convex problems [37, 27, 4, 45]. Ward et al.  only analyzed a variant of AdaGrad that uses a global adaptive step size instead of coordinate-wise adaptive step size as in the original AdaGrad used in practice. Li and Orabona  gave two results about the convergence of variants of AdaGrad. One is given in terms of asymptotic convergence for coordinate-wise adaptive step size, and another one is given in terms of non-asymptotic convergence for global adaptive step size. When we prepare this manuscript, we note that two recent studies [4, 45] appeared online, which also analyzed the convergence of AdaGrad with coordinate-wise adaptive step size and its momentum variants. Although all of these studies established an iteration complexity of for different variants of AdaGrad for finding an -stationary solution of a stochastic non-convex optimization problem, none of them can exhibit the potential adaptive advantage of AdaGrad over Sgd as in the convex case. To the best of our knowledge, our result is the first one that explicitly shows that coordinate-wise adaptive step size could yield faster convergence than using non-adaptive step size for non-convex problems similar to that in the convex case. Besides that, these studies also suffer from the following shortcomings: (i) they all assume smoothness of the problem, while we consider non-smooth and non-convex problems; (ii) their convergence is provided on a solution with minimum magnitude of gradient that is expensive to compute, though their results also imply a convergence on a random solution selected from all iterates with decreasing sampling probabilities. In contrast, these shortcomings do not exist in this paper.
proposed an epoch-GD method for stochastic strongly convex problems, in which a stagewise step size is used that decreases geometrically and the number of iteration for each stage increases geometrically. Xu et al. proposed an accelerated stochastic subgradient method for optimizing convex objectives satisfying a local error bound condition, which also employs a stagewise scheme with a constant number of iterations per-stage and geometrically decreasing stagewise step size. The difference from the present work is that they focus on convex problems.
The proposed stagewise algorithm is similar to several existing algorithms in design [39, 8], which are originated from the proximal point algorithm . I.e., at each stage a proximal strongly convex subproblem is formed and then a stochastic algorithm is employed for optimizing the proximal subproblem inexactly with a number of iterations. Xu et al.  used this idea for solving problems that satisfy a local error bound condition, aiming to achieve faster convergence than vanilla Sgd. Davis and Grimmer  followed this idea to solve weakly convex problems. At each stage, Sgd with decreasing step sizes for a strongly convex problem is employed for solving the proximal subproblem in these two papers. Our stagewise algorithm is developed following the similar idea. The key differences from [39, 8] are that (i) we focus on weakly convex problems instead of convex problems considered in ; (ii) we use non-uniform sampling probabilities that increase as the stage number to select an averaged solution as the final solution, unlike the uniform sampling used in ; (iii) we present a unified algorithmic framework and convergence analysis, which enable one to employ any suitable stochastic convex optimization algorithms at each stage. It gives us several interesting variants including stagewise stochastic momentum methods, stagewise AdaGrad, and stagewise stochastic ADMM. For stagewise AdaGrad that employs AdaGrad as the basic algorithm for solving the proximal subproblem, we derive an adaptive convergence that is faster than Sgd when the cumulative growth of stochastic gradients is slow.
Finally, we refer readers to several recent papers for other algorithms for weakly convex problems [6, 11]. For example, Drusvyatskiy and Paquette  studied a subclass of weakly convex problems whose objective consists of a composition of a convex function and a smooth map, and proposed a prox-linear method that could enjoy a lower iteration complexity than by smoothing the objective of each subproblem. Davis and Drusvyatskiy 
studied a more general algorithm that successively minimizes a proximal regularized stochastic model of the objective function. When the objective function is smooth and has a finite form, variance-reduction based methods are also studied[31, 33, 2, 1, 26], which have provable faster convergence than Sgd in terms of . However, in all of these studies the convergence is provided on an impractical solution, which is either a solution that gives the minimum value of the (proximal) subgradient’s norm  or on a uniformly sampled solution from all iterations [31, 33, 2, 1].
The problem of interest in this paper is:
where is a closed convex set,
is a random variable,and are non-convex functions, with the basic assumptions on the problem given in Assumption 1.
To state the convergence property of an algorithm for solving the above problem. We need to introduce some definitions. These definitions can be also found in related literature, e.g., [8, 7]. In the sequel, we let denote an Euclidean norm, denote a set, and denote the indicator function of the set .
(Fréchet subgradient) For a non-smooth and non-convex function ,
denotes the Fréchet subgradient of .
(First-order stationarity) For problem (1), a point is a first-order stationary point if
where denotes the indicator function of . Moreover, a point is said to be -stationary if
where dist denotes the Euclidean distance from a point to a set.
(Moreau Envelope and Proximal Mapping) For any function and , the following function is called a Moreau envelope of
Further, the optimal solution to the above problem denoted by
is called a proximal mapping of .
(Weakly convex) A function is -weakly convex, if is convex.
It is known that if is -weakly convex and , then its Moreau envelope is -smooth with the gradient given by (see e.g. )
A small norm of has an interpretation that is close to a point that is nearly stationary. In particular for any , let , then we have
This means that a point satisfying is close to a point in distance of that is -stationary.
It is notable that for a non-smooth non-convex function , there could exist a sequence of solutions such that converges while may not converge . To handle such a challenging issue for non-smooth non-convex problems, we will follow existing works [6, 11, 8] to prove the near stationarity in terms of . In the case when is smooth, is closely related to the magnitude of the projected gradient defined below, which has been used as a criterion for constrained non-convex optimization ,
It was shown that when is smooth with -Lipschitz continuous gradient :
Thus, the near stationarity in terms of implies the near stationarity in terms of for a smooth function .
Now, we are ready to state the basic assumptions of the considered problem (1).
There is a measurable mapping such that for any .
For any , .
Objective function is -weakly convex.
there exists such that for any .
Remark: Assumption 1-A, 1-B assume a stochastic subgradient is available for the objective function and its Euclidean norm square is bounded in expectation, which are standard assumptions for non-smooth optimization. Assumption C assumes weak convexity of the objective function, which is weaker than assuming smoothness. Assumption D assumes that the objective value with respect to the optimal value is bounded. Below, we present some examples of objective functions in machine learning that are weakly convex.
Ex. 1: Smooth Non-Convex Functions.
If is a -smooth function (i.e., its gradient is -Lipschitz continuous), then it is -weakly convex.
Ex. 2: Additive Composition.
where is a -weakly convex function, and is a closed convex function. In this case, is
-weakly convex. This class includes many interesting regularized problems in machine learning with smooth losses and convex regularizers. For smooth non-convex loss functions, one can consider truncated square loss for robust learning, i.e.,, where denotes a random data and denotes its corresponding output, and is a smooth non-convex truncation function (e.g., ). Such truncated non-convex losses have been considered in . When and , it was proved that is a smooth function with Lipschitz continuous gradient . For , one can consider any existing convex regularizers, e.g., norm, group-lasso regularizer , graph-lasso regularizer .
Ex. 3: Convex and Smooth Composition
where is closed convex and -Lipschitz continuous, and is nonlinear smooth mapping with -Lipschitz continuous gradient. This class of functions has been considered in  and it was proved that is -weakly convex. An interesting example is phase retrieval , where . More examples of this class can be found in .
Ex. 4: Smooth and Convex Composition
where is a -smooth function satisfying , and is convex and -Lipschitz continuous. This class of functions has been considered in  for robust learning and it was proved that is -weakly convex. An interesting example is truncated Lipschitz continuous loss , where is a smooth truncation function with (e.g., ) and is a convex and Lipschitz-continuous function (e.g., with bounded ).
Ex. 5: Weakly Convex Sparsity-Promoting Regularizers
where is a convex or a weakly-convex function, and is a weakly-convex sparsity-promoting regularizer. Examples of weakly-convex sparsity-promoting regularizers include:
4 Stagewise Optimization: Algorithms and Analysis
In this section, we will present the proposed algorithms and the analysis of their convergence. We will first present a Meta algorithmic framework highlighting the key features of the proposed algorithms and then present several variants of the Meta algorithm by employing different basic algorithms.
The Meta algorithmic framework is described in Algorithm 1. There are several key features that differentiate Algorithm 1 from existing stochastic algorithms that come with theoretical guarantee. First, the algorithm is run with multiple stages. At each stage, a stochastic algorithm (SA) is called to optimize a proximal problem inexactly that consists of the original objective function and a quadratic term, which is guaranteed to be convex due to the weak convexity of and . The convexity of allows one to employ any suitable existing stochastic algorithms (cf. Theorem 1) that have convergence guarantee for convex problems. It is notable that SA usually returns an averaged solution at each stage. Second, a decreasing sequence of step size parameters is used. At each stage, the SA uses a constant step size parameter and runs the updates for a number of iterations. We do not initialize as it might be adaptive to the data as in stagewise AdaGrad. Third, the final solution is selected from the stagewise averaged solutions with non-uniform sampling probabilities proportional to a sequence of non-decreasing positive weights . In the sequel, we are particularly interested in with . The setup of and will depend on the specific choice of SA, which will be exhibited later for different variants.
To illustrate that Algorithm 1 is a universal framework such that any suitable SA algorithm can be employed, we present the following result by assuming that SA has an appropriate convergence for a convex problem.
Remark: It is notable that the convergence guarantee is provided on a stagewise average solution . To justify a heuristic approach that returns the final average solution for prediction, we analyze a new sampling strategy that samples a solution among all stagewise average solutions with sampling probabilities increasing as the stage number increases. This sampling strategy is better than uniform sampling strategy or a strategy with decreasing sampling probabilities in the existing literature. The convergence upper bound in (7) of SA covers the results of a broad family of stochastic convex optimization algorithms. When (as in Sgd), the upper bound can be improved by a constant factor. Moreover, we do not optimize the value of . Indeed, any will work, which only has an effect on constant factor in the convergence upper bound.
Next, we present several variants of the Meta algorithm by employing Sgd, AdaGrad, and stochastic momentum methods as the basic SA algorithm, to which we refer as stagewise Sgd, stagewise AdaGrad
, and stagewise stochastic momentum methods, respectively. It is worth mentioning that one can follow similar analysis to analyze other stagewise algorithms by using their basic convergence for stochastic convex optimization, including RMSProp, AMSGrad , which is omitted in this paper.
Below, we use to denote expectation over randomness in the -th stage given all history before -th stage. Define
Then . Then we have . Next, we apply Lemma 1 to each call of Sgd in stagewise Sgd,
On the other hand, we have that
where the inequality follows from the Young’s inequality with . Thus we have that
Next, we bound given that is fixed. According to the definition of , we have
Taking expectation over randomness in the -th stage on both sides, we have
Assuming that , we have
Plugging this upper bound into (4), we have
By setting and assuming , we have
Multiplying both sides by , we have that
By summing over , we have
Taking the expectation w.r.t. , we have that
For the first term on the R.H.S, we have that
The standard calculus tells that
Combining these facts and the assumption , we have that
In order to have , we can set . The total number of iterations is
Next, we present several variants of the Meta algorithm by employing Sgd, stochastic momentum methods, and AdaGrad as the basic SA algorithm, to which we refer as stagewise Sgd, stagewise stochastic momentum methods, and stagewise AdaGrad, respectively.
4.1 Stagewise Sgd
In this subsection, we analyze the convergence of stagewise Sgd, in which Sgd shown in Algorithm 2 is employed in the Meta framework. Besides Assumption 1, we impose the following assumption in this subsection.
the domain is bounded, i.e., there exists such that for any .
It is worth mentioning that bounded domain assumption is imposed for simplicity, which is usually assumed in convex optimization. For machine learning problems, one usually imposes some bounded norm constraint to achieve a regularization. Recently, several studies have found that imposing a norm constraint is more effective than an additive norm regularization term in the objective for deep learning [17, 30]. Nevertheless, the bounded domain assumption is not essential for the proposed algorithm. We present a more involved analysis in the next subsection for unbounded domain . The following is a basic convergence result of Sgd, whose proof can be found in the literature and is omitted.
For Algorithm 2, assume that is convex and , then for any we have
To state the convergence, we introduce a notation
which is the gradient of the Moreau envelope of the objective function . The following theorem exhibits the convergence of stagewise Sgd
4.2 Stagewise stochastic momentum (SM) methods
In this subsection, we present stagewise stochastic momentum methods and their analysis. In the literature, there are two popular variants of stochastic momentum methods, namely, stochastic heavy-ball method (Shb) and stochastic Nesterov’s accelerated gradient method (Snag). Both methods have been used for training deep neural networks [25, 36], and have been analyzed by  for non-convex optimization. To contrast with the results in , we will consider the same unified stochastic momentum methods that subsume Shb, Snag and Sgd as special cases when . The updates are presented in Algorithm 3.
To present the analysis of stagewise SM methods, we first provide a convergence result for minimizing at each stage.
For Algorithm 3, assume is a -strongly convex function, where such that , and , then we have that