We consider optimization problems of the form
Here, each function is smooth, the regularization
may be nonsmooth, and . By considering extended-real-valued functions, this
composite setting also encompasses constrained minimization by letting
be the indicator function of the constraints on .
Minimization of regularized empirical risk objectives of form (1 ) is central in machine learning.
Whereas a significant amount of work has been devoted to this composite setting
for convex problems, leading in particular to fast incremental
) is central in machine learning. Whereas a significant amount of work has been devoted to this composite setting for convex problems, leading in particular to fast incremental algorithms[see, e.g., saga, conjugategradient, miso, sag, woodworth:srebro:2016, proxsvrg], the question of minimizing efficiently (1) when the functions and may be nonconvex is still largely open today.
Yet, nonconvex problems in machine learning are of high interest. For
instance, the variable may represent the parameters of a neural network,
where each term measures the fit between and a data point indexed
by , or (1) may correspond to a nonconvex matrix
factorization problem (see Section 6).
Besides, even when the data-fitting functions are convex, it is also typical
to consider nonconvex regularization functions , for example for
feature selection in signal processing
, for example for feature selection in signal processing[htw:2015]. In this work, we address two questions from nonconvex optimization:
How to apply a method for convex optimization to a nonconvex problem?
How to design an algorithm which does not need to know whether the objective function is convex while obtaining the optimal convergence guarantee if the function is convex?
Several pioneering works attempted to transfer ideas from
the convex world to the nonconvex one, see, e.g., [GL1, GL2].
Our paper has a similar goal and studies the extension of Nesterov’s
acceleration for convex problems [nesterov1983] to nonconvex composite ones.
Unfortunately, the concept of acceleration for nonconvex problems is unclear
from a worst-case complexity point of view: gradient descent requires
iterations to guarantee a gradient norm
smaller than [Cartis2010, Cartis2014]. Under a stronger assumption that the objective function is -smooth,
state-of-the-art methods [e.g., CDHS]
achieve a marginal gain with complexity , and do not appear to generalize to composite
or finite-sum settings.
For this reason, our work fits within a broader stream of recent research on
methods that do not perform worse than gradient descent in the nonconvex case
(in terms of worst-case complexity), while automatically accelerating for
minimizing convex functions . The hope when applying such methods to nonconvex problems
is to see acceleration in practice, by heuristically exploiting convexity that is “hidden”
in the objective (for instance, local convexity near the optimum, or convexity along the trajectory of iterates).
. The hope when applying such methods to nonconvex problems is to see acceleration in practice, by heuristically exploiting convexity that is “hidden” in the objective (for instance, local convexity near the optimum, or convexity along the trajectory of iterates).
The main contribution of this paper is a generic meta-algorithm, dubbed 4WD-Catalyst, which is able to use a gradient-based optimization method , originally designed for convex problems, and turn it into an accelerated scheme that also applies to nonconvex objective functions. The proposed 4WD-Catalyst can be seen as a -Wheel-Drive extension of Catalyst [catalyst] to all optimization “terrains” (convex and nonconvex), while Catalyst was originally proposed for convex optimization. Specifically, without knowing whether the objective function is convex or not, our algorithm may take a method designed for convex optimization problems with the same structure as (1), e.g., SAGA [saga], SVRG [proxsvrg], and apply to a sequence of sub-problems such that it asymptotically provides a stationary point of the nonconvex objective. Overall, the number of iterations of to obtain a gradient norm smaller than is in the worst case, while automatically reducing to if the function is convex.111In this section, the notation only displays the polynomial dependency with respect to for the clarity of exposition.
Inspired by Nesterov’s acceleration method for convex optimization [nesterov], the first accelerated method performing universally well for nonconvex and convex problems was introduced in [GL1]. Specifically, the work [GL1] addresses composite problems such as (1) with , and, provided the iterates are bounded, it performs no worse than gradient descent on nonconvex instances with complexity on the gradient norm. When the problem is convex, it accelerates with complexity . Extensions to accelerated Gauss-Newton type methods were also recently developed in [accel_prox_comp]. In a follow-up work [GL2], a new scheme is proposed, which monotonically interlaces proximal gradient descent steps and Nesterov’s extrapolation; thereby achieving similar guarantees as [GL1] but without the need to assume the iterates to be bounded. Extensions when the gradient of is only Hölder continuous can also be devised.
In [NIPS2015_5728], a similar strategy is proposed, focusing instead on convergence guarantees under the so-called Kurdyka-Łojasiewicz inequality—a property corresponding to polynomial-like growth of the function, as shown by [error_KL]. Our scheme is in the same spirit as these previous papers, since it monotonically interlaces proximal-point steps (instead of proximal-gradient as in [GL2]) and extrapolation/acceleration steps. A fundamental difference is that our method is generic and accommodates inexact computations, since we allow the subproblems to be approximately solved by any method we wish to accelerate.
By considering -smooth nonconvex objective functions with Lipschitz continuous gradient and Hessian , Carmon et al. [CDHS] propose an algorithm with complexity , based on iteratively solving convex subproblems closely related to the original problem. It is not clear if the complexity of their algorithm improves in the convex setting. Note also that the algorithm proposed in [CDHS] is inherently for -smooth minimization and requires exact gradient evaluations. This implies that the scheme does not allow incorporating nonsmooth regularizers and can not exploit finite sum structure.
Finally, a stochastic method related to SVRG [JT_SVRG] for
minimizing large sums while automatically adapting to the weak
convexity constant of the objective function is proposed in [natasha]. When the weak convexity constant is small (i.e. , the function is
nearly convex), the proposed method enjoys an improved efficiency estimate.
This algorithm, however, does not automatically accelerate for convex problems,
in the sense that the overall rate is slower than
, the function is nearly convex), the proposed method enjoys an improved efficiency estimate. This algorithm, however, does not automatically accelerate for convex problems, in the sense that the overall rate is slower thanin terms of target accuracy on the gradient norm.
Organization of the paper.
Section 2 presents mathematical tools for non-convex and non-smooth analysis, which are used throughout the paper. In Sections 3 and 4, we introduce the main algorithm and important extensions, respectively. Finally, we present experimental results on matrix factorization and training of neural networks in Section 6.
2 Tools for nonconvex and nonsmooth optimization
Convergence results for nonsmooth optimization typically rely on the concept of subdifferential, which does not admit a unique definition in a nonconvex context [borwein:lewis:2006]. In this paper, we circumvent this issue by focusing on a broad class of nonconvex functions known as weakly convex or lower functions, for which all these constructions coincide. Weakly convex functions cover most of the interesting cases of interest in machine learning and resemble convex functions in many aspects. In this section, we formally introduce them and discuss their subdifferential properties.
Definition 2.1 (Weak convexity).
A function is weakly convex if for any points and , the approximate secant inequality holds:
Notice that -weak convexity with is exactly the definition of a convex function. An elementary algebraic manipulation shows that is -weakly convex if and only if the function is convex. In particular, a -smooth function is -weakly convex if the gradient is -Lipschitz, while a -smooth function is -weakly convex if and only if for all . This closely resembles an equivalent condition for -smooth and -strongly convex functions, namely with .
Useful characterizations of -weakly convex functions rely on differential properties. Since the functions we consider in the paper are nonsmooth, we use a generalized derivative construction. We mostly follow the standard monograph on the subject by Rockafellar and Wets [rock_wets].
Definition 2.2 (Subdifferential).
Consider a function and a point with finite. The subdifferential of at is the set
Thus, a vector
Thus, a vectorlies in whenever the linear function is a lower-model of , up to first-order around . In particular, the subdifferential of a differentiable function is the singleton , while for a convex function it coincides with the subdifferential in the sense of convex analysis [see rock_wets, Exercise 8.8]. It is useful to keep in mind that the sum rule, , holds for any differentiable function .
We are interested in deriving complexity bounds on the number of iterations required by a method to guarantee
Recall when , we are at a stationary point and satisfy first-order optimality conditions. In our convergence analysis, we will also use the following differential characterization of -weakly convex functions, which generalize classical properties of convex functions. A proof follows directly from Theorem 12.17 of [rock_wets] by taking into account that is -weakly convex if and only if is convex.
Theorem 2.3 (Differential characterization of -weakly convex functions).
For any lower-semicontinuous function , the following properties are equivalent:
is -weakly convex.
(subgradient inequality). For all in and in , we have
(hypo-monotonicity). For all in , in , and in ,
Weakly convex functions have appeared in a wide variety of contexts, and under different names. Some notable examples are globally lower- [rock_subsmooth], prox-regular [prox_reg_var_anal], proximally smooth functions [prox_smooth_equiv_rock], and those functions whose epigraph has positive reach [pos_reach].
3 The Basic 4WD-Catalyst algorithm for non-convex optimization
We now present a generic scheme (Algorithm 1) for applying a convex optimization method to minimize
where is only -weakly convex. Our goal is to develop a unified framework that automatically accelerates in convex settings. Consequently, the scheme must be agnostic to the constant .
3.1 Basic 4WD-Catalyst : a meta algorithm
At the center of our meta algorithm (Algorithm 1) are two sequences of subproblems obtained by adding simple quadratics to . The proposed approach extends the Catalyst acceleration of [catalyst] and comes with a simplified convergence analysis. We next describe in detail each step of the scheme.
The proposed acceleration scheme builds two main sequences of iterates and , obtained from approximately solving two subproblems. These subproblems are simple quadratic perturbations of the original problem having the form:
Here, is a regularization parameter and is called the prox-center. By adding the quadratic, we make the problem more “convex”: when is non convex, with a large enough , the subproblem will be convex; when is convex, we improve the conditioning of the problem.
At the -th iteration, given a previous iterate and the extrapolation term , we construct the two following subproblems.
Proximal point step. We first perform an inexact proximal point step with prox-center :
Accelerated proximal point step. Then we build the next prox-center as the convex combination
Next, we use as a prox-center and update the next extrapolation term:
where is a sequence of coefficients satisfying . Essentially, the sequences are built upon the extrapolation principles of Nesterov [nesterov].
Picking the best.
At the end of iteration , we have at hand two iterates, resp. and . Following [GL1], we simply choose the best of the two in terms of their objective values, that is we choose such that
The proposed scheme blends the two steps in a synergistic way, allowing us to recover the near-optimal rates of convergence in both worlds: convex and non-convex. Intuitively, when is chosen, it means that Nesterov’s extrapolation step “fails” to accelerate convergence.
Choose using such that
where and .
Choose using such that
Choose to be any point satisfying
Stopping criterion for the subproblems.
In order to derive complexity bounds, it is important to properly define the stopping criterion for the proximal subproblems. When the subproblem is convex, a functional gap like may be used as a control of the inexactness, as in [catalyst]. Without convexity, this criterion cannot be used since such quantities can not be easily bounded. In particular, first order methods seek points whose subgradient is small. Since small subgradients do not necessarily imply small function values in a non-convex setting, first order methods only test is for small subgradients. In contrast, in the convex setting, small subgradients imply small function values; thus a first order method in the convex setting can “test” for small function values. Hence, we cannot use a direct application of Catalyst [catalyst] which uses the functional gap as a stopping criteria. Because we are working in the nonconvex setting, we include a stationarity stopping criteria.
We propose to use jointly the following two types of stopping criteria:
Descent condition: ;
Adaptive stationary condition: .
Without the descent condition, the stationarity condition is insufficient for defining a good stopping criterion because of the existence of local maxima in nonconvex problems. In the nonconvex setting, local maxima and local minima satisfy the stationarity condition. The descent condition ensures the iterates generated by the algorithm always decrease the value of objective function ; thus ensuring we move away from local maxima. The second criterion, adaptive stationary condition, provides a flexible relative tolerance on termination of algorithm used for solving the subproblems; a detailed analysis is forthcoming.
In Basic 4WD-Catalyst , we use both the stationary condition and the descent condition as a stopping criteria to produce the point :
For the point , our “acceleration” point, we use a modified stationary condition:
The factor guarantees Basic 4WD-Catalyst accelerates for the convex setting. To be precise, Equation (27) in the proofs of Theorem 3.1 and Theorem 3.2 uses the factor to ensure convergence. Note, we do not need the descent condition for , as the functional decrease in is enough to ensure the sequence is monotonically decreasing.
3.2 Convergence analysis.
We present here the theoretical properties of Algorithm 1. In this first stage, we do not take into account the complexity of solving the subproblems (5) and (7). For the next two theorems, we assume that the stopping criteria for the proximal subproblems are satisfied at each iteration of Algorithm 1.
Theorem 3.1 (Outer-loop complexity for Basic 4WD-Catalyst; non-convex case).
For any and , the iterates generated by Algorithm 1 satisfy
It is important to notice that this convergence result is valid for any and does not require it to be larger than the weak convexity parameter. As long as the stopping criteria for the proximal subproblems are satisfied, the quantities tend to zero. The proof is inspired by that of inexact proximal algorithms [bertsekas:2015, guler:1991, catalyst] and appears in Appendix B.
If the function turns out to be convex, the scheme achieves a faster convergence rate both in function values and in stationarity:
Theorem 3.2 (Outer-loop complexity, convex case).
If the function is convex, then for any and , the iterates generated by Algorithm 1 satisfy
where is any minimizer of the function .
The proof of Theorem 3.2 appears in Appendix B. This theorem establishes a rate of for suboptimality in function value and convergence in for the minimal norm of subgradients. The first rate is optimal in terms of information-based complexity for the minimization of a convex composite function [nesterov, nesterov2013gradient]. The second can be improved to through a regularization technique, if one knew in advance that the function is convex and had an estimate on the distance of the initial point to an optimal solution [nest_optima].
Towards an automatically adaptive algorithm.
So far, our analysis has not taken into account the cost of obtaining the iterates and by the algorithm .
We emphasize again that
the two results above do not require any assumption on , which leaves us a degree of
freedom. In order to develop the global complexity, we need to evaluate the total number of
iterations performed by
, which leaves us a degree of freedom. In order to develop the global complexity, we need to evaluate the total number of iterations performed bythroughout the process. Clearly, this complexity heavily depends on the choice of , since it controls the magnitude of regularization we add to improve the convexity of the subproblem. This is the point where a careful analysis is needed, because our algorithm must adapt to without knowing it in advance. The next section is entirely dedicated to this issue. In particular, we will explain how to automatically adapt the parameter (Algorithm 2).
4 The 4WD-Catalyst algorithm
In this section, we work towards understanding the global efficiency of Algorithm 1, which automatically adapts to the weak convexity parameter. For this, we must take into account the cost of approximately solving the proximal subproblems to the desired stopping criteria. We expect that once the subproblem becomes strongly convex, the given optimization method can solve it efficiently. For this reason, we first focus on the computational cost for solving the sub-problems, before introducing a new algorithm with known worst-case complexity.
4.1 Solving the sub-problems efficiently
When is large enough, the subproblems become strongly convex; thus globally solvable. Henceforth, we will assume that satisfies the following natural linear convergence assumption.
Linear convergence of for strongly-convex problems.
We assume that for any , there exist and so that the following hold:
For any prox-center and initial the iterates generated by on the problem satisfy
where . If the method is randomized, we require the same inequality to hold in expectation.
The rates and the constants are increasing in .
The linear convergence we assume here for differs from the one considered by [catalyst], which was given in terms of function values. However, if the problem is a composite one, both points of view are near-equivalent, as discussed in Section A and the precise relationship is given in Appendix C. We choose the norm of the subgradient as our measurement because the complexity analysis is easier.
Then, a straightforward analysis bounds the computational complexity to achieve an -stationary point.
Let us consider a strongly convex problem and a linearly convergent method generating a sequence of iterates . Define where is the target accuracy; then,
If is deterministic,
If is randomized, then
see Lemma C.1 of [catalyst].
As we can see, we only lose a factor in the log term by switching from deterministic to randomized algorithms. For the sake of simplicity, we perform our analysis only for deterministic algorithms and the analysis for randomized algorithms holds in the same way in expectation.
Bounding the required iterations when and restart strategy.
Recall that we add a quadratic to with the hope to make each subproblem convex. Thus, if is known, then we should set . In this first stage, we show that whenever , then the number of inner calls to can be bounded with a proper initialization. Consider the subproblem
and define the initialization point by
if is smooth, then set ;
if is composite, with -smooth, then set with .
Consider the subproblem (15) and suppose . Then initializing at the previous generates a sequence of iterates such that
in at most iterations where
the output satisfies (descent condition) and (adaptive stationary condition);
in at most iterations where
the output satisfies (modified adaptive stationary condition).
The proof is technical and is presented in Appendix D. The lesson we learn here is that as soon as the subproblem becomes strongly convex, it can be solved in almost a constant number of iterations. Herein arises a problem–the choice of the smoothing parameter . On one hand, when is already convex, we may want to choose small in order to obtain the desired optimal complexity. On the other hand, when the problem is non convex, a small may not ensure the strong convexity of the subproblems. Because of such different behavior according to the convexity of the function, we introduce an additional parameter to handle the regularization of the extrapolation step. Moreover, in order to choose a in the nonconvex case, we need to know in advance an estimate of . This is not an easy task for large scale machine learning problems such as neural networks. Thus we propose an adaptive step to handle it automatically.
[wide,labelindent = 10pt]
Compute and apply iterations of to find
by using the initialization strategy described below (15).
Update and by
Choose to be any point satisfying
4.2 4WD-Catalyst: adaptation to weak convexity
We now introduce 4WD-Catalyst, presented in Algorithm 2, which can automatically adapt to the unknown weak convexity constant of the objective. The algorithm relies on a procedure to automatically adapt to , described in Algorithm 3.
The idea is to fix in advance a number of iterations , let run on the subproblem for iterations, output the point , and check if a sufficient decrease occurs. We show that if we set , where the notation hides logarithmic dependencies in and , where is the Lipschitz constant of the smooth part of ; then, if the subproblem were convex, the following conditions would be guaranteed:
Descent condition: ;
Adaptive stationary condition:
Thus, if either condition is not satisfied, then the subproblem is deemed not convex and we double and repeat. The procedure yields an estimate of in a logarithmic number of increases; see Lemma D.3.
Relative stationarity and predefining .
One of the main differences of our approach with the Catalyst algorithm of [catalyst] is to use a pre-defined number of iterations, and , for solving the subproblems. We introduce , a dependent smoothing parameter and set it in the same way as the smoothing parameter in [catalyst]. The automatic acceleration of our algorithm when the problem is convex is due to extrapolation steps in Step 2-3 of Basic 4WD-Catalyst. We show that if we set , where hides logarithmic dependencies in , , and , then we can be sure that, for convex objectives,
This relative stationarity of , including the choice of , shall be crucial to guarantee that the scheme accelerates in the convex setting. An additional factor appears compared to the previous adaptive stationary condition because we need higher accuracy for solving the subproblem to achieve the accelerated rate in .
We shall see in the experiments that our strategy of predefining and works quite well. The theoretical bounds we derive are, in general, too conservative; we observe in our experiments that one may choose and significantly smaller than the theory suggests and still retain the stopping criteria.
To derive the global complexity results for 4WD-Catalyst that match optimal convergence guarantees, we make a distinction between the regularization parameter in the proximal point step and in the extrapolation step. For the proximal point step, we apply Algorithm 3 to adaptively produce a sequence of initializing at , an initial guess of . The resulting and satisfy both the following inequalities:
For the extrapolation step, we introduce the parameter which essentially depends on the Lipschitz constant . The choice is the same as the smoothing parameter in [catalyst] and depends on the method . With a similar predefined iteration strategy, the resulting satisfies the following inequality if the original objective is convex,
4.3 Convergence analysis
Let us next postulate that and are chosen large enough to guarantee that and satisfy conditions (18) and (19) for the corresponding subproblems, and see how the outer algorithm complexity resembles the guarantees of Theorem 3.1 and Theorem 3.2. The main technical difference is that changes at each iteration , which requires keeping track of the effects of and on the proof.
Theorem 4.4 (Outer-loop complexity, 4WD-Catalyst).
If in addition the function is convex and is chosen so that satisfies (19), then
where is any minimizer of the function .
In light of Theorem 4.4, we must now understand how to choose and as small as possible, while guaranteeing that and satisfy (18) and (19) hold for each . The quantities and depend on the method ’s convergence rate parameter which only depends on and . For example, the convergence rate parameter for gradient descent and for SVRG. The values of and must be set beforehand without knowing the true value of the weak convexity constant . Using Theorem 4.3, we assert the following choices for and .
Theorem 4.5 (Inner complexity for 4WD-Catalyst : determining the values and ).
for all . In particular,
Then and the following hold for any index :
Generating in Algorithm 2 requires at most iterations of ;
Generating in Algorithm 2 requires at most iterations of .
where hides universal constants and logarithmic dependencies on , , , , and .
We summarize the proof of Theorem 4.5 as followed:
When , we compute the number of iterations of to produce a point satisfying (18). Such a point will become .
When the function is convex, we compute the number of iterations of to produce a point which satisfies the (19) condition. Such a point will become the point .
We compute the smallest number of times we must double until it becomes larger than . Thus eventually the condition will occur.
The next proposition shows that Auto-adapt terminates with a suitable choice for after number of iterations.
Proposition 4.6 (Inner complexity for ).
Suppose . By initializing the method using the strategy suggested in Algorithm 2 for solving
we may run the method for at least iterations, where
then, the output satisfies and .
Under the additional assumption that the function is convex, we produce a point with (19) when the number of iterations is chosen sufficiently large.
Proposition 4.7 (Inner-loop complexity for ).
Theorem 4.8 (Global complexity bounds for 4WD-Catalyst).
Choose and as in Theorem 4.5. We let hide universal constants and logarithmic dependencies in , , , , , , and . Then, the following statements hold.
In general, the linear convergence parameter of , , depends on the condition number of the problem . Here, and are precisely given by plugging in and respectively into . To clarify, let be SVRG, is given by which yields . A more detailed computation is given in Table 2. For all the incremental methods we considered, these parameters and are on the order of .
If is a first order method, the convergence guarantee in the convex setting is near-optimal, up to logarithmic factors, when compared to [catalyst, woodworth:srebro:2016]. In the non-convex setting, our approach matches, up to logarithmic factors, the best known rate for this class of functions, namely [Cartis2010, Cartis2014]. Moreover, our rates dependence on the dimension and Lipschitz constant equals, up to log factors, the best known dependencies in both the convex and nonconvex setting. These logarithmic factors may be the price we pay for having a generic algorithm.
5 Applications to Existing Algorithms
We now show how to accelerate existing algorithms and compare the convergence guaranties before and after 4WD-Catalyst. In particular, we focus on the gradient descent algorithm and on the incremental methods SAGA and SVRG. For all the algorithms considered, we state the convergence guaranties in terms of the total number of iterations (in expectation, if appropriate) to reach an accuracy of ; in the convex setting, the accuracy is stated in terms of functional error, and in the nonconvex setting, the appropriate measure is stationarity, namely . All the algorithms considered have formulations for the composite setting with analogous convergence rates. Table 1 presents convergence rates for SAGA [saga], (prox) SVRG [proxsvrg], and gradient descent (FG).
|SVRG [proxsvrg]||not avail.|