Catalyst Acceleration for Gradient-Based Non-Convex Optimization

We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. When the objective is convex, the proposed approach enjoys the same properties as the Catalyst approach of Lin et al. [22]. When the objective is nonconvex, it achieves the best known convergence rate to stationary points for first-order methods. Specifically, the proposed algorithm does not require knowledge about the convexity of the objective; yet, it obtains an overall worst-case efficiency of Õ(ε^-2) and, if the function is convex, the complexity reduces to the near-optimal rate Õ(ε^-2/3). We conclude the paper by showing promising experimental results obtained by applying the proposed approach to SVRG and SAGA for sparse matrix factorization and for learning neural networks.

Authors

• 3 publications
• 8 publications
• 12 publications
• 51 publications
• 29 publications
• Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?

It is well-known that given a bounded, smooth nonconvex function, standa...
02/27/2020 ∙ by Ohad Shamir, et al. ∙ 0

• Convex Optimization with Nonconvex Oracles

In machine learning and optimization, one often wants to minimize a conv...
11/07/2017 ∙ by Oren Mangoubi, et al. ∙ 0

• Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH

The total complexity (measured as the total number of gradient computati...
01/22/2019 ∙ by Lam M. Nguyen, et al. ∙ 0

• Generalized Uniformly Optimal Methods for Nonlinear Programming

In this paper, we present a generic framework to extend existing uniform...
08/29/2015 ∙ by Saeed Ghadimi, et al. ∙ 0

• Optimal Complexity and Certification of Bregman First-Order Methods

We provide a lower bound showing that the O(1/k) convergence rate of the...
11/19/2019 ∙ by Radu-Alexandru Dragomir, et al. ∙ 0

• Maximum Principle Based Algorithms for Deep Learning

The continuous dynamical system approach to deep learning is explored in...
10/26/2017 ∙ by Qianxiao Li, et al. ∙ 1

• Understanding the Learned Iterative Soft Thresholding Algorithm with matrix factorization

Sparse coding is a core building block in many data analysis and machine...
06/02/2017 ∙ by Thomas Moreau, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider optimization problems of the form

 minx∈Rp{f(x):=f0(x)+ψ(x)},wheref0(x):=1nn∑i=1fi(x). (1)

Here, each function is smooth, the regularization may be nonsmooth, and . By considering extended-real-valued functions, this composite setting also encompasses constrained minimization by letting be the indicator function of the constraints on . Minimization of regularized empirical risk objectives of form (1

) is central in machine learning. Whereas a significant amount of work has been devoted to this composite setting for convex problems, leading in particular to fast incremental algorithms

[see, e.g., saga, conjugategradient, miso, sag, woodworth:srebro:2016, proxsvrg], the question of minimizing efficiently (1) when the functions  and  may be nonconvex is still largely open today.

Yet, nonconvex problems in machine learning are of high interest. For instance, the variable may represent the parameters of a neural network, where each term measures the fit between  and a data point indexed by , or (1) may correspond to a nonconvex matrix factorization problem (see Section 6). Besides, even when the data-fitting functions  are convex, it is also typical to consider nonconvex regularization functions

, for example for feature selection in signal processing

[htw:2015]. In this work, we address two questions from nonconvex optimization:

1. How to apply a method for convex optimization to a nonconvex problem?

2. How to design an algorithm which does not need to know whether the objective function is convex while obtaining the optimal convergence guarantee if the function is convex?

Several pioneering works attempted to transfer ideas from the convex world to the nonconvex one, see, e.g., [GL1, GL2]. Our paper has a similar goal and studies the extension of Nesterov’s acceleration for convex problems [nesterov1983] to nonconvex composite ones. Unfortunately, the concept of acceleration for nonconvex problems is unclear from a worst-case complexity point of view: gradient descent requires iterations to guarantee a gradient norm smaller than  [Cartis2010, Cartis2014]. Under a stronger assumption that the objective function is -smooth, state-of-the-art methods [e.g., CDHS] achieve a marginal gain with complexity , and do not appear to generalize to composite or finite-sum settings. For this reason, our work fits within a broader stream of recent research on methods that do not perform worse than gradient descent in the nonconvex case (in terms of worst-case complexity), while automatically accelerating for minimizing convex functions

. The hope when applying such methods to nonconvex problems is to see acceleration in practice, by heuristically exploiting convexity that is “hidden” in the objective (for instance, local convexity near the optimum, or convexity along the trajectory of iterates).

The main contribution of this paper is a generic meta-algorithm, dubbed 4WD-Catalyst, which is able to use a gradient-based optimization method , originally designed for convex problems, and turn it into an accelerated scheme that also applies to nonconvex objective functions. The proposed 4WD-Catalyst can be seen as a -Wheel-Drive extension of Catalyst [catalyst] to all optimization “terrains” (convex and nonconvex), while Catalyst was originally proposed for convex optimization. Specifically, without knowing whether the objective function is convex or not, our algorithm may take a method  designed for convex optimization problems with the same structure as (1), e.g., SAGA [saga], SVRG [proxsvrg], and apply to a sequence of sub-problems such that it asymptotically provides a stationary point of the nonconvex objective. Overall, the number of iterations of  to obtain a gradient norm smaller than is in the worst case, while automatically reducing to if the function is convex.111In this section, the notation only displays the polynomial dependency with respect to for the clarity of exposition.

Related work.

Inspired by Nesterov’s acceleration method for convex optimization [nesterov], the first accelerated method performing universally well for nonconvex and convex problems was introduced in [GL1]. Specifically, the work [GL1] addresses composite problems such as (1) with , and, provided the iterates are bounded, it performs no worse than gradient descent on nonconvex instances with complexity on the gradient norm. When the problem is convex, it accelerates with complexity . Extensions to accelerated Gauss-Newton type methods were also recently developed in [accel_prox_comp]. In a follow-up work [GL2], a new scheme is proposed, which monotonically interlaces proximal gradient descent steps and Nesterov’s extrapolation; thereby achieving similar guarantees as [GL1] but without the need to assume the iterates to be bounded. Extensions when the gradient of is only Hölder continuous can also be devised.

In [NIPS2015_5728], a similar strategy is proposed, focusing instead on convergence guarantees under the so-called Kurdyka-Łojasiewicz inequality—a property corresponding to polynomial-like growth of the function, as shown by [error_KL]. Our scheme is in the same spirit as these previous papers, since it monotonically interlaces proximal-point steps (instead of proximal-gradient as in [GL2]) and extrapolation/acceleration steps. A fundamental difference is that our method is generic and accommodates inexact computations, since we allow the subproblems to be approximately solved by any method we wish to accelerate.

By considering -smooth nonconvex objective functions  with Lipschitz continuous gradient and Hessian , Carmon et al. [CDHS] propose an algorithm with complexity , based on iteratively solving convex subproblems closely related to the original problem. It is not clear if the complexity of their algorithm improves in the convex setting. Note also that the algorithm proposed in [CDHS] is inherently for -smooth minimization and requires exact gradient evaluations. This implies that the scheme does not allow incorporating nonsmooth regularizers and can not exploit finite sum structure.

Finally, a stochastic method related to SVRG [JT_SVRG] for minimizing large sums while automatically adapting to the weak convexity constant of the objective function is proposed in [natasha]. When the weak convexity constant is small (i.e.

, the function is nearly convex), the proposed method enjoys an improved efficiency estimate. This algorithm, however, does not automatically accelerate for convex problems, in the sense that the overall rate is slower than

in terms of target accuracy on the gradient norm.

Organization of the paper.

Section 2 presents mathematical tools for non-convex and non-smooth analysis, which are used throughout the paper. In Sections 3 and 4, we introduce the main algorithm and important extensions, respectively. Finally, we present experimental results on matrix factorization and training of neural networks in Section 6.

2 Tools for nonconvex and nonsmooth optimization

Convergence results for nonsmooth optimization typically rely on the concept of subdifferential, which does not admit a unique definition in a nonconvex context [borwein:lewis:2006]. In this paper, we circumvent this issue by focusing on a broad class of nonconvex functions known as weakly convex or lower functions, for which all these constructions coincide. Weakly convex functions cover most of the interesting cases of interest in machine learning and resemble convex functions in many aspects. In this section, we formally introduce them and discuss their subdifferential properties.

Definition 2.1 (Weak convexity).

A function is weakly convex if for any points and , the approximate secant inequality holds:

 f(λx+(1−λ)y)≤λf(x)+(1−λ)f(y)+ρλ(1−λ)∥x−y∥2.

Notice that -weak convexity with is exactly the definition of a convex function. An elementary algebraic manipulation shows that is -weakly convex if and only if the function is convex. In particular, a -smooth function is -weakly convex if the gradient is -Lipschitz, while a -smooth function is -weakly convex if and only if for all . This closely resembles an equivalent condition for -smooth and -strongly convex functions, namely with .

Useful characterizations of -weakly convex functions rely on differential properties. Since the functions we consider in the paper are nonsmooth, we use a generalized derivative construction. We mostly follow the standard monograph on the subject by Rockafellar and Wets [rock_wets].

Definition 2.2 (Subdifferential).

Consider a function and a point with finite. The subdifferential of at is the set

 ∂f(x):={v∈Rp:f(y)≥f(x)+vT(y−x)+o(∥y−x∥)  ∀y∈Rp}.

Thus, a vector

lies in whenever the linear function is a lower-model of , up to first-order around . In particular, the subdifferential of a differentiable function is the singleton , while for a convex function it coincides with the subdifferential in the sense of convex analysis [see rock_wets, Exercise 8.8]. It is useful to keep in mind that the sum rule, , holds for any differentiable function .

We are interested in deriving complexity bounds on the number of iterations required by a method  to guarantee

 \rm dist(0,∂f(x))≤ε.

Recall when , we are at a stationary point and satisfy first-order optimality conditions. In our convergence analysis, we will also use the following differential characterization of -weakly convex functions, which generalize classical properties of convex functions. A proof follows directly from Theorem 12.17 of [rock_wets] by taking into account that is -weakly convex if and only if is convex.

Theorem 2.3 (Differential characterization of ρ-weakly convex functions).

For any lower-semicontinuous function , the following properties are equivalent:

1. is -weakly convex.

2. (subgradient inequality). For all in  and in , we have

 f(y)≥f(x)+vT(y−x)−ρ2∥y−x∥2.
3. (hypo-monotonicity). For all in , in , and in ,

 (v−w)T(x−y)≥−ρ∥x−y∥2.

Weakly convex functions have appeared in a wide variety of contexts, and under different names. Some notable examples are globally lower- [rock_subsmooth], prox-regular [prox_reg_var_anal], proximally smooth functions [prox_smooth_equiv_rock], and those functions whose epigraph has positive reach [pos_reach].

3 The Basic 4WD-Catalyst algorithm for non-convex optimization

We now present a generic scheme (Algorithm 1) for applying a convex optimization method to minimize

 minx∈Rp f(x), (2)

where is only -weakly convex. Our goal is to develop a unified framework that automatically accelerates in convex settings. Consequently, the scheme must be agnostic to the constant .

3.1 Basic 4WD-Catalyst : a meta algorithm

At the center of our meta algorithm (Algorithm 1) are two sequences of subproblems obtained by adding simple quadratics to . The proposed approach extends the Catalyst acceleration of [catalyst] and comes with a simplified convergence analysis. We next describe in detail each step of the scheme.

Two-step subproblems.

The proposed acceleration scheme builds two main sequences of iterates and , obtained from approximately solving two subproblems. These subproblems are simple quadratic perturbations of the original problem having the form:

 minx{fκ(x;y):=f(x)+κ2∥x−y∥2}.

Here, is a regularization parameter and is called the prox-center. By adding the quadratic, we make the problem more “convex”: when is non convex, with a large enough , the subproblem will be convex; when  is convex, we improve the conditioning of the problem.

At the -th iteration, given a previous iterate and the extrapolation term , we construct the two following subproblems.

1. Proximal point step. We first perform an inexact proximal point step with prox-center :

 ¯xk≈\operatornamewithlimitsargminxfκ(x;xk−1) [Proximal-point step]
2. Accelerated proximal point step. Then we build the next prox-center as the convex combination

 yk=αkvk−1+(1−αk)xk−1. (3)

Next, we use as a prox-center and update the next extrapolation term:

 ~xk ≈\operatornamewithlimitsargminxfκ(x;yk) [Accelerated proximal-point step] vk =xk−1+1αk(~xk−xk−1) [Extrapolation] (4)

where is a sequence of coefficients satisfying . Essentially, the sequences are built upon the extrapolation principles of Nesterov [nesterov].

Picking the best.

At the end of iteration , we have at hand two iterates, resp. and . Following [GL1], we simply choose the best of the two in terms of their objective values, that is we choose such that

 f(xk)≤min{f(¯xk),f(~xk)}.

The proposed scheme blends the two steps in a synergistic way, allowing us to recover the near-optimal rates of convergence in both worlds: convex and non-convex. Intuitively, when is chosen, it means that Nesterov’s extrapolation step “fails” to accelerate convergence.

Stopping criterion for the subproblems.

In order to derive complexity bounds, it is important to properly define the stopping criterion for the proximal subproblems. When the subproblem is convex, a functional gap like may be used as a control of the inexactness, as in [catalyst]. Without convexity, this criterion cannot be used since such quantities can not be easily bounded. In particular, first order methods seek points whose subgradient is small. Since small subgradients do not necessarily imply small function values in a non-convex setting, first order methods only test is for small subgradients. In contrast, in the convex setting, small subgradients imply small function values; thus a first order method in the convex setting can “test” for small function values. Hence, we cannot use a direct application of Catalyst [catalyst] which uses the functional gap as a stopping criteria. Because we are working in the nonconvex setting, we include a stationarity stopping criteria.

We propose to use jointly the following two types of stopping criteria:

1. Descent condition: ;

Without the descent condition, the stationarity condition is insufficient for defining a good stopping criterion because of the existence of local maxima in nonconvex problems. In the nonconvex setting, local maxima and local minima satisfy the stationarity condition. The descent condition ensures the iterates generated by the algorithm always decrease the value of objective function ; thus ensuring we move away from local maxima. The second criterion, adaptive stationary condition, provides a flexible relative tolerance on termination of algorithm used for solving the subproblems; a detailed analysis is forthcoming.

In Basic 4WD-Catalyst , we use both the stationary condition and the descent condition as a stopping criteria to produce the point :

 \rm dist(0,∂fκ(¯xk;xk−1))<κ∥¯xk−xk−1∥ and fκ(¯xk;xk−1)≤fκ(xk−1;xk−1). (11)

For the point , our “acceleration” point, we use a modified stationary condition:

 \rm dist(0,∂fκ(~xk;yk))<κk+1∥~xk−yk∥. (12)

The factor guarantees Basic 4WD-Catalyst accelerates for the convex setting. To be precise, Equation (27) in the proofs of Theorem 3.1 and Theorem 3.2 uses the factor to ensure convergence. Note, we do not need the descent condition for , as the functional decrease in is enough to ensure the sequence is monotonically decreasing.

3.2 Convergence analysis.

We present here the theoretical properties of Algorithm 1. In this first stage, we do not take into account the complexity of solving the subproblems (5) and (7). For the next two theorems, we assume that the stopping criteria for the proximal subproblems are satisfied at each iteration of Algorithm 1.

Theorem 3.1 (Outer-loop complexity for Basic 4WD-Catalyst; non-convex case).

For any and , the iterates generated by Algorithm 1 satisfy

 minj=1,…,N \rm dist2(0,∂f(¯xj))≤8κN(f(x0)−f∗).

It is important to notice that this convergence result is valid for any and does not require it to be larger than the weak convexity parameter. As long as the stopping criteria for the proximal subproblems are satisfied, the quantities tend to zero. The proof is inspired by that of inexact proximal algorithms [bertsekas:2015, guler:1991, catalyst] and appears in Appendix B.

If the function turns out to be convex, the scheme achieves a faster convergence rate both in function values and in stationarity:

Theorem 3.2 (Outer-loop complexity, convex case).

If the function is convex, then for any and , the iterates generated by Algorithm 1 satisfy

 f(xN)−f(x∗)≤4κ(N+1)2∥x∗−x0∥2, (13)

and

where is any minimizer of the function .

The proof of Theorem 3.2 appears in Appendix B. This theorem establishes a rate of for suboptimality in function value and convergence in for the minimal norm of subgradients. The first rate is optimal in terms of information-based complexity for the minimization of a convex composite function [nesterov, nesterov2013gradient]. The second can be improved to through a regularization technique, if one knew in advance that the function is convex and had an estimate on the distance of the initial point to an optimal solution [nest_optima].

So far, our analysis has not taken into account the cost of obtaining the iterates and by the algorithm . We emphasize again that the two results above do not require any assumption on

, which leaves us a degree of freedom. In order to develop the global complexity, we need to evaluate the total number of iterations performed by

throughout the process. Clearly, this complexity heavily depends on the choice of , since it controls the magnitude of regularization we add to improve the convexity of the subproblem. This is the point where a careful analysis is needed, because our algorithm must adapt to without knowing it in advance. The next section is entirely dedicated to this issue. In particular, we will explain how to automatically adapt the parameter (Algorithm 2).

4 The 4WD-Catalyst algorithm

In this section, we work towards understanding the global efficiency of Algorithm 1, which automatically adapts to the weak convexity parameter. For this, we must take into account the cost of approximately solving the proximal subproblems to the desired stopping criteria. We expect that once the subproblem becomes strongly convex, the given optimization method can solve it efficiently. For this reason, we first focus on the computational cost for solving the sub-problems, before introducing a new algorithm with known worst-case complexity.

4.1 Solving the sub-problems efficiently

When is large enough, the subproblems become strongly convex; thus globally solvable. Henceforth, we will assume that satisfies the following natural linear convergence assumption.

Linear convergence of M for strongly-convex problems.

We assume that for any , there exist and so that the following hold:

1. For any prox-center and initial the iterates generated by on the problem satisfy

 dist2(0,∂fκ(zt;y))≤Aκ(1−τκ)t(fκ(z0;y)−f∗κ(y)), (14)

where . If the method is randomized, we require the same inequality to hold in expectation.

2. The rates and the constants are increasing in .

Remark 4.1.

The linear convergence we assume here for differs from the one considered by [catalyst], which was given in terms of function values. However, if the problem is a composite one, both points of view are near-equivalent, as discussed in Section A and the precise relationship is given in Appendix C. We choose the norm of the subgradient as our measurement because the complexity analysis is easier.

Then, a straightforward analysis bounds the computational complexity to achieve an -stationary point.

Lemma 4.2.

Let us consider a strongly convex problem and a linearly convergent method generating a sequence of iterates . Define where  is the target accuracy; then,

1. If is deterministic,

 T(ε)≤1τκlog(Aκ(fκ(z0;y)−f∗κ(y))ε2).
2. If is randomized, then

 E[T(ε)]≤1τκlog(Aκ(fκ(z0;y)−f∗κ(y))τκε2).

see Lemma C.1 of [catalyst].

As we can see, we only lose a factor in the log term by switching from deterministic to randomized algorithms. For the sake of simplicity, we perform our analysis only for deterministic algorithms and the analysis for randomized algorithms holds in the same way in expectation.

Bounding the required iterations when κ>ρ and restart strategy.

Recall that we add a quadratic to with the hope to make each subproblem convex. Thus, if is known, then we should set . In this first stage, we show that whenever , then the number of inner calls to can be bounded with a proper initialization. Consider the subproblem

 minx∈Rp{fκ(x;y)=f(x)+κ2∥x−y∥2}, (15)

and define the initialization point by

1. if is smooth, then set ;

2. if is composite, with -smooth, then set with .

Theorem 4.3.

Consider the subproblem (15) and suppose . Then initializing at the previous generates a sequence of iterates such that

1. in at most iterations where

 Tκ=1τκlog(8Aκ(L+κ)(κ−ρ)2),

the output satisfies (descent condition) and (adaptive stationary condition);

2. in at most iterations where

 Sκlog(k+1)=1τκlog(8Aκ(L+κ)(k+1)2(κ−ρ)2),

the output satisfies (modified adaptive stationary condition).

The proof is technical and is presented in Appendix D. The lesson we learn here is that as soon as the subproblem becomes strongly convex, it can be solved in almost a constant number of iterations. Herein arises a problem–the choice of the smoothing parameter . On one hand, when is already convex, we may want to choose small in order to obtain the desired optimal complexity. On the other hand, when the problem is non convex, a small may not ensure the strong convexity of the subproblems. Because of such different behavior according to the convexity of the function, we introduce an additional parameter to handle the regularization of the extrapolation step. Moreover, in order to choose a in the nonconvex case, we need to know in advance an estimate of . This is not an easy task for large scale machine learning problems such as neural networks. Thus we propose an adaptive step to handle it automatically.

4.2 4WD-Catalyst: adaptation to weak convexity

We now introduce 4WD-Catalyst, presented in Algorithm 2, which can automatically adapt to the unknown weak convexity constant of the objective. The algorithm relies on a procedure to automatically adapt to , described in Algorithm 3.

The idea is to fix in advance a number of iterations , let run on the subproblem for iterations, output the point , and check if a sufficient decrease occurs. We show that if we set , where the notation hides logarithmic dependencies in and , where is the Lipschitz constant of the smooth part of ; then, if the subproblem were convex, the following conditions would be guaranteed:

1. [itemsep=0pt,topsep=0pt,parsep=0pt]

2. Descent condition: ;

Thus, if either condition is not satisfied, then the subproblem is deemed not convex and we double and repeat. The procedure yields an estimate of in a logarithmic number of increases; see Lemma D.3.

Relative stationarity and predefining S.

One of the main differences of our approach with the Catalyst algorithm of [catalyst] is to use a pre-defined number of iterations, and , for solving the subproblems. We introduce , a dependent smoothing parameter and set it in the same way as the smoothing parameter in [catalyst]. The automatic acceleration of our algorithm when the problem is convex is due to extrapolation steps in Step 2-3 of Basic 4WD-Catalyst. We show that if we set , where hides logarithmic dependencies in , , and , then we can be sure that, for convex objectives,

 \rm dist(0,∂fκcvx(~xk;yk))<κcvxk+1∥~xk−yk∥. (17)

This relative stationarity of , including the choice of , shall be crucial to guarantee that the scheme accelerates in the convex setting. An additional factor appears compared to the previous adaptive stationary condition because we need higher accuracy for solving the subproblem to achieve the accelerated rate in .

We shall see in the experiments that our strategy of predefining and works quite well. The theoretical bounds we derive are, in general, too conservative; we observe in our experiments that one may choose and significantly smaller than the theory suggests and still retain the stopping criteria.

To derive the global complexity results for 4WD-Catalyst that match optimal convergence guarantees, we make a distinction between the regularization parameter in the proximal point step and in the extrapolation step. For the proximal point step, we apply Algorithm 3 to adaptively produce a sequence of initializing at , an initial guess of . The resulting and satisfy both the following inequalities:

 \rm dist(0,∂fκk(¯xk;xk−1))<κk∥¯xk−xk∥ and fκk(¯xk;xk−1)≤fκk(xk−1;xk−1). (18)

For the extrapolation step, we introduce the parameter which essentially depends on the Lipschitz constant . The choice is the same as the smoothing parameter in [catalyst] and depends on the method . With a similar predefined iteration strategy, the resulting satisfies the following inequality if the original objective is convex,

 \rm dist(0,∂fκ\rm cvx(~xk;yk))<κ\rm cvxk+1∥~xk−yk∥. (19)

4.3 Convergence analysis

Let us next postulate that and are chosen large enough to guarantee that and satisfy conditions (18) and (19) for the corresponding subproblems, and see how the outer algorithm complexity resembles the guarantees of Theorem 3.1 and Theorem 3.2. The main technical difference is that  changes at each iteration , which requires keeping track of the effects of and on the proof.

Theorem 4.4 (Outer-loop complexity, 4WD-Catalyst).

Fix real constants , and . Set . Suppose that the number of iterations is such that satisfies (18). Define . Then for any , the iterates generated by Algorithm 2 satisfy,

 minj=1,…,N \rm dist2(0,∂f(¯xj))≤8κmaxN(f(x0)−f∗).

If in addition the function is convex and is chosen so that satisfies (19), then

 minj=1,…,2N \rm dist2(0,∂f(¯xj))≤32κmaxκ\rm cvxN(N+1)2∥x∗−x0∥2,

and

 f(xN)−f(x∗)≤4κ\rm cvx(N+1)2∥x∗−x0∥2, (20)

where is any minimizer of the function .

Inner-loop Complexity

In light of Theorem 4.4, we must now understand how to choose and as small as possible, while guaranteeing that and satisfy (18) and (19) hold for each . The quantities  and depend on the method ’s convergence rate parameter which only depends on and . For example, the convergence rate parameter for gradient descent and for SVRG. The values of and must be set beforehand without knowing the true value of the weak convexity constant . Using Theorem 4.3, we assert the following choices for and .

Theorem 4.5 (Inner complexity for 4WD-Catalyst : determining the values T and S).

Suppose the stopping criteria are (18) and (19) as in in Theorem 4.4, and choose and in Algorithm 2 to be the smallest numbers satisfying

 T≥1τLlog(40A4LL),

and

 Slog(k+1)≥1τκ\rm cvxlog⎛⎝8Aκ\rm cvx(κ\rm cvx+L)(k+1)2κ2\rm cvx⎞⎠,

for all . In particular,

 T =O(1τLlog(A4L,L)), S =O(1τκ\rm cvxlog(Aκ\rm cvx,L,κ\rm cvx)).

Then and the following hold for any index :

1. Generating in Algorithm 2 requires at most iterations of ;

2. Generating in Algorithm 2 requires at most iterations of .

where hides universal constants and logarithmic dependencies on , , , , and .

Appendix D is devoted to proving Theorem 4.5, but we outline below the general procedure and state the two main propositions (see Proposition 4.6 and Proposition 4.7).

We summarize the proof of Theorem 4.5 as followed:

1. When , we compute the number of iterations of to produce a point satisfying (18). Such a point will become .

2. When the function is convex, we compute the number of iterations of to produce a point which satisfies the (19) condition. Such a point will become the point .

3. We compute the smallest number of times we must double until it becomes larger than . Thus eventually the condition will occur.

4. We always set the number of iterations of to produce and as in Step 1 and Step 2, respectively, regardless of whether is convex or is convex.

The next proposition shows that Auto-adapt terminates with a suitable choice for after number of iterations.

Proposition 4.6 (Inner complexity for ¯xk).

Suppose . By initializing the method  using the strategy suggested in Algorithm 2 for solving

 minz{fκ(z;x):=f(z)+κ2∥z−x∥2}

we may run the method  for at least  iterations, where

 T≥1τLlog(40A4LL);

then, the output satisfies and .

Under the additional assumption that the function is convex, we produce a point with (19) when the number of iterations is chosen sufficiently large.

Proposition 4.7 (Inner-loop complexity for ~xk).

Consider the method  with the initialization strategy suggested in Algorithm 2 for minimizing with linear convergence rates of the form (14). Suppose the function is convex. If the number of iterations of is greater than

 S=O(1τκ\rm cvxlog(Aκ\rm cvx% ,L,κ\rm cvx))

such that

 Slog(k+1)≥1τκ\rm cvxlog⎛⎝8Aκ\rm cvx(κ\rm cvx+L)(k+1)2κ2\rm cvx⎞⎠, (21)

then, the output satisfies for all .

We can now derive global complexity bounds by combining Theorem 4.4 and Theorem 4.5, and a good choice for the constant .

Theorem 4.8 (Global complexity bounds for 4WD-Catalyst).

Choose and as in Theorem 4.5. We let hide universal constants and logarithmic dependencies in , , , , , , and . Then, the following statements hold.

1. Algorithm 2 generates a point satisfying after at most

 ~O((τ−1L+τ−1κ\rm cvx)⋅L(f(x0)−f∗)ε2)

iterations of the method .

2. If is convex, then Algorithm 2 generates a point satisfying after at most

 ~O⎛⎜⎝(τ−1L+τ−1κ\rm cvx)⋅L1/3(κ\rm cvx∥x∗−x0∥2)1/3ε2/3⎞⎟⎠

iterations of the method .

3. If is convex, then Algorithm 2 generates a point satisfying after at most

 ~O⎛⎜ ⎜⎝(τ−1L+τ−1κ\rm cvx)⋅√κ\rm cvx∥x∗−x0∥2√ε⎞⎟ ⎟⎠

iterations of the method .

Remark 4.9.

In general, the linear convergence parameter of , , depends on the condition number of the problem . Here, and are precisely given by plugging in and respectively into . To clarify, let be SVRG, is given by which yields . A more detailed computation is given in Table 2. For all the incremental methods we considered, these parameters and are on the order of .

Remark 4.10.

If is a first order method, the convergence guarantee in the convex setting is near-optimal, up to logarithmic factors, when compared to  [catalyst, woodworth:srebro:2016]. In the non-convex setting, our approach matches, up to logarithmic factors, the best known rate for this class of functions, namely  [Cartis2010, Cartis2014]. Moreover, our rates dependence on the dimension and Lipschitz constant equals, up to log factors, the best known dependencies in both the convex and nonconvex setting. These logarithmic factors may be the price we pay for having a generic algorithm.

5 Applications to Existing Algorithms

We now show how to accelerate existing algorithms and compare the convergence guaranties before and after 4WD-Catalyst. In particular, we focus on the gradient descent algorithm and on the incremental methods SAGA and SVRG. For all the algorithms considered, we state the convergence guaranties in terms of the total number of iterations (in expectation, if appropriate) to reach an accuracy of ; in the convex setting, the accuracy is stated in terms of functional error, and in the nonconvex setting, the appropriate measure is stationarity, namely . All the algorithms considered have formulations for the composite setting with analogous convergence rates. Table 1 presents convergence rates for SAGA [saga], (prox) SVRG [proxsvrg], and gradient descent (FG).