A large number of machine learning and signal processing problems are formulated as the minimization of a convex objective function:
where is convex and -smooth, and is convex but may not be differentiable. We call a function -smooth when it is differentiable and its gradient is -Lipschitz continuous.
In statistics or machine learning, the variable may represent model parameters, and the role of
is to ensure that the estimated parameters fit some observed data. Specifically,is often a large sum of functions and (1) is a regularized empirical risk which writes as
Each term measures the fit between and a data point indexed by , whereas the function acts as a regularizer; it is typically chosen to be the squared -norm, which is smooth, or to be a non-differentiable penalty such as the -norm or another sparsity-inducing norm (Bach et al., 2012).
We present a unified framework allowing one to accelerate gradient-based or first-order methods, with a particular focus on problems involving large sums of functions. By “accelerating”, we mean generalizing a mechanism invented by Nesterov (1983) that improves the convergence rate of the gradient descent algorithm. When , gradient descent steps produce iterates such that in iterations, where denotes the minimum value of . Furthermore, when the objective is -strongly convex, the previous iteration-complexity becomes , which is proportional to the condition number . However, these rates were shown to be suboptimal for the class of first-order methods, and a simple strategy of taking the gradient step at a well-chosen point different from yields the optimal complexity— for the convex case and for the -strongly convex one (Nesterov, 1983). Later, this acceleration technique was extended to deal with non-differentiable penalties for which the proximal operator defined below is easy to compute (Beck and Teboulle, 2009; Nesterov, 2013).
where denotes the Euclidean norm.
For machine learning problems involving a large sum of functions, a recent effort has been devoted to developing fast incremental algorithms such as SAG (Schmidt et al., 2017), SAGA (Defazio et al., 2014a), SDCA (Shalev-Shwartz and Zhang, 2012), SVRG (Johnson and Zhang, 2013; Xiao and Zhang, 2014), or MISO/Finito (Mairal, 2015; Defazio et al., 2014b), which can exploit the particular structure (2). Unlike full gradient approaches, which require computing and averaging gradients at every iteration, incremental techniques have a cost per-iteration that is independent of . The price to pay is the need to store a moderate amount of information regarding past iterates, but the benefits may be significant in terms of computational complexity. In order to achieve an -accurate solution for a -strongly convex objective, the number of gradient evaluations required by the methods mentioned above is bounded by , where is either the maximum Lipschitz constant across the gradients , or the average value, depending on the algorithm variant considered. Unless there is a big mismatch between and (global Lipschitz constant for the sum of gradients), incremental approaches significantly outperform the full gradient method, whose complexity in terms of gradient evaluations is bounded by .
Yet, these incremental approaches do not use Nesterov’s extrapolation steps and whether or not they could be accelerated was an important open question when these methods were introduced. It was indeed only known to be the case for SDCA (Shalev-Shwartz and Zhang, 2016) for strongly convex objectives. Later, other accelerated incremental algorithms were proposed such as Katyusha (Allen-Zhu, 2017), or the method of Lan and Zhou (2017).
We give here a positive answer to this open question. By analogy with substances that increase chemical reaction rates, we call our approach “Catalyst”. Given an optimization method as input, Catalyst outputs an accelerated version of it, eventually the same algorithm if the method is already optimal. The sole requirement on the method in order to achieve acceleration is that it should have linear convergence rate for strongly convex problems. This is the case for full gradient methods (Beck and Teboulle, 2009; Nesterov, 2013) and block coordinate descent methods (Nesterov, 2012; Richtárik and Takáč, 2014), which already have well-known accelerated variants. More importantly, it also applies to the previou incremental methods, whose complexity is then bounded by after Catalyst acceleration, where hides some logarithmic dependencies on the condition number . This improves upon the non-accelerated variants, when the condition number is larger than . Besides, acceleration occurs regardless of the strong convexity of the objective—that is, even if —which brings us to our second achievement.
Some approaches such as MISO, SDCA, or SVRG are only defined for strongly convex objectives. A classical trick to apply them to general convex functions is to add a small regularization term in the objective (Shalev-Shwartz and Zhang, 2012). The drawback of this strategy is that it requires choosing in advance the parameter , which is related to the target accuracy. The approach we present here provides a direct support for non-strongly convex objectives, thus removing the need of selecting beforehand. Moreover, we can immediately establish a faster rate for the resulting algorithm. Finally, some methods such as MISO are numerically unstable when they are applied to strongly convex objective functions with small strong convexity constant. By defining better conditioned auxiliary subproblems, Catalyst also provides better numerical stability to these methods.
A short version of this paper has been published at the NIPS conference in 2015 (Lin et al., 2015a); in addition to simpler convergence proofs and more extensive numerical evaluation, we extend the conference paper with a new Moreau-Yosida smoothing interpretation with significant theoretical and practical consequences as well as new practical stopping criteria and warm-start strategies.
The paper is structured as follows. We complete this introductory section with some related work in Section 1.1, and give a short description of the two-loop Catalyst algorithm in Section 1.2. Then, Section 2 introduces the Moreau-Yosida smoothing and its inexact variant. In Section 3, we introduce formally the main algorithm, and its convergence analysis is presented in Section 4. Section 5 is devoted to numerical experiments and Section 6 concludes the paper.
1.1 Related Work
Catalyst can be interpreted as a variant of the proximal point algorithm (Rockafellar, 1976; Güler, 1991), which is a central concept in convex optimization, underlying augmented Lagrangian approaches, and composite minimization schemes (Bertsekas, 2015; Parikh and Boyd, 2014). The proximal point algorithm consists of solving (1) by minimizing a sequence of auxiliary problems involving a quadratic regularization term. In general, these auxiliary problems cannot be solved with perfect accuracy, and several notions of inexactness were proposed by Güler (1992); He and Yuan (2012) and Salzo and Villa (2012). The Catalyst approach hinges upon (i) an acceleration technique for the proximal point algorithm originally introduced in the pioneer work of Güler (1992); (ii) a more practical inexactness criterion than those proposed in the past.111Note that our inexact criterion was also studied, among others, by Salzo and Villa (2012), but their analysis led to the conjecture that this criterion was too weak to warrant acceleration. Our analysis refutes this conjecture. As a result, we are able to control the rate of convergence for approximately solving the auxiliary problems with an optimization method . In turn, we are also able to obtain the computational complexity of the global procedure, which was not possible with previous analysis (Güler, 1992; He and Yuan, 2012; Salzo and Villa, 2012). When instantiated in different first-order optimization settings, our analysis yields systematic acceleration.
Beyond Güler (1992), several works have inspired this work. In particular, accelerated SDCA (Shalev-Shwartz and Zhang, 2016) is an instance of an inexact accelerated proximal point algorithm, even though this was not explicitly stated in the original paper. Catalyst can be seen as a generalization of their algorithm, originally designed for stochastic dual coordinate ascent approaches. Yet their proof of convergence relies on different tools than ours. Specifically, we introduce an approximate sufficient descent condition, which, when satisfied, grants acceleration to any optimization method, whereas the direct proof of Shalev-Shwartz and Zhang (2016), in the context of SDCA, does not extend to non-strongly convex objectives. Another useful methodological contribution was the convergence analysis of inexact proximal gradient methods of Schmidt et al. (2011) and Devolder et al. (2014). Finally, similar ideas appeared in the independent work (Frostig et al., 2015). Their results partially overlap with ours, but the two papers adopt rather different directions. Our analysis is more general, covering both strongly-convex and non-strongly convex objectives, and comprises several variants including an almost parameter-free variant.
Then, beyond accelerated SDCA (Shalev-Shwartz and Zhang, 2016), other accelerated incremental methods have been proposed, such as APCG (Lin et al., 2015b), SDPC (Zhang and Xiao, 2015), RPDG (Lan and Zhou, 2017), Point-SAGA (Defazio, 2016) and Katyusha (Allen-Zhu, 2017). Their techniques are algorithm-specific and cannot be directly generalized into a unified scheme. However, we should mention that the complexity obtained by applying Catalyst acceleration to incremental methods matches the optimal bound up to a logarithmic factor, which may be the price to pay for a generic acceleration scheme.
A related recent line of work has also combined smoothing techniques with outer-loop algorithms such as Quasi-Newton methods (Themelis et al., 2016; Giselsson and Fält, 2016). Their purpose was not to accelerate existing techniques, but rather to derive new algorithms for nonsmooth optimization.
To conclude this survey, we mention the broad family of extrapolation methods (Sidi, 2017), which allow one to extrapolate to the limit sequences generated by iterative algorithms for various numerical analysis problems. Scieur et al. (2016) proposed such an approach for convex optimization problems with smooth and strongly convex objectives. The approach we present here allows us to obtain global complexity bounds for strongly convex and non strongly convex objectives, which can be decomposed into a smooth part and a non-smooth proximal-friendly part.
1.2 Overview of Catalyst
Before introducing Catalyst precisely in Section 3, we give a quick overview of the algorithm and its main ideas. Catalyst is a generic approach that wraps an algorithm into an accelerated one , in order to achieve the same accuracy as with reduced computational complexity. The resulting method is an inner-outer loop construct, presented in Algorithm 1, where in the inner loop the method is called to solve an auxiliary strongly-convex optimization problem, and where in the outer loop the sequence of iterates produced by are extrapolated for faster convergence.
There are therefore three main ingredients in Catalyst: a) a smoothing technique that produces strongly-convex sub-problems; b) an extrapolation technique to accelerate the convergence; c) a balancing principle to optimally tune the inner and outer computations.
Smoothing by infimal convolution
Catalyst can be used on any algorithm that enjoys a linear-convergence guarantee when minimizing strongly-convex objectives. However the objective at hand may be poorly conditioned or even might not be strongly convex. In Catalyst, we use to approximately minimize an auxiliary objective at iteration , defined in (4), which is strongly convex and better conditioned than . Smoothing by infimal convolution allows one to build a well-conditioned convex function from a poorly-conditioned convex function (see Section 3 for a refresher on Moreau envelopes). We shall show in Section 3 that a notion of approximate Moreau envelope allows us to define precisely the information collected when approximately minimizing the auxiliary objective.
Extrapolation by Nesterov acceleration
Catalyst uses an extrapolation scheme “ à la Nesterov ” to build a sequence updated as
where is a positive decreasing sequence, which we shall define in Section 3.
We shall show in Section 4 that we can get faster rates of convergence thanks to this extrapolation step when the smoothing parameter , the inner-loop stopping criterion, and the sequence are carefully built.
Balancing inner and outer complexities
The optimal balance between inner loop and outer loop complexity derives from the complexity bounds established in Section 4. Given an estimate about the condition number of , our bounds dictate a choice of that gives the optimal setting for the inner-loop stopping criterion and all technical quantities involved in the algorithm. We shall demonstrate in particular the power of an appropriate warm-start strategy to achieve near-optimal complexity.
Overview of the complexity results
Finally, we provide in Table 1 a brief overview of the complexity results obtained from the Catalyst acceleration, when applied to various optimization methods for minimizing a large finite sum of functions. Note that the complexity results obtained with Catalyst are optimal, up to some logarithmic factors (see Agarwal and Bottou, 2015; Arjevani and Shamir, 2016; Woodworth and Srebro, 2016).
|Without Catalyst||With Catalyst|
2 The Moreau Envelope and its Approximate Variant
In this section, we recall a classical tool from convex analysis called the Moreau envelope or Moreau-Yosida smoothing (Moreau, 1962; Yosida, 1980), which plays a key role for understanding the Catalyst acceleration. This tool can be seen as a smoothing technique, which can turn any convex lower semicontinuous function into a smooth function, and an ill-conditioned smooth convex function into a well-conditioned smooth convex function.
The Moreau envelope results from the infimal convolution of with a quadratic penalty:
where is a positive regularization parameter. The proximal operator is then the unique minimizer of the problem—that is,
Note that does not admit a closed form in general. Therefore, computing it requires to solve the sub-problem to high accuracy with some iterative algorithm.
2.1 Basic Properties of the Moreau Envelope
The smoothing effect of the Moreau regularization can be characterized by the next proposition (see Lemaréchal and Sagastizábal, 1997, for elementary proofs). [Regularization properties of the Moreau Envelope] Given a convex continuous function and a regularization parameter , consider the Moreau envelope defined in (5). Then,
is convex and minimizing and are equivalent in the sense that
Moreover the solution set of the two above problems coincide with each other.
is continuously differentiable even when is not and
Moreover the gradient is Lipschitz continuous with constant .
If is -strongly convex, then is -strongly convex with
Interestingly, is friendly from an optimization point of view as it is convex and differentiable. Besides, is -smooth with condition number when is -strongly convex. Thus can be made arbitrarily well conditioned by choosing a small . Since both functions and admit the same solutions, a naive approach to minimize a non-smooth function is to first construct its Moreau envelope and then apply a smooth optimization method on it. As we will see next, Catalyst can be seen as an accelerated gradient descent technique applied to with inexact gradients.
2.2 A Fresh Look at Catalyst
First-order methods applied to provide us several well-known algorithms.
The proximal point algorithm.
Accelerated proximal point algorithm.
If gradient descent steps on yields the proximal point algorithm, it is then natural to consider the following sequence
where is Nesterov’s extrapolation parameter (Nesterov, 2004). Again, by using the closed form of the gradient, this is equivalent to the update
which is known as the accelerated proximal point algorithm of Güler (1992).
While these algorithms are conceptually elegant, they suffer from a major drawback in practice: each update requires to evaluate the proximal operator . Unless a closed form is available, which is almost never the case, we are not able to evaluate exactly. Hence an iterative algorithm is required for each evaluation of the proximal operator which leads to the inner-outer construction (see Algorithm 1). Catalyst can then be interpreted as an accelerated proximal point algorithm that calls an optimization method to compute inexact solutions to the sub-problems. The fact that such a strategy could be used to solve non-smooth optimization problems was well-known, but the fact that it could be used for acceleration is more surprising. The main challenge that will be addressed in Section 3 is how to control the complexity of the inner-loop minimization.
2.3 The Approximate Moreau Envelope
Since Catalyst uses inexact gradients of the Moreau envelope, we start with specifying the inexactness criteria.
Inexactness through absolute accuracy.
Given a proximal center , a smoothing parameter , and an accuracy , we denote the set of -approximations of the proximal operator by
and is the minimum function value of .
Checking whether may be impactical since is unknown in many situations. We may then replace by a lower bound that can be computed more easily. We may use the Fenchel conjugate for instance. Then, given a point and a lower-bound , we can guarantee if . There are other choices for the lower bounding function which result from the specific construction of the optimization algorithm. For instance, dual type algorithms such as SDCA (Shalev-Shwartz and Zhang, 2012) or MISO (Mairal, 2015) maintain a lower bound along the iterations, allowing one to compute .
When none of the options mentioned above are available, we can use the following fact, based on the notion of gradient mapping; see Section 2.3.2 of (Nesterov, 2004). The intuition comes from the smooth case: when is smooth, the strong convexity yields
In other words, the norm of the gradient provides enough information to assess how far we are from the optimum. From this perspective, the gradient mapping can be seen as an extension of the gradient for the composite case where the objective decomposes as a sum of a smooth part and a non-smooth part (Nesterov, 2004). [Checking the absolute accuracy criterion] Consider a proximal center , a smoothing parameter and an accuracy . Consider an objective with the composite form (1) and we set function as
For any , we define
Then, the gradient mapping of at is defined by and
The proof is given in Appendix B. The lemma shows that it is sufficient to check the norm of the gradient mapping to ensure condition (C1). However, this requires an additional full gradient step and proximal step at each iteration.
As soon as we have an approximate proximal operator in in hand, we can define an approximate gradient of the Moreau envelope,
by mimicking the exact gradient formula . As a consequence, we may immediately draw a link
where the first implication is a consequence of the strong convexity of at its minimum . We will then apply the approximate gradient instead of to build the inexact proximal point algorithm. Since the inexactness of the approximate gradient can be bounded by an absolute value , we call (C1) the absolute accuracy criterion.
Relative error criterion.
Another natural way to bound the gradient approximation is by using a relative error, namely in the form for some . This leads us to the following inexactness criterion.
Given a proximal center , a smoothing parameter and a relative accuracy in , we denote the set of -relative approximations by
At a first glance, we may interpret the criterion (C2) as (C1) by setting . But we should then notice that the accuracy depends on the point , which is is no longer an absolute constant. In other words, the accuracy varies from point to point, which is proportional to the squared distance between and . First one may wonder whether is an empty set. Indeed, it is easy to see that since . Moreover, by continuity, is closed set around . Then, by following similar steps as in (9), we have
By defining the approximate gradient in the same way yields,
which is the desired relative gradient approximation.
Finally, the discussion about bounding still holds here. In particular, Lemma 2.3 may be used by setting the value . The price to pay is as an additional gradient step and an additional proximal step per iteration.
A few remarks on related works.
Inexactness criteria with respect to subgradient norms have been investigated in the past, starting from the pioneer work of Rockafellar (1976) in the context of the inexact proximal point algorithm. Later, different works have been dedicated to more practical inexactness criteria (Auslender, 1987; Correa and Lemaréchal, 1993; Solodov and Svaiter, 2001; Fuentes et al., 2012). These criteria include duality gap, -subdifferential, or decrease in terms of function value. Here, we present a more intuitive point of view using the Moreau envelope.
While the proximal point algorithm has caught a lot of attention, very few works have focused on its accelerated variant. The first accelerated proximal point algorithm with inexact gradients was proposed by Güler (1992). Then, Salzo and Villa (2012) proposed a more rigorous convergence analysis, and more inexactness criteria, which are typically stronger than ours. In the same way, a more general inexact oracle framework has been proposed later by Devolder et al. (2014). To achieve the Catalyst acceleration, our main effort was to propose and analyze criteria that allow us to control the complexity for finding approximate solutions of the sub-problems.
3 Catalyst Acceleration
Catalyst is presented in Algorithm 2. As discussed in Section 2, this scheme can be interpreted as an inexact accelerated proximal point algorithm, or equivalently as an accelerated gradient descent method applied to the Moreau envelope of the objective with inexact gradients. Since an overview has already been presented in Section 1.2, we now present important details to obtain acceleration in theory and in practice.
Requirement: linear convergence of the method .
One of the main characteristic of Catalyst is to apply the method to strongly-convex sub-problems, without requiring strong convexity of the objective . As a consequence, Catalyst provides direct support for convex but non-strongly convex objectives to , which may be useful to extend the scope of application of techniques that need strong convexity to operate. Yet, Catalyst requires solving these sub-problems efficiently enough in order to control the complexity of the inner-loop computations. When applying to minimize a strongly-convex function , we assume that is able to produce a sequence of iterates such that
where is the initial point given to , and in , are two constants. In such a case, we say that admits a linear convergence rate. The quantity controls the speed of convergence for solving the sub-problems: the larger is , the faster is the convergence. For a given algorithm , the quantity depends usually on the condition number of . For instance, for the proximal gradient method and many first-order algorithms, we simply have , as is -strongly convex and -smooth. Catalyst can also be applied to randomized methods that satisfy (12) in expectation:
Then, the complexity results of Section 4 also hold in expectation. This allows us to apply Catalyst to randomized block coordinate descent algorithms (see Richtárik and Takáč, 2014, and references therein), and some incremental algorithms such as SAG, SAGA, or SVRG. For other methods that admit a linear convergence rates in terms of duality gap, such as SDCA, MISO/Finito, Catalyst can also be applied as explained in Appendix C.
Catalyst may be used with three types of stopping criteria for solving the inner-loop problems. We now detail them below.
absolute accuracy: we predefine a sequence of accuracies, and stop the method by using the absolute stopping criterion (C1). Our analysis suggests
if is -strongly convex,
if is convex but not strongly convex,
relative accuracy: To use the relative stopping criterion (C2), our analysis suggests the following choice for the sequence :
if is -strongly convex,
if is convex but not strongly convex,
fixed budget: Finally, the simplest way of using Catalyst is to fix in advance the number of iterations of the method for solving the sub-problems without checking any optimality criterion. Whereas our analysis provides theoretical budgets that are compatible with this strategy, we found them to be pessimistic and impractical. Instead, we propose an aggressive strategy for incremental methods that simply consists of setting . This setting was called the “one-pass” strategy in the original Catalyst paper (Lin et al., 2015a).
Warm-starts in inner loops.
Besides linear convergence rate, an adequate warm-start strategy needs to be used to guarantee that the sub-problems will be solved in reasonable computational time. The intuition is that the previous solution may still be a good approximation of the current subproblem. Specifically, the following choices arise from the convergence analysis that will be detailed in Section 4.
[leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt] Consider the minimization of the -th subproblem , we warm start the optimization method at as following:
when using a fixed budget , choose the same warm start strategy as in (b).
Note that the earlier conference paper (Lin et al., 2015a) considered the the warm start rule . That variant is also theoretically validated but it does not perform as well as the ones proposed here in practice.
Optimal balance: choice of parameter .
Finally, the last ingredient is to find an optimal balance between the inner-loop (for solving each sub-problem) and outer-loop computations. To do so, we minimize our global complexity bounds with respect to the value of . As we shall see in Section 5, this strategy turns out to be reasonable in practice. Then, as shown in the theoretical section, the resulting rule of thumb is [leftmargin=0pt,innerleftmargin=6pt,innerrightmargin=6pt] We select by maximizing the ratio . We recall that characterizes how fast solves the sub-problems, according to (12); typically, depends on the condition number and is a function of .222 Note that the rule for the non strongly convex case, denoted here by , slightly differs from Lin et al. (2015a) and results from a tighter complexity analysis. In Table 2, we illustrate the choice of for different methods. Note that the resulting rule for incremental methods is very simple for the pracitioner: select such that the condition number is of the order of ; then, the inner-complexity becomes .
4 Convergence and Complexity Analysis
We now present the complexity analysis of Catalyst. In Section 4.1, we analyze the convergence rate of the outer loop, regardless of the complexity for solving the sub-problems. Then, we analyze the complexity of the inner-loop computations for our various stopping criteria and warm-start strategies in Section 4.2. Section 4.3 combines the outer- and inner-loop analysis to provide the global complexity of Catalyst applied to a given optimization method .
4.1 Complexity Analysis for the Outer-Loop
The complexity analysis of the first variant of Catalyst we presented in (Lin et al., 2015a) used a tool called “estimate sequence”, which was introduced by Nesterov (2004). Here, we provide a simpler proof. We start with criterion (C1), before extending the result to (C2).
4.1.1 Analysis for Criterion (C1)
The next theorem describes how the errors accumulate in Catalyst. [Convergence of outer-loop for criterion (C1)] Consider the sequences and produced by Algorithm 2, assuming that is in for all , Then,
Before we prove this theorem, we note that by setting for all , the speed of convergence of is driven by the sequence . Thus we first show the speed of by recalling the Lemma 2.2.4 of Nesterov (2004). [Lemma 2.2.4 of Nesterov 2004] Consider the quantities defined in (14) and the ’s defined in Algorithm 2. Then, if ,
For non-strongly convex objectives, follows the classical accelerated
rate of convergence, whereas it achieves a linear convergence rate for
the strongly convex case. Intuitively, we are applying an inexact Nesterov method on the Moreau envelope , thus the convergence rate naturally depends on the inverse of its condition number, which is . We now provide the proof of the theorem below.
We start by defining an approximate sufficient descent condition inspired by a remark of Chambolle and Pock (2015) regarding accelerated gradient descent methods. A related condition was also used by Paquette et al. (2018) in the context of non-convex optimization.
Approximate sufficient descent condition.
Let us define the function
Since is the unique minimizer of , the strong convexity of yields: for any , for all in and any ,
where the -strong convexity of is used in the first inequality; Lemma A is used in the second inequality, and the last one uses the relation . Moreover, when , the last term is positive and we have
If instead , the coefficient is non-negative and we have
In this case, we have
As a result, we have for all value of ,
After expanding the expression of , we then obtain the approximate descent condition
Definition of the Lyapunov function.
We introduce a sequence that will act as a Lyapunov function, with
where is a minimizer of , is a sequence defined by and
and is an auxiliary quantity defined by
The way we introduce these variables allow us to write the following relationship,
which follows from a simple calculation. Then by setting the following relations hold for all .
and also the following one
where we used the convexity of the norm and the fact that . Using the previous relations in (15) with , gives for all ,
Remark that for all ,
and the quadratic terms involving cancel each other. Then, after noticing that for all ,
which allows us to write
We are left, for all , with
Control of the approximation errors for criterion (C1).
Using the fact that