1 Introduction
Convex composite optimization arises in many scientific fields, such as image and signal processing or machine learning. It consists of minimizing a realvalued function composed of two convex terms:
(1) 
where is smooth with Lipschitz continuous derivatives, and is a regularization function which is not necessarily differentiable. A typical example from the signal and image processing literature is the norm , which encourages sparse solutions [19, 40]; composite minimization also encompasses constrained minimization when considering extendedvalued indicator functions that may take the value outside of a convex set and inside (see [28]). In general, algorithms that are dedicated to composite optimization only require to be able to compute efficiently the proximal operator of :
where denotes the Euclidean norm. Note that when is an indicator function, the proximal operator corresponds to the simple Euclidean projection.
To solve (1), significant efforts have been devoted to (i) extending techniques for smooth optimization to deal with composite terms; (ii) exploiting the underlying structure of the problem—is a finite sum of independent terms? Is separable in different blocks of coordinates? (iii) exploiting the local curvature of the smooth term to achieve faster convergence than gradientbased approaches when the dimension is large. Typically, the first point is well understood in the context of optimal firstorder methods, see [2, 48]
, and the third point is tackled with effective heuristics such as LBFGS when the problem is smooth
[35, 49]. Yet, tackling all these challenges at the same time is difficult, which is precisely the focus of this paper.In particular, a problem of interest that initially motivated our work is that of empirical risk minimization (ERM); the problem arises in machine learning and can be formulated as the minimization of a composite function :
(2) 
where the functions are convex and smooth with Lipschitz continuous derivatives, and is a composite term, possibly nonsmooth. The function measures the fit of some model parameters to a specific data point indexed by , and is a regularization penalty to prevent overfitting. To exploit the sum structure of , a large number of randomized incremental gradientbased techniques have been proposed, such as SAG [56], SAGA [15], SDCA [58], SVRG [60], Finito [16], or MISO [38]. These approaches access a single gradient at every iteration instead of the full gradient and achieve lower computational complexity in expectation than optimal firstorder methods [2, 48] under a few assumptions. Yet, these methods are unable to exploit the curvature of the objective function; this is indeed also the case for variants that are accelerated in the sense of Nesterov [21, 33, 58].
To tackle (2), dedicated firstorder methods are often the default choice in machine learning, but it is also known that standard QuasiNewton approaches can sometimes be surprisingly effective in the smooth case—that is when , see, e.g., [56] for extensive benchmarks. Since the dimension of the problem is typically very large (), “limited memory” variants of these algorithms, such as LBFGS, are necessary to achieve the desired scalability [35, 49]. The theoretical guarantees offered by LBFGS are somewhat limited, meaning that it does not outperform accelerated firstorder methods in terms of worstcase convergence rate, and also it is not guaranteed to correctly approximate the Hessian of the objective. Yet, LBFGS remains one of the greatest practical success of smooth optimization. Adapting LBFGS to composite and structured problems, such as the finite sum of functions (2), is of utmost importance nowadays.
For instance, there have been several attempts to develop a proximal QuasiNewton method [10, 31, 54, 62]. These algorithms typically require computing many times the proximal operator of with respect to a variable metric. QuasiNewton steps were also incorporated as local search steps into accelerated firstorder methods to further enhance their numerical performance [24]. More related to our work, LBFGS is combined with SVRG for minimizing smooth finite sums in [26]. The scope of our approach is broader beyond the case of SVRG. We present a generic QuasiNewton scheme, applicable to a largeclass of firstorder methods for composite optimization, including other incremental algorithms [15, 16, 38, 56, 58] and block coordinate descent methods [51, 52]
More precisely, the main contribution of this paper is a generic metaalgorithm, called QNing (the letters “Q” and “N” stand for QuasiNewton), which uses a given optimization method to solve a sequence of auxiliary problems up to some appropriate accuracy, resulting in faster global convergence in practice. QNing falls into the class of inexact proximal point algorithms with variable metric and may be seen as applying a QuasiNewton algorithm with inexact (but accurate enough) gradients to the MoreauYosida regularization of the objective. As a result, our approach is (i) generic, as stated previously; (ii) despite the smoothing of the objective, the subproblems that we solve are composite ones, which may lead to exactly sparse iterates when a sparsityinducing regularization is involved, e.g., the norm; (iii) when used with LBFGS rules, it admits a worstcase linear convergence rate for strongly convex problems similar to that of gradient descent, which is typically the best guarantees obtained for LBFGS schemes in the literature.
The idea of combining secondorder or quasiNewton methods with MoreauYosida regularization is in fact relatively old. It may be traced back to variable metric proximal bundle methods [14, 23, 41], which use BFGS updates on the MoreauYosida smoothing of the objective and bundle methods to approximately solve the corresponding subproblems. Our approach revisits this principle with a limitedmemory variant (to deal with large dimension ), with a simple line search scheme, and with warm start strategies for the subproblems with a global complexity analysis that is more relevant than convergence rates that do not take into account the cost per iteration.
To demonstrate the effectiveness of our scheme in practice, we evaluate QNing on regularized logistic regression and regularized leastsquares, with smooth and nonsmooth regularization penalities such as the ElasticNet
[63]. We use largescale machine learning datasets and show that QNing performs at least as well as the recently proposed Catalyst [33] and as the classical LBFGS scheme in all numerical experiments, and significantly outperforms them in many cases.The paper is organized as follows: Section 2 presents related work on QuasiNewton methods such as LBFGS; we introduce QNing as well as basic properties of the MoreauYosida regularization in Section 3, and we provide a convergence analysis in Section 4; Section 5 is devoted to numerical experiments and Section 6 concludes the paper.
2 Related work and preliminaries
The history of quasiNewton methods can be traced back to the 1950’s [6, 29, 50]. QuasiNewton methods often lead to significantly faster convergence in practice compared to simpler gradientbased methods for solving smooth optimization problems [55]. Yet, a theoretical analysis of quasiNewton methods that explains their impressive empirical behavior on a wide range of problems is still an open topic. Here, we briefly review the wellknown BFGS algorithm in Section 2.1, its limited memory variant [49], and a few recent extensions. Then, we present earlier works that combine proximal point and QuasiNewton algorithms in Section 2.3.
2.1 QuasiNewton methods for smooth optimization
The most popular QuasiNewton method is BFGS, named after its inventors (BroydenFletcherGoldfarbShanno), and its limited variant LBFGS [50]. These approaches will be the workhorses of the QNing metaalgorithm in practice. Consider a smooth convex objective to be minimized, the BFGS method constructs at iteration a couple with the following update:
(3) 
where is a suitable stepsize and
The condition and the positive definiteness of are guaranteed as soon as is strongly convex. To determine the stepsize , Wolfe’s linesearch is a simple choice which provides linear convergence rate in the worst case. In addition, if the objective is twice differentiable and the Hessian is Lipschitz continuous, the convergence is asymptotically superlinear [50].
The limited memory variant LBFGS [49] overcomes the issue of storing for large , by replacing it by another positive definite matrix—say —which can be built from a “generating list” of at most
pairs of vectors
along with an initial diagonal matrix . Formally, can be computed by applying at most times a recursion similar to (3) involving all pairs of the generating list. Between iteration and , the generating list is incrementally updated, by removing the oldest pair in the list (when ) and adding a new one. What makes the approach appealing is the ability of computing for any vector with only floating point operations instead of for a naive implementation with matrix inversion. The price to pay is that superlinear convergence becomes out of reach in contrast to BFGS.LBFGS is thus appropriate for highdimensional problems (when is large), but it still requires computing the full gradient at each iteration, which may be cumbersome in the large sum setting (2). This motivated stochastic counterparts of the QuasiNewton method (SQN) [57, 42, 8]. Unfortunately substituting the full gradient by its stochastic counterpart does not lead to a convergent scheme. Instead, the SQN method [8] uses a product of a subsampled Hessian and to approximate . SQN can be complemented by a variance reduction scheme such as SVRG [26, 44].
2.2 QuasiNewton methods for composite optimization
Different approaches have been proposed to extend QuasiNewton methods to composite optimization problems. A first approach consists in minimizing successive quadratic approximations, also called proximal quasiNewton methods [10, 25, 30, 31, 36, 54]. More concretely, a quadratic approximation is minimized at each iteration:
(4) 
where is a Hessian approximation based on quasiNewton methods. The minimizer of provides a descent direction, which is subsequently used to build the next iterate. However, since is dense and changes over the iterations, a closed form solution of (4) is usually not available, and one needs to apply an optimization algorithm to approximately solve (4). Even though local superlinear convergence may be guaranteed under mild assumptions when (4) is solved with “high accuracy” [31], the composite structure naturally leads to choosing a firstorder algorithm for solving (4). Then, superlinear complexity becomes out of reach. The global convergence rate of this inexact variant has been for instance analyzed in [54], where a sublinear convergence rate is obtained for convex problems by using a randomized coordinate descent solver applied to (4); later, a linear convergence rate was obtained by [36] for strongly convex problems.
A second approach to extend QuasiNewton methods to composite optimization problems is based on a smoothing technique. More precisely, any QuasiNewton method may be applied to a smoothed version of the objective. For instance, one may use the forwardbackward envelope [4, 59] to build forwardbackward quasi Newton methods. The idea is to mimic forwardbackward splitting methods and apply quasiNewton methods instead of gradient methods on top of the envelope. Another well known smoothing technique is the MoreauYosida regularization [43, 61], which leads to the variable metric proximal point algorithm [7, 14, 22, 23]. Our method pursues this line of work by developing a practical inexact variant with global complexity guarantees.
2.3 Combining the proximal point algorithm and QuasiNewton methods
We briefly recall the definition of the MoreauYosida regularization and its basic properties.
Definition 1.
Given an objective function and a smoothing parameter , the MoreauYosida regularization of is defined as the infimal convolution
(5) 
When is convex, the subproblem defined in (5) is strongly convex which provides an unique minimizer, called the proximal point of , which we denote by .
Proposition 1 (Basic properties of the MoreauYosida regularization).
If is convex, the MoreauYosida regularization defined in (5) satisfies

Minimizing and are equivalent in the sense that
and the solution set of the two above problems coincide with each other.

is continuously differentiable even when is not and
(6) Moreover the gradient is Lipschitz continuous with constant .

is convex; moreover, when is strongly convex with respect to the Euclidean norm, is strongly convex with
Interestingly, inherits all the convex properties of and more importantly it is always continuously differentiable, see [32] for elementary proofs. Moreover, the condition number of is given by
(7) 
which is driven by the regularization parameter . Naturally, a naive approach for minimizing a possibly nonsmooth function is to apply an optimization method on since both functions admit the same solutions. This yields the following wellknown algorithms.
The proximal point algorithm.
Consider gradient descent with step size to minimize :
By rewriting the gradient as , we obtain the proximal point algorithm [53]:
(8) 
Accelerated proximal point algorithm.
Since gradient descent on yields the proximal point algorithm, it is also natural to apply an accelerated firstorder method to get faster convergence. To that effect, Nesterov’s algorithm [45] uses a twostage update, using a specific extrapolation parameter :
and, given (6), we obtain that . This is known as the accelerated proximal point algorithm introduced by Güler [27], which was recently extended in [33, 34].
Variable metric proximal point algorithm.
Towards an inexact variable metric proximal point algorithm.
QuasiNewton approaches have been applied after inexact MoreauYosida smoothing in various ways [7, 14, 22, 23]. In particular, it is shown in [14] that if the subproblems (5) are solved up to high enough accuracy, then the inexact variable metric proximal point algorithm preserves the superlinear convergence rate. However, the complexity for solving the subproblems with high accuracy is typically not taken into account in such previous work. The main contribution of our paper is to close this gap by providing a global analysis and algorithmic choices allowing to use a firstorder method in the innerloop. More precisely, in the proposed QNing algorithm, we provide i) a simple linesearch strategy which guarantees sufficient descent in terms of function value; ii) a practical stopping criterion for the subproblems; iii) several warmstart strategies. These three components together yields the global convergence analysis which takes into account the innerloop complexity.
Explicit vs. implicit gradient methods.
The classical QuasiNewton rule (3) and the variable metric proximal point update (9) are related since they only differ by the point chosen to evaluate the gradient of . The first rule performs indeed an explicit gradient step , whereas it is possible to show that (9) is equivalent to , where . The latter is often referred to as an implicit gradient step, since the point is not known in advance and requires solving a subproblem.
In the unrealistic case where can be obtained at no cost, implicit gradient steps can afford much larger step sizes than explicit ones and are more effective. For instance, when is strongly convex, by choosing , it is possible to get arbitrarily close to the optimum in a single gradient step by making arbitrarily small. In practice, however, subproblems are solved only approximately, and whether or not one should prefer explicit or inexact implicit steps is less clear. A small makes the smoothed function better conditioned, while a large is needed to improve the conditioning of the subproblem (5).
In the composite case, both approaches require approximately solving subproblems, namely (4) and (5), respectively. In the general case, when a generic firstorder method—e.g., proximal gradient descent—is used, our worstcase complexity analysis does not provide a clear winner, and our experiments in Section 5.4 confirm that both approaches perform similarly. However, when it is possible to exploit the specific structure of the subproblems in one case, but not in the other one, the conclusion may differ.
For instance, the implicit strategy applied to a finite sum (2) leads to subproblems that can be solved in iterations with SVRG [60], SAGA [15] or MISO [38], by using the same choice as Catalyst [34]. Assuming that computing a gradient of a function and computing the proximal operator of are both feasible in floating point operations, our approach solves each subproblem with enough accuracy in operations.^{1}^{1}1 The notation hides logarithmic quantities. On the other hand, we cannot naively apply SVRG to solve the proximal QuasiNewton update (4) at the same cost: (i) assuming that has rank , computing a single gradient of a sum’s component will cost , resulting in fold increase per iteration in terms of computational complexity; (ii) the previous iterationcomplexity for solving the subproblems would require the condition , forcing the QuasiNewton metric to be potentially more isotropic. For this reason, existing attempts to combine SVRG with QuasiNewton principles have adopted other directions [26, 44].
3 QNing: a QuasiNewton metaalgorithm
We now present the QNing method in Algorithm 1, which consists of applying variable metric algorithms on the smoothed objective with inexact gradients. Each gradient approximation is the result of a minimization problem tackled with the algorithm , used as a subroutine. The outer loop of the algorithm performs quasiNewton updates. The method can be any algorithm of the user’s choice, as long as it enjoys linear convergence rate for strongly convex problems. More details about the choice of the parameter and about the inexactness criterion to use will be given next.
(10) 

Stop when the approximate solution satisfies
(11) 
Simply use a predefined constant budget (for instance one pass over the data).
3.1 The main algorithm
We now discuss the main algorithm components and its main features.
Outerloop: variable metric inexact proximal point algorithm.
We apply variable metric algorithms with a simple line search strategy similar to [54] on the MoreauYosida regularization. Given a positive definite matrix and a step size in , the algorithm computes the update
(LS) 
where . When , the update uses the metric , and when , it uses an inexact proximal point update . In other words, when the quality of the metric is not good enough, due to the inexactness of the gradients used for its construction, the update is corrected towards that of a simple proximal point update, whose convergence is well understood when the gradients are inexact.
In order to choose the stepsize, we introduce the following descent condition,
(12) 
In our experiments, we observed empirically that the stepsize was almost always selected. In practice, we try the values in starting from the largest one and stopping whenever condition (12) is satisfied, which can be shown to be the case for .
Example of variable metric: inexact LBFGS method.
The LBFGS rule we consider is the standard one and consists in updating incrementally a generating list of vectors , which implicitly defines the LBFGS matrix. We use here the twoloop recursion detailed in [50, Algorithm 7.4] and use skipping steps when the condition is not satisfied, in order to ensure the positivedefiniteness of the LBFGS matrix (see [20]).
Innerloop: approximate Moreauenvelope.
The inexactness of our scheme comes from the approximation of the Moreauenvelope where a minimization algorithm is used. The procedure calls the minimization algorithm to minimize the subproblem (10). When the problem is solved exactly, the function returns the exact values , , and . However, this is infeasible in practice and we can only expect approximate solutions. In particular, a stopping criterion should be specified. We consider the following variants:

we define an adaptive stopping criterion based on function values and stop when the approximate solution satisfies the inequality (11). In contrast to standard stopping criterion where the accuracy is an absolute constant, our stopping criterion is adaptive since the righthand side of (11) also depends on the current iterate . More detailed theoretical insights will be given in Section 4. Typically, checking whether or not the criterion is satisfied requires computing a duality gap, as in Catalyst [34].

using a predefined budget in terms of number of iterations of the method , where is a constant independent of .
As we will see later in Section 4, when is large enough, criterion (11) is guaranteed. Note that such an adaptive stopping criterion is relatively classical in the literature of inexact gradientbased methods [9].
Requirements on .
To apply QNing, the optimization method needs to have linear convergence rates for stronglyconvex problems. More precisely, for any any stronglyconvex objective , the method should be able to generate a sequence of iterates such that
(13) 
where is the initial point given to . The notion of linearly convergent methods extends naturally to nondeterministic methods where (13) is satisfied in expectation:
(14) 
The linear convergence condition typically holds for many primal gradientbased optimization techniques, including classical full gradient descent methods, blockcoordinate descent algorithms [47, 52], or variance reduced incremental algorithms [15, 56, 60]. In particular, our method provides a generic way to combine incremental algorithms with QuasiNewton methods which are suitable for large scale optimization problems. For the simplicity of the presentation, we only consider the deterministic variant (13) in the analysis. However, it is possible to show that the same complexity results still hold for nondeterministic methods in expectation, as discussed in Section 4.5. We emphasize that we do not assume any convergence guarantee of on nonstrongly convex problems since we will always apply to strongly convex subproblems.
Warm starts for the subproblems.
Using the right starting point for initializing the method when solving each subproblem is important to guarantee that the accuracy to ensure global convergence of the algorithm can be achieved with a constant number of iterations. We show that it is indeed the case with the following choices: Consider the minimization of a subproblem
Then, our warm start strategy depends on the nature of :

when is smooth, we use ;

when is composite, we use
Handling composite objective functions.
In machine learning or signal processing, convex composite objectives (1) with a nonsmooth penalty are typically formulated to encourage solutions with specific characteristics; in particular, the norm is known to provide sparsity. Smoothing techniques [46] may allow us to solve the optimization problem up to some chosen accuracy, but they provide solutions that do not inherit the properties induced by the nonsmoothness of the objective. To illustrate what we mean by this statement, we may consider smoothing the norm, leading to a solution vector with small coefficients, but not with exact zeroes. When the goal is to perform model selection—that is, understanding which variables are important to explain a phenomenon, exact sparsity is seen as an asset, and optimization techniques dedicated to composite problems such as FISTA [2] are often preferred (see [40]).
Then, one might be concerned that our scheme operates on the smoothed objective , leading to iterates that may suffer from the above “nonsparse” issue, assuming that is the norm. Yet, our approach also provides iterates that are computed using the original optimization method we wish to accelerate. When handles composite problems without smoothing, typically when is a proximal blockcoordinate, or incremental method, the iterates may be sparse. For this reason, our theoretical analysis presented in Section 4 studies the convergence of the sequence to the solution .
4 Convergence and complexity analysis
In this section, we study the convergence of the QNing algorithm—that is, the rate of convergence of the quantities and , and also the computational complexity due to solving the subproblems (10). We start by stating the main properties of the gradient approximation in Section 4.1. Then, we analyze the convergence of the outer loop algorithm in Section 4.2, and Section 4.3 is devoted to the properties of the line search strategy. After that, we provide the cost of solving the subproblems in Section 4.4 and derive the global complexity analysis in Section 4.5.
4.1 Properties of the gradient approximation
The next lemma is classical and provides approximation guarantees about the quantities returned by the ApproxGradient procedure (Algorithm 2); see [5, 23]. We recall here the proof for completeness.
Lemma 1 (Approximation quality of the gradient approximation).
Consider a vector in , a positive scalar and an approximate proximal point
such that , where . As in Algorithm 2, define and . Then, the following inequalities hold
(15)  
(16)  
(17) 
Moreover, is related to by the following relationship
(18) 
Proof.
This lemma allows us to quantify the quality of the gradient and function value approximations, which is crucial to control the error accumulation of inexact proximal point methods. Moreover, the relation (18) establishes a link between the approximate function value of and the function value of the original objective ; as a consequence, it is possible to relate the convergence rate of from the convergence rate of . Finally, the following result is a direct consequence of Lemma 1:
Lemma 2 (Bounding the exact gradient by its approximation).
Consider the same quantities introduced in Lemma 1. Then,
(19) 
Proof.
The righthand side of Eq. (19) follows from
Interchanging and gives the lefthand side inequality. ∎
Corollary 1.
If with , then
(20) 
This corollary is important since it allows to replace the unknown exact gradient by its approximation , at the cost of a constant factor, as long as the condition is satisfied.
4.2 Convergence analysis of the outer loop
We are now in shape to establish the convergence of the QNing metaalgorithm, without considering yet the cost of solving the subproblems (10). At iteration , an approximate proximal point is evaluated:
(21) 
The following lemma characterizes the expected descent in terms of objective function value when the gradient is approximately computed.
Lemma 3 (Approximate descent property).
This lemma gives us a first intuition about the natural choice of the accuracy , which should be in the same order as . In particular, if
(23) 
then we have
(24) 
which is a typical inequality used for analyzing gradient descent methods. Before presenting the convergence result, we remark that condition (23) cannot be used directly since it requires knowing the exact gradient . A more practical choice consists of replacing it by the approximate gradient.
Lemma 4 (Practical choice of ).
The following condition implies inequality (23):
(25) 
Whereas the gradient is unknown in practice, we have access to the estimate , which allows us to use condition (25). Finally, we obtain the following convergence result for strongly convex problems, which is relatively classical in the literature of inexact gradient methods (see Section 4.1 of [9] for a similar result).
Proposition 2 (Convergence of Algorithm 1, stronglyconvex objectives).
Proof.
The proof follows directly from (24) and the standard analysis of the gradient descent algorithm for the strongly convex and smooth function by remarking that and . ∎
Corollary 2.
Under the conditions of Proposition 2, we have
(26) 
It is worth pointing out that our analysis establishes a linear convergence rate whereas one would expect a superlinear convergence rate as for classical classical variable metric methods with infinite memory. The tradeoff lies in the choice of the accuracy . In order to achieve superlinear convergence rate, the approximation error also needs to decrease superlinearly as shown in [14]. However, a fast decreasing sequence requires an increasing effort in solving the subproblems, which will dominate the global complexity. In other words, the global complexity may become worse even though we achieve faster convergence in the outerloop. This will become clearer when we discuss the inner loop complexity in Section 4.4.
Next, we show the classical sublinear convergence rate of QNing under a bounded level set condition, when the objective is convex but not necessarily strongly convex.
Proposition 3 (Convergence of Algorithm 1 for convex, but not stronglyconvex objectives).
Proof.
We defer the proof and the proper definition of the bounded level set assumption to Appendix A. ∎
So far, the analysis has assumed that the line search would always produce an iterate that satisfies the descent condition (12), which naturally holds for a step size . In the next section, we study classical conditions under which a nonzero step size is selected.
4.3 Conditions for nonzero step sizes and termination of the line search
At iteration , the line search is performed on the stepsize to find the next iterate
such that satisfies the descent condition (12). Intuitively, when goes to zero, will be close to the classical gradient step where the descent condition holds. This observation leads us to consider the following sufficient condition for the descent condition (12).
Lemma 5 (A sufficient condition for the descent condition (12)).
If where is a positive definite matrix such that with and the subproblems are solved up to accuracy , then the sufficient condition (12) holds, i.e.
(27) 
Therefore, a line search strategy consisting of finding the largest of the form , with and in always terminates in a bounded number of iterations if the sequence is also bounded, meaning there exists such that for any , . Note that in practice, we consider a set of step sizes for or , which naturally upperbounds the number of line search iterations to . More precisely, all experiments performed in this paper use and .
Proof of Lemma 5.
First, we recall that and
(28) 
which means the step size naturally satisfies the descent condition. Considering now , we have,
and we are going to bound the two error terms and by some factors of .
By denoting with , we obtain by construction
(29) 
where the last inequality comes from Corollary 1. Moreover,
(30) 
Thus,
(31) 
Second, by the smoothness of , we have
Since , we have . Furthermore, similarly to (30), we can bound
Therefore,
(32) 
Combining (31) and (32) yields
(33) 
When and , we have . This together with (28) completes the proof. ∎
In practice, the unit stepsize is very often sufficient for the descent condition to hold, as empirically studied in Appendix C.2. The following result shows that under a specific assumption on the MoreauYosida envelope , indeed the unit stepsize is always selected when the iterate are close to the optimum. The condition, called DennisMoré criterion [17], is classical in the literature of QuasiNewton methods, even though we cannot formally show that it holds for the MoreauYosida envelope . Indeed, the criterion requires to be twice continuously differentiable, which is not true in general, see [32]. Therefore, the lemma below should not be seen as an formal explanation for the choice of step size , which we often observe in practice, but simply as a reasonable condition that leads to this choice.
Lemma 6 (A sufficient condition for unit stepsize).
We remark that the DennisMoré criterion we use here is slightly different from the standard one since the criterion is based on approximate gradients . The proof is close to that of similar lemmas appearing in the proximal quasiNewton literature [31], and is relegated to the appendix. Interestingly, this proof also suggests that a stronger stopping criterion such that could lead to superlinear convergence. However, such a choice of would significantly increase the complexity for solving the subproblems, and overall degrade the global complexity.
4.4 Complexity analysis of the inner loop
In this section, we evaluate the complexity of solving the subproblems (10) up to the desired accuracy using a linearly convergent method . Our main result is that all subproblems can be solved in a constant number of iterations (in expectation if the method is nondeterministic) using a warm start strategy.
Let us consider the subproblem with an arbitrary prox center ,
(34) 
The number of iterations needed is determined by the ratio between the initial gap and the desired accuracy. We are going to bound this ratio by a constant factor.
Lemma 7 (Warm start for primal methods  smooth case).
If is differentiable with Lipschitz continuous gradients, we initialize the method with . Then, we have the guarantee that
(35) 
Proof.
Denote by the minimizer of . Then, we have the optimality condition . As a result,
∎
The inequality in the proof of Lemma 7 relies on the smoothness of , which does not hold for composite problems. The next lemma addresses this issue.
Lemma 8 (Warm start for primal methods  composite case).
Consider the composite optimization problem , where is smooth. By initializing the method with
(36) 
we have,
Proof.
We use the inequality corresponding to Lemma 2.3 in [2]: for any ,
(37) 
with . Then, we apply this inequality to , and
∎
We get an initialization of the same quality in the composite case as in the smooth case, by performing an additional proximal step. It is important to remark that the above analysis do not require strong convexity of , which allows us to derive the desired innerloop complexity.
Proposition 4 (Innerloop complexity for Algorithm 1).
Proof.
Consider at iteration , we apply to approximate the proximal mapping according to . With the given (which we abbreviate by ), we have
Comments
There are no comments yet.