 # A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization

We propose a generic approach to accelerate gradient-based optimization algorithms with quasi-Newton principles. The proposed scheme, called QuickeNing, can be applied to incremental first-order methods such as stochastic variance-reduced gradient (SVRG) or incremental surrogate optimization (MISO). It is also compatible with composite objectives, meaning that it has the ability to provide exactly sparse solutions when the objective involves a sparsity-inducing regularization. QuickeNing relies on limited-memory BFGS rules, making it appropriate for solving high-dimensional optimization problems. Besides, it enjoys a worst-case linear convergence rate for strongly convex problems. We present experimental results where QuickeNing gives significant improvements over competing methods for solving large-scale high-dimensional machine learning problems.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Convex composite optimization arises in many scientific fields, such as image and signal processing or machine learning. It consists of minimizing a real-valued function composed of two convex terms:

 minx∈Rd{f(x)≜f0(x)+ψ(x)}, (1)

where  is smooth with Lipschitz continuous derivatives, and is a regularization function which is not necessarily differentiable. A typical example from the signal and image processing literature is the -norm , which encourages sparse solutions [19, 40]; composite minimization also encompasses constrained minimization when considering extended-valued indicator functions  that may take the value outside of a convex set  and  inside (see ). In general, algorithms that are dedicated to composite optimization only require to be able to compute efficiently the proximal operator of :

 pψ(y)≜argminx∈Rd{ψ(x)+12∥x−y∥2},

where denotes the Euclidean norm. Note that when is an indicator function, the proximal operator corresponds to the simple Euclidean projection.

To solve (1), significant efforts have been devoted to (i) extending techniques for smooth optimization to deal with composite terms; (ii) exploiting the underlying structure of the problem—is a finite sum of independent terms? Is  separable in different blocks of coordinates? (iii) exploiting the local curvature of the smooth term  to achieve faster convergence than gradient-based approaches when the dimension  is large. Typically, the first point is well understood in the context of optimal first-order methods, see [2, 48]

, and the third point is tackled with effective heuristics such as L-BFGS when the problem is smooth

[35, 49]. Yet, tackling all these challenges at the same time is difficult, which is precisely the focus of this paper.

In particular, a problem of interest that initially motivated our work is that of empirical risk minimization (ERM); the problem arises in machine learning and can be formulated as the minimization of a composite function :

 minx∈Rd{f(x)≜1nn∑i=1fi(x)+ψ(x)}, (2)

where the functions are convex and smooth with Lipschitz continuous derivatives, and is a composite term, possibly non-smooth. The function measures the fit of some model parameters  to a specific data point indexed by , and is a regularization penalty to prevent over-fitting. To exploit the sum structure of , a large number of randomized incremental gradient-based techniques have been proposed, such as SAG , SAGA , SDCA , SVRG , Finito , or MISO . These approaches access a single gradient  at every iteration instead of the full gradient and achieve lower computational complexity in expectation than optimal first-order methods [2, 48] under a few assumptions. Yet, these methods are unable to exploit the curvature of the objective function; this is indeed also the case for variants that are accelerated in the sense of Nesterov [21, 33, 58].

To tackle (2), dedicated first-order methods are often the default choice in machine learning, but it is also known that standard Quasi-Newton approaches can sometimes be surprisingly effective in the smooth case—that is when , see, e.g.,  for extensive benchmarks. Since the dimension of the problem  is typically very large (), “limited memory” variants of these algorithms, such as L-BFGS, are necessary to achieve the desired scalability [35, 49]. The theoretical guarantees offered by L-BFGS are somewhat limited, meaning that it does not outperform accelerated first-order methods in terms of worst-case convergence rate, and also it is not guaranteed to correctly approximate the Hessian of the objective. Yet, L-BFGS remains one of the greatest practical success of smooth optimization. Adapting L-BFGS to composite and structured problems, such as the finite sum of functions (2), is of utmost importance nowadays.

For instance, there have been several attempts to develop a proximal Quasi-Newton method [10, 31, 54, 62]. These algorithms typically require computing many times the proximal operator of  with respect to a variable metric. Quasi-Newton steps were also incorporated as local search steps into accelerated first-order methods to further enhance their numerical performance . More related to our work, L-BFGS is combined with SVRG for minimizing smooth finite sums in . The scope of our approach is broader beyond the case of SVRG. We present a generic Quasi-Newton scheme, applicable to a large-class of first-order methods for composite optimization, including other incremental algorithms [15, 16, 38, 56, 58] and block coordinate descent methods [51, 52]

More precisely, the main contribution of this paper is a generic meta-algorithm, called QNing (the letters “Q” and “N” stand for Quasi-Newton), which uses a given optimization method to solve a sequence of auxiliary problems up to some appropriate accuracy, resulting in faster global convergence in practice. QNing falls into the class of inexact proximal point algorithms with variable metric and may be seen as applying a Quasi-Newton algorithm with inexact (but accurate enough) gradients to the Moreau-Yosida regularization of the objective. As a result, our approach is (i) generic, as stated previously; (ii) despite the smoothing of the objective, the sub-problems that we solve are composite ones, which may lead to exactly sparse iterates when a sparsity-inducing regularization is involved, e.g., the -norm; (iii) when used with L-BFGS rules, it admits a worst-case linear convergence rate for strongly convex problems similar to that of gradient descent, which is typically the best guarantees obtained for L-BFGS schemes in the literature.

The idea of combining second-order or quasi-Newton methods with Moreau-Yosida regularization is in fact relatively old. It may be traced back to variable metric proximal bundle methods [14, 23, 41], which use BFGS updates on the Moreau-Yosida smoothing of the objective and bundle methods to approximately solve the corresponding sub-problems. Our approach revisits this principle with a limited-memory variant (to deal with large dimension ), with a simple line search scheme, and with warm start strategies for the sub-problems with a global complexity analysis that is more relevant than convergence rates that do not take into account the cost per iteration.

To demonstrate the effectiveness of our scheme in practice, we evaluate QNing on regularized logistic regression and regularized least-squares, with smooth and nonsmooth regularization penalities such as the Elastic-Net

. We use large-scale machine learning datasets and show that QNing performs at least as well as the recently proposed Catalyst  and as the classical L-BFGS scheme in all numerical experiments, and significantly outperforms them in many cases.

The paper is organized as follows: Section 2 presents related work on Quasi-Newton methods such as L-BFGS; we introduce QNing as well as basic properties of the Moreau-Yosida regularization in Section 3, and we provide a convergence analysis in Section 4; Section 5 is devoted to numerical experiments and Section 6 concludes the paper.

## 2 Related work and preliminaries

The history of quasi-Newton methods can be traced back to the 1950’s [6, 29, 50]. Quasi-Newton methods often lead to significantly faster convergence in practice compared to simpler gradient-based methods for solving smooth optimization problems . Yet, a theoretical analysis of quasi-Newton methods that explains their impressive empirical behavior on a wide range of problems is still an open topic. Here, we briefly review the well-known BFGS algorithm in Section 2.1, its limited memory variant  , and a few recent extensions. Then, we present earlier works that combine proximal point and Quasi-Newton algorithms in Section 2.3.

### 2.1 Quasi-Newton methods for smooth optimization

The most popular Quasi-Newton method is BFGS, named after its inventors (Broyden-Fletcher-Goldfarb-Shanno), and its limited variant L-BFGS . These approaches will be the workhorses of the QNing meta-algorithm in practice. Consider a smooth convex objective to be minimized, the BFGS method constructs at iteration  a couple with the following update:

 xk+1=xk−αkB−1k∇f(xk) and Bk+1=Bk−Bksks⊤kBks⊤kBksk+yky⊤ky⊤ksk, (3)

where is a suitable stepsize and

 sk=xk+1−xk,yk=∇f(xk+1)−∇f(xk).

The condition and the positive definiteness of  are guaranteed as soon as is strongly convex. To determine the stepsize , Wolfe’s line-search is a simple choice which provides linear convergence rate in the worst case. In addition, if the objective is twice differentiable and the Hessian is Lipschitz continuous, the convergence is asymptotically superlinear .

The limited memory variant L-BFGS  overcomes the issue of storing for large , by replacing it by another positive definite matrix—say —which can be built from a “generating list” of at most

pairs of vectors

along with an initial diagonal matrix . Formally, can be computed by applying at most  times a recursion similar to (3) involving all pairs of the generating list. Between iteration  and , the generating list is incrementally updated, by removing the oldest pair in the list (when ) and adding a new one. What makes the approach appealing is the ability of computing for any vector with only floating point operations instead of for a naive implementation with matrix inversion. The price to pay is that superlinear convergence becomes out of reach in contrast to BFGS.

L-BFGS is thus appropriate for high-dimensional problems (when is large), but it still requires computing the full gradient at each iteration, which may be cumbersome in the large sum setting (2). This motivated stochastic counterparts of the Quasi-Newton method (SQN) [57, 42, 8]. Unfortunately substituting the full gradient by its stochastic counterpart does not lead to a convergent scheme. Instead, the SQN method  uses a product of a sub-sampled Hessian and to approximate . SQN can be complemented by a variance reduction scheme such as SVRG [26, 44].

### 2.2 Quasi-Newton methods for composite optimization

Different approaches have been proposed to extend Quasi-Newton methods to composite optimization problems. A first approach consists in minimizing successive quadratic approximations, also called proximal quasi-Newton methods [10, 25, 30, 31, 36, 54]. More concretely, a quadratic approximation is minimized at each iteration:

 qk(x)≜f0(xk)+⟨∇f0(xk),x−xk⟩+12(x−xk)TBk(x−xk)+ψ(x), (4)

where is a Hessian approximation based on quasi-Newton methods. The minimizer of provides a descent direction, which is subsequently used to build the next iterate. However, since is dense and changes over the iterations, a closed form solution of (4) is usually not available, and one needs to apply an optimization algorithm to approximately solve (4). Even though local superlinear convergence may be guaranteed under mild assumptions when (4) is solved with “high accuracy” , the composite structure naturally leads to choosing a first-order algorithm for solving (4). Then, superlinear complexity becomes out of reach. The global convergence rate of this inexact variant has been for instance analyzed in , where a sublinear convergence rate is obtained for convex problems by using a randomized coordinate descent solver applied to (4); later, a linear convergence rate was obtained by  for strongly convex problems.

A second approach to extend Quasi-Newton methods to composite optimization problems is based on a smoothing technique. More precisely, any Quasi-Newton method may be applied to a smoothed version of the objective. For instance, one may use the forward-backward envelope [4, 59] to build forward-backward quasi Newton methods. The idea is to mimic forward-backward splitting methods and apply quasi-Newton methods instead of gradient methods on top of the envelope. Another well known smoothing technique is the Moreau-Yosida regularization [43, 61], which leads to the variable metric proximal point algorithm [7, 14, 22, 23]. Our method pursues this line of work by developing a practical inexact variant with global complexity guarantees.

### 2.3 Combining the proximal point algorithm and Quasi-Newton methods

We briefly recall the definition of the Moreau-Yosida regularization and its basic properties.

###### Definition 1.

Given an objective function  and a smoothing parameter , the Moreau-Yosida regularization of is defined as the infimal convolution

 F(x)≜minz∈Rd{f(z)+κ2∥z−x∥2}. (5)

When is convex, the sub-problem defined in (5) is strongly convex which provides an unique minimizer, called the proximal point of , which we denote by .

###### Proposition 1 (Basic properties of the Moreau-Yosida regularization).

If is convex, the Moreau-Yosida regularization  defined in (5) satisfies

1. Minimizing and are equivalent in the sense that

 minx∈RdF(x)=minx∈Rdf(x),

and the solution set of the two above problems coincide with each other.

2. is continuously differentiable even when is not and

 ∇F(x)=κ(x−p(x)). (6)

Moreover the gradient is Lipschitz continuous with constant .

3. is convex; moreover, when is -strongly convex with respect to the Euclidean norm, is -strongly convex with

Interestingly, inherits all the convex properties of and more importantly it is always continuously differentiable, see  for elementary proofs. Moreover, the condition number of is given by

 q=LFμF=μ+κμ, (7)

which is driven by the regularization parameter . Naturally, a naive approach for minimizing a possibly non-smooth function  is to apply an optimization method on  since both functions admit the same solutions. This yields the following well-known algorithms.

#### The proximal point algorithm.

Consider gradient descent with step size to minimize :

 xk+1=xk−1κ∇F(xk).

By rewriting the gradient as , we obtain the proximal point algorithm :

 xk+1=p(xk)=argminz∈Rd{f(z)+κ2∥z−xk∥2}. (8)

#### Accelerated proximal point algorithm.

Since gradient descent on  yields the proximal point algorithm, it is also natural to apply an accelerated first-order method to get faster convergence. To that effect, Nesterov’s algorithm  uses a two-stage update, using a specific extrapolation parameter :

 xk+1=yk−1κ∇F(yk) and yk+1=xk+1+βk+1(xk+1−xk),

and, given (6), we obtain that . This is known as the accelerated proximal point algorithm introduced by Güler , which was recently extended in [33, 34].

#### Variable metric proximal point algorithm.

One can also apply Quasi-Newton methods on top of the Moreau-Yosida regularization, which yields

 xk+1=xk−αkB−1k∇F(xk), (9)

where  is the Hessian approximation based on Quasi-Newton methods. This is known as the variable metric proximal point algorithm [7, 14, 22, 23].

#### Towards an inexact variable metric proximal point algorithm.

Quasi-Newton approaches have been applied after inexact Moreau-Yosida smoothing in various ways [7, 14, 22, 23]. In particular, it is shown in  that if the sub-problems (5) are solved up to high enough accuracy, then the inexact variable metric proximal point algorithm preserves the superlinear convergence rate. However, the complexity for solving the sub-problems with high accuracy is typically not taken into account in such previous work. The main contribution of our paper is to close this gap by providing a global analysis and algorithmic choices allowing to use a first-order method in the inner-loop. More precisely, in the proposed QNing algorithm, we provide i) a simple line-search strategy which guarantees sufficient descent in terms of function value; ii) a practical stopping criterion for the sub-problems; iii) several warm-start strategies. These three components together yields the global convergence analysis which takes into account the inner-loop complexity.

#### Explicit vs. implicit gradient methods.

The classical Quasi-Newton rule (3) and the variable metric proximal point update (9) are related since they only differ by the point chosen to evaluate the gradient of . The first rule performs indeed an explicit gradient step , whereas it is possible to show that (9) is equivalent to , where . The latter is often referred to as an implicit gradient step, since the point is not known in advance and requires solving a sub-problem.

In the unrealistic case where can be obtained at no cost, implicit gradient steps can afford much larger step sizes than explicit ones and are more effective. For instance, when is strongly convex, by choosing , it is possible to get arbitrarily close to the optimum in a single gradient step by making arbitrarily small. In practice, however, sub-problems are solved only approximately, and whether or not one should prefer explicit or inexact implicit steps is less clear. A small makes the smoothed function  better conditioned, while a large is needed to improve the conditioning of the sub-problem (5).

In the composite case, both approaches require approximately solving sub-problems, namely (4) and (5), respectively. In the general case, when a generic first-order method—e.g., proximal gradient descent—is used, our worst-case complexity analysis does not provide a clear winner, and our experiments in Section 5.4 confirm that both approaches perform similarly. However, when it is possible to exploit the specific structure of the sub-problems in one case, but not in the other one, the conclusion may differ.

For instance, the implicit strategy applied to a finite sum (2) leads to sub-problems that can be solved in iterations with SVRG , SAGA  or MISO , by using the same choice as Catalyst . Assuming that computing a gradient of a function and computing the proximal operator of  are both feasible in floating point operations, our approach solves each sub-problem with enough accuracy in operations.111 The notation hides logarithmic quantities. On the other hand, we cannot naively apply SVRG to solve the proximal Quasi-Newton update (4) at the same cost: (i) assuming that has rank , computing a single gradient of a sum’s component will cost , resulting in -fold increase per iteration in terms of computational complexity; (ii) the previous iteration-complexity for solving the sub-problems would require the condition , forcing the Quasi-Newton metric to be potentially more isotropic. For this reason, existing attempts to combine SVRG with Quasi-Newton principles have adopted other directions [26, 44].

## 3 QNing: a Quasi-Newton meta-algorithm

We now present the QNing method in Algorithm 1, which consists of applying variable metric algorithms on the smoothed objective  with inexact gradients. Each gradient approximation is the result of a minimization problem tackled with the algorithm , used as a sub-routine. The outer loop of the algorithm performs quasi-Newton updates. The method  can be any algorithm of the user’s choice, as long as it enjoys linear convergence rate for strongly convex problems. More details about the choice of the parameter  and about the inexactness criterion to use will be given next.

### 3.1 The main algorithm

We now discuss the main algorithm components and its main features.

#### Outer-loop: variable metric inexact proximal point algorithm.

We apply variable metric algorithms with a simple line search strategy similar to  on the Moreau-Yosida regularization. Given a positive definite matrix and a step size in , the algorithm computes the update

 xk+1=xk−(ηkHk+(1−ηk)H0)gk, (LS)

where . When , the update uses the metric , and when , it uses an inexact proximal point update . In other words, when the quality of the metric is not good enough, due to the inexactness of the gradients used for its construction, the update is corrected towards that of a simple proximal point update, whose convergence is well understood when the gradients are inexact.

In order to choose the stepsize, we introduce the following descent condition,

 Fk+1≤Fk−14κ∥gk∥2. (12)

In our experiments, we observed empirically that the stepsize was almost always selected. In practice, we try the values in starting from the largest one and stopping whenever condition (12) is satisfied, which can be shown to be the case for .

#### Example of variable metric: inexact L-BFGS method.

The L-BFGS rule we consider is the standard one and consists in updating incrementally a generating list of vectors , which implicitly defines the L-BFGS matrix. We use here the two-loop recursion detailed in [50, Algorithm 7.4] and use skipping steps when the condition is not satisfied, in order to ensure the positive-definiteness of the L-BFGS matrix  (see ).

#### Inner-loop: approximate Moreau-envelope.

The inexactness of our scheme comes from the approximation of the Moreau-envelope where a minimization algorithm is used. The procedure calls the minimization algorithm to minimize the sub-problem (10). When the problem is solved exactly, the function returns the exact values , , and . However, this is infeasible in practice and we can only expect approximate solutions. In particular, a stopping criterion should be specified. We consider the following variants:

• we define an adaptive stopping criterion based on function values and stop  when the approximate solution satisfies the inequality (11). In contrast to standard stopping criterion where the accuracy is an absolute constant, our stopping criterion is adaptive since the righthand side of (11) also depends on the current iterate . More detailed theoretical insights will be given in Section 4. Typically, checking whether or not the criterion is satisfied requires computing a duality gap, as in Catalyst .

• using a pre-defined budget in terms of number of iterations of the method , where is a constant independent of .

As we will see later in Section 4, when is large enough, criterion (11) is guaranteed. Note that such an adaptive stopping criterion is relatively classical in the literature of inexact gradient-based methods .

#### Requirements on M.

To apply QNing, the optimization method needs to have linear convergence rates for strongly-convex problems. More precisely, for any any strongly-convex objective , the method  should be able to generate a sequence of iterates such that

 h(wt)−h∗≤CM(1−τM)t(h(w0)−h∗)  for some constants  CM>0 and 1>τM>0, (13)

where  is the initial point given to . The notion of linearly convergent methods extends naturally to non-deterministic methods where (13) is satisfied in expectation:

 E[h(wt)−h∗]≤CM(1−τM)t(h(w0)−h∗). (14)

The linear convergence condition typically holds for many primal gradient-based optimization techniques, including classical full gradient descent methods, block-coordinate descent algorithms [47, 52], or variance reduced incremental algorithms [15, 56, 60]. In particular, our method provides a generic way to combine incremental algorithms with Quasi-Newton methods which are suitable for large scale optimization problems. For the simplicity of the presentation, we only consider the deterministic variant (13) in the analysis. However, it is possible to show that the same complexity results still hold for non-deterministic methods in expectation, as discussed in Section 4.5. We emphasize that we do not assume any convergence guarantee of  on non-strongly convex problems since we will always apply to strongly convex sub-problems.

#### Warm starts for the sub-problems.

Using the right starting point for initializing the method  when solving each sub-problem is important to guarantee that the accuracy to ensure global convergence of the algorithm can be achieved with a constant number of iterations. We show that it is indeed the case with the following choices: Consider the minimization of a sub-problem

 minw∈Rd{h(w)≜f(w)+κ2∥w−x∥2}.

Then, our warm start strategy depends on the nature of :

• when is smooth, we use ;

• when is composite, we use

 w0=argminw∈Rd{f0(x)+⟨∇f0(x),w−x⟩+L+κ2∥w−x∥2+ψ(w)}.

#### Handling composite objective functions.

In machine learning or signal processing, convex composite objectives (1) with a non-smooth penalty  are typically formulated to encourage solutions with specific characteristics; in particular, the -norm is known to provide sparsity. Smoothing techniques  may allow us to solve the optimization problem up to some chosen accuracy, but they provide solutions that do not inherit the properties induced by the non-smoothness of the objective. To illustrate what we mean by this statement, we may consider smoothing the -norm, leading to a solution vector with small coefficients, but not with exact zeroes. When the goal is to perform model selection—that is, understanding which variables are important to explain a phenomenon, exact sparsity is seen as an asset, and optimization techniques dedicated to composite problems such as FISTA  are often preferred (see ).

Then, one might be concerned that our scheme operates on the smoothed objective , leading to iterates  that may suffer from the above “non-sparse” issue, assuming that  is the -norm. Yet, our approach also provides iterates  that are computed using the original optimization method  we wish to accelerate. When  handles composite problems without smoothing, typically when  is a proximal block-coordinate, or incremental method, the iterates  may be sparse. For this reason, our theoretical analysis presented in Section 4 studies the convergence of the sequence to the solution .

## 4 Convergence and complexity analysis

In this section, we study the convergence of the QNing algorithm—that is, the rate of convergence of the quantities and , and also the computational complexity due to solving the sub-problems (10). We start by stating the main properties of the gradient approximation in Section 4.1. Then, we analyze the convergence of the outer loop algorithm in Section 4.2, and Section 4.3 is devoted to the properties of the line search strategy. After that, we provide the cost of solving the sub-problems in Section 4.4 and derive the global complexity analysis in Section 4.5.

### 4.1 Properties of the gradient approximation

The next lemma is classical and provides approximation guarantees about the quantities returned by the ApproxGradient procedure (Algorithm 2); see [5, 23]. We recall here the proof for completeness.

###### Lemma 1 (Approximation quality of the gradient approximation).

Consider a vector  in , a positive scalar  and an approximate proximal point

 z≈argminw∈Rd{h(w)≜f(w)+κ2∥w−x∥2},

such that , where . As in Algorithm 2, define and . Then, the following inequalities hold

 F(x)≤Fa≤F(x)+ε, (15) ∥z−p(x)∥≤√2εκ, (16) ∥g−∇F(x)∥≤√2κε. (17)

Moreover, is related to by the following relationship

 f(z)=Fa−12κ∥g∥2. (18)
###### Proof.

(15) and (18) are straightforward by definition of . Since is convex, the function is -strongly convex, and (16) follows from

 κ2∥z−p(x)∥2≤h(z)−h(p(x))=h(z)−h∗≤ε,

where we recall that minimizes . Finally, we obtain (17) from

 g−∇F(x)=κ(x−z)−κ(x−p(x))=κ(p(x)−z),

by using the definitions of  and the property (6). ∎

This lemma allows us to quantify the quality of the gradient and function value approximations, which is crucial to control the error accumulation of inexact proximal point methods. Moreover, the relation (18) establishes a link between the approximate function value of and the function value of the original objective ; as a consequence, it is possible to relate the convergence rate of  from the convergence rate of . Finally, the following result is a direct consequence of Lemma 1:

###### Lemma 2 (Bounding the exact gradient by its approximation).

Consider the same quantities introduced in Lemma 1. Then,

 12∥g∥2−2κε≤∥∇F(x)∥2≤2(∥g∥2+2κε). (19)
###### Proof.

The right-hand side of Eq. (19) follows from

 ∥∇F(x)∥2 ≤ 2(∥∇F(x)−g∥2+∥g∥2) ≤ 2(2κε+∥g∥2)        (% from (???)).

Interchanging and gives the left-hand side inequality. ∎

###### Corollary 1.

If with , then

 1−4c2≤∥∇F(x)∥2∥g∥2≤2(1+2c). (20)

This corollary is important since it allows to replace the unknown exact gradient by its approximation , at the cost of a constant factor, as long as the condition is satisfied.

### 4.2 Convergence analysis of the outer loop

We are now in shape to establish the convergence of the QNing meta-algorithm, without considering yet the cost of solving the sub-problems (10). At iteration , an approximate proximal point is evaluated:

The following lemma characterizes the expected descent in terms of objective function value when the gradient is approximately computed.

###### Lemma 3 (Approximate descent property).

At iteration , if the sub-problem (21) is solved up to accuracy in the sense of Lemma 1 and the next iterate satisfies the descent condition (12), then,

 F(xk+1)≤F(xk)−18κ∥∇F(xk)∥2+32εk. (22)
###### Proof.

From (15) and (12),

 F(xk+1) ≤ Fk+1≤Fk−14κ∥gk∥2 ≤ F(xk)+εk−(18κ∥∇F(xk)∥2−εk2)(from (???) and (% ???)) = F(xk)−18κ∥∇F(xk)∥2+32εk.

This lemma gives us a first intuition about the natural choice of the accuracy , which should be in the same order as . In particular, if

 εk≤116κ∥∇F(xk)∥2, (23)

then we have

 F(xk+1)≤F(xk)−132κ∥∇F(xk)∥2, (24)

which is a typical inequality used for analyzing gradient descent methods. Before presenting the convergence result, we remark that condition (23) cannot be used directly since it requires knowing the exact gradient . A more practical choice consists of replacing it by the approximate gradient.

###### Lemma 4 (Practical choice of εk).

The following condition implies inequality (23):

 εk≤172κ∥gk∥2. (25)
###### Proof.

From Corollary 1, Equation (25) implies

 ∥gk∥2≤92∥∇F(xk)∥2and thus εk≤172κ∥gk∥2≤116κ∥∇F(xk)∥2.

Whereas the gradient is unknown in practice, we have access to the estimate , which allows us to use condition (25). Finally, we obtain the following convergence result for strongly convex problems, which is relatively classical in the literature of inexact gradient methods (see Section 4.1 of  for a similar result).

###### Proposition 2 (Convergence of Algorithm 1, strongly-convex objectives).

Assume that  is -strongly convex. Let be the sequences generated by Algorithm 1 where the stopping criterion (11) is used. Then,

 F(xk)−F∗≤(1−116q)k(F(x0)−F∗),withq=μ+κμ.
###### Proof.

The proof follows directly from (24) and the standard analysis of the gradient descent algorithm for the -strongly convex and -smooth function  by remarking that and . ∎

###### Corollary 2.

Under the conditions of Proposition 2, we have

 f(zk)−f∗≤(1−116q)k(f(x0)−f∗). (26)
###### Proof.

From (18) and (25), we have

 f(zk)=Fk−12κ∥gk∥2≤F(xk)+εk−12κ∥gk∥2≤F(xk).

Moreover,

It is worth pointing out that our analysis establishes a linear convergence rate whereas one would expect a superlinear convergence rate as for classical classical variable metric methods with infinite memory. The tradeoff lies in the choice of the accuracy . In order to achieve superlinear convergence rate, the approximation error also needs to decrease superlinearly as shown in . However, a fast decreasing sequence  requires an increasing effort in solving the sub-problems, which will dominate the global complexity. In other words, the global complexity may become worse even though we achieve faster convergence in the outer-loop. This will become clearer when we discuss the inner loop complexity in Section 4.4.

Next, we show the classical sublinear convergence rate of QNing under a bounded level set condition, when the objective is convex but not necessarily strongly convex.

###### Proposition 3 (Convergence of Algorithm 1 for convex, but not strongly-convex objectives).

Let be a convex function with bounded level sets. Then, there exists a constant , which depends on , such that the sequences and generated by Algorithm 1 with stopping criterion (11), satisfies

 F(xk)−F∗≤8κR2k   and   f(zk)−f∗≤8κR2k.
###### Proof.

We defer the proof and the proper definition of the bounded level set assumption to Appendix A. ∎

So far, the analysis has assumed that the line search would always produce an iterate that satisfies the descent condition (12), which naturally holds for a step size . In the next section, we study classical conditions under which a non-zero step size is selected.

### 4.3 Conditions for non-zero step sizes ηk and termination of the line search

At iteration , the line search is performed on the stepsize to find the next iterate

 xk+1=xk−(ηkHk+(1−ηk)H0)gk,

such that satisfies the descent condition (12). Intuitively, when goes to zero, will be close to the classical gradient step where the descent condition holds. This observation leads us to consider the following sufficient condition for the descent condition (12).

###### Lemma 5 (A sufficient condition for the descent condition (12)).

If where is a positive definite matrix such that with and the sub-problems are solved up to accuracy , then the sufficient condition (12) holds, i.e.

 Fk+1≤Fk−14κ∥gk∥2. (27)

Therefore, a line search strategy consisting of finding the largest of the form , with and in always terminates in a bounded number of iterations if the sequence is also bounded, meaning there exists such that for any , . Note that in practice, we consider a set of step sizes for or , which naturally upper-bounds the number of line search iterations to . More precisely, all experiments performed in this paper use and .

###### Proof of Lemma 5.

First, we recall that and

 F(zk)≤f(zk)=Fk−12κ∥gk∥2, (28)

which means the step size naturally satisfies the descent condition. Considering now , we have,

 Fk+1=Fk+1−F(xk+1)≜E1+F(xk+1)−F(zk)≜E2+F(zk),

and we are going to bound the two error terms and by some factors of .

By denoting with , we obtain by construction

 E1=Fk+1−F(xk+1)≤εk+1≤cκ∥gk+1∥2≤2c(1−4c)κ∥∇F(xk+1)∥2, (29)

where the last inequality comes from Corollary 1. Moreover,

 ∥∇F(xk+1)∥ ≤∥∇F(xk+1)−∇F(xk)∥+∥∇F(xk)−gk∥+∥gk∥ ≤κ∥xk+1−xk∥+√2κεk+∥gk∥ ≤(2+α+√2c)∥gk∥. (30)

Thus,

 E1≤2c(2+α+√2c)21−4c1κ∥gk∥2. (31)

Second, by the -smoothness of , we have

 E2 =F(xk+1)−F(zk) ≤⟨∇F(zk),xk+1−zk⟩+κ2∥xk+1−zk∥2 ≤∥∇F(zk)∥∥xk+1−zk∥+κ2∥xk+1−zk∥2.

Since , we have . Furthermore, similarly to (30), we can bound

 ∥∇F(zk)∥≤(2+√2c)∥gk∥.

Therefore,

 E2≤(2+α+√2c)ακ∥gk∥2. (32)

Combining (31) and (32) yields

 E1+E2≤[2c(2+α+√2c)21−4c+(2+α+√2c)α]1κ∥gk∥2. (33)

When and , we have . This together with (28) completes the proof. ∎

In practice, the unit stepsize is very often sufficient for the descent condition to hold, as empirically studied in Appendix C.2. The following result shows that under a specific assumption on the Moreau-Yosida envelope , indeed the unit stepsize is always selected when the iterate are close to the optimum. The condition, called Dennis-Moré criterion , is classical in the literature of Quasi-Newton methods, even though we cannot formally show that it holds for the Moreau-Yosida envelope . Indeed, the criterion requires to be twice continuously differentiable, which is not true in general, see . Therefore, the lemma below should not be seen as an formal explanation for the choice of step size , which we often observe in practice, but simply as a reasonable condition that leads to this choice.

###### Lemma 6 (A sufficient condition for unit stepsize).

Assume that is twice-continuously differentiable and is Lipschitz. If is bounded and the Dennis-Moré criterion  is satisfied, i.e.

 ∥(B−1k−∇2F(x∗)−1)gk∥∥gk∥→0,whereBk=H−1k, (DM)

then, the descent condition (12) is satisfied with when is large enough.

We remark that the Dennis-Moré criterion we use here is slightly different from the standard one since the criterion is based on approximate gradients . The proof is close to that of similar lemmas appearing in the proximal quasi-Newton literature , and is relegated to the appendix. Interestingly, this proof also suggests that a stronger stopping criterion such that could lead to superlinear convergence. However, such a choice of would significantly increase the complexity for solving the sub-problems, and overall degrade the global complexity.

### 4.4 Complexity analysis of the inner loop

In this section, we evaluate the complexity of solving the sub-problems (10) up to the desired accuracy using a linearly convergent method . Our main result is that all sub-problems can be solved in a constant number  of iterations (in expectation if the method is non-deterministic) using a warm start strategy.

Let us consider the sub-problem with an arbitrary prox center ,

 minw∈Rd{h(w)=f(w)+κ2∥w−x∥2}. (34)

The number of iterations needed is determined by the ratio between the initial gap and the desired accuracy. We are going to bound this ratio by a constant factor.

###### Lemma 7 (Warm start for primal methods - smooth case).

If is differentiable with -Lipschitz continuous gradients, we initialize the method  with . Then, we have the guarantee that

 h(w0)−h∗≤L+κ2κ2∥∇F(x)∥2. (35)
###### Proof.

Denote by the minimizer of . Then, we have the optimality condition . As a result,

 h(w0)−h∗=f(x)−(f(w∗)+κ2∥w∗−x∥2)≤f(w∗)+⟨∇f(w∗),x−w∗⟩+L2∥x−w∗∥2−(f(w∗)+κ2∥w∗−x∥2)=L+κ2∥w∗−x∥2=L+κ2κ2∥∇F(x)∥2.

The inequality in the proof of Lemma 7 relies on the smoothness of , which does not hold for composite problems. The next lemma addresses this issue.

###### Lemma 8 (Warm start for primal methods - composite case).

Consider the composite optimization problem , where is -smooth. By initializing the method  with

 w0=argminw∈Rd{f0(x)+⟨∇f0(x),w−x⟩+L+κ2∥w−x∥2+ψ(w)}, (36)

we have,

 h(w0)−h∗≤L+κ2κ2∥∇F(x)∥2.
###### Proof.

We use the inequality corresponding to Lemma 2.3 in : for any ,

 h(w)−h(w0)≥L′2∥w0−x∥2+L′⟨w0−x,x−w⟩, (37)

with . Then, we apply this inequality to , and

 h(w0)−h∗≤−L′2∥w0−x∥2−L′⟨w0−x,x−w∗⟩≤L′2∥x−w∗∥2=L+κ2κ2∥∇F(x)∥2.

We get an initialization of the same quality in the composite case as in the smooth case, by performing an additional proximal step. It is important to remark that the above analysis do not require strong convexity of , which allows us to derive the desired inner-loop complexity.

###### Proposition 4 (Inner-loop complexity for Algorithm 1).

Consider Algorithm 1 with the warm start strategy described in Lemma 7 or in Lemma 8. Assume that the optimization method  applied in the inner loop produces a sequence for each sub-problem (34) such that

 h(wt)−h∗≤CM(1−τM)t(h(w0)−h∗)  for some constants  CM,τM>0. (38)

Then, the stopping criterium is achieved in at most iterations with

 TM=1τMlog(74CML+κκ).
###### Proof.

Consider at iteration , we apply to approximate the proximal mapping according to . With the given (which we abbreviate by ), we have

 h(wT)−h∗≤CM(1−τM)T(h(w0)−h∗)≤CMe−τMT(h(w0)−h∗)≤C