Global linear convergence of Newton's method without strong-convexity or Lipschitz gradients

06/01/2018 ∙ by Sai Praneeth Karimireddy, et al. ∙ EPFL 0

We show that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable. This class of problems includes many functions which are not strongly convex, such as logistic regression. Our linear convergence result is (i) affine-invariant, and holds even if an (ii) approximate Hessian is used, and if the subproblems are (iii) only solved approximately. Thus we theoretically demonstrate the superiority of Newton's method over first-order methods, which would only achieve a sublinear O(1/t^2) rate under similar conditions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Newton’s method is one of the earliest algorithms for the minimization of an unconstrained convex objective function ,

(1)

and iteratively performs the following update for some step-size ,

(2)

Here is assumed to be a twice differentiable convex function. In contrast to the classical literature we do not assume that the function is smooth (i.e. that the gradient is Lipschitz continuous), nor do we assume strong convexity.

Popularized in its present form by Bennet [6] and Kantarovich [20], Newton’s method has been an immensely important algorithm for optimization. Though there has been significant work analyzing and extending the standard scheme (2), global convergence results remain few and unsatisfactory (cf. [27] and references therein). In a seminal result, Nesterov and Nemirovski [26] show that Newton’s algorithm achieves local quadratic convergence. However, the conditions under which quadratic convergence occurs are too restrictive—they require both the function to be self-concordant, and the starting point to be almost at the optimum. Neither of these conditions is typically satisfied when applying Newton’s method for minimizing functions of the form (1) in applications. Most of the global convergence results are either i) hard to compare with gradient descent and make strong assumptions on (e.g. [26, 28]), or ii) have a rate which is slower than vanilla gradient descent (e.g. [19, 22]). An exception to this is the breakthrough result by Nesterov and Polyak [27] where they obtain a rate, and later ([24]), by solving cubic sub-problems. These rates do not assume strong convexity or Lipschitz gradients. However, solving cubic sub-problems is impractical even for medium sized problems.

On the other hand, there has been recent efforts in performing efficient approximations of (2) in time comparable to that required for a gradient update ([2, 17, 22]). These methods, so far, did not enjoy any global convergence rates better than first-order methods.

A new regularity condition.

Most analyses of Newton-type algorithms assume that for , the Hessians are also close . In particular the results on cubic regularization (e.g. [27]) assume that the Hessian is Lipschitz. This is equivalent to assuming that the condition holds with an additive error whose magnitude depends on the distance . We instead assume that the Hessian is stable which means that the error is multiplicative. This is sufficient to show a simple proof of the global linear convergence of Newton’s method. Further, since our condition is multiplicative, stability is also a scale-free (i.e. affine invariant) condition.

The assumption of a stable Hessian was previously used to analyze the statistical properties of logistic regression in [3], and to analyze the convergence of SGD on logistic regression in [5, 4]. We were inspired by [10] who obtain an efficient algorithm for matrix scaling using ideas very similar to here.

Our contributions.

Our main contribution is a straightforward affine-invariant proof for global linear convergence of Newton’s algorithm, without resorting to strong convexity or Lipschitz continuous gradient (Section 3). We instead rely only on a natural multiplicative notion of stability of the Hessian (Section 2. This shows an exponential gap between global convergence rates of first-order and second-order methods for a wide class of functions, placing Newton-type methods on a strong theoretical footing. Further, in Section 4, we relax stability and show that a local notion of stability is sufficient to guarantee linear convergence for trust-region Newton methods. Finally, we show in Section 5 that linear convergence persists when using inexact and proximal Newton steps.

Related work.

Newton’s method with backtracking has been shown to be globally convergent for self-concordant functions ([26]) but the resulting rate is difficult to compare directly to gradient-based methods due to its two-phase additive structure. Otherwise, global convergence results of second-order algorithms were known when has both strong-convexity and Lipschitz gradients ([28, 22]), or by solving cubic subproblems ([27, 24, 9]). Similar convergence rates are shown for the inexact Newton method in ([30, 22]). Empirically, ([23]) show that trust region Newton’s method significantly outperform other methods, and is hence the default optimization algorithm for a variety of problems in the widely used LIBLINEAR library ([14]). Although in this work we restrict ourselves to convex functions, Newton-type algorithms ([27, 1, 29]) as well as trust region methods ([11, 12]) have been successfully used to escape saddle points and converge to a local minimum in non-convex settings.

2 Stability of the Hessian

We now formally define our notion of a stable Hessian and show that it is implied by many other standard assumptions. We will also demonstrate that for a large class of problems on which Newton’s method is usually applied, our condition is satisfied. As is standard, we will assume that the level set of the function is bounded. In particular set has a bounded diameter where is defined as

(3)

2.1 Definition of stability

Here we present an affine invariant definition of a stable Hessian. For any vector

, and a positive semi-definite matrix , let denote the semi-norm .

Assumption A (-stable Hessian).

For any and , we assume and that there exists a constant such that111This assumption can be relaxed—instead of for all of , we only need the condition to for hold for , and as well as , for all and .

Assumption A allows to derive global upper and lower bounds on the function for . In contrast to standard assumptions such as strong convexity, smoothness or Lipschitz Hessian, stability is affine invariant:

Lemma 1.

The constant defined in Assumption A

is invariant under any non-singular linear transformations of

.

2.2 Sufficient conditions

Here we will discuss a host of standard assumptions and see how they imply a stable Hessian. The formal definitions of the conditions, as well as the proof of Theorem I is presented in Appendix A. We also assume that the domain is bounded with a diameter .

Theorem I.

The following are sufficient conditions for ensuring the stability of the Hessian as defined in Assumption A:

  1. [label=(),itemsep=-.5ex,topsep=-.5ex]

  2. -Lipschitz gradient and -strongly convex -stable Hessian,

  3. -Lipschitz Hessian and -strongly convex -stable Hessian,

  4. -self-concordant and -Lipschitz gradient -stable Hessian, and

  5. -quasi-self-concordant -stable Hessian.

2.3 Applications

For a given matrix we consider functions of the form where is coordinate-wise separable. For learning applications is typically the data matrix. The objective function may further be regularized for an arbitrary (e.g. regularizer), as we will discuss in Section 5. We can assume that is full-rank, otherwise one can restrict the domain to the range of . Further let us also assume that each row of the matrix is normalized and . Then the affine-invariance of stability allows to transform into a sum of one-dimensional functions where . Since is normalized, . Thus without loss of generality, we can focus on discussing the stability of one-dimensional functions with a domain diameter less than . Many of the following applications have been adapted from [33].

  1. [label=(),itemsep=-.5ex,topsep=-.8ex]

  2. Logistic regression:

    The loss function

    is shown to be -quasi-self concordant in [3], and so is -stable.

  3. Wasserstein distance: Functions of the form are also -stable. The dual of the entropy-regularized Wasserstein distance is of this form [13].

  4. Boosting: Ada-boost can be seen as a first-order algorithm on an exponential loss function (cf. Chapter 6, [31]).

  5. Self-concordant functions: As was shown in Theorem I, all self-concordant functions (e.g. logarithmic barriers) with bounded domain and Lipschitz gradients are stable.

  6. Entropy regularizer: The standard entropy function also fits into our framework, assuming bounded domain for . The Hessian of the entropy function is and so is -stable.

  7. Robust regression: Instead of the standard least-squares loss, [35] consider a more robust version which is for with a Hessian . Assuming a bounded domain for , the function is -stable.

While some of these constants may seem large (e.g. the in Logistic regression), in Section 4 we will see a local notion of stability which gets around the super-linear dependence on .

3 Convergence of exact Newton’s method

The convergence of Newton’s method follows in a straightforward manner from the definition of a stable Hessian. To demonstrate the core idea, let us look at the simplest case—Newton’s algorithm on a twice differentiable function using the exact inversion of the Hessian (or its pseudo-inverse), as presented in Algorithm 1. We will later extend the algorithm and relax many of these assumptions. The algorithm uses a fixed step-size . This can easily be made adaptive (see Appendix B) at a mild additional cost.

1:Input: and .
2:for  do
3:     
4:end for
Algorithm 1 Exact Newton Descent
Theorem II.

Given Assumption A, for any iteration of Algorithm 1 with ,

As we noted before, the assumption that the Hessian is stable allows to provide global upper and lower bounds on the function value (the proof is given in Appendix C.2).

Lemma 2.

Given Assumption A, for any ,

(4)
(5)

The bounds above only hold for as defined in (3). To use the Lemma, we need that for all . For this, it suffices to show that Algorithm 1 is a descent method. For now, let us assume this technicality—the proof can be found in Appendix C.3.

Lemma 3.

Under Assumption A, for any of Algorithm 1 with , the update is well-defined and further .

Proof of Theorem II. By Lemma 3, Algorithm 1 is well-defined and is a descent method. This means that both and lie in and we can apply Lemma 2. The upper bound (4) implies that for ,

Here note that . Now minimizing both sides of the lower bound (5) gives

Using the above bound with , we get

Subtracting from both sides, and iterating from to proves the theorem. ∎

4 Trust region Newton’s method

The convergence rate of Newton’s algorithm in Theorem II critically depends on the constant , which is a global measure bounding the relative change of the Hessian of around the current point . Often, the value of depends on the diameter of the domain . E.g., as we discussed in Section 2.3, Logistic regression and exponential loss are -stable, which can be a large value. In this section, we design an algorithm whose convergence depends only on a local measure of stability, getting around the potentially exponential dependence on .

4.1 Local stability

We introduce a local measure of stability , which is typically much smaller than . This notion captures the multiplicative change in the Hessian in a small ball of radius around the current point , measured in an arbitrary norm .

Assumption B (-locally stable with respect to ).

For any such that and , we assume that and that there exists a constant for which the following holds

Since the norm may not be affine invariant, the resulting constant is also not necessarily affine invariant. It is, however, possible to circumvent this limitation (refer Section D in the Appendix).

4.2 Trust-region Algorithm

Trust-region methods restrict each update to a small ball of radius around , and so are more ‘local’ algorithms.

1:Input: , , and .
2:for  do
3:     
4:end for
Algorithm 2 Trust-region Newton Descent

4.3 Convergence analysis

Theorem III.

Given Assumption B, for any iteration of Algorithm 2 with ,

where is the trust region radius and is the diameter of the level set i.e.  .

Proof.

The proof of Theorem III is very similar to that of Theorem II. The main deviation is the derivation of tighter lower and upper bounds that depend on the local bound instead of the global parameter . This is detailed in Lemma 5 in Appendix C.4. At any iteration , we get that for any such that the following holds

(6)
(7)

The upper bound (6) combined with the update in Step 3 implies that for any ,

The last inequality is trivial (with an equality) when minimizing unbounded quadratics, but is also valid when minimizing over convex domains (refer Lemma 6 in Appendix C.4). Let us define and the point . Then

Combining this with our previous observation gives

The last inequality used the lower bound from (7). Now we will have to relate the term to the actual minimum value . This we will do by using the convexity of the function .

Adding and subtracting from the left side, rearranging the terms, and iterating over finishes the proof. ∎

4.4 Improvement in the rate of convergence

In a number of applications we saw in Section 2.3, the dependence of on the diameter was super-linear (and even exponential). Local-stability gets around this and ensures that the rate of convergence of the Algorithm 2 depends at most linearly on .

For in Theorem II gives a rate depending on . In contrast, using , Theorem III gives a rate depending on . Thus the optimal can be computed as

As an illustrative example, consider logistic regression or exponential losses. The local-stability scales as for . The rate of convergence of Newton’s method would depend on . On the other hand, using the optimal trust region radius , the rate for the trust-region method becomes . This result makes a very strong case for using trust-region Newton methods.

There are two points to note here. First, one might ask if a similar improvement could be shown for the simpler Newton search equipped with a line search. We answer in the negative in Section 6. Next, as we noted before, trust region methods are not affine-invariant, and moreover require solving the Newton step with an additional constraint. In the appendix (Section D), we show an affine-invariant algorithm only requiring minimizing quadratics over the domain .

5 Approximate and proximal extensions

We can extend our analysis of Newton’s method to the proximal setting to minimize a composite objective function, i.e.

(8)

where as before is a twice differentiable convex function, and is a possibly non-differentiable, extended valued convex function.

5.1 Inexact Newton steps

In this section we also make two relaxations, one being that an exact Hessian is used, and second that the quadratic subproblem is solved exactly. At each iteration with iterate , we assume access to the exact gradient , and only an approximation of the Hessian .

Approximate Hessian.

Below we list a few scenarios where this notion of an approximate Hessian is useful:

  1. [itemsep=-.5ex,topsep=-.8ex]

  2. Sketched Hessian.

    In machine learning and signal processing applications, the function

    is typically of the form where is a simple, separable function and is a data matrix. In such cases, the Hessian where is very cheap to compute (same cost as computing the gradient). Instead of using the full matrix , a low dimensional sketch is used instead. This provides guarantees satisfying (C) while ensuring cheap update steps (cf. [18, 17]).

  3. Hessian free inexact methods. If we use first order algorithms to minimize , we would only require products of the Hessian with a vector. Such product can be computed without computing, or storing the entire Hessian matrix. The resulting algorithms are inexpensive and costs are comparable to first order methods (cf. [7, Section 6.1]).

  4. Block diagonal . For distributed and parallel computation, it is crucial that we are able to create subproblems such that they are separable i.e. we can decompose the subproblem into multiple subproblems which can be solved independently (for e.g. [32, 21, 16]).

Approximate subproblems.

Using , we form a subproblem as in Step 3. Then we assume that at each iteration, our subproblem is solved to an arbitrary multiplicative accuracy, and only in expectation over some randomness of the subproblem algorithm. In particular, we assume the update is computed as in Step 5 for any fixed . Note that if , this means and that the subproblem was solved exactly.

1:Input: and .
2:for  do
3:     Define subproblem:
4:     Approximately minimize subproblem: Find such that
5:        
6:     Update:
7:end for
Algorithm 3 Approximate and Proximal Newton Descent

5.2 Convergence analysis

We need to quantify the approximation quality of the Hessian estimate

.

Assumption C.

We assume that there exists a constant such that for any , and as well as the following holds

(C)

Unfortunately the definition of is not necessarily affine invariant, but it does enable efficient approximations of the Hessian.

Theorem IV.

Given Assumptions B and C, for any iteration of Algorithm 3 with ,

where is the trust region radius and is the diameter of the level set .

6 Optimality of results

Linear vs. quadratic convergence.

When is self-concordant, or strongly convex and smooth, Newton’s method is known to converge quadratically when close enough to the optima ([8, Section 9.5]). This was crucial in designing generic interior point algorithms ([34]) and so one might ask if we can show similar local quadratic convergence for functions with stable Hessians? We give a simple counterexample for which Newton only achieves linear convergence. Consider for some large and . The function has a minimum value of 0 achieved at 0, and and . The Newton step on this function is , with a decrease in the function of

While is not globally stable, it is locally stable if at each step we restrict the trust region around   to lie within . Thus running Newton on with this varying trust region would also result in linear convergence, showing that our analysis can in general not be improved.

Superiority of trust region.

We saw in Section 4 that trust-region Newton methods converge at a rate depending on the local stability of the Hessian. One might question if Newton’s method equipped with line search could potentially have similar advantages. We provide a negative answer to this question. Consider the two dimensional function

The minimum of this function is 0 achieved at . Let us pick a starting point . The Newton’s update with step-size can be computed to be

Suppose we perform an exact line search to pick the best . To simplify computations, we will look at the case where i.e. when is large. In this setting, the predominant term in the objective is . The optimal in this case is approximately . This means that

Thus we cannot hope to obtain a global linear convergence for this case. However if we instead solve the quadratic problem defined by the Hessian as in Step 3

with the trust region and , then the Hessian changes only by a factor of . This means that the constant as defined in Assumption B for and using the norm is at most . Thus we can use a constant step-size , independent of . As before if we look at what happens when , we get that

This shows that trust region methods can be superior to line search methods, especially with a careful choice of the trust region.

7 Conclusion

A predominant focus of past work on Newton methods has been to show local quadratic convergence under very restrictive assumptions—both on the function class, as well as on the starting point. Such assumptions are almost never satisfied in practice, especially in machine learning applications. We believe the notion of stability recasts the analysis of Newton-type methods in a manner much more suitable to such applications. Using stability, we show strong global linear convergence rates under conditions in which first-order methods would only achieve sublinear rates—thereby providing a fresh perspective on the performance of a host of classical Newton’s methods.

There are a number of follow-up questions which arise out of this work. Using the estimate sequence framework of [25], it is possible to accelerate the exact Newton’s method. However it is unclear if such an acceleration could also be achieved for the trust-region methods, or for the approximate and proximal extensions. Further, our theory indicates that the radius of the trust region is crucial for ensuring fast convergence. Although adaptive methods exist for picking the step-size (Appendix B), designing and evaluating theoretically justified adaptive schemes for picking the trust-region radius would be a fruitful direction. Finally, the notion of stability is restricted to convex functions—generalizing insights here to the non-convex setting remains a challenging open problem.

References

Appendix A Sufficient conditions for stability

Here follow the definitions of the various conditions on discussed in Section 2. First some notation:

We will restate the definitions of these conditions using our new notation. For any ,

  1. -stable Hessian: .

  2. -Lipschitz gradients: .

  3. -strongly convex: .

  4. -Lipschitz Hessian: .

  5. -self-concordant: .

  6. -quasi-self-/concordant: .

Also, recall the diameter of the level set .

Proof of Theorem I.

Let us prove the Theorem case by case.

  1. -Lipschitz gradient and -strong convex -stable Hessian.
    Using the definitions of the three terms,

  2. -Lipschitz Hessian and -strong convex -stable Hessian.
    The definition of -Lipschitz Hessian implies that

    Now combining this with the definition of stability and strong convexity,

  3. -self-concordant and -Lipschitz gradient -stable Hessian.
    We use the proof technique from [15, Lemma 3.2]. Define . Assuming is thrice differentiable, using the definition of self-concordance

    This means that the definition of self-concordance implies that

    Since is -Lipschitz, this means

    Now setting in the definition of and multiplying the above equation by we get

    Using the definition of Lipschitz gradient, and the bound on the diameter of , we get that for all ,

  4. -quasi-self-concordant -stable Hessian.
    This statement is directly taken from [3, Proposition 1]. Define as before . The definition of -quasi-self-concordance implies that

    If we consider the function , the above equation shows that

    which in turn means

    Again setting in the definition of gives us that

Appendix B Line search strategies

All algorithms we have discussed in this paper assume that the value of is set correctly. This assumption can easily be relaxed by using line search strategies. There has been a significant amount of work different line-search strategies and we will not attempt to provide a complete survey. Instead we point to ([11]). Among those methods, the backtracking strategy employed in making the cubic regularization techniques adaptive by ([9]) is especially suited to our strategy.

1:Input: , , , and
2:for  do
3:     Define quadratic subproblem:
4:     Compute update: Let be the update based on
5:     Check progress: Compute and
6:     
7:     
8:end for
Algorithm 4 Back tracking strategy

It is easy to adapt the theoretical guarantees and techniques used in ([9, 30]) for analysis of this backtracking strategy to our setting. This way we are able to remove both the necessity of knowing as well as make it an adaptive method. The details are summarized in Algorithm 4.

Throughout Algorithm 4, we always assumed that the only unknown parameter is . However when we are running trust region algorithms, we would perhaps like to adapt both the trust region radius as well as . While it is possible to design such an adaptive trust region strategy using insights from on our proof, we leave the analysis and evaluation of such strategies for future work.

Necessity of step-size.

Theorems II, III and IV show that choosing the appropriate ensures global linear convergence. In the case where , this corresponds to using a step-size of . Here we show that this is not simply an artifact of the analysis—the use of is actually necessary to ensure global convergence. Consider the univariate function

This function is convex with gradient , second derivate and minimum value 0 achieved at . It satisfies our condition (Assumption A) of stable Hessian with where is the diameter of the level set. Suppose we start at for . Applying a Newton update with step-size gives . Let us assume the step-size , and to simplify computations. When is large, the predominant term of is if and if . In the setting where , and —we have veered too far to the left. Instead, using a step-size would ensure a descent step. In fact this example also showcases the advantage of adaptive step-sizes. Using a fixed step-size of either or even would require exponential (in ) number of iterations to converge. Instead, using an adaptive step size of , where is the current position, would give convergence in polynomial steps.

Appendix C Additional proofs

c.1 Proof of affine invariance of stability (Lemma 1)

Suppose we had a transformed function

for an invertible matrix

. Its Hessian would be

, using the chain rule. Let

denote the transformed domain of as defined in (3) so that if . The definition of would be

c.2 Proof of lower and upper bounds (Lemma 2)

The proof of the Lemma follows from the second-order Taylor expansion of around . Taylor’s theorem gives us that for any there exists a such that for ,

(9)

Since is convex, and by substituting , in Assumption A,

Substituting , we have

This proves (4). On the other hand, by substituting , and in Assumption A, we get a lower bound

Again by substituting , we can finish the proof as

c.3 Proof of descent (Lemma 3)

For some , let us assume that . The base case, is trivially true. If , we are already at an optimum and , proving our Lemma. Otherwise we proceed as below.

We know that is a descent direction [8, Section 9.2]. This means there exists a small enough such that for , meaning . Applying Assumption A with and , we get that . In particular this implies that is in the range of and so the update is well-defined.

Now we are left with the task of proving . Note that also implies . This is a sufficient condition to ensure that the Newton’s step is a descent direction [8, Section 9.2]. This means there exists , such that for , we have . Hence . Let us define the auxiliary function for . The function is continuous in since is a continuous function and is a continuous map. Moreover we have that