Newton’s method is one of the earliest algorithms for the minimization of an unconstrained convex objective function ,
and iteratively performs the following update for some step-size ,
Here is assumed to be a twice differentiable convex function. In contrast to the classical literature we do not assume that the function is smooth (i.e. that the gradient is Lipschitz continuous), nor do we assume strong convexity.
Popularized in its present form by Bennet  and Kantarovich , Newton’s method has been an immensely important algorithm for optimization. Though there has been significant work analyzing and extending the standard scheme (2), global convergence results remain few and unsatisfactory (cf.  and references therein). In a seminal result, Nesterov and Nemirovski  show that Newton’s algorithm achieves local quadratic convergence. However, the conditions under which quadratic convergence occurs are too restrictive—they require both the function to be self-concordant, and the starting point to be almost at the optimum. Neither of these conditions is typically satisfied when applying Newton’s method for minimizing functions of the form (1) in applications. Most of the global convergence results are either i) hard to compare with gradient descent and make strong assumptions on (e.g. [26, 28]), or ii) have a rate which is slower than vanilla gradient descent (e.g. [19, 22]). An exception to this is the breakthrough result by Nesterov and Polyak  where they obtain a rate, and later (), by solving cubic sub-problems. These rates do not assume strong convexity or Lipschitz gradients. However, solving cubic sub-problems is impractical even for medium sized problems.
On the other hand, there has been recent efforts in performing efficient approximations of (2) in time comparable to that required for a gradient update ([2, 17, 22]). These methods, so far, did not enjoy any global convergence rates better than first-order methods.
A new regularity condition.
Most analyses of Newton-type algorithms assume that for , the Hessians are also close . In particular the results on cubic regularization (e.g. ) assume that the Hessian is Lipschitz. This is equivalent to assuming that the condition holds with an additive error whose magnitude depends on the distance . We instead assume that the Hessian is stable which means that the error is multiplicative. This is sufficient to show a simple proof of the global linear convergence of Newton’s method. Further, since our condition is multiplicative, stability is also a scale-free (i.e. affine invariant) condition.
The assumption of a stable Hessian was previously used to analyze the statistical properties of logistic regression in , and to analyze the convergence of SGD on logistic regression in [5, 4]. We were inspired by  who obtain an efficient algorithm for matrix scaling using ideas very similar to here.
Our main contribution is a straightforward affine-invariant proof for global linear convergence of Newton’s algorithm, without resorting to strong convexity or Lipschitz continuous gradient (Section 3). We instead rely only on a natural multiplicative notion of stability of the Hessian (Section 2. This shows an exponential gap between global convergence rates of first-order and second-order methods for a wide class of functions, placing Newton-type methods on a strong theoretical footing. Further, in Section 4, we relax stability and show that a local notion of stability is sufficient to guarantee linear convergence for trust-region Newton methods. Finally, we show in Section 5 that linear convergence persists when using inexact and proximal Newton steps.
Newton’s method with backtracking has been shown to be globally convergent for self-concordant functions () but the resulting rate is difficult to compare directly to gradient-based methods due to its two-phase additive structure. Otherwise, global convergence results of second-order algorithms were known when has both strong-convexity and Lipschitz gradients ([28, 22]), or by solving cubic subproblems ([27, 24, 9]). Similar convergence rates are shown for the inexact Newton method in ([30, 22]). Empirically, () show that trust region Newton’s method significantly outperform other methods, and is hence the default optimization algorithm for a variety of problems in the widely used LIBLINEAR library (). Although in this work we restrict ourselves to convex functions, Newton-type algorithms ([27, 1, 29]) as well as trust region methods ([11, 12]) have been successfully used to escape saddle points and converge to a local minimum in non-convex settings.
2 Stability of the Hessian
We now formally define our notion of a stable Hessian and show that it is implied by many other standard assumptions. We will also demonstrate that for a large class of problems on which Newton’s method is usually applied, our condition is satisfied. As is standard, we will assume that the level set of the function is bounded. In particular set has a bounded diameter where is defined as
2.1 Definition of stability
Here we present an affine invariant definition of a stable Hessian. For any vector, and a positive semi-definite matrix , let denote the semi-norm .
Assumption A (-stable Hessian).
For any and , we assume and that there exists a constant such that111This assumption can be relaxed—instead of for all of , we only need the condition to for hold for , and as well as , for all and .
Assumption A allows to derive global upper and lower bounds on the function for . In contrast to standard assumptions such as strong convexity, smoothness or Lipschitz Hessian, stability is affine invariant:
2.2 Sufficient conditions
Here we will discuss a host of standard assumptions and see how they imply a stable Hessian. The formal definitions of the conditions, as well as the proof of Theorem I is presented in Appendix A. We also assume that the domain is bounded with a diameter .
The following are sufficient conditions for ensuring the stability of the Hessian as defined in Assumption A:
-Lipschitz gradient and -strongly convex -stable Hessian,
-Lipschitz Hessian and -strongly convex -stable Hessian,
-self-concordant and -Lipschitz gradient -stable Hessian, and
-quasi-self-concordant -stable Hessian.
For a given matrix we consider functions of the form where is coordinate-wise separable. For learning applications is typically the data matrix. The objective function may further be regularized for an arbitrary (e.g. regularizer), as we will discuss in Section 5. We can assume that is full-rank, otherwise one can restrict the domain to the range of . Further let us also assume that each row of the matrix is normalized and . Then the affine-invariance of stability allows to transform into a sum of one-dimensional functions where . Since is normalized, . Thus without loss of generality, we can focus on discussing the stability of one-dimensional functions with a domain diameter less than . Many of the following applications have been adapted from .
Wasserstein distance: Functions of the form are also -stable. The dual of the entropy-regularized Wasserstein distance is of this form .
Boosting: Ada-boost can be seen as a first-order algorithm on an exponential loss function (cf. Chapter 6, ).
Self-concordant functions: As was shown in Theorem I, all self-concordant functions (e.g. logarithmic barriers) with bounded domain and Lipschitz gradients are stable.
Entropy regularizer: The standard entropy function also fits into our framework, assuming bounded domain for . The Hessian of the entropy function is and so is -stable.
Robust regression: Instead of the standard least-squares loss,  consider a more robust version which is for with a Hessian . Assuming a bounded domain for , the function is -stable.
While some of these constants may seem large (e.g. the in Logistic regression), in Section 4 we will see a local notion of stability which gets around the super-linear dependence on .
3 Convergence of exact Newton’s method
The convergence of Newton’s method follows in a straightforward manner from the definition of a stable Hessian. To demonstrate the core idea, let us look at the simplest case—Newton’s algorithm on a twice differentiable function using the exact inversion of the Hessian (or its pseudo-inverse), as presented in Algorithm 1. We will later extend the algorithm and relax many of these assumptions. The algorithm uses a fixed step-size . This can easily be made adaptive (see Appendix B) at a mild additional cost.
As we noted before, the assumption that the Hessian is stable allows to provide global upper and lower bounds on the function value (the proof is given in Appendix C.2).
Given Assumption A, for any ,
The bounds above only hold for as defined in (3). To use the Lemma, we need that for all . For this, it suffices to show that Algorithm 1 is a descent method. For now, let us assume this technicality—the proof can be found in Appendix C.3.
Here note that . Now minimizing both sides of the lower bound (5) gives
Using the above bound with , we get
Subtracting from both sides, and iterating from to proves the theorem. ∎
4 Trust region Newton’s method
The convergence rate of Newton’s algorithm in Theorem II critically depends on the constant , which is a global measure bounding the relative change of the Hessian of around the current point . Often, the value of depends on the diameter of the domain . E.g., as we discussed in Section 2.3, Logistic regression and exponential loss are -stable, which can be a large value. In this section, we design an algorithm whose convergence depends only on a local measure of stability, getting around the potentially exponential dependence on .
4.1 Local stability
We introduce a local measure of stability , which is typically much smaller than . This notion captures the multiplicative change in the Hessian in a small ball of radius around the current point , measured in an arbitrary norm .
Assumption B (-locally stable with respect to ).
For any such that and , we assume that and that there exists a constant for which the following holds
Since the norm may not be affine invariant, the resulting constant is also not necessarily affine invariant. It is, however, possible to circumvent this limitation (refer Section D in the Appendix).
4.2 Trust-region Algorithm
Trust-region methods restrict each update to a small ball of radius around , and so are more ‘local’ algorithms.
4.3 Convergence analysis
The proof of Theorem III is very similar to that of Theorem II. The main deviation is the derivation of tighter lower and upper bounds that depend on the local bound instead of the global parameter . This is detailed in Lemma 5 in Appendix C.4. At any iteration , we get that for any such that the following holds
The last inequality is trivial (with an equality) when minimizing unbounded quadratics, but is also valid when minimizing over convex domains (refer Lemma 6 in Appendix C.4). Let us define and the point . Then
Combining this with our previous observation gives
The last inequality used the lower bound from (7). Now we will have to relate the term to the actual minimum value . This we will do by using the convexity of the function .
Adding and subtracting from the left side, rearranging the terms, and iterating over finishes the proof. ∎
4.4 Improvement in the rate of convergence
In a number of applications we saw in Section 2.3, the dependence of on the diameter was super-linear (and even exponential). Local-stability gets around this and ensures that the rate of convergence of the Algorithm 2 depends at most linearly on .
As an illustrative example, consider logistic regression or exponential losses. The local-stability scales as for . The rate of convergence of Newton’s method would depend on . On the other hand, using the optimal trust region radius , the rate for the trust-region method becomes . This result makes a very strong case for using trust-region Newton methods.
There are two points to note here. First, one might ask if a similar improvement could be shown for the simpler Newton search equipped with a line search. We answer in the negative in Section 6. Next, as we noted before, trust region methods are not affine-invariant, and moreover require solving the Newton step with an additional constraint. In the appendix (Section D), we show an affine-invariant algorithm only requiring minimizing quadratics over the domain .
5 Approximate and proximal extensions
We can extend our analysis of Newton’s method to the proximal setting to minimize a composite objective function, i.e.
where as before is a twice differentiable convex function, and is a possibly non-differentiable, extended valued convex function.
5.1 Inexact Newton steps
In this section we also make two relaxations, one being that an exact Hessian is used, and second that the quadratic subproblem is solved exactly. At each iteration with iterate , we assume access to the exact gradient , and only an approximation of the Hessian .
Below we list a few scenarios where this notion of an approximate Hessian is useful:
In machine learning and signal processing applications, the functionis typically of the form where is a simple, separable function and is a data matrix. In such cases, the Hessian where is very cheap to compute (same cost as computing the gradient). Instead of using the full matrix , a low dimensional sketch is used instead. This provides guarantees satisfying (C) while ensuring cheap update steps (cf. [18, 17]).
Hessian free inexact methods. If we use first order algorithms to minimize , we would only require products of the Hessian with a vector. Such product can be computed without computing, or storing the entire Hessian matrix. The resulting algorithms are inexpensive and costs are comparable to first order methods (cf. [7, Section 6.1]).
Using , we form a subproblem as in Step 3. Then we assume that at each iteration, our subproblem is solved to an arbitrary multiplicative accuracy, and only in expectation over some randomness of the subproblem algorithm. In particular, we assume the update is computed as in Step 5 for any fixed . Note that if , this means and that the subproblem was solved exactly.
5.2 Convergence analysis
We need to quantify the approximation quality of the Hessian estimate.
We assume that there exists a constant such that for any , and as well as the following holds
Unfortunately the definition of is not necessarily affine invariant, but it does enable efficient approximations of the Hessian.
6 Optimality of results
Linear vs. quadratic convergence.
When is self-concordant, or strongly convex and smooth, Newton’s method is known to converge quadratically when close enough to the optima ([8, Section 9.5]). This was crucial in designing generic interior point algorithms () and so one might ask if we can show similar local quadratic convergence for functions with stable Hessians? We give a simple counterexample for which Newton only achieves linear convergence. Consider for some large and . The function has a minimum value of 0 achieved at 0, and and . The Newton step on this function is , with a decrease in the function of
While is not globally stable, it is locally stable if at each step we restrict the trust region around to lie within . Thus running Newton on with this varying trust region would also result in linear convergence, showing that our analysis can in general not be improved.
Superiority of trust region.
We saw in Section 4 that trust-region Newton methods converge at a rate depending on the local stability of the Hessian. One might question if Newton’s method equipped with line search could potentially have similar advantages. We provide a negative answer to this question. Consider the two dimensional function
The minimum of this function is 0 achieved at . Let us pick a starting point . The Newton’s update with step-size can be computed to be
Suppose we perform an exact line search to pick the best . To simplify computations, we will look at the case where i.e. when is large. In this setting, the predominant term in the objective is . The optimal in this case is approximately . This means that
Thus we cannot hope to obtain a global linear convergence for this case. However if we instead solve the quadratic problem defined by the Hessian as in Step 3
with the trust region and , then the Hessian changes only by a factor of . This means that the constant as defined in Assumption B for and using the norm is at most . Thus we can use a constant step-size , independent of . As before if we look at what happens when , we get that
This shows that trust region methods can be superior to line search methods, especially with a careful choice of the trust region.
A predominant focus of past work on Newton methods has been to show local quadratic convergence under very restrictive assumptions—both on the function class, as well as on the starting point. Such assumptions are almost never satisfied in practice, especially in machine learning applications. We believe the notion of stability recasts the analysis of Newton-type methods in a manner much more suitable to such applications. Using stability, we show strong global linear convergence rates under conditions in which first-order methods would only achieve sublinear rates—thereby providing a fresh perspective on the performance of a host of classical Newton’s methods.
There are a number of follow-up questions which arise out of this work. Using the estimate sequence framework of , it is possible to accelerate the exact Newton’s method. However it is unclear if such an acceleration could also be achieved for the trust-region methods, or for the approximate and proximal extensions. Further, our theory indicates that the radius of the trust region is crucial for ensuring fast convergence. Although adaptive methods exist for picking the step-size (Appendix B), designing and evaluating theoretically justified adaptive schemes for picking the trust-region radius would be a fruitful direction. Finally, the notion of stability is restricted to convex functions—generalizing insights here to the non-convex setting remains a challenging open problem.
Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma.
Finding approximate local minima faster than gradient descent.
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195–1199. ACM, 2017.
-  Naman Agarwal, Brian Bullins, and Elad Hazan. Second-Order Stochastic Optimization for Machine Learning in Linear Time. arXiv:1602.03943 [cs, stat], February 2016.
-  Francis Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384–414, 2010.
Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression.Journal of Machine Learning Research, 15:595–627, 2014.
-  Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 773–781. Curran Associates, Inc., 2013.
-  Albert A Bennett. Newton’s method in general analysis. Proceedings of the National Academy of Sciences, 2(10):592–598, 1916.
-  L. Bottou, F. Curtis, and J. Nocedal. Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60(2):223–311, January 2018.
-  Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
-  Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, April 2011.
-  Michael B. Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu. Matrix Scaling and Balancing via Box Constrained Newton’s Method and Interior Point Methods. arXiv:1704.02310 [cs], April 2017.
-  Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust-Region Methods. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.
-  Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of mathcal O(epsilon^-3/2) o (e-3/2) for nonconvex optimization. Mathematical Programming, 162(1-2):1–32, 2017.
-  Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 2292–2300, USA, 2013. Curran Associates Inc.
-  Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
-  Wenbo Gao and Donald Goldfarb. Quasi-Newton Methods: Superlinear Convergence Without Line Search for Self-Concordant Functions. arXiv:1612.06965 [math], December 2016.
-  Matilde Gargiani, Celestine Dunner, and Martin Jaggi. Hessian-CoCoA: A general parallel and distributed framework for non-strongly convex regularizers. June 2017.
-  Robert M. Gower, Filip Hanzely, Peter Richtárik, and Sebastian Stich. Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order Optimization. arXiv:1802.04079 [cs, math], February 2018.
-  Robert M. Gower and Peter Richtárik. Randomized Quasi-Newton Updates are Linearly Convergent Matrix Inversion Algorithms. arXiv:1602.01768 [cs, math], February 2016.
-  Mert Gürbüzbalaban, Asuman Ozdaglar, and Pablo Parrilo. A globally convergent incremental newton method. Mathematical Programming, 151(1):283–313, 2015.
-  Leonid Vital’evich Kantorovich. Functional analysis and applied mathematics. Uspekhi Matematicheskikh Nauk, 3(6):89–185, 1948.
Sai Praneeth Reddy Karimireddy, Sebastian Stich, and Martin Jaggi.
Adaptive balancing of gradient and update computation times using
global geometry and approximate subproblems.
International Conference on Artificial Intelligence and Statistics, pages 1204–1213, March 2018.
-  Ching-pei Lee and Stephen J. Wright. Inexact Successive Quadratic Approximation for Regularized Optimization. arXiv:1803.01298 [math], March 2018.
-  Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. Trust Region Newton Method for Logistic Regression. Journal of Machine Learning Research, 9(Apr):627–650, 2008.
-  Yurii Nesterov. Accelerating the Cubic Regularization of Newton’s Method on Convex Problems. SSRN Scholarly Paper ID 885933, Social Science Research Network, Rochester, NY, September 2005.
-  Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014.
-  Yurii Nesterov and Arkadii Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial and Applied Mathematics, 1994.
-  Yurii Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, August 2006.
-  B. T. Polyak. Newton-Kantorovich Method and Its Global Convergence. Journal of Mathematical Sciences, 133(4):1513–1523, March 2006.
-  Clément W Royer and Stephen J Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM Journal on Optimization, 28(2):1448–1477, 2018.
-  Katya Scheinberg and Xiaocheng Tang. Practical inexact proximal quasi-newton method with global complexity analysis. Mathematical Programming, 160(1-2):495–529, 2016.
-  Shai Shalev-Shwartz and Yoram Singer. Online learning: Theory, algorithms, and applications. PhD thesis, Hebrew University, 2007.
-  Virginia Smith, Simone Forte, Chenxin Ma, Martin Takac, Michael I. Jordan, and Martin Jaggi. CoCoA: A General Framework for Communication-Efficient Distributed Optimization. arXiv:1611.02189 [cs], November 2016.
-  Tianxiao Sun and Quoc Tran-Dinh. Generalized Self-Concordant Functions: A Recipe for Newton-Type Methods. arXiv:1703.04599 [math, stat], March 2017.
-  Stephen J Wright. Primal-dual interior-point methods. Siam, 1997.
-  Huan Xu, Constantine Caramanis, and Shie Mannor. Robust regression and lasso. In Advances in Neural Information Processing Systems, pages 1801–1808, 2009.
Appendix A Sufficient conditions for stability
Here follow the definitions of the various conditions on discussed in Section 2. First some notation:
We will restate the definitions of these conditions using our new notation. For any ,
-stable Hessian: .
-Lipschitz gradients: .
-strongly convex: .
-Lipschitz Hessian: .
Also, recall the diameter of the level set .
Proof of Theorem I.
Let us prove the Theorem case by case.
-Lipschitz gradient and -strong convex -stable Hessian.
Using the definitions of the three terms,
-Lipschitz Hessian and -strong convex -stable Hessian.
The definition of -Lipschitz Hessian implies that
Now combining this with the definition of stability and strong convexity,
-self-concordant and -Lipschitz gradient -stable Hessian.
We use the proof technique from [15, Lemma 3.2]. Define . Assuming is thrice differentiable, using the definition of self-concordance
This means that the definition of self-concordance implies that
Since is -Lipschitz, this means
Now setting in the definition of and multiplying the above equation by we get
Using the definition of Lipschitz gradient, and the bound on the diameter of , we get that for all ,
-quasi-self-concordant -stable Hessian.
This statement is directly taken from [3, Proposition 1]. Define as before . The definition of -quasi-self-concordance implies that
If we consider the function , the above equation shows that
which in turn means
Again setting in the definition of gives us that
Appendix B Line search strategies
All algorithms we have discussed in this paper assume that the value of is set correctly. This assumption can easily be relaxed by using line search strategies. There has been a significant amount of work different line-search strategies and we will not attempt to provide a complete survey. Instead we point to (). Among those methods, the backtracking strategy employed in making the cubic regularization techniques adaptive by () is especially suited to our strategy.
It is easy to adapt the theoretical guarantees and techniques used in ([9, 30]) for analysis of this backtracking strategy to our setting. This way we are able to remove both the necessity of knowing as well as make it an adaptive method. The details are summarized in Algorithm 4.
Throughout Algorithm 4, we always assumed that the only unknown parameter is . However when we are running trust region algorithms, we would perhaps like to adapt both the trust region radius as well as . While it is possible to design such an adaptive trust region strategy using insights from on our proof, we leave the analysis and evaluation of such strategies for future work.
Necessity of step-size.
Theorems II, III and IV show that choosing the appropriate ensures global linear convergence. In the case where , this corresponds to using a step-size of . Here we show that this is not simply an artifact of the analysis—the use of is actually necessary to ensure global convergence. Consider the univariate function
This function is convex with gradient , second derivate and minimum value 0 achieved at . It satisfies our condition (Assumption A) of stable Hessian with where is the diameter of the level set. Suppose we start at for . Applying a Newton update with step-size gives . Let us assume the step-size , and to simplify computations. When is large, the predominant term of is if and if . In the setting where , and —we have veered too far to the left. Instead, using a step-size would ensure a descent step. In fact this example also showcases the advantage of adaptive step-sizes. Using a fixed step-size of either or even would require exponential (in ) number of iterations to converge. Instead, using an adaptive step size of , where is the current position, would give convergence in polynomial steps.
Appendix C Additional proofs
c.1 Proof of affine invariance of stability (Lemma 1)
c.2 Proof of lower and upper bounds (Lemma 2)
The proof of the Lemma follows from the second-order Taylor expansion of around . Taylor’s theorem gives us that for any there exists a such that for ,
Since is convex, and by substituting , in Assumption A,
Substituting , we have
Again by substituting , we can finish the proof as
c.3 Proof of descent (Lemma 3)
For some , let us assume that . The base case, is trivially true. If , we are already at an optimum and , proving our Lemma. Otherwise we proceed as below.
We know that is a descent direction [8, Section 9.2]. This means there exists a small enough such that for , meaning . Applying Assumption A with and , we get that . In particular this implies that is in the range of and so the update is well-defined.
Now we are left with the task of proving . Note that also implies . This is a sufficient condition to ensure that the Newton’s step is a descent direction [8, Section 9.2]. This means there exists , such that for , we have . Hence . Let us define the auxiliary function for . The function is continuous in since is a continuous function and is a continuous map. Moreover we have that