1 Introduction
Newton’s method is one of the earliest algorithms for the minimization of an unconstrained convex objective function ,
(1) 
and iteratively performs the following update for some stepsize ,
(2) 
Here is assumed to be a twice differentiable convex function. In contrast to the classical literature we do not assume that the function is smooth (i.e. that the gradient is Lipschitz continuous), nor do we assume strong convexity.
Popularized in its present form by Bennet [6] and Kantarovich [20], Newton’s method has been an immensely important algorithm for optimization. Though there has been significant work analyzing and extending the standard scheme (2), global convergence results remain few and unsatisfactory (cf. [27] and references therein). In a seminal result, Nesterov and Nemirovski [26] show that Newton’s algorithm achieves local quadratic convergence. However, the conditions under which quadratic convergence occurs are too restrictive—they require both the function to be selfconcordant, and the starting point to be almost at the optimum. Neither of these conditions is typically satisfied when applying Newton’s method for minimizing functions of the form (1) in applications. Most of the global convergence results are either i) hard to compare with gradient descent and make strong assumptions on (e.g. [26, 28]), or ii) have a rate which is slower than vanilla gradient descent (e.g. [19, 22]). An exception to this is the breakthrough result by Nesterov and Polyak [27] where they obtain a rate, and later ([24]), by solving cubic subproblems. These rates do not assume strong convexity or Lipschitz gradients. However, solving cubic subproblems is impractical even for medium sized problems.
On the other hand, there has been recent efforts in performing efficient approximations of (2) in time comparable to that required for a gradient update ([2, 17, 22]). These methods, so far, did not enjoy any global convergence rates better than firstorder methods.
A new regularity condition.
Most analyses of Newtontype algorithms assume that for , the Hessians are also close . In particular the results on cubic regularization (e.g. [27]) assume that the Hessian is Lipschitz. This is equivalent to assuming that the condition holds with an additive error whose magnitude depends on the distance . We instead assume that the Hessian is stable which means that the error is multiplicative. This is sufficient to show a simple proof of the global linear convergence of Newton’s method. Further, since our condition is multiplicative, stability is also a scalefree (i.e. affine invariant) condition.
The assumption of a stable Hessian was previously used to analyze the statistical properties of logistic regression in [3], and to analyze the convergence of SGD on logistic regression in [5, 4]. We were inspired by [10] who obtain an efficient algorithm for matrix scaling using ideas very similar to here.
Our contributions.
Our main contribution is a straightforward affineinvariant proof for global linear convergence of Newton’s algorithm, without resorting to strong convexity or Lipschitz continuous gradient (Section 3). We instead rely only on a natural multiplicative notion of stability of the Hessian (Section 2. This shows an exponential gap between global convergence rates of firstorder and secondorder methods for a wide class of functions, placing Newtontype methods on a strong theoretical footing. Further, in Section 4, we relax stability and show that a local notion of stability is sufficient to guarantee linear convergence for trustregion Newton methods. Finally, we show in Section 5 that linear convergence persists when using inexact and proximal Newton steps.
Related work.
Newton’s method with backtracking has been shown to be globally convergent for selfconcordant functions ([26]) but the resulting rate is difficult to compare directly to gradientbased methods due to its twophase additive structure. Otherwise, global convergence results of secondorder algorithms were known when has both strongconvexity and Lipschitz gradients ([28, 22]), or by solving cubic subproblems ([27, 24, 9]). Similar convergence rates are shown for the inexact Newton method in ([30, 22]). Empirically, ([23]) show that trust region Newton’s method significantly outperform other methods, and is hence the default optimization algorithm for a variety of problems in the widely used LIBLINEAR library ([14]). Although in this work we restrict ourselves to convex functions, Newtontype algorithms ([27, 1, 29]) as well as trust region methods ([11, 12]) have been successfully used to escape saddle points and converge to a local minimum in nonconvex settings.
2 Stability of the Hessian
We now formally define our notion of a stable Hessian and show that it is implied by many other standard assumptions. We will also demonstrate that for a large class of problems on which Newton’s method is usually applied, our condition is satisfied. As is standard, we will assume that the level set of the function is bounded. In particular set has a bounded diameter where is defined as
(3) 
2.1 Definition of stability
Here we present an affine invariant definition of a stable Hessian. For any vector
, and a positive semidefinite matrix , let denote the seminorm .Assumption A (stable Hessian).
For any and , we assume and that there exists a constant such that^{1}^{1}1This assumption can be relaxed—instead of for all of , we only need the condition to for hold for , and as well as , for all and .
Assumption A allows to derive global upper and lower bounds on the function for . In contrast to standard assumptions such as strong convexity, smoothness or Lipschitz Hessian, stability is affine invariant:
Lemma 1.
2.2 Sufficient conditions
Here we will discuss a host of standard assumptions and see how they imply a stable Hessian. The formal definitions of the conditions, as well as the proof of Theorem I is presented in Appendix A. We also assume that the domain is bounded with a diameter .
Theorem I.
The following are sufficient conditions for ensuring the stability of the Hessian as defined in Assumption A:

[label=(),itemsep=.5ex,topsep=.5ex]

Lipschitz gradient and strongly convex stable Hessian,

Lipschitz Hessian and strongly convex stable Hessian,

selfconcordant and Lipschitz gradient stable Hessian, and

quasiselfconcordant stable Hessian.
2.3 Applications
For a given matrix we consider functions of the form where is coordinatewise separable. For learning applications is typically the data matrix. The objective function may further be regularized for an arbitrary (e.g. regularizer), as we will discuss in Section 5. We can assume that is fullrank, otherwise one can restrict the domain to the range of . Further let us also assume that each row of the matrix is normalized and . Then the affineinvariance of stability allows to transform into a sum of onedimensional functions where . Since is normalized, . Thus without loss of generality, we can focus on discussing the stability of onedimensional functions with a domain diameter less than . Many of the following applications have been adapted from [33].

[label=(),itemsep=.5ex,topsep=.8ex]

Logistic regression:
The loss function
is shown to be quasiself concordant in [3], and so is stable. 
Wasserstein distance: Functions of the form are also stable. The dual of the entropyregularized Wasserstein distance is of this form [13].

Boosting: Adaboost can be seen as a firstorder algorithm on an exponential loss function (cf. Chapter 6, [31]).

Selfconcordant functions: As was shown in Theorem I, all selfconcordant functions (e.g. logarithmic barriers) with bounded domain and Lipschitz gradients are stable.

Entropy regularizer: The standard entropy function also fits into our framework, assuming bounded domain for . The Hessian of the entropy function is and so is stable.

Robust regression: Instead of the standard leastsquares loss, [35] consider a more robust version which is for with a Hessian . Assuming a bounded domain for , the function is stable.
While some of these constants may seem large (e.g. the in Logistic regression), in Section 4 we will see a local notion of stability which gets around the superlinear dependence on .
3 Convergence of exact Newton’s method
The convergence of Newton’s method follows in a straightforward manner from the definition of a stable Hessian. To demonstrate the core idea, let us look at the simplest case—Newton’s algorithm on a twice differentiable function using the exact inversion of the Hessian (or its pseudoinverse), as presented in Algorithm 1. We will later extend the algorithm and relax many of these assumptions. The algorithm uses a fixed stepsize . This can easily be made adaptive (see Appendix B) at a mild additional cost.
As we noted before, the assumption that the Hessian is stable allows to provide global upper and lower bounds on the function value (the proof is given in Appendix C.2).
Lemma 2.
Given Assumption A, for any ,
(4)  
(5) 
The bounds above only hold for as defined in (3). To use the Lemma, we need that for all . For this, it suffices to show that Algorithm 1 is a descent method. For now, let us assume this technicality—the proof can be found in Appendix C.3.
Proof of Theorem II. By Lemma 3, Algorithm 1 is welldefined and is a descent method. This means that both and lie in and we can apply Lemma 2. The upper bound (4) implies that for ,
Here note that . Now minimizing both sides of the lower bound (5) gives
Using the above bound with , we get
Subtracting from both sides, and iterating from to proves the theorem. ∎
4 Trust region Newton’s method
The convergence rate of Newton’s algorithm in Theorem II critically depends on the constant , which is a global measure bounding the relative change of the Hessian of around the current point . Often, the value of depends on the diameter of the domain . E.g., as we discussed in Section 2.3, Logistic regression and exponential loss are stable, which can be a large value. In this section, we design an algorithm whose convergence depends only on a local measure of stability, getting around the potentially exponential dependence on .
4.1 Local stability
We introduce a local measure of stability , which is typically much smaller than . This notion captures the multiplicative change in the Hessian in a small ball of radius around the current point , measured in an arbitrary norm .
Assumption B (locally stable with respect to ).
For any such that and , we assume that and that there exists a constant for which the following holds
Since the norm may not be affine invariant, the resulting constant is also not necessarily affine invariant. It is, however, possible to circumvent this limitation (refer Section D in the Appendix).
4.2 Trustregion Algorithm
Trustregion methods restrict each update to a small ball of radius around , and so are more ‘local’ algorithms.
4.3 Convergence analysis
Theorem III.
Proof.
The proof of Theorem III is very similar to that of Theorem II. The main deviation is the derivation of tighter lower and upper bounds that depend on the local bound instead of the global parameter . This is detailed in Lemma 5 in Appendix C.4. At any iteration , we get that for any such that the following holds
(6)  
(7) 
The upper bound (6) combined with the update in Step 3 implies that for any ,
The last inequality is trivial (with an equality) when minimizing unbounded quadratics, but is also valid when minimizing over convex domains (refer Lemma 6 in Appendix C.4). Let us define and the point . Then
Combining this with our previous observation gives
The last inequality used the lower bound from (7). Now we will have to relate the term to the actual minimum value . This we will do by using the convexity of the function .
Adding and subtracting from the left side, rearranging the terms, and iterating over finishes the proof. ∎
4.4 Improvement in the rate of convergence
In a number of applications we saw in Section 2.3, the dependence of on the diameter was superlinear (and even exponential). Localstability gets around this and ensures that the rate of convergence of the Algorithm 2 depends at most linearly on .
For in Theorem II gives a rate depending on . In contrast, using , Theorem III gives a rate depending on . Thus the optimal can be computed as
As an illustrative example, consider logistic regression or exponential losses. The localstability scales as for . The rate of convergence of Newton’s method would depend on . On the other hand, using the optimal trust region radius , the rate for the trustregion method becomes . This result makes a very strong case for using trustregion Newton methods.
There are two points to note here. First, one might ask if a similar improvement could be shown for the simpler Newton search equipped with a line search. We answer in the negative in Section 6. Next, as we noted before, trust region methods are not affineinvariant, and moreover require solving the Newton step with an additional constraint. In the appendix (Section D), we show an affineinvariant algorithm only requiring minimizing quadratics over the domain .
5 Approximate and proximal extensions
We can extend our analysis of Newton’s method to the proximal setting to minimize a composite objective function, i.e.
(8) 
where as before is a twice differentiable convex function, and is a possibly nondifferentiable, extended valued convex function.
5.1 Inexact Newton steps
In this section we also make two relaxations, one being that an exact Hessian is used, and second that the quadratic subproblem is solved exactly. At each iteration with iterate , we assume access to the exact gradient , and only an approximation of the Hessian .
Approximate Hessian.
Below we list a few scenarios where this notion of an approximate Hessian is useful:

[itemsep=.5ex,topsep=.8ex]

Sketched Hessian.
In machine learning and signal processing applications, the function
is typically of the form where is a simple, separable function and is a data matrix. In such cases, the Hessian where is very cheap to compute (same cost as computing the gradient). Instead of using the full matrix , a low dimensional sketch is used instead. This provides guarantees satisfying (C) while ensuring cheap update steps (cf. [18, 17]). 
Hessian free inexact methods. If we use first order algorithms to minimize , we would only require products of the Hessian with a vector. Such product can be computed without computing, or storing the entire Hessian matrix. The resulting algorithms are inexpensive and costs are comparable to first order methods (cf. [7, Section 6.1]).
Approximate subproblems.
Using , we form a subproblem as in Step 3. Then we assume that at each iteration, our subproblem is solved to an arbitrary multiplicative accuracy, and only in expectation over some randomness of the subproblem algorithm. In particular, we assume the update is computed as in Step 5 for any fixed . Note that if , this means and that the subproblem was solved exactly.
5.2 Convergence analysis
We need to quantify the approximation quality of the Hessian estimate
.Assumption C.
We assume that there exists a constant such that for any , and as well as the following holds
(C) 
Unfortunately the definition of is not necessarily affine invariant, but it does enable efficient approximations of the Hessian.
6 Optimality of results
Linear vs. quadratic convergence.
When is selfconcordant, or strongly convex and smooth, Newton’s method is known to converge quadratically when close enough to the optima ([8, Section 9.5]). This was crucial in designing generic interior point algorithms ([34]) and so one might ask if we can show similar local quadratic convergence for functions with stable Hessians? We give a simple counterexample for which Newton only achieves linear convergence. Consider for some large and . The function has a minimum value of 0 achieved at 0, and and . The Newton step on this function is , with a decrease in the function of
While is not globally stable, it is locally stable if at each step we restrict the trust region around to lie within . Thus running Newton on with this varying trust region would also result in linear convergence, showing that our analysis can in general not be improved.
Superiority of trust region.
We saw in Section 4 that trustregion Newton methods converge at a rate depending on the local stability of the Hessian. One might question if Newton’s method equipped with line search could potentially have similar advantages. We provide a negative answer to this question. Consider the two dimensional function
The minimum of this function is 0 achieved at . Let us pick a starting point . The Newton’s update with stepsize can be computed to be
Suppose we perform an exact line search to pick the best . To simplify computations, we will look at the case where i.e. when is large. In this setting, the predominant term in the objective is . The optimal in this case is approximately . This means that
Thus we cannot hope to obtain a global linear convergence for this case. However if we instead solve the quadratic problem defined by the Hessian as in Step 3
with the trust region and , then the Hessian changes only by a factor of . This means that the constant as defined in Assumption B for and using the norm is at most . Thus we can use a constant stepsize , independent of . As before if we look at what happens when , we get that
This shows that trust region methods can be superior to line search methods, especially with a careful choice of the trust region.
7 Conclusion
A predominant focus of past work on Newton methods has been to show local quadratic convergence under very restrictive assumptions—both on the function class, as well as on the starting point. Such assumptions are almost never satisfied in practice, especially in machine learning applications. We believe the notion of stability recasts the analysis of Newtontype methods in a manner much more suitable to such applications. Using stability, we show strong global linear convergence rates under conditions in which firstorder methods would only achieve sublinear rates—thereby providing a fresh perspective on the performance of a host of classical Newton’s methods.
There are a number of followup questions which arise out of this work. Using the estimate sequence framework of [25], it is possible to accelerate the exact Newton’s method. However it is unclear if such an acceleration could also be achieved for the trustregion methods, or for the approximate and proximal extensions. Further, our theory indicates that the radius of the trust region is crucial for ensuring fast convergence. Although adaptive methods exist for picking the stepsize (Appendix B), designing and evaluating theoretically justified adaptive schemes for picking the trustregion radius would be a fruitful direction. Finally, the notion of stability is restricted to convex functions—generalizing insights here to the nonconvex setting remains a challenging open problem.
References

[1]
Naman Agarwal, Zeyuan AllenZhu, Brian Bullins, Elad Hazan, and Tengyu Ma.
Finding approximate local minima faster than gradient descent.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1195–1199. ACM, 2017.  [2] Naman Agarwal, Brian Bullins, and Elad Hazan. SecondOrder Stochastic Optimization for Machine Learning in Linear Time. arXiv:1602.03943 [cs, stat], February 2016.
 [3] Francis Bach. Selfconcordant analysis for logistic regression. Electronic Journal of Statistics, 4:384–414, 2010.

[4]
Francis Bach.
Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression.
Journal of Machine Learning Research, 15:595–627, 2014.  [5] Francis Bach and Eric Moulines. Nonstronglyconvex smooth stochastic approximation with convergence rate O(1/n). In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 773–781. Curran Associates, Inc., 2013.
 [6] Albert A Bennett. Newton’s method in general analysis. Proceedings of the National Academy of Sciences, 2(10):592–598, 1916.
 [7] L. Bottou, F. Curtis, and J. Nocedal. Optimization Methods for LargeScale Machine Learning. SIAM Review, 60(2):223–311, January 2018.
 [8] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [9] Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, April 2011.
 [10] Michael B. Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu. Matrix Scaling and Balancing via Box Constrained Newton’s Method and Interior Point Methods. arXiv:1704.02310 [cs], April 2017.
 [11] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. TrustRegion Methods. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.
 [12] Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worstcase iteration complexity of mathcal O(epsilon^3/2) o (e3/2) for nonconvex optimization. Mathematical Programming, 162(12):1–32, 2017.
 [13] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 2, NIPS’13, pages 2292–2300, USA, 2013. Curran Associates Inc.
 [14] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
 [15] Wenbo Gao and Donald Goldfarb. QuasiNewton Methods: Superlinear Convergence Without Line Search for SelfConcordant Functions. arXiv:1612.06965 [math], December 2016.
 [16] Matilde Gargiani, Celestine Dunner, and Martin Jaggi. HessianCoCoA: A general parallel and distributed framework for nonstrongly convex regularizers. June 2017.
 [17] Robert M. Gower, Filip Hanzely, Peter Richtárik, and Sebastian Stich. Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster SecondOrder Optimization. arXiv:1802.04079 [cs, math], February 2018.
 [18] Robert M. Gower and Peter Richtárik. Randomized QuasiNewton Updates are Linearly Convergent Matrix Inversion Algorithms. arXiv:1602.01768 [cs, math], February 2016.
 [19] Mert Gürbüzbalaban, Asuman Ozdaglar, and Pablo Parrilo. A globally convergent incremental newton method. Mathematical Programming, 151(1):283–313, 2015.
 [20] Leonid Vital’evich Kantorovich. Functional analysis and applied mathematics. Uspekhi Matematicheskikh Nauk, 3(6):89–185, 1948.

[21]
Sai Praneeth Reddy Karimireddy, Sebastian Stich, and Martin Jaggi.
Adaptive balancing of gradient and update computation times using
global geometry and approximate subproblems.
In
International Conference on Artificial Intelligence and Statistics
, pages 1204–1213, March 2018.  [22] Chingpei Lee and Stephen J. Wright. Inexact Successive Quadratic Approximation for Regularized Optimization. arXiv:1803.01298 [math], March 2018.
 [23] ChihJen Lin, Ruby C. Weng, and S. Sathiya Keerthi. Trust Region Newton Method for Logistic Regression. Journal of Machine Learning Research, 9(Apr):627–650, 2008.
 [24] Yurii Nesterov. Accelerating the Cubic Regularization of Newton’s Method on Convex Problems. SSRN Scholarly Paper ID 885933, Social Science Research Network, Rochester, NY, September 2005.
 [25] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014.
 [26] Yurii Nesterov and Arkadii Nemirovskii. InteriorPoint Polynomial Algorithms in Convex Programming. Society for Industrial and Applied Mathematics, 1994.
 [27] Yurii Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, August 2006.
 [28] B. T. Polyak. NewtonKantorovich Method and Its Global Convergence. Journal of Mathematical Sciences, 133(4):1513–1523, March 2006.
 [29] Clément W Royer and Stephen J Wright. Complexity analysis of secondorder linesearch algorithms for smooth nonconvex optimization. SIAM Journal on Optimization, 28(2):1448–1477, 2018.
 [30] Katya Scheinberg and Xiaocheng Tang. Practical inexact proximal quasinewton method with global complexity analysis. Mathematical Programming, 160(12):495–529, 2016.
 [31] Shai ShalevShwartz and Yoram Singer. Online learning: Theory, algorithms, and applications. PhD thesis, Hebrew University, 2007.
 [32] Virginia Smith, Simone Forte, Chenxin Ma, Martin Takac, Michael I. Jordan, and Martin Jaggi. CoCoA: A General Framework for CommunicationEfficient Distributed Optimization. arXiv:1611.02189 [cs], November 2016.
 [33] Tianxiao Sun and Quoc TranDinh. Generalized SelfConcordant Functions: A Recipe for NewtonType Methods. arXiv:1703.04599 [math, stat], March 2017.
 [34] Stephen J Wright. Primaldual interiorpoint methods. Siam, 1997.
 [35] Huan Xu, Constantine Caramanis, and Shie Mannor. Robust regression and lasso. In Advances in Neural Information Processing Systems, pages 1801–1808, 2009.
Appendix A Sufficient conditions for stability
Here follow the definitions of the various conditions on discussed in Section 2. First some notation:
We will restate the definitions of these conditions using our new notation. For any ,

stable Hessian: .

Lipschitz gradients: .

strongly convex: .

Lipschitz Hessian: .

selfconcordant: .

quasiself/concordant: .
Also, recall the diameter of the level set .
Proof of Theorem I.
Let us prove the Theorem case by case.

Lipschitz gradient and strong convex stable Hessian.
Using the definitions of the three terms, 
Lipschitz Hessian and strong convex stable Hessian.
The definition of Lipschitz Hessian implies thatNow combining this with the definition of stability and strong convexity,

selfconcordant and Lipschitz gradient stable Hessian.
We use the proof technique from [15, Lemma 3.2]. Define . Assuming is thrice differentiable, using the definition of selfconcordanceThis means that the definition of selfconcordance implies that
Since is Lipschitz, this means
Now setting in the definition of and multiplying the above equation by we get
Using the definition of Lipschitz gradient, and the bound on the diameter of , we get that for all ,

quasiselfconcordant stable Hessian.
This statement is directly taken from [3, Proposition 1]. Define as before . The definition of quasiselfconcordance implies thatIf we consider the function , the above equation shows that
which in turn means
Again setting in the definition of gives us that
Appendix B Line search strategies
All algorithms we have discussed in this paper assume that the value of is set correctly. This assumption can easily be relaxed by using line search strategies. There has been a significant amount of work different linesearch strategies and we will not attempt to provide a complete survey. Instead we point to ([11]). Among those methods, the backtracking strategy employed in making the cubic regularization techniques adaptive by ([9]) is especially suited to our strategy.
It is easy to adapt the theoretical guarantees and techniques used in ([9, 30]) for analysis of this backtracking strategy to our setting. This way we are able to remove both the necessity of knowing as well as make it an adaptive method. The details are summarized in Algorithm 4.
Throughout Algorithm 4, we always assumed that the only unknown parameter is . However when we are running trust region algorithms, we would perhaps like to adapt both the trust region radius as well as . While it is possible to design such an adaptive trust region strategy using insights from on our proof, we leave the analysis and evaluation of such strategies for future work.
Necessity of stepsize.
Theorems II, III and IV show that choosing the appropriate ensures global linear convergence. In the case where , this corresponds to using a stepsize of . Here we show that this is not simply an artifact of the analysis—the use of is actually necessary to ensure global convergence. Consider the univariate function
This function is convex with gradient , second derivate and minimum value 0 achieved at . It satisfies our condition (Assumption A) of stable Hessian with where is the diameter of the level set. Suppose we start at for . Applying a Newton update with stepsize gives . Let us assume the stepsize , and to simplify computations. When is large, the predominant term of is if and if . In the setting where , and —we have veered too far to the left. Instead, using a stepsize would ensure a descent step. In fact this example also showcases the advantage of adaptive stepsizes. Using a fixed stepsize of either or even would require exponential (in ) number of iterations to converge. Instead, using an adaptive step size of , where is the current position, would give convergence in polynomial steps.
Appendix C Additional proofs
c.1 Proof of affine invariance of stability (Lemma 1)
Suppose we had a transformed function
for an invertible matrix
. Its Hessian would be, using the chain rule. Let
denote the transformed domain of as defined in (3) so that if . The definition of would bec.2 Proof of lower and upper bounds (Lemma 2)
The proof of the Lemma follows from the secondorder Taylor expansion of around . Taylor’s theorem gives us that for any there exists a such that for ,
(9) 
Since is convex, and by substituting , in Assumption A,
Substituting , we have
This proves (4). On the other hand, by substituting , and in Assumption A, we get a lower bound
Again by substituting , we can finish the proof as
∎
c.3 Proof of descent (Lemma 3)
For some , let us assume that . The base case, is trivially true. If , we are already at an optimum and , proving our Lemma. Otherwise we proceed as below.
We know that is a descent direction [8, Section 9.2]. This means there exists a small enough such that for , meaning . Applying Assumption A with and , we get that . In particular this implies that is in the range of and so the update is welldefined.
Now we are left with the task of proving . Note that also implies . This is a sufficient condition to ensure that the Newton’s step is a descent direction [8, Section 9.2]. This means there exists , such that for , we have . Hence . Let us define the auxiliary function for . The function is continuous in since is a continuous function and is a continuous map. Moreover we have that
Comments
There are no comments yet.