Consider the generic unconstrained optimization problem
where is smooth and non-convex. Faced with the large-scale nature of modern “big-data” problems, many of the classical optimization algorithms might prove to be inefficient, if applicable at all. In this light, many of the recent research efforts have been centered around designing variants of classical algorithms which, by employing suitable approximations of the gradient and/or Hessian, improve upon the cost-per-iteration, while maintaining the original iteration complexity. In this light, we focus on trust-region (TR) [conn2000trust] and cubic regularization (CR) [cubic1981Griewank], two algorithms which are considered as among the most elegant and theoretically sound general-purpose Newton-type methods for non-convex problems.
In doing so, we first consider (P0), and study the theoretical convergence properties of variants of these two algorithms in which, under favorable conditions, Hessian is suitably approximated. We show that our Hessian approximation conditions, in many cases, are weaker than the existing ones in the literature. In addition, and in contrast to some prior works, our conditions allow for efficient constructions of the inexact Hessian with a priori guarantees via various approximation methods, of which Randomized Numerical Linear Algebra (RandNLA), [mahoney2011randomized, drineas2016randnla], techniques are shown to be highly effective.
Subsequently, to showcase the application of randomized techniques for construction of the approximate Hessian, we consider an important instance of (P0), i.e., large-scale finite-sum minimization, of the form
and its special case
) arise very often in machine learning, e.g.,[shalev2014understanding] as well as scientific computing, e.g., [rodoas1, rodoas2]. In big-data regime where , operations with the Hessian of
, e.g., matrix-vector products, typically constitute the main bottleneck of computations. Here, we show that our relaxed Hessian approximation conditions allow one to draw upon thesub-sampling ideas of [roosta2018sub, xu2016sub, bollapragada2016exact], to design variants of TR and CR algorithms where the Hessian is (non-)uniformly sub-sampled. We then present the theoretical convergence properties of these variants for non-convex finite-sum problems of the form (P1) and (P2).
The rest of this paper is organized as follows. In Section 1.1, we first introduce the notation and definitions used throughout the paper. For completeness, in Section 1.2, we give a brief review of trust region (Section 1.2.1) and cubic regularization (Section 1.2.2) along with related prior works. Our main contributions are summarized in Section 1.3. Theoretical analysis of the proposed algorithms for solving generic non-convex problem (P0) are presented in Section 2. Various randomized sub-sampling strategies as well as theoretical properties of the proposed algorithms for finite-sum minimization problems (P1) and (P2) are given in Section 3. Conclusions and further thoughts are gathered in Section 4.
1.1 Notation and Definitions
Throughout the paper, vectors are denoted by bold lowercase letters, e.g.,
, and matrices or random variables are denoted by bold upper case letters, e.g.,. denotes the transpose of a real vector . We use regular lower-case and upper-case letters to denote scalar constants, e.g., or . For two vectors, , their inner-product is denoted as . For a vector , and a matrix , and denote the vector norm and the matrix spectral norm, respectively, while is the matrix Frobenius norm. and are the gradient and the Hessian of at , respectively, and
denotes the identity matrix. For two symmetric matricesand , indicates that is symmetric positive semi-definite. The subscript, e.g., , denotes iteration counter and is the natural logarithm of . The inexact Hessian is denoted by , but for notational simplicity, we may use to, instead, denote the approximate Hessian evaluated at the iterate in iteration , i.e., . Throughout the paper, denotes a collection of indices from , with potentially repeated items and its cardinality is denoted by .
Unlike convex functions for which “local optimality” and “global optimality” are in fact the same, in non-convex settings, we are often left with designing algorithms that can guarantee convergence to approximate local optimality. In this light, throughout this paper, we make use of the following definition of -Optimality:
Definition 1 (-Optimality).
Given , is an -optimal solution to the problem (P0), if
We note that -Optimality (even with ) does not necessarily imply closeness to any local minimum, neither in iterate nor in the objective value. However, if the saddle points satisfy the strict-saddle property [ge2015escaping, lee2016gradient], then an -optimality guarantees vicinity to a local minimum for sufficiently small and .
1.2 Background and Related Work
Arguably, the most straightforward approach for globalization of many Newton-type algorithms is the application of line-search. However, near saddle points where the gradient magnitude can be small, traditional line search methods can be very ineffective and in fact produce iterates that can get stuck at a saddle point [nocedal2006numerical]. Trust region and cubic regularization methods are two elegant globalization alternatives that, specially recently, have attracted much attention. The main advantage of these methods is that they are reliably able to take advantage of the direction of negative curvature and escape saddle points. In this section we briefly review these algorithms as they pertain to the present paper and mention the relevant prior works.
1.2.1 Trust Region
TR methods [sorensen1982newton, conn2000trust] encompass a general class of iterative methods which specifically define a region around the current iterate within which they trust the model to be a reasonable approximation of the true objective function. The most widely used approximating model, which we consider here, is done via a quadratic function. More specifically, using the current iterate , the quadratic variant of TR algorithm finds the next iterate as where is a solution of the constrained sub-problem equationparentequation
|Here, is the region in which we “trust” our quadratic model to be an acceptable approximation of the true objective for the current iteration. The major bottleneck of computations in TR algorithm is the minimization of the constrained quadratic sub-problem (2a), for which numerous approaches have been proposed, e.g., [steihaug1983conjugate, more1983computing, sorensen1997minimization, gould1999solving, lenders2016trlib, erway2009iterative, gould2010solving, hazan2015linear].|
For a smooth non-convex objective and in order to obtain approximate first-order criticality, i.e., for some , the complexity of an (inexact) trust-region method, which ensures at least a Cauchy (steepest-descent-like) decrease at each iteration, is shown to be of the same order as that of steepest descent, i.e., ; e.g., [gratton2008recursiveI, gratton2008recursiveII, blanchet2016convergence, gratton2017complexity, cartis2012complexity]. Recent non-trivial modifications of the classical TR methods have also been proposed which improve upon the complexity to ; see [curtis2017trust] and further extensions to a more general framework in [curtis2017inexact]. These bounds can be shown to be tight [cartis2010complexity] in the worst case. Under a more general algorithmic framework and in terms of objective function sub-optimality, i.e., , better complexity bounds, in the convex and strongly-convex settings, have been obtained which are of the orders of and , respectively [grapiglia2016worst].
For non-convex problems, however, it is more desired to obtain complexity bounds for achieving approximate second-order criticality, i.e., Definition 1. For this, bounds in the orders of and have been obtained in [cartis2012complexity] and [grapiglia2016worst], respectively. Similar bounds were also given in [gratton2017complexity] under probabilistic model. Bounds of this order have shown to be optimal in certain cases [cartis2012complexity].
More closely related to the present paper, there have been several results which study the role of derivative-free and probabilistic models in general, and Hessian approximation in particular, e.g., see [cartis2012complexity, conn2009global, chen2015stochastic, blanchet2016convergence, bandeira2014convergence, larson2016stochastic, shashaani2016astro, gratton2017complexity] and references therein.
1.2.2 Cubic Regularization
An alternative to the traditional line-search and TR for globalization of Newton-type methods is the application of cubic regularization. Such class of methods is characterized by generating iterates as where is a solution of the following unconstrained sub-problem
where is the cubic regularization parameter chosen for the current iteration. As in the case of TR, the major bottleneck of CR involved solving the sub-problem (2b), for which various techniques have been proposed, e.g., [cartis2011adaptiveI, carmon2016gradient, bianconcini2015use, agarwal2016finding].
To the best of our knowledge, the use of such regularization, was first introduced in the pioneering work of [cubic1981Griewank], and subsequently further studied in the seminal works of [nesterov2006cubic, cartis2011adaptiveI, cartis2011adaptiveII].From the worst-case complexity point of view, CR has a better dependence on compared to TR. More specifically, [nesterov2006cubic] showed that, under global Lipschitz continuity assumption on the Hessian, if the sub-problem (2b) is solved exactly, then the resulting CR algorithm achieves the approximate first-order criticality with complexity of . These results were extended by the pioneering and seminal works of [cartis2011adaptiveI, cartis2011adaptiveII] to an adaptive variant, which is often referred to as ARC (Adaptive Regularization with Cubics). In particular, the authors showed that the worst case complexity of can be achieved without requiring the knowledge of the Hessian’s Lipschitz constant, access to the exact Hessian, or multi-dimensional global optimization of the sub-problem (2b). These results were further refined in [cartis2012complexity] where it was shown that, not only, multi-dimensional global minimization of (2b) is unnecessary, but also the same complexity can be achieved with mere one or two dimensional search. This bound has been shown to be tight [cartis2011optimal]. As for the approximate second-order criticality, [cartis2012complexity] showed that at least is required. With further assumptions on the inexactness of sub-problem solution, [cartis2011adaptiveII, cartis2012complexity] also show that one can achieve , which is shown to be tight [cartis2010complexity]. Better dependence on can be obtained if one assumes additional structure, such as convexity, e.g., see [nesterov2006cubic, cartis2012evaluation] as well as the acceleration scheme of [nesterov2008accelerating].
Recently, for (strongly) convex problems, [ghadimi2017second] obtained sub-optimal complexity for ARC and its accelerated variants using Hessian approximations. In the context of stochastic optimization problems, [tripuraneni2017stochastic] considers cubic regularization with a priori chosen fixed regularization parameter using both approximations of the gradients and Hessian. Specific to the finite-sum problem (P1), and by a direct application of the theoretical results of [cartis2011adaptiveI, cartis2011adaptiveII], [kohler2017subsampledcubic] presents a sub-sampled variant of ARC, in which the exact Hessian and the gradient are replaced by sub-samples. However, unfortunately, their analysis suffers from a rather vicious circle: the approximate Hessian and gradient are formed based on an a priori unknown step which can only be determined after such approximations are formed.
In this section, we summarize the key aspects of our contributions. In Section 2, we consider (P0) and establish the worst-case iteration complexities for variants of trust-region and adaptive cubic regularization methods in which the Hessian is suitably approximated. More specifically, our entire analysis is based on the following key condition on the approximate Hessian :
Condition 1 (Inexact Hessian Regularity).
For some , , the approximating Hessian, , satisfies equationparentequation
where and are, respectively, the iterate and the update at iteration .
Under Condition 1, we show that our proposed algorithms (Algorithms 1 and 2) achieve the same worst-case iteration complexity to obtain approximate second order critical solution as that of the exact variants (Theorems 1, 2, and 3).
In Section 3, we describe schemes for constructing to satisfy Condition 1. Specifically, in the context of finite-sum optimization framework, i.e., problems (P1) and (P2), we present various sub-sampling schemes to probabilistically ensure Condition 1 (Lemmas 16 and 17
). Our proposed randomized sub-sampling strategies guarantee, with high probability, a stronger condition than (3a), namely
It is clear that (4) implies (3a). We then give optimal iteration complexities for Algorithms 1 and 2 for optimization of non-convex finite-sum problems where the Hessian is approximated by means of appropriate sub-sampling (Theorems 4, 5 and 6).
To establish optimal second-order iteration complexity, many previous works considered Hessian approximation conditions that, while enjoying many advantages, come with certain disadvantages. Our proposed Condition 1 aims to remedy some of these disadvantages. We first briefly review the conditions used in the prior works, and subsequently highlight the merits of Condition 1 in comparison.
1.3.1 Conditions Used in Prior Works
For the analysis of trust-region, many authors have considered the following condition equationparentequation
|for some , where is the current trust-region radius, e.g., [bandeira2014convergence, gratton2017complexity]. In [blanchet2016convergence], condition (5a) is replaced with|
|for some . In fact, by assuming Lipschitz continuity of Hessian, it is easy to show that (5a) and (5b) are equivalent, in that one implies the other, albeit with modified constants. We also note that [bandeira2014convergence, gratton2017complexity, blanchet2016convergence] study a more general framework under which the entire sub-problem model is probabilistically constructed and approximation extends beyond just the Hessian.|
For cubic regularization, the condition imposed on the inexact Hessian is often considered as
for some , e.g., [cartis2011adaptiveI, cartis2011adaptiveII, cartis2012complexity] and other follow-up works. In fact, [cartis2012complexity] has also established optimal iteration complexity for trust-region algorithm under (5c). Both of (5a) and (5c), are stronger than the celebrated Dennis-Moré [dennis1974characterization] condition, i.e.,
1.3.2 Merits of Condition 1
For our trust-region analysis, we require Condition 1 with ; see (11) in Theorem 1. Hence, when is large, e.g., at the beginning of iterations, all the conditions (3a), (5a), and (5b) are equivalent, up to some constants. However, the constants in (5a) and (5b) can be larger than what is implied by (3a), amounting to cruder approximations in practice for when is large. As iterations progress, the trust-region radius will get smaller, and in fact it is expected that will eventually shrink to be . In prior works, e.g., [blanchet2016convergence, gratton2017complexity], the convergence analysis is derived using , whereas here we allow . As a result, the requirements (5a) and (5b) can eventually amount to stricter conditions than (3a).
As for (5c), the main drawback lies in the difficulty of enforcing it. Despite the fact that for certain values of and , e.g., , (5c) can be less restrictive than (3a), a priori enforcing (5c) requires one to have already computed the search direction , which itself can be done only after
is constructed, hence creating a vicious circle. A posteriori guarantees can be given if one obtains a lower-bound estimate on the yet-to-be-computed step-size, i.e., to havesuch that . This allows one to consider a stronger condition as , which can be enforced using a variety of methods such as those described in Section 3. However, to obtain such a lower-bound estimate on the next step-size, one has to resort to a recursive procedure, which necessitates repeated constructions of the approximate Hessian and subsequent solutions of the corresponding subproblems. Consequently, this procedure may result in a significant computational overhead and will lead to undesirable theoretical complexities.
In sharp contrast to (5c), the condition (3a) allows for theoretically principled use of many practical techniques to construct . For example, under (3a), the use of quasi-Newton methods to approximate the Hessian is theoretically justified. Further, by considering the stronger condition (4), many randomized matrix approximation techniques can be readily applied, e.g., [woodruff2014sketching, mahoney2011randomized, tropp2015introduction, tropp_concentration]; see Section 3. To the best of our knowledge, the only successful attempt at guaranteed a priori construction of using (5c) is done in [cartis2015global]. Specifically, by considering probabilistic models, which are “sufficiently accurate” in that they are partly based on (5c), [cartis2015global] studies first-order complexity of a large class of methods, including ARC, and discusses ways to construct such probabilistic models as long as the gradient is large enough, i.e., before first-order approximate-optimality is achieved. Here, by considering (3a), we are able to provide an alternative analysis, which allows us to obtain second-order complexity results.
Requiring (4), as a way of enforcing (3a), offers a variety of other practical advantages, which are not readily available with other conditions. For example, consider distributed/parallel environments where the data is distributed across a network and the main bottleneck of computations is the communications across the nodes. In such settings, since (4) allows for the Hessian accuracy to be set a priori and to remain fixed across all iterations, the number of samples in each node can stay the same throughout iterations. This prevents unnecessary communications to re-distribute the data at every iteration.
Furthermore, in case of failed iterations, i.e., when the computed steps are rejected, the previous may seamlessly be used in the next iteration, which avoids repeating many such, potentially expensive, computations throughout the iterations. For example, consider approximate solutions to the underlying sub-problems by means of dimensionality reduction, i.e., is projected onto a lower dimensional sub-space as for some with , resulting in a smaller dimensional sub-problem. Now if the current iteration leads to a rejected step, the projection of the from the previous iteration can be readily re-used in the next iteration. This naturally amounts to saving further Hessian computations.
2 Algorithms and Convergence Analysis
We are now ready to present our main algorithms for solving the generic non-convex optimization (P0) along with their corresponding iteration complexity results to obtain a -optimal solution as in (1). More precisely, in Section 2.1 and 2.2, respectively, we present modifications of the TR and ARC methods which incorporate inexact Hessian information, according to Condition 1.
We remind that, though not specifically mentioned in the statement of the theorems or the algorithms, when the computed steps are rejected and an iteration needs to be repeated with different or , the previous may seamlessly be used in the next iteration. This can be a desirable feature in many practical situations and is directly the result of enforcing (4); see also the discussion in Section 1.3.2.
For our analysis throughout the paper, we make the following standard assumption regarding the regularity of the exact Hessian of the objective function .
Assumption 1 (Hessian Regularity).
is twice differentiable and has bounded and Lipschitz continuous Hessian on the piece-wise linear path generated by the iterates, i.e. for some and all iterations equationparentequation
where and are, respectively, the iterate and the update step at iteration .
2.1 Trust Region with Inexact Hessian
Algorithm 1 depicts a trust-region algorithm where at each iteration , instead of the true Hessian , only an inexact approximation, , is used. For Algorithm 1, the accuracy tolerance in (3a) is adaptively chosen as , where is the trust region in the t-th iteration and is some fixed threshold. This allows for a very crude approximation at the beginning of iterations, when is large. As iterations progress towards optimality and gets small, the threshold can prevent from getting unnecessarily too small.
In Algorithm 1, we require that the sub-problem (8) is solved only approximately. Indeed, in large-scale problems, where the exact solution of the sub-problem is the main bottleneck of the computations, this is a very crucial relaxation. Such approximate solution of the sub-problem (8) has been adopted in many previous work. Here, we follow the inexactness conditions discussed in [conn2000trust], which are widely known as Cauchy and Eigenpoint conditions. Recall that the Cauchy and Eigen directions correspond, respectively, to one dimensional minimization of the sub-problem (8) along the directions given by the gradient and negative curvature.
Condition 2 (Sufficient Descent Cauchy and Eigen Directions [conn2000trust]).
Assume that we solve the sub-problem (8) approximately to find such that equationparentequation
Here, is defined in (8), (Cauchy point) is along negative gradient direction and is along approximate negative curvature direction such that , for some (see Appendix B for a way to efficiently compute ).
One way to ensure that an approximate solution to the sub-problem (8) satisfies (9), is by replacing (8) with the following reduced-dimension problem, in which the search space is a two-dimensional sub-space containing vectors , and , i.e.,
We now set out to provide iteration complexity for Algorithm 1. Our analysis follows similar line of reasoning as that in [cartis2011adaptiveI, cartis2011adaptiveII, cartis2012complexity]. First, we show the discrepancy between the quadratic model and objective function in Lemma 1.
Applying Mean Value Theorem on at gives , for some in the segment of . We have
By assumption on , (9a), and since , we have
where the last inequality follows by assumption on . So , which means the iteration is successful. ∎∎
Lemma 4 gives a lower bound for the trust region radius before the algorithm terminates, i.e., this ensures that the trust region never shrinks to become too small.
We prove by contradiction. Assume that the t-th iteration is the first unsuccessful iteration such that , i.e., we have
Suppose . By Lemma 2, since , iteration must have been accepted and we must have , which is a contradiction. Now suppose . By assumption on , we have that , which implies that . Since the function , for any fixed , is decreasing in , and for any fixed , is increasing in , we have
As a result, since , it must satisfy the condition of Lemma 3. This implies that iteration must have been accepted, which is a contradiction. ∎∎
The following lemma follows closely the line of reasoning in [cartis2012complexity, Lemma 4.5].
Lemma 5 (Successful Iterations).
Consider any such that and let for some . Further, suppose Condition 1 is satisfied with , where is the trust region at the t-th iteration. Let denote the set of all the successful iterations before Algorithm 1 stops. Then, under Assumption 1, Condition 2, the number of successful iterations is upper bounded by,
where , and is as defined in Lemma 4.
Now we are ready to present the final complexity in Theorem 1.
Theorem 1 (Optimal Complexity of Algorithm 1).
Consider any such that and let for some where is a hyper-parameter in Algorithm 1, and is as in (9b). Suppose the inexact Hessian, , satisfies Condition 1 with the approximation tolerance, , in (3a) as
Suppose Algorithm 1 terminates at the t-th iteration. Let and denote the sets of all the successful and unsuccessful iterations, respectively. Then and , where is a hyper-parameter of Algorithm 1. From Lemma 4, we have . Hence, , which implies . Combine the result from Lemma 5, we have the total iteration complexity as
As it can be seen, the worst-case total number of iterations required by Algorithm 1 before termination, matches the optimal iteration complexity obtained in [cartis2012complexity]. Furthermore, from (3a), it follows that upon termination of Algorithm 1 after iterations, in addition to , we have , i.e., the obtained solution satisfies -Optimality as in (1).
For Algorithm 1, the Hessian approximation tolerance is allowed to be chosen per-iteration as . This way, when is large (e.g., at the beginning of iterations), one can employ crude Hessian approximations. As iterations progress towards optimality, can get very small, in which case Hessian accuracy is set in the order of . Note that by Lemma 4, we are always guaranteed to have . As a result, when , e.g., , we can have that . In such cases, the choice ensures that the Hessian approximation tolerance never gets unnecessarily too small.
2.2 Adaptive Cubic Regularization with Inexact Hessian
Similar to Section 2.1, in this section, we present the algorithm and its corresponding convergence results for the case of adaptive cubic regularization with inexact Hessian. In particular, Algorithm 2 depicts a variant of ARC algorithm where at each iteration , the inexact approximation, , is constructed according to Condition 1. Here, unlike Section 2.1, we were unable to provide convergence guarantees with adaptive tolerance in (3a) and as result, is set fixed a priori to a sufficiently small value, i.e., to guarantee -optimality.