We consider the problem of optimizing an objective function of the form
is a not necessarily convex loss function defined overdatapoints.
Our setting is one where access to the exact function gradient is computationally expensive (e.g. large-scale setting where is large) and one therefore wants to access only stochastic evaluations
, potentially over a mini-batch. In such settings, stochastic gradient descent (SGD) has long been the method of choice in the field of machine learning. Despite the uncontested empirical success of SGD to solve difficult optimization problems – including training deep neural networks – the convergence speed of SGD is known to slow down close to saddle points or in ill-conditioned landscapes[38, 23]. While gradient descent requires oracle evaluations111In ”oracle evaluations” we include the number of function and gradient evaluations as well as evaluations of higher-order derivatives. to reach an -approximate first-order critical point, the complexity worsens to for SGD. In contrast, high-order methods – including e.g. regularized Newton methods and trust-region methods – exploit curvature information, allowing them to enjoy faster convergence to a second-order critical point.
In this work, we focus our attention on regularized high-order methods to optimize Eq. (1), which construct and optimize a local Taylor model of the objective in each iteration with an additional step length penalty term that depends on how well the model approximates the real objective. This paradigm goes back to Trust-Region and Cubic Regularization methods, which also make use of regularized models to compute their update step [20, 40, 16]. For the class of second-order Lipschitz smooth functions,  showed that the Cubic Regularization framework finds an -approximate second-order critical point in at most iterations222The complexity ignores the cost of optimizing the model at each iteration. This is an issue we will revisit later on., thus achieving the best possible rate in this setting . Recently, stochastic extensions of these methods have appeared in the literature such as [19, 34, 46, 51]. These will be discussed in further details in Section 2.
Since the use of second derivatives can provide significant theoretical speed-ups, a natural question is whether higher-order derivatives can result in further improvements. This questions was answered affirmatively in  who showed that using derivatives up to order allows convergence to an -approximate first-order critical point in at most evaluations. This result was extended to -second-order stationarity in , which proves an rate. Yet, these results assume a deterministic setting where access to exact evaluations of the function derivatives is needed and – to the best of our knowledge – the question of using high-order () derivatives in a stochastic setting has received little consideration in the literature so far. We focus our attention on the case of computing derivative information of up to order . It has recently been shown in  that, while optimizing degree four polynomials is NP-hard in general , the specific models that arise from a third-order Taylor expansion with a quartic regularizer can still be optimized efficiently.
The main contribution of this work (Thm. 6) is to demonstrate that a stochastic fourth-order regularized method finds an second-order stationary point in iterations . Therefore, it matches the results obtained by deterministic methods while relying only on inexact derivative information. In order to prove this result, we develop a novel tensor concentration inequality (Thm. 7) for sums of tensors of any order and make explicit use of the finite-sum structure given in Eq. (1). Together with existing matrix and vector concentration bounds , this allows us to define a sufficient amount of samples needed for convergence. We thereby provide theoretically motivated sampling schemes for the derivatives of the objective – for both sampling with and without replacement – and discuss a practical implementation of the resulting approach, which we name STM (Stochastic Tensor Method). We also compute the total computational complexity of this algorithm (including the complexity of optimizing the model at each iteration) and find that the complexity in terms of is superior to other state-of-the-art stochastic algorithms such as [52, 46, 3]. Importantly, the implementation of this approach does not require computing high-order derivatives directly, which would potentially render it as too computationally and memory expensive. Instead, the necessary derivative information is accessed only indirectly via tensor-vector and matrix-vector products.
2 Related work
Sampling techniques for first-order methods.
In large-scale learning (
) most of the computational cost of traditional deterministic optimization methods is spent in computing the exact gradient information. A common technique to address this issue is to use sub-sampling to compute an unbiased estimate of the gradient. The simplest instance is SGD whose convergence does not depend on the number of datapoints
. However, the variance in the stochastic gradient estimates slows its convergence down. The work of explored a sub-sampling technique in the case of convex functions, showing that it is possible to maintain the same convergence rate as full-gradient descent by carefully increasing the sample size over time. Another way to recover a linear rate of convergence for strongly-convex functions is to use variance reduction [33, 24, 43, 32, 22]. The convergence of SGD and its variance-reduced counterpart has also been extended to non-convex functions [28, 42] but the guarantees these methods provide are only in terms of first-order stationarity. However, the work of [27, 44, 21] among others showed that SGD can achieve stronger guarantees in the case of strict-saddle functions. Yet, the convergence rate has a polynomial dependency to the dimension
and the smallest eigenvalue of the Hessian which can make this method fairly impractical.
For second-order methods that are not regularized or that make use of positive definite Hessian approximations (e.g. Gauss-Newton), the problem of avoiding saddle points is even worse as they might be attracted by saddle points or even local maximima . Another predominant issue is the computation (and storage) of the Hessian matrix, which can be partially addressed by Quasi-Newton methods such as (L-)BFGS. An increasingly popular alternative is to use sub-sampling techniques to approximate the Hessian matrix, such as in  and . The latter method uses a low-rank approximation of the Hessian to reduce the complexity per iteration. However, this yields a composite convergence rate: quadratic at first but only linear near the minimizer.
Cubic regularization and trust region methods.
Trust region methods are among the most effective algorithmic frameworks to avoid pitfalls such as local saddle points in non-convex optimization. Classical versions iteratively construct a local quadratic model and minimize it within a certain radius wherein the model is trusted to be sufficiently similar to the actual objective function. This is equivalent to minimizing the model function with a suitable quadratic penalty term on the stepsize. Thus, a natural extension is the cubic regularization method introduced by  that uses a cubic over-estimator of the objective function as a regularization technique for the computation of a step to minimize the objective function. The drawback of their method is that it requires computing the exact minimizer of the cubic model, thus requiring the exact gradient and Hessian matrix. However finding a global minimizer of the cubic model may not be essential in practice and doing so might be prohibitively expensive from a computational point of view.  introduced a method named ARC which relaxed this requirement by letting be an approximation to the minimizer. The model defined by the adaptive cubic regularization method introduced two further changes. First, instead of computing the exact Hessian it allows for a symmetric approximation . Second, it introduces a dynamic step-length penalty parameter instead of using the global Lipschitz constant. Our approach relies on the same adaptive framework.
There have been efforts to further reduce the computational complexity of optimizing the model. For example,  refined the approach of  to return an approximate local minimum in time which is linear in the input. Similar improvements have been made by  and . These methods provide alternatives to minimize the cubic model and can thus be seen as complementary to our approach. Finally,  proposed a stochastic trust region method but their analysis does not specify any accuracy level required for the estimation of the stochastic Hessian.  also analyzed a probabilistic cubic regularization variant that allows for approximate second-order models.  provided an explicit derivation of sampling conditions to preserve the worst-case complexity of ARC. Other works also derived similar stochastic extensions to cubic regularization, including  and . The worst-case rate derived in the latter includes the complexity of a specific model solver introduced in .
A hybrid algorithm suggested in  adds occasional third-order steps to a cubic regularization method, thereby obtaining provable convergence to some type of third-order local minima. Yet, high-order derivatives can also directly be applied within the regularized Newton framework, as done e.g. by  that extended cubic regularization to a -th high-order model (Taylor approximation of order ) with a -th order regularization, proving iteration complexities of order for first-order stationarity. These convergence guarantees are extended to the case of inexact derivatives in  but the latter only discusses a practical implementation for the case . The convergence guarantees of -th order models are extended to second-order stationarity in . Notably, all these approaches require optimizing a -th order polynomial, which is known to be a difficult problem. Recently, [39, 29] introduced an implementable method for the case in the deterministic case and for convex functions. We are not aware of any work that relies on high-order derivatives () to construct a model to optimize smooth functions and we therefore focus our work on the case .
3.1 Notation & Assumptions
First, we lay out some standard assumptions regarding the function as well as the required approximation quality of the high-order derivatives.
Assumption 1 (Continuity).
The functions , , are Lipschitz continuous for all , with Lipschitz constants and respectively.
In the following, we will denote the directional derivative of the function at along the directions as
For instance, and .
Assumption 1 implies that for each ,
for all and where .
As in , is the tensor norm recursively induced by the Euclidean norm on the space of -th order tensors.
3.2 Stochastic surrogate model
We construct a surrogate model to optimize based on a truncated Taylor approximation as well as a power prox function weighted by a sequence that is controlled adaptively according to the fit of the model to the function . Since the full Taylor expansion of requires computing high-order derivatives that are expensive, we instead use an inexact model defined as
where and approximate the derivatives and through sampling as follows. Three sample sets and are drawn and the derivatives are then estimated as
We will later see that the implementation of the algorithm we analyze does not require the computation of the Hessian or the third-order tensor – both of which would require significant computational resources – but instead directly compute Tensor-vector products with a complexity of order .
We will make use of the following condition in order to reach an -critical point:
For a given accuracy, one can choose the size of the sample sets for sufficiently small such that:
In Lemma 8, we prove that we can choose the size of each sample set and to satisfy the conditions above, without requiring knowledge of the length of the step . We will present a convergence analysis of STM, proving that the convergence properties of the deterministic methods [8, 39] can be retained by a sub-sampled version at the price of slightly worse constants.
The optimization algorithm we consider is detailed in Algorithm 1. At iteration step , we sample three sets of datapoints from which we compute stochastic estimates of the derivatives of so as to satisfy Cond. 1. We then obtain the step by solving the problem
either exactly or approximately (details will follow shortly) and update the regularization parameter depending on , which measures how well the model approximates the real objective. This is accomplished by differentiating between different types of iterations. Successful iterations (for which ) indicate that the model is, at least locally, an adequate approximation of the objective such that the penalty parameter is decreased in order to allow for longer steps. We denote the index set of all successful iterations between 0 and by . We also denote by its complement in which corresponds to the index set of unsuccessful iterations.
Exact model minimization
Solving Eq. (4) requires minimizing a nonconvex multivariate polynomials. As pointed out in [39, 5], this problem is computationally expensive to solve in general and further research is needed to establish whether one could design a practical minimization method. In , the authors demonstrated that an appropriately regularized Taylor approximation of convex functions is a convex multivariate polynomial, which can be solved using the framework of relatively smooth functions developed in . As long as the involved models are convex, this solver could be used in Algorithm 1 but it is unclear how to generalize this method to the non-convex case. Fortunately, we will see next that exact model minimizers is not even needed to establish global convergence guarantees for our method.
Approximate model minimization
Exact minimization of the model in Eq. (4) is often computationally expensive, especially given that it is required for every parameter update in Algorithm 1. In the following, we explore an approach to approximately minimize the model while retaining the convergence guarantees of the exact minimization. Instead of requiring the exact optimality conditions to hold, we use weaker conditions that were also used in prior work [8, 18]. First, we define two criticality measures based on first- and second-order information:
where is the minimum eigenvalue of the Hessian matrix .
The same criticality measures are defined for the model ,
Finally, we state the approximate optimality condition required to find the step .
For each outer iteration , the step is computed so as to approximately minimize the model in the sense that the following conditions hold:
4.1 Worst-case complexity
In the following, we provide a proof of convergence of STM to a second-order critical point, i.e. a point such that
The high-level idea of the analysis is to first show that the model decreases proportionally to the criticality measures at each iteration and then relate the model decrease to the function decrease. Since the function is lower bounded, it can only decrease a finite number of times, which therefore implies convergence. We start with a bound on the model decrease in terms of the step length .
For any , the step (satisfying Cond. 2) is such that
In order to complete our claim of model decrease, we prove that the length of the step can not be arbitrarily small compared to the gradient and the Hessian of the objective function.
A key lemma to derive a worst-case complexity bound on the total number of iterations required to reach a second-order critical point is to bound the number of unsuccessful iterations as a function of the number of successful ones , that have occurred up to some iteration .
The steps produced by Algorithm 1 guarantee that if for , then where
We are now ready to state the main result of this section that provides a bound on the number of iterations required to reach a second-order critical point.
Theorem 6 (Outer worst-case complexity).
As in , we obtain a worst-complexity bound of the order , except that we do not require the exact computation of the function derivatives but instead rely on approximate sampled quantities. By using third-order derivatives, STM also obtains a faster rate (in terms of outer iterations) than the one achieved by a sampled variant of cubic regularization  (at most iterations for and to reach approximate nonnegative curvature). The worst-case complexity bound stated in Theorem 6 does not take into account the complexity of the subsolver used to find the (approximate) solution of . This is discussed in detail in Section 4.3.
4.2 Sampling conditions
We now show that one can use random sampling without replacement and choose the size of the sample sets in order to satisfy Cond. 1
with high probability. The complete proof is available in AppendixB.1. The case of sampling with replacement is also discussed in Appendix B.2.
First, we need to develop a new concentration bound for tensors based on the spectral norm. Existing tensor concentration bounds are not applicable to our setting. Indeed,  relies on a different norm which can not be translated to the spectral norm required in our analysis while the bound derived in  relies on a specific form of the input tensor.
Formally, let be a probability space and let be a real random tensor, i.e. a measurable map from to (the space of real tensors of order and dimension ).
Our goal is to derive a concentration bound for a sum of identically distributed tensors sampled without replacement, i.e. we consider
where each tensor is sampled from a population of size .
The concentration result derived in the next theorem is based on the proof technique introduced in 
which we adapt for sums of random variables.
Theorem 7 (Tensor Hoeffding-Serfling Inequality).
Let be a sum of tensors sampled without replacement from a finite population of size . Let be such that and assume that for each tensor , . Let , then we have
Based on Theorem 7 as well as standard concentration bounds for vectors and matrices (see e.g. ), we prove the required sampling conditions of Cond. 1 hold for the specific sample sizes given in the next lemma.
Consider the sub-sampled gradient, Hessian and third-order tensor defined in Eq. (5). Under Assumption 1, the sampling conditions in Eqs. (6), (7) and (8) are satisfied with probability for the following choice of the size of the sample sets and :
where hides poly-logarithmic factors and a polynomial dependency to .
4.3 Complexity subproblem
We first discuss the scenario where we only require an -first-order critical point. In the next theorem, we state the total complexity of Algorithm 1, including the sampling complexities given in Lemma 8 as well as the subsolver complexity, which we denote by .
Theorem 9 (Total worst-case complexity).
Let be a lower bound on . Denote by the number of outer iterations defined in Eq. (21) and by the complexity of the model subsolver, both specialized to the case of first-order criticality. Assume Condition 1 holds. Then, given , Algorithm 1 needs at most
(stochastic) oracle calls to reach an iterate such that .
We now derive a corollary for the case where the desired accuracy is high, i.e. . We suppose the subsolver is a non-convex version of AGD whose worst-case complexity under Assumption 1 is proven to be in .
Corollary 10 (Total worst-case complexity - first-order stationarity).
The final total complexity in terms of is an improvement over state-of-the-art methods such as NEON + SCSG  (), SCR  and Natasha2  (). We expect that the dependency on could be further reduced using the variance reduction technique introduced in . Furthermore, we want to point out that the use of a specialized subproblem solver instead of AGD – as was done in  for the cubic model – could significantly improve the rate. For comparison, while the rate of AGD to reach is , the cubic solver from  achieves .333  reports a rate in terms of function suboptimality, which we here naively convert using smoothness.
Unlike the first-order guarantees, the condition imposed on the subproblem solver now requires finding a second-order stationary point. A common solver used for trust-region methods and ARC is to apply a Lanczos method to build up evolving Krylov spaces, which can be constructed in a Hessian-free manner, i.e. by accessing the Hessians only indirectly via matrix-vector products. We are however not aware of any existing work analyzing the worst-case complexity of (a generalization of) this approach for higher-order polynomials. Instead, we suggest using the solver presented in  which is an accelerated gradient method for non-convex optimization that requires to find a point such that .
Corollary 11 (Total worst-case complexity – second-order stationarity).
As for the first-order guarantees, an interesting direction to improve the total complexity would be a subproblem solver specialized to the specific characteristics of the fourth-order model.
We presented a stochastic fourth-order optimization algorithm that preserves the guarantees of its deterministic counterpart for finding an approximate second-order critical point. There are numerous extensions that could follow from this work, including the following.
Prior work such as [3, 52] has relied on using variance reduction to achieve faster rates. A variance-reduced variant of cubic regularization has also been shown in  to reduce the per-iteration sample complexity and one would therefore except similar improvements can be made to the quartic model. Finally, one could incorporate acceleration as in .
Perhaps the most promising direction for future work is to design efficient algorithms to solve high-order polynomial models. While there has been some significant work published for cubic models [11, 12], the problem has been relatively unexplored for higher-order models.
Finally, one relevant application for the type of high-order algorithms we developed is training deep neural networks as in [46, 1]. An interesting direction for future research would therefore be to design a practical implementation of STM for training neural networks based on efficient tensor-vector products similarly to the fast Hessian-vector products proposed in .
The authors would like to thank Coralia Cartis for helpful discussions on an early draft of this paper, as well as for pointing out additional relevant work.
-  Leonard Adolphs, Jonas Kohler, and Aurelien Lucchi. Ellipsoidal trust region methods and the marginal value of hessian information for neural network training. arXiv preprint arXiv:1905.09201, 2019.
-  Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding local minima for nonconvex optimization in linear time. arXiv preprint arXiv:1611.01146, 2016.
-  Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. In Advances in Neural Information Processing Systems, pages 2675–2686, 2018.
-  Anima Anandkumar and Rong Ge. Efficient approaches for escaping higher order saddle points in non-convex optimization. arXiv preprint arXiv:1602.05908, 2016.
-  Michel Baes. Estimate sequence methods: extensions and approximations. Institute for Operations Research, ETH, Zürich, Switzerland, 2009.
-  Rémi Bardenet, Odalric-Ambrym Maillard, et al. Concentration inequalities for sampling without replacement. Bernoulli, 21(3):1361–1385, 2015.
-  Stefania Bellavia, Gianmarco Gurioli, Benedetta Morini, and Philippe Toint. Adaptive regularization algorithms with inexact evaluations for nonconvex optimization. 2018.
-  Ernesto G Birgin, JL Gardenghi, José Mario Martínez, Sandra Augusta Santos, and Ph L Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Mathematical Programming, 163(1-2):359–368, 2017.
-  Jose Blanchet, Coralia Cartis, Matt Menickelly, and Katya Scheinberg. Convergence rate analysis of a stochastic trust region method for nonconvex optimization. arXiv preprint arXiv:1609.07428, 2016.
-  Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.
-  Yair Carmon and John C. Duchi. Gradient descent efficiently finds the cubic-regularized non-convex newton step. https://arxiv.org/abs/1612.00547, 2016.
-  Yair Carmon and John C Duchi. Analysis of krylov subspace solutions of regularized non-convex quadratic problems. In Advances in Neural Information Processing Systems, pages 10705–10715, 2018.
-  Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Convex until proven guilty: Dimension-free acceleration of gradient descent on non-convex functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 654–663. JMLR. org, 2017.
-  Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. arXiv preprint arXiv:1710.11606, 2017.
-  Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2):1751–1772, 2018.
-  Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
-  Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130(2):295–319, 2011.
-  Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Improved second-order evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. arXiv preprint arXiv:1708.04044, 2017.
-  Coralia Cartis and Katya Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, pages 1–39, 2015.
-  Andrew R Conn, Nicholas IM Gould, and Philippe L Toint. Trust region methods. SIAM, 2000.
-  Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. arXiv preprint arXiv:1803.05999, 2018.
-  Hadi Daneshmand, Aurélien Lucchi, and Thomas Hofmann. Starting small - learning with adaptive sample sizes. In International Conference on Machine Learning, 2016.
-  Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
-  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
-  Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled newton methods. In Advances in Neural Information Processing Systems, pages 3052–3060, 2015.
-  Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.
-  Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points-online stochastic gradient for tensor decomposition. In COLT, pages 797–842, 2015.
-  Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
-  Geovani Nunes Grapiglia and Yurii Nesterov. On inexact solution of auxiliary problems in tensor methods for convex optimization. arXiv preprint arXiv:1907.13023, 2019.
-  Elad Hazan and Tomer Koren. A linear-time algorithm for trust region problems. Mathematical Programming, 158(1-2):363–381, 2016.
-  Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6):45, 2013.
-  Thomas Hofmann, Aurelien Lucchi, Simon Lacoste-Julien, and Brian McWilliams. Variance reduced stochastic gradient descent with neighbors. In Advances in Neural Information Processing Systems 28, pages 2296–2304. Curran Associates, Inc., 2015.
-  Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
-  Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1895–1904. JMLR. org, 2017.
-  Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
-  Ziyan Luo, Liqun Qi, and Ph L Toint. Bernstein concentration inequalities for tensors via einstein products. arXiv preprint arXiv:1902.03056, 2019.
-  Yu Nesterov. Accelerating the cubic regularization of newton’s method on convex problems. Mathematical Programming, 112(1):159–181, 2008.
-  Yurii Nesterov. Introductory lectures on convex optimization. applied optimization, vol. 87, 2004.
-  Yurii Nesterov. Implementable tensor methods in unconstrained convex optimization. Technical report, CORE Discussion Paper, Université Catholique de Louvain, Belgium, 2015.
-  Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
-  Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
-  Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. arXiv preprint arXiv:1603.06160, 2016.
-  Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012.
-  Ju Sun, Qing Qu, and John Wright. When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096, 2015.
-  Ryota Tomioka and Taiji Suzuki. Spectral norm of random tensors. arXiv preprint arXiv:1407.1870, 2014.
-  Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2899–2908, 2018.
-  Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
-  Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
-  Roman Vershynin. Concentration inequalities for random tensors. arXiv preprint arXiv:1905.00802, 2019.
-  Zhe Wang, Yi Zhou, Yingbin Liang, and Guanghui Lan. Stochastic variance-reduced cubic regularization for nonconvex optimization. arXiv preprint arXiv:1802.07372, 2018.
-  Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Newton-type methods for non-convex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164, 2017.
-  Yi Xu, Jing Rong, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pages 5530–5540, 2018.
Appendix A Main analysis
In the following section, we provide detailed proofs for all the lemmas and theorems presented in the main paper. First, we present some basic results regarding the surrogate model that will be handy throughout the proofs.
Recall that we use the following surrogate model:
The first derivative of the model w.r.t. is
For the second-order derivative , we get:
We now start with a bound on the model decrease in terms of the step length .
Lemma 12 (Lemma 4.1 restated).
For any , the step (satisfying Cond. 2) is such that
Note that . Using the optimality conditions introduced in Condition 2, we get
which directly implies the desired result.
In order to complete our claim of model decrease started in Lemma 2, we prove that the length of the step can not be arbitrarily small compared to the gradient of the objective function.
Lemma 13 (Lemma 4.2 restated).
We start with the following bound,
For the first term in the RHS of Eq. (29) , we have
Next, we apply the Young’s inequality for products which states that if and such that then
Choosing , , ,
We now prove that the length of the step can not be arbitrarily small compared to . We first need an additional lemma that relates the step length to the second criticality measure . Proving such result requires the following auxiliary lemma.
Suppose that Condition 1 holds. For all ,
We again apply the Young’s inequality for products stated in Eq. 32 which yields