1 Introduction
We consider the following finitesum minimization problem
(1) 
where each (nonconvex) component function is assumed to have Lipschitz continuous gradient and Lipschitz continuous Hessian. Since firstorder stationary points could be saddle points and thus lead to inferior generalization performance (Dauphin et al., 2014), in this work we are particularly interested in computing approximate secondorder stationary points, SOSP:
(2) 
To find the local minimum in problem (1), the cubic regularization approach (Nesterov and Polyak, 2006) and the trust region algorithm (Conn et al., 2000; Curtis et al., 2017) are two classical methods. Specifically, cubic regularization forms a cubic surrogate function for the objective by adding a thirdorder regularization term to the secondorder Taylor expansion, and minimizes it iteratively. Such a method is proved to achieve an global convergence rate and thus needs stochastic first and secondorder oracle queries, namely the evaluation number of stochastic gradient and Hessian, to achieve a point that satisfies (2). On the other hand, trust region algorithms estimate the objective with its secondorder Taylor expansion but minimize it only within a local region. Recently, Curtis et al. (2017) propose a trust region variant to achieve the same convergence rate as the cubic regularization approach. But both methods require computing full gradients and Hessians of the objective and thus suffer from high computational cost in largescale problems.
To avoid coslty exact differential evaluations, many works explore the finitesum structure of problem (1) and develop stochastic cubic regularization approaches. Both Kohler and Lucchi (2017b) and Xu et al. (2017) propose to directly subsample the gradient and Hessian in the cubic surrogate function, and achieve and
stochastic first and secondorder oracle complexities respectively. By plugging a stochastic variance reduced estimator
(Johnson and Zhang, 2013) and the Hessian tracking technique (Gower et al., 2018) into the gradient and Hessian estimation, the approach in Zhou et al. (2018a) improves both the stochastic first and secondorder oracle complexities to . Recently, (Zhang et al., 2018; Zhou et al., 2018b) develop more efficient stochastic cubic regularization variants, which further reduce the stochastic secondorder oracle complexities to at the cost of increasing the stochastic firstorder oracle complexity to .Contributions: In this paper we propose and exploit a formulation in which we make explicit control of the step size in the trust region method. This idea is leveraged to develop two efficient stochastic trust region (STR) approaches. We tailor our methods to achieve stateoftheart oracle complexities under the following two measurements: (i) the stochastic secondorder oracle complexity is prioritized; (ii) the stochastic first and secondorder oracle complexities are treated equally. Specifically, in Setting (i), our method STR employs a newly proposed estimator to approximate the Hessian and adopts the estimator in (Fang et al., 2018) for gradient approximation. Our novel Hessian estimator maintains a high accuracy secondorder differential approximation with lower amortized oracle complexity. In this way, STR achieves stochastic secondorder oracle complexity. This is lower than existing results for solving problem (1). In Setting (ii), our method STR substitutes the gradient estimator in STR with one that integrates stochastic gradient and Hessian together to maintain an accurate gradient approximation. As a result, STR achieves convergence in overall stochastic first and secondorder oracle queries.
1.1 Related Work
Computing local minimum to a nonconvex optimization problem is gaining considerable amount of attentions in recent years. Both cubic regularization (CR) approaches (Nesterov and Polyak, 2006) and trust region (TR) algorithms (Conn et al., 2000; Curtis et al., 2017)
can escape saddle points and find a local minimum by iterating the variable along the direction related to the eigenvector of the Hessian with the most negative eigenvalue. As the CR heavily depends the regularization parameter for the cubic term,
Cartis et al. (2011) propose an adaptive cubic regularization (ARC) approach to boost the efficiency by adaptively tunes the regularization parameter according to the current objective decrease. Noting the high cost of full gradient and Hessian computation in ARC, subsampled cubic regularization (SCR) (Kohler and Lucchi, 2017a) is developed for sampling partial data points to estimate the full gradient and Hessian. Recently, by exploring the finitesum structure of the target problem, many works incorporate variancereduced technique (Johnson and Zhang, 2013) into CR and propose stochastic variancereduced methods. For example, Zhou et al. (2018c) propose stochastic variancereduced cubic (SVRC) in which they integrate the stochastic variancereduced gradient estimator (Johnson and Zhang, 2013) and the Hessian tracking technique (Gower et al., 2018) with CR. Such a method is proved to be at least faster than CR and TR. Then Zhou et al. (2018b) suggest to use adaptive gradient batch size and constant Hessian batch size, and develop LiteSVRC to further reduce the stochastic secondorder oracle of SVRC to at the cost of higher gradient computation cost. Similarly, except turning the gradient batch size, Zhang et al. (2018) further adaptively sample a certain number of data points to estimate the Hessian and prove the proposed method to have the same stochastic secondorder oracle complexity as LiteSVRC.Algorithm  SFO  SSO 

TR  
CR  
SCR  
SVRC  
LiteSVRC  
STR  
STR 
2 Preliminary
Notation. We use
to denote the Euclidean norm of vector
and use to denote the spectral norm of matrix . Let be the set of component indices. We define the batch average of component function byThen we specify the assumptions that are necessary to the analysis of our methods.
Assumption 2.1.
is bounded from below and its global optimal is achieved at . We further denote
Assumption 2.2.
Each component function has Lipschitz continuous gradient: for any
(3) 
Clearly, the objective as the average of component functions also has Lipschitz continuous gradient.
Assumption 2.3.
Each component function has Lipschitz continuous Hessian: for any
(4) 
Similarly, the objective has Lipschitz continuous Hessian, which implies the following: for any ,
2.1 Trust Region Method
The trust region method has a long history (Conn et al., 2000). In each step, it solves the Quadratic Constraint Quadratic Program (QCQP)
(5) 
where is the trustregion radius, and updates
(6) 
Since is indefinite, the trustregion subproblem (5) is nonconvex, but its global optimizer can be characterized by the following lemma.
Lemma 2.1 (Corollary 7.2.2 in (Conn et al., 2000)).
Any global minimizer of problem (5) satisfies the equation
(7) 
where the dual variable should satisfy and .
In particular, the standard QCQP solver returns both the minimizer as well as the corresponding dual variable of subproblem (5). While it is known that the trustregion update (5) and (6) converges at the rate , recently (Curtis et al., 2017) proposes a trustregion variant which converges at the optimal rate (Carmon et al., 2017). In this paper, we show that the vanilla trustregion update (5) and (6) already achieves the optimal convergence rate as the byproduct of our novel argument.
3 Methodology
In this section, we first introduce a general inexact trust region method which is summarized in MetaAlgorithm 1. It accepts inexact gradient estimation and Hessian estimation as input to the QCQP subproblem
(8) 
Similar to (5), the global solutions to (8) are characterized by Lemma 2.1 and we further denote the dual variable corresponding to the minimizer by . In practice, (8) can be efficiently solved by Lanczos method (Gould et al., 1999).
We prove that such inexact trustregion method achieves the optimal convergence rate when the estimation and at each iteration are sufficient close to their full (exact) counterparts and respectively:
(9) 
Such result allows us to derive stochastic trustregion variants with novel differential estimators that are tailored to ensure the optimal convergence rate. We state our formal results in Theorem 3.1.
Theorem 3.1.
Proof.
For simplicity of notation, we denote
From Assumption 2.3 we have
Use the Cauchy–Schwarz inequality to obtain
(10) 
The requirement (9) together with the trust region allow us to bound
(11) 
The optimality of (5) indicates that there exists dual variable so that (Corollary 7.2.2 in (Conn et al., 2000))
First Order  (12)  
Second Order  (13)  
Complementary  (14) 
Multiplying (12) by , we have
(15) 
Additionally, using (13) we have
which together with (15) gives
(16) 
Moreover, the complementary property (14) indicates as we have in MetaAlgorithm. Plug (11), (15), and (16) into (10) and use :
(17) 
Therefore, if we have , then
(18) 
Using Assumption 2.1, we find in no more than iterations.
We now show that once , then is already an SOSP: From (12), we have
(19) 
The assumptions and together with the trust region imply
(20)  
On the other hand use Assumption 2.3 to bound
Combining these two results gives .
Besides use Assumption 2.3, , and (13) we derive the Hessian lower bound
Hence is a stationary point.
Therefore, we have according to the complementary condition (14) for all but the last iteration. ∎
Remark 3.1.
We emphasize that MetaAlgorithm 1 degenerates to the exact trust region method by taking and . Such result is of its own interest because this is the first proof to show that the vanilla trust region method has the optimal convergence rate. Similar rate is achieved by (Curtis et al., 2017) but with a complicated trust region variant.
Remark 3.2.
We note that MetaAlgorithm 1 uses the dual variable as stopping criterion, which enables the lastterm convergence analysis in Theorem 3.1. In our appendix, we present a variant of MetaAlgorithm 1 without accessing to the exact dual variable. Further, we show such variant enjoys a similar convergence guarantee in expectation.
Theorem 3.1 shows the explicit step size control of the trustregion method: Since the dual variable satisfies for all but the last iteration, we always find the solution to the trustregion subproblem (8) in the boundary, i.e. , according to the complementary condition (14). Such exact stepsize control property is missing in the cubicregularzation method where the stepsize is implicitly decided by the cubic regularization parameter.
More importantly, we emphasize that such explicit step size control is crucial to the sample efficiency of our variance reduced differential estimators. The essence of variance reduction is to exploit the correlations between the differentials in consecutive iterations. Intuitively, when two neighboring iterates are close, so are their differentials due to the Lipschitz continuity, and hence a smaller number of samples suffice to maintain the accuracy of the estimators. On the other hand, smaller step size reduces the periteration objective decrease which harms the convergence rate of the algorithm (see proof of Theorem 3.1). Therefore, the explicit stepsize control in trustregion method allows us to well tradeoff the periteration sample complexity and convergence rate, from which we can derive stochastic trust region approaches with stateoftheart sample efficiency.
4 Stochastic Trust Region Method: Type I
Having the inexact trustregion method as prototype, we now derive our first sampleefficient stochastic trust region methods, namely STR, which emphasises more cheaper stochastic secondorder oracle complexity. As Theorem 3.1 already guarantees the optimal convergence rate of MetaAlgorithm 1 when the gradient estimator and the Hessian estimator meet requirements (9), here we focus on constructing such novel differential estimators. Specifically, we first present our Hessian estimator in Estimator 3 and our first gradient estimator in Estimator 4, both of which exploit the trust region radius to reduce their variances. Further, by plugging Estimator 3 and Estimator 4 in MetaAlgorithm 1, we present STR in Algorithm 2 with stateoftheart stochastic Hessian complexity.
4.1 Hessian Estimator
Our epochwise Hessian estimator is given in Estimator 3, where controls the epoch length and (and optionally ) controls the minibatch size. At the beginning of each epoch Estimator 3 has two options, designed for different target accuracy: Option I is preferable for the high accuracy case () where we compute the full Hessian to avoid approximation error, and Option II is designed for the moderate accuracy case () where we only need to an approximate Hessian estimator. Then, iterations follow with defined in a recurrent manner. These recurrent estimators exist for the firstorder case (Nguyen et al., 2017; Fang et al., 2018), but their bound only holds under the vector norm. Here we generalize them into Hessian estimation with matrix spectrum norm bound.
The following lemma analyzes the amortized stochastic secondorder oracle (Hessian) complexity for Algorithm 3 to meet the requirement in Theorem 3.1. As we need an approximation error bound under the spectrum norm, we will appeal to the matrix Azuma’s inequality (Tropp, 2012).
Lemma 4.1.
Assume Algorithm 2 takes the trust region radius as in Theorem 3.1. For any , Estimator 3 produces estimators for the second order differentials such that
with probability at least
if we set1. and in option I, or
2. , , and in option II .
Consequently the amortized periteration stochastic secondorder oracle complexity to construct is no more than
Proof.
Without loss of generality, we analyze the case for ease of notation.
We first focus on Option II.
The proof for Option I follows the similar argument.
Option II:
Define for and
and define for and
is a martingale difference. We have for all and ,
Besides, using Assumption 2.2 for to bound
(21) 
and using Assumption 2.3 for to bound
From the construction of , we have
Thus using the matrix Azuma’s Inequality in Theorem 7.1 of (Tropp, 2012) and , we have
Consequently, we have
by taking , , , and .
Option I: The proof is similar to the one of Option II except that we replace
with zero matrix. In such case, the matrix Azuma’s Inequality implies
Thus by taking , , and , we have the result.
Amortized Complexity: In option I, the choice of parameter ensures that: and in option II: . Consequently the amortized stochastic secondorder oracle is bounded by . ∎
4.2 Gradient Estimator: Case (1)
When stochastic secondorder oracle complexity is prioritized, we directly employ the SPIDER gradient estimator to construct (Fang et al., 2018). Similar to the construction for , the estimator is also construct in an epochwise manner as presented in Estimator 4, where controls the epoch length and controls the minibatch size.
We now analyze the necessary stochastic firstorder oracle complexity to meet the requirement in Theorem 3.1.
Lemma 4.2.
The proof for Lemma 4.2 is similar to the one of Lemma 4.1 and is deferred to Appendix 7.1.
Lemma 4.2 and Lemma 4.1 only guarantee the differential estimators satisfy the requirement (9) in a single iteration and can be extended to hold for all by using the union bound with .
Combining such lifted result with Theorem 3.1, we can establish the bound of computational complexity as follows.
Corollary 4.1.
Assume Algorithm 2 will use Estimator 4 to construct the firstorder differential estimator and use Estimator 3 to construct the secondorder differential estimator . To find an SOSP with probability at least , the overall stochastic firstorder oracle complexity is and the overall stochastic secondorder oracle complexity is .
5 Stochastic Trust Region Method: Type II
In the previous section, we focus on the setting where the stochastic secondorder oracle complexity is prioritized over the stochastic firstorder oracle complexity and STR achieves the stateoftheart efficiency. In this section, we consider a different complexity measure where the firstorder and secondorder oracle complexities are treated equally and our goal is to minimize the maximum of them. We note that, currently the best result is of the SVRC method (Zhou et al., 2018c).
Since the Hessian estimator of STR already delivers the superior stochastic Hessian complexity, in STR (see Algorithm 5), we retain Estimator 3 for secondorder differential estimation and use Estimator 6 to further reduce the stochastic gradient complexity.
(a) a9a  (b) ijcnn  
(c) codrna  (d) phishing 
Comparison on the logistic regression with nonconvex regularizer.
(a) a9a  (b) w8a  
(c) codrna 
5.1 Gradient Estimator: Case (2)
When the maximum of stochastic gradient and Hessian complexities, is prioritized, we use Hessian to improve the gradient estimation in Algorithm 4. Intuitively, the Mean Value Theorem gives
with for some , which under Assumption 2.3 allow us to bound
Such property can be used to improve Lemma 4.2 of Estimator 4. Specifically, define the correction term
where is some reference point updated in a epochwise manner. Estimator 6 adds to the estimator in Estimator 4. Note that in Estimator 6, the number of firstorder and secondorder oracle complexities are the same.
We now analyze the necessary firstorder (and secondorder) oracle complexity to meet requirement (9).
Lemma 5.1.
Assume Algorithm 5 takes the trust region size as in Theorem 3.1. For any , Estimator 6 produces estimator for the first order differential such that with probability at least , if we set and , with . Consequently the amortized periteration stochastic firstorder oracle complexity to construct is .
Similar to the previous section, Lemma 5.1 only guarantee the gradient estimator satisfies the requirement 9 in a single iteration. We extended such result to hold for all by using the union bound with , which together with Theorem 3.1 gives the following corollary.
Corollary 5.1.
Assume Algorithm 5 will use Estimator 6 to construct the firstorder differential estimator and use Estimator 3 to construct the secondorder differential estimator . To find an SOSP with probability at least , the overall stochastic firstorder oracle complexity is and the overall stochastic secondorder oracle complexity is .
6 Experiments
In this section, we compare the proposed STR with several stateoftheart (stochastic) cubic regularized algorithms and trust region approaches, including trust region (TR) algorithm (Conn et al., 2000), adaptive cubic regularization (ARC) (Cartis et al., 2011), subsampled cubic regularization (SCR) (Kohler and Lucchi, 2017a), stochastic variancereduced cubic (SVRC) (Zhou et al., 2018c) and LiteSVRC (Zhou et al., 2018b). For STR, we estimate the gradient as the way in case (1). This is because such a method enjoys lower Hessian computational complexity over the way in case (2) and for most problems, computing their Hessian matrices is much more timeconsuming than computing their gradients. For the subproblems in these compared methods, we use Lanczos method (Gould et al., 1999; Kohler and Lucchi, 2017a) to solve the subproblem approximately in a Hessianrelated Krylov subspace. We run simulations on five datasets from LibSVM (a09, ijcnn, codrna, phishing, and w08). The details of these datasets are described in Appendix 9. For all the considered algorithms, we tune their hyperparameters optimally.
Two evaluation nonconvex problems. Following (Kohler and Lucchi, 2017a; Zhou et al., 2018c), we evaluate all considered algorithms on two learning tasks: the logistic regression with nonconvex regularizer and the nonlinear least square. Given data points where is the sample vector and is the label, logistic regression with nonconvex regularizer aims at distinguishing these two kinds of samples by solving the following problem
where the nonconvex regularizer is defined as . The nonlinear least square problem fits the nonlinear data by minimizing
For both these two kinds of problems, we set the parameters and for all testing datasets.
Figure 1 summarizes the testing results on the nonconvex logistic regression problems. For each dataset, we report the function value gap v.s. the overall algorithm running time which can reflect the overall computational complexity of an algorithm, and also show the function value gap v.s. Hessian sample complexity which reveals the complexity of Hessian computation. From Fig. 1, one can observe that our proposed STR algorithm runs faster than the compared algorithms in terms of the algorithm running time, showing the overall superiority of STR. Furthermore, STR also reveals much sharper convergence curves in terms of the Hessian sample complexity which is consistent with our theory. This is because to achieve an accuracy local minimum, the Hessian sample complexity of the proposed STR is and is superior over the complexity of the compared methods (see the comparison in Sec. 4.2). Indeed, this also explains why our algorithm is also faster in terms of algorithm running time, since for most optimization problems, Hessian matrix is much more computationally expensive than the gradient and thus more efficient Hessian sample complexity means faster overall convergence speed.
Figure 2 displays the results of the compared algorithms on the nonlinear least square problems. STR shows very similar behaviors as those in Figure 1. More specifically, STR achieves fastest convergence rate in terms of both algorithm running time and Hessian sample complexity. On the codrna dataset (the bottom of Figure 2) we further plot the function value gap versus runningtime curves and Hessian sample complexity. One can obverse that the gradient in STR vanishes significantly faster than other algorithms which means that STR can find the stationary point with high efficiency. See Figure 3 in Appendix 9.2 for more experimental results on running time comparison. All these results confirm the superiority of the proposed STR.
Conclusion
We proposed two stochastic trust region variants. Under two efficiency measurement settings (whether the stochastic first and secondorder oracle complexity are treated equally), the proposed method achieve stateoftheart oracle complexity. Experimental results well testify our theoretical implications and the efficiency of the proposed algorithm.
References
 Carmon et al. [2017] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. arXiv preprint arXiv:1710.11606, 2017.
 Cartis et al. [2011] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
 Conn et al. [2000] Andrew R Conn, Nicholas IM Gould, and Ph L Toint. Trust region methods, volume 1. Siam, 2000.
 Curtis et al. [2017] Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worstcase iteration complexity of (epsilon 3/2) for nonconvex optimization. Mathematical Programming: Series A and B, 162(12):1–32, 2017.
 Dauphin et al. [2014] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
 Fang et al. [2018] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pages 686–696, 2018.
 Gould et al. [1999] Nicholas IM Gould, Stefano Lucidi, Massimo Roma, and Philippe L Toint. Solving the trustregion subproblem using the lanczos method. SIAM Journal on Optimization, 9(2):504–525, 1999.

Gower et al. [2018]
Robert Gower, Nicolas Le Roux, and Francis Bach.
Tracking the gradients using the hessian: A new look at variance
reducing stochastic methods.
In
Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics
, 2018. 
Johnson and Zhang [2013]
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in neural information processing systems, pages 315–323, 2013.  Kohler and Lucchi [2017a] Jonas Moritz Kohler and Aurelien Lucchi. Subsampled cubic regularization for nonconvex optimization. arXiv preprint arXiv:1705.05933, 2017a.
 Kohler and Lucchi [2017b] Jonas Moritz Kohler and Aurelien Lucchi. Subsampled cubic regularization for nonconvex optimization. In Proceedings of the 34th International Conference on Machine Learning, 2017b.
 Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 Nguyen et al. [2017] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621, 2017.
 Tropp [2012] Joel A Tropp. Userfriendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
 Xu et al. [2017] Peng Xu, Farbod RoostaKhorasani, and Michael W Mahoney. Newtontype methods for nonconvex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164, 2017.
 Zhang et al. [2018] Junyu Zhang, Lin Xiao, and Shuzhong Zhang. Adaptive stochastic variance reduction for subsampled newton method with cubic regularization. arXiv preprint arXiv:1811.11637, 2018.
 Zhou et al. [2018a] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variancereduced cubic regularized Newton methods. In Proceedings of the 35th International Conference on Machine Learning, 2018a.
 Zhou et al. [2018b] Dongruo Zhou, Pan Xu, and Quanquan Gu. Sample efficient stochastic variancereduced cubic regularization method. arXiv preprint arXiv:1811.11989, 2018b.
 Zhou et al. [2018c] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variancereduced cubic regularized newton method. In ICML, 2018c.
7 Appendix
7.1 Proof of Lemma 4.2
Without loss of generality, we analyze the case for ease of notation. Define for and
is a martingale difference: for all and
Besides has bounded norm: Using Assumption 2.3
(22) 
From the construction of , we have
Recall the Azuma’s Inequality. Using , we have
Take and denote . To ensure that