1 Introduction
We consider the following finite-sum minimization problem
(1) |
where each (non-convex) component function is assumed to have -Lipschitz continuous gradient and -Lipschitz continuous Hessian. Since first-order stationary points could be saddle points and thus lead to inferior generalization performance (Dauphin et al., 2014), in this work we are particularly interested in computing -approximate second-order stationary points, -SOSP:
(2) |
To find the local minimum in problem (1), the cubic regularization approach (Nesterov and Polyak, 2006) and the trust region algorithm (Conn et al., 2000; Curtis et al., 2017) are two classical methods. Specifically, cubic regularization forms a cubic surrogate function for the objective by adding a third-order regularization term to the second-order Taylor expansion, and minimizes it iteratively. Such a method is proved to achieve an global convergence rate and thus needs stochastic first- and second-order oracle queries, namely the evaluation number of stochastic gradient and Hessian, to achieve a point that satisfies (2). On the other hand, trust region algorithms estimate the objective with its second-order Taylor expansion but minimize it only within a local region. Recently, Curtis et al. (2017) propose a trust region variant to achieve the same convergence rate as the cubic regularization approach. But both methods require computing full gradients and Hessians of the objective and thus suffer from high computational cost in large-scale problems.
To avoid coslty exact differential evaluations, many works explore the finite-sum structure of problem (1) and develop stochastic cubic regularization approaches. Both Kohler and Lucchi (2017b) and Xu et al. (2017) propose to directly subsample the gradient and Hessian in the cubic surrogate function, and achieve and
stochastic first- and second-order oracle complexities respectively. By plugging a stochastic variance reduced estimator
(Johnson and Zhang, 2013) and the Hessian tracking technique (Gower et al., 2018) into the gradient and Hessian estimation, the approach in Zhou et al. (2018a) improves both the stochastic first- and second-order oracle complexities to . Recently, (Zhang et al., 2018; Zhou et al., 2018b) develop more efficient stochastic cubic regularization variants, which further reduce the stochastic second-order oracle complexities to at the cost of increasing the stochastic first-order oracle complexity to .Contributions: In this paper we propose and exploit a formulation in which we make explicit control of the step size in the trust region method. This idea is leveraged to develop two efficient stochastic trust region (STR) approaches. We tailor our methods to achieve state-of-the-art oracle complexities under the following two measurements: (i) the stochastic second-order oracle complexity is prioritized; (ii) the stochastic first- and second-order oracle complexities are treated equally. Specifically, in Setting (i), our method STR employs a newly proposed estimator to approximate the Hessian and adopts the estimator in (Fang et al., 2018) for gradient approximation. Our novel Hessian estimator maintains a high accuracy second-order differential approximation with lower amortized oracle complexity. In this way, STR achieves stochastic second-order oracle complexity. This is lower than existing results for solving problem (1). In Setting (ii), our method STR substitutes the gradient estimator in STR with one that integrates stochastic gradient and Hessian together to maintain an accurate gradient approximation. As a result, STR achieves convergence in overall stochastic first- and second-order oracle queries.
1.1 Related Work
Computing local minimum to a non-convex optimization problem is gaining considerable amount of attentions in recent years. Both cubic regularization (CR) approaches (Nesterov and Polyak, 2006) and trust region (TR) algorithms (Conn et al., 2000; Curtis et al., 2017)
can escape saddle points and find a local minimum by iterating the variable along the direction related to the eigenvector of the Hessian with the most negative eigenvalue. As the CR heavily depends the regularization parameter for the cubic term,
Cartis et al. (2011) propose an adaptive cubic regularization (ARC) approach to boost the efficiency by adaptively tunes the regularization parameter according to the current objective decrease. Noting the high cost of full gradient and Hessian computation in ARC, sub-sampled cubic regularization (SCR) (Kohler and Lucchi, 2017a) is developed for sampling partial data points to estimate the full gradient and Hessian. Recently, by exploring the finite-sum structure of the target problem, many works incorporate variance-reduced technique (Johnson and Zhang, 2013) into CR and propose stochastic variance-reduced methods. For example, Zhou et al. (2018c) propose stochastic variance-reduced cubic (SVRC) in which they integrate the stochastic variance-reduced gradient estimator (Johnson and Zhang, 2013) and the Hessian tracking technique (Gower et al., 2018) with CR. Such a method is proved to be at least faster than CR and TR. Then Zhou et al. (2018b) suggest to use adaptive gradient batch size and constant Hessian batch size, and develop Lite-SVRC to further reduce the stochastic second-order oracle of SVRC to at the cost of higher gradient computation cost. Similarly, except turning the gradient batch size, Zhang et al. (2018) further adaptively sample a certain number of data points to estimate the Hessian and prove the proposed method to have the same stochastic second-order oracle complexity as Lite-SVRC.Algorithm | SFO | SSO |
---|---|---|
TR | ||
CR | ||
SCR | ||
SVRC | ||
Lite-SVRC | ||
STR | ||
STR |
2 Preliminary
Notation. We use
to denote the Euclidean norm of vector
and use to denote the spectral norm of matrix . Let be the set of component indices. We define the batch average of component function byThen we specify the assumptions that are necessary to the analysis of our methods.
Assumption 2.1.
is bounded from below and its global optimal is achieved at . We further denote
Assumption 2.2.
Each component function has -Lipschitz continuous gradient: for any
(3) |
Clearly, the objective as the average of component functions also has -Lipschitz continuous gradient.
Assumption 2.3.
Each component function has -Lipschitz continuous Hessian: for any
(4) |
Similarly, the objective has -Lipschitz continuous Hessian, which implies the following: for any ,
2.1 Trust Region Method
The trust region method has a long history (Conn et al., 2000). In each step, it solves the Quadratic Constraint Quadratic Program (QCQP)
(5) |
where is the trust-region radius, and updates
(6) |
Since is indefinite, the trust-region subproblem (5) is non-convex, but its global optimizer can be characterized by the following lemma.
Lemma 2.1 (Corollary 7.2.2 in (Conn et al., 2000)).
Any global minimizer of problem (5) satisfies the equation
(7) |
where the dual variable should satisfy and .
In particular, the standard QCQP solver returns both the minimizer as well as the corresponding dual variable of subproblem (5). While it is known that the trust-region update (5) and (6) converges at the rate , recently (Curtis et al., 2017) proposes a trust-region variant which converges at the optimal rate (Carmon et al., 2017). In this paper, we show that the vanilla trust-region update (5) and (6) already achieves the optimal convergence rate as the byproduct of our novel argument.
3 Methodology
In this section, we first introduce a general inexact trust region method which is summarized in MetaAlgorithm 1. It accepts inexact gradient estimation and Hessian estimation as input to the QCQP subproblem
(8) |
Similar to (5), the global solutions to (8) are characterized by Lemma 2.1 and we further denote the dual variable corresponding to the minimizer by . In practice, (8) can be efficiently solved by Lanczos method (Gould et al., 1999).
We prove that such inexact trust-region method achieves the optimal convergence rate when the estimation and at each iteration are sufficient close to their full (exact) counterparts and respectively:
(9) |
Such result allows us to derive stochastic trust-region variants with novel differential estimators that are tailored to ensure the optimal convergence rate. We state our formal results in Theorem 3.1.
Theorem 3.1.
Proof.
For simplicity of notation, we denote
From Assumption 2.3 we have
Use the Cauchy–Schwarz inequality to obtain
(10) |
The requirement (9) together with the trust region allow us to bound
(11) |
The optimality of (5) indicates that there exists dual variable so that (Corollary 7.2.2 in (Conn et al., 2000))
First Order | (12) | |||
Second Order | (13) | |||
Complementary | (14) |
Multiplying (12) by , we have
(15) |
Additionally, using (13) we have
which together with (15) gives
(16) |
Moreover, the complementary property (14) indicates as we have in MetaAlgorithm. Plug (11), (15), and (16) into (10) and use :
(17) |
Therefore, if we have , then
(18) |
Using Assumption 2.1, we find in no more than iterations.
We now show that once , then is already an -SOSP: From (12), we have
(19) |
The assumptions and together with the trust region imply
(20) | ||||
On the other hand use Assumption 2.3 to bound
Combining these two results gives .
Besides use Assumption 2.3, , and (13) we derive the Hessian lower bound
Hence is a -stationary point.
Therefore, we have according to the complementary condition (14) for all but the last iteration. ∎
Remark 3.1.
We emphasize that MetaAlgorithm 1 degenerates to the exact trust region method by taking and . Such result is of its own interest because this is the first proof to show that the vanilla trust region method has the optimal convergence rate. Similar rate is achieved by (Curtis et al., 2017) but with a complicated trust region variant.
Remark 3.2.
We note that MetaAlgorithm 1 uses the dual variable as stopping criterion, which enables the last-term convergence analysis in Theorem 3.1. In our appendix, we present a variant of MetaAlgorithm 1 without accessing to the exact dual variable. Further, we show such variant enjoys a similar convergence guarantee in expectation.
Theorem 3.1 shows the explicit step size control of the trust-region method: Since the dual variable satisfies for all but the last iteration, we always find the solution to the trust-region subproblem (8) in the boundary, i.e. , according to the complementary condition (14). Such exact step-size control property is missing in the cubic-regularzation method where the step-size is implicitly decided by the cubic regularization parameter.
More importantly, we emphasize that such explicit step size control is crucial to the sample efficiency of our variance reduced differential estimators. The essence of variance reduction is to exploit the correlations between the differentials in consecutive iterations. Intuitively, when two neighboring iterates are close, so are their differentials due to the Lipschitz continuity, and hence a smaller number of samples suffice to maintain the accuracy of the estimators. On the other hand, smaller step size reduces the per-iteration objective decrease which harms the convergence rate of the algorithm (see proof of Theorem 3.1). Therefore, the explicit step-size control in trust-region method allows us to well trade-off the per-iteration sample complexity and convergence rate, from which we can derive stochastic trust region approaches with state-of-the-art sample efficiency.
4 Stochastic Trust Region Method: Type I
Having the inexact trust-region method as prototype, we now derive our first sample-efficient stochastic trust region methods, namely STR, which emphasises more cheaper stochastic second-order oracle complexity. As Theorem 3.1 already guarantees the optimal convergence rate of MetaAlgorithm 1 when the gradient estimator and the Hessian estimator meet requirements (9), here we focus on constructing such novel differential estimators. Specifically, we first present our Hessian estimator in Estimator 3 and our first gradient estimator in Estimator 4, both of which exploit the trust region radius to reduce their variances. Further, by plugging Estimator 3 and Estimator 4 in MetaAlgorithm 1, we present STR in Algorithm 2 with state-of-the-art stochastic Hessian complexity.
4.1 Hessian Estimator
Our epoch-wise Hessian estimator is given in Estimator 3, where controls the epoch length and (and optionally ) controls the minibatch size. At the beginning of each epoch Estimator 3 has two options, designed for different target accuracy: Option I is preferable for the high accuracy case () where we compute the full Hessian to avoid approximation error, and Option II is designed for the moderate accuracy case () where we only need to an approximate Hessian estimator. Then, iterations follow with defined in a recurrent manner. These recurrent estimators exist for the first-order case (Nguyen et al., 2017; Fang et al., 2018), but their bound only holds under the vector norm. Here we generalize them into Hessian estimation with matrix spectrum norm bound.
The following lemma analyzes the amortized stochastic second-order oracle (Hessian) complexity for Algorithm 3 to meet the requirement in Theorem 3.1. As we need an approximation error bound under the spectrum norm, we will appeal to the matrix Azuma’s inequality (Tropp, 2012).
Lemma 4.1.
Assume Algorithm 2 takes the trust region radius as in Theorem 3.1.
For any , Estimator 3 produces estimators for the second order differentials such that with probability at least
1. and in option I, or
2. , , and in option II .
Consequently the amortized per-iteration stochastic second-order oracle complexity to construct is no more than
Proof.
Without loss of generality, we analyze the case for ease of notation.
We first focus on Option II.
The proof for Option I follows the similar argument.
Option II:
Define for and
and define for and
is a martingale difference. We have for all and ,
Besides, using Assumption 2.2 for to bound
(21) |
and using Assumption 2.3 for to bound
From the construction of , we have
Thus using the matrix Azuma’s Inequality in Theorem 7.1 of (Tropp, 2012) and , we have
Consequently, we have
by taking , , , and .
Option I: The proof is similar to the one of Option II except that we replace
with zero matrix. In such case, the matrix Azuma’s Inequality implies
Thus by taking , , and , we have the result.
Amortized Complexity: In option I, the choice of parameter ensures that: and in option II: . Consequently the amortized stochastic second-order oracle is bounded by . ∎
4.2 Gradient Estimator: Case (1)
When stochastic second-order oracle complexity is prioritized, we directly employ the SPIDER gradient estimator to construct (Fang et al., 2018). Similar to the construction for , the estimator is also construct in an epoch-wise manner as presented in Estimator 4, where controls the epoch length and controls the minibatch size.
We now analyze the necessary stochastic first-order oracle complexity to meet the requirement in Theorem 3.1.
Lemma 4.2.
The proof for Lemma 4.2 is similar to the one of Lemma 4.1 and is deferred to Appendix 7.1.
Lemma 4.2 and Lemma 4.1 only guarantee the differential estimators satisfy the requirement (9) in a single iteration and can be extended to hold for all by using the union bound with .
Combining such lifted result with Theorem 3.1, we can establish the bound of computational complexity as follows.
Corollary 4.1.
Assume Algorithm 2 will use Estimator 4 to construct the first-order differential estimator and use Estimator 3 to construct the second-order differential estimator . To find an -SOSP with probability at least , the overall stochastic first-order oracle complexity is and the overall stochastic second-order oracle complexity is .
5 Stochastic Trust Region Method: Type II
In the previous section, we focus on the setting where the stochastic second-order oracle complexity is prioritized over the stochastic first-order oracle complexity and STR achieves the state-of-the-art efficiency. In this section, we consider a different complexity measure where the first-order and second-order oracle complexities are treated equally and our goal is to minimize the maximum of them. We note that, currently the best result is of the SVRC method (Zhou et al., 2018c).
Since the Hessian estimator of STR already delivers the superior stochastic Hessian complexity, in STR (see Algorithm 5), we retain Estimator 3 for second-order differential estimation and use Estimator 6 to further reduce the stochastic gradient complexity.
![]() |
![]() |
![]() |
![]() |
(a) a9a | (b) ijcnn | ||
![]() |
![]() |
![]() |
![]() |
(c) codrna | (d) phishing |
Comparison on the logistic regression with nonconvex regularizer.
![]() |
![]() |
![]() |
![]() |
(a) a9a | (b) w8a | ||
![]() |
![]() |
![]() |
![]() |
(c) codrna |
5.1 Gradient Estimator: Case (2)
When the maximum of stochastic gradient and Hessian complexities, is prioritized, we use Hessian to improve the gradient estimation in Algorithm 4. Intuitively, the Mean Value Theorem gives
with for some , which under Assumption 2.3 allow us to bound
Such property can be used to improve Lemma 4.2 of Estimator 4. Specifically, define the correction term
where is some reference point updated in a epoch-wise manner. Estimator 6 adds to the estimator in Estimator 4. Note that in Estimator 6, the number of first-order and second-order oracle complexities are the same.
We now analyze the necessary first-order (and second-order) oracle complexity to meet requirement (9).
Lemma 5.1.
Assume Algorithm 5 takes the trust region size as in Theorem 3.1. For any , Estimator 6 produces estimator for the first order differential such that with probability at least , if we set and , with . Consequently the amortized per-iteration stochastic first-order oracle complexity to construct is .
Similar to the previous section, Lemma 5.1 only guarantee the gradient estimator satisfies the requirement 9 in a single iteration. We extended such result to hold for all by using the union bound with , which together with Theorem 3.1 gives the following corollary.
Corollary 5.1.
Assume Algorithm 5 will use Estimator 6 to construct the first-order differential estimator and use Estimator 3 to construct the second-order differential estimator . To find an -SOSP with probability at least , the overall stochastic first-order oracle complexity is and the overall stochastic second-order oracle complexity is .
6 Experiments
In this section, we compare the proposed STR with several state-of-the-art (stochastic) cubic regularized algorithms and trust region approaches, including trust region (TR) algorithm (Conn et al., 2000), adaptive cubic regularization (ARC) (Cartis et al., 2011), sub-sampled cubic regularization (SCR) (Kohler and Lucchi, 2017a), stochastic variance-reduced cubic (SVRC) (Zhou et al., 2018c) and Lite-SVRC (Zhou et al., 2018b). For STR, we estimate the gradient as the way in case (1). This is because such a method enjoys lower Hessian computational complexity over the way in case (2) and for most problems, computing their Hessian matrices is much more time-consuming than computing their gradients. For the subproblems in these compared methods, we use Lanczos method (Gould et al., 1999; Kohler and Lucchi, 2017a) to solve the sub-problem approximately in a Hessian-related Krylov subspace. We run simulations on five datasets from LibSVM (a09, ijcnn, codrna, phishing, and w08). The details of these datasets are described in Appendix 9. For all the considered algorithms, we tune their hyper-parameters optimally.
Two evaluation nonconvex problems. Following (Kohler and Lucchi, 2017a; Zhou et al., 2018c), we evaluate all considered algorithms on two learning tasks: the logistic regression with nonconvex regularizer and the nonlinear least square. Given data points where is the sample vector and is the label, logistic regression with nonconvex regularizer aims at distinguishing these two kinds of samples by solving the following problem
where the nonconvex regularizer is defined as . The nonlinear least square problem fits the nonlinear data by minimizing
For both these two kinds of problems, we set the parameters and for all testing datasets.
Figure 1 summarizes the testing results on the nonconvex logistic regression problems. For each dataset, we report the function value gap v.s. the overall algorithm running time which can reflect the overall computational complexity of an algorithm, and also show the function value gap v.s. Hessian sample complexity which reveals the complexity of Hessian computation. From Fig. 1, one can observe that our proposed STR algorithm runs faster than the compared algorithms in terms of the algorithm running time, showing the overall superiority of STR. Furthermore, STR also reveals much sharper convergence curves in terms of the Hessian sample complexity which is consistent with our theory. This is because to achieve an -accuracy local minimum, the Hessian sample complexity of the proposed STR is and is superior over the complexity of the compared methods (see the comparison in Sec. 4.2). Indeed, this also explains why our algorithm is also faster in terms of algorithm running time, since for most optimization problems, Hessian matrix is much more computationally expensive than the gradient and thus more efficient Hessian sample complexity means faster overall convergence speed.
Figure 2 displays the results of the compared algorithms on the nonlinear least square problems. STR shows very similar behaviors as those in Figure 1. More specifically, STR achieves fastest convergence rate in terms of both algorithm running time and Hessian sample complexity. On the codrna dataset (the bottom of Figure 2) we further plot the function value gap versus running-time curves and Hessian sample complexity. One can obverse that the gradient in STR vanishes significantly faster than other algorithms which means that STR can find the stationary point with high efficiency. See Figure 3 in Appendix 9.2 for more experimental results on running time comparison. All these results confirm the superiority of the proposed STR.
Conclusion
We proposed two stochastic trust region variants. Under two efficiency measurement settings (whether the stochastic first- and second-order oracle complexity are treated equally), the proposed method achieve state-of-the-art oracle complexity. Experimental results well testify our theoretical implications and the efficiency of the proposed algorithm.
References
- Carmon et al. [2017] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. arXiv preprint arXiv:1710.11606, 2017.
- Cartis et al. [2011] Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
- Conn et al. [2000] Andrew R Conn, Nicholas IM Gould, and Ph L Toint. Trust region methods, volume 1. Siam, 2000.
- Curtis et al. [2017] Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of (epsilon 3/2) for nonconvex optimization. Mathematical Programming: Series A and B, 162(1-2):1–32, 2017.
- Dauphin et al. [2014] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
- Fang et al. [2018] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 686–696, 2018.
- Gould et al. [1999] Nicholas IM Gould, Stefano Lucidi, Massimo Roma, and Philippe L Toint. Solving the trust-region subproblem using the lanczos method. SIAM Journal on Optimization, 9(2):504–525, 1999.
-
Gower et al. [2018]
Robert Gower, Nicolas Le Roux, and Francis Bach.
Tracking the gradients using the hessian: A new look at variance
reducing stochastic methods.
In
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics
, 2018. -
Johnson and Zhang [2013]
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in neural information processing systems, pages 315–323, 2013. - Kohler and Lucchi [2017a] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. arXiv preprint arXiv:1705.05933, 2017a.
- Kohler and Lucchi [2017b] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning, 2017b.
- Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- Nguyen et al. [2017] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621, 2017.
- Tropp [2012] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
- Xu et al. [2017] Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Newton-type methods for non-convex optimization under inexact hessian information. arXiv preprint arXiv:1708.07164, 2017.
- Zhang et al. [2018] Junyu Zhang, Lin Xiao, and Shuzhong Zhang. Adaptive stochastic variance reduction for subsampled newton method with cubic regularization. arXiv preprint arXiv:1811.11637, 2018.
- Zhou et al. [2018a] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularized Newton methods. In Proceedings of the 35th International Conference on Machine Learning, 2018a.
- Zhou et al. [2018b] Dongruo Zhou, Pan Xu, and Quanquan Gu. Sample efficient stochastic variance-reduced cubic regularization method. arXiv preprint arXiv:1811.11989, 2018b.
- Zhou et al. [2018c] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic variance-reduced cubic regularized newton method. In ICML, 2018c.
7 Appendix
7.1 Proof of Lemma 4.2
Without loss of generality, we analyze the case for ease of notation. Define for and
is a martingale difference: for all and
Besides has bounded norm: Using Assumption 2.3
(22) |
From the construction of , we have
Recall the Azuma’s Inequality. Using , we have
Take and denote . To ensure that