1 Introduction
Many machine learning problems can be formulated as empirical risk minimization, which is in the form of finitesum optimization as follows:
(1) 
where each can be a convex or nonconvex function. In this paper, we are particularly interested in nonconvex finitesum optimization, where each
is nonconvex. This is often the case for deep learning
(LeCun et al., 2015). In principle, it is hard to find the global minimum of (1) because of the NPhardness of the problem (Hillar and Lim, 2013), thus it is reasonable to resort to finding local minima (a.k.a., secondorder stationary points). It has been shown that local minima can be the global minima in certain machine learning problems, such as lowrank matrix factorization (Ge et al., 2016; Bhojanapalli et al., 2016; Zhang et al., 2018b)and training deep linear neural networks
(Kawaguchi, 2016; Hardt and Ma, 2016). Therefore, developing algorithms to find local minima is important both in theory and in practice. More specifically, we define an approximate local minimum of as follows(2) 
where are predefined precision parameters.
The most classic algorithm to find the approximate local minimum is cubicregularized (CR) Newton method, which was originally proposed in a seminal paper by Nesterov and Polyak (2006). Generally speaking, in the th iteration, cubic regularization method solves a subproblem, which minimizes a cubicregularized secondorder Taylor expansion at the current iterate . The update rule can be written as follows:
(3)  
(4) 
where is a penalty parameter. Nesterov and Polyak (2006) proved that to find an approximate local minimum of a nonconvex function , cubic regularization requires at most iterations. However, when applying cubic regularization to nonconvex finitesum optimization in (1), a major bottleneck of cubic regularization is that it needs to compute individual gradients and Hessian matrices at each iteration, which leads to a total gradient complexity (i.e., number of queries to the stochastic gradient oracle for some and ) and Hessian complexity (i.e., number of queries to the stochastic Hessian oracle for some and ). Such computational overhead will be extremely expensive when is large as is in many largescale machine learning applications.
To overcome the aforementioned computational burden of cubic regularization, Kohler and Lucchi (2017); Xu et al. (2017) used subsampled gradient and subsampled Hessian, which achieve gradient complexity and Hessian complexity. Zhou et al. (2018d) proposed a stochstic variance reduced cubic regularization method (SVRC), which uses novel semistochastic gradient and semistochastic Hessian estimators inspired by variance reduction for firstorder finitesum optimization (Johnson and Zhang, 2013; Reddi et al., 2016a; AllenZhu and Hazan, 2016), which attains Secondorder Oracle (SO) complexity^{1}^{1}1Secondorder Oracle (SO) returns triple for some and , hence the SO complexity can be seen as the maximum of gradient and Hessian complexities.. Zhou et al. (2018b); Wang et al. (2018); Zhang et al. (2018a) used a simpler semistochastic gradient compared with Zhou et al. (2018d), and semistochastic Hessian, which a better Hessian complexity, i.e., . However, it is unclear whether the gradient and Hessian complexities of the aforementioned SVRC algorithms can be further improved. Furthermore, all these algorithms need to use the semistochastic Hessian estimator, which is not compatible with Hessianvector productbased cubic subproblem solvers (Agarwal et al., 2017; Carmon and Duchi, 2016, 2018). Therefore, the cubic subproblem (4) in each iteration of existing SVRC algorithms has to be solved by computing the inverse of the Hessian matrix, whose computational complexity is at least ^{2}^{2}2 is the matrix multiplication constant, where .. This makes existing SVRC algorithms not very practical for highdimensional problems.
In this paper, we first show that the gradient and Hessian complexities of SVRCtype algorithms can be further improved. The core idea is to use a novel recursively updated semistochastic gradient and Hessian estimators, which are inspired by the recursive semistochastic gradient estimators used in Nguyen et al. (2017); Fang et al. (2018) for firstorder finitesum optimization. We show that such kind of estimators can also reduce the Hessian complexity, which has never been discovered before. In addition, in order to reduce the runtime complexity of existing SVRC algorithms, we further propose a Hessianfree SVRC method that can not only use the novel semistochastic gradient estimator, but also leverage the Hessianvector productbased fast cubic subproblem solvers. Experiments on benchmark nonconvex finitesum optimization problems illustrate the superiority of our newly proposed SVRC algorithms against the stateoftheart.
In detail, our contributions are summarized as follows:

[leftmargin=*]

We propose a new SVRC algorithm, namely SRVRC, which can find an approximate local minimum with gradient complexity and Hessian complexity. Compared with previous work, the gradient and Hessian complexity of SRVRC is strictly better than the algorithms in Zhou et al. (2018b); Wang et al. (2018); Zhang et al. (2018a), and better than that in Zhou et al. (2018d) in a wide regime.

We further propose a new algorithm , which requires runtime to find an approximate local minimum. The runtime of is strictly better than that of Agarwal et al. (2017); Carmon and Duchi (2016); Tripuraneni et al. (2018) when . The runtime complexity of is also better than that of SRVRC when is large.
For the ease of comparison, we list the gradient and Hessian complexity results of our algorithms as well as the baselines algorithms in Table 1 and the runtime complexity results in Table 2.
Algorithm  Gradient Complexity  Hessian Complexity 
Cubic regularization  
(Nesterov and Polyak, 2006)  
Subsampled cubic regularization  
(Kohler and Lucchi, 2017; Xu et al., 2017)  
SVRC  
(Zhou et al., 2018d)  
LiteSVRC  
(Zhou et al., 2018b)  
SVRC  
(Wang et al., 2018)  
SVRC  
(Zhang et al., 2018a)  
SRVRC  
(Zhou et al., 2018b) 
Algorithm  Runtime 
FastCubic  
(Agarwal et al., 2017)  
GradientCubic  
(Carmon and Duchi, 2016)  
Stochastic Fast Cubic  
(Tripuraneni et al., 2018)  
(This work) 
2 Other Related Work
In this section, we review additional related work, that is not discussed in the introduction section.
Cubic Regularization and TrustRegion Methods Since cubic regularization was firstly proposed in Nesterov and Polyak (2006), there is a line of followup research. It was extended to adaptive regularized cubic methods (ARC) by Cartis et al. (2011a, b), which enjoy the same iteration complexity as standard cubic regularization while having better empirical performance. The first attempt to make cubic regularization a Hessianfree method was done by Carmon and Duchi (2016), which solves the cubic subproblem by gradient descent, requiring total runtime. Agarwal et al. (2017) solved cubic subproblem by fast matrix inversion based on accelerated gradient descent, which requires runtime. In the pure stochastic optimization setting, Tripuraneni et al. (2018) proposed stochastic cubic regularization method, which uses subsampled gradient and Hessianvector productbased cubic subproblem solver, and requires runtime. A closely related secondorder method to cubic regularization methods are trustregion methods (Conn et al., 2000; Cartis et al., 2009, 2012, 2013). Recent studies (Blanchet et al., 2016; Curtis et al., 2017; Martínez and Raydan, 2017) proved that the trustregion method can achieve the same iteration complexity as the cubic regularization method. Xu et al. (2017) also extended trustregion method to subsampled trustregion method for nonconvex finitesum optimization.
Local Minima Finding Besides cubic regularization and trustregion type methods, there is another line of research for finding approximate local minima, which is based on firstorder optimization. Ge et al. (2015); Jin et al. (2017a) proved that (stochastic) gradient methods with additive noise are able to escape from nondegenerate saddle points and find approximate local minima. Carmon et al. (2018); Royer and Wright (2017); AllenZhu (2017); Xu et al. (2018); AllenZhu and Li (2018); Jin et al. (2017b); Yu et al. (2017b, a); Zhou et al. (2018a); Fang et al. (2018) showed that by alternating firstorder optimization and Hessianvector product based negative curvature descent, one can find approximate local minima even more efficiently.
Variance Reduction Variance reduction techniques play an important role in our proposed algorithms. Variance reduction techniques were first proposed for convex finitesum optimization, which use semistochastic gradient to reduce the variance of the stochastic gradient and improve the gradient complexity. Representative algorithms include Stochastic Average Gradient (SAG) (Roux et al., 2012), Stochastic Variance Reduced Gradient (SVRG) (Johnson and Zhang, 2013; Xiao and Zhang, 2014), SAGA (Defazio et al., 2014) and SARAH (Nguyen et al., 2017), to mention a few. For nonconvex finitesum optimization problems, Garber and Hazan (2015); ShalevShwartz (2016) studies the case where each individual function is nonconvex, but their sum is still (strongly) convex. Reddi et al. (2016a); AllenZhu and Hazan (2016) extended SVRG to noncovnex finitesum optimization, which is able to converge to firstorder stationary point with better gradient complexity than vanilla gradient descent. Recently, Fang et al. (2018); Zhou et al. (2018c) further improve the gradient complexity for nonconvex finitesum optimization to be (near) optimal.
3 Notation and Preliminaries
In this work, all index subsets are multiset. We use to represent if and otherwise. We use to represent if and otherwise. For a vector , we denote its th coordinate by . We denote vector Euclidean norm by . For any matrix , we denote its entry by , its Frobenius norm by , and its spectral norm by . For a symmetric matrix
, we denote its minimum eigenvalue by
. For symmetric matrices , we say if . We use to denote that for some constant and use to hide the logarithmic factors of . For , means .We begin with a few assumptions that are needed for later theoretical analyses of our algorithms.
The following assumption says that there is a bounded gap between the function value at the initial point and the minimal function value. For any function and an initial point , there exists a constant such that
We also need the following gradient Lipschitz and Hessian Lipschitz assumption.
For each , we assume that is gradient Lipschitz continuous and Hessian Lipschitz continuous, where we have
Note that gradient Lipschitz is not required in the original cubic regularization algorithm (Nesterov and Polyak, 2006) and the SVRC algorithm in Zhou et al. (2018d). However, for most other SVRC algorithms (Zhou et al., 2018b; Wang et al., 2018; Zhang et al., 2018a), they need the gradient Lipschitz assumption.
In addition, we also need the difference between the stochastic gradient and the full gradient to be bounded.
We assume that has bounded stochastic gradient, where we have
It is worth noting that Assumption 3 is weaker than the assumption that each is Lipschitz continuous, which has been made in Kohler and Lucchi (2017); Zhou et al. (2018b); Wang et al. (2018); Zhang et al. (2018a). We would also like to point out that we can make additional assumptions on the variances of the stochastic gradient and Hessian, such as the ones made in Tripuraneni et al. (2018). Nevertheless, making these additional assumptions does not improve the dependency of the gradient and Hessian complexities or the runtime complexity on and . Therefore we chose not making these additional assumptions on the variances.
4 The Proposed Srvrc Algorithm
In this section, we present SRVRC, a novel algorithm which utilizes new semistochastic gradient and Hessian estimators compared with previous SVRC algorithms. We also provide a convergence analysis of the proposed algorithm.
4.1 Algorithm Description
In order to reduce the computational complexity for calculating full gradient and full Hessian in (3), several ideas such as subsampled/stochastic gradient and Hessian (Kohler and Lucchi, 2017; Xu et al., 2017; Tripuraneni et al., 2018) and variancereduced semistochastic gradient and Hessian (Zhou et al., 2018d; Wang et al., 2018; Zhang et al., 2018a) have been used in previous work. SRVRC
follows this line of work. The key idea is to use a new construction of semistochastic gradient and Hessian estimators, which are recursively updated in each iteration, and reset periodically after certain number of iterations (i.e., an epoch). To be more specific,
SRVRC takes different construction strategies for iteration depending on whether or not, where is the epoch length. In the th iteration when , SRVRC will calculate a subsampled gradient and Hessian at point and set the semistochastic gradient and Hessian as followsIn the th iteration when , SRVRC constructs semistochastic gradient and Hessian and based on previous estimators , recursively. More specifically, SRVRC generates index sets , and calculates two subsampled gradients , and two subsampled Hessians . Then SRVRC sets and as
(5)  
(6) 
Note that this kind of has been used in firstorder optimization algorithms before (Nguyen et al., 2017; Fang et al., 2018), while such is new and to our knowledge has never been used before. With semistochastic gradient , semistochastic Hessian and th Cubic penalty parameter , SRVRC constructs the th Cubic subproblem and solves for the solution to as th update direction, which is defined as
(7) 
If is less than a given threshold which we set it as , SRVRC returns as its output. Otherwise, SRVRC updates and continues the loop.
The main difference between SRVRC and previous stochastic cubic regularization algorithms (Kohler and Lucchi, 2017; Xu et al., 2017; Zhou et al., 2018d, b; Wang et al., 2018; Zhang et al., 2018a) is that SRVRC adapts new semistochastic gradient and semistochastic Hessian estimators, which are defined recursively and have smaller asymptotic variance. The use of such semistochastic gradient has been proved to help reduce the gradient complexity in firstorder nonconvex finitesum optimization for finding stationary point (Fang et al., 2018). Our work takes one step further to apply it to Hessian, and we will later show that it helps reduce the gradient and Hessian complexities in secondorder nonconvex finitesum optimization for finding local minima (i.e., secondorder stationary point).
4.2 Convergence Analysis
In this subsection, we present our theoretical results about SRVRC. While the idea of using variance reduction technique for cubic regularization is hardly new, the new semistochastic gradient and Hessian estimators in (5) and (6) bring new technical challenges in the convergence analysis.
To describe whether a point is a local minimum, we follow the original cubic regularization work (Nesterov and Polyak, 2006) to use the following criterion : For any , let be
(8) 
It is easy to note that if and only if is an approximate local minimum. Thus, in order to find an approximate local minimum, it suffices to find a point which satisfies .
The following theorem provides the convergence guarantee of SRVRC for finding an approximate local minimum.
Under Assumptions 3, 3, 3, set the cubic penalty parameter for any and the total iteration number . For such that , set the gradient sample size and Hessian sample size as
(9)  
(10) 
For such that , set the gradient sample size and Hessian sample size as
(11)  
(12) 
Then with probability at least , SRVRC outputs satisfying , i.e., an approximate local minimum. is a universal constant.
Next corollary spells out the exact gradient complexity and Hessian complexity of SRVRC to find an approximate local minimum. Under the same conditions as Theorem 4.2, if set as
and set as their lower bounds in (9) (12), then with probability at least , SRVRC will output an approximate local minimum within
stochastic Hessian evaluations and
stochastic gradient evaluations. For SRVRC, if we assume are constants, then its gradient complexity is
and its Hessian complexity is
Regarding Hessian complexity, suppose that , then the Hessian complexity of SRVRC can be simplified as . Compared with existing SVRC algorithms (Zhou et al., 2018b; Zhang et al., 2018a; Wang et al., 2018), SRVRC outperforms the bestknown Hessian sample complexity by a factor of . In terms of gradient complexity, SRVRC outperforms the algorithm in Zhang et al. (2018a) by a factor of when , and by a factor of when . The gradient complexity of SRVRC also outperforms that of the algorithm in Zhou et al. (2018d) by a factor of when .
5 HessianFree Srvrc
While SRVRC adapts novel semistochastic gradient and Hessian estimators to reduce both the gradient and Hessian complexities, it has three limitations for highdimensional problems with : (1) it needs to compute and store the Hessian matrix, which needs computational time and storage space; (2) it needs to solve cubic subproblem exactly, which requires computational time because it needs to compute the inverse of a Hessian matrix (Nesterov and Polyak, 2006); and (3) it cannot leverage the Hessianvector productbased cubic subproblem solvers (Agarwal et al., 2017; Carmon and Duchi, 2016, 2018) because of the use of the semistochastic Hessian estimator.
5.1 Algorithm Description
We present a Hessianfree algorithm to address above limitations of SRVRC for highdimensional problem, whose runtime complexity is linear in , and therefore works well in the highdimension regime. uses the same semistochastic gradient as SRVRC. As opposed to SRVRC which has to construct semistochastic Hessian explicitly, only accesses to Hessianvector product. In detail, at each iteration , subsamples index set and define a Hessianvector product function as follows:
Note that although the subproblem depends on , never explicitly compute this matrix. Instead, it only provides the subproblem solver access to through Hessianvector product function . The subproblem solver performs gradientbased optimization to solve the subproblem as depends on only via . In detail, following Tripuraneni et al. (2018), uses CubicSubsolver (See in Algorithms 3 and 4 in Appendix G) and CubicFinalsolver from Carmon and Duchi (2016), to find approximate solution to the cubic subproblem in (7). Both CubicSubsolver and CubicFinalsolver only need to access gradient and Hessianvector product function along with other problemdependent parameters. With the output from CubicSubsolver, decides either to update as or to exit the loop. For the later case, will call CubicFinalsolver to output , and takes as its final output.
The main differences between SRVRC and are twofold. First, only needs to compute stochastic gradient and Hessian vector product, and both of these two actions only take time (Rumelhart et al., 1986). Second, instead of solving cubic subproblem exactly, adopts approximate subproblem solver CubicSubsolver and CubicFinalsolver, both of which only need to access gradient and Hessianvector product function, and again only take time. Thus, is computational more efficient than SRVRC when .
5.2 Convergence Analysis
We now provide the convergence guarantee of , which ensures that will output an approximate local minimum. Under Assumptions 3, 3, 3, suppose . Set the cubic penalty parameter for any and the total iteration number . Set the Hessianvector product sample size as
(13) 
For such that , set the gradient sample size as
(14) 
For such that , set the gradient sample size as
(15) 
Then with probability at least , outputs satisfying , i.e., an approximate local minimum. is a universal constant. The following corollary calculates the runtime complexity of to find an approximate local minimum.
Under the same conditions as Theorem 5.2, if set as
and set as their lower bounds in (13)(15), then with probability at least , will output an approximate local minimum within
(16) 
runtime. For , if we assume are constants, then its runtime complexity is
(17) 
For stochastic algorithms, the regime is of most interest. In this regime, (17) becomes . Compared with other local minimum finding algorithms based on stochastic gradient and Hessianvector product, outperforms the results achieved by Tripuraneni et al. (2018) and AllenZhu (2018) by a factor of . also matches the bestknown result achieved by a recent firstorder algorithm proposed in Fang et al. (2018).
Plots of logarithmic function value gap with respect to runtime (in seconds) for nonconvex regularized binary logistic regression on (a) a9a (b) covtype, and for nonconvex regularized multiclass logistic regression on (c) MNIST.
5.3 Discussions
We would like to further compare the runtime complexity between SRVRC and . In specific, SRVRC needs time to construct semistochastic gradient and time to construct semistochastic Hessian. SRVRC also needs time to solve cubic subproblem for each iteration. Thus, with the fact that the total number of iterations is by Corollary 4.2, SRVRC needs
runtime to find an approximate local minimum if we regard as constants. Compared with (17), we conclude that outperforms SRVRC when is sufficiently large, which is in accordance with the fact that Hessianfree methods are superior for high dimension machine learning tasks. On the other hand, a careful calculation can show that the runtime of SRVRC can be less than that of when is moderately small. This is also reflected in our experiments.
6 Experiments
In this section, we present numerical experiments on different nonconvex Empirical Risk Minimization (ERM) problems and on different datasets to validate the advantage of our proposed SRVRC and algorithms for finding approximate local minima. We use runtime as the performance measures.
Baselines: We compare our algorithms with the following algorithms: subsampled cubic regularization (Subsample Cubic) (Kohler and Lucchi, 2017), stochastic cubic regularization (Stochastic Cubic) (Tripuraneni et al., 2018), stochastic variancereduced cubic regularization (SVRC) (Zhou et al., 2018d), sample efficient stochastic variancereduced cubic regularization (LiteSVRC) (Zhou et al., 2018b; Wang et al., 2018; Zhang et al., 2018a).
Parameter Settings and Subproblem Solver For each algorithm, we set the cubic penalty parameter adaptively based on how well the model approximates the real objective as suggested in Cartis et al. (2011a, b); Kohler and Lucchi (2017). For SRVRC, we set gradient and Hessian batch sizes and as follows:
For , we set gradient batch sizes the same as SRVRC and Hessian batch sizes . We tune over the grid , over the grid , and over the grid for the best performance. For Subsample Cubic, SVRC, LiteSVRC and SRVRC, we solve the cubic subproblem using the cubic subproblem solver discussed in Nesterov and Polyak (2006). For Stochstic Cubic and , we use CubicSubsolver (Algorithm 3 in Appendix G) to approximately solve the cubic subproblem. All algorithms are carefully tuned for a fair comparison.
Datasets and Optimization Problems We use 3 datasets a9a, covtype and MNIST from Chang and Lin (2011) . For a9a and covtype, we study binary logistic regression problem with a nonconvex regularizer (Reddi et al., 2016b). For MNIST, we study multiclass logistic regression with a nonconvex regularizer , where is number of classes.
We plot the logarithmic function value gap with respect to runtime in Figure 1. From Figure 1(a), 1(b) and 1(c), we can see that for the low dimension optimization task on a9a and covtype, our SRVRC outperforms all the other algorithms with respect to runtime. For high dimension optimization task MNIST, only Stochastic Cubic and are able to make progress and outperforms Stochastic Cubic. This is consistent with our discussions in Section 5.3.
7 Conclusions and Future Work
In this work we presented two faster SVRC algorithms namely SRVRC and to find approximate local minima for nonconvex finitesum optimization problems. SRVRC outperforms existing SVRC algorithms in terms of gradient and Hessian complexities, while further outperforms the bestknown runtime complexity for existing CR based algorithms. Whether our algorithms have achieved the optimal complexity under current assumptions is still an open problem, and we leave it as a future work.
Appendix A Proofs in Section 4
We define the filtration as the algebra of to . Without confusion, we assume and as the semistochastic gradient and Hessian, as the update parameter, as the cubic penalty parameter appearing in Algorithm 1 and Algorithm 2. We denote and . In this section, we define for the simplicity.
a.1 Proof of Theorem 4.2
To prove Theorem 4.2, we need the following lemmas from Zhou et al. (2018d) which characterize that can be bounded by and the norm of difference between semistochastic gradient and Hessian.
(Zhou et al., 2018d) Suppose that and . If , then for any , we have
Next lemma gives upper bounds on the inner product terms which will appear in our main proof. (Zhou et al., 2018d) For any , we have
(18)  
(19) 
We also need the following two lemmas, which show that semistochastic gradient and Hessian are good approximations to true gradient and Hessian.
Suppose that satisfies (9) and (11), then condition on , with probability at least , we have that for all ,
(20) 
Suppose that satisfies (10) and (12), then condition on , with probability at least , we have that for all ,
(21) 
Given all the above lemmas, we are ready to prove Theorem 4.2.
Proof of Theorem 4.2.
Suppose that SRVRC breaks at iteration , then for all . We have
Comments
There are no comments yet.