Many machine learning problems can be formulated as empirical risk minimization, which is in the form of finite-sum optimization as follows:
where each can be a convex or nonconvex function. In this paper, we are particularly interested in nonconvex finite-sum optimization, where each
is nonconvex. This is often the case for deep learning(LeCun et al., 2015). In principle, it is hard to find the global minimum of (1) because of the NP-hardness of the problem (Hillar and Lim, 2013), thus it is reasonable to resort to finding local minima (a.k.a., second-order stationary points). It has been shown that local minima can be the global minima in certain machine learning problems, such as low-rank matrix factorization (Ge et al., 2016; Bhojanapalli et al., 2016; Zhang et al., 2018b)
and training deep linear neural networks(Kawaguchi, 2016; Hardt and Ma, 2016). Therefore, developing algorithms to find local minima is important both in theory and in practice. More specifically, we define an -approximate local minimum of as follows
where are predefined precision parameters.
The most classic algorithm to find the approximate local minimum is cubic-regularized (CR) Newton method, which was originally proposed in a seminal paper by Nesterov and Polyak (2006). Generally speaking, in the -th iteration, cubic regularization method solves a subproblem, which minimizes a cubic-regularized second-order Taylor expansion at the current iterate . The update rule can be written as follows:
where is a penalty parameter. Nesterov and Polyak (2006) proved that to find an -approximate local minimum of a nonconvex function , cubic regularization requires at most iterations. However, when applying cubic regularization to nonconvex finite-sum optimization in (1), a major bottleneck of cubic regularization is that it needs to compute individual gradients and Hessian matrices at each iteration, which leads to a total gradient complexity (i.e., number of queries to the stochastic gradient oracle for some and ) and Hessian complexity (i.e., number of queries to the stochastic Hessian oracle for some and ). Such computational overhead will be extremely expensive when is large as is in many large-scale machine learning applications.
To overcome the aforementioned computational burden of cubic regularization, Kohler and Lucchi (2017); Xu et al. (2017) used subsampled gradient and subsampled Hessian, which achieve gradient complexity and Hessian complexity. Zhou et al. (2018d) proposed a stochstic variance reduced cubic regularization method (SVRC), which uses novel semi-stochastic gradient and semi-stochastic Hessian estimators inspired by variance reduction for first-order finite-sum optimization (Johnson and Zhang, 2013; Reddi et al., 2016a; Allen-Zhu and Hazan, 2016), which attains Second-order Oracle (SO) complexity111Second-order Oracle (SO) returns triple for some and , hence the SO complexity can be seen as the maximum of gradient and Hessian complexities.. Zhou et al. (2018b); Wang et al. (2018); Zhang et al. (2018a) used a simpler semi-stochastic gradient compared with Zhou et al. (2018d), and semi-stochastic Hessian, which a better Hessian complexity, i.e., . However, it is unclear whether the gradient and Hessian complexities of the aforementioned SVRC algorithms can be further improved. Furthermore, all these algorithms need to use the semi-stochastic Hessian estimator, which is not compatible with Hessian-vector product-based cubic subproblem solvers (Agarwal et al., 2017; Carmon and Duchi, 2016, 2018). Therefore, the cubic subproblem (4) in each iteration of existing SVRC algorithms has to be solved by computing the inverse of the Hessian matrix, whose computational complexity is at least 222 is the matrix multiplication constant, where .. This makes existing SVRC algorithms not very practical for high-dimensional problems.
In this paper, we first show that the gradient and Hessian complexities of SVRC-type algorithms can be further improved. The core idea is to use a novel recursively updated semi-stochastic gradient and Hessian estimators, which are inspired by the recursive semi-stochastic gradient estimators used in Nguyen et al. (2017); Fang et al. (2018) for first-order finite-sum optimization. We show that such kind of estimators can also reduce the Hessian complexity, which has never been discovered before. In addition, in order to reduce the runtime complexity of existing SVRC algorithms, we further propose a Hessian-free SVRC method that can not only use the novel semi-stochastic gradient estimator, but also leverage the Hessian-vector product-based fast cubic subproblem solvers. Experiments on benchmark nonconvex finite-sum optimization problems illustrate the superiority of our newly proposed SVRC algorithms against the state-of-the-art.
In detail, our contributions are summarized as follows:
We propose a new SVRC algorithm, namely SRVRC, which can find an -approximate local minimum with gradient complexity and Hessian complexity. Compared with previous work, the gradient and Hessian complexity of SRVRC is strictly better than the algorithms in Zhou et al. (2018b); Wang et al. (2018); Zhang et al. (2018a), and better than that in Zhou et al. (2018d) in a wide regime.
We further propose a new algorithm , which requires runtime to find an -approximate local minimum. The runtime of is strictly better than that of Agarwal et al. (2017); Carmon and Duchi (2016); Tripuraneni et al. (2018) when . The runtime complexity of is also better than that of SRVRC when is large.
|Algorithm||Gradient Complexity||Hessian Complexity|
|(Nesterov and Polyak, 2006)|
|Subsampled cubic regularization|
|(Kohler and Lucchi, 2017; Xu et al., 2017)|
|(Zhou et al., 2018d)|
|(Zhou et al., 2018b)|
|(Wang et al., 2018)|
|(Zhang et al., 2018a)|
|(Zhou et al., 2018b)|
|(Agarwal et al., 2017)|
|(Carmon and Duchi, 2016)|
|Stochastic Fast Cubic|
|(Tripuraneni et al., 2018)|
2 Other Related Work
In this section, we review additional related work, that is not discussed in the introduction section.
Cubic Regularization and Trust-Region Methods Since cubic regularization was firstly proposed in Nesterov and Polyak (2006), there is a line of followup research. It was extended to adaptive regularized cubic methods (ARC) by Cartis et al. (2011a, b), which enjoy the same iteration complexity as standard cubic regularization while having better empirical performance. The first attempt to make cubic regularization a Hessian-free method was done by Carmon and Duchi (2016), which solves the cubic sub-problem by gradient descent, requiring total runtime. Agarwal et al. (2017) solved cubic sub-problem by fast matrix inversion based on accelerated gradient descent, which requires runtime. In the pure stochastic optimization setting, Tripuraneni et al. (2018) proposed stochastic cubic regularization method, which uses subsampled gradient and Hessian-vector product-based cubic subproblem solver, and requires runtime. A closely related second-order method to cubic regularization methods are trust-region methods (Conn et al., 2000; Cartis et al., 2009, 2012, 2013). Recent studies (Blanchet et al., 2016; Curtis et al., 2017; Martínez and Raydan, 2017) proved that the trust-region method can achieve the same iteration complexity as the cubic regularization method. Xu et al. (2017) also extended trust-region method to subsampled trust-region method for nonconvex finite-sum optimization.
Local Minima Finding Besides cubic regularization and trust-region type methods, there is another line of research for finding approximate local minima, which is based on first-order optimization. Ge et al. (2015); Jin et al. (2017a) proved that (stochastic) gradient methods with additive noise are able to escape from nondegenerate saddle points and find approximate local minima. Carmon et al. (2018); Royer and Wright (2017); Allen-Zhu (2017); Xu et al. (2018); Allen-Zhu and Li (2018); Jin et al. (2017b); Yu et al. (2017b, a); Zhou et al. (2018a); Fang et al. (2018) showed that by alternating first-order optimization and Hessian-vector product based negative curvature descent, one can find approximate local minima even more efficiently.
Variance Reduction Variance reduction techniques play an important role in our proposed algorithms. Variance reduction techniques were first proposed for convex finite-sum optimization, which use semi-stochastic gradient to reduce the variance of the stochastic gradient and improve the gradient complexity. Representative algorithms include Stochastic Average Gradient (SAG) (Roux et al., 2012), Stochastic Variance Reduced Gradient (SVRG) (Johnson and Zhang, 2013; Xiao and Zhang, 2014), SAGA (Defazio et al., 2014) and SARAH (Nguyen et al., 2017), to mention a few. For nonconvex finite-sum optimization problems, Garber and Hazan (2015); Shalev-Shwartz (2016) studies the case where each individual function is nonconvex, but their sum is still (strongly) convex. Reddi et al. (2016a); Allen-Zhu and Hazan (2016) extended SVRG to noncovnex finite-sum optimization, which is able to converge to first-order stationary point with better gradient complexity than vanilla gradient descent. Recently, Fang et al. (2018); Zhou et al. (2018c) further improve the gradient complexity for nonconvex finite-sum optimization to be (near) optimal.
3 Notation and Preliminaries
In this work, all index subsets are multiset. We use to represent if and otherwise. We use to represent if and otherwise. For a vector , we denote its -th coordinate by . We denote vector Euclidean norm by . For any matrix , we denote its entry by , its Frobenius norm by , and its spectral norm by . For a symmetric matrix
, we denote its minimum eigenvalue by. For symmetric matrices , we say if . We use to denote that for some constant and use to hide the logarithmic factors of . For , means .
We begin with a few assumptions that are needed for later theoretical analyses of our algorithms.
The following assumption says that there is a bounded gap between the function value at the initial point and the minimal function value. For any function and an initial point , there exists a constant such that
We also need the following -gradient Lipschitz and -Hessian Lipschitz assumption.
For each , we assume that is -gradient Lipschitz continuous and -Hessian Lipschitz continuous, where we have
Note that -gradient Lipschitz is not required in the original cubic regularization algorithm (Nesterov and Polyak, 2006) and the SVRC algorithm in Zhou et al. (2018d). However, for most other SVRC algorithms (Zhou et al., 2018b; Wang et al., 2018; Zhang et al., 2018a), they need the -gradient Lipschitz assumption.
In addition, we also need the difference between the stochastic gradient and the full gradient to be bounded.
We assume that has -bounded stochastic gradient, where we have
It is worth noting that Assumption 3 is weaker than the assumption that each is Lipschitz continuous, which has been made in Kohler and Lucchi (2017); Zhou et al. (2018b); Wang et al. (2018); Zhang et al. (2018a). We would also like to point out that we can make additional assumptions on the variances of the stochastic gradient and Hessian, such as the ones made in Tripuraneni et al. (2018). Nevertheless, making these additional assumptions does not improve the dependency of the gradient and Hessian complexities or the runtime complexity on and . Therefore we chose not making these additional assumptions on the variances.
4 The Proposed Srvrc Algorithm
In this section, we present SRVRC, a novel algorithm which utilizes new semi-stochastic gradient and Hessian estimators compared with previous SVRC algorithms. We also provide a convergence analysis of the proposed algorithm.
4.1 Algorithm Description
In order to reduce the computational complexity for calculating full gradient and full Hessian in (3), several ideas such as subsampled/stochastic gradient and Hessian (Kohler and Lucchi, 2017; Xu et al., 2017; Tripuraneni et al., 2018) and variance-reduced semi-stochastic gradient and Hessian (Zhou et al., 2018d; Wang et al., 2018; Zhang et al., 2018a) have been used in previous work. SRVRC
follows this line of work. The key idea is to use a new construction of semi-stochastic gradient and Hessian estimators, which are recursively updated in each iteration, and reset periodically after certain number of iterations (i.e., an epoch). To be more specific,SRVRC takes different construction strategies for iteration depending on whether or not, where is the epoch length. In the -th iteration when , SRVRC will calculate a subsampled gradient and Hessian at point and set the semi-stochastic gradient and Hessian as follows
In the -th iteration when , SRVRC constructs semi-stochastic gradient and Hessian and based on previous estimators , recursively. More specifically, SRVRC generates index sets , and calculates two subsampled gradients , and two subsampled Hessians . Then SRVRC sets and as
Note that this kind of has been used in first-order optimization algorithms before (Nguyen et al., 2017; Fang et al., 2018), while such is new and to our knowledge has never been used before. With semi-stochastic gradient , semi-stochastic Hessian and -th Cubic penalty parameter , SRVRC constructs the -th Cubic subproblem and solves for the solution to as -th update direction, which is defined as
If is less than a given threshold which we set it as , SRVRC returns as its output. Otherwise, SRVRC updates and continues the loop.
The main difference between SRVRC and previous stochastic cubic regularization algorithms (Kohler and Lucchi, 2017; Xu et al., 2017; Zhou et al., 2018d, b; Wang et al., 2018; Zhang et al., 2018a) is that SRVRC adapts new semi-stochastic gradient and semi-stochastic Hessian estimators, which are defined recursively and have smaller asymptotic variance. The use of such semi-stochastic gradient has been proved to help reduce the gradient complexity in first-order nonconvex finite-sum optimization for finding stationary point (Fang et al., 2018). Our work takes one step further to apply it to Hessian, and we will later show that it helps reduce the gradient and Hessian complexities in second-order nonconvex finite-sum optimization for finding local minima (i.e., second-order stationary point).
4.2 Convergence Analysis
In this subsection, we present our theoretical results about SRVRC. While the idea of using variance reduction technique for cubic regularization is hardly new, the new semi-stochastic gradient and Hessian estimators in (5) and (6) bring new technical challenges in the convergence analysis.
To describe whether a point is a local minimum, we follow the original cubic regularization work (Nesterov and Polyak, 2006) to use the following criterion : For any , let be
It is easy to note that if and only if is an -approximate local minimum. Thus, in order to find an -approximate local minimum, it suffices to find a point which satisfies .
The following theorem provides the convergence guarantee of SRVRC for finding an -approximate local minimum.
For such that , set the gradient sample size and Hessian sample size as
Then with probability at least , SRVRC outputs satisfying , i.e., an -approximate local minimum. is a universal constant.
Next corollary spells out the exact gradient complexity and Hessian complexity of SRVRC to find an -approximate local minimum. Under the same conditions as Theorem 4.2, if set as
stochastic Hessian evaluations and
stochastic gradient evaluations. For SRVRC, if we assume are constants, then its gradient complexity is
and its Hessian complexity is
Regarding Hessian complexity, suppose that , then the Hessian complexity of SRVRC can be simplified as . Compared with existing SVRC algorithms (Zhou et al., 2018b; Zhang et al., 2018a; Wang et al., 2018), SRVRC outperforms the best-known Hessian sample complexity by a factor of . In terms of gradient complexity, SRVRC outperforms the algorithm in Zhang et al. (2018a) by a factor of when , and by a factor of when . The gradient complexity of SRVRC also outperforms that of the algorithm in Zhou et al. (2018d) by a factor of when .
5 Hessian-Free Srvrc
While SRVRC adapts novel semi-stochastic gradient and Hessian estimators to reduce both the gradient and Hessian complexities, it has three limitations for high-dimensional problems with : (1) it needs to compute and store the Hessian matrix, which needs computational time and storage space; (2) it needs to solve cubic subproblem exactly, which requires computational time because it needs to compute the inverse of a Hessian matrix (Nesterov and Polyak, 2006); and (3) it cannot leverage the Hessian-vector product-based cubic subproblem solvers (Agarwal et al., 2017; Carmon and Duchi, 2016, 2018) because of the use of the semi-stochastic Hessian estimator.
5.1 Algorithm Description
We present a Hessian-free algorithm to address above limitations of SRVRC for high-dimensional problem, whose runtime complexity is linear in , and therefore works well in the high-dimension regime. uses the same semi-stochastic gradient as SRVRC. As opposed to SRVRC which has to construct semi-stochastic Hessian explicitly, only accesses to Hessian-vector product. In detail, at each iteration , subsamples index set and define a Hessian-vector product function as follows:
Note that although the subproblem depends on , never explicitly compute this matrix. Instead, it only provides the subproblem solver access to through Hessian-vector product function . The subproblem solver performs gradient-based optimization to solve the subproblem as depends on only via . In detail, following Tripuraneni et al. (2018), uses Cubic-Subsolver (See in Algorithms 3 and 4 in Appendix G) and Cubic-Finalsolver from Carmon and Duchi (2016), to find approximate solution to the cubic subproblem in (7). Both Cubic-Subsolver and Cubic-Finalsolver only need to access gradient and Hessian-vector product function along with other problem-dependent parameters. With the output from Cubic-Subsolver, decides either to update as or to exit the loop. For the later case, will call Cubic-Finalsolver to output , and takes as its final output.
The main differences between SRVRC and are two-fold. First, only needs to compute stochastic gradient and Hessian vector product, and both of these two actions only take time (Rumelhart et al., 1986). Second, instead of solving cubic subproblem exactly, adopts approximate subproblem solver Cubic-Subsolver and Cubic-Finalsolver, both of which only need to access gradient and Hessian-vector product function, and again only take time. Thus, is computational more efficient than SRVRC when .
5.2 Convergence Analysis
We now provide the convergence guarantee of , which ensures that will output an -approximate local minimum. Under Assumptions 3, 3, 3, suppose . Set the cubic penalty parameter for any and the total iteration number . Set the Hessian-vector product sample size as
For such that , set the gradient sample size as
For such that , set the gradient sample size as
Then with probability at least , outputs satisfying , i.e., an -approximate local minimum. is a universal constant. The following corollary calculates the runtime complexity of to find an -approximate local minimum.
Under the same conditions as Theorem 5.2, if set as
runtime. For , if we assume are constants, then its runtime complexity is
For stochastic algorithms, the regime is of most interest. In this regime, (17) becomes . Compared with other local minimum finding algorithms based on stochastic gradient and Hessian-vector product, outperforms the results achieved by Tripuraneni et al. (2018) and Allen-Zhu (2018) by a factor of . also matches the best-known result achieved by a recent first-order algorithm proposed in Fang et al. (2018).
Plots of logarithmic function value gap with respect to runtime (in seconds) for nonconvex regularized binary logistic regression on (a) a9a (b) covtype, and for nonconvex regularized multiclass logistic regression on (c) MNIST.
We would like to further compare the runtime complexity between SRVRC and . In specific, SRVRC needs time to construct semi-stochastic gradient and time to construct semi-stochastic Hessian. SRVRC also needs time to solve cubic subproblem for each iteration. Thus, with the fact that the total number of iterations is by Corollary 4.2, SRVRC needs
runtime to find an -approximate local minimum if we regard as constants. Compared with (17), we conclude that outperforms SRVRC when is sufficiently large, which is in accordance with the fact that Hessian-free methods are superior for high dimension machine learning tasks. On the other hand, a careful calculation can show that the runtime of SRVRC can be less than that of when is moderately small. This is also reflected in our experiments.
In this section, we present numerical experiments on different nonconvex Empirical Risk Minimization (ERM) problems and on different datasets to validate the advantage of our proposed SRVRC and algorithms for finding approximate local minima. We use runtime as the performance measures.
Baselines: We compare our algorithms with the following algorithms: subsampled cubic regularization (Subsample Cubic) (Kohler and Lucchi, 2017), stochastic cubic regularization (Stochastic Cubic) (Tripuraneni et al., 2018), stochastic variance-reduced cubic regularization (SVRC) (Zhou et al., 2018d), sample efficient stochastic variance-reduced cubic regularization (Lite-SVRC) (Zhou et al., 2018b; Wang et al., 2018; Zhang et al., 2018a).
Parameter Settings and Subproblem Solver For each algorithm, we set the cubic penalty parameter adaptively based on how well the model approximates the real objective as suggested in Cartis et al. (2011a, b); Kohler and Lucchi (2017). For SRVRC, we set gradient and Hessian batch sizes and as follows:
For , we set gradient batch sizes the same as SRVRC and Hessian batch sizes . We tune over the grid , over the grid , and over the grid for the best performance. For Subsample Cubic, SVRC, Lite-SVRC and SRVRC, we solve the cubic subproblem using the cubic subproblem solver discussed in Nesterov and Polyak (2006). For Stochstic Cubic and , we use Cubic-Subsolver (Algorithm 3 in Appendix G) to approximately solve the cubic subproblem. All algorithms are carefully tuned for a fair comparison.
Datasets and Optimization Problems We use 3 datasets a9a, covtype and MNIST from Chang and Lin (2011) . For a9a and covtype, we study binary logistic regression problem with a nonconvex regularizer (Reddi et al., 2016b). For MNIST, we study multiclass logistic regression with a nonconvex regularizer , where is number of classes.
We plot the logarithmic function value gap with respect to runtime in Figure 1. From Figure 1(a), 1(b) and 1(c), we can see that for the low dimension optimization task on a9a and covtype, our SRVRC outperforms all the other algorithms with respect to runtime. For high dimension optimization task MNIST, only Stochastic Cubic and are able to make progress and outperforms Stochastic Cubic. This is consistent with our discussions in Section 5.3.
7 Conclusions and Future Work
In this work we presented two faster SVRC algorithms namely SRVRC and to find approximate local minima for nonconvex finite-sum optimization problems. SRVRC outperforms existing SVRC algorithms in terms of gradient and Hessian complexities, while further outperforms the best-known runtime complexity for existing CR based algorithms. Whether our algorithms have achieved the optimal complexity under current assumptions is still an open problem, and we leave it as a future work.
Appendix A Proofs in Section 4
We define the filtration as the -algebra of to . Without confusion, we assume and as the semi-stochastic gradient and Hessian, as the update parameter, as the cubic penalty parameter appearing in Algorithm 1 and Algorithm 2. We denote and . In this section, we define for the simplicity.
a.1 Proof of Theorem 4.2
(Zhou et al., 2018d) Suppose that and . If , then for any , we have
Next lemma gives upper bounds on the inner product terms which will appear in our main proof. (Zhou et al., 2018d) For any , we have
We also need the following two lemmas, which show that semi-stochastic gradient and Hessian are good approximations to true gradient and Hessian.
Given all the above lemmas, we are ready to prove Theorem 4.2.
Proof of Theorem 4.2.
Suppose that SRVRC breaks at iteration , then for all . We have