A Stochastic Tensor Method for Non-convex Optimization

11/23/2019 ∙ by Aurelien Lucchi, et al. ∙ 0

We present a stochastic optimization method that uses a fourth-order regularized model to find local minima of smooth and potentially non-convex objective functions. This algorithm uses sub-sampled derivatives instead of exact quantities and its implementation relies on tensor-vector products only. The proposed approach is shown to find an (ϵ_1,ϵ_2)-second-order critical point in at most (max(ϵ_1^-4/3, ϵ_2^-2)) iterations, thereby matching the rate of deterministic approaches. Furthermore, we discuss a practical implementation of this approach for objective functions with a finite-sum structure, as well as characterize the total computational complexity, for both sampling with and without replacement. Finally, we identify promising directions of future research to further improve the complexity of the discussed algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of optimizing an objective function of the form

(1)

where

is a not necessarily convex loss function defined over

datapoints.

Our setting is one where access to the exact function gradient is computationally expensive (e.g. large-scale setting where is large) and one therefore wants to access only stochastic evaluations

, potentially over a mini-batch. In such settings, stochastic gradient descent (SGD) has long been the method of choice in the field of machine learning. Despite the uncontested empirical success of SGD to solve difficult optimization problems – including training deep neural networks – the convergence speed of SGD is known to slow down close to saddle points or in ill-conditioned landscapes 

[38, 23]. While gradient descent requires oracle evaluations111In ”oracle evaluations” we include the number of function and gradient evaluations as well as evaluations of higher-order derivatives. to reach an -approximate first-order critical point, the complexity worsens to for SGD. In contrast, high-order methods – including e.g. regularized Newton methods and trust-region methods – exploit curvature information, allowing them to enjoy faster convergence to a second-order critical point.

In this work, we focus our attention on regularized high-order methods to optimize Eq. (1), which construct and optimize a local Taylor model of the objective in each iteration with an additional step length penalty term that depends on how well the model approximates the real objective. This paradigm goes back to Trust-Region and Cubic Regularization methods, which also make use of regularized models to compute their update step [20, 40, 16]. For the class of second-order Lipschitz smooth functions, [40] showed that the Cubic Regularization framework finds an -approximate second-order critical point in at most iterations222The complexity ignores the cost of optimizing the model at each iteration. This is an issue we will revisit later on., thus achieving the best possible rate in this setting [14]. Recently, stochastic extensions of these methods have appeared in the literature such as [19, 34, 46, 51]. These will be discussed in further details in Section 2.

Since the use of second derivatives can provide significant theoretical speed-ups, a natural question is whether higher-order derivatives can result in further improvements. This questions was answered affirmatively in [8] who showed that using derivatives up to order allows convergence to an -approximate first-order critical point in at most evaluations. This result was extended to -second-order stationarity in [18], which proves an rate. Yet, these results assume a deterministic setting where access to exact evaluations of the function derivatives is needed and – to the best of our knowledge – the question of using high-order () derivatives in a stochastic setting has received little consideration in the literature so far. We focus our attention on the case of computing derivative information of up to order . It has recently been shown in [39] that, while optimizing degree four polynomials is NP-hard in general [31], the specific models that arise from a third-order Taylor expansion with a quartic regularizer can still be optimized efficiently.

The main contribution of this work (Thm. 6) is to demonstrate that a stochastic fourth-order regularized method finds an second-order stationary point in iterations . Therefore, it matches the results obtained by deterministic methods while relying only on inexact derivative information. In order to prove this result, we develop a novel tensor concentration inequality (Thm. 7) for sums of tensors of any order and make explicit use of the finite-sum structure given in Eq. (1). Together with existing matrix and vector concentration bounds [48], this allows us to define a sufficient amount of samples needed for convergence. We thereby provide theoretically motivated sampling schemes for the derivatives of the objective – for both sampling with and without replacement – and discuss a practical implementation of the resulting approach, which we name STM (Stochastic Tensor Method). We also compute the total computational complexity of this algorithm (including the complexity of optimizing the model at each iteration) and find that the complexity in terms of is superior to other state-of-the-art stochastic algorithms such as [52, 46, 3]. Importantly, the implementation of this approach does not require computing high-order derivatives directly, which would potentially render it as too computationally and memory expensive. Instead, the necessary derivative information is accessed only indirectly via tensor-vector and matrix-vector products.

2 Related work

Sampling techniques for first-order methods.

In large-scale learning (

) most of the computational cost of traditional deterministic optimization methods is spent in computing the exact gradient information. A common technique to address this issue is to use sub-sampling to compute an unbiased estimate of the gradient. The simplest instance is SGD whose convergence does not depend on the number of datapoints

. However, the variance in the stochastic gradient estimates slows its convergence down. The work of 

[26] explored a sub-sampling technique in the case of convex functions, showing that it is possible to maintain the same convergence rate as full-gradient descent by carefully increasing the sample size over time. Another way to recover a linear rate of convergence for strongly-convex functions is to use variance reduction [33, 24, 43, 32, 22]. The convergence of SGD and its variance-reduced counterpart has also been extended to non-convex functions [28, 42] but the guarantees these methods provide are only in terms of first-order stationarity. However, the work of [27, 44, 21] among others showed that SGD can achieve stronger guarantees in the case of strict-saddle functions. Yet, the convergence rate has a polynomial dependency to the dimension

and the smallest eigenvalue of the Hessian which can make this method fairly impractical.

Second-order methods.

For second-order methods that are not regularized or that make use of positive definite Hessian approximations (e.g. Gauss-Newton), the problem of avoiding saddle points is even worse as they might be attracted by saddle points or even local maximima [23]. Another predominant issue is the computation (and storage) of the Hessian matrix, which can be partially addressed by Quasi-Newton methods such as (L-)BFGS. An increasingly popular alternative is to use sub-sampling techniques to approximate the Hessian matrix, such as in [10] and [25]. The latter method uses a low-rank approximation of the Hessian to reduce the complexity per iteration. However, this yields a composite convergence rate: quadratic at first but only linear near the minimizer.

Cubic regularization and trust region methods.

Trust region methods are among the most effective algorithmic frameworks to avoid pitfalls such as local saddle points in non-convex optimization. Classical versions iteratively construct a local quadratic model and minimize it within a certain radius wherein the model is trusted to be sufficiently similar to the actual objective function. This is equivalent to minimizing the model function with a suitable quadratic penalty term on the stepsize. Thus, a natural extension is the cubic regularization method introduced by [40] that uses a cubic over-estimator of the objective function as a regularization technique for the computation of a step to minimize the objective function. The drawback of their method is that it requires computing the exact minimizer of the cubic model, thus requiring the exact gradient and Hessian matrix. However finding a global minimizer of the cubic model may not be essential in practice and doing so might be prohibitively expensive from a computational point of view.  [16] introduced a method named ARC which relaxed this requirement by letting be an approximation to the minimizer. The model defined by the adaptive cubic regularization method introduced two further changes. First, instead of computing the exact Hessian it allows for a symmetric approximation . Second, it introduces a dynamic step-length penalty parameter instead of using the global Lipschitz constant. Our approach relies on the same adaptive framework.

There have been efforts to further reduce the computational complexity of optimizing the model. For example, [2] refined the approach of [40] to return an approximate local minimum in time which is linear in the input. Similar improvements have been made by  [11] and  [30]. These methods provide alternatives to minimize the cubic model and can thus be seen as complementary to our approach. Finally, [9] proposed a stochastic trust region method but their analysis does not specify any accuracy level required for the estimation of the stochastic Hessian.  [19] also analyzed a probabilistic cubic regularization variant that allows for approximate second-order models.  [34] provided an explicit derivation of sampling conditions to preserve the worst-case complexity of ARC. Other works also derived similar stochastic extensions to cubic regularization, including [51] and [46]. The worst-case rate derived in the latter includes the complexity of a specific model solver introduced in [11].

High-order models.

A hybrid algorithm suggested in [4] adds occasional third-order steps to a cubic regularization method, thereby obtaining provable convergence to some type of third-order local minima. Yet, high-order derivatives can also directly be applied within the regularized Newton framework, as done e.g. by  [8] that extended cubic regularization to a -th high-order model (Taylor approximation of order ) with a -th order regularization, proving iteration complexities of order for first-order stationarity. These convergence guarantees are extended to the case of inexact derivatives in [7] but the latter only discusses a practical implementation for the case . The convergence guarantees of -th order models are extended to second-order stationarity in [18]. Notably, all these approaches require optimizing a -th order polynomial, which is known to be a difficult problem. Recently, [39, 29] introduced an implementable method for the case in the deterministic case and for convex functions. We are not aware of any work that relies on high-order derivatives () to construct a model to optimize smooth functions and we therefore focus our work on the case .

Finally, another line of works [3, 52] considers methods that do not use high-order derivatives explicitly within a regularized Newton framework but rather rely on other routines – such as Oja’s algorithm – to explicitly find negative curvature directions and couple those with SGD steps.

3 Formulation

3.1 Notation & Assumptions

First, we lay out some standard assumptions regarding the function as well as the required approximation quality of the high-order derivatives.

Assumption 1 (Continuity).

The functions , , are Lipschitz continuous for all , with Lipschitz constants and respectively.

In the following, we will denote the directional derivative of the function at along the directions as

(2)

For instance, and .

Assumption 1 implies that for each ,

(3)

for all and where .

As in [18], is the tensor norm recursively induced by the Euclidean norm on the space of -th order tensors.

3.2 Stochastic surrogate model

We construct a surrogate model to optimize based on a truncated Taylor approximation as well as a power prox function weighted by a sequence that is controlled adaptively according to the fit of the model to the function . Since the full Taylor expansion of requires computing high-order derivatives that are expensive, we instead use an inexact model defined as

(4)

where and approximate the derivatives and through sampling as follows. Three sample sets and are drawn and the derivatives are then estimated as

(5)

We will later see that the implementation of the algorithm we analyze does not require the computation of the Hessian or the third-order tensor – both of which would require significant computational resources – but instead directly compute Tensor-vector products with a complexity of order .

We will make use of the following condition in order to reach an -critical point:

Condition 1.

For a given accuracy, one can choose the size of the sample sets for sufficiently small such that:

(6)
(7)
(8)

In Lemma 8, we prove that we can choose the size of each sample set and to satisfy the conditions above, without requiring knowledge of the length of the step . We will present a convergence analysis of STM, proving that the convergence properties of the deterministic methods [8, 39] can be retained by a sub-sampled version at the price of slightly worse constants.

3.3 Algorithm

1:  Input: Starting point (e.g ) , and
2:  for  do
3:     Sample gradient , Hessian and such that Eq. (6), Eq. (7) & Eq. (8) hold.
4:     Obtain by solving (Eq. (4)) such that Condition 2 holds.
5:     Compute and
(9)
6:     Set
(10)
7:     Set
(11)
8:  end for
Algorithm 1 Stochastic Tensor Method (STM)

The optimization algorithm we consider is detailed in Algorithm 1. At iteration step , we sample three sets of datapoints from which we compute stochastic estimates of the derivatives of so as to satisfy Cond. 1. We then obtain the step by solving the problem

(12)

either exactly or approximately (details will follow shortly) and update the regularization parameter depending on , which measures how well the model approximates the real objective. This is accomplished by differentiating between different types of iterations. Successful iterations (for which ) indicate that the model is, at least locally, an adequate approximation of the objective such that the penalty parameter is decreased in order to allow for longer steps. We denote the index set of all successful iterations between 0 and by . We also denote by its complement in which corresponds to the index set of unsuccessful iterations.

Exact model minimization

Solving Eq. (4) requires minimizing a nonconvex multivariate polynomials. As pointed out in [39, 5], this problem is computationally expensive to solve in general and further research is needed to establish whether one could design a practical minimization method. In [39], the authors demonstrated that an appropriately regularized Taylor approximation of convex functions is a convex multivariate polynomial, which can be solved using the framework of relatively smooth functions developed in [35]. As long as the involved models are convex, this solver could be used in Algorithm 1 but it is unclear how to generalize this method to the non-convex case. Fortunately, we will see next that exact model minimizers is not even needed to establish global convergence guarantees for our method.

Approximate model minimization

Exact minimization of the model in Eq. (4) is often computationally expensive, especially given that it is required for every parameter update in Algorithm 1. In the following, we explore an approach to approximately minimize the model while retaining the convergence guarantees of the exact minimization. Instead of requiring the exact optimality conditions to hold, we use weaker conditions that were also used in prior work [8, 18]. First, we define two criticality measures based on first- and second-order information:

(13)

where is the minimum eigenvalue of the Hessian matrix .

The same criticality measures are defined for the model ,

(14)

Finally, we state the approximate optimality condition required to find the step .

Condition 2.

For each outer iteration , the step is computed so as to approximately minimize the model in the sense that the following conditions hold:

(15)
Practical implementation

In Section 4.3, we will discuss how a gradient-based method can be used to approximately optimize the model defined in Eq. (4). Such an algorithm only needs to access the first derivative of the model w.r.t. , defined as

(16)

and it therefore has a low computational complexity per-iteration.

4 Analysis

4.1 Worst-case complexity

In the following, we provide a proof of convergence of STM to a second-order critical point, i.e. a point such that

(17)

The high-level idea of the analysis is to first show that the model decreases proportionally to the criticality measures at each iteration and then relate the model decrease to the function decrease. Since the function is lower bounded, it can only decrease a finite number of times, which therefore implies convergence. We start with a bound on the model decrease in terms of the step length .

Lemma 2.

For any , the step (satisfying Cond. 2) is such that

(18)

In order to complete our claim of model decrease, we prove that the length of the step can not be arbitrarily small compared to the gradient and the Hessian of the objective function.

Lemma 3.

Suppose that Condition 1 holds with the choice , , . For any , the length of the step (satisfying Cond. 2) is such that

(19)

where .

Lemma 4.

Suppose that Condition 1 holds with the choice , , . For any , the length of the step (satisfying Cond. 2) is such that

(20)

where .

A key lemma to derive a worst-case complexity bound on the total number of iterations required to reach a second-order critical point is to bound the number of unsuccessful iterations as a function of the number of successful ones , that have occurred up to some iteration .

Lemma 5.

The steps produced by Algorithm 1 guarantee that if for , then where

The proof of this Lemma can be found in [16] (Theorem 2.1) and is also restated in the Appendix. A closed-form expression for is provided in Lemma 16 in Appendix.

We are now ready to state the main result of this section that provides a bound on the number of iterations required to reach a second-order critical point.

Theorem 6 (Outer worst-case complexity).

Let be a lower bound on and assume Conditions 1 and 2 hold. Let and . Then, given , Algorithm 1 needs at most

successful iterations and

(21)

total outer iterations to reach an iterate such that both and .

As in [18], we obtain a worst-complexity bound of the order , except that we do not require the exact computation of the function derivatives but instead rely on approximate sampled quantities. By using third-order derivatives, STM also obtains a faster rate (in terms of outer iterations) than the one achieved by a sampled variant of cubic regularization [34] (at most iterations for and to reach approximate nonnegative curvature). The worst-case complexity bound stated in Theorem 6 does not take into account the complexity of the subsolver used to find the (approximate) solution of . This is discussed in detail in Section 4.3.

4.2 Sampling conditions

We now show that one can use random sampling without replacement and choose the size of the sample sets in order to satisfy Cond. 1

with high probability. The complete proof is available in Appendix 

B.1. The case of sampling with replacement is also discussed in Appendix B.2.

First, we need to develop a new concentration bound for tensors based on the spectral norm. Existing tensor concentration bounds are not applicable to our setting. Indeed,  [36] relies on a different norm which can not be translated to the spectral norm required in our analysis while the bound derived in [49] relies on a specific form of the input tensor.

Formally, let be a probability space and let be a real random tensor, i.e. a measurable map from to (the space of real tensors of order and dimension ).

Our goal is to derive a concentration bound for a sum of identically distributed tensors sampled without replacement, i.e. we consider

where each tensor is sampled from a population of size .

The concentration result derived in the next theorem is based on the proof technique introduced in [45]

which we adapt for sums of random variables.

Theorem 7 (Tensor Hoeffding-Serfling Inequality).

Let be a sum of tensors sampled without replacement from a finite population of size . Let be such that and assume that for each tensor , . Let , then we have

where .

Based on Theorem 7 as well as standard concentration bounds for vectors and matrices (see e.g. [48]), we prove the required sampling conditions of Cond. 1 hold for the specific sample sizes given in the next lemma.

Lemma 8.

Consider the sub-sampled gradient, Hessian and third-order tensor defined in Eq. (5). Under Assumption 1, the sampling conditions in Eqs. (6), (7) and (8) are satisfied with probability for the following choice of the size of the sample sets and :

where hides poly-logarithmic factors and a polynomial dependency to .

4.3 Complexity subproblem

First-order guarantees

We first discuss the scenario where we only require an -first-order critical point. In the next theorem, we state the total complexity of Algorithm 1, including the sampling complexities given in Lemma 8 as well as the subsolver complexity, which we denote by .

Theorem 9 (Total worst-case complexity).

Let be a lower bound on . Denote by the number of outer iterations defined in Eq. (21) and by the complexity of the model subsolver, both specialized to the case of first-order criticality. Assume Condition 1 holds. Then, given , Algorithm 1 needs at most

(stochastic) oracle calls to reach an iterate such that .

We now derive a corollary for the case where the desired accuracy is high, i.e. . We suppose the subsolver is a non-convex version of AGD whose worst-case complexity under Assumption 1 is proven to be in [13].

Corollary 10 (Total worst-case complexity - first-order stationarity).

Under the same assumptions as in Theorem 9, Algorithm 1 with the non-convex AGD variant presented in  [13] as subsolver needs at most (stochastic) oracle calls to reach an iterate such that and .

The final total complexity in terms of is an improvement over state-of-the-art methods such as NEON + SCSG [52] (), SCR [46] and Natasha2 [3] (). We expect that the dependency on could be further reduced using the variance reduction technique introduced in [50]. Furthermore, we want to point out that the use of a specialized subproblem solver instead of AGD – as was done in [11] for the cubic model – could significantly improve the rate. For comparison, while the rate of AGD to reach is , the cubic solver from [11] achieves .333 [11] reports a rate in terms of function suboptimality, which we here naively convert using smoothness.

Second-order guarantees

Unlike the first-order guarantees, the condition imposed on the subproblem solver now requires finding a second-order stationary point. A common solver used for trust-region methods and ARC is to apply a Lanczos method to build up evolving Krylov spaces, which can be constructed in a Hessian-free manner, i.e. by accessing the Hessians only indirectly via matrix-vector products. We are however not aware of any existing work analyzing the worst-case complexity of (a generalization of) this approach for higher-order polynomials. Instead, we suggest using the solver presented in [15] which is an accelerated gradient method for non-convex optimization that requires to find a point such that .

Corollary 11 (Total worst-case complexity – second-order stationarity).

Under the same assumptions as in Theorem 9, Algorithm 1 with the non-convex AGD variant presented in [15] as subsolver needs at most (stochastic) oracle calls to reach an iterate such that and .

As for the first-order guarantees, an interesting direction to improve the total complexity would be a subproblem solver specialized to the specific characteristics of the fourth-order model.

5 Conclusion

We presented a stochastic fourth-order optimization algorithm that preserves the guarantees of its deterministic counterpart for finding an approximate second-order critical point. There are numerous extensions that could follow from this work, including the following.

Algorithmic improvements.

Prior work such as [3, 52] has relied on using variance reduction to achieve faster rates. A variance-reduced variant of cubic regularization has also been shown in [50] to reduce the per-iteration sample complexity and one would therefore except similar improvements can be made to the quartic model. Finally, one could incorporate acceleration as in [37].

Model solver

Perhaps the most promising direction for future work is to design efficient algorithms to solve high-order polynomial models. While there has been some significant work published for cubic models [11, 12], the problem has been relatively unexplored for higher-order models.

Finally, one relevant application for the type of high-order algorithms we developed is training deep neural networks as in [46, 1]. An interesting direction for future research would therefore be to design a practical implementation of STM for training neural networks based on efficient tensor-vector products similarly to the fast Hessian-vector products proposed in [41].

Acknowledgements

The authors would like to thank Coralia Cartis for helpful discussions on an early draft of this paper, as well as for pointing out additional relevant work.

References

Appendix A Main analysis

In the following section, we provide detailed proofs for all the lemmas and theorems presented in the main paper. First, we present some basic results regarding the surrogate model that will be handy throughout the proofs.

Surrogate model

Recall that we use the following surrogate model:

(22)

The first derivative of the model w.r.t. is

(23)

For the second-order derivative , we get:

(24)

where

(25)

We now start with a bound on the model decrease in terms of the step length .

Lemma 12 (Lemma 4.1 restated).

For any , the step (satisfying Cond. 2) is such that

(26)
Proof.

Note that . Using the optimality conditions introduced in Condition 2, we get

(27)

which directly implies the desired result.

In order to complete our claim of model decrease started in Lemma 2, we prove that the length of the step can not be arbitrarily small compared to the gradient of the objective function.

Lemma 13 (Lemma 4.2 restated).

Suppose that Condition 1 holds with the choice , , . For any , the length of the step (satisfying Cond. 2) is such that

(28)

where .

Proof.

We start with the following bound,

(29)

We bound the second term in the RHS of Eq. (29) using the termination condition presented in Eq. (15). We get

(30)

For the first term in the RHS of Eq. (29) , we have

(31)

where the last inequality uses Lemma 32 and Condition 1 where we set .

Next, we apply the Young’s inequality for products which states that if and such that then

(32)

We obtain

(33)

therefore

(34)

Choosing , , ,

(35)

We now prove that the length of the step can not be arbitrarily small compared to . We first need an additional lemma that relates the step length to the second criticality measure . Proving such result requires the following auxiliary lemma.

Lemma 14.

Suppose that Condition 1 holds. For all ,

(36)
Proof.

Recall that

(37)

therefore

(38)

where the last inequality uses Lemma 32 and Condition 1 with .

We again apply the Young’s inequality for products stated in Eq. 32 which yields

(39)

Lemma 15 (Lemma 4.3 restated).

Suppose that Condition 1 holds with the choice , , . For any , the length of the step (satisfying Cond. 2) is such that

(40)

where .

Proof.

Using the definition of the model in Eq. (4) and the fact that , we find that

(41)

Considering each term in turn, and using Lemma 14, we see that