Deep neural nets are among the most powerful tools in supervised learning, which have shown outstanding performances in various application areas. Unfortunately, training such nets still remains a time-consuming task. It is thus of primary importance to design new optimization algorithms that allow one to perform this training in a more efficient manner.
Stochastic gradient descent (SGD) is one of the most popular algorithms for neural network training. However, being a first-order optimization scheme, SGD presents a number of pitfalls due to the nonconvex nature of the problem at hand. First, a proper learning rate can be difficult to select, causing SGD to slow down or even diverge if the stepsize is chosen too large. Additionally, the same learning rate applies to all weight updates, which may be suboptimal in a deep net, because of vanishing/exploding gradient problems. It is well known that in many cases the average norm of the gradient decays for earlier layers . Lastly, the algorithm can be trapped into one of the plateaus of low gradient length, which slows down the learning process.
Numerous variants of SGD have been developed for circumventing the aforementioned issues . Several of them are grounded on well-known accelerated first-order schemes, such as momentum  and Nesterov accelerated gradient , whereas others revolve around adaptive learning rate strategies, such as Adagrad , Adadelta 
, RMSProp, and Adam , the latter being one of the fastest algorithms among first-order schemes. It was also shown that deep learning is possible with first-order methods in case of suitable initialization and proper schedule for momentum .
Recently, a renewed attention has been paid to second-order optimization schemes, because of their ability to reach lower values of the error function compared with first order methods, in particular for deep autoencoders and recurrent neural nets. Martens 
proposed a Hessian-free approach based on a conjugate gradient descent for minimizing a local second-order approximation of the error function limited to a data minibatch, resorting to damping for avoiding too large steps, coupled with a Levenberg-Marquardt style heuristics to update the damping parameter. The author successfully applied his method to deep autoencoder training, and recurrent nets training. Vinyals and Povey  proposed to optimize the objective function within the Krylov subspace delineated by the previous iterate, the gradient, and products of powers of Hessian and gradient. Typically, the chosen dimensionality of the space ranges between and . The resulting quadratic function in the -dimensional space is minimized using iterations of BFGS. The authors reported significant speeds up compared with Hessian-free optimization. In contrast with the two previous methods, Dauphin et al. 
proposed a saddle-free Newton approach that uses the exact Hessian matrix (instead of a Gauss-Newton approximation) within a Krylov subspace of moderate dimensionality. The authors show that the Hessian matrix in this subspace is usually not positive definite, but it suffices to replace the negative eigenvalues with their absolute values in order to make this Newton-like method saddle point repellent. The authors reported some improvements in the context of autoencoding training.
it exploits the second-order information in order to move in directions of low curvature;
it uses as many learning rates as network layers, so as to update different blocks of weights at different speeds;
it relies on an automatic procedure to optimally adjust the learning rates at each iteration.
, we explain how to estimate the learning rates within the trust region. In SectionV, we put all these techniques together and detail the resulting algorithm. In Section VI, we show the numerical results obtained with our approach. Finally, conclusions are drawn in Section VII.
Ii General Idea
Training a neural network amounts to finding a weight vectorthat minimizes a global error function , so yielding the optimization problem:
The error function is a sum of many nonconvex twice-differentiable terms, one for each input-output pair available in the training set. In a stochastic setting, the data are decomposed into minibatches. We denote by the error function evaluated over the -th minibatch, which can be viewed as a stochastic approximation of function as minibatches are randomly selected throughout the optimization process. In the following, we denote the batch (resp. minibatch) gradients by
and the batch (resp. minibatch) Hessians by
Consider the -dimensional subspace
spanned by some orthonormal vectorsin , and let . Our proposal consists of updating the weight vector according to the following rule:
where is a vector of learning rates.
A local quadratic Taylor expansion of the error function around the current point reads:
By substituting , we get
by introducing and .
Note that although is a quadratic function of , curvature matrix is not necessarily positive definite when is nonconvex.
The classical trust region method  consists of minimizing a quadratic approximation to the cost function within a ball around current point , defined as
In the proposed subspace approach, the trust region corresponds to a Euclidean ball for the coefficients , defined as
Then, the main step of our approach is the minimization of the quadratic function inside the trust region, namely
The trust region bound is then determined with backtracking and linesearch, with the aim to maximize the decay of .
Unfortunately, preliminary experiments suggested us that such a classical approach results in a relatively slow minimization process. In our view, a possible explanation for this fact is as follows. The location of the minimizer of inside the trust region is mostly determined by the negative curvature directions of the quadratic form (in these directions decreases most rapidly). Negative curvature directions are however not very reliable, because the objective function is usually bounded from below. In practice, this yields very small steps of the algorithm, as the trust region size is chosen so as to decrease the function (backtracking), and thus the resulting norm of the update is small. One more observation confirming this fact is that in the case when becomes positive definite, then the decrease of the function is of higher order of magnitude compared with situations when a negative curvature is present.
As suggested in , a possible solution could be to ignore all negative curvature directions. But in that case, the algorithm will hardly escape from saddle points. We experimented this strategy, and it already showed better results than the classical trust region approach, expecially for deep nets.
In this work, we propose instead a two-stage approach that combines the above strategies. At the first stage, we ignore negative curvature directions and address the trust region problem only for the subspace generated by positive curvature eigenvectors. With backtracking, we find a trust region size that allows us to decrease the function, and we move to the point just found. At the new point, we re-compute the gradient and make gradient descent step. The stepsize is determined with linesearch and backtracking.
Thanks to the gradient descent step in the second stage, the proposed algorithm possesses the capability to move away from saddle points. Moreover, it should make fast progresses, because of large steps performed at the first stage in the subspace of positive curvature eigenvectors.
Iii Choice of the Subspace
Although previous works [12, 15] employ subspaces of relatively high dimensions (from 20 to 500), our experiments show that a much lower dimensionality is beneficial in terms of error decrease versus time. Indeed, each additional vector in the subspace requires the evaluation of a different Hessian-vector product at each iteration, which is obviously time consuming.
Consider the minimalistic subspace generated by 2 vectors, namely the gradient and the previous iterate :
(For simplicity, the iteration index
will be omitted in the following.) For logistic regression problems, this subspace is enough to achieve relatively fast convergence, but for neural nets the situation is different. A possible explanation for this may be related to the vanishing/exploding gradient problem. Thus, we would desire to allow distinct learning rates for weights from different layers.
Consider a feed-forward neural net with layers and some vector space basis vector of a -dimensional subspace . We can block-decompose and the weight vector as follows
where blocks and correspond to layer of connections of the neural net. Then, the separation of learning rates for each layer and vector results in the updates
which is equivalent to consider a larger subspace generated by
The resulting update scheme is similar to SGD with momentum, but with separate learning rates for each layer and automatic choice of them at each iteration.
Note that our algorithm requires the vectors forming the subspace to be orthonormal. Since these vectors are nonzero only in one block, the task of orthonormalization is split into separate lower-dimensional subtasks. A similar argument applies for efficiently computing the Hessian-vector products. Indeed, the vectors to be multiplied are non-zero only in one block. We can thus avoid redundant calculations by carefully extending the popular -technique  to the sparse case.
Iv Minimization within the trust region
The problem of finding the minimizer of a quadratic function inside an Euclidean ball has been well investigated. Here, we modify the classical algorithm  in order to apply our two-stage approach. Let us recall that the quadratic function is expressed as
with as in (2). Suppose that the eigenvalues of are , and the corresponding eigenvectors are denoted by . Assume that the first positive eigenvalue in the list is (we assume that there exists at least one positive eigenvalue, otherwise the first stage is not performed at all). Let us define the vector with components , . When the trust region is given by , we need to find such that the matrix is positive definite, and . Then, the minimal value inside the trust region is reached at . Matrix can be represented as:
When we need to compute the minimum in the subspace spanned by its positive eigenvectors, we just restrict the sum to
Then, in order to find , we solve the nonlinear equation:
subject to . Since is monotonically decreasing and convex, we resort to Newton method. It is important to initialize it at the point such that . In this way, sequence will be monotonically increasing and will not jump from the region of interest .
One more point to pay attention to is that this algorithm (for positive part) is applicable in the case when the global minimizer of the quadratic function given by
is outside the trust region we consider. For our algorithm this is always the case (see next section for details), so we can initialize for Newton iterations.
When , the initial problem is more difficult, as the solution is not guaranteed to be unique. Fortunately, Nesterov et al.  suggested a simple way to avoid the difficulties arising in this case. We need to choose any index such that (in fact we search for the index of the maximum absolute value of ), and make the assignment . It can be proven that as the minimum point for this shifted problem converges to some minimum point of the initial one.
When the value is found, the minimizer (for the positive curvature directions) in the trust region is given by
V Details of the main algorithm
Algorithm 1 describes the general procedure in more details. It starts by randomly selecting a minibatch and dividing it into roughly equal sub-minibatches. Minibatches and subminibatches contain equal number of training elements from each class. The gradient is computed using the whole minibatch, while for each Hessian-vector product the sub-minibatches are used. So doing, the calculation of all Hessian-vector products requires only about twice more time than computing the gradient. Note that a similar idea was implemented in the Hessian-free optimizer . Then, the algorithm performs the two-stage procedure explained earlier.
We found out that backtracking and linesearch greatly increase the convergence speed. In both stages, after the first guess of , we perform fictive updates of weights and compute the real value of the minibatch function at this new point. When there is no decrease of value, we start decreasing the trust region size by a factor . This process of shrinking the trust region is finite, because the gradient of is calculated exactly, and it can be shown that all updates calculated with trust region algorithm have an obtuse angle with the gradient. After a decrease of the function is obtained, we continue reducing the trust region by a factor to maximize the function decay.
In the second stage, we use the previous gradient descent step length as the initial guess, and we start with the point obtained in the first stage. Moreover, after the first guess of step length where we got a decrease of the function, we also start increasing it by a factor , and stop when a decrease smaller than at previous tested size is obtained.
Vi Experimental results
In order to assess the performance of our algorithm, we considered the training of fully-connected multilayer neural networks with MNIST, a benchmark dataset of handwritten digits. First, we performed some experiments to show the validity of the proposed two-stage trust region procedure. Secondly, we compared our approach with two popular methods: Adam  and RMSProp .
Vi-a Two-stage trust region assessment
In our first experiments, we investigate how to handle negative eigenvalues of the matrix , by testing the following approaches:
Trust region - A minimizer of is found subject to . Backtracking and linesearch are used to determine the optimal value of .
Only positive - The coefficients of are chosen from the subspace generated by the eigenvectors of corresponding to positive eigenvalues. Negative eigenvalues are ignored.
Saddle free - Negative eigenvalues of are replaced with their absolute values, then the trust region method is applied.
Positive-negative - A minimum inside the trust region for the positive eigenvector subspace is found. After that we move to this new point, recompute the gradient, and consider the subspace generated by negative eigenvectors. Trust region sizes at both stages are determined with linesearch.
Negative-positive - Same procedure as Positive-Negative, except that the order of first and second stages is inverted.
Two-stage - The proposed approach. A step in the positive subspace is followed by a gradient descent step.
We tested the above approaches for a 2-layer net with 50 hidden units (784-50-10), and 3-layer net with 784-50-50-10 architecture. The same subspace and backtracking/linesearch algorithms were used for all tested methods. We used softmax output, tanh hidden functions, and the cross-entropy error function. This error function was measured after each epoch on the whole training set. We also used quadratic regularization with coefficient. The results are shown in Figs. 1 and 2, indicating that the proposed two-stage strategy exhibits the best performance.
Vi-B State-of-the-art comparison
We also compared the proposed approach with two popular first-order methods: Adam  and RMSProp  on a 8-layer net with 784-80-70-60-50-40-30-20-10. We again used tanh units, softmax output, and the cross-entropy error function. We also used sparse initialization, which prevents saturation of learning at the beginning . Fig. 3 reports the value of the error function versus the elapsed time. This experiment shows that our approach performs much better than first order methods on such deep network. This happens because the second-order information exploited by our algorithm, despite requiring more computations per iteration w.r.t. first-order methods, pays off with larger updates on deep networks.
In this work, we have proposed a two-stage trust region subspace approach for training neural networks. According to our preliminary results, our algorithm appears to be faster than first-order methods for deep network training. This was made possible by carefully taking into account the local geometrical structure of the graph of the non convex error function in a suitable subspace. This allowed us to use different learning rates for each network layer that are automatically adjusted at each iteration. For the future, we plan to extend our algorithm to other deep architectures.
-  Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436–444, May 2015.
-  Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. on Neural Networks, vol. 5, no. 2, March 1994.
-  L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,” Tech. Rep., 2016, http://arxiv.org/pdf/1606.04838v1.pdf.
-  N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural Network, vol. 12, no. 1, pp. 145–151, Jan. 1999.
-  Y. Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k2),” in Soviet Mathematics Doklady, 1983, vol. 27, pp. 372–376.
J. Duchi, E. Hazan, and Y. Singer,
“Adaptive subgradient methods for online learning and stochastic optimization,”J. Mach. Learn. Res., vol. 12, pp. 2121–2159, July 2011.
-  M. D. Zeiler, “ADADELTA: An adaptive learning rate method,” Technical Report, 2012, Available online at https://arxiv.org/abs/1212.5701.
T. Tieleman and G. Hinton,
“Lecture 6.5 – RMSProp: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, 2012.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Int. Conf. Learn. Representations, Banff, Canada, 14-16 April 2014.
-  I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in Int. Conf. Mach. Learn., Atlanta, GA, 16-21 June 2013, vol. 28, pp. 1139–1147.
-  G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, pp. 504–507, 2006.
-  J. Martens, “Deep learning via Hessian-free optimization,” in Int. Conf. Mach. Learn., Haifa, Israel, 21-24 June 2010, pp. 735–742.
J. Martens and I. Sustskever,
“Learning recurrent neural networks with hessian-free optimization,”in Proc. Int’l Conf. Machine Learning, 2011.
-  O. Vinyals and D. Povey, “Krylov subspace descent for deep learning,” in Int. Conf. Artif. Intell. Statist., La Palma, Canary Islands, 21-23 April 2012, pp. 1261–1268.
-  Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Ann. Conf. Neur. Inform. Proc. Syst., Montreal, Canada, 8-11 Dec. 2014, pp. 2933–2941.
-  J. J. More and D. C. Sorensen, “Computing a trust region step,” SIAM J. Sci. Stat. Comp., vol. 4, no. 3, pp. 553–572, 1983.
-  Y. Yuan, “Recent advances in trust region algorithms,” Math. Prog., vol. 151, no. 1, pp. 249–281, 2015.
-  E. Chouzenoux and J.-C. Pesquet, “A stochastic majorize-minimize subspace algorithm for online penalized least squares estimation,” IEEE Trans. Sig. Process., 2017, (to appear).
-  J. B. Erway and P. E. Gill, “A subspace minimization method for the trust-region step,” SIAM J. Optim., vol. 20, no. 3, pp. 1439–1461, 2010.
-  H. A. Pearlmuter, “Fast exact multiplication by the Hessian,” Neural Computation, vol. 6, no. 1, pp. 147–160, Jan. 1994.
-  Y. Nesterov and B.T. Polyak, “Cubic regularization of Newton method and its global performance,” Math. Prog., vol. 108, no. 1, pp. 177–205, 2006.