Can we devise a learning algorithm for general/nonsmooth deep neural networks (DNNs) featuring inertia and Newtonian directional intelligence only by means of a backpropagation oracle?
In an optimization jargon: can we use second order ideas in time and space for nonsmooth nonconvex optimization by uniquely using a subgradient oracle?
Before providing answers to this daring question, let us have a glimpse at some of the fundamental optimization algorithms for training deep networks.
The backpropagation algorithm is, to this day, the fundamental block for training DNNs. It is an instance of the Stochastic Gradient Descent algorithm (SGD,(34)) and is as such powerful, flexible, capable of handling huge size problems, noise, and further comes with theoretical guarantees of many kinds. We refer to (13; 31)
in a convex machine learning context and(14) for a recent account highlighting the importance of deep learning (DL) applications and their challenges. In the nonconvex setting, recent works of (2; 20) follow the "Ordinary Differential Equations (ODE) approach" introduced in (29), and further developed in (8; 26; 9; 12). SGD is however a raw first order algorithm requiring manual tuning and whose convergence rate can sometimes be low on some DL instances.
In the recent literature two improvement lines have been explored:
use local geometry of empirical losses to improve over steepest descent directions,
use past steps history to design clever steps in the present.
The first approach is akin to quasi-Newton methods while the second revolves around Polyak’s inertial method (33). The latter is inspired by the following appealing mechanical thought-experiment. Consider a heavy ball evolving on the graph of the loss (the loss ”landscape"), subject to gravity and stabilized by some friction effects. Friction generates energy dissipation, so that the particle will eventually reach a steady state which one hopes to be a local minimum. These two approaches are already present in the DL literature: among the most popular algorithms for training DNNs, ADAGRAD (22) features local geometrical aspects while ADAM (24) combines inertial ideas with stepsizes similar to the ones of ADAGRAD. Stochastic Newton and quasi-Newton algorithms have been considered by (30; 15; 16) and recently reported to perform efficiently on several problems (10; 39). The work of (38) demonstrates that carefully tuned SGD and heavy-ball algorithms are competitive with concurrent methods.
But deviating from the simplicity of SGD also comes with major challenges because of the size and the severe absence of regularity in DL (differential regularity is generally absent, but even weaker regularity as semi-convexity or Clarke regularity are not available). All sorts of practical and theoretical hardships are met: defining/computing Hessian is delicate, inverting them is unthinkable at this day, first and second order Taylor approximation are useless, and one has to deal with shocks which are inherent to inertial approaches in a nonsmooth context ("corners/walls" indeed generate velocity discontinuity). This makes the study of ADAGRAD and ADAM in full generality quite difficult. Some recent progresses are reported in (7).
Our approach also blends inertial ideas with Newton’s method. It is inspired by the following dynamical system introduced in (3):
is the time parameter which acts as a continuous epoch counter,
is a given loss function (usually empirical loss in DL applications) whileand denote respectively the gradient of and its Hessian.
|(c)||(d) Objective function (in log-scale)|
To adapt this dynamics to DL and overcome the computational difficulties generated by second order objects occuring in (1), we combine a phase space lifting method with a small step discretization process typical to the stochastic approach. An important difficulty is met when dealing with networks having nonsmooth activation functions as ReLU (23). Indeed subsampled versions of the algorithm, which are absolutely necessary in practice, must be treated with great care since the sum of subdifferentials no longer coincides with the subdifferential of the sum. We address this delicate issue by using new notions of steady states and by providing adequate calculus rules.
The resulting algorithm, called INDIAN, shows great efficiency in practice. For the same computational price than other tested methods (including ADAM and ADAGRAD), INDIAN avoids parasite oscillations, often achieves better training accuracy and shows robustness to hyperparameter setting. A first illustration of the behaviour of the induced dynamics is given in Figure 1 for a simple nonsmooth and nonconvex function in .
Our theoretical results are also strong and simple. Using Lyapunov analysis from (3) we combine tame nonsmooth results à la Sard and the differential inclusion approximation method (9) to characterize the asymptotics of our algorithm similarly as in (20; 2). This provides a strong theoretical ground to our study since we can prove that our method converges to a connected component of the set of steady states even in the ReLU case where the optimization problem is nonsmooth. The algorithm is described in details in Section 2 and its convergence proof is given in Section 3. Section 4 describes experimental results on synthetic and real datasets. Details of proofs and additional experiments can be found in the appendices, python notebooks that allows to reproduce our experiments are available here https://github.com/camcastera/Code-for-an-Inertial-Newton-algorithm-for-DL/.
2 INDIAN: an Inertial Newton algorithm for Deep Neural Networks
2.1 Neural networks with Lipschitz continuous prediction function and losses
We consider DNNs of a very general type represented by a locally Lipschitz function (e.g., a composition of feed-forward, convolutional, recurrent networks with ReLU, sigmoid, or tanh activation functions). The variable is the parameter of the model ( can be very large), while and
represent input and output data. For instance, the vectormay embody an image while is a label explaining its content. Consider further a dataset of samples . Training amounts to find a value of the parameter such that, for each input data of the dataset, the output of the model predicts the real value with good accuracy.
To do so, we follow the traditional approach of minimizing an empirical risk loss function
where is a locally Lipschitz continuous dissimilarity measure.
2.2 Neural networks and tameness in a nutshell
Tameness refers to an ubiquitous geometrical property of losses/constraints encompassing most finite dimensional optimization problems met in practice. Prominent classes of tame objects are piecewise linear objects (with finitely many pieces), or semi-algebraic objects but the notion is much more general as we intent to convey below.
Sets or functions are called tame when they can be described by a finite number of basic formulas/inequalities/Boolean operations involving standard functions such as polynomial, exponential, or max functions. We refer to (5) for illustrations, recipes and examples within a general optimization setting or (20) for illustrations in the context of neural networks. One is referred to (21; 19; 36) for foundational material. To apprehend the strength behind tameness it is convenient to remember that it models nonsmoothness by confining the study to sets/functions which are union of smooth pieces in an inbuilt manner. This is the so-called stratification property of tame sets/functions. It was this property which motivated the vocable of tame topology, “la topologie modérée” wished for by Grothendieck, see (21).
All finite dimensional deep learning optimization models we are aware of yield tame losses . To understand this assertion and convey the universality of tameness assumptions, let us provide concrete examples (see also (20)).
If one assumes that the neural networks under consideration are built from the following traditional components:
the function has an arbitrary number of layers of arbitrary dimensions,
the activation functions are among classical ones: ReLU, sigmoid, SQNL, RReLU, tanh, APL, soft plus, soft clipping, and many others,
the dissimilarity function is a standard loss such as norms, logistic loss or cross-entropy,
then one easily shows, by elementary quantifier elimination arguments, that the corresponding loss is tame.111From now on we impose an o-minimal structure, so that an object is said to be tame if it belongs to this structure.
2.3 INDIAN and its generalized stochastic form INDIAN
Given a locally Lipschitz continuous function between finite dimensional spaces, we define for each in the domain of , its Clarke’s subdifferential as the closed convex envelope of the limits of neighboring gradients (see (18) for a formal definition). This makes this set compact, convex and nonempty.
In order to compute the subdifferential of and to cope with large datasets, can be approximated by mini-batches, reducing the memory footprint and computational cost of evaluation. For any , set
Observe that, for each , we have and that is differentiable almost everywhere with , see (18). When is tame the equalities hold on the complement of a finite union of manifolds of dimension strictly lower than , see (19). For convenience, a point satisfying will be called -critical. This vocable is motivated by favourable properties whose statements and proofs are postponed in an appendix: a good calculus along curves (see Lemmas 3 and 4) and the existence of a tame Sard’s theorem (see Lemma 5). To our knowledge, this notion of steady state has not previously been used in the literature. The definition of stems from the unavoidable absence of a sum rule for Clarke subdifferentials (think about ). This lack of linearity makes the traditional "subgradient plus centered noise" approach unfit to the study of mini-batch subsampling methods in DL.
We consider a sequence of nonempty subsets of chosen independently, uniformly at random with replacement and a sequence of positive stepsizes . Starting from initial values and , we consider the following iterative process:
where and are parameters of the algorithm. Empirical experiments suggest that and is a good choice of damping parameters for training DNNs. See the last section and the appendix for further details and explanation.
In practice is usually computed with a backpropagation algorithm, as in the seminal work of (35). The whole process is a stochastic approximation of the deterministic dynamics obtained by choosing , that is (batch version). This can be seen by observing that the vectors above may be written , where and compensates for the missing subgradients and can be seen as a zero-mean noise.
Hence, INDIAN admits the following general abstract stochastic formulation:
where is a martingale difference noise sequence adapted to the filtration induced by (random) iterates up to , and are arbitrary initial conditions.
2.4 INDIAN and INDIAN converge
In order to establish convergence, we start with the following standing assumption.
Assumption 1 (Vanishing stepsizes).
The stepsize sequence is positive, diverges and satisfies , that is .
Typical admissible choices are with , . The main theoretical result of this paper follows.
Theorem 1 (INDIAN converges to the set of -critical points of ).
Assume that is locally Lipschitz continuous, tame and that the stepsizes satisfy Assumption 1. Set an initial condition and assume that there exists such that almost surely.
Then, almost surely, any accumulation point of a realization of the sequence satisfies . In addition converges.
(a) [Stepsizes]: Assumption 1 offers much more flexibility than the usual assumption commonly used for SGD. We leverage boundedness assumption, local Lipschitz continuity and finite sum structure of
, so that the noise is actually uniformly bounded, hence sub-gaussian, allowing for much larger stepsizes than in the more common bounded second moment setting. See(9, Remark 1.5) and (8) for more details.
(b) [Convergence of INDIAN]: Apart from the uniform boundedness of the noise, we do not use the specific structure of DL losses. Thus our result actually holds for general locally Lipschitz continuous tame functions with finite sum structure and for the general stochastic algorithm INDIAN under uniformly bounded martingale increment noise. Other variants could be considered depending on the assumptions on the noise, see (9).
(c) [Convergence to critical points]: Observe that when is differentiable, limit points are simply critical points.
(d) [Local minima]: Let us mention that for general , being -critical is a necessary condition for being a local minima.
3 Convergence proof
3.1 Underlying differential inclusion
To study algorithms with vanishing stepsizes such as (4), a powerful approach is to view them as the time discretization of a differential equation/inclusion, see (20) in the context of DL. Our algorithms can be seen as discretizations of the following dynamical system akin to the one considered by (3):
Given any initial condition , general results ensure the existence of an absolutely continuous solution to this system satisfying and (6). Recall that absolute continuity amounts to the fact that are differentiable almost everywhere with
As explained in (3), when is twice differentiable, is equivalent to a second order system (avoiding the explicit use of the second order derivatives of ):
The steady points of DIN are given by
Observe that the first coordinates of these points are -critical for and that conversely any -critical point of corresponds to a unique rest point in .
3.2 Proof of convergence for INDIANg
Definition 1 (Lyapunov function).
Let be a subset of , we say that is a Lyapunov function for the set and the dynamics (6) if
for any solution of (DIN) with initial condition , we have:
a.e. on .
for any solution of (DIN) with initial condition , we have:
a.e. on .
In practice, to establish that a functional is Lyapunov, one can simply use differentiation through chain rule results, see appendices (with in particular Lemma3 first stated by (20)[Theorem 5.8] and based on the projection formula in (11)).
To build a Lyapunov function for the dynamics (6) and the set , consider the two following energy-like functions:
Then the following lemma applies.
Lemma 1 (Differentiation along DIN trajectories).
Define and recall that . By a direct integration argument, we obtain the following.
Lemma 2 ( is Lyapunov function for (INDIAN) with respect to ).
For any and any solution with initial condition ,
Proof of Theorem 1.
Lemmas 1 and 2 entail that is a Lyapunov function for the set and the dynamics (6). Set which is a actually the set of -critical points of . Using Lemma 5 in appendix A, is finite. Moreover, since for all , takes a finite number of values on , and in particular, has empty interior.
Denote by the set of accumulation points of a realizations of the sequences produced by (4) starting at and its projection on . We have the 3 following properties:
By assumption, we have almost surely, for all .
By local Lipschitz continuity is uniformly bounded for and any , hence the centered noise is a uniformly bounded martingale difference sequence.
By Assumption 1, the sequence are chosen such that (see Remark 1) (a)).
Combining Theorem 3.6, Remark 1.5 and Proposition 3.27 of (9) to obtain that and is a singleton. Hence is also a singleton and the theorem follows. ∎
In this section we provide some explanations for the 2D examples displayed in the introduction; these are meant to illustrate the versatility of INDIAN and the effect of hyperparameters. We then compare the performance of INDIAN with those of other algorithms on a DNN training for image recognition. Experiments were conducted with Python 3.6. For the DL experiment, we used Keras 2.2.4 (17) with Tensorflow 1.13.1 (1) as backend.
4.1 Understanding the role of hyperparameters
Both hyperparameters and can be seen as damping coefficients from the viewpoint of mechanics as discussed by (3) and sketched in the introduction. Recall the second-order time-continuous dynamics which serves as a model to INDIAN for twice differentiable :
This differential equation was inspired by Newton’s Second Law of dynamics asserting that the acceleration of a material point coincides with the sum of forces applied to the particle. As recalled in the introduction three forces are at stake: the gravity and two friction terms. The parameter calibrates the viscous damping intensity as in the Heavy Ball friction method of (33). It acts as a dissipation term but it can also be seen as a proximity parameter of the system with the usual gradient descent: the higher is, the more DIN behaves as a pure gradient descent.222This is easier to see when one rescales by . On the other hand the parameter can be seen as a “Newton damping" which takes into account the geometry of the landscape to brake or accelerate the dynamics in an adaptive anisotropic fashion, see (4; 3) for further insights.
We now turn our attention to the cousin dynamics INDIAN, and illustrate the versatility of the hyperparameters and in this case. We proceed on a 2D visual nonsmooth ill-conditioned example à la Rosenbrock, see Figure 1. For this example, we aim at finding the minimum of the function . This function has a V-shaped valley, and a unique critical point at which is also the global minimum. Starting from the point (the black cross), we apply INDIAN with constant steps . Figure 1 shows that when is too small, the trajectory presents many transverse oscillations as well as longitudinal ones close to the critical point (subplot (a)). Then, increasing significantly reduces transverse/parasite oscillations (subplot (b)). Finally, the longitudinal oscillations are reduced by choosing a higher (subplot (c)). In addition, these behaviors are also reflected in the values of the objective function (subplot (d)).
The orange curve (first setting) presents large oscillations. Moreover, looking at the red curve, corresponding to plot (c), there is a short period between and iterations when the decrease is slower than for the other values of and , but still it presents fewer oscillations. In the longer term, the third setting (, ) provides remarkably good performance.
4.2 Training a DNN with INDIAN
We now compare INDIAN with other popular algorithms used in deep learning (SGD, ADAM, ADAGRAD). We train a DNN for classification using the CIFAR-10 dataset (25). This dataset is composed of small colored images of size each associated with a label (airplane, cat, etc.). We split the dataset into images for the training part and for the test. Regarding the network, we use a sightly modified version of the LeNet-5 network of (27) as implemented by (37)
. It consists of two 2D-convolutional layers with max-pooling and three dense layers with ReLU activation function. The loss function used is the categorical cross-entropy. We compare our algorithm to both the classical stochastic gradient descent algorithm and the very popular ADAGRAD(22) and ADAM (24) algorithms. At each iteration, we compute the approximation of on a subset of size
. To do a fair comparison, each algorithm is initialized with the same random weights (following a normal distribution). To obtain more relevant results, this process is done for five different random initializations of. Given , is initialized such that the initial velocity is in the direction of , we use
The sequence of steps has to meet Assumption 1 and we chose the classical schedule for both INDIAN and SGD, where is the initial stepsize. Starting from , ADAGRAD and ADAM’s steps follow an adaptive procedure based on past gradients, see (22; 24). For all four algorithms, choosing the right initial step length is often critical in terms of efficiency. We use a grid-search for each algorithm and chose the initial stepsize that most decreases the loss function over four epochs (four complete passes over the data).
. Besides performances on the empirical loss, we assess the accuracy of the estimated DNNs using the test dataset that containsimages.
LeNet network is a reasonable benchmark that is however behind current state-of-the-art architectures. We may only expect accuracy with this network on CIFAR-10 (32; 28), as obtained in our results. Obtaining higher accuracy would involve more complex networks with careful tuning strategies, beyond the scope of this paper.
Figure 2 shows that SGD is significantly slower than INDIAN whereas ADAM is faster at early training. However, ADAM fails to decrease below , this behavior has been observed before (38). On the contrary INDIAN performs much better in the long run and can reach very low training error. ADAM can be interpreted as a special case of the Heavy Ball algorithm (7), which coincides with (7) when and the friction coefficient is not constant. Therefore we expect INDIAN to keep the favourable properties of ADAM, while introducing additional advantages thanks to the parameter . That said, one may want to tune INDIAN such that it becomes as efficient as ADAM for early training. However to do so, it is necessary to take into account the fact that ADAM decreases the stepsizes in a non usual way, therefore we tried to use slow decreasing steps of the form in INDIAN which preserves convergence as stated in Theorem 1 because it meets Assumption 1. Figure 4 of Appendix C.2 shows that slower stepsize decay may lead to faster loss decrease in early training using INDIAN.
Finally, regarding validation accuracy, it appears that for the choice of parameters and we made on Figure 2, validation is not as good as for the other algorithms (especially ADAGRAD) though it remains comparable. Different values of hyperparameters and allow to tune INDIAN to obtain similar performances as ADAGRAD on our example (see Figure 3 (b) of Appendix C.1). This suggests that a trade-off between generalization and training can be found by tuning these hyperpameters.
We introduced a novel stochastic optimization algorithm featuring inertial and Newtonian behaviour motivated by applications to DL. We provided a powerful convergence analysis under weak hypotheses applicable to most DL problems. We would like to point out that, apart from SGD (20), the convergence of concurrent methods in such a general setting is still an open question. Our result seems moreover to be the first one able to handle the mini-batch subsampling approach for ReLU DNNs. Our experiments show that INDIAN is very competitive with state of the art algorithms for DL. We stress that these numerical manipulations were performed on substantial DL benchmarks with only minimal algorithm tuning (very classical stepsizes with a simple grid search on a few epochs to set the initial stepsize). This facilitates reproducibility and allows to stay as close as possible to the reality of DL applications in machine learning.
This work has been supported by the European Research Council (ERC FACTORY-CoG-6681839). Support from the ANR-3IA Artificial and Natural Intelligence Toulouse Institute is also gratefully acknowledged. Jérôme Bolte was partially supported by Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant number FA9550-18-1-0226.
Part of the numerical experiments were done on the OSIRIM platform of IRIT, supported by the CNRS, the FEDER, the Occitanie region and the French government (http://osirim.irit.fr/site/fr).
We thank H. Attouch for useful discussions.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265–283, 2016.
-  S. Adil. Opérateurs monotones aléatoires et application à l’optimisation stochastique. PhD Thesis, Paris Saclay, 2018.
-  F. Alvarez, H. Attouch, J. Bolte, and P. Redont. A second-order gradient-like dissipative dynamical system with Hessian-driven damping: Application to optimization and mechanics. Journal de Mathématiques Pures et Appliquées, 81(8):747–779, 2002.
-  F. D. Alvarez and J. C. Pérez. A dynamical system associated with Newton’s method for parametric approximations of convex minimization problems. Applied Mathematics and Optimization, 38:193–217, 1998.
-  H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
-  J.-P. Aubin and A. Cellina. Differential inclusions: set-valued maps and viability theory. Springer, 2012.
-  A. Barakat and P. Bianchi. Convergence of the ADAM algorithm from a dynamical system viewpoint. arXiv:1810.02263, 2018.
-  M. Benaïm. Dynamics of stochastic approximation algorithms. In Séminaire de Probabilités XXXIII, pages 1–68. Springer, 1999.
-  M. Benaïm, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 44(1):328–348, 2005.
-  A. S. Berahas, R. Bollapragada, and J. Nocedal. An investigation of Newton-sketch and subsampled newton methods. arXiv:1705.06211, 2017.
-  J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18(2):556–572, 2007.
-  V. S. Borkar. Stochastic approximation: A dynamical systems viewpoint. Springer, 2009.
-  L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS), pages 161–168, 2008.
-  L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
-  R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977–995, 2011.
-  R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
-  F. Chollet. Keras. https://github.com/fchollet/keras, 2015.
-  F. H. Clarke. Optimization and nonsmooth analysis. SIAM, 1990.
-  M. Coste. An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazionali Pisa, 2000.
-  D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee. Stochastic subgradient method converges on tame functions. Foundations of Computational Mathematics, 2019. (in press).
-  L. van den Dries. Tame topology and o-minimal structures. Cambridge university press, 1998.
-  J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(7):2121–2159, 2011.
X. Glorot, A. Bordes, and Y. Bengio.
Deep sparse rectifier neural networks.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 315–323, 2011.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
-  A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Canadian Institute for Advanced Research, 2009.
-  H. Kushner and G. G. Yin. Stochastic approximation and recursive algorithms and applications. Springer, 2003.
-  Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  W. Li. Convolutional neural networks for CIFAR-10. https://github.com/BIGBALLON/cifar-10-cnn/tree/master/1_Lecun_Network, 2018.
-  L. Ljung. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4):551–575, 1977.
-  J. Martens. Deep learning via Hessian-free optimization. In Proceedings of the International Conference on Machine Learning (ICML), pages 735–742, 2010.
E. Moulines and F. R. Bach.
Non-asymptotic analysis of stochastic approximation algorithms for machine learning.In Advances in Neural Information Processing Systems (NIPS), pages 451–459, 2011.
-  A. A. Pandey. LeNet-Cifar10. https://github.com/amitpandey2194/LeNet-Cifar10/, 2017.
-  B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
-  H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(1):400–407, 1951.
-  D. E. Rumelhart and G. E. Hinton. Learning representations by back-propagating errors. Nature, 323(9):533–536, 1986.
-  M. Shiota. Geometry of subanalytic and semialgebraic sets, volume 150. Springer Science & Business Media, 2012.
-  T. Thaman. LeNet-5 with Keras. https://github.com/TaavishThaman/LeNet-5-with-Keras, 2018.
-  A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems (NIPS), pages 4148–4158, 2017.
-  P. Xu, F. Roosta-Khorasani, and M. W. Mahoney. Second-order optimization for non-convex machine learning: An empirical study. arXiv:1708.07827, 2017.
Appendix A Some variational results under sum rule failures
Lemma 3 (Chain rule).
Let be a locally Lipschitz continuous, tame function, then admits a chain rule, meaning that for all absolutely continuous curves , and for almost all ,
Consider now a function with an additive/composite structure (such as in deep learning):
where each is locally Lipschitz and tame. We set for any
The following lemma applies and is a direct generalization of the above chain rule.
Lemma 4 (Chain rule).
Let be an absolutely continuous curve so that is differentiable almost everywhere. If is tame then, for almost all , and for all ,
By local Lipschitz continuity and absolute continuity, each is differentiable almost everywhere and Lemma 3 can be applied:
for any , for all , and for almost all . This proves the desired result. ∎
Lemma 5 (A Sard’s theorem for tame D-critical values).
then is finite.
The set is tame and hence it has a finite number of connected components. It is sufficient to prove that is constant on each connected component of . Without loss of generality, assume that is connected and consider . By Whitney regularity [21, 4.15], there exist a tame continuous path joining to . Because of the tame nature of the result, we should here conclude with only tame arguments and use the projection formula in , but for convenience of readers who are not familiar with this result we use Lemma 3. Since is tame, the monotonicity lemma gives the existence of a finite collection of real numbers , such that is on each segment , . Applying Lemma 3 to each , we see that is constant save perhaps on a finite number of points, it is thus constant by continuity. ∎
Appendix B Proof of Lemma 1
Define . We aim at choosing so that is a Lyapunov function. Because is tame and locally Lipschitz continuous, using Lemma 3 we know that for any absolutely continuous trajectory and for almost all ,
Let and be solutions of (DIN). For almost all , we can differentiate to obtain
for all . Using (6), we get and a.e. Choosing yields:
Then, expressing everything as a function of and , one can show that a.e. on :
We aim at choosing so that is decreasing that is . This holds whenever . We choose , and , for these two values we obtain for almost all :
Remark finally that by definition and . ∎
Appendix C Additional experiments
All the results displayed in this section are related to INDIAN applied to the learning problem described in Section 4. The experimental setup is the same and we investigate the influence of hyperparameters and stepsize schedule.
|(a) Training loss||(b) Test accuracy|
Figure 3 (a) illustrates the influence of the hyperparameters and in training. While some choices appears better for training, others provide a better generalization, as shown by validation performance displayed in Figure 3 (b). It seems that tuning these parameters allows to achieve a compromise between training and testing as discussed at the end of Section 4.2.
c.2 Stepsize decay
Using INDIAN with on the DL problem decribed in Section 4, we now investigate the influence of the stepsize schedule. We use schedules of the form where is found using a grid search over four epochs. First, to link this experiment with the previous ones, notice that the blue curve on Figure 4 (a) corresponds to the exact same experiment as the blue curve on Figure 2. Then the most striking result is the red curve which shows that using for the decreasing power allows to reach the same speed as ADAM for early training and to achieve better training loss in the long term. This illustrates that moderate tuning of INDIAN can outperform state-of-the-art algorithms.
|(a) Training loss||(b) Test accuracy|
Appendix D Python Codes
The python notebooks that reproduce the experiment of this paper are available at https://github.com/camcastera/Code-for-an-Inertial-Newton-algorithm-for-DL/.
The ready-to-use code to train other networks with the INDIAN algorithm, is available here: https://github.com/camcastera/Indian-for-DeepLearning/.