1 Introduction
Equilibration rules the longterm fate of many macroscopic dynamical systems. For instance, as we pour water into a glass and let it be, the stationary state of tranquility is eventually attained. Zooming into the tranquil water with a microscope would reveal, however, a turmoil of stochastic fluctuations that maintain the apparent stationarity in balance. This is vividly exemplified by the Brownian motion (Brown, 1828): a pollen immersed in water is constantly bombarded by jittery molecular movements, resulting in the macroscopically observable diffusive motion of the solute. Out of the effort in bridging microscopic and macroscopic realms through the Brownian movement came a prototype of fluctuationdissipation relations (Einstein, 1905; Von Smoluchowski, 1906). These relations quantitatively link degrees of noisy microscopic fluctuations to smooth macroscopic dissipative phenomena and have since been codified in the linear response theory for physical systems (Onsager, 1931; Green, 1954; Kubo, 1957), a cornerstone of statistical mechanics.
Machine learning begets another form of equilibration. As a model learns patterns in data, its performance first improves and then plateaus, again reaching apparent stationarity. This dynamical process naturally comes equipped with stochastic fluctuations as well: often given data too gigantic to consume at once, training proceeds in small batches and random selections of these minibatches consequently give rise to the noisy dynamical excursion of the model parameters in the lossfunction landscape, reminiscent of the Brownian motion. It is thus natural to wonder if there exist analogous fluctuationdissipation relations that quantitatively link the noise in minibatched data to the observable evolution of the model performance and that in turn facilitate the learning process.
Here, we derive such fluctuationdissipation relations for the stochastic gradient descent algorithm. The only assumption made is stationarity of the probability distribution that governs the model parameters at sufficiently long time. Our results thus apply to generic cases with nonGaussian minibatch noises and nonconvex lossfunction landscapes. Practically, the first relation (FDR1) offers the metric for assessing equilibration and yields an adaptive algorithm that sets learningrate schedule on the fly. The second relation (FDR2) further helps us determine the properties of the lossfunction landscape, including the strength of its Hessian and the degree of anharmonicity, i.e., the deviation from the idealized harmonic limit of a quadratic loss surface and a constant noise matrix.
Our approach should be contrasted with recent attempts to import the machinery of stochastic differential calculus into the study of the stochastic gradient descent algorithm (Li et al., 2015; Mandt et al., 2017; Li et al., 2017; Smith & Le, 2018; Chaudhari & Soatto, 2017; Jastrzebski et al., 2017; Zhu et al., 2018; An et al., 2018). This line of work all assumes Gaussian noises and sometimes additionally employs the quadratic harmonic approximation for lossfunction landscapes. The more severe drawback, however, is the usage of the analogy with continuoustime stochastic differential equations, which is inconsistent in general (see Section 2.3.3). Instead, the stochastic gradient descent algorithm can be properly treated within the framework of the KramersMoyal expansion (Van Kampen, 1992; Gardiner, 2009; Risken, 1984; Radons et al., 1990; Leen & Moody, 1993).
The paper is organized as follows. In Section 2, after setting up notations and deriving a stationary fluctuationdissipation theorem (FDT), we derive two specific fluctuationdissipation relations. The first relation (FDR1) can be used to check stationarity and the second relation (FDR2) to delineate the shape of the lossfunction landscape, as empirically borne out in Section 3. An adaptive scheduling method is proposed and tested in Section 3.3. We conclude in Section 4 with future outlooks.
2 Fluctuationdissipation relations
A model is parametrized by a weight coordinate, . The training set of examples is utilized by the model to learn patterns in the data and the model’s overall performance is evaluated by a fullbatch loss function, , with quantifying the performance of the model on a particular sample : the smaller the loss is, the better the model is expected to perform. The learning process can thus be cast as an optimization problem of minimizing the loss function. One of the most commonly used optimization schemes is the stochastic gradient descent (SGD) algorithm (Robbins & Monro, 1951) in which a minibatch of size is stochastically chosen for training at each time step. Specifically, the update equation is given by
(1) 
where is a learning rate and a minibatch loss . Note that
(2) 
with denoting the average over minibatch realizations. For later purposes, it is convenient to define a full twopoint noise matrix through^{1}^{1}1A connected noise covariant matrix, , will not appear in fluctuationdissipation relations below but scales nicely with minibatch sizes as (Li et al., 2017).
(3) 
and, more generally, higherpoint noise tensors
(4) 
Below, we shall not make any assumptions on the distribution of the noise vector
– other than that a minibatch is independent and identically distributed from thetraining samples at each time step – and the noise distribution is therefore allowed to have nontrivial higher connected moments indicative of nonGaussianity.
It is empirically often observed that the performance of the model plateaus after some training through SGD. It is thus natural to hypothesize the existence of a stationarystate distribution, , that dictates the SGD sampling at long time (see Section 2.3.4 for discussion on this assumption). For any observable quantity, , – something that can be measured during training such as and – its stationarystate average is then defined as
(5) 
Stationarity as seen by the observable is tantamount to
and thus follows the master equation
(FDT) 
In the next two subsections, we apply this general formula to simple observables in order to derive various stationary fluctuationdissipation relations. Incidentally, the discrete version of the FokkerPlanck equation can be derived through the KramersMoyal expansion, considering the more general nonstationary version of the above equation and performing the Taylor expansion in and repeated integrations by parts (Van Kampen, 1992; Gardiner, 2009; Risken, 1984; Radons et al., 1990; Leen & Moody, 1993).
2.1 First fluctuationdissipation relation
Applying the master equation (FDT) to the linear observable,
(7) 
We thus have
(8) 
This is natural because there is no particular direction that the gradient picks on average as the model parameter stochastically bounces around the local minimum or, more generally, wanders around the lossfunction landscape according to the stationary distribution.
Performing similar algebra for the quadratic observable yields
(9) 
In particular, taking the trace of this matrixform relation, we obtain
(FDR1) 
More generally, in the case of SGD with momentum and dampening , whose update equation is given by
(10)  
(11) 
a similar derivation yields (see Appendix A)
(FDR1’) 
The last equation reduces to the equation (FDR1) when with . Also note that for an arbitrary constant vector because of the equation (8).
This first fluctuationdissipation relation is easy to evaluate on the fly during training, exactly holds without any approximation if sampled well from the stationary distribution, and can thus be used as the standard metric to check if learning has plateaued, just as similar relations can be used to check equilibration in Monte Carlo simulations of physical systems (Santen & Krauth, 2000). [It should be cautioned, however, that the fluctuationdissipation relations are necessary but not sufficient to ensure stationarity (Odriozola & Berthier, 2011).] Such a metric can in turn be used to schedule changes in hyperparameters, as shall be demonstrated in Section 3.3.
2.2 Second fluctuationdissipation relation
Applying the master equation (FDT) on the fullbatch loss function yields the closedform expression
where we introduced
(13) 
In particular, is the Hessian matrix. Reorganizing terms, we obtain
(FDR2) 
In the case of SGD with momentum and dampening, the lefthand side is replaced by and by more hideous expressions (see Appendix A).
We can extract at least two types of information on the lossfunction landscape by evaluating the dependence of the lefthand side, , on the learning rate . First, in the small learning rate regime, the value of approximates around a local ravine. Second, nonlinearity of at higher indicates discernible effects of anharmonicity. In such a regime, the Hessian matrix cannot be approximated as constant (which also implies that are nontrivial) and/or the noise twopoint matrix cannot be regarded as constant. Such nonlinearity especially indicates the breakdown of the harmonic approximation, that is, the quadratic truncation of the lossfunction landscape, often used to analyze the regime explored at small learning rates.
2.3 Remarks
2.3.1 Intuition within the harmonic approximation
In order to gain some intuition about the fluctuationdissipation relations, let us momentarily employ the harmonic approximation . Within this approximation, . The relation (FDR1) then becomes , linking the height of the noise ball to the noise amplitude. This is in line with, for instance, the theorem 4.6 of the reference Bottou et al. (2018) and substantiates the analogy between SGD and simulated annealing, with the learning rate – multiplied by – playing the role of temperature (Bottou, 1991).
2.3.2 Higherorder relations
Additional relations can be derived by repeating similar calculations for higherorder observables. For example, at the cubic order,
(14) 
The systematic investigation of higherorder relations is relegated to future work.
2.3.3 SgdSde
There is no limit in which SGD asymptotically reduces to the stochastic differential equation (SDE). In order to take such a limit with continuous time differential , each SGD update must become infinitesimal. One may thus try , as in recent work adapting the view that SGDSDE (Li et al., 2015; Mandt et al., 2017; Li et al., 2017; Smith & Le, 2018; Chaudhari & Soatto, 2017; Jastrzebski et al., 2017; Zhu et al., 2018; An et al., 2018). But this in turn forces the noise vector with zero mean, , to be multiplied by . This is in contrast to the scaling needed for the standard machinery of SDE – ItôStratonovich calculus and all that – to apply; the additional factor of makes the effective noise covariance be suppressed by
and the resulting equation in the continuoustime limit, if anything, would just be an ordinary differential equation without noise
^{2}^{2}2One may try to evade this by employing the scaling of the connected noise covariant matrix, but that would then enforces as , which is unphysical. [unless noise with the proper scaling is explicitly added as in stochastic gradient Langevin dynamics (Welling & Teh, 2011; Teh et al., 2016) and natural Langevin dynamics (MarceauCaron & Ollivier, 2017; Nado et al., 2018)].In short, the recent work views and sends while pretending that is finite, which is inconsistent. This is not just a technical subtlety. When unjustifiably passing onto the continuoustime FokkerPlanck equation, the diffusive term is incorrectly governed by the connected twopoint noise matrix rather than the full twopoint noise matrix that appears herein.^{3}^{3}3Heuristically, for small due to the relation FDR2, and one may thus neglect the difference between and , and hence justify the naive use of SDE, when and the Gaussiannoise assumption holds. In the similar vein, the reference Li et al. (2015) proves faster convergence between SGD and SDE when the term proportional to is added to the gradient. We must instead employ the discretetime version of the FokkerPlanck equation derived in references Van Kampen (1992); Gardiner (2009); Risken (1984); Radons et al. (1990); Leen & Moody (1993), as has been followed in the equation (2).
2.3.4 On stationarity
In contrast to statistical mechanics where an equilibrium state is dictated by a handful of thermodynamic variables, in machine learning a stationary state generically depends not only on hyperparameters but also on a part of its learning history. The stationarity assumption made herein, which is codified in the equation (2), is weaker than the typicality assumption underlying statistical mechanics and can hold even in the presence of lingering memory. The empirical verification of the stationary fluctuationdissipation relations in the next section supports this view.
One caveat worth mentioning, however, is that in many practical implementations of deep learning, or online machine learning with a timeevolving sample distribution, the model parameters keep evolving even at sufficiently long times. As long as such systematic cascading process is slow compared to shorttime stochastic processes, it is reasonable to expect the notion of stationarity approximately holds. Nonetheless, in order to disentangle from the present study this potential complication with quasistationarity, whose proper treatment is outside the scope of the stationary framework developed herein, we explicitly impose
regularization in empirically verifying our claims in Section 3.3 Empirical tests
In this section we empirically bear out our theoretical claims in the last section. To this end, two simple models of supervised learning are used (see Appendix
Bfor full specifications): a multilayer perceptron (MLP) learning patterns in the MNIST training data
(LeCun et al., 1998)through SGD without momentum and a convolutional neural network (CNN) learning patterns in the CIFAR10 training data
(Krizhevsky & Hinton, 2009) through SGD with momentum . For both models, the minibatch size is set to be, and the training data are shuffled at each epoch
with . As alluded to in Section 2.3.4, the regularization term with the weight decay is included in the loss function .Before proceeding further, let us define the halfrunning average of an observable as
(15) 
This is the average of the observable up to the time step , with the initial half discarded as containing transient. If SGD drives the distribution of the model parameters to stationarity at long time, then
(16) 
3.1 First fluctuationdissipation relation and equilibration
In order to assess the proximity to stationarity, define
(17) 
(with replaced by for SGD without momentum).^{4}^{4}4If the model parameter happens to fluctuate around large values, for numerical accuracy, one may want to replace by where a constant vector approximates the vector around which fluctuates at long time. Both of these observables can easily be measured on the fly at each time step during training and, according to the relation (FDR1’), the running averages of these two observables should converge to each other upon equilibration.
In order to verify this claim, we first train the model with the learning rate for epochs, that is, for time steps. As shown in the figure 1, the observables and converge to each other.
We then take the model at the end of the initial epoch training and sequentially train it further at various learning rates (see Appendix B). The observables and again converge to each other, as plotted in the figure 2. Note that the smaller the learning rate is, the longer it takes to equilibrate.
3.2 Second fluctuationdissipation relation and shape of lossfunction landscape
In order to assess the lossfunction landscape information from the relation (FDR2), define
(18) 
(with the second term nonexistent for SGD without momentum).^{5}^{5}5For the second term, in order to ensure that , we measure the halfrunning average of and not . Note that is a fullbatch – not minibatch – quantity. Given its computational cost, here we measure this first term only at the end of each epoch and take the halfrunning average over these sparse sample points, discarding the initial half of the run.
The halfrunning average of the fullbatch observable at the end of sufficiently long training, which is a good proxy for , is plotted in the figure 3 as a function of the learning rate . As predicted by the relation (FDR2), at small learning rates , the observable approaches zero; its slope – divided by if preferred – measures the magnitude of the Hessian matrix, componentwise averaged over directions in which the noise preferentially fluctuates. Meanwhile, nonlinearity at higher learning rates measures the degree of anharmonicity experienced over the distribution . We see that anharmonic effects are pronounced especially for the CNN on the CIFAR10 data even at moderately small learning rates. This invalidates the use of the quadratic harmonic approximation for the lossfunction landscape and/or the assumption of the constant noise matrix for this model except at very small learning rates.
, estimated through halfrunning averages. Dots and error bars denote mean values and
confidence intervals over several distinct runs, respectively. The straight red line connects the origin and the point with the smallest explored. (a) For the MLP on the MNIST data, linear dependence on for supports the validity of the harmonic approximation there. (b) For the CNN on the CIFAR10 data, anharmonicity is pronounced even down to .3.3 First fluctuationdissipation relation and learningrate schedules
Saturation of the relation (FDR1) suggests the learning stationarity, at which point it might be wise to decrease the learning rate . Such scheduling is often carried out in an ad hoc manner but we can now algorithmize this procedure as follows:

Evaluate the halfrunning averages and at the end of each epoch.

If , then decrease the learning rate as and also set for the purpose of evaluating halfrunning averages.
Here, two scheduling hyperparameters and are introduced, which control the threshold for saturation of the relation (FDR1) and the amount of decrease in the learning rate, respectively.
Plotted in the figure 4 are results for SGD without momentum, with the Xavier initialization (Glorot & Bengio, 2010) and training (i) through preset training schedule with decrease of the learning rate by a factor of for each epochs and (ii) through an adaptive scheduler with ( threshold) and ( decrease). The adaptive scheduler attains comparable results.
These two scheduling methods span different subspaces of all the possible schedules. The adaptive scheduling method proposed herein has a theoretical grounding and in practice much less dimensionality for tuning of scheduling hyperparameters than the presetting method, thus ameliorating the optimization of scheduling hyperparameters. The systematic comparison between the two scheduling methods for stateofthearts architectures could be a worthwhile avenue to pursue in the future.
4 Conclusion
In this paper, we have derived the fluctuationdissipation relations with no assumptions other than stationarity of the probability distribution. These relations hold exactly even when the noise is nonGaussian and the loss function is nonconvex. The relations have been empirically verified and used to probe the properties of the lossfunction landscapes for the simple models. The relations further have resulted in the algorithm to adaptively set learningrate schedule on the fly rather than presetting it in an ad hoc manner. In addition to systematically testing the performance of this adaptive scheduling algorithm, it would be interesting to investigate nonGaussianity and noncovexity in more details through higherpoint observables, both analytically and numerically. It would also be interesting to further elucidate the physics of machine learning by extending our formalism to incorporate nonstationary dynamics, linearly away from stationarity (Onsager, 1931; Green, 1954; Kubo, 1957) and beyond (Jarzynski, 1997; Crooks, 1999), so that it can in particular properly treat overfitting cascading dynamics and timedependent sample distributions.
Acknowledgments
The author thanks Ludovic Berthier, Léon Bottou, Guy GurAri, Kunihiko Kaneko, Dheevatsa Mudigere, Yann Ollivier, Yuandong Tian, and Mark Tygert for discussions. Special thanks go to Daniel Adam Roberts who prompted the practical application of the fluctuationdissipation relations, leading to the adaptive method in Section 3.3.
References
 An et al. (2018) Jing An, Jianfeng Lu, and Lexing Ying. Stochastic modified equations for the asynchronous stochastic gradient descent. arXiv:1805.08244, 2018.
 Bottou (1991) Léon Bottou. Stochastic gradient learning in neural networks. Proceedings of NeuroNımes, 91(8):12, 1991.
 Bottou et al. (2018) Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for largescale machine learning. SIAM Review, 60(2):223–311, 2018.
 Brown (1828) Robert Brown. XXVII. A brief account of microscopical observations made in the months of June, July and August 1827, on the particles contained in the pollen of plants; and on the general existence of active molecules in organic and inorganic bodies. The Philosophical Magazine, 4(21):161–173, 1828.
 Chaudhari & Soatto (2017) Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv:1710.11029, 2017.
 Crooks (1999) Gavin E Crooks. Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences. Physical Review E, 60(3):2721–2726, 1999.
 Einstein (1905) Albert Einstein. Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen. Annalen der physik, 322(8):549–560, 1905.
 Gardiner (2009) Crispin Gardiner. Stochastic methods, volume 4. Springer Berlin, 2009.

Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256, 2010.  Green (1954) Melville S Green. Markoff random processes and the statistical mechanics of timedependent phenomena. II. Irreversible processes in fluids. The Journal of Chemical Physics, 22(3):398–413, 1954.
 Jarzynski (1997) Christopher Jarzynski. Nonequilibrium equality for free energy differences. Physical Review Letters, 78(14):2690–2693, 1997.
 Jastrzebski et al. (2017) Stanislaw Jastrzebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in SGD. arXiv:1711.04623, 2017.
 Karpathy (2014) Andrej Karpathy. ConvNetJS CIFAR10 demo. https://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html, 2014. Last accessed on 20180925.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Kubo (1957) Ryogo Kubo. Statisticalmechanical theory of irreversible processes. I. General theory and simple applications to magnetic and conduction problems. Journal of the Physical Society of Japan, 12(6):570–586, 1957.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Leen & Moody (1993) Todd K Leen and John E Moody. Weight space probability densities in stochastic learning: I. Dynamics and equilibria. In Advances in neural information processing systems, pp. 451–458, 1993.
 Li et al. (2017) Chris Junchi Li, Lei Li, Junyang Qian, and JianGuo Liu. Batch size matters: A diffusion approximation framework on nonconvex stochastic gradient descent. arXiv:1705.07562v1, 2017.
 Li et al. (2015) Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and adaptive stochastic gradient algorithms. arXiv:1511.06251, 2015.

Mandt et al. (2017)
Stephan Mandt, Matthew D Hoffman, and David M Blei.
Stochastic gradient descent as approximate Bayesian inference.
The Journal of Machine Learning Research, 18(1):4873–4907, 2017.  MarceauCaron & Ollivier (2017) Gaétan MarceauCaron and Yann Ollivier. Natural Langevin dynamics for neural networks. In International Conference on Geometric Science of Information, pp. 451–459. Springer, 2017.
 Nado et al. (2018) Zachary Nado, Jasper Snoek, Roger Grosse, David Duvenaud, Bowen Xu, and James Martens. Stochastic gradient Langevin dynamics that exploit neural network structure, 2018. URL https://openreview.net/forum?id=rySe9kvG.
 Odriozola & Berthier (2011) Gerardo Odriozola and Ludovic Berthier. Equilibrium equation of state of a hard sphere binary mixture at very large densities using replica exchange Monte Carlo simulations. The Journal of Chemical Physics, 134(5):054504, 2011.
 Onsager (1931) Lars Onsager. Reciprocal relations in irreversible processes. I. Physical Review, 37(4):405–426, 1931.

Radons et al. (1990)
Günter Radons, Heinz Georg Schuster, and D Werner.
FokkerPlanck description of learning in backpropagation networks.
In International Neural Network Conference, pp. 993–996. Springer, 1990.  Risken (1984) Hannes Risken. The FokkerPlanck equation: methods of solution and applications. Springer, 1984.
 Robbins & Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.

Santen & Krauth (2000)
Ludger Santen and Werner Krauth.
Absence of thermodynamic phase transition in a model glass former.
Nature, 405(6786):550–551, 2000.  Smith & Le (2018) Samuel L Smith and Quoc V Le. A Bayesian perspective on generalization and stochastic gradient descent. arXiv:1710.06451, 2018.
 Teh et al. (2016) Yee Whye Teh, Alexandre H Thiery, and Sebastian J Vollmer. Consistency and fluctuations for stochastic gradient Langevin dynamics. The Journal of Machine Learning Research, 17(1):193–225, 2016.
 Van Kampen (1992) Nicolaas Godfried Van Kampen. Stochastic processes in physics and chemistry, volume 1. Elsevier, 1992.
 Von Smoluchowski (1906) Marian Von Smoluchowski. Zur kinetischen theorie der brownschen molekularbewegung und der suspensionen. Annalen der physik, 326(14):756–780, 1906.
 Welling & Teh (2011) Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pp. 681–688, 2011.
 Zhu et al. (2018) Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from minima and regularization effects. arXiv:1803.00195, 2018.
Appendix A SGD with momentum and dampening
For SGD with momentum and dampening , the update equation is given by
(19)  
(20) 
Here is the velocity and the learning rate; SGD without momentum is the special case with . Again hypothesizing the existence of a stationarystate distribution , the stationarystate average of an observable is defined as
(21) 
Just as in the main text, from the assumed stationarity follows the master equation for SGD with momentum and dampening
(22) 
For the linear observables,
(23) 
and
(24) 
thus
(25) 
For the quadratic observables
(26) 
(27) 
and
(28) 
Note that the relations (26) and (27) are trivially satisfied at each time step if the lefthand side observables are evaluated at one step ahead and thus their being satisfied for running averages has nothing to do with equilibration [the same can be said about the relation (23)]; the only nontrivial relation is the equation (28), which is a consequence of setting constant of time. After taking traces and some rearrangement, we obtain the relation (FDR1’) in the main text.
For the fullbatch loss function, the algebra similar to the one in the main text yields
Appendix B Models and simulation protocols
b.1 MLP on MNIST through SGD without momentum
The MNIST training data consist of blackwhite images of handwritten digits with by pixels (LeCun et al., 1998)
. We preprocess the data through an affine transformation such that their mean and variance (over both the training data and pixels) are zero and one, respectively.
Our multilayer perceptron (MLP) consists of a dimensional input layer followed by a hidden layer of neurons with ReLU activations, another hidden layer of neurons with ReLU activations, and a dimensional output layer with the softmax activation. The model performance is evaluated by the crossentropy loss supplemented by the regularization term with the weight decay .
Throughout the paper, the MLP is trained on the MNIST data through SGD without momentum. The data are shuffled at each epoch with the minibatch size .
The MLP is initialized through the Xavier method (Glorot & Bengio, 2010) and trained for epochs with the learning rate . We then sequentially train it with . This sequentialrun protocol is carried out with distinct seeds for the randomnumber generator used in data shuffling, all starting from the common model parameter attained at the end of the initial epoch run. The figure 2 depicts trajectories for one particular seed, while the figure 3 plots means and error bars over these distinct seeds.
b.2 CNN on CIFAR10 through SGD with momentum
The CIFAR10 training data consist of color images of objects – divided into ten categories – with by pixels in each of color channels, each pixel ranging in (Krizhevsky & Hinton, 2009). We preprocess the data through uniformly subtracting and multiplying by so that each pixel ranges in .
In order to describe the architecture of our convolutional neural network (CNN) in detail, let us associate a tuple to a convolutional layer with filter width , a number of channels
, stride
, and padding
, followed by ReLU activations and a maxpooling layer of width
. Then, as in the demo at Karpathy (2014), our CNN consists of a input layer followed by a convolutional layer with , another convolutional layer with , yet another convolutional layer with , and finally a fullyconnected dimensional output layer with the softmax activation. The model performance is evaluated by the crossentropy loss supplemented by the regularization term with the weight decay .Throughout the paper (except in Section 3.3 where the adaptive scheduling method is tested for SGD without momentum), the CNN is trained on the CIFAR10 data through SGD with momentum and dampening . The data are shuffled at each epoch with the minibatch size .
The CNN is initialized through the Xavier method (Glorot & Bengio, 2010) and trained for epochs with the learning rate . We then sequentially train it with . At each junction of the sequence, the velocity is zeroed. This sequentialrun protocol is carried out with distinct seeds for the randomnumber generator used in data shuffling, all starting from the common model parameter attained at the end of the initial epoch run. The figure 2 depicts trajectories for one particular seed, while the figure 3 plots means and error bars over these distinct seeds.
Comments
There are no comments yet.