Fluctuation-dissipation relations for stochastic gradient descent

09/28/2018 ∙ by Sho Yaida, et al. ∙ Facebook 0

The notion of the stationary equilibrium ensemble has played a central role in statistical mechanics. In machine learning as well, training serves as generalized equilibration that drives the probability distribution of model parameters toward stationarity. Here, we derive stationary fluctuation-dissipation relations that link measurable quantities and hyperparameters in the stochastic gradient descent algorithm. These relations hold exactly for any stationary state and can in particular be used to adaptively set training schedule. We can further use the relations to efficiently extract information pertaining to a loss-function landscape such as the magnitudes of its Hessian and anharmonicity. Our claims are empirically verified.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Equilibration rules the long-term fate of many macroscopic dynamical systems. For instance, as we pour water into a glass and let it be, the stationary state of tranquility is eventually attained. Zooming into the tranquil water with a microscope would reveal, however, a turmoil of stochastic fluctuations that maintain the apparent stationarity in balance. This is vividly exemplified by the Brownian motion (Brown, 1828): a pollen immersed in water is constantly bombarded by jittery molecular movements, resulting in the macroscopically observable diffusive motion of the solute. Out of the effort in bridging microscopic and macroscopic realms through the Brownian movement came a prototype of fluctuation-dissipation relations (Einstein, 1905; Von Smoluchowski, 1906). These relations quantitatively link degrees of noisy microscopic fluctuations to smooth macroscopic dissipative phenomena and have since been codified in the linear response theory for physical systems (Onsager, 1931; Green, 1954; Kubo, 1957), a cornerstone of statistical mechanics.

Machine learning begets another form of equilibration. As a model learns patterns in data, its performance first improves and then plateaus, again reaching apparent stationarity. This dynamical process naturally comes equipped with stochastic fluctuations as well: often given data too gigantic to consume at once, training proceeds in small batches and random selections of these mini-batches consequently give rise to the noisy dynamical excursion of the model parameters in the loss-function landscape, reminiscent of the Brownian motion. It is thus natural to wonder if there exist analogous fluctuation-dissipation relations that quantitatively link the noise in mini-batched data to the observable evolution of the model performance and that in turn facilitate the learning process.

Here, we derive such fluctuation-dissipation relations for the stochastic gradient descent algorithm. The only assumption made is stationarity of the probability distribution that governs the model parameters at sufficiently long time. Our results thus apply to generic cases with non-Gaussian mini-batch noises and nonconvex loss-function landscapes. Practically, the first relation (FDR1) offers the metric for assessing equilibration and yields an adaptive algorithm that sets learning-rate schedule on the fly. The second relation (FDR2) further helps us determine the properties of the loss-function landscape, including the strength of its Hessian and the degree of anharmonicity, i.e., the deviation from the idealized harmonic limit of a quadratic loss surface and a constant noise matrix.

Our approach should be contrasted with recent attempts to import the machinery of stochastic differential calculus into the study of the stochastic gradient descent algorithm (Li et al., 2015; Mandt et al., 2017; Li et al., 2017; Smith & Le, 2018; Chaudhari & Soatto, 2017; Jastrzebski et al., 2017; Zhu et al., 2018; An et al., 2018). This line of work all assumes Gaussian noises and sometimes additionally employs the quadratic harmonic approximation for loss-function landscapes. The more severe drawback, however, is the usage of the analogy with continuous-time stochastic differential equations, which is inconsistent in general (see Section 2.3.3). Instead, the stochastic gradient descent algorithm can be properly treated within the framework of the Kramers-Moyal expansion (Van Kampen, 1992; Gardiner, 2009; Risken, 1984; Radons et al., 1990; Leen & Moody, 1993).

The paper is organized as follows. In Section 2, after setting up notations and deriving a stationary fluctuation-dissipation theorem (FDT), we derive two specific fluctuation-dissipation relations. The first relation (FDR1) can be used to check stationarity and the second relation (FDR2) to delineate the shape of the loss-function landscape, as empirically borne out in Section 3. An adaptive scheduling method is proposed and tested in Section 3.3. We conclude in Section 4 with future outlooks.

2 Fluctuation-dissipation relations

A model is parametrized by a weight coordinate, . The training set of examples is utilized by the model to learn patterns in the data and the model’s overall performance is evaluated by a full-batch loss function, , with quantifying the performance of the model on a particular sample : the smaller the loss is, the better the model is expected to perform. The learning process can thus be cast as an optimization problem of minimizing the loss function. One of the most commonly used optimization schemes is the stochastic gradient descent (SGD) algorithm (Robbins & Monro, 1951) in which a mini-batch of size is stochastically chosen for training at each time step. Specifically, the update equation is given by

(1)

where is a learning rate and a mini-batch loss . Note that

(2)

with denoting the average over mini-batch realizations. For later purposes, it is convenient to define a full two-point noise matrix through111A connected noise covariant matrix, , will not appear in fluctuation-dissipation relations below but scales nicely with mini-batch sizes as  (Li et al., 2017).

(3)

and, more generally, higher-point noise tensors

(4)

Below, we shall not make any assumptions on the distribution of the noise vector

– other than that a mini-batch is independent and identically distributed from the

training samples at each time step – and the noise distribution is therefore allowed to have nontrivial higher connected moments indicative of non-Gaussianity.

It is empirically often observed that the performance of the model plateaus after some training through SGD. It is thus natural to hypothesize the existence of a stationary-state distribution, , that dictates the SGD sampling at long time (see Section 2.3.4 for discussion on this assumption). For any observable quantity, , – something that can be measured during training such as and – its stationary-state average is then defined as

(5)

Stationarity as seen by the observable is tantamount to

and thus follows the master equation

(FDT)

In the next two subsections, we apply this general formula to simple observables in order to derive various stationary fluctuation-dissipation relations. Incidentally, the discrete version of the Fokker-Planck equation can be derived through the Kramers-Moyal expansion, considering the more general nonstationary version of the above equation and performing the Taylor expansion in and repeated integrations by parts (Van Kampen, 1992; Gardiner, 2009; Risken, 1984; Radons et al., 1990; Leen & Moody, 1993).

2.1 First fluctuation-dissipation relation

Applying the master equation (FDT) to the linear observable,

(7)

We thus have

(8)

This is natural because there is no particular direction that the gradient picks on average as the model parameter stochastically bounces around the local minimum or, more generally, wanders around the loss-function landscape according to the stationary distribution.

Performing similar algebra for the quadratic observable yields

(9)

In particular, taking the trace of this matrix-form relation, we obtain

(FDR1)

More generally, in the case of SGD with momentum and dampening , whose update equation is given by

(10)
(11)

a similar derivation yields (see Appendix A)

(FDR1’)

The last equation reduces to the equation (FDR1) when with . Also note that for an arbitrary constant vector because of the equation (8).

This first fluctuation-dissipation relation is easy to evaluate on the fly during training, exactly holds without any approximation if sampled well from the stationary distribution, and can thus be used as the standard metric to check if learning has plateaued, just as similar relations can be used to check equilibration in Monte Carlo simulations of physical systems (Santen & Krauth, 2000). [It should be cautioned, however, that the fluctuation-dissipation relations are necessary but not sufficient to ensure stationarity (Odriozola & Berthier, 2011).] Such a metric can in turn be used to schedule changes in hyperparameters, as shall be demonstrated in Section 3.3.

2.2 Second fluctuation-dissipation relation

Applying the master equation (FDT) on the full-batch loss function yields the closed-form expression

where we introduced

(13)

In particular, is the Hessian matrix. Reorganizing terms, we obtain

(FDR2)

In the case of SGD with momentum and dampening, the left-hand side is replaced by and by more hideous expressions (see Appendix A).

We can extract at least two types of information on the loss-function landscape by evaluating the dependence of the left-hand side, , on the learning rate . First, in the small learning rate regime, the value of approximates around a local ravine. Second, nonlinearity of at higher indicates discernible effects of anharmonicity. In such a regime, the Hessian matrix cannot be approximated as constant (which also implies that are nontrivial) and/or the noise two-point matrix cannot be regarded as constant. Such nonlinearity especially indicates the breakdown of the harmonic approximation, that is, the quadratic truncation of the loss-function landscape, often used to analyze the regime explored at small learning rates.

2.3 Remarks

2.3.1 Intuition within the harmonic approximation

In order to gain some intuition about the fluctuation-dissipation relations, let us momentarily employ the harmonic approximation . Within this approximation, . The relation (FDR1) then becomes , linking the height of the noise ball to the noise amplitude. This is in line with, for instance, the theorem 4.6 of the reference Bottou et al. (2018) and substantiates the analogy between SGD and simulated annealing, with the learning rate – multiplied by – playing the role of temperature (Bottou, 1991).

2.3.2 Higher-order relations

Additional relations can be derived by repeating similar calculations for higher-order observables. For example, at the cubic order,

(14)

The systematic investigation of higher-order relations is relegated to future work.

2.3.3 SgdSde

There is no limit in which SGD asymptotically reduces to the stochastic differential equation (SDE). In order to take such a limit with continuous time differential , each SGD update must become infinitesimal. One may thus try , as in recent work adapting the view that SGDSDE (Li et al., 2015; Mandt et al., 2017; Li et al., 2017; Smith & Le, 2018; Chaudhari & Soatto, 2017; Jastrzebski et al., 2017; Zhu et al., 2018; An et al., 2018). But this in turn forces the noise vector with zero mean, , to be multiplied by . This is in contrast to the scaling needed for the standard machinery of SDE – Itô-Stratonovich calculus and all that – to apply; the additional factor of makes the effective noise covariance be suppressed by

and the resulting equation in the continuous-time limit, if anything, would just be an ordinary differential equation without noise

222One may try to evade this by employing the -scaling of the connected noise covariant matrix, but that would then enforces as , which is unphysical. [unless noise with the proper scaling is explicitly added as in stochastic gradient Langevin dynamics (Welling & Teh, 2011; Teh et al., 2016) and natural Langevin dynamics (Marceau-Caron & Ollivier, 2017; Nado et al., 2018)].

In short, the recent work views and sends while pretending that is finite, which is inconsistent. This is not just a technical subtlety. When unjustifiably passing onto the continuous-time Fokker-Planck equation, the diffusive term is incorrectly governed by the connected two-point noise matrix rather than the full two-point noise matrix that appears herein.333Heuristically, for small due to the relation FDR2, and one may thus neglect the difference between and , and hence justify the naive use of SDE, when and the Gaussian-noise assumption holds. In the similar vein, the reference Li et al. (2015) proves faster convergence between SGD and SDE when the term proportional to is added to the gradient. We must instead employ the discrete-time version of the Fokker-Planck equation derived in references Van Kampen (1992); Gardiner (2009); Risken (1984); Radons et al. (1990); Leen & Moody (1993), as has been followed in the equation (2).

2.3.4 On stationarity

In contrast to statistical mechanics where an equilibrium state is dictated by a handful of thermodynamic variables, in machine learning a stationary state generically depends not only on hyperparameters but also on a part of its learning history. The stationarity assumption made herein, which is codified in the equation (2), is weaker than the typicality assumption underlying statistical mechanics and can hold even in the presence of lingering memory. The empirical verification of the stationary fluctuation-dissipation relations in the next section supports this view.

One caveat worth mentioning, however, is that in many practical implementations of deep learning, or online machine learning with a time-evolving sample distribution, the model parameters keep evolving even at sufficiently long times. As long as such systematic cascading process is slow compared to short-time stochastic processes, it is reasonable to expect the notion of stationarity approximately holds. Nonetheless, in order to disentangle from the present study this potential complication with quasi-stationarity, whose proper treatment is outside the scope of the stationary framework developed herein, we explicitly impose

-regularization in empirically verifying our claims in Section 3.

3 Empirical tests

In this section we empirically bear out our theoretical claims in the last section. To this end, two simple models of supervised learning are used (see Appendix 

B

for full specifications): a multilayer perceptron (MLP) learning patterns in the MNIST training data 

(LeCun et al., 1998)

through SGD without momentum and a convolutional neural network (CNN) learning patterns in the CIFAR-10 training data 

(Krizhevsky & Hinton, 2009) through SGD with momentum . For both models, the mini-batch size is set to be

, and the training data are shuffled at each epoch

with . As alluded to in Section 2.3.4, the -regularization term with the weight decay is included in the loss function .

Before proceeding further, let us define the half-running average of an observable as

(15)

This is the average of the observable up to the time step , with the initial half discarded as containing transient. If SGD drives the distribution of the model parameters to stationarity at long time, then

(16)

3.1 First fluctuation-dissipation relation and equilibration

In order to assess the proximity to stationarity, define

(17)

(with replaced by for SGD without momentum).444If the model parameter happens to fluctuate around large values, for numerical accuracy, one may want to replace by where a constant vector approximates the vector around which fluctuates at long time. Both of these observables can easily be measured on the fly at each time step during training and, according to the relation (FDR1’), the running averages of these two observables should converge to each other upon equilibration.

(a) MLP on MNIST,
(b) CNN on CIFAR-10,
Figure 1: Approaches toward stationarity during the initial trainings for the MLP on the MNIST data (a) and for the CNN on the CIFAR-10 data (b). Top panels depict the half-running average (dark green) and the instantaneous value (light green) of the mini-batch loss. Bottom panels depict the convergence of the half-running averages of the observables and , whose stationary-state averages should agree according to the relation (FDR1’).

In order to verify this claim, we first train the model with the learning rate for epochs, that is, for time steps. As shown in the figure 1, the observables and converge to each other.

(a) MLP on MNIST,
(b) CNN on CIFAR-10,
Figure 2: Approaches toward stationarity during the sequential runs for various learning rates , seen through the half-running averages of the observables (dotted) and (solid). They agree at sufficiently long times but the relaxation time to reach such a stationary regime increases as the learning rate decreases.

We then take the model at the end of the initial -epoch training and sequentially train it further at various learning rates (see Appendix B). The observables and again converge to each other, as plotted in the figure 2. Note that the smaller the learning rate is, the longer it takes to equilibrate.

3.2 Second fluctuation-dissipation relation and shape of loss-function landscape

In order to assess the loss-function landscape information from the relation (FDR2), define

(18)

(with the second term nonexistent for SGD without momentum).555For the second term, in order to ensure that , we measure the half-running average of and not . Note that is a full-batch – not mini-batch – quantity. Given its computational cost, here we measure this first term only at the end of each epoch and take the half-running average over these sparse sample points, discarding the initial half of the run.

The half-running average of the full-batch observable at the end of sufficiently long training, which is a good proxy for , is plotted in the figure 3 as a function of the learning rate . As predicted by the relation (FDR2), at small learning rates , the observable approaches zero; its slope – divided by if preferred – measures the magnitude of the Hessian matrix, component-wise averaged over directions in which the noise preferentially fluctuates. Meanwhile, nonlinearity at higher learning rates measures the degree of anharmonicity experienced over the distribution . We see that anharmonic effects are pronounced especially for the CNN on the CIFAR-10 data even at moderately small learning rates. This invalidates the use of the quadratic harmonic approximation for the loss-function landscape and/or the assumption of the constant noise matrix for this model except at very small learning rates.

(a) MLP on MNIST,
(b) CNN on CIFAR-10,
Figure 3: The stationary-state average of the full-batch observable as a function of the learning rate

, estimated through half-running averages. Dots and error bars denote mean values and

confidence intervals over several distinct runs, respectively. The straight red line connects the origin and the point with the smallest explored. (a) For the MLP on the MNIST data, linear dependence on for supports the validity of the harmonic approximation there. (b) For the CNN on the CIFAR-10 data, anharmonicity is pronounced even down to .

3.3 First fluctuation-dissipation relation and learning-rate schedules

Saturation of the relation (FDR1) suggests the learning stationarity, at which point it might be wise to decrease the learning rate . Such scheduling is often carried out in an ad hoc manner but we can now algorithmize this procedure as follows:

  1. Evaluate the half-running averages and at the end of each epoch.

  2. If , then decrease the learning rate as and also set for the purpose of evaluating half-running averages.

Here, two scheduling hyperparameters and are introduced, which control the threshold for saturation of the relation (FDR1) and the amount of decrease in the learning rate, respectively.

Plotted in the figure 4 are results for SGD without momentum, with the Xavier initialization (Glorot & Bengio, 2010) and training (i) through preset training schedule with decrease of the learning rate by a factor of for each epochs and (ii) through an adaptive scheduler with ( threshold) and ( decrease). The adaptive scheduler attains comparable results.

(a) MLP on MNIST,
(b) CNN on CIFAR-10,
Figure 4: Comparison of preset training schedule (black) and adaptive training schedule (blue), employing SGD without momentum both for the MLP on the MNIST data (a) and the CNN on the CIFAR-10 data (b). From top to bottom, plotted are the learning rate , the full-batch training loss , and prediction accuracies on the training-set images (solid) and the test-set images (dashed).

These two scheduling methods span different subspaces of all the possible schedules. The adaptive scheduling method proposed herein has a theoretical grounding and in practice much less dimensionality for tuning of scheduling hyperparameters than the presetting method, thus ameliorating the optimization of scheduling hyperparameters. The systematic comparison between the two scheduling methods for state-of-the-arts architectures could be a worthwhile avenue to pursue in the future.

4 Conclusion

In this paper, we have derived the fluctuation-dissipation relations with no assumptions other than stationarity of the probability distribution. These relations hold exactly even when the noise is non-Gaussian and the loss function is nonconvex. The relations have been empirically verified and used to probe the properties of the loss-function landscapes for the simple models. The relations further have resulted in the algorithm to adaptively set learning-rate schedule on the fly rather than presetting it in an ad hoc manner. In addition to systematically testing the performance of this adaptive scheduling algorithm, it would be interesting to investigate non-Gaussianity and noncovexity in more details through higher-point observables, both analytically and numerically. It would also be interesting to further elucidate the physics of machine learning by extending our formalism to incorporate nonstationary dynamics, linearly away from stationarity (Onsager, 1931; Green, 1954; Kubo, 1957) and beyond (Jarzynski, 1997; Crooks, 1999), so that it can in particular properly treat overfitting cascading dynamics and time-dependent sample distributions.

Acknowledgments

The author thanks Ludovic Berthier, Léon Bottou, Guy Gur-Ari, Kunihiko Kaneko, Dheevatsa Mudigere, Yann Ollivier, Yuandong Tian, and Mark Tygert for discussions. Special thanks go to Daniel Adam Roberts who prompted the practical application of the fluctuation-dissipation relations, leading to the adaptive method in Section 3.3.

References

Appendix A SGD with momentum and dampening

For SGD with momentum and dampening , the update equation is given by

(19)
(20)

Here is the velocity and the learning rate; SGD without momentum is the special case with . Again hypothesizing the existence of a stationary-state distribution , the stationary-state average of an observable is defined as

(21)

Just as in the main text, from the assumed stationarity follows the master equation for SGD with momentum and dampening

(22)

For the linear observables,

(23)

and

(24)

thus

(25)

For the quadratic observables

(26)
(27)

and

(28)

Note that the relations (26) and (27) are trivially satisfied at each time step if the left-hand side observables are evaluated at one step ahead and thus their being satisfied for running averages has nothing to do with equilibration [the same can be said about the relation (23)]; the only nontrivial relation is the equation (28), which is a consequence of setting constant of time. After taking traces and some rearrangement, we obtain the relation (FDR1’) in the main text.

For the full-batch loss function, the algebra similar to the one in the main text yields

Appendix B Models and simulation protocols

b.1 MLP on MNIST through SGD without momentum

The MNIST training data consist of black-white images of hand-written digits with -by- pixels (LeCun et al., 1998)

. We preprocess the data through an affine transformation such that their mean and variance (over both the training data and pixels) are zero and one, respectively.

Our multilayer perceptron (MLP) consists of a -dimensional input layer followed by a hidden layer of neurons with ReLU activations, another hidden layer of neurons with ReLU activations, and a -dimensional output layer with the softmax activation. The model performance is evaluated by the cross-entropy loss supplemented by the -regularization term with the weight decay .

Throughout the paper, the MLP is trained on the MNIST data through SGD without momentum. The data are shuffled at each epoch with the mini-batch size .

The MLP is initialized through the Xavier method (Glorot & Bengio, 2010) and trained for epochs with the learning rate . We then sequentially train it with . This sequential-run protocol is carried out with distinct seeds for the random-number generator used in data shuffling, all starting from the common model parameter attained at the end of the initial -epoch run. The figure 2 depicts trajectories for one particular seed, while the figure 3 plots means and error bars over these distinct seeds.

b.2 CNN on CIFAR-10 through SGD with momentum

The CIFAR-10 training data consist of color images of objects – divided into ten categories – with -by- pixels in each of color channels, each pixel ranging in  (Krizhevsky & Hinton, 2009). We preprocess the data through uniformly subtracting and multiplying by so that each pixel ranges in .

In order to describe the architecture of our convolutional neural network (CNN) in detail, let us associate a tuple to a convolutional layer with filter width , a number of channels

, stride

, and padding

, followed by ReLU activations and a max-pooling layer of width

. Then, as in the demo at Karpathy (2014), our CNN consists of a input layer followed by a convolutional layer with , another convolutional layer with , yet another convolutional layer with , and finally a fully-connected -dimensional output layer with the softmax activation. The model performance is evaluated by the cross-entropy loss supplemented by the -regularization term with the weight decay .

Throughout the paper (except in Section 3.3 where the adaptive scheduling method is tested for SGD without momentum), the CNN is trained on the CIFAR-10 data through SGD with momentum and dampening . The data are shuffled at each epoch with the mini-batch size .

The CNN is initialized through the Xavier method (Glorot & Bengio, 2010) and trained for epochs with the learning rate . We then sequentially train it with . At each junction of the sequence, the velocity is zeroed. This sequential-run protocol is carried out with distinct seeds for the random-number generator used in data shuffling, all starting from the common model parameter attained at the end of the initial -epoch run. The figure 2 depicts trajectories for one particular seed, while the figure 3 plots means and error bars over these distinct seeds.