Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States

11/19/2022
by   Ziqiao Wang, et al.
0

Stochastic differential equations (SDEs) have been shown recently to well characterize the dynamics of training machine learning models with SGD. This provides two opportunities for better understanding the generalization behaviour of SGD through its SDE approximation. First, under the SDE characterization, SGD may be regarded as the full-batch gradient descent with Gaussian gradient noise. This allows the application of the generalization bounds developed by Xu Raginsky (2017) to analyzing the generalization behaviour of SGD, resulting in upper bounds in terms of the mutual information between the training set and the training trajectory. Second, under mild assumptions, it is possible to obtain an estimate of the steady-state weight distribution of SDE. Using this estimate, we apply the PAC-Bayes-like information-theoretic bounds developed in both Xu Raginsky (2017) and Negrea et al. (2019) to obtain generalization upper bounds in terms of the KL divergence between the steady-state weight distribution of SGD with respect to a prior distribution. Among various options, one may choose the prior as the steady-state weight distribution obtained by SGD on the same training set but with one example held out. In this case, the bound can be elegantly expressed using the influence function (Koh Liang, 2017), which suggests that the generalization of the SGD is related to the stability of SGD. Various insights are presented along the development of these bounds, which are subsequently validated numerically.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2021

On the Generalization of Models Trained with SGD: Information-Theoretic Bounds and Implications

This paper follows up on a recent work of (Neu, 2021) and presents new a...
research
02/01/2021

Information-Theoretic Generalization Bounds for Stochastic Gradient Descent

We study the generalization properties of the popular stochastic gradien...
research
05/27/2022

Generalization Bounds for Gradient Methods via Discrete and Continuous Prior

Proving algorithm-dependent generalization error bounds for gradient-typ...
research
01/09/2022

Stability Based Generalization Bounds for Exponential Family Langevin Dynamics

We study generalization bounds for noisy stochastic mini-batch iterative...
research
10/13/2020

Information-Theoretic Bounds on Transfer Generalization Gap Based on Jensen-Shannon Divergence

In transfer learning, training and testing data sets are drawn from diff...
research
12/26/2017

Entropy-SGD optimizes the prior of a PAC-Bayes bound: Data-dependent PAC-Bayes priors via differential privacy

We show that Entropy-SGD (Chaudhari et al., 2016), when viewed as a lear...
research
06/09/2022

Trajectory-dependent Generalization Bounds for Deep Neural Networks via Fractional Brownian Motion

Despite being tremendously overparameterized, it is appreciated that dee...

Please sign up or login with your details

Forgot password? Click here to reset