# On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning

Generalization error (also known as the out-of-sample error) measures how well the hypothesis obtained from the training data can generalize to previously unseen data. Obtaining tight generalization error bounds is central to statistical learning theory. In this paper, we study the generalization error bound in learning general non-convex objectives, which has attracted significant attention in recent years. In particular, we study the (algorithm-dependent) generalization bounds of various iterative gradient based methods. (1) We present a very simple and elementary proof of a recent result for stochastic gradient Langevin dynamics (SGLD), due to Mou et al. (2018). Our proof can be easily extended to obtain similar generalization bounds for several other variants of SGLD (e.g., with postprocessing, momentum, mini-batch, acceleration, and more general noises), and improves upon the recent results in Pensia et al. (2018). (2) By incorporating ideas from the PAC-Bayesian theory into the stability framework, we obtain tighter distribution-dependent (or data-dependent) generalization bounds. Our bounds provide an intuitive explanation for the phenomenon reported in Zhang et al. (2017a). (3) We also study the setting where the total loss is the sum of a bounded loss and an additional `l2 regularization term. We obtain new generalization bounds for the continuous Langevin dynamic in this setting by leveraging the tool of Log-Sobolev inequality. Our new bounds are more desirable when the noisy level of the process is not small, and do not grow when T approaches to infinity.

## Authors

• 100 publications
• 1 publication
• 11 publications
• ### Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints

Algorithm-dependent generalization error bounds are central to statistic...
07/19/2017 ∙ by Wenlong Mou, et al. ∙ 0

• ### On Random Subset Generalization Error Bounds and the Stochastic Gradient Langevin Dynamics Algorithm

In this work, we unify several expected generalization error bounds base...
10/21/2020 ∙ by Borja Rodríguez Gálvez, et al. ∙ 8

• ### Generalization Error Bounds for Noisy, Iterative Algorithms

In statistical learning theory, generalization error is used to quantify...
01/12/2018 ∙ by Ankit Pensia, et al. ∙ 0

• ### From inexact optimization to learning via gradient concentration

Optimization was recently shown to control the inductive bias in a learn...
06/09/2021 ∙ by Bernhard Stankewitz, et al. ∙ 0

• ### Learning from weakly dependent data under Dobrushin's condition

Statistical learning theory has largely focused on learning and generali...
06/21/2019 ∙ by Yuval Dagan, et al. ∙ 0

• ### Distribution-Dependent Analysis of Gibbs-ERM Principle

Gibbs-ERM learning is a natural idealized model of learning with stochas...
02/05/2019 ∙ by Ilja Kuzborskij, et al. ∙ 0

• ### Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory

We study problem-dependent rates, i.e., generalization errors that scale...
11/12/2020 ∙ by Yunbei Xu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Non-convex stochastic optimization is the major workhorse of modern machine learning. For instance, the standard supervised learning on a model class parametrized by

can be formulated as the following optimization problem:

 minw∈RdEz∼D[f(w,z)],

where denotes the model parameter, is an unknown data distribution over instance space , and

is a given loss function which may be non-convex. A learning algorithm takes as input a collection

of data points sampled i.i.d. from , and outputs a (possibly randomized) parameter configuration .

A fundamental question in learning theory is to understand the generalization performance of learning algorithms—is the algorithm guaranteed to output a model that generalizes well to the data distribution ? Specifically, we aim to prove upper bounds on the generalization error

. Classical learning theory relates the generalization error to various complexity measures (e.g., the VC-dimension and Rademacher complexity) of the model class. Directly applying these classical complexity measures, however, fails to explain the recent success of over-parametrized neural networks (see e.g.,

Zhang et al. (2017a)), where the model complexity significantly exceeds the amount of available training data. By incorporating certain data-dependent quantities such as margin and compressibility into the classical framework, some recent work (e.g., Bartlett et al. (2017); Arora et al. (2018); Wei and Ma (2019)

) obtained more meaningful generalization bounds in the deep learning context.

An alternative approach to showing generalization guarantees is to prove algorithm-dependent bounds. One celebrated example along this line is the algorithmic stability framework initiated by Bousquet and Elisseeff (2002). Roughly speaking, the generalization error can be bounded by the stability of the the algorithm (see Section 2 for the details). Using this framework, Hardt et al. (2016)

studied the stability (hence the generalization) of stochastic gradient descent (SGD) for both convex and non-convex functions. Their work motivates recent work on the generalization performance of several other gradient-based optimization algorithms

Kuzborskij and Lampert (2018); London (2016); Chaudhari et al. (2017); Raginsky et al. (2017); Mou et al. (2018); Pensia et al. (2018); Chen et al. (2018).

In this paper, we study the algorithmic stability and generalization guarantee of various iterative gradient-based method, with certain continuous noise injected in each iteration, in a non-convex setting. As a concrete example, we consider the stochastic gradient Langevin dynamics (SGLD) (see Raginsky et al. (2017); Mou et al. (2018); Pensia et al. (2018)). Viewed as a variant of SGD, SGLD adds an isotropic Gaussian noise at every update step:

 Wt←Wt−1−γtgt(Wt−1)+σt√2N(0,Id), (1)

where denotes either the full gradient or the gradient over a mini-batch sampled from training dataset. We also study the continuous version of (1), which is the dynamic defined by the following stochastic differential equation (SDE):

 dWt=−∇F(Wt) dt+√2β−1 dBt, (2)

where is the standard Brownian motion.

### 1.1 Related Work

Most related to our work is the study of algorithm-dependent generalization bounds of stochastic gradient methods. Hardt et al. (2016) first study the generalization performance of SGD via algorithmic stability. They prove a generalization bound that scales linearly with , the number of iterations, when the loss function is convex, but their results for general non-convex optimization are more restricted. Our work is a follow-up of the recent work by Mou et al. (2018), in which they provide generalization bounds for SGLD from both stability and PAC-Bayesian perspectives. Another closely related work by Pensia et al. (2018) derives similar bounds for noisy stochastic gradient methods, based on the information theoretic framework of Xu and Raginsky (2017). However, their bounds scale as where is the size of the training dataset, which is sub-optimal even for SGLD.

We acknowledge that besides the algorithm-dependent approach that we follow, recent advances in learning theory aim to explain the generalization performance of neural networks from many other perspectives. Some of the most prominent ideas include bounding the network capacity by the norm of weight matrices Neyshabur et al. (2015); Liang et al. (2017), margin theory Bartlett et al. (2017); Wei et al. (2018), PAC-Bayesian theory Dziugaite and Roy (2017); Neyshabur et al. (2018); Dziugaite and Roy (2018), network compressibility Arora et al. (2018), and over-parametrization Du et al. (2018); Allen-Zhu et al. (2018); Zou et al. (2018); Chizat and Bach (2018). Most of these results are stated in the context of neural networks (some are tailored to networks with specific architecture), whereas our work addresses generalization in non-convex stochastic optimization in general. We also note that some recent work provide explanations for the phenomenon reported in Zhang et al. (2017a) from a variety of different perspectives (e.g., Bartlett et al. (2017); Arora et al. (2018); Arora et al. (2019)).

Welling and Teh (2011)

first consider stochastic gradient Langevin dynamics (SGLD) as a sampling algorithm in the Bayesian inference context.

Raginsky et al. (2017)

give a non-asymptotic analysis and establish the finite-time convergence guarantee of SGLD to an approximate global minimum.

Zhang et al. (2017b) analyze the hitting time of SGLD and prove that SGLD converges to an approximate local minimum. These bounds are further improved and generalized to a family of Langevin dynamics based algorithms in the subsequent work of Xu et al. (2018).

### 1.2 Overview of Our Results

In this paper, we provide generalization guarantees for the noisy variants of several popular stochastic gradient methods.

##### The Bayes-Stability method and data-dependent generalization bounds.

We develop a new method, called Bayes-Stability, for proving generalization bounds by incorporating ideas from the PAC-Bayesian theory into the stability framework. In particular, assuming the loss takes value in , our method shows that the generalization error is bounded by both and , where is a prior distribution independent of the training set , and is the expected posterior distribution conditioned on (i.e., the last training data is ); see Definition 6 and Theorem 8 for details.

Inspired by Lever et al. (2013), instead of using a fixed prior distribution, we bound the KL-divergence from the posterior to a distribution-dependent prior. This enables us to derive the following generalization error bound that depends on the expected norm of the gradient along the optimization path:

 errgen=O⎛⎜⎝Cn ⎷ES[T∑t=1γ2tσ2tge(t)]⎞⎟⎠. (3)

Here is the dataset and is the expected empirical squared gradient norm at step ; see Theorem 9 for details.

Compared with the previous bound in (Mou et al., 2018, Theorem 1), where is the global Lipschitz constant of the loss, our new bound (3) depends on the data distribution and is typically tighter (as the gradient norm is at most ). In modern deep neural networks, the worst-case Lipschitz constant can be quite large, and typically much larger than the expected empirical gradient norm along the optimization trajectory. Specifically, in the later stage of the training, the distribution of the parameter is mostly concentrated around a flat local minimum region, where the expected empirical gradient is small. Hence, our generalization bound does not grow much even if we train longer in this case.

Our new bound also offers an explanation to the question regarding the difference between training on correct and random labels raised by Zhang et al. (2017a). In particular, we show empirically that the expected gradient norm (along the optimization path) is significantly higher when the training labels are replaced with random labels (Section 3, Remark 13).

This bound is similar in spirit to the PAC-Bayesian bound (for SGLD with -regularization) proposed by Mou et al. (2018). Compared with their bound, our bound has a faster rate (instead of ) and can be easily extended to other general settings (e.g., momentum). One advantage of their bound is that in the numerator the contribution of each step decays exponentially through time if the regularization coefficient (however, if , there is no such decay; see Theorem 2 in Mou et al. (2018)). Furthermore, we note that we can obtain a similar generalization bound in which we can replace the expected empirical gradient norm with the population gradient norm.

Extensions. We also want to remark that our technique allows for an arguably simpler proof of the (Mou et al., 2018, Theorem 1), which was based on SDE and Fokker-Planck equation. More importantly, our technique can be easily extended to handle mini-batches and a variety of general settings as follows.

1. Extension to other gradient-based methods. Our results naturally extends to other noisy stochastic gradient methods including momentum Polyak (1964) (Theorem 24), Nesterov’s accelerated gradient method Nesterov (1983) (Theorem 24), and Entropy-SGD Chaudhari et al. (2017) (Theorem 25).

2. Extension to general noises. The proof of the generalization bound in Mou et al. (2018) relies heavily on the fact that the noise is Gaussian111In particular, their proof leverages the Fokker-Planck equation, which describes the time evolution of the density function associated with the Langevin dynamics and can only handle Gaussian noise., which makes it difficult to generalize to other noise distributions such as the Laplace distribution. In contrast, our analysis easily carries over to the class of log-Lipschitz noises (noises drawn from distributions with Lipschitz log densities).

3. Pathwise stability. In practice, it is also natural to output a certain function of the entire optimization path, e.g., the one with the smallest empirical risk or a weighted average. We show that the same generalization bound holds for all such decision rules (Remark 12). We note that the analysis in an independent work of Pensia et al. (2018) also satisfies this property, and their bound is (see Corollary 1 in their work). We can see that their bound scales at a slower rate of (instead of ) dealing with -bounded loss.222They assume the loss is sub-Gaussian. By Hoeffding’s lemma,

-bounded random variables are sub-Gaussian with parameter

.

##### Generalization bounds with ℓ2 regularization via Log-Sobolev inequalities.

We also study the setting where the total loss is the sum of a bounded loss and an additional regularization term . In this case, can be treated as a perturbation of a quadratic function, and the continuous Langevin dynamics (CLD) is well understood for quadratic functions. In particular, we obtain two generalization bounds for CLD, both via the technique of Log-Sobolev inequalities, a powerful tool for proving the convergence rate of CLD. One of our bounds is as follows (Theorem 14):

 errgen≤2e4βCCLn√βλ(1−exp(−λTe8βC)) (4)

The above bound has the following advantages:

1. Using for , one can see that our bound is at most , which matches the previous bound in (Mou et al., 2018, Proposition 8).

2. As time grows, the bound is upper bounded by and approaches to (unlike the previous bound that goes to infinity as ).

3. If the noise level is not so small (i.e., is not very large), the generalization bound is quite desirable.

Our analysis is based on a Log-Sobolev inequality (LSI) for the parameter distribution at time , whereas most known LSIs only hold for the stationary distribution of the Markov process. We prove the new LSI by exploiting the variational formulation of the entropy formula.

## 2 Preliminaries

##### Notations.

We use to denote the data distribution. The training dataset is a sequence of i.i.d. random variables drawn from . are neighboring datasets if and only if they differ at exactly one data point (we could assume without loss of generality that ). Let be the loss function, where denotes a model parameter in . We also define as the average loss on dataset . Let be the set of all possible mini-batches. denotes the collection of mini-batches that contain , while . Let denote the diameter of set .

###### Definition 1 (L-lipschitz).

A loss function is -lipschitz in if holds for any and . Note that this implies that .

###### Definition 2 (Generalization error).

The generalization error is defined as

 errgen=ESEA[f(A(S))−f(A(S),S)],

where is the population loss, and is a learning algorithm.

###### Assumption 3.

The loss function is differentiable, -bounded and -lipschitz in .

##### Algorithmic Stability.

Intuitively, a learning algorithm that is stable (i.e., a small perturbation of the training data does not affect its output too much) can generalize well. In the seminal work of Bousquet and Elisseeff (2002) (see also Hardt et al. (2016)), the authors formally defined algorithmic stability and established a close connection between the stability of a learning algorithm and its generalization performance.

###### Definition 4 (Uniform stability).

(Bousquet and Elisseeff (2002)) A randomized algorithm is -uniformly stable w.r.t. loss , if for all neighboring sets , it holds that

 supz∈Z|EA[f(wS,z)]−EA[f(wS′,z)]|≤ϵn,

where and denote the outputs of on and respectively.

###### Lemma 5 (Generalization in expectation).

(Hardt et al. (2016)) Suppose a randomized algorithm is -uniformly stable. Then, .

## 3 Bayes-Stability Method

In this section, we incorporate ideas from the PAC-Bayesian theory (see e.g., Lever et al. (2013)) into the algorithmic stability framework. Combined with the technical tools introduced in previous sections, the new framework enables us to prove tighter data-dependent generalization bounds.

First, we define the posterior of a dataset and the posterior of a single data point.

###### Definition 6 (Single-point posterior).

Let be the posterior distribution of the parameter for a given training dataset

. In other words, it is the probability distribution of the output of the learning algorithm on dataset

(e.g., for iterations of SGLD, is the pdf of ). The single-point posterior is defined as

 Q(i,z)=E(z1,…,zi−1,zi+1,…zn)[Q(z1,…,zi−1,z,zi+1,…,zn)].

For convenience, we make the following assumption on the learning algorithm:

###### Assumption 7 (Order-independent).

For any fixed dataset and any permutation , is the same as , where .

Assumption 7 implies . So we use as a shorthand for in the following. Note that this assumption can be easily satisfied if the learning algorithm permutes the training data randomly at the beginning. It is also easy to verify that both SGD and SGLD satisfy the order-independent assumption.

Now, we state our new Bayes-stability framework, which holds for any prior distribution over the parameter space that is independent of the training dataset .

###### Theorem 8 (Bayes-Stability).

Under Assumptions 3 and 7, for any prior distribution not depending on , the generalization error is bounded by both and .

Applying this general framework, we obtain the following concrete data-dependent generalization bounds for SGLD:

###### Theorem 9.

Suppose that Assumption 3 and the following conditions hold:

1. Batch size .

2. Learning rate .

Let be the empirical squared gradient norm. Then, the following generalization error bound holds for iterations of SGLD:

 errgen=O⎛⎜⎝Cn ⎷ES∼Dn[T∑t=1γ2tσ2tge(t)]⎞⎟⎠, (Empirical norm)

where is the dataset. is the parameter at step of SGLD for given dataset .

Proof Sketch of Theorem 9 The proof builds upon the following two technical lemmas, which we prove in Appendix A.2.

###### Lemma 10.

Let and be two sequences of random variables such that for each , and have the same support. Suppose and follow the same distribution. Then,

 KL(W≤T,W≤T)=T∑t=1Ew

Where and .

###### Lemma 11.

Suppose that batch size . and are two collections of points in labeled by mini-batches of size that satisfy the following conditions for constants : (1) for and for ; (2) . (See Section 2 for the definitions of , and .)

Let

denote the Gaussian distribution

. Let and be two mixture distributions over all mini-batches. Then, for some universal constant .

Define , where denotes the zero data point (i.e., for any ). Theorem 8 shows that

 errgen≤2CEz√2KL(Qz,P) (5)

By the convexity of KL-divergence, for a fixed , we have

 KL(Qz,P)=KL(E¯¯¯S[Q(¯¯¯S,z)],E¯¯¯S[Q(¯¯¯S,0)])≤E¯¯¯S[KL(Q(¯¯¯S,z),Q(¯¯¯S,0))]. (6)

Let and be the training process of SGLD for and , respectively. Note that for a fixed , both and are Gaussian mixtures. By Lemma 11, we have

 KL(Wt|W

Applying Lemma 10 and gives

 KL(QS,QS′)≤C0n2T∑t=1γ2tσ2tEw∼Wt−1∥∇f(w,z)∥22.

Recall that is the parameter at step using as dataset. In this case, we can rewrite as since it is the -th data point of . Note that SGLD satisfies the order-independent assumption, we can rewrite as for all . Together with (5), (6), and using , we can prove this theorem.

Furthermore, if we bound instead of , we can obtain the following bound that depends on the population gradient norm:

The full proofs of the above results are postponed to Appendix A, and we provide some remarks about the new bounds.

###### Remark 12.

In fact, our proof establishes that the above upper bound holds for the two sequences and : . Hence, our bound holds for any sufficiently regular function over the parameter sequences: . In particular, our generalization error bound automatically extends to several variations such as outputting the average of the sequence, the average of the suffix of certain length, or the exponential moving average.

###### Remark 13.

We reproduce the experiment in Zhang et al. (2017a). (See Appendix C for more experiment details.) As shown in Figure 1, both empirical and population gradients have significantly larger norms when training on random labels than on normal labels. Moreover, the curve of the cumulative empirical squared gradient norm looks quite close to the generalization error curve. This suggests that the generalization bounds in Theorem 9 can distinguish randomly labelled data from normal data.

## 4 Generalization of CLD and GLD with ℓ2 regularization

In this section, we study the generalization error of Continuous Langevin Dynamics (CLD) with regularization. Let the total loss function over training set be . The Continuous Langevin Dynamics is defined by the following SDE:

 dWt=−∇FS(Wt) dt+√2β−1 dBt,W0∼μ0, (CLD)

where is the standard Brownian motion on and the initial distribution is the centered Gaussian distribution in with covariance . We show that the generalization error of CLD is upper bounded by , which is independent of the training time  (Theorem 14). Furthermore, as goes to infinity, we have a tighter generalization error bound (Theorem 37 in Appendix B). We also study the generalization of Gradient Langevin Dynamics (GLD), which is the discretization of CLD:

 Wk+1=Wk−η∇FS(Wk)+√2ηβ−1ξk, (GLD)

where

is the standard Gaussian random vector in

. Using a result developed in Raginsky et al. (2017), we can show that, as tends to zero, GLD has the same generalization as CLD (see Theorems 14 and 37). We first formally state our first main result in this section.

###### Theorem 14.

Under Assumption 3, CLD (with initial probability measure ) has the following expected generalization error bound:

 errgen≤2e4βCCLn√βλ(1−exp(−λTe8βC)). (7)

In addition, if is -smooth and non-negative, by setting , and , GLD (running iterations with the same as CLD) has the expected generalization error bound:

 errgen≤2C√2KC1η2+2CLe4βCn√βλ(1−exp(−ληKe8βC)), (8)

where is a constant that only depends on , , , , and .

The following lemma is crucial for establishing the above generalization bound for CLD. In particular, we need to establish a Log-Sobolev inequality for , the parameter distribution at time , for every time step . In contrast, most known LSIs only characterize the stationary distribution of the Markov process. The proof of the lemma can be found in Appendix B.

###### Lemma 15.

Under Assumption 3, let be the probability measure of in CLD (with ). Let be a probability measure that is absolutely continuous with respect to . Suppose and . Then, it holds that

 KL(γ,πt)≤exp(8βC)2λβ∫Rd∥∥∥∇logγ(w)πt(w)∥∥∥22γ(w) dw.

We sketch the proof of Theorem 14 in the following (see the complete proof in Appendix B).

Proof Sketch of Theorem 14 Suppose and are two neighboring datasets that differ on exactly one data point. Let and be the process of CLD running on and , respectively. Let and be the pdf of and . We have

 ddtKL(γt,πt) =−1β∫Rdγt∥∥∥∇logγtπt∥∥∥22 dw+∫Rdγt⟨∇logγtπt,∇FS−∇FS′⟩ dw ≤−λe8βCKL(γt,πt)+2βL2n2 (Lemma 15)

Solving this inequality gives . Hence the generalization error of CLD can be bounded by , which proves the first part. The second part of the theorem follows from Lemma 34 in Appendix B.

Our second generalization bound for CLD (Theorem 37 in Appendix B) is . The high level idea to prove this bound is very similar to that in Raginsky et al. (2017). We first observe that the (stationary) Gibbs distribution has a small generalization error. Then, we bound the distance from to . In our setting, we can use the Holley-Stroock perturbation lemma which allows us to bound the Logarithmic Sobolev constant, and we can thus bound the above distance easily.

## 5 Future Directions

In this paper, we prove several new generalization bounds for a variety of noisy gradient-based methods. Our current techniques can only handle continuous noises for which we can bound the KL-divergence. One future direction is to handle the discrete noise introduced in SGD (in this case the KL divergence may not be well defined). For either SGLD or CLD, if the noise level is small (i.e., is large), it may take a long time for the diffusion process to reach the stable distribution. Hence, another interesting future direction is to consider the local behavior and generalization of the diffusion process in finite time through the techniques developed in the studies of metastability (see e.g., Bovier et al. (2005); Bovier and den Hollander (2006); Tzen et al. (2018)). In particular, the technique may be helpful for further improving the bounds in Theorem 14 and 37 (when is not very large).

## References

• (1)
• Allen-Zhu et al. (2018) Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. 2018. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 (2018).
• Arora et al. (2019) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. 2019. On Exact Computation with an Infinitely Wide Neural Net. arXiv preprint arXiv:1904.11955 (2019).
• Arora et al. (2018) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. 2018. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning (ICML). 254–263.
• Bakry et al. (2013) Dominique Bakry, Ivan Gentil, and Michel Ledoux. 2013. Analysis and geometry of Markov diffusion operators. Vol. 348. Springer Science &amp; Business Media.
• Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 6240–6249.
• Bousquet and Elisseeff (2002) Olivier Bousquet and André Elisseeff. 2002. Stability and generalization. Journal of machine learning research 2, Mar (2002), 499–526.
• Bovier and den Hollander (2006) Anton Bovier and Frank den Hollander. 2006. Metastability: a potential theoretic approach. In International Congress of Mathematicians, Vol. 3. Eur. Math. Soc. Zürich, 499–518.
• Bovier et al. (2005) Anton Bovier, Véronique Gayrard, and Markus Klein. 2005.

Metastability in reversible diffusion processes II: Precise asymptotics for small eigenvalues.

Journal of the European Mathematical Society 7, 1 (2005), 69–99.
• Chaudhari et al. (2017) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. 2017. Entropy-SGD: Biasing gradient descent into wide valleys. In International Conference on Learning Representations (ICLR).
• Chen et al. (2018) Yuansi Chen, Chi Jin, and Bin Yu. 2018. Stability and Convergence Trade-off of Iterative Optimization Algorithms. arXiv preprint arXiv:1804.01619 (2018).
• Chizat and Bach (2018) Lenaic Chizat and Francis Bach. 2018. A Note on Lazy Training in Supervised Differentiable Programming. arXiv preprint arXiv:1812.07956 (2018).
• Du et al. (2018) Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. 2018. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804 (2018).
• Duchi (2007) John Duchi. 2007. Derivations for linear algebra and optimization. Berkeley, California 3 (2007).
• Dziugaite and Roy (2018) Gintare Karolina Dziugaite and Daniel Roy. 2018. Entropy-SGD optimizes the prior of a PAC-Bayes bound: Generalization properties of Entropy-SGD and data-dependent priors. In International Conference on Machine Learning (ICML). 1377–1386.
• Dziugaite and Roy (2017) Gintare Karolina Dziugaite and Daniel M Roy. 2017. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In

Uncertainty in Artificial Intelligence (UAI)

.
• Hardt et al. (2016) Moritz Hardt, Benjamin Recht, and Yoram Singer. 2016. Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine Learning (ICML). 1225–1234.
• Holley and Stroock (1987) Richard Holley and Daniel Stroock. 1987. Logarithmic Sobolev inequalities and stochastic Ising models. Journal of statistical physics 46, 5 (1987), 1159–1194.
• Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.
• Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 1097–1105.
• Kuzborskij and Lampert (2018) I. Kuzborskij and C. H. Lampert. 2018. Data-Dependent Stability of Stochastic Gradient Descent. In International Conference on Machine Learning (ICML).
• Lever et al. (2013) Guy Lever, François Laviolette, and John Shawe-Taylor. 2013. Tighter PAC-Bayes bounds through distribution-dependent priors. Theoretical Computer Science 473 (2013), 4–28.
• Liang et al. (2017) Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. 2017. Fisher-Rao metric, geometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530 (2017).
• London (2016) Ben London. 2016. Generalization bounds for randomized learning with application to stochastic gradient descent. In NIPS Workshop on Optimizing the Optimizers.
• Menz et al. (2014) Georg Menz, André Schlichting, et al. 2014. Poincaré and logarithmic Sobolev inequalities by decomposition of the energy landscape. The Annals of Probability 42, 5 (2014), 1809–1884.
• Mou et al. (2018) Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. 2018. Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints. In Conference on Learning Theory (COLT). 605–638.
• Nesterov (1983) Yurii E Nesterov. 1983. A method for solving the convex programming problem with convergence rate O. In Dokl. Akad. Nauk SSSR, Vol. 269. 543–547.
• Neyshabur et al. (2018) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. 2018. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations (ICLR).
• Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. 2015. Norm-based capacity control in neural networks. In Conference on Learning Theory (COLT). 1376–1401.
• Pavliotis (2014) Grigorios A Pavliotis. 2014. Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations. Vol. 60. Springer.
• Pensia et al. (2018) Ankit Pensia, Varun Jog, and Po-Ling Loh. 2018. Generalization Error Bounds for Noisy, Iterative Algorithms. In International Symposium on Information Theory (ISIT). 546–550.
• Polyak (1964) Boris T Polyak. 1964. Some methods of speeding up the convergence of iteration methods. U. S. S. R. Comput. Math. and Math. Phys. 4, 5 (1964), 1–17.
• Raginsky et al. (2017) Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. 2017. Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. In Conference on Learning Theory (COLT). 1674–1703.
• Risken (1996) Hannes Risken. 1996. Fokker-planck equation. In The Fokker-Planck Equation. Springer, 63–95.
• Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning (ICML). 1139–1147.
• Topsoe (2000) Flemming Topsoe. 2000. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory 46, 4 (2000), 1602–1609.
• Tzen et al. (2018) Belinda Tzen, Tengyuan Liang, and Maxim Raginsky. 2018. Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability. Proceedings of the 2018 Conference on Learning Theory (COLT) (2018).
• Wei et al. (2018) Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. 2018. On the margin theory of feedforward neural networks. arXiv preprint arXiv:1810.05369 (2018).
• Wei and Ma (2019) Colin Wei and Tengyu Ma. 2019. Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation. arXiv preprint arXiv:1905.03684 (2019).
• Welling and Teh (2011) Max Welling and Yee W Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning (ICML). 681–688.
• Xu and Raginsky (2017) Aolin Xu and Maxim Raginsky. 2017. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems. 2524–2533.
• Xu et al. (2018) Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. 2018. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. In Advances in Neural Information Processing Systems (NeurIPS). 3126–3137.
• Zhang et al. (2017a) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017a. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR).
• Zhang et al. (2017b) Yuchen Zhang, Percy Liang, and Moses Charikar. 2017b. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics. In Conference on Learning Theory (COLT). 1980–2022.
• Zou et al. (2018) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. 2018. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888 (2018).

## Appendix A Proofs in Section 3

### a.1 Bayes-Stability Framework

###### Lemma 16.

Under Assumption 7, for any prior distribution not depending on the dataset , the generalization error is upper bounded by

 ∣∣∣Ez[Ew∼Pf(w,z)−Ew∼Qzf(w,z)]∣∣∣+∣∣∣Ez[Ew∼Pf(w)−Ew∼Qzf(w)]∣∣∣,

where denotes the population loss .

Proof of Lemma 16 Let and . We can rewrite generalization error as , where

 errtest =EzEw∼Q(1,z)f(w)=EzEw∼Qzf(w) (Assumption 7) =Ez∫Rd(Qz(w)−P(w))f(w) dw+∫RdP(w)f(w) dw.

and

 errtrain =1nn∑i=1ESEw∼QSf(w,zi) =1nn∑i=1EzEw∼Q(i,z)f(w,z)=EzEw∼Qzf(w,z) (Assumption 7) =Ez∫Rd(Qz(w)−P(w))f(w,z) dw+∫RdP(w)Ezf(w,z) dw (P is a prior) =Ez∫Rd(Qz(w)−P(w))f(w,z) dw+∫RdP(w)f(w) dw. (definition of f(w))

Thus, we have

 |errgen| =|errtest−errtrain| =∣∣∣Ez∫Rd(Qz(w)−P(w))f(w) dw−Ez∫Rd(Qz(w)−P(w))f(w,z) dw∣∣∣ ≤∣∣∣Ez[Ew∼Qzf(w,z)−Ew∼Pf(w,z)]∣∣∣+∣∣∣Ez[Ew∼Qzf(w)−Ew∼Pf(w)]∣∣∣.

Now we are ready to prove Theorems 8 and 9, which we restate in the following.

###### Theorem 8 (Bayes-Stability).

Under Assumptions 3 and 7, for any prior distribution not depending on , the generalization error is bounded by both and .

Proof  By Lemma 16,

 errgen ≤∣∣∣Ez[Ew∼Pf(w,z)−Ew∼Qzf(w,z)]∣∣∣+∣∣∣Ez[Ew∼Pf(w)−Ew∼Qzf(w)]∣∣∣ ≤Ez[2C⋅TV(P,Qz)+2C⋅TV(P,Qz)] (C-boundedness) ≤4CEz[√12KL(P,Qz)] (Pinsker’s inequality)

The other bound follows from a similar argument.

### a.2 Technical Lemmas

The following lemma allows us to reduce the proof of algorithmic stability to the analysis of a single update.

###### Lemma 10.

Let and be two sequences of random variables such that for each , and have the same support. Suppose and follow the same distribution. Then,

 KL(W≤T,W≤T)=T∑t=1Ew

Where and .

Proof  Let

. By the chain rule of the KL-divergence,

 KL(W≤t,W′≤t)=KL(W

The lemma follows from summing over .

The following lemma (see e.g., (Duchi, 2007, Section 9)) gives a closed-form formula for the KL-divergence between Gaussian distributions.

###### Lemma 17.

Suppose that and are two Gaussian distributions on . Then,

 KL(P,Q)=12(tr(Σ−12Σ1)+(μ2−μ1)⊤Σ−12(μ2−μ1)−d+lndet(Σ2)det(Σ1)).

The following lemma (Topsoe, 2000, Theorem 3) helps us to derive upper bounds on the KL-divergence in the technical proofs.

###### Definition 18.

Let and be two probability distributions on . The directional triangular discrimination from to is defined as

 Δ∗(P,Q)=+∞∑k=02k⋅Δ(2−kP+(1−2−k)Q,Q),

where

 Δ(P,Q)=∫Rd(P(w)−Q(w))2P(w)+Q(w) dw.
###### Lemma 19.

For any two probability distributions and on ,

 KL(P,Q)≤ln2⋅Δ∗(P,Q).

Let be the set of all possible mini-batches. denotes the collection of mini-batches that contain , while . Let denote the diameter of set .

###### Lemma 11.

Suppose that batch size . and are two collections of points in labeled by mini-batches of size that satisfy the following conditions for some constant :

1. for and for .

2. .

Let denote the Gaussian distribution . Let and be two mixture distributions over all mini-batches. Then, for some universal constant ,

 KL(P,P′)≤C0b2β2σ2n2.

Proof of Lemma 11 By Lemma 19, is bounded by

 ln2⋅Δ∗(P,P′) =ln2⋅+∞∑k=02k⋅Δ(2−kP+(1−2−k)P′,P′) =ln2⋅+∞∑k=02k⋅∫Rd4−k(P(w)−P′(w))22−kP