1 Introduction
Nonconvex stochastic optimization is the major workhorse of modern machine learning. For instance, the standard supervised learning on a model class parametrized by
can be formulated as the following optimization problem:where denotes the model parameter, is an unknown data distribution over instance space , and
is a given loss function which may be nonconvex. A learning algorithm takes as input a collection
of data points sampled i.i.d. from , and outputs a (possibly randomized) parameter configuration .A fundamental question in learning theory is to understand the generalization performance of learning algorithms—is the algorithm guaranteed to output a model that generalizes well to the data distribution ? Specifically, we aim to prove upper bounds on the generalization error
. Classical learning theory relates the generalization error to various complexity measures (e.g., the VCdimension and Rademacher complexity) of the model class. Directly applying these classical complexity measures, however, fails to explain the recent success of overparametrized neural networks (see e.g.,
Zhang et al. (2017a)), where the model complexity significantly exceeds the amount of available training data. By incorporating certain datadependent quantities such as margin and compressibility into the classical framework, some recent work (e.g., Bartlett et al. (2017); Arora et al. (2018); Wei and Ma (2019)) obtained more meaningful generalization bounds in the deep learning context.
An alternative approach to showing generalization guarantees is to prove algorithmdependent bounds. One celebrated example along this line is the algorithmic stability framework initiated by Bousquet and Elisseeff (2002). Roughly speaking, the generalization error can be bounded by the stability of the the algorithm (see Section 2 for the details). Using this framework, Hardt et al. (2016)
studied the stability (hence the generalization) of stochastic gradient descent (SGD) for both convex and nonconvex functions. Their work motivates recent work on the generalization performance of several other gradientbased optimization algorithms
Kuzborskij and Lampert (2018); London (2016); Chaudhari et al. (2017); Raginsky et al. (2017); Mou et al. (2018); Pensia et al. (2018); Chen et al. (2018).In this paper, we study the algorithmic stability and generalization guarantee of various iterative gradientbased method, with certain continuous noise injected in each iteration, in a nonconvex setting. As a concrete example, we consider the stochastic gradient Langevin dynamics (SGLD) (see Raginsky et al. (2017); Mou et al. (2018); Pensia et al. (2018)). Viewed as a variant of SGD, SGLD adds an isotropic Gaussian noise at every update step:
(1) 
where denotes either the full gradient or the gradient over a minibatch sampled from training dataset. We also study the continuous version of (1), which is the dynamic defined by the following stochastic differential equation (SDE):
(2) 
where is the standard Brownian motion.
1.1 Related Work
Most related to our work is the study of algorithmdependent generalization bounds of stochastic gradient methods. Hardt et al. (2016) first study the generalization performance of SGD via algorithmic stability. They prove a generalization bound that scales linearly with , the number of iterations, when the loss function is convex, but their results for general nonconvex optimization are more restricted. Our work is a followup of the recent work by Mou et al. (2018), in which they provide generalization bounds for SGLD from both stability and PACBayesian perspectives. Another closely related work by Pensia et al. (2018) derives similar bounds for noisy stochastic gradient methods, based on the information theoretic framework of Xu and Raginsky (2017). However, their bounds scale as where is the size of the training dataset, which is suboptimal even for SGLD.
We acknowledge that besides the algorithmdependent approach that we follow, recent advances in learning theory aim to explain the generalization performance of neural networks from many other perspectives. Some of the most prominent ideas include bounding the network capacity by the norm of weight matrices Neyshabur et al. (2015); Liang et al. (2017), margin theory Bartlett et al. (2017); Wei et al. (2018), PACBayesian theory Dziugaite and Roy (2017); Neyshabur et al. (2018); Dziugaite and Roy (2018), network compressibility Arora et al. (2018), and overparametrization Du et al. (2018); AllenZhu et al. (2018); Zou et al. (2018); Chizat and Bach (2018). Most of these results are stated in the context of neural networks (some are tailored to networks with specific architecture), whereas our work addresses generalization in nonconvex stochastic optimization in general. We also note that some recent work provide explanations for the phenomenon reported in Zhang et al. (2017a) from a variety of different perspectives (e.g., Bartlett et al. (2017); Arora et al. (2018); Arora et al. (2019)).
Welling and Teh (2011)
first consider stochastic gradient Langevin dynamics (SGLD) as a sampling algorithm in the Bayesian inference context.
Raginsky et al. (2017)give a nonasymptotic analysis and establish the finitetime convergence guarantee of SGLD to an approximate global minimum.
Zhang et al. (2017b) analyze the hitting time of SGLD and prove that SGLD converges to an approximate local minimum. These bounds are further improved and generalized to a family of Langevin dynamics based algorithms in the subsequent work of Xu et al. (2018).1.2 Overview of Our Results
In this paper, we provide generalization guarantees for the noisy variants of several popular stochastic gradient methods.
The BayesStability method and datadependent generalization bounds.
We develop a new method, called BayesStability, for proving generalization bounds by incorporating ideas from the PACBayesian theory into the stability framework. In particular, assuming the loss takes value in , our method shows that the generalization error is bounded by both and , where is a prior distribution independent of the training set , and is the expected posterior distribution conditioned on (i.e., the last training data is ); see Definition 6 and Theorem 8 for details.
Inspired by Lever et al. (2013), instead of using a fixed prior distribution, we bound the KLdivergence from the posterior to a distributiondependent prior. This enables us to derive the following generalization error bound that depends on the expected norm of the gradient along the optimization path:
(3) 
Here is the dataset and is the expected empirical squared gradient norm at step ; see Theorem 9 for details.
Compared with the previous bound in (Mou et al., 2018, Theorem 1), where is the global Lipschitz constant of the loss, our new bound (3) depends on the data distribution and is typically tighter (as the gradient norm is at most ). In modern deep neural networks, the worstcase Lipschitz constant can be quite large, and typically much larger than the expected empirical gradient norm along the optimization trajectory. Specifically, in the later stage of the training, the distribution of the parameter is mostly concentrated around a flat local minimum region, where the expected empirical gradient is small. Hence, our generalization bound does not grow much even if we train longer in this case.
Our new bound also offers an explanation to the question regarding the difference between training on correct and random labels raised by Zhang et al. (2017a). In particular, we show empirically that the expected gradient norm (along the optimization path) is significantly higher when the training labels are replaced with random labels (Section 3, Remark 13).
This bound is similar in spirit to the PACBayesian bound (for SGLD with regularization) proposed by Mou et al. (2018). Compared with their bound, our bound has a faster rate (instead of ) and can be easily extended to other general settings (e.g., momentum). One advantage of their bound is that in the numerator the contribution of each step decays exponentially through time if the regularization coefficient (however, if , there is no such decay; see Theorem 2 in Mou et al. (2018)). Furthermore, we note that we can obtain a similar generalization bound in which we can replace the expected empirical gradient norm with the population gradient norm.
Extensions. We also want to remark that our technique allows for an arguably simpler proof of the (Mou et al., 2018, Theorem 1), which was based on SDE and FokkerPlanck equation. More importantly, our technique can be easily extended to handle minibatches and a variety of general settings as follows.

Extension to general noises. The proof of the generalization bound in Mou et al. (2018) relies heavily on the fact that the noise is Gaussian^{1}^{1}1In particular, their proof leverages the FokkerPlanck equation, which describes the time evolution of the density function associated with the Langevin dynamics and can only handle Gaussian noise., which makes it difficult to generalize to other noise distributions such as the Laplace distribution. In contrast, our analysis easily carries over to the class of logLipschitz noises (noises drawn from distributions with Lipschitz log densities).

Pathwise stability. In practice, it is also natural to output a certain function of the entire optimization path, e.g., the one with the smallest empirical risk or a weighted average. We show that the same generalization bound holds for all such decision rules (Remark 12). We note that the analysis in an independent work of Pensia et al. (2018) also satisfies this property, and their bound is (see Corollary 1 in their work). We can see that their bound scales at a slower rate of (instead of ) dealing with bounded loss.^{2}^{2}2They assume the loss is subGaussian. By Hoeffding’s lemma,
bounded random variables are subGaussian with parameter
.
Generalization bounds with regularization via LogSobolev inequalities.
We also study the setting where the total loss is the sum of a bounded loss and an additional regularization term . In this case, can be treated as a perturbation of a quadratic function, and the continuous Langevin dynamics (CLD) is well understood for quadratic functions. In particular, we obtain two generalization bounds for CLD, both via the technique of LogSobolev inequalities, a powerful tool for proving the convergence rate of CLD. One of our bounds is as follows (Theorem 14):
(4) 
The above bound has the following advantages:

Using for , one can see that our bound is at most , which matches the previous bound in (Mou et al., 2018, Proposition 8).

As time grows, the bound is upper bounded by and approaches to (unlike the previous bound that goes to infinity as ).

If the noise level is not so small (i.e., is not very large), the generalization bound is quite desirable.
Our analysis is based on a LogSobolev inequality (LSI) for the parameter distribution at time , whereas most known LSIs only hold for the stationary distribution of the Markov process. We prove the new LSI by exploiting the variational formulation of the entropy formula.
2 Preliminaries
Notations.
We use to denote the data distribution. The training dataset is a sequence of i.i.d. random variables drawn from . are neighboring datasets if and only if they differ at exactly one data point (we could assume without loss of generality that ). Let be the loss function, where denotes a model parameter in . We also define as the average loss on dataset . Let be the set of all possible minibatches. denotes the collection of minibatches that contain , while . Let denote the diameter of set .
Definition 1 (lipschitz).
A loss function is lipschitz in if holds for any and . Note that this implies that .
Definition 2 (Generalization error).
The generalization error is defined as
where is the population loss, and is a learning algorithm.
Assumption 3.
The loss function is differentiable, bounded and lipschitz in .
Algorithmic Stability.
Intuitively, a learning algorithm that is stable (i.e., a small perturbation of the training data does not affect its output too much) can generalize well. In the seminal work of Bousquet and Elisseeff (2002) (see also Hardt et al. (2016)), the authors formally defined algorithmic stability and established a close connection between the stability of a learning algorithm and its generalization performance.
Definition 4 (Uniform stability).
(Bousquet and Elisseeff (2002)) A randomized algorithm is uniformly stable w.r.t. loss , if for all neighboring sets , it holds that
where and denote the outputs of on and respectively.
Lemma 5 (Generalization in expectation).
(Hardt et al. (2016)) Suppose a randomized algorithm is uniformly stable. Then, .
3 BayesStability Method
In this section, we incorporate ideas from the PACBayesian theory (see e.g., Lever et al. (2013)) into the algorithmic stability framework. Combined with the technical tools introduced in previous sections, the new framework enables us to prove tighter datadependent generalization bounds.
First, we define the posterior of a dataset and the posterior of a single data point.
Definition 6 (Singlepoint posterior).
Let be the posterior distribution of the parameter for a given training dataset
. In other words, it is the probability distribution of the output of the learning algorithm on dataset
(e.g., for iterations of SGLD, is the pdf of ). The singlepoint posterior is defined asFor convenience, we make the following assumption on the learning algorithm:
Assumption 7 (Orderindependent).
For any fixed dataset and any permutation , is the same as , where .
Assumption 7 implies . So we use as a shorthand for in the following. Note that this assumption can be easily satisfied if the learning algorithm permutes the training data randomly at the beginning. It is also easy to verify that both SGD and SGLD satisfy the orderindependent assumption.
Now, we state our new Bayesstability framework, which holds for any prior distribution over the parameter space that is independent of the training dataset .
Theorem 8 (BayesStability).
Applying this general framework, we obtain the following concrete datadependent generalization bounds for SGLD:
Theorem 9.
Suppose that Assumption 3 and the following conditions hold:

Batch size .

Learning rate .
Let be the empirical squared gradient norm. Then, the following generalization error bound holds for iterations of SGLD:
(Empirical norm) 
where is the dataset. is the parameter at step of SGLD for given dataset .
Proof Sketch of Theorem 9 The proof builds upon the following two technical lemmas, which we prove in Appendix A.2.
Lemma 10.
Let and be two sequences of random variables such that for each , and have the same support. Suppose and follow the same distribution. Then,
Where and .
Lemma 11.
Suppose that batch size . and are two collections of points in labeled by minibatches of size that satisfy the following conditions for constants : (1) for and for ; (2) . (See Section 2 for the definitions of , and .)
Let
denote the Gaussian distribution
. Let and be two mixture distributions over all minibatches. Then, for some universal constant .Define , where denotes the zero data point (i.e., for any ). Theorem 8 shows that
(5) 
By the convexity of KLdivergence, for a fixed , we have
(6) 
Let and be the training process of SGLD for and , respectively. Note that for a fixed , both and are Gaussian mixtures. By Lemma 11, we have
Applying Lemma 10 and gives
Recall that is the parameter at step using as dataset. In this case, we can rewrite as since it is the th data point of . Note that SGLD satisfies the orderindependent assumption, we can rewrite as for all . Together with (5), (6), and using , we can prove this theorem.
Furthermore, if we bound instead of , we can obtain the following bound that depends on the population gradient norm:
The full proofs of the above results are postponed to Appendix A, and we provide some remarks about the new bounds.
Remark 12.
In fact, our proof establishes that the above upper bound holds for the two sequences and : . Hence, our bound holds for any sufficiently regular function over the parameter sequences: . In particular, our generalization error bound automatically extends to several variations such as outputting the average of the sequence, the average of the suffix of certain length, or the exponential moving average.
Remark 13.
We reproduce the experiment in Zhang et al. (2017a). (See Appendix C for more experiment details.) As shown in Figure 1, both empirical and population gradients have significantly larger norms when training on random labels than on normal labels. Moreover, the curve of the cumulative empirical squared gradient norm looks quite close to the generalization error curve. This suggests that the generalization bounds in Theorem 9 can distinguish randomly labelled data from normal data.
4 Generalization of CLD and GLD with regularization
In this section, we study the generalization error of Continuous Langevin Dynamics (CLD) with regularization. Let the total loss function over training set be . The Continuous Langevin Dynamics is defined by the following SDE:
(CLD) 
where is the standard Brownian motion on and the initial distribution is the centered Gaussian distribution in with covariance . We show that the generalization error of CLD is upper bounded by , which is independent of the training time (Theorem 14). Furthermore, as goes to infinity, we have a tighter generalization error bound (Theorem 37 in Appendix B). We also study the generalization of Gradient Langevin Dynamics (GLD), which is the discretization of CLD:
(GLD) 
where
is the standard Gaussian random vector in
. Using a result developed in Raginsky et al. (2017), we can show that, as tends to zero, GLD has the same generalization as CLD (see Theorems 14 and 37). We first formally state our first main result in this section.Theorem 14.
Under Assumption 3, CLD (with initial probability measure ) has the following expected generalization error bound:
(7) 
In addition, if is smooth and nonnegative, by setting , and , GLD (running iterations with the same as CLD) has the expected generalization error bound:
(8) 
where is a constant that only depends on , , , , and .
The following lemma is crucial for establishing the above generalization bound for CLD. In particular, we need to establish a LogSobolev inequality for , the parameter distribution at time , for every time step . In contrast, most known LSIs only characterize the stationary distribution of the Markov process. The proof of the lemma can be found in Appendix B.
Lemma 15.
Proof Sketch of Theorem 14 Suppose and are two neighboring datasets that differ on exactly one data point. Let and be the process of CLD running on and , respectively. Let and be the pdf of and . We have
(Lemma 15) 
Solving this inequality gives . Hence the generalization error of CLD can be bounded by , which proves the first part. The second part of the theorem follows from Lemma 34 in Appendix B.
Our second generalization bound for CLD (Theorem 37 in Appendix B) is . The high level idea to prove this bound is very similar to that in Raginsky et al. (2017). We first observe that the (stationary) Gibbs distribution has a small generalization error. Then, we bound the distance from to . In our setting, we can use the HolleyStroock perturbation lemma which allows us to bound the Logarithmic Sobolev constant, and we can thus bound the above distance easily.
5 Future Directions
In this paper, we prove several new generalization bounds for a variety of noisy gradientbased methods. Our current techniques can only handle continuous noises for which we can bound the KLdivergence. One future direction is to handle the discrete noise introduced in SGD (in this case the KL divergence may not be well defined). For either SGLD or CLD, if the noise level is small (i.e., is large), it may take a long time for the diffusion process to reach the stable distribution. Hence, another interesting future direction is to consider the local behavior and generalization of the diffusion process in finite time through the techniques developed in the studies of metastability (see e.g., Bovier et al. (2005); Bovier and den Hollander (2006); Tzen et al. (2018)). In particular, the technique may be helpful for further improving the bounds in Theorem 14 and 37 (when is not very large).
References
 (1)
 AllenZhu et al. (2018) Zeyuan AllenZhu, Yuanzhi Li, and Yingyu Liang. 2018. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918 (2018).
 Arora et al. (2019) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. 2019. On Exact Computation with an Infinitely Wide Neural Net. arXiv preprint arXiv:1904.11955 (2019).
 Arora et al. (2018) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. 2018. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning (ICML). 254–263.
 Bakry et al. (2013) Dominique Bakry, Ivan Gentil, and Michel Ledoux. 2013. Analysis and geometry of Markov diffusion operators. Vol. 348. Springer Science & Business Media.
 Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. 2017. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 6240–6249.
 Bousquet and Elisseeff (2002) Olivier Bousquet and André Elisseeff. 2002. Stability and generalization. Journal of machine learning research 2, Mar (2002), 499–526.
 Bovier and den Hollander (2006) Anton Bovier and Frank den Hollander. 2006. Metastability: a potential theoretic approach. In International Congress of Mathematicians, Vol. 3. Eur. Math. Soc. Zürich, 499–518.

Bovier
et al. (2005)
Anton Bovier,
Véronique Gayrard, and Markus
Klein. 2005.
Metastability in reversible diffusion processes II: Precise asymptotics for small eigenvalues.
Journal of the European Mathematical Society 7, 1 (2005), 69–99.  Chaudhari et al. (2017) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. 2017. EntropySGD: Biasing gradient descent into wide valleys. In International Conference on Learning Representations (ICLR).
 Chen et al. (2018) Yuansi Chen, Chi Jin, and Bin Yu. 2018. Stability and Convergence Tradeoff of Iterative Optimization Algorithms. arXiv preprint arXiv:1804.01619 (2018).
 Chizat and Bach (2018) Lenaic Chizat and Francis Bach. 2018. A Note on Lazy Training in Supervised Differentiable Programming. arXiv preprint arXiv:1812.07956 (2018).
 Du et al. (2018) Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. 2018. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804 (2018).
 Duchi (2007) John Duchi. 2007. Derivations for linear algebra and optimization. Berkeley, California 3 (2007).
 Dziugaite and Roy (2018) Gintare Karolina Dziugaite and Daniel Roy. 2018. EntropySGD optimizes the prior of a PACBayes bound: Generalization properties of EntropySGD and datadependent priors. In International Conference on Machine Learning (ICML). 1377–1386.

Dziugaite and Roy (2017)
Gintare Karolina Dziugaite and
Daniel M Roy. 2017.
Computing nonvacuous generalization bounds for deep
(stochastic) neural networks with many more parameters than training data.
In
Uncertainty in Artificial Intelligence (UAI)
.  Hardt et al. (2016) Moritz Hardt, Benjamin Recht, and Yoram Singer. 2016. Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine Learning (ICML). 1225–1234.
 Holley and Stroock (1987) Richard Holley and Daniel Stroock. 1987. Logarithmic Sobolev inequalities and stochastic Ising models. Journal of statistical physics 46, 5 (1987), 1159–1194.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 1097–1105.
 Kuzborskij and Lampert (2018) I. Kuzborskij and C. H. Lampert. 2018. DataDependent Stability of Stochastic Gradient Descent. In International Conference on Machine Learning (ICML).
 Lever et al. (2013) Guy Lever, François Laviolette, and John ShaweTaylor. 2013. Tighter PACBayes bounds through distributiondependent priors. Theoretical Computer Science 473 (2013), 4–28.
 Liang et al. (2017) Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. 2017. FisherRao metric, geometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530 (2017).
 London (2016) Ben London. 2016. Generalization bounds for randomized learning with application to stochastic gradient descent. In NIPS Workshop on Optimizing the Optimizers.
 Menz et al. (2014) Georg Menz, André Schlichting, et al. 2014. Poincaré and logarithmic Sobolev inequalities by decomposition of the energy landscape. The Annals of Probability 42, 5 (2014), 1809–1884.
 Mou et al. (2018) Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. 2018. Generalization bounds of SGLD for nonconvex learning: Two theoretical viewpoints. In Conference on Learning Theory (COLT). 605–638.
 Nesterov (1983) Yurii E Nesterov. 1983. A method for solving the convex programming problem with convergence rate O. In Dokl. Akad. Nauk SSSR, Vol. 269. 543–547.
 Neyshabur et al. (2018) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. 2018. A PACBayesian approach to spectrallynormalized margin bounds for neural networks. In International Conference on Learning Representations (ICLR).
 Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. 2015. Normbased capacity control in neural networks. In Conference on Learning Theory (COLT). 1376–1401.
 Pavliotis (2014) Grigorios A Pavliotis. 2014. Stochastic processes and applications: diffusion processes, the FokkerPlanck and Langevin equations. Vol. 60. Springer.
 Pensia et al. (2018) Ankit Pensia, Varun Jog, and PoLing Loh. 2018. Generalization Error Bounds for Noisy, Iterative Algorithms. In International Symposium on Information Theory (ISIT). 546–550.
 Polyak (1964) Boris T Polyak. 1964. Some methods of speeding up the convergence of iteration methods. U. S. S. R. Comput. Math. and Math. Phys. 4, 5 (1964), 1–17.
 Raginsky et al. (2017) Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. 2017. Nonconvex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. In Conference on Learning Theory (COLT). 1674–1703.
 Risken (1996) Hannes Risken. 1996. Fokkerplanck equation. In The FokkerPlanck Equation. Springer, 63–95.
 Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning (ICML). 1139–1147.
 Topsoe (2000) Flemming Topsoe. 2000. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory 46, 4 (2000), 1602–1609.
 Tzen et al. (2018) Belinda Tzen, Tengyuan Liang, and Maxim Raginsky. 2018. Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability. Proceedings of the 2018 Conference on Learning Theory (COLT) (2018).
 Wei et al. (2018) Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. 2018. On the margin theory of feedforward neural networks. arXiv preprint arXiv:1810.05369 (2018).
 Wei and Ma (2019) Colin Wei and Tengyu Ma. 2019. Datadependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation. arXiv preprint arXiv:1905.03684 (2019).
 Welling and Teh (2011) Max Welling and Yee W Teh. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning (ICML). 681–688.
 Xu and Raginsky (2017) Aolin Xu and Maxim Raginsky. 2017. Informationtheoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems. 2524–2533.
 Xu et al. (2018) Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. 2018. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. In Advances in Neural Information Processing Systems (NeurIPS). 3126–3137.
 Zhang et al. (2017a) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017a. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR).
 Zhang et al. (2017b) Yuchen Zhang, Percy Liang, and Moses Charikar. 2017b. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics. In Conference on Learning Theory (COLT). 1980–2022.
 Zou et al. (2018) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. 2018. Stochastic gradient descent optimizes overparameterized deep ReLU networks. arXiv preprint arXiv:1811.08888 (2018).
Appendix A Proofs in Section 3
a.1 BayesStability Framework
Lemma 16.
Under Assumption 7, for any prior distribution not depending on the dataset , the generalization error is upper bounded by
where denotes the population loss .
Proof of Lemma 16 Let and . We can rewrite generalization error as , where
(Assumption 7)  
and
(Assumption 7)  
( is a prior)  
(definition of ) 
Thus, we have
Theorem 8 (BayesStability).
Proof By Lemma 16,
(boundedness)  
(Pinsker’s inequality) 
The other bound follows from a similar argument.
a.2 Technical Lemmas
The following lemma allows us to reduce the proof of algorithmic stability to the analysis of a single update.
Lemma 10.
Let and be two sequences of random variables such that for each , and have the same support. Suppose and follow the same distribution. Then,
Where and .
The following lemma (see e.g., (Duchi, 2007, Section 9)) gives a closedform formula for the KLdivergence between Gaussian distributions.
Lemma 17.
Suppose that and are two Gaussian distributions on . Then,
The following lemma (Topsoe, 2000, Theorem 3) helps us to derive upper bounds on the KLdivergence in the technical proofs.
Definition 18.
Let and be two probability distributions on . The directional triangular discrimination from to is defined as
where
Lemma 19.
For any two probability distributions and on ,
Let be the set of all possible minibatches. denotes the collection of minibatches that contain , while . Let denote the diameter of set .
Lemma 11.
Suppose that batch size . and are two collections of points in labeled by minibatches of size that satisfy the following conditions for some constant :

for and for .

.
Let denote the Gaussian distribution . Let and be two mixture distributions over all minibatches. Then, for some universal constant ,
Comments
There are no comments yet.