1 Introduction
Evaluating the generalization error of a learning algorithm is one of the most important challenges in statistical learning theory. Various approaches have been developed
(Rodrigues & Eldar, 2021), including VC dimensionbased bounds (Vapnik, 1999), algorithmic stabilitybased bounds (Bousquet & Elisseeff, 2002), algorithmic robustnessbased bounds (Xu & Mannor, 2012), PACBayesian bounds (McAllester, 2003), and informationtheoretic bounds (Xu & Raginsky, 2017).However, upper bounds on generalization error may not entirely capture the generalization ability of a learning algorithm. One apparent reason is the tightness issue, some upper bounds (Anthony & Bartlett, 2009) can be far away from the true generalization error or even vacuous when evaluated in practice. More importantly, existing upper bounds do not fully characterize all the aspects that could influence the generalization error of a supervised learning problem. For example, VC dimensionbased bounds depend only on the hypothesis class, and algorithmic stabilitybased bounds only exploit the properties of the learning algorithm. As a consequence, both methods fail to capture the fact that generalization behavior depends strongly on the interplay between the hypothesis class, learning algorithm, and the underlying datagenerating distribution, as shown in (Zhang et al., 2016). This paper overcomes the above limitations by deriving an exact characterization of the generalization error for a specific supervised learning algorithm, namely the Gibbs algorithm.
1.1 Problem Formulation
Let be the training set, where each is defined on the same alphabet . Note that is not required to be i.i.d generated from the same datagenerating distribution
, and we denote the joint distribution of all the training samples as
. We denote the hypotheses by , whereis a hypothesis class. The performance of the hypothesis is measured by a nonnegative loss function
, and we can define the empirical risk and the population risk associated with a given hypothesis as and respectively.A learning algorithm can be modeled as a randomized mapping from the training set onto an hypothesis according to the conditional distribution . Thus, the expected generalization error that quantifies the degree of overfitting can be written as
(1) 
where the expectation is taken over the joint distribution .
The Gibbs distribution, which was first investigated by (Gibbs, 1902), is defined as:
(2) 
where is the inverse temperature, is an arbitrary prior distribution of , is energy function, and is the partition function.
If and
are probability measures over the space
, and is absolutely continuous with respect to , the KullbackLeibler (KL) divergence between and is given by . If is also absolutely continuous with respect to , the symmetrized KL divergence (a.k.a. Jeffrey’s divergence (Jeffreys, 1946)) is(3) 
The mutual information between two random variables
and is defined as the KL divergence between the joint distribution and productofmarginal distribution , or equivalently, the conditional KL divergence between and averaged over , . By swapping the role of and in mutual information, we get the lautum information introduced by (Palomar & Verdú, 2008), . Finally, the symmetrized KL information between and is given by (Aminian et al., 2015):(4) 
Throughout the paper, uppercase letters denote random variables, lowercase letters denote the realizations of random variables, and calligraphic letters denote sets. All the logarithms are the natural ones, and all the information measure units are nats.
denotes a Gaussian distribution with mean
and covariance matrix .1.2 Contributions
The core contribution of this paper (see Theorem 1) is an exact characterization of the expected generalization error for the Gibbs algorithm in terms of symmetrized KL information between the input training samples and the output hypothesis , as follows:
This result highlights the fundamental role of such an information quantity in learning theory that does not appear to have been recognized before. We also discuss some general properties of the symmetrized KL information, which could be used to prove the nonnegativity and concavity of the expected generalization error for the Gibbs algorithm.
Building upon this result, we further expand our contributions by tightening existing expected generalization error for Gibbs algorithm under i.i.d and subGaussian assumptions by combining our symmetrized KL information characterization with existing bounding techniques.
1.3 Motivations for Gibbs Algorithm
As we discuss below, the choice of the Gibbs algorithm is not arbitrary since it arises naturally in many different applications and is sufficiently general to model many learning algorithms used in practice:
Empirical Risk Minimization: The Gibbs algorithm can be viewed as a randomized version of the empirical risk minimization (ERM) algorithm if we specify the energy function . As the inverse temperature , the prior distribution becomes negligible, and the Gibbs algorithm converges to the standard ERM algorithm.
Information Risk Minimization: The Gibbs algorithm is the solution to the regularized ERM problem by considering conditional KLdivergence , as a regularizer to penalize overfitting in the information risk minimization framework (Xu & Raginsky, 2017; Zhang, 2006; Zhang et al., 2006).
SGLD Algorithm: The Stochastic Gradient Langevin Dynamics (SGLD) can be viewed as the discrete version of the continuoustime Langevin diffusion. In (Raginsky et al., 2017), it is proved that under some conditions on loss function, the learning algorithm induced by SGLD algorithm is close to Gibbs distribution in 2Wasserstein distance for sufficiently large iterations, where is the distribution over hypothesis in the first step. Under some conditions on the loss function , (Chiang et al., 1987; Markowich & Villani, 2000) shows that in the continuoustime Langevin diffusion, the stationary distribution of hypothesis is the Gibbs distribution.
1.4 Other Related Works
Informationtheoretic generalization error bounds: Recently, (Russo & Zou, 2019; Xu & Raginsky, 2017) proposed to use the mutual information between the input training set and the output hypothesis to upper bound the expected generalization error. However, those bounds are known not to be tight, and multiple approaches have been proposed to tighten the mutual informationbased bound. (Bu et al., 2020a) provides tighter bounds by considering the individual sample mutual information, (Asadi et al., 2018; Asadi & Abbe, 2020) propose using chaining mutual information, and (Steinke & Zakynthinou, 2020; HafezKolahi et al., 2020; Haghifam et al., 2020) advocate the conditioning and processing techniques. Informationtheoretic generalization error bounds using other information quantities are also studied, such as, divergence (Jiao et al., 2017), Réyni divergence and maximal leakage (Issa et al., 2019; Esposito et al., 2019), JensenShannon divergence (Aminian et al., 2020) and Wasserstein distance (Lopez & Jog, 2018; Wang et al., 2019; RodríguezGálvez et al., 2021). Using ratedistortion theory, (Masiha et al., 2021; Bu et al., 2020a) provide informationtheoretic generalization error upper bounds for model misspecification and model compression.
Generalization error of Gibbs algorithm: Both informationtheoretic and PACBayesian approaches have been used to bound the generalization error of the Gibbs algorithm. An informationtheoretic upper bound with a convergence rate of is provided in (Raginsky et al., 2016) for the Gibbs algorithm with bounded loss function, and PACBayesian bounds using a variational approximation of Gibbs posteriors are studied in (Alquier et al., 2016). (Kuzborskij et al., 2019) focus on the excess risk of the Gibbs algorithm and a similar generalization bound with rate of is provided under subGaussian assumption. Although these bounds are tight in terms of the sample complexity , they become vacuous when the inverse temperature , hence are unable to capture the behaviour of the ERM algorithm.
Our work differs from this body of research in the sense that we provide an exact characterization of the generalization error of the Gibbs algorithm in terms of the symmetrized KL information. Our work also further leverages this characterization to tighten existing expected generalization error bounds in literature.
2 Generalization Error of Gibbs Algorithm
Our main result, which characterizes the exact expected generalization error of the Gibbs algorithm with prior distribution , is as follows:
Theorem 1.
For the Gibbs algorithm, the expected generalization error is given by
(5) 
Sketch of Proof:.
It can be shown that the symmetrized KL information can be written as
(6) 
Just like the generalization error, the above expression is the difference between the expectations of the same function evaluated under the joint distribution and the productofmarginal distribution. Note that and share the same marginal distribution, we have , and . Then, combining (2) with (6) completes the proof. More details are provided in Appendix A. ∎
To the best of our knowledge, this is the first exact characterization of the expected generalization error for the Gibbs algorithm. Note that Theorem 1 only assumes that the loss function is nonnegative, and it holds even for noni.i.d training samples.
2.1 General Properties
By Theorem 1, some basic properties of the expected generalization error, including nonnegativity and concavity can be proved directly from the properties of symmetrized KL information.
The nonnegativity of the expected generalization error, i.e., , follows by the nonnegativity of the symmetrized KL information. Note that the nonnegativity result obtained in (Kuzborskij et al., 2019) requires more technical assumptions, including i.i.d samples and a subGaussian loss function.
It is shown in (Aminian et al., 2015) that the symmetrized KL information is a concave function of for fixed , and a convex function of for fixed . Thus, we have the following corollary.
Corollary 1.
For a fixed Gibbs algorithm , the expected generalization error is a concave function of .
The concavity of the generalization error for the Gibbs algorithm can be immediately used to explain the wellknown fact why training a model by mixing multiple datasets from different domains leads to poor generalization. Suppose that the datagenerating distribution is domaindependent, i.e., there exists a random variable , such that holds. Then, can be viewed as the mixture of the datagenerating distribution across all domains. From Corollary 1 and Jensen’s inequality, we have
(7) 
which shows that the generalization error of Gibbs algorithm achieved with the mixture distribution is larger than the averaged generalization error for each .
2.2 Example: Mean Estimation
We now consider a simple learning problem, where the symmetrized KL information can be computed exactly, to demonstrate the usefulness of Theorem 1. All details are provided in Appendix B.
Consider the problem of learning the mean
of a random vector
using i.i.d training samples . We assume that the covariance matrix of satisfies with unknown . We adopt the meansquared loss , and assume a Gaussian prior for the mean . If we set inversetemperature , then the Gibbs algorithm is given by the following posterior distribution (Murphy, 2007),(8) 
with
Since is Gaussian, the mutual information and lautum information are given by
(9)  
(10) 
with As we can see from the above expressions, symmetrized KL information is independent of the distribution of , as long as .
From Theorem 1, the generalization error of this algorithm can be computed exactly as:
(11) 
which has the decay rate of . As a comparison, the individual sample mutual information (ISMI) bound from (Bu et al., 2020b), which is shown to be tighter than the mutual informationbased bound in (Xu & Raginsky, 2017, Theorem 1), gives a suboptimal bound with order , as , (see Appendix B.3).
3 Tighter Expected Generalization Error Upper Bound
In this section, we show that by combining Theorem 1 with the informationtheoretic bound proposed in (Xu & Raginsky, 2017) under i.i.d and subGaussian assumptions, we can provide a tighter generalization error upper bound for Gibbs algorithm. This bound quantifies how the generalization error of the Gibbs algorithm depends on the number of samples , and is useful when directly evaluating the symmetrized KL information is hard.
Theorem 2.
(proved in Appendix C) Suppose that the training samples are i.i.d generated from the distribution , and the nonnegative loss function is subGaussian on the lefttail ^{1}^{1}1A random variable is subGaussian if , , and is subGaussian on the lefttail if , . under distribution for all . If we further assume for some , then for the Gibbs algorithm, we have
(12) 
Theorem 2 establishes the convergence rate of the generalization error of Gibbs algorithm with i.i.d training samples, and suggests that a smaller inverse temperature leads to a tighter upper bound. Note that all the subGaussian loss functions are also subGaussian on the lefttail under the same distribution (loss function in Section 2.2 is subGaussian on the lefttail, but not subGaussian). Therefore, our result also applies to any bounded loss function , since bounded functions are subGaussian.
Remark 1 (Previous Results).
Using the fact that Gibbs algorithm is differentially private (McSherry & Talwar, 2007) for bounded loss functions , directly applying (Xu & Raginsky, 2017, Theorem 1) gives a suboptimal bound . By further exploring the bounded loss assumption using Hoeffding’s lemma, a tighter upper bound is obtained in (Raginsky et al., 2016), which has the similar decay rate order of . In (Kuzborskij et al., 2019, Theorem 1), the upper bound is derived with a different assumption, i.e., is subGaussian under Gibbs algorithm . In Theorem 2, we assume the loss function is subGaussian on lefttail under data generating distribution for all , which is more general as we discussed above. Our upper bound is also improved by a factor of compared to the result in (Kuzborskij et al., 2019).
Remark 2 (Choice of ).
Since when , setting is always valid in Theorem 2, which gives . As shown in (Palomar & Verdú, 2008, Theorem 15), holds for any Gaussian channel . In addition, it is discussed in (Palomar & Verdú, 2008, Example 1), if either the entropy of training or the hypothesis is small, would be smaller than (as it is not upperbounded by the entropy), which implies that the lautum information term is not negligible in general.
4 Conclusion
We provide an exact characterization of the expected generalization error for the Gibbs algorithm using symmetrized KL information. We demonstrate the versatility of our approach by tightening the existing informationtheoretic expected generalization error upper bound. This work motivates further investigation of the Gibbs algorithm in a variety of settings, including extending our results to characterize the generalization ability of a overparameterized Gibbs algorithm, which could potentially provide more understanding of the generalization ability for deep learning.
5 Acknowledgment
Yuheng Bu is supported, in part, by NSF under Grant CCF1717610 and by the MITIBM Watson AI Lab. Gholamali Aminian is supported by the Royal Society Newton International Fellowship, grant no. NIF\R1 \192656.
References
 Alquier et al. (2016) Alquier, P., Ridgway, J., and Chopin, N. On the properties of variational approximations of gibbs posteriors. The Journal of Machine Learning Research, 17(1):8374–8414, 2016.
 Aminian et al. (2015) Aminian, G., Arjmandi, H., Gohari, A., NasiriKenari, M., and Mitra, U. Capacity of diffusionbased molecular communication networks over ltipoisson channels. IEEE Transactions on Molecular, Biological and MultiScale Communications, 1(2):188–201, 2015.
 Aminian et al. (2020) Aminian, G., Toni, L., and Rodrigues, M. R. Jensenshannon information based characterization of the generalization error of learning algorithms. 2020 IEEE Information Theory Workshop (ITW), 2020.
 Anthony & Bartlett (2009) Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009.
 Asadi et al. (2018) Asadi, A., Abbe, E., and Verdú, S. Chaining mutual information and tightening generalization bounds. In Advances in Neural Information Processing Systems, pp. 7234–7243, 2018.

Asadi & Abbe (2020)
Asadi, A. R. and Abbe, E.
Chaining meets chain rule: Multilevel entropic regularization and training of neural networks.
Journal of Machine Learning Research, 21(139):1–32, 2020.  Bousquet & Elisseeff (2002) Bousquet, O. and Elisseeff, A. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.

Bu et al. (2020a)
Bu, Y., Gao, W., Zou, S., and Veeravalli, V.
Informationtheoretic understanding of population risk improvement
with model compression.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 34, pp. 3300–3307, 2020a.  Bu et al. (2020b) Bu, Y., Zou, S., and Veeravalli, V. V. Tightening mutual informationbased bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):121–130, 2020b.
 Chiang et al. (1987) Chiang, T.S., Hwang, C.R., and Sheu, S. J. Diffusion for global optimization in r^n. SIAM Journal on Control and Optimization, 25(3):737–753, 1987.
 Esposito et al. (2019) Esposito, A. R., Gastpar, M., and Issa, I. Generalization error bounds via réyni, divergences and maximal leakage. arXiv preprint arXiv:1912.01439, 2019.
 Gibbs (1902) Gibbs, J. W. Elementary principles of statistical mechanics. Compare, 289:314, 1902.
 HafezKolahi et al. (2020) HafezKolahi, H., Golgooni, Z., Kasaei, S., and Soleymani, M. Conditioning and processing: Techniques to improve informationtheoretic generalization bounds. Advances in Neural Information Processing Systems, 33, 2020.
 Haghifam et al. (2020) Haghifam, M., Negrea, J., Khisti, A., Roy, D. M., and Dziugaite, G. K. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Advances in Neural Information Processing Systems, 2020.
 Issa et al. (2019) Issa, I., Esposito, A. R., and Gastpar, M. Strengthened informationtheoretic bounds on the generalization error. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 582–586. IEEE, 2019.

Jeffreys (1946)
Jeffreys, H.
An invariant form for the prior probability in estimation problems.
Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.  Jiao et al. (2017) Jiao, J., Han, Y., and Weissman, T. Dependence measures bounding the exploration bias for general measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 1475–1479. IEEE, 2017.
 Kuzborskij et al. (2019) Kuzborskij, I., CesaBianchi, N., and Szepesvári, C. Distributiondependent analysis of gibbserm principle. In Conference on Learning Theory, pp. 2028–2054. PMLR, 2019.
 Lopez & Jog (2018) Lopez, A. T. and Jog, V. Generalization error bounds using wasserstein distances. In 2018 IEEE Information Theory Workshop (ITW), pp. 1–5. IEEE, 2018.
 Markowich & Villani (2000) Markowich, P. A. and Villani, C. On the trend to equilibrium for the fokkerplanck equation: an interplay between physics and functional analysis. Mat. Contemp, 19:1–29, 2000.
 Masiha et al. (2021) Masiha, M. S., Gohari, A., Yassaee, M. H., and Aref, M. R. Learning under distribution mismatch and model misspecification. arXiv preprint arXiv:2102.05695, 2021.
 McAllester (2003) McAllester, D. A. Pacbayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003.
 McSherry & Talwar (2007) McSherry, F. and Talwar, K. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp. 94–103. IEEE, 2007.
 Murphy (2007) Murphy, K. P. Conjugate bayesian analysis of the gaussian distribution. def, 1(22):16, 2007.
 Palomar & Verdú (2008) Palomar, D. P. and Verdú, S. Lautum information. IEEE transactions on information theory, 54(3):964–975, 2008.
 Raginsky et al. (2016) Raginsky, M., Rakhlin, A., Tsao, M., Wu, Y., and Xu, A. Informationtheoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop (ITW), pp. 26–30. IEEE, 2016.
 Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Nonconvex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp. 1674–1703. PMLR, 2017.

Rodrigues & Eldar (2021)
Rodrigues, M. R. and Eldar, Y. C.
InformationTheoretic Methods in Data Science
. Cambridge University Press, 2021.  RodríguezGálvez et al. (2021) RodríguezGálvez, B., Bassi, G., Thobaben, R., and Skoglund, M. Tighter expected generalization error bounds via wasserstein distance. arXiv preprint arXiv:2101.09315, 2021.
 Russo & Zou (2019) Russo, D. and Zou, J. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
 Steinke & Zakynthinou (2020) Steinke, T. and Zakynthinou, L. Reasoning about generalization via conditional mutual information. arXiv preprint arXiv:2001.09122, 2020.
 Vapnik (1999) Vapnik, V. N. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
 Wang et al. (2019) Wang, H., Diaz, M., Santos Filho, J. C. S., and Calmon, F. P. An informationtheoretic view of generalization via wasserstein distance. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 577–581. IEEE, 2019.
 Xu & Raginsky (2017) Xu, A. and Raginsky, M. Informationtheoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pp. 2524–2533, 2017.
 Xu & Mannor (2012) Xu, H. and Mannor, S. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
 Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 Zhang (2006) Zhang, T. Informationtheoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006.
 Zhang et al. (2006) Zhang, T. et al. From entropy to klentropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5):2180–2210, 2006.
Appendix A Proof of Theorem 1
We start with the following two Lemmas:
Lemma 1.
We define the following function as a proxy for the empirical risk, i.e., , where , , , and the function as a proxy for the population risk. Then,
(13) 
Proof.
Lemma 2.
Consider a learning algorithm , if we set the function , then
(14) 
Proof.
Appendix B Example Details: Estimating the Mean of Gaussian
b.1 Generalization Error
We first evaluate the generalization error of the learning algorithm in (8) directly. Note that the output can be written as
(16) 
where is independent from the training samples . Thus,
(17) 
where denotes an independent copy of the training sample, follows due to the fact that are i.i.d, and follows from the fact that has zero mean, and it is only dependent on .
b.2 Symmetrized KL divergence
The following lemma from (Palomar & Verdú, 2008) characterizes the mutual and lautum information for the Gaussian Channel.
Lemma 3.
(Palomar & Verdú, 2008, Theorem 14) Consider the following model
(18) 
where denotes the input random vector with zero mean (not necessarily Gaussian),
denotes the linear transformation undergone by the input,
is the output vector, and is a Gaussian noise vector independent of . The input and the noise covariance matrices are given by and . Then, the mutual information and lautum information are given by(19)  
(20) 
In our example, the output can be written as
(21) 
where . Then, setting , and noticing that in Lemma 3 completes the proof.
b.3 ISMI bound
In this subsection, we evaluate the ISMI bound from (Bu et al., 2020b) for the example discussed in Section 2.2 with i.i.d. samples generated from Gaussian .
Lemma 4.
(Bu et al., 2020b, Theorem 2) Suppose satisfies for , and for under , where and . Then,
(22)  
(23) 
First, we need to compute the mutual information between each individual sample and the output hypothesis , and the cumulant generating function (CGF) of , where , are independent copies of and with the same marginal distribution, respectively.
Since and are Gaussian, can be computed exactly as:
(24) 
then, we have
(25)  
for , . In addition, since
(26) 
it can be shown that
is a scaled noncentral chisquare distribution with
degrees of freedom, where the scaling factor and its noncentrality parameter . Note that the expectation of chisquare distribution with noncentrality parameter and degrees of freedom isand its moment generating function is
. Therefore, the CGF of is given byfor . Since , we only need to consider the case . It can be shown that:
(27) 
where . Further note that
(28)  
(29) 
We have the following upper bound on the CGF of :
(30) 
which means that is subGaussian for . Combining the results in (B.3), Lemma 4 gives the following bound
If is fixed, i.e., , then as , , and the above bound is .
Comments
There are no comments yet.