Evaluating the generalization error of a learning algorithm is one of the most important challenges in statistical learning theory. Various approaches have been developed(Rodrigues & Eldar, 2021), including VC dimension-based bounds (Vapnik, 1999), algorithmic stability-based bounds (Bousquet & Elisseeff, 2002), algorithmic robustness-based bounds (Xu & Mannor, 2012), PAC-Bayesian bounds (McAllester, 2003), and information-theoretic bounds (Xu & Raginsky, 2017).
However, upper bounds on generalization error may not entirely capture the generalization ability of a learning algorithm. One apparent reason is the tightness issue, some upper bounds (Anthony & Bartlett, 2009) can be far away from the true generalization error or even vacuous when evaluated in practice. More importantly, existing upper bounds do not fully characterize all the aspects that could influence the generalization error of a supervised learning problem. For example, VC dimension-based bounds depend only on the hypothesis class, and algorithmic stability-based bounds only exploit the properties of the learning algorithm. As a consequence, both methods fail to capture the fact that generalization behavior depends strongly on the interplay between the hypothesis class, learning algorithm, and the underlying data-generating distribution, as shown in (Zhang et al., 2016). This paper overcomes the above limitations by deriving an exact characterization of the generalization error for a specific supervised learning algorithm, namely the Gibbs algorithm.
1.1 Problem Formulation
Let be the training set, where each is defined on the same alphabet . Note that is not required to be i.i.d generated from the same data-generating distribution
, and we denote the joint distribution of all the training samples as. We denote the hypotheses by , where
is a hypothesis class. The performance of the hypothesis is measured by a non-negative loss function, and we can define the empirical risk and the population risk associated with a given hypothesis as and respectively.
A learning algorithm can be modeled as a randomized mapping from the training set onto an hypothesis according to the conditional distribution . Thus, the expected generalization error that quantifies the degree of over-fitting can be written as
where the expectation is taken over the joint distribution .
The -Gibbs distribution, which was first investigated by (Gibbs, 1902), is defined as:
where is the inverse temperature, is an arbitrary prior distribution of , is energy function, and is the partition function.
are probability measures over the space, and is absolutely continuous with respect to , the Kullback-Leibler (KL) divergence between and is given by . If is also absolutely continuous with respect to , the symmetrized KL divergence (a.k.a. Jeffrey’s divergence (Jeffreys, 1946)) is
The mutual information between two random variablesand is defined as the KL divergence between the joint distribution and product-of-marginal distribution , or equivalently, the conditional KL divergence between and averaged over , . By swapping the role of and in mutual information, we get the lautum information introduced by (Palomar & Verdú, 2008), . Finally, the symmetrized KL information between and is given by (Aminian et al., 2015):
Throughout the paper, upper-case letters denote random variables, lower-case letters denote the realizations of random variables, and calligraphic letters denote sets. All the logarithms are the natural ones, and all the information measure units are nats.
denotes a Gaussian distribution with meanand covariance matrix .
The core contribution of this paper (see Theorem 1) is an exact characterization of the expected generalization error for the Gibbs algorithm in terms of symmetrized KL information between the input training samples and the output hypothesis , as follows:
This result highlights the fundamental role of such an information quantity in learning theory that does not appear to have been recognized before. We also discuss some general properties of the symmetrized KL information, which could be used to prove the non-negativity and concavity of the expected generalization error for the Gibbs algorithm.
Building upon this result, we further expand our contributions by tightening existing expected generalization error for Gibbs algorithm under i.i.d and sub-Gaussian assumptions by combining our symmetrized KL information characterization with existing bounding techniques.
1.3 Motivations for Gibbs Algorithm
As we discuss below, the choice of the Gibbs algorithm is not arbitrary since it arises naturally in many different applications and is sufficiently general to model many learning algorithms used in practice:
Empirical Risk Minimization: The -Gibbs algorithm can be viewed as a randomized version of the empirical risk minimization (ERM) algorithm if we specify the energy function . As the inverse temperature , the prior distribution becomes negligible, and the Gibbs algorithm converges to the standard ERM algorithm.
Information Risk Minimization: The -Gibbs algorithm is the solution to the regularized ERM problem by considering conditional KL-divergence , as a regularizer to penalize over-fitting in the information risk minimization framework (Xu & Raginsky, 2017; Zhang, 2006; Zhang et al., 2006).
SGLD Algorithm: The Stochastic Gradient Langevin Dynamics (SGLD) can be viewed as the discrete version of the continuous-time Langevin diffusion. In (Raginsky et al., 2017), it is proved that under some conditions on loss function, the learning algorithm induced by SGLD algorithm is close to -Gibbs distribution in 2-Wasserstein distance for sufficiently large iterations, where is the distribution over hypothesis in the first step. Under some conditions on the loss function , (Chiang et al., 1987; Markowich & Villani, 2000) shows that in the continuous-time Langevin diffusion, the stationary distribution of hypothesis is the Gibbs distribution.
1.4 Other Related Works
Information-theoretic generalization error bounds: Recently, (Russo & Zou, 2019; Xu & Raginsky, 2017) proposed to use the mutual information between the input training set and the output hypothesis to upper bound the expected generalization error. However, those bounds are known not to be tight, and multiple approaches have been proposed to tighten the mutual information-based bound. (Bu et al., 2020a) provides tighter bounds by considering the individual sample mutual information, (Asadi et al., 2018; Asadi & Abbe, 2020) propose using chaining mutual information, and (Steinke & Zakynthinou, 2020; Hafez-Kolahi et al., 2020; Haghifam et al., 2020) advocate the conditioning and processing techniques. Information-theoretic generalization error bounds using other information quantities are also studied, such as, -divergence (Jiao et al., 2017), -Réyni divergence and maximal leakage (Issa et al., 2019; Esposito et al., 2019), Jensen-Shannon divergence (Aminian et al., 2020) and Wasserstein distance (Lopez & Jog, 2018; Wang et al., 2019; Rodríguez-Gálvez et al., 2021). Using rate-distortion theory, (Masiha et al., 2021; Bu et al., 2020a) provide information-theoretic generalization error upper bounds for model misspecification and model compression.
Generalization error of Gibbs algorithm: Both information-theoretic and PAC-Bayesian approaches have been used to bound the generalization error of the Gibbs algorithm. An information-theoretic upper bound with a convergence rate of is provided in (Raginsky et al., 2016) for the Gibbs algorithm with bounded loss function, and PAC-Bayesian bounds using a variational approximation of Gibbs posteriors are studied in (Alquier et al., 2016). (Kuzborskij et al., 2019) focus on the excess risk of the Gibbs algorithm and a similar generalization bound with rate of is provided under sub-Gaussian assumption. Although these bounds are tight in terms of the sample complexity , they become vacuous when the inverse temperature , hence are unable to capture the behaviour of the ERM algorithm.
Our work differs from this body of research in the sense that we provide an exact characterization of the generalization error of the Gibbs algorithm in terms of the symmetrized KL information. Our work also further leverages this characterization to tighten existing expected generalization error bounds in literature.
2 Generalization Error of Gibbs Algorithm
Our main result, which characterizes the exact expected generalization error of the Gibbs algorithm with prior distribution , is as follows:
For the -Gibbs algorithm, the expected generalization error is given by
Sketch of Proof:.
It can be shown that the symmetrized KL information can be written as
Just like the generalization error, the above expression is the difference between the expectations of the same function evaluated under the joint distribution and the product-of-marginal distribution. Note that and share the same marginal distribution, we have , and . Then, combining (2) with (6) completes the proof. More details are provided in Appendix A. ∎
To the best of our knowledge, this is the first exact characterization of the expected generalization error for the Gibbs algorithm. Note that Theorem 1 only assumes that the loss function is non-negative, and it holds even for non-i.i.d training samples.
2.1 General Properties
By Theorem 1, some basic properties of the expected generalization error, including non-negativity and concavity can be proved directly from the properties of symmetrized KL information.
The non-negativity of the expected generalization error, i.e., , follows by the non-negativity of the symmetrized KL information. Note that the non-negativity result obtained in (Kuzborskij et al., 2019) requires more technical assumptions, including i.i.d samples and a sub-Gaussian loss function.
It is shown in (Aminian et al., 2015) that the symmetrized KL information is a concave function of for fixed , and a convex function of for fixed . Thus, we have the following corollary.
For a fixed -Gibbs algorithm , the expected generalization error is a concave function of .
The concavity of the generalization error for the Gibbs algorithm can be immediately used to explain the well-known fact why training a model by mixing multiple datasets from different domains leads to poor generalization. Suppose that the data-generating distribution is domain-dependent, i.e., there exists a random variable , such that holds. Then, can be viewed as the mixture of the data-generating distribution across all domains. From Corollary 1 and Jensen’s inequality, we have
which shows that the generalization error of Gibbs algorithm achieved with the mixture distribution is larger than the averaged generalization error for each .
2.2 Example: Mean Estimation
Consider the problem of learning the mean
of a random vectorusing i.i.d training samples . We assume that the covariance matrix of satisfies with unknown . We adopt the mean-squared loss , and assume a Gaussian prior for the mean . If we set inverse-temperature , then the -Gibbs algorithm is given by the following posterior distribution (Murphy, 2007),
Since is Gaussian, the mutual information and lautum information are given by
with As we can see from the above expressions, symmetrized KL information is independent of the distribution of , as long as .
From Theorem 1, the generalization error of this algorithm can be computed exactly as:
which has the decay rate of . As a comparison, the individual sample mutual information (ISMI) bound from (Bu et al., 2020b), which is shown to be tighter than the mutual information-based bound in (Xu & Raginsky, 2017, Theorem 1), gives a sub-optimal bound with order , as , (see Appendix B.3).
3 Tighter Expected Generalization Error Upper Bound
In this section, we show that by combining Theorem 1 with the information-theoretic bound proposed in (Xu & Raginsky, 2017) under i.i.d and sub-Gaussian assumptions, we can provide a tighter generalization error upper bound for Gibbs algorithm. This bound quantifies how the generalization error of the Gibbs algorithm depends on the number of samples , and is useful when directly evaluating the symmetrized KL information is hard.
(proved in Appendix C) Suppose that the training samples are i.i.d generated from the distribution , and the non-negative loss function is -sub-Gaussian on the left-tail 111A random variable is -sub-Gaussian if , , and is -sub-Gaussian on the left-tail if , . under distribution for all . If we further assume for some , then for the -Gibbs algorithm, we have
Theorem 2 establishes the convergence rate of the generalization error of Gibbs algorithm with i.i.d training samples, and suggests that a smaller inverse temperature leads to a tighter upper bound. Note that all the -sub-Gaussian loss functions are also -sub-Gaussian on the left-tail under the same distribution (loss function in Section 2.2 is -sub-Gaussian on the left-tail, but not sub-Gaussian). Therefore, our result also applies to any bounded loss function , since bounded functions are -sub-Gaussian.
Remark 1 (Previous Results).
Using the fact that Gibbs algorithm is differentially private (McSherry & Talwar, 2007) for bounded loss functions , directly applying (Xu & Raginsky, 2017, Theorem 1) gives a sub-optimal bound . By further exploring the bounded loss assumption using Hoeffding’s lemma, a tighter upper bound is obtained in (Raginsky et al., 2016), which has the similar decay rate order of . In (Kuzborskij et al., 2019, Theorem 1), the upper bound is derived with a different assumption, i.e., is -sub-Gaussian under Gibbs algorithm . In Theorem 2, we assume the loss function is -sub-Gaussian on left-tail under data generating distribution for all , which is more general as we discussed above. Our upper bound is also improved by a factor of compared to the result in (Kuzborskij et al., 2019).
Remark 2 (Choice of ).
Since when , setting is always valid in Theorem 2, which gives . As shown in (Palomar & Verdú, 2008, Theorem 15), holds for any Gaussian channel . In addition, it is discussed in (Palomar & Verdú, 2008, Example 1), if either the entropy of training or the hypothesis is small, would be smaller than (as it is not upper-bounded by the entropy), which implies that the lautum information term is not negligible in general.
We provide an exact characterization of the expected generalization error for the Gibbs algorithm using symmetrized KL information. We demonstrate the versatility of our approach by tightening the existing information-theoretic expected generalization error upper bound. This work motivates further investigation of the Gibbs algorithm in a variety of settings, including extending our results to characterize the generalization ability of a over-parameterized Gibbs algorithm, which could potentially provide more understanding of the generalization ability for deep learning.
Yuheng Bu is supported, in part, by NSF under Grant CCF-1717610 and by the MIT-IBM Watson AI Lab. Gholamali Aminian is supported by the Royal Society Newton International Fellowship, grant no. NIF\R1 \192656.
- Alquier et al. (2016) Alquier, P., Ridgway, J., and Chopin, N. On the properties of variational approximations of gibbs posteriors. The Journal of Machine Learning Research, 17(1):8374–8414, 2016.
- Aminian et al. (2015) Aminian, G., Arjmandi, H., Gohari, A., Nasiri-Kenari, M., and Mitra, U. Capacity of diffusion-based molecular communication networks over lti-poisson channels. IEEE Transactions on Molecular, Biological and Multi-Scale Communications, 1(2):188–201, 2015.
- Aminian et al. (2020) Aminian, G., Toni, L., and Rodrigues, M. R. Jensen-shannon information based characterization of the generalization error of learning algorithms. 2020 IEEE Information Theory Workshop (ITW), 2020.
- Anthony & Bartlett (2009) Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009.
- Asadi et al. (2018) Asadi, A., Abbe, E., and Verdú, S. Chaining mutual information and tightening generalization bounds. In Advances in Neural Information Processing Systems, pp. 7234–7243, 2018.
Asadi & Abbe (2020)
Asadi, A. R. and Abbe, E.
Chaining meets chain rule: Multilevel entropic regularization and training of neural networks.Journal of Machine Learning Research, 21(139):1–32, 2020.
- Bousquet & Elisseeff (2002) Bousquet, O. and Elisseeff, A. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
Bu et al. (2020a)
Bu, Y., Gao, W., Zou, S., and Veeravalli, V.
Information-theoretic understanding of population risk improvement
with model compression.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3300–3307, 2020a.
- Bu et al. (2020b) Bu, Y., Zou, S., and Veeravalli, V. V. Tightening mutual information-based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):121–130, 2020b.
- Chiang et al. (1987) Chiang, T.-S., Hwang, C.-R., and Sheu, S. J. Diffusion for global optimization in r^n. SIAM Journal on Control and Optimization, 25(3):737–753, 1987.
- Esposito et al. (2019) Esposito, A. R., Gastpar, M., and Issa, I. Generalization error bounds via -réyni, -divergences and maximal leakage. arXiv preprint arXiv:1912.01439, 2019.
- Gibbs (1902) Gibbs, J. W. Elementary principles of statistical mechanics. Compare, 289:314, 1902.
- Hafez-Kolahi et al. (2020) Hafez-Kolahi, H., Golgooni, Z., Kasaei, S., and Soleymani, M. Conditioning and processing: Techniques to improve information-theoretic generalization bounds. Advances in Neural Information Processing Systems, 33, 2020.
- Haghifam et al. (2020) Haghifam, M., Negrea, J., Khisti, A., Roy, D. M., and Dziugaite, G. K. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Advances in Neural Information Processing Systems, 2020.
- Issa et al. (2019) Issa, I., Esposito, A. R., and Gastpar, M. Strengthened information-theoretic bounds on the generalization error. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 582–586. IEEE, 2019.
- Jeffreys (1946) Jeffreys, H. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.
- Jiao et al. (2017) Jiao, J., Han, Y., and Weissman, T. Dependence measures bounding the exploration bias for general measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 1475–1479. IEEE, 2017.
- Kuzborskij et al. (2019) Kuzborskij, I., Cesa-Bianchi, N., and Szepesvári, C. Distribution-dependent analysis of gibbs-erm principle. In Conference on Learning Theory, pp. 2028–2054. PMLR, 2019.
- Lopez & Jog (2018) Lopez, A. T. and Jog, V. Generalization error bounds using wasserstein distances. In 2018 IEEE Information Theory Workshop (ITW), pp. 1–5. IEEE, 2018.
- Markowich & Villani (2000) Markowich, P. A. and Villani, C. On the trend to equilibrium for the fokker-planck equation: an interplay between physics and functional analysis. Mat. Contemp, 19:1–29, 2000.
- Masiha et al. (2021) Masiha, M. S., Gohari, A., Yassaee, M. H., and Aref, M. R. Learning under distribution mismatch and model misspecification. arXiv preprint arXiv:2102.05695, 2021.
- McAllester (2003) McAllester, D. A. Pac-bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003.
- McSherry & Talwar (2007) McSherry, F. and Talwar, K. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp. 94–103. IEEE, 2007.
- Murphy (2007) Murphy, K. P. Conjugate bayesian analysis of the gaussian distribution. def, 1(22):16, 2007.
- Palomar & Verdú (2008) Palomar, D. P. and Verdú, S. Lautum information. IEEE transactions on information theory, 54(3):964–975, 2008.
- Raginsky et al. (2016) Raginsky, M., Rakhlin, A., Tsao, M., Wu, Y., and Xu, A. Information-theoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop (ITW), pp. 26–30. IEEE, 2016.
- Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp. 1674–1703. PMLR, 2017.
Rodrigues & Eldar (2021)
Rodrigues, M. R. and Eldar, Y. C.
Information-Theoretic Methods in Data Science. Cambridge University Press, 2021.
- Rodríguez-Gálvez et al. (2021) Rodríguez-Gálvez, B., Bassi, G., Thobaben, R., and Skoglund, M. Tighter expected generalization error bounds via wasserstein distance. arXiv preprint arXiv:2101.09315, 2021.
- Russo & Zou (2019) Russo, D. and Zou, J. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
- Steinke & Zakynthinou (2020) Steinke, T. and Zakynthinou, L. Reasoning about generalization via conditional mutual information. arXiv preprint arXiv:2001.09122, 2020.
- Vapnik (1999) Vapnik, V. N. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
- Wang et al. (2019) Wang, H., Diaz, M., Santos Filho, J. C. S., and Calmon, F. P. An information-theoretic view of generalization via wasserstein distance. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 577–581. IEEE, 2019.
- Xu & Raginsky (2017) Xu, A. and Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pp. 2524–2533, 2017.
- Xu & Mannor (2012) Xu, H. and Mannor, S. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
- Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
- Zhang (2006) Zhang, T. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006.
- Zhang et al. (2006) Zhang, T. et al. From -entropy to kl-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5):2180–2210, 2006.
Appendix A Proof of Theorem 1
We start with the following two Lemmas:
We define the following function as a proxy for the empirical risk, i.e., , where , , , and the function as a proxy for the population risk. Then,
Consider a learning algorithm , if we set the function , then
Appendix B Example Details: Estimating the Mean of Gaussian
b.1 Generalization Error
We first evaluate the generalization error of the learning algorithm in (8) directly. Note that the output can be written as
where is independent from the training samples . Thus,
where denotes an independent copy of the training sample, follows due to the fact that are i.i.d, and follows from the fact that has zero mean, and it is only dependent on .
b.2 Symmetrized KL divergence
The following lemma from (Palomar & Verdú, 2008) characterizes the mutual and lautum information for the Gaussian Channel.
(Palomar & Verdú, 2008, Theorem 14) Consider the following model
where denotes the input
random vector with zero mean (not necessarily
Gaussian), denotes the linear transformation undergone by the input,
denotes the linear transformation undergone by the input,is the output vector, and is a Gaussian noise vector independent of . The input and the noise covariance matrices are given by and . Then, the mutual information and lautum information are given by
In our example, the output can be written as
where . Then, setting , and noticing that in Lemma 3 completes the proof.
b.3 ISMI bound
(Bu et al., 2020b, Theorem 2) Suppose satisfies for , and for under , where and . Then,
First, we need to compute the mutual information between each individual sample and the output hypothesis , and the cumulant generating function (CGF) of , where , are independent copies of and with the same marginal distribution, respectively.
Since and are Gaussian, can be computed exactly as:
then, we have
for , . In addition, since
it can be shown that
is a scaled non-central chi-square distribution withdegrees of freedom, where the scaling factor and its non-centrality parameter . Note that the expectation of chi-square distribution with non-centrality parameter and degrees of freedom is
and its moment generating function is. Therefore, the CGF of is given by
for . Since , we only need to consider the case . It can be shown that:
where . Further note that
We have the following upper bound on the CGF of :
If is fixed, i.e., , then as , , and the above bound is .