I Introduction
Machine learningbased approaches are increasingly adopted to solve various prediction problems in a wide range of applications such as computer vision, speech recognition, speech translation, and many more
[21], [4]. In particular, supervised machine learning approaches learn a predictor – also known as a hypothesis – mapping input variables to output variables using some algorithm that leverages a series of inputoutput examples drawn from some underlying (and unknown) distribution [21]. It is therefore critical to understand the generalization ability of such a predictor, i.e., how the predictor performance on the training set differs from its performance on a testing set (or on the population).A recent research direction within the informationtheoretic and related communities has concentrated on the development of approaches to characterize the generalization error of randomized learning algorithms, i.e. learning algorithms map the set of training examples to the hypothesis according to some probability law [24],[17].
The characterization of the generalization ability of randomized learning algorithms has come in two broad flavours. One involves determining a bound to the generalization error that holds on average. For example, building upon pioneering work by Russo and Zou [20], Xu and Raginsky [24] have derived average generalization error bounds involving the mutual information between the training set and the hypothesis. Bu et al. [6] have derived tighter average generalization error bounds involving the mutual information between each sample in the training set and the hypothesis. Bounds using chaining mutual information have been proposed in [3]. Other authors have also constructed informationtheoretic based average generalization error bounds using quantities such as Réyni divergence, divergence, JensenShannon divergences, Wasserstein distances, or maximal leakage (see [10], [2], [15], [23], or [14]).
The other flavour – known as probably approximately correct (PAC)Bayesian bounds and singledraw upper bounds – involves determining a bound to the generalization error that holds with high probability. The original PACBayesian generalization error bounds have been characterized via a KullbackLeibler (KL) divergence (a.k.a. relative entropy) between a prior datafree distribution and a posterior datadependent distribution on the hypothesis space [16]. Other slightly different PACBayesian generalization error bounds have also been offered in [22], [7], [1] and [13]. A general PACBayesian framework offering high probability bounds on a convex function of the population risk and empirical risk with respect to a posterior distribution has also been provided in [11]. A PACBayesian upper bound by considering a Gibbs data dependent prior is provided in [19]. Some singledraw upper bounds have been proposed in [24], [10], and [13].
In this paper, we aspire to offer a more refined analysis of the generalization ability of randomized learning algorithms in view of the fact that the generalization error can be seen as a random variable with distribution that depends on randomized algorithm distribution and the data distribution. The analysis of moments of certain quantities arising in statistical learning problems has already been considered in certain works. For example, Russo and and Zou
[20] have analysed bounds to certain moments of the error arising in data exploration problems, whereas Dhurandhar and Dobra [8] have analysed bounds to moments of the error arising in model selection problems. Sharper high probability bounds for sums of functions of independent random variables based on their moments, within the context of stable learning algorithms, have also been derived in [5]. However, to the best of our knowledge, a characterization of bounds to the moments of the generalization error of randomized learning algorithms, allowing us to capture better how the population risk may deviate from the empirical risk, does not appear to have been considered in the literature.Our contributions are as follows:

First, we offer a general upper bound on the expected value of a function of the population risk and the empirical risk of a randomized learning algorithm expressed via certain information measures between the training set and the hypothesis.

Second, we offer upper bounds on the moments of the generalization error of a randomized learning algorithm deriving from the aforementioned general bound in terms of power information and Chisquare information measures. We also propose another upper bound on the second moment of generalization error in terms of mutual information.

Third, we show how to leverage the generalization error moment bounds to construct highprobability bounds showcasing how the population risk deviates from the empirical risk associated with a randomized learning algorithm.

Finally, we show how the proposed results bound the true moments of the generalization error via a simple numerical example.
We adopt the following notation in the sequel. Uppercase letters denote random variables (e.g., ), lowercase letters denote random variable realizations (e.g. ), and calligraphic letters denote sets (e.g. ). The distribution of the random variable is denoted by
and the joint distribution of two random variables
is denoted by . We let represent the natural logarithm. We also let represent the set of positive integers.Ii Problem Formulation
We consider a standard supervised learning setting where we wish to learn a hypothesis given a set of inputoutput examples; we also then wish to use this hypothesis to predict new outputs given new inputs.
We model the input data (also known as features) using a random variable where represents the input set; we model the output data (also known as predictors or labels) using a random variable where represents the output set; we also model inputoutput data pairs using a random variable where is drawn from per some unknown distribution . We also let be a training set consisting of a number of inputoutput data points drawn i.i.d. from according to .
We represent hypotheses using a random variable where is a hypothesis class. We also represent a randomized learning algorithm via a Markov kernel that maps a given training set onto a hypothesis of the hypothesis class according to the probability law .
Let us also introduce a (nonnegative) loss function
that measures how well a hypothesis predicts an output given an input. We can now define the population risk and the empirical risk given by:(1)  
(2) 
which quantify the performance of a hypothesis delivered by the randomized learning algorithm on a testing set (population) and the training set, respectively. We can also define the generalization error as follows:
(3) 
which quantifies how much the population risk deviates from the empirical risk. This generalization error is a random variable whose distribution depends on the randomized learning algorithm probabilistic law along with the (unknown) underlying data distribution. Therefore, an exact characterization of the behaviour of the generalization error – such as its distribution – is not possible.
In order to bypass this challenge, our goal in the sequel will be to derive upper bounds to the moments of the generalization error given by:
(4) 
in terms of various divergences and informationtheoretic measures. In particular, we will use the following divergence measures between two distributions and on a common measurable space :

The KL divergence given by:

The power divergence of order given by [12]:
We also use the following information measures between two random variables and with joint distribution and marginals and :

The mutual information given by:

The power information of order given by:

The Chisquare information given by:
where
Iii Bounding Moments of Generalization Error
We begin by offering a general result inspired from [1] bounding the (absolute) expected value of an arbitrary function of the population and empirical risks under a joint measure in terms of the (absolute) expected value of the function of the population and empirical risks under the product measure.
Theorem 1.
Consider a measurable function . It follows that
(5)  
where such that , is the distribution of training set, and and are the distribution of hypothesis and joint distribution of hypothesis and training set induced by learning algorithm .
Proof.
See Appendix A. ∎
Theorem 1 can now be immediately used to bound the moments of the generalization error of a randomized learning algorithm in terms of a power divergence, under the common assumption that the loss function is subgaussian. ^{1}^{1}1A random variable is subgaussian if for all .
Theorem 2.
Assume that the loss function is  subgaussian under distribution for all . Then, the th moment of the generalization error of a randomized learning algorithm obeys the bound given by:
(6) 
provided that , , and .
Proof.
See Appendix B. ∎
Theorem 2 can also be immediately specialized to bound the moments of the generalization error of a randomized learning algorithm in terms of a chisquare divergence, also under the common assumption that the loss function is subgaussian.
Theorem 3.
Assume that the loss function is  subgaussian under distribution for all . Then, the th moment of the generalization error of a randomized learning algorithm obeys the bound given by:
(7) 
Proof.
This theorem follows immediately by setting in Theorem 2. ∎
Interestingly, these moment bounds also appear to lead to a new average generalization error bound complementing existing ones in the literature.
Corollary 1.
Assume that the loss function is  subgaussian under distribution for all . Then, the average generalization error can be bounded as follows:
(8) 
provided that for .
Proof.
This corollary follows immediately by setting in Theorem 2. ∎
Note that the chisquare information based expected generalization error upper bound is looser than the mutual information based counterpart in [24]. However, for in Corollary 1, if , the power information of order based bound is tighter than the mutual information based one [24].
It is also interesting to reflect about how the generalization error moment bounds decay as a function of the training set size ingested by the learning algorithm. In general, information measures such as power information and chisquare information do not have to be finite, but these information measures can be shown to obey and , respectively, provided that ^{2}^{2}2This condition holds provided that is countable.
It follows immediately that the moments of the generalization error are governed by the upper bound given by:
(9) 
exhibiting a decay rate of the order . Naturally, with the increase in the training set size, one would expect the empirical risk to concentrate around the population risk, and our bounds hint at the speed of such convergence.
It is also interesting to reflect about the tightness of the various generalization error moment bounds. In particular, in view of the fact that it may not be possible to compare directly information measures such as power information and chisquare information, the following Proposition puts forth conditions allowing one to compare the tightness of the bounds portrayed in Theorems 2 and 3 under the condition that the randomized learning algorithm ingests i.i.d. inputoutput data examples.
Proposition 1.
Assume that the loss function is  subgaussian under distribution for all . Then, the power information of order generalization error th moment upper bound
(10) 
is looser than the chisqare information based bound
(11) 
provided that for with .
Proof.
See Appendix C. ∎
For example, it turns out guarantees a chisquare information based generalization error second moment bound to be tighter than the power information of order 3 based bound.
Finally, we offer an additional bound – applicable only to the second moment of the generalization error – leveraging an alternative proof route inspired by tools put forth in [20, Proposition 2]. ^{3}^{3}3It does not appear that [20, Proposition 2] can be used to generate generalization error higherorder moment bounds
Theorem 4.
Assume that the loss function is  subgaussian under distribution for all . Then, the second moment of the generalization error of a randomized learning algorithm can be bounded as follows:
(12) 
Proof.
See Appendix D. ∎
The next proposition showcases that under certain conditions the mutual information based second moment bound can be tighter than the power information and chisquare information bounds.
Proposition 2.
Assume that the loss function is  subgaussian under distribution for all . The second moment of generalization error upper based on Chisquare information
(13) 
is looser than the upper bound based on mutual information in Theorem 4,
(14) 
provided that .
Proof.
See Appendix E. ∎
Iv From Moments to High Probability Bounds
We now showcase how to use the moment upper bounds to bound the probability that the empirical risk deviates from the population risk by a certain amount, under a singledraw scenario where one draws a single hypothesis based on the training data [13].
Concretely, our following results leverage generalization error moment bounds to construct a generalization error highprobability bound, involving a simple application of Markov’s inequality.
Theorem 5.
Assume that the loss function is  subgaussian under distribution for all . It follows that with probability at least for some under distribution the generalization error obeys:
(15)  
provided that .
Proof.
See Appendix F. ∎
Corollary 2.
Assume that the loss function is  subgaussian under distribution for all . It follows that with probability at least for some under distribution the generalization error obeys:
(16)  
provided that .
Proof.
This corollary follows immediately by setting in Theorem 5. ∎
It is instructive to comment on how this informationtheoretic based highprobability generalization error bound compares to other similar informationtheoretic bounds such as in [24], [10] and [13]. Our singledraw bound dependence on (i.e. ) is more beneficial than Xu et al.’s bound [24, Theorem 3] dependent on i.e. . Our singledraw bound based on chisquare information (along with bounds based on mutual information) is also typically tighter than maximal leakage based single draw bounds [10], [13].
A similar singledraw high probability upper bound based on chisquare information has also been provided in [10]. The approach pursued to lead to such bound in [10] is based on Réyni divergence and mutual information, whereas our approach leading to Corollary 2 is based on bounds to the moments of the generalization error.
V Numerical Example
We now illustrate our generalization error bounds within a very simple setting involving the estimation of the mean of a Gaussian random variable
– where corresponds to the (unknown) mean andcorresponds to the (known) variance – based on
i.i.d. samples for .We consider the hypothesis corresponding to the empirical risk minimizer given by . We also consider the loss function given by
In view of the fact that the loss function is bounded within the interval , it is also subgaussian so that we can apply the generalization error moments upper bounds offered earlier.
In our simulations, we consider , and . We compute the true generalization error numerically. We also compute chisquare and mutual information bounds to the moments of the generalization error appearing in Theorems 3 and 4. We focus exclusively on chisquare information – corresponding to power information of order 2 – because it has been established in Proposition 1 that the chisquare information bound can be tighter than the power information one under certain conditions. Both the chisquare information and the mutual information are evaluated numerically. Due to complexity in estimation of chisquare information and mutual information, we consider a relatively small number of training samples.
Fig.1 and Fig.2 demonstrate that the chisquare based bounds to the first and second moment of the generalization error is looser than the mutual information based bounds, as suggested earlier. Fig.3 also suggests that higherorder moments (and bounds) to the generalization error decay faster than lowerorder ones, as highlighted earlier.
Vi Conclusion
We have introduced a new approach to obtain informationtheoretic oriented bounds to the moments of generalization error associated with randomized supervised learning problems. We have discussed how these bounds relate to existing ones within the literature. Finally, we have also discussed how to leverage the generalization error moment bounds to derive a high probability bounds to the generalization error.
Appendix A Proof of Theorem 1
The result follows immediately by noting that:
(17)  
(18)  
(19)  
(20)  
(21) 
where (20) is due to Hlder inequality.
Appendix B Proof of Theorem 2
Appendix C Proof of Proposition1
This result follows from the inequality given by [12, Corollary 5.6]:
(25) 
holding for . We then have that:
(26)  
(27) 
where the last inequality is valid if for and considering .
Appendix D Proof of Theorem 4
The loss function is assumed to be subgaussian under distribution for all hence – in view of the fact that the data samples are i.i.d. – is subgaussian and also is subgaussian under distribution for all .
It is possible to establish that the random variable is subexponential [18, Lemma 1.12]. ^{4}^{4}4A random variable is subexponential if for all . Now, we have from the variational representation of the KullbackLeibler distance that:
(28)  
As the is subexponential for all , we have:
(29)  
As is subgaussian, we also have for all . Therefore the following inequality holds:
(30)  
This leads to the inequality:
(31) 
holding for .
The final result follows by choosing .
Appendix E Proof of Proposition 2
The result follow from the inequality given by [9]:
(32) 
We then have that:
(33) 
and, for , we also have that:
(34) 
Appendix F Proof of Theorem 5
Consider that:
(35)  
(36)  
(37)  
(38) 
where the first inequality is due to Markov’ inequality and the second inequality is due to Theorem 3. Consider also that
We then have immediately that with probability at least under the distribution it holds that:
(39)  
The value of that optimizes the right hand side in the bound above is given by:
which is an integer in view of the assumption . The result then follows immediately by substituting in (39).
References
 [1] (2018) Simpler pacbayesian bounds for hostile data. Machine Learning 107 (5), pp. 887–902. Cited by: §I, §III.
 [2] (2020) Jensenshannon information based characterization of the generalization error of learning algorithms. 2020 IEEE Information Theory Workshop (ITW). Cited by: §I.
 [3] (2018) Chaining mutual information and tightening generalization bounds. In Advances in Neural Information Processing Systems, pp. 7234–7243. Cited by: §I.
 [4] (2017) Deep learning. Vol. 1, MIT press Massachusetts, USA:. Cited by: §I.
 [5] (2020) Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pp. 610–626. Cited by: §I.
 [6] (2020) Tightening mutual informationbased bounds on generalization error. IEEE Journal on Selected Areas in Information Theory 1 (1), pp. 121–130. Cited by: §I.
 [7] (2003) A pacbayesian approach to adaptive classification. preprint 840. Cited by: §I.
 [8] (2009) Semianalytical method for analyzing models and model selection measures based on moment analysis. ACM Transactions on Knowledge Discovery from Data (TKDD) 3 (1), pp. 1–51. Cited by: §I.
 [9] (2000) Some inequalities for the kullbackleibler and  distances in information theory and applications. RGMIA research report collection 3 (2), pp. 199–210. Cited by: Appendix E.
 [10] (2019) Generalization error bounds via r’enyi, divergences and maximal leakage. arXiv preprint arXiv:1912.01439. Cited by: §I, §I, §IV, §IV.
 [11] (2015) Risk bounds for the majority vote: from a pacbayesian analysis to a learning algorithm. Journal of Machine Learning Research 16 (26), pp. 787–860. Cited by: §I.
 [12] (2013) Sharp inequalities for divergences. IEEE transactions on information theory 60 (1), pp. 104–121. Cited by: Appendix C, 2nd item, 3rd item.
 [13] (2020) Generalization bounds via information density and conditional information density. arXiv preprint arXiv:2005.08044. Cited by: §I, §IV, §IV.
 [14] (2017) Dependence measures bounding the exploration bias for general measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 1475–1479. Cited by: §I.
 [15] (2018) Generalization error bounds using wasserstein distances. In 2018 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §I.
 [16] (2003) PACbayesian stochastic model selection. Machine Learning 51 (1), pp. 5–21. Cited by: §I.
 [17] (2016) Informationtheoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop (ITW), pp. 26–30. Cited by: §I.
 [18] (2015) Lecture 2: subgaussian random variables. In high dimensional statistics—MIT Course No. 2.080J, Note: MIT OpenCourseWare External Links: Link Cited by: Appendix B, Appendix D.
 [19] (2020) PACbayes analysis beyond the usual bounds. Advances in Neural Information Processing System. Cited by: §I.
 [20] (2019) How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory 66 (1), pp. 302–323. Cited by: §I, §I, §III, footnote 3.
 [21] (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: §I.
 [22] (2017) A strongly quasiconvex pacbayesian bound. In International Conference on Algorithmic Learning Theory, pp. 466–492. Cited by: §I.
 [23] (2019) An informationtheoretic view of generalization via wasserstein distance. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 577–581. Cited by: §I.
 [24] (2017) Informationtheoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pp. 2524–2533. Cited by: §I, §I, §I, §III, §IV.