Information-Theoretic Bounds on the Moments of the Generalization Error of Learning Algorithms

02/03/2021 ∙ by Gholamali Aminian, et al. ∙ UCL 15

Generalization error bounds are critical to understanding the performance of machine learning models. In this work, building upon a new bound of the expected value of an arbitrary function of the population and empirical risk of a learning algorithm, we offer a more refined analysis of the generalization behaviour of a machine learning models based on a characterization of (bounds) to their generalization error moments. We discuss how the proposed bounds – which also encompass new bounds to the expected generalization error – relate to existing bounds in the literature. We also discuss how the proposed generalization error moment bounds can be used to construct new generalization error high-probability bounds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning-based approaches are increasingly adopted to solve various prediction problems in a wide range of applications such as computer vision, speech recognition, speech translation, and many more 

[21][4]. In particular, supervised machine learning approaches learn a predictor – also known as a hypothesis – mapping input variables to output variables using some algorithm that leverages a series of input-output examples drawn from some underlying (and unknown) distribution [21]. It is therefore critical to understand the generalization ability of such a predictor, i.e., how the predictor performance on the training set differs from its performance on a testing set (or on the population).

A recent research direction within the information-theoretic and related communities has concentrated on the development of approaches to characterize the generalization error of randomized learning algorithms, i.e. learning algorithms map the set of training examples to the hypothesis according to some probability law [24],[17].

The characterization of the generalization ability of randomized learning algorithms has come in two broad flavours. One involves determining a bound to the generalization error that holds on average. For example, building upon pioneering work by Russo and Zou [20], Xu and Raginsky [24] have derived average generalization error bounds involving the mutual information between the training set and the hypothesis. Bu et al. [6] have derived tighter average generalization error bounds involving the mutual information between each sample in the training set and the hypothesis. Bounds using chaining mutual information have been proposed in [3]. Other authors have also constructed information-theoretic based average generalization error bounds using quantities such as -Réyni divergence, -divergence, Jensen-Shannon divergences, Wasserstein distances, or maximal leakage (see [10][2][15],  [23], or [14]).

The other flavour – known as probably approximately correct (PAC)-Bayesian bounds and single-draw upper bounds – involves determining a bound to the generalization error that holds with high probability. The original PAC-Bayesian generalization error bounds have been characterized via a Kullback-Leibler (KL) divergence (a.k.a. relative entropy) between a prior data-free distribution and a posterior data-dependent distribution on the hypothesis space [16]. Other slightly different PAC-Bayesian generalization error bounds have also been offered in  [22][7], [1] and [13]. A general PAC-Bayesian framework offering high probability bounds on a convex function of the population risk and empirical risk with respect to a posterior distribution has also been provided in [11]. A PAC-Bayesian upper bound by considering a Gibbs data dependent prior is provided in [19]. Some single-draw upper bounds have been proposed in [24], [10], and [13].

In this paper, we aspire to offer a more refined analysis of the generalization ability of randomized learning algorithms in view of the fact that the generalization error can be seen as a random variable with distribution that depends on randomized algorithm distribution and the data distribution. The analysis of moments of certain quantities arising in statistical learning problems has already been considered in certain works. For example, Russo and and Zou 

[20] have analysed bounds to certain moments of the error arising in data exploration problems, whereas Dhurandhar and Dobra [8] have analysed bounds to moments of the error arising in model selection problems. Sharper high probability bounds for sums of functions of independent random variables based on their moments, within the context of stable learning algorithms, have also been derived in [5]. However, to the best of our knowledge, a characterization of bounds to the moments of the generalization error of randomized learning algorithms, allowing us to capture better how the population risk may deviate from the empirical risk, does not appear to have been considered in the literature.

Our contributions are as follows:

  1. First, we offer a general upper bound on the expected value of a function of the population risk and the empirical risk of a randomized learning algorithm expressed via certain information measures between the training set and the hypothesis.

  2. Second, we offer upper bounds on the moments of the generalization error of a randomized learning algorithm deriving from the aforementioned general bound in terms of power information and Chi-square information measures. We also propose another upper bound on the second moment of generalization error in terms of mutual information.

  3. Third, we show how to leverage the generalization error moment bounds to construct high-probability bounds showcasing how the population risk deviates from the empirical risk associated with a randomized learning algorithm.

  4. Finally, we show how the proposed results bound the true moments of the generalization error via a simple numerical example.

We adopt the following notation in the sequel. Upper-case letters denote random variables (e.g., ), lower-case letters denote random variable realizations (e.g. ), and calligraphic letters denote sets (e.g. ). The distribution of the random variable is denoted by

and the joint distribution of two random variables

is denoted by . We let represent the natural logarithm. We also let represent the set of positive integers.

Ii Problem Formulation

We consider a standard supervised learning setting where we wish to learn a hypothesis given a set of input-output examples; we also then wish to use this hypothesis to predict new outputs given new inputs.

We model the input data (also known as features) using a random variable where represents the input set; we model the output data (also known as predictors or labels) using a random variable where represents the output set; we also model input-output data pairs using a random variable where is drawn from per some unknown distribution . We also let be a training set consisting of a number of input-output data points drawn i.i.d. from according to .

We represent hypotheses using a random variable where is a hypothesis class. We also represent a randomized learning algorithm via a Markov kernel that maps a given training set onto a hypothesis of the hypothesis class according to the probability law .

Let us also introduce a (non-negative) loss function

that measures how well a hypothesis predicts an output given an input. We can now define the population risk and the empirical risk given by:

(1)
(2)

which quantify the performance of a hypothesis delivered by the randomized learning algorithm on a testing set (population) and the training set, respectively. We can also define the generalization error as follows:

(3)

which quantifies how much the population risk deviates from the empirical risk. This generalization error is a random variable whose distribution depends on the randomized learning algorithm probabilistic law along with the (unknown) underlying data distribution. Therefore, an exact characterization of the behaviour of the generalization error – such as its distribution – is not possible.

In order to bypass this challenge, our goal in the sequel will be to derive upper bounds to the moments of the generalization error given by:

(4)

in terms of various divergences and information-theoretic measures. In particular, we will use the following divergence measures between two distributions and on a common measurable space :

  • The KL divergence given by:

  • The power divergence of order given by [12]:

  • The Chi-square divergence given by [12]:

    where .

We also use the following information measures between two random variables and with joint distribution and marginals and :

  • The mutual information given by:

  • The power information of order given by:

  • The Chi-square information given by:

    where

Iii Bounding Moments of Generalization Error

We begin by offering a general result inspired from [1] bounding the (absolute) expected value of an arbitrary function of the population and empirical risks under a joint measure in terms of the (absolute) expected value of the function of the population and empirical risks under the product measure.

Theorem 1.

Consider a measurable function . It follows that

(5)

where such that , is the distribution of training set, and and are the distribution of hypothesis and joint distribution of hypothesis and training set induced by learning algorithm .

Proof.

See Appendix A. ∎

Theorem 1 can now be immediately used to bound the moments of the generalization error of a randomized learning algorithm in terms of a power divergence, under the common assumption that the loss function is -subgaussian. 111A random variable is -subgaussian if for all .

Theorem 2.

Assume that the loss function is - subgaussian under distribution for all . Then, the -th moment of the generalization error of a randomized learning algorithm obeys the bound given by:

(6)

provided that , , and .

Proof.

See Appendix B. ∎

Theorem 2 can also be immediately specialized to bound the moments of the generalization error of a randomized learning algorithm in terms of a chi-square divergence, also under the common assumption that the loss function is -subgaussian.

Theorem 3.

Assume that the loss function is - subgaussian under distribution for all . Then, the -th moment of the generalization error of a randomized learning algorithm obeys the bound given by:

(7)
Proof.

This theorem follows immediately by setting in Theorem 2. ∎

Interestingly, these moment bounds also appear to lead to a new average generalization error bound complementing existing ones in the literature.

Corollary 1.

Assume that the loss function is - subgaussian under distribution for all . Then, the average generalization error can be bounded as follows:

(8)

provided that for .

Proof.

This corollary follows immediately by setting in Theorem 2. ∎

Note that the chi-square information based expected generalization error upper bound is looser than the mutual information based counterpart in [24]. However, for in Corollary 1, if , the power information of order based bound is tighter than the mutual information based one [24].

It is also interesting to reflect about how the generalization error moment bounds decay as a function of the training set size ingested by the learning algorithm. In general, information measures such as power information and chi-square information do not have to be finite, but these information measures can be shown to obey and , respectively, provided that 222This condition holds provided that is countable.

It follows immediately that the moments of the generalization error are governed by the upper bound given by:

(9)

exhibiting a decay rate of the order . Naturally, with the increase in the training set size, one would expect the empirical risk to concentrate around the population risk, and our bounds hint at the speed of such convergence.

It is also interesting to reflect about the tightness of the various generalization error moment bounds. In particular, in view of the fact that it may not be possible to compare directly information measures such as power information and chi-square information, the following Proposition puts forth conditions allowing one to compare the tightness of the bounds portrayed in Theorems 2 and 3 under the condition that the randomized learning algorithm ingests i.i.d. input-output data examples.

Proposition 1.

Assume that the loss function is - subgaussian under distribution for all . Then, the power information of order generalization error -th moment upper bound

(10)

is looser than the chi-sqare information based bound

(11)

provided that for with .

Proof.

See Appendix C. ∎

For example, it turns out guarantees a chi-square information based generalization error second moment bound to be tighter than the power information of order 3 based bound.

Finally, we offer an additional bound – applicable only to the second moment of the generalization error – leveraging an alternative proof route inspired by tools put forth in [20, Proposition 2]333It does not appear that [20, Proposition 2] can be used to generate generalization error higher-order moment bounds

Theorem 4.

Assume that the loss function is - subgaussian under distribution for all . Then, the second moment of the generalization error of a randomized learning algorithm can be bounded as follows:

(12)
Proof.

See Appendix D. ∎

The next proposition showcases that under certain conditions the mutual information based second moment bound can be tighter than the power information and chi-square information bounds.

Proposition 2.

Assume that the loss function is - subgaussian under distribution for all . The second moment of generalization error upper based on Chi-square information

(13)

is looser than the upper bound based on mutual information in Theorem 4,

(14)

provided that .

Proof.

See Appendix E. ∎

Iv From Moments to High Probability Bounds

We now showcase how to use the moment upper bounds to bound the probability that the empirical risk deviates from the population risk by a certain amount, under a single-draw scenario where one draws a single hypothesis based on the training data [13].

Concretely, our following results leverage generalization error moment bounds to construct a generalization error high-probability bound, involving a simple application of Markov’s inequality.

Theorem 5.

Assume that the loss function is - subgaussian under distribution for all . It follows that with probability at least for some under distribution the generalization error obeys:

(15)

provided that .

Proof.

See Appendix F. ∎

Corollary 2.

Assume that the loss function is - subgaussian under distribution for all . It follows that with probability at least for some under distribution the generalization error obeys:

(16)

provided that .

Proof.

This corollary follows immediately by setting in Theorem 5. ∎

It is instructive to comment on how this information-theoretic based high-probability generalization error bound compares to other similar information-theoretic bounds such as in [24], [10] and [13]. Our single-draw bound dependence on (i.e. ) is more beneficial than Xu et al.’s bound [24, Theorem 3] dependent on i.e. . Our single-draw bound based on chi-square information (along with bounds based on mutual information) is also typically tighter than maximal leakage based single draw bounds [10], [13].

A similar single-draw high probability upper bound based on chi-square information has also been provided in [10]. The approach pursued to lead to such bound in [10] is based on -Réyni divergence and -mutual information, whereas our approach leading to Corollary 2 is based on bounds to the moments of the generalization error.

V Numerical Example

We now illustrate our generalization error bounds within a very simple setting involving the estimation of the mean of a Gaussian random variable

– where corresponds to the (unknown) mean and

corresponds to the (known) variance – based on

i.i.d. samples for .

We consider the hypothesis corresponding to the empirical risk minimizer given by . We also consider the loss function given by

In view of the fact that the loss function is bounded within the interval , it is also -subgaussian so that we can apply the generalization error moments upper bounds offered earlier.

In our simulations, we consider , and . We compute the true generalization error numerically. We also compute chi-square and mutual information bounds to the moments of the generalization error appearing in Theorems 3 and 4. We focus exclusively on chi-square information – corresponding to power information of order 2 – because it has been established in Proposition 1 that the chi-square information bound can be tighter than the power information one under certain conditions. Both the chi-square information and the mutual information are evaluated numerically. Due to complexity in estimation of chi-square information and mutual information, we consider a relatively small number of training samples.

Fig.1 and Fig.2 demonstrate that the chi-square based bounds to the first and second moment of the generalization error is looser than the mutual information based bounds, as suggested earlier. Fig.3 also suggests that higher-order moments (and bounds) to the generalization error decay faster than lower-order ones, as highlighted earlier.

Fig. 1: First moment of the generalization error. The figure depicts the true values along with bounds based on mutual information and chi-square information.
Fig. 2: Second moment of the generalization error. The figure depicts the true values along with bounds based on mutual information and chi-square information.
Fig. 3: Third and fourth moments of the generalization error. The figure depicts the true values along with bounds based on chi-square information.

Vi Conclusion

We have introduced a new approach to obtain information-theoretic oriented bounds to the moments of generalization error associated with randomized supervised learning problems. We have discussed how these bounds relate to existing ones within the literature. Finally, we have also discussed how to leverage the generalization error moment bounds to derive a high probability bounds to the generalization error.

Appendix A Proof of Theorem 1

The result follows immediately by noting that:

(17)
(18)
(19)
(20)
(21)

where (20) is due to Hlder inequality.

Appendix B Proof of Theorem 2

This result follows from Theorem 1 by considering:

(22)

We now have that:

(23)

We also have that:

(24)

in view of the fact that (a) the loss function is -subgaussian hence (b) is -subgaussian and (c) [18, Lemma 1.4]. This completes the proof.

Appendix C Proof of Proposition1

This result follows from the inequality given by [12, Corollary 5.6]:

(25)

holding for . We then have that:

(26)
(27)

where the last inequality is valid if for and considering .

Appendix D Proof of Theorem 4

The loss function is assumed to be -subgaussian under distribution for all hence – in view of the fact that the data samples are i.i.d. – is -subgaussian and also is -subgaussian under distribution for all .

It is possible to establish that the random variable is -subexponential [18, Lemma 1.12]. 444A random variable is -subexponential if for all . Now, we have from the variational representation of the Kullback-Leibler distance that:

(28)

As the is -subexponential for all , we have:

(29)

As is -subgaussian, we also have for all . Therefore the following inequality holds:

(30)

This leads to the inequality:

(31)

holding for .

The final result follows by choosing .

Appendix E Proof of Proposition 2

The result follow from the inequality given by [9]:

(32)

We then have that:

(33)

and, for , we also have that:

(34)

Appendix F Proof of Theorem 5

Consider that:

(35)
(36)
(37)
(38)

where the first inequality is due to Markov’ inequality and the second inequality is due to Theorem 3. Consider also that

We then have immediately that with probability at least under the distribution it holds that:

(39)

The value of that optimizes the right hand side in the bound above is given by:

which is an integer in view of the assumption . The result then follows immediately by substituting in (39).

References

  • [1] P. Alquier and B. Guedj (2018) Simpler pac-bayesian bounds for hostile data. Machine Learning 107 (5), pp. 887–902. Cited by: §I, §III.
  • [2] G. Aminian, L. Toni, and M. R. Rodrigues (2020) Jensen-shannon information based characterization of the generalization error of learning algorithms. 2020 IEEE Information Theory Workshop (ITW). Cited by: §I.
  • [3] A. Asadi, E. Abbe, and S. Verdú (2018) Chaining mutual information and tightening generalization bounds. In Advances in Neural Information Processing Systems, pp. 7234–7243. Cited by: §I.
  • [4] Y. Bengio, I. Goodfellow, and A. Courville (2017) Deep learning. Vol. 1, MIT press Massachusetts, USA:. Cited by: §I.
  • [5] O. Bousquet, Y. Klochkov, and N. Zhivotovskiy (2020) Sharper bounds for uniformly stable algorithms. In Conference on Learning Theory, pp. 610–626. Cited by: §I.
  • [6] Y. Bu, S. Zou, and V. V. Veeravalli (2020) Tightening mutual information-based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory 1 (1), pp. 121–130. Cited by: §I.
  • [7] O. Catoni (2003) A pac-bayesian approach to adaptive classification. preprint 840. Cited by: §I.
  • [8] A. Dhurandhar and A. Dobra (2009) Semi-analytical method for analyzing models and model selection measures based on moment analysis. ACM Transactions on Knowledge Discovery from Data (TKDD) 3 (1), pp. 1–51. Cited by: §I.
  • [9] S. S. Dragomir and V. Gluscevic (2000) Some inequalities for the kullback-leibler and - distances in information theory and applications. RGMIA research report collection 3 (2), pp. 199–210. Cited by: Appendix E.
  • [10] A. R. Esposito, M. Gastpar, and I. Issa (2019) Generalization error bounds via r’enyi-, -divergences and maximal leakage. arXiv preprint arXiv:1912.01439. Cited by: §I, §I, §IV, §IV.
  • [11] P. Germain, A. Lacasse, F. Laviolette, M. March, and J. Roy (2015) Risk bounds for the majority vote: from a pac-bayesian analysis to a learning algorithm. Journal of Machine Learning Research 16 (26), pp. 787–860. Cited by: §I.
  • [12] A. Guntuboyina, S. Saha, and G. Schiebinger (2013) Sharp inequalities for -divergences. IEEE transactions on information theory 60 (1), pp. 104–121. Cited by: Appendix C, 2nd item, 3rd item.
  • [13] F. Hellström and G. Durisi (2020) Generalization bounds via information density and conditional information density. arXiv preprint arXiv:2005.08044. Cited by: §I, §IV, §IV.
  • [14] J. Jiao, Y. Han, and T. Weissman (2017) Dependence measures bounding the exploration bias for general measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pp. 1475–1479. Cited by: §I.
  • [15] A. T. Lopez and V. Jog (2018) Generalization error bounds using wasserstein distances. In 2018 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §I.
  • [16] D. A. McAllester (2003) PAC-bayesian stochastic model selection. Machine Learning 51 (1), pp. 5–21. Cited by: §I.
  • [17] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu (2016) Information-theoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop (ITW), pp. 26–30. Cited by: §I.
  • [18] Prof. P. Rigollet (2015) Lecture 2: sub-gaussian random variables. In high dimensional statistics—MIT Course No. 2.080J, Note: MIT OpenCourseWare External Links: Link Cited by: Appendix B, Appendix D.
  • [19] O. Rivasplata, I. Kuzborskij, C. Szepesvári, and J. Shawe-Taylor (2020) PAC-bayes analysis beyond the usual bounds. Advances in Neural Information Processing System. Cited by: §I.
  • [20] D. Russo and J. Zou (2019) How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory 66 (1), pp. 302–323. Cited by: §I, §I, §III, footnote 3.
  • [21] S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: §I.
  • [22] N. Thiemann, C. Igel, O. Wintenberger, and Y. Seldin (2017) A strongly quasiconvex pac-bayesian bound. In International Conference on Algorithmic Learning Theory, pp. 466–492. Cited by: §I.
  • [23] H. Wang, M. Diaz, J. C. S. Santos Filho, and F. P. Calmon (2019) An information-theoretic view of generalization via wasserstein distance. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 577–581. Cited by: §I.
  • [24] A. Xu and M. Raginsky (2017) Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pp. 2524–2533. Cited by: §I, §I, §I, §III, §IV.