DeepAI

# Tightening Mutual Information Based Bounds on Generalization Error

A mutual information based upper bound on the generalization error of a supervised learning algorithm is derived in this paper. The bound is constructed in terms of the mutual information between each individual training sample and the output of the learning algorithm, which requires weaker conditions on the loss function, but provides a tighter characterization of the generalization error than existing studies. Examples are further provided to demonstrate that the bound derived in this paper is tighter, and has a broader range of applicability. Application to noisy and iterative algorithms, e.g., stochastic gradient Langevin dynamics (SGLD), is also studied, where the constructed bound provides a tighter characterization of the generalization error than existing results.

• 17 publications
• 26 publications
• 25 publications
04/27/2020

### Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms

The information-theoretic framework of Russo and J. Zou (2016) and Xu an...
01/12/2018

### Generalization Error Bounds for Noisy, Iterative Algorithms

In statistical learning theory, generalization error is used to quantify...
10/23/2020

### Jensen-Shannon Information Based Characterization of the Generalization Error of Learning Algorithms

Generalization error bounds are critical to understanding the performanc...
06/11/2018

### Chaining Mutual Information and Tightening Generalization Bounds

Bounding the generalization error of learning algorithms has a long hist...
02/10/2022

### Generalization Bounds via Convex Analysis

Since the celebrated works of Russo and Zou (2016,2019) and Xu and Ragin...
11/16/2021

### Generalization Bounds and Algorithms for Learning to Communicate over Additive Noise Channels

An additive noise channel is considered, in which the distribution of th...
06/29/2022

### Understanding Generalization via Leave-One-Out Conditional Mutual Information

We study the mutual information between (certain summaries of) the outpu...

## I Introduction

Consider an instance space , a continuous hypothesis space , and a nonnegative loss function . A training dataset consists of i.i.d samples drawn from an unknown distribution . The goal of a supervised learning algorithm is to find an output hypothesis that minimizes the population risk:

 Lμ(w)≜EZ∼μ[ℓ(w,Z)]. (1)

In practice, is unknown, and thus cannot be computed directly. Instead, the empirical risk of on a training dataset is studied, which is defined as

 LS(w)≜1nn∑i=1ℓ(w,Zi). (2)

A learning algorithm can be characterized by a randomized mapping from the training data set to a hypothesis according to a conditional distribution . The generalization error of a supervised learning algorithm is the expected difference between the population risk of the output hypothesis and its empirical risk on the training dataset:

 gen(μ,PW|S)≜EW,S[Lμ(W)−LS(W)], (3)

where the expectation is taken over the joint distribution

. The generalization error is used to measure the extent to which the learning algorithm overfits the training data.

Traditional ways of bounding the generalization error can be categorized into two groups: (1) by measuring the complexity of the hypothesis space , e.g., VC dimension and Rademacher complexity [1]; and (2) by exploring properties of the learning algorithm, e.g., uniform stability [2]. Recently, it was proposed in [3] and further studied in [4] and [5] that the metric of mutual information can be used to develop upper bounds on the generalization error of a learning algorithm. Such an information-theoretic framework can handle a broader range of problems, e.g., problems with unbounded loss function. More importantly, it offers an information-theoretic point of view on how to improve the generalization capability of a learning algorithm.

In this paper, we follow the information-theoretic framework in [3, 4, 5]. Our main contribution is a tighter upper bound on the generalization error using the mutual information between an individual training sample and the output hypothesis of the learning algorithm. We show that compared to existing studies, our bound has a broader applicability, and can be considerably tighter.

### I-a Main Contributions and Comparison to Related Works

The following lemma from [4] provides an upper bound on the generalization error using the mutual information between the training data set and the output hypothesis .

###### Lemma 1.

[4, Theorem 1] Suppose is -sub-Gaussian 111

is -sub-Gaussian if , . under for all , then

 |gen(μ,PW|S)|≤√2R2nI(S;W). (4)

This mutual information based bound in (4) is related to the on-average stability [6], and quantifies the overall dependence between the output of the learning algorithm and its input dataset using . By further exploiting the structure of the hypothesis space and the dependency between the algorithm input and output, the authors of [5] combined the chaining and mutual information methods, and obtained a tighter bound on the generalization error.

However, the bound in Lemma 1 and the chaining mutual information (CMI) bound in [5] both suffer from the following two shortcomings. First, for empirical risk minimization (ERM), if is the unique minimizer of in , the mutual information . It can be shown that both bounds are not tight in this case. Second, both bounds assume that has a bounded cumulant generating function (CGF) under for all , which may not hold for many problems.

In this paper, we get around these shortcomings by combining the idea of algorithmic stability [6, 7] and the information theoretic framework. Specifically, an algorithm is stable if the output hypothesis does not change too much with the replacement of any individual training sample, and if an algorithm is stable, then it generalizes well [6, 7]. Motivated by these facts, we tighten the mutual information based generalization error bound by considering the individual sample mutual information (ISMI) . Compared with the bound in Lemma 1, and the CMI bound in [5], the ISMI bound requires a weaker condition on the CGF of the loss function, is applicable to a broader range of problems, and provides a tighter characterization of the generalization error. We also comprehensively study three examples, and compare the ISMI bound with existing results to demonstrate its superiority.

## Ii Preliminaries

We use upper letters to denote random variables, and calligraphic upper letters to denote sets. For a random variable generated from a distribution , we use to denote the expectation taken over with distribution . We write to denote the

-dimensional identity matrix. All logarithms are natural ones.

The cumulant generating function (CGF) of a random variable is defined as . It can be verified that , and that is convex if it exists.

###### Definition 1.

For a convex function defined on the interval , where , its Legendre dual is defined as

 ψ∗(x)≜supλ∈[0,b)λx−ψ(λ). (5)

The following lemma characterizes the property of Legendre dual and its inverse function.

###### Lemma 2.

[8, Lemma 2.4] Assume that . Then defined above is a nonnegative convex and non-decreasing function on with . Moreover, its inverse function is concave, and can be written as

 ψ∗−1(y)=infλ∈(0,b)y+ψ(λ)λ. (6)

For a -sub-Gaussian random variable , let , then by Lemma 2, .

## Iii Bounding Generalization Error via I(W;Zi)

In this section, we first generalize the decoupling lemma in [4, Lemma 1] to a more general setting, and then tighten the bound on generalization error via .

### Iii-a General Decoupling Estimate

Consider a pair of random variables and with joint distribution . Let be an independent copy of , and be an independent copy of , such that . Suppose is a real-valued function. If the CGF of is upper bounded for , we have the following theorem.

###### Theorem 1.

Assume that for , and for under distribution , where and . Suppose that and are convex, and . Then,

 E[f(W,Z)]−E[f(˜W,˜Z)] ≤ψ∗−1+(I(W;Z)), (7) E[f(˜W,˜Z)]−E[f(W,Z)] ≤ψ∗−1−(I(W;Z)). (8)
###### Proof.

Consider the Donsker-Varadhan variational representation of the relative entropy between two probability measures

and defined on :

 D(P∥Q)=supg∈G{EP[g(X)]−logEQ[eg(X)]}, (9)

where the supremum is over all measurable functions , and the equality is achieved when . It then follows that ,

 D(PW,Z∥ PW⊗PZ)≥E[λf(W,Z)]−logE[eλf(˜W,˜Z)] ≥λ(E[f(W,Z)]−E[f(˜W,˜Z)])−ψ+(λ), (10)

where the last inequality follows from the assumption that

 logE[eλ(f(˜W,˜Z)−Ef(˜W,˜Z))]≤ψ+(λ),∀λ∈[0,b+). (11)

Similarly, , it follows that

 D(PW,Z∥ PW⊗PZ) ≥λ(E[f(W,Z)]−E[f(˜W,˜Z)])−ψ−(−λ). (12)

If ,

 E[f(W,Z)]−E[f(˜W,˜Z)] ≤infλ∈[0,b+)I(W,Z)+ψ+(λ)λ =ψ∗−1+(I(W,Z)), (13)

and if ,

 E[f(˜W,˜Z)]−E[f(W,Z)] ≤infλ∈[0,−b−)I(W,Z)+ψ−(λ)λ =ψ∗−1−(I(W,Z)), (14)

where the equalities in (III-A) and (III-A) follow from Lemma 2. ∎

Theorem 1

provides a more general characterization of the decoupling estimate than existing results. Specifically, it is assumed that the CGF of

is bounded for all in [4, Lemma 1] and [9, Theorem 2], whereas in Theorem 1, it is only assumed that the CGF of is bounded in expectation under .

### Iii-B Individual Sample Mutual Information Bound

Motivated by the idea of algorithmic stability, which measures how much an output hypothesis changes with the replacement of an individual training sample, we construct an upper bound on the generalization error via .

###### Theorem 2.

Suppose satisfies for , and for under , where and . Then,

 gen(μ,PW|S)≤1nn∑i=1ψ∗−1−(I(W;Zi)), (15) −gen(μ,PW|S)≤1nn∑i=1ψ∗−1+(I(W;Zi)). (16)
###### Proof.

The generalization error can be written as follows:

 gen(μ,PW|S) =1nn∑i=1(EW,Z[ℓ(W,˜Z)]−EW,Zi[ℓ(W,Zi)]),

where and in the second term are dependent with , and and in the first term are independent with the same marginal distributions. Applying Theorem 1 completes the proof. ∎

The following Proposition shows that the ISMI bound is always tighter than the bound in Lemma 1.

###### Proposition 1.

Suppose is -sub-Gaussian under for all , then

 |gen(μ,PW|S)|≤1nn∑i=1√2R2I(W;Zi)≤√2R2nI(W;S).
###### Proof.

It is clear that if is -sub-Gaussian under for all , then is also -sub-Gaussian. For -sub-Gaussian random variables, it is easy to show that . The first inequality then follows from Theorem 2.

For the second part, by the chain rule of mutual information,

 I(W;S) =n∑i=1I(W;Zi|Zi−1)≥n∑i=1I(W;Zi), (17)

where , and the last step follows by the fact that and are independent. Applying Jensen’s inequality completes the proof. ∎

###### Remark 1.

If and are concave, it can be shown that the ISMI bound in Theorem 2 is also tighter than the bound using in [9].

## Iv Examples with Infinite I(w;s)

In this section, we consider two examples with infinite . We show that for these two examples, the upper bound on generalization error in Lemma 1 blows up, whereas the ISMI bound in Theorem 2 still provides an accurate approximation.

### Iv-a Estimating the Mean

We first consider the problem of learning the mean of a Gaussian random vector

, which minimizes the mean square error . The empirical risk with i.i.d. samples is . The empirical risk minimization (ERM) solution is the sample mean , which is deterministic given . Its generalization error can be computed exactly as follows:

 gen(μ,PW|S) =2σ2dn. (18)

The bound in Lemma 1 is not applicable here due to the following two reasons: (1) is a deterministic function of , and hence ; and (2) since is a Gaussian random vector, the loss function

is not sub-Gaussian. Specifically, the variance of the loss function

diverges as , which implies that a uniform upper bound on , does not exist.

Both of these issues can be solved by applying the ISMI bound in Theorem 2. Since , the mutual information between each individual sample and the output hypothesis can be computed exactly as follows:

 I(W;Zi) =d2lognn−1,i=1,⋯,n. (19)

In addition, since , it can be shown that , where , and

denotes the chi-squared distribution with

degrees of freedom. Then, the CGF of is

 Λℓ(˜W,˜Z)(λ)=−dσ2ℓλ−d2log(1−2σ2ℓλ), λ∈(−∞,12σ2ℓ).

Since is the ERM solution, it follows that . We only need to consider the case . It can be shown that

 Λℓ(˜W,˜Z)(λ)≤dσ4ℓλ2≜ψ−(−λ),λ<0. (20)

Then, Combining the results in (19), we have

 gen(μ,PW|S)≤σ2d√2(n+1)2n2lognn−1. (21)

As , the above bound is , which is usually the case when one applies bounding techniques based on the VC dimension [1], and algorithmic stability [2].

### Iv-B Gaussian Process

In this subsection, we revisit the example studied in [5]. Let , and be a standard normal random vector in . The loss function is defined to be the following Gaussian process indexed by :

 ℓ(w,Z)≜−⟨w,Z⟩,∀w∈W. (22)

Note that the loss function is sub-Gaussian with parameter for all . In addition, the output hypothesis can also be represented equivalently using the phase of . In other words, we can let be the unique number in such that . For this problem, the empirical risk of a hypothesis is given by

We consider two learning algorithms which are the same as the ones in [5]. The first is the ERM algorithm:

 W=argminϕ∈[0,2π)LS(w)=argmaxϕ∈[0,2π)⟨w,1nn∑i=1Zi⟩. (23)

The second is the ERM algorithm with additive noise:

 W′=(argmaxϕ∈[0,2π)⟨w,1nn∑i=1Zi⟩)⊕ξ (mod 2π), (24)

where the noise is independent of , and has an atom with probability mass at 0, and probability uniformly distributed on . Due to the symmetry of the problem, and are uniformly distributed over .

For this example, the generalization error of can be computed exactly as follows:

 gen(μ,PW|S)=EW,S∥∥1nn∑i=1Zi∥∥2=√π2n, (25)

where the last step is due to the fact that the distribution of is . For the second algorithm , since the noise is independent from , it follows that

 gen(μ,PW′|S)=ϵ√π2n. (26)

The bound via in Lemma 1 is not applicable, since is deterministic given and . Moreover, for the second algorithm ,

 I(W′;S)=h(W′)−h(W′|S)=log2π−h(ξ)=∞, (27)

since has a singular component at 0, and .

Applying the ISMI bound in Theorem 2 to the ERM algorithm , we have that

 I(W,Zi) =h(W)−h(W|Zi)=log2π−h(W|Zi) =log2π−EZi[h(W|Zi=zi)]. (28)

Note that given , the ERM solution

 W=argmaxϕ∈[0,2π)⟨w,zin+1n∑j≠iZi⟩, (29)

which depends on the other samples , . Moreover, it can be shown that is equivalent to the phase distribution of a Gaussian random variable in polar coordinates. Due to symmetry, we can always rotate the polar coordinates, such that , where is the Euclidian norm of . Then, is a function of , and can be equivalently characterized by

 f(ϕ∣∣∥Zi∥=r)= 12πe−r22(n−1) + rcosϕ√2π(n−1)e−r2sin2ϕ2(n−1)Q(−rcosϕn−1), (30)

where

is the tail distribution function of the standard normal distribution. Since the norm of

has a Rayleigh distribution with unit variance, it then follows that

 I(W;Zi)=log2π−E∥Zi∥[h(f(ϕ∣∣∥Zi∥=r))]. (31)

Applying Theorem 2, we obtain

 |gen(μ,PW|S)|≤1nn∑i=1√2I(W;Zi)=√2I(W;Zi). (32)

Similarly, we can compute the ISMI bound for .

Numerical comparisons are presented in Fig. 2 and Fig. 2. In both figures, we plot the ISMI bound, the CMI bound in [5], and the true values of the generalization error, as functions of the number of samples . In Fig. 2, we compare these bounds for the ERM solution . Note that the CMI bound reduces to the classical chaining bound in this case. In Fig. 2, we evaluate these bounds for the noisy algorithm with . Both figures demonstrate that the ISMI bound is closer to the true values of the generalization error, and outperforms the CMI bound significantly.

## V Noisy, Iterative Algorithms

In this section, we apply the ISMI bound in Theorem 2 to a class of noisy, iterative algorithms, specifically, stochastic gradient Langevin dynamics (SGLD).

### V-a SGLD Algorithm

Denote the parameter vector at iteration by , and let denote an arbitrary initialization. At each iteration , we sample a training data point , where denotes the random index of the sample selected at iteration , and compute the gradient . We then scale the gradient by a step size and perturb it by isotropic Gaussian noise . The overall updating rule is as follows [10]:

 W(t)=W(t−1)−η(t)∇ℓ(W(t−1),ZU(t))+σ(t)ξ, (33)

where controls the variance of the Gaussian noise.

For , let and . We assume that the training process takes epochs. For the -th training epoch, i.e., from -th to -th iterations, all training samples in are used exactly once. The total number of iterations is . The output of the algorithm is .

In the following, we use the same assumptions as in [11].

###### Assumption 1.

is -sub-Gaussian with respect to , for every .

###### Assumption 2.

The gradients are bounded, i.e., , for some .

In [11], the following bound was obtained by upper bounding in Lemma 1.

###### Lemma 3.

[11, Corollary 1] The generalization error of the SGLD algorithm is bounded by

 |gen(μ,PW|S)|≤ ⎷R2nT∑t=1η2tL2σ2t. (34)

### V-B ISMI Bound for SGLD

To apply the ISMI bound for SGLD, we modify the result in Theorem 2 by conditioning the random sample path ,

 |gen(μ,PW|S)| ≤1|U|∑u(T)∈U(1nn∑i=1√2R2I(W;Zi|U(T)=u(T))), (35)

where denotes the set of all possible sample paths.

Let denote the set of iterations for which samples is selected for a given sample path . Using the chain rule of mutual information, we have

 I(W;Zi|U(T)=u(T)) ≤I(Zi;W(T)|U(T)=u(T)) =T∑τ=1I(Zi;W(τ)|W(τ−1),U(T)=u(T)) =∑τ∈Ti(u(T))I(Zi;W(τ)|W(τ−1),U(T)=u(T)), (36)

where the last equality is due to the fact that given and , is independent of , if . For , i.e., if is selected at iteration , we have

 I(Zi;W(τ)|W(τ−1),U(T)=u(T)) =h(η(τ)∇ℓ(W(τ−1),Zi)+σ(τ)ξ|W(τ−1))−h(σ(τ)ξ) ≤d2log(1+η2(τ)L2dσ2(τ)), (37)

where the last step follows from Assumption 2 and the fact that is an independent Gaussian noise as in [11].

Combining with (V-B), it follows that

 |gen(μ,PW|S)|≤EU(T)[Rnn∑i=1  ⎷∑τ∈Ti(U(T))η2(τ)L2σ2(τ)], (38)

where we remove the term by using .

### V-C Discussion

As in [11], we set , and . Then,

 |gen(μ,PW|S)| ≤RLnEU(T)[n∑i=1√∑τ∈Ti(U(T))cτ] (a)≤RL√cnn∑i=1 ⎷1i+K−1∑k=11nk (b)≤RL√cnn∑i=1√1i+log(K−1)+1n (c)≤RL√n(√clog(K−1)+c+o(loglogK)),

where follows from the sampling scheme that all samples are used exactly once in each epoch; is due to the fact that ; and follows by computing the integral .

Comparing with the bound in [11],

 |gen(μ,PW|S)|≤RL√n√clog(nK)+c, (39)

it can be seen that our bound is tighter by a factor of .

## References

• [1] S. Boucheron, O. Bousquet, and G. Lugosi, “Theory of classification: A survey of some recent advances,” ESAIM: probability and statistics, vol. 9, pp. 323–375, 2005.
• [2] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Mach. Learn. Res., vol. 2, pp. 499–526, Mar 2002.
• [3] D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,” in Proc. International Conference on Artifical Intelligence and Statistics (AISTATS), 2016, pp. 1232–1240.
• [4] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2017, pp. 2524–2533.
• [5] A. Asadi, E. Abbe, and S. Verdu, “Chaining mutual information and tightening generalization bounds,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2018, pp. 7245–7254.
• [6] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence,” J. Mach. Learn. Res., vol. 11, pp. 2635–2670, Oct 2010.
• [7] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in Proc. Information Theory Workshop (ITW), 2016, pp. 26–30.
• [8] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford University Press, 2013.
• [9] J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration bias for general measurements,” in Proc. IEEE Int. Symp. Information Theory (ISIT), 2017, pp. 1475–1479.
• [10] M. Welling and Y. Teh, “Bayesian learning via stochastic gradient Langevin dynamics,” in

Proc. International Conference on Machine Learning (ICML)

, 2011, pp. 681–688.
• [11] A. Pensia, V. Jog, and P. Loh, “Generalization error bounds for noisy, iterative algorithms,” in Proc. IEEE Int. Symp. Information Theory (ISIT), June 2018, pp. 546–550.