Consider an instance space , a continuous hypothesis space , and a nonnegative loss function . A training dataset consists of i.i.d samples drawn from an unknown distribution . The goal of a supervised learning algorithm is to find an output hypothesis that minimizes the population risk:
In practice, is unknown, and thus cannot be computed directly. Instead, the empirical risk of on a training dataset is studied, which is defined as
A learning algorithm can be characterized by a randomized mapping from the training data set to a hypothesis according to a conditional distribution . The generalization error of a supervised learning algorithm is the expected difference between the population risk of the output hypothesis and its empirical risk on the training dataset:
where the expectation is taken over the joint distribution. The generalization error is used to measure the extent to which the learning algorithm overfits the training data.
Traditional ways of bounding the generalization error can be categorized into two groups: (1) by measuring the complexity of the hypothesis space , e.g., VC dimension and Rademacher complexity ; and (2) by exploring properties of the learning algorithm, e.g., uniform stability . Recently, it was proposed in  and further studied in  and  that the metric of mutual information can be used to develop upper bounds on the generalization error of a learning algorithm. Such an information-theoretic framework can handle a broader range of problems, e.g., problems with unbounded loss function. More importantly, it offers an information-theoretic point of view on how to improve the generalization capability of a learning algorithm.
In this paper, we follow the information-theoretic framework in [3, 4, 5]. Our main contribution is a tighter upper bound on the generalization error using the mutual information between an individual training sample and the output hypothesis of the learning algorithm. We show that compared to existing studies, our bound has a broader applicability, and can be considerably tighter.
I-a Main Contributions and Comparison to Related Works
The following lemma from  provides an upper bound on the generalization error using the mutual information between the training data set and the output hypothesis .
This mutual information based bound in (4) is related to the on-average stability , and quantifies the overall dependence between the output of the learning algorithm and its input dataset using . By further exploiting the structure of the hypothesis space and the dependency between the algorithm input and output, the authors of  combined the chaining and mutual information methods, and obtained a tighter bound on the generalization error.
However, the bound in Lemma 1 and the chaining mutual information (CMI) bound in  both suffer from the following two shortcomings. First, for empirical risk minimization (ERM), if is the unique minimizer of in , the mutual information . It can be shown that both bounds are not tight in this case. Second, both bounds assume that has a bounded cumulant generating function (CGF) under for all , which may not hold for many problems.
In this paper, we get around these shortcomings by combining the idea of algorithmic stability [6, 7] and the information theoretic framework. Specifically, an algorithm is stable if the output hypothesis does not change too much with the replacement of any individual training sample, and if an algorithm is stable, then it generalizes well [6, 7]. Motivated by these facts, we tighten the mutual information based generalization error bound by considering the individual sample mutual information (ISMI) . Compared with the bound in Lemma 1, and the CMI bound in , the ISMI bound requires a weaker condition on the CGF of the loss function, is applicable to a broader range of problems, and provides a tighter characterization of the generalization error. We also comprehensively study three examples, and compare the ISMI bound with existing results to demonstrate its superiority.
We use upper letters to denote random variables, and calligraphic upper letters to denote sets. For a random variable generated from a distribution , we use to denote the expectation taken over with distribution . We write to denote the
-dimensional identity matrix. All logarithms are natural ones.
The cumulant generating function (CGF) of a random variable is defined as . It can be verified that , and that is convex if it exists.
For a convex function defined on the interval , where , its Legendre dual is defined as
The following lemma characterizes the property of Legendre dual and its inverse function.
[8, Lemma 2.4] Assume that . Then defined above is a nonnegative convex and non-decreasing function on with . Moreover, its inverse function is concave, and can be written as
For a -sub-Gaussian random variable , let , then by Lemma 2, .
Iii Bounding Generalization Error via
In this section, we first generalize the decoupling lemma in [4, Lemma 1] to a more general setting, and then tighten the bound on generalization error via .
Iii-a General Decoupling Estimate
Consider a pair of random variables and with joint distribution . Let be an independent copy of , and be an independent copy of , such that . Suppose is a real-valued function. If the CGF of is upper bounded for , we have the following theorem.
Assume that for , and for under distribution , where and . Suppose that and are convex, and . Then,
Consider the Donsker-Varadhan variational representation of the relative entropy between two probability measuresand defined on :
where the supremum is over all measurable functions , and the equality is achieved when . It then follows that ,
where the last inequality follows from the assumption that
Similarly, , it follows that
Iii-B Individual Sample Mutual Information Bound
Motivated by the idea of algorithmic stability, which measures how much an output hypothesis changes with the replacement of an individual training sample, we construct an upper bound on the generalization error via .
Suppose satisfies for , and for under , where and . Then,
The generalization error can be written as follows:
where and in the second term are dependent with , and and in the first term are independent with the same marginal distributions. Applying Theorem 1 completes the proof. ∎
The following Proposition shows that the ISMI bound is always tighter than the bound in Lemma 1.
Suppose is -sub-Gaussian under for all , then
It is clear that if is -sub-Gaussian under for all , then is also -sub-Gaussian. For -sub-Gaussian random variables, it is easy to show that . The first inequality then follows from Theorem 2.
For the second part, by the chain rule of mutual information,
where , and the last step follows by the fact that and are independent. Applying Jensen’s inequality completes the proof. ∎
Iv Examples with Infinite
In this section, we consider two examples with infinite . We show that for these two examples, the upper bound on generalization error in Lemma 1 blows up, whereas the ISMI bound in Theorem 2 still provides an accurate approximation.
Iv-a Estimating the Mean
We first consider the problem of learning the mean of a Gaussian random vector, which minimizes the mean square error . The empirical risk with i.i.d. samples is . The empirical risk minimization (ERM) solution is the sample mean , which is deterministic given . Its generalization error can be computed exactly as follows:
The bound in Lemma 1 is not applicable here due to the following two reasons: (1) is a deterministic function of , and hence ; and (2) since is a Gaussian random vector, the loss function
is not sub-Gaussian. Specifically, the variance of the loss functiondiverges as , which implies that a uniform upper bound on , does not exist.
Both of these issues can be solved by applying the ISMI bound in Theorem 2. Since , the mutual information between each individual sample and the output hypothesis can be computed exactly as follows:
In addition, since , it can be shown that , where , and
denotes the chi-squared distribution withdegrees of freedom. Then, the CGF of is
Since is the ERM solution, it follows that . We only need to consider the case . It can be shown that
Then, Combining the results in (19), we have
Iv-B Gaussian Process
In this subsection, we revisit the example studied in . Let , and be a standard normal random vector in . The loss function is defined to be the following Gaussian process indexed by :
Note that the loss function is sub-Gaussian with parameter for all . In addition, the output hypothesis can also be represented equivalently using the phase of . In other words, we can let be the unique number in such that . For this problem, the empirical risk of a hypothesis is given by
We consider two learning algorithms which are the same as the ones in . The first is the ERM algorithm:
The second is the ERM algorithm with additive noise:
where the noise is independent of , and has an atom with probability mass at 0, and probability uniformly distributed on . Due to the symmetry of the problem, and are uniformly distributed over .
For this example, the generalization error of can be computed exactly as follows:
where the last step is due to the fact that the distribution of is . For the second algorithm , since the noise is independent from , it follows that
The bound via in Lemma 1 is not applicable, since is deterministic given and . Moreover, for the second algorithm ,
since has a singular component at 0, and .
Applying the ISMI bound in Theorem 2 to the ERM algorithm , we have that
Note that given , the ERM solution
which depends on the other samples , . Moreover, it can be shown that is equivalent to the phase distribution of a Gaussian random variable in polar coordinates. Due to symmetry, we can always rotate the polar coordinates, such that , where is the Euclidian norm of . Then, is a function of , and can be equivalently characterized by
is the tail distribution function of the standard normal distribution. Since the norm ofhas a Rayleigh distribution with unit variance, it then follows that
Applying Theorem 2, we obtain
Similarly, we can compute the ISMI bound for .
Numerical comparisons are presented in Fig. 2 and Fig. 2. In both figures, we plot the ISMI bound, the CMI bound in , and the true values of the generalization error, as functions of the number of samples . In Fig. 2, we compare these bounds for the ERM solution . Note that the CMI bound reduces to the classical chaining bound in this case. In Fig. 2, we evaluate these bounds for the noisy algorithm with . Both figures demonstrate that the ISMI bound is closer to the true values of the generalization error, and outperforms the CMI bound significantly.
V Noisy, Iterative Algorithms
In this section, we apply the ISMI bound in Theorem 2 to a class of noisy, iterative algorithms, specifically, stochastic gradient Langevin dynamics (SGLD).
V-a SGLD Algorithm
Denote the parameter vector at iteration by , and let denote an arbitrary initialization. At each iteration , we sample a training data point , where denotes the random index of the sample selected at iteration , and compute the gradient . We then scale the gradient by a step size and perturb it by isotropic Gaussian noise . The overall updating rule is as follows :
where controls the variance of the Gaussian noise.
For , let and . We assume that the training process takes epochs. For the -th training epoch, i.e., from -th to -th iterations, all training samples in are used exactly once. The total number of iterations is . The output of the algorithm is .
In the following, we use the same assumptions as in .
is -sub-Gaussian with respect to , for every .
The gradients are bounded, i.e., , for some .
[11, Corollary 1] The generalization error of the SGLD algorithm is bounded by
V-B ISMI Bound for SGLD
To apply the ISMI bound for SGLD, we modify the result in Theorem 2 by conditioning the random sample path ,
where denotes the set of all possible sample paths.
Let denote the set of iterations for which samples is selected for a given sample path . Using the chain rule of mutual information, we have
where the last equality is due to the fact that given and , is independent of , if . For , i.e., if is selected at iteration , we have
Combining with (V-B), it follows that
where we remove the term by using .
As in , we set , and . Then,
where follows from the sampling scheme that all samples are used exactly once in each epoch; is due to the fact that ; and follows by computing the integral .
Comparing with the bound in ,
it can be seen that our bound is tighter by a factor of .
-  S. Boucheron, O. Bousquet, and G. Lugosi, “Theory of classification: A survey of some recent advances,” ESAIM: probability and statistics, vol. 9, pp. 323–375, 2005.
-  O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Mach. Learn. Res., vol. 2, pp. 499–526, Mar 2002.
-  D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,” in Proc. International Conference on Artifical Intelligence and Statistics (AISTATS), 2016, pp. 1232–1240.
-  A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2017, pp. 2524–2533.
-  A. Asadi, E. Abbe, and S. Verdu, “Chaining mutual information and tightening generalization bounds,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2018, pp. 7245–7254.
-  S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence,” J. Mach. Learn. Res., vol. 11, pp. 2635–2670, Oct 2010.
-  M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in Proc. Information Theory Workshop (ITW), 2016, pp. 26–30.
-  S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford University Press, 2013.
-  J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration bias for general measurements,” in Proc. IEEE Int. Symp. Information Theory (ISIT), 2017, pp. 1475–1479.
M. Welling and Y. Teh,
“Bayesian learning via stochastic gradient Langevin dynamics,”
Proc. International Conference on Machine Learning (ICML), 2011, pp. 681–688.
-  A. Pensia, V. Jog, and P. Loh, “Generalization error bounds for noisy, iterative algorithms,” in Proc. IEEE Int. Symp. Information Theory (ISIT), June 2018, pp. 546–550.