I Introduction
Consider an instance space , a continuous hypothesis space , and a nonnegative loss function . A training dataset consists of i.i.d samples drawn from an unknown distribution . The goal of a supervised learning algorithm is to find an output hypothesis that minimizes the population risk:
(1) 
In practice, is unknown, and thus cannot be computed directly. Instead, the empirical risk of on a training dataset is studied, which is defined as
(2) 
A learning algorithm can be characterized by a randomized mapping from the training data set to a hypothesis according to a conditional distribution . The generalization error of a supervised learning algorithm is the expected difference between the population risk of the output hypothesis and its empirical risk on the training dataset:
(3) 
where the expectation is taken over the joint distribution
. The generalization error is used to measure the extent to which the learning algorithm overfits the training data.Traditional ways of bounding the generalization error can be categorized into two groups: (1) by measuring the complexity of the hypothesis space , e.g., VC dimension and Rademacher complexity [1]; and (2) by exploring properties of the learning algorithm, e.g., uniform stability [2]. Recently, it was proposed in [3] and further studied in [4] and [5] that the metric of mutual information can be used to develop upper bounds on the generalization error of a learning algorithm. Such an informationtheoretic framework can handle a broader range of problems, e.g., problems with unbounded loss function. More importantly, it offers an informationtheoretic point of view on how to improve the generalization capability of a learning algorithm.
In this paper, we follow the informationtheoretic framework in [3, 4, 5]. Our main contribution is a tighter upper bound on the generalization error using the mutual information between an individual training sample and the output hypothesis of the learning algorithm. We show that compared to existing studies, our bound has a broader applicability, and can be considerably tighter.
Ia Main Contributions and Comparison to Related Works
The following lemma from [4] provides an upper bound on the generalization error using the mutual information between the training data set and the output hypothesis .
Lemma 1.
This mutual information based bound in (4) is related to the onaverage stability [6], and quantifies the overall dependence between the output of the learning algorithm and its input dataset using . By further exploiting the structure of the hypothesis space and the dependency between the algorithm input and output, the authors of [5] combined the chaining and mutual information methods, and obtained a tighter bound on the generalization error.
However, the bound in Lemma 1 and the chaining mutual information (CMI) bound in [5] both suffer from the following two shortcomings. First, for empirical risk minimization (ERM), if is the unique minimizer of in , the mutual information . It can be shown that both bounds are not tight in this case. Second, both bounds assume that has a bounded cumulant generating function (CGF) under for all , which may not hold for many problems.
In this paper, we get around these shortcomings by combining the idea of algorithmic stability [6, 7] and the information theoretic framework. Specifically, an algorithm is stable if the output hypothesis does not change too much with the replacement of any individual training sample, and if an algorithm is stable, then it generalizes well [6, 7]. Motivated by these facts, we tighten the mutual information based generalization error bound by considering the individual sample mutual information (ISMI) . Compared with the bound in Lemma 1, and the CMI bound in [5], the ISMI bound requires a weaker condition on the CGF of the loss function, is applicable to a broader range of problems, and provides a tighter characterization of the generalization error. We also comprehensively study three examples, and compare the ISMI bound with existing results to demonstrate its superiority.
Ii Preliminaries
We use upper letters to denote random variables, and calligraphic upper letters to denote sets. For a random variable generated from a distribution , we use to denote the expectation taken over with distribution . We write to denote the
dimensional identity matrix. All logarithms are natural ones.
The cumulant generating function (CGF) of a random variable is defined as . It can be verified that , and that is convex if it exists.
Definition 1.
For a convex function defined on the interval , where , its Legendre dual is defined as
(5) 
The following lemma characterizes the property of Legendre dual and its inverse function.
Lemma 2.
[8, Lemma 2.4] Assume that . Then defined above is a nonnegative convex and nondecreasing function on with . Moreover, its inverse function is concave, and can be written as
(6) 
For a subGaussian random variable , let , then by Lemma 2, .
Iii Bounding Generalization Error via
In this section, we first generalize the decoupling lemma in [4, Lemma 1] to a more general setting, and then tighten the bound on generalization error via .
Iiia General Decoupling Estimate
Consider a pair of random variables and with joint distribution . Let be an independent copy of , and be an independent copy of , such that . Suppose is a realvalued function. If the CGF of is upper bounded for , we have the following theorem.
Theorem 1.
Assume that for , and for under distribution , where and . Suppose that and are convex, and . Then,
(7)  
(8) 
Proof.
Consider the DonskerVaradhan variational representation of the relative entropy between two probability measures
and defined on :(9) 
where the supremum is over all measurable functions , and the equality is achieved when . It then follows that ,
(10) 
where the last inequality follows from the assumption that
(11) 
Similarly, , it follows that
(12) 
IiiB Individual Sample Mutual Information Bound
Motivated by the idea of algorithmic stability, which measures how much an output hypothesis changes with the replacement of an individual training sample, we construct an upper bound on the generalization error via .
Theorem 2.
Suppose satisfies for , and for under , where and . Then,
(15)  
(16) 
Proof.
The generalization error can be written as follows:
where and in the second term are dependent with , and and in the first term are independent with the same marginal distributions. Applying Theorem 1 completes the proof. ∎
The following Proposition shows that the ISMI bound is always tighter than the bound in Lemma 1.
Proposition 1.
Suppose is subGaussian under for all , then
Proof.
It is clear that if is subGaussian under for all , then is also subGaussian. For subGaussian random variables, it is easy to show that . The first inequality then follows from Theorem 2.
For the second part, by the chain rule of mutual information,
(17) 
where , and the last step follows by the fact that and are independent. Applying Jensen’s inequality completes the proof. ∎
Iv Examples with Infinite
In this section, we consider two examples with infinite . We show that for these two examples, the upper bound on generalization error in Lemma 1 blows up, whereas the ISMI bound in Theorem 2 still provides an accurate approximation.
Iva Estimating the Mean
We first consider the problem of learning the mean of a Gaussian random vector
, which minimizes the mean square error . The empirical risk with i.i.d. samples is . The empirical risk minimization (ERM) solution is the sample mean , which is deterministic given . Its generalization error can be computed exactly as follows:(18) 
The bound in Lemma 1 is not applicable here due to the following two reasons: (1) is a deterministic function of , and hence ; and (2) since is a Gaussian random vector, the loss function
is not subGaussian. Specifically, the variance of the loss function
diverges as , which implies that a uniform upper bound on , does not exist.Both of these issues can be solved by applying the ISMI bound in Theorem 2. Since , the mutual information between each individual sample and the output hypothesis can be computed exactly as follows:
(19) 
In addition, since , it can be shown that , where , and
denotes the chisquared distribution with
degrees of freedom. Then, the CGF of isSince is the ERM solution, it follows that . We only need to consider the case . It can be shown that
(20) 
Then, Combining the results in (19), we have
(21) 
As , the above bound is , which is usually the case when one applies bounding techniques based on the VC dimension [1], and algorithmic stability [2].
IvB Gaussian Process
In this subsection, we revisit the example studied in [5]. Let , and be a standard normal random vector in . The loss function is defined to be the following Gaussian process indexed by :
(22) 
Note that the loss function is subGaussian with parameter for all . In addition, the output hypothesis can also be represented equivalently using the phase of . In other words, we can let be the unique number in such that . For this problem, the empirical risk of a hypothesis is given by
We consider two learning algorithms which are the same as the ones in [5]. The first is the ERM algorithm:
(23) 
The second is the ERM algorithm with additive noise:
(24) 
where the noise is independent of , and has an atom with probability mass at 0, and probability uniformly distributed on . Due to the symmetry of the problem, and are uniformly distributed over .
For this example, the generalization error of can be computed exactly as follows:
(25) 
where the last step is due to the fact that the distribution of is . For the second algorithm , since the noise is independent from , it follows that
(26) 
The bound via in Lemma 1 is not applicable, since is deterministic given and . Moreover, for the second algorithm ,
(27) 
since has a singular component at 0, and .
Applying the ISMI bound in Theorem 2 to the ERM algorithm , we have that
(28) 
Note that given , the ERM solution
(29) 
which depends on the other samples , . Moreover, it can be shown that is equivalent to the phase distribution of a Gaussian random variable in polar coordinates. Due to symmetry, we can always rotate the polar coordinates, such that , where is the Euclidian norm of . Then, is a function of , and can be equivalently characterized by
(30) 
where
is the tail distribution function of the standard normal distribution. Since the norm of
has a Rayleigh distribution with unit variance, it then follows that(31) 
Applying Theorem 2, we obtain
(32) 
Similarly, we can compute the ISMI bound for .
Numerical comparisons are presented in Fig. 2 and Fig. 2. In both figures, we plot the ISMI bound, the CMI bound in [5], and the true values of the generalization error, as functions of the number of samples . In Fig. 2, we compare these bounds for the ERM solution . Note that the CMI bound reduces to the classical chaining bound in this case. In Fig. 2, we evaluate these bounds for the noisy algorithm with . Both figures demonstrate that the ISMI bound is closer to the true values of the generalization error, and outperforms the CMI bound significantly.
V Noisy, Iterative Algorithms
In this section, we apply the ISMI bound in Theorem 2 to a class of noisy, iterative algorithms, specifically, stochastic gradient Langevin dynamics (SGLD).
Va SGLD Algorithm
Denote the parameter vector at iteration by , and let denote an arbitrary initialization. At each iteration , we sample a training data point , where denotes the random index of the sample selected at iteration , and compute the gradient . We then scale the gradient by a step size and perturb it by isotropic Gaussian noise . The overall updating rule is as follows [10]:
(33) 
where controls the variance of the Gaussian noise.
For , let and . We assume that the training process takes epochs. For the th training epoch, i.e., from th to th iterations, all training samples in are used exactly once. The total number of iterations is . The output of the algorithm is .
In the following, we use the same assumptions as in [11].
Assumption 1.
is subGaussian with respect to , for every .
Assumption 2.
The gradients are bounded, i.e., , for some .
Lemma 3.
[11, Corollary 1] The generalization error of the SGLD algorithm is bounded by
(34) 
VB ISMI Bound for SGLD
To apply the ISMI bound for SGLD, we modify the result in Theorem 2 by conditioning the random sample path ,
(35) 
where denotes the set of all possible sample paths.
Let denote the set of iterations for which samples is selected for a given sample path . Using the chain rule of mutual information, we have
(36) 
where the last equality is due to the fact that given and , is independent of , if . For , i.e., if is selected at iteration , we have
(37) 
where the last step follows from Assumption 2 and the fact that is an independent Gaussian noise as in [11].
VC Discussion
As in [11], we set , and . Then,
where follows from the sampling scheme that all samples are used exactly once in each epoch; is due to the fact that ; and follows by computing the integral .
References
 [1] S. Boucheron, O. Bousquet, and G. Lugosi, “Theory of classification: A survey of some recent advances,” ESAIM: probability and statistics, vol. 9, pp. 323–375, 2005.
 [2] O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Mach. Learn. Res., vol. 2, pp. 499–526, Mar 2002.
 [3] D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,” in Proc. International Conference on Artifical Intelligence and Statistics (AISTATS), 2016, pp. 1232–1240.
 [4] A. Xu and M. Raginsky, “Informationtheoretic analysis of generalization capability of learning algorithms,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2017, pp. 2524–2533.
 [5] A. Asadi, E. Abbe, and S. Verdu, “Chaining mutual information and tightening generalization bounds,” in Proc. Advances in Neural Information Processing Systems (NIPS), 2018, pp. 7245–7254.
 [6] S. ShalevShwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence,” J. Mach. Learn. Res., vol. 11, pp. 2635–2670, Oct 2010.
 [7] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Informationtheoretic analysis of stability and bias of learning algorithms,” in Proc. Information Theory Workshop (ITW), 2016, pp. 26–30.
 [8] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford University Press, 2013.
 [9] J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration bias for general measurements,” in Proc. IEEE Int. Symp. Information Theory (ISIT), 2017, pp. 1475–1479.

[10]
M. Welling and Y. Teh,
“Bayesian learning via stochastic gradient Langevin dynamics,”
in
Proc. International Conference on Machine Learning (ICML)
, 2011, pp. 681–688.  [11] A. Pensia, V. Jog, and P. Loh, “Generalization error bounds for noisy, iterative algorithms,” in Proc. IEEE Int. Symp. Information Theory (ISIT), June 2018, pp. 546–550.