Bounding the generalization error of learning algorithms is of fundamental importance in statistical machine learning. The conventional approach is to bound it using a quantity related to the hypothesis class, such as the VC-dimension
, and such bounds are therefore oblivious to the learning algorithm and data distribution. The obtained results are usually rather conservative, and cannot fully explain the recent success of deep learning. Recently, information theoretic approaches that jointly take into consideration the hypothesis class, the learning algorithm, and the data distribution, has drawn considerable attention[2, 3, 4, 5, 6, 7, 8, 9, 10, 11].
The effort of deriving generalization error bounds using information theoretic approaches was perhaps first initiated in  and . The bound was further tightened in , by decomposing the error, and bounding each term individually. Steinke and Zakynthinou  proposed a conditional mutual information (CMI) based bound, by introducing a dependence structure which resembles that in the analysis of the Rademacher complexity . Combining the idea of error decomposition  and the CMI bound in , Haghifam et al.  subsequently provided a sharpened bound based on conditional individual mutual information (CIMI).
In this work, we propose a new generalization error bound, which is also based on a combination of the error decomposition technique and the CMI construction. This new bound is motivated by the observation that in a simple Gaussian setting, the CIMI bound in  (as well as the CMI bound in ) is of constant order, while the bound in  is of order , where is the number of training samples. We further observe that the conditioning term in CIMI is the same as CMI, and it tends to reveal too much information which makes the bounds loose. The proposed new bound is thus obtained by making the mutual information conditioned on an individual sample (pair), which we refer to as the individually conditional individual mutual information (ICIMI) bound. In order to establish the new bound, we introduce a new conditional decoupling lemma. This lemma allows us to view the bounds in [8, 9, 10, 11] and the new bound in a unified manner, which not only yields a dichotomy of these bounds, but also makes possible a meaningful comparison among them. Finally, we show that in the Gaussian setting mentioned earlier, the proposed new bound is also able to provide a bound of the same order as, but with an improved leading constant than, that in .
We study the classic supervised learning setting. Denote the data domain as, where is the feature domain and is the label set. The parametric hypothesis class is denoted as , where is the parameter space. During the training, the learning algorithm (learner) has access to a sequence of training samples , where each is drawn independently from
following some unknown probability distribution. The learner can be represented by , which is a kernel (channel) that (randomly) maps to .
To complete the classification or regression task, the learner in principle would choose a hypothesis
to minimize the following population loss, under a given loss function,
However, since only a training data vectoris available, the empirical loss of is usually computed (and minimized during training), which is given as
The expected generalization error of the learner is
where the expectation is taken over the joint distribution. This quantity captures the effect of the learner’s expected overfitting error due to limited training data, which we shall study in this work.
Iii Review of Related Results
In this section, we briefly review a few information theoretic bounds on the generalization error relevant to this work. A more thorough discussion of their relation is deferred to Section IV-D and IV-E, after a unified framework is given.
Iii-a Mutual information based bounds
Theorem 1 (MI Bound ).
Suppose is -sub-Gaussian under for all , then
The generalization can be written in two ways
are independent random variables that have the same marginal distributions asand , respectively. Instead of bounding the difference (5) as in , Bu et al.  bounded each individual difference in (6) and derived an individual mutual information (IMI) based bound. Furthermore, the following inverse Fenchel conjugate function was utilized to obtain a tightened bound. For any random variables , its cumulant generating function is
and the inverse of its Fenchel conjugate is given as
The tightened bound is summarized in the following theorem.
Theorem 2 (IMI Bound ).
Suppose is an upper bound of , then
where and are independent random variables that have the same marginal distributions as and , respectively.
Iii-B Conditional mutual information based bounds
Steinke and Zakynthinou  recently introduced a novel bounding approach. In their approach, is a table of samples that each , for and is independently drawn following . The training vector is selected from the table , where ’s are independent Rademacher random variables, i.e., takes or equally likely. The vector essentially selects one sample from each column in the table, which partition into a training vector and a testing vector. For simplicity, we shall write and as and , when the meaning is clear from the context.
With the structure given above, the expected generalization error of the algorithm can be written as
Steinke and Zakynthinou obtained the following conditional mutual information (CMI) based result.
Theorem 3 (CMI Bound ).
Suppose for any , then
where are independent samples distributed as .
Since is binary, the conditional mutual information is always bounded; in contrast, mutual information based bounds (i.e., MI and IMI bounds) can be unbounded, particularly when the random variables are both continuous.
Motivated by the results in , Haghifam et al.  proposed a sharpened bound by similarly bounding each term in (10). Moreover, they provided a conditional individual mutual information (CIMI) based bound represented by the sample-conditioned mutual information, which is defined as
Clearly is a function of the random variable , thus also a random variable, and . These sharpened bounds are summarized in the following theorem.
Theorem 4 (CIMI Bound ).
Suppose , then
Iv New Result
Iv-a A motivating example
Example 1 (Estimating the Gaussian mean).
The training samples are drawn following for some unknown . The learner deterministically estimates by averaging the training samples, i.e., , whose empirical error is
Bu et al.  showed that the mutual information term in the IMI bound is
and obtained the following IMI based bound
For this simple setting, the generalization error can in fact be calculated exactly to be . Though the error bound above does not have the same order as the true generalization error, it is consistent with the VC dimension-based bound and is the best known for this case. Note that the MI bound will be unbounded, since is unbounded.
Next consider the CMI and CIMI bounds, and let us focus on the mutual information terms in these bounds, which give
Theorem 3 and Theorem 4 in fact do not apply directly in this setting, since their required conditions do not hold. In Theorem 3, the function does not exist (i.e., unbounded); even if it existed, the term would be a constant, thus the CMI bound would be of constant order. Similarly, if the condition held, the CIIMI bound would also be of constant order. As we shall show shortly, the CMI and CIMI bounds can be generalized and strengthened, yet the resultant strengthened bounds in this setting still do not diminish as , and thus would be order-wise worse than the IMI bound.
A question arises naturally: Is the looseness of the CMI and CIMI bounds here due to the introduction of the conditioning terms? As we shall show next, it is in fact caused by too much information being revealed in the conditioning terms, and there is indeed a natural way to resolve this issue.
Iv-B A conditional decoupling lemma
Our main result relies on a key lemma. A few more definitions are first introduced in order to present this lemma and the main result.
For any random variables and , define the sample-conditioned cumulant generating function for any realization ,
It is straightforward to verify that for any realization , and . Hence the inverse of its Fenchel conjugate
which are functions of , thus random. Next define the conditional cumulant generating function
and similarly its inverse Fenchel conjugate as .
For a pair of random variables , its decoupled pair conditioned on a third random variable is a pair of random variables , such that
i.e., and are identically distributed, and and are identically distributed, and moreover
forms a Markov string. It follows from this definition that
We next introduce a conditional decoupling (CD) lemma, which serves an instrumental role in our work. The unconditioned version was presented in .
Lemma 1 (The CD lemma).
For any three random variables , let be the decoupled pair of conditioned on . Let and , for some real-valued measurable function . The following inequalities hold
This lemma is proved by utilizing the Donsker–Varadhan variational representation of KL divergence and the concavity of the inverse Fenchel conjugate function. The proof details are deferred to Section IV-G.
Iv-C The ICIMI bound
(ICIMI Bound) Given an algorithm , the following bounds on the generalization hold
There are two bounds in this theorem. The stronger bound is in terms of the sample-conditioned mutual information, which is different from the conventional notion of conditional mutual information and may be more difficult to evaluate. The weaker bound is in terms of the conventional mutual information.
In the proposed bounds, the mutual information is conditioned on the individual data pair , instead of the full data pair set . Intuitively, revealing only makes it more difficult, than revealing all data pairs , to deduce information regarding from . As a consequence, the mutual information is less than , yielding a potentially tighter bound.
Proof of Theorem 5.
We call this bound the individually conditional individual mutual information (ICIMI) bound, since it is derived by applying the CD lemma on the individual conditional terms in (31). This theorem implies the following corollary.
Suppose , then
Proof of Corollary 1.
When and , it is straightforward to verify that is -sub-Gaussian. The definition of the sub-Gaussian distribution in fact gives , and thus , from which the corollary follows. ∎
Iv-D Dichotomy and generalizations of existing bounds
The CD lemma allows us to view the existing MI, IMI, CMI, and CIMI bounds in a unified framework. By applying the CD lemma in different manners, these bounds can be obtained almost directly. The technical conditions under which the bound hold can also be generalized, and the bounds themselves can be strengthened using the inverse Fenchel conjugate. These results are summarized in Table I. We also provide the bounds for bounded loss function, which eliminate the functions.
The CMI and CIMI results can be further strengthened by utilizing the inverse Fenchel conjugate function together with the sample-conditioned mutual information. More precisely, let be the decoupled pair of conditioned on . Further define
then we have the strengthened CMI and CIMI bounds:
Iv-E Comparison of the bounds
We first consider the special case where the loss function is bounded, i.e., . For this case, it was shown in  that the CIMI bound (14) is tighter than the CMI bound (11). We next show that the proposed bound (35) is tighter than the CIMI bound (14) when .
For any , we have
Proof of Lemma 2.
By the independence of and , we have
It follows that
which concludes the proof. ∎
To further understand the relation among these bounds under more general conditions when the loss function may not be bounded, let us assume the inverse Fenchel conjugate functions, which roughly capture the geometry induced by the expected loss, are the same (denoted as ) for all the five approaches, i.e.,
Then we can focus on the information measure quantities, and compare these bounds as shown in Fig. 1. Here the inequalities given in black were proved previously (see  and ). Since the common function is non-decreasing, the inequality "CIMI ICIMI" follows from Lemma 2. The inequality "IMI ICIMI" is implied by the following lemma for the same reason.
For any , we have
Proof of Lemma 3.
First and are both the training sample for the input of the algorithm, thus
Then since and are independent given ,
It follows that
which concludes the proof. ∎
The inverse Fenchel conjugate functions may indeed be different for different bounds, thus although the above comparison suggests certain dominant relations, it is not clear for any specific problem, whether any particular bound is tighter than the other. This is particularly true if we use the bounds based on the inverse Fenchel conjugate, however, even for the special case of , the different multiplicative factors and the sum-square-root forms imply that the relation can be less clear.
Iv-F Revisiting the example
We now return to the problem of estimating the Gaussian mean, and show that the proposed ICIMI bound can provide scaling behavior similar to that of IMI, thus order-wise stronger than the CMI and CIMI bounds. In fact, the bound is also strictly better than the IMI bound given in  asymptotically in this setting.
We first formally establish, as suspected previously, that the CMI and CIMI bounds are at least of constant order for this setting, the proof of which can be found in the appendix.
The next proposition establishes a generalization error bound based on the ICIMI bound in this setting.
For the the problem of estimating the mean of the Gaussian distribution, the ICIMI bound gives
Remark: This bound scales as . Compared to the IMI bound in (17), the new ICIMI based bound is asymptotically tighter by a factor of .
Proposition 2 is proved by studying separately the sample-conditioned individual mutual information and the inverse Fenchel conjugate functions . For the former, since the algorithm here is averaging the samples without any prior of the Gaussian distribution, without loss of generality, we can assume the mean of the Gaussian distribution to be , i.e., . Therefore, given , is mixed-Gaussian distributed, which follows when and follows when . The term is thus related to the scaling behavior of the differential entropy of a mixed Gaussian distribution, which the following lemma makes more precise.
Let be a Rademacher random variable and be a mixed-Gaussian random variable, such that when , and when . We have
The next lemma gives an upper bound on the inverse Fenchel conjugate functions.
For the problem of estimating the mean of the Gaussian distribution, and any realization of with ,
and for ,
The proofs of these two lemmas are relegated to the appendix. With these lemmas, Proposition 2 can be proved as follows.
Iv-G Proof of the CD lemma
Proof of Lemma 1.
The definition of sample-conditioned cumulant generating function implies that
By the Donsker–Varadhan variational representation of KL divergence, for any
where the equality is due to (26). It follows that for
where the last inequality is by exchanging the order of expectation and infimum. Similarly, since