Individually Conditional Individual Mutual Information Bound on Generalization Error

12/17/2020 ∙ by Ruida Zhou, et al. ∙ Texas A&M University 0

We propose a new information-theoretic bound on generalization error based on a combination of the error decomposition technique of Bu et al. and the conditional mutual information (CMI) construction of Steinke and Zakynthinou. In a previous work, Haghifam et al. proposed a different bound combining the two aforementioned techniques, which we refer to as the conditional individual mutual information (CIMI) bound. However, in a simple Gaussian setting, both the CMI and the CIMI bounds are order-wise worse than that by Bu et al.. This observation motivated us to propose the new bound, which overcomes this issue by reducing the conditioning terms in the conditional mutual information. In the process of establishing this bound, a conditional decoupling lemma is established, which also leads to a meaningful dichotomy and comparison among these information-theoretic bounds.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Bounding the generalization error of learning algorithms is of fundamental importance in statistical machine learning. The conventional approach is to bound it using a quantity related to the hypothesis class, such as the VC-dimension

[1]

, and such bounds are therefore oblivious to the learning algorithm and data distribution. The obtained results are usually rather conservative, and cannot fully explain the recent success of deep learning. Recently, information theoretic approaches that jointly take into consideration the hypothesis class, the learning algorithm, and the data distribution, has drawn considerable attention

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11].

The effort of deriving generalization error bounds using information theoretic approaches was perhaps first initiated in [2] and [8]. The bound was further tightened in [9], by decomposing the error, and bounding each term individually. Steinke and Zakynthinou [10] proposed a conditional mutual information (CMI) based bound, by introducing a dependence structure which resembles that in the analysis of the Rademacher complexity [1]. Combining the idea of error decomposition [9] and the CMI bound in [10], Haghifam et al. [11] subsequently provided a sharpened bound based on conditional individual mutual information (CIMI).

In this work, we propose a new generalization error bound, which is also based on a combination of the error decomposition technique and the CMI construction. This new bound is motivated by the observation that in a simple Gaussian setting, the CIMI bound in [11] (as well as the CMI bound in [10]) is of constant order, while the bound in [9] is of order , where is the number of training samples. We further observe that the conditioning term in CIMI is the same as CMI, and it tends to reveal too much information which makes the bounds loose. The proposed new bound is thus obtained by making the mutual information conditioned on an individual sample (pair), which we refer to as the individually conditional individual mutual information (ICIMI) bound. In order to establish the new bound, we introduce a new conditional decoupling lemma. This lemma allows us to view the bounds in [8, 9, 10, 11] and the new bound in a unified manner, which not only yields a dichotomy of these bounds, but also makes possible a meaningful comparison among them. Finally, we show that in the Gaussian setting mentioned earlier, the proposed new bound is also able to provide a bound of the same order as, but with an improved leading constant than, that in [9].

Ii Preliminary

We study the classic supervised learning setting. Denote the data domain as

, where is the feature domain and is the label set. The parametric hypothesis class is denoted as , where is the parameter space. During the training, the learning algorithm (learner) has access to a sequence of training samples , where each is drawn independently from

following some unknown probability distribution

. The learner can be represented by , which is a kernel (channel) that (randomly) maps to .

To complete the classification or regression task, the learner in principle would choose a hypothesis

to minimize the following population loss, under a given loss function

,

(1)

However, since only a training data vector

is available, the empirical loss of is usually computed (and minimized during training), which is given as

(2)

The expected generalization error of the learner is

(3)

where the expectation is taken over the joint distribution

. This quantity captures the effect of the learner’s expected overfitting error due to limited training data, which we shall study in this work.

Iii Review of Related Results

In this section, we briefly review a few information theoretic bounds on the generalization error relevant to this work. A more thorough discussion of their relation is deferred to Section IV-D and IV-E, after a unified framework is given.

Iii-a Mutual information based bounds

Xu and Raginsky, motivated by a previous work by Russo and Zou [2], provided a mutual information (MI) based bound on the expected generalization error [8].

Theorem 1 (MI Bound [8]).

Suppose is -sub-Gaussian under for all , then

(4)

The generalization can be written in two ways

(5)
(6)

where and

are independent random variables that have the same marginal distributions as

and , respectively. Instead of bounding the difference (5) as in [8], Bu et al. [9] bounded each individual difference in (6) and derived an individual mutual information (IMI) based bound. Furthermore, the following inverse Fenchel conjugate function was utilized to obtain a tightened bound. For any random variables , its cumulant generating function is

(7)

and the inverse of its Fenchel conjugate is given as

(8)

The tightened bound is summarized in the following theorem.

Theorem 2 (IMI Bound [9]).

Suppose is an upper bound of , then

(9)

where and are independent random variables that have the same marginal distributions as and , respectively.

Iii-B Conditional mutual information based bounds

Steinke and Zakynthinou [10] recently introduced a novel bounding approach. In their approach, is a table of samples that each , for and is independently drawn following . The training vector is selected from the table , where ’s are independent Rademacher random variables, i.e., takes or equally likely. The vector essentially selects one sample from each column in the table, which partition into a training vector and a testing vector. For simplicity, we shall write and as and , when the meaning is clear from the context.

With the structure given above, the expected generalization error of the algorithm can be written as

(10)

Steinke and Zakynthinou obtained the following conditional mutual information (CMI) based result.

Theorem 3 (CMI Bound [10]).

Suppose for any , then

(11)

where are independent samples distributed as .

Since is binary, the conditional mutual information is always bounded; in contrast, mutual information based bounds (i.e., MI and IMI bounds) can be unbounded, particularly when the random variables are both continuous.

Motivated by the results in [9], Haghifam et al. [11] proposed a sharpened bound by similarly bounding each term in (10). Moreover, they provided a conditional individual mutual information (CIMI) based bound represented by the sample-conditioned mutual information, which is defined as

(12)

Clearly is a function of the random variable , thus also a random variable, and . These sharpened bounds are summarized in the following theorem.

Theorem 4 (CIMI Bound [11]).

Suppose , then

(13)
(14)

Iv New Result

Iv-a A motivating example

Let us consider the simple setting of estimating the mean from samples generated from a Gaussian distribution

, by averaging the training samples under the squared loss.

Example 1 (Estimating the Gaussian mean).

The training samples are drawn following for some unknown . The learner deterministically estimates by averaging the training samples, i.e., , whose empirical error is

(15)

Bu et al. [9] showed that the mutual information term in the IMI bound is

(16)

and obtained the following IMI based bound

(17)

For this simple setting, the generalization error can in fact be calculated exactly to be . Though the error bound above does not have the same order as the true generalization error, it is consistent with the VC dimension-based bound and is the best known for this case. Note that the MI bound will be unbounded, since is unbounded.

Next consider the CMI and CIMI bounds, and let us focus on the mutual information terms in these bounds, which give

(18)
(19)

It is seen that they are order-wise worse than (16), which suggests that the bounds obtained from the CMI and CIMI bounds would be order-wise worse than (17).

Theorem 3 and Theorem 4 in fact do not apply directly in this setting, since their required conditions do not hold. In Theorem 3, the function does not exist (i.e., unbounded); even if it existed, the term would be a constant, thus the CMI bound would be of constant order. Similarly, if the condition held, the CIIMI bound would also be of constant order. As we shall show shortly, the CMI and CIMI bounds can be generalized and strengthened, yet the resultant strengthened bounds in this setting still do not diminish as , and thus would be order-wise worse than the IMI bound.

A question arises naturally: Is the looseness of the CMI and CIMI bounds here due to the introduction of the conditioning terms? As we shall show next, it is in fact caused by too much information being revealed in the conditioning terms, and there is indeed a natural way to resolve this issue.

Iv-B A conditional decoupling lemma

Our main result relies on a key lemma. A few more definitions are first introduced in order to present this lemma and the main result.

For any random variables and , define the sample-conditioned cumulant generating function for any realization ,

(20)

It is straightforward to verify that for any realization , and . Hence the inverse of its Fenchel conjugate

(21)

is concave and non-decreasing; see e.g., [9] and [12]. The unconditioned version of this function was introduced earlier by Bu et al. [9]. When it is clear from context, we will write

(22)

which are functions of , thus random. Next define the conditional cumulant generating function

(23)

and similarly its inverse Fenchel conjugate as .

For a pair of random variables , its decoupled pair conditioned on a third random variable is a pair of random variables , such that

(24)

i.e., and are identically distributed, and and are identically distributed, and moreover

(25)

forms a Markov string. It follows from this definition that

(26)

We next introduce a conditional decoupling (CD) lemma, which serves an instrumental role in our work. The unconditioned version was presented in [9].

Lemma 1 (The CD lemma).

For any three random variables , let be the decoupled pair of conditioned on . Let and , for some real-valued measurable function . The following inequalities hold

(27)
(28)

This lemma is proved by utilizing the Donsker–Varadhan variational representation of KL divergence and the concavity of the inverse Fenchel conjugate function. The proof details are deferred to Section IV-G.

Iv-C The ICIMI bound

Let be as given previously in Section III-B. For each , let be a decoupled pair of conditioned on . The new bound we propose is presented in Theorem 5.

Theorem 5.

(ICIMI Bound) Given an algorithm , the following bounds on the generalization hold

(29)
(30)

where .

There are two bounds in this theorem. The stronger bound is in terms of the sample-conditioned mutual information, which is different from the conventional notion of conditional mutual information and may be more difficult to evaluate. The weaker bound is in terms of the conventional mutual information.

In the proposed bounds, the mutual information is conditioned on the individual data pair , instead of the full data pair set . Intuitively, revealing only makes it more difficult, than revealing all data pairs , to deduce information regarding from . As a consequence, the mutual information is less than , yielding a potentially tighter bound.

Proof of Theorem 5.

We can rewrite the generalization error given in (10) as

(31)

Now apply the CD lemma on each individual term in (31) by letting , , , and . Since

we have

(32)
(33)

which completes the proof. ∎

We call this bound the individually conditional individual mutual information (ICIMI) bound, since it is derived by applying the CD lemma on the individual conditional terms in (31). This theorem implies the following corollary.

Corollary 1.

Suppose , then

(34)
(35)
Proof of Corollary 1.

When and , it is straightforward to verify that is -sub-Gaussian. The definition of the sub-Gaussian distribution in fact gives , and thus , from which the corollary follows. ∎

Iv-D Dichotomy and generalizations of existing bounds

The CD lemma allows us to view the existing MI, IMI, CMI, and CIMI bounds in a unified framework. By applying the CD lemma in different manners, these bounds can be obtained almost directly. The technical conditions under which the bound hold can also be generalized, and the bounds themselves can be strengthened using the inverse Fenchel conjugate. These results are summarized in Table I. We also provide the bounds for bounded loss function, which eliminate the functions.

Approach Generalization bound Special case
MI [8]
IMI [9]
CMI [10]
CIMI [11]
ICIMI (new)
TABLE I: A dichotomy of several generalization bounds using the CD Lemma

The CMI and CIMI results can be further strengthened by utilizing the inverse Fenchel conjugate function together with the sample-conditioned mutual information. More precisely, let be the decoupled pair of conditioned on . Further define

(36)

then we have the strengthened CMI and CIMI bounds:

(37)
(38)

Iv-E Comparison of the bounds

We first consider the special case where the loss function is bounded, i.e., . For this case, it was shown in [11] that the CIMI bound (14) is tighter than the CMI bound (11). We next show that the proposed bound (35) is tighter than the CIMI bound (14) when .

Lemma 2.

For any , we have

Proof of Lemma 2.

By the independence of and , we have

It follows that

which concludes the proof. ∎

ICIMI (new)

IMI

CIMI

MI

CMI

Fig. 1: Relations among generalization bounds, when the inverse Fenchel conjugate functions are assumed to be the same.

To further understand the relation among these bounds under more general conditions when the loss function may not be bounded, let us assume the inverse Fenchel conjugate functions, which roughly capture the geometry induced by the expected loss, are the same (denoted as ) for all the five approaches, i.e.,

Then we can focus on the information measure quantities, and compare these bounds as shown in Fig. 1. Here the inequalities given in black were proved previously (see [9] and [11]). Since the common function is non-decreasing, the inequality "CIMI ICIMI" follows from Lemma 2. The inequality "IMI ICIMI" is implied by the following lemma for the same reason.

Lemma 3.

For any , we have

Proof of Lemma 3.

First and are both the training sample for the input of the algorithm, thus

(39)

Then since and are independent given ,

(40)
(41)

It follows that

(42)

which concludes the proof. ∎

The inverse Fenchel conjugate functions may indeed be different for different bounds, thus although the above comparison suggests certain dominant relations, it is not clear for any specific problem, whether any particular bound is tighter than the other. This is particularly true if we use the bounds based on the inverse Fenchel conjugate, however, even for the special case of , the different multiplicative factors and the sum-square-root forms imply that the relation can be less clear.

Iv-F Revisiting the example

We now return to the problem of estimating the Gaussian mean, and show that the proposed ICIMI bound can provide scaling behavior similar to that of IMI, thus order-wise stronger than the CMI and CIMI bounds. In fact, the bound is also strictly better than the IMI bound given in [9] asymptotically in this setting.

We first formally establish, as suspected previously, that the CMI and CIMI bounds are at least of constant order for this setting, the proof of which can be found in the appendix.

Proposition 1.

The strengthened CMI and CIMI bounds, i.e., (37) and (38), are at least in the problem of estimating the Gaussian mean.

The next proposition establishes a generalization error bound based on the ICIMI bound in this setting.

Proposition 2.

For the the problem of estimating the mean of the Gaussian distribution, the ICIMI bound gives

(43)

Remark: This bound scales as . Compared to the IMI bound in (17), the new ICIMI based bound is asymptotically tighter by a factor of .

Proposition 2 is proved by studying separately the sample-conditioned individual mutual information and the inverse Fenchel conjugate functions . For the former, since the algorithm here is averaging the samples without any prior of the Gaussian distribution, without loss of generality, we can assume the mean of the Gaussian distribution to be , i.e., . Therefore, given , is mixed-Gaussian distributed, which follows when and follows when . The term is thus related to the scaling behavior of the differential entropy of a mixed Gaussian distribution, which the following lemma makes more precise.

Lemma 4.

Let be a Rademacher random variable and be a mixed-Gaussian random variable, such that when , and when . We have

(44)

The next lemma gives an upper bound on the inverse Fenchel conjugate functions.

Lemma 5.

For the problem of estimating the mean of the Gaussian distribution, and any realization of with ,

where

(45)

and for ,

(46)

The proofs of these two lemmas are relegated to the appendix. With these lemmas, Proposition 2 can be proved as follows.

Proof of Proposition 2.

First by Lemma 4, we have

(47)

Then Theorem 5 and Lemma 5 imply

(48)
(49)
(50)

which proves the proposition. ∎

Iv-G Proof of the CD lemma

Proof of Lemma 1.

The definition of sample-conditioned cumulant generating function implies that

(51)

By the Donsker–Varadhan variational representation of KL divergence, for any

(52)
(53)

where the equality is due to (26). It follows that for

(54)
(55)

Moreover

(56)
(57)
(58)
(59)

where the last inequality is by exchanging the order of expectation and infimum. Similarly, since