Log In Sign Up

Strengthened Information-theoretic Bounds on the Generalization Error

by   Ibrahim Issa, et al.

The following problem is considered: given a joint distribution P_XY and an event E, bound P_XY(E) in terms of P_XP_Y(E) (where P_XP_Y is the product of the marginals of P_XY) and a measure of dependence of X and Y. Such bounds have direct applications in the analysis of the generalization error of learning algorithms, where E represents a large error event and the measure of dependence controls the degree of overfitting. Herein, bounds are demonstrated using several information-theoretic metrics, in particular: mutual information, lautum information, maximal leakage, and J_∞. The mutual information bound can outperform comparable bounds in the literature by an arbitrarily large factor.


page 1

page 2

page 3

page 4


Generalization Error Bounds Via Rényi-, f-Divergences and Maximal Leakage

In this work, the probability of an event under some joint distribution ...

Chained Generalisation Bounds

This work discusses how to derive upper bounds for the expected generali...

Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates

In this work, we improve upon the stepwise analysis of noisy iterative l...

Generalization Bounds via Convex Analysis

Since the celebrated works of Russo and Zou (2016,2019) and Xu and Ragin...

Information Theoretic Co-Training

This paper introduces an information theoretic co-training objective for...

On Leave-One-Out Conditional Mutual Information For Generalization

We derive information theoretic generalization bounds for supervised lea...

Fast Rate Generalization Error Bounds: Variations on a Theme

A recent line of works, initiated by Russo and Xu, has shown that the ge...

I Introduction

One of the main challenges in designing learning algorithms is guaranteeing that they generalize well [3, 4, 5, 6]. The analysis is made especially hard by the fact that, in order to handle large data sets, learning algorithms are typically adaptive. A recent line of work initiated by Dwork et al. [7, 8, 9] shows that differentially private algorithms provide generalization guarantees. More recently, Russo and Zou [10], and Xu and Raginsky [11], provided an information-theoretic framework for this problem, and showed that the mutual information (between the input and output of the learning algorithm) can be used to bound the generalization error, under a certain assumption. Jiao et al. [12] and Issa and Gastpar [2] relaxed this assumption and provided new bounds using new information-theoretic measures.

The aforementioned papers focus mainly on the expected generalization error. In this paper, we study instead the probability of an undesirable event (e.g., large generalization error in the learning setting). In particular, given an event and a joint distribution , we bound in terms of (where is the product of the marginals of ) and a measure of dependence between and .

Bassily et al. [13] and Feldman and Steinke [14] provide a bound of this form, where the measure of dependence is mutual information, . We present a new bound in terms of mutual information, which can outperform theirs by an arbitrarily large factor. Moreover, we prove a new bound using lautum information (a measure introduced by Palomar and Verdú [1]). We demonstrate two further bounds using maximal leakage [15, 16] and (which was recently introduced by Issa and Gastpar [2]). One advantage of the latter two bounds is that they can be computed using only a partial description of the joint , hence they are more amenable to analysis.

Ii Main Results


be a joint probability distribution on alphabets

, and let be some (“undesirable”) event. We want to bound in terms of (where is the product of the marginals induced by the joint ) and a measure of dependence between and .

Ii-a KL Divergence/Mutual Information Bounds

Consider the following intermediate problem: let and be two probability distributions on an alphabet , and let be some event. We will bound in terms of and . Then by replacing by and by , we get a bound for our desired setup in terms of the mutual information .

Proposition 1

Given , define as . Then, is a strictly increasing function of . Given any event and pair of distributions and with ,


In particular, given an event and a joint distribution satisfying ,


Note that for , hence is strictly increasing. Moreover, the range of is , so (1) is well defined.

If , then (1) holds trivially since by the definition of . Otherwise, if , then , where the second inequality follows from the data processing inequality. Since is strictly increasing, then so is . Hence .

The bound above is tight in the following sense. Let be such that, given any alphabet and event , and any two distributions and on , . Then if . This is true since given any tuple such that , there exists such that , , and (1) holds with equality. In particular, choose , , , and .

However, there is no closed form for the bound in (1). The following corollary provides an upper bound in closed form:

Corollary 1

Given , define and as follows:

Then, is concave and non-decreasing in . Moreover, given any event and pair of distributions and with ,


In particular, given an event and a joint satisfying ,


where in (a) and , and (b) follows from the fact that .


Since is concave in and the square root is concave and non-decreasing, is concave in ; hence is concave in . To show that it is non-decreasing, consider the derivative (ignoring the positive denominator):

For , both terms are non-negative (the first is non-negative since ). For , the numerator of the second term is negative and decreasing, and the denominator is positive and decreasing. Hence, it achieves its minimum for . Since the minimum , we get that for .

Now, let and . Then we can rewrite the inequality as


where is the binary entropy function (in nats). Upper-bounding , we get

For ease of notation, let and be the left-hand side. Then,


Hence, there exists such that is decreasing on and increasing on . Therefore, admits at most two solutions, say , and . It remains to solve


Let , and . We get


The discriminant of (9) is given by

where the inequality follows from the fact that . Hence, the larger root of (9) is given by , as desired.

Ii-A1 Comparison with existing bound

It has been shown [14, Lemma 3.11][13, Lemma 9] that


The bound in Corollary 1 can be arbitrarily smaller than (10). That is, let be the left-hand side of (10) and consider the calculation shown at the top of the next page.

Moreover, one can derive a family of bounds in the form of (10) using the Donsker-Varadhan characterization of KL. In particular,


Now, let for some , where is the indicator function. After rearranging terms, we get


Choosing , we slightly improve (10) by replacing with . In fact, we can solve the infimum over of the right-hand side of (12). In particular, by [17, Lemma 2.4], the infimum is given by , where is the convex conjugate of111Lemma 2.4 of [17] assumes , but the proof goes as is for , which is the case here. , and . It turns out that is given by


Now, . Hence, for , . By noting that , we get for any , . Finally, for , we get , which is equal to satisfying . That is, the bound derived from (13) exactly recovers Proposition 1.

Furthermore, we could compare with the “mutual information bound” of Russo and Zou [10], and Xu and Raginsky [11]. In particular, by considering for in (11), we get

where the second inequality follows from the fact that is

-subgaussian (which is true for any random variable whose support has length 1). Since the above inequality holds for any

, we get


Evidently, Corollary 1 can outperform (14) since (for finite ) , whereas the right-hand side of (14) goes to . In Figure 1, we plot the 3 bounds (equations (6), (10), and (14)) for a given range of interest: small , and relatively small , e.g., proportional to .

Fig. 1: Comparison of bounds
Remark 1

Given the form of the 3 bounds, one might expect that (14) outperforms the other two for large values of . This is in fact not true because the range of interest for the right-hand sides is restricted to . For instance, for small and , the bound in (14) is trivial (), and the other two bounds are strictly less than 1.

Ii-A2 Bound using /Lautum Information

By considering the data processing inequality , we can bound in terms of and .

Theorem 1

Given any event and a pair of distributions and , if , then

In particular, given an event and a joint distribution with ,


where is the lautum information [1].


Set and . As in (6), we can rewrite as


Since (by assumption), we can drop the first term of the left-hand side. Rearranging the inequality then yields Theorem 1.

Moreover, we can derive a family of bounds similar to (12) by considering the Donsker-Varadhan representation of :


Now, let for some . Then after rearranging terms, we get for any ,


Ii-B Maximal Leakage Bound

The bounds presented so far in (4) and (15) do not take into account the specific relation of and as a joint distribution and its marginal. Indeed, they are applications of a more general bound that can be applied to an arbitrary pair of distributions (Corollary 1 and Theorem 1). The following bound does not fall under this category, i.e., it only applies to pairs of distributions forming a joint and marginal.

Theorem 2

Given , finite alphabets and , a joint distribution and an event such that for all , where , then


where is the maximal leakage.

Remark 2

The bound holds more generally but we restrict our attention to finite alphabets to make the presentation of the proof simple.

Proof: Fix satisfying , and consider the pair of distributions and :

where the equalities follow from [18, Theorem 6]. Hence,


where (a) follows from the following (readily verifiable) facts:

One advantage of the bound of Theorem 2 is that it depends on a partial description of only. Hence, it is simpler to analyze than the mutual information bounds. Moreover, for fixed , the bound is convex in . In the next subsection, we present a bound with similar properties.

Ii-C -Bound

Theorem 3

Given , finite alphabets and , a joint distribution and an event such that for all , where , then


where  [2].


The theorem follows from Theorem 1 and Corollary 1 of [2]. In particular, following the same proof steps as in [2], one can show that for any function222In [2], the authors consider , , and . Nevertheless, the proof of (21) remains the same. ,


where . Now, set . Then, , , and . Moreover,

where the last inequality follows from the assumption that . Then, it follows from (21) that


The theorem follows by noting that .


  • [1] D. P. Palomar and S. Verdú, “Lautum information,” IEEE transactions on information theory, vol. 54, no. 3, pp. 964–975, 2008.
  • [2] I. Issa and M. Gastpar, “Computable bounds on the exploration bias,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), June 2018, pp. 576–580.
  • [3] J. P. Ioannidis, “Why most published research findings are false,” PLoS medicine, vol. 2, no. 8, p. e124, 2005.
  • [4] J. P. Simmons, L. D. Nelson, and U. Simonsohn, “False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant,” Psychological science, vol. 22, no. 11, pp. 1359–1366, 2011.
  • [5] M. Hardt and J. Ullman, “Preventing false discovery in interactive data analysis is hard,” in Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on.   IEEE, 2014, pp. 454–463.
  • [6] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence,”

    Journal of Machine Learning Research

    , vol. 11, no. Oct, pp. 2635–2670, 2010.
  • [7] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth, “Preserving statistical validity in adaptive data analysis,” in

    Proceedings of the forty-seventh annual ACM symposium on Theory of computing

    .   ACM, 2015, pp. 117–126.
  • [8] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “Generalization in adaptive data analysis and holdout reuse,” CoRR, vol. abs/1506.02629, 2015. [Online]. Available:
  • [9] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman, “Algorithmic Stability for Adaptive Data Analysis,” ArXiv e-prints, Nov. 2015.
  • [10] D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,” in Artificial Intelligence and Statistics, 2016, pp. 1232–1240.
  • [11] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” arXiv preprint arXiv:1705.07809, 2017.
  • [12] J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration bias for general measurements,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT).   IEEE, June 2017, pp. 1475–1479.
  • [13] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff, “Learners that use little information,” in Proceedings of Algorithmic Learning Theory, ser. Proceedings of Machine Learning Research, F. Janoos, M. Mohri, and K. Sridharan, Eds., vol. 83.   PMLR, 07–09 Apr 2018, pp. 25–55. [Online]. Available:
  • [14] V. Feldman and T. Steinke, “Calibrating noise to variance in adaptive data analysis,” arXiv preprint arXiv:1712.07196, 2017.
  • [15] C. Braun, K. Chatzikokolakis, and C. Palamidessi, “Quantitative notions of leakage for one-try attacks,” Electronic Notes in Theoretical Computer Science, vol. 249, pp. 75–91, 2009.
  • [16] I. Issa, S. Kamath, and A. B. Wagner, “An operational measure of information leakage,” in Proc. of 50th Ann. Conf. on Information Sciences and Systems (CISS), Mar. 2016.
  • [17] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence.   Oxford university press, 2013.
  • [18]

    T. van Erven and P. Harremos, “Rényi divergence and Kullback-Leibler divergence,”

    IEEE Trans. Inf. Theory, vol. 60, no. 7, pp. 3797–3820, July 2014.