I Introduction
One of the main challenges in designing learning algorithms is guaranteeing that they generalize well [3, 4, 5, 6]. The analysis is made especially hard by the fact that, in order to handle large data sets, learning algorithms are typically adaptive. A recent line of work initiated by Dwork et al. [7, 8, 9] shows that differentially private algorithms provide generalization guarantees. More recently, Russo and Zou [10], and Xu and Raginsky [11], provided an informationtheoretic framework for this problem, and showed that the mutual information (between the input and output of the learning algorithm) can be used to bound the generalization error, under a certain assumption. Jiao et al. [12] and Issa and Gastpar [2] relaxed this assumption and provided new bounds using new informationtheoretic measures.
The aforementioned papers focus mainly on the expected generalization error. In this paper, we study instead the probability of an undesirable event (e.g., large generalization error in the learning setting). In particular, given an event and a joint distribution , we bound in terms of (where is the product of the marginals of ) and a measure of dependence between and .
Bassily et al. [13] and Feldman and Steinke [14] provide a bound of this form, where the measure of dependence is mutual information, . We present a new bound in terms of mutual information, which can outperform theirs by an arbitrarily large factor. Moreover, we prove a new bound using lautum information (a measure introduced by Palomar and Verdú [1]). We demonstrate two further bounds using maximal leakage [15, 16] and (which was recently introduced by Issa and Gastpar [2]). One advantage of the latter two bounds is that they can be computed using only a partial description of the joint , hence they are more amenable to analysis.
Ii Main Results
Let
be a joint probability distribution on alphabets
, and let be some (“undesirable”) event. We want to bound in terms of (where is the product of the marginals induced by the joint ) and a measure of dependence between and .Iia KL Divergence/Mutual Information Bounds
Consider the following intermediate problem: let and be two probability distributions on an alphabet , and let be some event. We will bound in terms of and . Then by replacing by and by , we get a bound for our desired setup in terms of the mutual information .
Proposition 1
Given , define as . Then, is a strictly increasing function of . Given any event and pair of distributions and with ,
(1) 
In particular, given an event and a joint distribution satisfying ,
(2) 
Proof:
Note that for , hence is strictly increasing. Moreover, the range of is , so (1) is well defined.
If , then (1) holds trivially since by the definition of . Otherwise, if , then , where the second inequality follows from the data processing inequality. Since is strictly increasing, then so is . Hence .
The bound above is tight in the following sense. Let be such that, given any alphabet and event , and any two distributions and on , . Then if . This is true since given any tuple such that , there exists such that , , and (1) holds with equality. In particular, choose , , , and .
However, there is no closed form for the bound in (1). The following corollary provides an upper bound in closed form:
Corollary 1
Given , define and as follows:
Then, is concave and nondecreasing in . Moreover, given any event and pair of distributions and with ,
(3) 
In particular, given an event and a joint satisfying ,
(4) 
(5) 
where in (a) and , and (b) follows from the fact that .
Proof:
Since is concave in and the square root is concave and nondecreasing, is concave in ; hence is concave in . To show that it is nondecreasing, consider the derivative (ignoring the positive denominator):
For , both terms are nonnegative (the first is nonnegative since ). For , the numerator of the second term is negative and decreasing, and the denominator is positive and decreasing. Hence, it achieves its minimum for . Since the minimum , we get that for .
Now, let and . Then we can rewrite the inequality as
(6) 
where is the binary entropy function (in nats). Upperbounding , we get
For ease of notation, let and be the lefthand side. Then,
(7) 
Hence, there exists such that is decreasing on and increasing on . Therefore, admits at most two solutions, say , and . It remains to solve
(8) 
Let , and . We get
(9) 
The discriminant of (9) is given by
where the inequality follows from the fact that . Hence, the larger root of (9) is given by , as desired.
IiA1 Comparison with existing bound
It has been shown [14, Lemma 3.11][13, Lemma 9] that
(10) 
The bound in Corollary 1 can be arbitrarily smaller than (10). That is, let be the lefthand side of (10) and consider the calculation shown at the top of the next page.
Moreover, one can derive a family of bounds in the form of (10) using the DonskerVaradhan characterization of KL. In particular,
(11) 
Now, let for some , where is the indicator function. After rearranging terms, we get
(12) 
Choosing , we slightly improve (10) by replacing with . In fact, we can solve the infimum over of the righthand side of (12). In particular, by [17, Lemma 2.4], the infimum is given by , where is the convex conjugate of^{1}^{1}1Lemma 2.4 of [17] assumes , but the proof goes as is for , which is the case here. , and . It turns out that is given by
(13) 
Now, . Hence, for , . By noting that , we get for any , . Finally, for , we get , which is equal to satisfying . That is, the bound derived from (13) exactly recovers Proposition 1.
Furthermore, we could compare with the “mutual information bound” of Russo and Zou [10], and Xu and Raginsky [11]. In particular, by considering for in (11), we get
where the second inequality follows from the fact that is
subgaussian (which is true for any random variable whose support has length 1). Since the above inequality holds for any
, we get(14) 
Evidently, Corollary 1 can outperform (14) since (for finite ) , whereas the righthand side of (14) goes to . In Figure 1, we plot the 3 bounds (equations (6), (10), and (14)) for a given range of interest: small , and relatively small , e.g., proportional to .
Remark 1
Given the form of the 3 bounds, one might expect that (14) outperforms the other two for large values of . This is in fact not true because the range of interest for the righthand sides is restricted to . For instance, for small and , the bound in (14) is trivial (), and the other two bounds are strictly less than 1.
IiA2 Bound using /Lautum Information
By considering the data processing inequality , we can bound in terms of and .
Theorem 1
Given any event and a pair of distributions and , if , then
In particular, given an event and a joint distribution with ,
(15) 
where is the lautum information [1].
Proof:
Moreover, we can derive a family of bounds similar to (12) by considering the DonskerVaradhan representation of :
(17) 
Now, let for some . Then after rearranging terms, we get for any ,
(18) 
IiB Maximal Leakage Bound
The bounds presented so far in (4) and (15) do not take into account the specific relation of and as a joint distribution and its marginal. Indeed, they are applications of a more general bound that can be applied to an arbitrary pair of distributions (Corollary 1 and Theorem 1). The following bound does not fall under this category, i.e., it only applies to pairs of distributions forming a joint and marginal.
Theorem 2
Given , finite alphabets and , a joint distribution and an event such that for all , where , then
(19) 
where is the maximal leakage.
Remark 2
The bound holds more generally but we restrict our attention to finite alphabets to make the presentation of the proof simple.
Proof: Fix satisfying , and consider the pair of distributions and :
where the equalities follow from [18, Theorem 6]. Hence,
Now,
where (a) follows from the following (readily verifiable) facts:
One advantage of the bound of Theorem 2 is that it depends on a partial description of only. Hence, it is simpler to analyze than the mutual information bounds. Moreover, for fixed , the bound is convex in . In the next subsection, we present a bound with similar properties.
IiC Bound
Theorem 3
Given , finite alphabets and , a joint distribution and an event such that for all , where , then
(20) 
where [2].
Proof:
The theorem follows from Theorem 1 and Corollary 1 of [2]. In particular, following the same proof steps as in [2], one can show that for any function^{2}^{2}2In [2], the authors consider , , and . Nevertheless, the proof of (21) remains the same. ,
(21)  
where . Now, set . Then, , , and . Moreover,
where the last inequality follows from the assumption that . Then, it follows from (21) that
(22) 
The theorem follows by noting that .
References
 [1] D. P. Palomar and S. Verdú, “Lautum information,” IEEE transactions on information theory, vol. 54, no. 3, pp. 964–975, 2008.
 [2] I. Issa and M. Gastpar, “Computable bounds on the exploration bias,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT), June 2018, pp. 576–580.
 [3] J. P. Ioannidis, “Why most published research findings are false,” PLoS medicine, vol. 2, no. 8, p. e124, 2005.
 [4] J. P. Simmons, L. D. Nelson, and U. Simonsohn, “Falsepositive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant,” Psychological science, vol. 22, no. 11, pp. 1359–1366, 2011.
 [5] M. Hardt and J. Ullman, “Preventing false discovery in interactive data analysis is hard,” in Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on. IEEE, 2014, pp. 454–463.

[6]
S. ShalevShwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability,
stability and uniform convergence,”
Journal of Machine Learning Research
, vol. 11, no. Oct, pp. 2635–2670, 2010. 
[7]
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth,
“Preserving statistical validity in adaptive data analysis,” in
Proceedings of the fortyseventh annual ACM symposium on Theory of computing
. ACM, 2015, pp. 117–126.  [8] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, “Generalization in adaptive data analysis and holdout reuse,” CoRR, vol. abs/1506.02629, 2015. [Online]. Available: http://arxiv.org/abs/1506.02629
 [9] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman, “Algorithmic Stability for Adaptive Data Analysis,” ArXiv eprints, Nov. 2015.
 [10] D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,” in Artificial Intelligence and Statistics, 2016, pp. 1232–1240.
 [11] A. Xu and M. Raginsky, “Informationtheoretic analysis of generalization capability of learning algorithms,” arXiv preprint arXiv:1705.07809, 2017.
 [12] J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration bias for general measurements,” in Proc. IEEE Int. Symp. Inf. Theory (ISIT). IEEE, June 2017, pp. 1475–1479.
 [13] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff, “Learners that use little information,” in Proceedings of Algorithmic Learning Theory, ser. Proceedings of Machine Learning Research, F. Janoos, M. Mohri, and K. Sridharan, Eds., vol. 83. PMLR, 07–09 Apr 2018, pp. 25–55. [Online]. Available: http://proceedings.mlr.press/v83/bassily18a.html
 [14] V. Feldman and T. Steinke, “Calibrating noise to variance in adaptive data analysis,” arXiv preprint arXiv:1712.07196, 2017.
 [15] C. Braun, K. Chatzikokolakis, and C. Palamidessi, “Quantitative notions of leakage for onetry attacks,” Electronic Notes in Theoretical Computer Science, vol. 249, pp. 75–91, 2009.
 [16] I. Issa, S. Kamath, and A. B. Wagner, “An operational measure of information leakage,” in Proc. of 50th Ann. Conf. on Information Sciences and Systems (CISS), Mar. 2016.
 [17] S. Boucheron, G. Lugosi, and P. Massart, Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

[18]
T. van Erven and P. Harremos, “Rényi divergence and KullbackLeibler divergence,”
IEEE Trans. Inf. Theory, vol. 60, no. 7, pp. 3797–3820, July 2014.