Let us consider two probability spaces and let be a measurable event. Given some divergence between the two distributions (e.g., KL, Rényi’s Divergence, etc.) our aim is to provide bounds of the following form:
for some functions . represents some “undesirable” event (e.g., large generalization error), whose measure under is known and whose measure under we wish to bound. To that end, we use some notion of “distance” between and . Of particular interest is the case where , (the joint distribution), and (product of the marginals). This allows us to bound the likelihood of when two random variables and are dependent as a function of the likelihood of when and are independent (a scenario typically much easier to analyze). Such a result can be applied in the analysis of the generalization error of learning algorithms, as well as in adaptive data analysis (with a proper choice of the dependence measure). Adaptive data analysis is a recent field that is gaining attention due to its connection with the “Reproducibility Crisis” [1, 2]. The idea is that, whenever you apply a sequence of analyses to some data (e.g., data-exploration procedures) and each analysis informs the subsequent ones, even though each of these algorithms is guaranteed to generalize well in isolation, this may no longer be true when they are composed together. The problem that arises with the composition is believed to be connected with the leakage of information from the data. The leakage happens because the output of each algorithm becomes an input to the subsequent ones. In order to be used in adaptive data analysis, a measure that provides such bounds needs to be robust to post-processing and to compose adaptively (meaning that we can bound the measure between input and output of the composition of the sequence of algorithms if each of them has bounded measure). Results of this form involving mutual information can be found in [3, 4, 5]. Via inequalities like in (1) we can provide bounds for adaptive mechanisms by treating them as non-adaptive and paying a “penalty” term (e.g., an information measure of statistical dependency) that measures how far is the mechanism from being non-adaptive.
With this aim, our main theorem provides a general bound in the form of (1) with and . As corollaries, we derive several families of interesting bounds:
a family of bounds involving the Rényi’s divergence of order ;
a family of bounds involving Sibson’s Mutual Information of order ;
a bound involving Maximal Leakage;
Moreover, we derive a family of bounds using -divergences, which provides a rich class of information measures. We focus in particular on the bounds involving Maximal Leakage, which is a secrecy metric that has appeared both in the computer security literature , and the information theory literature . It quantifies the leakage of information from a random variable to another random variable , and is denoted by . The basic insight is as follows: if a learning algorithm leaks little information about the training data, then it will generalize well. Moreover, similarly to differential privacy, maximal leakage behaves well under composition: we can bound the leakage of a sequence of algorithms if each of them has bounded leakage. It is also robust under post-processing. In addition, the expression to compute it is simply given by the following formula (for finite and ):
making it more amenable to analysis and relatively easy to compute, especially for algorithms whose randomness consists in adding independent noise to the outcomes. Despite the main focus being on a joint distribution and the corresponding product of the marginals, the proof techniques are more general and can be applied to any pair of joint distributions (under a mild condition of absolute continuity). Moreover, the Maximal Leakage result, as well as the bound using infinite-Rényi divergence, reduce to the classical concentration inequalities when independence holds (i.e., ).
I-a Further related work
In addition to differentially private algorithms, Dwork et al.  show that algorithms whose output can be described concisely generalize well. They further introduce -max information to unify the analysis of both classes of algorithms. Consequently, one can provide generalization guarantees for a sequence of algorithms that alternate between differential privacy and short description. In , the authors connect -max information with the notion of approximate differential privacy, but show that there are no generalization guarantees for an arbitrary composition of algorithms that are approximate-DP and algorithms with short description length. With a more information-theoretic approach, bounds on the exploration bias and/or the generalization error are given in [5, 4, 9, 10, 11, 12, 13], using mutual information and other dependence-measures. Some results have also been found using Wasserstein distance [14, 15].
We will denote by calligraphic letters probability measures and with capital letters random variables. Given two measures , denotes the concept of absolute continuity, i.e., for any measurable set , .
Given two random variables over the spaces we will denote by a joint measure over the product space , while with we will denote the product of the marginals, i.e., for any measurable set .
Given a probability measure and a random variable defined over the same space, we will denote with
Furthermore, given a random variable we say that it is -sub-Gaussian if the following holds true for every :
In Section II we define the fundamental objects that will be used in this work:
In Subsection II-A we consider Rényi’s- Divergences, Sibson’s Mutual Information, Maximal Leakage and divergences;
In Subsection II-B we provide an overview of the basic concepts in Learning Theory;
In Section III we prove our main results, categorized with respect to the information measure they consider. Some extension of our bounds to expected generalization error is also considered. In Section IV we consider the basic definitions of Adaptive Data Analysis and show how some of our results can be employed in the area. To conclude, in Section V we compare our results with recent results in the literature.
Ii Background And Definitions
Ii-a Information Measures
We will now briefly introduce the information measures that we will use to provide bounds. The idea is to try and capture the dependency between two random variables through some information measure and employ it in order to provide bounds. We will consider to be the input of a learning algorithm and the corresponding (random) output. By controlling some measure of dependency, we will control how much the learning algorithm is over-fitting to the data.
Ii-A1 Sibson’s Mutual Information
Introduced by Rényi in an attempt to generalize the concept of Entropy and KL-Divergence, the -Divergence has then found many applications over the years in hypothesis testing, guessing and several other statistical inference problems . Indeed, it has several useful operation interpretations (e.g., the number of bits by which a mixture of two codes can be compressed, the cut-off rate in block coding and hypothesis testing [17, 18][19, p. 649]). It can be defined as follows :
Let be two probability spaces. Let be a positive real different from . Consider a measure such that and (such a measure always exists, e.g. )) and denote with the densities of with respect to . The Divergence of from is defined as follows:
The definition is independent of the chosen measure whenever and . It is indeed possible to show that , and that whenever or , see .
It can be shown that if and then . The behavior of the measure for can be defined by continuity. In particular, we have that , i.e.
, the classical Kullback-Leibler divergence. For an extensive treatment of-Divergences and their properties we refer the reader to . Starting from the concept of Divergence, Sibson built a generalization of Mutual Information that retains many interesting properties. The definition is the following :
Let be two random variables jointly distributed according to . Let be the corresponding marginal of (i.e., given a measurable set , ) and let be any probability measure over . Let , the Sibson’s Mutual Information of order between is defined as:
The following, alternative formulation is also useful :
where is the measure minimizing (6). In analogy with the limiting behavior of Divergence we have that while, when we retrieve the following object:
To conclude, let us list some of the properties of the measure:
Proposition 1 ().
Data Processing Inequality: given ,
if the Markov Chainholds;
with equality iff and are independent;
Let then ;
Let , for a given , is convex in ;
For an extensive treatment of Sibson’s -MI we refer the reader to .
Ii-A2 Maximal Leakage
A particularly relevant dependence measure, strongly connected to Sibson’s Mutual Information is the maximal leakage. was introduced as a way of measuring the leakage of information from to , hence the following definition:
Definition 3 (Def. 1 of ).
Given a joint distribution on finite alphabets and , the maximal leakage from to is defined as:
where and take values in the same finite, but arbitrary, alphabet.
It is shown in [8, Theorem 1] that, for finite alphabets:
If and have a jointly continuous pdf , we get [6, Corollary 4]:
One can show that i.e., Maximal Leakage corresponds to the Sibson’s Mutual Information of order infinity. This allows the measure to retain the properties listed in Proposition 1, furthermore:
Lemma 1 ().
For any joint distribution on finite alphabets and , .
Another relevant notion, important for its application to Adaptive Data Analysis, is Conditional Maximal Leakage:
Definition 4 (Conditional Maximal Leakage ).
Given a joint distribution on alphabets , define:
where takes value in an arbitrary finite alphabet and we consider
to be the optimal estimators ofgiven and , respectively.
Ii-A3 Mutual Information
Another generalization of the KL-Divergence can be obtained by considering a generic convex function , usually with the simple constraint that . The constraint can be ignored as long as by simply considering a new mapping .
Let be two probability spaces. Let be a convex function such that . Consider a measure such that and . Denoting with the densities of the measures with respect to , the Divergence of from is defined as follows:
Despite the fact that the definition uses and the densities with respect to this measure, it is possible to show that divergences are actually independent from the dominating measure . Indeed, when absolute continuity between holds, i.e. , an assumption we will often use, we retrieve the following :
Denoting with the Sigma-field generated from the random variable , (i.e., ), -mutual information is defined as follows:
Let and be two random variables jointly distributed according to over the a measurable space . Let be the corresponding probability spaces induced by the marginals. Let be a convex function such that . The Mutual Information between and is defined as:
If we have that:
It is possible to see that, if satisfies and it is strictly convex at , then if and only if and are independent . This generalization includes the KL (by simply setting ) and allows to retrieve Divergences through a one-to-one mapping. But it also includes many more divergences:
Total Variation distance, with ;
Hellinger distance, with ;
Pearson -divergence, with .
Exploiting a bound involving for a broad enough set of functions allows to differently measure the dependence between and and it may help us circumventing issues that commonly used measures, like Mutual Information, may suffer from. Consider for instance the following example : let
be a random vector, via Strong Data-Processing inequalities it is possible to show that, given the Markov Chain, where and with Gaussian noise, the Total Variation distance between the joint and the product of the marginals of is strictly less than , while may still be infinite. Furthermore, as presented in , different divergences between distributions can provide different convergence rates. It has been proved in  that it is possible to construct a random walk that converges in steps under KL, steps under the distance and in total variation. This shows that even though several divergences may go to with the number of steps (or samples, in the case of a generalization error bound), the rate of convergence obtainable can be quite different and this can possibly impact the sample complexity in the problems we will analyze in later sections.
Ii-B Learning Theory
In this section we will provide some basic background knowledge on learning algorithms and concepts like generalization error. We are mainly interested in supervised learning, where the algorithm learns aclassifier by looking at points in a proper space and the corresponding labels.
More formally, suppose we have an instance space and a hypothesis space . The hypothesis space is a set of functions that, given a data point outputs the corresponding label . Suppose we are given a training data set made of points sampled in an i.i.d. fashion from some distribution . Given some , a learning algorithm is a (possibly stochastic) mapping that given as an input a finite sequence of points
outputs some classifier. In the simplest setting we can think of as a product between the space of data points and the space of labels i.e., and suppose that is fed with pairs data-label . In this work we will view as a family of conditional distributions and provide a stochastic analysis of its generalization capabilities using the information measures presented so far. The goal is to generate a hypothesis that has good performance on both the training set and newly sampled points from . In order to ensure such property, the concept of generalization error is introduced.
Let be some distribution over . Let
be a loss function. The error (or risk) of a prediction rulewith respect to is defined as
while, given a sample , the empirical error of with respect to is defined as
Moreover, given a learning algorithm , its generalization error with respect to is defined as:
The definition just stated considers general loss functions. An important instance for the case of supervised learning is the loss. Suppose again that and that , given a couple and a hypothesis the loss is defined as follows:
and the corresponding errors become:
Another fundamental concept we will need is the sample complexity of a learning algorithm.
Fix . Let be a hypothesis class. The sample complexity of with respect to , denoted by , is defined as the smallest for which there exists a learning algorithm such that, for every distribution over the domain
If there is no such then .
For more details we refer the reader to [24, Sections 2-3].
Iii Main Results
In this section we will present our main result. The bounds we provide will be categorized according to the information measure we are adopting. Notice that for the remainder of this paper is always taken to the base . Also, unless stated otherwise, we will always consider the following two probability spaces and assume that .
Let be two probability spaces, and assume that . Given and , let , i.e. the “fibers” of with respect to . Then,
where are such that .
We have that:
where LABEL:lbl:holder and LABEL:lbl:holder2 follow from Holder’s inequality, given that and . ∎
The proof above works for any couple of measures defined on the same measurable space. Although, we chose to state the theorem when the distributions considered are the joint and the corresponding product of the marginals (i.e. informally, given a measurable set , and similarly, ). This helps us make a direct connection between what appears on the right-hand side of (26) and well-known information measures, later on.
It is clear from the proof that one can similarly bound for any positive function that is -integrable. But the shape of the bound becomes more complex as one in general does not have that for every .
Iii-a -Divergences and Sibson’s Mutual Information
Based on the choices of one has different bounds. Two are of particular interest to us and rely on different choices of . Choosing and thus in Theorem 1, we retrieve:
Let we have that:
Proof 2 of Corollary 1.
Let us denote with
where LABEL:lbl:dataProcAlpha follows from the Data-Processing inequality for Divergences. Re-arranging the terms one gets:
Alternatively, choosing , which implies we retrieve:
Let we have that:
where is the Sibson’s Mutual Information of order .
Moreover, for a fixed due to the property that Holder’s conjugates need to satisfy, we have that and the bound in (43) can also be rewritten as:
thus, choosing a smaller yields a better dependence on in the bound, but given that we also have that and being it implies that
with a worse dependence on on the bound. This leads to a trade-off between the two quantities. In the bounds of interest is typically exponentially decaying with the number of samples and this trade-off can be explicitly seen in the sample complexity of a learning algorithm:
Let be the sample space and be the set of hypotheses. Let be a learning algorithm that, given a sequence of points, returns a hypothesis . Suppose is sampled i.i.d according to some distribution over , i.e., . Let be the loss function as defined in (22). Given , let . Fix then,
Fix and . Let . Let us denote with the fiber of over for some , i.e. . Consider , where and . If differ only in one position , i.e. and we have that for every ,
By McDiarmid’s inequality and Inequality (50) we have that for every hypothesis ,
Under the same assumptions of Corollary 3, fix . In order to ensure a confidence of , i.e. , it is sufficient to have samples where
Smaller means that will be smaller, but it will imply a larger value for and thus a worse dependency on in the sample complexity. Let be the sample space and be the set of hypotheses. An immediate generalization of Corollary 3 follows by considering loss functions such that is -sub-Gaussian for every and some .
Let be a learning algorithm that, given a sequence of points, returns a hypothesis . Suppose is sampled i.i.d according to some distribution over . Let be a loss function s.t. is -sub-Gaussian random variable for every . Given , let . Fix Then,
A similar approach yields bounds involving -Divergences and -Mutual Information.
Let be a convex function such that , and assume is non-decreasing on . Suppose also that is such that for every the set is non-empty, i.e. the generalized inverse, defined as , exists. Let be the Fenchel-Legendre dual of [25, Section 2.2]. Given an event , we have that: