The amount of information that an algorithm uses is a natural and important quantity to study. A central idea that this paper revolves around is that a learning algorithm that only uses a small amount of information from its input sample will generalize well. The amount of information used can be quantified by
which is the mutual information between the output of the algorithm and the input sample . With this quantity we define a new class of learning algorithms termed -bit information learners, which are learning algorithms in which the mutual information is at most . This definition naturally combines notions from information theory and from learning theory (Littlestone and Warmuth, 1986; Moran and Yehudayoff, 2016). It also relates to privacy, because one can think of this definition as a bound on the information that an algorithm leaks or reveals about its potentially sensitive training data.
Low information yields generalization.
Our work stems from the intuition that a learning algorithm that only uses a small amount of information from its input will generalize well. We formalize this intuition in Theorem 3.1, that roughly states that
where is the number of examples in the input sample. We provide four different proofs of this statement, each of which emphasizes a different perspective of this phenomenon (see Section 4).
Sharpness of the sample complexity bound.
Theorem 3.1 entails that to achieve an error of with confidence , it is sufficient to use
examples. This differs from results in well-known settings such as learning hypothesis classes of finite VC dimension, where only grows logarithmically with . Nonetheless, we prove that this bound is sharp (Section 4.2). In particular, we show the existence of a learning problem and an -bit information learner that has a true error of at least
with probability of at least, where is the size of the input sample.
A lower bound for mutual information.
In Section 5 we show that for the simple class of thresholds, every (possibly randomized) proper ERM must reveal at least
bits of information, where is the size of the domain. This means that even in very simple settings, learning may not always be possible if we restrict the information used by the algorithm. However, this does not imply the non-existence of bounded information learners that are either non-consistent or non-proper, an issue we leave open for future work.
Upper bounds for mutual information.
Section 6.1 provides a method for upper bounding the amount of information that algorithms reveal. We also define a generic learner for a concept class , and show that in a number of natural cases this algorithm conveys as little information as possible (up to some constant). This generic learner is proper and consistent (i.e. an ERM); it simply outputs a uniformly random hypothesis from the set of hypotheses that are consistent with the input sample. However, we show that in other simple cases, this algorithm has significantly higher mutual information than necessary.
The distribution-dependent setting.
We also consider an alternative setting, in Section 6.4, in which the distribution over the domain is known to the learner. Here, for any concept class with finite VC-dimension and for any distribution on the data domain there exists a learning algorithm that outputs with high probability an approximately correct function from the concept class, such that the mutual information between the input sample and the output is . In contrast with the abovementioned lower bound, the information here does not grow with the size of the domain.
Contrast with pure differential privacy.
Corollary 6.2 provides a separation between differential privacy and bounded mutual information. For the class of point functions , it is known that any pure differentially private algorithm that properly learns this class must require a number of examples that grows with the domain size (Beimel et. al., 2010). On the other hand, we show that the generic ERM learner leaks at most bits of information and properly learns this class with optimal PAC-learning sample complexity.
1.2 Related Work
Sample compression schemes.
-bit information learners resemble the notion of sample compression schemes (Littlestone and Warmuth, 1986). Sample compression schemes correspond to learning algorithms whose output hypothesis is determined by a small subsample of the input. For example, s
Both sample compression schemes and information learners quantify (in different ways) the property of limited dependence between the output hypothesis and the input sample. It is therefore natural to ask how these two notions relate to each other.
It turns out that not every sample compression scheme of constant size also leaks a constant number of bits. Indeed, in Section 5 it is shown that there is no empirical risk minimizer (ERM) for thresholds that is an -bits information learner.111Here and below , and mean up to some multiplicative universal constants. On the other hand, there is an ERM for this class that is based on a sample compression scheme of size .
Theorem 3.1 extends the classical Occam’s razor generalization bound (Blumer et. al., 1987), which states the following: Assume a fixed encoding of hypotheses in by bit strings. The complexity of a hypothesis is the bit-length of its encoding. A learning algorithm for is called an Occam-algorithm with parameters if for every realizable sample of size it produces a consistent hypothesis of complexity at most , where is the complexity of some hypothesis in that is consistent with the sample.
[1em]2em [Blumer et. al. 1987] Let be an Occam-algorithm with parameters and . Let be a realizable distribution, let be such that , and let denote the complexity of . Then,
as long as is at least
To relate Occam’s razor to Theorem 3.1, observe that an Occam-algorithm is in particular a -bit information learner (since its output hypothesis is encoded by bits), which implies that the probability of it outputting a function with true error more than is at most . The bound can be improved by standard confidence-boosting techniques (see Appendix I).
Mutual information for controlling bias in statistical analysis.
The connection between mutual information and statistical bias has been recently studied in Russo and Zhou (2016) in the context of adaptive data analysis. In adaptive statistical analysis, the analyst conducts a sequence of analysis steps, where the choice and structure of each step depends adaptively on the outcomes of the previous ones. Some of the results of Russo and Zhou (2016) have been recently improved by Raginsky and Xu (2017).
Differential privacy and generalization.
Differential privacy, introduced by Dwork et. al. (2006), is a rigorous notion of privacy enabling a strong guarantee that data holders may provide to their sources. Pure222As opposed to a more relaxed notion known as approximate differential privacy (see Section 2.3 for a precise definition). differential privacy implies a bound on mutual information (McGregor et. al., 2010).
The role of differential privacy in controlling overfitting has been recently studied in several works (e.g. Dwork et. al., 2015; Bassily et. al., 2016; Rogers et. al., 2016; Bassily et. al., 2014). The authors of Bassily et. al. (2016) provide a treatment of differential privacy as a notion of distributional stability, and a tight characterization of the generalization guarantees of differential privacy.
Max-information and approximate max-information:
Dwork et. al. (2015) introduced and studied the notions of max-information – a stronger notion than mutual information – and its relaxation, approximate max-information.333Unlike max-information, the relaxed notion of approximate max-information is not directly related to mutual information; that is, boundedness of one does not necessarily imply the same for the other. They showed that these notions imply generalization and that pure differentially private algorithms exhibit low (approximate) max-information. Rogers et. al. (2016) showed that approximate differentially private algorithms also have low approximate max-information, and that the notion of approximate max-information captures the generalization properties (albeit with slightly worse parameters) of differentially private algorithms (pure or approximate).
Connections to approximate differential privacy:
De (2012) has shown that the relaxed notion of approximate differential privacy does not necessarily imply bounded mutual information. In McGregor et. al. (2010), it was also shown that if the dataset entries are independent, then approximate differential privacy implies a (weak) bound on the mutual information. Such a bound has an explicit dependence on the domain size, which restricts its applicability in general settings. Unlike the case of pure differential privacy, an exact characterization of the relationship between mutual information and approximate differential privacy algorithms is not fully known even when the dataset distribution is i.i.d.
Bun et. al. (2015) showed that the sample complexity of properly learning thresholds (in one dimension) under approximate differential privacy is , where is the domain size. Hence, their result asserts the impossibility of this task for infinite domains. In this work, we show a result of a similar flavor (albeit of a weaker implication) for the class of bounded information learners. Specifically, for the problem of proper PAC-learning of thresholds over a domain of size , we show that the mutual information of any proper learning algorithm (deterministic or randomized) that outputs a threshold that is consistent with the input sample is . This result implies that there are no consistent proper bounded information learners for thresholds over infinite domains.
We start with some basic terminology from statistical learning (for a textbook see Shalev-Shwartz and Ben-David 2014). Let be a set called the domain, be the label-set, and be the examples domain. A sample is a sequence of examples. A function is called a hypothesis or a concept.
Let be a distribution over . The error of a hypothesis with respect to is defined by . Let be a sample. The empirical error of with respect to is defined by .
A hypothesis class is a set of hypotheses. A distribution is realizable by if there is with . A sample is realizable by if there is with .
A learning algorithm, or a learner is a (possibly randomized) algorithm that takes a sample as input and outputs a hypothesis, denoted by . We say that learns444In this paper we focus on learning in the realizable case. if for every there is a finite bound such that for every -realizable distribution ,
is called the error parameter, and the confidence parameter. is called proper if for every realizable , and it is called consistent if for every realizable .
2.2 Information Theoretic Measures
Information theory studies the quantification and communication of information. In this work, we use the language of learning theory combined with information theory to define and study a new type of learning theoretic compression. Here are standard notions from information theory (for more background see the textbook Cover and Thomas 2006).
be two discrete random variables. The entropy ofmeasures the number of bits required to encode on average.
[Entropy] The entropy of is defined as
where and by convention .
The mutual information between and is (roughly speaking) a measure for the amount of random bits and share on average. It is also a measure of their independence; for example iff and are independent.
[Mutual information] The mutual information between and is defined to be
The Kullback-Leibler divergence between two measuresand is a useful measure for the “distance” between them (it is not a metric and may be infinite).
[KL-divergence] The KL-divergence between two measures and on is
Mutual information can be written as the following KL-divergence:
is the joint distribution of the pair, and is the product of the marginals and .
2.3 Differential Privacy
Differential privacy (Dwork et. al., 2006) is a standard notion for statistical data privacy. Despite the connotation perceived by the name, differential privacy is a distributional stability condition that is imposed on an algorithm performing analysis on a dataset. Algorithms satisfying this condition are known as differentially private algorithms. There is a vast literature on the properties of this class of algorithms and their design and structure (see e.g., Dwork and Roth, 2014, for an in-depth treatment).
[Differential privacy] Let be two sets, and let . Let . An algorithm is said to be -differentially private if for all datasets that differ in exactly one entry, and all measurable subsets , we have
where the probability is taken over the random coins of .
When , the condition is sometimes referred to as pure differential privacy (as opposed to approximate differential privacy when .)
The general form of differential privacy entails two parameters: which is typically a small constant and which in most applications is of the form .
Differential privacy has been shown to provide non-trivial generalization guarantees especially in the adaptive settings of statistical analyses (see e.g., Dwork et. al., 2015; Bassily et. al., 2016; Dwork et. al., 2015). In the context of (agnostic) PAC-learning, there has been a long line of work (e.g. Kasiviswanathan et. al., 2008; Beimel et. al., 2010, 2013; Feldman and Xiao, 2014; Bun et. al., 2015) that studied differentially private learning and the characterization of the sample complexity of private learning in several settings. However, the picture of differentially private learning is very far from complete and there are still so many open questions. Vadhan (2017) gives a good survey on the subject.
3 -Bit Information Learners
Here we define learners that use little information from their input.555In this text we focus on Shannon’s mutual information, but other notions of divergence may be interesting to investigate as well. We start by setting some notation. Let be a (possibly randomized) learning algorithm. For every sample , let denote the conditional distribution function of the output of the algorithm given that its input is . When is deterministic, is a degenerate distribution. For a fixed distribution over examples and , let denote the marginal distribution of the output of when it takes an input sample of size drawn i.i.d. from , i.e. for every function .
[Mutual information of an algorithm] We say that has mutual information of at most bits (for sample size ) with respect to a distribution if
[-bit information learner] A learning algorithm for is called a -bit information learner if it has mutual information of at most bits with respect to every realizable distribution ( can depend on the sample size).
3.1 Bounded Information Implies Generalization
The following theorem quantifies the generalization guarantees of -bit information learners.
Let be a learner that has mutual information of at most bits with a distribution , and let . Then, for every ,
where the probability is taken over the randomness in the sample and the randomness of .
In particular, if a class admits a
-bit information learner then the class is PAC learnable. Also, some of the proofs will go through for multi-class classification with every bounded loss function.
The fact that the sample complexity bound that follows from the theorem is sharp is proved in Section 4.2. We mention that the dependence on can be improved in the realizable case; if the algorithm always outputs a hypothesis with empirical error then the bound on the right hand side can be replaced by
. As in similar cases, the reason for this difference stems from the fact that estimating the bias of a coin up to an additive errorrequires samples, but if the coin falls on heads with probability then the chance of seeing tails in a row is .
Proof Sketch for Deterministic Algorithms
Here we sketch a proof of Theorem 3.1 for deterministic algorithms. When is deterministic, we have
Let denote the distribution of . Let be the set of hypotheses so that . By Markov’s inequality, . In addition, the size of is at most . So Chernoff’s inequality and the union bound imply that for every the empirical error is close to the true error for (with probability at least ).
4 Proofs that Bounded Information Implies Generalization
In this paper, we prove the statement in Theorem 3.1 via different approaches (some of the arguments are only sketched). We provide four different proofs of this statement, each of which highlights a different general idea.
The first proof is based on an information theoretic lemma, which roughly states that if the KL-divergence between two measures and is small then is not much larger than for every event . The nature of this proof strongly resembles the proof of the PAC-Bayes bounds (Shalev-Shwartz and Ben-David, 2014), and indeed a close variant of the theorem can be derived from these standard bounds as well (see proof IV). The second proof is based on a method to efficiently “de-correlate” two random variables in terms of their mutual information; roughly speaking, this implies that an algorithm of low mutual information can only generate a small number of hypotheses and hence does not overfit. The third proof highlights an important connection between low mutual information and the stability of a learning algorithm. The last proof uses the PAC-Bayes framework. Following is the first proof, see appendices A.1, A.2 and A.3 for the other proofs.
4.1 Proof I: Mutual Information and Independence
The first proof of that we present uses the following lemma, which allows to control a distribution by a distribution as long as it is close to it in KL-divergence. The proof technique is similar to a classical technique in Shannon’s information theory, e.g., in Arutyunyan (1968).
Let and be probability distributions
on a finite set
be probability distributions on a finite setand let . Then,
The lemma enables us to compare between events of small probability: if is small then is also small, as long as is not very large.
The bound given by the lemma above is tight, as the following example shows. Let and let . For each , let
Thus, and . But on the other hand
Similar examples can be given when is constant.
We now change the setting to allow it to apply more naturally to -bit information learners.
Let be a distribution on the space and let be an event that satisfies
for all , where is the marginal distribution of and is a fiber of E over y. Then
The lemma enables us to bound the probability of an event , if we have a bound on the probability of its fibers over measured with the marginal distribution of .
This lemma can be thought of as a generalization of the extremal case where and are independent (i.e. ). In this case, the lemma corresponds to the following geometric statement in the plane: if the width of every -parallel fiber of a shape is at most and its height is bounded by 1 then its area is also at most . The bound given by the above lemma is , which is weaker, but the lemma applies more generally when . In fact, the bound is tight when the two variables are highly dependent; e.g. and . In this case, the probability of the diagonal is , while and . So indeed .
We now use this lemma to prove the theorem.
[Proof of Theorem 3.1] Let be the distribution on pairs where is chosen i.i.d. from and is the output of the algorithm given . Let be the event of error; that is,
Using Chernoff’s inequality, for each ,
where is the fiber of over function .
Lemma 4.1 implies
[Proof of Lemma 4.1]
where (a) follows by convexity, and (b) holds since the binary entropy is at most one.
[Proof of Lemma 4.1]
By Lemma 4.1, for each ,
Taking expectation over yields
4.2 The Sample Complexity Bound is Sharp
Standard bounds on the sample complexity of learning hypotheses classes of VC dimension imply that to achieve a fixed confidence one must use at least
examples in the non-realizeable case (see e.g., Shalev-Shwartz and Ben-David, 2014, theorem 6.7), and this bound is sharp.
In contrast, Theorem 3.1 above states that achieving confidence requires
examples, where in this case is the bound on . A natural question to ask is whether this sample complexity bound is also sharp. We now show that indeed it is.
To see that the bound is tight for and , consider the case where and . For any learner it holds that
However, the VC dimension of is also . Because the bound for VC dimension is always sharp and it equals the bound from Theorem 3.1, it follows that that bound is also sharp in this case.
To see that the bound is sharp in as well, consider the following proposition.
Let be integers such that is sufficiently large.
Let and let be the uniform distribution on examples of the form
be the uniform distribution on examples of the form. There is a deterministic learning algorithm with sample size and mutual information so that
where is generated i.i.d. from and .
The construction is based on the following claim.
For a sufficiently large , there are subsets of each of size at most so that the -measure of is between and .
Let be i.i.d. uniformly random subsets of , each of size . The size of each is . For we have
where in the last inequality, the first term corresponds to sequences where all elements are distinct, and the second term corresponds to sequences for which there are such that . Now, since ,
Therefore, since ,
as long as . Hence, plugging yields that
where the term approaches 0 as approaches . So, for a sufficiently large there is a choice of as claimed.
[Proof of Proposition 4.2] The algorithm is defined as follows. Let be the sets given in Claim 4.2. For each , let be the hypothesis that is on and elsewhere. Given a sample , the algorithm outputs , where is the minimum index so that ; if no such index exists the algorithm outputs the all-ones function.
The empirical error of the algorithm is , and with probability it outputs a hypothesis with true error at least . The amount of information it provides on its inputs can be bounded as follows: letting be the probability that the algorithm outputs , we have
5 A Lower Bound on Information
In this section we show that any proper consistent learner for the class of thresholds cannot use only little information with respect to all realizable distributions . Namely, we find for every such algorithm a realizable distribution so that is large.
Let and let be the set of all thresholds; that is where
For any consistent and proper learning algorithm for with sample size there exists a realizable distribution so that
The high-level approach is to identify in a rich enough structure and use it to define the distribution . Part of the difficulty in implementing this approach is that we need to argue on a general algorithm, with no specific structure. A different aspect of the difficulty in defining stems from that we can not adaptively construct , we must choose it and then the algorithm gets to see many samples from it.
5.1 Warm Up
We first prove Theorem 5 for the special case of deterministic learning algorithms. Let be a consistent deterministic learning algorithm for . Define the upper triangular matrix as follows. For all ,
where is the output of on a sample of the form
and for all .
The matrix summarizes the behavior of on some of its inputs. Our goal is to identify a sub-structure in , and then use it to define the distribution .
We start with the following lemma.
Let be a symmetric matrix that has the property that for all :
Then contains a row with at least different values (and hence also a column with different values).
The proof is by induction on . In the base case we have
and the lemma indeed holds.
For the induction step, let
All the values in are in the interval and is also a symmetric matrix and satisfies property (i). So contains some row with at least distinct values in the interval .
Similarly, all the values in are in the interval , and we can write as
where is the all-1 matrix and is a symmetric matrix satisfying property (i). From the induction hypothesis it follows that contains a column with at least different values in the interval .
Now consider the value . If then the column in corresponding to contains different values. Otherwise, the row corresponding to contains different values.
Next, consider the matrix
It is a symmetric matrix satisfying property (i), and so it contains a row with at least distinct values. If row contains at least distinct values above the diagonal then row in also contains distinct values above the diagonal. Otherwise, row contains at least distinct values below the diagonal, and then column in contains distinct values above the diagonal.
The proof proceeds by separately considering each of these two cases (row or column):
Case 1: contains a row with distinct values. Let be columns such that all the values are distinct. Let be the distribution that gives the point a mass of , gives a mass of and evenly distributes the remaining mass on the values . The function labeling the examples is chosen to be .
be the indicator random variable of the event that the sampleis of the form
for some . The probability of is at least
From the definition of , when the algorithm outputs where is uniformly distributed uniformly over values, and so . This yields
Case 2: contains a column with distinct values. Let be rows such that all the values are distinct. Now, consider the event that
for some . The rest of the argument is the same as for the previous case.
5.2 Framework for Lower Bounding Mutual Information
Here we describe a simple framework that allows to lower bound the mutual information between two random variables.
Standard Lemmas666We include the proofs for completeness.
For any two distribution and , the contribution of the terms with to the divergence is at least :
Let denote the subset of ’s for which . Then we have
For , is maximized when its derivative is : . So the maximum is attained at , proving that .
Lemma (Data processing)
Let be random variables such that form a Markov chain;
form a Markov chain; that is,and are independent conditioned on . Then
The chain rule for mutual information yields that
|(information is non-negative)|
The following lemma is the key tool in proving a lower bound on the mutual information.
Let , and let be probability distributions over the set such that for all ,
Let be a random variable distributed uniformly over . Let be a random variable over that results from sampling an index according to and then sampling an element of according to . Then
Consider the first sum (the “diagonal”):
Finally, Lemma 5.2 implies that the second sum (the “off-diagonal”) is at least .
We generalize the previous lemma as follows.
Let be probability distributions over . Let be pairwise disjoint events such that for all ,
Let be a random variable distributed uniformly over . Let be a random variable taking values in that results from sampling an index according to and then sampling an element of according to . Then,
Let , and let be the random variable taking values in defined by iff . Hence, satisfies the conditions of Lemma 5.2. Furthermore, we have that the random variables form the following Markov chain: . We thus conclude that
5.3 Proof for General Case
We start with the analog of Lemma 5.1 from the warm up. Let be the set of all probability distribution over . Let , i.e., is a matrix where each cell contains a probability distribution.
Assume that is symmetric and that it has the property that for all ,
Then, contains a row with distributions such that there exist pairwise disjoint sets so that for all ,
(and hence it also contains such a column).
Again, the proof is by induction on . The base case is easily verified. For the step, let
The matrix is symmetric, it satisfies property (ii), and that the supports of all the entries in are contained in . So by induction contains a row with probability functions and pairwise disjoint sets that satisfy property (iii).
Similarly, one sees that satisfies property (iii) as well. Namely, contains a column with probabilities and pairwise disjoint sets with for all .
We now consider the probability distribution . If then we define and . Thus, column of satisfies property (iii) with probabilities and sets . Otherwise, we choose and . In this case, the row of