The high-level question that guides this paper is:
when is learning equivalent to compression?
Variants of this question were studied extensively throughout the years in many different contexts. Recently, its importance grew even further due to the growing complexity of learning tasks. In this work, we measure compression using information theory. Our main message is that, in the framework we develop, learning implies compression.
It is well-known that in many contexts, the ability to compress implies learnability. Here is a partial list of examples: sample compression schemes Littlestone and Warmuth (1986); Moran and Yehudayoff (2016), Occam’s razor Blumer et al. (1987), minimum description length (Rissanen, 1978; Grünwald, 2007), and differential privacy Dwork et al. (2006, 2015); Bassily et al. (2016); Rogers et al. (2016); Bassily et al. (2014). We refer the interested reader to Xu and Raginsky (2017); Bassily et al. (2018) for more details.
We use the setting of Xu and Raginsky (2017) and Bassily et al. (2018), where the value of interest is the mutual information between the input sample and the output of the learning algorithm . Xu and Raginsky (2017) and Bassily et al. (2018) suggested that studying this notion may shed additional light on our understanding of the relations between compression and learning.
The rational is that compression is, in many cases, an information theoretic notion, so it is natural to use information theory to quantify the amount of compression a learning algorithm performs. The quantity is a natural information theoretic measure for the amount of compression the algorithm performs. Additional motivation comes from the connections to privacy, which is about leaking little information while maintaining functionality.
In the information theoretic setting, Xu and Raginsky (2017) and Bassily et al. (2018) showed that for every learning algorithm for which the information is much smaller than the sample size , the true error and the empirical error are typically close. This highlights the following simple thumb rule for designing learning algorithm: try to find an algorithm that has small empirical error but in the same time reveals a small amount of information on its input.
What about the other direction? Is it true that learning compression in this context? Bassily et al. (2018) answered this question for the class of thresholds and Nachum et al. (2018) extended the result for classes of VC-dimension (see Section 2 for notations).
[Bassily et al. (2018); Nachum et al. (2018)] For every and every , there exists a class of VC-dimension such that for any proper and consistent (possibly randomized) learning algorithm, there exists a hypothesis
and a random variableover such that where .
The theorem can be interpreted as saying that no, learning does not imply compression in this context. In some cases, for any consistent and proper algorithm, there is always a scenario in which a large amount of information is revealed.
In this work, we shift our attention from a worst-case analysis to an average-case analysis. In the average-case setting, we show that every prior distribution over of VC-dimension admits an algorithm that typically reveals -bits of information on its input (there is an unbounded difference between the worst-case and the average-case).
This result is a special case of a more general phenomenon we explore. If there is a low information learner when the algorithm knows the underlying distribution on inputs, then there is a learner that reveals little information on an average concept without knowing the distribution on inputs (Lemma 3).
The average-case framework is different than the standard worst-case PAC setting. In the standard model, the teacher (or nature) is thought of as being adversarial and is assumed to have perfect knowledge of the learner’s strategy.
From a practical point of view, it is not obvious that such strong assumptions about the environment should be made, since worst-case analysis seems to fail when trying to explain real-life learning algorithms.
From a biological perspective, to survive, a living organism must perform many tasks (concept class). No human can perform well on all of them (worst case analysis). What matters for survival, is to be able to perform well on most tasks (average case learning).
The average-case framework we study also provides a general mechanism for proving upper bounds on the average sample complexity for classes of functions (not necessarily binary or with 0-1 loss). This framework (“The Information Game”) allows the user the freedom to apply his prior knowledge when trying to solve a learning problem. For example, the user can pick only distributions that make sense in his setting (see Discussion).
Information Complexity in Learning
Feldman and Steinke (2018) used the information theoretic setting and proved generalization bounds for performing adaptive data analysis. In this setting, a user asks a series of queries over some data. Every new query the user decides to ask depends on the answers to the previous queries.
Asadi et al. (2018) applied the information theoretic setting for achieving generalization bounds that depend on the correlations between the functions in the class together with the dependence between the input and the output of the learning algorithm. They mostly investigated Gaussian processes.
Here is a brief survey of other works that deviate from the worst-case analysis of the PAC learning setting.
Haussler et al. (1994) studied how the sample complexity depends on properties of a prior distribution on the class and over the sequence of examples the algorithm receives.
Specifically, they studied the probability of an incorrect prediction for an optimal learning algorithm using the Shannon information gain.
They also studied stability in the context they investigated.
and over the sequence of examples the algorithm receives. Specifically, they studied the probability of an incorrect prediction for an optimal learning algorithm using the Shannon information gain. They also studied stability in the context they investigated.
Finally, we note that many of the lower bounds on the sample complexity of learning algorithm can be casted in the “on average” language. In many cases, the lower bound is proved by choosing an appropriate distribution on the concept class .
The information game is also relevant in the following information theoretic scenario. Player two wants to transmit a message through a noisy channel that has several states and player one wants to prevent that by appropriately choosing . In the game, player two chooses a distribution on . Player one chooses a state that defines the channel; i.e., is the distribution on the transmitted data conditioned on the input being . By the minimax theorem this game also has an equilibrium point.
Other variants of this scenario can be found in chapter 7 of El Gamal and Kim (2011).
Here we provide the basic definitions that are needed for this text, and provide references that contain more details and background.
We identify between random variables and the distributions they define. The notation means that consists of i.i.d. pairs of the form where is distributed as .
Big and notations in this text hide absolute constants.
Let and be sets. A set is called a class of hypotheses. is called the sample space. A realizable sample for of size is
such that there exists satisfying for all .
A learning algorithm for with sample size is a (possibly randomized) algorithm that takes a realizable sample for as input, and returns a function as output. We say that the learning algorithm is consistent if the output always satisfies for all . We say the algorithm is proper if it outputs members of .
The empirical error of with respect to and a function is
where is the loss function.
is the loss function. Thetrue error of with respect to a random variable over and a function is defined to be
The class shatters some finite set if . The VC-dimension of denoted is the maximal size of a set such that shatters .
Let be a finite set, and let be a random variable over with probability mass function such that . The entropy of is111 is a shorthand for , and we use the convention that .
The mutual information between two random variables and is
See the textbook Cover and Thomas (2006) for additional basic definitions and results from information theory which are used throughout this paper.
Let be a learning algorithm for with sample size , and let be a probability distribution on
be a probability distribution on. We say that has average information complexity of bits with respect to , if all random variables over satisfy
We say that has error , confidence , and average sample complexity with respect to , if for all random variables over and all ,
3 Information Games
It is helpful to think about the learning framework as a two-player game.
The Information Game
The two players decide in advance on a class of functions and a sample size .
Player one (“Learner”) picks a consistent and proper learning algorithm (possibly randomized).
Player two (“Nature”) picks a function and a random variable over .
Learner pays Nature coins where .
In the setting of Theorem 1, Nature knows in advance what the learning algorithm of Learner is. In that case, Nature’s optimal strategy leads to a gain of
In other words, when Nature knows what the learner is going to do, Nature’s gain can be quite large even in very simple cases.
In Bassily et al. (2018), the other extreme was studied as well. Theorem 13 in Bassily et al. (2018) states that when Learner knows in advance the random variable of Nature (but not the concept ), the gain of Nature is always much smaller; for all ,
In particular, in this case, Nature’s gain does not tend to infinity with the size of the universe.
We see that this information game does not have, in general, a game theoretic equilibrium point. To remedy this, we suggest the following average case information game. We shall see the benefits of considering this game below.
The Average Information Game
The two players decide in advance on and .
Learner picks a consistent and proper learning algorithm (possibly randomized).
Nature picks a random variable over .
Learner pays Nature coins where .
In the average game, Nature’s gain is for an average concept in the class. Nature can not choose a particular that would lead to a high payoff. As opposed to the first game, the average information game has an equilibrium point (see the proof of Theorem 3 below):
By the results mentioned above, if the VC-dimension of is , then Nature’s gain in the game is at most , like in the case that Learner knows the underlying distribution. For VC classes, although may be extremely large for all algorithms under some distribution on inputs, the average is small for some algorithms under all distributions on inputs.
An even more general statement holds. If one allows an empirical error of at most , instead of a consistent algorithm, the dependence on can be omitted. This is indeed more general as if the empirical error is less than then the algorithm is consistent.
For every class of VC-dimension , every , and every , there is a proper learning algorithm with empirical error bounded by such that for all random variables on ,
The above result means that there is a learning algorithm such that for any distribution on inputs, the algorithm reveals little information about its input for at least half of the functions in . If is smaller than the entropy of the sample , then the algorithm can be thought of as compressing its input.
Theorem 3 is a consequence of a more general phenomenon that holds even outside the scope of VC classes. To state it, we need to consider a convex space of random variables (or distributions), since the mechanism that underlies its proof is von Neumann’s minimax theorem (see Von Neumann (1928); Von Neumann and Morgenstern (1944)).
Let be a class of of hypotheses (not necessarily binary valued) with a loss function that is bounded from above by one. Let be a convex set of random variables over the space . Assume that for every , there exists an algorithm whose output has empirical error and for all where and . Then there exists a learning algorithm such that for all , the algorithm outputs a hypothesis with empirical error and
The lemma is proved in Section 4.
Some natural collections of random variables are not convex. If one starts e.g. with a set of i.i.d. random variables over , the relevant convex hull does not consist only of i.i.d. random variables. This point needs to be addressed in the proof of Theorem 3. In the proof of Theorem 3, we apply the lemma with being the space of all symmetric distributions on ; see Definition 5.
We call the learning algorithm that is constructed in the proof of the theorem a minimax algorithm for with information and empirical error . Such algorithms reveal a small amount of information on most of the hypotheses in . So, together with the “compression yields generalization” results from Xu and Raginsky (2017) and Bassily et al. (2018) we get that the minimax algorithm has small true error for every for most hypotheses in , as long as .
Let be a convex set of random variables on . Let be the convex hull of distributions of the form for . Let be a minimax algorithm for with information and empirical error . Let . If , then
|(h is uniform)|
where and is uniform in and independent of .
In particular, for at least half of the functions in ,
|(h is fixed)|
There is nothing special about the uniform distribution on . Any other prior distribution on works just as well. It is important, however, to keep in mind that the algorithm depends on the choice of the prior .
The convex set of distributions may be chosen by the algorithm designer. One general choice is to take the space of all distribution on . Another example is the space of all sub-gaussian probability distributions.
To complete the proof of Theorem 3, we apply Lemma 3. For the lemma to apply, we need to design an algorithm that reveals little information for VC classes when the distribution of is known in advance (as mentioned in the remark following Lemma 3 we need to handle even a more general scenario). To do so, we need to extend a result from Bassily et al. (2018). The main ingredient is metric properties of VC classes (see Haussler (1995)). This appears in Section 5.
To describe the minimax algorithm we need to come up with some prior distribution on . In practice, we do not necessarily know the actual prior but we may have some approximation of it. It is natural to ask how does the performance of the minimax algorithm change when our prior is wrong, and the true prior is .
As an example, if we have a bound , then we immediately get
As another example, consider the case that the statistical distance is small. If we assume nothing on how distributes, we can get
which seems too costly to be useful. This can happen when one hypothesis satisfies , and we move all the allowed weight from one hypothesis with small mutual information to
The last inequality is Cauchy-Schwartz. Roughly speaking, this means that if is close to then the average information that is leaked is similar, when the map has bounded second moment under both distributions. It is possible to replace the second moment by the -moment for using Holder’s inequality.
We saw that with no assumptions, information cost can grow considerably under small perturbations of . The average sample complexity, however, does not. If has error , confidence , and average sample complexity with respect to , it also has error , confidence , and average sample complexity with respect to .
4 The Minimax Learner
Naturally, the proof requires von Neumann’s minimax theorem.
– is convex for every and
– is concave for every .
[Proof of Lemma 3] We need to verify that the minimax theorem applies. First, as stated in the preliminaries, we deal with a finite space so the set of all algorithms (randomized included) with empirical error and the set of random variables over can be treated as convex compact sets in high dimensional euclidean space. Specifically, let be the collection of randomized learning algorithms with empirical error at most , and let be the set of distributions.
Second, mutual information is a continuous function of both strategies.
Third, the following lemma about mutual information.
[Theorem 2.7.4 in Cover and Thomas (2006)] Let . The mutual information is a concave function of for fixed and a convex function of for fixed .
We apply the lemma with being the distribution on and being the distribution of conditioned on that the learning algorithm defines. Since a convex combination of convex/concave functions is convex/concave, we see that the map
is convex-concave, where defines the distribution of conditioned on the value of , and defines the distribution of .
By the minimax theorem,
In other words, there is a randomized algorithm as needed (points in are randomized algorithms). In the proof above, we used the special fact that the mutual information is convex-concave. We are not aware of any other measure of dependence between random variables that satisfy this.
5 Learning Using Nets
Theorem 13 from Bassily et al. (2018) states the following. For an i.i.d. random variable over and with VC-dimension , there exists a consistent, proper, and deterministic learner that leaks at most -bits of information, where is the input sample size (for -realizable samples).
For the minimax theorem to apply, we need to generalize the above statement to work for any convex combination of i.i.d. random variables over . To analyze this collection of random variables, we need to identify some property that we can leverage. We use the fact that such random variables are invariant under permutation of the coordinates.
A random variable over is called symmetric if for any permutation ,
The following theorem holds for all symmetric random variables. In this space, we can not assume any kind of independence between the coordinates. This should make the proof more complicated than in Bassily et al. (2018), but in fact it helps to guide the proof and make it quite simple.
Let . For a symmetric random variable over and with VC-dimension , there exists a proper and deterministic learner with empirical error so that
for all .
A key component in the proof is Haussler’s theorem (see Haussler (1995)) on the size of covers of VC classes. The theorem states that for a given probability distribution on , there are small covers to the metric space whose elements are concepts in and the distance between is . The starting point of this theorem is a distribution on . In the general setting we consider, we start with a non-product distribution on . To apply Haussler’s theorem, we need to find the relevant (the solution is eventually quite simple).
Since is symmetric, the marginal distribution is the same on each of the coordinates of and denote it . For every integer , pick a minimal -net with respect to the distribution over for .
The learning algorithm is simple – it outputs the first consistent function it sees along the sequence of nets. The algorithm stops because is finite. It remains to calculate the entropy of its output.
For every and , there is a function in so that
By the linearity of the expectation,
So, by Markov’s inequality,
In total, for all ,
Now take to be the index of the net where the algorithm stops. For it holds that . Thus,
By Haussler’s theorem (see Haussler (1995)), the size of is at most
The proof of Theorem 5 together with Lemma 3 suggest a general recipe for controlling the average information complexity (and hence the average sample complexity) for pairs of the form (not necessarily binary class or with 0-1 loss).
For every marginal distribution over from , find a sequence of small -nets. This sequence induces an algorithm that leaks little information, for every symmetric random variable whose marginal distribution is (even though it is not necessarily i.i.d.).
Use the minimax theorem to find an algorithm that leaks little information over all of .
It will be interesting to see if this setting can be extended to the non-realizable case. It is not immediate to apply the principles seen in the proof of Theorem 3 to this case. In theory, some samples may require large empirical losses (for proper learners). Since the minimax algorithm is a convex combination of those algorithms, it is hard to say what the empirical error of such an algorithm will be, or how far will the empirical error be from the hypothesis in with an optimal empirical error.
This work leaves the traditional setting of PAC learning and assumes a less hostile environment for learning. We introduce game-theoretic perspectives of the compression learning algorithms perform. In the standard setting, Nature is assumed all powerful and can make the Learner leak quite a lot of information. In the average-case scenario, Nature needs to commit ahead of time on some probability distribution from which the eventual concept is generated. In this case, the minimax theorem allows to lower the amount of information that is leaked.
The average-case framework captures some amount of prior knowledge on the world that the learner can use. It therefore allows to avoid singular or pathological cases.
This work suggests an idea that may be useful in other contexts. Given a class , perform the following four steps.
Define a set of reasonable distributions over .
Find a collection of -nets for distributions in .
Look for a distribution over those nets that works well for most distributions in .
Given a sample , sample a random -net until finding an hypothesis with small empirical error.
It seems plausible that this will yield acceptable results for samples that come from the real world. All steps above, however, may be quite challenging to implement.
- Asadi et al. (2018) Amir R. Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tightening generalization bounds. CoRR, abs/1806.03803, 2018.
- Bassily et al. (2014) Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464–473. IEEE, 2014.
Bassily et al. (2016)
Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and
Algorithmic stability for adaptive data analysis.
Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1046–1059. ACM, 2016.
Bassily et al. (2018)
Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, and Amir Yehudayoff.
Learners that use little information.
In Proceedings of Algorithmic Learning Theory, volume 83 of
Proceedings of Machine Learning Research, pages 25–55. PMLR, 2018.
- Blumer et al. (1987) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam’s razor. Information processing letters, 24(6):377–380, 1987.
- Cover and Thomas (2006) Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2ed edition, 2006.
- Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer, 2006.
- Dwork et al. (2015) Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126. ACM, 2015.
- El Gamal and Kim (2011) Abbas El Gamal and Young-Han Kim. Network Information Theory. Cambridge University Press, 2011.
Feldman and Steinke (2018)
Vitaly Feldman and Thomas Steinke.
Calibrating noise to variance in adaptive data analysis.In COLT, 2018.
- Grünwald (2007) Peter D Grünwald. The minimum description length principle. MIT press, 2007.
- Haussler (1995) David Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded vapnik-chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217 – 232, 1995.
- Haussler et al. (1994) David Haussler, Michael Kearns, and Robert E. Schapire. Bounds on the sample complexity of bayesian learning using information theory and the vc dimension. Machine Learning, 14(1):83–113, 1994.
- Littlestone and Warmuth (1986) Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Technical report, Technical report, University of California, Santa Cruz, 1986.
- Moran and Yehudayoff (2016) Shay Moran and Amir Yehudayoff. Sample compression schemes for VC classes. Journal of the ACM (JACM), 63(3):21, 2016.
- Nachum et al. (2018) Ido Nachum, Jonathan Shafer, and Amir Yehudayoff. A direct sum result for the information complexity of learning. In Proceedings of the 2018 Conference on Learning Theory, 2018.
- Reischuk and Zeugmann (1999) Rüdiger Reischuk and Thomas Zeugmann. A complete and tight average-case analysis of learning monomials. In STACS 99, pages 414–423, Berlin, Heidelberg, 1999. Springer Berlin Heidelberg.
- Rissanen (1978) Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
- Rogers et al. (2016) Ryan Rogers, Aaron Roth, Adam Smith, and Om Thakkar. Max-information, differential privacy, and post-selection hypothesis testing. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 487–494. IEEE, 2016.
- Shalev-Shwartz and Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Von Neumann (1928) J Von Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
- Von Neumann and Morgenstern (1944) J Von Neumann and O Morgenstern. Theory of games and economic behavior. 1944.
- Wan (2010) Andrew Wan. Learning, cryptography, and the average case. Institution Columbia University, 2010.
- Xu and Raginsky (2017) Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems 30, pages 2524–2533. 2017.