A learning algorithm can be viewed as a randomized mapping, or a channel in the information-theoretic language, which takes a training dataset as input and generates a hypothesis as output. The generalization error is the difference between the population risk of the output hypothesis and its empirical risk on the training data. It measures how much the learned hypothesis suffers from overfitting. The traditional way of analyzing the generalization error relies either on certain complexity measures of the hypothesis space, e.g. the VC dimension and the Rademacher complexity BBL05 , or on certain properties of the learning algorithm, e.g., uniform stability BouEli_stab_gen02 . Recently, motivated by improving the accuracy of adaptive data analysis, Russo and Zou RusZou16 showed that the mutual information between the collection of empirical risks of the available hypotheses and the final output of the algorithm can be used effectively to analyze and control the bias in data analysis, which is equivalent to the generalization error in learning problems. Compared to the methods of analysis based on differential privacy, e.g., by Dwork et al. DFH14_dp ; Dwork_adp_holdout and Bassily et al. AlStDp , the method proposed in RusZou16
is simpler and can handle unbounded loss functions; moreover, it provides elegant information-theoretic insights into improving the generalization capability of learning algorithms. In a similar information-theoretic spirit, AlabdulmohsinAlab_unigen15 ; Alab_unigen17 proposed to bound the generalization error in learning problems using the total-variation information between a random instance in the dataset and the output hypothesis, but the analysis apply only to bounded loss functions.
In this paper, we follow the information-theoretic framework proposed by Russo and Zou RusZou16 to derive upper bounds on the generalization error of learning algorithms. We extend the results in RusZou16 to the situation where the hypothesis space is uncountably infinite, and provide improved upper bounds on the expected absolute generalization error. We also obtain concentration inequalities for the generalization error, which were not given in RusZou16 . While the main quantity examined in RusZou16 is the mutual information between the collection of empirical risks of the hypotheses and the output of the algorithm, we mainly focus on relating the generalization error to the mutual information between the input dataset and the output of the algorithm, which formalizes the intuition that the less information a learning algorithm can extract from the input dataset, the less it will overfit. This viewpoint provides theoretical guidelines for striking the right balance between data fit and generalization by controlling the algorithm’s input-output mutual information. For example, we show that regularizing the empirical risk minimization (ERM) algorithm with the input-output mutual information leads to the well-known Gibbs algorithm. As another example, regularizing the ERM algorithm with random noise can also control the input-output mutual information. For both the Gibbs algorithm and the noisy ERM algorithm, we also discuss how to calibrate the regularization in order to incorporate any prior knowledge of the population risks of the hypotheses into algorithm design. Additionally, we discuss adaptive composition of learning algorithms, and show that the generalization capability of the overall algorithm can be analyzed by examining the input-output mutual information of the constituent algorithms.
Another advantage of relating the generalization error to the input-output mutual information is that the latter quantity depends on all ingredients of the learning problem, including the distribution of the dataset, the hypothesis space, the learning algorithm itself, and potentially the loss function, in contrast to the VC dimension or the uniform stability, which only depend on the hypothesis space or on the learning algorithm. As the generalization error can strongly depend on the input dataset Zhang_gen17 , the input-output mutual information can be more tightly coupled to the generalization error than the traditional generalization-guaranteeing quantities of interest. We hope that our work can provide some information-theoretic understanding of generalization in modern learning problems, which may not be sufficiently addressed by the traditional analysis tools Zhang_gen17 .
For the rest of this section, we define the quantities that will be used in the paper. In the standard framework of statistical learning theoryShBe_book14 , there is an instance space , a hypothesis space , and a nonnegative loss function . A learning algorithm characterized by a Markov kernel takes as input a dataset of size , i.e., an -tuple
of i.i.d. random elements of with some unknown distribution , and picks a random element of as the output hypothesis according to . The population risk of a hypothesis on is
The goal of learning is to ensure that the population risk of the output hypothesis
is small, either in expectation or with high probability, under any data generating distribution. The excess risk of is the difference , and its expected value is denoted as . Since is unknown, the learning algorithm cannot directly compute for any , but can instead compute the empirical risk of on the dataset as a proxy, defined as
For a learning algorithm characterized by , the generalization error on is the difference , and its expected value is denoted as
where the expectation is taken with respect to the joint distribution. The expected population risk can then be decomposed as
where the first term reflects how well the output hypothesis fits the dataset, while the second term reflects how well the output hypothesis generalizes. To minimize we need both terms in (5) to be small. However, it is generally impossible to minimize the two terms simultaneously, and any learning algorithm faces a trade-off between the empirical risk and the generalization error. In what follows, we will show how the generalization error can be related to the mutual information between the input and output of the learning algorithm, and how we can use these relationships to guide the algorithm design to reduce the population risk by balancing fitting and generalization.
2 Algorithmic stability in input-output mutual information
As discussed above, having a small generalization error is crucial for a learning algorithm to produce an output hypothesis with a small population risk. It turns out that the generalization error of a learning algorithm can be determined by its stability properties. Traditionally, a learning algorithm is said to be stable if a small change of the input to the algorithm does not change the output of the algorithm much. Examples include uniform stability defined by Bousquet and Elisseeff BouEli_stab_gen02 and on-average stability defined by Shalev-Shwartz et al. Learn_stability2010 . In recent years, information-theoretic stability notions, such as those measured by differential privacy Dwork_adp_holdout , KL divergence AlStDp ; WLF_avgKL16 , total-variation information Alab_unigen15 , and erasure mutual information stability_ITW16 , have been proposed. All existing notions of stability show that the generalization capability of a learning algorithm hinges on how sensitive the output of the algorithm is to local modifications of the input dataset. It implies that the less dependent the output hypothesis is on the input dataset , the better the learning algorithm generalizes. From an information-theoretic point of view, the dependence between and can be naturally measured by the mutual information between them, which prompts the following information-theoretic definition of stability. We say that a learning algorithm is -stable in input-output mutual information if, under the data-generating distribution ,
Further, we say that a learning algorithm is -stable in input-output mutual information if
According to the definitions in (6) and (7), the less information the output of a learning algorithm can provide about its input dataset, the more stable it is. Interestingly, if we view the learning algorithm as a channel from to , the quantity can be viewed as the information capacity of the channel, under the constraint that the input distribution is of a product form. The definition in (7) means that a learning algorithm is more stable if its information capacity is smaller. The advantage of the weaker definition in (6) is that depends on both the algorithm and the distribution of the dataset. Therefore, it can be more tightly coupled with the generalization error, which itself depends on the dataset. We mainly focus on studying the consequence of this notion of -stability in input-output mutual information for the rest of this paper.
3 Upper-bounding generalization error via
In this section, we derive various generalization guarantees for learning algorithms that are stable in input-output mutual information.
3.1 A decoupling estimate
We start with a digression from the statistical learning problem to a more general problem, which may be of independent interest. Consider a pair of random variablesand with joint distribution . Let be an independent copy of , and an independent copy of , such that . For an arbitrary real-valued function , we have the following upper bound on the absolute difference between and .
Lemma 1 (proved in Appendix A).
If is -subgaussian under 111Recall that a random variable is -subgaussian if for all . , then
3.2 Upper bound on expected generalization error
Upper-bounding the generalization error of a learning algorithm can be cast as a special case of the preceding problem, by setting , and For an arbitrary , the empirical risk can be expressed as and the population risk can be expressed as Moreover, the expected generalization error can be written as
where the joint distribution of and is . If is -subgaussian for all , then is -subgaussian due to the i.i.d. assumption on ’s, hence is -subgaussian. This, together with Lemma 1, leads to the following theorem.
Suppose is -subgaussian under for all , then
Theorem 1 suggests that, by controlling the mutual information between the input and the output of a learning algorithm, we can control its generalization error. The theorem allows us to consider unbounded loss functions as long as the subgaussian condition is satisfied. For a bounded loss function , is guaranteed to be -subgaussian for all and all .
Russo and Zou RusZou16 considered the same problem setup with the restriction that the hypothesis space is finite, and showed that can be upper-bounded in terms of , where
is the collection of empirical risks of the hypotheses in . Using Lemma 1 by setting , , and , we immediately recover the result by Russo and Zou even when is uncountably infinite:
Theorem 2 (Russo and Zou RusZou16 ).
Suppose is -subgaussian under for all , then
which is due to the Markov chain, as for each , is a function of . However, if the output depends on only through the empirical risks , in other words, when the Markov chain holds, then Theorem 1 and Theorem 2 are equivalent. The advantage of Theorem 1 is that can be much easier to evaluate than , and can provide better insights to guide the algorithm design. We will elaborate on this when we discuss the Gibbs algorithm and the adaptive composition of learning algorithms.
Theorem 1 and Theorem 2 only provide upper bounds on the expected generalization error. We are often interested in analyzing the absolute generalization error , e.g., its expected value or the probability for it to be small. We need to develop stronger tools to tackle these problems, which is the subject of the next two subsections.
3.3 A concentration inequality for
For any fixed , if is -subgaussian, the Chernoff-Hoeffding bound gives It implies that, if and are independent, then a sample size of
suffices to guarantee
The following results show that, when is dependent on , as long as is sufficiently small, a sample complexity polynomial in and logarithmic in still suffices to guarantee (15), where the probability now is taken with respect to the joint distribution .
Theorem 3 (proved in Appendix B).
Suppose is -subgaussian under for all . If a learning algorithm satisfies , then for any and , (15) can be guaranteed by a sample complexity of
In view of (13), any learning algorithm that is -stable in input-output mutual information satisfies the condition . The proof of Theorem 3 is based on Lemma 1 and an adaptation of the “monitor technique” proposed by Bassily et al. AlStDp . While the high-probability bounds of DFH14_dp ; Dwork_adp_holdout ; AlStDp based on differential privacy are for bounded loss functions and for functions with bounded differences, the result in Theorem 3 only requires to be subgaussian. We have the following corollary of Theorem 3.
3.4 Upper bound on
A byproduct of the proof of Theorem 3 (setting in the proof) is an upper bound on the expected absolute generalization error.
Suppose is -subgaussian under for all . If a learning algorithm satisfies that , then
4 Learning algorithms with input-output mutual information stability
In this section, we discuss several learning problems and algorithms from the viewpoint of input-output mutual information stability. We first consider two cases where the input-output mutual information can be upper-bounded via the properties of the hypothesis space. Then we propose two learning algorithms with controlled input-output mutual information by regularizing the ERM algorithm. We also discuss other methods to induce input-output mutual information stability, and the stability of learning algorithms obtained from adaptive composition of constituent algorithms.
4.1 Countable hypothesis space
When the hypothesis space is countable, the input-output mutual information can be directly upper-bounded by , the entropy of . If , we have . From Theorem 1, if is -subgaussian for all , then for any learning algorithm with countable ,
For the ERM algorithm, the upper bounds for the expected generalization error also hold for the expected excess risk, since the empirical risk of the ERM algorithm satisfies
For an uncountable hypothesis space, we can always convert it to a finite one by quantizing the output hypothesis. For example, if , we can define the covering number as the cardinality of the smallest set such that for all there is with , and we can use as the codebook for quantization. The final output hypothesis will be an element of . If lies in a -dimensional subspace of and , then setting , we have , and under the subgaussian condition of ,
4.2 Binary Classification
For the problem of binary classification, , ,
is a collection of classifiers, which could be uncountably infinite, and . Using Theorem 1, we can perform a simple analysis of the following two-stage algorithm Bue_Kum96_1 ; DGLbook96 that can achieve the same performance as ERM. Given the dataset , split it into and with lengths and . First, pick a subset of hypotheses based on such that for are all distinct and . In other words, forms an empirical cover of with respect to . Then pick a hypothesis from with the minimal empirical risk on , i.e.,
Denoting the th shatter coefficient and the VC dimension of by and , we can upper-bound the expected generalization error of with respect to as
From an information-theoretic point of view, the above two-stage algorithm effectively controls the conditional mutual information by extracting an empirical cover of using , while maintaining a small empirical risk using .
4.3 Gibbs algorithm
As Theorem 1 shows that the generalization error can be upper-bounded in terms of , it is natural to consider an algorithm that minimizes the empirical risk regularized by :
where is a parameter that balances fitting and generalization. To deal with the issue that is unknown to the learning algorithm, we can relax the above optimization problem by replacing with an upper bound where is an arbitrary distribution on and , so that the solution of the relaxed optimization problem does not depend on . It turns out that the well-known Gibbs algorithm solves the relaxed optimization problem.
Theorem 5 (proved in Appendix C).
The solution to the optimization problem
is the Gibbs algorithm, which satisfies
We would not have been able to arrive at the Gibbs algorithm had we used as the regularization term instead of in (25), even if we upper-bound by . Using the fact that the Gibbs algorithm is (,)-differentially private when exp_MT07 and the group property of differential privacy Dwo_dp_book14 , we can upper-bound the input-output mutual information of the Gibbs algorithm as . Then from Theorem 1, we know that for , Using Hoeffding’s lemma, a tighter upper bound on the expected generalization error for the Gibbs algorithm is obtained in stability_ITW16 , which states that if ,
With the guarantee on the generalization error, we can analyze the population risk of the Gibbs algorithm. We first present a result for countable hypothesis spaces.
Corollary 2 (proved in Appendix D).
Suppose is countable. Let denote the output of the Gibbs algorithm applied on dataset , and let denote the hypothesis that achieves the minimum population risk among . For , the population risk of satisfies
The distribution in the Gibbs algorithm can be used to express our preference, or our prior knowledge of the population risks, of the hypotheses in , in a way that a higher probability under is assigned to a hypothesis that we prefer. For example, we can order the hypotheses according to our prior knowledge of their population risks, and set for the th hypothesis in the order, then, setting , (29) becomes
where is the index of . It means that a better prior knowledge on the population risks leads to a smaller sample complexity to achieve a certain expected excess risk. As another example, if and we have no preference on any hypothesis, then taking
as the uniform distribution onand setting , (29) becomes
For uncountable hypothesis spaces, we can do a similar analysis for the population risk under a Lipschitz assumption on the loss function.
Corollary 3 (proved in Appendix E).
Suppose . Let be the hypothesis that achieves the minimum population risk among . Suppose and is -Lipschitz for all . Let denote the output of the Gibbs algorithm applied on dataset . The population risk of satisfies
Again, we can use the distribution to express our preference of the hypotheses in . For example, we can choose with and choose . Then, setting in (31), we have
This result essentially has no restriction on , which could be unbounded, and only requires the Lipschitz condition on , which could be non-convex. The sample complexity decreases with a better prior knowledge of the optimal hypothesis.
4.4 Noisy empirical risk minimization
Another algorithm with controlled input-output mutual information is the noisy empirical risk minimization algorithm, where independent noise , , is added to the empirical risk of each hypothesis, and the algorithm outputs a hypothesis that minimizes the noisy empirical risks:
Similar to the Gibbs algorithm, we can express our preference of the hypotheses by controlling the amount of noise added to each hypothesis, such that our preferred hypotheses will be more likely to be selected when they have similar empirical risks as other hypotheses. The following result formalizes this idea.
Corollary 4 (proved in Appendix F).
Suppose is countable and is indexed such that a hypothesis with a lower index is preferred over one with a higher index. Also suppose . For the noisy ERM algorithm in (33), choosing to be an exponential random variable with mean , we have
where . In particular, choosing , we have
Without adding noise, the ERM algorithm applied to the above case when can achieve Compared with (35), we see that performing noisy ERM may be beneficial when we have high-quality prior knowledge of and when is large.
4.5 Other methods to induce input-output mutual information stability
In addition to the Gibbs algorithm and the noisy ERM algorithm, many other methods may be used to control the input-output mutual information of the learning algorithm. One method is to preprocess the dataset to obtain , and then run a learning algorithm on . The preprocessing can be adding noise to the data or erasing some of the instances in the dataset, etc. In any case, we have the Markov chain which implies Another method is the postprocessing of the output of a learning algorithm. For example, the weights
generated by a neural network training algorithm can be quantized or perturbed by noise. This gives rise to the Markov chainwhich implies Moreover, strong data processing inequalities MR_SDPI may be used to sharpen these upper bounds on
. Preprocessing of the dataset and postprocessing of the output hypothesis are among numerous regularization methods used in the field of deep learning(DL_book2016, , Ch. 7.5). Other regularization methods may also be interpreted as ways to induce the input-output mutual information stability of a learning algorithm, and this would be an interesting direction of future research.
4.6 Adaptive composition of learning algorithms
Beyond analyzing the generalization error of individual learning algorithms, examining the input-output mutual information is also useful for analyzing the generalization capability of complex learning algorithms obtained by adaptively composing simple constituent algorithms. Under a -fold adaptive composition, the dataset is shared by learning algorithms that are sequentially executed. For , the output of the th algorithm may be drawn from a different hypothesis space based on and the outputs of the previously executed algorithms, according to . An example with
is model selection followed by a learning algorithm using the same dataset. Various boosting techniques in machine learning can also be viewed as instances of adaptive composition. From the data processing inequality and the chain rule of mutual information,
If the Markov chain holds conditional on for , then the upper bound in (36) can be sharpened to . We can thus control the generalization error of the final output by controlling the conditional mutual information at each step of the composition. This also gives us a way to analyze the generalization error of the composed learning algorithm using the knowledge of local generalization guarantees of the constituent algorithms.
We would like to thank Vitaly Feldman and Vivek Bagaria for pointing out errors in the earlier version of this paper. We also would like to thank Peng Guan for helpful discussions.
- (1) S. Boucheron, O. Bousquet, and G. Lugosi, “Theory of classification: a survey of some recent advances,” ESAIM: Probability and Statistics, vol. 9, pp. 323–375, 2005.
- (2) O. Bousquet and A. Elisseeff, “Stability and generalization,” J. Machine Learning Res., vol. 2, pp. 499–526, 2002.
- (3) D. Russo and J. Zou, “How much does your data exploration overfit? Controlling bias via information usage,” arXiv preprint, 2016. [Online]. Available: https://arxiv.org/abs/1511.05219
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth,
“Preserving statistical validity in adaptive data analysis,” in
Proc. of 47th ACM Symposium on Theory of Computing (STOC), 2015.
- (5) ——, “Generalization in adaptive data analysis and holdout reuse,” in 28th Annual Conference on Neural Information Processing Systems (NIPS), 2015.
- (6) R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman, “Algorithmic stability for adaptive data analysis,” in Proceedings of The 48th Annual ACM Symposium on Theory of Computing (STOC), 2016.
- (7) I. Alabdulmohsin, “Algorithmic stability and uniform generalization,” in 28th Annual Conference on Neural Information Processing Systems (NIPS), 2015.
——, “An information-theoretic route from generalization in expectation to
generalization in probability,” in
20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
- (9) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations (ICLR), 2017.
- (10) S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
- (11) S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability and uniform convergence,” J. Mach. Learn. Res., vol. 11, pp. 2635–2670, 2010.
- (12) Y.-X. Wang, J. Lei, and S. E. Fienberg, “On-average kl-privacy and its equivalence to generalization for max-entropy mechanisms,” in Proceedings of the International Conference on Privacy in Statistical Databases, 2016.
- (13) M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, “Information-theoretic analysis of stability and bias of learning algorithms,” in Proceedings of IEEE Information Theory Workshop, 2016.
K. L. Buescher and P. R. Kumar, “Learning by canonical smooth estimation. I. Simultaneous estimation,”IEEE Transactions on Automatic Control, vol. 41, no. 4, pp. 545–556, Apr 1996.
L. Devroye, L. Györfi, and G. Lugosi,
A Probabilistic Theory of Pattern Recognition. Springer, 1996.
- (16) F. McSherry and K. Talwar, “Mechanism design via differential privacy,” in Proceedings of 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2007.
- (17) C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, 2014.
- (18) M. Raginsky, “Strong data processing inequalities and -Sobolev inequalities for discrete channels,” IEEE Trans. Inform. Theory, vol. 62, no. 6, pp. 3355–3389, 2016.
- (19) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
- (20) S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Univ. Press, 2013.
- (21) T. Zhang, “Information-theoretic upper and lower bounds for statistical estimation,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1307 – 1321, 2006.
- (22) Y. Polyanskiy and Y. Wu, “Lecture Notes on Information Theory,” Lecture Notes for ECE563 (UIUC) and 6.441 (MIT), 2012-2016. [Online]. Available: http://people.lids.mit.edu/yp/homepage/data/itlectures_v4.pdf
S. Verdú, “The exponential distribution in information theory,”Problems of Information Transmission, vol. 32, no. 1, pp. 86–95, 1996.
Appendix A Proof of Lemma 1
Just like Russo and Zou RusZou16 , we exploit the Donsker–Varadhan variational representation of the relative entropy (Boucheron_etal_concentration_book, , Corollary 4.15): for any two probability measures on a common measurable space ,
where the supremum is over all measurable functions , such that . From (A.1), we know that for any ,
where the second step follows from the subgaussian assumption on :
Inequality (A.2) gives a nonnegative parabola in , whose discriminant must be nonpositive, which implies
The result follows by noting that .
Appendix B Proof of Theorem 3
To prove Theorem 3, we need the following two lemmas.
Consider the parallel execution of independent copies of on independent datasets : for , an independent copy of takes as input and outputs . Define . If under , satisfies that , then the overall algorithm satisfies .
The proof is based on the independence among , , and the chain rule of mutual information. ∎
Let , where . If an algorithm satisfies , and if is -subgaussian for all , then
Proof of Theorem 3.
The proof is an adaptation of a “monitor technique” proposed by Bassily et al. AlStDp . First, let be the parallel execution of independent copies of : for , an independent copy of takes an independent as input and outputs . Given and , let the output of the “monitor” be a sample drawn from according to
Taking expectation on both sides, we have
Note that conditional on , the tuple can take only values, which means that
In addition, since is assumed to satisfy , Lemma B.1 implies that
Therefore, by the chain rule of mutual information and the data processing inequality, we have
By Lemma B.2 and the assumption that is -subgaussian,
The rest of the proof is by contradiction. Choose . Suppose the algorithm does not satisfy the claimed generalization property, namely,
Then by the independence among the pairs , ,