Bounding the generalization error of machine learning algorithms has been studied extensively in the literature. Classical results in this area, such as the VC dimension and the Rademacher complexity, focused mostly on capturing the structural constraints of the hypothesis class. Despite the considerable insights offered by such bounds, they do not explain well the performance of powerful learning algorithms such as deep neural networks. There has been significant recent interest in finding information-theoretic bounds to capture more data-dependent and distribution-dependent structures which may lead to strengthened bounds.
Asadi et al.  introduced the chaining technique, which has traditionally been used in bounding random process, into the derivation of information-theoretic generalization bounds. The technique resolves the issue that certain unbounded mutual information quantity leads to a vacuous bound, and may also yield a tighter bound in general. The main idea behind the result in  can be summarized as follows. The generalization error can be viewed as a random process indexed by the hypothesis parameters. If is a bounded metric space under the metric , then can be divided into finer and finer partitions, with each coarse partition embedded into the next layer finer partition, and the partition cells having decreasing radius. The generalization error can then be represented by a sum of chained quantities, each relating to two adjacent partition layers. Since the partitions are becoming finer and finer, each of these decomposed quantities can be bounded more effectively, eventually resulting in an overall tighter bound. This approach is referred to as chaining mutual information.
Despite the success of the chaining mutual information approach, we observe several difficulties in applying the chaining technique in this manner, which motivated the current work:
Restriction on the metric space to be bounded: This chaining approach assumes a bounded metric space . However, even in some of the simplest settings, the parameter space may not be bounded (or impractical to assume the bound on is known).
Difficulty in computation: Using these deterministic and hierarchical partitions, the information measures involved in the bounds can be difficult to compute or bound analytically.
Restrictions in the partitions: The hierarchical partitions place certain unnecessary geometric constraints on the covering radius sequence of the required partitions, which can impact the bound.
To make these difficulties more concrete, consider the following two simple examples.
Example-1: The training samples are drawn
following a normal distribution with an unknown mean, and the algorithm wishes to estimate this mean. Here the parameter space is , which is unbounded under any meaningful metric, and particularly so for the natural Euclidean distance. Moreover, since the induced measure on will not be uniform, computing the series sum of mutual information is rather difficult if not impossible.
be standard normal vectors in. The learner needs to identify the phase of the vector through certain means, and the learned result is modeled as the true phase with certain additive noise. Here is the bounded interval of the angle . A natural sequence of partitions is to reduce the stepsize by an integer factor . However, this would preclude any non-integer values, which potentially makes the bound looser.
1.2 Main contribution: stochastic chaining
The sequence of refining partitions of the metric space associated with the chaining technique is reminiscent of multilevel quantization in data compression. For example, a scalar source distributed on the real line can be quantized with stepsize of for the -th level quantization, resulting in its quantized representation . As the index increases, the stepsize reduces and the accuracy of the quantization improves; see the left side of Fig. 1 for an illustration.
The information-theoretic model for multilevel quantization is usually referred to as successive refinement source coding [3, 4]. Particularly useful to us is a stochastic abstraction in this framework. For example, assume there are a total of -levels, then one possible stochastic representation of the reconstruction is that is written as
where ’s are mutually independent random noises, also independent of , and ’s are certain fixed scalar coefficients; see the right side of Fig. 1. It is seen that the relation among and
is captured by the joint probability distribution among them, and we can measure the “distance” betweenand using , in contrast to the conventional chaining approach which uses the covering radius.
The main idea of this work is that these abstracted stochastic versions of can be used to replace the partition-based quantized versions in bounding the generalization error. This new approach helps to resolve the difficulties mentioned above: firstly the restriction for the metric space to be bounded is naturally removed, and secondly, it helps to simplify the computation, and lastly, the abstract model can remove the geometric constraints in designing the hierarchical partitions in some cases.
The proposed stochastic chaining approach essentially allows more flexible constructions of the chains than the more traditional deterministic chaining. One can attempt to further optimize the construction of stochastic chains based on the existing knowledge regarding the underlying metric space and the corresponding probability distribution for the given problem setting. On the other hand, when such knowledge is not available, we can safely fall back to the default construction of the original deterministic chaining partitions, which is essentially a special case of the stochastic chaining.
We obtain two generalization bounds using stochastic chaining instead of the deterministic chaining in , built on the mutual information bound given in  and the individual sample mutual inforamtion bound given in , respectively. We further show that the proposed bound can reduce to the VC-dimension bound correctly. We then illustrate the benefits of this new approach in the context of the two examples. For the problem of estimating the Gaussian mean mentioned above, we can obtain a bound that is order-wise stronger than previously given in the literature. For the phase retrieval problem considered in , the bound can be naturally improved by optimizing over a continuous parameter.
1.3 Related works
The approach of using information-theoretic tools to develop generalization error bounds was pioneered by Russo and Zou , and then extended by Xu and Rakinsky . The bound in  was later tightened by Bu et al. in , using the individual sample mutual information instead of sample set mutual information. Steinke and Zakynthinou  proposed a conditional mutual information based bound by introducing pseudo samples and split them into training and testing samples. Combining the idea of error decomposition  and the conditional mutual information bound in , Haghifam et al.  provided a sharpened bound based on conditional individual mutual information. Hafez-Kolahi et al. proposed a streamlined view of several bounding techniques proposed in the literature . Rodriguez-Galvez et al. and Zhou et al.  provide further improved on conditional individual mutual information type of bound. Hellstrom and Durisi [13, 14] used information density to unify several existing bounds and bounding approaches. Similar bounds using other measures can be found in . The information-theoretic bounds have been used to bound generalization errors in noisy and iterative learn algorithms [16, 17, 18, 9, 11].
1.4 Organization of the paper
The rest of the paper is organized as follows. In Section 2, we introduce the necessary notation and general background on the chaining technique and successive refinement source coding. The main results are given in Section 3 with a few important discussions. In Section 4, we return to the two motivating examples to illustrate the benefits of the proposed bound, and Section 5 concludes the paper.
2 Notation and preliminaries
2.1 Generalization error
Consider the supervised learning setting, and denote the data domain as, where is the instance domain and is the label domain. The hypothesis class is denoted as , where is the parameter space, or more generally the index set of the hypothesis class. A learning algorithm has access to a sequence of training samples , where each is drawn independently from following some unknown probability distribution , and the notation is used to denote the set . The mapping from the data set to the hypothesis can be represented by the kernel .
Under a loss function, the population risk is given as
For training using the data set , the empirical risk of a given hypothesis is
The generalization error for the given data set is defined as
The expected generalization error is defined as
where the expectation is taken over the joint distribution. Our goal is to bound in this work.
Generalization error can also be written in a different form by defining
Clearly . It is worth noting that the distribution is obtained by marginalizing over (and dividing ).
2.2 Random process and partitions
Let be a random process with the index set . There is a metric on
which describes the dependence among the random variables in the random process. For simplicity,is written as whenever it does not cause confusion.
The following definitions are standard.
Definition 1 (Separable process).
The random process on the metric space is called separable if there is a dense countable set such that for any , there exists a sequence in such that and .
All the random processes considered in this paper are separable.
Definition 2 (Sub-Gaussian process).
The random process on the metric space is called sub-Gaussian if for all , and
Let be a random variable on the set . From here on, we use capital letter (such as ) to denote a random variable, and its lower case letter (such as ) to indicate a realization. In the particular setting of generalization bound, the random variable is the index (or the parameters) of the hypothesis chosen by the possibly randomized learning algorithm using the stochastically generated data set . The random process is index by in this case. is jointly distributed with , and is the generalization error of interest.
A well known tool in bounding a random process is the chaining technique . The notion of an increasing sequence of -partition of the metric space is particularly important in this setting.
Definition 3 (Increasing sequence of -partition).
A partition of the set is called an -partition of the metric space if for all , can be contained within a ball of radius . A sequence of partitions of a set is called an increasing sequence if for any and each , there exists such that .
In the context of bounding the generalization error, when it is viewed as a random process , we are interested in the expectation .
2.3 Information theory and successive refinement source coding
For a discrete random variable, the entropy is denoted as
, and for a continuous random variable, its differential entropy is denoted as . The mutual information between two random variables is denoted as , regardless whether they are discrete or continuous. We use natural logarithm in this work, and thus information is measured in terms of nats.
Successive refinement source coding considers the problem of encoding a source in a total of stages, each with rate nats of coding budget, and the end user uses the encoded information in all previous stages, i.e., stages , to reconstruct the source at stage . The achievable rate region, i.e., the set of encoding rate vectors, is given as the collection of nonnegative rates such that
where is a random variable representing the stochastic reconstruction of at stage-, which guarantees , i.e., the distortion is less than or equal to the given distortion budget ; see  for more details. One particular useful choice is to make
a Markov chain. The simplification is immediate since
in this case. When the choice of satisfying the Markov chain also yields the optimal coding rates among all possible choices of auxiliary random variables, the source is called successively refinable . More results on this problem can be found in [20, 21].
3 Main results
3.1 Main Theorems
We define a new notion of stochastic chain as follows.
Definition 4 (Stochastic chain of random process and random variable pair).
Let be a random process and random variable pair, where is a random variable in the set . A sequence of random variables , each distributed in the set , is called a stochastic chain of the pair , if 1) , 2) , and 3) is a Markov chain for every .
We allow to take the value of instead of providing another parallel definition to that effect. We are now ready to present the first main theorem of this work.
Assume is sub-Gaussian on , and is a stochastic chain of . Then
Moreover, we have
The following theorem is based on the individual sample mutual information bound of .
For each , assume is sub-Gaussian on , and is a stochastic chain of . Then
Moreover, we have
These two theorems are given in the context of bounding generalization errors, which are obtained using a more general result on bounding random processes.
Assume is sub-Gaussian on , and is a stochastic chain for , then
Moreover, we have
By using a deterministic sequence of partitions to form , we recover the result in  which was obtained for bounded metric space .
Let be an increasing sequence of partitions of , where for each , is a -partition of the bounded metric space , and . Let be the center of the covering ball of the partition cell that belongs to in the partition , then for separable process on ,
for all , we can apply the data processing inequality for KL divergence  and that for mutual information, respectively, to arrive at
When the process is not sub-Gaussian, more general forms of these bounds can also be found in terms of the cumulant generating function. This result is given in the supplementary material.
3.2 Relations to existing results
Connection to VC theory:
For binary classification problems, i.e., with zero-one loss
, the generalization error of any classifieris upper bounded as , where is the VC-dimension of the classification function class (c.f.,  Ch. 6). The generalization error bound in Theorem 3, or more precisely the proposed stochastic chaining approach, can naturally recover the VC-dimension based bound, and we establish this connection in the supplementary material.
Discussion on the chaining construction:
The conventional deterministic chaining places certain structural constraints on the hierarchical partitions. For example, consider a partition of a bounded 2-D space using congruent hexagon cells; the next partition at the higher level will be collections of such hexagons. This subsequently implies that hierarchy must follow a certain relation between consecutive levels, and the analysis of such hierarchical partitions can be complex. The stochastic chaining technique can remove the geometric constraints in the design of hierarchical partitions as in Corollary 1
in many cases. In the example above, we can replace the partition using either an additive Gaussian noise or additive noise with a uniform distribution on hexagons (see the second example in the next section where a similar uniform additive noise is used).
Since stochastic chains include conventional partition-based chaining as a special case, it is not more difficult to construct. In fact, the construction can be more straightforward due to its flexibility. For example, for bounded metric space, we can use the following generic construction: let be uniformly distributed on a metric ball of radius centering at . If more information regarding the distribution of is known, we can further optimize the chain, e.g., by adjusting the radius such that they are dependent on the density value of ; more specifically, we can let the radius be larger for values of lower density, and vice versa. If the metric space is also a vector space, it can be convenient to let
be some vector Gaussian distribution with covariance scaling like. This allows more opportunity for optimization for stronger bounds in a parametric form. In contrast, it is impossible to design partitions (or deterministic mappings ) to mimic such behaviors, let alone finding analytic bound. This issue in fact has a natural origin in source coding: deterministic quantization design vs. probabilistic forward test channel modeling. The latter is used in source coding for mathematically precise characterization, and for analytic optimization.
Comparison to the chaining technique in :
The alternative chaining method proposed by Hafez-Kolahi et al. (Theorem 6 in ) used a different chaining construction, which does not require hierarchical partitions, and to some extent it helps resolve the difficulty in designing such hierarchical partitions. However, this simplification came with a heavy price: the learning algorithm must be deterministic, the hypothesis space still needs to be bounded (since the core steps rely on ), and there is a factor of 2 loss in the bound. The restrictions make it inapplicable in the two examples we study in the next section. In contrast, the proposed method applies to unbounded metric space, and does not require the learning algorithm to be deterministic.
4 Two examples: estimating the Gaussian mean and phase retrieval
We analyze two simple settings, which demonstrate the effectiveness of the proposed stochastic chaining technique. The purpose for discussing the following two examples is by no means to literally characterize the generalization error, since the generalization error can be calculated directly due to the simplicity of the examples. We aim to show the effectiveness of the proposed stochastic chaining technique in these two examples by comparing with the underlining generalization error and some previous generalization error bound.
4.1 Estimating the Gaussian Mean
Consider the case when the training samples are drawn following for some unknown . Here , and a natural choice of the metric on this space is the (scaled) Euclidean distance. The loss function is , and by defining , the random process (indexed by ) of interest can be written as
It follows that
which is sub-Gaussian with . The learner deterministically estimates by averaging the training samples, i.e., . We shall use Theorem 1 to bound the generalization error in this case.
To build a stochastic chain, select a sequence of mutually independent Gaussian noise , which is independent of , and , where . Define the cumulative noise
where . The stochastic chain is designed as
where . We then have
where and are independent. Under this stochastic chain, we can derive the expression for and the mutual information term . Specifically, , which relies on the relations between and in (23) and the detailed calculation is given in the supplementary material. The mutual information can be upper bounded as
where the inequality is due to the data processing inequality over the Markov chain and the equality is by the Gaussian channel nature of the stochastic chain design. The detailed proof steps are given in the supplementary material. A bound of the following form can then be obtained
Note that the series sum on the right hand side of (25) converges, and thus the bound is of order . Bounding the series sum using numerical methods, we can then obtain .
Due to the simplicity of the setting, the generalization error can in fact be calculated exactly to be . It can be seen that the generalization bound offered by Theorem 1 has the same order as the true generalization error. In contrast, the authors of  derived a generalization error bound of the order using the individual sample mutual information approach. Thus the proposed approach results in an order-wise improvement in this example case. More importantly, it can be seen that the proposed chaining approach allows us to overcome the limitation of bounded metric space (i.e., the chaining mutual information approach  does not even apply in this setting), and also simplify the calculation due to the introduced dependence structure in the chain. In the supplementary material, we further derive an improved bound (with a slightly better constant factor) using Theorem 2.
4.2 Phase retrieval
|Chaining mutual information||1.1013||0.7507||0.5709||0.4612||0.2364||0.1204||0.0610|
|stochastic chaining ()||0.4951||0.3387||0.2581||0.2088||0.1074||0.0548||0.0278|
In the phase retrieval example given in , the data is a standard normal vector in . The hypothesis class is , and through the transformation for , it is in fact the same as ; we will use them interchangeably. Define the loss function , which implies that the learner wishes to estimate an angle for the underlying data, and the generalization error process is a Gaussian process . The metric is the Euclidean distance, and the process is sub-Gaussian. Suppose the learned parameter is
where is independent of , and has an atom with a mass on 0, and that is uniformly distributed in . Note that is exactly the phase of , which will be the hypothesis learned by an ERM learner, and being retrieved here is a noisy version of the phase.
The stochastic chain can be given as
where , and is uniformly distributed on for some to be specified later; ’s are mutually independent and also independent of the hypothesis parameter .
Since is independent of and uniformly distributed on , we have . It is also clear that when a.s., and thus since the process is Gaussian. Since is exactly , the Euclidean distance between and (using their vector representations) is bounded by the length of the arc, i.e., . We can now apply Theorem 1, where
The second term can be bounded as
using the fact that more conditioning reduces the differential entropy. Due to the structure of the distribution of and , the density of can be written down explicitly as
Thus we can bound and subsequently using this density function, which eventually gives
When choosing , this is almost identical to the result given in  using the partition based chaining, except the slightly better coefficient instead of . This improved coefficient is mainly due to the more explicit bound on inherent in the Euclidean space, instead of the same distance derived in a generic metric space.
One advantage of the proposed approach is that we can further optimize over . Observe that the series has a faster decaying tail if is large, however, the first term, i.e., , approaches when . Thus there is an optimal value in between for this bound. Numerical result suggests , which provides a slight improvement comparing to . As noted in , in this toy setting, we can in fact calculate the exact true value . A comparison of several bounds is given in Table. 1. To obtain (31), we have in fact relaxed this bound in (29) for convenience using a simple property of the entropy function, and therefore loosen the bound to some extent. Moreover, we have chosen to use the geometric sequence to produce the stochastic chain, and it is possible other sequences can produce tighter bounds.
We proposed a new chaining-based approach to bound the generalization error by replacing the hierarchical partitions with a stochastic chain. The proposed approach can firstly remove naturally the restriction for the metric space to be bounded, and secondly, it helps to simplify the computation, and lastly, it can remove the geometric constraints in designing the hierarchical partitions in some cases. Two examples are used to illustrate that the proposed approach can overcome some difficulties in applying the chaining mutual information approach. The roles that chaining can play in bounding generalization error on conjunction with other information-theoretic approach, such as the conditional mutual information , information density , and Wasserstein distance , as well as the possible application in noisy and stochastic learning algorithms, call for further research.
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” inInternational Conference on Learning Representations (ICLR), Apr. 2017.
-  A. Asadi, E. Abbe, and S. Verdú, “Chaining mutual information and tightening generalization bounds,” in Advances in Neural Information Processing Systems, 2018, pp. 7234–7243.
-  B. Rimoldi, “Successive refinement of information: Characterization of the achievable rates,” IEEE Transactions on Information Theory, vol. 40, no. 1, pp. 253–259, 1994.
-  W. H. Equitz and T. M. Cover, “Successive refinement of information,” IEEE Transactions on Information Theory, vol. 37, no. 2, pp. 269–275, 1991.
-  A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability of learning algorithms,” in Advances in Neural Information Processing Systems, 2017, pp. 2524–2533.
-  Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information based bounds on generalization error,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020.
-  D. Russo and J. Zou, “Controlling bias in adaptive data analysis using information theory,” in Artificial Intelligence and Statistics, 2016, pp. 1232–1240.
-  T. Steinke and L. Zakynthinou, “Reasoning about generalization via conditional mutual information,” in Conference on Learning Theory. PMLR, 2020, pp. 3437–3452.
-  M. Haghifam, J. Negrea, A. Khisti, D. M. Roy, and G. K. Dziugaite, “Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms,” Advances in Neural Information Processing Systems, vol. 33, pp. 9925–9935, 2020.
-  H. Hafez-Kolahi, Z. Golgooni, S. Kasaei, and M. Soleymani, “Conditioning and processing: Techniques to improve information-theoretic generalization bounds,” Advances in Neural Information Processing Systems, vol. 33, 2020.
-  B. Rodríguez-Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “On random subset generalization error bounds and the stochastic gradient langevin dynamics algorithm,” in 2020 IEEE Information Theory Workshop (ITW). IEEE, 2021, pp. 1–5.
-  R. Zhou, C. Tian, and T. Liu, “Individually conditional individual mutual information bound on generalization error,” in 2021 IEEE International Symposium on Information Theory (ISIT). IEEE, 2021, pp. 670–675.
F. Hellström and G. Durisi, “Generalization error bounds via
-th central moments of the information density,” in2020 IEEE International Symposium on Information Theory (ISIT). IEEE, 2020, pp. 2741–2746.
-  ——, “Generalization bounds via information density and conditional information density,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 3, pp. 824–839, 2020.
-  B. Rodríguez-Gálvez, G. Bassi, R. Thobaben, and M. Skoglund, “Tighter expected generalization error bounds via Wasserstein distance,” arXiv preprint arXiv:2101.09315, 2021.
-  A. Pensia, V. Jog, and P.-L. Loh, “Generalization error bounds for noisy, iterative algorithms,” in 2018 IEEE International Symposium on Information Theory (ISIT). IEEE, 2018, pp. 546–550.
-  J. Li, X. Luo, and M. Qiao, “On generalization error bounds of noisy gradient methods for non-convex learning,” in International Conference on Learning Representations, 2019.
-  J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy, “Information-theoretic generalization bounds for SGLD via data-dependent estimates,” in Advances in Neural Information Processing Systems, 2019, pp. 11 015–11 025.
-  M. Talagrand, The generic chaining: upper and lower bounds of stochastic processes. Springer Science & Business Media, 2006.
-  E. Tuncel and K. Rose, “Additive successive refinement,” IEEE Transactions on Information Theory, vol. 49, no. 8, pp. 1983–1991, 2003.
-  L. Lastras and T. Berger, “All sources are nearly successively refinable,” IEEE Transactions on Information Theory, vol. 47, no. 3, pp. 918–926, 2001.
Y. Wu, “Lecture notes on information-theoretic methods for high-dimensional statistics,”Lecture Notes for ECE598YW (UIUC), vol. 16, 2017.
-  S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018, vol. 47.
Appendix A Proof of Theorem 3.3
To prove the theorem, we start by writing
Because is a stochastic chain for , we have and , and it follows that
By the Donsker–Varadhan variational representation of the KL divergence, the expectation of a function with respect to the measure defined on can be bounded as
where is another measure on .
In our setting, let , , and , then we have
where the second inequality is because the process is sub-Gaussian on .
The fact that is a stochastic chain also implies that , and thus
Denote as an independent copy of such that and are independent and have the same distribution. is the distribution of conditioned on . By the data processing inequality for the KL divergence, we have
from which the second inequality follows.
Let us now consider the mutual information based bound. It is seen that
Combing with (33) we arrive at
which concludes the proof. ∎
Appendix B A chaining bound in a more general form
In this section we provide a more general bound without the assumption on sub-Gaussianity, and replace it with a more general form on the measure concentration.
Let be a real-valued random variable. The cumulant generating function of is for .
If exists, then and , and it is convex.
For a convex function defined on the interval , where , its Legendre dual is defined as
Assume that , then is a non-negative convex and non-decreasing function on with . Moreover, its inverse function is concave, and can be written as
Assume is a random process defined on the metric space , with for all and
where is convex and . Let be a stochastic chain for the random process and random variable pair , then