1 Introduction
When designing a learning algorithm, one fundamental goal is to make the learning algorithm perform well on unseen test data, only with access to a finite number of training data. Formally, we hope that the population risk ( i.e. test error) of the learning algorithm is as small as possible. Unfortunately, minimizing the population risk directly is computationally intractable, because the underlying distribution of the data is unknown. Therefore, to achieve this goal, we separate the population risk as a tradeoff sum between empirical risk ( i.e. training error) and the generalization error. The empirical risk measures the extent to which the learning algorithm is consistent with empirical evidence ( i.e. data fitting), while the generalization error quantifies how well the empirical risk can be a valid estimate of the population risk ( i.e. generalization). Thus, one can obtain a hypothesis with minimal population risk by minimizing both the generalization error and empirical risk. Minimizing the empirical risk alone can be realized though empirical risk minimization (
Vapnik 1999) or its alternatives, for example, the stochastic approximation (Kushner and Yin 2003). However, as the generalization error cannot be minimized directly, it is a common practice to derive a generalization upper bound analytically so that we can study the conditions under which it is guaranteed to be small.In this paper, we analyze the generalization error of a learning algorithm from an optimal transport perspective. Specifically, our contributions lie in four aspects:

We derive an optimaltransport type of generalization error bounds for learning algorithms with Lipschitz continuous loss functions. This result does not put any constraint, for example, the subgaussian assumption, on the distribution of input data or the model parameter and applies to unbounded loss functions.

The bound we derive can be related to other probability metrics, e.g., total variation distance, relative entropy, and some notions in classical learning theory, for example, the VC dimension. Therefore, our theory bridges optimal transport theory and information theory with statistical learning theory.

Some other generalization error bounds with different probability metrics are also derived under different assumptions on the loss function. For example, for a learning algorithm with a bounded loss function, a totalvariation type generalization error bound can be derived. Via inequalities between probability metrics, the totalvariation type generalization error bound can further be bounded by other probability metrics, such as Hellinger distance and distance.

Under our established framework, we are able to analyze the generalization error in deep learning by exploiting the contraction property of divergence and the hierarchical structure in DNNs and conclude that the generalization error in DNNs decreases exponentially to zero as the number of layers increases.
The rest of this paper is structured as follows. In Section 2, some related works are introduced. Section 3 formalizes the research problem and gives some basic definitions. In Section 4, we derive our main theorem that studies the generalization error of a learning algorithm from an optimal transport perspective. Section 5 further relates the main theorem to some notions in classical learning theory, for instance, the VC dimension, and derives generalization upper bounds w.r.t. other probability metrics under different conditions for loss functions, which extends the main theorem. In Section 6, we analyze the generalization error in deep learning under our established framework and conclude that the key to the nonoverfitting puzzle in a DNN is its hierarchical structures.
2 Related Works
Our work is related to several different research topics about algorithmic or theoretic aspects of machine learning, summarized below.
2.1 Generalization in Classical Statistical Learning Theory
We note that there exists extensive studies on the generalization in classical learning theory. Hence, the references listed here are far from exhaustive.
One central goal in statistical learning theory is to study the condition under which a learning algorithm can generalize. Mathematically, it requires that the generalization error converges asymptotically to zero as the number of training examples goes to infinity. Traditional ways of charactering the generalization capability of a learning algorithm mainly rely on the complexity of hypothesis class, e.g. VC dimension, Rademacher and Gaussian Complexities, covering number (Vapnik 2013, Bartlett and Mendelson 2002, Bartlett et al. 2005, Zhou 2002, Zhang 2002) or the algorithmic property of the learning algorithm itself, e.g. uniform stability, robustness, algorithmic luckiness (Liu et al. 2017, ShalevShwartz et al. 2010, Bousquet and Elisseeff 2002, Xu and Mannor 2012, Herbrich and Williamson 2002
). These classical learning theory based approaches consider the worst case generalization error over all functions in the hypothesis class and have successfully explained some prevalent learning models, such as the Support Vector Machines (SVMs) for binary classification (
Vapnik 2013). Some other approaches are also proposed to analyze the generalization in machine learning, such as PACBayesian approach, Occam’s Razor bound, and sample compression approach(Langford 2005, Ambroladze et al. 2007).It is worth mentioning that some works show that these approaches are tightly connected. For example, Liu et al. 2017 proves that a higher algorithmic stability implies a smaller hypothesis complexity and Rivasplata et al. 2018 analyzes the PAC bayesian bounds for stable learning algorithms. Nevertheless, these approaches are insufficient to explain the generalization of learning models with large hypothesis space, such as deep neural networks (Zhang et al. 2016). Therefore, it is necessary to find a valid approach that can explain why deep learning is attractive in terms of its generalization properties.
2.2 Information Theoretic Learning and Generalization
Recently, inspired by an observation in learning theory that learning and information compression are closely related, in the sense that both tasks involve finding identifying patterns or regularities in the training data to reconstruct or reidentify future data with similar patterns, some informationtheoretic approaches to analyze the learnability and generalization capability of learning algorithms are studied (Nachum et al. 2018, Bassily et al. 2018, Alabdulmohsin 2018, Xu and Raginsky 2017, Alabdulmohsin 2017, Russo and Zou 2016, Bassily et al. 2016, Raginsky et al. 2016, Alabdulmohsin 2015, Dwork et al. 2015, Wang et al. 2016). Specifically, Russo and Zou 2016 show that mutual information the collection of empirical risks and the final output hypothesis can be used to bound the generalization error for learning algorithms with finite hypothesis class. Xu and Raginsky 2017 then extend this result to the case of uncountable infinite hypothesis space and bound the generalization error by the mutual information between the input training set and the output hypothesis, providing a clearer explanation that when the output hypothesis relies less on the training data, the generalization will be better. However, both of these results require that the loss function is subgaussian, where the result may not apply to the case when the data distribution is heavy tailed. Alabdulmohsin 2015 also provides an analysis of stability and generalization from an informationtheoretic perspective and shows that the stability of a learning algorithm can be controlled by an informationtheoretic notion of algorithmic stability, which is also defined as the mutual information between the input training data and output hypothesis, but the underlying metric defining the mutual information is total variation instead of relative entropy. However, the analysis of Alabdulmohsin 2015 restricts to bounded loss functions and countable instance and hypothesis space. In a similar informationtheoretic spirit, other informationtheoretic notions of stability are proposed: Raginsky et al. 2016 proposes the several informationtheoretic notions of stability, such as erasure mutual information, which is an informationtheoretic analogous of replaceone stability proposed in ShalevShwartz et al. 2010; Dwork et al. 2015, Wang et al. 2016, Bassily et al. 2016 propose notions of stability based on KL divergence, which often arises in the problem of differential privacy.
2.3 Optimal Transport in Machine Learning
Optimal transport provides a powerful, flexible, and geometric way to compare probability measures, which is equivalent to measure the Wasserstein distance between two probability distributions (
Peyré et al. 2017) . Optimal transport has recently drawn attention from the machine learning community, because it is capable of tackling some challenging learning scenarios such as generative learning (Arjovsky et al. 2017, Gulrajani et al. 2017), transfer learning (
Courty et al. 2017) , and distributionally robust optimization (Lee and Raginsky 2017, Gao and Kleywegt 2016) . When comparing different probability distributions, the main advantage of Wasserstein metric over other common probability metrics (such as relative entropy, total variation distance) lies in its good convergence property even when the supports of two probability measures have a a negligible intersection (see example of Arjovsky et al. 2017). To our knowledge, the studies of characterizing the generalization capability of a learning algorithm from an optimal transport perspective are quite limited.3 Problem Setup
We consider the traditional paradigm of statistical learning, where there is an instance space , a hypothesis space , and a nonnegative loss function . The training sample of size is denoted by where each element is drawn i.i.d. from an unknown underlying distribution . A learning algorithm can be viewed as a (possibly randomized) mapping from the training sample space to the hypothesis space . Following the settings above, the learning algorithm can be uniquely characterized by a Markov kernel , which means that given the training sample , the learning algorithm picks the output hypothesis according the conditional distribution . Note that when degenerates to a Dirac Delta distribution, the mapping
becomes deterministic, which matches the traditional setting of statistical learning, such as Support Vector Machines and linear regression. For any hypothesis
, the population risk is defined as(3.1) 
The population risk is the performance measure of the hypothesis . Therefore, the goal of learning is to find a hypothesis with small population risk, either with high probability or by expectation, under the data generating distribution . However, as the underlying population is unknown, the learning algorithm cannot compute and minimize the population risk directly. Instead, it can compute the empirical risk of on the training sample as a proxy, which is defined by
(3.2) 
where is the empirical distribution of the training examples. To evaluate how well the empirical risk can be a valid estimate of the population risk, it defines the generalization error as the difference between the population risk and the empirical risk:
(3.3) 
A small generalization error implies that the empirical risk on the training sample can be a good estimate of the population risk on the underlying unknown population . In this paper, we are interested in expected generalization error, which is defined as
(3.4) 
where the expectation is taken over the joint distribution of training sample
and the hypothesis (i.e. ).As discussed earlier, the goal of learning is to make the population risk as small as possible, either in expectation or with high probability. In this paper, we are interested in the expected population risk, which is . We then have the following decomposition,
(3.5) 
where the first term in the right hand side of the equation controls the generalization and the second term measures the data fitting. To minimize the expected population risk
, we need to minimize both terms. Often, when the training error is small, the learning algorithm tends to fit the training data too well and thus generalizes poorly to unseen test data; when the training error is large, the learning algorithm tends to be insensitive to the training data and therefore has better generalization capabilities towards test data. Thus, a learning algorithm faces a tradeoff between minimizing the empirical risk (i.e. datafitting) and generalization, which is also known as biasvariance tradeoff (
Mohri et al. 2012) in statistical learning. In what follows, it will be shown how the generalization error can be related to optimal transport.4 Generalization Guarantees via Algorithmic Transport Cost
We assume that the hypothesis space is a Polish space (i.e. a complete separable metric space) with metric and denote by the space of all Borel probability measures with finite th (
) moments on
. We can define a family of metrics on the space with respect to the metric structure on .Definition 1 (Wasserstein Distance).
For any , the Wasserstein distance between two probability measures is defined as
(4.1) 
where denotes the collections of all measures on with marginals and on the first and second factors respectively. The set is also called the set of all couplings of and .
The Wasserstein distance is often used in the problem of optimal transport, which is also called earth mover’s distance. Intuitively, for any coupling of and , if each distribution can be viewed as a unit amount of “dirt” piled on , the conditional distribution can be seen as a randomized policy for moving a unit quantity of dirt from a random location to another location . If the cost for moving a unit amount of dirt is , then the optimal cost is identical to the definition of .
In the proof of our main result, we will adopt the Kantorovich dual representation of Wasserstein metric. It is worth mentioning that as Wasserstein metric is a monotonic increasing function of for , any upper bound holds for naturally holds for any .
Lemma 1 (Villani 2008).
Let be two probability measures defined on the same metric space , i.e., . Then the Wasserstein distance between and can be represented as
(4.2) 
where denotes the Lipschitz constant for .
We then define the notion of algorithmic transport cost for a learning algorithm .
Definition 2 (algorithmic transport cost).
For a learning algorithm with the training set , the data generating distribution , and the Markov kernel , the algorithmic transport cost can be defined as
(4.3) 
Remark 1. The notion of algorithmic transport cost is defined by the average 1Wasserstein distance between and , where is one input example. Intuitively, when one input example has less contribution on determining , the cost of moving probability mass from to will become less and thus the generalization will be better. An extreme case is that, when is independent of , then the generalization error will become zero, since the output hypothesis does not have any correlation with input and therefore has the same performances on both training data and test data.
Following the definitions and the lemma above, we have our main theorem, which bounds the generalization error of a learning algorithm via algorithmic transport cost.
Theorem 1.
Assume that the function is Lipschitz continuous for any :
(4.4) 
The expected generalization error of a learning algorithm can be upper bounded by
(4.5) 
Proof.
Let be a learning algorithm that has access to a finite set of training examples where each element draws i.i.d. from an unknown underlying distribution . Let
be a random variable that stands for the hypothesis output by
with a Markov kernel and be a single random training example. By definition, we have(4.6) 
As is a random training example which is chosen uniformly at random from the training set , we have the following relationship between the output hypothesis , the training set , and the random training example :
(4.7) 
which means that and are conditionally independent of each other when given . Therefore, by marginalization, we have
(4.8) 
Substituting (4.8) into (1) yields
(4.9) 
where the inequality follows Lemma 1, which completes the proof. ∎
Remark 2. The derivation of this generalization error bound only requires that the loss function is Lipschitz continuous w.r.t. its first argument . The Lipschitz condition can also be imposed on the second argument or . Similar results also hold by slightly modifying the definition of algorithmic transport cost.
5 Upper Bound via Probability Metrics
In the previous section, we derive an upper bound on the expected generalization error of a learning algorithm w.r.t. the algorithmic transport cost, which is defined as the expected Wasserstein distance between the output hypothesis and the output hypothesis conditioned on an input example. In this section, we will first show that the generalization error can also be bounded by other probability metrics via inequalities between probability metrics.
5.1 Generalization Bound with Total Variation Distance
In this subsection, we will show that the generalization error can be further upper bounded by total variation distance among distributions relating to input data and output hypothesis. First, we give the definition of total variation distance along with its dual and coupling representations.
Definition 3 (Total Variation).
Let be two probability measures defined on the same metric space . The total variation distance between and is defined by
(5.1) 
where the supremum is over all Borel measurable sets.
Lemma 1 (Dual Representation of Total Variation Distance).
Let be two probability measures defined on the same metric space . The total variation distance between and is can be represented as
(5.2) 
Lemma 2 (Coupling Characterization of Total Variation Distance).
Let be two probability measures defined on the same metric space . The total variation distance between and has a coupling characterization
(5.3) 
With the above definition and lemmas, we derive a generalization upper bound w.r.t. total variation distance.
Theorem 1.
Assume that the hypothesis space is bounded, i.e.,
(5.4) 
and the function is Lipschitz continuous for any . The expected generalization error of a learning algorithm can be upper bounded by
(5.5) 
The above theorem requires that the loss function is Lipschitz continuous and the hypothesis space is bounded. When the loss function is bounded, we can also upper bound the generalization error via total variation distance by exploiting the dual representation of total variation distance. This result is an extension of Alabdulmohsin 2015, which requires that both the instance space and the hypothesis are countably finite.
Theorem 2.
Assume that the loss function is bounded for any , i.e.,
(5.10) 
The expected generalization error of a learning algorithm can be upper bounded by
(5.11) 
5.2 Generalization Bound with Mutual Information
Following the results in the previous section, we can further bound the generalization error via the mutual information between the input training sample and the output hypothesis . The result in this section is complementary to that of Xu and Raginsky 2017, in the sense that we do not require that the loss function is subgaussian w.r.t. .
Definition 4 (Relative Entropy).
Let be two probability measures defined on the same metric space .The relative entropy between and is defined by
(5.14) 
Definition 5 (Mutual Information).
Let be two random variables defined on the same metric space .The mutual information between and is defined by
(5.15) 
where denotes the joint distribution of and are corresponding marginal distributions.
Theorem 3.
Assume that the hypothesis space is bounded, i.e.,
(5.16) 
and the function is Lipschitz continuous for any . The expected generalization error of a learning algorithm can be upper bounded by
(5.17) 
5.3 Generalization Bound with Bounded Lipschitz Distance
Under assumption that the loss function is both Lipschitz and bounded, we derive the generalization error bound with respect to the Bounded Lipschitz distance.
Definition 6 (Bounded Lipschitz Distance).
Let be two probability measures defined on the same metric space .The bounded Lipschitz distance between and is defined by
(5.21) 
where the boundedLipschitz norm of is defined by
(5.22) 
Theorem 4.
Assume that the loss function is bounded and Lipschitz continuous continuous with respect to for any , i.e., there exists , such that
(5.23) 
The expected generalization error of a learning algorithm can be upper bounded by
(5.24) 
5.4 Relationships to VCdimension
We have derived several upper generalization upper bounds with respect to several probability metrics, such as total variation, Wasserstein distance, and relative entropy. In this subsection, we will show how these results can be related to some traditional notions, for example, the VC dimension, that characterize the learnability and generalization capability in statistical learning.
The notion of VCdimension arises in the problem of binary classification. In this setting, the instance space , where denotes the feature space and denotes the label space. The training set is . The hypothesis space is a class of functions that define the mapping . For any integer , we present the definition of the growth function as in Mohri et al. 2012 .
Definition 7 (Growth Function).
The growth function of a function class is defined as
(5.27) 
In the following theorem, we show how the notion of VC dimension can be related to our previous generalization bounds with probability metrics.
Theorem 5.
If has finite VCdimension , the expected generalization error of a learning algorithm for binary classification can be upper bounded by
(5.28) 
where .
Proof.
In the setting of binary classification with hypothesis space , we have
(5.29) 
Without loss of generality, we let . Therefore, by Theorem 2 , we have
(5.30) 
Using Pinsker’s inequality and Equation (5.20), it follows
(5.31) 
Denote the collection of empirical risks of the hypothesis in on by
(5.32) 
As the output hypothesis only depends on the empirical risk of the training data , the following equality holds:
(5.33) 
Therefore, we have
(5.34) 
where the last inequality is due to Sauer’s lemma
(5.35) 
∎
5.5 Generalization Bounds with Other Probability Metrics
In previous sections, we have derived generalization bounds with different probability metrics under different assumptions (e.g. bounded, Lipschitz) of the loss function. By using the relationships between different probability metrics, in this subsection, we can further extend previous results to other probability metrics listed in Table 1. We first present precise definitions of these probability metrics.
Let be two probability measures defined on the same metric space . We have the following definitions.
Definition 8 (Prokhorov metric).
The Prokhorov metric between and is defined by
(5.36) 
where and the infimum is over all Borel sets .
Definition 9 (Hellinger distance).
The Hellinger distance between and is defined by
(5.37) 
Definition 10 ( distance).
The distance between and is defined by
(5.38) 
We present some relationships among different probability metrics in Table 1.
Lemma 3.
Let be any metric space with metric and be two probability measures on . Then the following relationships holds.
. The Wasserstein and Prokhorov metrics satisfy
(5.39) 
where denotes the diameter of the probability space.
. The Bounded Lipschitz distance and Wasserstein metric satisfy
(5.40) 
. The Bounded Lipschitz distance and total variation distance satisfy
(5.41) 
. The total variation distance and Hellinger distance satisfy
(5.42) 
.The relative entropy and distance satisfy
(5.43) 
Proof.
(1). This result is due to Theorem 2 in Gibbs and Su 2002.
(2) & (3). These two results are classical facts in probability theory, which can be proved by combining Definition
6 with the Kantorovich dual representation of Wasserstein metric in Lemma 1 and the dual representation of total variation distance in Lemma 1 respectively.(4). See p.35 in Le Cam 1969 .
(5). This result is proved by Gibbs and Su 2002 in Theorem 5. ∎
The relationships are illustrated in Figure 1. Combining it with results in previous subsections, we have the following generalization bounds for Lipschitz, bounded, boundedLipschitz loss functions respectively.
Corollary 1 (Generalization Bounds for Lipschitz Continuous Loss functions).
Assume that the hypothesis space is bounded, i.e.,
(5.44) 
and the function is Lipschitz continuous for any . The following generalization bounds holds:
(1). Generalization Bound by Prokhorov metric.
(5.45) 
(2). Generalization Bound by Hellinger distance.
(5.46) 
(3). Generalization Bound by distance.
(5.47) 
Corollary 2 (Generalization Bounds for Bounded Loss functions).
Assume that the loss function is bounded for any , i.e.,
(5.48) 
The following generalization bounds holds:
(1). Generalization Bound by Mutual Information.
(5.49) 
(2). Generalization Bound by Hellinger distance.
(5.50) 
(3). Generalization Bound by distance.
(5.51) 
Corollary 3 (Generalization Bounds for Bounded and Lipschitz Continuous Loss functions).
Assume that the loss function is bounded and Lipschitz continuous continuous with respect to for any , i.e., there exists , such that
(5.52) 
The following generalization bounds holds:
(1). Generalization Bound by algorithmic transport cost.
(5.53) 
(2). Generalization Bound by total variation distance.
(5.54) 
(3). Generalization Bound by Prokhorov metric.
(5.55) 
(4). Generalization Bound by Mutual Information.
(5.56) 
(5). Generalization Bound by Hellinger distance.
(5.57) 
(6). Generalization Bound by distance.
(5.58) 
Proof.
∎
Abbreviation  Metric 

Wasserstein Metric  
Relative entropy  
Total variation distance  
Bounded Lipschitz distance  
Prokhorov metric  
Hellinger distance  
distance 
6 Application: Explaining Generalization in Deep Learning
Deep neural networks has shown its attractive generalization capability without explicit regularization even in the heavily overparametrized regime. Traditional statistical learning theory fails to explain the generalization mystery of deep learning mainly because of the following two reasons:

Worst Case Analysis. The generalization upper bounds derived in traditional statistical learning are based on worstcase analyses over all functions in the hypothesis space, and thus too loose to bound the generalization error of models with large hypothesis space, such as deep neural networks.

Structure Independence. Traditional statistical learning views a learning model as a whole, ignoring specific structures inside a learning model, such as hierarchical structures in deep neural networks.
In this section, we will explain the nonoverfitting puzzle in deep learning via fixing the above two issues arisen in statistical learning. As shown in the proof of Theorem 1, the analysis of generalization bound via optimal transport only takes the supremum over all Lipchitz functions, and therefore does not rely on worst case analysis as in traditional statistical learning. Besides, in previous sections, we derive generalization error bounds w.r.t. different probability metrics, some of which belong to divergence (Sason and Verdú 2016), such as total variation distance, relative entropy, Hellinger distance, and
distance. It is wellknown that the strong data processing inequalities (SDPIs) for noisy Markov chains can be applied to
divergence or its related quantities, such as total variation, relative entropy, and mutual information (Polyanskiy and Wu 2017), which leads to a contraction property. In this subsection, we will use the contraction property for mutual information, as stated in Polyanskiy and Wu 2017.Lemma 1 (Data Processing Inequalities and Strong Data Processing Inequalities for Mutual Information).
Consider a Markov chain and the corresponding random mapping , by the data processing inequalities, we have and the equality holds i.i.f. also forms a Markov chain. If the mapping is noisy (that is, we cannot recover perfectly from the observed random variable ), then there exists , such that
(6.1) 
The above lemma quantifies an intuitive observation that for a Markov chain , the noise inside the channel must reduce the information that carries about the data (Ajjanagadde et al. 2017).
A DNN with hidden layers is illustrated in Figure 2, which conducts feature transformations sequentially on the input and makes predictions on the learned feature by the hypothesis of the output layer. The output of th hidden layer is denoted by . The training set is denoted by and the transformed training set at the output of th hidden layer is denoted by , where denotes the th training sample of the output of th hidden layer . The parameter of this DNN is where is the parameter space of th hidden layer and is the parameter space of the output layer, equipped with metric . When given , we have the following Markov chain,
(6.2) 
Often, as the feature mappings in hidden layers of deep neural networks are noisy (e.g. dropout, noisy SGD) and noninvertible (e.g. convolution, pooling, ReLU activation, nonfull column rank in the wight matrix), the channel
(let ) is often noisy, which will cause a contraction property for the mutual information and we term it a contraction layer. By exploiting the contraction property of hidden layers recursively, we have the following exponential generalization bound w.r.t. the depth of DNNs, i.e., the generalization error of deep learning will decrease exponentially to zero as we increase the depth of neural networks.Theorem 1 (Generalization in Deep Learning).
For a DNN with hidden layers, the input , and the output hypothesis
Comments
There are no comments yet.