An Optimal Transport View on Generalization

11/08/2018 ∙ by Jingwei Zhang, et al. ∙ The University of Sydney 0

We derive upper bounds on the generalization error of learning algorithms based on their algorithmic transport cost: the expected Wasserstein distance between the output hypothesis and the output hypothesis conditioned on an input example. The bounds provide a novel approach to study the generalization of learning algorithms from an optimal transport view and impose less constraints on the loss function, such as sub-gaussian or bounded. We further provide several upper bounds on the algorithmic transport cost in terms of total variation distance, relative entropy (or KL-divergence), and VC dimension, thus further bridging optimal transport theory and information theory with statistical learning theory. Moreover, we also study different conditions for loss functions under which the generalization error of a learning algorithm can be upper bounded by different probability metrics between distributions relating to the output hypothesis and/or the input data. Finally, under our established framework, we analyze the generalization in deep learning and conclude that the generalization error in deep neural networks (DNNs) decreases exponentially to zero as the number of layers increases. Our analyses of generalization error in deep learning mainly exploit the hierarchical structure in DNNs and the contraction property of f-divergence, which may be of independent interest in analyzing other learning models with hierarchical structure.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When designing a learning algorithm, one fundamental goal is to make the learning algorithm perform well on unseen test data, only with access to a finite number of training data. Formally, we hope that the population risk ( i.e. test error) of the learning algorithm is as small as possible. Unfortunately, minimizing the population risk directly is computationally intractable, because the underlying distribution of the data is unknown. Therefore, to achieve this goal, we separate the population risk as a trade-off sum between empirical risk ( i.e. training error) and the generalization error. The empirical risk measures the extent to which the learning algorithm is consistent with empirical evidence ( i.e. data fitting), while the generalization error quantifies how well the empirical risk can be a valid estimate of the population risk ( i.e. generalization). Thus, one can obtain a hypothesis with minimal population risk by minimizing both the generalization error and empirical risk. Minimizing the empirical risk alone can be realized though empirical risk minimization (

Vapnik 1999) or its alternatives, for example, the stochastic approximation (Kushner and Yin 2003). However, as the generalization error cannot be minimized directly, it is a common practice to derive a generalization upper bound analytically so that we can study the conditions under which it is guaranteed to be small.

In this paper, we analyze the generalization error of a learning algorithm from an optimal transport perspective. Specifically, our contributions lie in four aspects:

  • We derive an optimal-transport type of generalization error bounds for learning algorithms with Lipschitz continuous loss functions. This result does not put any constraint, for example, the sub-gaussian assumption, on the distribution of input data or the model parameter and applies to unbounded loss functions.

  • The bound we derive can be related to other probability metrics, e.g., total variation distance, relative entropy, and some notions in classical learning theory, for example, the VC dimension. Therefore, our theory bridges optimal transport theory and information theory with statistical learning theory.

  • Some other generalization error bounds with different probability metrics are also derived under different assumptions on the loss function. For example, for a learning algorithm with a bounded loss function, a total-variation type generalization error bound can be derived. Via inequalities between probability metrics, the total-variation type generalization error bound can further be bounded by other probability metrics, such as Hellinger distance and distance.

  • Under our established framework, we are able to analyze the generalization error in deep learning by exploiting the contraction property of -divergence and the hierarchical structure in DNNs and conclude that the generalization error in DNNs decreases exponentially to zero as the number of layers increases.

The rest of this paper is structured as follows. In Section 2, some related works are introduced. Section 3 formalizes the research problem and gives some basic definitions. In Section 4, we derive our main theorem that studies the generalization error of a learning algorithm from an optimal transport perspective. Section 5 further relates the main theorem to some notions in classical learning theory, for instance, the VC dimension, and derives generalization upper bounds w.r.t. other probability metrics under different conditions for loss functions, which extends the main theorem. In Section 6, we analyze the generalization error in deep learning under our established framework and conclude that the key to the non-overfitting puzzle in a DNN is its hierarchical structures.

2 Related Works

Our work is related to several different research topics about algorithmic or theoretic aspects of machine learning, summarized below.

2.1 Generalization in Classical Statistical Learning Theory

We note that there exists extensive studies on the generalization in classical learning theory. Hence, the references listed here are far from exhaustive.

One central goal in statistical learning theory is to study the condition under which a learning algorithm can generalize. Mathematically, it requires that the generalization error converges asymptotically to zero as the number of training examples goes to infinity. Traditional ways of charactering the generalization capability of a learning algorithm mainly rely on the complexity of hypothesis class, e.g. VC dimension, Rademacher and Gaussian Complexities, covering number (Vapnik 2013, Bartlett and Mendelson 2002, Bartlett et al. 2005, Zhou 2002, Zhang 2002) or the algorithmic property of the learning algorithm itself, e.g. uniform stability, robustness, algorithmic luckiness (Liu et al. 2017, Shalev-Shwartz et al. 2010, Bousquet and Elisseeff 2002, Xu and Mannor 2012, Herbrich and Williamson 2002

). These classical learning theory based approaches consider the worst case generalization error over all functions in the hypothesis class and have successfully explained some prevalent learning models, such as the Support Vector Machines (SVMs) for binary classification (

Vapnik 2013). Some other approaches are also proposed to analyze the generalization in machine learning, such as PAC-Bayesian approach, Occam’s Razor bound, and sample compression approach(Langford 2005, Ambroladze et al. 2007).

It is worth mentioning that some works show that these approaches are tightly connected. For example, Liu et al. 2017 proves that a higher algorithmic stability implies a smaller hypothesis complexity and Rivasplata et al. 2018 analyzes the PAC bayesian bounds for stable learning algorithms. Nevertheless, these approaches are insufficient to explain the generalization of learning models with large hypothesis space, such as deep neural networks (Zhang et al. 2016). Therefore, it is necessary to find a valid approach that can explain why deep learning is attractive in terms of its generalization properties.

2.2 Information Theoretic Learning and Generalization

Recently, inspired by an observation in learning theory that learning and information compression are closely related, in the sense that both tasks involve finding identifying patterns or regularities in the training data to re-construct or re-identify future data with similar patterns, some information-theoretic approaches to analyze the learnability and generalization capability of learning algorithms are studied (Nachum et al. 2018, Bassily et al. 2018, Alabdulmohsin 2018, Xu and Raginsky 2017, Alabdulmohsin 2017, Russo and Zou 2016, Bassily et al. 2016, Raginsky et al. 2016, Alabdulmohsin 2015, Dwork et al. 2015, Wang et al. 2016). Specifically, Russo and Zou 2016 show that mutual information the collection of empirical risks and the final output hypothesis can be used to bound the generalization error for learning algorithms with finite hypothesis class. Xu and Raginsky 2017 then extend this result to the case of uncountable infinite hypothesis space and bound the generalization error by the mutual information between the input training set and the output hypothesis, providing a clearer explanation that when the output hypothesis relies less on the training data, the generalization will be better. However, both of these results require that the loss function is sub-gaussian, where the result may not apply to the case when the data distribution is heavy tailed. Alabdulmohsin 2015 also provides an analysis of stability and generalization from an information-theoretic perspective and shows that the stability of a learning algorithm can be controlled by an information-theoretic notion of algorithmic stability, which is also defined as the mutual information between the input training data and output hypothesis, but the underlying metric defining the mutual information is total variation instead of relative entropy. However, the analysis of Alabdulmohsin 2015 restricts to bounded loss functions and countable instance and hypothesis space. In a similar information-theoretic spirit, other information-theoretic notions of stability are proposed: Raginsky et al. 2016 proposes the several information-theoretic notions of stability, such as erasure mutual information, which is an information-theoretic analogous of replace-one stability proposed in Shalev-Shwartz et al. 2010; Dwork et al. 2015, Wang et al. 2016, Bassily et al. 2016 propose notions of stability based on KL divergence, which often arises in the problem of differential privacy.

2.3 Optimal Transport in Machine Learning

Optimal transport provides a powerful, flexible, and geometric way to compare probability measures, which is equivalent to measure the Wasserstein distance between two probability distributions (

Peyré et al. 2017) . Optimal transport has recently drawn attention from the machine learning community, because it is capable of tackling some challenging learning scenarios such as generative learning (Arjovsky et al. 2017, Gulrajani et al. 2017

), transfer learning (

Courty et al. 2017) , and distributionally robust optimization (Lee and Raginsky 2017, Gao and Kleywegt 2016) . When comparing different probability distributions, the main advantage of Wasserstein metric over other common probability metrics (such as relative entropy, total variation distance) lies in its good convergence property even when the supports of two probability measures have a a negligible intersection (see example of Arjovsky et al. 2017). To our knowledge, the studies of characterizing the generalization capability of a learning algorithm from an optimal transport perspective are quite limited.

3 Problem Setup

We consider the traditional paradigm of statistical learning, where there is an instance space , a hypothesis space , and a nonnegative loss function . The training sample of size is denoted by where each element is drawn i.i.d. from an unknown underlying distribution . A learning algorithm can be viewed as a (possibly randomized) mapping from the training sample space to the hypothesis space . Following the settings above, the learning algorithm can be uniquely characterized by a Markov kernel , which means that given the training sample , the learning algorithm picks the output hypothesis according the conditional distribution . Note that when degenerates to a Dirac Delta distribution, the mapping

becomes deterministic, which matches the traditional setting of statistical learning, such as Support Vector Machines and linear regression. For any hypothesis

, the population risk is defined as

(3.1)

The population risk is the performance measure of the hypothesis . Therefore, the goal of learning is to find a hypothesis with small population risk, either with high probability or by expectation, under the data generating distribution . However, as the underlying population is unknown, the learning algorithm cannot compute and minimize the population risk directly. Instead, it can compute the empirical risk of on the training sample as a proxy, which is defined by

(3.2)

where is the empirical distribution of the training examples. To evaluate how well the empirical risk can be a valid estimate of the population risk, it defines the generalization error as the difference between the population risk and the empirical risk:

(3.3)

A small generalization error implies that the empirical risk on the training sample can be a good estimate of the population risk on the underlying unknown population . In this paper, we are interested in expected generalization error, which is defined as

(3.4)

where the expectation is taken over the joint distribution of training sample

and the hypothesis (i.e. ).

As discussed earlier, the goal of learning is to make the population risk as small as possible, either in expectation or with high probability. In this paper, we are interested in the expected population risk, which is . We then have the following decomposition,

(3.5)

where the first term in the right hand side of the equation controls the generalization and the second term measures the data fitting. To minimize the expected population risk

, we need to minimize both terms. Often, when the training error is small, the learning algorithm tends to fit the training data too well and thus generalizes poorly to unseen test data; when the training error is large, the learning algorithm tends to be insensitive to the training data and therefore has better generalization capabilities towards test data. Thus, a learning algorithm faces a trade-off between minimizing the empirical risk (i.e. data-fitting) and generalization, which is also known as bias-variance trade-off (

Mohri et al. 2012) in statistical learning. In what follows, it will be shown how the generalization error can be related to optimal transport.

4 Generalization Guarantees via Algorithmic Transport Cost

We assume that the hypothesis space is a Polish space (i.e. a complete separable metric space) with metric and denote by the space of all Borel probability measures with finite -th (

) moments on

. We can define a family of metrics on the space with respect to the metric structure on .

Definition 1 (Wasserstein Distance).

For any , the -Wasserstein distance between two probability measures is defined as

(4.1)

where denotes the collections of all measures on with marginals and on the first and second factors respectively. The set is also called the set of all couplings of and .

The Wasserstein distance is often used in the problem of optimal transport, which is also called earth mover’s distance. Intuitively, for any coupling of and , if each distribution can be viewed as a unit amount of “dirt” piled on , the conditional distribution can be seen as a randomized policy for moving a unit quantity of dirt from a random location to another location  . If the cost for moving a unit amount of dirt is , then the optimal cost is identical to the definition of  .

In the proof of our main result, we will adopt the Kantorovich dual representation of -Wasserstein metric. It is worth mentioning that as -Wasserstein metric is a monotonic increasing function of for , any upper bound holds for naturally holds for any .

Lemma 1 (Villani 2008).

Let be two probability measures defined on the same metric space , i.e., . Then the -Wasserstein distance between and can be represented as

(4.2)

where denotes the Lipschitz constant for .

We then define the notion of algorithmic transport cost for a learning algorithm  .

Definition 2 (algorithmic transport cost).

For a learning algorithm with the training set , the data generating distribution , and the Markov kernel , the algorithmic transport cost can be defined as

(4.3)

Remark 1. The notion of algorithmic transport cost is defined by the average 1-Wasserstein distance between and , where is one input example. Intuitively, when one input example has less contribution on determining , the cost of moving probability mass from to will become less and thus the generalization will be better. An extreme case is that, when is independent of , then the generalization error will become zero, since the output hypothesis does not have any correlation with input and therefore has the same performances on both training data and test data.

Following the definitions and the lemma above, we have our main theorem, which bounds the generalization error of a learning algorithm via algorithmic transport cost.

Theorem 1.

Assume that the function is -Lipschitz continuous for any :

(4.4)

The expected generalization error of a learning algorithm can be upper bounded by

(4.5)
Proof.

Let be a learning algorithm that has access to a finite set of training examples where each element draws i.i.d. from an unknown underlying distribution . Let

be a random variable that stands for the hypothesis output by

with a Markov kernel and be a single random training example. By definition, we have

(4.6)

As is a random training example which is chosen uniformly at random from the training set , we have the following relationship between the output hypothesis , the training set , and the random training example :

(4.7)

which means that and are conditionally independent of each other when given . Therefore, by marginalization, we have

(4.8)

Substituting (4.8) into (1) yields

(4.9)

where the inequality follows Lemma 1, which completes the proof. ∎

Remark 2. The derivation of this generalization error bound only requires that the loss function is -Lipschitz continuous w.r.t. its first argument . The Lipschitz condition can also be imposed on the second argument or . Similar results also hold by slightly modifying the definition of algorithmic transport cost.

5 Upper Bound via Probability Metrics

In the previous section, we derive an upper bound on the expected generalization error of a learning algorithm w.r.t. the algorithmic transport cost, which is defined as the expected -Wasserstein distance between the output hypothesis and the output hypothesis conditioned on an input example. In this section, we will first show that the generalization error can also be bounded by other probability metrics via inequalities between probability metrics.

5.1 Generalization Bound with Total Variation Distance

In this subsection, we will show that the generalization error can be further upper bounded by total variation distance among distributions relating to input data and output hypothesis. First, we give the definition of total variation distance along with its dual and coupling representations.

Definition 3 (Total Variation).

Let be two probability measures defined on the same metric space . The total variation distance between and is defined by

(5.1)

where the supremum is over all Borel measurable sets.

Lemma 1 (Dual Representation of Total Variation Distance).

Let be two probability measures defined on the same metric space . The total variation distance between and is can be represented as

(5.2)
Lemma 2 (Coupling Characterization of Total Variation Distance).

Let be two probability measures defined on the same metric space . The total variation distance between and has a coupling characterization

(5.3)

With the above definition and lemmas, we derive a generalization upper bound w.r.t. total variation distance.

Theorem 1.

Assume that the hypothesis space is bounded, i.e.,

(5.4)

and the function is -Lipschitz continuous for any  . The expected generalization error of a learning algorithm can be upper bounded by

(5.5)
Proof.

For any and , we have

(5.6)

Taking the infimum of the expected value over on both sides of the above inequality and following Definition 1 and Lemma 2, we have

(5.7)

Following the definition of algorithmic transport cost and Theorem 1, we obtain

(5.8)

By Definition 3, it follows

(5.9)

Combining (1) and (5.8), it completes the proof. ∎

The above theorem requires that the loss function is Lipschitz continuous and the hypothesis space is bounded. When the loss function is bounded, we can also upper bound the generalization error via total variation distance by exploiting the dual representation of total variation distance. This result is an extension of Alabdulmohsin 2015, which requires that both the instance space and the hypothesis are countably finite.

Theorem 2.

Assume that the loss function is bounded for any , i.e.,

(5.10)

The expected generalization error of a learning algorithm can be upper bounded by

(5.11)
Proof.

From Equation (1), the expected generalization error can be rewritten as

(5.12)

By the dual representation of total variation, as shown in Lemma 1, we have

(5.13)

which completes the proof. ∎

5.2 Generalization Bound with Mutual Information

Following the results in the previous section, we can further bound the generalization error via the mutual information between the input training sample and the output hypothesis . The result in this section is complementary to that of Xu and Raginsky 2017, in the sense that we do not require that the loss function is sub-gaussian w.r.t. .

Definition 4 (Relative Entropy).

Let be two probability measures defined on the same metric space .The relative entropy between and is defined by

(5.14)
Definition 5 (Mutual Information).

Let be two random variables defined on the same metric space .The mutual information between and is defined by

(5.15)

where denotes the joint distribution of and are corresponding marginal distributions.

Theorem 3.

Assume that the hypothesis space is bounded, i.e.,

(5.16)

and the function is -Lipschitz continuous for any  . The expected generalization error of a learning algorithm can be upper bounded by

(5.17)
Proof.

Using Pinsker’s inequality, we obtain

(5.18)

As and each element is drawn i.i.d. from the underlying unknown distribution , we deduce the following inequalities

(5.19)

It follows naturally

(5.20)

The proof ends by combining (5.18) and (5.20) . ∎

5.3 Generalization Bound with Bounded Lipschitz Distance

Under assumption that the loss function is both Lipschitz and bounded, we derive the generalization error bound with respect to the Bounded Lipschitz distance.

Definition 6 (Bounded Lipschitz Distance).

Let be two probability measures defined on the same metric space .The bounded Lipschitz distance between and is defined by

(5.21)

where the bounded-Lipschitz norm of is defined by

(5.22)
Theorem 4.

Assume that the loss function is bounded and Lipschitz continuous continuous with respect to for any , i.e., there exists , such that

(5.23)

The expected generalization error of a learning algorithm can be upper bounded by

(5.24)
Proof.

The proof is similar to Theorem 1 where the expected generalization error can be rewritten as

(5.25)

By definition 6 , we deduce

(5.26)

which completes the proof. ∎

5.4 Relationships to VC-dimension

We have derived several upper generalization upper bounds with respect to several probability metrics, such as total variation, Wasserstein distance, and relative entropy. In this subsection, we will show how these results can be related to some traditional notions, for example, the VC dimension, that characterize the learnability and generalization capability in statistical learning.

The notion of VC-dimension arises in the problem of binary classification. In this setting, the instance space , where denotes the feature space and denotes the label space. The training set is . The hypothesis space is a class of functions that define the mapping  . For any integer , we present the definition of the growth function as in Mohri et al. 2012 .

Definition 7 (Growth Function).

The growth function of a function class is defined as

(5.27)

In the following theorem, we show how the notion of VC dimension can be related to our previous generalization bounds with probability metrics.

Theorem 5.

If has finite VC-dimension , the expected generalization error of a learning algorithm for binary classification can be upper bounded by

(5.28)

where  .

Proof.

In the setting of binary classification with hypothesis space , we have

(5.29)

Without loss of generality, we let . Therefore, by Theorem 2 , we have

(5.30)

Using Pinsker’s inequality and Equation (5.20), it follows

(5.31)

Denote the collection of empirical risks of the hypothesis in on by

(5.32)

As the output hypothesis only depends on the empirical risk of the training data , the following equality holds:

(5.33)

Therefore, we have

(5.34)

where the last inequality is due to Sauer’s lemma

(5.35)

5.5 Generalization Bounds with Other Probability Metrics

In previous sections, we have derived generalization bounds with different probability metrics under different assumptions (e.g. bounded, Lipschitz) of the loss function. By using the relationships between different probability metrics, in this subsection, we can further extend previous results to other probability metrics listed in Table 1. We first present precise definitions of these probability metrics.

Let be two probability measures defined on the same metric space . We have the following definitions.

Definition 8 (Prokhorov metric).

The Prokhorov metric between and is defined by

(5.36)

where and the infimum is over all Borel sets  .

Definition 9 (Hellinger distance).

The Hellinger distance between and is defined by

(5.37)
Definition 10 ( distance).

The distance between and is defined by

(5.38)

We present some relationships among different probability metrics in Table 1.

Lemma 3.

Let be any metric space with metric and be two probability measures on . Then the following relationships holds.

. The Wasserstein and Prokhorov metrics satisfy

(5.39)

where denotes the diameter of the probability space.

.  The Bounded Lipschitz distance and Wasserstein metric satisfy

(5.40)

.  The Bounded Lipschitz distance and total variation distance satisfy

(5.41)

. The total variation distance and Hellinger distance satisfy

(5.42)

.The relative entropy and distance satisfy

(5.43)
Proof.

(1). This result is due to Theorem 2 in Gibbs and Su 2002.

(2) & (3). These two results are classical facts in probability theory, which can be proved by combining Definition

6 with the Kantorovich dual representation of -Wasserstein metric in Lemma 1 and the dual representation of total variation distance in Lemma 1 respectively.

(4). See p.35 in Le Cam 1969 .

(5). This result is proved by Gibbs and Su 2002 in Theorem 5. ∎

The relationships are illustrated in Figure 1. Combining it with results in previous subsections, we have the following generalization bounds for Lipschitz, bounded, bounded-Lipschitz loss functions respectively.

Corollary 1 (Generalization Bounds for Lipschitz Continuous Loss functions).

Assume that the hypothesis space is bounded, i.e.,

(5.44)

and the function is -Lipschitz continuous for any  . The following generalization bounds holds:

(1). Generalization Bound by Prokhorov metric.

(5.45)

(2). Generalization Bound by Hellinger distance.

(5.46)

(3). Generalization Bound by distance.

(5.47)
Proof.

(1). The result is due to Theorem 1 and (1) of Lemma 3.

(2). The result is due to Theorem 1 and (4) of Lemma 3.

(3). The result is due to Theorem 3 and (5) of Lemma 3.

Corollary 2 (Generalization Bounds for Bounded Loss functions).

Assume that the loss function is bounded for any , i.e.,

(5.48)

The following generalization bounds holds:

(1). Generalization Bound by Mutual Information.

(5.49)

(2). Generalization Bound by Hellinger distance.

(5.50)

(3). Generalization Bound by distance.

(5.51)
Proof.

(1). The result is by Theorem 2, Pinsker’s inequality, and Equation (5.20).

(2). The result is due to Theorem 2 and (4) of Lemma 3.

(3). The result is due to (1) of Corollary 2 and (5) of Lemma 3.

Corollary 3 (Generalization Bounds for Bounded and Lipschitz Continuous Loss functions).

Assume that the loss function is bounded and Lipschitz continuous continuous with respect to for any , i.e., there exists , such that

(5.52)

The following generalization bounds holds:

(1). Generalization Bound by algorithmic transport cost.

(5.53)

(2). Generalization Bound by total variation distance.

(5.54)

(3). Generalization Bound by Prokhorov metric.

(5.55)

(4). Generalization Bound by Mutual Information.

(5.56)

(5). Generalization Bound by Hellinger distance.

(5.57)

(6). Generalization Bound by distance.

(5.58)
Proof.

(1). The result is due to Theorem 4 and (2) of Lemma 3.

(2). The result is due to Theorem 4 and (3) of Lemma 3.

(3). The result is due to (1) of Corollary 3 and (1) of Lemma 3.

(4). The result is due to (2) of Corollary 3, Pinsker’s inequality, and Equation (5.20).

(5). The result is due to (2) of Corollary 3 and (4) of Lemma 3.

(6). The result is due to (4) of Corollary 3 and (5) of Lemma 3.

Abbreviation Metric
Wasserstein Metric
Relative entropy
Total variation distance
Bounded Lipschitz distance
Prokhorov metric
Hellinger distance
distance
Table 1: Abbreviations for Metrics
Figure 1: Relationships among different probability metrics on the same measurable space . A directed arrow from metric to metric annotated by a function means that  . The symbol denotes the diameter of the probability space.

6 Application: Explaining Generalization in Deep Learning

Figure 2: The hierarchical structure of DNNs.

Deep neural networks has shown its attractive generalization capability without explicit regularization even in the heavily over-parametrized regime. Traditional statistical learning theory fails to explain the generalization mystery of deep learning mainly because of the following two reasons:

  • Worst Case Analysis.  The generalization upper bounds derived in traditional statistical learning are based on worst-case analyses over all functions in the hypothesis space, and thus too loose to bound the generalization error of models with large hypothesis space, such as deep neural networks.

  • Structure Independence.  Traditional statistical learning views a learning model as a whole, ignoring specific structures inside a learning model, such as hierarchical structures in deep neural networks.

In this section, we will explain the non-overfitting puzzle in deep learning via fixing the above two issues arisen in statistical learning. As shown in the proof of Theorem 1, the analysis of generalization bound via optimal transport only takes the supremum over all Lipchitz functions, and therefore does not rely on worst case analysis as in traditional statistical learning. Besides, in previous sections, we derive generalization error bounds w.r.t. different probability metrics, some of which belong to -divergence (Sason and Verdú 2016), such as total variation distance, relative entropy, Hellinger distance, and

distance. It is well-known that the strong data processing inequalities (SDPIs) for noisy Markov chains can be applied to

-divergence or its related quantities, such as total variation, relative entropy, and mutual information (Polyanskiy and Wu 2017), which leads to a contraction property. In this subsection, we will use the contraction property for mutual information, as stated in Polyanskiy and Wu 2017.

Lemma 1 (Data Processing Inequalities and Strong Data Processing Inequalities for Mutual Information).

Consider a Markov chain and the corresponding random mapping , by the data processing inequalities, we have and the equality holds i.i.f. also forms a Markov chain. If the mapping is noisy (that is, we cannot recover perfectly from the observed random variable ), then there exists , such that

(6.1)

The above lemma quantifies an intuitive observation that for a Markov chain , the noise inside the channel must reduce the information that carries about the data (Ajjanagadde et al. 2017).

A DNN with hidden layers is illustrated in Figure 2, which conducts feature transformations sequentially on the input and makes predictions on the learned feature by the hypothesis of the output layer. The output of -th hidden layer is denoted by  . The training set is denoted by and the transformed training set at the output of -th hidden layer is denoted by , where denotes the -th training sample of the output of -th hidden layer . The parameter of this DNN is where is the parameter space of -th hidden layer and is the parameter space of the output layer, equipped with metric . When given , we have the following Markov chain,

(6.2)

Often, as the feature mappings in hidden layers of deep neural networks are noisy (e.g. dropout, noisy SGD) and non-invertible (e.g. convolution, pooling, ReLU activation, non-full column rank in the wight matrix), the channel

(let ) is often noisy, which will cause a contraction property for the mutual information and we term it a contraction layer. By exploiting the contraction property of hidden layers recursively, we have the following exponential generalization bound w.r.t. the depth of DNNs, i.e., the generalization error of deep learning will decrease exponentially to zero as we increase the depth of neural networks.

Theorem 1 (Generalization in Deep Learning).

For a DNN with hidden layers, the input , and the output hypothesis