# An Information-Theoretic View for Deep Learning

Deep learning has transformed the computer vision, natural language processing and speech recognition. However, the following two critical questions are remaining obscure: (1) why deep neural networks generalize better than shallow networks? (2) Does it always hold that a deeper network leads to better performance? Specifically, letting L be the number of convolutional and pooling layers in a deep neural network, and n be the size of the training sample, we derive the upper bound on the expected generalization error for this network, i.e., E[R(W)-R_S(W)] ≤(-L/21/η)√(2σ^2/nI(S,W) ) where σ >0 is a constant depending on the loss function, 0<η<1 is a constant depending on the information loss for each convolutional or pooling layer, and I(S, W) is the mutual information between the training sample S and the output hypothesis W. This upper bound discovers: (1) As the network increases its number of convolutional and pooling layers L, the expected generalization error will decrease exponentially to zero. Layers with strict information loss, such as the convolutional layers, reduce the generalization error of deep learning algorithms. This answers the first question. However, (2) algorithms with zero expected generalization error does not imply a small test error or E[R(W)]. This is because E[R_S(W)] will be large when the information for fitting the data is lost as the number of layers increases. This suggests that the claim "the deeper the better" is conditioned on a small training error or E[R_S(W)].

## Authors

• 18 publications
• 46 publications
• 210 publications
• ### Generalization Error in Deep Learning

Deep learning models have lately shown great performance in various fiel...
08/03/2018 ∙ by Daniel Jakubovitz, et al. ∙ 0

• ### Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness

The accuracy of deep learning, i.e., deep neural networks, can be charac...
05/27/2019 ∙ by Pengzhan Jin, et al. ∙ 31

• ### What Information Does a ResNet Compress?

The information bottleneck principle (Shwartz-Ziv Tishby, 2017) sugg...
03/13/2020 ∙ by Luke Nicholas Darlow, et al. ∙ 4

• ### Doing the impossible: Why neural networks can be trained at all

As deep neural networks grow in size, from thousands to millions to bill...
05/13/2018 ∙ by Nathan O. Hodas, et al. ∙ 0

• ### Feature-Robustness, Flatness and Generalization Error for Deep Neural Networks

The performance of deep neural networks is often attributed to their aut...
01/03/2020 ∙ by Henning Petzka, et al. ∙ 0

• ### A Group Theoretic Perspective on Unsupervised Deep Learning

Why does Deep Learning work? What representations does it capture? How d...
04/08/2015 ∙ by Arnab Paul, et al. ∙ 0

• ### An Optimal Transport View on Generalization

We derive upper bounds on the generalization error of learning algorithm...
11/08/2018 ∙ by Jingwei Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We study the standard statistical learning framework, where the instance space is denoted by and the hypothesis space is denoted by . The training sample is denoted by , where each element is drawn i.i.d. from an unknown distribution . A learning algorithm can be regarded as a randomized mapping from the training sample space to the hypothesis space . The learning algorithm is characterized by a Markov kernel , meaning that, given training sample , the algorithm picks a hypothesis in according to the conditional distribution .

We introduce a loss function to measure the quality of a prediction w.r.t. a hypothesis. For any learned hypothesis by , we define the expected risk

 R(W)=EZ∼D[ℓ(W,Z)] , (1.1)

and the empirical risk

 RS(W)=1nn∑i=1ℓ(W,Zi) . (1.2)

For a learning algorithm , the generalization error is defined as

 GS(D,PW|S)=R(W)−RS(W) . (1.3)

A small generalization error implies that the learned hypothesis will have similar performances on both the training and test datasets.

In this paper, we study the following expected generalization error for deep learning:

 G(D,PW|S)=E[R(W)−RS(W)] , (1.4)

where the expectation is over the joint distribution

.

We have the following decomposition:

 E[R(W)]=G(D,PW|S)+E[RS(W)] , (1.5)

where the first term on the right-hand side is the expected generalization error, and the second term reflects how well the learned hypothesis fits the training data from an expectation view.

When designing a learning algorithm, we want the expectation of the expected risk, i.e., , to be as small as possible. However, obtaining small values for the expected generalization error and the expected empirical risk

at the same time is difficult. Usually, if a model fits the training data too well, it may generalize poorly on the test data; this is known as the bias-variance trade-off problem (

Domingos 2000). Surprisingly, deep learning has empirically shown their power for simultaneously minimizing and . They have small because neural networks with deep architectures can efficiently compactly represent highly-varying functions (Sonoda and Murata 2015). However, the theoretical justification for their small expected generalization errors remains elusive.

In this paper, we study the expected generalization error for deep learning from an information-theoretic point of view. We will show that, as the number of layers grows, the expected generalization error decreases exponentially to zero111 We have , which is independent of . Detailed discussions will be in Section 4 and Section 6 .. Specifically, in Theorem 1, we prove that

 G(D,PW|S)=E[R(W)−RS(W)]≤exp(−L2log1η)√2σ2nI(S,W) ,

where is the number of information loss layers in deep neural networks (DNNs), is a constant depending on the average information loss of each layer, is a constant depending on the loss function, is the size of the training sample , and is the mutual information between the input training sample and the output hypothesis . The advantage of using the mutual information between the input and output to bound the expected generalization error (Russo and Zou 2015, Xu and Raginsky 2017) is that it depends on almost every aspects of the learning algorithm, including the data distribution, the complexity of the hypothesis class, and the property of the learning algorithm itself.

Our result is consistent with the bias-variance trade-off. Although the expected generalization error decreases exponentially to zero as the number of information loss layers increases, the expected empirical risk increases since the information loss is harmful to data fitting. This implies that, when designing deep neural networks, greater efforts should be made to balance the information loss and expected training error.

We also provide stability and risk bound analyses for deep learning. We prove that deep learning satisfies a weaker notion of stability, which we term average replace-one hypothesis stability, implying that the output hypothesis will not change too much by expectation when one point in the training sample is replaced. Under the assumption that the algorithm mapping is deterministic, the notion of average replace-one hypothesis stability will degenerate to the case of average replace-one stability, as proposed by (Shalev-Shwartz et al. 2010), which has been identified as a necessary condition for learnability in the general learning setting introduced by Vapnik.

We further provide an expected excess risk bound for deep learning and show that the sample complexity of deep learning will decrease as increases, which surprisingly indicates that by increasing , we need a smaller sample complexity for training. However, this does not imply that increasing the number of layers will always help. An extreme case is that, as goes to infinity, the output feature will lose all predictive information and no training sample is needed because random-guessing is optimal. We also derive upper bounds of the expected generalization error for some specific deep learning algorithms, such as noisy stochastic gradient decent (SGD) and binary classification for deep learning. We further show that these two algorithms are PAC-learnable with sample complexities of .

The remainder of this paper is organized as follows. In Section 2

, we relate DNNs to Markov chains. Section

3 exploits the strong data processing inequality to derive how the mutual information, between intermediate features representations and the output, varies in DNNs. Our main results are given in Section 4, which gives an exponential generalization error bound for DNNs in terms of the depth ; we then analyze the stability of deep learning in Section 5 and the learnability for deep learning with noisy SGD and binary classification in Section 6; finally, we conclude our paper and highlight some important implications in Section 7 . All the proofs are provided in the supplementary material.

## 2 The Hierarchical Feature Mapping of DNNs and Its Relationship to Markov Chains

We first introduce some notations for deep neural networks (DNNs). As shown in Figure 1, a DNN with hidden layers can be seen as feature maps that sequentially conduct feature transformations times on the input . After

feature transformations, the learned feature will be the input of a classifier (or regressor) at the output layer. If the distribution on a single input is

, then we denote the distribution after going through the -th hidden layer as and the corresponding variable as where . The weight of the whole network is denoted by , where is the space of all possible weights. As shown in Figure 2, the input is transformed layer by layer and the output of the -th hidden layer is , where . We also denote the -th sample after going through the -th hidden layer by . In other words, we have the following relationship:

 Z∼D (2.1) ˜Zk∼Dk,fork=1,…,L (2.2) S={Z1,…,Zn}∼Dn (2.3) Tk={Zk1,…,Zkn}∼Dnk,when  given  w1,…,wk,fork=1,…,L. (2.4)

We now have a Markov model for DNNs, as shown in Figure

2. From the Markov property, we know that if forms a Markov chain, then is conditionally independent of given . Furthermore, from the data processing inequality (Cover and Thomas 2012), we have , and the equality holds if and only if also forms a Markov chain. Applying the data processing inequality to the Markov chain, we have,

 I(TL,h)≤I(TL−1,h)≤I(TL−2,h)≤…≤I(S,h)≤I(S,W) . (2.5)

This means that the mutual information between input and output is non-increasing as it goes through the network layer by layer. As the feature map in each layer is likely to be non-invertible, the mutual information between the input and output is likely to strictly decrease as it goes through each layer. This encourages the study of the strong data processing inequality (Polyanskiy and Wu 2015, Ahlswede and Gács 1976). In the next section, we prove that the strong data processing inequality holds for DNNs in general.

## 3 Information Loss in DNNs

In the previous section, we modeled a DNN as a Markov chain and concluded that the mutual information between input and output in DNNs is non-increasing by using the data processing inequality. The equalities in equation (2.5) will not hold for most cases because the feature mapping is likely to be non-invertible, and therefore we can apply the strong data processing inequality to achieve tighter inequalities.

For a Markov chain , the random transformation can be seen as a channel from an information-theoretic point of view. Strong data processing inequalities (SDPIs) quantify an intuitive observation that the noise inside channel will reduce the mutual information between and . That is, there exists , such that

 I(U,W)≤ηI(U,V) . (3.1)

Formally,

###### Theorem 1 (Ahlswede and Gács 1976).

Consider a Markov chain and the corresponding random mapping . If the mapping is not noiseless (that is, we cannot recover any

perfectly with probability

from the observed random variable

), then there exists , such that

 I(W,Y)≤ηI(W,X) (3.2)

More details can be found in a comprehensive survey on SDPIs (Polyanskiy and Wu 2015).

Let us consider the -th hidden layer () in Figure 1. This can be seen as a randomized transformation mapping from one distribution to another distribution (when , we denote ). We then denote the parameters of the -th hidden layer by . Without loss of generality, let be a matrix in

. Also, we denote the activation function in this layer by

.

###### Definition 1 (Contraction Layer).

A layer in a deep neural network is called a contraction layer if it causes information loss.

We now give the first result, which quantifies the information loss in DNNs.

###### Corollary 1 (Information Loss in DNNs).

Consider a DNN as shown in Figure 1 and its corresponding Markov model in Figure 2. If its -th ( ) hidden layer is a contraction layer, then there exists , such that

 I(Tk,h)≤ηkI(Tk−1,h) . (3.3)

We show that the most used convolutional or pooling layers are contraction layers.

###### Lemma 1 (proved in A.1).

For any layer in a DNN, with parameters , if , it is a contraction layer.

Corollary 1 shows that the mutual information decreases after it goes through a contraction layer. From Lemma 1, we know that the convolutional and pooling layers are guaranteed to be contraction layers. Besides, when the shape of the weight satisfies , it also leads to a contraction layer. For a fully connected layer with shape

, the contraction property may not hold because the weight sometimes may be of full column rank, leading to a noiseless and invertible mapping. However, the activation function (e.g. ReLU activation) employed sub-sequentially can contribute to forming a contraction layer. Without loss of generality, in this paper, we let all

hidden layers be contraction layers, e.g., convolutional or pooling layers.

## 4 Exponential Bound on the Generalization Error of DNNs

Before we introduce our main theorem, we need to restrict the loss function to be -sub-Gaussian with respect to Z for any .

###### Definition 2 (σ-sub-Gaussian).

A random variable X is said to be -sub-Gaussian if the following inequality holds for any ,

 E[exp(λ(X−E[X]))]≤exp(σ2λ22) . (4.1)

We now present our main theorem, which gives an exponential bound for the expected generalization error of deep learning.

###### Theorem 1 (proved in A.2).

For a DNN with hidden layers, input , and parameters , assume that the loss function is -sub-Gaussian with respect to Z for any . Without loss of generality, let all hidden layers be contraction layers (e.g., convolutional or pooling layers). Then, the expected generalization error can be upper bounded as follows,

 E[R(W)−RS(W)]≤exp(−L2log1η)√2σ2nI(S,W) (4.2)

where

is the geometric mean of information loss factors of all L contraction layers, that is

 η=(L∏i=1ηi)1L . (4.3)

The upper bound in Theorem 1 may be loose w.r.t. the mutual information since we used the inequality in the proof. We also have that

 I(S,W)≤H(S) . (4.4)

By definition, holds uniformly for any given and is a strictly decreasing function of . These imply that as the number of contraction layers increases, the expected generalization error will decrease exponentially to zero.

Theorem 1 implies that deeper neural networks will improve the generalization error. However, this does not means that the deeper the better. Recall that ; a small does not imply a small , since the expected training error increases due to information loss. Specifically, if the information about the relationship between the observation and the target is lost, fitting the training data will become difficult and the expected training error will increase. Our results highlight a new research direction for designing deep neural networks, namely that we should increase the number of contraction layers while keeping the expected training error small.

Information loss factor plays an essential role in the generalization of deep learning. A successful deep learning model should filter redundant information as much as possible while retaining sufficient information to fit the training data. The functions of some deep learning tricks, such as convolution, pooling, and activation, are very good at filtering some redundant information. This further confirms the information-bottleneck theory (Shwartz-Ziv and Tishby 2017), namely that with more contraction layers, more redundant information will be removed while prediction information is preserved.

## 5 Stability and Risk Bound of Deep Learning

It is known that the expected generalization error is equivalent to the notion of stability of the learning algorithm (Shalev-Shwartz et al. 2010). In this section, we show that deep learning satisfies a weak notion of stability and, further, show that it is a necessary condition for the learnability of deep learning. We first present a definition of stability, as proposed by (Shalev-Shwartz et al. 2010).

###### Definition 3 (Shalev-Shwartz et al. 2010).

A learning algorithm is average replace-one stable with rate under distribution if

 ∣∣ ∣∣1nn∑i=1ES∼Dn,Z′i∼D[ℓ(W,Z′i)−ℓ(Wi,Z′i)]∣∣ ∣∣≤α(n) . (5.1)

For deep learning, we define another notion of stability, that we term average replace-one hypothesis stability.

###### Definition 4 (average replace-one hypothesis stability).

A learning algorithm is average replace-one hypothesis stable with rate under distribution if

 ∣∣ ∣∣EW∼PW|S[1nn∑i=1ES∼Dn,Z′i∼D[ℓ(W,Z′i)−ℓ(Wi,Z′i)]]∣∣ ∣∣≤β(n) . (5.2)

The difference between average replace-one hypothesis stability and average replace-one stability is that the former takes an expectation over , which is weaker than average replace-one stability. It can clearly be seen that average replace-one stability with rate implies average replace-one hypothesis stability with rate . We now prove that deep learning is average replace-one hypothesis stable.

###### Theorem 1 (proved in A.3).

Deep learning is average replace-one hypothesis stable with rate

 β(n)=exp(−L2log1η)√2σ2nI(S,W) . (5.3)

Deep learning algorithms are average replace-one hypothesis stable, which means that replacing one training example does not alter the output too much as shown in Theorem 1.

As concluded by (Shalev-Shwartz et al. 2010), the property of average replace-one stability is a necessary condition for characterizing learnability. We have also shown that average replace-one stability implies average replace-one hypothesis stability. Therefore, the property of average replace-one hypothesis stability is a necessary condition for the learnability of deep learning. However, it is not a sufficient condition. Finding a necessary and sufficient condition for characterizing learnability for deep learning remains unsolved.

## 6 Learnability, Sample Complexity, and Risk Bound for Deep Learning

We have derived an exponential upper bound of the expected generalization error for deep learning. In this section, we further derive the excess risk bound and analyze the sample complexity and learnability for deep learning in a general setting. We can roughly bound by

, which will be large when the input tends to be uniformly distributed. Nevertheless, for some specific deep learning algorithms, a much tighter upper bound of the mutual information can be obtained. Here, we consider two cases where a tighter bound can be achieved. That is noisy SGD and binary classification in deep learning. We also derive the sample complexity for these two algorithms.

### 6.1 Learnability and Risk Bound for Deep Learning

This subsection provides a qualitative analysis on the expected risk bound of deep learning. By picking any global expected risk minimizer,

 W∗=argminW∈WR(W) (6.1)

and picking any empirical risk minimizer

 W=argminW∈WRS(W) , (6.2)

we have

 EW,S[RS(W)]≤EW,S[RS(W∗)]=ES[RS(W∗)]=R(W∗) . (6.3)

Note that a global expected risk minimizer is neither dependent on nor a random variable, while is dependent on . As mentioned before, we consider the case when is a random variable drawn according to the distribution .

Therefore, by combining (4.2) and (6.3), we obtain an expected excess risk bound as follows,

 EW,S[R(W)]−R∗≤exp(−L2log1η)√2σ2nI(S,W) (6.4)

where .

It is worth noticing that is a non-decreasing function of , because the rule constructed over the space cannot be better than the best possible rule in , since all information in originates from space . We now reach two conclusions:

• As the number of contraction layers goes to infinity, then both the excess risk and generalization error will decrease to zero. By strong data processing inequalities, will decrease to zero, which means that the output feature will lose all predictive information. Therefore, no samples are needed for training, as any learned predictor over the transformed feature will perform no better than random guessing. In this case, although the sample complexity is zero, the optimal risk reaches its worst case.

• As we increase the number of contraction layers , the sample complexity will decrease. The result is surprising when is not increasing. This finding implies that if we could efficiently find a global empirical risk minimizer, we need smaller sample complexities when increasing the number of contraction layers which only filter out redundant information. However, it is hard to find the global empirical risk minimizer and control all contraction layers such that they only filter out redundant information. A promising new research direction is to increase the number of contraction layers while keeping a small or or .

We now discuss whether the deep learning is learnable in general. From equation (6.4) and using Markov inequality, we have that with probability at least ,

 R(W)−R∗≤1δexp(−L2log1η)√2σ2nI(S,W) . (6.5)

We know that the notion of PAC-learnability in traditional learning theory must hold for any distribution over the instance space . However, for the general case as presented in our main result, with different distribution , the upper bound of the term can vary and sometimes may be quite large even of the order (e.g. ). In this case, the sample complexity is , which is trivial and cannot guarantee the learnability as increases. In the next two subsections, we will show that for some specific deep learning algorithms, a tighter excess risk bound can be achieved and the sample complexity will be the order of .

### 6.2 Generalization Error Bound With Noisy SGD in Deep Learning

Consider the problem of empirical risk minimization (ERM) via noisy mini-batch SGD in deep learning, where the weight is updated successively based on samples drawn from the training set and with a noisy perturbation. The motivations of adding noise in SGD are mainly to prevent the learning algorithm from overfitting and to avoid an exponential time to escape from saddle points (Du et al. 2017).

Denote the weight of a DNN at the time step by and is the mini-batch with batch size at the -th iteration. Then we have the updating rules and where ; and denote the learning rates at the time step for each layer; and are noisy terms that add a white Gaussian noise to each element of the update independently. Here, we assume that the updates of

have bounded second moment. That is, there exists

, such that for all . We have the following generalization error bound.

###### Theorem 1 (proved in A.4).

For noisy SGD with bounded second moment in updates and iterations, the expected generalization error can be upper bounded by

 |E[R(W)−RS(W)]|≤exp(−L2log1η) ⎷σ2nT∑i=1M2α2iσ2i . (6.6)

With the theorem above, we further give the learnability and sample complexity of the noisy SGD in deep learning.

###### Theorem 2 (proved and further discussed in B.1).

The noisy SGD with bounded second moment in updates for deep learning is learnable, with the sample complexity  .

### 6.3 Generalization Error Bound for Binary Classification in Deep Learning

This subsection gives an upper bound of the expected generalization error for deep learning in the case of binary classification. For binary classification, we denote the function space of the classifier of the output layer by and its VC-dimension by . The training set is . When given , we have the transformed training set after feature mappings and is a class of functions from to . For any integer , we present the definition of the growth function of as in Mohri et al. 2012.

###### Definition 5 (Growth Function).

The growth function of a function class is defined as

 ΠH(m)=maxx1,…,xm∈X|{(h(x1),…,h(xm)):h∈H}| . (6.7)

Now, we give a generalization error bound and sample complexity for binary classification in deep learning in the following two theorems.

###### Theorem 3 (proved in A.5).

For binary classification in deep learning, the upper bound of the expected generalization error is given by

 |E[R(W)−RS(W)]|≤exp(−L2log1η)√2σ2dnforn≤d (6.8)

and

 |E[R(W)−RS(W)]|≤exp(−L2log1η)√2σ2dnlog(end)forn>d . (6.9)
###### Theorem 4 (proved in B.2).

The binary classification in deep learning is learnable, with the sample complexity 222 We use the notation to hide constants and poly-logarithmic factors of and ..

## 7 Conclusions

In this paper, we obtain an exponential-type upper bound for the expected generalization error of deep learning and prove that deep learning satisfies a weak notion of stability. Besides, we also prove that deep learning algorithms are learnable in some specific cases such as employing noisy SGD and for binary classification. Our results have valuable implications for other critical problems in deep learning that require further investigation. (1) Traditional statistical learning theory can validate the success of deep neural networks, because (i) the mutual information between the learned feature and weight

decreases with increasing number of contraction layers because , and (ii) smaller mutual information implies higher algorithmic stability (Raginsky et al. 2016) and smaller complexity of the algorithmic hypothesis class (Liu et al. 2017). (2) The information loss factor offers the potential to explore the characteristics of various convolution, pooling, and activation functions as well as other deep learning tricks; that is, how they contribute to the reduction in the expected generalization error. (3) The weak notion of stability for deep learning is only a necessary condition for learnability (Shalev-Shwartz et al. 2010) and deep learning is learnable in some specific cases. It would be interesting to explore a necessary and sufficient condition for the learnability of deep learning in general. (4) When increasing the number of contraction layers in DNNs, it is worth further exploring: how to filter out redundant information while keep the useful part intact.

## References

• Ahlswede and Gács 1976 Ahlswede, R. and Gács, P. (1976). Spreading of sets in product spaces and hypercontraction of the Markov operator. The annals of probability, pages 925–939.
• Cover and Thomas 2012 Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John Wiley & Sons.
• Domingos 2000 Domingos, P. (2000). A unified bias-variance decomposition. In

Proceedings of 17th International Conference on Machine Learning

, pages 231–238.
• Donsker and Varadhan 1983 Donsker, M. D. and Varadhan, S. S. (1983). Asymptotic evaluation of certain Markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212.
• Du et al. 2017 Du, S. S., Jin, C., Lee, J. D., Jordan, M. I., Singh, A., and Poczos, B. (2017). Gradient descent can take exponential time to escape saddle points. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 1067–1077. Curran Associates, Inc.
• Liu et al. 2017 Liu, T., Lugosi, G., Neu, G., and Tao, D. (2017). Algorithmic stability and hypothesis complexity. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2159–2167. PMLR.
• Mohri et al. 2012 Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of machine learning. MIT press.
• Pensia et al. 2018 Pensia, A., Jog, V., and Loh, P.-L. (2018). Generalization Error Bounds for Noisy, Iterative Algorithms. ArXiv e-prints.
• Polyanskiy and Wu 2015 Polyanskiy, Y. and Wu, Y. (2015).

Strong data-processing inequalities for channels and Bayesian networks.

ArXiv e-prints.
• Raginsky et al. 2016 Raginsky, M., Rakhlin, A., Tsao, M., Wu, Y., and Xu, A. (2016). Information-theoretic analysis of stability and bias of learning algorithms. In Information Theory Workshop (ITW), 2016 IEEE, pages 26–30. IEEE.
• Russo and Zou 2015 Russo, D. and Zou, J. (2015). How much does your data exploration overfit? Controlling bias via information usage. ArXiv e-prints.
• Shalev-Shwartz et al. 2010 Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2010). Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670.
• Shwartz-Ziv and Tishby 2017 Shwartz-Ziv, R. and Tishby, N. (2017). Opening the Black Box of Deep Neural Networks via Information. ArXiv e-prints.
• Sonoda and Murata 2015 Sonoda, S. and Murata, N. (2015). Neural network with unbounded activation functions is universal approximator. arXiv preprint arXiv:1505.03654.
• Xu and Raginsky 2017 Xu, A. and Raginsky, M. (2017). Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems 30, pages 2524–2533. Curran Associates, Inc.

## A.1 Proof of Lemma 1

For the -th hidden layer, considering any input and the corresponding output , we have 333The bias for each layer can be included in via homogeneous coordinates.

 xk=σk(wkxk−1) . (7.1)

Because , the dimension of its right null space is greater than or equal to . Denoting the right null space of by

, then we can pick a non-zero vector

such that .

Then, we have

 σk(wk(xk−1+α))=σk(wkxk−1)=xk . (7.2)

Therefore, for any input of the -th hidden layer, there exists such that their corresponding outputs are the same. This means, for any , we cannot recover it perfectly with probability .

We conclude that the mapping is noisy and the corresponding layer will cause information loss.

## A.2 Proof of Theorem 1

First, by the law of total expectation, we have,

 E[R(W)−RS(W)]=E[E[R(W)−RS(W)|w1,…,wL]] . (7.3)

We now give an upper bound on similar to that detailed in (Russo and Zou 2015, Xu and Raginsky 2017).

###### Lemma 1.

Under the same conditions as in Theorem 1, the upper bound of is given by

 E[R(W)−RS(W)|w1,…,wL]≤√2σ2nI(TL,h) . (7.4)
###### Proof.

We have,

 E[R(W)−RS(W)|w1,…,wL] =Eh,S[EZ∼D[ℓ(W,Z)]−1nn∑i=1ℓ(W,Zi)|w1,…,wL] =Eh,TL[EZL∼DL[ℓ(h,ZL)]−1nn∑i=1ℓ(h,ZLi)] . (7.5)

We are now going to upper bound

 Eh,TL[EZL∼DL[ℓ(h,ZL)]−1nn∑i=1ℓ(h,ZLi)] . (7.6)

Note that when given because of the Markov property. We adopt the classical idea of ghost sample in statical learning theory. That is, we sample another -tuple :

 T′L={Z′1,…,Z′L} (7.7)

where each element is drawn i.i.d. from the distribution . We now have,

 Eh,TL[EZL∼DL[ℓ(h,ZL)]−1nn∑i=1ℓ(h,ZLi)] =Eh,TL[ET′L[1nn∑i=1ℓ(h,ZL′i)]−1nn∑i=1ℓ(h,ZLi)] =Eh,TL,T′L[1nn∑i=1ℓ(h,ZL′i)]−Eh,TL[1nn∑i=1ℓ(h,ZLi)] . (7.8)

We know that the output classifier in the output layer follows the distribution . We denote the joint distribution of and by . Also, we denote the marginal distribution of and by and respectively. Therefore, we have,

 Eh,TL,T′L[1nn∑i=1ℓ(h,ZL′i)]−Eh,TL[1nn∑i=1ℓ(h,ZLi)] =Eh′∼Ph,T′L∼PTL[1nn∑i=1ℓ(h′,ZL′i)]−E(h,TL)∼Ph,TL[1nn∑i=1ℓ(h,ZLi)] . (7.9)

We now bound the above term by the mutual information by employing the following lemma.

###### Lemma 2 (Donsker and Varadhan 1983).

Let P and Q be two probability distributions on the same measurable space

. Then the KL-divergence between P and Q can be represented as,

 D(P||Q)=supF[EP[F]−logEQ[eF]] (7.10)

where the supremum is taken over all measurable functions such that .

Using lemma 2, we have,

 I(TL,h)=D(Ph,TL||Ph×PTL) ≥E(h,TL)∼Ph,TL[λnn∑i=1ℓ(h,ZLi)]−logEh′∼Ph,T′L∼PTL[eλn∑ni=1ℓ(h′,ZL′i)] . (7.11)

As the loss function is -sub-Gaussian w.r.t. for any and is i.i.d. for , then is -sub-Gaussian. By definition, we have,

 logEh′∼Ph,T′L∼PTL[eλn∑ni=1ℓ(h′,ZL′i)]≤σ2λ22n+Eh′∼Ph,T′L∼PTL[λnn∑i=1ℓ(h′,ZL′i)] . (7.12)

Substituting inequality (7.12) into inequality (1), we have,

 E(h,TL)∼Ph,TL[λnn∑i=1ℓ(h,ZLi)]−σ2λ22n−Eh′∼Ph,T′L∼PTL[λnn∑i=1ℓ(h′,ZL′i)]−I(TL,h) =−σ2λ22n−[Eh′∼Ph,T′L∼PTL[1nn∑i=1ℓ(h′,ZL′i)]−E(h,TL)∼Ph,TL[1nn∑i=1ℓ(h,ZLi)]]λ−I(TL,h) ≤0 . (7.13)

The above inequality is a quadratic curve about and always less than or equal to zero. Therefore we have,

 ∣∣ ∣∣Eh′∼Ph,T′L∼PTL[1nn∑i=1ℓ(h′,ZL′i)]−E(h,TL)∼Ph,TL[1nn∑i=1ℓ(h,ZLi)]∣∣ ∣∣2 =|E[R(W)−RS(W)|w1,…,wL]|2≤2σ2nI(TL,h) (7.14)

which completes the proof. ∎

By Theorem 1, we can use the strong data processing inequality for a Markov chain in Figure 2 recursively. Thus, we have,

 √2σ2nI(TL,h)≤√2σ2nηLI(TL−1,h) ≤√2σ2nηLηL−1I(TL−2,h) ≤…≤ ⎷2σ2n(L∏k=1ηk)I(S,h) =√2σ2ηLnI(S,h) =exp(−L2log1η)√2σ2nI(S,h) (7.15)

where

 η=(L∏i=1ηi)1L<1 . (7.16)

We then have

 E[R(W)−RS(W)] =E[E[R(W)−RS(W)|w1,…,wL]] ≤E(exp(−L2log1η)√2σ2nI(S,h)) ≤E(exp(−L2log1η)√2σ2nI(S,W)) ≤exp(−L2log1η)√2σ2nI(S,W) (7.17)

where the second inequality follows from the data processing inequality444Throughout the proof of Theorem 1, we represent the conditional mutual information implicitly because we can eliminate its condition in the last step based on the following inequality:

We hide the condition of the conditional mutual information throughout this paper, but as shown in the proofs later, the condition will also be eliminated in the end. .

## A.3 Proof of Theorem 1

Let be a ghost sample of . We have

 E[R(W)−RS(W)] =EW∼PW|S[ES,S′[1nn∑i=1ℓ(W,Z′i)]−ES[1nn∑i=1ℓ(W,Zi)]] (7.18)

where stands for the output of the algorithm when the input is and , and () are i.i.d. examples.
From equation (1), we have

 |E[R(W)−RS(W)|w1,…,wL]|≤√2σ2nI(TL,h) . (7.19)

Using similar proofs as in Theorem 1, we have

 |E[R(W)−RS(W)]|≤exp(−L2log1η)√2σ2nI(S,W) . (7.20)

Note that the difference between the above equation and our main theorem is that the absolute value is adopted for the expected generalization error, which may be slightly tighter, but the all conclusions are the same. Combining (7.18) and (7.20), we have

 ∣∣ ∣∣EW∼PW|S[1nn∑i=1ES∼Dn,Z′i∼D[ℓ(W,Z′i)−ℓ(Wi,Z′i)]]∣∣ ∣∣ ≤exp(−L2log1η)√2σ2nI(S,W) (7.21)

which ends the proof.

## A.4 Proof of Theorem 1

Our analysis here is mainly based on the work of Raginsky et al. 2016 and Pensia et al. 2018.

By (1) and (A.2 Proof of Theorem 1), we have,

 |E[R(W)−RS(W)|w1,…,wL]|≤exp(−L2log1η)√2σ2nI(S,h) . (7.22)

We now bound the right side in the above inequality and then use the law of total expectation, the theorem can be proved.

At the final iteration, we have and the algorithm outputs . We have the following Markov relation when the initialization