1 Introduction
We study the standard statistical learning framework, where the instance space is denoted by and the hypothesis space is denoted by . The training sample is denoted by , where each element is drawn i.i.d. from an unknown distribution . A learning algorithm can be regarded as a randomized mapping from the training sample space to the hypothesis space . The learning algorithm is characterized by a Markov kernel , meaning that, given training sample , the algorithm picks a hypothesis in according to the conditional distribution .
We introduce a loss function to measure the quality of a prediction w.r.t. a hypothesis. For any learned hypothesis by , we define the expected risk
(1.1) 
and the empirical risk
(1.2) 
For a learning algorithm , the generalization error is defined as
(1.3) 
A small generalization error implies that the learned hypothesis will have similar performances on both the training and test datasets.
In this paper, we study the following expected generalization error for deep learning:
(1.4) 
where the expectation is over the joint distribution
.We have the following decomposition:
(1.5) 
where the first term on the righthand side is the expected generalization error, and the second term reflects how well the learned hypothesis fits the training data from an expectation view.
When designing a learning algorithm, we want the expectation of the expected risk, i.e., , to be as small as possible. However, obtaining small values for the expected generalization error and the expected empirical risk
at the same time is difficult. Usually, if a model fits the training data too well, it may generalize poorly on the test data; this is known as the biasvariance tradeoff problem (
Domingos 2000). Surprisingly, deep learning has empirically shown their power for simultaneously minimizing and . They have small because neural networks with deep architectures can efficiently compactly represent highlyvarying functions (Sonoda and Murata 2015). However, the theoretical justification for their small expected generalization errors remains elusive.In this paper, we study the expected generalization error for deep learning from an informationtheoretic point of view. We will show that, as the number of layers grows, the expected generalization error decreases exponentially to zero^{1}^{1}1 We have , which is independent of . Detailed discussions will be in Section 4 and Section 6 .. Specifically, in Theorem 1, we prove that
where is the number of information loss layers in deep neural networks (DNNs), is a constant depending on the average information loss of each layer, is a constant depending on the loss function, is the size of the training sample , and is the mutual information between the input training sample and the output hypothesis . The advantage of using the mutual information between the input and output to bound the expected generalization error (Russo and Zou 2015, Xu and Raginsky 2017) is that it depends on almost every aspects of the learning algorithm, including the data distribution, the complexity of the hypothesis class, and the property of the learning algorithm itself.
Our result is consistent with the biasvariance tradeoff. Although the expected generalization error decreases exponentially to zero as the number of information loss layers increases, the expected empirical risk increases since the information loss is harmful to data fitting. This implies that, when designing deep neural networks, greater efforts should be made to balance the information loss and expected training error.
We also provide stability and risk bound analyses for deep learning. We prove that deep learning satisfies a weaker notion of stability, which we term average replaceone hypothesis stability, implying that the output hypothesis will not change too much by expectation when one point in the training sample is replaced. Under the assumption that the algorithm mapping is deterministic, the notion of average replaceone hypothesis stability will degenerate to the case of average replaceone stability, as proposed by (ShalevShwartz et al. 2010), which has been identified as a necessary condition for learnability in the general learning setting introduced by Vapnik.
We further provide an expected excess risk bound for deep learning and show that the sample complexity of deep learning will decrease as increases, which surprisingly indicates that by increasing , we need a smaller sample complexity for training. However, this does not imply that increasing the number of layers will always help. An extreme case is that, as goes to infinity, the output feature will lose all predictive information and no training sample is needed because randomguessing is optimal. We also derive upper bounds of the expected generalization error for some specific deep learning algorithms, such as noisy stochastic gradient decent (SGD) and binary classification for deep learning. We further show that these two algorithms are PAClearnable with sample complexities of .
The remainder of this paper is organized as follows. In Section 2
, we relate DNNs to Markov chains. Section
3 exploits the strong data processing inequality to derive how the mutual information, between intermediate features representations and the output, varies in DNNs. Our main results are given in Section 4, which gives an exponential generalization error bound for DNNs in terms of the depth ; we then analyze the stability of deep learning in Section 5 and the learnability for deep learning with noisy SGD and binary classification in Section 6; finally, we conclude our paper and highlight some important implications in Section 7 . All the proofs are provided in the supplementary material.2 The Hierarchical Feature Mapping of DNNs and Its Relationship to Markov Chains
We first introduce some notations for deep neural networks (DNNs). As shown in Figure 1, a DNN with hidden layers can be seen as feature maps that sequentially conduct feature transformations times on the input . After
feature transformations, the learned feature will be the input of a classifier (or regressor) at the output layer. If the distribution on a single input is
, then we denote the distribution after going through the th hidden layer as and the corresponding variable as where . The weight of the whole network is denoted by , where is the space of all possible weights. As shown in Figure 2, the input is transformed layer by layer and the output of the th hidden layer is , where . We also denote the th sample after going through the th hidden layer by . In other words, we have the following relationship:(2.1)  
(2.2)  
(2.3)  
(2.4) 
We now have a Markov model for DNNs, as shown in Figure
2. From the Markov property, we know that if forms a Markov chain, then is conditionally independent of given . Furthermore, from the data processing inequality (Cover and Thomas 2012), we have , and the equality holds if and only if also forms a Markov chain. Applying the data processing inequality to the Markov chain, we have,(2.5) 
This means that the mutual information between input and output is nonincreasing as it goes through the network layer by layer. As the feature map in each layer is likely to be noninvertible, the mutual information between the input and output is likely to strictly decrease as it goes through each layer. This encourages the study of the strong data processing inequality (Polyanskiy and Wu 2015, Ahlswede and Gács 1976). In the next section, we prove that the strong data processing inequality holds for DNNs in general.
3 Information Loss in DNNs
In the previous section, we modeled a DNN as a Markov chain and concluded that the mutual information between input and output in DNNs is nonincreasing by using the data processing inequality. The equalities in equation (2.5) will not hold for most cases because the feature mapping is likely to be noninvertible, and therefore we can apply the strong data processing inequality to achieve tighter inequalities.
For a Markov chain , the random transformation can be seen as a channel from an informationtheoretic point of view. Strong data processing inequalities (SDPIs) quantify an intuitive observation that the noise inside channel will reduce the mutual information between and . That is, there exists , such that
(3.1) 
Formally,
Theorem 1 (Ahlswede and Gács 1976).
Consider a Markov chain and the corresponding random mapping . If the mapping is not noiseless (that is, we cannot recover any
perfectly with probability
from the observed random variable
), then there exists , such that(3.2) 
More details can be found in a comprehensive survey on SDPIs (Polyanskiy and Wu 2015).
Let us consider the th hidden layer () in Figure 1. This can be seen as a randomized transformation mapping from one distribution to another distribution (when , we denote ). We then denote the parameters of the th hidden layer by . Without loss of generality, let be a matrix in
. Also, we denote the activation function in this layer by
.Definition 1 (Contraction Layer).
A layer in a deep neural network is called a contraction layer if it causes information loss.
We now give the first result, which quantifies the information loss in DNNs.
Corollary 1 (Information Loss in DNNs).
We show that the most used convolutional or pooling layers are contraction layers.
Lemma 1 (proved in A.1).
For any layer in a DNN, with parameters , if , it is a contraction layer.
Corollary 1 shows that the mutual information decreases after it goes through a contraction layer. From Lemma 1, we know that the convolutional and pooling layers are guaranteed to be contraction layers. Besides, when the shape of the weight satisfies , it also leads to a contraction layer. For a fully connected layer with shape
, the contraction property may not hold because the weight sometimes may be of full column rank, leading to a noiseless and invertible mapping. However, the activation function (e.g. ReLU activation) employed subsequentially can contribute to forming a contraction layer. Without loss of generality, in this paper, we let all
hidden layers be contraction layers, e.g., convolutional or pooling layers.4 Exponential Bound on the Generalization Error of DNNs
Before we introduce our main theorem, we need to restrict the loss function to be subGaussian with respect to Z for any .
Definition 2 (subGaussian).
A random variable X is said to be subGaussian if the following inequality holds for any ,
(4.1) 
We now present our main theorem, which gives an exponential bound for the expected generalization error of deep learning.
Theorem 1 (proved in A.2).
For a DNN with hidden layers, input , and parameters , assume that the loss function is subGaussian with respect to Z for any . Without loss of generality, let all hidden layers be contraction layers (e.g., convolutional or pooling layers). Then, the expected generalization error can be upper bounded as follows,
(4.2) 
where
is the geometric mean of information loss factors of all L contraction layers, that is
(4.3) 
The upper bound in Theorem 1 may be loose w.r.t. the mutual information since we used the inequality in the proof. We also have that
(4.4) 
By definition, holds uniformly for any given and is a strictly decreasing function of . These imply that as the number of contraction layers increases, the expected generalization error will decrease exponentially to zero.
Theorem 1 implies that deeper neural networks will improve the generalization error. However, this does not means that the deeper the better. Recall that ; a small does not imply a small , since the expected training error increases due to information loss. Specifically, if the information about the relationship between the observation and the target is lost, fitting the training data will become difficult and the expected training error will increase. Our results highlight a new research direction for designing deep neural networks, namely that we should increase the number of contraction layers while keeping the expected training error small.
Information loss factor plays an essential role in the generalization of deep learning. A successful deep learning model should filter redundant information as much as possible while retaining sufficient information to fit the training data. The functions of some deep learning tricks, such as convolution, pooling, and activation, are very good at filtering some redundant information. This further confirms the informationbottleneck theory (ShwartzZiv and Tishby 2017), namely that with more contraction layers, more redundant information will be removed while prediction information is preserved.
5 Stability and Risk Bound of Deep Learning
It is known that the expected generalization error is equivalent to the notion of stability of the learning algorithm (ShalevShwartz et al. 2010). In this section, we show that deep learning satisfies a weak notion of stability and, further, show that it is a necessary condition for the learnability of deep learning. We first present a definition of stability, as proposed by (ShalevShwartz et al. 2010).
Definition 3 (ShalevShwartz et al. 2010).
A learning algorithm is average replaceone stable with rate under distribution if
(5.1) 
For deep learning, we define another notion of stability, that we term average replaceone hypothesis stability.
Definition 4 (average replaceone hypothesis stability).
A learning algorithm is average replaceone hypothesis stable with rate under distribution if
(5.2) 
The difference between average replaceone hypothesis stability and average replaceone stability is that the former takes an expectation over , which is weaker than average replaceone stability. It can clearly be seen that average replaceone stability with rate implies average replaceone hypothesis stability with rate . We now prove that deep learning is average replaceone hypothesis stable.
Theorem 1 (proved in A.3).
Deep learning is average replaceone hypothesis stable with rate
(5.3) 
Deep learning algorithms are average replaceone hypothesis stable, which means that replacing one training example does not alter the output too much as shown in Theorem 1.
As concluded by (ShalevShwartz et al. 2010), the property of average replaceone stability is a necessary condition for characterizing learnability. We have also shown that average replaceone stability implies average replaceone hypothesis stability. Therefore, the property of average replaceone hypothesis stability is a necessary condition for the learnability of deep learning. However, it is not a sufficient condition. Finding a necessary and sufficient condition for characterizing learnability for deep learning remains unsolved.
6 Learnability, Sample Complexity, and Risk Bound for Deep Learning
We have derived an exponential upper bound of the expected generalization error for deep learning. In this section, we further derive the excess risk bound and analyze the sample complexity and learnability for deep learning in a general setting. We can roughly bound by
, which will be large when the input tends to be uniformly distributed. Nevertheless, for some specific deep learning algorithms, a much tighter upper bound of the mutual information can be obtained. Here, we consider two cases where a tighter bound can be achieved. That is noisy SGD and binary classification in deep learning. We also derive the sample complexity for these two algorithms.
6.1 Learnability and Risk Bound for Deep Learning
This subsection provides a qualitative analysis on the expected risk bound of deep learning. By picking any global expected risk minimizer,
(6.1) 
and picking any empirical risk minimizer
(6.2) 
we have
(6.3) 
Note that a global expected risk minimizer is neither dependent on nor a random variable, while is dependent on . As mentioned before, we consider the case when is a random variable drawn according to the distribution .
Therefore, by combining (4.2) and (6.3), we obtain an expected excess risk bound as follows,
(6.4) 
where .
It is worth noticing that is a nondecreasing function of , because the rule constructed over the space cannot be better than the best possible rule in , since all information in originates from space . We now reach two conclusions:

As the number of contraction layers goes to infinity, then both the excess risk and generalization error will decrease to zero. By strong data processing inequalities, will decrease to zero, which means that the output feature will lose all predictive information. Therefore, no samples are needed for training, as any learned predictor over the transformed feature will perform no better than random guessing. In this case, although the sample complexity is zero, the optimal risk reaches its worst case.

As we increase the number of contraction layers , the sample complexity will decrease. The result is surprising when is not increasing. This finding implies that if we could efficiently find a global empirical risk minimizer, we need smaller sample complexities when increasing the number of contraction layers which only filter out redundant information. However, it is hard to find the global empirical risk minimizer and control all contraction layers such that they only filter out redundant information. A promising new research direction is to increase the number of contraction layers while keeping a small or or .
We now discuss whether the deep learning is learnable in general. From equation (6.4) and using Markov inequality, we have that with probability at least ,
(6.5) 
We know that the notion of PAClearnability in traditional learning theory must hold for any distribution over the instance space . However, for the general case as presented in our main result, with different distribution , the upper bound of the term can vary and sometimes may be quite large even of the order (e.g. ). In this case, the sample complexity is , which is trivial and cannot guarantee the learnability as increases. In the next two subsections, we will show that for some specific deep learning algorithms, a tighter excess risk bound can be achieved and the sample complexity will be the order of .
6.2 Generalization Error Bound With Noisy SGD in Deep Learning
Consider the problem of empirical risk minimization (ERM) via noisy minibatch SGD in deep learning, where the weight is updated successively based on samples drawn from the training set and with a noisy perturbation. The motivations of adding noise in SGD are mainly to prevent the learning algorithm from overfitting and to avoid an exponential time to escape from saddle points (Du et al. 2017).
Denote the weight of a DNN at the time step by and is the minibatch with batch size at the th iteration. Then we have the updating rules and where ; and denote the learning rates at the time step for each layer; and are noisy terms that add a white Gaussian noise to each element of the update independently. Here, we assume that the updates of
have bounded second moment. That is, there exists
, such that for all . We have the following generalization error bound.Theorem 1 (proved in A.4).
For noisy SGD with bounded second moment in updates and iterations, the expected generalization error can be upper bounded by
(6.6) 
With the theorem above, we further give the learnability and sample complexity of the noisy SGD in deep learning.
Theorem 2 (proved and further discussed in B.1).
The noisy SGD with bounded second moment in updates for deep learning is learnable, with the sample complexity .
6.3 Generalization Error Bound for Binary Classification in Deep Learning
This subsection gives an upper bound of the expected generalization error for deep learning in the case of binary classification. For binary classification, we denote the function space of the classifier of the output layer by and its VCdimension by . The training set is . When given , we have the transformed training set after feature mappings and is a class of functions from to . For any integer , we present the definition of the growth function of as in Mohri et al. 2012.
Definition 5 (Growth Function).
The growth function of a function class is defined as
(6.7) 
Now, we give a generalization error bound and sample complexity for binary classification in deep learning in the following two theorems.
Theorem 3 (proved in A.5).
For binary classification in deep learning, the upper bound of the expected generalization error is given by
(6.8) 
and
(6.9) 
Theorem 4 (proved in B.2).
The binary classification in deep learning is learnable, with the sample complexity ^{2}^{2}2 We use the notation to hide constants and polylogarithmic factors of and ..
7 Conclusions
In this paper, we obtain an exponentialtype upper bound for the expected generalization error of deep learning and prove that deep learning satisfies a weak notion of stability. Besides, we also prove that deep learning algorithms are learnable in some specific cases such as employing noisy SGD and for binary classification. Our results have valuable implications for other critical problems in deep learning that require further investigation. (1) Traditional statistical learning theory can validate the success of deep neural networks, because (i) the mutual information between the learned feature and weight
decreases with increasing number of contraction layers because , and (ii) smaller mutual information implies higher algorithmic stability (Raginsky et al. 2016) and smaller complexity of the algorithmic hypothesis class (Liu et al. 2017). (2) The information loss factor offers the potential to explore the characteristics of various convolution, pooling, and activation functions as well as other deep learning tricks; that is, how they contribute to the reduction in the expected generalization error. (3) The weak notion of stability for deep learning is only a necessary condition for learnability (ShalevShwartz et al. 2010) and deep learning is learnable in some specific cases. It would be interesting to explore a necessary and sufficient condition for the learnability of deep learning in general. (4) When increasing the number of contraction layers in DNNs, it is worth further exploring: how to filter out redundant information while keep the useful part intact.References
 Ahlswede and Gács 1976 Ahlswede, R. and Gács, P. (1976). Spreading of sets in product spaces and hypercontraction of the Markov operator. The annals of probability, pages 925–939.
 Cover and Thomas 2012 Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John Wiley & Sons.

Domingos 2000
Domingos, P. (2000).
A unified biasvariance decomposition.
In
Proceedings of 17th International Conference on Machine Learning
, pages 231–238.  Donsker and Varadhan 1983 Donsker, M. D. and Varadhan, S. S. (1983). Asymptotic evaluation of certain Markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212.
 Du et al. 2017 Du, S. S., Jin, C., Lee, J. D., Jordan, M. I., Singh, A., and Poczos, B. (2017). Gradient descent can take exponential time to escape saddle points. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 1067–1077. Curran Associates, Inc.
 Liu et al. 2017 Liu, T., Lugosi, G., Neu, G., and Tao, D. (2017). Algorithmic stability and hypothesis complexity. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2159–2167. PMLR.
 Mohri et al. 2012 Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2012). Foundations of machine learning. MIT press.
 Pensia et al. 2018 Pensia, A., Jog, V., and Loh, P.L. (2018). Generalization Error Bounds for Noisy, Iterative Algorithms. ArXiv eprints.

Polyanskiy and Wu 2015
Polyanskiy, Y. and Wu, Y. (2015).
Strong dataprocessing inequalities for channels and Bayesian networks.
ArXiv eprints.  Raginsky et al. 2016 Raginsky, M., Rakhlin, A., Tsao, M., Wu, Y., and Xu, A. (2016). Informationtheoretic analysis of stability and bias of learning algorithms. In Information Theory Workshop (ITW), 2016 IEEE, pages 26–30. IEEE.
 Russo and Zou 2015 Russo, D. and Zou, J. (2015). How much does your data exploration overfit? Controlling bias via information usage. ArXiv eprints.
 ShalevShwartz et al. 2010 ShalevShwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2010). Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670.
 ShwartzZiv and Tishby 2017 ShwartzZiv, R. and Tishby, N. (2017). Opening the Black Box of Deep Neural Networks via Information. ArXiv eprints.
 Sonoda and Murata 2015 Sonoda, S. and Murata, N. (2015). Neural network with unbounded activation functions is universal approximator. arXiv preprint arXiv:1505.03654.
 Xu and Raginsky 2017 Xu, A. and Raginsky, M. (2017). Informationtheoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems 30, pages 2524–2533. Curran Associates, Inc.
A.1 Proof of Lemma 1
For the th hidden layer, considering any input and the corresponding output , we have ^{3}^{3}3The bias for each layer can be included in via homogeneous coordinates.
(7.1) 
Because , the dimension of its right null space is greater than or equal to . Denoting the right null space of by
, then we can pick a nonzero vector
such that .Then, we have
(7.2) 
Therefore, for any input of the th hidden layer, there exists such that their corresponding outputs are the same. This means, for any , we cannot recover it perfectly with probability .
We conclude that the mapping is noisy and the corresponding layer will cause information loss.
A.2 Proof of Theorem 1
First, by the law of total expectation, we have,
(7.3) 
We now give an upper bound on similar to that detailed in (Russo and Zou 2015, Xu and Raginsky 2017).
Lemma 1.
Under the same conditions as in Theorem 1, the upper bound of is given by
(7.4) 
Proof.
We have,
(7.5) 
We are now going to upper bound
(7.6) 
Note that when given because of the Markov property. We adopt the classical idea of ghost sample in statical learning theory. That is, we sample another tuple :
(7.7) 
where each element is drawn i.i.d. from the distribution . We now have,
(7.8) 
We know that the output classifier in the output layer follows the distribution . We denote the joint distribution of and by . Also, we denote the marginal distribution of and by and respectively. Therefore, we have,
(7.9) 
We now bound the above term by the mutual information by employing the following lemma.
Lemma 2 (Donsker and Varadhan 1983).
Let P and Q be two probability distributions on the same measurable space
. Then the KLdivergence between P and Q can be represented as,(7.10) 
where the supremum is taken over all measurable functions such that .
Using lemma 2, we have,
(7.11) 
As the loss function is subGaussian w.r.t. for any and is i.i.d. for , then is subGaussian. By definition, we have,
(7.12) 
Substituting inequality (7.12) into inequality (1), we have,
(7.13) 
The above inequality is a quadratic curve about and always less than or equal to zero. Therefore we have,
(7.14) 
which completes the proof. ∎
By Theorem 1, we can use the strong data processing inequality for a Markov chain in Figure 2 recursively. Thus, we have,
(7.15) 
where
(7.16) 
We then have
(7.17) 
where the second inequality follows from the data processing inequality^{4}^{4}4Throughout the proof of Theorem 1, we represent the conditional mutual information implicitly because we can eliminate its condition in the last step based on the following inequality:
A.3 Proof of Theorem 1
Let be a ghost sample of . We have
(7.18)  
where stands for the output of the algorithm when the input is and , and () are i.i.d. examples.
From equation (1), we have
(7.19) 
Using similar proofs as in Theorem 1, we have
(7.20) 
Note that the difference between the above equation and our main theorem is that the absolute value is adopted for the expected generalization error, which may be slightly tighter, but the all conclusions are the same. Combining (7.18) and (7.20), we have
(7.21) 
which ends the proof.
A.4 Proof of Theorem 1
By (1) and (A.2 Proof of Theorem 1), we have,
(7.22) 
We now bound the right side in the above inequality and then use the law of total expectation, the theorem can be proved.
At the final iteration, we have and the algorithm outputs . We have the following Markov relation when the initialization
Comments
There are no comments yet.