In the last 15 years, deep learning, i.e., deep neural networks (NNs), has been used very effectively in diverse applications, such as image classification (Krizhevsky et al., 2012)2013), and game playing (Silver et al., 2016). Despite this remarkable success, our theoretical understanding of deep learning is lagging behind. The accuracy of NNs can be characterized by dividing the expected error into three main types: approximation (or called expressivity), optimization, and generalization (Bottou and Bousquet, 2008; Bottou, 2010), see Fig. 1. The well-known approximation result was obtained by Cybenko (1989) and Hornik et al. (1989)
almost three decades ago stating that feed-forward neural nets can approximate essentially any function. In the past several years, there have been numerous studies that analyze the landscape of the non-convex objective functions, and the optimization process by stochastic gradient descent (SGD)(Lee et al., 2016; Liao and Poggio, 2017; Allen-Zhu et al., 2018b; Du et al., 2018; Lu et al., 2019). Whereas there are some satisfactory answers to the problems of approximation and optimization, much less is known about the theory of generalization, which is the focus of this study.
The classical analysis of generalization is based on controlling the complexity of the function class, i.e., model complexity, by managing the bias-variance trade-off(Friedman et al., 2001). However, this type of analysis is not able to explain the small generalization gap between training and test performance of neural networks learned by SGD in practice, considering the fact that deep neural networks often have far more model parameters than the number of samples they are trained on, and have sufficient capacity to memorize random labels (Neyshabur et al., 2014; Zhang et al., 2016). To explain this phenomenon, several approaches have been recently developed by many researchers. The first approach is characterizing neural networks with some other low “complexity” instead of the traditional Vapnik-Chervonenkis (VC) dimension (Bartlett et al., 2017b) or Rademacher complexity (Bartlett and Mendelson, 2002), such as path-norm (Neyshabur et al., 2015), margin-based bounds (Sokolić et al., 2017; Bartlett et al., 2017a; Neyshabur et al., 2017b), Fisher-Rao norm (Liang et al., 2017), and more (Neyshabur et al., 2019; Wei and Ma, 2019). The second approach is to analyze some good properties of SGD or its variants, including its stability (Hardt et al., 2015; Kuzborskij and Lampert, 2017; Gonen and Shalev-Shwartz, 2017; Chen et al., 2018), robustness (Sokolic et al., 2016; Sokolić et al., 2017), implicit biases/regularization (Poggio et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018; Nagarajan and Kolter, 2019b), and the structural properties (e.g., sharpness) of the obtained minimizers (Keskar et al., 2016; Dinh et al., 2017; Zhang et al., 2018). The third approach relies on overparameterization, e.g., sufficiently overparameterized networks can learn the ground truth with a small generalization error using SGD from random initialization (Li and Liang, 2018; Allen-Zhu et al., 2018a; Arora et al., 2019; Cao and Gu, 2019). There are also other approaches, such as compression (Arora et al., 2018; Baykal et al., 2018; Zhou et al., 2018; Cheng et al., 2018), Fourier analysis (Rahaman et al., 2018; Xu et al., 2019), “double descent” risk curve (Belkin et al., 2018), and PAC-Bayesian framework (Neyshabur et al., 2017b; Nagarajan and Kolter, 2019a).
However, most theoretical bounds fail to explain the performance of neural networks in practice (Neyshabur et al., 2017a; Arora et al., 2018). To get non-vacuous and tight enough bounds to be practically meaningful, some problem-specific factors should be taken into consideration, such as the easiness of the data (i.e., data-dependent analysis) (Dziugaite and Roy, 2017; Kawaguchi et al., 2017), or properties of the trained neural networks (Sokolić et al., 2017; Arora et al., 2018; Wei and Ma, 2019). In this study, to achieve a practically meaningful bound, our analysis relies on the data distribution and the smoothness of the trained neural network. The analysis proposed in this study provides guarantees on the generalization error, and theoretical insights to guide the practical application.
As shown in Fig. 1, the optimization error is correlated with the loss value, while the approximation error depends on the network size. In addition, a small loss requires a sufficient approximation ability, i.e., a large network size, which in turn leads to a small approximation error. If we assume a sufficient small loss, which is indeed true in practice, then the expected error mainly depends on the generalization error. Hence, we study the expected error/accuracy directly. In particular, we propose a mathematical framework to analyze the expected accuracy of neural networks for classification problems. We introduce the concepts of total cover (TC), self cover (SC), mutual cover (MC) and cover difference (CD) to represent the data distribution, and then we use the concept of cover complexity (CC) as a measure of the complexity of classification problems. On the other hand, the smoothness of a neural network is characterized by the inverse of the modulus of continuity . Because
is not tractable in general, we propose an estimation using the 2-norm of weight matrices of the neural network. The main terminologies are illustrated in Fig2. By combining the properties of data distribution and the smoothness of neural networks, we derive a lower bound for the expected accuracy, i.e., an upper bound for the expected classification error.
Subsequently, we test our theoretical bounds on several data sets, including MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky and Hinton, 2009), CIFAR-100 (Krizhevsky and Hinton, 2009), COIL-20 (Nene et al., 1996b), COIL-100 (Nene et al., 1996a), and SVHN (Netzer et al., 2011). Our numerical results not only confirm our theoretical bounds, but also provide insights into the optimization process and the learnability of neural networks. In particular, we find that:
The best accuracy that can be achieved by fully-connected networks is approximately linear with respect to the cover complexity of the data set.
The trend of the expected accuracy is consistent with the smoothness of the neural network, which provides a new “early stopping” strategy by monitoring the smoothness of the neural network.
The paper is organized as follows. After setting up notation and terminology in Section 2, we present the main theoretical results of the accuracy based on data distribution and neural network smoothness in Section 3. In Section 4, we provide the numerical results for several data sets. In Section 5 we include a discussion, and in Section 6 we summarize our findings.
Before giving the main results, we introduce the necessary notation and terminology. Without loss of generality, we assume that the space we need to classify is
where is the dimensionality, and the points in this space are classified into categories, i.e., there are labels . Let
be the probability density function by which samples are drawn from, and we have
2.1 Ideal label function
For the problem setup, we assume that every sample has at least one true label, and one sample may have multiple true labels. Taking image classification as an example, each image has at least one correct label. A fuzzy image or an image with more than one object it may have multiple possible correct labels, and as long as the prediction is one of these labels, we consider the prediction to be correct.
It is intuitive that when two samples are close enough, they should have similar labels, which means that the ideal label function should be continuous. Continuity of a mapping depends on the topology of both domain and image space. For the domain of the ideal label function, we choose the standard topology induced by the Euclidean-metric. As for the topology of the image space, we define it as follows. We first define the label set and the topology on it.
Definition 2.1 (Topology).
be the label set. Define the topology on to be
and thus constitutes a topological space.
In this definition, is an element of , and is a set comprised of all elements containing from . All sets like constitute a topological base for , and then is the topology generated by this base, see A for an example. Next we give the definition of the ideal label function according to this topological space.
Definition 2.2 (Ideal label function).
Define an ideal label function, i.e., an ideal classifier, as
where is the Euclidean-metric topology. Then is a continuous function (i.e., ). Moreover, continuity holds if and only if
Eq. (1) means that two neighboring points would have some common labels. Based on the topological space defined above, it is easy to show that Eq. (1) is equivalent to continuity. The reason why we consider a multi-label setup for classification problems is that it induces the continuity property in Eq. (1), which does not exist in the setup of a single label set. In addition, the multi-label setup introduces a smooth transition, i.e., a buffer domain, between two domains of different labels, while the transition is sharp in the single label setup. In the following proposition, we show that if two samples are close enough, they must share at least one common label.
Proposition 2.1 (Separation gap).
We denote the supremum of as the separation gap , which is used in the sequel.
The proof can be found in B. ∎
To understand the geometric interpretation of , we consider the following special case: the label of each sample is either a single label set, such as , or the full label set if it is not uniquely identifiable.
Proposition 2.2 (Geometric interpretation of separation gap).
If the label of each sample is either a single label set or the full label set , then is the smallest distance between two different single label points, i.e.,
The proof can be found in C. ∎
2.2 Cover complexity of data set
In this subsection, we introduce a quantity to measure the difficulty of learning a training data set
First, we give some notations and propositions.
Denote the probability measure on as , that is, for a measurable set , we have
is the probability of a random sample falling in . Then the probability of the neighborhood of the training set with radius of is
where is the open ball centered at with radius of , see Fig. 2A. Obviously, is a monotone non-decreasing continuous function, , and when , see Fig. 2B. To represent the global behavior of , we use the integral of with respect to :
considers both the number of data points, but also the probability distribution of the space. A largermeans a larger number of data points and also that the probability distribution is more concentrated around , which we call the “coverability” of . We can increase by adding more data points or redistribute their locations. Next, we show the formal definition for the “coverability”.
Definition 2.4 (Coverability).
Let be a data set from a domain with probability measure . We define the following for the coverability of .
The total cover (TC) is
The cover difference (CD) is
where is the number of categories, and and represent the subset and probability measure of the label , respectively. Here, is called self cover (SC), and is called mutual cover (MC).
The cover complexity (CC) is
The CD is defined as the difference between the mean of SC and the mean of MC, since each category occurs with the same probability () in the data sets mostly used in practice. If there are some categories occurring more frequently than others, then it is straightforward to extend this definition by using the mean weighted by the probability of each category.
In image classification, the dimension of the image space is very high, and thus the data points are quite sparse. However, due to the fact that images actually live on a manifold of low dimension, the probability density around is actually high, which makes the TC to be meaningful. Then, we derive a lower bound of by .
Let be a data set. and are defined as above. Then we have
The proof can be found in D. ∎
From this proposition, we know that for a fixed , can be close to 1 when is large enough. However, the probability distribution is usually given in practice, and what we can control is the number of samples. The following theorem shows can be arbitrary close to 1 when enough samples are available.
Let be a data set of size drawn from by . Then there exists a non-increasing function satisfying , and for any , there exists an
holds with probability at least when .
The proof and some other results of TC can be found in E. ∎
The reason why CD is introduced is that TC does not consider the labels of each data points. However, data points of the same label should be clustered in a good data set. is the difference of self cover and mutual cover, which considers the distributions of each label. By normalizing TC with CD, cover complexity is able to measure the difficulty of learning a data set. The difficulty of a problem should be translation-independent and scale-independent. It is easy to see that is independence of translation, and the following proposition shows that it is also scale-independent.
Proposition 2.4 (Scale independence).
is scale-independent, i.e., if all the data points are scaled with same rate, then is unchanged.
The proof can be found in F. ∎
2.3 Setup for accuracy analysis
The setup for accuracy analysis is as follows.
If is a continuous mapping, then the mapping
is still continuous, where represents the i-th component of . We have , and . For convenience, we directly consider the case that and , and we call such mapping the normalized continuous positive mapping.
A neural network with softmax nonlinear is a normalized continuous positive mapping.
Different from the accuracy usually used in classification problems, we define a stronger accuracy called -accuracy as follows.
Definition 2.6 (-accuracy at ).
Let be a normalized continuous positive mapping. For , we state that is -accurate at point if
Definition 2.7 (-accuracy on ).
Let be a normalized continuous positive mapping. The -accuracy of on a sample space is defined as
where is -accurate at .
Definition 2.8 (-accuracy on ).
Let be a normalized continuous positive mapping. The -accuracy of on a data set is defined as
where is -accurate at , and and are the TC of and , respectively.
We note that the -accuracy of on represents the expected -accuracy, and the -accuracy of on represents the empirical -accuracy.
Finally, we define a non-decreasing function to describe the smoothness of .
Definition 2.9 (Smoothness).
is a continuous mapping, then is uniformly continuous due to the compactness of , i.e.
We denote the supremum of satisfying the above requirement by . It is easy to see that is equal to the inverse of modulus of continuity of .
For low dimensional problems, we can directly compute by brute force. However, for high dimensional problems, it is hard to compute , and thus we give the following lower bound of for a neural network :
where is the 2-norm of the weight matrix of the layer in the neural network , and and represent the Lipschitz coefficients of and , respectively. is a constant, and thus is ignored in our numerical examples. We note that although the lower bound of depends exponentially on the neural net depth, itself does not necessarily scale exponentially in the network depth.
3 Lower bound of expected accuracy
In this section, we present a theoretical analysis of the lower bound of expected accuracy as well as an upper bound of expected error.
Let be a normalized continuous positive mapping. Suppose that is a single label training set, i.e. . For any , we have
The proof can be found in G. ∎
Proposition 3.1 shows that the expected -accuracy of can be bounded by the empirical -accuracy and the TC of training set. We can see that tends to 1 when and
tend to 1. Next we derive a bound of the accuracy by taking into account the loss function.
Theorem 3.4 (Lower bound of -accuracy).
Let be a normalized continuous positive mapping. Suppose that is a single label training set, and . For any , if the maximum cross entropy loss
then we have
where is the cross entropy loss, , and is defined in Proposition 2.1.
The proof can be found in H. ∎
Let be a normalized continuous positive mapping. Suppose that is a single label training set, and . For any , if the loss function
then we have
where is the cross entropy loss, and .
The corollary can be obtained from Theorem 3.4 based on the fact . ∎
Theorem 3.4 reveals that the expected accuracy is correlated with the total cover , separation gap , neural network smoothness , and loss value . We will show numerically in Section 4 that increases first and then decreases during the training of neural networks. The following theorem states that the maximum value of is bounded by the empirical separation gap.
Theorem 3.6 (Empirical separation gap).
Let be a normalized continuous positive mapping. Suppose that is a single label training set. For any , when , then we have
is called the empirical separation gap, i.e., the smallest distance between two different labeled training points. Furthermore, when , the upper bound is tight.
The proof can be found in I. ∎
Besides the upper bound, the lower bound of is also important to the accuracy. We have observed that in practice NNs always have satisfactory smoothness. Based on this observation, we have the following theorem for the accuracy.
Theorem 3.7 (Lower bound of accuracy).
Assume that there exists a constant , such that
holds for any single label training set and any trained network on , then we have the following conclusions for the expected accuracy and the expected error :
with the same condition of Theorem 3.4,
where , and .
Here, the cover complexity consists of two parts, one represents the richness of the whole training set while the other part describes the degree of separation between different labeled subsets. As for , both the denominator and numerator seem to have a forward correlation with respect to separation level. What we wish is that is almost close to a constant with high probability and the expected error is mainly determined by , which approximately represents the complexity level of data set. We will provide more information in detail in the section of numerical results.
4 Numerical results
In this section, we use numerical simulations to test the accuracy of neural networks in terms of the data distribution (cover complexity), and neural network smoothness.
4.1 Data distribution
|Data Set||Variants||Input dim ()||Output dim ()|
In this subsection, we explore how affects the expected error . In our experiments, we test several data sets, including MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky and Hinton, 2009), CIFAR-100 (Krizhevsky and Hinton, 2009), COIL-20 (Nene et al., 1996b), COIL-100 (Nene et al., 1996a), SVHN (Netzer et al., 2011)
. In addition to the original data set, we also create some variants: (1) the images of grey color, (2) the images extracted from a convolutional layer after training the original data set using a convolutional neural network (CNN), (3) combine several categories into one category to reduce the number of total categories, see Table1 and details in K.
For a training data set , we estimate by the proportion of the test data points within the balls with radius centered at training data points, i.e.,
and then is obtained by Definition 2.4. Similarly, we estimate and then compute
. Next for each data set, we train fully-connected neural networks with different hyperparameters, and record the best error we observed, see the details inK. The cover complexity and best error of each data set is shown in Table 1.
These data sets are divided into three groups according to their output dimensions. For each group of the same output dimension, the error is linearly correlated with , see Fig. 3A, regardless of the input dimension. In addition, we find that all the cases collapse into a single line by normalized the error with , see Fig. 3B.
It is noteworthy that the of convolutional variants of data sets is much smaller than the original data sets, and hence the expected accuracy increases. The results confirm the importance of data distribution.
Next, we consider the most difficult data set, i.e., data with random labels. We choose MNIST and then assign each image a random label. We repeat this process 50 times, and compute each . The distribution of is shown in Fig. 4. The smallest is 300, which is much larger than the normal data sets with . This extreme example again confirms that is a proper measure of the difficulty of classifying a data set.
4.2 Neural network smoothness
In this subsection, we will investigate the relationship between the neural network smoothness and the accuracy. We first show results for one- and two-dimensional problems, where can be computed accurately by brute force. Subsequently, we show the high dimensional problem of MNIST data set, and is estimated by Eq. (3).
4.2.1 One- and two-dimensional problems
We first consider a one-dimensional case and a two-dimensional case. For the one-dimensional case, we choose the sample space , , and the ideal label function as
with separation gap . We use equispaced points ( is an even number) on as the training set, i.e., , where
For the two-dimensional case, we choose the sample space , , and the ideal label function as
with . For the training set, we first choose equispaced points, i.e., , and then remove the points with label to ensure that all samples are of single label.
During the training process of the neural network, the test loss first decreases and then increases, while first increases and then decreases, see Fig. 5A for the one-dimensional problem () and Fig. 5B for the two-dimensional problem (). is bounded by , as proved in Theorem 3.6. We also observe that the trends of test loss and coincide, and thus we should stop the training when begins to decrease to prevent overfitting.
4.2.2 High-dimensional problem
In the high-dimensional problem of MNIST, we consider the average loss instead of the maximum loss , which is very sensitive to extreme points. As shown in Eq. (3), we use the following quantity to bound :
Because we use -accuracy to approximate the true accuracy, for the classification problems with two categories, these two values are equivalent. However, they are not equal for problems with more than two categories, where the best depends on the properties of the data set, such as the easiness. If the data set is easy to classify, such as MNIST, the best should be close to 1. In our example, we choose
. We train MNIST using a 3-layer fully-connected NN with ReLU activation and 100 neurons per layer for 100 epochs. In Fig.6, we can also see the consistency between the test loss and neural network smoothness, as we observed in the low-dimensional problems.
When neural networks are used to solve classification problems, we expect that the accuracy is dependent on some properties of the data set. However, it is still quite surprising as we have seen in Section 4.1 that the accuracy and error are approximately linearly dependent on the cover complexity of data sets. Theorem 3.7(ii) provides an upper bound of the error, but a lower upper is missing. To fully explain this observation, two conjectures of the learnability of fully-connected neural networks are proposed:
For a data set ,
where is a constant depending only on .
For a data set ,
where is a constant.
On the other hand, the theoretical and numerical results provide us a better understanding of the generalization of neural network from the training procedure. The smoothness of neural networks plays a key role, where is the maximum loss or the average loss . We can see that:
depends on both the regularity of and the loss value (which is also depends on ). Large requires good regularity and large , i.e., small . However, small could correspond to bad regularity of . Thus, there is a trade-off between the loss value and the regularity of .
Due to this trade-off, increases first and then decrease during training process. Hence, we should not optimize neural networks excessively. Instead, we should stop the training early when begins to decrease, which leads to another “early stopping” strategy to prevent overfitting.
We also note that the lower bound of in Eq. (3) relates to the norm of weight matrices of neural networks:
There have been some works to study the norm-based complexity of neural networks (see the Introduction), and these bounds typically scale with the product of the norms of the weight matrices, e.g., (Neyshabur et al., 2017a)
where and are the number of nodes and weight matrix in layer of a -layers network, and is the margin quantity, which describes the goodness of fit of the trained network on the data. The product of the matrix norms depends exponentially on depth, while some recent works show that the generalization bound could scale polynomially in depth under some assumptions (Nagarajan and Kolter, 2019a; Wei and Ma, 2019). Clearly our neural net smoothness has a much weaker dependence on depth than exponent, and the detailed analysis of this dependence is left for future work.
In this paper, we study the generalization error of neural networks for classification problems in terms of data distribution and neural network smoothness. We first establish a new framework for classification problems. We introduce the cover complexity (CC) to measure the difficulty of learning a data set, an accuracy measure called -accuracy which is stronger than the standard classification accuracy, and the inverse of modules of continuity to quantify neural network smoothness. Subsequently, we derive a quantitative bound for the expected accuracy/error in Theorem 3.7, which considers both the cover complexity and neural network smoothness.
We validate our theoretical results by several data sets of images. Our numerical results verify that the expected error of trained network has a linear relationship with respect to the CC. In addition, we find that the most difficult case, i.e., random labeled data, leads to quite large CC. Hence, CC is a reliable measure for the difficulty of a data set. On the other hand, we observe a clear consistency between test loss and neural network smoothness during the training process.
This work is supported by the DOE PhILMs project (No. de-sc0019453), the AFOSR grant FA9550-17-1-0013, and the DARPA AIRA grant HR00111990025. The work of P. Jin and Y. Tang is partially supported by the National Natural Science Foundation of China (Grant No. 11771438).
Appendix A Example of topology
then is the topology generated by .
In this example, is an open set, since it consists of all elements containing label , and is also an open set with common part . Besides open sets from base , is still an open set as the union of the two shown above.
Appendix B Proof of Proposition 2.1
We use the proof by contradiction. Assume that the result does not hold, then
As we know that is compact, then there exists and the subsequence of such that As , thus Choose any , then there exists a sufficient large such that Therefore , this is contradictory with the assumption. ∎
Appendix C Proof of Proposition 2.2
Consider defined in this proposition. For any two different points with distance less than , at least one of the two is a full label point, therefore . For any , according to the definition of , there exist two points satisfying
The two facts imply that is the supremum of satisfying Proposition 2.1. ∎
Appendix D Proof of Proposition 2.3
According to the definition,
Appendix E Estimate of total cover
In this section, we estimate the TC by the number of samples in training set. The notations, such as , , , , , as well as training set
are the same as before. Note that samples in are drawn from by . Before performing the analysis, we display the following preliminary issues (Definitions E.1-E.4, Theorem E.8) which are easily found in Mitzenmacher and Upfal (2017):
A range space is a pair where:
is a (finite or infinite) set of points;
is a family of subsets of , called ranges.
Let be a range space and let . The projection of on is
Let be a range space. A set is shattered by if . The Vapnik-Chervonenkis (VC) dimension of a range space is the maximum cardinality of a set that is shattered by . If there are arbitrarily large finite sets that are shattered by , then the VC dimension is infinite.
Let be a range space, and let be a probability distribution on . A set is an for with respect to if for any set such that , the set contains at least one point from , i.e.,
Let be a range space with VC dimension and let be a probability distribution on . For any , there is an
such that a random sample from of size greater than or equal to is an for with probability at least .
we first show is a range space with VC dimension .
The VC dimension of range space is .
All vertices of a simplex of dimension in is shattered by , hence the VC dimension of is at least . Furthermore, Dudley (1979) proves that the VC dimension of is at most . ∎
we have the following lemmas.
when is an for .
For any , assume that , , . Since is an and , we know . Thus there exists such that . Therefore
The above inequality shows that . ∎
by dominated convergence theorem, we have
According to the aforementioned lemmas, we deduce the following theorem.
Let be the training set drawn from by , then for any , there exists an
holds with probability at least when . Note that when .
Theorem E.8 shows that is an