The development of deep neural networks has achieved remarkable advancement in the field of machine learning during the past decade. Recently, many researchers try to explain the experimental success of deep neural network. One of the research direction is to explain why the deep learning does not have serious overfitting problem. Although several common techniques, such as dropoutsrivastava2014dropout ioffe2015batch , and weight decay krogh1992simple , do improve the generalization performance of the over-parameterized deep models, these techniques do not have a solid theoretical foundation to explain the corresponding effects.
In the history of machine learning research, the large margin principle has played an important role in the theoretical analysis of generalization ability, meanwhile, it also achieves remarkable practical results for classification Cortes1995 and regression problems NIPS1996_1238 . More than that, this powerful principle has been used to explain the empirical success of the deep neural network. NIPS2017_7204 and neyshabur2018a present a margin-based multi-class generalization bound for neural networks that scale with their margin-normalized spectral complexity using two different proving tools. Moreover, DBLP:conf/icml/Arora0NZ18 proposes a stronger generalization bound for deep networks via a compression approach, which are orders of magnitude better in practice.
As for margin theory, DBLP:conf/icml/SchapireFBL97 first introduce it to explain the phenomenon that AdaBoost seems resistant to overfitting problem. Two years later, Breiman:1999:PGA:334369.334370 indicates that the minimum margin is important to achieve a good performance. However, Reyzin:2006:BMB:1143844.1143939 conjectures that the margin distribution, rather than the minimum margin, plays a key role in being empirically resistant to overfitting problem; this has been finally proved by Gao:2013:DME:2527803.2527872
. In order to restrict the complexity of hypothesis space suitably, a possible way is to design a classifier to obtain optimal margin distribution.Gao:2013:DME:2527803.2527872
proves that to attain the optimal margin distribution, it is crucial to consider not only the margin mean but also the margin variance. Inspired by this idea,DBLP:journals/corr/ZhangZ16a proposes the optimal margin distribution machine (ODM) for binary classification, which optimizes the margin distribution through the first- and second-order statistics, i.e., maximizing the margin mean and minimizing the margin variance simultaneously. To expand this method to the multi-class classification problem, pmlr-v70-zhang17h presents a multi-class version of ODM.
The complexity of the margin bound proposed by NIPS2017_7204 and neyshabur2018a is much larger than the generalization gap in the experiment, and makes no significant contribution to the design of algorithm DBLP:conf/icml/Arora0NZ18 . The hinge-type deep model considers to improve the generalization ability by maximizing the minimum margin like the SVM algorithm, but it is not good in experimental results elsayed2018large . Although DBLP:conf/icml/Arora0NZ18 proposes a more practical generalization bound, it introduces many conditions in the compression theory, which are not necessary for the nature of the neural networks. An initial motivation of our study is trying to enhance the margin-based bound by using more information on margin distribution. We hope to achieve two main goals:
First, we want to design a new margin distribution loss for deep neural networks to reveal a stronger relationship between generalization gap and margin distribution theoretically;
Second, the theoretical result can guide us how to improve the generalization performance of deep neural networks in practice by getting an optimal margin distribution.
We find that the normal hind loss is unstable by considering the minimum margin only. So we want to portray the margin distribution by using the mean and variance of margin. In this way, we can optimize the entire distribution and improve the performance effectively. pmlr-v70-zhang17h ; ZhangZ18 demonstrate that optimal margin distribution principle is useful to obtain a better generalization performance on classical models. Thus, we hope to study the expansion of the optimal margin distribution principle on deep neural networks.
The main contributions of this paper are as follows:
In this paper, we propose an optimal margin distribution loss for deep neural networks (mdNet), which is not only maximizing the margin mean but also minimizing the margin variance. This loss function optimizes the entire margin distribution instead of the minimum one, and first attempts to use the sharpness of the distribution to explain the generalization ability of the deep model;
Moreover, we use the PAC-Bayesian framework to obtain a novel generalization bound based on margin distribution. Comparing to the spectrally-normalized margin bounds of NIPS2017_7204 and neyshabur2018a , our generalization bound shows that we can restrict the complexity of model by setting an appropriate ratio between the first-order statistic and the second-order statistic rather than trying to control the whole product of the spectral norms of each layer;
And we empirically evaluate our loss function on the deep network across different image datasets and model structures. Specifically, overwhelming empirical evidence demonstrate the effectiveness of the proposed mdNet in generalization task through limited training data.
2 mdNet Loss: Optimal Margin Distribution Loss
Consider the classification problem with input domain and output domain , we denote a labeled sample as . Suppose we use a network generating a prediction score for the input vector to class , through a function , for . The predicted label is chosen by the class with maximal score, i.e. .
Define the decision boundary of each class pair as:
Constructed on this definition, the margin distance of a sample point to the decision boundary is defined by the smallest translation of the sample point to establish the equation as:
In order to approximate the margin distance in the nonlinear situation, elsayed2018large has offered a linearizing definition:
Naturally, this pairwise margin distance leads us to the following definition of the margin for a labeled sample :
Therefore, the defined classifier misclassifies if and only if the margin is negative. Given a hypothesis space of functions mapping to , which can be learned by the fixed deep neural network through the training set , our purpose is to find a way to learn a decision function such that the generalization error is small.
In this work, we introduce a type of margin loss, and connect it to deep neural networks. The origin loss function has been specially adapted for the difference between deep learning models and linear models by us as following definition:
where is the margin mean, is the margin variance and is a parameter to trade off two different kinds of deviation (keeping the balance on both sides of the margin mean).
Figure 1 shows, equation 5 will produce a square loss when the margin satisfies or . Therefore, our margin loss function will enforce the tie which has zero loss to contain the sample points as many as possible. So the parameters of the classifier will be determined not only by the samples that are close to the decision boundary but also by the samples that are away from the decision boundary. In other words, our loss function is aimed at finding a decision boundary which is determined by the whole sample margin distribution, instead of the minority samples that have minimum margins. To verify superiority of the optimal margin distribution network, our paper verifies it both theoretically and empirically.
Explaination for Optimal Margin Distribution Loss: Inspired by the optimal margin distribution principle, pmlr-v70-zhang17h propose the multi-class optimal margin distribution machine, which characterizes the margin distribution according to the first- and second-order statistics. Specially, let denote the margin mean, and the optimal margin distribution machine can be formulated as:
where is the regularization term to penalize the norm of the weights, and are trading-off parameters, and are the deviation of the margin to the margin mean. It is evident that is exactly the margin mean.
In the linear situation, scaling does not affect the final classification results such as SVM, the margin mean can be normalized as , then the deviation of the margin of to the margin mean is , and the formula can be reconstruct as:
where is parameter to trade off two different kinds of deviation (keeping the balance on both sides of the margin mean). is a parameter of the zero loss band, which can control the number of support vectors. In other words, is a parameter to control the margin variance, while the data which is out of this zero loss band will be used to update the weights to minimize the loss. For this reason, we simply regard it as the margin variance.
However, under the non-linear setting in our paper, we can not directly linearly normalize the margin mean to the value . So we assume that the normalized margin mean is , then the optimization target can be reformulated as:
In our paper, we use the linear approximation elsayed2018large to normalize the magnitude of the norm of weights, so we can just transform this optimization target to a loss function as:
3 Theoretical Analysis
In this section, we we analysis the generalization gap to understand the possibility of overfitting for our mdNet loss function. Theorem 1 shows that the generalization gap of mdNet can be controlled by the ratio between the margin variance and the margin mean.
To present a new margin bound for our optimal margin distribution loss, some notations are needed. Consider that the convolution neural networks can be regarded as a special structure of the fully connected neural networks, we simplify the definition of the deep networks. Letbe the function learned by a L-layer feed-forward network for the classification task with parameters , thus , here denote the output of layer before activation and be an upper bound on the number of output units in each layer. Recursively, we can redefine the deep network: and . Let , and denote the Frobenius norm, the element-wise norm and the spectral norm respectively.
In order to facilitate the theoretical derivation of our formula, we simplify the definition of the loss function:
Specially, define the as , actually equal to the 0-1 loss. And let
be the empirical estimate of the optimal margin distribution loss. So we will denote the expected risk and the empirical risk asand , which are bounded between 0 and 1.
3.2 Lemmas and Definitions
In the PAC-Bayesian framework, one expresses the prior knowledge by defining a prior distribution over the hypothesis class. Following the Bayesian reasoning approach, the output of the learning algorithm is not necessarily a single hypothesis. Instead, the learning process defines a posterior probability over, which we denote by
. In the context of a supervised learning problem, wherecontains functions from to , one can think of as defining a randomized prediction rule. We consider the distribution which is learned from the training data of form , where
is a random variable whose distribution may also depend on the training data. Letbe a prior distribution over that is independent of the training data, the PAC-Bayesian theorem states that with possibility at least over the choice of an i.i.d. training set sampled according to , for all distributions over (even such that depend on ), we have 10.1007/978-3-540-45167-9_16 :
Note that the left side of the inequality is based on . To get an expected risk bound for a single predictor , we have to relate this PAC-Bayesian bound to the expected perturbed loss just like neyshabur2018a get the Lemma 1 in their paper. Based on the inequality 11, we introduce a perturbed restriction which is related to the margin distribution (the margin mean and margin variance ):
Let be any predictor with parameters , and be any distribution on the parameters that is independent of the training data. Then, for any , with probability at least over the training set of size , for any , and any random perturbation s.t. , we have:
The margin variance information does not change the conclusion of the perturbed restriction, the proof of this lemma is similar to Lemma 1 in neyshabur2018a .
In order to bound the change caused by perturbation, we have to bring in three definitions that are used to formalize error-resilience in DBLP:conf/icml/Arora0NZ18 as follows:
(Layer Cushion). The layer cushion of layer is defined to be largest number such that for any :
Intuitively, cushion considers how much smaller the output is compared to the upper bound . However, for nonlinear operators the definition of error resilience is less clean. Let’s denote the operator corresponding to the portion of the deep network from layer to layer , and by its Jacobian. If infinitesimal noise is injected before level then passes it like , a linear operator. When the noise is small but not infinitesimal then one hopes that we can still capture the local linear approximation of the nonlinear operator by define Interlayer Cushion:
(Interlayer Cushion). For any two layers , we define the interlayer cushion , as the largest number such that for any :
Furthermore, for any layer i we define the minimal interlayer cushion as .
The next condition qualifies a common appearance: if the input to the activations is well-distributed and the activations do not correlate with the magnitude of the input, then one would expect that on average, the effect of applying activations at any layer is to decrease the norm of the pre-activation vector by at most some small constant factor.
(Activation Contraction). The activation contraction is defined as the smallest number such that for any layer i and any :
To guarantee that the perturbation of the random variable will not cause a large change on the output with high possibility, we need a perturbation bound to relate the change of output to the structure of the network and the prior distribution over . Fortunately, neyshabur2018a proved a restriction on the change of the output by norms of the parameter weights. In the following lemma, we preset our hyper-parameters and , s.t. the parameter weights satisfying , when fixing . Thus, we can bound this change in terms of the spectral norm of the layer and the presetting hyper-parameters:
For any , let be a L-layer network. Then for any satisfying , and , and any perturbation s.t. , the change of the output of the network can be bounded as follows:
Proof of Lemma 2.
Let and , we will write the network as . If we just give layer parameter weights a perturbation , we can have following:
In the last Approximate equation in Equation 17, we assume that the perturbation is in the linear space span by , therefore, the part of that is orthogonal to the space of will not affect the output of perturbation. In other word, we equal the projection on the linear space of with the one on the linear space of , ie. .
where is the stable rank of layer , i.e. . Therefore by induction method we have:
Obviously, we can know that , when
Suppose that all the perturbations are independent from each other, so we can just add the influence linearly for union bound:
3.3 Generalization Error Bound
(Generalization Error Bound). For any , let be a L-layer feed-forward network with ReLU activations. Then, for any , with probability over a training set of size m, for any , we have:
where is the layer cushion defined in Definition 1, is the interlayer cushion defined in Definition 2, is the activation contraction defined in Definition 3. All this definitions are used to formalize error-resilience in DBLP:conf/icml/Arora0NZ18 .
Proof of Theorem 1.
The proof involves chiefly two steps. In the first step we bound the maximum value of perturbation of parameters to satisfied the condition that the change of output restricted by hyper-parameter of margin , using Lemma 2. In the second step we proof the final margin generalization bound through Lemma 1 with the value of KL term calculated based on the bound in the first step.
Let and consider a network structured by normalized weights . Due to the homogeneity of the ReLU, we have that for feedforward networks with ReLU activations , so the empirical and expected loss is the same for and . Furthermore, we can also get that and . Hence, we can just assume that the spectral norm is equal across the layers, i.e. for any layer i, .
When we choose the distribution of the prior to be , i.e. , the problem is that we will set the parameter according to , which can not depend on the learned predictor or its norm. neyshabur2018a proposed a method that can avoid this block: they set based on an approximation on a pre-determined grid. By formalizing this method, we can establish the generalization bound for all for which , while given a constant , and ensuring that each relevant value of is covered by some on the grid, i.e. , can be considered as a constant.
Taking the union bound over layers, with probability , the spectral norm of each layer perturbation is bounded by . Plugging this into Lemma 2 we have that with probability :
We can obtain from the above inequality. Naturally, we can calculate the KL-diversity in Lemma 1 with the chosen distributions for .
Hence, for any , with probability and for all such that, , we have:
This proof method based on PAC-Bayesian framework has been raised by neyshabur2018a , we use this convenient tool for proving generalization bound with our loss function which can obtain the optimal margin distribution. ∎
Remark. Comparing with the spectral complexity in NIPS2017_7204 :
which is dominated by the product of spectral norms across all, our margin bound is relevant to dependent on the margin distribution and and dependent on the network structure. The model’s complexity in our generalization bound is easy to be controlled by the ratio . Explicitly, the factor consisted of hyper-parameters is a monotonicity increasing function with regard to the ratio . Under the assumption of separability, we can come to the conclusion that smaller and larger make the complexity smaller. Searching a suitable value of and for the specific data distribution will lead us to a better generalization performance.
In this section, we empirically evaluate the effectiveness of our optimal margin distribution loss on generalization tasks, comparing it with three other loss functions: cross-entropy loss (Xent), hinge loss, and soft hinge loss. We first compare them under limited training data situation, using only part of the MNIST dataset lecun1998gradient to train and evaluate the models deploying the four different losses, with the used data ratio ranging from 0.125% to 100%. Similar experiments are also performed on the legend CIFAR-10 dataset krizhevsky2009learning . Then we compare them under different regularization situations, investigating the combination of optimal margin distribution loss with dropout and batch normalization. Finally, we visualize and compare the features learned by the deep learning model with different loss models as well as the margin distribution from those models.
In Table 1, we introduce three commonly used loss functions in deep learning for comparison in the experimental section.
|soft hinge loss|