Deep convolutional neural networks have recently become the de-facto approach for feature construction and classification in computer vision. With applications from image classification for object recognition[1, 2, 3, 4, 5] detection [6, 7] to end-to-end learning for complex computer vision tasks [8, 9]
. Their impressive performance on a variety of tasks have made them immensely applicable to a variety of applications, including transfer learning[10, 11]. The widespread general applicability of deep neural network architectures such as Residual Networks , Inception models  and Wide Resnet  have made them popular recognition architectures for various applications.
The generalization properties of deep neural networks have been studied in great depth over the last few years. This has further resulted in their widespread adaptation into real world scenarios. Moreover, coupled with popular regularization methods such as Batch Normalization, Dropout  and -Regularization these models have been able to achieve significant improvements in accuracies by further reducing the generalization errors. Recent works in this area have led to new distribution dependent regularization priors such as direct regularization of model complexity  and taking a bounded spectral norm of the network’s Jacobian matrix in the neighbourhood of the training sample . While each of these methods solves a unique problem, we seek to look at the overall model in terms of the network complexity and its relative accuracies on a test set.
The goal of this study is to identify an empirically backed strong regularizer that can reduce the complexity of the network by resulting in smaller weights and perform equally well in the presence of input noise. By comparing the accuracies on a hold out dataset we also show that the effects of regularization vary greatly across these different methods. TO best of our knowledge this is the first such study of regularization methods in deep learning under complexity bounds and varying input noise. Our primary contribution lies in demonstrating strong experimental evidence in favor of certain regularizers over the others. We show that deep networks are robust in presence of input noise andnorm acts as a proxy for model complexity of neural networks. We also show that distribution dependent DARC1 regularizer and regularizer perform well in presence of input noise in the training set since the network tends to prefer a simpler hypothesis to get better test set accuracies on a clean noise-free test set.
The paper is divided into 5 sections. Section 2 introduces pre-requisites to generalization error and presents a brief mathematical background of various measures of complexities and regularizers. Section 3 discusses the experimental settings used in the paper along with the results on accuracies and various notions of model complexities and techniques to control the complexities like norm, DARC1 norm, Dropout, Jacobian norm and Spectral regularization. Section 4 presents insights and key findings obtained from the experiments Finally Section 5 presents conclusions and future line of research.
In this section, we describe the background, measures of complexities and the regularizers that control the model complexity of the neural network. Firstly, we present various notations used in the paper in the Table 1.
|3||Input sampled from|
|4||Output sampled from|
|6||Set of functions/hypothesis space|
|8||Family of loss functions associated with .|
|9||# training samples|
|10||# input dimension|
|12||, expected risk of a function|
|13||Model learned by learning algorithm|
|14||i.i.d. training dataset of size|
|15||, empirical risk of|
Weight decay/DARC1/Jacobian/LCNN norm hyperparameter
margin of a classifier
|20||Vapnik-Chervonenkis dimension of the hypothesis class|
The goal in machine learning is to minimize the expected risk. However, the expected risk being non-computable, we aim to minimize the computable empirical risk denoted by . The generalization gap is given by:
A major drawback of this approach is the dependence of on the same dataset used in the definition of . One way to tackle this is to considering the worst-case gap for functions in the hypothesis space.
A union bound over all the elements of the hypothesis class yields a vacuous bound, hence we consider other quantities to characterize the complexity of namely, Vapnik-Chervonenkis (VC) dimension [17, 18] and Rademacher Complexity [19, 20, 21]. If the codomain of the loss is given by , for any
with probability at least,
where, is the Rademacher Complexity of , which can then be bounded by Rademacher complexity of F denoted as . Similar bounds can be found in the literature using VC dimension, fat shattering dimension and covering numbers. We now highlight the various bounds for a neural network in the next subsection.
2.1 Measures of Complexities of a Neural Network
Consider a deep net with layers and output margin on training set . Various generalization bounds proposed in the literature are mentioned in the Table 2. The term
denotes the number of parameters of a multilayered feed forward neural network.
|2||Bartlett et al. |
|3||Neyshabur et al. Sharma et. al. [21, 22]|
|4||Bartlett et al. |
|5||Neyshabur et al. [23, 24]|
|6||Kawaguchi et al. |
|7||Pant et al. |
The expression is the sum of stable ranks of the layers which is a measure of the parameter count. The expression is related to the Lipschitz constant of the network. However, for this paper we use as a measure of the network complexity.
We now briefly describe various regularizers used in neural networks namely, norm, dropout, jacobian, DARC1 and spectral normalization. We study the behavior of these regularizers in presence of varying input noise. In doing so, we characterize the complexity of network by the norm of the parameters and DARC1 Rademacher bound.
norm: Weight decay or regularization is one of the most popular regularization techniques used to control the model complexity. It amounts to penalizing the norm of the weights of the network. Almost all the generalization bounds are a function of norm of the weights. Sontag  first proposed the VC dimension of a multilayer neural network in terms of the number of parameters. But for a network with millions of parameters, the bound turns out to be vacuous as the network seems to generalize well with much smaller samples than the number of parameters. Bartlett  also proposed the VC dimension or fat shattering dimension of a multilayer feed forward neural network in terms of the product of norm of the weights. This bound showed that in order to control the complexity of a network, one has to regularize the norm of the weights. Neyshabur  presented the generalized analysis of complexity in terms of where . Some other recent bounds are mentioned in table 2.
DARC1 norm: Kawaguchi et al.  suggested minimizing the max norm of the activation as a regularizer. They termed the method as Directly Approximately Regularizing Complexity (DARC) and named a basic version of their proposed regularization prior as DARC1. They argue that the common generalization bounds (as mentioned in the table) are too loose to be used practically. They therefore consider a margin based 0-1 loss defined as:
Finally, they show that
where, is the Rademacher complexity of the hypothesis class defined as , where is the margin, are the Rademacher variables, supremum is taken over all and allowed in . is the confidence level and is the true label of . Instead of using worst case vacuous bounds, they use the approximation of with an expectation over the known dataset . DARC1 is the new regularization term added on each minibatch as follows:
Following Zhang et al. , the regularizer can be seen as penalizing the most confident predictions, thereby reducing the tendency of the model to overfit. A similar analysis is done using Low complexity Neural Network (LCNN) loss in Jayadeva et al. , which instead of using a max-norm, uses norm over the hypothesis class of the final layer of the network. Here, original loss can be any of the standard loss function used in the literation viz. cross-entropy, max-margin, 0-1 etc.
LCNN norm: Low Complexity Neural Network  regularizer tries to upper bound the VC dimension of neural network using radius margin bound. It is known that the VC dimension of a large margin linear classifier is upper bounded by:
where is the radius of the data, is the margin and is the number of input features. A similar analysis can be performed for the last layer of a neural network. Here, we state without proof that the VC dimension () of a neural network is bounded by:
LCNN regularization term is added on each minibatch as follows:
LCNN term penalizes the large values in the last layer and acts as a confidence penalty. Here, original loss can be any of the standard loss function used in the literation viz. cross-entropy, max-margin, 0-1 etc.
Jacobian Regularizer: Sokolic et al.  have argued that the existing generalization bounds in deep neural networks (DNN) grows disproportionately to the number of training samples. To resolve this, they propose a new lower bound expressed as a function of the network’s Jacobian matrix which is based on the robustness framework of . The Jacobian matrix of a DNN is given by:
Addition of Jacobian regularizer also allows the network to become robust to changes in the input. It has the effect of inducing a large classification margin at the input. Following theorem 4 in  the classification margin for a point with score is given by:
is the Kronecker delta vector with. Hence, the generalization error is bounded by:
Here, is a constant defining the dimensional manifold. It can be seen that . The above bound shows that the generalization error does not increase with the number of layers provided the spectral norm of the weight matrices are bounded. If we assume the weight matrices contains orthonormal rows then the generalization error depends on the complexity of data manifold and not on depth. The Jacobian regularizer is given by:
Spectral Normalization: Spectral normalization controls the Lipschitz constant of the network by constraining the spectral norm of each layer
. The Lipschitz constant of a general differentiable function is the maximum singular value of its gradient over its domain.
For composite functions, . Spectral normalization proposed for generative adversarial net , replaces each weight with . The computation is done using power iteration method. Let us consider a linear map . Let be a vector in the domain of matrix and be a vector in the codomain. Power iteration involves the following recurrence relation.
On further simplification (see Algorithm 1 in ), we have the relation:
Finally, the weight matrix is updated as:
Authors in  argue that the gradient regularizer proposed in  which is similar in concept to the Jacobian regularizer proposed in , has a drawback that the Jacobian regularizer is not able to regularize the function at the points outside of the support of the current distribution. They also show that spectral normalization does not get destabilized by large learning rates, whereas Jacobian regularizer falters with aggressive learning rates.
Spectral normalization has another advantage in terms of controlling the model complexity. Following the Bound 5 of Neyshabur et. al. [23, 26] presented in the Table 2 as , setting the term close to 1, we get the bound
The bound in eq 19 shows that, in order to control to complexity of the model, one needs to perform spectral normalization and normalization. In this case only, sum of norm of weights of the network truly indicate the capacity of the architecture.
Dropout: It is known that models with large number of parameters such as deep neural architectures can model very complex functions and phenomenon. This also means that such models have a tendency to overfit. Thus regularizing such a model is imperative for good performance on the unseen test set. Dropout is such a technique proposed by Srivastava et al. 
. Dropout works by randomly and temporarily deleting neurons in the hidden layer during the training with probability. During testing/prediction, we feed the input to unmodified layer, but scale the layer output by . Dropout acts as an averaging scheme on the output of a large number of networks, which is often found to be a powerful way of reducing overfitting.
Also, since a neuron cannot rely on the presence of other neurons, it is forced to learn features that are not dependent on the presence of other neurons. Thus the network learns robust features, and are less susceptible to noise. This reduces the co-adaptation of features.
The following section describes the experimental settings and results by varying the input noise.
3 Experiments and Results
In this paper we analyze various regularizers and their effect on input noise. We characterize the complexity required by each network with different regularizers as the noise in the input increases. Following Arora et al. , we find that deep nets are noise-stable. However, shallow nets are not as is evident from the results.
3.1 Experimental settings
We describe the experimental settings used in the paper. To begin with, we add a class specific gaussian noise to each input example in the training set while the validation and test set remained unchanged.
Letand a noise level. The noise level ranges from 0 to 1.2 in steps of 0.2. The procedure to generate noisy datasets is described in the Algorithm 1.
We include two types of datasets in this work. Four non-image datasets and one image dataset. We trained two networks, a shallow network for non-image datasets and a deeper network for image dataset. The details of the networks are presented below,
A Wide Resnet 28-10 trained for 50 epochs, also with ADAM optimizer.
Both these networks have Batch normalization applied on each layer.
The initial learning rate is set to for both the networks.
The shallow network was trained separately with the seven different regularizers viz. no regularization, norm, DARC1 norm, LCNN norm, Dropout, Jacobian and Spectral normalization. The deep network was trained with the aforementioned regularizers. We tuned each network on the validation set and report the results on the test set for the best hyperparameter setting.
For the shallow conv net we used the hyperparameter settings mentioned in the Table 3.
For the deep network, we used the hyperparameters mentioned in the Table 4:
We show results on 5 datasets in this paper. We experimented with two neural network architectures, one shallow convolutional net (with three layers) for datasets 1 to 4 and one deep (Wide Resnet 28-10) for dataset 5 (CIFAR 10). The datasets are described in the Table 5.
|S.no.||Dataset_name||# Train||# Val||# Test||# Features||# Classes|
We compare test set accuracies, log2 norm, DARC1 norm and loss for each datasets. Firstly, we present the results for shallow conv net and thereafter show the results for a Wide Resnet 28-10 on CIFAR10 dataset.
Figure 0(a) shows the bar plot of the mean accuracy comparison across the four datasets used for testing shallow architecture. We observe that the accuracies fall as the noise in the training set increases. Upto the noise levels of 6 there is no differences in the accuracies across regularizers, however at noise levels of 8, we see that Jacobian regularizer outperforms the others, closely followed by spectral norm regularizer and regularizer. LCNN regularizer however does not perform well at higher levels of noises for shallow architectures.
Figure 0(b) shows the mean log2 scaled norms obtained for the shallow architecture. We observe that the network with regularization results in smallest norm. For network without regularization, DARC1, LCNN regularization we find that the norm decreases as the noise increases, however for Dropout, Jacobian, Spectral and regularization we observe an increase in norm initially and then a decline. The latter trend is more pronounced in case of norm. This is indicative of the fact that with slight increase in noise, the model increases its complexity at first to fir the noise, but as the noise increases to a larger extent, the model reduces its complexity to fit the non-noisy validation set. Similar trends are observed for DARC1 regularizer (fig. 0(c)), LCNN regularizer (fig. 0(d)) and Jacobian regularizer (fig. 0(e)). Since, LCNN is an upper bound on DARC1, the graphs have a similar trend.
Figure 0(f) shows the test set cross entropy loss as the noise increases. We observe an increasing trend in all the cases as noise increases.
We now show the results of these regularizers on a deeper architecture (Wide Resnet 28-10) trained on an image dataset. For this experiment we varied the noise levels from 0 to 8 in steps of 2.
Figure 1(a) shows the accuracies obtained as the noise ratio increases for various regularization techniques. We observe that dropout results in a non-robust graph, where the accuracy drops sharply as noise increases earlier and then increases with noise. Spectral normalization results in the highest set of accuracies and is not affected by noise. For other methods such there is no appreciable drops in accuracies as the noise increases, thus validating the hypothesis that deep neural nets are robust to input noise.
Figure 1(b) shows the log2 norm. We see that for regularization, the norm is the smallest, followed by spectral norm. The norm in case of is orders of magnitude smaller than the rest, despite having comparable accuracies. This is indicative of the fact that deeper architectures have high model complexities which be controlled using norm.
Figures 1(c) and 1(d) shows the DARC1 and LCNN norm respectively. It can be seen that LCNN regularizer results in smallest respective norms, followed by Spectral norm and norm. Dropout results in the largest DARC1 and LCNN norms.
Figure 1(e) shows the Jacobian norm for various regularizers. We observe that for LCNN norm the Jacobian norm is the smallest. This shows that the error signal is not able to propagate to the initial layers. This may be due to high value of LCNN hyperparameter. Among others, Spectral norm shows the smallest values for the Jacobian norm and also the smallest variation. This also corresponds to the robustness in accuracies as the noise increases. Dropout shows the highest variation in Jacobian norm which corresponds to the large variation in accuracies as the noise increases. It is also observed that Jacobian norm increases with noise, which is indicative of increasing uncertainity in data.
From the results we can obtain the following insights:
The model complexity can be fairly represented using norm, DARC1 norm and LCNN norm.
regularization and DARC1 regularization performs well in controlling the model complexity of the network. Their smaller values are indicative of a less complex hypothesis class learnt in the wake of higher regularization.
As the noise increases for shallow networks, the norm, DARC1 norm increases at first and then we observe a decrease in the norms indicative of the simpler hypothesis learn when the validation set is noise free.
For shallow network, Jacobian norm performs well in the presence of noise. For deeper architectures, network itself performs noise reduction and thus Jacobian regularization has minimal affect. Deeper networks are robust to input noise as is elucidated in Arora et al., .
Spectral normalization performs the best in case of noisy inputs closely followed by DARC1 regularizer which is derived from a distribution dependent bound.
The test set loss is a good indicator of the test set accuracy.
Dropout alone does not perform well in case of noisy inputs neither does it control the model complexity given in terms of norm, DARC1 norm or LCNN norm.
Adding a input noise can result in better generalization properties for deeper architectures, however this effect is less pronounced for shallow architectures.
In this paper we presented a study on the effect of regularization of model complexity and generalization as the noise in the inputs increases. We find multiple notions of model complexities in the literature ranging from distribution independent VC dimension bounds for neural networks [18, 22] to distribution dependent Rademacher complexity bounds [19, 21, 23, 20]. All these bounds are in terms of product or sum of norm of weights of the network. We used sum of norm of the weights as proxy for model complexity. This directly translates to adding a weight decay regularizer that penalizes larger weights. Recently proposed DARC1 or LCNN regularizer are a step in generating a distribution dependent bounds for neural networks. We see the both and DARC1 controls the model complexity.
The experiments clearly demonstrate that the norm is best suited for controlling model complexity as well as for generating solutions with high accuracies. For deeper networks Dropout alone does not result in higher accuracies or control the model complexity in terms of the norm of weights. We also see that the newly proposed distribution dependent DARC1 and LCNN regularization performs equally well as regularization. Spectral normalization results in one of the best results for deeper architectures, due to stable gradient propagation. Our experiments also show that deep neural networks are more generally more robust to varying degrees of Gaussian input noise than shallow architectures and can therefore effectively model any number of such distributions under strong priors. In future, we shall explore novel regularization schemes like shakedrop  and shake-shake regularization 
. We have refrained from discussing recurrent architectures like Long short term memory (LSTM) network[34, 35], which shall be the focus of our future research.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
-  ——, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
-  S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
-  A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137.
-  H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4293–4302.
-  A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with convolutional neural networks,” in Advances in Neural Information Processing Systems, 2014, pp. 766–774.
-  S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  K. Kawaguchi, L. P. Kaelbling, and Y. Bengio, “Generalization in deep learning,” arXiv preprint arXiv:1710.05468, 2017.
-  J. Sokolić, R. Giryes, G. Sapiro, and M. R. Rodrigues, “Robust large margin deep neural networks,” IEEE Transactions on Signal Processing, vol. 65, no. 16, pp. 4265–4280, 2017.
-  P. L. Bartlett, “For valid generalization the size of the weights is more important than the size of the network,” in Advances in neural information processing systems, 1997, pp. 134–140.
-  E. D. Sontag, “Vc dimension of neural networks,” NATO ASI Series F Computer and Systems Sciences, vol. 168, pp. 69–96, 1998.
-  P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002.
-  P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrally-normalized margin bounds for neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 6240–6249.
-  B. Neyshabur, R. Tomioka, and N. Srebro, “Norm-based capacity control in neural networks,” in Conference on Learning Theory, 2015, pp. 1376–1401.
-  M. Sharma, S. Soman et al., “Radius-margin bounds for deep neural networks,” arXiv preprint arXiv:1811.01171, 2018.
-  B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro, “Exploring generalization in deep learning,” in Advances in Neural Information Processing Systems, 2017, pp. 5947–5956.
-  ——, “A pac-bayesian approach to spectrally-normalized margin bounds for neural networks,” arXiv preprint arXiv:1707.09564, 2017.
-  C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
-  H. Pant, M. Sharma, A. Dubey, S. Soman, S. Tripathi, S. Guruju, N. Goalla et al., “Learning neural network classifiers with low model complexity,” arXiv preprint arXiv:1707.09933, 2017.
-  H. Xu and S. Mannor, “Robustness and generalization,” Machine learning, vol. 86, no. 3, pp. 391–423, 2012.
-  T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” arXiv preprint arXiv:1802.05957, 2018.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
-  S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, “Stronger generalization bounds for deep nets via a compression approach,” arXiv preprint arXiv:1802.05296, 2018.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Y. Yamada, M. Iwamura, and K. Kise, “Shakedrop regularization,” arXiv preprint arXiv:1802.02375, 2018.
-  X. Gastaldi, “Shake-shake regularization,” arXiv preprint arXiv:1705.07485, 2017.
-  F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” 1999.
-  K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey,” IEEE transactions on neural networks and learning systems, vol. 28, no. 10, pp. 2222–2232, 2017.