Effect of Various Regularizers on Model Complexities of Neural Networks in Presence of Input Noise

by   Mayank Sharma, et al.

Deep neural networks are over-parameterized, which implies that the number of parameters are much larger than the number of samples used to train the network. Even in such a regime deep architectures do not overfit. This phenomenon is an active area of research and many theories have been proposed trying to understand this peculiar observation. These include the Vapnik Chervonenkis (VC) dimension bounds and Rademacher complexity bounds which show that the capacity of the network is characterized by the norm of weights rather than the number of parameters. However, the effect of input noise on these measures for shallow and deep architectures has not been studied. In this paper, we analyze the effects of various regularization schemes on the complexity of a neural network which we characterize with the loss, L_2 norm of the weights, Rademacher complexities (Directly Approximately Regularizing Complexity-DARC1), VC dimension based Low Complexity Neural Network (LCNN) when subject to varying degrees of Gaussian input noise. We show that L_2 regularization leads to a simpler hypothesis class and better generalization followed by DARC1 regularizer, both for shallow as well as deeper architectures. Jacobian regularizer works well for shallow architectures with high level of input noises. Spectral normalization attains highest test set accuracies both for shallow and deeper architectures. We also show that Dropout alone does not perform well in presence of input noise. Finally, we show that deeper architectures are robust to input noise as opposed to their shallow counterparts.


Radius-margin bounds for deep neural networks

Explaining the unreasonable effectiveness of deep learning has eluded re...

Deep learning generalizes because the parameter-function map is biased towards simple functions

Deep neural networks generalize remarkably well without explicit regular...

Do Deep Nets Really Need to be Deep?

Currently, deep neural networks are the state of the art on problems suc...

Analytical Moment Regularizer for Gaussian Robust Networks

Despite the impressive performance of deep neural networks (DNNs) on num...

What training reveals about neural network complexity

This work explores the hypothesis that the complexity of the function a ...

A Functional Perspective on Learning Symmetric Functions with Neural Networks

Symmetric functions, which take as input an unordered, fixed-size set, a...

A Close Look at Deep Learning with Small Data

In this work, we perform a wide variety of experiments with different De...

1 Introduction

Deep convolutional neural networks have recently become the de-facto approach for feature construction and classification in computer vision. With applications from image classification for object recognition

[1, 2, 3, 4, 5] detection [6, 7] to end-to-end learning for complex computer vision tasks [8, 9]

. Their impressive performance on a variety of tasks have made them immensely applicable to a variety of applications, including transfer learning

[10, 11]. The widespread general applicability of deep neural network architectures such as Residual Networks [5], Inception models [3] and Wide Resnet [12] have made them popular recognition architectures for various applications.

The generalization properties of deep neural networks have been studied in great depth over the last few years. This has further resulted in their widespread adaptation into real world scenarios. Moreover, coupled with popular regularization methods such as Batch Normalization

[13], Dropout [14] and -Regularization these models have been able to achieve significant improvements in accuracies by further reducing the generalization errors. Recent works in this area have led to new distribution dependent regularization priors such as direct regularization of model complexity [15] and taking a bounded spectral norm of the network’s Jacobian matrix in the neighbourhood of the training sample [16]. While each of these methods solves a unique problem, we seek to look at the overall model in terms of the network complexity and its relative accuracies on a test set.

The goal of this study is to identify an empirically backed strong regularizer that can reduce the complexity of the network by resulting in smaller weights and perform equally well in the presence of input noise. By comparing the accuracies on a hold out dataset we also show that the effects of regularization vary greatly across these different methods. TO best of our knowledge this is the first such study of regularization methods in deep learning under complexity bounds and varying input noise. Our primary contribution lies in demonstrating strong experimental evidence in favor of certain regularizers over the others. We show that deep networks are robust in presence of input noise and

norm acts as a proxy for model complexity of neural networks. We also show that distribution dependent DARC1 regularizer and regularizer perform well in presence of input noise in the training set since the network tends to prefer a simpler hypothesis to get better test set accuracies on a clean noise-free test set.

The paper is divided into 5 sections. Section 2 introduces pre-requisites to generalization error and presents a brief mathematical background of various measures of complexities and regularizers. Section 3 discusses the experimental settings used in the paper along with the results on accuracies and various notions of model complexities and techniques to control the complexities like norm, DARC1 norm, Dropout, Jacobian norm and Spectral regularization. Section 4 presents insights and key findings obtained from the experiments Finally Section 5 presents conclusions and future line of research.

2 Preliminaries

In this section, we describe the background, measures of complexities and the regularizers that control the model complexity of the neural network. Firstly, we present various notations used in the paper in the Table 1.

S.no. Notation Description
1 Input space
2 Label space
3 Input sampled from
4 Output sampled from
5 Distribution of
6 Set of functions/hypothesis space
7 Loss function
8 Family of loss functions associated with .
9 # training samples
10 # input dimension
11 # classes
12 , expected risk of a function
13 Model learned by learning algorithm
14 i.i.d. training dataset of size
15 , empirical risk of
16 Noise level

Weight decay/DARC1/Jacobian/LCNN norm hyperparameter


margin of a classifier

20 Vapnik-Chervonenkis dimension of the hypothesis class
Table 1: Notations

The goal in machine learning is to minimize the expected risk

. However, the expected risk being non-computable, we aim to minimize the computable empirical risk denoted by . The generalization gap is given by:


A major drawback of this approach is the dependence of on the same dataset used in the definition of . One way to tackle this is to considering the worst-case gap for functions in the hypothesis space.


A union bound over all the elements of the hypothesis class yields a vacuous bound, hence we consider other quantities to characterize the complexity of namely, Vapnik-Chervonenkis (VC) dimension [17, 18] and Rademacher Complexity [19, 20, 21]. If the codomain of the loss is given by , for any

with probability at least



where, is the Rademacher Complexity of , which can then be bounded by Rademacher complexity of F denoted as . Similar bounds can be found in the literature using VC dimension, fat shattering dimension and covering numbers. We now highlight the various bounds for a neural network in the next subsection.

2.1 Measures of Complexities of a Neural Network

Consider a deep net with layers and output margin on training set . Various generalization bounds proposed in the literature are mentioned in the Table 2. The term

denotes the number of parameters of a multilayered feed forward neural network.

S.no. Reference Measure
1 Sontag [18]
2 Bartlett et al. [19]
3 Neyshabur et al. Sharma et. al. [21, 22]
4 Bartlett et al. [20]
5 Neyshabur et al. [23, 24]
6 Kawaguchi et al. [15]
7 Pant et al. [15]
Table 2: Notations

The expression is the sum of stable ranks of the layers which is a measure of the parameter count. The expression is related to the Lipschitz constant of the network. However, for this paper we use as a measure of the network complexity.

2.2 Regularizers

We now briefly describe various regularizers used in neural networks namely, norm, dropout, jacobian, DARC1 and spectral normalization. We study the behavior of these regularizers in presence of varying input noise. In doing so, we characterize the complexity of network by the norm of the parameters and DARC1 Rademacher bound.

  • norm: Weight decay or regularization is one of the most popular regularization techniques used to control the model complexity. It amounts to penalizing the norm of the weights of the network. Almost all the generalization bounds are a function of norm of the weights. Sontag [18] first proposed the VC dimension of a multilayer neural network in terms of the number of parameters. But for a network with millions of parameters, the bound turns out to be vacuous as the network seems to generalize well with much smaller samples than the number of parameters. Bartlett [17] also proposed the VC dimension or fat shattering dimension of a multilayer feed forward neural network in terms of the product of norm of the weights. This bound showed that in order to control the complexity of a network, one has to regularize the norm of the weights. Neyshabur [21] presented the generalized analysis of complexity in terms of where . Some other recent bounds are mentioned in table 2.

  • DARC1 norm: Kawaguchi et al. [15] suggested minimizing the max norm of the activation as a regularizer. They termed the method as Directly Approximately Regularizing Complexity (DARC) and named a basic version of their proposed regularization prior as DARC1. They argue that the common generalization bounds (as mentioned in the table) are too loose to be used practically. They therefore consider a margin based 0-1 loss defined as:


    Finally, they show that


    where, is the Rademacher complexity of the hypothesis class defined as , where is the margin, are the Rademacher variables, supremum is taken over all and allowed in . is the confidence level and is the true label of . Instead of using worst case vacuous bounds, they use the approximation of with an expectation over the known dataset . DARC1 is the new regularization term added on each minibatch as follows:


    Following Zhang et al. [25], the regularizer can be seen as penalizing the most confident predictions, thereby reducing the tendency of the model to overfit. A similar analysis is done using Low complexity Neural Network (LCNN) loss in Jayadeva et al. [26], which instead of using a max-norm, uses norm over the hypothesis class of the final layer of the network. Here, original loss can be any of the standard loss function used in the literation viz. cross-entropy, max-margin, 0-1 etc.

  • LCNN norm: Low Complexity Neural Network [26] regularizer tries to upper bound the VC dimension of neural network using radius margin bound. It is known that the VC dimension of a large margin linear classifier is upper bounded by:


    where is the radius of the data, is the margin and is the number of input features. A similar analysis can be performed for the last layer of a neural network. Here, we state without proof that the VC dimension () of a neural network is bounded by:


    LCNN regularization term is added on each minibatch as follows:


    LCNN term penalizes the large values in the last layer and acts as a confidence penalty. Here, original loss can be any of the standard loss function used in the literation viz. cross-entropy, max-margin, 0-1 etc.

  • Jacobian Regularizer: Sokolic et al. [16] have argued that the existing generalization bounds in deep neural networks (DNN) grows disproportionately to the number of training samples. To resolve this, they propose a new lower bound expressed as a function of the network’s Jacobian matrix which is based on the robustness framework of [27]. The Jacobian matrix of a DNN is given by:


    Addition of Jacobian regularizer also allows the network to become robust to changes in the input. It has the effect of inducing a large classification margin at the input. Following theorem 4 in [16] the classification margin for a point with score is given by:



    is the Kronecker delta vector with

    . Hence, the generalization error is bounded by:


    Here, is a constant defining the dimensional manifold. It can be seen that . The above bound shows that the generalization error does not increase with the number of layers provided the spectral norm of the weight matrices are bounded. If we assume the weight matrices contains orthonormal rows then the generalization error depends on the complexity of data manifold and not on depth. The Jacobian regularizer is given by:

  • Spectral Normalization: Spectral normalization controls the Lipschitz constant of the network by constraining the spectral norm of each layer

    . The Lipschitz constant of a general differentiable function is the maximum singular value of its gradient over its domain.


    For composite functions, . Spectral normalization proposed for generative adversarial net [28], replaces each weight with . The computation is done using power iteration method. Let us consider a linear map . Let be a vector in the domain of matrix and be a vector in the codomain. Power iteration involves the following recurrence relation.


    On further simplification (see Algorithm 1 in [28]), we have the relation:


    Finally, the weight matrix is updated as:



    Authors in [28] argue that the gradient regularizer proposed in [29] which is similar in concept to the Jacobian regularizer proposed in [16], has a drawback that the Jacobian regularizer is not able to regularize the function at the points outside of the support of the current distribution. They also show that spectral normalization does not get destabilized by large learning rates, whereas Jacobian regularizer falters with aggressive learning rates.

    Spectral normalization has another advantage in terms of controlling the model complexity. Following the Bound 5 of Neyshabur et. al. [23, 26] presented in the Table 2 as , setting the term close to 1, we get the bound


    The bound in eq 19 shows that, in order to control to complexity of the model, one needs to perform spectral normalization and normalization. In this case only, sum of norm of weights of the network truly indicate the capacity of the architecture.

  • Dropout: It is known that models with large number of parameters such as deep neural architectures can model very complex functions and phenomenon. This also means that such models have a tendency to overfit. Thus regularizing such a model is imperative for good performance on the unseen test set. Dropout is such a technique proposed by Srivastava et al. [14]

    . Dropout works by randomly and temporarily deleting neurons in the hidden layer during the training with probability

    . During testing/prediction, we feed the input to unmodified layer, but scale the layer output by . Dropout acts as an averaging scheme on the output of a large number of networks, which is often found to be a powerful way of reducing overfitting.

    Also, since a neuron cannot rely on the presence of other neurons, it is forced to learn features that are not dependent on the presence of other neurons. Thus the network learns robust features, and are less susceptible to noise. This reduces the co-adaptation of features.

The following section describes the experimental settings and results by varying the input noise.

3 Experiments and Results

In this paper we analyze various regularizers and their effect on input noise. We characterize the complexity required by each network with different regularizers as the noise in the input increases. Following Arora et al. [30], we find that deep nets are noise-stable. However, shallow nets are not as is evident from the results.

3.1 Experimental settings

We describe the experimental settings used in the paper. To begin with, we add a class specific gaussian noise to each input example in the training set while the validation and test set remained unchanged.


be the standard deviation of the training points belonging to class i. To each sample of class i in training set, we add a zero mean unit variance Gaussian noise scaled by

and a noise level. The noise level ranges from 0 to 1.2 in steps of 0.2. The procedure to generate noisy datasets is described in the Algorithm 1.

Input: Training set , noise factor
Output: Noisy set
for  to  do
       samples of class j
end for
Algorithm 1 Creating noise dataset

We include two types of datasets in this work. Four non-image datasets and one image dataset. We trained two networks, a shallow network for non-image datasets and a deeper network for image dataset. The details of the networks are presented below,

  • A network with two convolutional layers of filter size 30 and 20 respectively, and a fully connected layer of size

    , trained for 50 epochs with ADAM optimizer


  • A Wide Resnet 28-10 trained for 50 epochs, also with ADAM optimizer.

  • Both these networks have Batch normalization applied on each layer.

  • The initial learning rate is set to for both the networks.

The shallow network was trained separately with the seven different regularizers viz. no regularization, norm, DARC1 norm, LCNN norm, Dropout, Jacobian and Spectral normalization. The deep network was trained with the aforementioned regularizers. We tuned each network on the validation set and report the results on the test set for the best hyperparameter setting.

For the shallow conv net we used the hyperparameter settings mentioned in the Table 3.

S.no. Hyperparameter Range
1 weight decay
3 Jacobian regularizer
4 Dropout
6 Batch size 128
Table 3: Notations

For the deep network, we used the hyperparameters mentioned in the Table 4:

S.no. Hyperparameter Values
1 weight decay
2 Jacobian regularizer
3 Dropout
4 Batch size 32
Table 4: Notations

3.2 Datasets

We show results on 5 datasets in this paper. We experimented with two neural network architectures, one shallow convolutional net (with three layers) for datasets 1 to 4 and one deep (Wide Resnet 28-10) for dataset 5 (CIFAR 10). The datasets are described in the Table 5.

S.no. Dataset_name # Train # Val # Test # Features # Classes
1 Adult 29304 9769 9769 14 2
2 MNIST 50000 5000 5000 784 10
3 Codrna 293139 97713 97713 8 2
4 Covertype 348603 116204 116205 54 7
5 CIFAR10 50000 5000 5000 3072 10
Table 5: Datasets used in the paper

3.3 Results

We compare test set accuracies, log2 norm, DARC1 norm and loss for each datasets. Firstly, we present the results for shallow conv net and thereafter show the results for a Wide Resnet 28-10 on CIFAR10 dataset.

Figure 1: Comparison of fig:acc small test set accuracies, fig:log l2 small norm of weights, fig:log darc1 small mean DARC1 regularizer, fig:log lcnn small mean LCNN regularizer, fig:log jacobian small mean Jacobian regularizer, fig:loss small mean cross entropy loss for shallow network across datasets

Figure 0(a) shows the bar plot of the mean accuracy comparison across the four datasets used for testing shallow architecture. We observe that the accuracies fall as the noise in the training set increases. Upto the noise levels of 6 there is no differences in the accuracies across regularizers, however at noise levels of 8, we see that Jacobian regularizer outperforms the others, closely followed by spectral norm regularizer and regularizer. LCNN regularizer however does not perform well at higher levels of noises for shallow architectures.

Figure 0(b) shows the mean log2 scaled norms obtained for the shallow architecture. We observe that the network with regularization results in smallest norm. For network without regularization, DARC1, LCNN regularization we find that the norm decreases as the noise increases, however for Dropout, Jacobian, Spectral and regularization we observe an increase in norm initially and then a decline. The latter trend is more pronounced in case of norm. This is indicative of the fact that with slight increase in noise, the model increases its complexity at first to fir the noise, but as the noise increases to a larger extent, the model reduces its complexity to fit the non-noisy validation set. Similar trends are observed for DARC1 regularizer (fig. 0(c)), LCNN regularizer (fig. 0(d)) and Jacobian regularizer (fig. 0(e)). Since, LCNN is an upper bound on DARC1, the graphs have a similar trend.

Figure 0(f) shows the test set cross entropy loss as the noise increases. We observe an increasing trend in all the cases as noise increases.

We now show the results of these regularizers on a deeper architecture (Wide Resnet 28-10) trained on an image dataset. For this experiment we varied the noise levels from 0 to 8 in steps of 2.

Figure 2: Comparison of fig:acc 14 test set accuracies, fig:log l2 14 norm of weights, fig:darc1 14 mean DARC1 regularizer, fig:log lcnn 14 mean LCNN regularizer, fig:jacobian 14 mean Jacobian regularizer, fig:loss 14 mean cross entropy loss for shallow network across datasets

Figure 1(a) shows the accuracies obtained as the noise ratio increases for various regularization techniques. We observe that dropout results in a non-robust graph, where the accuracy drops sharply as noise increases earlier and then increases with noise. Spectral normalization results in the highest set of accuracies and is not affected by noise. For other methods such there is no appreciable drops in accuracies as the noise increases, thus validating the hypothesis that deep neural nets are robust to input noise.

Figure 1(b) shows the log2 norm. We see that for regularization, the norm is the smallest, followed by spectral norm. The norm in case of is orders of magnitude smaller than the rest, despite having comparable accuracies. This is indicative of the fact that deeper architectures have high model complexities which be controlled using norm.

Figures 1(c) and 1(d) shows the DARC1 and LCNN norm respectively. It can be seen that LCNN regularizer results in smallest respective norms, followed by Spectral norm and norm. Dropout results in the largest DARC1 and LCNN norms.

Figure 1(e) shows the Jacobian norm for various regularizers. We observe that for LCNN norm the Jacobian norm is the smallest. This shows that the error signal is not able to propagate to the initial layers. This may be due to high value of LCNN hyperparameter. Among others, Spectral norm shows the smallest values for the Jacobian norm and also the smallest variation. This also corresponds to the robustness in accuracies as the noise increases. Dropout shows the highest variation in Jacobian norm which corresponds to the large variation in accuracies as the noise increases. It is also observed that Jacobian norm increases with noise, which is indicative of increasing uncertainity in data.

Figure 1(f) shows the cross entropy loss variation with noise. The loss is inversely related to the accuracy (fig. 1(a)). The loss is high for network with only Dropout as regularizer. The loss is minimal for LCNN norm, followed by Spectral normalization and norm.

4 Discussion

From the results we can obtain the following insights:

  1. The model complexity can be fairly represented using norm, DARC1 norm and LCNN norm.

  2. regularization and DARC1 regularization performs well in controlling the model complexity of the network. Their smaller values are indicative of a less complex hypothesis class learnt in the wake of higher regularization.

  3. As the noise increases for shallow networks, the norm, DARC1 norm increases at first and then we observe a decrease in the norms indicative of the simpler hypothesis learn when the validation set is noise free.

  4. For shallow network, Jacobian norm performs well in the presence of noise. For deeper architectures, network itself performs noise reduction and thus Jacobian regularization has minimal affect. Deeper networks are robust to input noise as is elucidated in Arora et al., [30].

  5. Spectral normalization performs the best in case of noisy inputs closely followed by DARC1 regularizer which is derived from a distribution dependent bound.

  6. The test set loss is a good indicator of the test set accuracy.

  7. Dropout alone does not perform well in case of noisy inputs neither does it control the model complexity given in terms of norm, DARC1 norm or LCNN norm.

  8. Adding a input noise can result in better generalization properties for deeper architectures, however this effect is less pronounced for shallow architectures.

5 Conclusion

In this paper we presented a study on the effect of regularization of model complexity and generalization as the noise in the inputs increases. We find multiple notions of model complexities in the literature ranging from distribution independent VC dimension bounds for neural networks [18, 22] to distribution dependent Rademacher complexity bounds [19, 21, 23, 20]. All these bounds are in terms of product or sum of norm of weights of the network. We used sum of norm of the weights as proxy for model complexity. This directly translates to adding a weight decay regularizer that penalizes larger weights. Recently proposed DARC1 or LCNN regularizer are a step in generating a distribution dependent bounds for neural networks. We see the both and DARC1 controls the model complexity.

The experiments clearly demonstrate that the norm is best suited for controlling model complexity as well as for generating solutions with high accuracies. For deeper networks Dropout alone does not result in higher accuracies or control the model complexity in terms of the norm of weights. We also see that the newly proposed distribution dependent DARC1 and LCNN regularization performs equally well as regularization. Spectral normalization results in one of the best results for deeper architectures, due to stable gradient propagation. Our experiments also show that deep neural networks are more generally more robust to varying degrees of Gaussian input noise than shallow architectures and can therefore effectively model any number of such distributions under strong priors. In future, we shall explore novel regularization schemes like shakedrop [32] and shake-shake regularization [33]

. We have refrained from discussing recurrent architectures like Long short term memory (LSTM) network

[34, 35], which shall be the focus of our future research.