Regularization is a popular and essential technique in the field of machine learning, reducing the complexity of learned models while allowing them to predict accurately over a set of unseen data during the test phase[R1]. This technique even becomes more vital in the concept of deep learning because of their highly nonlinear behaviors, which adding even one more layer increases the nonlinearity of models to a considerable extent [R2], resulting in a poor generalization performance on unseen samples.
According to a taxonomy proposed in [R3]
, deep learning regularization techniques could be divided into five main categories. Data-based regularization is the first category, which aims to either simplify the representation of input data by applying certain transformations or creating a large number of data points using data augmentation techniques. The second category is based on modifying the network structure such as imposing restrictions on the number of nodes/layers and choosing a proper activation function. Regularization via the error function is the third type of this taxonomy trying to add certain features to the error function such as robustness to imbalanced data. The fourth category is based on modifying the optimization algorithm used to learn network. Termination methods, DropOut, momentum, and weight initialization are some examples of this category. The last category is based on adding a regularization term into the network loss-function. It is assumed that there is no dependency between the regularization term and targets in this category, and they are independent of each other. Weight decay andare two typical examples of this category.
Because of the highly nonlinear behavior of deep learning models, especially when they become deeper by adding more layers, they naturally tend to do more memorization than generalization. This problem even becomes more serious when the size of train-data is small. Moreover, although many researchers have tried to address this problem, it undoubtedly needs more work, as the danger of overfitting has remained not completely solved in the context of deep learning models. Moreover, none of the current existing regularization methods explicitly try to enforce deep networks to behave less nonlinear. That is, there is no method yet that explicitly penalizes deep networks from learning a highly nonlinear model, and this area is still open to research. Additionally, it is desirable for a regularizer to be efficient, simple, computationally inexpensive, and result in discriminative features maps. Satisfying all these characteristics together, however, is subject to more research.
Considering the aforementioned points, this paper proposes a simple but efficient regularization method, named as DL-Reg (an abbreviation for Deep Learning Regularization), by adding a regularization term into the network’ s loss function, enforcing the learned model to explicitly behave as much linear as possible. In other words, the proposed DL-Reg, which could be categorized into the last category outlined above, not only explicitly penalizes the network from learning a purely highly nonlinear model, but also gives enough motivation for learning as linear as possible while preserving the discrimination ability of the model. To accomplish this, we take the advantage of linear regression and propose a least-squares error-term, representing the squared error of a linear mapping from the inputs of the network to its outputs. This term simply motivates the network to behave as linearly as possible so that minimizes the least-squares errors. Additionally, the main contributions of this work can be summarized as follows:
Regularizing deep networks using the sum of squared errors of a linear regression model which directly maps the inputs to the output of the network
Allowing supervised deep networks to be trained using the semi-supervised learning
Increasing the performance of deep networks on the small-sized dataset
Conducting extensive experiments to certify the significance of the proposed method in enhancing the performance of deep networks
The rest of this paper is structured as follows. Section II provides a brief survey of the related regularization approaches. Section III presents the proposed regularization method in depth. Section IV reports the experimental results, and section V discusses the findings and provides several possible future trends. Finally, Section VI concludes the paper.
Ii Related Work
This section covers a brief background of several regularization methods used in the context of deep learning. , which works similar to Weight Decay in the case of SGD-optimizer [R34], is perhaps one of the well-known traditional regularizing methods, which is simply , where is a regularization factor, and is the weights of the network [R2, R21]. In a recent work [R19], however, it was shown that separating the weight decay from the gradient-based updating rule can substantially improve the generalization ability of the learning, particularly in the case of Adam optimizer.
Smoothness [R4] is another regularization method, which penalizes large derivatives in the model and is defined by , where and denote the Jacobian of the network parametrized by and the Frobenius norm, respectively. In another work,
of the gradient of loss function was applied to obtain a loss-invariant backpropagation, which makes the loss invariant to the input changes[R5]. Hessian Penalty [R6] has been proposed as a fast approximation of of the Hessian of the network by penalizing Jacoobian with noisy inputs. This idea was further exploited in [R25] to build a robust network against adversarial examples.
To improve the performance of recurrent neural networks (RNN), it is shown that imposing unitary or orthogonal constraints on the weight matrices prevents the network from the problem of vanishing/exploding gradients[R7, R8]. In another research, matrix spectral norm [R9]
has been used to regularize the network by making it indifferent to the perturbations and variations of the training samples. More precisely, the parameters of the model are trained so that the spectral norm of weights is small, allowing the network to not be sensitive to the changes in the order of training data at each epoch. In the same direction, SHADE[R20] has been proposed whose loss function is based on the conditional entropy trying to minimize the variation in the input representations. Inspired by [R23], Louizos et al. [R22] leverage the notion of weight sparsity, trying to set a certain number of the weights of the network as zeros by applying ; however, it is applicable for a certain condition, as is not generally differentiable. In the same way, group sparse regularization method [R24] applies the notion of
sparsity on the sets of outgoing weights from neurons.
Shake-Shake regularization [R10] was proposed for only a specific type of residual network (ResNet). It follows the idea of adding gradient noise to the learning procedure where gradient noise is replaced by gradient augmentation, allowing the network to escape from local optima. More precisely, this approach multiplies the output of residual branches by a random scaler and adds the results to both forward and backward passes. ShakeDrop regularization [R11] is an extension of Shake-Shake that can be applied to other ResNet models.
The family of drop methods, initially introduced by DropOut [JMLR:v15:srivastava14a], is another type of regularization method, which prevents deep learning techniques from overfitting by randomly dropping a certain number of neurons during different epochs of training. DropBlock [R29], DropBand [R30], DropFilter [R31] are several recently proposed methods of this family. DropBlock randomly drops a certain number of continuous regions in feature maps. DropBand drops one channel of input data each time, and DropFilter randomly drops some elements of convolutional layers. Additionally, Spectral DropOut [KHAN201982] is another member of this family, which prevents overfitting by firstly calculating Fourier coefficients of the network and then eliminating noisy and weak coefficients; it, however, needs additional calculations to find Fourier coefficients. Cutout [R26] is a restricted type of DropOut-based regularizers that randomly masks squared regions of inputs. This masking forces the network to learn complementary features, which is helpful in case of occlusion. Overall, having a closer look at this family, one can see that such methods are different in terms of dropping layers and/or dropping nodes/weights; however, there is not much difference between them in terms of performance in practice, and they behave almost like DropOut.
Moosavi-Dezfooli et al. [R27]
proposed a new regularization technique by minimizing the curvature of the loss surface, which is helpful in the case of adversarial robustness. However, their optimization procedure needs to calculate the eigenvalues of a Hessian loss, which is computationally complex. Stankovicet al. [R33], took the advantage of graph Laplacian regularizer to address the problem of limited training data in deep neural networks. The proposed method is based on iteratively solving a quadratic programming problem, which adds more computation to the training phase. Apart from that, this method works only for binary classification problems.
A modification of the softmax loss function, which is called Angular softmax [R28], was recently proposed as an explicit regularization technique, trying to increase the inter-class separability by distancing between class centers. Although this method, theoretically, leads to more discrimination, the empirical results over different types of datasets are far from expectations. Moreover, Angular softmax is a new/modified loss function, not a regularization method in general. Another recent work is style transfer regularization [R32], which tries to regularize the network by generating new data, mostly textured image data, through combining the content of an image with the appearance of another one.
Based on the above summary, we can conclude that none of the current regularization methods have all the properties of simplicity/generality, efficiency, and dealing with small-sized training datasets at the same time. Accordingly, this paper aims to propose an efficient, but simple, method for regularizing deep neural networks, allowing the networks to extract highly discriminative features, as the experimental results certify this assertion. Moreover, the proposed method is suitable for the case of small-sized training datasets.
Iii Proposed Method
This section describes the proposed regularization method (DL-Reg) in details. A deep neural network can be considered as a function parameterized by a set of weights that maps an n-dimensional input to a c-dimensional output , i.e., . The goal of training is to find an optimal set of weights minimizing a certain empirical risk function :
where and is the corresponding binary label matrix. is defined as follows: for each training sample ,
is its label vector. Ifis from the th class , then only the th entry of is one and all the other entries are zero. The risk function then takes the following form:
where is a loss function, which calculates the error between the output of the network and the target , is a regularization function that may consume a certain number of inputs except the targets, and is a regularization factor, which determines the importance of regularization in the risk function. Considering the aforementioned definitions, this work aims to propose a regularization function , which improves the generalization ability of the network, particularly in case of small sample size problems. Accordingly, we propose the following regularization function, which is simply the squared norm of error between a linear mapping of the inputs and the outputs of the network:
is a linear transformation operator (the last row ofrepresents bias parameters), which maps (i.e. concatenated by a column of all ones) to the c-dimensional output , and denotes . In other words, calculates the error of a linear regression between the inputs and the outputs of the network. The parameters of are initialized randomly and then updated during the training process of the deep neural network. More precisely, whenever the parameters of the network get updated, is updated as well.
Practically speaking, Eq. (3) can be applied for the case of mini-batch optimization. Therefore, it can be rewritten as follows:
where represents a mini-batch subsamples of concatenated by a column of all ones as the biases multipliers, is the size of a batch, and . Figure 1 shows a graphical view of DL-Reg.
Minimizing Eq. (4) w.r.t. is a typical least-squares problem and could be solved by a closed-form solution as follows:
Because the number of samples in a mini-batch, , is often smaller than the size of the input, , i.e., , is a fat matrix, and can be accounted as a full-row rank matrix. Hence, the inverse of exists as it forms a full rank matrix.
It is also worthwhile noting that in a case of , which is almost quite rare in deep learning problems, Eq. (4) could then be solved as follows:
in which becomes a full-column rank, i.e., tall matrix, and consequently exists [R12].
Eq. (3) aims to keep the network to behave as a linear mapping function and penalize the network when it behaves highly nonlinear. Hence, one concern about the proposed -function might be its negative impact on the nonlinearity power of the deep networks, as the major power of deep learning methods is rooted in their abilities to produce nonlinear feature maps. This concern, however, can be rejected because the regularization factor adjusts the impact of linearization enforced by -function, and choosing a proper value of can easily resolve this concern. It is worthwhile noting that the parameter could be selected through a cross-validation procedure over a validation set. Additionally, the independence of
-function from the targets allows the network to take the advantage of unlabeled training data and makes it also suitable for the case of semi-supervised learning[R13].
This section evaluates the effectiveness of the proposed regularization method on several state-of-the-art deep network architectures, such as ResNet-152 [R14], DenseNet [R16], and VGG [R17]
. The evaluation is performed on the task of image classification using several benchmark datasets including MNIST, CIFAR-10, CIFAR-100, and ImageNet. Needless to say that the parameters used in the training phase of each architecture, such as learning rate, epochs, and batch-size are the same for both cases of using and not using our proposed regularization method.
Iv-a Cifar Datasets
This subsection reports the results of our proposed regularization function on CIFAR-10 and CIFAR-100 and compares them to the original case of each network, i.e., the case of not using the proposed regularization. CIFAR-10 consists of color-images divided into 10 classes. Moreover, the standard training and testing sizes are and 10k respectively, where the size of each training class is . CIFAR-100 is the same as CIFAR-10, except that it has classes, where each class has training images. Tables I and II report the classification accuracies of each model in each dataset, where the latter uses a subset of the original dataset as training data. Additionally, Figures 2 to 5 show the diagrams of train/test accuracies in accordance of Tables I and II, depicting the learning behaviors of the proposed method, and its ability to escape from local optima, e.g., sub-figures 2.(a,d). As the figures illustrate, in most cases the diagrams of proposed train accuracies are lower than those of original methods, certifying the less sensitivity of the proposed regularization to overfitting. Finally, training/testing diagrams of accuracies in Figures 2 and 3. As the results show, applying the proposed method results in a significant improvement for each network. More importantly, Table II demonstrates the overfitting robustness of the proposed regularization in case of small-sample-sized problems, i.e., there are small drops in accuracies.
This experiment investigates the impact of our proposed regularization term on a very large-scale set of images. ImageNet is a large image-classification dataset with more thanmillion annotated images divided into classes. Table III reports the classification results on ImageNet and a reduced training set of ImageNet by randomly selecting 200 images from each category, therefore 200k in total. As the results verify, applying the proposed regularization method could significantly increase the performance of each method. More importantly, the improvement ratios on the reduced version of ImageNet are significantly higher than those of the full dataset, supporting the idea that the proposed method could be helpful in case of small-sample-size problems by reducing the chance of overfitting. Detailed specifications and preprocessing steps of each method are available in https://paperswithcode.com/sota/image-classification-on-imagenet.
|Original||Proposed||Improvement Ratio||Original||Proposed||Improvement Ratio|
To have a deeper investigation on the effectiveness of the proposed method, we use class investigation maps (CAM), which is introduced in [R18], to depict class activations of each architecture on several samples of ImageNet’s test set. To do that, we randomly select three classes of ImageNet depicted in the first row of Figure 6. Then, we calculate their CAMs using the original DensNet-121 and the one equipped with the proposed regularization (the second row shows the obtained CAMs), where both networks are trained on ImageNet. Finally and to have a better view, the third row combines the first two rows as one. From the results, it is evident that the proposed regularization forces DensNet-121 to learn discriminative features from the object of interest in each class, while the original DensNet-121 tends to memorize most areas of images. In the case of Fireweed, for instance, the equipped version of DensNet-121 with our proposed regularization uses a few number of petals to make decision, while the the original DensNet almost uses all areas of image in its decision, which eventually leads to a lower generalization.
Iv-C Comparison with regularizer
This subsection conducts several experiments on the MNIST dataset to compare the performance of the proposed regularization technique (DL-Reg) and the well-known . The reason for selecting regularizer is that it belongs to the same category (see Section I) as DL-Reg; hence, the comparison is fair. In all experiments, we use the same parameter settings including randomness, train/test size, batch-size, learning rate, max-epoch, and every other setting. Table V describes a list of such parameters along with their assigned values. Moreover, we use the same network structure, consisting of three sequential hidden layers (1024, 1024, 2048) with ReLUs and with/without Dropout rates of for the input-layer and for the other layers. Figure 7 depicts the obtained per-epoch results in terms of train and test accuracies as well as train losses. To have a better view over the results, we only depict the results of the first 200 epochs and the last 400 epochs of the learning procedure. Therefore, it becomes easier to compare the learning behaviours of models at the beginning and the end of training. Additionally, the final test accuracy of each strategy is reported in Table IV.
As it is depicted in Figure 7, in all cases the proposed regularization method achieves higher accuracy in both test and train phases, and a lower value of training loss. More precisely, the proposed DL-Reg shows a superior behaviour to regularizer in both cases of with and without Dropout layers. The convergence speed is another significant implication of the proposed method. We can observe that DL-Reg shows even a faster rate of convergence and a more stable behavior compared to regularizer. That is to say that DL-Reg can successfully reduce the nonlinearity of deep networks by implicitly forcing the neurons of the networks to behave as linear as necessary.
Another interesting observation by investigating Figure 7(c) is the fact that regularizer performs better at the first epochs of the training; however, after a certain number of epochs, DL-Reg reveals its generalization power and performs superior to regularizer.
|Method||Test Classification Accuracy|
|Learning rate (lr)|
|Decay rate for lr||0.96|
|Frequency of lr-scheduler||every 30 epochs|
|Regularization factor for DL-Reg|
|Regularization factor for regularizer|
V Discussion and Future Work
There is always a trade-off between the amount of data used for training and the depth of the model on one side, and the model’s complexity in terms of memory and time on another side. The proposed regularization technique forces the network to behave as linear as possible. That is, it limits the network to learn a highly nonlinear function while preserving its prediction’s ability. This limitation enables the network to learn discriminative features. This ability is clearly visible in the obtained results depicted in Figure 6. In the case of Wading bird, for instance, the proposed method uses the wings’ pattern of the bird for detecting this object, which obviously provides enough discrimination. It is worthy to note that the proposed method even detects the wings’ reflection on the water, which is incredible.
The parameter of regularization factor, i.e., , plays an essential role in the performance of DL-Reg. If increases, then the learning ability of the network reduces, causing the network to entirely behave like a linear regression. In contrast, if approaches zero, then there would be no more regularization/generalization impact in the learning procedure. Therefore, the parameter should be chosen carefully in every learning problem.
Finally, the main implications of the proposed regularization method are summarized as follows:
DL-Reg provides a better generalization in practice and learns discriminative features
The convergence speed of the proposed DL-Reg is fast; however, it depends to the value of regularization factor
The computational cost of DL-Reg is negligible
the proposed regularization method is easy to implement and can be added to any network. In other words, it is independent of the choice of loss-function.
One of the promising areas of future work could be investigating the effects of adding the linearity restriction to each layer of the network, jointly or separately.
This paper proposes a linear technique named as DL-Reg for regularizing the family of deep neural networks. As such deep networks tend to learn highly nonlinear functions, DL-Reg forces the final network to behave as linear as possible and, at the same time, as nonlinear as necessary. A series of various experiments along with a comparison with the traditional regularizer is conducted, and the obtained results show great improvements in classification performances of several state-of-the-art methods. We have also shown that DL-Reg is able to extract discriminative features while avoiding unnecessary and less discriminative ones. This behavior enables the final network to avoid overfitting. Moreover, the proposed method is easy to implement and increases the learning/convergence speed.