DL-Reg: A Deep Learning Regularization Technique using Linear Regression

by   Maryam Dialameh, et al.

Regularization plays a vital role in the context of deep learning by preventing deep neural networks from the danger of overfitting. This paper proposes a novel deep learning regularization method named as DL-Reg, which carefully reduces the nonlinearity of deep networks to a certain extent by explicitly enforcing the network to behave as much linear as possible. The key idea is to add a linear constraint to the objective function of the deep neural networks, which is simply the error of a linear mapping from the inputs to the outputs of the model. More precisely, the proposed DL-Reg carefully forces the network to behave in a linear manner. This linear constraint, which is further adjusted by a regularization factor, prevents the network from the risk of overfitting. The performance of DL-Reg is evaluated by training state-of-the-art deep network models on several benchmark datasets. The experimental results show that the proposed regularization method: 1) gives major improvements over the existing regularization techniques, and 2) significantly improves the performance of deep neural networks, especially in the case of small-sized training datasets.



There are no comments yet.


page 7

page 9


Techniques All Classifiers Can Learn from Deep Networks: Models, Optimizations, and Regularization

Deep neural networks have introduced novel and useful tools to the machi...

Disturbing Target Values for Neural Network Regularization

Diverse regularization techniques have been developed such as L2 regular...

On Regularization Properties of Artificial Datasets for Deep Learning

The paper discusses regularization properties of artificial data for dee...

On the training dynamics of deep networks with L_2 regularization

We study the role of L_2 regularization in deep learning, and uncover si...

Demystifying Deep Learning: A Geometric Approach to Iterative Projections

Parametric approaches to Learning, such as deep learning (DL), are highl...

Semi-supervised physics guided deep learning framework for predicting the I-V characteristics of GAN HEMT

This letter proposes a novel deep learning framework (DLF) that addresse...

A trans-disciplinary review of deep learning research for water resources scientists

Deep learning (DL), a new-generation artificial neural network research,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Regularization is a popular and essential technique in the field of machine learning, reducing the complexity of learned models while allowing them to predict accurately over a set of unseen data during the test phase

[R1]. This technique even becomes more vital in the concept of deep learning because of their highly nonlinear behaviors, which adding even one more layer increases the nonlinearity of models to a considerable extent [R2], resulting in a poor generalization performance on unseen samples.

According to a taxonomy proposed in [R3]

, deep learning regularization techniques could be divided into five main categories. Data-based regularization is the first category, which aims to either simplify the representation of input data by applying certain transformations or creating a large number of data points using data augmentation techniques. The second category is based on modifying the network structure such as imposing restrictions on the number of nodes/layers and choosing a proper activation function. Regularization via the error function is the third type of this taxonomy trying to add certain features to the error function such as robustness to imbalanced data. The fourth category is based on modifying the optimization algorithm used to learn network. Termination methods, DropOut, momentum, and weight initialization are some examples of this category. The last category is based on adding a regularization term into the network loss-function. It is assumed that there is no dependency between the regularization term and targets in this category, and they are independent of each other. Weight decay and

are two typical examples of this category.

Because of the highly nonlinear behavior of deep learning models, especially when they become deeper by adding more layers, they naturally tend to do more memorization than generalization. This problem even becomes more serious when the size of train-data is small. Moreover, although many researchers have tried to address this problem, it undoubtedly needs more work, as the danger of overfitting has remained not completely solved in the context of deep learning models. Moreover, none of the current existing regularization methods explicitly try to enforce deep networks to behave less nonlinear. That is, there is no method yet that explicitly penalizes deep networks from learning a highly nonlinear model, and this area is still open to research. Additionally, it is desirable for a regularizer to be efficient, simple, computationally inexpensive, and result in discriminative features maps. Satisfying all these characteristics together, however, is subject to more research.

Considering the aforementioned points, this paper proposes a simple but efficient regularization method, named as DL-Reg (an abbreviation for Deep Learning Regularization), by adding a regularization term into the network’ s loss function, enforcing the learned model to explicitly behave as much linear as possible. In other words, the proposed DL-Reg, which could be categorized into the last category outlined above, not only explicitly penalizes the network from learning a purely highly nonlinear model, but also gives enough motivation for learning as linear as possible while preserving the discrimination ability of the model. To accomplish this, we take the advantage of linear regression and propose a least-squares error-term, representing the squared error of a linear mapping from the inputs of the network to its outputs. This term simply motivates the network to behave as linearly as possible so that minimizes the least-squares errors. Additionally, the main contributions of this work can be summarized as follows:

Regularizing deep networks using the sum of squared errors of a linear regression model which directly maps the inputs to the output of the network

Allowing supervised deep networks to be trained using the semi-supervised learning

Increasing the performance of deep networks on the small-sized dataset

Conducting extensive experiments to certify the significance of the proposed method in enhancing the performance of deep networks

The rest of this paper is structured as follows. Section II provides a brief survey of the related regularization approaches. Section III presents the proposed regularization method in depth. Section IV reports the experimental results, and section V discusses the findings and provides several possible future trends. Finally, Section VI concludes the paper.

Ii Related Work

This section covers a brief background of several regularization methods used in the context of deep learning. , which works similar to Weight Decay in the case of SGD-optimizer [R34], is perhaps one of the well-known traditional regularizing methods, which is simply , where is a regularization factor, and is the weights of the network [R2, R21]. In a recent work [R19], however, it was shown that separating the weight decay from the gradient-based updating rule can substantially improve the generalization ability of the learning, particularly in the case of Adam optimizer.

Smoothness [R4] is another regularization method, which penalizes large derivatives in the model and is defined by , where and denote the Jacobian of the network parametrized by and the Frobenius norm, respectively. In another work,

of the gradient of loss function was applied to obtain a loss-invariant backpropagation, which makes the loss invariant to the input changes

[R5]. Hessian Penalty [R6] has been proposed as a fast approximation of of the Hessian of the network by penalizing Jacoobian with noisy inputs. This idea was further exploited in [R25] to build a robust network against adversarial examples.

To improve the performance of recurrent neural networks (RNN), it is shown that imposing unitary or orthogonal constraints on the weight matrices prevents the network from the problem of vanishing/exploding gradients

[R7, R8]. In another research, matrix spectral norm [R9]

has been used to regularize the network by making it indifferent to the perturbations and variations of the training samples. More precisely, the parameters of the model are trained so that the spectral norm of weights is small, allowing the network to not be sensitive to the changes in the order of training data at each epoch. In the same direction, SHADE 

[R20] has been proposed whose loss function is based on the conditional entropy trying to minimize the variation in the input representations. Inspired by [R23], Louizos et al. [R22] leverage the notion of weight sparsity, trying to set a certain number of the weights of the network as zeros by applying ; however, it is applicable for a certain condition, as is not generally differentiable. In the same way, group sparse regularization method [R24] applies the notion of

sparsity on the sets of outgoing weights from neurons.

Shake-Shake regularization [R10] was proposed for only a specific type of residual network (ResNet). It follows the idea of adding gradient noise to the learning procedure where gradient noise is replaced by gradient augmentation, allowing the network to escape from local optima. More precisely, this approach multiplies the output of residual branches by a random scaler and adds the results to both forward and backward passes. ShakeDrop regularization [R11] is an extension of Shake-Shake that can be applied to other ResNet models.

The family of drop methods, initially introduced by DropOut [JMLR:v15:srivastava14a], is another type of regularization method, which prevents deep learning techniques from overfitting by randomly dropping a certain number of neurons during different epochs of training. DropBlock [R29], DropBand [R30], DropFilter [R31] are several recently proposed methods of this family. DropBlock randomly drops a certain number of continuous regions in feature maps. DropBand drops one channel of input data each time, and DropFilter randomly drops some elements of convolutional layers. Additionally, Spectral DropOut [KHAN201982] is another member of this family, which prevents overfitting by firstly calculating Fourier coefficients of the network and then eliminating noisy and weak coefficients; it, however, needs additional calculations to find Fourier coefficients. Cutout [R26] is a restricted type of DropOut-based regularizers that randomly masks squared regions of inputs. This masking forces the network to learn complementary features, which is helpful in case of occlusion. Overall, having a closer look at this family, one can see that such methods are different in terms of dropping layers and/or dropping nodes/weights; however, there is not much difference between them in terms of performance in practice, and they behave almost like DropOut.

Moosavi-Dezfooli et al. [R27]

proposed a new regularization technique by minimizing the curvature of the loss surface, which is helpful in the case of adversarial robustness. However, their optimization procedure needs to calculate the eigenvalues of a Hessian loss, which is computationally complex. Stankovic

et al. [R33], took the advantage of graph Laplacian regularizer to address the problem of limited training data in deep neural networks. The proposed method is based on iteratively solving a quadratic programming problem, which adds more computation to the training phase. Apart from that, this method works only for binary classification problems.

A modification of the softmax loss function, which is called Angular softmax [R28], was recently proposed as an explicit regularization technique, trying to increase the inter-class separability by distancing between class centers. Although this method, theoretically, leads to more discrimination, the empirical results over different types of datasets are far from expectations. Moreover, Angular softmax is a new/modified loss function, not a regularization method in general. Another recent work is style transfer regularization [R32], which tries to regularize the network by generating new data, mostly textured image data, through combining the content of an image with the appearance of another one.

Based on the above summary, we can conclude that none of the current regularization methods have all the properties of simplicity/generality, efficiency, and dealing with small-sized training datasets at the same time. Accordingly, this paper aims to propose an efficient, but simple, method for regularizing deep neural networks, allowing the networks to extract highly discriminative features, as the experimental results certify this assertion. Moreover, the proposed method is suitable for the case of small-sized training datasets.

Iii Proposed Method

This section describes the proposed regularization method (DL-Reg) in details. A deep neural network can be considered as a function parameterized by a set of weights that maps an n-dimensional input to a c-dimensional output , i.e., . The goal of training is to find an optimal set of weights minimizing a certain empirical risk function :


where and is the corresponding binary label matrix. is defined as follows: for each training sample ,

is its label vector. If

is from the th class , then only the th entry of is one and all the other entries are zero. The risk function then takes the following form:


where is a loss function, which calculates the error between the output of the network and the target , is a regularization function that may consume a certain number of inputs except the targets, and is a regularization factor, which determines the importance of regularization in the risk function. Considering the aforementioned definitions, this work aims to propose a regularization function , which improves the generalization ability of the network, particularly in case of small sample size problems. Accordingly, we propose the following regularization function, which is simply the squared norm of error between a linear mapping of the inputs and the outputs of the network:



is a linear transformation operator (the last row of

represents bias parameters), which maps (i.e. concatenated by a column of all ones) to the c-dimensional output , and denotes . In other words, calculates the error of a linear regression between the inputs and the outputs of the network. The parameters of are initialized randomly and then updated during the training process of the deep neural network. More precisely, whenever the parameters of the network get updated, is updated as well.

Practically speaking, Eq. (3) can be applied for the case of mini-batch optimization. Therefore, it can be rewritten as follows:


where represents a mini-batch subsamples of concatenated by a column of all ones as the biases multipliers, is the size of a batch, and . Figure 1 shows a graphical view of DL-Reg.

Minimizing Eq. (4) w.r.t. is a typical least-squares problem and could be solved by a closed-form solution as follows:


Because the number of samples in a mini-batch, , is often smaller than the size of the input, , i.e., , is a fat matrix, and can be accounted as a full-row rank matrix. Hence, the inverse of exists as it forms a full rank matrix.

It is also worthwhile noting that in a case of , which is almost quite rare in deep learning problems, Eq. (4) could then be solved as follows:


in which becomes a full-column rank, i.e., tall matrix, and consequently exists [R12].

Eq. (3) aims to keep the network to behave as a linear mapping function and penalize the network when it behaves highly nonlinear. Hence, one concern about the proposed -function might be its negative impact on the nonlinearity power of the deep networks, as the major power of deep learning methods is rooted in their abilities to produce nonlinear feature maps. This concern, however, can be rejected because the regularization factor adjusts the impact of linearization enforced by -function, and choosing a proper value of can easily resolve this concern. It is worthwhile noting that the parameter could be selected through a cross-validation procedure over a validation set. Additionally, the independence of

-function from the targets allows the network to take the advantage of unlabeled training data and makes it also suitable for the case of semi-supervised learning


Figure 1: A schematic overview of the proposed regularization method.

Iv Experiments

This section evaluates the effectiveness of the proposed regularization method on several state-of-the-art deep network architectures, such as ResNet-152 [R14], DenseNet [R16], and VGG [R17]

. The evaluation is performed on the task of image classification using several benchmark datasets including MNIST, CIFAR-10, CIFAR-100, and ImageNet. Needless to say that the parameters used in the training phase of each architecture, such as learning rate, epochs, and batch-size are the same for both cases of using and not using our proposed regularization method.

Iv-a Cifar Datasets

This subsection reports the results of our proposed regularization function on CIFAR-10 and CIFAR-100 and compares them to the original case of each network, i.e., the case of not using the proposed regularization. CIFAR-10 consists of color-images divided into 10 classes. Moreover, the standard training and testing sizes are and 10k respectively, where the size of each training class is . CIFAR-100 is the same as CIFAR-10, except that it has classes, where each class has training images. Tables I and II report the classification accuracies of each model in each dataset, where the latter uses a subset of the original dataset as training data. Additionally, Figures 2 to 5 show the diagrams of train/test accuracies in accordance of Tables I and II, depicting the learning behaviors of the proposed method, and its ability to escape from local optima, e.g., sub-figures 2.(a,d). As the figures illustrate, in most cases the diagrams of proposed train accuracies are lower than those of original methods, certifying the less sensitivity of the proposed regularization to overfitting. Finally, training/testing diagrams of accuracies in Figures 2 and 3. As the results show, applying the proposed method results in a significant improvement for each network. More importantly, Table II demonstrates the overfitting robustness of the proposed regularization in case of small-sample-sized problems, i.e., there are small drops in accuracies.

Original Proposed Improvement Original Proposed Improvement
 DensNet-121 93.25 95.63 +%2.55 69.4 73.28 +%5.59
 VGG-13 92.26 93.66 +% 1.52 67.25 71.3 +%6.02
 ResNet-152 92.71 95.0 +% 2.47 75.67 77.33 +%2.2
 EfficientNetB0 89.11 91.77 +% 2.99 78.61 80.05 +%1.83
Table I: The comparison of test results on CIFAR-10 and CIFAR-100. The improvement ratio shows the amount of improvement achieved by applying the proposed regularization method. Batch-size is set to 128.
(a) DensNet-121-Train
(b) VGG13-Train
(c) ResNet-Train
(d) EfficientNetB0-Train
(e) DensNet-121-Test
(f) VGG13-Test
(g) ResNet-Test
(h) EfficientNetB0-Test
Figure 2: CIFAR10- The classification accuracies are obtained by applying our proposed regularization method on four different networks. Rows show training/testing results obtained from each network.
(a) DensNet-121-Train
(b) VGG13-Train
(c) ResNet-152-Train
(d) EfficientNetB0-Train
(e) DensNet-121-Test
(f) VGG13-Test
(g) ResNet-152-Test
(h) EfficientNetB0-Test
Figure 3: CIFAR100- The classification accuracies are obtained by applying our proposed regularization method on four different networks. Rows show training/testing results obtained from each network.
Original Proposed Improvement Original Proposed Improvement
 DensNet-121 89.73 91.95 +%2.47 68.21 72.56 +%6.37
 VGG-13 88.61 90.62 +%1.70 58.89 70.36 +%19.47
 ResNet-152 87.02 89.31 +%2.63 62.33 75.42 +%21.0
 EfficientNetB0 83.16 86.17 +%7.39 64.3 76.01 +%18.21
Table II: The comparison of test results on a randomly reduced sets (20k) of CIFAR-10 and CIFAR-100. The improvement ratio shows the amount of improvement achieved by applying the proposed regularization method. Batch-size is set to 128.
(a) DensNet-121-Train
(b) VGG13-Train
(c) ResNet152-Train
(d) EfficientNetB0-Train
(e) DensNet-121-Test
(f) VGG13-Test
(g) ResNet152-Test
(h) EfficientNetB0-Test
Figure 4: Reduced-CIFAR10 -The classification accuracies are obtained by applying our proposed regularization method on four different networks while the training size of dataset is randomly reduced to 20k. Rows show training/testing results obtained from each network.
(a) DensNet-121-Train
(b) VGG13-Train
(c) ResNet152-Train
(d) EfficientNetB0-Train
(e) DensNet-121-Test
(f) VGG13-Test
(g) ResNet152-Test
(h) EfficientNetB0-Test
Figure 5: Reduced-CIFAR100- The classification accuracies are obtained by applying our proposed regularization method on four different networks while the training size of dataset is randomly reduced to 20k. Rows show training/testing results obtained from each network.

Iv-B ImageNet

This experiment investigates the impact of our proposed regularization term on a very large-scale set of images. ImageNet is a large image-classification dataset with more than

million annotated images divided into classes. Table III reports the classification results on ImageNet and a reduced training set of ImageNet by randomly selecting 200 images from each category, therefore 200k in total. As the results verify, applying the proposed regularization method could significantly increase the performance of each method. More importantly, the improvement ratios on the reduced version of ImageNet are significantly higher than those of the full dataset, supporting the idea that the proposed method could be helpful in case of small-sample-size problems by reducing the chance of overfitting. Detailed specifications and preprocessing steps of each method are available in https://paperswithcode.com/sota/image-classification-on-imagenet.

ImageNet Reduced ImageNet
Original Proposed Improvement Ratio Original Proposed Improvement Ratio
 DensNet-121 74.98 77.2 +%2.96 62.9 70.5 +% 12.08
 VGG-13 74.1 76.33 +%3.0 61.25 69.05 +% 12.73
 ResNet-152 78.57 79.8 +%1.56 69.7 75.4 +% 8.17
 EfficientNetB0 76.3 78.15 +%2.42 65.4 69.1 +% 5.65
Table III: The comparison of top-1 test results on ImageNet and a randomly reduced set of ImageNet (200k). The improvement ratio shows the amount of improvement achieved by applying the proposed regularization method.

To have a deeper investigation on the effectiveness of the proposed method, we use class investigation maps (CAM), which is introduced in [R18], to depict class activations of each architecture on several samples of ImageNet’s test set. To do that, we randomly select three classes of ImageNet depicted in the first row of Figure 6. Then, we calculate their CAMs using the original DensNet-121 and the one equipped with the proposed regularization (the second row shows the obtained CAMs), where both networks are trained on ImageNet. Finally and to have a better view, the third row combines the first two rows as one. From the results, it is evident that the proposed regularization forces DensNet-121 to learn discriminative features from the object of interest in each class, while the original DensNet-121 tends to memorize most areas of images. In the case of Fireweed, for instance, the equipped version of DensNet-121 with our proposed regularization uses a few number of petals to make decision, while the the original DensNet almost uses all areas of image in its decision, which eventually leads to a lower generalization.

(a) Wading bird
(b) Fireweed
(c) Pouch
(d) proposed
(e) original
(f) proposed
(g) original
(h) proposed
(i) original
(j) proposed
(k) original
(l) proposed
(m) original
(n) proposed
(o) original
Figure 6: The results of Class Activation Mapping (CAM) of DensNet-121 on several images taken from ImageNet. The first row shows original images, and the second row depicts the obtained CAMs with and without the proposed regularization. The third row combines the previous rows to have a better view of CAMs. As the results show, the CAMs of the proposed regularization highlight less but more discriminative areas of the image, e.g., the wings of the bird, which is very discriminative, are highlighted even on the water, while the CAMs of the original network tend to highlight more but less discriminative areas of the image, e.g, look at sub-figures (g,i).

Iv-C Comparison with regularizer

This subsection conducts several experiments on the MNIST dataset to compare the performance of the proposed regularization technique (DL-Reg) and the well-known . The reason for selecting regularizer is that it belongs to the same category (see Section I) as DL-Reg; hence, the comparison is fair. In all experiments, we use the same parameter settings including randomness, train/test size, batch-size, learning rate, max-epoch, and every other setting. Table V describes a list of such parameters along with their assigned values. Moreover, we use the same network structure, consisting of three sequential hidden layers (1024, 1024, 2048) with ReLUs and with/without Dropout rates of for the input-layer and for the other layers. Figure 7 depicts the obtained per-epoch results in terms of train and test accuracies as well as train losses. To have a better view over the results, we only depict the results of the first 200 epochs and the last 400 epochs of the learning procedure. Therefore, it becomes easier to compare the learning behaviours of models at the beginning and the end of training. Additionally, the final test accuracy of each strategy is reported in Table IV.

As it is depicted in Figure 7, in all cases the proposed regularization method achieves higher accuracy in both test and train phases, and a lower value of training loss. More precisely, the proposed DL-Reg shows a superior behaviour to regularizer in both cases of with and without Dropout layers. The convergence speed is another significant implication of the proposed method. We can observe that DL-Reg shows even a faster rate of convergence and a more stable behavior compared to regularizer. That is to say that DL-Reg can successfully reduce the nonlinearity of deep networks by implicitly forcing the neurons of the networks to behave as linear as necessary.

Another interesting observation by investigating Figure 7(c) is the fact that regularizer performs better at the first epochs of the training; however, after a certain number of epochs, DL-Reg reveals its generalization power and performs superior to regularizer.

(a) Train-Acc with Dropout - Epochs 1-200
(b) Train-Acc with Dropout - Epochs 800-1200
(c) Train-Acc without Dropout - Epochs 1-200
(d) Train-Acc without Dropout - Epochs 800-1200
(e) Test-Acc with Dropout - Epochs 1-200
(f) Test-Acc with Dropout - Epochs 800-1200
(g) Test-Acc without Dropout - Epochs 1-200
(h) Test-Acc without Dropout - Epochs 800-1200
(i) Train-loss with Dropout - Epochs 1-200
(j) Train-loss with Dropout - Epochs 800-1200
(k) Train-loss without Dropout - Epochs 1-200
(l) Train-loss without Dropout - Epochs 800-1200
Figure 7: A comparison between the traditional regularizer and the proposed DL-Reg in two scenarios including with (first two columns) and without (last two columns) Dropout layers on the MNIST dataset. The network structure and all other common settings are the same for both methods. To better visualize the training behaviour of models, only the results of the first 200 epochs and the last 400 epochs are shown. In both scenarios, DL-Reg outperforms regularizer. Moreover, DL-Reg shows faster convergence and more stable behaviours.
Method Test Classification Accuracy
regularizer 98.38
Dropout regularizer 98.47
DL-Reg (proposed) 98.69
DropoutDL-Reg (proposed) 98.94
Table IV: Comparison of and the proposed DL-Reg regularization methods on MNIST dataset.
Parameter Value
Learning rate (lr)
Decay rate for lr 0.96
lr-scheduler Exponential
Frequency of lr-scheduler every 30 epochs
Optimizer SGD
Loss function Cross-Entropy
Batch size
Data pre-processing None
Regularization factor for DL-Reg
Regularization factor for regularizer
Table V: The hyper-parameters of the fully-connected networks () trained on MNIST dataset, (Subsection IV-C).

V Discussion and Future Work

There is always a trade-off between the amount of data used for training and the depth of the model on one side, and the model’s complexity in terms of memory and time on another side. The proposed regularization technique forces the network to behave as linear as possible. That is, it limits the network to learn a highly nonlinear function while preserving its prediction’s ability. This limitation enables the network to learn discriminative features. This ability is clearly visible in the obtained results depicted in Figure 6. In the case of Wading bird, for instance, the proposed method uses the wings’ pattern of the bird for detecting this object, which obviously provides enough discrimination. It is worthy to note that the proposed method even detects the wings’ reflection on the water, which is incredible.

The parameter of regularization factor, i.e., , plays an essential role in the performance of DL-Reg. If increases, then the learning ability of the network reduces, causing the network to entirely behave like a linear regression. In contrast, if approaches zero, then there would be no more regularization/generalization impact in the learning procedure. Therefore, the parameter should be chosen carefully in every learning problem.

Finally, the main implications of the proposed regularization method are summarized as follows:

  1. DL-Reg provides a better generalization in practice and learns discriminative features

  2. The convergence speed of the proposed DL-Reg is fast; however, it depends to the value of regularization factor

  3. The computational cost of DL-Reg is negligible

  4. the proposed regularization method is easy to implement and can be added to any network. In other words, it is independent of the choice of loss-function.

One of the promising areas of future work could be investigating the effects of adding the linearity restriction to each layer of the network, jointly or separately.

Vi Conclusion

This paper proposes a linear technique named as DL-Reg for regularizing the family of deep neural networks. As such deep networks tend to learn highly nonlinear functions, DL-Reg forces the final network to behave as linear as possible and, at the same time, as nonlinear as necessary. A series of various experiments along with a comparison with the traditional regularizer is conducted, and the obtained results show great improvements in classification performances of several state-of-the-art methods. We have also shown that DL-Reg is able to extract discriminative features while avoiding unnecessary and less discriminative ones. This behavior enables the final network to avoid overfitting. Moreover, the proposed method is easy to implement and increases the learning/convergence speed.