Training computers to learn and think like human to produce reliable results has been a topical study in Machine Learning (ML). Though there are several types of ML, for the sake of simplicity, we will focus only supervised learning in which we train the model with known dataset to make predictions and let the trained model predict new unseen data. Two different tasks under supervised learning are separated as classification and regression. They share the same concept of mapping inputs to outputs, but differ in the form of targets. For classification, the output is discrete or categorical, while the output is continuous or real numbers for regression. Attempting to create machine to reason like human, the mathematical models mimicking the human brain was introduced and has become commonly known as Neuron Network (NN). The idea of NN was first introduced in 1948 by Alan Turingcopeland04turing and has been developed further forming a series of algorithm having a neuron-like network. These networks are used to solve similar, but more complex, problems to other ML algorithms such as credit card fraud detection in financial sector CHOUIEKH2018133 and an automated driving system in automotive industry 9307303
. Artificial Neural Network (ANN), Recurrent Neural Network (RNN) or Convolutional Neural Network are some examples of these neuron-like algorithms. Deeper networks yield higher accuracy and higher ability to data representation. However, it comes with large numbers of parameters which makes the models prone to overfitting or fail to generalize unseen data. This phenomenon is likely to happen when the model is very complex while training data is insufficient. Generally, deeper networks are deployed and they have far more parameters than LeNetNIPS1989_53c3bce6 does. Overfitting phenomenon conflicts with objective of any ML algorithm where we need the trained model to perform well not only on training data, but also on the unseen data. To avoid such event, regularization is brought into play.
In recent years, several regularization techniques have been developed to be applied to various parts of the network. Some techniques are applied to weight such as DropConnect pmlr-v28-wan13 that drops the weight between connected nodes leaving the connected layer sparse but the node can still be active. -regularization or weight decay Krizhevsky2009LearningML
, adds penalty term with a hyperparameter () to the error function resulting in weight decaying close to zero. Data augmentation krizhevsky2012imagenet is used at the input layer to perform image transformation such as flipping, zooming, shifting and cropping. Dropout srivastava2014dropout can be implemented, for instance, at the hidden layers to reduce the dependence between neurons by randomly dropping out the nodes from the network. Penalizing confident output by pereyra2017regularizing is done at the output layer which introduces negative entropy to the negative likelihood during training. Last example is from the latest novel idea concerns the loss layer. DisturbLabel(DL) xie2016disturblabel aims to attack the loss layer by randomly flipping the ground truth with a shared hyperparameter (). Among the numbers of various regularization methods, DL is claimed as the novel algorithm attacking the loss layer for classification task. Its simplicity of implementation, yet proving the compelling result comparable to the widely known technique such as dropout, has convinced us to explore and generate new prospects inherited from this concept. In this paper, we improve the regularization technique on the loss layer for classification task and extend the idea onto regression task.
For classification, we propose Directional Disturb Label (DDL)
which is the method of selecting systematically which labels to disturb by excluding non-confident labels from the candidates based on cosine similarity. We show that this improvement can reduce misclassification rate comparing to the baseline and performs better with deeper network. Additionally, the experimental results reaffirm that there is no burden on the cooperation of DDL with other regularization methods. For regression, we proposeDisturbValue(DV) and DisturbError(DE), the two novel methods developed from applying disturbing procedure of DL onto regression task. The experimental results show the efficiency of our methods. Our codes are available from the first author’s Github (https://github.com/kimy-de/DisturbMethods).
Our main contributions include:
Propose model-agnostic noisy regularization methods.
Improve DisturbLabel by filtering non-confident labels from the candidates of disturb labels.
Demonstrate the robustness of our methods using 14 datasets.
2 Related Work
Regularization methods come in a wide variety of the areas of the neural network they are aimed at regularization can be imposed on weights, hidden or inputs nodes of the neural network, but regularizing within the loss layer is relatively new. Xie, Lingxi et al.xie2016disturblabel introduced DisturbLabel (DL) which is the first work investigating this area for convolutional neural networks.
The closest method to DisturbLabel is label smoothing szegedy2015rethinking
, which also perturbs the ground-truth labels by softening it to a vector of probabilities of belonging to each of the classes in the task, while DisturbLabel flips the label fully. As the authors point out, the main advantage of DisturbLabel is being stochastic while soft labelling is deterministic and therefore cannot provide the same strong regularization as by DisturbLabel.
Taejong Joo et al.joo2020being
also perturbs ground-truth labels similarly to label smoothing, but use Bayesian approach, and instead of manipulating the labels directly, they consider the ground truth to be a random variable of a categorical probability over class labels rather than being given by the training label.
From recent findings on manipulating the labels, the mixup method zhang2017mixup is one of the prominent. It suggests training a neural network on convex combinations of pairs of examples and their labels. Weizhi Li et al.li2020regularization combine the label smoothing with the hypothesis that more confident predictions require stronger regularization (we also exploit this hypothesis in our work). They perform regularization via structural label smoothing, imposing various smoothing strength on clusters of data lying in different parts of feature space.
Noisy regularization for regression tasks are well known techniques, but as pointed out in a survey papermoradi2020survey , the noise is added either to inputs (poole2014analyzing ) or to weights (hochreiter1995simplifying ). To the best of our findings, adding noise to the targets as a form of regularization has been never attempted before. Ehsan Imani et al.imani2018improving did add Gaussian noise to targets, but their experiments are different from our proposed DV approach in the sense that (a) it was done as a form of augmentation and (b) noise was added to all targets.
Adding Gaussian noise to target values was also explored wang1999training
, but our works compliment rather than repeat each other. First, this prior work concentrates on the convergence properties of the network with noise added to the desired signal and does not investigate regularization effects. Second, they add noise to target values and use annealing schedule of the step size to control the variance of noise since the noise affects weight updating. On the contrary, we control the number of noisy values using noise rate instead of adjustment of the step size.
DisturbLabel (DL) xie2016disturblabel regularizes the loss layer by replacing some ground truth labels with incorrect labels in each iteration. DL selects substitutes randomly to generate disturb labels based on a noise rate that determines the amount of . When , a batch-training works without any change. When , each of selected true labels is converted into a disturb label by a Multinoulli distribution with the following probabilities:
where and replaces 1 and 0 respectively in the one-hot vector and C is the number of classes. The DL does not consider the confidence of labels to generate disturb labels.
Our hypothesis is that high confident labels cause the overfitting problem, hence our method Directional DisturbLabel (DDL) considers only confident labels as candidates to be disturbed. To be specific, we define the confident label as satisfying where is a permissible angle. In , the angle between and is calculated by the cosine similarity and we say that is a confident label if . To simplify the formula , we use the unit vectors, and , with so that is applied to select substitutes for disturb labels in each batch data. The range of is . when is the same as (i.e. the same direction). Plus, and always have positive elements so that the maximum angle between them is
. In our setting, non-confident labels from the candidates are excluded by the non-confident interval of. For instance, Figure 1 shows that all the outputs of classification models with three classes are on the three dimensional space and the natural basis of the vector space are the one-hot vectors of the three true labels, , and . Therefore, there exist and are on the same space such that is calculated.
Our hypothesis is that noise injection to target values regularizes the classifier layer by the oscillation of target values because the measurement of continuous target values could have errors caused by machine tolerance, human ability, measurement condition, and so on. For example, when we measure the current temperature, we can think thatC and C are the same depending on the tolerance of thermometers so proper noise injection to target values can lead to a regularization effect without changing the attribute of target values. Thus, we use the concept of DisturbLabel (DL) for regression tasks. However, the original DL can be applied in classification problems so we propose two disturb methods by extending the concept of DL to regression problems. Given a mini-batch set . Then a prediction where is a regression model with model parameters .
To prevent overfitting in the training session, DisturbValue (DV) adds Gaussian noise to some of the target values at random. To be specific, a target value is replaced by following based on a noise rate . When , is added randomly to target values in each batch data where . As hyperparameters of DV, and have a huge impact on the regularization performance, but it is too hassle for controlling them depending on datasets. To reduce the number of the hyperparameters, we transform the domain of target values into by the MinMax scaler and is fixed as a default. Thus, all of the domain of target values is regardless of datasets so that the same standard variance can be used for any datasets. Finally, we consider the only one hyperparameter during the training.
DV adds noise to target values at random. However, DisturbError (DE) adds Gaussian noise to target values satisfy where is a residual boundary. We say that is a high confident value if there exists small constant such that . When there are many high confident values, it is highly possible that a model is overfitted to train data. To prevent overfitting, DE disturbs the error of high confident values by redefining the error as . In DE, we should control and originally, however we can consider only based on the same scaling and as DV.
Therefore, we find an optimal hyperparameter of DV and DE depending on datasets and compare to other regularization techniques in Section 4
. In this paper, mean square loss functionis used such that is defined as an objective function with our disturb methods. Considering the gradient of ,
is derived. It shows that the noise controls the gradient of prediction values to prevent overfitting.
4.1 Classification Task
|Size of images||2828||2828||3232||3232||150150||227227**|
|Train/Test (% )||86/14||86/14||83/17||83/17||82/18||86/14|
datasets also used by the baseline paper. CIFAR100 is implemented differently frrom the baseline. We use all 100 classes and treat the dataset as an additional one.
input size fed to the network (original sizes vary)
We report test misclassification rate to compare the performance of the methods. Experiments for each dataset method combination were run for 5 times, average value alongside with standard deviation is reported.
Baselines. In our experiments we compare the performance of suggested DDL method against same baselines as the reference paper (including exploring cooperation of the methods) and DL method itself; no regularization, dropout, DistrubLabel (DL), DistrubLabel (DL) + dropout, Directional DisturbLabel (DDL), Directional DisturbLabel (DDL) + dropout.
Datasets. Having a hypothesis that regularization effects should appear more vividly on ’easier’ datasets, we conduct experiments on six collections of images of various complexity: MNIST mnist2010 , FMNIST xiao2017/online
, CIFAR-10CIFAR10 , CIFAR-100 Krizhevsky2009LearningML , INTEL intel , and ART art . The characteristics of the datasets are summarized in Table 1.
Architecture. LeNet NIPS1989_53c3bce6ResNet18 resnet18 is modified with additional dropout after average pooling step for our experiment on models combination before feeding the input to the last layer with the softmax loss function.
The training procedure was performed using SGD optimizer with momentum and decaying learning rate. We start with learning rate of 0.001 and reduce it by factor 0.1 after 40, 60 and 80 epochs. Each experiment was run for 100 epochs in total.
Hyperparameters. This hyperparameter (probability of the label to be disturbed) was tuned for each dataset separately. The optimal values of hyperparameters used are summarized in Table 2. The dropout probability was set to 0.5 in all corresponding experiments.
|No reg.1||0.86 0.034||8.052 0.112||24.82 0.828||59.732 1.590||15.614 0.484||18.08 0.597|
|Dr.2||0.658 0.049||7.784 0.167||22.58 0.771||50.662 0.828||13.48 0.575||16.372 0.651|
|DL||0.642 0.052||7.748 0.069||23.77 0.310||58.758 0.596||13.289 0.270||16.28 0.335|
|DDL||0.61 0.041||7.789 0.111||22.614 0.553||56.444 0.337||13.542 0.193||16.687 0.948|
|Dr.+DL||0.58 0.044||7.816 0.132||21.65 0.320||50.87 0.701||13.066 0.339||15.578 0.456|
|Dr.+DDL||0.658 0.086||7.744 0.210||21.766 0.403||50.824 1.487||12.366 0.310||15.537 0.498|
No reg. refers no to regularization
Dr. refers to dropout
Experimental Results. The results of the experimented on LeNet are summarized in Table 3. DDL method (in combination with Dropout) demonstrated the best results for half of the datasets and is in top-2 results for all of them. Intel and Art dataset considered as simpler datasets which have higher degree of overfitting the other more complex datasets. MNIST is also known for being easily classified by modern networks, therefore we can say that the results prove the hypothesis of DDL method having better regularization capacity and is useful on the datasets which needed it at most. The four newly tested datasets demonstrated similar behaviour except for CIFAR100, which is the most challenging of them, having far more classes. We assume that memorizing the ground truth was not really in place for CIFAR100, that is why perturbing the labels even more has not facilitated the training procedure. A more lightweight method - dropout - nevertheless has still improved the result compared to full absence of regularization.
The results of the experimented on ResNet18 is shown in Table 4 confirming the results obtained for LeNet. For 5 out of 6 tested datasets, DDL (either alone or in combination with Dropout) has gained the best result compared to baselines and is second best for the remaining. FMNIST and CIFAR10 presumably do not require such strong regularization as the most simple MNIST, Intel and Art datasets, therefore using combination of DDL and dropout can be excessive in this case and DDL method obtains the best result alone. CIFAR100 still does not suffer overfitting even with the deeper network to the extend as the other datasets do and does not profit from strong regularization.
|No reg.1||0.668 0.0618||8.354 0.072||8.145 0.430||30.557 0.466||6.237 0.269||2.927 0.26|
|Dr.2||0.601 0.045||8.608 0.179||7.813 0.170||28.363 0.500||6.531 0.333||2.837 0.348|
|DL||0.612 0.1567||8.244 0.274||7.99 0.444||28.048 1.064||6.004 0.123||2.909 0.416|
|DDL||0.558 0.0507||8.228 0.107||7.667 0.096||28.236 1.209||6.271 0.207||2.891 0.090|
|Dr.+DL||0.543 0.039||8.480 0.378||7.841 0.198||28.123 0.540||6.231 0.346||3.035 0.0342|
|Dr.+DDL||0.532 0.054||8.475 0.191||8.111 0.151||28.115 1.186||5.917 0.166||2.728 0.206|
No reg. refers to no regularization,
Dr. refers to dropout.
4.2 Regression Task
Evaluation metric. We use root-mean-square error (RMSE) HYNDMAN2006679 . For every experiment and every dataset we average RMSE for 20 runs and measure standard deviation.
Baselines. For all eight datasets we compare the results of our methods with the baselines which represent state-of-the-art on neural network regularization to the best of our knowledge and also combination of our methods with baselines: regularization, dropout, DV - Gaussian noise, DV - Laplacian noise, DV - cosine annealing inproceedings , DE, DV + dropout, DV + , DV + DE.
Datasets. We use eight datasets with different sizes and complexity for our evaluations: Boston House Prices (BHP) scikit-learn , Bike Sharing (BS) bike , Air Quality (AQ) DEVITO2008750 , Make-sklearn (MS) scikit-learn , Housing Price (HP) de2011ames , Superconductivty (SC) HAMIDIEH2018346 , Communities and Crime (CC) Dua:2019 , Appliances Energy Prediction (AEP) CANDANEDO201781 . The characteristics of the datasets are summarized in Table 5. Also, Minmax scaling is used to standardize the features to present in the data in a fixed range.
We implement our methods using neural network with two hidden layers and ReLU activation function.
Optimizer. We use ADAM optimizer kingma2017adam with learning rate of 0.001 in all experiments.
Hyperparameters. For each dataset a grid-search is used to find the best value for hyper-parameters:
penalty for regularization,
Drop rate (%) for Dropout,
Disturb rate (%) for Disturb Value,
Residual (e) for Disturb Error.
Experimental Results. To tune hyper-parameters all datasets were split on training set (50%) and testing set (50%). For each hyper-parameter we fit our models on training set and then evaluate accuracy metric on testing set. For every run every dataset values were shuffled before split. Table 6 shows the results of experiments, which proved that DV approach or combination of DV with DE, cosine-annealing, Dropout or outperforms baselines for all eight datasets. We were interested to know if the most suitable approach depends on data size and data complexity. To check this we form Table 6 starting from a dataset with smaller complexity (smaller number of features) and ending with a dataset with bigger complexity. No dependencies were revealed during this analysis.
We also analyze standard deviations for every method to be sure that offered approaches (DV and DE) are robust. There is no significant deviation between standard deviation for the model without regularization and models with regularization.
In this paper, we have extended one of the modern regularization methods, DisturbLabel (DL), by proposing an improved procedure of label disturbing for classification task and projecting the idea of useful ground truth disturbance to regression domain, which was not covered by the method of the baseline paper.
The problem of overfitting can be interpreted as memorizing the ground truth by a neural network. Our extension of the classification task is based on the hypothesis that confident predictions often come from this memorizing of the ground truth labels, that is why it makes more sense to penalize (in our case, randomly disturb) a share of them and do not perturb training procedure for the predictions that are unconfident in a natural way. We proposed to use the cosine similarity to measure the distance between vectors of ground truth and predicted class probabilities in classification tasks. We have tested this method on six datasets and two architectures and showed that DDL brings improvement compared to DL (alone or in combination with dropout).
We also have shown that the extension of DL method on the regression domain can improve the performance of the model and help to avoid overfitting. We presented two methods: (1) DisturbValue (DV) method which injects Gaussian noise to target values at random, and (2) DisturbError method (DE) which injects Gaussian noise to target values if the prediction is close to the target value, other words, the difference between prediction and target values is smaller than a small constant. Our proposed DV method outperformed well-known baselines and dropout alone or in combination with other approaches (DE, , dropout, cosine-annealing). The experiments were done for eight datasets with different sizes and complexity.
Xie, Lingxi and Wang, Jingdong and Wei, Zhen and Wang, Meng and Tian, Qi, (2016). Disturblabel: Regularizing cnn on the loss layer, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, 4753–4762.
- (2) Li Wan and Matthew Zeiler and Sixin Zhang and Yann Le Cun and Rob Fergus, (2013). Regularization of Neural Networks using DropConnect, Proceedings of the 30th International Conference on Machine Learning, 1058–1066, Vol 28.
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E, (2012). Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems, 1097–1105.
- (4) Srivastava, Nitish and Hinton, Geoffrey and Krizhevsky, Alex and Sutskever, Ilya and Salakhutdinov, Ruslan, (2014). Dropout: a simple way to prevent neural networks from overfitting, The journal of machine learning research 2014, Vol 15, 1929–1958.
- (5) Pereyra, Gabriel and Tucker, George and Chorowski, Jan and Kaiser, Lukasz and Hinton, Geoffrey, (2017). Regularizing neural networks by penalizing confident output distributions, arXiv preprint arXiv:1701.06548.
- (6) Diederik P. Kingma and Jimmy Ba, (2017). Adam: A Method for Stochastic Optimization, arXiv:1412.6980.
- (7) Rob J. Hyndman and Anne B. Koehler, (2006). Another look at measures of forecast accuracy, International Journal of Forecasting, Vol 22, 679–688.
- (8) Pedregosa, F. and Varoquaux, G., Gramfort, A., Michel, V.,Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. and Duchesnay, E., (2011). Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, Vol 12, 2825–2830.
Fanaee-T, Hadi and Gama, Joao, (2013). Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence, Springer Berlin Heidelberg, 1–15.
Sensors and Actuators B: Chemical, (2008). On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario, Sensors and Actuators B: Chemical Vol 129, 750–757.
- (11) De Coc and Dean, (2011). Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project, Journal of Statistics Education, Vol 19.
- (12) Kam Hamidieh, (2018). A data-driven statistical model for predicting the critical temperature of a superconductor, Computational Materials Science, Vol 154, 346–354
- (13) Dua, Dheeru and Graff, Casey, (2017). UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml
- (14) Luis M. Candanedo and Véronique Feldheim and Dominique Deramaix, (2017). Data driven prediction models of energy use of appliances in a low-energy house, Energy and Buildings, Vol 140, 81–97.
- (15) Geoffrey E. Hinton and Nitish Srivastava and Alex Krizhevsky and Ilya Sutskever and Ruslan R. Salakhutdinov, (2012). Improving neural networks by preventing co-adaptation of feature detectors, arXiv:1207.0580.
- (16) A. Krizhevsky, (2009). Learning Multiple Layers of Features from Tiny Images, http://www.cs.toronto.edu/%7Ekriz/learning-features-2009-TR.pdf
Loshchilov, Ilya and Hutter, Frank, (2016). SGDR: Stochastic Gradient Descent with Warm Restarts
- (18) Christian Szegedy and Vincent Vanhoucke and Sergey Ioffe and Jonathon Shlens and Zbigniew Wojna, (2015). Rethinking the Inception Architecture for Computer Vision,
- (19) Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun, (2015). Deep Residual Learning for Image Recognition, cs.CV arXiv:1512.03385.
- (20) LeCun, Yann and Boser, Bernhard and Denker, John and Henderson, Donnie and Howard, R. and Hubbard, Wayne and Jackel, Lawrence, (1990), Handwritten Digit Recognition with a Back-Propagation Network, Advances in Neural Information Processing Systems, Vol 2.
- (21) Copeland, B. Jack., (2004). The Essential Turing: Seminal Writings in Computing, Logic, Philosophy, Artificial Intelligence, and Artificial Life plus The Secrets of Enigma, Oxford University Press.
Alae Chouiekh and EL Hassane Ibn EL Haj, (2018). ConvNets for Fraud Detection analysis, Proceedings of the first international conference on intelligent computing in data sciences ICDS2017, Vol 127, 133–138.
- (23) M. J. Shafiee, A. Jeddi, A. Nazemi, P. Fieguth, and A. Wong, (2021). Deep Neural Network Perception Models and Robust Autonomous Driving Systems: Practical Solutions for Mitigation and Improvement, IEEE Signal Processing Magazine, Vol 38, 22–30.
- (24) LeCun, Yann and Cortes, Corinna, (2010). MNIST handwritten digit database, http://yann.lecun.com/exdb/mnist/
- (25) Alex Krizhevsky and Vinod Nair and Geoffrey Hinton, (2010). CIFAR-10 (Canadian Institute for Advanced Research), http://www.cs.toronto.edu/ kriz/cifar.html
- (26) Imani, Ehsan and White, Martha, (2018). Improving regression performance with distributional losses, International Conference on Machine Learning, PMLR 2018, 2157–2166.
- (27) Poole, Ben and Sohl-Dickstein, Jascha and Ganguli, Surya, (2014). Analyzing noise in autoencoders and deep networks, arXiv preprint arXiv:1406.1831.
- (28) Hochreiter, Sepp and Schmidhuber, Jürgen and others, (1995). Simplifying neural nets by discovering flat minima, Advances in neural information processing systems, 529–536.
- (29) Moradi, Reza and Berangi, Reza and Minaei, Behrouz, (2020). A survey of regularization strategies for deep models, Artificial Intelligence Review, Vol 53, 3947–3986.
- (30) Zhang, Hongyi and Cisse, Moustapha and Dauphin, Yann N and Lopez-Paz, David, (2017). Mixup: Beyond empirical risk minimization, arXiv preprint arXiv:1710.09412.
- (31) Li, Weizhi and Dasarathy, Gautam and Berisha, Visar, (2020). Regularization via structural label smoothing, International Conference on Artificial Intelligence and Statistics, 1453–1463.
- (32) Joo, Taejong and Chung, Uijung and Seo, Min-Gwan, (2020). Being Bayesian about categorical probability, International Conference on Machine Learning, 4950–4961.
- (33) Danil, (2018). Art Images: Drawing/Painting/Sculptures/Engravings, https://www.kaggle.com/thedownhill/art-images-drawings-painting-sculpture-engraving
- (34) Puneet Bansal, (2019). Intel Image Classification, https://www.kaggle.com/puneet6060/intel-image-classification
- (35) Han Xiao and Kashif Rasul and Roland Vollgraf, (2017). Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms, cs.LG arXiv:1708.07747
- (36) Fanaee-T, Hadi and Gama, Joao, (2013). Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence, 1–15.
- (37) Wang, Chuan and Principe, Jose C, (1999). Training neural networks with additive noise in the desired signal, IEEE Transactions on Neural Networks, Vol 10, 1511–1517.