Disturbing Target Values for Neural Network Regularization

10/11/2021
by   Yongho Kim, et al.
0

Diverse regularization techniques have been developed such as L2 regularization, Dropout, DisturbLabel (DL) to prevent overfitting. DL, a newcomer on the scene, regularizes the loss layer by flipping a small share of the target labels at random and training the neural network on this distorted data so as to not learn the training data. It is observed that high confidence labels during training cause the overfitting problem and DL selects disturb labels at random regardless of the confidence of labels. To solve this shortcoming of DL, we propose Directional DisturbLabel (DDL) a novel regularization technique that makes use of the class probabilities to infer the confident labels and using these labels to regularize the model. This active regularization makes use of the model behavior during training to regularize it in a more directed manner. To address regression problems, we also propose DisturbValue (DV), and DisturbError (DE). DE uses only predefined confident labels to disturb target values. DV injects noise into a portion of target values at random similar to DL. In this paper, 6 and 8 datasets are used to validate the robustness of our methods in classification and regression tasks respectively. Finally, we demonstrate that our methods are either comparable to or outperform DisturbLabel, L2 regularization, and Dropout. Also, we achieve the best performance in more than half the datasets by combining our methods with either L2 regularization or Dropout.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

10/31/2020

DL-Reg: A Deep Learning Regularization Technique using Linear Regression

Regularization plays a vital role in the context of deep learning by pre...
08/18/2021

Confidence Adaptive Regularization for Deep Learning with Noisy Labels

Recent studies on the memorization effects of deep neural networks on no...
06/08/2021

Muddling Label Regularization: Deep Learning for Tabular Datasets

Deep Learning (DL) is considered the state-of-the-art in computer vision...
02/19/2020

Improving Generalization by Controlling Label-Noise Information in Neural Network Weights

In the presence of noisy or incorrect labels, neural networks have the u...
12/15/2021

Robust Neural Network Classification via Double Regularization

The presence of mislabeled observations in data is a notoriously challen...
06/06/2021

Regularization in ResNet with Stochastic Depth

Regularization plays a major role in modern deep learning. From classic ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training computers to learn and think like human to produce reliable results has been a topical study in Machine Learning (ML). Though there are several types of ML, for the sake of simplicity, we will focus only supervised learning in which we train the model with known dataset to make predictions and let the trained model predict new unseen data. Two different tasks under supervised learning are separated as classification and regression. They share the same concept of mapping inputs to outputs, but differ in the form of targets. For classification, the output is discrete or categorical, while the output is continuous or real numbers for regression. Attempting to create machine to reason like human, the mathematical models mimicking the human brain was introduced and has become commonly known as Neuron Network (NN). The idea of NN was first introduced in 1948 by Alan Turing

copeland04turing and has been developed further forming a series of algorithm having a neuron-like network. These networks are used to solve similar, but more complex, problems to other ML algorithms such as credit card fraud detection in financial sector CHOUIEKH2018133 and an automated driving system in automotive industry 9307303

. Artificial Neural Network (ANN), Recurrent Neural Network (RNN) or Convolutional Neural Network are some examples of these neuron-like algorithms. Deeper networks yield higher accuracy and higher ability to data representation. However, it comes with large numbers of parameters which makes the models prone to overfitting or fail to generalize unseen data. This phenomenon is likely to happen when the model is very complex while training data is insufficient. Generally, deeper networks are deployed and they have far more parameters than LeNet

NIPS1989_53c3bce6 does. Overfitting phenomenon conflicts with objective of any ML algorithm where we need the trained model to perform well not only on training data, but also on the unseen data. To avoid such event, regularization is brought into play.

In recent years, several regularization techniques have been developed to be applied to various parts of the network. Some techniques are applied to weight such as DropConnect pmlr-v28-wan13 that drops the weight between connected nodes leaving the connected layer sparse but the node can still be active. -regularization or weight decay Krizhevsky2009LearningML

, adds penalty term with a hyperparameter (

) to the error function resulting in weight decaying close to zero. Data augmentation krizhevsky2012imagenet is used at the input layer to perform image transformation such as flipping, zooming, shifting and cropping. Dropout srivastava2014dropout can be implemented, for instance, at the hidden layers to reduce the dependence between neurons by randomly dropping out the nodes from the network. Penalizing confident output by pereyra2017regularizing is done at the output layer which introduces negative entropy to the negative likelihood during training. Last example is from the latest novel idea concerns the loss layer. DisturbLabel(DL) xie2016disturblabel aims to attack the loss layer by randomly flipping the ground truth with a shared hyperparameter (). Among the numbers of various regularization methods, DL is claimed as the novel algorithm attacking the loss layer for classification task. Its simplicity of implementation, yet proving the compelling result comparable to the widely known technique such as dropout, has convinced us to explore and generate new prospects inherited from this concept. In this paper, we improve the regularization technique on the loss layer for classification task and extend the idea onto regression task.

For classification, we propose Directional Disturb Label (DDL)

which is the method of selecting systematically which labels to disturb by excluding non-confident labels from the candidates based on cosine similarity. We show that this improvement can reduce misclassification rate comparing to the baseline and performs better with deeper network. Additionally, the experimental results reaffirm that there is no burden on the cooperation of DDL with other regularization methods. For regression, we propose

DisturbValue(DV) and DisturbError(DE), the two novel methods developed from applying disturbing procedure of DL onto regression task. The experimental results show the efficiency of our methods. Our codes are available from the first author’s Github (https://github.com/kimy-de/DisturbMethods).

Our main contributions include:

  1. Propose model-agnostic noisy regularization methods.

  2. Improve DisturbLabel by filtering non-confident labels from the candidates of disturb labels.

  3. Demonstrate the robustness of our methods using 14 datasets.

2 Related Work

Regularization methods come in a wide variety of the areas of the neural network they are aimed at regularization can be imposed on weights, hidden or inputs nodes of the neural network, but regularizing within the loss layer is relatively new. Xie, Lingxi et al.xie2016disturblabel introduced DisturbLabel (DL) which is the first work investigating this area for convolutional neural networks.

The closest method to DisturbLabel is label smoothing szegedy2015rethinking

, which also perturbs the ground-truth labels by softening it to a vector of probabilities of belonging to each of the classes in the task, while DisturbLabel flips the label fully. As the authors point out, the main advantage of DisturbLabel is being stochastic while soft labelling is deterministic and therefore cannot provide the same strong regularization as by DisturbLabel.

Taejong Joo et al.joo2020being

also perturbs ground-truth labels similarly to label smoothing, but use Bayesian approach, and instead of manipulating the labels directly, they consider the ground truth to be a random variable of a categorical probability over class labels rather than being given by the training label.

From recent findings on manipulating the labels, the mixup method zhang2017mixup is one of the prominent. It suggests training a neural network on convex combinations of pairs of examples and their labels. Weizhi Li et al.li2020regularization combine the label smoothing with the hypothesis that more confident predictions require stronger regularization (we also exploit this hypothesis in our work). They perform regularization via structural label smoothing, imposing various smoothing strength on clusters of data lying in different parts of feature space.

Noisy regularization for regression tasks are well known techniques, but as pointed out in a survey papermoradi2020survey , the noise is added either to inputs (poole2014analyzing ) or to weights (hochreiter1995simplifying ). To the best of our findings, adding noise to the targets as a form of regularization has been never attempted before. Ehsan Imani et al.imani2018improving did add Gaussian noise to targets, but their experiments are different from our proposed DV approach in the sense that (a) it was done as a form of augmentation and (b) noise was added to all targets.

Adding Gaussian noise to target values was also explored wang1999training

, but our works compliment rather than repeat each other. First, this prior work concentrates on the convergence properties of the network with noise added to the desired signal and does not investigate regularization effects. Second, they add noise to target values and use annealing schedule of the step size to control the variance of noise since the noise affects weight updating. On the contrary, we control the number of noisy values using noise rate instead of adjustment of the step size.

3 Methodology

3.1 Classification

DisturbLabel (DL) xie2016disturblabel regularizes the loss layer by replacing some ground truth labels with incorrect labels in each iteration. DL selects substitutes randomly to generate disturb labels based on a noise rate that determines the amount of . When , a batch-training works without any change. When , each of selected true labels is converted into a disturb label by a Multinoulli distribution with the following probabilities:

(1)

where and replaces 1 and 0 respectively in the one-hot vector and C is the number of classes. The DL does not consider the confidence of labels to generate disturb labels.

Figure 1: Schematic representation of prediction vectors

Our hypothesis is that high confident labels cause the overfitting problem, hence our method Directional DisturbLabel (DDL) considers only confident labels as candidates to be disturbed. To be specific, we define the confident label as satisfying where is a permissible angle. In , the angle between and is calculated by the cosine similarity and we say that is a confident label if . To simplify the formula , we use the unit vectors, and , with so that is applied to select substitutes for disturb labels in each batch data. The range of is . when is the same as (i.e. the same direction). Plus, and always have positive elements so that the maximum angle between them is

. In our setting, non-confident labels from the candidates are excluded by the non-confident interval of

. For instance, Figure 1 shows that all the outputs of classification models with three classes are on the three dimensional space and the natural basis of the vector space are the one-hot vectors of the three true labels, , and . Therefore, there exist and are on the same space such that is calculated.

3.2 Regression

Our hypothesis is that noise injection to target values regularizes the classifier layer by the oscillation of target values because the measurement of continuous target values could have errors caused by machine tolerance, human ability, measurement condition, and so on. For example, when we measure the current temperature, we can think that

C and C are the same depending on the tolerance of thermometers so proper noise injection to target values can lead to a regularization effect without changing the attribute of target values. Thus, we use the concept of DisturbLabel (DL) for regression tasks. However, the original DL can be applied in classification problems so we propose two disturb methods by extending the concept of DL to regression problems. Given a mini-batch set . Then a prediction where is a regression model with model parameters .

Figure 2: Illustration of DisturbValue(left) and DisturbError(right)

To prevent overfitting in the training session, DisturbValue (DV) adds Gaussian noise to some of the target values at random. To be specific, a target value is replaced by following based on a noise rate . When , is added randomly to target values in each batch data where . As hyperparameters of DV, and have a huge impact on the regularization performance, but it is too hassle for controlling them depending on datasets. To reduce the number of the hyperparameters, we transform the domain of target values into by the MinMax scaler and is fixed as a default. Thus, all of the domain of target values is regardless of datasets so that the same standard variance can be used for any datasets. Finally, we consider the only one hyperparameter during the training.

DV adds noise to target values at random. However, DisturbError (DE) adds Gaussian noise to target values satisfy where is a residual boundary. We say that is a high confident value if there exists small constant such that . When there are many high confident values, it is highly possible that a model is overfitted to train data. To prevent overfitting, DE disturbs the error of high confident values by redefining the error as . In DE, we should control and originally, however we can consider only based on the same scaling and as DV.

Therefore, we find an optimal hyperparameter of DV and DE depending on datasets and compare to other regularization techniques in Section 4

. In this paper, mean square loss function

is used such that is defined as an objective function with our disturb methods. Considering the gradient of ,

(2)

is derived. It shows that the noise controls the gradient of prediction values to prevent overfitting.

4 Experiments

4.1 Classification Task

Dataset MNIST* FMNIST CIFAR10* CIFAR100 Intel Art
# classes 10 10 10 100 6 5
Size of images 2828 2828 3232 3232 150150 227227**
# channels 1 1 3 3 3 3
# instances 70K 70K 60K 60K 17K 9K
Train/Test (% ) 86/14 86/14 83/17 83/17 82/18 86/14
  • datasets also used by the baseline paper. CIFAR100 is implemented differently frrom the baseline. We use all 100 classes and treat the dataset as an additional one.

  • input size fed to the network (original sizes vary)

Table 1: Description of the datasets for classification task

Evaluation metric.

We report test misclassification rate to compare the performance of the methods. Experiments for each dataset method combination were run for 5 times, average value alongside with standard deviation is reported.

Baselines. In our experiments we compare the performance of suggested DDL method against same baselines as the reference paper (including exploring cooperation of the methods) and DL method itself; no regularization, dropout, DistrubLabel (DL), DistrubLabel (DL) + dropout, Directional DisturbLabel (DDL), Directional DisturbLabel (DDL) + dropout.

Datasets. Having a hypothesis that regularization effects should appear more vividly on ’easier’ datasets, we conduct experiments on six collections of images of various complexity: MNIST mnist2010 , FMNIST xiao2017/online

, CIFAR-10

CIFAR10 , CIFAR-100 Krizhevsky2009LearningML , INTEL intel , and ART art . The characteristics of the datasets are summarized in Table 1.

Architecture. LeNet NIPS1989_53c3bce6

is modified in accordance to baseline paper (two convolution units for MNIST and FMNIST dataset and three convolution units for CIFAR10, CIFAR100, ART and INTEL followed by ReLU and max pooling).

ResNet18 resnet18 is modified with additional dropout after average pooling step for our experiment on models combination before feeding the input to the last layer with the softmax loss function.

Optimizer.

The training procedure was performed using SGD optimizer with momentum and decaying learning rate. We start with learning rate of 0.001 and reduce it by factor 0.1 after 40, 60 and 80 epochs. Each experiment was run for 100 epochs in total.

Hyperparameters. This hyperparameter (probability of the label to be disturbed) was tuned for each dataset separately. The optimal values of hyperparameters used are summarized in Table 2. The dropout probability was set to 0.5 in all corresponding experiments.

Architecture Method MNIST FMNIST CIFAR10 CIFAR100 Intel Art
LeNet DL 10 5 10 40 50 50
DDL 10 5 50 30 50 20
ResNet18 DL 20 5 10 10 5 10
DDL 20 5 10 10 5 10
Table 2: Optimal values of hyperparameter (%)
MNIST FMNIST CIFAR10 CIFAR100 Intel Art
# classes 10 10 10 100 6 5
No reg.1 0.86 0.034 8.052 0.112 24.82 0.828 59.732 1.590 15.614 0.484 18.08 0.597
Dr.2 0.658 0.049 7.784 0.167 22.58 0.771 50.662 0.828 13.48 0.575 16.372 0.651
DL 0.642 0.052 7.748 0.069 23.77 0.310 58.758 0.596 13.289 0.270 16.28 0.335
DDL 0.61 0.041 7.789 0.111 22.614 0.553 56.444 0.337 13.542 0.193 16.687 0.948
Dr.+DL 0.58 0.044 7.816 0.132 21.65 0.320 50.87 0.701 13.066 0.339 15.578 0.456
Dr.+DDL 0.658 0.086 7.744 0.210 21.766 0.403 50.824 1.487 12.366 0.310 15.537 0.498
  • No reg. refers no to regularization

  • Dr. refers to dropout

Table 3: LeNet Experimental Result(average misclassification rate): DDL demonstrates good cooperation with Dropout and brings the best results for datasets most suffering from overfitting (less complex datasets). Results are reported as first and second best.

Experimental Results. The results of the experimented on LeNet are summarized in Table 3. DDL method (in combination with Dropout) demonstrated the best results for half of the datasets and is in top-2 results for all of them. Intel and Art dataset considered as simpler datasets which have higher degree of overfitting the other more complex datasets. MNIST is also known for being easily classified by modern networks, therefore we can say that the results prove the hypothesis of DDL method having better regularization capacity and is useful on the datasets which needed it at most. The four newly tested datasets demonstrated similar behaviour except for CIFAR100, which is the most challenging of them, having far more classes. We assume that memorizing the ground truth was not really in place for CIFAR100, that is why perturbing the labels even more has not facilitated the training procedure. A more lightweight method - dropout - nevertheless has still improved the result compared to full absence of regularization.

The results of the experimented on ResNet18 is shown in Table 4 confirming the results obtained for LeNet. For 5 out of 6 tested datasets, DDL (either alone or in combination with Dropout) has gained the best result compared to baselines and is second best for the remaining. FMNIST and CIFAR10 presumably do not require such strong regularization as the most simple MNIST, Intel and Art datasets, therefore using combination of DDL and dropout can be excessive in this case and DDL method obtains the best result alone. CIFAR100 still does not suffer overfitting even with the deeper network to the extend as the other datasets do and does not profit from strong regularization.

MNIST FMNIST CIFAR10 CIFAR100 Intel Art
# classes 10 10 10 100 6 5
No reg.1 0.668 0.0618 8.354 0.072 8.145 0.430 30.557 0.466 6.237 0.269 2.927 0.26
Dr.2 0.601 0.045 8.608 0.179 7.813 0.170 28.363 0.500 6.531 0.333 2.837 0.348
DL 0.612 0.1567 8.244 0.274 7.99 0.444 28.048 1.064 6.004 0.123 2.909 0.416
DDL 0.558 0.0507 8.228 0.107 7.667 0.096 28.236 1.209 6.271 0.207 2.891 0.090
Dr.+DL 0.543 0.039 8.480 0.378 7.841 0.198 28.123 0.540 6.231 0.346 3.035 0.0342
Dr.+DDL 0.532 0.054 8.475 0.191 8.111 0.151 28.115 1.186 5.917 0.166 2.728 0.206
  • No reg. refers to no regularization,

  • Dr. refers to dropout.

Table 4: ResNet18 Experimental results (average misclassification rate): With a more complex network, our method demonstrates vivid results against baselines. CIFAR100, however, does not require strong regularization and takes advantage of simpler DL. Results are reported as first and second best.

4.2 Regression Task

Evaluation metric. We use root-mean-square error (RMSE) HYNDMAN2006679 . For every experiment and every dataset we average RMSE for 20 runs and measure standard deviation.

Baselines. For all eight datasets we compare the results of our methods with the baselines which represent state-of-the-art on neural network regularization to the best of our knowledge and also combination of our methods with baselines: regularization, dropout, DV - Gaussian noise, DV - Laplacian noise, DV - cosine annealing inproceedings , DE, DV + dropout, DV + , DV + DE.

Dataset BHP BS AQ MS HP SC CC AEP
# Instances 506 731 9,357 5,000 1460 21,263 1,994 19,735
# Feautres 13 13 10 30 81 81 100 27
Table 5: Description of the datasets for regression task
Method Air Boston Bike Energy Sklearn House Scond Crime
Features 10 13 13 27 30 81 81 100
No reg 0.00880 0.09496 0.03090 0.00506 0.06493 0.01363 0.08137 0.14596
0.00620 0.09122 0.02260 0.00431 0.06088 0.01952 0.07922 0.14270
Dropout 0.00744 0.09121 0.01986 0.00455 0.06435 0.01159 0.08100 0.14514
DV gaus 0.00451 0.08958 0.01566 0.00327 0.06100 0.01133 0.07930 0.14350
DV lapl 0.00512 0.09093 0.02559 - 0.06106 0.01460 - -
DV anneal 0.01207 0.09324 0.03013 0.00455 0.06235 0.00967 0.07867 0.14460
DE 0.00699 0.08981 0.02525 0.00462 0.06106 0.01464 0.07953 0.14448
DV+Drop 0.00453 0.090262 0.01960 0.00153 0.06335 0.01504 0.07972 0.14389
DV+ 0.00558 0.08728 0.02147 0.00313 0.06016 0.01142 0.07928 0.14215
DV+DE 0.00293 0.08789 0.02327 0.00335 0.06496 0.01270 0.07937 0.14349
Table 6: Experimental results for the regression (RMSE averaged for 20 runs). Best results are marked in bold (lower is better). Offered method DV alone or in combination with others methods of regularization show best result for every dataset tested. Hypothesis about dependencies between complexity of dataset and best method was not proved.

Datasets. We use eight datasets with different sizes and complexity for our evaluations: Boston House Prices (BHP) scikit-learn , Bike Sharing (BS) bike , Air Quality (AQ) DEVITO2008750 , Make-sklearn (MS) scikit-learn , Housing Price (HP) de2011ames , Superconductivty (SC) HAMIDIEH2018346 , Communities and Crime (CC) Dua:2019 , Appliances Energy Prediction (AEP) CANDANEDO201781 . The characteristics of the datasets are summarized in Table 5. Also, Minmax scaling is used to standardize the features to present in the data in a fixed range.

Architecture.

We implement our methods using neural network with two hidden layers and ReLU activation function.

Optimizer. We use ADAM optimizer kingma2017adam with learning rate of 0.001 in all experiments.

Hyperparameters. For each dataset a grid-search is used to find the best value for hyper-parameters:

  1. penalty for regularization,

  2. Drop rate (%) for Dropout,

  3. Disturb rate (%) for Disturb Value,

  4. Residual (e) for Disturb Error.

Experimental Results. To tune hyper-parameters all datasets were split on training set (50%) and testing set (50%). For each hyper-parameter we fit our models on training set and then evaluate accuracy metric on testing set. For every run every dataset values were shuffled before split. Table 6 shows the results of experiments, which proved that DV approach or combination of DV with DE, cosine-annealing, Dropout or outperforms baselines for all eight datasets. We were interested to know if the most suitable approach depends on data size and data complexity. To check this we form Table 6 starting from a dataset with smaller complexity (smaller number of features) and ending with a dataset with bigger complexity. No dependencies were revealed during this analysis.

We also analyze standard deviations for every method to be sure that offered approaches (DV and DE) are robust. There is no significant deviation between standard deviation for the model without regularization and models with regularization.

5 Conclusion

In this paper, we have extended one of the modern regularization methods, DisturbLabel (DL), by proposing an improved procedure of label disturbing for classification task and projecting the idea of useful ground truth disturbance to regression domain, which was not covered by the method of the baseline paper.

The problem of overfitting can be interpreted as memorizing the ground truth by a neural network. Our extension of the classification task is based on the hypothesis that confident predictions often come from this memorizing of the ground truth labels, that is why it makes more sense to penalize (in our case, randomly disturb) a share of them and do not perturb training procedure for the predictions that are unconfident in a natural way. We proposed to use the cosine similarity to measure the distance between vectors of ground truth and predicted class probabilities in classification tasks. We have tested this method on six datasets and two architectures and showed that DDL brings improvement compared to DL (alone or in combination with dropout).

We also have shown that the extension of DL method on the regression domain can improve the performance of the model and help to avoid overfitting. We presented two methods: (1) DisturbValue (DV) method which injects Gaussian noise to target values at random, and (2) DisturbError method (DE) which injects Gaussian noise to target values if the prediction is close to the target value, other words, the difference between prediction and target values is smaller than a small constant. Our proposed DV method outperformed well-known baselines and dropout alone or in combination with other approaches (DE, , dropout, cosine-annealing). The experiments were done for eight datasets with different sizes and complexity.

References