Signed Input Regularization

11/16/2019 ∙ by Saeid Asgari Taghanaki, et al. ∙ Simon Fraser University 16

Over-parameterized deep models usually over-fit to a given training distribution, which makes them sensitive to small changes and out-of-distribution samples at inference time, leading to low generalization performance. To this end, several model-based and randomized data-dependent regularization methods are applied, such as data augmentation, which prevent a model from memorizing the training distribution. Instead of the random transformation of the input images, we propose SIGN, a new regularization method, which modifies the input variables using a linear transformation by estimating each variable's contribution to the final prediction. Our proposed technique maps the input data to a new manifold where the less important variables are de-emphasized. To test the effectiveness of the proposed idea and compare it with other competing methods, we design several test scenarios, such as classification performance, uncertainty, out-of-distribution, and robustness analyses. We compare the methods using three different datasets and four models. We find that SIGN encourages more compact class representations, which results in the model's robustness to random corruptions and out-of-distribution samples while also simultaneously achieving superior performance on normal data compared to other competing methods. Our experiments also demonstrate the successful transferability of the SIGN samples from one model to another.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The classical computer vision methods generally achieve lower performance compared to the current deep neural networks 

[o2019deep]. This is mainly because classical methods are designed to capture human-level information from images at the expense of curtailing their level of freedom from the beginning of the exploration

of a solution to the problem. In contrast, deep models might not be able to provide human-communicable latent information from signals, but they achieve superior performance in various machine learning tasks. However, extensive regularization of the deep networks reduces their freedom of exploration.

Several different regularization techniques have been successfully applied to over-parameterized models to reduce over-fitting and to generalize to unseen data, including the norm penalty for model parameters, weight decay [nowlan1992simplifying], early stopping, and dropout [srivastava2014dropout]

. Although initially not introduced as a regularization method, batch normalization 

[ioffe2015batch] performs regularization by considering fluctuations within mini-batches, albeit with a generalization performance of deep neural networks that often depends on the size of the mini-batch [keskar2016large]

. Similarly, stochastic gradient descent (SGD) 

[sutskever2013importance] can also be interpreted as a regularized gradient descent impaired by noisy gradients. However, SGD’s fluctuations do not lead to a smooth convergence. Other types of regularization are data-dependent, such as traditional data augmentation techniques (rotation, flipping, etc.) [lecun1998gradient, simonyan2014very], AutoAugment [cubuk2018autoaugment], and the mixup method [zhang2017mixup]. The regularization properties of mixup were studied by Guo et al. [guo2019mixup]. Although a few works have attempted to improve the mixup method [shimada2019data, mai2019metamixup], besides the smoothed decision boundaries which can cause a model to be highly under-confident, the random sample selection step of mixup may produce wrong labels, as shown in Figure 1, thus failing to generalize well. Another issue with the mixup method is that the linear blending results in implausible images. As we detail later, our proposed method modifies the input to generate new samples as well, but it does so while avoiding the drawbacks of the linear blending.

Figure 1: The mixup

method leads to wrong labels for the samples on the boundaries due to random sample selection. Blue and red dots represent samples from two different classes. The dashed line represents the interpolation function (

) for both images and labels, which was proposed by the authors [zhang2017mixup] for the mixing step. The samples inside the blue dotted circle should be assigned a blue label; however, due to random sampling, they get a red label, and vice-versa for the red dotted circle. represents the image space.

Adversarial training is another data-dependent regularization approach [goodfellow2014explaining, roth2019adversarial] in which an adversarially perturbed version of data is used as a type of augmentation. However, an adversarially perturbed image contains perceptible information about the correct class and imperceptible

information about a wrong or a random label. Ideally, a model should learn to cancel out the deliberate implicit patterns specific to the wrong class, but with an arbitrary level of perturbation (which is the case in adversarial training), a model might get biased towards an implausible distribution. In other words, from an adversarial attack perspective, it is ideal to destroy a well-distributed manifold. Therefore, using adversarial training as data augmentation for regularization can skew the feature space towards an arbitrary class distribution.

In contrast to adversarial training that perturbs an input image to effectuate a wrong prediction, Taghanaki et al. [taghanakitransfer] proposed a gradient based method that perturbs the input image by transforming it to a new space to effectuate a more accurate prediction. However, since their method relies on the true labels, which are not available at the inference time, they leverage a second model to learn the mapping between the input and the corresponding transformed images, and show an improvement in performance with a computational complexity trade-off. As we describe later, our proposed method also generates new perturbed images that lead to improved predictions, however, it does not require ground truth labels for mapping.

Relation to causality. Understanding the effect of input variables on the prediction function’s output is essentially a causality problem. Recently, several approaches have successfully applied reverse gradients towards the input to decode, and in particular, to visualize the causal effect of input variables on the overall final prediction [bach2015pixel, selvaraju2017grad, smilkov2017smoothgrad, sundararajan2017axiomatic]. However, they have not studied the causal influence of each single input variable towards a better prediction performance. In this paper, we attempt to answer the question: “can the casual influence of each variable in the input manifold be modified to limit the number of input variables by discarding the less effective ones?”. If a model can successfully capture only the truly correlated variables in the input space to the labels, then the model should be robust to manifold shifts and out-of-distribution samples [bengio2019meta] without the need to perform explicit regularization. Leveraging prior knowledge, such as location information of target objects, might help regularize the model and steer it to focus only on the relevant variables in the input, which has motivated recent works on attention priors [yan2019melanoma, zhao2019retinal]. However, such location labels are not always provided and, similar to over-regularization, excessive prior knowledge might restrict the model’s exploration of the data and miss other more discriminative latent information.

To facilitate the model’s exploration with only relevant input variables, we seek a transformation that projects the input to a new space where the important variables are emphasised. Therefore, we adopt the push-forward approach [jost2008riemannian, lee2013smooth]

to approximate the changes in the output for each input variable. In contrast to methods that limit the exploration of the model by turning off the neurons and scaling the model parameters, such as dropout and batch normalization, and inspired by the adversarial training concept, we introduce a new method called

Signed Input Regularization (SIGN).

The SIGN method re-weights the input (pixels in the image space, ) such that an input sample is transformed to . The variables

are discarded by a model with ReLU non-linearity. Thus the model focuses on fewer input variables.

In this paper, we make the following contributions:

  • We introduce a new Jacobian-based regularization method which facilitates the model’s exploration by reducing the variables in the input.

  • We show how the push-forward and manifold mapping concepts can be used for the linear estimation of input using a layer’s output in a model.

  • Our proposed method achieves significantly better results compared to traditional data augmentation and the mixup regularization schemes. We also discuss and demonstrate the potential shortcomings of the mixup method.

2 Method

Given a training sample , the goal is to transform it to a new space. Let be a map between manifolds and . Then, the derivative , the derivative of at point in the tangent space of is the best linear approximation of near in the tangent space of . Therefore,

pushes tangent vectors on

forward to tangent vectors on  [lee2013smooth], as shown in Figure 2.

Figure 2: Visualization of the push-forward map from the tangent space in to the tangent space in . Compared to the mixup method (Figure 1), which transforms the sample in the same manifold using a linear combination, our proposed method moves the samples to a new manifold. represents the image space.

2.1 The Push Forward Map

Let be a smooth function () and let . The goal is to define the push forward map [jost2008riemannian, lee2013smooth] as

(1)

Let be a tangent vector (in tangent space ) which by definition is completely determined by its action on . In order that to be a tangent vector at , we need to define

(2)

Let be a function in the space of . The composition is a smooth function on an open subset of which contains , and thus is well defined. Let be a smooth function and let . Given is defined by

(3)

Equation 3 can be written using the coordinate bases. Let be the coordinate basis for and let be the coordinate basis for . If is given by

(4)

then,

(5)

Therefore, the matrix representation of the linear transformation in the basis for and the basis for is given by the Jacobian

(6)

Proof. If we write the matrix representation of using the matrix , we get

(7)

The matrix can then be calculated using Equation 5, and therefore

(8)

Therefore, the map can be thought of as a matrix of partial derivatives:

(9)

2.2 Iterative SIGN Method

Given a trained model, the goal is to transform the input with a linear estimation using the model’s layer under the push-forward definition, which is realized by the Jacobian, as explained in Section 2.1. Not only the Jacobian provides the linear estimation of the input, it also highlights the variables in the input which contribute the most to the output of the layer. The input variables with negative sign are discarded after they are transferred to manifold

because the following ReLU activation function does not let them pass on to the subsequent layers.

In order to emphasize the input variable discarding step, we use a similar approach to the momentum method [polyak1964some]

, which is a technique for accelerating gradient descent algorithms by accumulating a velocity vector in the gradient direction of the loss function across iterations. Our iterative process is summarized in Algorithm 

1.

Input : A well-trained model , a real sample , and number of iterations
Output : Transformed input sample
1 ;
2 for  to  do
3       ;
4      
5 end for
6;
return
Algorithm 1 Iterative SIGN
Model Method airplane automobile bird cat deer dog frog horse ship truck Mean
MobileNetV2
[sandler2018mobilenetv2]
Classical [simonyan2014very]
mixup [zhang2017mixup] 0.870 0.861 0.793 0.884
SIGN (proposed) 0.931 0.864 0.763 0.936 0.941 0.936 0.8637
BasicCNN DeepAugment [deepaugment] 0.885 0.733 0.911 0.915 0.924
(Section 3) DeepAugment + SIGN 0.939 0.819 0.816 0.920 0.919 0.8691
Table 1: Class-wise and overall classification performances of different methods on CIFAR-10 using MobileNetV2.
Method Metric airplane automobile bird cat deer dog frog horse ship truck Mean
Classical [simonyan2014very] Overall accuracy 0.934 0.929
(# images)
0.4414
[uncertainty]
mixup [zhang2017mixup] Overall accuracy 0.866
(# images)
0.4348
0.4376
[uncertainty]
SIGN (proposed) Overall accuracy 0.949 0.807 0.725 0.849 0.764 0.918 0.926 0.8591
(# images)
0.4292
0.4571
0.4468
0.4576
0.4535
0.4574
0.4984
0.4428
[uncertainty]
Table 2: Aleatoric uncertainty analysis of different methods on CIFAR-10 dataset.

3 Experiments

We evaluate our proposed method on three datasets: CIFAR-10 [krizhevsky2009learning]

, Tiny ImageNet 

[chrabaszcz2017downsampled], and 2D RGB skin lesion classification dataset from the 2017 IEEE International Skin Imaging Collaboration (ISIC) ISBI Challenge [codella2018skin], and four different deep neural network architectures: MobileNetV2 [sandler2018mobilenetv2], Inception-ResNet-v2 [szegedy2017inception], NASNetMobile [zoph2018learning], and a simple small network (called BasicCNN) with the following architecture to show the effectiveness of the proposed method regardless of the complexity of the architecture.

[Conv2D_a ReLu Conv2D_a ReLu MaxPool2D Dropout] [Conv2D_b ReLu Conv2D_b ReLu MaxPool2D Dropout] [FC512 ReLu Dropout],

where Conv2D_a and Conv2D_b layers have 32 and 64 filter channels each. All the Conv2D layers use kernels, all the MaxPool2D layers use

kernels, and all the dropout layers have a drop probability of

.

We compare our proposed SIGN method with classical data augmentation strategies (horizontal and vertical flipping, shifting, and rotations) on normal data, the mixup transformed samples, and DeepAugment [deepaugment], the Bayesian version of the AutoAugment [cubuk2018autoaugment]

methods. We train all the models with a batch size of 128 except for the Inception-ResNet-v2 in the skin lesion classification experiment, for which we set the batch size to 32. We train all the models except BasicCNN for 100 epochs and select the epochs corresponding to the highest accuracies on the validation sets. The BasicCNN model, which was used to test DeepAugment was trained for 200 epochs.

We set (the number of iterations) for iterative SIGN (Algorithm 1

) to 50 and 100, thereby augmenting the training set with two different versions. For the source manifold, we use the output of the last layer (before the logits) of the model to estimate the target manifold which is the input variables’ space.

In the following subsections, we start with a general classification performance analysis in subsection 3.1. In subsection 3.2, we design a MobileNetV2-based model to capture the aleatoric uncertainty [kendall2017uncertainties] of different regularization methods. Next, we set up an experiment to evaluate the models’ robustness to random perturbations and out-of-distribution samples in subsection 3.3. Finally, we study the transferability of the SIGN-generated samples to other models in subsection 3.4.

3.1 Classification Performance Analysis

In this subsection, we study the performance of the proposed approach and compare it with other methods in terms of the classification accuracy. We first test all the three regularization methods on the CIFAR-10 dataset using the MobileNetV2 and the BasicCNN models. As reported in Table 1, the proposed method achieves the highest overall classification accuracy for both the models. Note that for the BasicCNN model, we apply the learned augmentation policies from normal CIFAR-10 images to both normal and SIGN transformed samples. We do this to study the transferability of the augmentation policies on the SIGN samples.

In Figure 3, we visualize a few samples from CIFAR-10 before and after applying the SIGN regularization method. As can be seen, the color contrasts and the boundaries of the objects are enhanced after applying SIGN.

Figure 3: CIFAR-10 samples before (top) and after (bottom) applying SIGN.

Next, we apply the winner strategy (SIGN) from the previous experiments on a separate dataset and a model, i.e., the ISIC dataset and the Inception-ResNet-v2 model. The task here is to predict whether a skin lesion is a melanoma or not. The dataset consists of 2000 dermoscopic images of skin lesions, out of which only 374 belong to the positive class, indicating a considerable class imbalance. Our proposed SIGN method improves the baseline model’s area under the curve (AUC) by and on validation and test sets ( versus and versus ), respectively. Note that for this experiment, we do not apply any data augmentation, so as to ascertain that the improvement is from the application of the SIGN method. It can be seen in Figure 4 that our SIGN method improves the dermoscopic features of the skin lesions, which leads to an improvement in the classification performance.

Figure 4: ISIC skin lesion samples before (top) and after (bottom) applying SIGN.

Next, we visualize the t-SNE [maaten2008visualizing] feature space of the different approaches by transforming and projecting their high dimensional features into a 2D space. As can be seen in Figure 5, the proposed SIGN method results in more compact class representations for both the training and the validation samples. However, for the mixup method, the linear interpolation function causes label smoothing that is manifested as fuzzy boundaries across the different classes.

(a) Classical [simonyan2014very] (b) mixup [zhang2017mixup] (c) SIGN (proposed)
Figure 5: Feature space visualization for models trained under different schemes. The first and second rows represent plots for training and validation sets, respectively.

In order to certify that there is useful signal for correct classification in the perturbation added by our iterative SIGN method to the samples, we train and evaluate a MobileNetV2 model with only the s, i.e., (the linear estimation of the input) from the last layer of the model. We obtain a mean classification accuracy of 46.2% for CIFAR-10 using only s which shows that captures the classes information as expected.

3.2 Uncertainty Analysis

To capture uncertainties in the models’ predictions, which arise because of the inherent noise in the observations, we model the aleatoric uncertainty [kendall2017uncertainties]. Particularly, we define a stochastic loss function which models the aleatoric uncertainty as

(10)

where is the class label for sample input . We design the model architecture such that it outputs and with parameters , where is unary for input and are the Monte Carlo approximations (samples) of the unaries using the learned which is defined as

(11)

Table 2 summarizes the results of this experiment on CIFAR-10 dataset. As expected, the mixup method makes the decision boundaries smooth (see Figure 5), which causes the model(s) trained with this data augmentation strategy to be highly under-confident, i.e., failing to know when it fails. In particular, the mixup method’s threshold for correctly predicted classes is slightly above , while the same value is and for the standard data augmentation and SIGN methods, respectively. This means that for a model trained with mixup, the lowest probability an image could have while still being correctly predicted is a little over , which is just above the probability of randomly assigning a correct label for CIFAR-10 images.

Method Metric automobile bird cat dog frog Mean
(total images) (49) (146) (50) (298) (100) (643)
Classical [simonyan2014very] Accuracy 0.5479 0.8300
(# images)
(1)
0.4438
[uncertainty]
mixup [zhang2017mixup] Accuracy
(# images)
0.4759
0.4314
0.4256
[uncertainty]
SIGN (proposed) Accuracy 0.9184 0.7200 0.4497 0.5490
(# images)
(1)
(1)
N/A
(0)
0.4611
(3)
N/A
(0)
(5)
[uncertainty] [N/A] [N/A]
Table 3: Out-of-distribution results on the Tiny ImageNet dataset using a MobileNetV2 model trained on the CIFAR-10 dataset.
Method airplane automobile bird cat deer dog frog horse ship truck Mean
Classical [simonyan2014very] 0.741 0.962
Transfer (with SIGN) 0.858 0.927 0.813 0.672 0.829 0.898 0.934 0.910 0.8478
Table 4: Transferability results of the proposed SIGN method. Evaluating a NASNetMobile model trained using CIFAR-10 samples transformed using SIGN method from a MobileNetV2 model.
Model Method Normal accuracy Pixels turned off Gaussian noise
(50 random pixels) ()
MobileNetV2 [sandler2018mobilenetv2] Classical [simonyan2014very]
mixup [zhang2017mixup]
SIGN (proposed) 0.8637
BasicCNN (Section 3) DeepAugment [deepaugment]
DeepAugment + SIGN 0.8691
Table 5: Robustness of the methods to additive noise and pixel corruption.

In Table 2

, we report the mean probability values of the correctly classified test samples with a correct class probabilities less than

, i.e., relatively low confidence. As can be seen in the table, for all the classes combined, our proposed method only has samples in total with probabilities below the threshold value, while there are and samples for mixup and the standard data augmentation methods respectively. As can be seen in the last row of the same table, the SIGN method obtains reasonably high uncertainty values for the samples with a low confidence level, which is desirable, since we want a model to know when it fails by producing high uncertainty. However, the mixup method causes the model to produce relatively smaller uncertainty values while it is still highly under-confident, indicating that the model is unaware of its failures.

3.3 Out-of-distribution and Robustness Analysis

Next, we study the performance of the models trained with different data augmentation techniques by evaluating on out-of-distribution and corrupted samples, a few examples of which are shown in Figure 6. For evaluating on out-of-distribution samples, we train a MobileNetV2 model on the CIFAR-10 dataset and test on Tiny ImageNet. In particular, we extract images from multiple classes in the TinyImageNet dataset so as to correspond to the classes present in CIFAR-10 and resize them to the CIFAR-10 image resolution (). We use 49 images of ‘automobile’ class, 146 images of ‘bird’ class, 50 images of ‘cat’ class, 298 images of ‘dog’ class, and 100 images of ‘frog’ class, totaling to 643 images overall. As can be seen in the third row of Table 3, our method achieves significantly better results ( improvement over mixup) compared to other techniques.

Next, to test the robustness of the methods to different input corruptions using two different models and five different augmentation approaches, we apply two types of corruptions to the test samples ( which are assumed to be drawn from the same distribution as the train samples): (a) we add Gaussian noise with mean

and standard deviation of

to the images, and (b) we turn off 50 randomly selected pixels in each image. Table 5 summarized the results for this experiment. The SIGN method shows higher resistance from other methods. The DeepAugment method’s results are improved from to when it uses the SIGN transformed samples.

This experiment shows that if a method is restricted to learn from a fewer input variables, as SIGN does, it should be robust when a few variables are missing, which is simulated in this experiment by zeroing out random pixels.

Figure 6: Samples of the corrupted CIFAR-10 images. Rows from top to bottom show the normal, pixel-off, and Gaussian noise, respectively.

3.4 Transferability of SIGN Samples

Finally, we examine whether the transformed samples obtained with our SIGN method are effective in other unknown models. To this end, we leverage the samples mapped to a new space using a MobileNetV2 model to train a NASNetMobile model. As reported in Table 4, the SIGN method’s samples are transferable as they improve the NASNetMobile baseline model’s mean classification accuracy on the CIFAR-10 dataset by , i.e., from to , including considerable improvements in class-wise accuracies for 8 out of 10 classes.

4 Conclusion

We proposed SIGN, a regularization method that can be used as a data augmentation strategy. Our proposed iterative SIGN technique produces different transformations of the input data, which are then used to train a model. We showed how the push-forward concept in manifold transformation could be applied to both obtain the linear estimation of the input using a layer in the model and diminish the effect of the non-important input variables by assigning them negative signs. We showed that the iterative SIGN method could help for better generalization performance of deep models.

We also discussed the critical limitations and risks of using the mixup and the adversarial training methods as regularization techniques. We evaluated the proposed idea for several classification tasks and demonstrated the superior classification accuracy obtained using iterative SIGN regularization. Moreover, we note that the proposed method can also be extended to other problems such as dense image labeling and object detection. Another possible future direction is to study the feasibility of using the SIGN method to cancel out the adversarial perturbations added by adversarial attack strategies.

References