1 Introduction
The classical computer vision methods generally achieve lower performance compared to the current deep neural networks
[o2019deep]. This is mainly because classical methods are designed to capture humanlevel information from images at the expense of curtailing their level of freedom from the beginning of the explorationof a solution to the problem. In contrast, deep models might not be able to provide humancommunicable latent information from signals, but they achieve superior performance in various machine learning tasks. However, extensive regularization of the deep networks reduces their freedom of exploration.
Several different regularization techniques have been successfully applied to overparameterized models to reduce overfitting and to generalize to unseen data, including the norm penalty for model parameters, weight decay [nowlan1992simplifying], early stopping, and dropout [srivastava2014dropout]
. Although initially not introduced as a regularization method, batch normalization
[ioffe2015batch] performs regularization by considering fluctuations within minibatches, albeit with a generalization performance of deep neural networks that often depends on the size of the minibatch [keskar2016large]. Similarly, stochastic gradient descent (SGD)
[sutskever2013importance] can also be interpreted as a regularized gradient descent impaired by noisy gradients. However, SGD’s fluctuations do not lead to a smooth convergence. Other types of regularization are datadependent, such as traditional data augmentation techniques (rotation, flipping, etc.) [lecun1998gradient, simonyan2014very], AutoAugment [cubuk2018autoaugment], and the mixup method [zhang2017mixup]. The regularization properties of mixup were studied by Guo et al. [guo2019mixup]. Although a few works have attempted to improve the mixup method [shimada2019data, mai2019metamixup], besides the smoothed decision boundaries which can cause a model to be highly underconfident, the random sample selection step of mixup may produce wrong labels, as shown in Figure 1, thus failing to generalize well. Another issue with the mixup method is that the linear blending results in implausible images. As we detail later, our proposed method modifies the input to generate new samples as well, but it does so while avoiding the drawbacks of the linear blending.Adversarial training is another datadependent regularization approach [goodfellow2014explaining, roth2019adversarial] in which an adversarially perturbed version of data is used as a type of augmentation. However, an adversarially perturbed image contains perceptible information about the correct class and imperceptible
information about a wrong or a random label. Ideally, a model should learn to cancel out the deliberate implicit patterns specific to the wrong class, but with an arbitrary level of perturbation (which is the case in adversarial training), a model might get biased towards an implausible distribution. In other words, from an adversarial attack perspective, it is ideal to destroy a welldistributed manifold. Therefore, using adversarial training as data augmentation for regularization can skew the feature space towards an arbitrary class distribution.
In contrast to adversarial training that perturbs an input image to effectuate a wrong prediction, Taghanaki et al. [taghanakitransfer] proposed a gradient based method that perturbs the input image by transforming it to a new space to effectuate a more accurate prediction. However, since their method relies on the true labels, which are not available at the inference time, they leverage a second model to learn the mapping between the input and the corresponding transformed images, and show an improvement in performance with a computational complexity tradeoff. As we describe later, our proposed method also generates new perturbed images that lead to improved predictions, however, it does not require ground truth labels for mapping.
Relation to causality. Understanding the effect of input variables on the prediction function’s output is essentially a causality problem. Recently, several approaches have successfully applied reverse gradients towards the input to decode, and in particular, to visualize the causal effect of input variables on the overall final prediction [bach2015pixel, selvaraju2017grad, smilkov2017smoothgrad, sundararajan2017axiomatic]. However, they have not studied the causal influence of each single input variable towards a better prediction performance. In this paper, we attempt to answer the question: “can the casual influence of each variable in the input manifold be modified to limit the number of input variables by discarding the less effective ones?”. If a model can successfully capture only the truly correlated variables in the input space to the labels, then the model should be robust to manifold shifts and outofdistribution samples [bengio2019meta] without the need to perform explicit regularization. Leveraging prior knowledge, such as location information of target objects, might help regularize the model and steer it to focus only on the relevant variables in the input, which has motivated recent works on attention priors [yan2019melanoma, zhao2019retinal]. However, such location labels are not always provided and, similar to overregularization, excessive prior knowledge might restrict the model’s exploration of the data and miss other more discriminative latent information.
To facilitate the model’s exploration with only relevant input variables, we seek a transformation that projects the input to a new space where the important variables are emphasised. Therefore, we adopt the pushforward approach [jost2008riemannian, lee2013smooth]
to approximate the changes in the output for each input variable. In contrast to methods that limit the exploration of the model by turning off the neurons and scaling the model parameters, such as dropout and batch normalization, and inspired by the adversarial training concept, we introduce a new method called
Signed Input Regularization (SIGN).The SIGN method reweights the input (pixels in the image space, ) such that an input sample is transformed to . The variables
are discarded by a model with ReLU nonlinearity. Thus the model focuses on fewer input variables.
In this paper, we make the following contributions:

We introduce a new Jacobianbased regularization method which facilitates the model’s exploration by reducing the variables in the input.

We show how the pushforward and manifold mapping concepts can be used for the linear estimation of input using a layer’s output in a model.

Our proposed method achieves significantly better results compared to traditional data augmentation and the mixup regularization schemes. We also discuss and demonstrate the potential shortcomings of the mixup method.
2 Method
Given a training sample , the goal is to transform it to a new space. Let be a map between manifolds and . Then, the derivative , the derivative of at point in the tangent space of is the best linear approximation of near in the tangent space of . Therefore,
pushes tangent vectors on
forward to tangent vectors on [lee2013smooth], as shown in Figure 2.2.1 The Push Forward Map
Let be a smooth function () and let . The goal is to define the push forward map [jost2008riemannian, lee2013smooth] as
(1) 
Let be a tangent vector (in tangent space ) which by definition is completely determined by its action on . In order that to be a tangent vector at , we need to define
(2) 
Let be a function in the space of . The composition is a smooth function on an open subset of which contains , and thus is well defined. Let be a smooth function and let . Given is defined by
(3) 
Equation 3 can be written using the coordinate bases. Let be the coordinate basis for and let be the coordinate basis for . If is given by
(4) 
then,
(5) 
Therefore, the matrix representation of the linear transformation in the basis for and the basis for is given by the Jacobian
(6) 
Proof. If we write the matrix representation of using the matrix , we get
(7) 
The matrix can then be calculated using Equation 5, and therefore
(8) 
Therefore, the map can be thought of as a matrix of partial derivatives:
(9) 
2.2 Iterative SIGN Method
Given a trained model, the goal is to transform the input with a linear estimation using the model’s layer under the pushforward definition, which is realized by the Jacobian, as explained in Section 2.1. Not only the Jacobian provides the linear estimation of the input, it also highlights the variables in the input which contribute the most to the output of the layer. The input variables with negative sign are discarded after they are transferred to manifold
because the following ReLU activation function does not let them pass on to the subsequent layers.
In order to emphasize the input variable discarding step, we use a similar approach to the momentum method [polyak1964some]
, which is a technique for accelerating gradient descent algorithms by accumulating a velocity vector in the gradient direction of the loss function across iterations. Our iterative process is summarized in Algorithm
1.Model  Method  airplane  automobile  bird  cat  deer  dog  frog  horse  ship  truck  Mean  

Classical [simonyan2014very]  
mixup [zhang2017mixup]  0.870  0.861  0.793  0.884  
SIGN (proposed)  0.931  0.864  0.763  0.936  0.941  0.936  0.8637  
BasicCNN  DeepAugment [deepaugment]  0.885  0.733  0.911  0.915  0.924  
(Section 3)  DeepAugment + SIGN  0.939  0.819  0.816  0.920  0.919  0.8691 
Method  Metric  airplane  automobile  bird  cat  deer  dog  frog  horse  ship  truck  Mean  
Classical [simonyan2014very]  Overall accuracy  0.934  0.929  













[uncertainty]  
mixup [zhang2017mixup]  Overall accuracy  0.866  













[uncertainty]  
SIGN (proposed)  Overall accuracy  0.949  0.807  0.725  0.849  0.764  0.918  0.926  0.8591  













[uncertainty] 
3 Experiments
We evaluate our proposed method on three datasets: CIFAR10 [krizhevsky2009learning]
, Tiny ImageNet
[chrabaszcz2017downsampled], and 2D RGB skin lesion classification dataset from the 2017 IEEE International Skin Imaging Collaboration (ISIC) ISBI Challenge [codella2018skin], and four different deep neural network architectures: MobileNetV2 [sandler2018mobilenetv2], InceptionResNetv2 [szegedy2017inception], NASNetMobile [zoph2018learning], and a simple small network (called BasicCNN) with the following architecture to show the effectiveness of the proposed method regardless of the complexity of the architecture.[Conv2D_a ReLu Conv2D_a ReLu MaxPool2D Dropout] [Conv2D_b ReLu Conv2D_b ReLu MaxPool2D Dropout] [FC512 ReLu Dropout],
where Conv2D_a and Conv2D_b layers have 32 and 64 filter channels each. All the Conv2D layers use kernels, all the MaxPool2D layers use
kernels, and all the dropout layers have a drop probability of
.We compare our proposed SIGN method with classical data augmentation strategies (horizontal and vertical flipping, shifting, and rotations) on normal data, the mixup transformed samples, and DeepAugment [deepaugment], the Bayesian version of the AutoAugment [cubuk2018autoaugment]
methods. We train all the models with a batch size of 128 except for the InceptionResNetv2 in the skin lesion classification experiment, for which we set the batch size to 32. We train all the models except BasicCNN for 100 epochs and select the epochs corresponding to the highest accuracies on the validation sets. The BasicCNN model, which was used to test DeepAugment was trained for 200 epochs.
We set (the number of iterations) for iterative SIGN (Algorithm 1
) to 50 and 100, thereby augmenting the training set with two different versions. For the source manifold, we use the output of the last layer (before the logits) of the model to estimate the target manifold which is the input variables’ space.
In the following subsections, we start with a general classification performance analysis in subsection 3.1. In subsection 3.2, we design a MobileNetV2based model to capture the aleatoric uncertainty [kendall2017uncertainties] of different regularization methods. Next, we set up an experiment to evaluate the models’ robustness to random perturbations and outofdistribution samples in subsection 3.3. Finally, we study the transferability of the SIGNgenerated samples to other models in subsection 3.4.
3.1 Classification Performance Analysis
In this subsection, we study the performance of the proposed approach and compare it with other methods in terms of the classification accuracy. We first test all the three regularization methods on the CIFAR10 dataset using the MobileNetV2 and the BasicCNN models. As reported in Table 1, the proposed method achieves the highest overall classification accuracy for both the models. Note that for the BasicCNN model, we apply the learned augmentation policies from normal CIFAR10 images to both normal and SIGN transformed samples. We do this to study the transferability of the augmentation policies on the SIGN samples.
In Figure 3, we visualize a few samples from CIFAR10 before and after applying the SIGN regularization method. As can be seen, the color contrasts and the boundaries of the objects are enhanced after applying SIGN.
Next, we apply the winner strategy (SIGN) from the previous experiments on a separate dataset and a model, i.e., the ISIC dataset and the InceptionResNetv2 model. The task here is to predict whether a skin lesion is a melanoma or not. The dataset consists of 2000 dermoscopic images of skin lesions, out of which only 374 belong to the positive class, indicating a considerable class imbalance. Our proposed SIGN method improves the baseline model’s area under the curve (AUC) by and on validation and test sets ( versus and versus ), respectively. Note that for this experiment, we do not apply any data augmentation, so as to ascertain that the improvement is from the application of the SIGN method. It can be seen in Figure 4 that our SIGN method improves the dermoscopic features of the skin lesions, which leads to an improvement in the classification performance.
Next, we visualize the tSNE [maaten2008visualizing] feature space of the different approaches by transforming and projecting their high dimensional features into a 2D space. As can be seen in Figure 5, the proposed SIGN method results in more compact class representations for both the training and the validation samples. However, for the mixup method, the linear interpolation function causes label smoothing that is manifested as fuzzy boundaries across the different classes.
In order to certify that there is useful signal for correct classification in the perturbation added by our iterative SIGN method to the samples, we train and evaluate a MobileNetV2 model with only the s, i.e., (the linear estimation of the input) from the last layer of the model. We obtain a mean classification accuracy of 46.2% for CIFAR10 using only s which shows that captures the classes information as expected.
3.2 Uncertainty Analysis
To capture uncertainties in the models’ predictions, which arise because of the inherent noise in the observations, we model the aleatoric uncertainty [kendall2017uncertainties]. Particularly, we define a stochastic loss function which models the aleatoric uncertainty as
(10) 
where is the class label for sample input . We design the model architecture such that it outputs and with parameters , where is unary for input and are the Monte Carlo approximations (samples) of the unaries using the learned which is defined as
(11) 
Table 2 summarizes the results of this experiment on CIFAR10 dataset. As expected, the mixup method makes the decision boundaries smooth (see Figure 5), which causes the model(s) trained with this data augmentation strategy to be highly underconfident, i.e., failing to know when it fails. In particular, the mixup method’s threshold for correctly predicted classes is slightly above , while the same value is and for the standard data augmentation and SIGN methods, respectively. This means that for a model trained with mixup, the lowest probability an image could have while still being correctly predicted is a little over , which is just above the probability of randomly assigning a correct label for CIFAR10 images.
Method  Metric  automobile  bird  cat  dog  frog  Mean  

(total images)  (49)  (146)  (50)  (298)  (100)  (643)  
Classical [simonyan2014very]  Accuracy  0.5479  0.8300  








[uncertainty]  
mixup [zhang2017mixup]  Accuracy  








[uncertainty]  
SIGN (proposed)  Accuracy  0.9184  0.7200  0.4497  0.5490  








[uncertainty]  [N/A]  [N/A] 
Method  airplane  automobile  bird  cat  deer  dog  frog  horse  ship  truck  Mean 
Classical [simonyan2014very]  0.741  0.962  
Transfer (with SIGN)  0.858  0.927  0.813  0.672  0.829  0.898  0.934  0.910  0.8478 
Model  Method  Normal accuracy  Pixels turned off  Gaussian noise 

(50 random pixels)  ()  
MobileNetV2 [sandler2018mobilenetv2]  Classical [simonyan2014very]  
mixup [zhang2017mixup]  
SIGN (proposed)  0.8637  
BasicCNN (Section 3)  DeepAugment [deepaugment]  
DeepAugment + SIGN  0.8691 
In Table 2
, we report the mean probability values of the correctly classified test samples with a correct class probabilities less than
, i.e., relatively low confidence. As can be seen in the table, for all the classes combined, our proposed method only has samples in total with probabilities below the threshold value, while there are and samples for mixup and the standard data augmentation methods respectively. As can be seen in the last row of the same table, the SIGN method obtains reasonably high uncertainty values for the samples with a low confidence level, which is desirable, since we want a model to know when it fails by producing high uncertainty. However, the mixup method causes the model to produce relatively smaller uncertainty values while it is still highly underconfident, indicating that the model is unaware of its failures.3.3 Outofdistribution and Robustness Analysis
Next, we study the performance of the models trained with different data augmentation techniques by evaluating on outofdistribution and corrupted samples, a few examples of which are shown in Figure 6. For evaluating on outofdistribution samples, we train a MobileNetV2 model on the CIFAR10 dataset and test on Tiny ImageNet. In particular, we extract images from multiple classes in the TinyImageNet dataset so as to correspond to the classes present in CIFAR10 and resize them to the CIFAR10 image resolution (). We use 49 images of ‘automobile’ class, 146 images of ‘bird’ class, 50 images of ‘cat’ class, 298 images of ‘dog’ class, and 100 images of ‘frog’ class, totaling to 643 images overall. As can be seen in the third row of Table 3, our method achieves significantly better results ( improvement over mixup) compared to other techniques.
Next, to test the robustness of the methods to different input corruptions using two different models and five different augmentation approaches, we apply two types of corruptions to the test samples ( which are assumed to be drawn from the same distribution as the train samples): (a) we add Gaussian noise with mean
and standard deviation of
to the images, and (b) we turn off 50 randomly selected pixels in each image. Table 5 summarized the results for this experiment. The SIGN method shows higher resistance from other methods. The DeepAugment method’s results are improved from to when it uses the SIGN transformed samples.This experiment shows that if a method is restricted to learn from a fewer input variables, as SIGN does, it should be robust when a few variables are missing, which is simulated in this experiment by zeroing out random pixels.
3.4 Transferability of SIGN Samples
Finally, we examine whether the transformed samples obtained with our SIGN method are effective in other unknown models. To this end, we leverage the samples mapped to a new space using a MobileNetV2 model to train a NASNetMobile model. As reported in Table 4, the SIGN method’s samples are transferable as they improve the NASNetMobile baseline model’s mean classification accuracy on the CIFAR10 dataset by , i.e., from to , including considerable improvements in classwise accuracies for 8 out of 10 classes.
4 Conclusion
We proposed SIGN, a regularization method that can be used as a data augmentation strategy. Our proposed iterative SIGN technique produces different transformations of the input data, which are then used to train a model. We showed how the pushforward concept in manifold transformation could be applied to both obtain the linear estimation of the input using a layer in the model and diminish the effect of the nonimportant input variables by assigning them negative signs. We showed that the iterative SIGN method could help for better generalization performance of deep models.
We also discussed the critical limitations and risks of using the mixup and the adversarial training methods as regularization techniques. We evaluated the proposed idea for several classification tasks and demonstrated the superior classification accuracy obtained using iterative SIGN regularization. Moreover, we note that the proposed method can also be extended to other problems such as dense image labeling and object detection. Another possible future direction is to study the feasibility of using the SIGN method to cancel out the adversarial perturbations added by adversarial attack strategies.
Comments
There are no comments yet.