SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization

06/02/2020 ∙ by A. F. M. Shahab Uddin, et al. ∙ 0

Advanced data augmentation strategies have widely been studied to improve the generalization ability of deep learning models. Regional dropout is one of the popular solutions that guides the model to focus on less discriminative parts by randomly removing image regions, resulting in improved regularization. However, such information removal is undesirable. On the other hand, recent strategies suggest to randomly cut and mix patches and their labels among training images, to enjoy the advantages of regional dropout without having any pointless pixel in the augmented images. We argue that the random selection of the patch may not necessarily represent any information about the corresponding object and thereby mixing the labels according to that uninformative patch enables the model to learn unexpected feature representation. Therefore, we propose SaliencyMix that carefully selects a representative image patch with the help of a saliency map and mixes this indicative patch with the target image that leads the model to learn more appropriate feature representation. SaliencyMix achieves a new state-of-the-art top-1 error of 20.09 classification using ResNet-101 architecture and also improves the model robustness against adversarial perturbations. Furthermore, SaliencyMix trained model helps to improve the object detection performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning has achieved state-of-the-art (SOTA) performance in almost every field, especially in computer vision tasks such as image classification (Olga et al., 2015; Krizhevsky et al., 2012; He et al., 2016), object detection (Shaoqing et al., 2015; Wei et al., 2016), semantic segmentation (Chen et al., 2018; Jonathan et al., 2015)

, natural scene understanding

(Yang et al., 2019; Xiao et al., 2018b)

, human pose estimation

(Toshev and Szegedy, 2014; Xiao et al., 2018a)

and so on. This success can mainly be attributed to the deep architecture of convolutional neural networks (CNN) that typically have 10 to 100 millions of learnable parameters. This huge number of parameters enable the deep CNNs to solve extremely complex problems. However, besides the powerful representation ability, a huge number of parameters increase the probability of overfitting when the number of training examples are insufficient, which results in a poor generalization capability of the model.

In order to improve the generalization ability of deep learning models, several data augmentation strategies have been studied. Random feature removal is one of the popular techniques where the goal is to improve the model robustness by guiding the CNN models not to focus on some small regions of input images or on a small set of internal activations. Dropout (Nitish et al., 2014; Tompson et al., 2015) and regional dropout (Junsuk and Hyunjung, 2019; Terrance and Graham, 2017; Golnaz et al., 2018; Singh and Lee, 2017; Zhun et al., 2017) are two established training strategies where the former one randomly turns off some internal activations and the later one removes or alters random regions of the input images. Both of them force a model to learn the entire object region rather than just focusing on the most important features and thereby improve the generalizability of the model. Although dropout and regional dropout improves the classification performance, this kind of feature removal is undesired to the CNN since they discard a notable portion of informative pixels from the training images.

Recently, CutMix (Yun et al., 2019) proposed a data augmentation technique that randomly replaces a region of an image with a patch from another training image and also mixes the labels of the source and target objects according to the ratio of mixed pixels. This method does not let the training images to have any uninformative pixels, while enjoys the properties of regional dropout. But we argue that, since the source image patch has been selected in a random fashion, there is a possibility of selecting the patch from the background region of the source image which may not contain any information about the corresponding object. Such an example is shown in Figure 1. The selected source patch is highlighted with a black rectangle on the source image as shown in the bottom-left picture. Two possible augmented images are shown by the right side images. In both of the cases, there is no information about the source object in the augmented images despite of their location on the target image. But the labels provided by CutMix indicate that there is a probability of the source object to a certain degree as shown by the augmented labels in Figure 1. But we recognize that it is undesirable and misleads the CNN to learn unexpected feature representation. Because, CNNs are highly sensitive to textures (Geirhos et al., 2019) and since CutMix indicates the selected background patch as the source object, it may encourage the classifier to learn the background as the representative feature for the source object class.

We address the aforementioned problem by carefully selecting the image patch from the source image with the help of some prior information. Specifically, we first extract a saliency map of the source image that highlights the semantically important regions of that image and then select a patch surrounding the peak salient region of the source image to assure that we select from the object part and then mix it with the target image. Now the selected patch contains relevant information about the source object that leads the model to learn more appropriate feature representation. This more effective data augmentation strategy is what we call, ”SaliencyMix”.

We present extensive experiments on various standard CNN architectures, benchmark datasets, and multiple tasks to evaluate the proposed method. In summary, SaliencyMix has obtained the new SOTA top-1 error of and on CIFAR-10 and CIFAR-100 (Krizhevsky, 2012), respectively, while applying WideResNet (Zagoruyko and Komodakis, 2016) as the baseline architecture. Also, on ImageNet (Olga et al., 2015) classification problem, SaliencyMix has achieved the new SOTA top-1 and top-5 error of and when applied with ResNet-50 as the baseline architecture and and when applied with ResNet-101 (He et al., 2016) as the baseline architecture. In object detection task, initializing the Faster RCNN (Shaoqing et al., 2015) model (ResNet-50 as a backbone network), with SaliencyMix trained model (trained on ImageNet) and then fine-tuning the detector has improved the detection performance on Pascal VOC (Everingham et al., 2010) dataset by mean average precision (mAP). Moreover, SaliencyMix trained model has proved to be more robust against adversarial attack and improves the top-1 accuracy by on adversarially perturbed ImageNet validation set. All of these results clearly indicates the effectiveness of the proposed SaliencyMix data augmentation strategy to enhance the model performance and robustness.

2 Related Works

2.1 Data Augmentation

The success of deep learning models can be accredited to the volume and diversity of data. But collecting labeled data is a cumbersome and time-consuming task. As a result, data augmentation strategy comes into account that aims to increase the diversity of existing data by applying various transformations. Since, this simple and inexpensive technique significantly improves the model performance and robustness, data augmentation has widely been used to train deep learning models.

LeCun applied data augmentation to train LeNet (Lecun et al., 1998) for hand written character recognition. They performed several affine transformations such as translation (horizontal and vertical), scaling, shearing etc. For the same task of optical character recognition, Bengio (Bengio et al., 2011) used a deeper network and in addition to the affine transformation, they applied more diverse transformation on the dataset such as Gaussian noise, salt and pepper noise, Gaussian smoothing, motion blur, local elastic deformation, and various occlusions to the images. In AlexNet (Krizhevsky et al., 2012)

, a revolutionary work in image classification, Krizhevsky applied random image patch cropping, horizontal flipping and randomly changing the color intensity using principal component analysis (PCA) based color augmentation. In Deep Image

(Wu et al., 2015), the author applied color casting, vignetting, and lens distortion besides flipping and cropping to improve the robustness of very deep network.

Besides these manually designed data augmentation, Lemley proposed an end-to-end learnable augmentation process, called Smart Augmentation (Lemley et al., 2017). They used two different networks where one is used to learn the suitable augmentation type and the other one is used to train the actual task.

Devries and Taylor (2017) proposed Cutout that randomly removes square regions of the input training images to improve the robustness of the model. Zhang et al. (2017)

proposed MixUp that blends two training images in some degree where the labels of the augmented image are assigned by the linear interpolation of those two images. But the augmented images look unnatural and locally ambiguous. Recently,

Yun et al. (2019) proposed an effective data augmentation technique called CutMix that randomly cuts and mixes image patches among training samples where the image labels are also mixed proportionally to the size of those patches. However, due to the randomness in source patch selection process, it may select a region that does not contain any informative pixel about the source object and the label mixing according to those uninformative patches, misleads the classifier to learn unexpected feature representation.

In this work, the careful selection of the source patch always helps to contain some information about the source object and thereby solves the class probability assignment problem and helps to improve the model performance and robustness.

2.2 Label Smoothing

In object classification, the class labels are usually represented by one-hot code i.e., the true labels are expected to have the probability of exactly 1 while the others to have exactly 0. In other words, it suggests the model to be overconfident. However, the CNN models employ the softmax function that is unable to predict a class with an exact probability of 1 or 0. As a result, the models start learning ever-larger weight values that makes the model less adaptive and too confident about its predictions.

Label smoothing allows to relax the model confidence on the labels by setting the class probabilities to a slightly shifted intermediate values. Thereby, helps the model to be more adaptive instead of being over-confident and ultimately improves the model robustness and performance (Szegedy et al., 2016). Our method also mixes the class labels and enjoys the benefit of label smoothing.


Figure 2: Flow diagram of the proposed SaliencyMix data augmentation. We first select the saliency map of the source image that highlights the regions of interest. Then we select a patch around the peak salient pixel location and mix it with the target image.

2.3 Saliency Detection

The human visual systems (HVS) contain a natural attention mechanism that automatically focuses on the most important points in any visual scene. In computer vision, the visual saliency detection models aim to simulate this HVS attention mechanism (for more details, please refer to Cong et al., 2019). Among several well-established models (Hou and Zhang, 2007; Achanta et al., 2009; Wang and Dudek, 2014; Qin et al., 2019) are some fast and powerful saliency detection algorithm.

Instead of focusing on summarizing the target objects property, Hou and Zhang (2007) proposed a spectral residual method that focuses on the properties of background. Specifically, this method first analyzes the log-spectrum of an image and then extracts the spectral residual in spectral domain and finally reconstructs the saliency map in the spatial domain. In order to produce full resolution saliency map with clear boundaries, Achanta et al. (2009) proposed a frequency tuned approach that preserves the boundary information by retaining sufficient amount of high frequency contents. Wang and Dudek (2014) considered the saliency detection as a background subtraction problem and proposed a fast pixel-level adaptive background detection algorithm. This method records per-pixel historical background values and their importance based on the occurrence statistics that helps to remove the least useful information and enables to detect the salient region with high accuracy. Qin et al. (2019) exploited the power of deep CNN and proposed a boundary-aware saliency detection network (BASNet) that consists of densely supervised Encoder-Decoder network and a residual refinement module to produce saliency prediction and to perform saliency map refinement respectively.

In this paper, since our goal is to restrict the source patch selection to the objects of interest, we have used the saliency detection proposed by Wang and Dudek (2014) that highlights the objects rather than the backgrounds. Section 3.3 further explains the effects of different saliency detection algorithm on the proposed data augmentation method.

3 Proposed Method

Similar to the (Yun et al., 2019), we cut a patch from the source image and mix it to the target image and also mix the source and target class labels proportional to the size of the mixed patches. But in order to prevent the model from learning any irrelevant feature representation, the proposed method enforces to select a source patch in a way so that it must contains information about the source object. It first extracts a saliency map of the source image to highlight the objects of interest and then selects a patch surrounding the peak salient region to mix with the target image. Here we explain the process in detail.

3.1 Selection of the Source Patch

The goal of saliency detection is to find out the pixels or regions that are attractive to the HVS and to assign them with higher intensity values (Cong et al., 2019). A saliency detection method produces the visual saliency map, a gray-scale image, that highlights the objects of interest and thereby mostly focuses on the foreground.

Let is a randomly selected training image with label from which a patch will be cut. Then its saliency map detection can be represented as

(1)

where represents the saliency map of the given source image as shown in Figure 2 where the objects of interest have higher intensity values and represents a saliency detection model. It is worth-noting that we have used the saliency detection method proposed by Wang and Dudek (2014) due to its performance, simplicity and availability as explained in Section 3.3. Then we search for a pixel from the saliency map that have the maximum intensity value. The represent the x and y coordinates of the most salient pixel and can be found as

(2)

Then we select a patch, either by centering on the pixel if possible, or keeping the pixel on the selected patch. It ensures that the patch is selected from the object region, not from the background. Following the CutMix, the size of the patch is determined based on the combination ratio

that is sampled from the uniform distribution

.

3.2 Mixing the Patches and Labels

Let be the target image that is another randomly selected training sample with label . SaliencyMix partially mixes and to produce a new training sample , the augmented image, with label . The mixing of two images can be defined as

(3)

where denotes the augmented image, represents a binary mask, is the complement of and represents element-wise multiplication.

First the region that will be taken from the source image is defined by using the peak salient information and the value of and then the corresponding location of the mask is set to and others to . The element-wise multiplication of with the source image results with an image that removes everything except the region decided to keep. In contrast, does the opposite of i.e., the element-wise multiplication of with the target image keeps all the regions except the selected patch. Finally, the addition of those two creates a new training sample that contains the target image with the selected source patch in it (See Figure 2).

Besides mixing the images we also mix their labels based on the size of the mixed patches as

(4)

where denotes the label for the augmented sample and is the mixing ratio. Other ways of mixing are also possible and investigated in Section 3.4.


Figure 3: The effect of using different saliency detection methods on the proposed SaliencyMix data augmentation. Performances are reported from the average of five runs.

3.3 Impact of Different Saliency Detection Methods

Here we investigate the effect of using various saliency detection methods for our SaliencyMix data augmentation. We use four well-recognized saliency detection algorithm (Wang and Dudek, 2014; Hou and Zhang, 2007; Achanta et al., 2009; Qin et al., 2019), and perform experiments using ResNet-18 as a baseline model on CIFAR-10 dataset with traditional data augmentation techniques.

Figure 3 shows that the fast self-tuning background subtraction (Wang and Dudek, 2014) performs better among all the saliency detection methods. Because this method aims to highlight the foreground over the background and well suits our intention to find out the object of interest. Also, it is easily available as an OpenCV (Itseez, 2015) library.

3.4 Different Ways of Mixing the Source Patch

There are several ways to select the source patch and mix it with the target image. In this section we explore those possible mixing styles and examine their effect on the proposed method. We use ResNet-18 architecture with SaliencyMix data augmentation and perform experiments on CIFAR-10 dataset with traditional data augmentation techniques. We found out five ways of mixing: Salient to Corresponding, that selects the source patch from the most salient region and mix it on the corresponding location of the target image; Salient to Salient, that selects the source patch from the most salient region and mix it on the salient region of the target image; Salient to Non-Salient, that selects the source patch from the most salient region but mix it on the non-salient region of the target image; Non-Salient to Salient, that selects the source patch from the non salient region of the source image but mix it on the salient region of the target image; and Non-Salient to Non-Salient, that selects the source patch from the non salient region of the source image and also mix it on the non-salient region of the target image. In order to find out the non-salient region we use the least important pixel of an image.

Figure 4 shows the classification performance of the proposed SaliencyMix data augmentation with the above mentioned mixing styles. Both the Non-Salient to Salient and Non-Salient to Non-Salient select the source patch from the non-salient region of the source image that doesn’t contain any information about the source object and thereby produce large classification error compared to the other three options where the patch is selected from the most salient region of the source image. It justifies our SaliencyMix i.e., the source patch should be selected in such a way so that it must contain information about the source object. On the other hand, Salient to Salient covers the most significant part of the target image that restricts the model from learning its most important feature and Salient to Non-Salient may not occlude the target object which is necessary to improve the regularization. But Salient to Corresponding keeps balance by changeably occluding the most important part and other based on the orientation of the source and target object. Consequently, it produces more variety of augmented data and thereby achieves the lowest classification error. Also it introduces less computational burden since only the source image saliency detection is required. Therefore, the proposed method uses Salient to Corresponding as the default mixing style.


Figure 4: Different ways of mixing the source patch with the target image and their effects. Performances are reported from the average of five runs.

4 Experiments

We verify the effectiveness of the proposed SaliencyMix data augmentation strategy on multiple tasks. We evaluate our method on image classification by applying it on several benchmark image recognition datasets using popular SOTA architectures. We also use the SaliencyMix trained model and fine-tune it for object detection task to verify its usefulness in enhancing the detection performance. Furthermore, we validate the robustness of the proposed method against adversarial attacks. All experiments were performed on PyTorch platform with four NVIDIA GeForce RTX

Ti GPUs.


Method Top-1 Error (%)
CIFAR-10 CIFAR-10+ CIFAR-100 CIFAR-100+
ResNet-18 (Baseline) 10.63 0.26 4.27 0.21 36.68 0.57 22.46 0.31
ResNet-18 + Cutout 9.310.18 3.990.13 34.980.29 21.960.24
ResNet-18 + CutMix 9.440.34 3.780.12 34.420.27 19.420.23
ResNet-18 + SaliencyMix 7.590.22 3.650.10 28.730.13 19.290.21
ResNet-50 (Baseline) 12.140.95 4.980.14 36.480.50 21.580.43
ResNet-50 + Cutout 8.840.77 3.860.25 32.970.74 21.380.69
ResNet-50 + CutMix 9.160.38 3.610.13 31.650.61 18.720.23
ResNet-50 + SaliencyMix 6.810.30 3.460.08 24.890.39 18.570.29
WideResNet-28-10 (Baseline) 6.970.22 3.870.08 26.060.22 18.800.08
WideResNet-28-10 + Cutout 5.540.08 3.080.16 23.940.15 18.410.27
WideResNet-28-10 + CutMix 5.180.20 2.870.16 23.210.20 16.660.20
WideResNet-28-10 + SaliencyMix 4.040.13 2.760.07 19.450.32 16.560.17
Table 1: Classification performance (average of five runs) of SOTA data augmentation methods on CIFAR-10 and CIFAR-100 datasets using popular standard architectures. An additional ”+” sign after the dataset name indicates that the traditional data augmentation techniques have also been used during training.

4.1 Image Classification

4.1.1 CIFAR-10 and CIFAR-100

There are color images of size pixels in both the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2012) where CIFAR-10 has 10 distinct classes and CIFAR-100 has 100 classes. The number of training and test images in each dataset are and , respectively.

We apply several standard architectures: a deep residual network (He et al., 2016) with a depth of (ResNet-18) and (ResNet-50), and a wide residual network (Zagoruyko and Komodakis, 2016) with a depth of , a widening factor of , and dropout with a drop probability of

in the convolutional layers (WideResNet-28-10). We train the networks for 200 epochs with batch size of 256 using stochastic gradient descent (SGD), Nesterov momentum of 0.9, and weight decay of

. The initial learning rate was 0.1 and decreased by a factor of after each of the and

epochs. The images are normalized using per-channel mean and standard deviation. We perform experiments with and without traditional data augmentation scheme where the traditional data augmentation includes zero-padding, random cropping, and horizontal flipping.

Table 1 shows the experimental results of several well-established data augmentation techniques on CIFAR datasets. For a fair comparison we report the average performance of five runs for all the methods. It can be seen that for each of the architectures, the proposed SaliencyMix data augmentation strategy outperforms the other methods. It achieves the new SOTA top-1 error of and on CIFAR-10 and CIFAR-100 datasets, respectively, when applied with WideResNet-28-10 architecture. Moreover, SaliencyMix shows significant performance improvement over CutMix when applied without any traditional augmentation technique. It reduces the error rate by and on CIFAR-10 dataset when applied with ResNet-18, ResNet-50 and WideResNet-28-10 architectures, respectively, as shown in Table 1. Using the same architectures, it reduces the error rate by and on CIFAR-100 dataset, respectively.


Method # Params Top-1 Error (%) Top-5 Error (%)
ResNet-50 (Baseline) 25.6 M 23.68 7.05
ResNet-50 + Cutout 25.6 M 22.93 6.66
ResNet-50 + StochasticDepth (Huang et al., 2016) 25.6 M 22.46 6.27
ResNet-50 + Mixup 25.6 M 22.58 6.40
ResNet-50 + Manifold Mixup(Verma et al., 2018) 25.6 M 22.50 6.21
ResNet-50 + DropBlock (Ioffe and Szegedy, 2015) 25.6 M 21.87 5.98
ResNet-50 + CutMix 25.6 M 21.40 5.92
ResNet-50 + SaliencyMix 25.6 M 21.26 5.76
ResNet-101 (Baseline) 44.6 M 21.87 6.29
ResNet-101 + Cutout 44.6 M 20.72 5.51
ResNet-101 + Mixup 44.6 M 20.52 5.28
ResNet-101 + Cutmix 44.6 M 20.17 5.24
ResNet-101 + SaliencyMix 44.6 M 20.09 5.15
Table 2: Performance comparison (the best performance) of SOTA data augmentation strategies on ImageNet classification with standard model architectures. The results are taken from Yun et al. (2019).

4.1.2 ImageNet

ImageNet (Olga et al., 2015) is one of the most challenging and widely recognized benchmark datasets for image classification. It contains million training images and validation images of classes. To perform experiments on ImageNet dataset, we apply the same settings as used in (Yun et al., 2019). We have trained our SaliencyMix for epochs with initial learning rate of and decayed by factor of at epochs and , with a batch size of . Also, the traditional data augmentations such as resizing, cropping, flipping and jitters have been applied during the training process.

Table 2 shows the ImageNet performance comparison of the proposed SaliencyMix with Cutout, Stochastic Depth (Huang et al., 2016), Mixup, Manifold Mixup (Verma et al., 2018), Drop Block (Ioffe and Szegedy, 2015) and CutMix. The results have been reported from the best performances of each method. With ResNet-50 architecture, SaliencyMix drops the top-1 classification error by and over Cutout, Mixup and CutMix data augmentation, respectively. And while using ResNet-101 architecture, SaliencyMix outperforms the other methods in comparison and achieves the new best result of top-1 error and top-5 error.

4.2 Object Detection using Pre-Trained SaliencyMix

In this section, we use the SaliencyMix trained model to initialize the Faster RCNN (Shaoqing et al., 2015) that uses ResNet-50 as a backbone network and examine its effect on object detection task. The model is fine-tuned on Pascal VOC 2007 and 2012 datasets and evaluated on VOC 2007 test data using the mAP metric. We follow the fine-tuning strategy of the original method (Shaoqing et al., 2015). The batch size, learning rate, and training iterations are set to and and learning rate is decayed by a factor of at iterations. The Results are shown in Table 3. Pre-training with CutMix and SaliencyMix significantly improve the performance of Faster RCNN. Because in object detection, foreground information (positive data) is much more important than the background (Lin et al., 2017). Since SaliencyMix helps the augmented image to have more foreground or object part than background, it leads to a better detection performance. It can be seen that SaliencyMix trained model outperforms other methods and achieves a performance gain of mAP.


Backbone ImageNet Detection
Network Cls. Err. F-RCNN
Top-1 (%) (mAP)
ResNet-50 (Baseline) 23.68 76.71 (+0.00)
Cutout-trained 22.93 77.17 (+0.46)
Mixup-trained 22.58 77.98 (+1.27)
CutMix-trained 21.40 78.31 (+1.60)
SaliencyMix-trained 21.26 78.48 (+1.77)
Table 3:

Impact of SaliencyMix trained model on transfer learning to object detection task. The results are reported from the average of three runs.

4.3 Class Activation Map (CAM) Analysis

To validate that the SaliencyMix is learning better feature representation to recognize objects only by observing their partial views, we perform the Class Activation Map (CAM) analysis (Zhou et al., 2016) following (Yun et al., 2019). CAM finds out the regions of an input image, where the model focuses to recognize an object. Here we compare the CAM output of the proposed SaliencyMix with other data augmentation techniques.

We extract the CAM by using a vanilla ResNet-50 model, pre-trained on ImageNet that is trained using various augmentation techniques. Figure 5 shows the visual comparison of CAM output on two mixed classes, ’Golden Retriever’ and ’Tiger Cat’. It can be seen that Mixup has a severe problem of being confused when try to recognize an object because the pixels are mixed and it is not possible to extract class specific features. Also, Cutout suffers disadvantages due to uninformative image region. On the other hand, both the CutMix and SaliencyMix effectively focuses on the corresponding features of the two classes and precisely localizes the two objects in the scene.


Figure 5: Class activation map (CAM) visualizations using various augmentation techniques. First row shows the original images, second row shows the input augmented images by different methods, third and fourth rows show the CAM for ’Golden Retriever’, and ’Tiger Cat’ class, respectively.

4.4 Robustness Against Adversarial Attack

Several recent studies (Szegedy et al., 2014; Goodfellow et al., 2015; Madry et al., 2017) have shown that deep learning based models are vulnerable to adversarial examples i.e., they can be fooled by a slightly modified examples even when the added perturbations are small and unrecognizable. Data augmentation helps to increase the robustness against adversarial perturbations since it introduces many unseen image samples during the training (Madry et al., 2017). Here we verify the adversarial robustness of a model that is trained using various data augmentation techniques and compare their effectiveness.

Similar to the CutMix, we also use the Fast Gradient Sign Method (FGSM) (Madry et al., 2017), to generate the adversarial examples. We apply the ImageNet pre-trained models that are trained using ResNet-50 architecture with various data augmentation techniques in comparison, as described in Section 4.1.2 and check their robustness against adversarial attacks. Table 4 reports the top-1 accuracies of various augmentation techniques on adversarially attacked ImageNet validation set. Due to the appropriate feature representation learning and focusing on overall object rather than a small part, SaliencyMix significantly improves the robustness against adversarial attack and achieves performance improvement over the nearly comparable method CutMix (Yun et al., 2019).

4.5 Computational Complexity

We investigate the computational complexity of the proposed method and compare with other data augmentation techniques in terms of training time. All the models are trained on CIFAR-10 dataset using ResNet-18 architecture for 200 epochs. Table 5 presents the training time comparison. It can be seen that SaliencyMix requires a slightly longer training time compared to others, due to saliency map generation. But considering the performance improvement it can be negligible.


Baseline Mixup Cutout CutMix SaliencyMix
Top-1 Acc. (%) 8.2 24.4 11.5 31.0 32.96
Table 4: Performance comparison on adversarial robustness. Top-1 accuracy (%) of various data augmentation techniques on adversarially perturbed ImageNet validation set.
Baseline Cutout Mixup CutMix SaliencyMix
Time (Hour) 0.83 0.84 0.87 0.89 0.91
Table 5: Training time comparison of various data augmentation techniques using ResNet-18 architecture on CIFAR-10 dataset.

5 Conclusion

We have introduced an effective data augmentation strategy, called SaliencyMix, that is carefully designed for training CNNs to improve their classification performance and generalization ability. The proposed SaliencyMix guides the models to focus on the overall object regions rather than a small region of input images and also prevents the model from learning in-appropriate feature representation by carefully selecting the representative source patch. It introduces a little computational burden due to saliency detection, while significantly boosts up the model performance and strengthen the model robustness on various computer vision tasks.

Applying SaliencyMix with WideResNet achieves the new SOTA image classification top-1 error of and on CIFAR-10 and CIFAR-100, respectively. On ImageNet classification, applying SaliencyMix with ResNet-50 and ResNet-101 obtains the new SOTA top-1 error of and , respectively. On object detection, using the SaliencyMix trained model to initialize the Faster RCNN (ResNet-50 as a backbone network) and fine-tuning leads to a performance improvement by mAP. Furthermore, SaliencyMix trained model is more robust against adversarial attacks and achieves accuracy improvement on adversarially perturbed ImageNet validation set.

References

  • R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk (2009) Frequency-tuned salient region detection. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 1597–1604. External Links: Document, ISSN 1063-6919 Cited by: §2.3, §2.3, §3.3.
  • Y. Bengio, F. Bastien, A. Bergeron, N. Boulanger–Lewandowski, T. Breuel, Y. Chherawala, M. Cisse, M. Côté, D. Erhan, J. Eustache, X. Glorot, X. Muller, S. P. Lebeuf, R. Pascanu, S. Rifai, F. Savard, and G. Sicard (2011) Deep learners benefit more from out-of-distribution examples. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    ,
    Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 164–172. External Links: Link Cited by: §2.1.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. External Links: Document, ISSN 1939-3539 Cited by: §1.
  • R. Cong, J. Lei, H. Fu, M. Cheng, W. Lin, and Q. Huang (2019) Review of visual saliency detection with comprehensive information. IEEE Transactions on Circuits and Systems for Video Technology 29 (10), pp. 2941–2959. External Links: Document, ISSN 1558-2205 Cited by: §2.3, §3.1.
  • T. Devries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. ArXiv abs/1708.04552. Cited by: §2.1.
  • M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88, pp. 303–338. External Links: Document Cited by: §1.
  • R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2019) ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.. In International Conference on Learning Representations, Cited by: §1.
  • G. Golnaz, L. Tsung-Yi, and V. L. Quoc (2018) DropBlock: a regularization method for convolutional networks. In Neural Information Processing Systems (NeurIPS), Cited by: §1.
  • I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. pp. 1–10. Cited by: §4.4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Document, ISSN 1063-6919 Cited by: §1, §1, §4.1.1.
  • X. Hou and L. Zhang (2007) Saliency detection: a spectral residual approach. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1–8. External Links: Document, ISSN 1063-6919 Cited by: §2.3, §2.3, §3.3.
  • G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. Lecture Notes in Computer Science, pp. 646–661. External Links: ISBN 9783319464930, ISSN 1611-3349, Link, Document Cited by: §4.1.2, Table 2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, Cited by: §4.1.2, Table 2.
  • Itseez (2015) Open source computer vision library. Note: https://github.com/itseez/opencv Cited by: §3.3.
  • L. Jonathan, S. Evan, and D. Trevor (2015) Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3431–3440. Cited by: §1.
  • C. Junsuk and S. Hyunjung (2019) Attention-based dropout layer for weakly supervised object localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2219–2228. Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems (NeurIPS), pp. 1097–1105. External Links: Document Cited by: §1, §2.1.
  • A. Krizhevsky (2012) Learning multiple layers of features from tiny images. University of Toronto, pp. . Cited by: §1, §4.1.1.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document, ISSN 1558-2256 Cited by: §2.1.
  • J. Lemley, S. Bazrafkan, and P. Corcoran (2017) Smart augmentation learning an optimal data augmentation strategy. IEEE Access 5, pp. 5858–5869. Cited by: §2.1.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2999–3007. External Links: Document, ISSN 2380-7504 Cited by: §4.2.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. ArXiv abs/1706.06083. Cited by: §4.4, §4.4.
  • S. Nitish, H. Geoffrey, K. Alex, S. Ilya, and S. Ruslan (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. External Links: Link Cited by: §1.
  • R. Olga, D. Jia, S. Hao, K. Jonathan, S. Sanjeev, M. Sean, H. Zhiheng, K. Andrej, K. Aditya, B. Michael, C. B. Alexander, and F. Li (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §1, §1, §4.1.2.
  • X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand (2019) BASNet: boundary-aware salient object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3, §2.3, §3.3.
  • R. Shaoqing, H. Kaiming, G. Ross, and S. Jian (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Neural Information Processing Systems (NeurIPS), pp. . Cited by: §1, §1, §4.2.
  • K. Singh and Y. Lee (2017) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. pp. 3544–3553. External Links: Document Cited by: §1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. pp. . External Links: Document Cited by: §2.2.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §4.4.
  • D. Terrance and W. T. Graham (2017) Improved regularization of convolutional neural networks with cutout. arXiv preprint , pp. . External Links: Link Cited by: §1.
  • J. Tompson, R. Goroshin, A. Jain, Y. Lecun, and C. Bregler (2015) Efficient object localization using convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 648–656. External Links: Document Cited by: §1.
  • A. Toshev and C. Szegedy (2014)

    Deeppose: human pose estimation via deep neural networks

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653–1660. Cited by: §1.
  • V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, A. Courville, D. Lopez-Paz, and Y. Bengio (2018) Manifold mixup: better representations by interpolating hidden states. In International Conference on Machine Learning, Cited by: §4.1.2, Table 2.
  • B. Wang and P. Dudek (2014) A fast self-tuning background subtraction algorithm. In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Vol. , pp. 401–404. External Links: Document, ISSN 2160-7508 Cited by: §2.3, §2.3, §2.3, §3.1, §3.3, §3.3.
  • L. Wei, A. Dragomir, E. Dumitru, S. Christian, R. Scott, and C. B. Alexander (2016) SSD: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  • R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun (2015) Deep image: scaling up image recognition. pp. . External Links: Document Cited by: §2.1.
  • B. Xiao, H. Wu, and Y. Wei (2018a) Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 466–481. Cited by: §1.
  • T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018b) Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434. Cited by: §1.
  • S. Yang, W. Wang, C. Liu, and W. Deng (2019) Scene understanding in deep learning-based end-to-end controllers for autonomous vehicles. IEEE Transactions on Systems, Man, and Cybernetics: Systems 49 (1), pp. 53–63. External Links: Document, ISSN 2168-2232 Cited by: §1.
  • S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In International Conference on Computer Vision (ICCV), Cited by: Figure 1, §1, §2.1, §3, §4.1.2, §4.3, §4.4, Table 2.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. Procedings of the British Machine Vision Conference 2016. External Links: ISBN 1901725596, Link, Document Cited by: §1, §4.1.1.
  • H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization.. abs/1710.09412. Cited by: Figure 1, §2.1.
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2921–2929. External Links: Document, ISSN 1063-6919 Cited by: §4.3.
  • Z. Zhun, Z. Liang, K. Guoliang, L. Shaozi, and Y. Yi (2017) Random erasing data augmentation. arXiv preprint , pp. . External Links: Link Cited by: §1.