Saliency Grafting: Innocuous Attribution-Guided Mixup with Calibrated Label Mixing

The Mixup scheme suggests mixing a pair of samples to create an augmented training sample and has gained considerable attention recently for improving the generalizability of neural networks. A straightforward and widely used extension of Mixup is to combine with regional dropout-like methods: removing random patches from a sample and replacing it with the features from another sample. Albeit their simplicity and effectiveness, these methods are prone to create harmful samples due to their randomness. To address this issue, 'maximum saliency' strategies were recently proposed: they select only the most informative features to prevent such a phenomenon. However, they now suffer from lack of sample diversification as they always deterministically select regions with maximum saliency, injecting bias into the augmented data. In this paper, we present, a novel, yet simple Mixup-variant that captures the best of both worlds. Our idea is two-fold. By stochastically sampling the features and 'grafting' them onto another sample, our method effectively generates diverse yet meaningful samples. Its second ingredient is to produce the label of the grafted sample by mixing the labels in a saliency-calibrated fashion, which rectifies supervision misguidance introduced by the random sampling procedure. Our experiments under CIFAR, Tiny-ImageNet, and ImageNet datasets show that our scheme outperforms the current state-of-the-art augmentation strategies not only in terms of classification accuracy, but is also superior in coping under stress conditions such as data corruption and object occlusion.


page 2

page 4

page 12


SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization

Advanced data augmentation strategies have widely been studied to improv...

ResizeMix: Mixing Data with Preserved Object Information and True Labels

Data augmentation is a powerful technique to increase the diversity of d...

RandomMix: A mixed sample data augmentation method with multiple mixed modes

Data augmentation is a very practical technique that can be used to impr...

MixMo: Mixing Multiple Inputs for Multiple Outputs via Deep Subnetworks

Recent strategies achieved ensembling "for free" by fitting concurrently...

Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup

While deep neural networks achieve great performance on fitting the trai...

batchboost: regularization for stabilizing training with resistance to underfitting overfitting

Overfitting underfitting and stable training are an important challe...

Graph Transplant: Node Saliency-Guided Graph Mixup with Local Structure Preservation

Graph-structured datasets usually have irregular graph sizes and connect...

1 Introduction

Modern deep neural networks (DNNs) have achieved unprecedented success in various computer vision tasks,

e.g., image classification he2016deep, generation brock2018large and segmentation he2017mask. However, due to their over-parameterized nature, DNNs require an immense amount of training data to generalize well for test data. Otherwise, DNNs are predisposed to memorize the training samples and exhibit lackluster performance on the unseen data - in other words, incur overfitting.

Acquiring a sufficient amount of data for a given task is not always possible as it consumes valuable manpower and budget. One common approach to combat such data scarcity is data augmentation, which aims to enlarge the effective size of a dataset by producing virtual samples from the training data through means such as injecting noise amodei2016deepspeech or cropping out regions devries2017cutout. Datasets diversified with these augmented samples are shown to effectively improve the generalization performance of the trained model. Furthermore, data augmentation is proven to be effective not only for promoting generalization but also in boosting the robustness of a model hendrycks2019augmix and acquiring visual representations without human supervision chen2020simple; moco.

To this end, conventional augmentation methods have focused on creating new images by transforming a given image using means such as flipping, resizing, and more. However, a recently proposed augmentation method called Mixup mixup proposed the idea of crafting a new sample out of a pair of samples by taking a convex combination of them. Inspired by this pioneering work, yun2019cutmix proposed CutMix, a progeny of Mixup and Cutout devries2017cutout, which crops a random region of an image and pasting it on another. These methods are able to generate a wider variety of samples while effectively compensating for the loss of information caused by actions such as cropping. However, such context-agnostic nature of these methods gives way to creating samples that are potentially harmful. Since the images are combined randomly without considering their contexts and labels, incorrect augmentation is destined to occur (see Figure 1(d)). For instance, an object can be cropped out and replaced by a different kind of object from another image, or the background part of the image can be pasted on top of an existing object. Even worse, their labels are naively mixed according only to their mixing proportions, disregarding any information transfer or loss caused by the data mixing. The harmfulness of semantically unaware label mixing was previously reported in adamixup. This mismatch in data and its supervision signal yields harmful samples.

Figure 1: Comparison of augmented samples generated by mixup-based augmentations. (a) Source and destination images to be used in augmentation. (b) Saliency Grafting produces diverse samples, including samples that do not contain the maximum saliency region. For all kinds of diverse samples, their labels are correctly rectified. (c) Deterministic saliency-based methods produce semantically plausible labels, but lack diversity since the maximum saliency region is always included. (d) CutMix generates diverse samples but produces misleading labels.

To address this problem, saliency-guided augmentation methods have been recently proposed attentivemix; puzzlemix; saliencymix. These approaches allegedly refrain from generating harmful samples by preserving the region of maximum saliency based on the saliency maps of the image. Attentive Cutmix attentivemix preserves the maximum saliency regions of the donor image by locating the -most salient patches of it and merging them on top of the acceptor image. SaliencyMix saliencymix constructs a bounding box around the maximum saliency region and crops this box to place it on top of the acceptor image. PuzzleMix puzzlemix tries to salvage the most salient regions of each image by mixing one to another and solving an optimal transport problem and region-wise mixup ratio to maximize the saliency of the created sample. However, these precautionary measures sacrifice sample diversity - which is the advantage of previous CutMix-based methods. Unlike CutMix that teaches the model to attend to the whole object by probabilistically choosing diverse regions of the image, maximum saliency methods lose this feature as the most discriminative region is always included in the resulting image, biasing the model to depend on such regions. Moreover, they still overlook making appropriate supervision to describe the augmented image properly, and use semantically inaccurate labels determined by the mixing ratio or the size of the pasted region, which can easily mislead the network (see Figure 1(c)).

To solve the drawbacks present in contemporary augmentation methods, we propose Saliency Grafting, a novel data augmentation method that can generate diverse and innocuous augmented data (see Figure 1

(b)). Instead of blindingly selecting the maximum saliency region, our method scales and thresholds the saliency map to grant all salient regions equal chance. The selected regions are then imposed with Bernoulli distribution and sampled to generate stochastic patches. These patches are then ‘grafted’ on top of another image. Moreover, to compensate for the side effects of grafting such as label mismatch, we propose a novel label mixing strategy: saliency-guided label mixing. By mixing the labels of the two images according to their

saliency instead of their area, potential bad apples are effectively neutralized.

Our contribution is threefold:

  • We discuss the potential weaknesses of current Mixup-based augmentation strategies and present a novel data augmentation strategy that can generate diverse yet meaningful data through saliency-based sampling.

  • We present a novel label mixing method to calibrate the generated label to match the information contained in the newly generated data.

  • Through extensive experiments, we show that models trained with our method outperform others - even under data corruption or data scarcity.

2 Related work

Data augmentation

Image data augmentation played a formidable role in breakthroughs of deep learning based computer vision 

lecun1998gradient; alexnet; vgg. Recently, regional dropout methods such as Cutout devries2017cutout, Dropblock ghiasi2018dropblock and Random Erasing zhong2020random were proposed to promote generalization by removing selected regions of an image or a feature map to diversify the model’s focus. However, the removed regions are bound to suffer from information loss. The recently proposed Mixup mixup and its variants manifold; adamixup, shifted the augmentation paradigm by not only transforming a sample but using a pair of samples to create a new augmented sample via convex combination. Although successful on multiple domains, Mixup is met with lost opportunities when applied to images as it cannot exploit their spatial locality. To remedy this issue, Cutmix yun2019cutmix, a method combining Cutout and Mixup, was proposed. By cropping out a region then filling it with a patch of another image, Cutmix executes regional dropout with less information loss. However, in Cutmix, a new problem arises as the random cut-and-paste strategy incurs semantic information loss and label mismatch. To fix this issue, methods exploiting maximum saliency regions were proposed. Attentive Cutmix attentivemix selects the top- regions to cut and paste to another image. SaliencyMix saliencymix creates a bounding box around the maximum saliency region, and pastes the box on another image. Puzzlemix puzzlemix takes a slightly different approach, where it selects maximum saliency regions of the two images and solves a transportation problem to maximize the saliency of the mixed image. However, since the maximum saliency region is always pertained, the model is deprived of the opportunities to learn from challenging but beneficial samples present in CutMix.

Saliency methods

In neuroscience literature, koch1987shifts first proposed saliency maps as a means for understanding the attention patterns of the human visual cortex. As contemporary CNNs bear close resemblance to the visual cortex, it is plausible to adapt this tool to observe the inner workings of CNNs. These saliency techniques inspired by human attention are divided into two groups: bottom-up (backward) and top-down (forward) katsuki2014bottom

. For backward methods, saliency is determined in a class-discriminative fashion. Starting from the output of the network, the saliency signal is back-propagated starting from the label logit and attributed to the regions of the input image.

simonyan2013deep; zhou2016cam; gradcam

utilize the backpropagated gradients to construct saliency maps. Methods such as

lrp; rap

proposed to backpropagate saliency scores with carefully designed backpropagation rules that preserve the total saliency score across a selected layer. On the other hand, forward saliency techniques start from the input layer and accumulate the detected signals up the network. The accumulated signals are then extracted at a higher convolutional layer (often the last convolutional layer) to obtain a saliency map. Unlike backward approaches, forward methods are class-agnostic as the convolutional layers extract features from all possible objects inside an image to support the last classifier. These maps are used in a variety of fields such as classification 


and transfer learning 


3 Preliminaries

Method Augmentation function Label mixing function
Manifold mixup
Puzzle Mix
Saliency Grafting
Table 1: Overview of various mixed sample augmentations.

We first clarify the notations used throughout the section by describing a general form of Mixup-based augmentation procedures. Let

be a Convolutional Neural Network (CNN) parametrized by

. For a given batch of input data and the corresponding labels , a mixed image is generated by the augmentation function and the corresponding label is created through the label mixing function : and for data index and its random permutation .

Then, Mixup-based augmentation methods define their own as a pixel-wise convex combination of two randomly selected pair, as follows:


where is a mixing matrix controlled by a mixing ratio , is the element-wise Hadamard product, and is some pre-processing function.

The vanilla (input) Mixup defines the augmentation function as . Manifold Mixup uses similar function but with the latent features. In CutMix, the augmentation function is defined as . This method randomly cuts a rectangular region from the source image with area proportional to and pastes it onto the destination image . PuzzleMix, recent saliency-based Mixup variant, employs the augmentation function . This method exploits the image transportation plan and region-wise mask matrix to maximize the saliency of the mixed image. Note that unlike the vanilla Mixup, is discretized region-wise mixing matrix that satisfies for given mixing ratio . To find the optimal transportation plan and region-wise mask for the maximum saliency, PuzzleMix solves additional optimization problems in an alternating fashion, per each iteration.

Although it is a simpler scalar function, the label function is also defined in a similar form to the augmentation function :


where is a label mixing coefficient determined by the sample pair and the mixing ratio from . However, in all methods mentioned above, this simply depends on , disregarding the contents of sample pair and : . Table 1 summarizes and for the augmentation methods described above.

Figure 2: Overview of Saliency Grafting. The source and destination images are drawn from the mini-batch and fed forward through the training network, producing their respective forward saliency maps. The source saliency map is thresholded and sampled using a region-wise i.i.d. Bernoulli distribution. Patches of the source image that corresponds to the resulting image are grafted to the destination image.

4 Saliency Grafting

We now describe our simple approach, Saliency Grafting, that creates diverse and innocuous Mixup augmentation based on the content of instances being merged. Two key innovations in Saliency Grafting are stochastic patch selection (Section 4.1) and label mixing (Section 4.2), both of which utilize the saliency information at the core. Last but not least, another important element of Saliency Grafting is choosing a saliency map generation method (Section 4.3) for the above two main components while keeping the learning cost to a minimum. The overall procedure is described in Figure 2. Now we discuss the details of each component in the subsequent subsections.

4.1 Stochastic salient patch selection

The stochastic patch selection of Saliency Grafting aims to choose regions that can create diverse and meaningful instances. The key question here is how to select regions to be grafted, given a saliency matrix for the source image (whose element indicates the saliency for a region of ). As in recent studies puzzlemix; attentivemix, if only regions with high intensity of are always selected, then these regions - which are already easy to judge by the model - are continuously augmented in the iterative training procedure. As a result, the model is repeatedly exposed to the same grafting patch, which would iteratively amplify the model’s attention on the selected regions and deprive the opportunity to learn how to attend to other parts and structures of the object.

In order to eliminate this selection bias, the patch selection of Saliency Grafting consists of two steps: i) softmax thresholding and ii) stochastic sampling.

Softmax thresholding

To neutralize the selection bias due to the intensity of saliency, we normalize the saliency map by applying the softmax function and then binarize the map with some threshold



given the temperature hyperparameter

to control the sharpness of the normalized saliency map. Here, threshold has a variety of options, but we adopt the threshold as , using the mean value of the normalized saliency map.

Stochastic sampling

Although the selection bias is significantly mitigated by thresholding, the high intensity regions are never removed, as the softmax function preserves the order of the regions. To address this issue, we stochastically sample the grafting regions based on the binarized saliency map produced above. The final mixing matrix is constructed by taking the Hadamard product of and a region-wise i.i.d. random Bernoulli matrix of same dimensions :

. Here, the batch-wise sampling probability

is drawn from a Beta distribution

. The final augmentation function for Saliency Grafting is .

4.2 Calibrated label mix based on saliency maps

In addition to the method of grafting diverse and innocuous augmentations described in the previous section, attaching an appropriate label for supervision to the generated data is also the core of Saliency Grafting. Although extreme, to highlight the drawbacks of the existing label mixing strategy used in all baselines, suppose that source image is combined with destination image , both of which have saliency concentrated in some small regions. Suppose further that this region of is selected and grafted to the region where the original class of destination is concentrated. Then, most of the information of class is retained while most of the information on class is lost. However, if the label is determined in proportion to the mixing rate or the size of the area used, as all the baselines do, the generated label will be close to class since most areas of it originally came from the destination image .

To tackle this issue, we propose a novel label mixing procedure that can adaptively mix the labels again based on saliency maps. Regarding the destination image receiving the graft, the ground truth label is penalized according to the degree of occlusion. Specifically, the importance of the destination image 111We use norm to define the importance in the sense that the overall saliency is simply the same as the sum of saliency in each region, but similar importance can be obtained with other norms. given the mixing matrix is calibrated using the saliency values of the remaining part not occluded by the source image, . On the other hand, with regard to the source image giving the graft, the corresponding label is compensated in proportion to the importance of the selected region: .

The final label mixing ratio is computed based on the relative importance of and , so that their coefficients sum to to define the calibrated label mixing function .


4.3 Saliency map generation

Technically, Saliency Grafting can be combined with various saliency generation methods without the dependence on a specific method. However, the caveat here is that the performance of Saliency Grafting is, by design, highly affected by the quality of the saliency map, or how accurately the saliency map corresponds to the ground truth label. From this point of view, the forward saliency methods, which incur less false negatives, may support Salient Grafting more stably than the backward methods (see Section 2 for forward and backward saliency methods). We also provide the performance comparison in Appendix A. This is because the backward methods are likely to break down and exclude true salient regions when the model fails to predict the true label, whereas the forward methods preserve all the feature maps inside the saliency map, i.e., they act like a class-agnostic saliency detector visualizing.

In an environment where there is no separate pre-trained model, another advantage of using forward saliency is gained: Saliency maps can be naturally constructed based on the terms already calculated in the learning process. In this environment, since the generated maps can be noisy in the early phases of training, we employ warmup epochs without data augmentation.

We now describe the choice of generating the saliency maps to guide our augmentation process. We adopt the channel-collapsed absolute feature map of the model as our saliency map, mainly due to its simplicity: where is the feature map at the -th layer. Albeit it is possible to extract saliency maps from any designated layer in the network, we extract the maps from the last convolutional layer as it generally conveys the high-level spatial information bengio2013representation. In practice, we randomly select the up/down-sampling scale of saliency maps per each mini-batch.

5 Experiments

We conduct a collection of experiments to test Saliency Grafting against other baselines. First, we test the prediction performance on standard image classification datasets. Next, to confirm our claim that Saliency Grafting can safely boost the diversity of augmented data, we design and conduct experiments to assess the sample diversity of each augmentation method. We also conduct multiple stress tests to measure the enhancement in generalization capability. Finally, we perform an ablation study to investigate the contribution of each sub-component of Saliency Grafting. Note that we train the models with both original and augmented images.

5.1 Classification tasks


We evaluate our method Saliency Grafting on CIFAR-100 dataset cifar using two neural networks: PyramidNet-200 with widening factor  pyramidnet and WRN28-10 wideresnet. For the PyramidNet-200, we follow the experimental setting of yun2019cutmix, which trains PyramidNet-200 for 300 epochs. The baselines results on PyramidNet-200 are as reported in yun2019cutmix. For WRN28-10, the network is trained for 400 epochs as following studies puzzlemix; manifold. In this experiment, we reproduce other baselines following the original setting of each paper. Detailed settings are provided in Appendix B.As shown in Table 2 and Table 3, Saliency Grafting exhibits significant improvements for both architectures compared to other baselines. Furthermore, when used together with Shakedrop regularization shakedrop, Saliency Grafting achieves additional enhancement - 13.05% Top-1 error.

width=0.9 PyramidNet-200 () Top-1 Top-5 (# params: 26.8 M) Error (%) Error (%) Vanilla 16.45 3.69 Cutout 16.53 3.65 DropBlock 15.73 3.26 Mixup () 15.63 3.99 Manifold Mixup () 16.14 4.07 ShakeDrop 15.08 2.72 Cutout + Mixup () 15.46 3.42 Cutout + Manifold Mixup () 15.09 3.35 CutMix 14.47 2.97 CutMix + ShakeDrop 13.81 2.29 Attentive CutMix (N = 6) 15.24 3.46 SaliencyMix 14.74 3.07 PuzzleMix 14.78 3.08 Saliency Grafting 13.94 2.79 Saliency Grafting + ShakeDrop 13.05 2.18

Table 2: Error rates on CIFAR-100 for PyramidNet-200() in comparison to state-of-the-art regularization methods. The experiment was performed three times and the averaged best error rates are reported.

width=0.75 WRN28-10 Top-1 Top-5 (# params: 36.5 M) Error (%) Error (%) Vanilla 20.74 5.70 Mixup 17.59 5.18 Manifold Mixup 18.04 - CutMix 17.47 4.80 AugMix 19.19 4.36 SaliencyMix 16.38 3.62 SaliencyMix (w/ dropout) 16.23 - PuzzleMix 16.00 3.84 Saliency Grafting 15.32 3.54

Table 3:

Error rates on CIFAR-100 for WRN28-10 in comparison to data augmentation methods. The experiment was performed three times and the averaged best error rates with standard errors are reported.

indicates the reported result in the original paper.


We evaluate our method on another benchmark dataset - Tiny-ImageNet tinyimagenet. We train ResNet-18 resnet for 600 epochs and report the converged error rates of the last 10 epochs, following one of Tiny-ImageNet experimental settings in puzzlemix. Other data augmentation methods are evaluated using their author-released code and hyperparameters. Detailed experimental settings are described in Appendix B. The obtained results are shown in Table  4. In line with the CIFAR-100 experiments, Saliency Grafting consistently exhibits the best performance on this benchmark dataset.

width=0.7 ResNet-18 Top-1 Top-5 (# params: 11.3 M) Error (%) Error (%) Vanilla 38.54 18.53 Mixup 37.37 18.09 CutMix 35.76 15.82 SaliencyMix 36.61 16.31 PuzzleMix 35.79 16.31 Saliency Grafting 35.16 15.02

Table 4: Error rates on Tiny-ImageNet for ResNet-18 in comparison to data augmentations. The experiment was performed three times and the converged error rates with standard errors are reported.


For the ImageNet imagenet experiment, we train ResNet-50 for 100 epochs. We follow the training protocol in Wong2020Fast

, which includes cyclic learning rate, regularization on batch normalization layers, and mixed-precision training. This protocol also gradually resizes images during training, beginning with larger batches of smaller images and moving on to smaller batches of larger images later (

image-resizing policy). The baselines results are as reported in puzzlemix. Detailed experimental settings are described in Appendix B. As shown in Table 5, Saliency Grafting achieves again the best performance in both Top-1/Top-5 error rates. We confirm that ours can bring further performance improvement without image-resizing scheduling.

width=0.86 ResNet-50 Top-1 Top-5 (# params: 25.6M) Error (%) Error (%) Vanilla 24.31 7.34 Mixup 22.99 6.48 Manifold Mixup 23.15 6.50 CutMix 22.92 6.55 AugMix 23.25 6.70 PuzzleMix 22.49 6.24 Saliency Grafting 22.35 6.19 Saliency Grafting (w/o image-resizing policy) 22.26 6.29

Table 5: Comparison of state-of-the-art data augmentation methods on ImageNet dataset.

Additional experiments

Due to the space constraint, three additional experiments are deferred to Appendix A. The first experiment shows that Saliency Grafting is useful for speech dataset beyond the image classification task, and the second experiment (weakly supervised object localization) implies that the final model learned through Saliency Grafting contains more useful saliency information.

5.2 Sample diversity

Generating augmented data k-times

We design intuitive experiment to compare Saliency Grafting and other augmentation methods in terms of sample diversity. For every iteration, each method trains the network by generating additional augmented data k times from the mini-batch; each method tries to diversify the mini-batch by producing k independent augmented batches with its own randomness. To ensure sufficient diversity, the mixing ratio is newly sampled for each augmented data. While varying k from 1 to 6, we evaluate whether each method can obtain the performance gain due to sample diversity. We train the WRN28-10 for 200 epochs and use 20% of the CIFAR-100 dataset to better confirm the diversity effect of the augmented data. In Figure 3, the performance of Saliency Grafting consistently improves as k increases, whereas PuzzleMix, one of the representative maximum saliency strategies, does not show any gain in performance even when k increases. In this sense, we believe that this is the direct evidence that generating multiple augmented instances by sampling the random mixing ratio is insufficient to ensure sample diversity in the case of maximum saliency approaches puzzlemix; attentivemix. However, since Saliency Grafting exploits temperature-scaled thresholding with stochastic sampling, the model easily attends to the entire object as increases. It is also possible to properly supervise the augmented data through calibrated label mixing. Hence, the sample diversity can be guaranteed innocuity.

Figure 3: Comparison of sample diversity by generating k-times augmented data from a mini-batch. The experiment was performed five times and the averaged best error rates are reported.

5.3 Stress testing

Data scarcity

The situation where data augmentation is most needed is when data is scarce. In this condition, it is important to improve the generalization performance by increasing the data volume while preventing overfitting. To this end, we test our method against data scarcity by reducing the number of data per class to 50%, 20%, and 10%, with the WRN28-10 model on the CIFAR-100 dataset. In Table 6, Saliency Grafting exhibits the best performance in every condition. These results are in line with the fact that, as investigated by rolnick2017deep, corrupted labels severely degrade the performance when data is scarce. Since our method exploits adaptive label mixing to reduce the mismatch between data and labels while maintaining the diversity to prevent overfitting, the generalization performance can be enhanced even in extreme data scarcity conditions.

# of data per class 50 (10 %) 100 (20 %) 250 (50 %)
Vanilla 59.96 44.29 31.19
Mixup 51.29 38.80 27.11
CutMix 52.90 38.76 26.58
PuzzleMix 54.69 38.66 26.69
Saliency Grafting 51.01 37.35 25.56
Table 6: Top-1 error rates on the CIFAR-100 with reduced number of data per class. The experiment was performed three times.

Partial occlusion of salient regions

We demonstrate how existing saliency-guided augmentations fail to diversify the given data (Section 5.2). These methods are designed to always preserve the maximum saliency region, but this strategy harms generalization by training the model to ‘expect’ such a region (injecting bias), which is not the case outside the lab where objects can be partially occluded. This forfeits the diversification effects of their crop-and-mix strategy, degrading performance. To expose the dataset bias induced by previous saliency augmentations, we conduct an ‘occlusion experiment’, where we remove the top- salient regions from the images then evaluate. Table 7 shows that as the occluded area gets larger, other methods perform worse than ours due to bias injection, while Saliency Grafting scores the highest as stochastic sampling removes the bias.

Top-1 Error (%)
SaliencyMix 36.61 44.73 55.89
PuzzleMix 35.79 50.91 66.23
SaliencyGrafting 35.16 42.98 52.19
Table 7: Top-1 error rates on TinyImageNet with top- salient regions removed.

5.4 Ablation study

Stochastic selection VS deterministic selection

In Section 4.1, we argued that the deterministic region selection process of existing maximum saliency methods attentivemix; puzzlemix; saliencymix leads to performance degradation. This was partly shown in Table 2, 4, and 5 where such methods perform worse than CutMix. Here, we directly study the contribution of stochastic selection. We measure the classification accuracy on CIFAR-100 with two architectures where the deterministic top- selection of Attentive CutMix attentivemix is replaced by our stochastic selection. For fair comparison, the softmax temperature is adjusted to satisfy . Results show that stochastic selection indeed outperforms deterministic selection (Table 8).

Effect of threshold

Here, we conduct an experiment where we vary the saliency threshold . Figure 4 shows that as we lower below the normalized saliency mean , non-salient regions are introduced, and the performance degenerates. At , SG becomes saliency-agnostic (which is near-equivalent to the CutMix strategy), and the performance of SG converges to the vicinity of CutMix.

Figure 4: Effect of threshold on CIFAR-100 with WRN28-10.

Label mixing strategies

In Section 4.2, we discussed the pitfalls of naive area-based label mixing and proposed saliency-based label mixing as a solution. Here, we compare the two strategies. We experiment on CIFAR-100 with two architectures and replace the mixing strategy of Saliency Grafting with area-based mixing. Results in Table 8 confirm that saliency-based mixing outperforms area-based mixing.

Method WRN28-10 PyramidNet-200
Saliency Grafting (SG) Top-1 Error (%) Top-1 Error (%)
Deterministic + area labels 16.34 14.63
Stochastic + area labels 15.67 14.14
Stochastic + saliency labels 15.32 13.94
Table 8: Top-1 error rates on CIFAR-100 for PyramidNet-200 () and WRN28-10.

6 Conclusion

We have presented Saliency Grafting, a data augmentation method that generates diverse saliency-guided samples via stochastic sampling and neutralizing any induced data-label mismatch with saliency-based label mixing. Through extensive experiments, we have shown that models equipped with Saliency Grafting outperform existing mixup-based data augmentation techniques under both normal and extreme conditions while using less computational resources.

7 Acknowledgements

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)) and National Research Foundation of Korea (NRF) grants (2019R1C1C1009192). This work was also partly supported by KAIST-NAVER Hypercreative AI Center.


Appendix A Additional experiments

Speech data

To test our method on data outside the distribution of natural images, we evaluate our method on the Google Speech Commands dataset speechcommand

. The training samples are first augmented in the time domain by applying random changes in amplitude, speed and pitch and in the frequency domain by stretching and time-shifting the spectrogram. Then, random background noises clip drawn from the noise compilation in the dataset are added to the samples. Finally, the samples are transformed into 32

32 mel-spectrograms by using 32 MFCC filters. To evaluate our method on this data, we use the WRN28-10 architecture. As shown in Table 9, our method is able to outperform other methods in a non-natural image domain.

WRN28-10 Top-1
(# params: 36.5 M) Error (%)
Vanilla 2.81
Mixup 2.72
CutMix 2.62
Saliency Grafting 2.51

Table 9: Top-1 error rates on Google Speech Commands in comparison to other augmentation methods. The experiment was performed three times and the test error rates with standard errors are reported.

Weakly supervised object localization

To examine how our method affects the backward saliency of a model (how a model ‘thinks’), we measure the weakly supervised object localization performance on the CUB200-2011 dataset wah2011caltech. We follow the experiment protocol of yun2019cutmix except for the use of an ImageNet pretrained network. For ResNet-50, we slightly modify the last convolution layer to increase the feature map size from 7x7 to 14x14. We first obtain the backward saliency map with CAM zhou2016cam

. The map is then thresholded using 15% of the maximum value of CAM and enclosed by the smallest possible bounding box. We measure the Intersection-over-Union(IoU) between this estimated bounding box and the ground truth bounding box. For the localization of a single image to be correct, the IoU between the estimated bounding box and the ground truth box should be greater than 0.5, and simultaneously, the predicted class label should be correct. We use the Adam optimizer, and the initial learning rate, weight decay, batch size were 0.001,0.0001, and 32. The initial learning rate of the last fully-connected layers are set to 0.01. The learning rate is decaying by the factor of 0.1 per 150 epochs. All the experiments were performed three times and the averaged localization accuracies are reported.

Method Loc Acc(%)

ResNet-50 + CAM
ResNet-50 + Mixup 32.94
ResNet-50 + CutMix 27.95
ResNet-50 + PuzzleMix 35.43
ResNet-50 + Saliency Grafting 38.58
Table 10: Performance of weakly supervised object localization on the CUB200-2011 dataset.

Forward saliency VS Backward saliency

To support our choice of forward saliency maps in Section 4.3 of the main paper, we conduct an additional experiment on CIFAR-100 with WRN28-10 where the forward saliency map of Saliency Grafting is replaced by CAM zhou2016cam, a backward saliency map. The detailed settings are kept identical to Section 5.1 of the main paper (refer to Appendix B.1). Results show that the classification error increases when a backward saliency map is used (Table 11).

Method Top-1 Top-5
Error (%) Error (%)
Backward (CAM) 15.70 3.8
Forward (ours) 15.32 3.54
Table 11: Top-1/Top-5 error rates on CIFAR-100 for WRN28-10.

Sensitivity to temperature T

The threshold value of our method is determined by the mean value of the temperature-scaled saliency map. Note that the number of saliency regions greater than the expectation depends on the temperature . As decreases, the softmax distribution becomes sharper and the number of saliency regions above the expectation decreases. That is, the mixing regions are selected from a smaller range. On the other hand, as increases, the distribution flattens so that nearly half the numbers are above the threshold. To see the sensitivity of model performance with respect to the softmax temperature, we conducted an additional experiment on the CIFAR-100 dataset with ResNet-18 by increasing the temperature from 0.01 to 0.30 (Figure 5). If we set a very small , such as 0.01, only a small number of regions are mixed, resulting in a relatively small performance improvement. As we raise the temperature, the number of participating regions increases, resulting in a major increase in performance. When the temperature is sufficiently high, enough number of regions can participate in the mix. Thus, further increasing the temperature plateaus the performance.

Figure 5: Saliency Grafting’s sensitivity to temperature .

Appendix B Detailed experimental settings

b.1 CIFAR-100 Classification

We use stochastic gradient descent (SGD) with a momentum of 0.9 for both network models. For each mini-batch, mixing ratio

is sampled from Beta(1,1) with regard to Mixup and CutMix. In the Manifold Mixup case, we adopt Beta(2,2) for sampling distribution to follow the original paper. PuzzleMix has four hyperparameters: label smoothness term , data smoothness term , prior term , and transport cost . We use . For the classification task, our method uses a temperature and Beta(2,2) for stochastic sampling. For early convergence, we warm up the model for 5 epochs. The described weight decay of each augmentation method is different for the CIFAR dataset, so the results of our paper are reported as having better results among 0.0005 and 0.0001. For the PyramidNet-200 network, the initial learning rate is set to 0.25 and decayed by the factor of 0.1 at 150 and 225 epoch. For the WRN28-10 network, the initial learning rate is set to 0.2 and decayed by the factor of 0.1 at the 200 and 300 epoch. All the experiments were performed three times with two TITAN XP GPUs and the averaged best error rates are reported.

b.2 Tiny-ImageNet Classification

For Tiny-ImageNet, we train the ResNet-18 model for 600 epochs using images resized to 64 64. As in the CIFAR-100 experiments, we use a temperature and Beta(2,2) for stochastic sampling. We randomly down-sample the resolution of saliency map to one of {44,88} to support multi-scale saliency. We also warm up the model for 5 epochs. Other data augmentation baselines are evaluated using the authors’ hyperparameters as described above Section B.1. We use the SGD optimizer with momentum 0.9 and weight decay 0.0001. The initial learning rate is set to 0.2 and decayed by the factor of 0.1 at 300 and 450 epoch. All the experiments were performed three times with one TITAN XP GPU and the converged error rates with standard errors for the last 10 epochs are reported as puzzlemix.

b.3 ImageNet Classification

For the ImageNet, we follow the training process in Wong2020Fast; puzzlemix. We train the ResNet-50 model for 100 epochs with 4 RTX TITAN GPUs. Specifically, we use cyclic learning rate scheduling, mixed-precision training, and weight decay regularization from batch normalization layers. Moreover, this protocol progressively resizes images during the training phase. We test our method on both settings (w/ image-resizing policy, w/o image-resizing policy). Our method adopts temperature and sampling probabilities are sampled from . We randomly up/down-sample the resolution of saliency map to one of {44, 77, 88}. We use the SGD optimizer, and the initial learning rate, momentum, and weight decay are 0.1, 0.9, 0.0001. For the fixed-size setting (w/o image-resizing policy), the images are fixed at and the batch size is 256. we warm up the model for 5 epochs.

Appendix C Examples

Figure 6: Comparison of diversity between Saliency Grafting and PuzzleMix images.