1 Introduction
Data augmentation (DA) methods synthetically expand a dataset by applying transformations on the available examples, with the goal of reducing overfitting and improving generalization in models trained on these datasets. In computer vision, conventional DA techniques are typically based on random geometric (translation, rotation and flipping) and photometric (contrast, brightness and sharpness) transformations
[simonyan2015very, lecun1998gradient, cubuk2019autoaugment, cubuk2020randaugment]. While these techniques are already effective, they merely create slightly altered copies of the original images and thus introduce limited diversity in the augmented dataset. A more advanced DA [zhang2018mixup, yun2019cutmix] combines multiple training examples into a new imagelabel pair. By augmenting both the image and the label space simultaneously, such approaches greatly increase the diversity of the augmented set. Consequently, they substantially improve model generalization, without any efficiency overhead, due to their simplicity. Nonetheless, these DA approaches are agnostic to image semantics; they ignore object location cues, and as a result may produce ambiguous scenes with occluded distinctive regions (see Figure 1, Mixup [zhang2018mixup] and CutMix [yun2019cutmix]).To account for such shortcomings, a new line of work [kim2020puzzle, kim2020co, gong2021keepaugment, uddin2020saliencymix] proposes to explicitly use visual saliency [simonyan2013deep] for data augmentation. Typically, a saliency map contains the information about the importance of different image regions for the downstream task. As a result, saliency maps implicitly contain information about objects, their locations and, crucially, about the “informativeness” of image regions. Previous methods [kim2020puzzle, kim2020co, uddin2020saliencymix] take full advantage of the saliency information, and formulate data augmentation as a saliency maximization problem. Given training image patches, their augmentation “assembles” a new image of high visual saliency. This approach greatly improves the test accuracy; however, this comes with a large computation overhead due to the need to maximize saliency at every training step. Moreover, as the augmented images are composed of patches, the resulting scenes are often unrealistic (see Puzzle Mix, CoMixup and SaliencyMix in Figure 1), which leads to poor outofdistribution generalization, as shown later in our experiments. In summary, the existing data augmentation techniques can either i) boost the test accuracy, or ii) produce a robust model with little computational overhead; there are no methods that can do both.
To address the aforementioned drawbacks, we propose a new augmentation – SaliencyGuided Mixup with Optimal Re
arrangements (SAGE) – that provides both high accuracy and robustness, and has minimal computation overhead. SAGE is a simple and effective DA technique that uses visual saliency to perform optimal image blending at each spatial location, and optimizes the relative image position such that the resulting visual saliency is maximized. Given two images and their saliency maps, SAGE mixes the images together, such that at each spatial location, the contribution of different images to the mix is proportional to their saliency in that location. The corresponding label is also obtained by interpolating the original labels based on the saliency of the corresponding images. To maximize the resulting saliency of the mix, we find an optimal relative arrangement of the two images prior to the mixing stage. As a result, SAGE produces smooth and realistic images with clear and distinct foreground objects (see Figure
1), unlike other augmentation techniques. Thanks to our efficient implementation, SAGE has virtually no computation overhead beyond obtaining the saliency information. Furthermore, our computations are partially shared between the saliency masks and the training gradients, which further decreases the amortized training time.Contributions. We make the following three contributions: (i) We introduce SAGE, a DA method to generate novel training examples by mixing image pairs based on their visual saliency, which promotes discriminative foreground objects in the mix. (ii) SAGE achieves test accuracy better than or comparable to stateoftheart augmentation techniques, without incurring significant computation overhead. (iii) Through robustness evaluations on perturbed test data, we show that SAGE improves test accuracy without trading off robustness.
2 Related Work
In this section, we review data augmentation techniques that go beyond simple geometrical and color transformations to improve generalization. A popular approach is to synthesize new training inputoutput pairs by combining information from multiple raw samples. Mixup [zhang2018mixup] creates a new imagelabel pair by linearly interpolating both the input and output space. In contrast, Manifold Mixup [verma2019manifold] and HypMix [sawhney2021hypmix] apply interpolation at the feature level. Others create new training samples by “copypasting” patches from one image to another [yun2019cutmix, ghiasi2021simple, fang2019instaboost]. This class of methods is very efficient and simple to implement. However, a common drawback of these approaches is that they do not take image semantics into account when performing augmentation. This potentially encourages the model to generalize using completely irrelevant information from the new training data, leading to inferior generalization.
To address this problem, recent work explicitly uses visual saliency information in the DA process. KeepAugment [gong2021keepaugment] leverages input saliency to improve existing DA techniques, e.g., Cutout [devries2017improved], by always keeping the important regions untouched during augmentation. SaliencyMix [uddin2020saliencymix] improves CutMix [yun2019cutmix] by selecting a patch around the peak salient pixel location in the source image and mixing it with the target image. Puzzle Mix formulates DA as an optimization problem, where the objective balances saliency maximization, local smoothness and the optimal transport between data pairs [kim2020puzzle]. CoMixup [kim2020co] extends this idea by encouraging the diversity of the augmentation when mixing a collection of inputs, and thus further complicates the optimization objective. The need to solve the optimization problem at every step significantly slows down the training, which may be prohibitive in some situations. Our saliencyguided method not only reduces this computational overhead, but also generates more plausible augmented images that result in improved test accuracy and outofdistribution generalization.
3 Technical Approach
The main idea behind SAGE is to synthesize novel images (with their labels) by blending pairs of training samples, using spatial saliency information as guidance for optimal blending. As illustrated in Fig. 2, our method consists of three independent components: i) saliency mask generation (Sec. 3.1), ii) the “Optimal Rearrangement” module (Sec. 3.3), and iii) the “Saliencyguided Mixup” module (Sec 3.2). All chained together, they form our SAGE approach. Below, we elaborate on each of the components and conclude with a discussion on the efficiency of our pipeline in Sec. 3.4.
3.1 Computing Saliency Maps
We define the saliency of each image pixel as its importance in making the correct prediction, using a given vision model. More formally, we are given a training sample, , where is an RGB image and
is the corresponding onehot label vector, a classifier,
, that is the current partially trained model, and our task loss, , measuring the discrepancy between the classifier’s output and the true label. We define the saliency, , as the magnitude of the gradient with respect to the input image,(1) 
where denotes the norm along the third (color) dimension. In practice, the saliency map tends to focus on the foreground objects useful for classification and ignores irrelevant background. Note that our saliency definition differs from others [simonyan2013deep, selvaraju2017grad] in that we consider the gradient of the full loss, while previous work consider the gradient of the groundtruth class activation with respect to the input image. We find that our definition is advantageous for data augmentation, and additionally allows for more efficient training, as detailed in Sec. 3.4.
3.2 Saliencyguided Mixup
Before describing our Saliencyguided Mixup, we revisit the original Mixup [zhang2018mixup]. Mixup creates a new training sample, , by linearly mixing pairs of training samples, and , i.e., , and their corresponding labels, i.e., , where . While simple and effective, Mixup has a notable drawback, namely it ignores the image semantics. That is, at every pixel location, the contribution of and to the final image is constant. As Fig. 3 (e) shows, this may lead to prominent image regions being suppressed by the background, which is not ideal for data augmentation [kim2020puzzle, kim2020co].
To address this shortcoming, we propose Saliencyguided Mixup, where at every image location in , the mixing ratio between and is different, defined by the saliency of the corresponding image regions. More formally, given two images, and , and their saliency maps, and , we craft a 2D mixing mask, , and use it to mix the images:
(2) 
where , and are spatiallynormalized and Gaussiansmoothed saliency maps,
is a scalar hyperparameter used to avoid divisionbyzero and
denotes elementwise product. That is, the elements in are defined as the saliency ratio in different images at the same location. This means that, at any given location, more prominent regions of one image will suppress less salient regions of the other image in the final blend, . This strategy largely resolves the issue with the original Mixup and leads to more informative augmentation (see Fig 3 (e)). Lastly, we mix the labels using , where is the mean of the mixing mask, .Saliencyguided Mixup, Eq. 2, is most suitable for mixing images that have salient regions in distinct locations. When the maximally salient regions in both images spatially overlap, the mask, , tends to suppress one or both objects, which leads to uninformative new scenes.
3.3 Optimal Rearrangements via Saliency Maximization
To produce highlyinformative augmentations with Eq. 2, even when both images have overlapping salient regions, we propose to shift one image relative to the other prior to mixing. Our objective is to find the shift that maximizes the resulting image saliency. An example of such rearrangements with the resulting augmentations are shown in Fig. 4. In the following, we formalize this shifting process and describe a solution for finding the best rearrangement.
We define the translation operator that shifts a tensor
by , ) pixels as(3) 
where is the value of at the location . Essentially, translation shifts all the values in the tensor by the given offset,
, and zeropads the empty space.
To quantify how successful a given rearrangement is in resolving the saliency overlap, we measure the total saliency [kim2020puzzle] after the rearrangement. For a given rearrangement, , the total saliency, , is defined as follows:
(4) 
where is the saliency translated by and is the mixing mask (Eq. 2) computed with and . Essentially, the scalar captures the total saliency after the rearrangement (Eq. 3) and fusion (Eq. 4) of the individual saliency tensors. Intuitively, larger total saliency values imply smaller overlap between the salient regions in the shifted images, and , and suggests that the resulting mix is more informative. Thus, it is reasonable to look for a rearrangement that maximizes the total saliency. To this end, we propose to find the optimal rearrangement (offset), , by solving the following: , where is the space of all possible offsets (shown in Fig. 2, step 3).
Finally, we use the obtained optimal rearrangement to generate the augmented sample, . This is done by applying our Saliencyguided Mixup to the rearranged image pair (shown in Fig. 2, step 4), i.e., simply plugging the images and with the corresponding saliency and into Eq. 2. The exact data augmentation algorithm is detailed in the supplement.
3.4 Discussion
One of the advantages of SAGE over other saliencybased augmentations (e.g., [kim2020puzzle, kim2020co]) is its efficiency.
Here, we elaborate on our pipeline design choices and discuss their complexity.
Saliencyguided Mixup. Compared to the original Mixup blending step, our Saliencyguided Mixup (Sec 3.2) adds a simple elementwise multiplication by the mixing mask.
The cost of this operation is negligible to our model’s runtime.
Optimal Rearrangements.
As described in Sec. 3.3, to arrive at our final mixture, we consider all possible rearrangements and select the one maximizing the total saliency, Eq. 4.
The number of rearrangements grows quadratically with image size and soon becomes the bottleneck.
To keep our method efficient, we randomly sample a small portion of all possible arrangements (1% in all experiments),
and search among them.
In our experiments, this does not affect classification performance, while greatly improving efficiency.
Saliency Computation.
Computing saliency requires an extra forward and backward pass of the model.
When the existing works [kim2020puzzle, kim2020co] compute saliency masks, they discard all the intermediate computations and only use the mask itself for DA, which essentially doubles the training time.
In contrast, SAGE saves the gradients, , with respect to the model parameters, obtained in the backward pass of saliency computations.
These gradients can be combined with the standard gradients, , computed on SAGEaugmented images to perform the final model update with , where .
The hyperparameter, ,
effectively controls how much information from the original images is used for updates versus that of the augmented images.
This trick allows us to amortize the saliency computations, and reuse the intermediate results for the model updates.
Note that this is only possible thanks to our saliency definition (Eq. 1), which differs from the classical one [simonyan2013deep].
4 Experiments
We demonstrate the advantage of SAGE in image classification in Sec. 4.1. Sec. 4.2 evaluates SAGE in outofdistribution generalization, Sec. 4.3 analyzes the efficiency of our pipeline and Sec. 4.4 presents an ablation study of SAGE’s components. Our implementation is largely based on the publicly available repository of Puzzle Mix^{*}^{*}*https://github.com/snumllab/PuzzleMix.
Dataset  Model  Vanilla  Mixup  CutMix  Manifold  SaliencyMix  Puzzle Mix  CoMixup  SAGE 


CIFAR10  PreActResNet18  95.07  95.97  96.27  96.28  96.15  96.62  96.23  96.95 
CIFAR100  PreActResNet18  76.8  77.40.  78.96  78.51  78.85  79.65  79.68  79.91 
CIFAR100  WRN16  78.55  79.83.  80.03  79.77  80.16  80.73  80.42  80.45 
CIFAR100  ResNext29  78.77  78.23.  77.43  77.97  78.89  79.20  80.27  80.35 

4.1 Image Classification
Following previous work [kim2020co], we perform evaluations on the CIFAR10 [krizhevsky2009learning] and CIFAR100 [krizhevsky2009learning] datasets with the PreActResNet18 [he2016identity], ResNext29 [xie2017aggregated] and WideResNet16 [zagoruyko2016wide] architectures. For all datasets and models, we follow the optimization schedule described in Puzzle Mix and CoMix; training and model details are included in the supplement. For a comprehensive comparison, we use the following DA baselines: (i) Vanilla, i.e., standard data augmentation only, which includes random cropping and horizontal flips, (ii) Mixup [zhang2018mixup], (iii) CutMix [yun2019cutmix], (iv) Manifold [verma2019manifold], (v) SaliencyMix [uddin2020saliencymix], (vi) Puzzle Mix [kim2020puzzle] and (vii) CoMixup [kim2020co]. Note that all the baseline methods are applied on top of the standard data augmentation. Following previous work [kim2020puzzle, kim2020co], we report the results averaged over three independent training runs.
Table 1 summarizes the comparison of SAGE to the baselines, pointing to two key observations. First, the DA techniques utilizing saliency (i.e., SaliencyMix, Puzzle Mix, CoMixup and SAGE) substantially outperform other nonsaliencybased variants across almost all datasets and architectures. This clear improvement demonstrates that using image semantics for data augmentation leads to better generalization on the test set. Second, among saliencybased methods, SAGE is consistently the best on CIFAR10; on CIFAR100, SAGE outperforms Puzzle Mix and CoMixup on PreActResNet18 and ResNext29, and has comparable performance on WideResNet. SAGE also outperforms SaliencyMix on all tested architectures on both datasets. We attribute the advantage of SAGE to the fact that our augmented images are smoother and more realistic, combining the advantages of Mixup and the saliencybased methods. This is despite the fact that Puzzle Mix and CoMixup are explicitly optimizing for maximum saliency, and have considerably more computational overhead.
4.2 Outofdistribution Generalization and Fewshot Adaptation
It is known that different DA techniques may lead to similar test accuracy improvements but have drastically different behavior on outofdistribution (OOD) data [verma2019manifold]. This phenomenon is attributed to the difference in the quality of the learned representation. Therefore, to further evaluate our approach, we consider generalization in the OOD setting.
In our evaluation, we test the OOD generalization in two scenarios: using corrupted test images (with Gaussian noise or adversarial perturbations [szegedy2014intriguing]) or evaluating generalization to new categories in a fewshot setup [vinyals2016matching]
. More specifically, we test against three different perturbations: i) Gaussian noise with zero mean and variance of 0.01, ii)
norm bounded attack generated using the Fast Gradient Sign Method (FGSM) [goodfellow2014explaining] with and iii) norm bounded attack crafted with Fast Gradient Method (FGM) [goodfellow2014explaining] with . Our choice of the attacks and follows the standard practice used with the robustness benchmarks [croce2020robustbench]. To evaluate fewshot adaptation capabilities of our model and test how well the learned representations transfer to novel categories, we perform fewshot classification on the miniImageNet dataset [vinyals2016matching]. Additional details are provided in the supplement.To summarize the performance on all three OOD benchmarks, we average the accuracy across the benchmarks, and get a single score quantifying model robustness. Figure (a)a plots the average OOD accuracy on CIFAR100, against the standard accuracy on the original test set. We observe a striking difference in the robustness characteristics across different DA methods. Notably, models trained using SAGE are much less sensitive to outofdistribution shifts compared to the two other saliencybased methods, i.e., Puzzle Mix and CoMixup, despite comparable test accuracy improvements. Moreover, the models trained with CutMix, Puzzle Mix and CoMixup have worse OOD performance compared to Vanilla training. These methods produce augmentations with unnatural patchlike patterns, which likely leads to unwanted properties of the learned representations. In contrast, Mixup and SAGE fuse images in a homogeneous way, leading to models more robust to various input perturbations. Please refer to the supplement for the full table of results and CIFAR10 experiments.
Vanilla  Mixup  CutMix  SaliencyMix  Puzzle Mix  CoMixup  SAGE 


77.9  78.9  78.4  78.6  78.6  79.0  79.8 

To show OOD generalization beyond adversarial attacks, we compare SAGE to other data augmentation techniques for fewshot classification on miniImageNet, where the goal is to learn a representation that generalizes to novel categories. We follow the setup from previous work [dvornik2019diversity], using a single ResNet12 with the prototype classifier. As shown in Table 2, SAGE outperforms other augmentation techniques, including Mixup (the strongest model on adversarial perturbations). This shows that SAGE is useful in OOD scenarios beyond Gaussian and adversarial perturbations.
4.3 Runtime Analysis
In this section, we compare the training time of different data augmentation methods running on a single NVIDIA Tesla T4. Figure (b)b
plots each method’s average training time (GPU hours) versus accuracy. Notably, the techniques not using saliency (i.e., Mixup, Manifold and CutMix) are as fast as Vanilla, since the data augmentation is performed during data loading, which does not affect the overall training time. SaliencyMix stands apart from the other saliencybased augmentation techniques. This follows because it utilizes an external trained saliency detector based on a shallow predeep learning method
[montabone2010human], that is fast but considerably less capable than the deep saliency methods [simonyan2013deep] used for the other augmentation techniques. Consequently, SaliencyMix introduces minimal overhead; however, its improvement on classification accuracy is limited. Other saliencybased methods (i.e., PuzzleMix, CoMixup and SAGE) are more accurate, yet also significantly slower. Among them, SAGE is the fastest and also the most accurate on CIFAR100. Based on these observations, we argue that SAGE represents a good tradeoff between accuracy and efficiency overall, and is clearly the best choice among the saliencybased methods.(a) Robustness versus standard accuracy in OOD generalization. The methods in the green area (i.e., Mixup and SAGE) improve both accuracy and robustness relative to vanilla augmentation, while the others in red (i.e., CutMix, CoMixup and Puzzle Mix) improve standard test accuracy at the cost of decreased robustness. (b) Runtime comparison of SAGE and other baselines. We estimate computation cost with a single NVIDIA Tesla T4. For SAGE, there is no noticeable overhead besides the additional forward and backward pass to compute the saliency map which approximately doubles the time of Vanilla training.
4.4 Ablation Studies
In this section, we further analyze our data augmentation strategy by ablating different design choices in the pipeline. For all the experiments, we use the same setup described in Sec. 4.1 with ResNet18 on CIFAR10 and CIFAR100. Please see the supplement for additional ablations.
Saliencyguided Mixup and optimal rearrangements. The two components that make SAGE novel are the Saliencyguided Mixup (Sec. 3.2) and the Optimal Rearrangements (Sec. 3.3). Here, we evaluate SAGE with some of the components removed or replaced by an existing technique. In particular, we evaluate i) SAGE w/o OR (i.e., without optimal rearrangements) that always performs Saliencyguided Mixup on nonshifted images and ii) SAGE w/o SM (i.e., without Saliencyguided Mixup) for mixing images together that simply replaces one image region with the other image instead of performing smooth saliencybased mixing. Examples of SAGE w/o SM and SAGE w/o OR are included in the supplement. As shown in Table 3, each of the components is important for the final performance and thus justifies their use.
Optimal rearrangements search space. As described in Sec. 3.4, to select a rearrangement, we evaluate a set of locations, and proceed with the one that maximizes saliency. To speed up the search, we only explore a random subset of all rearrangements, 1% in all previous experiments, which suggest that our data augmentation may be suboptimal. Table 3 shows the model’s performance, depending on the portion of all rearrangements we consider for DA. Surprisingly, using only 1% of the rearrangements works best. While seemingly counterintuitive, we hypothesize the suboptimal rearrangements act as additional training regularization and introduce more diversity in the augmented data.
5 Conclusion
We proposed SAGE – a new data augmentation approach that integrates visual saliency to produce highly informative training samples. Compared to existing methods, SAGE leads to better test accuracy, and generates more realistic training samples. Moreover, SAGE is the only saliencybased augmentation technique that improves model robustness and OOD performance, while incurring minimal computational overhead. In principle, SAGE is not limited to image classification and can be easily extended to other visual tasks. We believe that SAGE delivers a unique combination of accuracy, robustness and efficiency, and can become the new plugandplay data augmentation for a wide range of vision tasks.
References
Appendix A Summary of the Supplementary Material
The supplementary material is organized as follows. In Sec. B, we describe the exact optimization schedule and the hyperparameters used to train with SAGE and other baseline DA frameworks. In Sec. C and Sec. D, we provide detailed results to bolster our claim on SAGE’s improvement on OOD generalization (Sec. 4.2) and its low computation overhead (Sec. 4.3). Pseudocode to augment data with SAGE is included in Sec. E. In Sec. F, we show examples of augmentations using SAGE w/o SM and SAGE w/o OR (Sec. 4.4). Furthermore, we provide additional ablation studies to verify the design choices of SAGE in Sec. G.
Appendix B Optimization schedule and hyperparameters
Optimization schedule: Following previous work [kim2020puzzle, kim2020co]
, all models are trained using stochastic gradient descent (SGD) for 300 epochs with an initial learning rate of
. The learning rate decreases by a factor of 0.1 at epoch 100 and 200. We use a momentum of 0.9 and a weight decay of 0.0001. The above optimization schedule is used to train both CIFAR10 and CIFAR100 for all models, except for CoMixup [kim2020co] on CIFAR10. We notice that training with CoMixup on CIFAR10 with an initial learning rate of 0.2 results in divergence at the beginning of the training. We find training becomes stable with an initial learning rate of 0.12.Training with baseline DA: We follow the hyperparameter settings used in previous work [kim2020puzzle, kim2020co]. To train with Mixup [zhang2018mixup], CutMix [yun2019cutmix], Puzzle Mix [kim2020puzzle] and CoMixup [kim2020co], we use with , and use for Manifold Mixup [verma2019manifold]. For SaliencyMix ^{†}^{†}†https://github.com/afmshahabuddin/SaliencyMix, Puzzle Mix^{‡}^{‡}‡https://github.com/snumllab/PuzzleMix and CoMixup^{§}^{§}§https://github.com/snumllab/CoMixup, we use the parameter settings described in author’s public repository: , and .
Training with SAGE: For all models and datasets, we use of all possible rearrangements (Sec. 3.3) and a smoothing parameter of (Sec. 3.2). Here we use to denote the truncation factor (Sec. G) and use to denote the gradient update ratio (Sec. 3.4). On CIFAR10 with ResNet18, we . On CIFAR100 with ResNet18, we . On CIFAR100 with WRN16, we . On CIFAR100 with ResNext29, we .
Appendix C Robustness Evaluation on CIFAR10 and CIFAR100
We evaluate the robustness of models trained with various baseline DA methods. In particular, we measure the classification accuracy of models on test data perturbed using Gaussian noise () and adversarial attacks. To craft adversarial perturbations, we use for bounded FGSM [goodfellow2014explaining] and for bounded FGM [goodfellow2014explaining]. Results are based on models trained with ResNet18. In Figure 6, we notice that models trained with SAGE achieve improved classification accuracy on both clean and noiseperturbed test data. However, method such as SaliencyMix, Puzzle Mix, CoMixup and CutMix improves generalization performance on the test data at the cost of decreased robustness.
Perturbations  Vanilla  Mixup  CutMix  Manifold  SaliencyMix  Puzzle Mix  CoMixup  SAGE 


Rank  4  3  8  1  6  5  7  2 
FGSM ()  79.96  80.93  79.57  85.79  80.62  81.96  78.78  83.75 
FGM ()  89.67  89.22  87.81  90.86  88.86  89.64  88.11  90.64 
Gaussian  89.88  92.56  77.2  92.21  85.99  87.60  85.25  91.67 

Perturbations  Vanilla  Mixup  CutMix  Manifold  SaliencyMix  Puzzle Mix  CoMixup  SAGE 


Rank  3  1  8  4  5  7  6  2 
FGSM ()  49.24  50.51  44.2  48.89  46.52  44.58  44.32  50.18 
FGM ()  62.19  63.36  55.57  61.14  59.4  58.16  58.56  62.23 
Gaussian  52.68  60.76  28.06  55.47  38.21  43.96  34.46  47.68 

Appendix D Runtime Comparison
To estimate the computation cost of various baseline DA methods, we measure the total GPU hours required to train CIFAR10 and CIFAR100 using a single NVIDIA Tesla T4. Notice training with SAGE approximately doubles the time of vanilla training due to the computation of the saliency map; however, unlike Puzzle Mix and CoMixup, there is no additional overhead in finding the optimal rearrangements to maximize the total saliency. SaliencyMix stands apart from the other saliencybased augmentation techniques. This follows because it utilizes an external trained saliency detector based on a shallow predeep learning method [montabone2010human], that is fast but considerably less capable than the deep saliency methods [simonyan2013deep] used for the other augmentation techniques. Consequently, SaliencyMix introduces minimal overhead; however, its improvement on classification accuracy is limited.
Dataset  Model  Vanilla  Mixup  CutMix  Manifold  SaliencyMix  Puzzle Mix  CoMixup  SAGE 


CIFAR10  PreActResNet18  3.35  3.29  3.38  3.44  3.45  8.9  25.29  6.83 
CIFAR100  ResNext29  9.76  9.67  9.83  10.27  10.18  22.64  35.65  19.5 

Appendix E Full SAGE Algorithm
Algorithm 1 shows the exact procedure of SAGE. We discuss saliencyguided mixing with optimal rearrangement (Ln 3) in Sec. 3.2, and the rest of the algorithm is covered in Sec. 3.1.
Appendix F Examples of Augmentation Results with SAGE w/o SM and SAGE w/o OR
In Sec. 4.4, we verified the effectiveness of our data augmentation strategy by ablating i) SAGE w/o OR (i.e., without optimal rearrangements) that always performs Saliencyguided Mixup on nonshifted images and ii) SAGE w/o SM (i.e., without Saliencyguided Mixup). Examples of the augmentation results are shown in Figure (d)d.
Appendix G Additional Ablations
We include two additional ablation studies in this section: i) reusing parameter gradients from unaugmented samples (Sec. 3.4) and ii) randomly rescaling of total saliency. For each experiment, we use the best result as the control group (bold numbers), then we repeat the runs with modified taskrelated parameters.
Reusing the parameter gradients: In Sec. 3.4, we discuss performing gradient descent update by combining parameter gradients computed on unaugmented and SAGEaugmented samples. In particular, let and represent the gradients computed using unaugmented and augmented images, respectively. The final model update is based on , where . In Table 7, we observe reusing the parameter gradients computed on unaugmented samples () significantly increases accuracy on the test data.
Random rescaling of total saliency: A random mixing ratio in prior work [zhang2018mixup, yun2019cutmix, devries2017improved, kim2020co, kim2020puzzle] can be seen as a way to increase diversity of the augmentation results. Similarly, we randomly rescale the total saliency of smoothed and using and respective. In practice, we observe the diversity in the augmented images greatly decreases when , since and
dominate when computing the total saliency. Therefore, when the offset images are rescaled to having a small total saliency, it is often better to just exclude it in the augmented results. As such, we propose a simple heuristic to truncate the random rescaling factor:
, where . Results in Table 7 shows with , the test accuracy on both datasets increase significantly.