Log In Sign Up

SAGE: Saliency-Guided Mixup with Optimal Rearrangements

by   Avery Ma, et al.

Data augmentation is a key element for training accurate models by reducing overfitting and improving generalization. For image classification, the most popular data augmentation techniques range from simple photometric and geometrical transformations, to more complex methods that use visual saliency to craft new training examples. As augmentation methods get more complex, their ability to increase the test accuracy improves, yet, such methods become cumbersome, inefficient and lead to poor out-of-domain generalization, as we show in this paper. This motivates a new augmentation technique that allows for high accuracy gains while being simple, efficient (i.e., minimal computation overhead) and generalizable. To this end, we introduce Saliency-Guided Mixup with Optimal Rearrangements (SAGE), which creates new training examples by rearranging and mixing image pairs using visual saliency as guidance. By explicitly leveraging saliency, SAGE promotes discriminative foreground objects and produces informative new images useful for training. We demonstrate on CIFAR-10 and CIFAR-100 that SAGE achieves better or comparable performance to the state of the art while being more efficient. Additionally, evaluations in the out-of-distribution setting, and few-shot learning on mini-ImageNet, show that SAGE achieves improved generalization performance without trading off robustness.


page 2

page 4

page 5

page 7

page 18


TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers

Mixup is a commonly adopted data augmentation technique for image classi...

RandomMix: A mixed sample data augmentation method with multiple mixed modes

Data augmentation is a very practical technique that can be used to impr...

KeepAugment: A Simple Information-Preserving Data Augmentation Approach

Data augmentation (DA) is an essential technique for training state-of-t...

SageMix: Saliency-Guided Mixup for Point Clouds

Data augmentation is key to improving the generalization ability of deep...

Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup

While deep neural networks achieve great performance on fitting the trai...

InAugment: Improving Classifiers via Internal Augmentation

Image augmentation techniques apply transformation functions such as rot...

Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error

In computer vision, it is standard practice to draw a single sample from...

1 Introduction

Data augmentation (DA) methods synthetically expand a dataset by applying transformations on the available examples, with the goal of reducing overfitting and improving generalization in models trained on these datasets. In computer vision, conventional DA techniques are typically based on random geometric (translation, rotation and flipping) and photometric (contrast, brightness and sharpness) transformations 

[simonyan2015very, lecun1998gradient, cubuk2019autoaugment, cubuk2020randaugment]. While these techniques are already effective, they merely create slightly altered copies of the original images and thus introduce limited diversity in the augmented dataset. A more advanced DA [zhang2018mixup, yun2019cutmix] combines multiple training examples into a new image-label pair. By augmenting both the image and the label space simultaneously, such approaches greatly increase the diversity of the augmented set. Consequently, they substantially improve model generalization, without any efficiency overhead, due to their simplicity. Nonetheless, these DA approaches are agnostic to image semantics; they ignore object location cues, and as a result may produce ambiguous scenes with occluded distinctive regions (see Figure 1, Mixup [zhang2018mixup] and CutMix [yun2019cutmix]).

(a) Batch
(b) Mixup[zhang2018mixup]
(c) CutMix[yun2019cutmix]
(d) SaliencyMix[uddin2020saliencymix]
(e) Puzzle Mix[kim2020puzzle]
(f) Co-Mixup[kim2020co]
(g) SAGE (ours)
Figure 1: Comparison of data augmentation methods. Thanks to the saliency-guided mixing and image rearrangements, SAGE produces more meaningful and informative scenes, as verified in our experiments.

To account for such shortcomings, a new line of work [kim2020puzzle, kim2020co, gong2021keepaugment, uddin2020saliencymix] proposes to explicitly use visual saliency [simonyan2013deep] for data augmentation. Typically, a saliency map contains the information about the importance of different image regions for the downstream task. As a result, saliency maps implicitly contain information about objects, their locations and, crucially, about the “informativeness” of image regions. Previous methods [kim2020puzzle, kim2020co, uddin2020saliencymix] take full advantage of the saliency information, and formulate data augmentation as a saliency maximization problem. Given training image patches, their augmentation “assembles” a new image of high visual saliency. This approach greatly improves the test accuracy; however, this comes with a large computation overhead due to the need to maximize saliency at every training step. Moreover, as the augmented images are composed of patches, the resulting scenes are often unrealistic (see Puzzle Mix, Co-Mixup and SaliencyMix in Figure 1), which leads to poor out-of-distribution generalization, as shown later in our experiments. In summary, the existing data augmentation techniques can either i) boost the test accuracy, or ii) produce a robust model with little computational overhead; there are no methods that can do both.

To address the aforementioned drawbacks, we propose a new augmentation – Saliency-Guided Mixup with Optimal Re

arrangements (SAGE) – that provides both high accuracy and robustness, and has minimal computation overhead. SAGE is a simple and effective DA technique that uses visual saliency to perform optimal image blending at each spatial location, and optimizes the relative image position such that the resulting visual saliency is maximized. Given two images and their saliency maps, SAGE mixes the images together, such that at each spatial location, the contribution of different images to the mix is proportional to their saliency in that location. The corresponding label is also obtained by interpolating the original labels based on the saliency of the corresponding images. To maximize the resulting saliency of the mix, we find an optimal relative arrangement of the two images prior to the mixing stage. As a result, SAGE produces smooth and realistic images with clear and distinct foreground objects (see Figure 

1), unlike other augmentation techniques. Thanks to our efficient implementation, SAGE has virtually no computation overhead beyond obtaining the saliency information. Furthermore, our computations are partially shared between the saliency masks and the training gradients, which further decreases the amortized training time.

Contributions. We make the following three contributions: (i) We introduce SAGE, a DA method to generate novel training examples by mixing image pairs based on their visual saliency, which promotes discriminative foreground objects in the mix. (ii) SAGE achieves test accuracy better than or comparable to state-of-the-art augmentation techniques, without incurring significant computation overhead. (iii) Through robustness evaluations on perturbed test data, we show that SAGE improves test accuracy without trading off robustness.

2 Related Work

In this section, we review data augmentation techniques that go beyond simple geometrical and color transformations to improve generalization. A popular approach is to synthesize new training input-output pairs by combining information from multiple raw samples. Mixup [zhang2018mixup] creates a new image-label pair by linearly interpolating both the input and output space. In contrast, Manifold Mixup [verma2019manifold] and HypMix [sawhney2021hypmix] apply interpolation at the feature level. Others create new training samples by “copy-pasting” patches from one image to another [yun2019cutmix, ghiasi2021simple, fang2019instaboost]. This class of methods is very efficient and simple to implement. However, a common drawback of these approaches is that they do not take image semantics into account when performing augmentation. This potentially encourages the model to generalize using completely irrelevant information from the new training data, leading to inferior generalization.

To address this problem, recent work explicitly uses visual saliency information in the DA process. KeepAugment [gong2021keepaugment] leverages input saliency to improve existing DA techniques, e.g., Cutout [devries2017improved], by always keeping the important regions untouched during augmentation. SaliencyMix [uddin2020saliencymix] improves CutMix [yun2019cutmix] by selecting a patch around the peak salient pixel location in the source image and mixing it with the target image. Puzzle Mix formulates DA as an optimization problem, where the objective balances saliency maximization, local smoothness and the optimal transport between data pairs [kim2020puzzle]. Co-Mixup [kim2020co] extends this idea by encouraging the diversity of the augmentation when mixing a collection of inputs, and thus further complicates the optimization objective. The need to solve the optimization problem at every step significantly slows down the training, which may be prohibitive in some situations. Our saliency-guided method not only reduces this computational overhead, but also generates more plausible augmented images that result in improved test accuracy and out-of-distribution generalization.

3 Technical Approach

Figure 2: SAGE overview. Given the original images, we first compute saliency maps. Next, we find the best rearrangement of the images that maximizes the total saliency (in the green box). Finally, we use our saliency-guided Mixup to fuse the overlapping image parts and derive the new label. As a result, SAGE produces smooth, realistic and informative scenes.

The main idea behind SAGE is to synthesize novel images (with their labels) by blending pairs of training samples, using spatial saliency information as guidance for optimal blending. As illustrated in Fig. 2, our method consists of three independent components: i) saliency mask generation (Sec. 3.1), ii) the “Optimal Rearrangement” module (Sec. 3.3), and iii) the “Saliency-guided Mixup” module (Sec 3.2). All chained together, they form our SAGE approach. Below, we elaborate on each of the components and conclude with a discussion on the efficiency of our pipeline in Sec. 3.4.

3.1 Computing Saliency Maps

We define the saliency of each image pixel as its importance in making the correct prediction, using a given vision model. More formally, we are given a training sample, , where is an RGB image and

is the corresponding one-hot label vector, a classifier,

, that is the current partially trained model, and our task loss, , measuring the discrepancy between the classifier’s output and the true label. We define the saliency, , as the magnitude of the gradient with respect to the input image,


where denotes the -norm along the third (color) dimension. In practice, the saliency map tends to focus on the foreground objects useful for classification and ignores irrelevant background. Note that our saliency definition differs from others [simonyan2013deep, selvaraju2017grad] in that we consider the gradient of the full loss, while previous work consider the gradient of the ground-truth class activation with respect to the input image. We find that our definition is advantageous for data augmentation, and additionally allows for more efficient training, as detailed in Sec. 3.4.

3.2 Saliency-guided Mixup

Before describing our Saliency-guided Mixup, we revisit the original Mixup [zhang2018mixup]. Mixup creates a new training sample, , by linearly mixing pairs of training samples, and , i.e., , and their corresponding labels, i.e., , where . While simple and effective, Mixup has a notable drawback, namely it ignores the image semantics. That is, at every pixel location, the contribution of and to the final image is constant. As Fig. 3 (e) shows, this may lead to prominent image regions being suppressed by the background, which is not ideal for data augmentation [kim2020puzzle, kim2020co].

Figure 3: Comparison between Saliency-guided Mixup and original Mixup. Given the a) original images with b) saliency maps, our Saliency Mixup computes d) the Mixing Mask (given by Eq. 2) based on the relative saliency of the inputs. The values of are represented with a heatmap; blue areas indicate stronger contribution of image 1, red areas correspond to image 2 being more prominent and pale areas indicate more uniform blending. Consequently, salient regions from different images contribute to different locations and result in a realistic, informative output c). In contrast, the original Mixup produces f) a uniform mixing mask (at ), which results in e) an unrealistic and unclear image.

To address this shortcoming, we propose Saliency-guided Mixup, where at every image location in , the mixing ratio between and is different, defined by the saliency of the corresponding image regions. More formally, given two images, and , and their saliency maps, and , we craft a 2D mixing mask, , and use it to mix the images:


where , and are spatially-normalized and Gaussian-smoothed saliency maps,

is a scalar hyperparameter used to avoid division-by-zero and

denotes element-wise product. That is, the elements in are defined as the saliency ratio in different images at the same location. This means that, at any given location, more prominent regions of one image will suppress less salient regions of the other image in the final blend, . This strategy largely resolves the issue with the original Mixup and leads to more informative augmentation (see Fig 3 (e)). Lastly, we mix the labels using , where is the mean of the mixing mask, .

Saliency-guided Mixup, Eq. 2, is most suitable for mixing images that have salient regions in distinct locations. When the maximally salient regions in both images spatially overlap, the mask, , tends to suppress one or both objects, which leads to uninformative new scenes.

3.3 Optimal Rearrangements via Saliency Maximization

To produce highly-informative augmentations with Eq. 2, even when both images have overlapping salient regions, we propose to shift one image relative to the other prior to mixing. Our objective is to find the shift that maximizes the resulting image saliency. An example of such rearrangements with the resulting augmentations are shown in Fig. 4. In the following, we formalize this shifting process and describe a solution for finding the best rearrangement.

We define the translation operator that shifts a tensor

by , ) pixels as


where is the value of at the location . Essentially, translation shifts all the values in the tensor by the given offset,

, and zero-pads the empty space.

To quantify how successful a given rearrangement is in resolving the saliency overlap, we measure the total saliency [kim2020puzzle] after the rearrangement. For a given rearrangement, , the total saliency, , is defined as follows:


where is the saliency translated by and is the mixing mask (Eq. 2) computed with and . Essentially, the scalar captures the total saliency after the rearrangement (Eq. 3) and fusion (Eq. 4) of the individual saliency tensors. Intuitively, larger total saliency values imply smaller overlap between the salient regions in the shifted images, and , and suggests that the resulting mix is more informative. Thus, it is reasonable to look for a rearrangement that maximizes the total saliency. To this end, we propose to find the optimal rearrangement (offset), , by solving the following: , where is the space of all possible offsets (shown in Fig. 2, step 3).

Finally, we use the obtained optimal rearrangement to generate the augmented sample, . This is done by applying our Saliency-guided Mixup to the rearranged image pair (shown in Fig. 2, step 4), i.e., simply plugging the images and with the corresponding saliency and into Eq. 2. The exact data augmentation algorithm is detailed in the supplement.

(a) Total saliency
(b) Total saliency
(c) Max saliency 0.72
Figure 4: Possible rearrangements. In each example, the saliency map corresponding to the rearrangement is shown on the left, the corresponding image (after applying Saliency-guided Mixup) is on the right. The rearrangement maximizing the total saliency is shown in c); clearly, it results in a denser mixed saliency, and produces a more informative image.

3.4 Discussion

One of the advantages of SAGE over other saliency-based augmentations (e.g., [kim2020puzzle, kim2020co]) is its efficiency. Here, we elaborate on our pipeline design choices and discuss their complexity.
Saliency-guided Mixup. Compared to the original Mixup blending step, our Saliency-guided Mixup (Sec 3.2) adds a simple element-wise multiplication by the mixing mask. The cost of this operation is negligible to our model’s runtime.
Optimal Rearrangements. As described in Sec. 3.3, to arrive at our final mixture, we consider all possible rearrangements and select the one maximizing the total saliency, Eq. 4. The number of rearrangements grows quadratically with image size and soon becomes the bottleneck. To keep our method efficient, we randomly sample a small portion of all possible arrangements (1% in all experiments), and search among them. In our experiments, this does not affect classification performance, while greatly improving efficiency.
Saliency Computation. Computing saliency requires an extra forward and backward pass of the model. When the existing works [kim2020puzzle, kim2020co] compute saliency masks, they discard all the intermediate computations and only use the mask itself for DA, which essentially doubles the training time. In contrast, SAGE saves the gradients, , with respect to the model parameters, obtained in the backward pass of saliency computations. These gradients can be combined with the standard gradients, , computed on SAGE-augmented images to perform the final model update with , where . The hyperparameter, , effectively controls how much information from the original images is used for updates versus that of the augmented images. This trick allows us to amortize the saliency computations, and reuse the intermediate results for the model updates. Note that this is only possible thanks to our saliency definition (Eq. 1), which differs from the classical one [simonyan2013deep].

4 Experiments

We demonstrate the advantage of SAGE in image classification in Sec. 4.1. Sec. 4.2 evaluates SAGE in out-of-distribution generalization, Sec. 4.3 analyzes the efficiency of our pipeline and Sec. 4.4 presents an ablation study of SAGE’s components. Our implementation is largely based on the publicly available repository of Puzzle Mix***

Dataset Model Vanilla Mixup CutMix Manifold SaliencyMix Puzzle Mix Co-Mixup SAGE


CIFAR-10 PreActResNet18 95.07 95.97 96.27 96.28 96.15 96.62 96.23 96.95
CIFAR-100 PreActResNet18 76.8 77.40. 78.96 78.51 78.85 79.65 79.68 79.91
CIFAR-100 WRN16 78.55 79.83. 80.03 79.77 80.16 80.73 80.42 80.45
CIFAR-100 ResNext29 78.77 78.23. 77.43 77.97 78.89 79.20 80.27 80.35


Table 1: Image classification accuracy. CIFAR-10 and CIFAR-100 results are obtained by averaging over three independent training runs. The best numbers are in bold and the second best numbers are underlined.

4.1 Image Classification

Following previous work [kim2020co], we perform evaluations on the CIFAR-10 [krizhevsky2009learning] and CIFAR-100 [krizhevsky2009learning] datasets with the PreActResNet18 [he2016identity], ResNext29 [xie2017aggregated] and WideResNet16 [zagoruyko2016wide] architectures. For all datasets and models, we follow the optimization schedule described in Puzzle Mix and Co-Mix; training and model details are included in the supplement. For a comprehensive comparison, we use the following DA baselines: (i) Vanilla, i.e., standard data augmentation only, which includes random cropping and horizontal flips, (ii) Mixup [zhang2018mixup], (iii) CutMix [yun2019cutmix], (iv) Manifold [verma2019manifold], (v) SaliencyMix [uddin2020saliencymix], (vi) Puzzle Mix [kim2020puzzle] and (vii) Co-Mixup [kim2020co]. Note that all the baseline methods are applied on top of the standard data augmentation. Following previous work [kim2020puzzle, kim2020co], we report the results averaged over three independent training runs.

Table 1 summarizes the comparison of SAGE to the baselines, pointing to two key observations. First, the DA techniques utilizing saliency (i.e., SaliencyMix, Puzzle Mix, Co-Mixup and SAGE) substantially outperform other non-saliency-based variants across almost all datasets and architectures. This clear improvement demonstrates that using image semantics for data augmentation leads to better generalization on the test set. Second, among saliency-based methods, SAGE is consistently the best on CIFAR-10; on CIFAR-100, SAGE outperforms Puzzle Mix and Co-Mixup on PreActResNet18 and ResNext29, and has comparable performance on WideResNet. SAGE also outperforms SaliencyMix on all tested architectures on both datasets. We attribute the advantage of SAGE to the fact that our augmented images are smoother and more realistic, combining the advantages of Mixup and the saliency-based methods. This is despite the fact that Puzzle Mix and Co-Mixup are explicitly optimizing for maximum saliency, and have considerably more computational overhead.

4.2 Out-of-distribution Generalization and Few-shot Adaptation

It is known that different DA techniques may lead to similar test accuracy improvements but have drastically different behavior on out-of-distribution (OOD) data [verma2019manifold]. This phenomenon is attributed to the difference in the quality of the learned representation. Therefore, to further evaluate our approach, we consider generalization in the OOD setting.

In our evaluation, we test the OOD generalization in two scenarios: using corrupted test images (with Gaussian noise or adversarial perturbations [szegedy2014intriguing]) or evaluating generalization to new categories in a few-shot setup [vinyals2016matching]

. More specifically, we test against three different perturbations: i) Gaussian noise with zero mean and variance of 0.01, ii)

-norm bounded attack generated using the Fast Gradient Sign Method (FGSM) [goodfellow2014explaining] with and iii) -norm bounded attack crafted with Fast Gradient Method (FGM) [goodfellow2014explaining] with . Our choice of the attacks and follows the standard practice used with the robustness benchmarks [croce2020robustbench]. To evaluate few-shot adaptation capabilities of our model and test how well the learned representations transfer to novel categories, we perform few-shot classification on the mini-ImageNet dataset [vinyals2016matching]. Additional details are provided in the supplement.

To summarize the performance on all three OOD benchmarks, we average the accuracy across the benchmarks, and get a single score quantifying model robustness. Figure (a)a plots the average OOD accuracy on CIFAR-100, against the standard accuracy on the original test set. We observe a striking difference in the robustness characteristics across different DA methods. Notably, models trained using SAGE are much less sensitive to out-of-distribution shifts compared to the two other saliency-based methods, i.e., Puzzle Mix and Co-Mixup, despite comparable test accuracy improvements. Moreover, the models trained with CutMix, Puzzle Mix and Co-Mixup have worse OOD performance compared to Vanilla training. These methods produce augmentations with unnatural patch-like patterns, which likely leads to unwanted properties of the learned representations. In contrast, Mixup and SAGE fuse images in a homogeneous way, leading to models more robust to various input perturbations. Please refer to the supplement for the full table of results and CIFAR-10 experiments.

Vanilla Mixup CutMix SaliencyMix Puzzle Mix Co-Mixup SAGE


77.9 78.9 78.4 78.6 78.6 79.0 79.8


Table 2: Few-shot classification accuracy on mini-ImageNet.

To show OOD generalization beyond adversarial attacks, we compare SAGE to other data augmentation techniques for few-shot classification on mini-ImageNet, where the goal is to learn a representation that generalizes to novel categories. We follow the setup from previous work [dvornik2019diversity], using a single ResNet12 with the prototype classifier. As shown in Table 2, SAGE outperforms other augmentation techniques, including Mixup (the strongest model on adversarial perturbations). This shows that SAGE is useful in OOD scenarios beyond Gaussian and adversarial perturbations.

4.3 Runtime Analysis

In this section, we compare the training time of different data augmentation methods running on a single NVIDIA Tesla T4. Figure (b)b

plots each method’s average training time (GPU hours) versus accuracy. Notably, the techniques not using saliency (i.e., Mixup, Manifold and CutMix) are as fast as Vanilla, since the data augmentation is performed during data loading, which does not affect the overall training time. SaliencyMix stands apart from the other saliency-based augmentation techniques. This follows because it utilizes an external trained saliency detector based on a shallow pre-deep learning method 

[montabone2010human], that is fast but considerably less capable than the deep saliency methods [simonyan2013deep] used for the other augmentation techniques. Consequently, SaliencyMix introduces minimal overhead; however, its improvement on classification accuracy is limited. Other saliency-based methods (i.e., PuzzleMix, Co-Mixup and SAGE) are more accurate, yet also significantly slower. Among them, SAGE is the fastest and also the most accurate on CIFAR-100. Based on these observations, we argue that SAGE represents a good trade-off between accuracy and efficiency overall, and is clearly the best choice among the saliency-based methods.

(a) Robustness Comparison
(b) Runtime Comparison
Figure 5: Robustness and efficiency analysis of SAGE.

(a) Robustness versus standard accuracy in OOD generalization. The methods in the green area (i.e., Mixup and SAGE) improve both accuracy and robustness relative to vanilla augmentation, while the others in red (i.e., CutMix, Co-Mixup and Puzzle Mix) improve standard test accuracy at the cost of decreased robustness. (b) Runtime comparison of SAGE and other baselines. We estimate computation cost with a single NVIDIA Tesla T4. For SAGE, there is no noticeable overhead besides the additional forward and backward pass to compute the saliency map which approximately doubles the time of Vanilla training.

4.4 Ablation Studies

In this section, we further analyze our data augmentation strategy by ablating different design choices in the pipeline. For all the experiments, we use the same setup described in Sec. 4.1 with ResNet18 on CIFAR-10 and CIFAR-100. Please see the supplement for additional ablations.

Saliency-guided Mixup and optimal rearrangements. The two components that make SAGE novel are the Saliency-guided Mixup (Sec. 3.2) and the Optimal Rearrangements (Sec. 3.3). Here, we evaluate SAGE with some of the components removed or replaced by an existing technique. In particular, we evaluate i) SAGE w/o OR (i.e., without optimal rearrangements) that always performs Saliency-guided Mixup on non-shifted images and ii) SAGE w/o SM (i.e., without Saliency-guided Mixup) for mixing images together that simply replaces one image region with the other image instead of performing smooth saliency-based mixing. Examples of SAGE w/o SM and SAGE w/o OR are included in the supplement. As shown in Table 3, each of the components is important for the final performance and thus justifies their use.

Model CIFAR-10 CIFAR-100   Vanila 95.07 76.8 SAGE w/o SM 96.53 78.89 SAGE w/o OR 96.48 78.68 SAGE 96.95 79.91        Search Space CIFAR-10 CIFAR-100   1% 96.95 79.91 10% 96.67 79.47 50% 96.58 79.40 100% 96.69 79.45  

Table 3: Ablation studies of SAGE. (left) Dissecting the benefit from saliency-guided mixing and optimal rearrangements. Here, SAGE w/o SM (without Saliency-guided Mixup), and SAGE w/o OR (without optimal rearrangements). (right) SAGE’s accuracy depending on the explored rearrangements. The first column indicates the size of the random portion of rearrangements used for data augmentation.

Optimal rearrangements search space. As described in Sec. 3.4, to select a rearrangement, we evaluate a set of locations, and proceed with the one that maximizes saliency. To speed up the search, we only explore a random subset of all rearrangements, 1% in all previous experiments, which suggest that our data augmentation may be sub-optimal. Table 3 shows the model’s performance, depending on the portion of all rearrangements we consider for DA. Surprisingly, using only 1% of the rearrangements works best. While seemingly counterintuitive, we hypothesize the sub-optimal rearrangements act as additional training regularization and introduce more diversity in the augmented data.

5 Conclusion

We proposed SAGE – a new data augmentation approach that integrates visual saliency to produce highly informative training samples. Compared to existing methods, SAGE leads to better test accuracy, and generates more realistic training samples. Moreover, SAGE is the only saliency-based augmentation technique that improves model robustness and OOD performance, while incurring minimal computational overhead. In principle, SAGE is not limited to image classification and can be easily extended to other visual tasks. We believe that SAGE delivers a unique combination of accuracy, robustness and efficiency, and can become the new plug-and-play data augmentation for a wide range of vision tasks.


Appendix A Summary of the Supplementary Material

The supplementary material is organized as follows. In Sec. B, we describe the exact optimization schedule and the hyperparameters used to train with SAGE and other baseline DA frameworks. In Sec. C and Sec. D, we provide detailed results to bolster our claim on SAGE’s improvement on OOD generalization (Sec. 4.2) and its low computation overhead (Sec. 4.3). Pseudocode to augment data with SAGE is included in Sec. E. In Sec. F, we show examples of augmentations using SAGE w/o SM and SAGE w/o OR (Sec. 4.4). Furthermore, we provide additional ablation studies to verify the design choices of SAGE in Sec. G.

Appendix B Optimization schedule and hyper-parameters

Optimization schedule: Following previous work [kim2020puzzle, kim2020co]

, all models are trained using stochastic gradient descent (SGD) for 300 epochs with an initial learning rate of

. The learning rate decreases by a factor of 0.1 at epoch 100 and 200. We use a momentum of 0.9 and a weight decay of 0.0001. The above optimization schedule is used to train both CIFAR-10 and CIFAR-100 for all models, except for Co-Mixup [kim2020co] on CIFAR-10. We notice that training with Co-Mixup on CIFAR-10 with an initial learning rate of 0.2 results in divergence at the beginning of the training. We find training becomes stable with an initial learning rate of 0.12.

Training with baseline DA: We follow the hyperparameter settings used in previous work [kim2020puzzle, kim2020co]. To train with Mixup [zhang2018mixup], CutMix [yun2019cutmix], Puzzle Mix [kim2020puzzle] and Co-Mixup [kim2020co], we use with , and use for Manifold Mixup [verma2019manifold]. For SaliencyMix, Puzzle Mix and Co-Mixup§§§, we use the parameter settings described in author’s public repository: , and .

Training with SAGE: For all models and datasets, we use of all possible rearrangements (Sec. 3.3) and a smoothing parameter of (Sec. 3.2). Here we use to denote the truncation factor (Sec. G) and use to denote the gradient update ratio (Sec. 3.4). On CIFAR-10 with ResNet18, we . On CIFAR-100 with ResNet18, we . On CIFAR-100 with WRN16, we . On CIFAR-100 with ResNext29, we .

Appendix C Robustness Evaluation on CIFAR-10 and CIFAR-100

We evaluate the robustness of models trained with various baseline DA methods. In particular, we measure the classification accuracy of models on test data perturbed using Gaussian noise () and adversarial attacks. To craft adversarial perturbations, we use for bounded FGSM [goodfellow2014explaining] and for bounded FGM [goodfellow2014explaining]. Results are based on models trained with ResNet18. In Figure 6, we notice that models trained with SAGE achieve improved classification accuracy on both clean and noise-perturbed test data. However, method such as SaliencyMix, Puzzle Mix, Co-Mixup and CutMix improves generalization performance on the test data at the cost of decreased robustness.

Perturbations Vanilla Mixup CutMix Manifold SaliencyMix Puzzle Mix Co-Mixup SAGE


Rank 4 3 8 1 6 5 7 2
FGSM () 79.96 80.93 79.57 85.79 80.62 81.96 78.78 83.75
FGM () 89.67 89.22 87.81 90.86 88.86 89.64 88.11 90.64
Gaussian 89.88 92.56 77.2 92.21 85.99 87.60 85.25 91.67


Table 4: Classification accuracy on noise perturbed CIFAR-10 test data.
Perturbations Vanilla Mixup CutMix Manifold SaliencyMix Puzzle Mix Co-Mixup SAGE


Rank 3 1 8 4 5 7 6 2
FGSM () 49.24 50.51 44.2 48.89 46.52 44.58 44.32 50.18
FGM () 62.19 63.36 55.57 61.14 59.4 58.16 58.56 62.23
Gaussian 52.68 60.76 28.06 55.47 38.21 43.96 34.46 47.68


Table 5: Classification accuracy on noise perturbed CIFAR-100 test data.
(a) CIFAR-10
(b) CIFAR-100
Figure 6: Visualization of the standard generalization performance vs. generalization in the OOD setting. We notice that SaliencyMix, CutMix, Co-Mixup and Puzzle Mix improves standard test accuracy over vanilla but at a cost of decreased robustness.

Appendix D Runtime Comparison

To estimate the computation cost of various baseline DA methods, we measure the total GPU hours required to train CIFAR-10 and CIFAR-100 using a single NVIDIA Tesla T4. Notice training with SAGE approximately doubles the time of vanilla training due to the computation of the saliency map; however, unlike Puzzle Mix and Co-Mixup, there is no additional overhead in finding the optimal rearrangements to maximize the total saliency. SaliencyMix stands apart from the other saliency-based augmentation techniques. This follows because it utilizes an external trained saliency detector based on a shallow pre-deep learning method [montabone2010human], that is fast but considerably less capable than the deep saliency methods [simonyan2013deep] used for the other augmentation techniques. Consequently, SaliencyMix introduces minimal overhead; however, its improvement on classification accuracy is limited.

Dataset Model Vanilla Mixup CutMix Manifold SaliencyMix Puzzle Mix Co-Mixup SAGE


CIFAR10 PreActResNet18 3.35 3.29 3.38 3.44 3.45 8.9 25.29 6.83
CIFAR100 ResNext29 9.76 9.67 9.83 10.27 10.18 22.64 35.65 19.5


Table 6: GPU hours comparison of SAGE and other baselines.
(a) CIFAR-10
(b) CIFAR-100
Figure 7: Compared to other saliency-guided methods, SAGE achieves better standard test accuracy on both datasets with low computation overhead.

Appendix E Full SAGE Algorithm

Algorithm 1 shows the exact procedure of SAGE. We discuss saliency-guided mixing with optimal rearrangement (Ln 3) in Sec. 3.2, and the rest of the algorithm is covered in Sec. 3.1.

Input : Pairs of training samples: and , a classifier

, a loss function

, a randomly sampled mix ratio , a Gaussian smoothing parameter and is the space of all possible image translations
Output : A new data-label pair:
3 , where is defined in Eq. 4
Algorithm 1 Data Augmentation based on SMART Mixup

Appendix F Examples of Augmentation Results with SAGE w/o SM and SAGE w/o OR

In Sec. 4.4, we verified the effectiveness of our data augmentation strategy by ablating i) SAGE w/o OR (i.e., without optimal rearrangements) that always performs Saliency-guided Mixup on non-shifted images and ii) SAGE w/o SM (i.e., without Saliency-guided Mixup). Examples of the augmentation results are shown in Figure (d)d.

Figure 8: Augmentation results with SAGE w/o OR and SAGE w/o SM

Appendix G Additional Ablations

We include two additional ablation studies in this section: i) reusing parameter gradients from un-augmented samples (Sec. 3.4) and ii) randomly rescaling of total saliency. For each experiment, we use the best result as the control group (bold numbers), then we repeat the runs with modified task-related parameters.

CIFAR-10 CIFAR-100   0.5 96.65 79.36 0.7 96.95 79.91 1.0 96.58 79.24  

CIFAR-10 CIFAR-100   0.5 96.75 79.91 0.6 96.95 79.7 1.0 96.6 79.29  
Table 7: Additional ablation studies of SAGE. (left) Test accuracy of models trained with combined parameter gradients from un-augmented and SAGE-augmented samples. (right) Test accuracy of models trained with truncated total saliency.

Reusing the parameter gradients: In Sec. 3.4, we discuss performing gradient descent update by combining parameter gradients computed on un-augmented and SAGE-augmented samples. In particular, let and represent the gradients computed using un-augmented and augmented images, respectively. The final model update is based on , where . In Table 7, we observe reusing the parameter gradients computed on un-augmented samples () significantly increases accuracy on the test data.

Random rescaling of total saliency: A random mixing ratio in prior work [zhang2018mixup, yun2019cutmix, devries2017improved, kim2020co, kim2020puzzle] can be seen as a way to increase diversity of the augmentation results. Similarly, we randomly rescale the total saliency of smoothed and using and respective. In practice, we observe the diversity in the augmented images greatly decreases when , since and

dominate when computing the total saliency. Therefore, when the offset images are rescaled to having a small total saliency, it is often better to just exclude it in the augmented results. As such, we propose a simple heuristic to truncate the random rescaling factor:

, where . Results in Table 7 shows with , the test accuracy on both datasets increase significantly.