Log In Sign Up

AdvMask: A Sparse Adversarial Attack Based Data Augmentation Method for Image Classification

Data augmentation is a widely used technique for enhancing the generalization ability of convolutional neural networks (CNNs) in image classification tasks. Occlusion is a critical factor that affects on the generalization ability of image classification models. In order to generate new samples, existing data augmentation methods based on information deletion simulate occluded samples by randomly removing some areas in the images. However, those methods cannot delete areas of the images according to their structural features of the images. To solve those problems, we propose a novel data augmentation method, AdvMask, for image classification tasks. Instead of randomly removing areas in the images, AdvMask obtains the key points that have the greatest influence on the classification results via an end-to-end sparse adversarial attack module. Therefore, we can find the most sensitive points of the classification results without considering the diversity of various image appearance and shapes of the object of interest. In addition, a data augmentation module is employed to generate structured masks based on the key points, thus forcing the CNN classification models to seek other relevant content when the most discriminative content is hidden. AdvMask can effectively improve the performance of classification models in the testing process. The experimental results on various datasets and CNN models verify that the proposed method outperforms other previous data augmentation methods in image classification tasks.


page 1

page 2

page 4

page 7


Data Augmentation by Pairing Samples for Images Classification

Data augmentation is a widely used technique in many machine learning ta...

The Effectiveness of Data Augmentation in Image Classification using Deep Learning

In this paper, we explore and compare multiple solutions to the problem ...

Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization and Beyond

We propose 'Hide-and-Seek' a general purpose data augmentation technique...

Data Augmentation Vision Transformer for Fine-grained Image Classification

Recently, the vision transformer (ViT) has made breakthroughs in image r...

Feature transforms for image data augmentation

A problem with Convolutional Neural Networks (CNNs) is that they require...

Semantic-based Data Augmentation for Math Word Problems

It's hard for neural MWP solvers to deal with tiny local variances. In M...

I Introduction

Image classification [19, 25]

aims to classify digital images automatically, which has been a core topic of computer vision for decades. Image classification can be applied to various fields, such as object detection, video classification and segmentation. Recently, convolutional neural network (CNN) has made tremendous progress in image classification. In order to further improve the classification accuracy, more complex deep neural networks, such as ResNet-152 

[14], have been proposed. However, if a model is too complex, for example, when the complexity of neural network is higher than that of training data, over-fitting will damage the classification accuracy. At the same time, images in such tasks often undergo intensive changes in appearance and varying degrees of occlusion, which are the critical influencing factors on the generalization ability of CNNs [38]. As most image datasets only have clearly visible images, the generalization ability of classification models degrades significantly in this case.

Considering the extreme difficulty of data collection in real-world tasks, data augmentation [13, 6, 33, 1, 7] has been proposed as a very important technique to improve the generalization ability of CNNs by generating more effective data based on the existing data.

Recently, data augmentation methods based on information deletion, such as Cutout [8], Hide-and-Seek (HaS) [30], and Random Erasing [38], have been widely used. It is known that this kind of data augmentation methods can reduce models’ sensitivity and strengthen the generalization ability of the models. Especially, they can predict occluded objects when the original training data are all clearly visible. The core idea of the methods based on information deletion is to generate new data by deleting some areas of the image, which can simulate the situation that objects are partially covered and the whole structure is incomplete in the real world. However, if is found that, instead of customizing the mask according to the structural characteristics of the images, the existing data augmentation methods based on information deletion can block some random areas in the images. Although this may improve the classification accuracy because the diversity of training data increases, this kind of mask generation is not targeted and uninterpretable. In the bad situation, the occluded areas are all in the foreground, thus the objects of interest in the image are completely occluded, which seriously damages the generalization ability of CNN models. In addition, different images have different structures. In other words, the key information used by CNN models to classify different images is different. It is much more beneficial to customize the mask for each image according to the structure information. However, random information deletion cannot solve such problems.

Based on the above discussions, we propose a data augmentation method (AdvMask) based on information deletion, which can regularize CNN training and improve the performance of models in image classification. The most important motivation for AdvMask is to simulate the situations where the key information for classification is partly or completely lost. The key information in the image is closely related to the classification results, and the change of those information will lead to misclassification. So we need to locate the points where perturbation would make the classification result different from the truth. Adversarial attack is to deceive the target model by adding adversarial perturbations to the original clean images. In addition, sparse adversarial attack can only perturb the pixels that have the greatest influence on the classification results. Therefore, in AdvMask, we design a sparse adversarial attack module to find the pixels that have the greatest influence on the classification results. For data augmentation, we generate augmented samples by occluding some of these adversarial attack points in the images. There are three advantages of occlusion on these adversarial attack points. Firstly, the deleted areas generated by AdvMask are based on the adversarial sensitive region, so the deleted areas are not limited to a continuous area, but multiple areas with various shapes. Secondly, the deleted areas generated by AdvMask is customized for each image, rather than randomly generated, so the whole deleted areas are closely related to the image structure, including both foreground and background. Finally, AdvMask uses sparse adversarial attack to find the most sensitive areas in the image, instead of only focusing on the area where the main object is located. At the same time, we also avoid excessive deletion and reservation of continuous regions, in order to balance object occlusion and information retention [3].

Fig. 1:

The framework of proposed AdvMask, which consists of two modules: sparse adversarial attack module and data augmentation module. Given a perturbation, we encode it through a trainable neural network to generate a feature map. Then, we use the scaled sigmoid function to generate an approximately binary attack mask. The attack mask is used to guide the generation of augmentation mask for augmented images. After mask generation, the shapes of generated augmented masks are very diverse. Here we show a few augmented images.

Generally speaking, AdvMask consists of two modules: sparse adversarial attack module and data augmentation module, as illustrated in Fig. 1. Using the learned sparse adversarial attack model, we can customize the augmented training samples for each image in the process of model training. In this way, the augmented data can simulate various situations where some key information is lost in the image, which can encourage the model to pay more attention to less sensitive areas with important features.

AdvMask can reduce the influence of over-fitting, enhance the generalization ability of classifier, and thus improve the classification accuracy. To demonstrate the effectiveness of AdvMask, we also conducted experiments on CIFAR10/100 [17]

and Tiny-ImageNet 

[4] for coarse-grained classification, and Oxford Flower Dataset [26] for fine-grained classification and effect visualization.

To summarize, the contributions of this work are featured in the following aspects:

  • A novel data augmentation method is proposed, which innovatively uses sparse adversarial attack methods for data augmentation. AdvMask first finds key information in the images through sparse adversarial attack module, and then generates customized augmented data for every image.

  • We design an end-to-end sparse adversarial attack method, which uses the attack to guide the automatic selection of pixels without hand-crafted rules. Compared with other sparse adversarial attack methods, it is competitive in speed and attack success rate.

  • We develop a data augmentation module based on occlusion suppression in AdvMask. By suppressing the occlusions on key information, classification models can focus more on less sensitive areas with important features.

  • We further conduct image classification experiments to compare and discuss the proposed method with other data augmentation techniques, and analyze the parameters of our proposed method in more detail by ablation research. Experimental results demonstrate that the proposed data augmentation method achieves superior performance on several datasets, and improves the performance of various popular image classification models.

  • We are the first to propose the method of sparse adversarial attack in the community of data augmentation. We prove that key points from sparse adversarial attack module are important for image classification, and are very useful for designing data augmentation methods.

The reminder of the paper is organized as follows: In Section II, we review related works about sparse adversarial attack and data augmentation methods. Section III elaborates on the proposed data augmentation method based on information deletion for image classification, including sparse adversarial attack module and data augmentation module. Section IV presents the experimental results of the comparisons and evaluations. Ablation experiment results and performance analysis are reported in Sectoin V. Finally, the conclusion is drawn in Section VI.

Ii Related Work

Ii-a Sparse Adversarial Attack

The main difficulty of sparse adversarial attack is how to determine the pixels to be perturbed. Most of the existing methods can be described by a two-stage pipeline. First, these methods artificially define a measure of pixel importance, such as gradients. Then according to the importance of each pixel, the iterative strategy is applied, which selects the most important pixels to attack until the attack is successful. For example, based on the saliency map, JSMA [27]

uses a heuristic strategy to iteratively select the pixels to be perturbed in each iteration. C&W-

 [2] first uses the attack under the constraint of -norm, and then fixes several least important pixels according to the perturbation magnitudes and gradients. PGD- [5] projects the perturbations generated by PGD [23] onto the -ball to achieve the version of PGD. The specific projection method is to fix some pixels that do not need to be perturbed according to the perturbation magnitudes and projection loss. SparseFool [24] converts the problem into an -norm constrained problem, and selects some pixels to perturb in each iteration according to the geometric relationship. Based on the gradient and the distortion map, GreedyFool [9] selects some pixels to be added to the modifiable pixel set in each iteration, and then uses the greedy method to drop as many less important pixels as possible to obtain better sparsity. However, all these methods are trying to find a better way to evaluate the importance of pixels, and use greedy methods to remove as many unimportant pixels as possible to achieve better sparsity. We think that combining these two steps is a better choice.

There are also some methods that go beyond the two-stage pipeline. One-Pixel attack uses the differential evolution algorithm to explore the extreme cases where only one pixel is perturbed. However, this method has a low attack success rate. SAPF [11] is similar to our adversarial attack method. This method factorizes the perturbation into the product of the perturbation magnitude and the binary mask, and uses -box ADMM [35] to optimize them jointly. Our method also uses the binary mask, but we generate the mask through a trainable neural network. Experimental results show that the adversarial attack points generated by our method have better sparsity.

Ii-B Data Augmentation Methods

Deep neural network often suffers from severe over-fitting, which weakens its generalization ability. Data augmentation is an effective method to solve the problem of over-fitting, and it has long been used in practice. The core idea of data augmentation is to generate more data on the basis of current training set. The baseline method in this work is basic image manipulation, including cropping, translation, and rotation, which are widely used in image classification tasks. Recently, some methods based on information deletion aim at simulating the occlusion in image classification tasks. Random Erasing (RE) [38] reduces the risk of over-fitting and makes the model robust to occlusion by randomly selecting a rectangular region in an image and erasing its pixels with random values. Similarly, Hide-and-Seek (HaS) [30] randomly hides patches in a training image, which can improve the object localization accuracy and enhance the generalization ability of CNN models. Cutout [8] is a simple regularization technique of randomly masking out square regions of input images and can improve the robustness and overall performance of CNN models. More recently, GridMask [3] uses structured dropping regions in the input images, which emphasizes the importance of the balance between information deletion and reservation, and greatly improves the performance of CNN models on several CV tasks. Similar to GridMask, FenceMask [20] overcomes the difficulty of small object augmentation, and improves the performance over baselines by enhancing the sparsity and regularity of the occlusion block. For person re-identification tasks, [16]

uses a sliding occlusion mask with specific horizontal and vertical strides to search for discriminative regions which are determined by some network visualization techniques. In contrast to 

[16], the occluded regions of the augmented samples in our work are based on the adversarial attack points precalculated by our end-to-end sparse adversarial attack module.

Iii AdvMask

As illustrated in the Fig. 1, AdvMask consists of two modules: sparse adversarial attack and data augmentation. First, we distinguish two types of masks in AdvMask: attack mask and augmentation mask. The attack mask is the output of the sparse adversarial attack module, and it indicates the position of the adversarial attack points. The augmentation mask in the data augmentation module is generated according to the attack mask, and it indicates the occluded regions of augmented data.

Firstly, AdvMask takes as input a whole image and a random adversarial perturbation. We then encode the perturbation by a trainable neural network to generate a binary attack mask of the same size. From the binary attack mask, we obtain the points of interest (POI) for the data augmentation module. Finally, AdvMask performs data augmentation based on the POI from the binary mask.

Iii-a Sparse Adversarial Attack

Sparse adversarial attacks can find the most sensitive pixels in the images, because applying invisible perturbations to these pixels will lead to wrong classification. Let be the classification model, where , and denote the width, height, and the number of channels of the images, respectively. denotes the total number of classes. The adversarial attack is generally formulated as:


where is the adversarial perturbation, denotes the -norm that is the number of non-zero elements, and

denotes the probability that the model classifies the input as class

. and denote the input image and its true label.

However, problem (1) does not limit the maximum perturbation magnitude. When the magnitude is large, the attack can be completed by perturbing only a few pixels [32]. Too few perturbed pixels are not helpful for mask generation in the data augmentation module. The sparsity of adversarial attack points is beneficial to data augmentation because these points are more likely to cover both foreground and background. Therefore, it is necessary to constrain the magnitudes of sparse adversarial perturbations. Our work uses -norm as a distance function to learn sparse adversarial attack. We formulate the sparse adversarial attack as:


where denotes the maximum perturbation magnitude allowed, which is the -norm of the perturbation.

However, because the -norm is not differentiable, the difficulty to optimize the problem makes sparse adversarial attack a NP-hard problem [9]. As mentioned in Section II-A, the existing methods use artificially defined importance indicators and greedy strategies, which may not be able to obtain the optimal results. Inspired by the research of neural network pruning, pruning neural network can be regarded as an optimization problem of -norm which indicates the sparsity of convolution kernels. Early neural network pruning methods also use artificially defined importance index of convolution kernels [22, 15, 21]

. Recently, AutoPruner, an automatic pruning method on neural network, gets rid of the need to manually define the importance of convolution kernels, and automatically selects the convolution kernels that need to be kept in the process of fine-tuning the models. AutoPruner adds a branch in the usual network training process, which outputs a mask to automatically select filters. Based on the intrinsic similarity between network pruning and sparse adversarial attacks, we add a branch that includes encoding and binarization in the usual attack process, which also outputs a mask to automatically select pixels. The branch takes the adversarial perturbation as its input and generates an approximate binary mask with the same size. According to the sparsity and attack effect, the adversarial perturbation and the encoder are jointly optimized. By gradually forcing the scaled sigmoid function to output binary values, the perturbation of some pixels finally becomes 0, thus ensuring the sparsity.


Let denote the neural network, which is used to encode the perturbation.

represents the tensor obtained by encoding the perturbation

with .

When the image size is small, we directly use a fully-connected layer as the encoder, and its weights are denoted as . However, large image size causes too many weights to be trained. Considering that dimensions of the input and output must be the same, we use the classical image segmentation network U-net [28] as the encoder.

Fig. 2:

Illustration of attack mask generated by our sparse adversarial attack module. The first row: original images from ImageNet dataset 

[18]. The second row: attack masks for images in the first row.


The elements of the encoded tensor are real numbers, but the elements of the attack mask are binary, in which 1 denotes the key points and 0 denotes otherwise. Therefore, in order to maintain continuity and differentiability, we use the scaled sigmoid function to generate an approximate binary mask:


where denotes the approximate binary mask and is the scaling factor that controls the degree of binarization. When is too small, (3) is not enough to binarize the elements of mask. When is too large, the selection of pixels has been determined before the training, which degenerates our method into randomly selecting pixels. Therefore, we gradually increase in the training process to ensure that the elements of mask can converge to binary values, and prevent the method from degenerating into random pixels selection.

Loss Function

Our goal is to find the most sensitive pixels in the image while these points are sparse. Therefore, the design of loss function is to fool the target model with the pruned perturbations as much as possible, and the key pixels should be as sparse as possible. First we use adversarial attack, such as PGD and MI-FGSM, to calculate the loss of cross entropy with successful attack,

. We formulate our loss function for sparsity as follows:


where denotes the element-wise multiplication, is the adversarial classification loss, usually the cross-entropy loss. Since the elements of are approximately binary, the second term approximately represents the sparsity of the mask, where represents the size of the mask. is a dynamic parameter to balance these two terms, which is calculated by the following formula:



, a hyperparameter, is the minimum value of

, is the -th element of the mask, and is a hyperparameter. If the current mask is not sparse enough, our method would pay more attention to improving the sparsity of the mask, otherwise our method would focus on the adversarial attack.

Update Perturbation and Encoder

Since attack mask is to find sparse POI in the image, we do not need to consider its own sparsity. Therefore, instead of updating the with greedy strategies, we can use adversarial attack methods, such as PGD and MI-FGSM, to quickly update . Moreover, considering the constrain of in problem (2), we can combine PGD and MI-FGSM [10] under the -norm constrain. In this way, we can constrain both the -norm and -nrom of the perturbation. Specifically, the update formula of is as follows:


where is the loss defined by (4), denotes the momentum decay factor, represents the update step, and is used to project adversarial perturbation into the -ball of radius .

To summarize, the perturbation is trained simultaneously with the encoder. The completion of training means the completion of the attack mask generation. Fig. 2 illustrates some examples of attack masks.

Iii-B Data Augmentation

Excessively deleting regions in the image may lead to complete object removal and context information loss, while retaining too many regions may lead to information reduction [3]. Therefore, the key of data augmentation based on information deletion is to balance between deletion and reservation of continuous areas. Our proposed information removal method, AdvMask, utilizes structured dropping regions. As is illustrated in Fig. 1, the attack mask and input image are used for data augmentation. The areas removed by our method are neither continuous areas nor random areas. The core idea of our data augmentation module is to remove the square areas centered on the adversarial attack points from the attack mask, which is indicated by augmentation mask. In augmentation mask , if , the pixel at position is unchanged; otherwise, we remove it.

We express our setting as:


where denotes the input image, is the binary augmentation mask indicating the pixels to be removed, and means the element-wise multiplication.

The augmentation mask is determined by five parameters:.

Square Length

From sparse adversarial attack, we can get the position of POI of each image. So we can customize the augmentation masks for each image. For each perturbation point in the attack mask, we design a square centered on this point. The parameter is the length of . The value of is a random sample within a range:


The parameter is set according to the image sizes.

Mask Ratio

Since some recent works [37, 3] revealed that dropping a very small region is useless for convolutional operation, parameter mask ratio is used to control the proportion of total masked pixels. The larger is, the more pixels are masked. Considering the balance between information deletion and reservation, the scale range of the masked regions should be limited. Once the parameter is determined, the area of total masked regions can be calculated by:


where is the number of squares in the mask. Then, we get:


where and are the height and width of the image. can prevent algorithm from removing too many or too few pixels by limiting the number of squares in the mask. Therefore, we set a scale range of as . In this case, the dropping regions can neither be too large nor too small.

Overlapping Ratio

Since square has different length, if two perturbation points are very close, the squares centered on these two points may easily overlap. This situation could deteriorate sharply when the lengths of the squares are large while the perturbation points are dense, which causes the deleted areas to continuously overlap and finally concentrate on a sub-areas. In the worst case, no matter how large is, the entire object of interest will be deleted. Therefore, by controlling the overlapping ratio , on the one hand, we can prevent the algorithm from removing continuous areas. On the other hand, we can ensure the structural characteristics of the mask. So for every mask, we set a upper bound of , and randomly choose as


In this way, the overlapping area of every pair of squares in the augmentation mask is no larger than .

The random values of these five parameters increase the diversity of the augmentation masks and ensure the structural deletion of areas. For every image in each epoch during training, we generate different masks. Even for the same image and its attack mask, every generated augmentation mask is different.

Dataset Model Basic Cutout HaS GridMask RE AdvMask (ours)
ResNet-18 95.28 * 96.01 * 96.10 * 96.38 95.69 * 96.44
ResNet-44 94.10 94.78 94.97 95.02 94.87 * 95.49
CIFAR-10 ResNet-50 95.66 95.81 95.60 96.15 95.82 96.69
WideResNet-28-10 95.52 96.92 96.94 97.23 96.92 97.02
Shake-shake-26-32 94.90 96.96 * 96.89 * 96.91 96.46 * 97.03
ResNet-18 77.54 * 78.04 * 78.19 75.23 75.97 * 78.43
ResNet-44 74.80 74.84 75.82 76.07 75.71 * 76.44
CIFAR-100 ResNet-50 77.41 78.62 78.76 78.38 77.79 78.99
WideResNet-28-10 78.96 79.84 80.22 80.40 80.57 80.70
Shake-shake-26-32 76.65 77.37 76.89 77.28 77.30 79.96
TABLE I: Image classification accuracy on CIFAR-10 and CIFAR-100 are summarized in this table. * means results reported in the original paper. RE: random erasing.

Iv Experiment

In this section, we report on experiments to evaluate our method and compare it with other data augmentation methods. To prove the effectiveness of our proposed method, the experiments include four image datasets and various popular neural network models in image classification tasks. For the same model trained with different data augmentation methods, we use the same parameter settings, such as learning rate.

Iv-a Datasets and Evaluation Metrics

We conduct experiments on four representative datasets for image classification tasks: CIFAR-10/100, Tiny-ImageNet, and Oxford Flower Dataset.

  1. CIFAR-10/100 [17]: Both CIFAR-10 and CIFAR-100 datasets consist of 60000 color images of size pixels, while the former has 10 distinct classes and the latter has 100. Each dataset is divided into a training set containing 50000 images and a test set containing 10000 images. Experimental results on these datasets demonstrate the superiority of AdvMask for coarse-grained classification on small image sizes.

  2. Oxford Flower Dataset[26] consists of 102 flower categories. Each class contains between 40 and 258 images of size . These images have large scale, and vary greatly in pose and light. In addition, because the images are larger in size and clearer in vision, we present the results of class activation maps (CAM) [39] from models trained with different data augmentation methods. Experiment results on the dataset demonstrate the superiority of AdvMask for coarse-grained classification on large image sizes, and visually prove the effect of the model trained with AdvMask.

  3. Tiny-ImageNet[4] contains 100000 images of 200 classes. Each class has 500 training images, 50 validation images, and 50 test images. Since test images are not labeled, we use validation set to test. Experimental results on the dataset prove the superiority of AdvMask for fine-grained dataset with large image sizes.

  4. Evaluation Metric: We use Rank-1 and mean average classification accuracy to evaluate the performance of the proposed method.

Iv-B Implementation Details

Since our adversarial attack module is white-box attack, we first use a trained classification model as the attack target to obtain the attack mask for every image. Based on the attack mask, we then generate augmentation masks in data augmentation module.

For the sparse adversarial attack module, there are some important implementation details to note. As mentioned in Secton III-A, we gradually increase the scaling factor from to in the optimization process to make the mask binary. Specifically, is set to 100 on CIFAR-10 and CIFAR-100, and 5 on Tiny-ImageNet and Oxford Flower Dataset. The change of has little effect on the results, so we just set as 0.1 all the time.

For the data augmentation module, the datasets is normalized using per-channel mean and standard deviation, and our algorithm is applied after the image normalization operation.

Inspired by the easy-to-hard learning strategy [34, 31], we propose an incremental generative strategy. Specifically, instead of applying AdvMask to every image in each epoch, the number of augmented occluded samples gradually increases during training, which will make the network more robust to occlusion by learning more and more occlusion samples. The number of augmented samples grows uniformly with epoch until it reaches a constant upper bound. In practice, we set the upper bound to 80% of the total sample size.

Iv-C Results on CIFAR-10 and CIFAR-100

We use ResNet [14] architecture with different size models, including ResNet-18, ResNet-44, and ResNet-50. In addition, we use WideResNet-28-10 [36] and ShakeShake-26-32 [12]. The hyperparameters we use are the same as those reported in the paper [8, 3]

. For the basic augmentation, we pad the image to

and randomly crop it into size . The images are normalized using per-channel mean and standard deviation. AdvMask is applied after the normalization. For the basic augmentation, we pad the image to and randomly crop it into size , then horizontally flipped with the probability of 0.5. The parameters for AdvMask is correlated to the complicity of datasets. Generally, the more complex the dataset is, the larger masked regions is preferred. Therefore, we train with CIFAR-10 using the range of square length of , the range of mask ratio of , and overlapping ratio of 0.1. For CIFAR-100, the range of square length of , the range of mask ratio of , and overlapping ratio of 0.2.

We conduct the experiments several times and report the best classification accuracy of different models on CIFAR-10 and CIFAR-100, as summarized in Table I. Notably, AdvMask can improve the performance of ResNet18, ResNet44, ResNet50, WideResNet-28-10 and ShakeShake-26-32 on baseline by 1.16%, 1.39%, 1.03%, 1.50%, and 2.12%, respectively. Since the complexity of CIFAR-10 is not high, the accuracy of different methods is not much different.

Especially for ResNet44, WideResNet-28-10, and ShakeShake-26-32, we have improved the classification accuracy on CIFAR-100 by 1.64%, 1.74% and 3.31%, respectively. AdvMask achieves best results on these 5 classification models, which shows the superior to previous data augmentation methods. Meanwhile, for average classification accuracy, AdvMask is the state-of-the-art method among different models.

Iv-D Results on Oxford Flower Classification Dataset

Fig. 3: Class activation mapping (CAM) for ResNet50 model trained on , with baseline augmentation, Cutout, GridMask, Randomerasing (RE) or our AdvMask. The models trained with AdvMask incline to focus on large important regions and cover larger area of object of interest.

We randomly divide the Oxford FLower Classification Dataset into training set, validation set and test set, and use cross validation for experiment. The basic augmentation includes random padding, cropping and horizontal flipping.

From Table. II, in terms of accuracy, AdvMask achieves the state-of-the-art classification accuracy on this dataset. Specifically, compared with baseline, AdvMask improve the accuracy by 11.37%. In addition, because the images are larger in sizes and clearer in visual effect than CIFAR-10/100, we visualize the CAM on ResNet-50 model trained with different data augmentation methods for comparison. It can be seen from Fig. 3 that AdvMask is more inclined to locate and cover relevant parts of the objects than other data augmentation methods. At the same time, AdvMask can indeed achieve the best classification accuracy on ResNet-50 model, which shows that successful data augmentation help models to focus on the discriminative and salient areas in the image. Specifically, in the first row of Fig. 3, the region of interest (ROI) of baseline method just covers part of the flower, even focuses on the background information, which is a manifestation of over-fitting problem. On the contrary, the attention area of the models trained with AdvMask covers almost the whole flower, while the background is neglected, which proves that AdvMask improves the generalization ability of the models.

Method Accuracy (%)
Basic 80.20
Cutout 88.53
HaS 84.12
RE 88.43
GridMask 90.19
AdvMask (Ours) 91.57
TABLE II: Image classification accuracy on Oxford Flower Classification Dataset.

Iv-E Results on Tiny ImageNet

Fig. 4: Relative classification accuracy compared with baseline on ResNet-18, ResNet-50, and WideResNet-50-2 for Tiny-ImageNet dataset.

We also conduct experiments on a larger image dataset Tiny ImageNet. First, we resize the Tiny-ImageNet data into and initialize all neural models with ImageNet pre-trained weight, and then fine-tune the Tiny-ImageNet dataset for some epochs. Therefore, we modify the network architectures to fit the image size and the output of 200 classes, including ResNet-18, ResNet-50 and Wide-ResNet 50-2 [36]. The basic augmentation includes random padding, cropping and horizontal flipping. Images are all normalized to zero, and then AdvMask is applied.

In order to reflect the classification accuracy improvements brought by different data augmentation methods on baseline, in this experiment, we show the relative average accuracy improvements of different data augmentation methods compared with the baseline in Fig. 4. The baseline accuracies of these three models are 61.38%, 73,61% and 81.55%, respectively.

From Fig. 4, we can see that the average accuracy of all data augmentation methods is better than that of baseline. Among all data augmentation methods, AdvMask achieves the highest improvement in accuracy. On ResNet-18, ResNet-50 and WideResNet-50-2, AdvMask improves the accuracy by 3.91%, 6.57%, and 1.30%, respectively. In addition, we also present the error interval of each method. Specifically, for these three classification models, AdvMask has the smallest error interval among all the methods, which proves that the effect of AdvMask is the most stable. For ResNet-18 and ResNet-50, the worst-cases performance is still higher than baseline, which proves the effect of these data augmentation methods. For WideResNet-50-2, although the worst-cases performance of other methods is worse than the average performance of baseline, the worst-cases performance of AdvMask is still better than that of baseline. Therefore, AdvMask is stable and effective.

V Ablation Studies

V-a Attack Success Rate of Sparse Adversarial Attack Module

Threshold Method ASR(%)
JSMA 78.9 440.8 0.611 0.031
PGD- 73.9 1199.7 1.078 0.031
GreedyFool 100.0 468.2 0.547 0.031
CW- 100.0 326.6 0.542 0.068
SAPF 100.0 321.8 0.523 0.085
Ours 100.0 320.1 0.532 0.031
JSMA 97.3 247.7 0.896 0.063
PGD- 72.8 498.0 1.390 0.063
GreedyFool 100.0 238.3 0.707 0.063
CW- 100.0 136.7 0.691 0.118
SAPF 100.0 133.7 0.718 0.159
Ours 100.0 131.3 0.689 0.063
TABLE III: Results of targeted sparse adversarial attack on CIFAR-10.

In our proposed method, the sparse adversarial attack module is used to find the key points in the images for the data augmentation module. Therefore, it is very important to successfully find the most important points that are most susceptible to disturbance. We evaluate the overall performance of our adversarial attack module through attack success rate (ASR) and -norm (). ASR is the proportion of misclassified adversarial samples in the total samples. -norm is the number of non-zero elements. The lower it is, the fewer pixels are used to apply the perturbations. is the distance between the adversarial image and the original clean image. indicates the maximum value of the perturbation magnitude, and larger it is, the greater the pixels change.

In this subsection, we compare the effect of our proposed adversarial attack module with several state-of-the-art sparse adversarial attack methods on CIFAR-10 and ImageNet [18] dataset under different parameter settings. The sparse adversarial attack methods used include four two-stage methods, such as JSMA, CW-, PGD-, GreedyFool, and one-stage method, SAPF. The target classifier on CIFAR-10 is VGG19 model [29] with the input size of . The average norm and ASR under two settings are shown in Table III, and is the maximum perturbation magnitude. For , our proposed sparse adversarial attack module can achieve 100% ASR and the is only 320.1, which means that only 10.42% pixels are perturbed. For large perturbation magnitudes, our method achieves 100% ASR and the -norm is only 131.3 (4.27% pixels). We conduct experiments of JSMA and GreedyFool, and specify the -norm . The ASR of these two methods is not higher than ours, while the -norm and -norm are obviously higher than ours, which means that they have to perturb more points to successfully attack. Because PGD- needs to specify both -norm and -norm, we experiment with these two norms under the condition that to find the appropriate -norm. However, even a larger -norm can not achieve 100% ASR. Both CW- and SARF can only specify the -norm, so we conduct experiments with reference to the -norm of our method. Although the -norm of these two methods is almost the same as that of other methods, the -norm of our method is much smaller than these two methods. Therefore, the maximum generated perturbation magnitudes are significantly smaller.

We also experiment on ImageNet, as illustrated in Table IV. When , our method obtains 100% ASR and the -norm is just 12855.0, which is the lowest (4.76% pixels). When , our method can also achieve 100% ASR, while the -norm is the lowest (2.08% pixels). In addition, since the -norm for PGD- and GreedyFool can be specified, we conduct the experiments under the same settings. We can see that the ASR of PGD- is low and the -norm of these two methods is larger than ours.

Threshold Method ASR(%)
PGD- 50.0 11989.0 2.957 0.031
GreedyFool 100.0 12454.5 2.663 0.031
Ours 100.0 5591.5 2.223 0.031
PGD- 62.2 22980.9 2.116 0.016
GreedyFool 100.0 26107.7 2.083 0.016
Ours 100.0 12855.0 1.694 0.016
TABLE IV: Results of Targeted Sparse Adversarial Attack on ImageNet.

These results above all demonstrate the superiority and effectiveness of our proposed adversarial attack module that can indeed find sparse and most sensitive points.

V-B The Effect of Generated Mask on the Targeted Points

For AdvMask, on the one hand, the points used for mask generation are obtained by our adversarial attack module. In order to verify the effect of adversarial attack points for data augmentation, we experiment with two point selection strategies, and then use our data augmentation module under the same experimental settings. 1) Random points (RP) selection: We randomly select points in the whole image. 2) Corner points (CP): We use corner detection to find the key points in the image. Results are summarized in Table V. Although both of these point selection schemes can improve the classification accuracy on baseline, our method is obviously superior to the these two point selection strategies on ResNet-18 and ResNet-50. On Resnet-50, AdvMask is 1.08% and 1.28% higher in accuracy than RP and CP, respectively. The results show the effectiveness of our adversarial attack module.

On the other hand, the data augmentation module generates various masks according to the adversarial attack points. In order to verify the effectiveness of data augmentation module, we conduct experiments which use attack masks directly as augmentation masks. Therefore, the masked points are all adversarial attack points with squares.

Method ResNet-18 Acc (%) ResNet-50 Acc (%)
RP 95.66 95.61
CP 95.11 95.41
Baseline 95.28 94.12
TABLE V: Accuracy (%) with different points selection strategies with CIFAR-10 on both Resnet-18 and Resnet-50. RP: random points, CP: corner points.
Method ResNet-18 Acc (%) ResNet-50 Acc (%)
AAPM 92.96 94.52
Baseline 95.28 94.12
TABLE VI: Test accuracy on CIFAR-10 on Resnet-18 and Resnet-50 with no mask generation data augmentation method. AAPM: adversarial attack point mask.

From Table VI, we can see that the accuracy of using attack mask is obviously lower than that of our method. The accuracy is even worse than baseline on ResNet-18, because the masked region is so small and thus bring noises into the images. Influenced by adversarial attack, the classifier misclassifies many samples, which leads to side effect. Our data augmentation module makes proper use of these adversarial attack points and obtains best accuracy.

V-C The Impact of Parameters

When implementing AdvMask on CNN training, we use three parameters to control the mask generation: square length , mask ratio , and overlapping ratio , which is introduced in Section III-B. In this subsection, to demonstrate the impact of these parameters on the model performance, we conduct experiments on CIFAR-100 under varying parameter settings. When evaluating one of the parameters, we fixed the other two parameters and the model architecture. The fixed values of , , and are: , , and , respectively, if not specified.


Since the image size of CIFAR-100 is and AdvMask will pad all images into larger area, we the maximum mask square length 35. Considering that is a range interval, we set eight disjoint intervals: , , , , , , , . Under this setting, we can analyze the effect of in each interval. Results are shown in Fig. 5.

Fig. 5: Classification accuracy with different settings of parameter .

Notably, there is a trend of increasing at first and then decreasing. If the mask square length is too small or large, the accuracy will be lower, especially when the mask is too large. Specifically, if is lower than 2, the accuracy is far lower than that with in the range of , which proves that dropping a very small region is useless for convolutional operation, and may even bring noises into the images. When is in the interval , the accuracy does not change much. However, if is further increased, the accuracy drops dramatically. If , we get the lowest accuracy, even lower than the best accuracy. means that nearly more than 50% of the image have been removed, so the main object in the image are likely to be removed.

On the one hand, when the length of mask square is too small, the mask area is too small to occlude the object of interest in the image. On the other hand, if the mask area is too large, the whole object of interest is more likely to be totally covered, which will bring side effects to the classifier. For example, if , even the whole image is masked, which has a negative impact on the accuracy. Therefore, the maximum value of should not be too great while the variety of is beneficial.


Similar to the parameter , we control the total mask area not to be too large or too small by adjusting , so that the object can neither be completely removed nor completely kept. We set 5 intervals, each with length 20%, from 0 to 100%.

Results are summarized in Fig. 6. When is relatively small, the difference in accuarcy is not significant. However, when further increases, accuracy drops sharply. Specifically, when is highest, we obtain the lowest accuracy. The classification accuracy varies by up to 1.53%. This is because the total masked area is so large that most of the image is masked. As a result, the main object in the image is also removed, which will greatly damage the performance of the classifier.

Fig. 6: Classification accuracy with different settings of parameter


Because structured deletion is beneficial, we avoid deleting continuous regions in a single image. Therefore, we experiment with different values of : 0, 20%, 40%, 60%, 80%, 100%. means that there are no overlapping area between every pair of mask squares and means that any two mask squares can completely overlap.

As illustrated in Fig. 7, under the control of both and , the impact of is not so significant.

Fig. 7: Classification accuracy with different settings of parameter

When the overlapping ratio is within the range , we obtain the relatively better accuracy. When we further increase the , we get lower accuracies, with a maximum drop of 0.39%. From this experiment, we can see that smaller is beneficial. This is because that with lower , we can avoid deletion of continuous regions and preserve the structured information of the images. As a result, the main object is less likely to be fully removed nor reserved.

Experiments on these three parameters verify that any one or two parameters are sufficient to control the AdvMask. Therefore, we design , and to control the generation of masks, so as to balance the object occlusion and information retention.

Vi Conclusion

In this paper, we propose a data augmentation method, AdvMask, for image classification. AdvMask first makes use of adversarial attack research to find key points in the image and then randomly occludes a structured region based on key points during each iteration. Neural network models trained with AdvMask can reduce the sensitivity to occlusion and effectively strengthen the generalization ability by paying more attention to less sensitive areas with important features. The experiment results on various image datasets demonstrate the robustness and effectiveness of our method.

In the future work, we will apply our data augmentation method to other CV tasks, such as, object detection and segmentation. Finally, we hope our work provides a new research direction for the development of data augmentation research.


This work was supported in part by the National Key R&D Program of China under Grant 2021ZD0201300, and by the National Science Foundation of China under Grant 61876076.


  • [1] A. Blaas, X. Suau, J. Ramapuram, N. Apostoloff, and L. Zappella (2022) Challenges of adversarial image augmentations. In Proc. NeurIPS Workshop, pp. 9–14. Cited by: §I.
  • [2] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In Proc. IEEE European Symposium on Security and Privacy (EuroS&P), pp. 39–57. Cited by: §II-A.
  • [3] P. Chen, S. Liu, H. Zhao, and J. Jia (2020-01) Gridmask data augmentation. arXiv preprint arXiv:2001.04086. Cited by: §I, §II-B, §III-B, §III-B, §IV-C.
  • [4] P. Chrabaszcz, I. Loshchilov, and F. Hutter (2017-08) A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819. Cited by: §I, item 3.
  • [5] F. Croce and M. Hein (2019) Sparse and imperceivable adversarial attacks. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 4724–4732. Cited by: §II-A.
  • [6] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 113–123. Cited by: §I.
  • [7] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2020) Randaugment: practical automated data augmentation with a reduced search space. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshop (CVPRW), pp. 702–703. Cited by: §I.
  • [8] T. DeVries and G. W. Taylor (2017-11) Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552. Cited by: §I, §II-B, §IV-C.
  • [9] X. Dong, D. Chen, J. Bao, C. Qin, L. Yuan, W. Zhang, N. Yu, and D. Chen (2020) GreedyFool: distortion-aware sparse adversarial attack. In Proc. Adv. Neural Inf. Process. Syst., Vol. 33, pp. 11226–11236. Cited by: §II-A, §III-A.
  • [10] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting adversarial attacks with momentum. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 9185–9193. Cited by: §III-A.
  • [11] Y. Fan, B. Wu, T. Li, Y. Zhang, M. Li, Z. Li, and Y. Yang (2020) Sparse adversarial attack via perturbation factorization. In Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 35–50. Cited by: §II-A.
  • [12] X. Gastaldi (2017-05) Shake-shake regularization. CoRR abs/1705.07485. External Links: Link, 1705.07485 Cited by: §IV-C.
  • [13] C. Gong, D. Wang, M. Li, V. Chandra, and Q. Liu (2021) KeepAugment: a simple information-preserving data augmentation approach. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 1055–1064. Cited by: §I.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 770–778. Cited by: §I, §IV-C.
  • [15] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 1389–1397. Cited by: §III-A.
  • [16] H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang (2018-06) Adversarially occluded samples for person re-identification. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Cited by: §II-B.
  • [17] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §I, item 1.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012-05) Imagenet classification with deep convolutional neural networks. 25. Cited by: Fig. 2, §V-A.
  • [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Proc. Adv. Neural Inf. Process. Syst., F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), Vol. 25. Cited by: §I.
  • [20] P. Li, X. Li, and X. Long (2020-06) FenceMask: a data augmentation approach for pre-extracted image features. arXiv preprint arXiv:2006.07877. Cited by: §II-B.
  • [21] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 2736–2744. Cited by: §III-A.
  • [22] J. Luo, H. Zhang, H. Zhou, C. Xie, J. Wu, and W. Lin (2018-10) Thinet: pruning cnn filters for a thinner net. IEEE Trans. Pattern Anal. Mach. Intell. 41 (10), pp. 2525–2538. Cited by: §III-A.
  • [23] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018)

    Towards deep learning models resistant to adversarial attacks

    In Proc. Int. Conf. on Learning Representations, External Links: Link Cited by: §II-A.
  • [24] A. Modas, S. Moosavi-Dezfooli, and P. Frossard (2019) Sparsefool: a few pixels make a big difference. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 9087–9096. Cited by: §II-A.
  • [25] S. Naseer, Y. Saleem, S. Khalid, M. K. Bashir, J. Han, M. M. Iqbal, and K. Han (2018-08)

    Enhanced network anomaly detection based on deep neural networks

    IEEE Access 6, pp. 48231–48246. Cited by: §I.
  • [26] M. Nilsback and A. Zisserman (2008-12) Automated flower classification over a large number of classes. In Proceeding of the Sixth Indian Conference on Computer Vision, Graphics and Image Processing, pp. 722–729. External Links: Document Cited by: §I, item 2.
  • [27] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016) The limitations of deep learning in adversarial settings. In Proc. IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §II-A.
  • [28] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent., pp. 234–241. Cited by: §III-A.
  • [29] K. Simonyan and A. Zisserman (2014-04) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §V-A.
  • [30] K. K. Singh and Y. J. Lee (2017-10) Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 3544–3553. Cited by: §I, §II-B.
  • [31] P. Soviany, R. T. Ionescu, P. Rota, and N. Sebe (2021-03) Curriculum learning: a survey. ArXiv abs/2101.10382. Cited by: §IV-B.
  • [32] J. Su, D. V. Vargas, and K. Sakurai (2019-01) One pixel attack for fooling deep neural networks. EEE Trans. Evol. Comput. 23 (5), pp. 828–841. External Links: Document Cited by: §III-A.
  • [33] N. Tran, V. Tran, N. Nguyen, T. Nguyen, and N. Cheung (2021-10) On data augmentation for gan training. IEEE Trans. on Image Process. 30, pp. 1882–1897. Cited by: §I.
  • [34] X. Wang, Y. Chen, and W. Zhu (2021-03) A survey on curriculum learning. IEEE Trans. on Pattern Anal. Mach. Intell. (), pp. 1–1. External Links: Document Cited by: §IV-B.
  • [35] B. Wu and B. Ghanem (2018-06) Lp-box admm: a versatile framework for integer programming. IEEE Trans. Pattern Anal. Mach. Intell. 41 (7), pp. 1695–1708. Cited by: §II-A.
  • [36] S. Zagoruyko and N. Komodakis (2016-09) Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), E. R. H. Richard C. Wilson and W. A. P. Smith (Eds.), pp. 87.1–87.12. External Links: Document, ISBN 1-901725-59-6, Link Cited by: §IV-C, §IV-E.
  • [37] C. Zhao, X. Lv, S. Dou, S. Zhang, J. Wu, and L. Wang (2021) Incremental generative occlusion adversarial suppression network for person reid. IEEE Trans. Image Process. 30 (), pp. 4212–4224. External Links: Document Cited by: §III-B.
  • [38] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2020) Random erasing data augmentation. In Proc. AAAI, Vol. 34, pp. 13001–13008. Cited by: §I, §I, §II-B.
  • [39] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 2921–2929. Cited by: item 2.