Adversarial Feature Augmentation and Normalization for Visual Recognition

03/22/2021 ∙ by Tianlong Chen, et al. ∙ 14

Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings, instead of relying on computationally-expensive pixel-level perturbations. We propose Adversarial Feature Augmentation and Normalization (A-FAN), which (i) first augments visual recognition models with adversarial features that integrate flexible scales of perturbation strengths, (ii) then extracts adversarial feature statistics from batch normalization, and re-injects them into clean features through feature normalization. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks, including ResNets and EfficientNets for classification, Faster-RCNN for detection, and Deeplab V3+ for segmentation. Extensive experiments show that A-FAN yields consistent generalization improvement over strong baselines across various datasets for classification, detection and segmentation tasks, such as CIFAR-10, CIFAR-100, ImageNet, Pascal VOC2007, Pascal VOC2012, COCO2017, and Cityspaces. Comprehensive ablation studies and detailed analyses also demonstrate that adding perturbations to specific modules and layers of classification/detection/segmentation backbones yields optimal performance. Codes and pre-trained models will be made available at: https://github.com/VITA-Group/CV_A-FAN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of adversarial feature augmentation and normalization ( A-FAN) for enhanced image classification (left), object detection (center) and semantic segmentation (right). We take ResNets [he2016deep], Faster RCNN [ren2016faster], and DeepLab V3+ [chen2018encoder] pipelines as examples. Our proposed A-FAN mechanisms are plugged into backbone networks and/or ROI/decoder modules for classification/detection/segmentation, respectively.

Adversarial vulnerability is a critical issue in the practical application of neural networks. Various attacks have been proposed to challenge visual recognition models of classification, detection and segmentation

[szegedy2013intriguing, goodfellow2014explaining, li2018robust, li2018exploring, lu2017adversarial, liu2018dpatch, lu2017adversarial, xie2017adversarial, wei2018transferable, zhang2020contextual, arnab2018robustness, shen2019advspade]. Such susceptibility has motivated abundant studies on adversarial defense mechanisms for training robust neural networks [schmidt2018adversarially, sun2019towards, nakkiran2019adversarial, stutz2019disentangling, raghunathan2019adversarial, hu2019triple, chen2020adversarial, chen2021robust, jiang2020robust], among which adversarial training based methods [madry2017towards, zhang2019theoretically], leveraging augmented adversarial examples, have achieved consistently superior robustness than others. However, crafting high-quality adversarial examples is computationally costly, and such adversarial training often results in a negative impact on performance over clean data [zhang2019theoretically].

Interestingly, a few advanced studies turn to investigate the possibility of ameliorating networks’ generalization ability via adversarial training. Recent progress shows that using adversarial perturbations to augment input data/embedding can effectively alleviate overfitting issues and lead to better generalization in multiple domains, including image classification [xie2020adversarial], language understanding [wang2019improving, zhu2019freelb], and vision-language modeling [gan2020large]. However, it still suffers from expensive computational cost due to the generation of pixel-level perturbations when applied to image classification. We raise the following natural, yet largely open questions:

Q1: Can adversarial training, as data augmentation, broadly boost the performance of various visual recognition tasks on clean data, not only image classification, but also object detection, semantic segmentation or so?

Q2: If the above answer is yes, can we have more efficient and effective options for adversarial data augmentation, e.g., avoiding the high cost of finding input-level adversarial perturbations?

In this paper, we propose A-FAN (Adversarial Feature Augmentation and N

ormalization), a novel algorithm to improve the generalization for visual recognition models. Our method perturbs the representations of intermediate feature space for both task-specific modules (e.g., Classifiers for ResNets, ROI for Faster RCNN, and Decoder for Deeplab V3+) and generic backbones, as shown in Figure 

1

. Specifically, A-FAN generates adversarial feature perturbations efficiently by one-step projected gradient descent, and fastly computes adversarial features with other perturbation strengths from weak to strong via interpolation. This strength-spectrum coverage allows models to consider a wide range of attack strengths simultaneously, to fully unleash the power of implicit regularization of adversarial features.

Furthermore, A-FAN normalizes adversarial augmented features in a “Mixup” fashion. Unlike previous work [zhang2017mixup, li2020feature] that fuses inputs or features from different samples, we amalgamate adversarial and clean features by injecting adversarial statistics extracted from batch normalization into clean features. Such re-normalized features serve as an implicit label-preserving data augmentation, which smooths the learned decision surface [li2020feature]. Our main contributions are summarized as follows:

  • We introduce a new adversarial feature augmentation approach to enhancing the generalization ability of image classification, object detection, and semantic segmentation models, by incorporating scaled perturbation strength from weak to strong simultaneously.

  • We also propose a new feature normalization method, which extracts the statistics from adversarial perturbed features and re-injects them into the original clean features. It can be regarded as implicit label-preserving data augmentation that smooths the learned decision boundary (illustrated in Figure 3 later on).

  • We conduct comprehensive experiments to verify the effectiveness of our proposed approach over diverse tasks (CIFAR-10, CIFAR-100, ImageNet for image classification; Pascal VOC2007 and COCO2017 for object detection; Pascal VOC2007, Pascal VOC2012 and Cityspaces for semantic segmentation). The substantial and consistent performance lift demonstrates the superiority of our A-FAN framework.

2 Related Work

Adversarial Attacks and Defenses.

When presented with adversarial samples, which are maliciously designed by imperceptible perturbations [goodfellow2014explaining, kurakin2016adversarial, madry2017towards], deep neural networks often suffer from severe performance deterioration, e.g., [szegedy2013intriguing, goodfellow2014explaining, carlini2017towards, croce2020reliable] for classification models and [li2018robust, li2018exploring, lu2017adversarial, liu2018dpatch, xie2017adversarial, wei2018transferable, zhang2020contextual, arnab2018robustness, shen2019advspade] for detection/segmentation models. To address this notorious vulnerability, numerous defense mechanisms [zhang2019theoretically, schmidt2018adversarially, sun2019towards, nakkiran2019adversarial, stutz2019disentangling, raghunathan2019adversarial] have been proposed, such as input transformation [xu2017feature, liao2018defense, guo2017countering, dziugaite2016study], randomization [liu2018adv, liu2018towards, dhillon2018stochastic], and certified defense approaches [cohen2019certified, raghunathan2018semidefinite]. Among these, adversarial-training-based methods show superior robustness in defending state-of-the-art adversarial attacks [goodfellow2014explaining, kurakin2016adversarial, madry2017towards]. Although adversarial training substantially enhances model robustness, it usually comes at the price of compromising the standard accuracy [tsipras2018robustness], which has been demonstrated both empirically and theoretically [zhang2019theoretically, schmidt2018adversarially, sun2019towards, nakkiran2019adversarial, stutz2019disentangling, raghunathan2019adversarial].

Adversarial Training Ameliorates Generalization.

It is unexpected, but reasonable that recent works [xie2020adversarial, zhu2019freelb, wang2019improving, gan2020large, wei2019improved] present an opposite perspective: adversarial training can be leveraged to enhance models’ generalization if harnessed in the right manner. For example, [xie2020adversarial] shows that image classification performance on the clean dataset can be improved by using adversarial samples with pixel-level perturbation generation. [zhu2019freelb] and [wang2019improving] apply adversarial training to natural language understanding and language modeling, both successfully achieving better standard accuracy. [gan2020large] achieves similar success on various vision-and-language tasks. Parallel studies [wei2019improved, 8852250] employ handcrafted or auto-generated perturbed features to ameliorate generalization. However, adversarial training in latent feature space as a more efficient and effective alternative has, to our best knowledge, not been studied in depth, even for classification tasks. Our work comprehensively explores this possibility not only for image classification, but also for object detection and semantic segmentation which are more challenging prediction tasks and usually require a much more sophisticated model structure, posing obstacles to easily exploit adversarial information for enhanced generalization.

Figure 2: The pipeline of A-FAN, which contains adversarial feature augmentation and adversarial feature normalization. From top to bottom, a series of adversarial feature perturbations with different strengths are generated to augment the intermediate clean features. Then, the statistics (i.e., and ) of perturbed features are extracted and re-injected into the original clean features . In the end, the normalized features are taken as inputs by the rest of the network, and optimized by with standard () and adversarial () training objectives.

Feature Augmentation and Normalization.

Pixel-level data augmentation techniques have been widely adopted in visual recognition models,e.g., [simard1993efficient, scholkopf1996incorporating, cubuk2018autoaugment, hendrycks2019augmix] for classification, [Detectron2018, liu2016ssd, zoph2019learning] for detection and segmentation. They are generic pipelines for augmenting training data with image-level information. Adversarial samples can also serve as a data augmentation method [xie2020adversarial]. However, feature space augmentations have not received the same level of attention. A few pioneering works propose generative-based feature augmentation approaches for domain adaptation [8578674], imbalanced classification [zhang2019feature], and few-shot learning [chen2019multi].

Feature normalization plays an important role in neural network training [ioffe2015batch, li2020feature, montavon2012neural, li1998sphering]. [ioffe2015batch] proposes batch normalization to remove biases in the dataset, which can substantially improve model generalization ability. [xie2020adversarial] utilizes dual batch normalization to calculate statistics of adversarial and clean samples separately, therefore obtaining promising standard accuracy. Recent investigations [ba2016layer, ulyanov2016instance, wu2018group, li2019positional, li2020feature] devote a particular attention to normalizing features of each training instance individually. As an illustration, [li2020feature]

leverages the first and second-order moments of extracted features and re-injects these moments into features from another instance by feature normalization. Different from them, we propose to utilize feature normalization techniques to combine adversarial and clean features to smooth the learned decision surface and improve model generalization ability.

3 Preliminaries

3.1 Rationale of A-FAN

Theoretical Insights.

For linear classifiers, a large output margin, the gap between predictions on the true label and the next most confident label, implies good generalization [bartlett2002rademacher, koltchinskii2002empirical, hofmann2008kernel, kakade2008complexity]. Although this relationship is less clear for non-linear deep neural networks, [wei2019improved] establishes a similar generalization bound associated with the “all-layer margin” which depends on Jacobian and intermediate layer norms. Furthermore, [wei2019improved] derives theoretical analyses that appropriately injecting perturbations to intermediate features encourages a large layer margin and leads to improved generalization. A parallel study [wang2019improving]

presents theoretical intuitions from a new perspective that introducing adversarial noises encourages the diversity of the embedding vectors, mitigates overfitting, and improves generalization for neural language models. These observations make the main cornerstone for our A-FAN approach valid.

Empirical Evidences.

There exist advanced studies [xie2020adversarial, zhu2019freelb, wang2019improving, gan2020large, wei2019improved] revealing that appropriately utilizing adversarial perturbations ameliorates generalization ability of deep neural networks on diverse applications. Note that these designed approaches are not defense mechanisms for adversarial robustness; instead, they serve as a special data augmentation for improved performance on clean samples. Different from input perturbations [xie2020adversarial], our work leverages adversarial perturbation in latent feature space. To further unleash the power of adversarial augmented features, we asymmetrically fuse them with clean features, which allows the model to capture and smooth out different directions of the decision boundary [li2020feature]. Accordingly, A-FAN-augmented models obtain flatter loss landscape (i.e., smaller norms of Hessian with respect to model weights) and improved generalization ability, as supported in Table 1 and Figure 3.

width=0.47 Settings ResNet-56s ResNet-56s + A-FAN Standard Accuracy 93.59 94.82 Spectral Norm of Hessian 23.34 12.66 Trace of Hessian 246.24 211.94

Table 1: Performance and Hessian properties of ResNet-56s with or without A-FAN on CIFAR-10. A smaller spectral norm or trace of Hessian indicates a flatter loss landscape w.r.t. model weights.
Figure 3: Loss landscape of ResNet-56s with or without A-FAN on CIFAR-10. Visualization tools are provided by [visualloss].

3.2 Notations

Our proposed A-FAN framework includes two key components: () adversarial feature augmentation; and () adversarial feature normalization, as shown in Figure 2. Note that we introduce adversarial perturbations in the intermediate feature space, instead of manipulating raw image pixels as in common practice.

Let denotes the dataset, where is the input image and is the corresponding ground-truth (e.g., one-hot classification labels, bounding boxes or segmentation maps). Let with represent the predictions of neural networks, where and are the parameters of the backbone networks and task-specific modules, respectively. For example, denotes the parameters of ResNets’ classifiers; or the parameters of RPN, ROI, and classifier in Faster RCNN; or the parameters of ASPP and Decoder in Deeplab V3+. Adversarial training [madry2017towards] can be formulated as follows:

(1)

where is the crafted adversarial perturbation constrained within a norm ball centered at with a radius . The radius is the maximum magnitude of generated adversarial perturbations, which roughly indicates the strength of perturbations [madry2017towards]. takes the expectation over the empirical objective over the dataset . The perturbation can be reliably created by multi-step projected gradient descent (PGD) [madry2017towards] (taking perturbation for example):

(2)

where is the step size of inner maximization, is the sign function, and is the adversarial training objective calculated over perturbed images.

3.3 Adversarial Feature Augmentation

In this section, we present the proposed adversarial feature augmentation mechanism. Specifically, perturbations are generated in the intermediate feature space via PGD (taking features from backbone for example):

(3)

where the type of and are determined by tasks (e.g., detection models adopt regression and classification loss).

is a hyperparameter to control the influence of adversarial feature augmentation. Perturbations

are generated by PGD, as shown in Equation 2, but on the features from the backbone network () rather than on raw input images. Note that the formulation in Equation 3 only considers single perturbation strength.

To fully unleash the powerful of adversarial augmentation in the feature space, we propose an enhanced technique that utilizes a series of adversarially perturbed features with strength from weak to strong simultaneously. In particular, we integrate the adversarial training objective with respect to the feature perturbation strength on an interval instead of a single point, depicted as follows:

(4)

where is the integral interval for perturbation strength , and is the crafted feature perturbation dependent on . In a similar way, we can generate adversarial augmented features for the task-specific modules in classification, detection and segmentation models.

Approximation.

Unfortunately, the integral in Equation 4 is intractable due to the lack of an explicit functional representation for deep neural networks. We provides an approximate solution by uniformly sampling and subsequently generating augmented features , as shown in Figure 2. Specifically,

(5)

where is the adversarial augmented feature embedding.

3.4 Adversarial Feature Normalization

In this section, we introduce the proposed adversarial feature normalization. Inspired by [zhang2017mixup, yun2019cutmix, li2020feature], we fuse clean () and adversarially () perturbed features for each training sample. Specifically, normalized features are crafted by normalizing clean features with adversarial feature moments. This asymmetric composition across clean and adversarial features assists networks to smooth out decision boundaries and obtain improved generalization [li2020feature].

Let and denote the first-order moment of clean feature and the -th augmented adversarial feature. Similarly, and denote the corresponding second-order moment. Their feature statistics are calculated in the routine of batch normalization [ioffe2015batch]. Note that the statistics can also derive from other normalization approaches [ba2016layer, ulyanov2016instance, wu2018group, li2019positional], such as instance-norm. The detailed formulation is defined as follows:

(6)

where and are the number of augmented features. Normalized features are fed to the networks and computed as the adversarial training objective .

3.5 Overall Framework of A-FAN

As presented in Figure 2, we first generate a sequence of adversarial perturbations with diverse strengths to augment the intermediate features. Then, we inject perturbed feature statistics into clean features by feature normalization. In the end, the augmented and normalized features together with clean features are both utilized in the network training. In this way, adversarial training can be formulated as an effective regularization to improve the generalization ability of visual recognition models. The full algorithm is summarized in Algorithm 1.

1: is the visual recognition model, where . are intermediate features.
2:# Generate adversarial augmented features
3:Uniformly sample different perturbation strength from .
4:Generate adversarial perturbations with PGD, according to Equation 2 and 3.
5:Apply to the intermediate features and obtain adversarial features .
6:for  do
7:     Generate other augmented features via the efficient implementation in Section 3.3.
8:end for
9:# Generate adversarial normalized features
10:Calculate the feature statistics , and with batch normalization [ioffe2015batch].
11:for  do
12:     Inject adversarial feature statistics into clean features via the normalization, and obtain normalized features , according to Equation 6.
13:end for
14:Feed normalized features to the model and compute the complete objective of A-FAN in Equation 7.
15:return Training objective
Algorithm 1 Adversarial Feature Augmentation and Normalization (A-FAN).

After incorporating adversarial feature augmentation and normalization, the complete training objective of A-FAN can be computed as follows:

(7)

where is tuned by grid search.

4 A-FAN on Image Classification

Datasets and Backbones.

We consider three representative datasets for image classification: CIFAR-10, CIFAR-100 [krizhevsky2009learning], and ImageNet [deng2009imagenet]. In our experiments, the original training datasets are randomly split into training and validation. The early stopping technique is applied to find the top-performing checkpoints on the validation set. Then, the selected checkpoints are evaluated on the test set to report the performance. The hyperparameters are tuned by grid search, which are quite stable from validation to test sets based on our observations, including PGD steps, step size , the layers to introduce adversarial perturbations, and the number of perturbations with different strength levels. We evaluate large backbone networks (ResNet-18/50/101/152 [he2016deep], EfficientNet-B0 [tan2019efficientnet]) on ImageNet, and test smaller backbones (ResNet-20s/56s) as well on CIFAR-10 and CIFAR-100. More details about training and evaluation are provided in Section S2.1.

Settings CIFAR-10 CIFAR-100
Baseline A-FAN Baseline A-FAN
ResNet-20s 91.25 92.52 ( 1.27) 66.92 67.89 ( 0.97)
ResNet-56s 93.59 94.82 ( 1.23) 71.22 72.36 ( 1.14)
Table 2: Standard testing accuracy (SA%) of ResNet-20s/56s on CIFAR-10 and CIFAR-100. Baseline denotes the standard training without A-FAN. indicates the improvement over SA compared to the corresponding baseline in standard training.
Settings ImageNet
Baseline Baseline + A-FAN
ResNet-18 69.38 70.25 ( 0.87)
ResNet-50 75.21 76.33 ( 1.12)
ResNet-101 77.10 78.14 ( 1.04)
ResNet-152 78.31 78.69 ( 0.38)
EfficientNet-B0 77.04 77.50 ( 0.46)
Table 3: Standard testing accuracy (SA%) of ResNet-18/50/101/152 and EfficientNet-B0 on the ImageNet dataset.

CIFAR and ImageNet Results.

We apply PGD-5 and PGD-1 to augment the feature embeddings in the last block with adversarial perturbations for CIFAR and ImageNet models, respectively. A series of adversarial augmented features are crafted with three different strengths uniformly sampled from [0,], where the step size . Table 2 and Table 3 present the standard testing accuracy of diverse models on CIFAR-10, CIFAR-100 and ImageNet. Comparing the standard training (i.e., Baseline) with our proposed A-FAN, here are the main observations:

  • A-FAN obtains a consistent and substantial improvement over standard accuracy, e.g., on CIFAR-10 with ResNet-20s, on CIFAR-100 with ResNet-56s, and on ImageNet with ResNet-50 and EfficientNet-B0. This suggests that training with augmented and normalized features generated by A-FAN effectively enhances the generalization of deep networks. We hypothesize that it is because adversarial perturbed features are treated as an implicit regularization, leading to better solutions for network training.

  • Shallow ResNets benefit more from A-FAN than deep ResNets (e.g., on ResNet-50 vs. on ResNet-152). A possible reason is that the performance of standard trained deep ResNets is already saturated, leaving little room for improvement.

Furthermore, we notice that A-FAN advocates different steps of PGD to achieve superior performance on diverse datasets. More ablation analyses can be found in Section 7. Meanwhile, although the robust testing accuracy is not the focus of A-FAN, we report it for completeness in Section S2.1.

Settings ResNet-18 on CIFAR-10 EfficientNet-B0 on ImageNet
SA Time SA Time
Baseline 94.30 23s 77.00 2628s
AdvProp 94.52 ( 0.22) 123s 77.60 ( 0.60) 13352s
A-FAN 94.67 ( 0.37) 56s 77.50 ( 0.50) 6237s
Table 4:

Running time per epoch and standard testing accuracy (SA%) comparison across Baseline, AdvProp, and A-FAN.

A-FAN vs. AdvProp.

We compare A-FAN with AdvProp [xie2020adversarial] on CIFAR-10 with ResNet-18, and on ImageNet with EfficientNet-B0 [tan2019efficientnet], as presented in Table 4. CIFAR-10 models are trained on a single GTX1080 Ti GPU. ImageNet (batch size 256) experiments are conducted on Quadro RTX 6000 GPUs with 24G2 memory in total. Since for generating feature-level perturbations, only a partialbackpropagation to the target intermediate layer is needed which brings computational saving. The results also confirm our intuition that proposed A-FAN as an effective and efficient alternative for pixel-level adversarial augmentations (e.g., AdvProp), achieves competitive performance with much more less computational cost (i.e., less running time).

5 A-FAN on Object Detection

Datasets and Backbones.

We evaluate A-FAN on Pascal VOC2007 [everingham2010pascal] and COCO2017 [lin2014microsoft] for object detection. COCO2017 is a large-scale dataset with more than ten times of data than Pascal VOC2007. In our experiments, we choose the widely-used framework, Faster RCNN [ren2015faster], for detection tasks. It is worth mentioning that the proposed A-FAN approach can be directly plugged into other detection frameworks without any change, which is left to future work. We conduct experiments with both ResNet-50 [he2016deep] and ResNet-101 [he2016deep] as backbone networks. More details about training and evaluation are be found in Section S2.2.

Pascal VOC and COCO Results.

Results are presented in Table 5. All hyperparameters of A-FAN are tuned by grid search, including PGD steps, step size , the layers to introduce adversarial feature augmentations, and the number of perturbations with different strength levels. We find that utilizing PGD-1 to generate adversarial feature perturbations in the last layer of backbone and ROI networks of Faster RCNN, achieves the most promising performance. We adopt for Pascal VOC2007 and for COCO2017. For both datasets, a series of adversarial augmented features are crafted with five different strengths uniformly sampled from [0,]. To evaluate the robustness (i.e., robust AP) of detection model [li2018robust, xie2017adversarial], PGD-10 attack with and is applied.

width=0.47 COCO2017 ResNet-50 ResNet-101 Baseline A-FAN Baseline A-FAN AP (%) 33.20 33.85 36.21 37.05 AP (%) 53.92 54.73 56.90 57.31 AP (%) 35.83 36.54 39.40 40.22 Robust AP (%) 0.00 0.50 0.20 0.66 width=0.47 Pascal VOC2007 ResNet-50 ResNet-101 Baseline A-FAN Baseline A-FAN mAP (%) 73.96 75.38 74.32 75.71 Robust mAP (%) 0.86 2.43 1.71 3.85

Table 5: Performance of object detection on Pascal VOC2007 and COCO2017 datasets. Faster RCNN is equipped with ResNet-50/ResNet-101 backbone networks, respectively. Robustness is evaluated on the adversarial perturbed images [li2018robust, xie2017adversarial] via PGD-10.

Table 5 summarizes the results of the baseline (i.e., standard training) and A-FAN. More results with different training settings are provided in Section S4. Comparing standard training with our proposed A-FAN mechanism, several major observations can be drawn:

  • A-FAN consistently achieves substantial performance improvement across multiple backbones on diverse datasets. Specifically, A-FAN gains / AP with ResNet-50/ResNet-101 on COCO2007, and / mAP with ResNet-50/ResNet-101 on Pascal VOC2007. This demonstrates that training with adversarially augmented and normalized features crafted via A-FAN significantly boosts the generalization of detection models. A possible reason is that utilizing adversarially perturbed features as an implicit regularization for training leads to better generalization.

  • Detectors trained on small-scale dataset benefits more from A-FAN. For example, Faster RCNN with ResNet-50 backbone obtains an almost double mAP111AP shares the same meaning as mAP in VOC datasets [9102805] boost (i.e., vs. ) on VOC2007 than on COCO2017. It comes as no surprise that adversarially augmented and normalized features can be regarded as data augmentation in the embedding space and therefore perform more effectively on small-scale datasets [shorten2019survey]. We also notice that Faster RCNN with both shallow and deep ResNets gets a similar degree of improvement.

  • Besides the enhanced generalization ability, detectors trained with A-FAN also receive better robustness, improved by robust AP on COCO2017 and robust mAP on Pascal VOC2007. Although the improved robustness still cannot hold a candle to adversarially trained models [dai2016r, ren2016faster, lin2017feature], it is an extra bonus from A-FAN.

  • A-FAN can achieve similar improvements, compared to other previous/identical data augmentations (e.g., [zoph2019learning]).

6 A-FAN on Semantic Segmentation

Datasets and Backbones.

We validate the effectiveness of A-FAN on Pascal VOC2007 [everingham2010pascal], Pascal VOC2012 [everingham2015pascal], and Cityspaces [cordts2016cityscapes] for semantic segmentation. Among these commonly used datasets, Cityspaces is a large-scale datasets with more than ten times of data than Pascal VOC2007/2012. In our experiments, the popular framework DeepLab V3+ [chen2018encoder] with ResNet-50 [he2016deep] and ResNet-101 [he2016deep] as backbone networks, is adopted for segmentation tasks. Note that A-FAN can also be directly plugged into other segmentation frameworks without any change, which is left to future work. More details are referred to Section S2.3.

width=0.47 Pascal VOC2012 ResNet-50 ResNet-101 Baseline A-FAN Baseline A-FAN mIOU (%) 71.20 72.21 73.65 74.91 Robust mIOU (%) 10.84 12.07 9.75 11.01 width=0.47 ResNet-50 Pascal VOC2007 Cityspaces Baseline A-FAN Baseline A-FAN mIOU (%) 61.51 62.83 76.00 76.43 Robust mIOU (%) 6.77 7.06 0.51 1.11

Table 6: Performance of object detection on Pascal VOC2007 and COCO2017 datasets. Faster RCNN is equipped with ResNet-50/ResNet-101 backbone networks. Robustness is evaluated on adversarially perturbed images [shen2019advspade] via PGD-10.

Pascal VOC and Cityspaces Results.

Results are collected in Table 6. We adopt PGD-1 to craft adversarially augmented features with three different perturbation strengths (sampled from [0,]) in the last layer of backbone and the decoder networks of DeepLab V3+ with for Pascal VOC2007, Pascal VOC2012 and Cityspaces, respectively. All hyperparameters are tuned by grid search. PGD-10 with and is employed to measure robustness (i.e., Robust mIOU) of segmentation models [shen2019advspade].

From the results in Table 6, we observe that Deeplab V3+ gains substantial performance improvement from A-FAN, which is consistent with the observations on detection models. First, A-FAN enhances the generalization of segmentation models by / mIOU with ResNet-50/ResNet-101 on Pascal VOC2012, with ResNet-50 on Pascal VOC2007, and mIOU with ResNet-50 on Cityspaces. Second, A-FAN improves Deeplab V3+ more on Pascal VOC2007/2012 than on Cityspaces (i.e., vs. ), where the former two datasets only have one-tenth amount of data compared to Cityspaces. Third, Training with A-FAN yields moderate robustness improvement (i.e., robust mIOU) for segmentation models.

7 Ablation Study and Analyses

Due to the limited space, more ablation results and analyses can be found in Section S3.1S4 and S3.1 for classification, detection and segmentation models, respectively.

Augmentation vs. Normalization

To verify the effects of adversarial feature augmentation (AFA) and adversarial feature normalization (AFN) in A-FAN, we incrementally evaluate each module on CIFAR-10 for image classification, Pascal VOC2007 for object detection, and Pascal VOC2012 for semantic segmentation. As shown in Table S10 and Table 7, AFA improves the baseline by SA/ AP/ mIOU for classification, detection and segmentation, respectively. The combination of the two modules, AFA and AFN, gains a further performance boost by SA/ AP/ mIOU on CIFAR-10, Pascal VOC2007 and VOC2012. These results demonstrate that each proposed component contributes to improving the generalization ability of detection and segmentation models, and AFA plays a dominant role in ameliorating performance.

width=0.47 Settings Detection Segmentation AP (%) mIOU (%) Baseline 73.96 71.20     + AFA 75.09 ( 1.13) 72.09 ( 0.89)     + AFA + AFN 75.38 ( 1.42) 72.21 ( 1.01) A-FAN on Backbone 75.06 ( 1.10) 71.98 ( 0.78) A-FAN on ROI/Decoder 74.68 ( 0.72) 71.71 ( 0.51) A-FAN on Both 75.38 ( 1.42) 72.21 ( 1.01)

Table 7: Ablation study of A-FAN on Pascal VOC2007 and Pascal VOC2012 for detection and segmentation, respectively. AFA: adversarial feature augmentation; AFN: adversarial feature normalization (i.e., A-FAN = AFA + AFN). ResNet-50 backbone is used here. indicates performance improvement compared to the corresponding baseline. Classification results are in Table S10.
Figure 4: Ablation study on the location and strength of introducing A-FAN to detection models. Results are on Pascal VOC2007 dataset. (a) PGD steps used in the generation of adversarial perturbations; (b) The number of augmented features, ( in Equation 5); (c) The location to apply A-FAN, e.g., B1 means that A-FAN is applied to features from the first residual blocks in the ResNet backbone; (d) Step size that controls the strength of crafted perturbations. The red points represent settings with top performance.

Effects on Backbone v.s. ROI/Decoder.

In general, detection and segmentation models can be divided into backbone and task-specific modules (e.g., RPN/ROI in Faster RCNN [ren2016faster] and ASPP/Decoder in Deeplab V3+). Our proposed A-FAN can be introduced to either or both modules as shown in detailed ablations in Table 7. We observe that applying A-FAN to backbone networks ( AP/ mIOU) gains more generalization improvement than ROI/Decoder modules ( AP/ mIOU) for detection and segmentation. Incorporating A-FAN on both backbone and task-specific modules always enjoys extra performance boost, compared to applying either one alone.

Effects of Location and Strength.

The performance gain from A-FAN is determined by the location and strength of generated adversarial perturbations. Figure S54 and S8 illustrate a comprehensive control study to investigate these relevant factors. Without losing generality, these ablation experiments and analyses are performed on backbone networks. When studying one of the factors, we choose the best configuration for the other factors.

To identify the proper location for A-FAN operation we inject feature perturbations to different blocks (e.g., B1) or some combination of blocks (e.g., B2,B3), as presented in Figure S5 (c) for classification, 4 (c) for detection, and S8 (c) for segmentation. We notice that applying A-FAN to features from the last block (i.e., B3 or B4) obtains the best performance, while introducing A-FAN to multiple blocks degrades generalization.

The strength of A-FAN includes the number of PGD steps and the step size for generating adversarial features, and the number of augmented features with different perturbation strengths ( in Equation 5), as shown in Figure S5 (a),(b),(c) for classification, 4 (a),(b),(c) for detection, and S8 (a),(b),(c) for segmentation. Experiments show that {ResNet-18, Faster RCNN, Deeplab V3+} gains more from A-FAN with {PGD-5,PGD-1,PGD-1}, step size , and {3,5,3} augmented features with different perturbation strength. These systematic evaluations reveal that: weak (e.g., ) adversarial perturbed features contribute marginal generalization improvements; excessively strong (e.g., PGD-10, ) A-FAN incurs performance deterioration. In summary, we observe that a proper configuration for A-FAN usually produces high-quality augmented and normalized features, realizing enhanced visual recognition models.

Comparing A-FAN with Random Noise.

One straight-froward approach to augment feature embeddings is injecting random noise. Here we replace the generated adversarial noise in our proposed mechanism with a randomly sampled noise from Gaussian distribution

. As shown in Table 8, AFA+AFN (i.e., A-FAN) achieves a larger performance gain than Random Noise+AFN, suggesting that gradient-based crafted feature augmentation benefits more to the generalization ability of visual recognition models.

width=0.47 Settings CIFAR-10 VOC2007 VOC2012 SA (%) AP (%) mIOU (%) Random Noise + AFN 93.36 ( 0.23) 73.91 ( 0.05) 71.23 ( 0.03) AFA + AFN (i.e. A-FAN) 94.82 ( 1.23) 75.38 ( 1.42) 72.21 ( 1.01)

Table 8: Performance comparison between adversarial feature perturbations with the strength and random noise sampled from a Gaussian distribution . Results are reported on CIFAR-10 (with ResNet-56s), Pascal VOC2007, and Pascal VOC2012 for classification, detection, and segmentation, respectively. / indicates performance improvement/degradation compared to baseline.

Visualization.

Figure S6S7 and S9 provide visualization of adversarially augmented features and normalized features generated by A-FAN. Features are collected via applying A-FAN to classification, detection and segmentation models on ImageNet, Pascal VOC2007 and VOC2012 datasets, respectively. Visualization of classification models can be found in Section S3.1. For better visualization, we use features from the first block of backbone networks and further enlarge the magnitude of adversarial perturbations by times. We notice that normalizing features by injecting adversarial statistics into clean features, seems to neutralize the excessively generated adversarial noise. It offers an explanation for the extra performance improvement by adversarial feature normalization.

8 Conclusion and Discussion

In this paper, we present A-FAN, an enhanced adversarial training method to improve image classification, object detection, and semantic segmentation. By generating a series of adversarial perturbations with different strengths on feature embeddings, and fusing adversarial feature statistics with clean features, A-FAN substantially boost the generalization ability of various models across multiple representative datasets, such as CIFAR-10/100, ImageNet, Pascal VOC2007/2012, COCO2017 and Cityspaces. For future work, we would like to extend A-FAN to more tasks and provide theoretical understanding of A-FAN.

References

Appendix S1 More Methodology Details

s1.1 Efficient Implementation of A-FAN

In practice, we offer an efficient implementation to compute Equation 5 for A-FAN with PGD-1. Since we by default [madry2017towards, xie2020adversarial] use , for most A-FAN experiments with a small and PGD-1, the PGD will degrade to simple gradient descent. Similar routines are also adopted in [gan2020large]. In this sense, step size is the only indicator for perturbation strength, satisfying as demonstrated in Equation 2. Thus, can be efficiently calculated by , which merely requires to apply PGD once rather than times in Equation 5, and then derives other augmented features with negligible extra cost. Note that it is not available for the multi-step PGD or large step sizes (e.g., ).

width=1 Datasets Detection Segmentation Pascal VOC2007 COCO2017 Pascal VOC2007 Pascal VOC2012 Cityspaces Batch Size Iterations Init. Learning Rate 0.008 0.01 0.01 0.01 0.1 Learning Rate Decay at , at , Polynomial w. power 0.9 Polynomial w. power 0.9 Optimizer SGD with momentum 0.9 and weight decay SGD with momentum 0.9 and weight decay Eval. Metric mAP AP, AP, AP mIOU mIOU mIOU

Table S9: Details of training and evaluation. We use the standard implementations and hyperparameters in [ren2015faster, chen2018encoder]

. The evaluation metrics are also follow standards in

[ren2015faster, chen2018encoder]. Linear learning rate warm-up for iterations is applied.

Appendix S2 More Implementation Details

s2.1 More A-FAN on Image Classification

Training Details and Evaluation Metrics.

For network training on CIFAR-10 and CIFAR-100, we adopt an SGD optimizer with a momentum of , weight decay of , and batch size of for epochs. The learning rate starts from and decays to one-tenth at -th and -th epochs. We also perform a linear learning rate warm-up in the first

iterations. For ImageNet experiments, following the official setting in Pytorch repository,

222https://github.com/pytorch/examples/tree/master/imagenet we train deep networks for epochs with a batch size of , and the learning rate decay at -th and -th epoch. The SGD optimizer is adopted with a momentum of and a weight decay of . We evaluate the generalization ability of a network with Standard Testing Accuracy (SA), which represents image classification accuracy on the original clean test dataset.

s2.2 More A-FAN on Object Detection

Training and Evaluation Metrics.

For both detection, we use the datasets following [ren2015faster]: in Pascal VOC2007, we use the train and validation sets for training, and evaluate on test set; In COCO2017, we train models on the train set and evaluate on the validation set. All other implementation details are provided in Table S9.

s2.3 More A-FAN on Semantic Segmentation

Training and Evaluation Metrics.

For both segmentation, we use the datasets following [chen2018encoder]: in Pascal VOC2007, we use the train and validation sets for training, and evaluate on test set; In Pascal VOC2012 and Cityspaces, we train models on the train set and evaluate on the validation set. All other implementation details are provided in Table S9.

Appendix S3 More Experiments Results

s3.1 More Classification Results

Training with Full Training Sets.

For a sanity check, we also conduct experiments with ResNet-50 on the full () ImageNet training set, Baseline () vs. A-FAN (). Equipped with A-FAN, it obtain improvements in terms of the standard accuracy.

Augmentation vs. Normalization.

To verify the effects of adversarial feature augmentation (AFA) and adversarial feature normalization (AFN) in A-FAN, we incrementally evaluate each module on CIFAR-10 with ResNet-56s. As shown in Table S10, these results show that each proposed component contributes to improving the generalization ability of classification models, and AFA plays a dominant role in ameliorating performance.

width=0.40 Settings Classification Standard Accuracy (%) Baseline 93.59     + AFA 94.45 ( 0.86)     + AFA + AFN 94.82 ( 1.23)

Table S10: Ablation study of A-FAN on CIFAR-10 with ResNet-56s. AFA: adversarial feature augmentation; AFN: adversarial feature normalization (i.e., A-FAN = AFA + AFN). indicates performance improvement compared to the baseline on corresponding dataset.

Strength and Locations of A-FAN.

To understand the effect or the strength of injected adversarial perturbations, we implement ResNet-18 on CIFAR-10 and examine the performance across different step sizes, the number of PGD steps and augmented features. Figure S5 shows that perturbing with PGD-1, step size and augmenting three features with diverse perturbation strength achieve the superior performance, compared with other configurations. Generally, it shares similar observations to the ablation of detection (Figure 4) and segmentation (Figure S8) models. For the ablation of PGD steps, we implement ResNet-18 on ImageNet as well in Table S11, which suggest A-FAN PGD-1 works the best for ImageNet.

Then, we analyze the effect of locations (i.e., where to apply A-FAN) via the typical backbone, ResNet-18 on CIFAR-10. In each setting, we present a detailed analysis on which layer and how many layers the feature embeddings should be adversarially augmented for achieving the best performance. Figure S5 presents the layer preference of feature perturbations when applying A-FAN to different blocks or some combinations of blocks. We notice that introducing A-FAN to the last block achieves better standard accuracy, while the performance deteriorates after injecting A-FAN to multiple blocks.

Steps ImageNet
PGD-1 PGD-3 PGD-5
A-FAN 70.25 ( 0.87) 68.65 ( 0.73) 67.42 ( 1.96)
Table S11: Standard testing accuracy (%) on ImageNet datasets. As for ImageNet, we also perturb the last block features of ResNet-18 via PGD-1/3/5 and step size . The reference SA is .
Figure S5: Ablation study on the location and strength of introducing A-FAN to classification models. Results are on CIAFR-10 dataset with ResNet-18. (a) PGD steps used in the generation of adversarial perturbations; (b) The number of augmented features, ( in Equation 5); (c) The location to apply A-FAN, e.g., B1 means that A-FAN is applied to features from the first residual blocks in the ResNet backbone; (d) Step size that controls the strength of crafted perturbations. The red points represent settings with top performance.

Robust Performance of A-FAN.

Although the robust testing accuracy (RA) is not the focus of A-FAN, we report it for completeness. We implement the standard, A-FAN and the adversarial trained ResNet-18 networks on CIFAR-10. The adversarial trained model uses PGD-10 with step size and for training. Then, PGD-20 with the same and is applied to evaluate the robust performance of the three models. We observe that A-FAN trained models ( RA) yield moderate robustness, compared to models from standard ( RA) and adversarial ( RA) trained models.

Visualization.

Figure S6 collects the visualization of adversarially augmented and normalized features for a trained ResNet-18 with A-FAN.

Figure S6: Visualization of adversarially augmented and normalized features for classification models with A-FAN, using a trained ResNet-18. The fifth and sixth columns are normalized features of the third and forth columns, respectively.

Appendix S4 More Objection Results

Addition Experiments of Detection on COCO2017.

Following another representative repository333https://github.com/potterhsu/easy-faster-rcnn.pytorch for the Faster RCNN [ren2015faster] implementation on COCO2017, we further verify the effectiveness of our proposed A-FAN. Table S12 collects all detailed setup. As shown in Table S13, we observe that A-FAN boosts the baseline detection model by AP. Consistent performance gains in Table S13 and the main text, reveals that A-FAN benefits detection models across diverse training configurations.

width=0.47 Datasets Detection on COCO2017 Batch Size Iterations Init. Learning Rate 0.01 Learning Rate Decay at , Optimizer SGD with momentum 0.9 and weight decay Eval. Metric AP, AP, AP

Table S12: Details of training and evaluation. We use the standard implementations and hyperparameters in the repository. The evaluation metrics are also follow standards in [ren2015faster]. Linear learning rate warm-up for iterations is applied.

width=0.40 Metrics ResNet-101 on COCO2017 Baseline Baseline + A-FAN AP (%) 37.00 37.96 AP (%) 57.60 58.40 AP (%) 40.33 41.01 Robust AP (%) 0.21 0.60

Table S13: Performance of object detection on COCO2017 datasets. Faster RCNN is equipped with ResNet-101 backbone networks. Robustness are evaluated on the adversarial perturbed images [li2018robust, xie2017adversarial] via PGD-10 with and .

Comparison with Learned Data Augmentation (LDA) for Object Detection.

A recent work [zoph2019learning] presents learned, specialized data augmentation policies to improve generalization performance for detection models. Although it is independent of our proposed feature-level adversarial augmentation, we still provide comparison experiments for a comprehensive investigation, as shown in Table S14. Note that, for a fair comparison, we follow the exact same setting as [zoph2019learning]

. In addition to the detailed parameters, we combine the training sets of Pascal VOC 2007 and Pascal VOC 2012, and test the trained models on the Pascal VOC 2007 test set (4953 images). From Table 

S14, we observe that both A-FAN and LDA obtain performance improvements by mAP and mAP, respectively. Achieved superior performance further validates the effectiveness of our proposed A-FAN.

width=0.47 Metrics ResNet-101 on Pascal VOC2007 Baseline Baseline + A-FAN LDA mAP (%) 76.00 79.66 78.70 Robust mAP (%) 2.59 5.05 -

Table S14: Performance of object detection on Pascal VOC2007 datasets. Faster RCNN is equipped with ResNet-101 backbone networks. Robustness are evaluated on the adversarial perturbed images [li2018robust, xie2017adversarial] via PGD-10 with and .

Visualization.

Figure S7 presents the visualization of adversarial augmented and normalized features for detection models with A-FAN, using a trained Faster RCNN.

Figure S7: Visualization of adversarial augmented and normalized features for detection models with A-FAN, using a trained Faster RCNN. The left column shows the input image and the corresponding clean feature. The remaining four columns, from left to right, present features with an increased perturbation strength; from up to bottom, it shows augmented and normalized features alternatively.

Appendix S5 More Segmentation Results

Strength and Locations of A-FAN.

Figure S5 provides a comprehensive control study to investigate the relevant factors of A-FAN with segmentation models.

Figure S8: The ablation study on the location and strength of introducing A-FAN to segmentation models. Results are on Pascal VOC2012 dataset. (a), (b), (c) and (d) share the same definitions as in Figure 4. The red points represent settings with top performance.

Visualization.

Figure S9 collects the visualization of adversarially augmented and normalized features for segmentation models with A-FAN, using a trained Deeplab V3+.

Figure S9: Visualization of adversarially augmented and normalized features for segmentation models with A-FAN, using a trained Deeplab V3+. The fifth and sixth columns are normalized features of the third and forth columns, respectively.