Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning

03/28/2020 ∙ by Tianlong Chen, et al. ∙ Microsoft 11

Pretrained models from self-supervision are prevalently used in fine-tuning downstream tasks faster or for better accuracy. However, gaining robustness from pretraining is left unexplored. We introduce adversarial training into self-supervision, to provide general-purpose robust pre-trained models for the first time. We find these robust pre-trained models can benefit the subsequent fine-tuning in two ways: i) boosting final model robustness; ii) saving the computation cost, if proceeding towards adversarial fine-tuning. We conduct extensive experiments to demonstrate that the proposed framework achieves large performance margins (eg, 3.83 accuracy, on the CIFAR-10 dataset), compared with the conventional end-to-end adversarial training baseline. Moreover, we find that different self-supervised pre-trained models have a diverse adversarial vulnerability. It inspires us to ensemble several pretraining tasks, which boosts robustness more. Our ensemble strategy contributes to a further improvement of 3.59 while maintaining a slightly higher standard accuracy on CIFAR-10. Our codes are available at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised training of deep neural networks requires massive, labeled datasets, which may be unavailable and costly to assemble

[hinton2006fast, bengio2007greedy, raina2007self, vincent2010stacked]. Self-supervised and unsupervised training techniques attempt to address this challenge by eliminating the need for manually labeled data. Representations pretrained through self-supervised techniques enable fast fine-tuning to multiple downstream tasks, and lead to better generalization and calibration [liu2019towards, mohseni2020self]. Examples of tasks proven to attain high accuracy through self-supervised pretraining include position predicting tasks (Selfie [trinh2019selfie], Jigsaw [noroozi2016unsupervised, carlucci2019domain]), rotation predicting tasks (Rotation [gidaris2018unsupervised]), and a variety of other perception tasks [criminisi2004region, zhang2016colorful, dosovitskiy2015discriminative].

Figure 1: Summary of our achieved performance (CIFAR-10). The upper right corner indicates the best performance in terms of both standard and robust accuracy. The size of markers

represents the number of training epochs to achieve the best robust accuracy.

Black circle () is the baseline method: end-to-end adversarial training. Blue circles () are fine-tuned models that inherit robust models from different self-supervised pretraining tasks. Orange circle () is the ensemble of three self-supervised pretraining tasks. Red Star () is the ensemble of three fine-tuned models. The correspondence between the and # epochs is given by, Ensemble Fine-tune (, 144 epochs) Baseline (, 99 epochs) Ensemble Pretrain (, 56 epochs) Selfie (, 50 epochs) Jigsaw (, 48 epochs) Rotation (, 46 epochs)

The labeling and sample efficiency challenges of deep learning are further exacerbated by vulnerability to adversarial attacks. For example, Convolutional Neural Networks (CNNs), are widely leveraged for perception tasks, due to high predictive accuracy. However, even a well-trained CNN suffers from high misclassification rates when imperceivable perturbations are applied the input 

[Kurakin2016AdversarialML, moosavi2016deepfool]. As suggested by [schmidt2018adversarially], the sample complexity of learning an adversarially robust model with current methods is significantly higher than that of standard learning. Adversarial training (AT) [madry2017towards], the state-of-the-art model defense approach, is also known to be computationally more expensive than standard training (ST). The above facts make it especially meaningful to explore:

Can appropriately pretrained models play a similar role for adversarial training as they have for ST? That is, can they lead to more efficient fine-tuning and better, adversarially-robust generalization?

Self-supervision has only recently been linked to the study of robustness. An approach is offered in [hendrycks2019using], by incorporating the self-supervised task as a complementary objective, which is co-optimized with the conventional classification loss through the method of AT [madry2017towards]. Their co-optimization approach presents scalability challenges, and does not enjoy the benefits of pretrained embeddings. Further, it leaves many unanswered questions, especially with respect to efficient tuning, which we tackle in this paper.


This paper introduces a framework for self-supervised pretraining and fine-tuning into the adversarial robustness field. We motivate our study with the following three scientific questions:

  • Is an adversarially pretrained model effective in boosting the robustness of subsequent fine-tuning?

  • Which provides the better accuracy and efficiency: adversarial pretraining or adversarial fine-tuning?

  • How does the type of self-supervised pretraining task affect the final model’s robustness?

Our contributions address the above questions and can be summarized as follows:

  • We demonstrate for the first time that robust pretrained models leveraged for adversarial fine-tuning result in a large performance gain. As illustrated by Figure 1, the best pretrained model from a single self-supervised task (Selfie) leads to 3.83% on robust accuracy111Throughout this paper, we follow [zhang2019theoretically] to adopt their defined standard accuracy and robust accuracy, as two metrics to evaluate our method’s effectiveness: a desired model shall be high in both. and 1.3% on standard accuracy on CIFAR-10 when being adversarially fine-tuned, compared with the strong AT baseline. Even performing standard fine-tuning (which consumes fewer resources) with the robust pretrained models improves the resulting model’s robustness.

  • We systematically study all possible combinations between pretraining and fine-tuning. Our extensive results reveal that adversarial fine-tuning contributes to the dominant portion of robustness improvement, while robust pretraining mainly speeds up adversarial fine-tuning. That can also be read from Figure 1 (smaller marker sizes denote less training epochs needed).

  • We experimentally show that the pretrained models resulting from different self-supervised tasks have diverse adversarial vulnerabilities. In view of that, we propose to pretrain with an ensemble of self-supervised tasks, in order to leverage their complementary strengths. On CIFAR-10, our ensemble strategy further contributes to an improvement of 3.59% on robust accuracy, while maintaining a slightly higher standard accuracy. Our approach establishes a new benchmark result on standard accuracy () and robust accuracy () in the setting of AT.

2 Related Work

Self-supervised pretraining.

Numerous self-supervised learning methods have been developed in recent years, including: region/component filling (

e.g. inpainting [criminisi2004region] and colorization [zhang2016colorful]); rotation prediction [gidaris2018unsupervised]; category prediction [dosovitskiy2015discriminative]; and patch-base spatial composition prediction (e.g., Jigsaw [noroozi2016unsupervised, carlucci2019domain] and Selfie [trinh2019selfie]). All perform standard training, and do not tackle adversarial robustness. For example, Selfie [trinh2019selfie]

, generalizes BERT to image domains. It masks out a few patches in an image, and then attempts to classify a right patch to reconstruct the original image. Selfie is first pretrained on unlabeled data and fine-tuned towards the downstream classification task.

Adversarial robustness.

Many defense methods have been proposed to improve model robustness against adversarial attacks. Approaches range from adding stochasticity [dhillon2018stochastic], to label smoothening and feature squeezing [papernot2017extending, xu2017feature], to denoising and training on adversarial examples  [meng2017magnet, liao2018defense]. A handful of recent works point out that those empirical defenses could still be easily compromised [athalye2018obfuscated]. Adversarial training (AT) [madry2017towards] provides one of the strongest current defenses, by training the model over the adversarially perturbed training data, and has not yet been fully compromised by new attacks. [gui2019model, Hu2020Triple] showed AT is also effective in compressing or accelerating models [Zhu2020FreeLB] while preserving learned robustness.

Several works have demonstrated model ensembles [strauss2017ensemble, tramer2017ensemble] to boost adversarial robustness, as the ensemble diversity can challenge the transferability of adversarial examples. Recent proposals [pang2019improving, wang2019unified] formulate the diversity as a training regularizer for improved ensemble defense. Their success inspires our ensembled self-supervised pretraining.

Unlabeled data for adversarial robustness.

Self-supervised training learns effective representations for improving performance on downstream tasks, without requiring labels. Because robust training methods have higher sample complexity, there has been significant recent attention on how to effectively utilize unlabeled data to train robust models.

Results show that unlabeled data can become a competitive alternative to labeled data for training adversarially robust models. These results are concurred by [zhai2019adversarially], who also finds that learning with more unlabeled data can result in better adversarially robust generalization. Both works [stanforth2019labels, carmon2019unlabeled] use unlabeled data to form an unsupervised auxiliary loss (e.g., a label-independent robust regularizer or a pseudo-label loss).

To the best of our knowledge, [hendrycks2019using] is the only work so far that utilizes unlabeled data via self-supervision to train a robust model given a target supervised classification task. It improves AT by leveraging the rotation prediction self-supervision as an auxiliary task, which is co-optimized with the conventional AT loss. Our self-supervised pretraining and fine-tuning differ from all above settings.

3 Our Proposal

In this section, we introduce self-supervised pretraining to learn feature representations from unlabeled data, followed by fine-tuning on a target supervised task. We then generalize adversarial training (AT) to different self-supervised pretraining and fine-tuning schemes.

3.1 Setup

Self-Supervised Pretraining

Let denote a pretraining task and denote the corresponding (unlabeled) pretraining dataset. The goal of self-supervised pretraining is to learn a model from itself without explicit manual supervision. This is often cast as an optimization problem, in which a proposed pretraining loss is minimized to determine a model parameterized by . Here signifies additional parameters customized for a given . In the rest of the paper, we focus on the following self-supervised pretraining tasks (details on each pretraining task are provided in the supplement):

Selfie [trinh2019selfie]: By masking out select patches in an image, Selfie constructs a classification problem to determine the correct patch to be filled in the masked location.

Rotation [gidaris2018unsupervised]: By rotating an image by a random multiple of 90 degrees, Rotation constructs a classification problem to determine the degree of rotation applied to an input image.

Jigsaw [noroozi2016unsupervised, carlucci2019domain]: By dividing an image into different patches, Jigsaw trains a classifier to predict the correct permutation of these patches.

Supervised Fine-tuning

Let denote the mapping (parameterized by ) from an input sample to its embedding space learnt from the self-supervised pretraining task . Given a target finetuning task with the labeled dataset , the goal of fine-tuning is to determine a classifier, parameterized by , which maps the represetnation to the label space. To learn the classifier, one can minimize a common supervised training loss with a fixed or re-trainable model , corresponding to partial fine-tuning and full fine-tuning, respectively.

AT versus standard training (ST)

AT is known as one of the most powerful methods to train a robust classifier against adversarial attacks [madry2017towards, athalye2018obfuscated]. Considering an -tolerant attack subject to , an adversarial example of a benign input is given by . With the aid of adversarial examples, AT solves a min-max optimization problem of the generic form


where denotes the parameters of an ML/DL model, is a given dataset, and signifies a classification loss evaluated at the model and the perturbed input . By fixing , problem (1) then simplifies to the ST framework .

3.2 AT meets self-supervised pretraining and fine-tuning

AT given by (1) can be specified for either self-supervised pretraining or supervised fine-tuning. For example, AT for self-supervised pretraining can be cast as problem (1) by letting and , and specifying as . In Table 1, we summarize all the possible scenarios when AT meets self-supervised pretraining.

max width=0.47 Scenario Pretraining method Loss in (1) Variables in (1) dataset in (1) None NA NA NA ST AT

  • None: the model form of is known in advance.

  • NA: Not applicable.

  • ST: A special case of (1) with .

Table 1: Summary of self-supervised pretraining scenarios.

max width=0.47 Scenario Fine-tuning type Fine-tuning method Loss in (1) Variables in (1) dataset in (1) Partial (with fixed ) ST Partial (with fixed ) AT Full ST Full AT

  • Fixed signifies the model learnt in a given pretraining scenario.

  • Full fine-tuning retrains .

Table 2: Summary of fine-tuning scenarios.

Given a pretrained model , adversarial fine-tuning could have two forms: a) AT for partial fine-tuning and b) AT for full fine-tuning. Here the former case a) solves a supervised fine-tuning task under the fixed model (), and the latter case b) solves a supervised fine-tuning task by retraining . In Table 2, we summarize different scenarios when AT meets supervised fine-tuning.

It is worth noting that our study on the integration of AT with a pretraining+fine-tuning scheme provided by Tables 1-2 is different from [hendrycks2019using], which conducted one-shot AT over a supervised classification task integrated with a rotation self-supervision task.

In order to explore the network robustness against different configurations , we ask: is AT for robust pretraining sufficient to boost the adversarial robustness of fine-tuning? What is the influence of fine-tuning strategies (partial or full) on the adversarial robustness of image classification? How does the type of self-supervised pretraining task affect the classifier’s robustness?

We provide detailed answers to the above questions in Sec. 4.3, Sec. 4.4 and Sec. 4.5. In a nutshell, we find that robust representation learnt from adversarial pretraining is transferable to down-stream fine-tuning tasks to some extent. However, a more significant robustness improvement is obtained by adversarial fine-tuning. Moreover, AT for full fine-tuning outperforms that for partial fine-tuning in terms of both robust accuracy and standard accuracy (except the Jigsaw-specified self-supervision task). Furthermore, different self-supervised tasks demonstrate diverse adversarial vulnerability. As will be evident later, such diversified tasks provide complementary benefits to model robustness and therefore can be combined.

Figure 2: The overall framework of ensemble adversarial pretraining. The pretrained weights are the first three blocks of ResNet-50v2 [he2016deep]; Green arrows (), Blue arrows () and Red arrows () represent the feed forward paths of Selfie, Jigsaw and Rotation, respectively.

3.3 AT by leveraging ensemble of multiple self-supervised learning tasks

In what follows, we generalize AT to learn a robust pretrained model by leveraging the diversified pretraining tasks. More specifically, consider self-supervised pretraining tasks , each of which obeys the formulation in Section 3.1. We generalize problem (1) to


where denotes the adversarial loss given by


In (2), for ease of notation, we replace with , denotes the common network shared among different self-supervised tasks, and denotes a sub-network customized for the th task. We refer readers to Figure 2 for an overview of our proposed model architecture. In (3.3), denotes the th pretraining loss, denotes a diversity-promoting regularizer, and is a regularization parameter. Note that gives the averaging ensemble strategy. In our case, we perform grid search to tune around the value chosen in [pang2019improving]. Details are referred to the supplement.

Spurred by [pang2019improving, wang2019unified], we quantify the diversity-promoting regularizer through the orthogonality of input gradients of different self-supervised pretraining losses,


where each column of corresponds to a normalized input gradient , and reaches the maximum value as input gradients become orthogonal, otherwise it is negative. The rationale behind the diversity-promoting adversarial loss (3.3) is that we aim to design a robust model by defending attacks from diversified perturbation directions.

4 Experiments and Results

In this section, we design and conduct extensive experiments to examine the network robustness against different configurations for image classification. First, we show adversarial self-supervised pretraining (namely, in Table 1) improves the performance of downstream tasks. We also discuss the influence of different fine-tuning strategies on the adversarial robustness. Second, we show the diverse impacts of different self-supervised tasks on their resulting pretrained models. Third, we ensemble those self-supervised tasks to perform adversarial pretraining. At the fine-tuning phase, we also ensemble three best models with the configuration (, ) and show its performance superiority. Last, we report extensive ablation studies to reveal the influence of the size of the datasets and the resolution of images in , as well as other defense options beyond AT.

4.1 Datasets

Dataset Details We consider four different datasets in our experiments: CIFAR-10, CIFAR-10-C [hendrycks2019robustness], CIFAR-100 and R-ImageNet-224

(a specifically constructed “restricted” version of ImageNet, with resolution

). For the last one, we indeed to demonstrate our approach on high-resolution data despite the computational challenge. We follow [santurkar2019computer] to choose 10 super classes which contain a total of 190 ImageNet classes. The detailed classes distribution of each super class can be found in our supplement.

For the ablation study of different pretraining dataset sizes, we sample more training images from the 80 Million Tiny Images dataset [torralba200880] where CIFAR-10 was selected from. Using the same 10 super classes, we form CIFAR-30K (i.e., 30,000 for images), CIFAR-50K, CIFAR-150K for training, and keep another 10,000 images for hold-out testing.

Dataset Usage In Sec. 4.3, Sec. 4.4 and Sec. 4.5, for all results, we use CIFAR-10 training set for both pretraining and fine-tuning. We evaluate our models on the CIFAR-10 testing set and CIFAR-10-C. In Sec. 4.6, we use CIFAR-10, CIFAR-30K, CIFAR-50K, CIFAR-150K and R-ImageNet-224 for pretraining, and CIFAR-10 training set for fine-tuning, while evaluating on CIFAR-10 testing set. We also validate our approaches on CIFAR-100 in the supplement. In all of our experiments, we randomly split the original training set into a training set and a validation set (the ratio is 9:1).

4.2 Implementation Details

Model Architecture: For pretraining with the Selfie task, we identically follow the setting in [trinh2019selfie]. For Rotation and Jigsaw pretraining tasks, we use ResNet-50v2 [he2016identity]. For the fine-tuning, we use ResNet-50v2 for all. Each fine-tuning network will inherit the corresponding robust pretrained weights to initialize the first three blocks of ResNet-50v2, while leaving the remaining blocks randomly initialized.

Training & Evaluation Details: All pretraining and fine-tuning tasks are trained using SGD with 0.9 momentum. We use batch sizes of 256 for CIFAR-10, ImageNet-32 and 64 for R-ImageNet-224. All pretraining tasks adopt cosine learning rates. The maximum and minimum learning rates are 0.1 and for Rotation and Jigsaw pretraining; 0.025 and for Selfie pretraining; and 0.001 and for ensemble pretraining. All fine-tuning phases follow a multi-step learning rate schedule, starting from 0.1 and decayed by 10 times at epochs 30 and 50 for a 100 epochs training.

We use 10-step and 20-step PGD attacks [madry2017towards] for adversarial training and evaluation, respectively. Unless otherwise specified, we follow [hendrycks2019using]’s setting with and . For all adversarial evaluations, we use the full testing datasets ( images for CIFAR-10) to generate adversarial images. We also consider unforeseen attacks [kang2019testing, hendrycks2019robustness].

Evaluation Metrics & Model Picking Criteria: We follow [zhang2019theoretically] to use: i) Standard Testing Accuracy (TA): the classification accuracy on the clean test dataset; II) Robust Testing Accuracy (RA): the classification accuracy on the attacked test dataset. In our experiments, we use TA to pick models for a better trade-off with RA. Results of models picked using RA criterion are included in the supplement.

4.3 Adversarial self-supervised pertraining & fine-tuning helps classification robustness

Scenario Selfie Pretraining Rotation Pretraining Jigsaw Pretraining
TA (%) RA (%) Epochs TA (%) RA (%) Epochs TA (%) RA (%) Epochs
() 94.24 0.00 92 94.24 0.00 92 94.24 0.00 92
() 84.72 47.22 99 84.72 47.22 99 84.72 47.22 99
() 95.09 0.00 97 95.45 0.00 92 93.93 0.00 89
() 85.56 50.42 60 86.66 50.95 45 85.18 50.94 46
() 78.93 6.30 82 86.83 18.22 99 80.47 2.68 87
() 74.30 37.65 64 82.32 45.10 47 72.76 32.59 51
() 94.69 0.00 86 94.79 0.00 92 93.06 0.00 93
() 86.02 51.05 50 85.66 50.40 46 84.50 49.61 48
Table 3: Evaluation Results of Eight Different () Scenarios. Table 1 and Table 2 provide detailed definitions for (without pre-training), (standard self-supervision pre-training), (adversarial self-supervision pre-training), (partial standard fine-tuning), (partial adversarial fine-tuning), (full standard fine-tuning), and (full adversarial fine-tuning). The best results are highlighted (,) under each column of different self-supervised pretraining tasks.

We systematically study all possible configurations of pretraining and fine-tuning considered in Table 1 and Table 2, where recall that the expression denotes a specified pretraining+fine-tuning scheme. The baseline schemes are given by the end-to-end standard training (ST), namely, and the end-to-end adversarial training (AT), namely, . Table 3 shows TA, RA, and iteration complexity of fine-tuning (in terms of number of epochs) under different pretraining+fine-tuning strategies involving different self-supervised pretraining tasks, Selfie, Rotation and Jigsaw. In what follows, we analyze the results of Table 3 and provide additional insights.

We begin by focusing on the scenario of integrating the standard pretraining strategy with fine-tuning schemes and used in baseline methods. Several observations can be made from the comparison vs. and vs. in Table 3. 1) The use of self-supervised pretraining consistently improves TA and/or RA even if only standard pretraining is conducted; 2) The use of adversarial fine-tuning (against standard fine-tuning ) is crucial, leading to significantly improved RA under both and ; 3) Compared with , the use of self-supervised pretraining offers better eventual model robustness (around improvement) and faster fine-tuning speed (almost saving the half number of epochs).

Next, we investigate how the adversarial pretraining (namely, ) affects the eventual model robustness. It is shown by and in Table 3 that the robust feature representation learnt from benefits adversarial robustness even in the case of partial fine-tuning, but the use of adversarial partial fine-tuning, namely, , yields a more improvement. We also observe from the case of that the standard full fine-tuning harms the robust feature representation learnt from , leading to RA. Furthermore, when the adversairal full fine-tuning is adopted, namely, , the most significant robustness improvement is acquired. This observation is consistent with against .

Third, at the first glance, adversarial full fine-tuning (namely, ) is the most important step to improve the final mode robustness. However, adversarial pretraining is also a key, particularly for reducing the computation cost of fine-tuning; for example, less than epochs in vs. epochs in the end-to-end AT .

Last but not the least, we note that the aforementioned results are consistent against different self-supervised prediction tasks. However, Selfie and Rotation are more favored than Jigsaw to improve the final model robustness. For example, in the cases of adversarial pretraining followed by standard and adversarial partial fine-tuning, namely, and , Selfie and Rotation yields at least improvement in RA. As the adversarial full fine-tuning is used, namely, , Selfie and Rotation outperform Jigsaw in both TA and RA, where Selfie yields the largest improvement, around in both TA and RA.

4.4 Comparison with one-shot AT regularized by self-supervised prediction task

In what follows, we compare our proposed adversarial pretraining followed by adversarial fine-tuning approach, namely, , in Table 3 with the one-shot AT that optimizes a classification task regularized by the self-supervised rotation prediction task [hendrycks2019using]. In addition to evaluating this comparison in TA and RA (evaluated at PGD attack [madry2017towards]), we also measure the robustness in eventual classification against unforeseen attacks that are not used in AT [kang2019testing]. More results can be found in the supplement.

Figure 3 presents the multi-dimensional performance comparison of our approach vs. the baseline method in [hendrycks2019using]. As we can see, our approach yields improvement on TA while degradation on RA. However, our approach yields consistent robustness improvement in defending all unforeseen attacks, where the improvement ranges from to . Moreover, our approach separates pretraining and fine-tuning such that the target image classifier can be learnt from a warm start, namely, the adversarial pretrained representation network. This mitigates the computation drawback of one-shot AT in [hendrycks2019using], recalling that our advantage in saving computation cost was shown in Table 3. Next, Figure 4 presents the performance of our approach under different types of self-supervised prediction task. As we can see, Selfie provides consistently better performance than others, where Jigsaw performs the worst.

Figure 3: The summary of the accuracy over unforeseen adversarial attackers. Our models are obtained after adversarial fine-tuning with adversarial Rotation pretraining. Baseline are co-optimized models with Rotation auxiliary task [hendrycks2019using].
Figure 4: The summary of the accuracy over unforeseen adversarial attackers. Competition among adversarial fine-tuned models with Selfie, Rotation and Jigsaw adversarial pretraining.

4.5 Diversity vs. Task Ensemble

In what follows, we show that different self-supervised prediction tasks demonstrate a diverse adversarial vulnerability even if their corresponding RAs remain similar. We evaluate such a diversity through the transferability of adversarial examples generated from robust classifiers fine-tuned from the adversarially pretrained models using different self-supervised prediction tasks. We then demonstrate the performance of our proposed adversarial pretraining method (2) by leveraging an ensemble of Selfie, Rotation, and Jigsaw.

In Table 4, we present the transferbility of PGD attacks generated from the final model trained using adversarial pretraining followed by adversarial full fine-tuning, namely, , where for ease of presentation, let denote the classifier learnt using the self-supervised pretraining task . Given the PGD attacks from , we evaluate their transferbility, in terms of attack success rate (ASR222ASR is given by the ratio of successful adversarial examples over the total number of test images.), against . If , then ASR reduces to . If , then ASR reflects the attack transferbility from to . As we can see, the diagonal entries of Table 4 correspond to the largest ASR at each column. This is not surprising, since transferring to another model makes the attack being weaker. One interesting observation is that ASR suffers a larger drop when transferring attacks from to other target models. This implies that and yields better robustness, consistent with our previous results like Figure 4.

At the first glance, the values of ASR of transfer attacks from to () keep similar, e.g., the first column of Table 4 where and ( ASR) or ( ASR). However, Figure 5 shows that the seemingly similar transferability are built on more diverse adversarial examples that succeed to attack and , respectively. As we can see, there exist at least transfer examples that are non-overlapped when successfully attacking and . This diverse distribution of transferred adversarial examples against models using different self-supervised pretraining tasks motivates us to further improve the robustness by leveraging an ensemble of diversified pretraining tasks.

In Figure 2, we demonstrate the effectiveness of our proposed adversarial pretraining via diversity-promoted ensemble (AP + DPE) given in (2). Here we consider baseline methods: single task based adversarial pretraining, and adversarial pretraining via standard ensemble (AP + SE), corresponding to in (2). As we can see in Table 5, AP + DPE yields at least improvement on RA while at most degradation on TA, comparing with the best single fine-tuned model. In addition to the ensemble at the pretraining stage, we consider a simple but the most computationally intensive ensemble strategy, an averaged predictions over three final robust models learnt using adversarial pretraining followed by adversarial fine-tuning over Selfie, rotation, and Jigsaw. As we can see in Table 6, the best combination, ensemble of three fine-tuned models, yields at least on RA while maintains a slight higher TA. More results of other ensemble configurations can be found in the supplement.

PGD attacks
PGD attacks
PGD attacks
48.95% 37.75% 36.65%
38.92% 49.60% 38.12%
38.96% 39.56% 51.17%
Table 4: The vulnerability diversity among fine-tuned models with Selfie, Rotation and Jigsaw self-supervised adversarial pretraining. The results take full adversarial fine-tuning. The highest ASRs are highlighted (,) under each column of PGD attacks from different fine-tuned models. Ensemble model results to different PGD attacks can be found in our supplement.
Figure 5: The VENN plot between sets of successful transfer adversarial examples from Model(Selfie) to Model(Rotation) and Model(Selfie) to Model(Jigsaw). The overlapping Brown area () represents the successful transfer attacks both on Model(Rotation) and Model(Jigsaw) from Model(Selfie). The Pink area () represents the successful transfer attacks only on Model(Jigsaw) from ModelSelfie. The Green area () represents the successful transfer attacks only on Model(Rotation) from Model(Selfie).
Models TA (%) RA (%) Epochs
Selfie Pretraining 86.02 51.05 50
Rotation Pretraining 85.66 50.40 46
Jigsaw Pretraining 83.74 48.83 48
AP + SE 84.44 49.53 47
AP + DPE 83.00 52.22 56
Table 5: Results comparison between fine-tuned model from single task pretraining and fine-tuned model from tasks ensemble pretraining. AP + SE represents adversarial pretraining via standard ensemble. AP + DPE represents adversarial pretraining via diversity-promoted ensemble. The best results are highlighted (,) under each column of evaluation metrics.
Fine-tuned Models () TA (%) RA (%)
Jigsaw + Rotation 85.36 53.08
Jigsaw + Selfie 85.64 53.32
Rotation + Selfie 86.51 53.83
Jigsaw + Rotation + Selfie 86.04 54.64
Table 6: Ensemble results of fine-tuned models with different adversarial pretrainings. The best results are highlighted (,) under each column of evaluation metrics.

4.6 Ablation Study and Analysis

For comparison fairness, we fine-tune all models in the same CIFAR-10 dataset. In each ablation, we show results under scenarios () and (), where represents adversarial pretraining, represents partial adversarial fine-tuning and represents full adversarial fine-tuning. More ablation results can be found in the supplement.

Scenario CIFAR-30K
TA (%) RA (%) Epochs
() 65.65 30.00 70
() 85.29 49.64 42
Scenario CIFAR-50K
TA (%) RA (%) Epochs
() 66.87 30.42 87
() 85.26 49.66 61
Scenario CIFAR-150K
TA (%) RA (%) Epochs
() 67.73 30.24 95
() 85.18 50.61 55
Table 7: Ablation results of the size of pretraining datasets. All pretraining datasets have resolution and classes.
Selfie Pretrainng Rotation Pretraining Jigsaw Pretraining
TA (%) RA (%) Epochs TA (%) RA (%) Epochs TA (%) RA (%) Epochs
71.9 30.57 61 74.7 34.23 78 74.66 33.84 68
85.14 50.23 48 85.62 51.25 46 85.18 50.94 46
Table 8: Ablation results of defense approaches. Instead of adversarial training, we perform random smoothing [cohen2019certified] for pretraining.

Ablation of the pretraining data size

As shown in Table 7, as the pretraining dataset grows larger, the standard and robust accuracies both demonstrate steady growth. Under the () scenario, when the pretraining data size increases from 30K to 150K, we observe a 0.97% gain on robust accuracy with nearly the same standard accuracy. That aligns with the existing theory [schmidt2018adversarially]. Since self-supervised pretraining requires no label, we could in future grow the unlabeled data size almost for free to continuously boost the pretraining performance.

Ablation of defense approaches in pretraining

In Table 8, we use random smoothing [cohen2019certified] in place of AT to robustify pretraining, while other protocols remain all unchanged. We obtain consistent results to using adversarial pretraining: robust pretraining speed up adversarial fine-tuning and helps final model robustness, while the full adversarial fine-tuning contributes the most to the robustness boost.

5 Conclusions

In this paper, we combine adversarial training with self-supervision to gain robust pretrained models, that can be readily applied towards downstream tasks through fine-tuning. We find that adversarial pretraining can not only boost final model robustness but also speed up the subsequent adversarial fine-tuning. We also find adversarial fine-tuning to contribute the most to the final robustness improvement. Further motivated by our observed diversity among different self-supervised tasks in pretraining, we propose an ensemble pretraining strategy that boosts robustness further. Our results observe consistent gains over state-of-the-art AT in terms of both standard and robust accuracy, leading to new benchmark numbers on CIFAR-10. In the future, we are interested to explore several promising directions revealed by our experiments and ablation studies, including incorporating more self-supervised tasks, extending the pretraining dataset size, and scaling up to high-resolution data.