Improving the Adversarial Robustness of Transfer Learning via Noisy Feature Distillation

by   Ting-Wu Chin, et al.

Fine-tuning through knowledge transfer from a pre-trained model on a large-scale dataset is a widely spread approach to effectively build models on small-scale datasets. However, recent literature has shown that such a fine-tuning approach is vulnerable to adversarial examples based on the pre-trained model, which raises security concerns for many industrial applications. In contrast, models trained with random initialization are much more robust to such attacks, although these models often exhibit much lower accuracy. In this work, we propose noisy feature distillation, a new transfer learning method that trains a network from random initialization while achieving clean-data performance competitive with fine-tuning. In addition, the method is shown empirically to significantly improve the robustness compared to fine-tuning with 15x reduction in attack success rate for ResNet-50, from 66 to 4.4 Actions, MIT 67 Indoor Scenes, and Oxford 102 Flowers datasets. Code is available at



There are no comments yet.


page 8


AdaFilter: Adaptive Filter Fine-tuning for Deep Transfer Learning

There is an increasing number of pre-trained deep neural network models....

Weight Poisoning Attacks on Pre-trained Models

Recently, NLP has seen a surge in the usage of large pre-trained models....

Progressive Transfer Learning for Person Re-identification

Model fine-tuning is a widely used transfer learning approach in person ...

A Simple Fine-tuning Is All You Need: Towards Robust Deep Learning Via Adversarial Fine-tuning

Adversarial Training (AT) with Projected Gradient Descent (PGD) is an ef...

Automated Synthetic-to-Real Generalization

Models trained on synthetic images often face degraded generalization to...

The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning

Although machine learning models typically experience a drop in performa...

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets

Several datasets have recently been constructed to expose brittleness in...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transfer learning is an important approach that enables training deep neural networks faster and with relatively less data than training from scratch without any prior knowledge. There are various forms of transfer learning, depending on whether the target input and label domains are the same as the source ones. In this work, we are particularly interested in the setting where we have different input and label domains between the source and the target datasets, and we only care about the model’s performance on the target task. In other words, our goal is to maximize the performance on the target task assuming a pre-trained model trained on a source task is available. This setting has various applications and has led to state-of-the-art performance in several image classification tasks 

(Cui_2018_CVPR). Moreover, this setting is also considered in industry in the form of machine-as-a-service, such as Google’s Cloud AutoML (gautoml) and Microsoft’s Custom Vision service (azure) where users can upload custom data to fine-tune a pre-trained model. We refer to this setting as transfer learning throughout this paper.

Transfer learning for ConvNets has received great attention due to its effectiveness in achieving high accuracy. (simonyan2014very)

have shown that the pre-trained model that is trained on a large-scale dataset (such as ImageNet) acts as an effective feature extractor that supersedes hand-crafted feature extractors. Later,

(yosinski2014transferable; donahue2014decaf) find that inheriting the pre-trained weights and starting learning from there (often referred to as ”fine-tuning”) can result in even larger performance improvements. Fine-tuning has then been adopted in various tasks to achieve state-of-the-art results. Besides fine-tuning, several prior methods have relied on fine-tuning with an explicit regularization loss to further enhance the performance of transfer learning (xuhong2018explicit; li2018delta)

. While prior art has demonstrated that fine-tuning might not necessarily outperform training from random initialization for some tasks, such as classifying medical images 

(NIPS2019_8596) and object detection and semantic segmentation with sufficient training data (he2019rethinking), it is important to note that fine-tuning is the state-of-the-art method for small and visually similar datasets such as the Caltech-UCSD Bird 200 datasets (WelinderEtal2010).

Very recently, (Rezaei2020A) have demonstrated that fine-tuned models are vulnerable to adversarial examples crafted solely based on the pre-trained model. In other words, an adversary can attack a pre-trained model available on open repositories, e.g., TorchVision, and use the adversarial image to deceive the transferred models. The success of this attack raises security concerns for the widely-adopted fine-tuning mechanism, which is also used in industrial applications such as Google’s AutoML (gautoml) and Microsoft’s Custom Vision (azure). In this work, we take a first step toward alleviating this problem. Intuitively, the vulnerability to such an attack stems from the similarity between the pre-trained and the transferred models. Thus, to improve the robustness of the transferred models, one would prefer the transferred model to be dissimilar to the pre-trained one. However, we find that existing fine-tuning methods result in transferred models that are similar to the pre-trained one, which in turn makes them vulnerable to such an attack. In contrast, models trained with random initialization are much more robust to such attacks, with the caveat that these models often exhibit much lower accuracy compared to fine-tuning. As an alternative, we propose to train from random initialization with noisy feature distillation, which achieves clean-data performance similar to fine-tuning and the robustness comparable to training with random initialization. Quantitatively, the success rate of -norm projected gradient descent (PGD) attack averaged across five datasets drops from 74.3%, 65.7%, 70.3%, and 50.75% to 6.9%, 4.4%,4.9%, and 6.9% for ResNet-18, ResNet-50, ResNet-101, and MobileNetV2, respectively. Overall, our contributions are as follows:

  • Our work is the first towards improving the robustness of the transferred network against the attack designed specifically for fine-tuning.

  • We propose to conduct transfer learning via training from random initialization and noisy feature distillation which results in competitive clean-data performance with significant robustness improvement over existing fine-tuning methods.

  • We conduct extensive experiments on four networks and five datasets via hyper-parameter tuning and an ablation study to empirically justify the proposed method.

2 Background

2.1 Transfer learning

In general, the goal of transfer learning is to minimize the following objective:



is the loss function such as cross entropy,

denotes the neural network of interest, are the weights excluding the last linear layer of the neural network, denotes the weights for last linear layer, denotes the pre-trained weights excluding the last linear layer, and are the regularization functions, is the number of training samples, and and denote the training data and labels in the target task. Various transfer learning methods differ in the variables to be optimized and the form of the regularization functions. We describe four transfer learning methods and a common baseline we consider in this paper in the following. We note that in all five baselines, is considered to be .

Linear classifier

(simonyan2014very) minimizes only w.r.t. with being a constant function. is initialized to be . In other words, the linear classifier only trains the linear part while using the pre-trained model as a feature extractor.


(Li2020Rethinking; donahue2014decaf; raghu2019transfusion; he2019rethinking; yosinski2014transferable) minimizes w.r.t. both and with being a constant function. is initialized to be . In other words, fine-tuning optimizes the entire pre-trained model.


(xuhong2018explicit; lee2019mixout) minimizes w.r.t. both and with


In this case, is initialized to be . In other words, L2SP optimizes the entire pre-trained model while regularizing the weights to be close to those of the pre-trained model.


(li2018delta; pmlr-v97-jang19b; wang2020pay) minimizes w.r.t. both and with


where denotes the number of layers excluding the last classification layer, denotes the number of output elements for the layer, denotes the function that evaluates the activation of layer. is initialized to be . In other words, DELTA optimizes the entire pre-trained model while regularizing the activations to be close to those of the pre-trained model. We note that we do not consider the per-channel regularization weights for feature regularization. That is, we are considering the so-called DELTA (w/o ATT) in (li2018delta). Following (li2018delta; pmlr-v97-jang19b), the layers to match are the last layer of each stage in the neural network. For example, there are four stages in ResNets (he2016deep).


is usually used as a baseline to demonstrate the effectiveness of transfer learning methods. Re-training minimizes w.r.t. both and with being a constant function. is randomly initialized. In other words, Re-training optimizes the entire pre-trained model with randomly initialized weights. In this case, there is essentially no information being transferred from the pre-trained model.

Besides the aforementioned baselines considered in the experiments, we also discuss efforts in improving transfer learning using additional information or architectural changes. (ge2017borrowing) developed a method to improve fine-tuning by leveraging additional training data obtained from large-scale datasets. (Cui_2018_CVPR) used Earth Mover’s Distance to measure domain similarity between datasets and showed that pre-training on similar domains results in better transfer. (wang2017growing) discovered that increasing the model capacity (wider or deeper) improves the effectiveness of fine-tuning. (kornblith2019better) investigate whether better models on the source dataset imply better models on the target dataset using linear classifiers and fine-tuning. In contrast, we do not assume having access to the source data.

2.2 Adversarial examples

Adversarial examples (szegedy2013intriguing)

for deep learning models have received growing attention due to their potential impact on machine learning systems. According to different threat models, there are various types of attacks. For example, in a white-box threat model where the adversary knows all the information regarding a model, fast gradient sign method (FGSM) 

(goodfellow2014explaining), projected gradient descent, and CW (carlini2017towards) have been shown to be strong attacks. Counteracting these attacks, adversarial training (madry2017towards)

is the dominant approach. On the other hand, there are also methods targeting a black-box threat model where the adversary can only query the model and obtain the probability vector 

(liu2016delving; papernot2016transferability; chen2017zoo).

In this work, our threat model assumes that the adversary has access to the model weights and model architecture for the pre-trained model. The adversary does not have access to the task-specific transferred model and query. This threat model aligns with practical usage of deep learning models where researchers use pre-trained models on large datasets (like ImageNet) and fine-tune them for other tasks. Based on this threat model, prior art (Rezaei2020A) has proposed an attack that successfully compromises the task-specific fine-tuned models, which raises security concerns for fine-tuning. In this work, we propose an algorithm to improve the robustness of the transferred model under this particular threat model. Recently, (shafahi2020adversarially) have proposed to improve the adversarial robustness of the transferred model in a white-box setting by transferring to the target model the robust features obtained through adversarial training. We note that their threat model is very different from ours. As we will show later in Section 3.4, for the considered threat model, adversarial training is less effective than our proposed method.

To craft an adversarial example under our threat model, we adopt an attack from prior art (Rezaei2020A), which optimizes the following objective:


where is the penultimate layer, is a target vector that is set to a scalar multiplied by a one-hot vector. is chosen to be large and denotes the perturbation budget. The pixel intensity in this formulation is normalized and constrained to . To optimize equation 4

, we use projected gradient descent (PGD). Intuitively, the objective is trying to find a small-norm perturbation such that the response of the penultimate layer of the pre-trained model is high in one neuron but zero in other neurons. Once the perturbation

for a specific input image is found, the perturbed image is used to attack a transferred model , which the attacker has no information about. We provide a qualitative view of the attack in Figure 1.

Regarding the parameters, we set the perturbation budget to 0.1, the number of iterations of PGD to be 40, to be 1000 (following (Rezaei2020A)), and the learning rate to be 0.01. We use AdvTorch (ding2019advertorch) for generating adversarial examples using the above specified objective and parameters.

Figure 1: Qualitative view of the adversarial attack considered and the comparison of transfer learning between prior art and our proposed method.

2.3 Datasets and implementation detail

In this work, we consider five datasets to transfer to and models trained on ImageNet as pre-trained models. The datasets under consideration are shown in Table 1.

max width=1 Dataset Task Category # Training Samples # Testing Samples # Classes Abbreviation Stanford Dogs (khosla2011novel) Fine-grained classification 100 72 120 Dog Caltech-UCSD Birds (WelinderEtal2010) Fine-grained classification 30 29 200 Bird Stanford 40 Actions (yao2011human) Action classification 100 138 40 Action MIT Indoor Scenes (quattoni2009recognizing)

Indoor scene classification

80 20 67 Indoor 102 Category Flower (nilsback2008automated) Fine-grained classification 20 60 102 Flower

Table 1: The characteristics of the datasets for transfer learning we considered in this work. We includes the number of training samples per class, the number of testing samples per class, and the number of classes.

We mainly use ResNet-18 (he2016deep) throughout the experiments and provide an ablation study on other networks in Section 4.3

. For training, we use a batch size of 64 and stochastic gradient descent with momentum following prior art 

(li2018delta; xuhong2018explicit). For the experiments using fine-tuning, i.e., those that start with pre-trained weights, we use 30,000 iterations to make sure the loss converges. Additionally, we tune the learning rate, weight decay, and momentum for fine-tuning each dataset according to prior art (Li2020Rethinking). Specifically, we tune learning rate, momentum, and weight decay using grid search. For re-training, the hyper-parameters are set throughout the experiments across datasets without tuning. We use 90,000 iterations, learning rate , momentum 0.9, and weight decay 0.005. Also, for is set to 0.01 across all the experiments following (li2018delta; xuhong2018explicit).

For fine-tuning methods that come with hyper-parameters such as L2SP and DELTA, we tune and to obtain the best transferred results according to prior art (xuhong2018explicit; li2018delta).

3 Methodology

3.1 Robustness evaluation

(the lower the more robust). Clean denotes the Top-1 accuracy for clean-data.

max width=1 Dog Bird Action Indoor Flower Linear classifier clean 84.22 67.02 73.64 72.54 88.52 ASR 96.06 96.47 92.49 88.95 86.40 Fine-tuning clean 81.84 77.67 77.19 75.37 95.71 ASR 89.36 50.33 73.75 54.75 14.07 L2SP clean 83.82 77.51 77.22 75.15 95.63 ASR 94.08 50.08 92.16 66.73 16.75 DELTA clean 84.39 78.75 77.69 78.36 95.90 ASR 95.65 58.83 93.51 79.71 43.65 Re-training clean 70.77 69.76 51.90 59.93 87.38 ASR 5.99 6.14 5.82 6.73 3.00

Table 2: Robustness evaluation for the baseline transfer learning methods. ASR denotes attack success rate, which is computed as wrong with adversarial-data  correct with clean-data

We first evaluate the robustness to the adversarial examples crafted solely based on the pre-trained model for each of the five baselines. To characterize the robustness across different datasets, we use attack success rate (ASR), which is calculated by the conditional probability wrong with adversarial-data  correct with clean-data. As shown in Table 2, methods that rely on the pre-trained model (the first four rows) are vulnerable to the adversarial examples crafted solely based on the pre-trained model. As expected, linear classifier has the worst robustness since it relies on the model under attack as a feature extractor without changes. In contrast, re-training provides the best robustness since it trains the model from random initialization without inheriting any knowledge from the pre-trained model.

Interestingly, from Table 2 one can observe that for datasets such as Bird and Flower, fine-tuning does not fail completely when facing adversarial examples. The difference in robustness across datasets is related to how different the transferred and the pre-trained models are for different datasets. We further analyze the relationship between robustness and the distance between the transferred and the pre-trained models. Specifically, we consider weight distance (i.e., Eq. 2) and feature distillation loss (i.e., Eq. 3). We plot the corresponding distance metrics for the 25 data points from Table 2 by measuring the feature distance using the training data. As shown in Figure 2, the distance between the transferred and the pre-trained models correlates well with robustness in both distance metrics considered, which matches our intuition.

Figure 2: Robustness vs. distance between transferred and pre-trained models for the five baseline methods on five datasets.

In terms of the clean-data performance, the sweet-spot of similarity to the pre-trained model is dataset-dependent. Model similarity often helps due to task similarity and lack of a large-scale target dataset, which is why re-training has the worst clean-data performance among these methods.

Based on these observations, it is natural to wonder if it is possible to achieve the best of both worlds, i.e., the robustness of re-training and the clean-data performance of DELTA. We conjecture that this is possible if vulnerability to attacks and generalization improvements brought by transfer learning stem from different sources.

3.2 The role of pre-trained weights

We begin by first combining re-training and DELTA. That is, instead of fine-tuning from pre-trained weights using feature distillation, we re-train from random initialization using feature distillation. We term this new method DELTA-R, which stands for DELTA with randomly initialized weights. This helps us understand the role of pre-trained weights in both generalization and robustness. It is important to note that while this modification is simple, it has not been explored in prior art and it is not trivial to see if re-training with feature distillation using a small target dataset is sufficient to achieve clean-data performance comparable to fine-tuning with prior knowledge encoded in the pre-trained weights.

Figure 3: (Left) Clean-data performance and (Right) Attack success rate for DELTA, DELTA-R, and re-training.

As shown in Figure 3, it is encouraging to observe that DELTA-R can achieve clean-data performance comparable to fine-tuning (DELTA) with only one to two points of accuracy degradation. The competitive clean-data performance implies that the generalization benefits of the pre-trained model can be largely captured by the features on the target dataset. On the other hand, the attack success rate drops significantly when we re-randomize the weights, which implies that a large portion of the vulnerability stems from the pre-trained weights. While encouraging, there is still a gap in robustness between DELTA-R and re-training for most datasets.

3.3 Avoiding over-fitting to pre-trained features

DELTA-R removes the vulnerability of the pre-trained weights, but the transferred models are still vulnerable due to having features close to those of the pre-trained model. To further improve robustness, the key technical challenge is to reduce the similarity between the pre-trained and the transferred features without hurting clean-data performance. A naïve idea is to control the strength of the feature regularization term in Eq. 3 on the hope of achieving better robustness without hurting the clean-data performance. We sweep for DELTA-R for each of the datasets. As shown in Figure 4, clean-data performance and robustness have a trade-off relationship when we control . While a controllable trade-off between the clean-data performance and robustness is useful for application developers, we are interested in improving the adversarial robustness without hurting clean-data performance.

Figure 4: The effect of tuning on the trade-off between clean-data performance and the attack success rate. Star marks the we use.

To develop our approach, a key observation is that regularization techniques in deep learning such as dropout (hinton2012improving) and stochastic weight averaging (SWA) (izmailov2018averaging) are able to improve the generalization performance at the cost of a higher training loss. In other words, one can further increase the feature distillation loss, which in turn improves robustness, without hurting the current generalization performance. While there are many regularization techniques that align with this key observation, we consider dropout and SWA in this work. We note that not all regularization techniques in deep learning are helpful; for example, label smoothing (szegedy2016rethinking) does not help since it increases the training cross entropy loss, but merely affects the feature distillation loss that we care about.


was proposed to avoid over-fitting by randomly dropping out activations during training (hinton2012improving). In this work, we consider spatial-dropout (tompson2015efficient) for convolutional layers. Spatial-dropout drops channels randomly during training. Intuitively, this regularization technique makes it harder for the optimizer to minimize the loss between the pre-trained features and the transferred ones because the latter are randomly set to zeros. We insert the dropout layer after those that are used for the feature distillation loss and we use a dropout rate of 10%.

Stochastic Weight Averaging (SWA)

has shown great promise in improving the generalization performance for deep neural networks (izmailov2018averaging). The core idea is to average numerous local optima to form the final solution. It has been demonstrated empirically that SWA improves generalization while increasing the training loss. We apply SWA on trained DELTA-R models by training with half of the learning rate, i.e., 0.005, as suggested in prior art (izmailov2018averaging). SWA training considered has constant learning rate with a third of the iterations of DELTA-R, i.e., 30,000 iterations. We average the models every 500 iterations.

Re-training with noisy feature distillation

or Renofeation 111Pronounced ”Renovation.” is our proposed method. It re-initializes the network weights and trains them with feature distillation and both dropout and SWA. Both dropout and SWA are used to alleviate over-fitting the features to the pre-trained model and improve robustness, hence the name noisy feature distillation. As shown in Table 3, we observe that these regularization techniques indeed increase the feature distillation loss, which in turn improves the robustness of DELTA-R. Additionally, we find empirically that both dropout and SWA can work together to achieve better regularization. From the clean-data performance point of view, Renofeation remains comparable to DELTA-R while losing some accuracy for the Dog dataset. We hypothesize that the accuracy loss in this case may be due to the large overlap between the Dog and the ImageNet datasets and therefore matching features alone may not be sufficient. This can be inferred from the fact that the linear classifier alone has similar results compared DELTA for the Dog dataset as shown in Table 2.

When comparing to the baselines, Figure 5 shows that Renofeation robustness is close to re-training and has clean-data performance close to DELTA. We note that even though Renofeation has higher attack success rate than re-training for some datasets such as Dog and Indoor, it has higher Top-1 accuracy when facing the adversarial examples due to a much better clean-data performance.

max width=1 Dog Bird Action Indoor Flower DELTA-R clean 82.49 77.58 76.79 77.39 94.49 ASR 37.93 7.28 60.83 38.48 14.30 Feature loss 0.70 1.48 0.72 0.68 0.56 DELTA-R + Dropout clean 81.21 77.72 78.00 77.31 95.35 ASR 17.91 4.24 18.98 22.30 7.44 Feature loss 0.86 1.57 1.03 0.89 0.68 DELTA-R + SWA clean 80.32 78.92 78.07 77.69 94.81 ASR 12.87 3.65 23.62 16.72 2.95 Feature loss 0.86 1.63 0.87 0.82 0.73 Renofeation clean 78.11 79.03 79.07 76.79 95.59 ASR 9.83 3.41 7.16 11.08 2.86 Feature loss 1.00 1.68 1.06 1.02 0.81

Table 3: The effect of dropout and SWA on feature distillation loss, clean-data performance, and robustness for DELTA-R.
Figure 5: Comparison between the proposed Renofeation approach and baseline methods for clean-data performance (Left) and attack success rate (Right).

3.4 Comparison with adversarial training

While we showed that our proposed Renofeation approach, when compared to DELTA, achieves significant adversarial robustness improvement with comparable clean-data performance under our threat model, adversarial training can also be considered as a defense under our threat model. As a result, in this section, we compare our method with adversarial training to further demonstrate the effectiveness of our proposed method. To conduct adversarial training in our considered threat model, we train DELTA with 2 longer iterations and, for each iteration, we randomly sample a batch of benign examples or a batch of adversarial examples crafted with three iterations of projected gradient descent. As shown in Table 4, adversarial training indeed achieves better robustness compared to the baselines but worse compared to Renofeation.

max width=1 Dog Bird Action Indoor Flower DELTA clean 84.39 78.75 77.69 78.36 95.90 ASR 95.65 58.83 93.51 79.71 43.65 DELTA Adv. Trained clean 82.83 77.10 75.69 77.84 95.12 ASR 85.86 16.77 85.19 61.84 23.85 Renofeation clean 78.11 79.03 79.07 76.79 95.59 ASR 9.83 3.41 7.16 11.08 2.86

Table 4: Comparison between DELTA, DELTA with PGD-3 adversarial training, and proposed Renofeation.

4 Ablation study

4.1 Regularization for baselines

Figure 6: Ablation study of the effect of dropout (DO) and SWA on (Left) clean-data performance and (Right) attack success rate for the two baselines including DELTA and re-training.

In Renofeation, we incorporate both dropout and SWA for improving robustness. One might naturally wonder if these two techniques are specific to Renofeation. In other words, can baselines such as re-training and DELTA benefit from these techniques? To answer this question, we conduct an ablation study for re-training and DELTA approaches in conjunction with these two regularization techniques. For SWA, we use the same hyper-parameters as mentioned earlier, i.e., 0.005 constant learning rate, 30,000 iterations, and average every 500 iterations. As shown in Figure 6, we find that dropout improves robustness for DELTA while SWA minimally helps. For both techniques, DELTA is still highly vulnerable to adversarial examples for datasets such as Dog, Action, and Indoor. On the other hand, dropout and SWA help clean-data performance for re-training, but the resulting performance still cannot compete with DELTA for most datasets except for Bird. Overall, the proposed Renofeation achieves the best of both worlds.

4.2 Limited training data

Earlier in Section 3.2 we noted that feature distillation using the target dataset is able to achieve competitive performance compared to fine-tuning. Intuitively, if the amount of training data is large, feature distillation should be able recover the knowledge encoded in the pre-trained weights. However, in the transfer learning case, target datasets usually have much less training data compared to large-scale datasets such as ImageNet. In this section, we ablate the number of training samples to understand how it affects the effectiveness of Renofeation so as to further provide a guideline for when to use it. Specifically, we consider cases where the training data for each dataset is reduced to and . For each class in the dataset, we randomly sub-sample and of the training images. As a result, the overall training dataset is still balanced across classes.

As shown in Table 5, Renofeation introduces a larger gap in the clean-data performance compared to re-training while having similar robustness when data becomes limited. On the other hand, when comparing Renofeation to DELTA, clean-data performance is comparable for most datasets except Dog, which aligns with our previous observation. Overall, we find Renofeation to be even more preferable when there is limited training data since the gap between Renofeation and re-training becomes larger.

max width=1 Dog Bird Action Indoor Flower 33% DELTA clean 81.80 63.41 70.72 70.97 90.11 ASR 95.77 74.12 93.94 85.38 46.60 Renofeation clean 74.13 61.75 69.22 70.22 88.32 ASR 10.88 5.56 9.40 15.73 4.94 Re-train clean 44.98 26.10 24.51 37.54 62.73 ASR 9.67 14.68 9.22 8.35 2.85 66% DELTA clean 83.58 73.04 75.52 75.30 94.23 ASR 95.36 64.58 93.80 80.77 56.39 Renofeation clean 77.25 74.46 76.09 74.48 93.56 ASR 9.85 3.55 7.53 12.22 4.73 Re-train clean 64.03 56.47 40.73 52.61 80.60 ASR 7.72 10.79 5.86 7.23 3.23

Table 5: Ablating the number of training samples for each dataset to 33% and 66% and compare the performances among methods.

max width=1 Dog Bird Action Indoor Flower Average ResNet-18 DELTA clean 84.39 78.75 77.69 78.36 95.90 - ASR 95.65 58.83 93.51 79.71 43.65 74.27 Renofeation clean 78.11 79.03 79.07 76.79 95.59 - ASR 9.83 3.41 7.16 11.08 2.86 6.87 Re-train clean 70.77 69.76 51.90 59.93 87.38 - ASR 5.99 6.14 5.82 6.73 3.00 5.54 ResNet-50 DELTA clean 90.13 81.95 81.87 79.93 96.63 - ASR 94.69 32.29 91.94 84.69 24.84 65.69 Renofeation clean 83.57 79.27 84.04 80.67 96.75 - ASR 5.08 3.96 3.33 7.12 2.39 4.38 Re-train clean 72.55 70.47 53.53 59.11 85.93 - ASR 6.30 7.45 6.11 6.06 2.20 5.62 ResNet-101 DELTA clean 91.92 82.07 82.61 80.00 96.37 - ASR 88.03 42.60 87.53 89.27 44.04 70.29 Renofeation clean 83.88 80.98 84.67 80.97 96.33 - ASR 4.38 3.54 3.69 9.95 3.09 4.93 Re-train clean 73.42 71.80 52.78 61.12 85.59 - ASR 6.64 7.21 6.64 5.13 2.00 5.52 MobileNetV2 DELTA clean 84.86 78.51 78.94 76.12 96.68 - ASR 82.89 40.30 57.00 52.45 21.08 50.75 Renofeation clean 76.42 75.70 77.78 76.49 96.32 - ASR 11.62 6.79 5.92 7.12 2.84 6.86 Re-train clean 67.95 69.50 52.86 61.49 88.73 - ASR 8.56 8.54 8.35 7.65 2.71 7.16

Table 6: Comparing DELTA, Renofeation, and re-training for different ConvNets.

4.3 Results on more networks

So far, we have conducted our experiments and analyses based on ResNet-18. We are interested to see if Renofeation is still more preferable compared to re-training and DELTA for other networks. Specifically, we further consider deeper networks, i.e., ResNet-50 and ResNet-101. Additionally, due to recent interests in reducing the computational overhead of ConvNets for deployment purpose (wang2020pay; stamoulis2019single; sandler2018mobilenetv2; chin2019legr; wu2019fbnet), we also consider a compact network, i.e., MobileNetV2 (sandler2018mobilenetv2). Due to computational considerations, for DELTA with other networks, we inherit the learning rate, weight decay, and momentum from ResNet-18 for each of the dataset.

As shown in Table 6, Renofeation achieves clean-data performance comparable to that of DELTA and has robustness similar to re-training across all ConvNets we have investigated.

5 Conclusion

In this work, we have proposed Renofeation, a transfer learning method that is significantly more robust to adversarial attacks based on the pre-trained model when compared to state-of-the-art transfer learning based on fine-tuning. Moreover, under the considered threat model (adversary only has information to the pre-trained model), Renofeation is preferable to combining fine-tuning with adversarial training using three steps of projected gradient descent. In contrast to transfer learning methods based on fine-tuning, the key ingredients of our approach are the randomly initialized weights (as opposed to using pre-trained weights) and noisy feature distillation. To achieve noisy feature distillation, we incorporate two deep learning regularization techniques, namely spatial-dropout and stochastic weight averaging. We have conducted extensive experiments including a comprehensive ablation study to demonstrate the effectiveness of the proposed method compared to its competitors.

While the threat model under consideration is relatively new (Rezaei2020A), it is crucial to improve robustness under this threat model due to the practical popularity of fine-tuning. This work takes a first step towards improving the robustness under this threat model and sheds light on this topic for future study.