Exploring Adversarial Defense Robustness on Transfer Learning
This work evaluates the efficacy of adversarial robustness under transfer from CIFAR 100 to CIFAR 10. This allows us to identify transfer learning strategies under which adversarial defences are successfully retained, in addition to revealing potential vulnerabilities. We study the extent to which features crafted by fast gradient sign methods (FGSM) and their iterative alternative (PGD) can preserve their defence properties against black and white-box attacks under three different transfer learning strategies. We find that using PGD examples during training leads to more general robustness that is easier to transfer. Furthermore, under successful transfer, it achieves 5.2 more accuracy against white-box PGD attacks than the considered baselines. In this paper, we study the effects of using robust optimisation in the source and target networks. Our empirical evaluation sheds light on how well such mechanisms generalise while achieving comparable results to non-transferred defences.READ FULL TEXT VIEW PDF
Transfer learning has become a common practice for training deep learnin...
Transfer learning, in which a network is trained on one task and re-purp...
We propose a new defense mechanism against adversarial attacks inspired ...
Adversarial attacks against deep learning models have gained significant...
Despite the huge success of Deep Neural Networks (DNNs) in a wide spectr...
Neural network compression methods like pruning and quantization are ver...
Deep learning techniques are renowned for supporting effective transfer
Exploring Adversarial Defense Robustness on Transfer Learning
Machine learning models, in general, are known to be vulnerable to adversarial examples. For instance, certain imperceptible perturbations to the input can result in an incorrect classification Szegedy et al. (2013); Jia & Liang (2017). This should be concerning when such methods need to be deployed in safety-critical applications, such as autonomous vehicles or surgical robots Warde-Farley & Goodfellow (2016); Yuan et al. (2019). A notable formulation of robustness against adversarial attacks is that of Madry et al. who formulate it as a "robust optimisation"-based problem:
Here, is a loss specified in advance, where the tuple of the input and its label is sampled from some distribution . In this formulation, the adversary’s task is to maximise the inner optimisation problem while the defender minimises the outer one. The adversary is specified through some threat model and is realised as a set of allowed perturbations . A defence can then be learned by augmenting the training data with adversarial examples Szegedy et al. (2014).
Transfer learning (TL) is commonly used deep learning technique. It has been showed to improve the performance as well as speed up training on variety of tasksPan & Yang (2010b)
. To this end, little has been done towards assessing adversarial robustness under the scope of TL. We hypothesise that evaluating the efficacy of robust features under transfer can help us identify strategies under which adversarial defences are successfully retained. This can allow us to identify more informative ways to both defend and attack neural networks.
Our empirical evaluation indicates that adversarial robustness against black-box (BB) attacks transfers more consistently. In white-box (WB) scenarios, defence mechanisms benefit from using robust optimisation in both source and target. Further, adversarial training using PGD leads to learning more general robust features that can maintain their properties under transfer better than the alternative. With this, our empirical findings are the following:
Robustness: We compared the level of transferred robustness between two tasks. PGD-based defences were easier to transfer than FGSM-based ones. Defending against simpler attacks sufficed from defending lower-level features only which are easier to transfer.
Generalisation: We evaluate the ability to generalise against two threat models. We achieve 5.2% higher accuracy against WB PGD adversaries using robust weight initialisation as well as adversarial examples during training the target.
Performance: We study the performance of different combinations of robust and clean optimisation routines. We visualise the results using normalised heatmaps and a complete table with accuracies.
Transfer learning in CNNs can be achieved through retraining using various strategies Yosinski et al. (2014). One can use the network as a feature extractor by freezing all the layers and only retraining the last one Sharif Razavian et al. (2014) or fine-tuning a larger part of the network Oquab et al. (2014). However, adversarial robustness may not necessarily be transferable in this process. So, we evaluate this property using the following three learning strategies: a) freeze all layers and retrain only the final layer, b) unfreeze only the last block of our network and, c) retrain the whole network, essentially using the source as an initialisation strategy.
As threat models, we use as adversaries the Fast Gradient Sign Method (FGSM) Goodfellow et al. (2015) and the Projected Gradient Descent (PGD) algorithm Kurakin et al. (2016) which is an iterative variant of FGSM.
FGSM creates an adversarial example
by following the gradient of the loss function with respect to the true label. It then takes a single step towards that direction:
PGD turns FGSM into an iterative attack, ensuring at each step that the adversarial example is within the ball around with radius :
where denotes the number of iterations and the step taken at each iteration.
An intriguing property of FGSM is label leaking, where the network achieves greater accuracy in the adversarial examples than with the clean ones Kurakin et al. (2016). This happens most likely because FGSM produces a predictable perturbation which the network is able to identify. To avoid this effect, one can replace the true label with the most likely label predicted by the model.
Establishing the defence objective
When performing adversarial training we modify the loss to be the average of the adversarial and clean loss, with equal weights . We weighted the two losses equally because we were aiming for both robustness and accuracy. Cross-entropy is used for the individual losses.
Each batch had 200 examples, 100 clean images and 100 adversarial examples generated based on the most recent parameters of the network using the current clean examples. We used a standard which corresponds to pixel intensity of . For the PGD adversary, we used 7 iterations and a learning rate as was done by Madry et al. 2018. We trained and used a DenseNet 121 to construct the BB attacks.
We evaluate the effect of using transfer learning in the process of building defence mechanisms against adversarial attacks. We perform a series of empirical evaluations and report our results in this section.
Figure 1 introduces the general outline of the pipeline for evaluation. For brevity, we define , where , as a network () learned with transfer learning. For example, was trained with clean examples (nat) on the source domain and used both clean and adversarial examples (adv) during transfer. was trained with both clean and adversarial examples from the same threat model in both source and target tasks. We evaluate the networks using the clean accuracy, as well as the accuracy against 4 adversaries, namely BB and WB attacks using FGSM and PGD.
Threat models have been shown to successfully attack both known and unknown architectures. Current defence systems have been targeting the actual attacks to a specific task and architecture. In practice, however, we would like to learn specific representations that can generalise to different settings. In this section, we evaluate the ability of defence systems to generalise to different tasks in the context of transfer learning. We examine the amount of transferable robustness from a source task to some target by training 3 independent neural networks on CIFAR100, namely: one that does not use adversarial examples during training and hence has no defence mechanism, one that uses examples generated using FGSM and a final network that uses examples generated using PGD. We report our results in Figure 2.
The figure depicts the amount of transferred robustness across all strategies as a min-max normalised per attack heatmap. The first three rows show the robustness for the baseline networks that were trained with no transfer. Using PGD or FGSM samples during training performed equally well against BB attacks. In the former case, however, the resulted network is more robust against PGD-based WB attacks and less robust against FGSM-based ones. When training using FGSM generated samples we achieve the opposite results. These observations align with the ones made in Madry et al. (2018) but we found FGSM to be more robust against PGD most likely because we consider label leaking.
The next three blocks of rows (or 9 rows in total) report the results from transferring robustness using the three different strategies. Unlike the case in the absence of transfer, defence mechanisms developed using PGD are more likely to preserve the robustness of the learned features against BB attacks when used on the different task. However, both approaches seem to transfer proportionally the same amount of the achieved robustness against WB attacks. WB attacks are tailored for a specific architecture, hence directly transferring robustness was not expected to be as successful.
Overall, iterative learning results in more intricate and general features. However, the two tasks are distinct enough to not allow for the direct use of the learned features (see second block of results in Figure 2). Regardless, unfreezing the final block of ResNet56 resulted in an almost complete transfer of the defence against BB attacks. This itself suggests that robust low level features are sufficient to defend just as well against such simpler attacks. Such features are easier to transfer too. Recent work, Andrew Ilyas (2019), made similar observations and proposed a theoretical framework for studying such features. Unlike us, the authors do not focus on the transferability between tasks.
Evaluating against WB attacks, however, seems to be more successful at targeting aspects of the representation related to the higher levels of abstraction such as representations of the objects and sub-objects that are present in the input. Such attacks, however, have been shown to sometimes get stuck at local minimas resulting in weaker attacks Carlini & Wagner (2016). Using robust features as initialisation does not seem to have as good of an effect when training the target network with clean examples only. That said, they can potentially allow for iterative methods to overcome the above limitations by ensuring better starting point for the optimisation procedure.
Finally, none of the reported methods fully matched or exceeded the performance of the baselines. In the next section we attempt to combine our findings with adversarial training applied on the target task as well.
Successfully applying transfer learning has been shown to improve the performance as well as speed up training on a variety of tasks Pan & Yang (2010a); Yosinski et al. (2014). This suggests it can potentially enable building stronger, more general defence mechanisms as well as more complex attacks. We study the extent to which transfer learning can help us improve established defence mechanisms against adversarial attacks and the effects this has to clean accuracy.
To this end, we compare the performance of an exhaustive list of models using adversarial attacks that follows the outline in Figure 1. Figure 3 reports the performance of the best models per each of the 3 transfer learning strategies. Those omitted did not transfer robustness and are thus removed for brevity. The complete tables of results in % and as a heatmap is provided in the Appendix. We use as baselines the non-transferred networks as well as networks that used transfer, but did not have any learned defence mechanisms.
Using robust features as initialisation did not lead to positive results in the previous section. However, when combined with robust optimisation applied on the target network, it improved performance in the context of WB attacks while maintaining similar robustness as the baselines’ against BB attacks. In fact, robust initialisation for achieves 5.2% accuracy improvement against WB PGD attacks, inline with recent observations about the properties of pre-training Hendrycks et al. (2019). Training by unfreezing the last block of layers only did not result in successful transfer, even though we used adversarially perturbed examples during training. This is somewhat expected as we already saw that the lower level features are easier to attack and thus all attacks managed to exploit this. Finally, seems to have a negative effect when using non-iterative methods. This itself again correlates with Athalye et al. (2018) and can be interpreted as ensuring that iterative attacks during training do not get stuck in local minima which itself ensures building a stronger defence. Hypothetically, using a similar approach can lead to building stronger attacks too. Unfreezing the final block of ResNet gets close to the baseline results however requires less resources. Nevertheless, both networks obtain a lot worse clean accuracy.
In this work, we investigate the use of transfer learning in the context of defending against adversarial perturbations. We showed that using FGSM and PGD during training results in different behaviour under transfer. PGD learns more general features that are easier to transfer to a different task. We found that lower level features by themselves play significant role in robustness against both WB and BB attacks and seem to be more transferable among tasks. Moreover, we showed that initialising with robust features can help improve the overall achieved robustness. When using PGD samples during re-training our analysis led to a 5.2% robustness improvement against a WB PGD adversary for compared to and an overall stronger defence. A combination of freezing low-level features and training the final block of ResNet56 provides a good trade off that is both close to the best achieved results while requiring a lot less training time.
The reported results suggest that the current success against BB attacks can be achieved just by focusing on the lower level features of the network. On the other hand, WB attacks are able to target more complicated, higher level, "categorical" features, which makes it more challenging to defeat. Building attacks that can better exploit this observation could result in more challenging adversaries.
In the future, we aim to further investigate the performance of defence mechanisms on a broader range of attacks and under transfer on different architectures. Further, we want to better understand the theoretical implications of the reported findings. Finally, we plan to extend the evaluation on control tasks in a simulated or a real-world setting.
The authors would like to thank Antreas Antoniou for helpful discussions and technical advice, Michael Burke and Ben Krause for feedback on an early draft of the paper, and the anonymous reviewers for their comments. This work was supported in part by an EPSRC Industrial CASE award funded by Thales.
Learning and transferring mid-level image representations using convolutional neural networks.In