Towards Evaluating and Understanding Robust Optimisation under Transfer

05/07/2019 ∙ by Todor Davchev, et al. ∙ 18

This work evaluates the efficacy of adversarial robustness under transfer from CIFAR 100 to CIFAR 10. This allows us to identify transfer learning strategies under which adversarial defences are successfully retained, in addition to revealing potential vulnerabilities. We study the extent to which features crafted by fast gradient sign methods (FGSM) and their iterative alternative (PGD) can preserve their defence properties against black and white-box attacks under three different transfer learning strategies. We find that using PGD examples during training leads to more general robustness that is easier to transfer. Furthermore, under successful transfer, it achieves 5.2 more accuracy against white-box PGD attacks than the considered baselines. In this paper, we study the effects of using robust optimisation in the source and target networks. Our empirical evaluation sheds light on how well such mechanisms generalise while achieving comparable results to non-transferred defences.



There are no comments yet.


page 3

page 4

page 7

page 8

Code Repositories


Exploring Adversarial Defense Robustness on Transfer Learning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning models, in general, are known to be vulnerable to adversarial examples. For instance, certain imperceptible perturbations to the input can result in an incorrect classification Szegedy et al. (2013); Jia & Liang (2017). This should be concerning when such methods need to be deployed in safety-critical applications, such as autonomous vehicles or surgical robots Warde-Farley & Goodfellow (2016); Yuan et al. (2019). A notable formulation of robustness against adversarial attacks is that of Madry et al. who formulate it as a "robust optimisation"-based problem:

Here, is a loss specified in advance, where the tuple of the input and its label is sampled from some distribution . In this formulation, the adversary’s task is to maximise the inner optimisation problem while the defender minimises the outer one. The adversary is specified through some threat model and is realised as a set of allowed perturbations . A defence can then be learned by augmenting the training data with adversarial examples Szegedy et al. (2014).

Transfer learning (TL) is commonly used deep learning technique. It has been showed to improve the performance as well as speed up training on variety of tasks

Pan & Yang (2010b)

. To this end, little has been done towards assessing adversarial robustness under the scope of TL. We hypothesise that evaluating the efficacy of robust features under transfer can help us identify strategies under which adversarial defences are successfully retained. This can allow us to identify more informative ways to both defend and attack neural networks.

Our empirical evaluation indicates that adversarial robustness against black-box (BB) attacks transfers more consistently. In white-box (WB) scenarios, defence mechanisms benefit from using robust optimisation in both source and target. Further, adversarial training using PGD leads to learning more general robust features that can maintain their properties under transfer better than the alternative. With this, our empirical findings are the following:

  • Robustness: We compared the level of transferred robustness between two tasks. PGD-based defences were easier to transfer than FGSM-based ones. Defending against simpler attacks sufficed from defending lower-level features only which are easier to transfer.

  • Generalisation: We evaluate the ability to generalise against two threat models. We achieve 5.2% higher accuracy against WB PGD adversaries using robust weight initialisation as well as adversarial examples during training the target.

  • Performance: We study the performance of different combinations of robust and clean optimisation routines. We visualise the results using normalised heatmaps and a complete table with accuracies.

Figure 1: This figure represents a general outline of the acquired pipeline for evaluation. The subscripts follow the nomenclature for describing the network’s way of training. Nat stands for clean training which means no adversarial examples were used whereas adv stands for adversarial training. For the networks obtained through transfer learning, we first mention the method of training its source (nat or adv) and then the method of transfer learning (nat or adv). In this context, the CIFAR100 networks (pink) are only an intermediary step on which we perform transfer learning.

2 Methodology

2.1 Transfer learning

Transfer learning in CNNs can be achieved through retraining using various strategies Yosinski et al. (2014). One can use the network as a feature extractor by freezing all the layers and only retraining the last one Sharif Razavian et al. (2014) or fine-tuning a larger part of the network Oquab et al. (2014). However, adversarial robustness may not necessarily be transferable in this process. So, we evaluate this property using the following three learning strategies: a) freeze all layers and retrain only the final layer, b) unfreeze only the last block of our network and, c) retrain the whole network, essentially using the source as an initialisation strategy.

2.2 Setup details

In our experiments we used a Resnet56 network, denoted as , which is an architecture specifically designed for the CIFAR dataset He et al. (2015). Transfer learning is from CIFAR100 to CIFAR10, where the images from both datasets where re-scaled to pixel values in [-1,1].

2.3 Threat models

As threat models, we use as adversaries the Fast Gradient Sign Method (FGSM) Goodfellow et al. (2015) and the Projected Gradient Descent (PGD) algorithm Kurakin et al. (2016) which is an iterative variant of FGSM.

FGSM creates an adversarial example

by following the gradient of the loss function with respect to the true label

. It then takes a single step towards that direction:

PGD turns FGSM into an iterative attack, ensuring at each step that the adversarial example is within the ball around with radius :

where denotes the number of iterations and the step taken at each iteration.

An intriguing property of FGSM is label leaking, where the network achieves greater accuracy in the adversarial examples than with the clean ones Kurakin et al. (2016). This happens most likely because FGSM produces a predictable perturbation which the network is able to identify. To avoid this effect, one can replace the true label with the most likely label predicted by the model.

Establishing the defence objective

When performing adversarial training we modify the loss to be the average of the adversarial and clean loss, with equal weights . We weighted the two losses equally because we were aiming for both robustness and accuracy. Cross-entropy is used for the individual losses.

Each batch had 200 examples, 100 clean images and 100 adversarial examples generated based on the most recent parameters of the network using the current clean examples. We used a standard which corresponds to pixel intensity of . For the PGD adversary, we used 7 iterations and a learning rate as was done by Madry et al. 2018. We trained and used a DenseNet 121 to construct the BB attacks.

3 Experimental Evaluation

We evaluate the effect of using transfer learning in the process of building defence mechanisms against adversarial attacks. We perform a series of empirical evaluations and report our results in this section.

Figure 1 introduces the general outline of the pipeline for evaluation. For brevity, we define , where , as a network () learned with transfer learning. For example, was trained with clean examples (nat) on the source domain and used both clean and adversarial examples (adv) during transfer. was trained with both clean and adversarial examples from the same threat model in both source and target tasks. We evaluate the networks using the clean accuracy, as well as the accuracy against 4 adversaries, namely BB and WB attacks using FGSM and PGD.

3.1 Empirical Analysis of Transferring Robust Features

Figure 2: Heatmap comparison between the per column normalised results where 1 indicates the highest result and 0 the lowest. The table contains an empirical evaluation of the transferability of each defence routine for all considered strategies. Horizontal red lines separate transfer strategies where each model is named using the source/target type of training and the number describes the amount of unfrozen layers. No number indicates the model was trained without transfer and 56 indicates complete training using source’s weights for initialisation. ’’ indicates no label leaking considered. Vertical red lines separate clean accuracy from the robustness achieved against WB and BB attacks. Percentage values on top represent the per column min/max accuracy. Overall using PGD in source results in higher ratios of transferred robustness compared to using FGSM. Keeping only lower-level features frozen leads to highest amounts of preserved robustness against BB attacks.

Threat models have been shown to successfully attack both known and unknown architectures. Current defence systems have been targeting the actual attacks to a specific task and architecture. In practice, however, we would like to learn specific representations that can generalise to different settings. In this section, we evaluate the ability of defence systems to generalise to different tasks in the context of transfer learning. We examine the amount of transferable robustness from a source task to some target by training 3 independent neural networks on CIFAR100, namely: one that does not use adversarial examples during training and hence has no defence mechanism, one that uses examples generated using FGSM and a final network that uses examples generated using PGD. We report our results in Figure 2.

The figure depicts the amount of transferred robustness across all strategies as a min-max normalised per attack heatmap. The first three rows show the robustness for the baseline networks that were trained with no transfer. Using PGD or FGSM samples during training performed equally well against BB attacks. In the former case, however, the resulted network is more robust against PGD-based WB attacks and less robust against FGSM-based ones. When training using FGSM generated samples we achieve the opposite results. These observations align with the ones made in Madry et al. (2018) but we found FGSM to be more robust against PGD most likely because we consider label leaking.

The next three blocks of rows (or 9 rows in total) report the results from transferring robustness using the three different strategies. Unlike the case in the absence of transfer, defence mechanisms developed using PGD are more likely to preserve the robustness of the learned features against BB attacks when used on the different task. However, both approaches seem to transfer proportionally the same amount of the achieved robustness against WB attacks. WB attacks are tailored for a specific architecture, hence directly transferring robustness was not expected to be as successful.

Overall, iterative learning results in more intricate and general features. However, the two tasks are distinct enough to not allow for the direct use of the learned features (see second block of results in Figure 2). Regardless, unfreezing the final block of ResNet56 resulted in an almost complete transfer of the defence against BB attacks. This itself suggests that robust low level features are sufficient to defend just as well against such simpler attacks. Such features are easier to transfer too. Recent work, Andrew Ilyas (2019), made similar observations and proposed a theoretical framework for studying such features. Unlike us, the authors do not focus on the transferability between tasks.

Evaluating against WB attacks, however, seems to be more successful at targeting aspects of the representation related to the higher levels of abstraction such as representations of the objects and sub-objects that are present in the input. Such attacks, however, have been shown to sometimes get stuck at local minimas resulting in weaker attacks Carlini & Wagner (2016). Using robust features as initialisation does not seem to have as good of an effect when training the target network with clean examples only. That said, they can potentially allow for iterative methods to overcome the above limitations by ensuring better starting point for the optimisation procedure.

Finally, none of the reported methods fully matched or exceeded the performance of the baselines. In the next section we attempt to combine our findings with adversarial training applied on the target task as well.

Figure 3: Heatmap comparison between the per column normalised results across different training routines. The table compares the use of adversarial examples in different combinations between source and target. Horizontal red lines separate different training routines. The numerical values within the heatmap are the per column normalised results. ’’ indicates no label leaking taken into account. Vertical red lines separate clean accuracy from the robustness achieved against WB and BB attacks. Percentage values on top represent the per column min/max accuracy. Overall, good weight initialisation leads to improved performance of PGD-based defence mechanisms where robust initialisation is most effective.

3.2 Improving Defence Mechanisms with Transfer

Successfully applying transfer learning has been shown to improve the performance as well as speed up training on a variety of tasks Pan & Yang (2010a); Yosinski et al. (2014). This suggests it can potentially enable building stronger, more general defence mechanisms as well as more complex attacks. We study the extent to which transfer learning can help us improve established defence mechanisms against adversarial attacks and the effects this has to clean accuracy.

To this end, we compare the performance of an exhaustive list of models using adversarial attacks that follows the outline in Figure 1. Figure 3 reports the performance of the best models per each of the 3 transfer learning strategies. Those omitted did not transfer robustness and are thus removed for brevity. The complete tables of results in % and as a heatmap is provided in the Appendix. We use as baselines the non-transferred networks as well as networks that used transfer, but did not have any learned defence mechanisms.

Using robust features as initialisation did not lead to positive results in the previous section. However, when combined with robust optimisation applied on the target network, it improved performance in the context of WB attacks while maintaining similar robustness as the baselines’ against BB attacks. In fact, robust initialisation for achieves 5.2% accuracy improvement against WB PGD attacks, inline with recent observations about the properties of pre-training Hendrycks et al. (2019). Training by unfreezing the last block of layers only did not result in successful transfer, even though we used adversarially perturbed examples during training. This is somewhat expected as we already saw that the lower level features are easier to attack and thus all attacks managed to exploit this. Finally, seems to have a negative effect when using non-iterative methods. This itself again correlates with Athalye et al. (2018) and can be interpreted as ensuring that iterative attacks during training do not get stuck in local minima which itself ensures building a stronger defence. Hypothetically, using a similar approach can lead to building stronger attacks too. Unfreezing the final block of ResNet gets close to the baseline results however requires less resources. Nevertheless, both networks obtain a lot worse clean accuracy.

4 Conclusions and Future Work

In this work, we investigate the use of transfer learning in the context of defending against adversarial perturbations. We showed that using FGSM and PGD during training results in different behaviour under transfer. PGD learns more general features that are easier to transfer to a different task. We found that lower level features by themselves play significant role in robustness against both WB and BB attacks and seem to be more transferable among tasks. Moreover, we showed that initialising with robust features can help improve the overall achieved robustness. When using PGD samples during re-training our analysis led to a 5.2% robustness improvement against a WB PGD adversary for compared to and an overall stronger defence. A combination of freezing low-level features and training the final block of ResNet56 provides a good trade off that is both close to the best achieved results while requiring a lot less training time.

The reported results suggest that the current success against BB attacks can be achieved just by focusing on the lower level features of the network. On the other hand, WB attacks are able to target more complicated, higher level, "categorical" features, which makes it more challenging to defeat. Building attacks that can better exploit this observation could result in more challenging adversaries.

In the future, we aim to further investigate the performance of defence mechanisms on a broader range of attacks and under transfer on different architectures. Further, we want to better understand the theoretical implications of the reported findings. Finally, we plan to extend the evaluation on control tasks in a simulated or a real-world setting.


The authors would like to thank Antreas Antoniou for helpful discussions and technical advice, Michael Burke and Ben Krause for feedback on an early draft of the paper, and the anonymous reviewers for their comments. This work was supported in part by an EPSRC Industrial CASE award funded by Thales.