Defending Against Misclassification Attacks in Transfer Learning

08/29/2019 ∙ by Bang Wu, et al. ∙ Monash University 0

Transfer learning accelerates the development of new models (Student Models). It applies relevant knowledge from a pre-trained model (Teacher Model) to the new ones with a small amount of training data, yet without affecting the model accuracy. However, these Teacher Models are normally open in order to facilitate sharing and reuse, which creates an attack plane in transfer learning systems. Among others, recent emerging attacks demonstrate that adversarial inputs can be built with negligible perturbations to the normal inputs. Such inputs can mimic the internal features of the student models directly based on the knowledge of the Teacher Models and cause misclassification in final predictions. In this paper, we propose an effective defence against the above misclassification attacks in transfer learning. First, we propose a distilled differentiator that can address the targeted attacks, where adversarial inputs are misclassified to a specific class. Specifically, this dedicated differentiator is designed with network activation pruning and retraining in a fine-tuned manner, so as to reach high defence rates and high model accuracy. To address the non-targeted attacks that misclassify adversarial inputs to randomly selected classes, we further employ an ensemble structure from the differentiators to cover all possible misclassification. Our evaluations over common image recognition tasks confirm that the student models applying our defence can reject most of the adversarial inputs with a marginal accuracy loss. We also show that our defence outperforms prior approaches in both targeted and non-targeted attacks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Transfer learning is widely applied in the development of machine learning systems and promoted through many commercialised machine learning services, e.g., Google Cloud AI 

(Google, 2019). With less training data and computation resources, new models (Student Models) can be developed through transferring relevant knowledge (i.e., copying and slightly tuning training network structures and parameters) from a pre-trained model (Teacher Model) on a similar task (Pan and Yang, 2010)

. In a typical deep learning task, when the scale of the networks blows up, the amount of the training data for obtaining high accuracy models also increases. However, the large training datasets are often hard to collect in practice, and developing large models is time-consuming. For example, the InceptionV3 classifier training on ImageNet 

(Deng et al., 2009) with 1.2 million images requires more than 2 weeks using 8 GPU (Szegedy et al., 2016). By contrast, in transfer learning, a quality Student Model recognising publics faces can be developed from a Teacher Model with only several hundred training images in minutes (Wang et al., 2018; Pan and Yang, 2010).

The reuse of the models also introduces vulnerability to the newly learned Student models (Wang et al., 2018; Ji et al., 2018; Zhao et al., 2019). Although the Student Models are often proprietary and hidden as a black-box, the corresponding Teacher Models which contain similar knowledge are hosted on public platforms for wide adoption. Therefore, the attackers can gain both the training network structures and parameters from the Teacher Models, and easily apply adapted black-box attacks to the resulting Student Models. Among others, recent emerging attacks proposed in (Wang et al., 2018) show that attackers can imitate the internal features of the target inputs, and the similar internal representations will lead to the same prediction from two different source inputs. Accordingly, two types of misclassification attacks are instantiated, i.e., targeted attacks and non-targeted attacks. The former attacks can generate adversarial inputs which can be identified to a target class, while the latter ones can misclassify the inputs to any other classes. The above attacks demonstrate strong transferability, even the structures and parameters of the Student Models are modified. To our best knowledge, they are the most effective and efficient attacks with limited knowledge on the Student Model. Note that other existing general attacks (Goodfellow et al., 2015; Sabour et al., 2016; Moosavi-Dezfooli et al., 2016; Su et al., 2019; Ambra et al., 2019; Papernot et al., 2017, 2016a) either fail to maintain the performance due to the modification of the model (Goodfellow et al., 2015; Sabour et al., 2016; Moosavi-Dezfooli et al., 2016; Su et al., 2019; Wang et al., 2018) or require extra queries to the targeted model or knowledge about the training data (Ambra et al., 2019; Papernot et al., 2017, 2016a; Wang et al., 2018).

In the literature, few methods can effectively defend against these misclassification attacks. Most of the existing defences against adversarial examples are not suitable in transfer learning. Some of them like adversarial detection (Feinman et al., 2017) loss their abilities since the new model structures and weights are modified, while others like adversarial training (Tramèr17; Kurakin et al., 2017) are general defence methods which are not optimized for transfer learning. The original attack paper (Wang et al., 2018)

provides two basic defences, i.e., Randomizing Input via Dropout and Injecting Neuron Distances. Unfortunately, these two approaches are either of limited model accuracy or time-consuming. Besides, they both fall short of addressing non-targeted attacks. Detailed comparisons can be found later in Section 


In this paper, we aim to design effective mechanisms to defend against the most advanced misclassification attacks to transfer learning (Wang et al., 2018) in the current literature. Our goal is to improve the trained models to classify or reject the adversarial inputs while maintaining the model accuracy. To successfully classify the adversarial inputs, we build a dedicated classifier called differentiator which can correctly classify between two classes in an attack pair of a targeted attack (misclassifying a source input from one class to a target input from another class). Such a differentiator developed by network pruning and high distilling has strong robustness defending against the adversarial inputs. Meanwhile, we apply iterative retaining to preserve its classification accuracy. For the non-targeted attacks, we develop an ensemble structure of the differentiators to classify or reject the adversarial inputs of random attack pairs. The ensemble models consist of differentiators corresponding to every potential attack pair, which vote for the final prediction. The contributions of our work are summarized as follows:

  • We implement our defence to two transfer learning applications including Face Recognition and Traffic Sign Recognition. It is shown that the models applying our design are robust to the adversarial images generated by the misclassification attacks.

  • We design distilled differentiators to defend against adversarial inputs for the targeted attacks. We utilise activation pruning to effectively remove the impact from adversaries while maintaining the accuracy of the classification. During the pruning, we apply adapted ratios considering the different feature representations for the early and later layers. Particularly, we prune via filter rather than connectivity for the convolutional layers to fully distill the models. In addition, in order to preserve the classification accuracy within a proper range after pruning, we build the models through iterative retraining. For defending against the non-targeted misclassification attacks, we propose an ensemble structure of these differentiators.

  • We evaluate our design on the defence rate, model accuracy, and the efficiency of the model development. It achieves over 90% defence rates for both the targeted and non-targeted attacks in the above two different applications and keeps the model accuracy over 97% which is slightly lower than 99% for the original models. From the detailed comparison, particularly for the non-targeted attacks, our defence rate (90%) is much higher than the rates (20%) of prior arts proposed in (Wang et al., 2018).

The rest of our paper is organized as follows. Section 2 introduces related work on attacks and defences of machine learning and transfer learning systems. Section 3 provides the background knowledge of transfer learning and network pruning which is used in our design. Section 4 states the attack model and the evaluation strategy of our defence. Section 5 presents our defence design in details. Section 6 shows the experimental results of our defence and the comparisons among our design to others. Section 7 discusses effectiveness of our design in broader scenarios. Section 8 gives a conclusion and future directions.

2. Related Work

Attacks on machine learning systems: There are several prior studies about attacking the machine learning system. One of the popular approaches is to build adversarial inputs. These inputs have unnoticeable differences from the normal inputs but can cause misclassification. Goodfellow et al. devise a fast algorithm, called Fast Gradient Sign Method (FGSM), to generate these adversarial inputs (Goodfellow et al., 2015). For example, in image recognition systems, they add small perturbations along the direction of the sign of gradient for every pixel in an original clean image input. After that, Sabour et al. propose to build the adversarial inputs through minimizing the internal feature distances between the original and target inputs (Sabour et al., 2016), and Seyed-Mohsen et al. propose to add perturbations through finding the smallest distance between the original inputs and the classification boundaries (Moosavi-Dezfooli et al., 2016). Recently, Su et al. design an attack that only modifies one pixel to generate the adversarial images (Su et al., 2019).

However, these attacks are less effective in transfer learning systems. The structure of the last layer and the weights of several layers in the Student Model can be modified according to Student Model tasks in different applications. As a result, white box attacks based on this knowledge from the Teacher Models lose their performance when targeting at the Student Model. For those black-box attacks directly targeting at the Student Models, they rely on building reasonable surrogate models using extra information, for example, part of the training dataset or other dataset with a similar distribution. These assume stronger attacker access (Carlini et al., 2019; Papernot et al., 2017) or require a large number of queries (Carlini et al., 2019; Papernot et al., 2016a) which make them less deployable and unnoticeable.

How to attack a transfer learning system is also investigated in the literature. Ji et al. demonstrate attacks that can generate adversarial models to attack the transfer learning systems during their development periods (Ji et al., 2018). Wang et al. represent an effective attack to generate adversarial inputs for both targeted and non-targeted attacks to the transfer learning system (Wang et al., 2018). Comparing these two different approaches, it is easier for the attackers in practice to use adversarial inputs than generating adversarial models during system development (Goodfellow et al., 2018). Therefore, in this paper, we focus on how to defend against the practical misclassification attacks proposed in (Wang et al., 2018).

Defences against adversarial machine learning:

Defence approaches are explored heavily in prior researches. There are three major approaches to make machine learning systems more robust. The first approach is called adversarial training, which adds adversarial images to the training datasets during system development. Kurakin et al. show that adversarial training can make the system more robust to the adversarial inputs generated by one step without doing iterations (Kurakin et al., 2017). Tramèr et al. demonstrate an ensemble adversarial training strategy to improve the robustness of adversarial training models (Tramèr17). The second approach prepossess the inputs before they are sent to the classifiers. Bendale et al. propose to add an extra layer, called OpenMax, to reject adversarial inputs before they are fed into the original models (Bendale and Boult, 2016). Abbasi et al. come out with an ensemble structure consisting of multiple specialists which can correctly classify or reject the adversarial inputs (Abbasi and Gagné, 2017). Feinman et al. propose to detect the adversarial inputs by comparing the distribution of Bayesian uncertainty for a model trained with dropout (Feinman et al., 2017). The third approach is to prune the networks to defend against the adversarial inputs. Liu et al.

develop a strong defence against backdoor attacks by pruning and fine-tuning the neural networks 

(Liu et al., 2018). However, these defences designed for normal machine learning system are less effective for the transfer learning attacks. Since the structures and other parameters are changed during the development of the Student Models, the defences based on the knowledge of the original models fail to defend for the new Student Models. Meanwhile, other defences such as adversarial training and dropout affect the model accuracy which is not applicable to sensitive applications.

We have not found effective yet efficient methods of defending the transfer learning attacks. Wang et al. suggest two basic approaches which either suffer from accuracy loss or expensive cost (Wang et al., 2018). One is called Randomizing Input via Dropout. It applies dropout for the input layer of the Models. Several pixels of the input images are randomly dropped before they are sent to the classifier, which affects the model accuracy as mentioned. Another defence introduced by Wang et al. is Injecting Neuron Distances. The key idea of this approach is to increase the dissimilarities of the internal features at a certain (attack) layer by retaining the Student Models. The dissimilarities can make attacks less effective in transfer learning. However, applying this approach, the parameters of the whole models should be retrained and updated iteratively, which increases the computation costs of development and reduces the benefits of transfer learning. In addition, the non-targeted attacks are still effective to both above approaches.

Our design employs advanced approaches which overcome the limitations of these two methods. The classification accuracy of the models with our defence maintains via iteration pruning and retraining. For the non-targeted attacks, our design achieves much better performance than prior work in (Wang et al., 2018) through developing a customized ensemble structure. Detail comparison results among our design and these two methods are shown in the experimental results in Section 6.

3. Background

Transfer learning: Transfer learning is proposed to learn knowledge from a pre-trained model (Pan and Yang, 2010). The new model for one task can be developed based on both weights and architectures of the layers from a well-trained model for other similar tasks. It can be built by fine-tuning the parameters to fit its own task using its own dataset. When applying the transfer learning, in the first step, the Student Model will copy both architecture and weights from the Teacher Model. After that, the last layer of the Student Model will be changed to fit the new different task. In the second step, the Student Model will be tuned based on the similarity of two tasks. One common methodology is to freeze different layers and retrain the rest of them as shown in Figure 1.

Figure 1. Transfer Learning (Wang et al., 2018)

In particular, there are three approaches for realisation (Wang et al., 2018). Deep-layer Feature Extractor: only the last layer is retrained using datasets for Student Model tasks. This is suitable when the tasks of Student Models are very similar to the tasks of Teacher Models. Mid-layer Feature Extractor: some of the layers are unfrozen and retrained using datasets for Student Model tasks. Full Model Fine-tuning: All of the layers are unfrozen and tuned. This suits for the Teacher and Student Model pairs with large differences.

Network Pruning: Network pruning aims to prune a dense neural network to a phase network. It is originally designed to improve the efficiency of both running and storing machine learning models. According to prior work (Han et al., 2015; Wang et al., 2019), most of the model structures have redundant neurons and connectives. These connectives contribute few during the classification tasks. Pruning these non-significant components can still maintain the model accuracy if we perform iteration pruning and training (Han et al., 2015). Figure 2 overviews the pruning over a two-layer neural network.

Figure 2. Network Pruning

There are two approaches to conduct pruning. Pruning via weights (Han et al., 2015): this approach prunes the weights with small absolute values. The weights in neural networks will directly contribute to the final outcomes. For a small absolute value closed to zero, it is expected to contribute less for the predictions and be a non-significant component. This is a simple and direct way for pruning. Pruning via activation (Polyak and Wolf, 2015): This approach prunes the components with small activation. Rather than only considering the weights, the inputs of the layers can also affect the final outcomes. The activation evaluates how the weights are activated by the expected inputs (Polyak and Wolf, 2015). By considering both weights and expected inputs, this method is more comprehensive.

For large neural networks with convolution layers, the internal features are generated by the convolution matrix also called kernel or filter. Li et al. raise that it is the whole filter rather than separated connectives that should be pruned when pruning the convolution layer (Li et al., 2017). Considering the pruning components, there are also two approaches for pruning. Pruning the connectives: The connectives with fewer weights or less activation can be pruned. Pruning the filters: The filters with small average weights or less overall activation are pruned.

4. Problem Definition

4.1. Attack Assumption

In this paper, we target the most advanced misclassification attacks in transfer learning (Wang et al., 2018) and follow their assumptions.

White-box Teacher Model: We first assume that the attackers have full access to the Teacher Model. This is realistic because most of the well-trained models are publicly available. The attackers can pretend to be one of the Students so that they can know both the weights and the architecture of the Teacher Models. We also assume that the attackers can find the corresponding Teacher Model when they are targeting at a Student Model. This Teacher Model fingerprinting method is introduced in (Wang et al., 2018).

Black-box Student Model: On the other hand, the attackers are assumed to have no access to the Student Models. Namely, the Student Model is considered as a black box. We assume that neither the model parameters (including the weights and the architecture) nor the training dataset for Student Models is accessible to the attackers. In real situations, this information which may include sensitive and private data is normally considered proprietary to Student Model owners. Besides, we assume that the attackers can only use limited queries to the implemented Student Models, which makes them hard to reproduce the shadow models.

Remark: In this paper, we do not consider the case where the Student Models are reproduced or leaked. If the attackers can directly gain enough information from the Student Models rather than from the Teacher Models, the implementation of the attacks will not be impacted by the transfer learning method. Namely, this case will become a generic attack and defence problem in machine learning. Nevertheless, We also consider relaxing the assumption about the attacker’s knowledge of the defence implementation according to a guiding principle proposed in (Carlini et al., 2019), which suggests that the adversaries have knowledge of the defense algorithm. More detailed discussions can be found in Section 7.

4.2. Attack Overview

In Section 2, we introduce several prior attacks against machine learning systems. However, most of these attacks fail to maintain their performance in the transfer learning system, especially for the targeted attacks. These attacks are based on the analysis of the decision boundary which is strongly correlated to the classification layer, while the classification layers in the Student Models are rebuilt or modified to fit their new classification scenarios. The difference of the weights and structure considerably reduce the effectiveness of these ordinary attacks. As a result, our work focuses more on the attacks specifically targeting at transfer learning system while the most effective and easily deployable attacks in the current literature are the targeted and non-targeted misclassification attacks proposed by Wang et al. (Wang et al., 2018). The attacks can generate adversarial images with negligible perturbations to the input images to cause misclassification for the Student Models.

Targeted Attacks: Figure 3 depicts the idea of how the misclassification attacks work. The key insight is that the attacks can make the internal features of a certain layer output in Teacher Models for two different input images being very similar.

Figure 3. Misclassification Attacks in Transfer Learning (Wang et al., 2018)

We refer to this certain layer as attack layer. For a target transfer learning system, the attackers expect a carefully chosen layer where the layers before it may be frozen or just be slightly tuned during the development of the Student Models. In this case, the internal features of the attack layer outputs can still be similar in the Student Models. Since the models are feedforward networks, the output of each layer will only depend on the preceding output. Once two internal features of the output from the attack layer are close enough, the outputs for the rest of the layers are likely to be very similar even they are retrained. Such similarity will be maintained to the final prediction. As a result, the misclassification takes place since two inputs with different labels have a similar prediction. For realisation, it can be translated into an optimized problem which minimizes the distance for the internal outputs by limited perturbations. More details can be found in Appendix A.

Non-targeted Attacks: A non-targeted attack is applied by evaluating multiple adversarial images for different targeted attacks and choosing the one with the smallest internal feature dissimilarity. According to  (Wang et al., 2018)

, a subset including five supposed targeted attack images is sufficient to find an adversarial image that has a high probability to attack the models successfully.

In both attacks, since the attackers do not know the transfer learning approaches or the cutoff layers for building the Student Models. The attackers will attack each layer of the Student Models to find the optimal layers and generate the final adversarial images.

4.3. Defence Definition and Evaluation

Our goal is to develop a defence approach to address the attack models described above. Considering the real application scenarios for the defenders, several assumptions are made. Since robust machine learning systems against the adversarial inputs are desired for users, the development of our defence can be fully supported by the customers owning the Student Models. Naturally, the defenders are assumed to have access to the parameters and structures of the Student Models as well as the training data. They are also assumed to be able to modify the Student Models and make them more robust to the adversarial images.

For the targeted attacks, the defenders are assumed to know what will be the source images class as well as the target images class. In the real case, there are several evident goals for the attackers. Some sensitive goals or attack pairs with high value are attracted. For example, attackers prefer to misclassify the red light to the green in a traffic light recognition system. These attack goals are also clear to the defenders, which corresponds to our assumption. For the non-targeted attacks, we assume the defenders have no information about what will be the source image class and the target image class. Unlike the situations for the targeted attacks, the attackers plan to cause general misclassification rather than a specific one, where the defenders do not know the exact attack pairs. This assumption can also be referred as the approach building the adversarial inputs of the non-targeted attacks with randomly chosen attack pairs.

In particular, the defence will be evaluated in terms of both efficiency and effectiveness as follows:

Classification accuracy: The models with our defence should be able to successfully classify the clean images. These models should restrain the decrease of accuracy on clean inputs within compared to the original models.

Defence success rate: Our defence expects to classify most of the adversarial inputs. Even for the adversarial inputs which are hard to be successfully classified, it should still be able to identify them from clean inputs and reject them in high confidence. Namely, the defence success rate is expected to be as high as possible.

Time consuming: As one of the motivations of transfer learning is to save cost for large scale learning tasks, our defence is expected to use fewer computation resources compared to others while introducing an acceptable time cost. Our defence is also expected to be scalable for the large Teacher Models, because transfer learning will be widely adopted when the Teacher Models become large and complex.

5. Defence Design

5.1. Defence Overview

We first give an intuition of our defence against the misclassification attacks introduced above. The misclassification happens when the classifiers fail to find the true decision boundaries among different classes. To understand the underlying reason, we review the training procedure. During training periods, the models are trained to find these decision boundaries which help the models successfully classify the inputs. Ideally, after training, the decision boundaries should be the same as the true ones. But there will be error areas between them in practice. Most of the adversarial inputs are built to locate at these error areas. If the models with refined decision boundaries can be built, the error areas will decrease and thus make the adversarial inputs hard to fool the classifiers.

5.1.1. Differentiator for Targeted Attacks

Following the above observation, we first focus on the targeted attacks, which happen as the misclassification between a source class and a target class. To avoid the misclassification, a distilled differentiator that only classifies between these two classes can be designed. It can be very specific and only focus on classifying two classes. The influence of other classes can be minimized during its development. Therefore, it can find a refined boundary closer to the true decision boundary which decreases the attack success rates of the adversarial inputs. The intuition of building the differentiators is to distill these classifiers.

Distillation via Pruning: From a network-viewpoint, the decision boundary is built by the connectivity of the model network while some of the connectivity is redundant for a specific classification task. During the clean input classification, no all the connectivity is active. For an adversarial input, there is also active connectivity while it is different from the clean one for achieving misclassification. A distilled model with only the clean input activated connectivity can stop the adversaries from utilising the clean input inactivated connectivity and drive the model only focus on classifying the clean inputs. A commonly-adopted way of the classifier distillation is to prune their networks  (Hinton et al., 2015; Papernot et al., 2016b). Figure 4 shows the idea about using pruning to build a distilled classifier defending against the adversarial inputs. After pruning, since most of the clean input activated connectivity remains, the classification for the clean inputs can maintain their accuracy. However, most of the previous adversarial input activated connectivity is pruned which reduces the effectiveness of the attacks, so the model is more robust to these adversarial inputs.

Figure 4. Pruning to Build Differentiators. For an original dense classifier, the clean inputs and the adversarial inputs activate different sets of connectivity. After pruning, most of the clean inputs activated connectivity remains. Meanwhile, most of the previous active adversarial connectivity is pruned.

Two-class differentiators: To build the classifiers with better classification boundaries, our intuition is that the dedicated two-class differentiators have better performance. The networks for more diversified classification tasks activate more connectivity which may be exploited by the adversaries. Meanwhile, for a classification problem with fewer classes, the activated connectives may cover fewer components in the network. Therefore, the more dedicated classifiers which have pruned more insignificant components are more robust to the adversarial inputs. In our design, we build differentiators for classifying only two classes as a default setting.

5.1.2. Ensemble Structure for Non-targeted Attacks

As introduced before, each non-targeted attack is chosen from several targeted attacks. Therefore, the same approach as the targeted attacks can be adapted but with an ensemble structure. The ensemble structure consists of differentiators corresponding to every attack pair. To ensure the robustness of the ensemble of models, every differentiator should be built to have strong defence ability. These differentiators can successfully classify the adversarial inputs generated as the corresponding attack pairs. Since the defenders do not know which attack pair will be applied as the non-targeted attacks, all the possible attack pairs should be taken into consideration. Therefore, we combine all the differentiators corresponding to each attack pair to make sure that at least one of them can successfully classify the adversarial inputs. Considering an adversarial input generating from a source class targeting at a target class, any differentiators whose corresponding attack pair includes the source class can successfully classify the adversarial inputs. By applying such an ensemble structure, we can make sure that several numbers of the differentiators can successfully classify the inputs. With a standard voting strategy, the ensemble of models can have a reliable final prediction.

5.2. Defence Implementation

This section introduces the technical details of our proposed defence implementation. Specifically, we will elaborate on how to develop the differentiators and then the ensemble structure.

5.2.1. Distilled Differentiator Implementation for Targeted Attacks

To address the targeted attacks, we design distilled classifiers called differentiators which is robust to the specific targeted attack pairs. Each differentiator can identify between two labels and it is hard for the corresponding adversarial inputs to fool them. Algorithm 1 shows how to build the distilled differentiators.


, two training datasets for the differentiator;

, the Teacher Model of the transfer learning task.

, the activation of the two class for Teacher Model.


A distilled differentiator .

1:function TwoLabelClassifierTraining()
2:     for each in  do
3:         if  is convolution layer then
4:              Prune non-significant filter in based on          
5:         if  is full-connected layer then
6:              Prune non-significant connectives in based on               
8:     for  to  do
9:         for each in  do
10:              Prune non-significant connectives in based on          
12:     return
Algorithm 1 Differentiator Generator

For a classes classification problem, differentiators are built to cover all possible attack pairs. Each differentiator is trained by two datasets only including the class data of the attack pair. Our design applies the general transfer learning method to build the differentiators. As introduced in Section 3, parts of the layers copied from the Teacher Model are frozen, and the rest of them will be retrained by the datasets for the differentiator’s tasks. After that, these differentiators are distilled via pruning each layer and we prune the differentiators as followed.

Activation Pruning We use activation pruning which has been demonstrated more comprehensive by considering the effect of both inputs and models parameters. In our design, the differentiators are expected to only focus on classifying between the two classes of the attacks pairs to preserve a better defence performance. By applying activation pruning, the inactivated and less activated connectives, which do not affect the prediction of the normal inputs but could be leveraged by the adversarial inputs, can be removed. As a result, it astricts the adversaries to imitate the features of the target images, which reduces the attack effectiveness.

Pruning via Ratio Our design prunes the models via ratio rather than via threshold values. According to our experiments, the values of the activation can be entirely different for each differentiator. If the models are pruned via absolute values, values with huge distinction should be chosen for differentiators in various tasks to have a better performance. On the contrary, the pruning ratio can be limited in a small range which expedites our defence developments. Therefore, we prune the network via ratio.

Different Ratio for Different Layers In our defence, different pruning rates are chosen for different layers in each differentiator. According to previous work (Li et al., 2017), each layer has a different sensitivity corresponding to the final model accuracy. Our experiments show that their sensitivities to defence performance are also different. Therefore, it is necessary to choose a proper pruning rate for each layer to gain better performance. In our design, we conduct empirical studies to choose the pruning rate, which will be discussed in Section 6.

Filters and Connectives To distill the differentiators better, we apply diverse strategies for different types of layers. For the full-connected layers, we prune every single connectivity evaluating their activation. For the convolution layers, we prune the filters consisted of correlative connectives. While (Li et al., 2017) shows that pruning the filters in the convolution layers makes the networks more efficient, we find it also improves the defence. Pruning via filter can remove a whole batch of correlative insignificant connectives without omission. Therefore, it can fully remove the negative effect from irrelevant components.

Independent Pruning We prune each layer separately, where the pruning for each layer is not affected by the others. By pruning each layer independently, every layer can be pruned in parallel which makes our pruning more efficient. The activation of each class for activation pruning are calculated once and reused when developing other differentiators.

According to the above configurations, we first prune the non-significant filters in the convolution layers via activation. Then, we apply connectivity pruning for the full-connected layers via activation as well and a distilled differentiator can be built.

Iteration Pruning and Retraining: To preserve accuracy, we propose to retrain and prune the models iteratively. As directly pruning the networks will harm the accuracy of the models, the classifiers can regain accuracy by performing iteration pruning and retraining the whole models (Han et al., 2015). In order to limit the computation cost, we apply the iteration pruning and retraining only for the unfrozen layers.

And the final differentiator is built with acceptable accuracy and robustness to adversarial inputs.

5.2.2. Ensemble Structure for Non-targeted Attacks

To address the non-targeted attacks, we design an ensemble structure of differentiators. Given an input, it is identified by every differentiator and the final prediction is based on the voting among them. The construction of the ensemble models and their voting mechanism are present as followed.

Ensemble Construction At least one differentiator should be guaranteed to predict the correct result. Therefore, all differentiators for each attack pairs are included in the whole ensemble models in our design. Considering a classification problem with classes, we assume any combinations of classes pairs are possible to be attack pairs. In this case, there are possible attack pairs , where each of them corresponds to a differentiator. Two reversed attack pairs and share the same differentiator. As a result, differentiators can be built. Each of them is trained by the steps described above using their corresponding training subsets. They are expected to maintain high accuracy for classifying the clean inputs and be robust to the adversarial inputs according to the design goals of the differentiators. For any clean inputs, there will always be differentiators which can correctly classify the inputs. For any adversarial inputs, since all combinations of the attack pairs are also considered, there will always be at least one differentiator which is designed to classify the inputs correctly. In conclusion, there will always be at least one differentiator which can provide a correct result.

Voting Mechanism Then, a proper voting strategy is designed to make sure that the correct result wins the voting. Assuming the inputs do not fool any of the differentiators, a standard approach is equally voting for every differentiator. The detailed strategy is described in Algorithm B in Appendix. The inputs will be identified by all of the differentiators to get the final predictions where all these predictions will be counted. Based on our structure, the maximum number of differentiators voting to the same class is . If differentiators have the same prediction, it is considered to be the final outcome. If the maximum number of the differentiators voting to the same class is less than , it is considered that even the differentiators are fooled by the adversarial inputs. As seen, the inputs will be rejected for this case. According to our experiments, it is a rare event.

Figure 5. Ensemble of Models

Figure 5 illustrates an example of the above progress. Given an input image with source class , it will be identified by all differentiators. In our example, of them correctly classify this input and agree to vote to class . Assuming all classifiers work properly, the number of other prediction results will always be less than . Thus, the winner of the simple equal voting strategy is the correct prediction.

Note that, for equal voting, the winner should gain votes which is the maximum number any predictions can gain. Meanwhile, it is possible for one of the losers gaining votes which is just slightly smaller than the winner. Therefore, the final reliability of our design strongly relies on that all differentiators work correctly. In our design, the differentiators are planned to be credible and the experimental results show that all the differentiators are faithful. For other erratic systems, our design can be improved by assigning different weights for each differentiator. The voting weights can be decided based on the frequency of attack pairs taking place in practice. It is considered as future work of our design.

Efficiency Improvement: While our design achieves much better performance in the non-targeted attacks compared to prior methods, the ensemble structure increases the developing costs. To improve the efficiency of our design, one may reduce the number of models by combining several differentiators as a multiple attack pair classifier. We apply optimization, and the detail results are shown in Appendix C. Our experiments show that, when preserving the classification accuracy, the defence rates of the differentiators combining three attack pairs are slightly reduced. As a result, while increasing the efficiency of development, the defence abilities can not be preserved. It is desired to carefully balance the trade off between the development cost and the defence performance. We leave this optimisation as our future work.

6. Experimental Results

This section shows the comprehensive experimental results of our design. We demonstrate that our design can defend against the targeted and non-targeted misclassification attacks in transfer learning. We evaluate our defence in both accuracy and defence success rates for transfer learning applications under various parameter settings. We also evaluate another popular attack in ordinary machine learning system called FGSM. And we show that these attacks are less effective in transfer learning system and fail to attack our design.

6.1. Experimental Setup

6.1.1. Teacher and Student Models Selection

To evaluate our defence, we apply the misclassification attacks to two popular transfer learning tasks: (1) Face Recognition(recognising five public people in a dataset called PubFig (Pinto et al., 2011)), and (2)Traffic Sign Recognition(recognising five traffic signs in a dataset called GTSRB (Stallkamp et al., 2011)).

Face Recognition: The task is to classify several human faces. And it is a common task used to evaluate both the attacks and defences. The Teacher Model is a popular pre-trained model, called VGG-Face (Parkhi et al., 2015) which is already well trained to classify 2622 faces with an accuracy of over 90%. The Student Model will be trained to classify five public people chosen from PubFig dataset (Kumar et al., 2009) with only 647 images.

Traffic Sign Recognition: Another task is to classify different traffic signs for an auto-driving system. The Teacher Model is a normal VGG16 model (Simonyan and Zisserman, 2015) trained via ImageNet dataset with 14 million images to recognised 1000 classes items. The top-5 test accuracy of this pre-trained model is about 92.7%. The training data for the Student Model comes from the GTSRB dataset (Stallkamp et al., 2012) which has more than 50,000 images of about 40 classes traffic signs. Our task aims to recognise five different traffic signs with only 214 images.

These two tasks both use popular pre-trained models as the Teacher Models, which are trained with a large amount of data, while the training datasets for the Student Models are limited to simulate the general transfer learning applications. The classification tasks are chosen to classify only five classes. According to our experiments, the misclassification attacks are more powerful when attacking the models classifying fewer labels. Our design is shown to be effective for a broader range of applications even the attacks are much stronger.

6.1.2. Transfer Method Selection

In our experiments, two different transfer learning methods are applied for the two tasks above. As we introduced in Section 3, there are three approaches for applying transfer learning: Deep-layer Feature Extractor, Mid-layer Feature Extractor, and Full Model Fine-tuning. According to previous work (Wang et al., 2018), the transfer learning misclassification attacks are less effective for the Student Models developed using Full Model Fine-tuning. Therefore, only the applications for the former two approaches are evaluated in our experiments. For the first tasks, both the Teacher Model and the Student Model focus on face recognition. Therefore, the transfer learning system can be applied in a direct and simple way. The Deep-layer Feature Extractor is built for the Face Recognition. For the second task, the Teacher Model is classifying general objects, while the Student Model focuses on traffic signs. Therefore, the Mid-layer Feature Extractor is used in this task. The cutoff layer is chosen to be layer 10 out of 16 for VGG16 Model when developing the Student Model to maintain the accuracy.

6.1.3. Attack Setup

In our experiments, we generate adversarial images following the same steps as (Wang et al., 2018) as given in Appendix A for both the target and non-target attacks which are sufficient to evaluate our defence.

Attack Pairs: Both the source and target images are randomly chosen from the test dataset, and they are all not used for training the Student Model. This treatment matches to the assumption that the Student Model is black-box and the attackers cannot get the training data. For the targeted attacks, we randomly choose source and target pairs to generate adversarial images. For the non-targeted attacks, we also randomly choose source images. For each of them, we randomly choose

target images with different classes. After that, we evaluate the distance between the internal feature vectors of the adversarial and target images. The source and target pair with the smallest distance will be chosen to generate the final adversarial images of the non-targeted attacks. We generate

adversarial images as well.

Attack Configuration: The adversarial images are generated targeting at different attack layers. The optimal attack layer which has the highest attack success rate will be considered to be the final attack layer. The perturbation budget of the adversarial images is for Face Recognition task and for Traffic Sign Recognition task. These two perturbations are significantly small compared to the original work (Wang et al., 2018). The optimizer for the adversarial sample generator is Adadelta. The optimized problem of the adversarial images generation uses for the iteration times and for the learning rate.

6.2. Evaluation for Our defence

We evaluate our design by comparing the robustness of the systems with and without our design. We also compare our design with other defences to demonstrate the advantages.

6.2.1. Comparison between Models with and without our defence

Firstly, we compare the performance for the Student Models with the defence and the original Student Models without any defence. The results confirm that models with our method can defend most of the misclassification attacks and maintain acceptable accuracy for clean samples.

Our experiment results illustrate the attack success rates of the Student Models for both Face Recognition and Traffic Sign Recognition with and without our defence. We also evaluate another attacks called FGSM targeting at ordinary machine learning system. The results show that they are less effective for transfer learning system where the attack success rate drops from almost to about and is much lower (almost ) while targeting at models with our design. For the attack configurations, the perturbation budget is fixed to be for the adversarial images of the Face Recognition task, and for the Traffic Signs Recognition as mentioned. For the defence setup, the iteration number is fixed to be . The pruning ratios for different layers in each differentiator are empirically chosen to be different. Our defence reduces the attacks success rates from and for the unguarded Face Recognition and Traffic Sign Recognition model to and for the targeted attacks, and and for the non-targeted attacks.

Impact of Attack Layers: We evaluate our defence by applying attacks aiming at different layers. Since the attackers do not know which transfer method is used for the development of the Student Models, they generate small sets of adversarial images targeting several layers to find the optimal attack layer. Our design is expected to defence against the attacks aiming at all layers. For Deep-layer Feature Extractor, based on the experience of prior work (Wang et al., 2018), the optimal attack layer is at the last frozen layer. In Figure 6, our defence is quite effective. For the original Student Model without our defence, the success rates for the attack aiming at the last layer are almost for both the targeted and non-targeted attacks. After applying our design, the attack success rate drops to for the targeted attacks and for the non-targeted attacks.

Figure 6 shows the relationship between the attack layers and the attack success rates with and without the defence for the Traffic Signs Recognition.

(a) Face Recognition (b) Traffic Sign Recognition
Figure 6. Attack layer and attack success rate

It can be seen that the attack success rates targeting at different layers keep small, where most of them are less than . Unlike the unguarded models which have an obvious attack success rate increase when the attack layers close to the optimal layer, the variation tendency for our design is flat. It is difficult for the attackers to find the optimal layer with a small set of adversarial images. In addition, even when the attackers successfully find the optimal attack layer, the attack success rates are still limited below .

Impact of the Perturbation Budget: In practice, the attack configurations such as the perturbations added to the inputs are various. A larger perturbation budget makes the adversarial inputs stronger to fool the classifiers. In order to justify the ability of our defence in different situations, we also evaluate our design for different perturbation budgets. Figure 7 shows the relationships between perturbation budgets and the attack successful rates for both Face Recognition and Traffic Sign Recognition.

(a) Face Recognition (b) Traffic Sign Recognition
Figure 7. Perturbation budgets and the attack successful rate

For these adversarial images, we choose the optimal attack layers which have the highest attack success rates as shown in Figures 6. According to the results, our design is robust to the perturbation. For the Face Recognition, when the perturbation budgets are in the range of and , the attack success rate targeting unguarded models increases to almost . And for a larger perturbation budget, the attack success rate increases observably. For the Traffic Signs Recognition, the attack is effective when the perturbation budget is larger than 0.01. On the contrary, the models with our defence are more robust to the perturbation variation. In the test region, the attack success rates are all less than .

Impact of Iteration Number: The models applying our design can still maintain the acceptable accuracy after several iteration periods. Some defence approaches may affect the performance of classification. As introduced in Section 5, the original neural networks are pruned in our design. Previous studies  (Polyak and Wolf, 2015; Han et al., 2015) show that pruning the neural network of classifiers will affect the model accuracy, while iteratively pruning and retraining can help the models regain their accuracy. In our design, the weights in the frozen layers are fixed, and thus the damage caused by the pruning cannot be recovered by retraining these layers. Our experiments show that the iteration pruning and retraining for only the unfrozen layers can still regain an acceptable model accuracy. In addition, the pruned models with more iteration numbers for pruning and retraining will lead to higher model accuracy. However, unlike the general network pruning which retrains the whole models, it is hard for our design to achieve the same or better accuracy compared to original models. Figure 8 illustrates the connection between the iteration times and the accuracy of the Student Models for both tasks. It can be found that the model accuracy for Face Recognition task increases to more than after iterations. For Traffic Signs Recognition, the accuracy is also over after iteration periods. As a result, with iteration pruning and retraining, the accuracy of the Student Models rises back to an acceptable value.

(a) Face Recognition (b) Traffic Sign Recognition
Figure 8. Iteration times and accuracy
Differentiator ID Classification Accuracy Attack Success Rate
Table 1. Performance of Different Differentiators
Model Accuracy Attack Success Rate
Targeted Attacks Non-targeted Attacks
Original Unguarded Model
Randomizing Input via Dropout
Injecting Neuron Distances
Ensemble Differentiators
Table 2. Performance Comparison between Different Defences

Impact of the Pruning Ratio: Based on the current experiments, to achieve a high defence success rate and maintain proper accuracy at the same time, the pruning ratio for each layer for different differentiators should not be fixed. According to prior work (Han et al., 2015), the sensitivities for the selection of layers to the model accuracy are different. Therefore, it is necessary to choose a proper pruning ratio for each layer. In addition, for diverse differentiators aiming at varieties of attack pairs, the proportion of activating neurons are different. As a result, the pruning ratio should be adapted for each specific differentiator. Table 1 reports the accuracy and attack success rates for five differentiators under the same pruning ratio. From the results, some of the differentiators such as differentiator 4 loses its accuracy caused by excessive pruning. However, with the same pruning ratio, differentiator 2 maintains the accuracy while the attack success rate is high, which means that the network should be pruned further to defend against the adversarial inputs. Therefore, it is necessary to choose the pruning ratio for each differentiator properly to achieve a good performance in both accuracy and defence ability.

6.2.2. Comparison with Others Defences

As introduced in Section 2, two basic defence approaches against transfer learning are introduced. They are Randomizing Input via Dropout and Injecting Neuron Distances. We further compare our design with these two defences.

Comparison to Randomizing Input via Dropout: In Randomizing Input via Dropout, several random pixels of the input images are dropped to decrease the attack success rates of adversarial images. Although it can make the models more robust, the accuracy of the Student Models is severely affected. On the contrary, our design maintains much better accuracy of the Student Models. Table 2 reports the comparison between our defence and Randomizing Input via Dropout method. For Randomizing Input via Dropout method, the accuracy drops from to after the attacks being conducted. Our design still maintains the classification accuracy at about by doing iteration pruning and retraining. Moreover, the defence rates of both targeted attacks and non-targeted attacks in our design are higher.

Comparison to Injecting Neuron Distances: Another defence method is Injecting Neuron Distances. It retrained the whole Student model to increase the distances of the internal feature vectors at the cut-off layer for the inputs. Unlike the previous method, it maintains the model accuracy to some extent. However, the computation cost is much higher due to the iteratively retraining of the whole models. For our defence approach, only the unfrozen layers are retrained. It highly improves the efficiency of developing defence models. In the original attack work (Wang et al., 2018), it takes about hours for Injecting Neuron Distances to develop a robust model with targeted attack success rate by using GPU. In our work, it only takes less than 10 minutes to develop a differentiator aiming at specific attack pairs with targeted attack success rate.

Furthermore, the ability of Injecting Neuron Distances defending against the non-targeted attacks is less effective. In our design, multiple Differentiators are trained for defending against multiple targeted attacks or non-targeted attacks. Such treatment makes our models more robust to these attacks. Detailed comparison is shown as Table 2. The attack success rate for the non-targeted attacks for Injecting Neuron Distances is about ; it is much higher than our design which is .

7. Discussion

Attack Efficiency with Known Defence Methodology: Based on the guidance principle discussed by Carlini et al. in (Carlini et al., 2019), where the defense algorithm might not be held secret, we enhance the knowledge of the attackers on the defence strategy. The attackers are assumed to know our defence strategies such as the activation pruning and ensemble structure. However, even with this knowledge, the attackers still cannot gain enough information about the weights and structure of the targeted models. The Student Models are developed by activation pruning, and thus the parameters and architectures of the models are strongly related to the private training data which is hidden to the attackers. Therefore, it is still hard for them to adopt effective attacks against these variable pruned models. The known defence methodology might be useful for a more powerful attacker with knowledge about the training data distribution. They could manage to build similar surrogate models via our defence strategy. To mitigate such advanced attackers, a possible approach is to introduce randomness to the differentiator pruning. It can make our design more flexible and robust to these attacks. We leave the detailed investigation as our future work.

Necessity of Combining Specialising and Pruning: In our design, the differentiators are built by considering two defence intuitions. Firstly, the models are devoted to classifying the small subsets of classes to be robust to adversaries. Secondly, the models are pruned to prevent interference by irrelevant network components which may be used by the attackers. In our design, both the concentration on fewer classes and the networks pruning are indispensable while only by combining the above two approaches, our defence can achieve a better defence performance. For the differentiators classifying fewer classes, there are also fewer active neurons and connections. The neurons and the connectives utilized by the adversarial inputs which are active for other label classification can be pruned without affecting the model accuracy. Therefore, the networks of the models can be fully pruned, which minimises the exploitation of the network components from the adversaries and reduces the effectiveness of the adversarial inputs.

8. Conclusion

In this paper, we describe and implement our defence against the misclassification attacks in the transfer learning system. We show that by activation pruning, the distilled differentiators which are robust to adversarial inputs can be built. We apply iteration pruning and retraining to maintain the classification accuracy. In addition, we design an ensemble structure to defend against the non-targeted attacks. Finally, we evaluate our defence by comparing it to several other methods. Our design is shown to be more effective and accessible to be implemented. Potential improvement of the ensemble structure is considered as a part of the future works.


  • M. Abbasi and C. Gagné (2017) Robustness to adversarial examples through an ensemble of specialists. Computing Research Repository abs/1702.06856. Cited by: §2.
  • D. Ambra, M. Marco, P. Maura, J. Matthew, B. Battista, O. Alina, N. Cristina, and R. Fabio (2019) Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks. In 28th USENIX Security Symposium (USENIX Security 19), Cited by: §1.
  • A. Bendale and T. E. Boult (2016) Towards open set deep networks. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 1563–1572. Cited by: §2.
  • N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber, D. Tsipras, I. Goodfellow, A. Madry, and A. Kurakin (2019) On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705. Cited by: §2, §4.1, §7.
  • J. Deng, W. Dong, R. Socher, and L. Li (2009) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. Cited by: §1.
  • R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. Computing Research Repository abs/1703.00410. Cited by: §1, §2.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. Computing Research Repository abs/1412.6572. Cited by: §1, §2.
  • I. Goodfellow, P. McDaniel, and N. Papernot (2018) Making machine learning robust against adversarial inputs. Communications of the ACM 61 (7), pp. 56–66. Cited by: §2.
  • Google (2019) Google Cloud AutoML. Note: Online at Cited by: §1.
  • S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, pp. 1135–1143. Cited by: §3, §3, §5.2.1, §6.2.1, §6.2.1.
  • G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. Computing Research Repository abs/1503.02531. Cited by: §5.1.1.
  • Y. Ji, X. Zhang, S. Ji, X. Luo, and T. Wang (2018) Model-reuse attacks on deep learning systems. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 349–363. Cited by: §1, §2.
  • N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar (2009) Attribute and simile classifiers for face verification. In 2009 IEEE 12th International Conference on Computer Vision, Vol. , pp. 365–372. Cited by: §6.1.1.
  • A. Kurakin, I. J. Goodfellow, and S. Bengio (2017) Adversarial machine learning at scale. Computing Research Repository abs/1611.01236. Cited by: §1, §2.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. Computing Research Repository abs/1608.08710. Cited by: §3, §5.2.1, §5.2.1.
  • K. Liu, B. Dolan-Gavitt, and S. Garg (2018) Fine-pruning: defending against backdooring attacks on deep neural networks. In Research in Attacks, Intrusions, and Defenses, pp. 273–294. Cited by: §2.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: a simple and accurate method to fool deep neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582. Cited by: §1, §2.
  • S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering (TKDE) 22 (10), pp. 1345–1359. Cited by: §1, §3.
  • N. Papernot, P. D. McDaniel, and I. J. Goodfellow (2016a) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. Computing Research Repository abs/1605.07277. Cited by: §1, §2.
  • N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami (2016b) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §5.1.1.
  • N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. Cited by: §1, §2.
  • O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. British Machine Vision Conference 1 (3), pp. 6. Cited by: §6.1.1.
  • N. Pinto, Z. Stone, T. Zickler, and D. Cox (2011) Scaling up biologically-inspired computer vision: a case study in unconstrained face recognition on facebook. In Computer Vision and Pattern Recognition 2011 Workshops, Vol. , pp. 35–42. Cited by: §6.1.1.
  • A. Polyak and L. Wolf (2015) Channel-level acceleration of deep face representations. IEEE Access 3 (), pp. 2163–2175. Cited by: §3, §6.2.1.
  • S. Sabour, Y. Cao, F. Faghri, and D. J. Fleet (2016) Adversarial manipulation of deep representations. Computing Research Repository abs/1511.05122. Cited by: §1, §2.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. Computing Research Repository abs/1409.1556. Cited by: §6.1.1.
  • J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel (2011) The german traffic sign recognition benchmark: a multi-class classification competition. In The 2011 International Joint Conference on Neural Networks, Vol. , pp. 1453–1460. Cited by: §6.1.1.
  • J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel (2012) Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks 32, pp. 323–332. Cited by: §6.1.1.
  • J. Su, D. V. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks. Computing Research Repository abs/1710.08864. Cited by: §1, §2.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826. Cited by: §1.
  • B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In Proceedings of 40th IEEE Symposium on Security and Privacy, Cited by: §3.
  • B. Wang, Y. Yao, B. Viswanath, H. Zheng, and B. Y. Zhao (2018) With great training comes great vulnerability: practical attacks against transfer learning. In 27th USENIX Security Symposium, pp. 1281–1297. Cited by: 3rd item, §1, §1, §1, §1, §2, §2, §2, Figure 1, §3, Figure 3, §4.1, §4.1, §4.2, §4.2, §6.1.2, §6.1.3, §6.1.3, §6.2.1, §6.2.2.
  • Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, Vol. 2, pp. 1398–1402 Vol.2. Cited by: Appendix A.
  • Y. Zhao, I. Shumailov, R. Mullins, and R. Anderson (2019) To compress or not to compress: understanding the interactions between adversarial attacks and neural network compression. In Proceedings of the 2nd SysML Conference, Cited by: §1.

Appendix A Misclassification Attacks in Transfer Learning

The misclassification attacks in transfer learning are minimizing the distance for the internal outputs within perturbation budgets. To measure this perturbation, one simple way is to use distance which is a common measurement for vectors. In our design, it is used to measure the internal feature similarity. For the distance of inputs, Structural Dissimilarity(DSSIM) is used in our design. DSSIM is a distance matrix evaluating the structural similarity of images (Wang et al., 2003). It evaluates the difference between two input images similar to human’s criterion which is suit for the two image recognition applications in our experiments.

Model Accuracy Attack Success Rate
Targeted Attacks Non-targeted Attacks
Original Unguarded Model
Differentiator of One Attack Pair
Differentiator of Three Attack Pairs
Table 3. Performance of Multiple Attack Pairs Differentiator

Two kinds of attacks, the targeted and non-targeted attacks are applied to the classification systems. As we have introduced, A targeted attack can be translated to an optimized problem as equation 1.


The distance(measured by ) between both internal feature at attack layer , for the adversarial input and the target inputs is minimized. The perturbations adding to source inputs is limited by the budget (measured by distance function ).

For the non-targeted attacks which evaluate multiple adversarial images for different targeted attacks and choose the one with the smallest internal feature dissimilarity, it can be an optimized problem as equation 2


Appendix B Voting Algorithm

We design a proper voting strategy to make sure that the correct result among at least one of the differentiators wins the voting.


input image x;

, , … Specialists which can classify between       every two classes for a classification problem of K classes, the       outputs of these Specialists are labels from 1 to K.


final prediction y.

1:function voting()
2:     for  to  do
5:     if  then
7:     else
8:          there is no class 0, we define class 0 as rejection.      
9:     return y
Algorithm 2 Voting Strategy

Appendix C Differentiator against Multiple Attack Pairs

We combine several differentiators as a multiple attack pair differentiator to further reduce the development cost. It targets at defending the adversarial images built from more than one attack pairs. Table 3 reports the performance of these differentiators. As seen, combination of the differentiators reduces the defence performance. The attack success rate for the targeted attacks increases to 15.4% while for the non-targeted attacks it increases to 26.7%. Since the number of labels increases, more neurons are active during the classification. To maintain the accuracy, fewer neurons are pruned during the development of our defence. Meanwhile, this restraint on adversarial images is relieved. As a result, the defence ability of the design may be affected when multiple differentiators are combined to one.