Cassandra: Detecting Trojaned Networks from Adversarial Perturbations

Deep neural networks are being widely deployed for many critical tasks due to their high classification accuracy. In many cases, pre-trained models are sourced from vendors who may have disrupted the training pipeline to insert Trojan behaviors into the models. These malicious behaviors can be triggered at the adversary's will and hence, cause a serious threat to the widespread deployment of deep models. We propose a method to verify if a pre-trained model is Trojaned or benign. Our method captures fingerprints of neural networks in the form of adversarial perturbations learned from the network gradients. Inserting backdoors into a network alters its decision boundaries which are effectively encoded in their adversarial perturbations. We train a two stream network for Trojan detection from its global (L_∞ and L_2 bounded) perturbations and the localized region of high energy within each perturbation. The former encodes decision boundaries of the network and latter encodes the unknown trigger shape. We also propose an anomaly detection method to identify the target class in a Trojaned network. Our methods are invariant to the trigger type, trigger size, training data and network architecture. We evaluate our methods on MNIST, NIST-Round0 and NIST-Round1 datasets, with up to 1,000 pre-trained models making this the largest study to date on Trojaned network detection, and achieve over 92% detection accuracy to set the new state-of-the-art.


page 4

page 5

page 7

page 11


A Method for Computing Class-wise Universal Adversarial Perturbations

We present an algorithm for computing class-specific universal adversari...

Generative Adversarial Perturbations

In this paper, we propose novel generative models for creating adversari...

Stereopagnosia: Fooling Stereo Networks with Adversarial Perturbations

We study the effect of adversarial perturbations of images on the estima...

TOP: Backdoor Detection in Neural Networks via Transferability of Perturbation

Deep neural networks (DNNs) are vulnerable to "backdoor" poisoning attac...

Generalized Depthwise-Separable Convolutions for Adversarially Robust and Efficient Neural Networks

Despite their tremendous successes, convolutional neural networks (CNNs)...

Topological Data Analysis of Decision Boundaries with Application to Model Selection

We propose the labeled Čech complex, the plain labeled Vietoris-Rips com...

Lipschitz-Margin Training: Scalable Certification of Perturbation Invariance for Deep Neural Networks

High sensitivity of neural networks against malicious perturbations on i...

1 Introduction

Deep neural networks (DNNs) are the main driving force behind the current success of Artificial Intelligence. However, training DNN models requires enormous amounts of data and computational resources. Hence, many users prefer to source and deploy pre-trained models in their, often security critical, applications such as drug discovery 

Chen et al. (2018b); Zhang et al. (2018)

, facial recognition 

Sun et al. (2015), autonomous driving Geiger et al. (2012), surveillance Javed and Shah (2002)

. It is well known that DNNs easily learn any bias that is present in the training data. Vendors of DNN models, with malicious intentions, can exploit this vulnerability of DNNs and intentionally inject Trojan behavior into the network during the training process. This is generally achieved by inserting a trigger into some of the samples and then training the DNN to exhibit malicious behavior for data that contains the trigger and normal behavior for data without the trigger. With full control over the DNN training process, the adversary is able to choose any trigger shape. Triggers are chosen such that they do not appear suspicious to the human observer e.g. a yellow rectangular sticker on a stop sign can be used to trigger a DNN to classify it as a speed limit sign. Since only the adversaries have knowledge of the trigger, they can initiate malicious behaviour at will and with no knowledge of the trigger, users of pre-trained models may not even suspect the presence of backdoors. This causes a serious threat to the widespread deployment of pre-trained models. Note that attacking Trojaned DNNs is much easier and different than adversarial attacks on clean DNNs, since the former has access to the DNN training process itself, while the latter only exploits intrinsic vulnerabilities of neural networks 

Akhtar and Mian (2018).

Given that noise-based adversarial attacks are inherent to CNN models, it is no surprise that trigger-based Trojan attacks also exist Yuan et al. (2019); Tramèr et al. (2018)

. Trojans are generally inserted into the deep model during training or transfer learning 

Liu et al. (2020, 2018b); Eykholt et al. (2018); Chen et al. (2017). A backdoor is typically inserted into a network Gu et al. (2019)

to make the CNNs mis-classify some specific class or classes. Instead of training a model with a dataset poisoned with triggers, another possible way the adversary can Trojan a network is by modifying the weights of selected neurons so that the model responds maliciously to a specific trigger

Liu et al. (2018b).

Current challenges for Trojan (backdoor) detection in practice are: 1) Lack of a deep learning-based model trained on a large-scale dataset for Trojan detection; 2) Unavailability of trigger information for a suspected Trojan infected model; usually only limited training data of clean samples is available; 3) Very limited information which can be obtained from the query model predictions since the test accuracy for Trojaned DNNs is normal for clean inputs, and 4) The target class in the infected model is unknown, and it is computationally expensive to search all possible targeted attacks when the output labels are in the hundreds.

To address these challenges, we propose the first deep learning based Trojan Detection Network (TDN). Our method has two stages, the first one is a two stream neural network that outputs the probability of a model containing a Trojan, and the second stage predicts the target class in a Trojaned model. Our contributions are summarized as follows. First, we propose a deep neural network for Trojan detection from only a few clean samples. To the best of our knowledge, we are the first to use a DNN classifier, trained on a large scale dataset of benign and Trojaned models, for Trojan detection. Second, we propose a method for target class prediction in a Trojaned model. We introduce a new variable (

) that quantifies the difficulty of attacking a model. This variable is a critical indicator for the target class of a Trojan infected model.

Theoretical Justification: Inserting Trojan behaviour into a network essentially puts an additional constraint on the model optimization during the training process. The model must learn to exhibit normal behavior and achieve an expected high classification accuracy on clean training/validation samples but exhibit the chosen malicious behaviour on samples containing a trigger, a localized pattern. This has two important consequences. Firstly, the decision boundaries of the model must adjust to allow such a behavior. Secondly, the model must become more responsive to local patterns (the trigger). Our hypothesis is that if we can encode these two aspects, we will be able to detect Trojaned models accurately. For the former, we use universal adversarial perturbations Moosavi-Dezfooli et al. (2017) which, being image agnostic, reasonably capture a fingerprint of the decision boundaries. For the latter, we look for a localized region of high energy in the adversarial perturbation. Thirdly, we also hypothesize that Trojaned models are easier to fool with minimal universal perturbation energy compared to clean models. Our proposed method basically capitalizes on these three factors to detect Trojaned networks and the target class of such networks.

2 Related Work

Adversarial attacks on CNNs have focused on the phenomenon of noise-based adversarial examples Szegedy et al. (2014); Akhtar et al. (2018), which are visually almost indistinct from the original images, but can mislead DNN classifiers into making incorrect predictions. Even universal adversarial perturbations Moosavi-Dezfooli et al. (2017)

have been discovered that are image agnostic and when added to any image of any class, can cause the DNN to mis-classify them. By computing singular vectors of the Jacobian matrices of hidden layers, universal perturbations can be constructed with very few images 

Khrulkov and Oseledets (2018). Adversarial attacks generally do not assume access to the training process of deep models. A comprehensive survey of such method is reported in Akhtar and Mian (2018). In this paper, we focus on defending against Trojan attacks where the attacker disrupts the training pipeline of the DNN to insert a backdoor.

The risk of Trojan models arises when the training process of a DNN is outsourced or a pre-trained model from an untrusted source is deployed. This security risk was first investigated in BadnetsGu et al. (2019). It was shown that backdoors in networks infected with Trojans can remain a threat even after transfer learning. Chen et al. Chen et al. (2017) proposed a backdoor attack algorithm that uses poisoned data to contaminate the CNN model. Trojaning attack Liu et al. (2018b) introduced a way to generate triggers and maximize the activation of some specific neurons to insert a backdoor. The embedded backdoors are stealthy and the unexpected malicious behavior is activated only by triggers, making them extremely challenging to detect with only clean data samples.

Defense methods were first developed to detect adversarial images  Akhtar et al. (2018); Xie et al. (2019); Yuan et al. (2019); Liao et al. (2018); Grosse et al. (2017); Hendrycks and Gimpel (2016). Metzen et al. Metzen et al. (2017) detect adversarial perturbations with a target classification network. Feinman et al. Feinman et al. (2017) also use a binary classifier to detect adversarial perturbations. Magnet Meng and Chen (2017) trains a classifier on manifolds of normal examples to discriminate adversarial perturbations without any prior knowledge of the attack. Safetynet Lu et al. (2017) is designed to detect adversarial-noise based attacks and exploits the different adversarial perturbations produced to train a SVM classifier.

Methods for detecting and defending against Trojan attacks have also been proposed. Liu et al.  Liu et al. (2018a) proposed a pruning and fine-tuning procedure to suppress backdoor attacks. Chen et al. Chen et al. (2018a) proposed Activation Clustering methodology for detecting and removing backdoors from DNNs. SentiNetChou et al. (2018) uses the behavior of adversarial misclassification of poisoned networks to detect an attack. However, all these methods fail in the realistic settings where access to poisoned data is not available. Neural Cleanse Wang et al. (2019) was the first method to detect Trojan infected models with clean samples by reverse engineering the trigger. They employ the Median Absolute Deviation (MAD) technique to compute the anomaly in the norm of the reversed triggers to detect Trojaned models. However, the trigger must be reverse engineered for each class, which is not scalable in practice for DNNs with hundreds and thousands of classes. DeepInspect Chen et al. (2019) uses conditional GAN to reconstruct trigger patterns for Trojan detection. NeuronInspect Huang et al. (2019) detects backdoor from the output features, such as sparsity, smoothness, and persistence of saliency maps obtained from back-propagation of the confidence scores. Tabor Guo et al. (2019) propose metrics to measure the quality of reversed triggers and achieve improved performance than Neural Cleanse by introducing several regularization terms to refine the generated triggers.

The above methods Wang et al. (2019); Chen et al. (2019); Huang et al. (2019)

are sub-optimal because they are not learning-based and employ the MAD technique and manually tuned anomaly thresholds to detect the outliers of reverse engineered triggers. More importantly, none of these techniques report results on large scale data of benign/Trojaned models and none of them can predict the target class of a Trojaned model. To address these challenges, we propose

Cassandra, a Trojan detection method that exploits universal adversarial perturbations Khrulkov and Oseledets (2018) generated from a very limited number of clean samples. Given their image-agnostic nature, we compute universal adversarial perturbations for a batch of clean samples, where the batches could be as few as 5. Note that this holds even if the number of classes is in thousands, unlike prior work such as Neural Cleanse, where one perturbation per class is necessary. Our method also provides the target class of a Trojan infected model.

3 Detecting Trojan Infected Models

Figure 1: (a) and (b) illustrate how decision boundaries can change after inserting a Trojan in a network. The Trojaned model (b) has a complicated decision boundary after being compromised. Introducing triggers in the training data changes the decision boundary of model (b) to accommodate the poisoned samples. This makes it easier to perturb the class label of a sample from class A to B (shown by arrows) since the distance across decision boundary is smaller compared to in (a). (c) Shows universal adversarial perturbations computed using (top row) and norm (bottom row) for benign and Trojaned models. Models from left to right: Trojaned Inception-v3, Trojaned DenseNet-121, benign ResNet50, and benign DenseNet121.

During training, a neural network simultaneously learns feature representation and decision boundaries that partition the feature vector space into the respective classes. When an adversary inserts a backdoor into a network, the decision boundaries are altered. Our hypothesis is that Trojan infected networks exhibit decision boundaries that are different from typical, benign classification networks. Our approach exploits this fact by retrieving the fingerprints of the decision boundaries of a network, and subsequently trains a classifier on these fingerprints to classify a query network as benign or Trojan infected. We use adversarial perturbations to retrieve fingerprints of the decision boundaries of the query network. In contrast to image specific perturbations, universal perturbations Moosavi-Dezfooli et al. (2017) are image agnostic, such that the generated perturbations when added to any input image sends it across the decision boundary to change its label. The success of universal perturbations is measured by its fooling rate, the proportion of images that are successfully mis-classified after the perturbation is added. Since universal adversarial perturbations capture the geometry of the decision boundaries Moosavi-Dezfooli et al. (2017), the perturbations for benign and Trojan models are expected to be significantly different in character.

3.1 Fingerprinting Decision Boundaries with Adversarial Perturbations

We formulate Trojan detection as a classification problem. For a query neural network model, , we define a Trojan detection classifier as


Here outputs a prediction for each input image , drawn from the distribution of images in , is fooling rate and is the perturbation energy. Similarly, and denote the Trojaned classifier and corresponding prediction for . For a desired threshold , we obtain universal perturbations and for classifiers and , respectively, such that the following holds:


Note that the observed fooling rate can go much higher than during the generation of universal adversarial perturbations. We define the perturbation energy as:


where is parameterized by the process that generates the perturbations. Let denote the perturbation energy cost to transform all data samples from class B to class A across the decision boundary for a benign model, and vice versa for . Similarly, and denote the same for a Trojan infected model. In an infected model the decision boundary is changed such that some backdoors are created close to other classes. Due to these changes in the decision boundary, and for a given fooling rate (see Fig. 1a,b), where is proportional to and so on.

We define the notion of attack difficulty for both universal perturbations and targeted attacks as


where is the fooling rate () for universal perturbations and attack success rate for targeted attacks. Universal adversarial perturbations of clean and Trojan infected models are distinguishable above a given fooling rate , both visually and in terms of energy, as shown in Figure 1c.

4 Trojan Detector

Figure 2 shows the schematic overview of our proposed Trojan detector, referred to as Cassandra. The query network, along with the clean labelled training data, are used to generate two types of universal perturbations i.e. those bounded by and norms. Note that we do not assume the presence of triggered images in the training data, since triggers are unknown in a realistic scenario. The universal perturbations are fed to one stream of the network together with their corresponding attack difficulty, . Similarly, the norm bounded universal perturbations and their attack difficulty, , are fed to the second stream of the trojan detection network. The feature extractor in Fig. 2 extracts distinguish features from the bounded perturbations of each stream. The two feature extractors have identical architectures, but have different sets of weights. Outputs from the two streams are concatenated and used to train the Trojan classifier with binary cross-entropy loss.

Perturbation Generator: Since the target class of the (potentially Trojan infected) query network is unknown, the universal adversarial perturbations (Eq. 2) Moosavi-Dezfooli et al. (2017) are computed that cause mis-classification of any input image. The DeepFool Moosavi-Dezfooli et al. (2016) kernel is used for perturbation generation. A batch of training images are passed to the query network, the direction of the nearest decision boundary is computed which is back-propagated to compute a small or bounded perturbation for the input. By iteratively refining the perturbation over different mini-batches, a universal (image agnostic) adversarial perturbation is obtained. The generated perturbations are sent to their respective feature extractor stream for further processing. In our case, we stop the iterations when a certain threshold is achieved by the universal perturbation or a maximum number of iterations are reached.

Feature Extractor contains two parallel modules. The first one (top right in Fig. 2

) is a Multi Layer Perceptron (MLP), which crops a

window from the perturbation image (after conversion to grayscale) and outputs a dimensional feature vector. The MLP layers have , , , , and neurons, respectively. A sliding window is moved over the grayscale perturbation and the location which has the maximum norm is selected as input to the MLP. The second module (bottom right in Fig. 2) comprises of a MobilenetV3-Large CNN Howard et al. (2019)

pre-trained on ImageNet classification. The 1280-dimensional embedding output from the penultimate layer is used as a feature for the Trojan classifier. The network has a total of 5.5M trainable parameters.

Trojan Classifier: The output of each feature extractor module is a concatenation of the MLP features (256-D), CNN features (1280-D) and the attack difficulty (1-D). This totals to a 1537-D vector. The outputs of the two-streams ( and perturbations) are concatenated to form a 3,074-D vector that is fed to the Trojan classifier, which is a simple fully connected layer. The probability of the query model being Trojan infected is obtained by applying the sigmoid activation to the output. To fully capture properties of the complex decision boundaries, we divide the training data into 10 batches and obtain 10 probabilities for each query model. The final score is computed as the mean value of these 10 probabilities.

Figure 2: Trojan Detection:Features extracted from universal adversarial perturbations are used to fingerprint the query model. A two stream architecture is used with and norm bounded perturbations. The two feature extractor modules have identical architectures, but don’t share weights. Three sets of features characterizing the query model are extracted: features to represent the maximum energy window in the perturbation extracted by a 5 layer MLP, CNN embedding from the perturbation image, using a MobileNetv3 network and the attack difficulty () of the generated perturbation. Features from the two streams are concatenated and passed to the fully connected classification layer.

5 Target Class Prediction

We propose targeted attack difficulty, as a metric for outlier class prediction in a Torjan infected model. An outlier class is the one which is easy to launch a targeted attack against, compared to the other classes, and hence most likely to be the target class of the Trojan infected model.

Attack difficulty is defined as , where is the perturbation energy (Eq. 3) and is the attack success rate for the targeted attack i.e. the proportion of images whose predictions change to target labels. Attack difficulty (or its reciprocal attack efficiency) measures the perturbation energy normalized by the success rate of the attack. Given a query model, we use the Fast Gradient Sign Method (FGSM) Goodfellow et al. (2015), given its fast execution time, to compute adversarial perturbations for each class. For example, NIST-Round0 data contains five class labels and the Trojan models classify triggered images of any class to class 0. In this case, class 0 is the target class and the attack is called an "any-to-one" targeted attack. Fig. 3 shows that the proposed attack difficulty is able to correctly detect the target class of the Trojan attack as outlier, but norm used for Trojaned model detection in Neural Cleanse Wang et al. (2019)

fails. We finalize our target class prediction with a two stage method. The first stage is our Trojan detection network which outputs the probability of the model being infected with a Trojan, and the second stage is outlier detection based on Median Absolute Deviation (MAD) 

Hampel (1974); Wang et al. (2019) for predicting the target class. The anomaly index for outlier detection is defined as the absolute deviation of the data points from their median and then normalized by the median, to measure the dispersion of the data distribution. For Trojaned models, the second stage selects the label with anomaly index value above a threshold as the predicted target class.

Figure 3: Target class prediction: The norm (left) and attack difficulty ( Norm/Success Rate) (right) of FGSM adversarial perturbations computed per label for benign and Trojan infected networks. The true target class is 0, detected outlier in each case is encircled.

6 Experiments

For all experiments, we perform 5-fold cross validation and report average results. In Eqn. 2,

is set to 0.2 to quantify the desired fooling rate. We use the Adam optimizer with a learning rate of 0.001.The constant estimator for the MAD outlier detector is 1.4826, so that any data sample with anomaly index larger than 2 has

95% probability of being an outlier. We employ anomaly index threshold of 2, such that the class labels with anomaly index larger than 2 are considered the target class. For training we use a server with 6 Nvidia RTX 2080 Ti GPUs. Perturbation generation and training for NIST-Round1 data takes around 12 hours. Inference for each model takes about 560s with Nvidia RTX 2080 Ti, and inference for 200 models finishes within 24 hours.

6.1 Datasets

We evaluate our proposed approach on a dataset of trigger infected models for classifying images from MNIST and the public NIST-Round0 and NIST-Round1 datasets. We will refer to the dataset of trigger infected MNIST classification models as Triggered MNIST dataset throughout. Code to generate the triggered MNIST dataset was used from the TrojAI GitHub repo111 NIST datasets were obtained from the TrojAI challenge website222

Triggered MNIST Dataset: Two types of triggers, Type I and II (see Fig. 4) are inserted into each image of clean MNIST dataset to generate Triggered data. A total of 900 models of 3 architectures (ModdedBadNet, BadNet and ModdedLeNet5) are generated. Out of these, 300 were benign models, 300 were trained for any-to-any attack, and 300 were trained for any-to-one targeted attack. Details of the models and their performance on clean and triggered data are given in the supplementary material.

NIST Datasets: The NIST datasets consist of traffic sign classification models (half benign and half Trojaned) with 3 possible architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created images of artificial traffic signs superimposed on road background scenes. The Trojan infected models are poisoned with an unknown embedded trigger. NIST-Round0 and NIST-Round1 datasets are both from the same distribution, the main difference is that Round0 consists of 200 models, while Round1 dataset has 1,000 models. Details of the models in the NIST datasets including their accuracy and attack success rates are provided in the supplementary material. Clean data samples used to train the NIST models can be seen in Figure 5.

Figure 4: Triggered MNIST dataset samples containing Type I triggers (a) and Type II triggers (b).
Figure 5: Clean image samples of 5 classes from the NIST datasets.

6.2 Results

In Table 1, we report the classification accuracy for a variety of training and test set model configurations for the Triggered MNIST dataset. The classification accuracy is consistently high for both Type I (93.3%) and Type II (91.7%) triggers. Even when training and test models are infected by different types of triggers, the algorithm still has a high classification accuracy of 91.7% (or 90%) which shows that our method is independent of trigger types. We achieve 94.4% performance for the configuration where both trigger types are present, in equal proportions, in the training and test sets (Table 1 last row).

# Train Models Train Trigger # Test Models Test Trigger Accuracy
240 I 60 I
240 II 60 II
240 I 60 II
240 II 60 I
480 I, II 120 I, II
Table 1: Trojan Detection results for the Triggered MNIST dataset of models. The proposed method achieves good results even when trained on a trigger, which is different from the one seen at test time.
Method Triggered MNIST NIST-Round0 NIST-Round1
Neural Cleanse Wang et al. (2019)
Cassandra (Ours)

Table 2: Trojan detection accuracy on Triggered MNIST, NIST-Round0 & Round1 datasets.

NIST datasets are more challenging compared to Triggered MNIST, not only in terms of trigger types, color and size of the data used to train the infected models, but also due to the fact that the NIST models are much deeper. Our method obtains high classification accuracy of 92.5% for NIST-Round0 and 92.0% for NIST-Round1 datasets. Table 2 shows results of our method on the Triggered MNIST, NIST Round0 and NIST Round1 datasets and compares them to Neural Cleanse Wang et al. (2019). Our proposed Trojan Detection Network outperforms Neural Cleanse on all three datasets with large margins of , and % respectively. This can be attributed to two reasons. Firstly, it is difficult for Neural Cleanse to find an optimal anomaly index threshold. Secondly, reverse engineering the trigger does not perform well when the triggers are complex.

Triggered MNIST NIST-Round0 NIST-Round1
with predicted
with ground truth

Table 3: Target Class Prediction Accuracy: Availability of significantly increases the target class prediction accuracy. All models are attacked any-to-one i.e. one target class per model.

Table 3 shows our target class prediction results. The proposed two stage prediction algorithm based on the attack difficulty and predicted improves the classification accuracy significantly over the baseline (without ) from 76.1%, 72.5% and 70.0% to 90.0%, 94.7% and 88.1% on Triggered MNIST, NIST-Round0 and NIST-Round1 datasets respectively. Using Ground truth P(Trojan) further improves classification accuracy which demonstrates attack difficulty is a critical indicator of target class.

6.3 Ablation Study

Trojan Detector Network Modules: In Table 4, we explore different network architectures and the functionality of individual modules of our method. Using only universal perturbations computed from the complete training data of NIST-Round0, we achieved 77.5% classification accuracy. After dividing the training data into 10 batches (these are different from the training mini-batches), we generate 10 perturbations for each model. With these 10 perturbations, the accuracy improves to 85%. Adding the attack difficulty further improves the classification accuracy in all cases. Finally, with multi-batch and two stream architecture we achieve 92.5% classification accuracy.

Classification Accuracy
Input for classifier without with
perturbation from all training data
multi-batch perturbations
multi-batch + two stream ( & ) perturbations

Table 4: Effects of using multiple perturbations and attack difficulty. Trojan detection accuracy on the NIST-Round0 validation data improves significantly after using multiple perturbations (n=10) calculated from different batches of training data. Using and perturbations in a two stream architecture combined with attack difficulty () further improves the accuracy.
Perturbation Magnitude ()
= 10, # iterations = 10
# of Iterations 0.1 0.2 0.4 0.8 1
Perturbation Magnitude ()
= 1, # iterations = 10
# of Iterations 5 10 20 30 40
Table 5: Effect of perturbation generator hyper-parameters on Trojan detection accuracy

Universal Perturbation Generator Hyper-parameters: The choice of hyper-parameters may impact on the effectiveness of the generated universal adversarial perturbations for various tasks. However, our experiments show that the proposed method is robust to these parameters. We compare the mean classification accuracy when using different number of iterations and magnitudes for and bounded universal adversarial perturbations, and find that the Trojan detection accuracy varies only slightly as shown in Table 5.

7 Conclusion

We proposed the first deep learning based method, that is trained on a large scale dataset of Trojaned and clean models, for detecting Trojan infected models. We exploit the universal adversarial perturbations to retrieve the fingerprints of Trojans in the DNNs and train our proposed TDN based on the features of the perturbations and attack difficulty to discriminate benign and Trojaned models. We also proposed simple variable, coined attack difficulty , to measure the energy needed to achieve an average unit fooling rate. Based on the attack difficulty, we proposed a two stage target class prediction method that can predict the target class of a Trojaned model in addition to the Trojan probability. This provides further information on the type of malicious behaviour embedded in a Trojan infected model e.g. which identity is being impersonated in a Trojaned face recognition model.


This research was supported in part under ARC Discovery Grant DP190102443. Xiaoyu Zhang and Rohit Gupta were supported by University of Central Floria ORC fellowships.


  • [1] N. Akhtar, J. Liu, and A. Mian (2018-06) Defense against universal adversarial perturbations. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.
  • [2] N. Akhtar, J. Liu, and A. Mian (2018) Defense against universal adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3389–3398. Cited by: §2.
  • [3] N. Akhtar and A. Mian (2018) Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, pp. 14410–14430. Cited by: §1, §2.
  • [4] B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava (2018) Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728. Cited by: §2.
  • [5] H. Chen, O. Engkvist, Y. Wang, M. Olivecrona, and T. Blaschke (2018) The rise of deep learning in drug discovery. Drug discovery today 23 (6), pp. 1241–1250. Cited by: §1.
  • [6] H. Chen, C. Fu, J. Zhao, and F. Koushanfar (2019) Deepinspect: a black-box trojan detection and mitigation framework for deep neural networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, pp. 4658–4664. Cited by: §2, §2.
  • [7] X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §1, §2.
  • [8] E. Chou, F. Tramèr, G. Pellegrino, and D. Boneh (2018) Sentinet: detecting physical attacks against deep learning systems. arXiv preprint arXiv:1812.00292. Cited by: §2.
  • [9] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2018) Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1625–1634. Cited by: §1.
  • [10] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §2.
  • [11] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §1.
  • [12] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §5.
  • [13] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel (2017) On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280. Cited by: §2.
  • [14] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg (2019) BadNets: evaluating backdooring attacks on deep neural networks. IEEE Access 7 (), pp. 47230–47244. Cited by: §1, §2.
  • [15] W. Guo, L. Wang, X. Xing, M. Du, and D. Song (2019) Tabor: a highly accurate approach to inspecting and restoring trojan backdoors in ai systems. arXiv preprint arXiv:1908.01763. Cited by: §2.
  • [16] F. R. Hampel (1974) The influence curve and its role in robust estimation. Journal of the american statistical association 69 (346), pp. 383–393. Cited by: §5.
  • [17] D. Hendrycks and K. Gimpel (2016) Early methods for detecting adversarial images. arXiv preprint arXiv:1608.00530. Cited by: §2.
  • [18] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: §4.
  • [19] X. Huang, M. Alzantot, and M. Srivastava (2019) NeuronInspect: detecting backdoors in neural networks via output explanations. arXiv preprint arXiv:1911.07399. Cited by: §2, §2.
  • [20] O. Javed and M. Shah (2002) Tracking and object classification for automated surveillance. In European Conference on Computer Vision, pp. 343–357. Cited by: §1.
  • [21] V. Khrulkov and I. Oseledets (2018-06) Art of singular vectors and universal adversarial perturbations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
  • [22] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu (2018-06) Defense against adversarial attacks using high-level representation guided denoiser. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [23] K. Liu, B. Dolan-Gavitt, and S. Garg (2018) Fine-pruning: defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273–294. Cited by: §2.
  • [24] Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang (2018) Trojaning attack on neural networks. In 25nd Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-221, 2018, Cited by: §1, §2.
  • [25] Y. Liu, A. Mondal, A. Chakraborty, M. Zuzak, N. Jacobsen, D. Xing, and A. Srivastava (2020) A survey on neural trojans. In 2020 IEEE International Symposium on Quality Electronics Design (ISQED), Cited by: §1.
  • [26] J. Lu, T. Issaranon, and D. Forsyth (2017) Safetynet: detecting and rejecting adversarial examples robustly. In Proceedings of the IEEE International Conference on Computer Vision, pp. 446–454. Cited by: §2.
  • [27] D. Meng and H. Chen (2017) Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 135–147. Cited by: §2.
  • [28] J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff (2017) On detecting adversarial perturbations. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [29] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard (2017) Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765–1773. Cited by: §1, §2, §3, §4.
  • [30] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §4.
  • [31] Y. Sun, D. Liang, X. Wang, and X. Tang (2015) Deepid3: face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873. Cited by: §1.
  • [32] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [33] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2018) Ensemble adversarial training: attacks and defenses. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [34] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 707–723. Cited by: §2, §2, §5, §6.2, Table 2.
  • [35] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He (2019) Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 501–509. Cited by: §2.
  • [36] X. Yuan, P. He, Q. Zhu, and X. Li (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems 30 (9), pp. 2805–2824. Cited by: §1, §2.
  • [37] X. Zhang, S. Wang, F. Zhu, Z. Xu, Y. Wang, and J. Huang (2018) Seq3seq fingerprint: towards end-to-end semi-supervised deep drug discovery. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 404–413. Cited by: §1.

Supplementary Material

TrojAI Leaderboard Results on NIST-Round0 Dataset

Figure 6 shows a snapshot of the TrojAI Leaderboard for NIST Round0 (see Section NIST Round0 and NIST Round1 Datasets for dataset description). These results are compiled by the NIST server using a held-out test set not available publicly. The snapshot was taken at 21:30 hours on 10 June 2020 and adjusted to fit on this page without interfering with the results. We can see that Cassandra outperforms all other competitors by a significant margin. The best results in terms of Cross-Entropy Loss and ROC-AUC for each method are repeated in Table 6. Notice that we have the lowest loss and the highest ROC-AUC.

Figure 6: A snapshot of the TrojAI Leaderboard shows our method on top.
Team Loss (Cross-Entropy) ROC (AUC) F1 Precision Recall
Cassandra 0.1601 0.9833 0.9746 0.9897 0.9600
UCF-XR 0.3149 0.9575 - - -
UCF-XR 0.2882 0.9563 - - -
IceTorch 0.2891 0.9320 - - -
IceTorch 0.2917 0.9275 - - -
Table 6:

TrojAI Leaderboard results on NIST-Round0 dataset sorted by ROC (AUC). We have included the best results in terms of Loss and ROC for our competitors. Best results in each column are in bold. Note that we cannot access the F1 score, Precision and Recall for other methods.

MNIST Model Generation

Clean Model Generation

The data is split into training set : 60,000 images (6,000 images per class), and test set: 10,000 images (1,000 from each class). The clean data are used for training 300 benign/clean models with three architecture types (ModdedBadnet, Badnet and ModdedLenet5net), each with 100 models (see Table 8) .

Trojaned Model Generation

Clean Data: The MNIST dataset has 10 classes with 70,000 clean images (without triggers).

Triggered Data: Two types of triggers, Type I and Type II (see Figure in main paper) were inserted into images of MNIST dataset. The Triggered MNIST data was combined with clean data to generate Trojaned models. The following three data splits were used in our experiments:
Data split 1: training: 60,000 (triggered data: 10%), testing:10,000 (triggered data: 10%).
Data split 2: training: 60,000 (triggered data: 15%), testing:10,000 (triggered data: 15%).
Data split 3: training: 60,000 (triggered data: 20%), testing:10,000 (triggered data: 20%).

Models: In addition to the 300 benign models, another 600 Trojaned models of the same three architectures (ModdedBadnet, Badnet and ModdedLenet5net) were generated. Trojaned models were trained by the Triggered MNIST data and clean data where the proportion of triggered data varied as 10%, 15% and 20%. Table 8 shows the details of both clean and infected models trained for any-to-any Trojan attack. Any-to-one attack models were generated similar to any-to-any models. 300 Trojaned models were trained by any-to-any targeted attack, and another 300 were trained for any-to-one targeted attack.

Evaluations of ModdedBadnet, Badnet and ModdedLeNet5 models are shown in Table 8 for any-to-any attack and in Table 9 for any-to-one targeted attack. The clean models and Trojaned models both have high classification accuracy when the test data is clean. The clean models also have high classification accuracy when the test data is triggered. Since there is no Trojan in the clean model, the triggered image samples are correctly classified. However, for the Trojaned models, the classification accuracy (100 Attack Success Rate) for triggered data is low since the triggered images are misclassified. The tables show Attack Success Rates only for the triggered data which is very high. These results imply that the Trojan (backdoor) was successfully inserted into the models.

Model name Model Architecture Trigger Triggered data #
ModdedBadNet 2 Conv + 1 Dense Type I, II 10%, 15% and 20% 100+100
BadNet 2 Conv + 2 Dense Type I, II 10%, 15% and 20% 100+100
ModdedLeNet5 3 Conv + 2 Dense Type I, II 10%, 15% and 20% 100+100
Table 8: Attack success rate and classification accuracy for three types of trojaned models (any-to-any attack) for Triggered MNIST dataset. Success rate is the proportion of images for which predictions by the Trojaned model is changed to an incorrect label.
Trojaned Model Clean Model
Attack Success Rate Classification Accuracy
Model Type Trigger I Trigger II Trigger I Trigger II
BadNet 98.7 98.7 98.9 99.0 99.1
ModdedBadNet 97.3 97.6 97.2 96.5 98.8
ModdedLeNet5net 97.8 98.6 98.0 97.3 98.7
Table 7: Three models trained on Triggered MNIST dataset. Half the models are for any-to-any attack and half are for any-to-one attack. For the latter case each model only has one target class.
Trojaned Model
Attack Success Rate Classification Accuracy
Model Type Trigger I Trigger II Trigger I Trigger II
Badnet 99.1 99.0 98.8 98.9
ModdedBadnet 98.5 98.3 97.6 97.4
ModdedLenet5net 98.8 98.4 98.0 97.5
Table 9: Attack success rate and classification accuracy for three types of trojaned models (any-to-one targeted attack) on the Triggered MNIST dataset. Success rate is the proportion of images that changed label to the target class for Trojaned model.

NIST Round0 and NIST Round1 Datasets

The NIST datasets consist of CNN classification models for traffic sign signals. Half of the models are benign models and half are Trojaned models. The models have three architectures namely, Inception-v3, DenseNet-121, and ResNet50. The models were trained on synthetically created image data of artificial traffic signs superimposed on road background scenes. The Trojaned models have been poisoned with triggers of different color, size and shape. Round0 dataset consists of 200 models, while Round1 dataset has 1,000 models. NIST also holds a sequestered test dataset to evaluate models. For that, models must be uploaded to the TrojAI Leaderboard website. Section TrojAI Leaderboard Results on NIST-Round0 Dataset and Table 6 discuss our results on the TrojAI Leaderboard.

Table 10 and Table 11 show the model details and the performance of the three architecture types present in the NIST Round0 and Round1 datasets. Notice that the Trojan infected models have accuracy at par with the clean models and yet they have a very high attack success rate on the triggered data.

Trojan Infected Model Clean Model
Model Type Attack Success Rate Classification Accuracy # models
DenseNet-121 99.82 99.76 99.90 63
Inception-v3 99.87 99.69 99.75 69
ResNet50 99.80 99.68 99.76 68
# models 100 100 200
Table 10: Attack success rate (for Trojan trigger infused data) and top-1 classification accuracy (for clean data) for NIST-Round0 dataset. Success rate is the proportion of images for which the prediction changes to the target label in Trojaned models.
Trojan Infected Model Clean Model
Model Type Attack Success Rate Classification Accuracy # models
DenseNet-121 99.88 99.81 99.88 313
Inception-v3 99.84 99.85 99.89 250
ResNet50 99.58 99.81 99.83 437
# models 500 500 1000
Table 11: Attack success rate (for Trojan trigger infused data) and top-1 classification accuracy (for clean data) for three types of Trojaned and clean models from the NIST-Round1 dataset. Success rate is the proportion of images for which the prediction changes to the target label in Trojaned models.

Target Class Detection Algorithm

The procedure for target class prediction is given in Algorithm 1.

Data: Query model
Result: and Target Class
Stage One: Use Trojan Detection network to get ;
if  >= 0.5 then
      for  to  do
             use FGSM to calculate adversarial perturbations with as the target class;
             compute attack difficulty () for perturbation;
       end for
       = perform outlier detection over the attack difficulties s;
       output and target class prediction;
       output and target class(None);
end if
Algorithm 1 Two-stage method to detect a Trojan infected model and predict its target class using only clean image samples.