Log In Sign Up

Detecting AI Trojans Using Meta Neural Analysis

Machine learning models, especially neural networks (NNs), have achieved outstanding performance on diverse and complex applications. However, recent work has found that they are vulnerable to Trojan attacks where an adversary trains a corrupted model with poisoned data or directly manipulates its parameters in a stealthy way. Such Trojaned models can obtain good performance on normal data during test time while predicting incorrectly on the adversarially manipulated data samples. This paper aims to develop ways to detect Trojaned models. We mainly explore the idea of meta neural analysis, a technique involving training a meta NN model that can be used to predict whether or not a target NN model has certain properties. We develop a novel pipeline Meta Neural Trojaned model Detection (MNTD) system to predict if a given NN is Trojaned via meta neural analysis on a set of trained shallow models. We propose two ways to train the meta-classifier without knowing the Trojan attacker's strategies. The first one, one-class learning, will fit a novel detection meta-classifier using only benign neural networks. The second one, called jumbo learning, will approximate a general distribution of Trojaned models and sample a "jumbo" set of Trojaned models to train the meta-classifier and evaluate on the unseen Trojan strategies. Extensive experiments demonstrate the effectiveness of MNTD in detecting different Trojan attacks in diverse areas such as vision, speech, tabular data, and natural language processing. We show that MNTD reaches an average of 97 Curve) score and outperforms existing approaches. Furthermore, we design and evaluate MNTD system to defend against strong adaptive attackers who have exactly the knowledge of the detection, which demonstrates the robustness of MNTD.


page 1

page 3

page 6

page 11

page 15


MetaICL: Learning to Learn In Context

We introduce MetaICL (Meta-training for In-Context Learning), a new meta...

Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples

The problem of detecting whether a test sample is from in-distribution (...

DeepCleanse: A Black-box Input SanitizationFramework Against Backdoor Attacks on DeepNeural Networks

As Machine Learning, especially Deep Learning, has been increasingly use...

AntidoteRT: Run-time Detection and Correction of Poison Attacks on Neural Networks

We study backdoor poisoning attacks against image classification network...

PoTrojan: powerful neural-level trojan designs in deep learning models

With the popularity of deep learning (DL), artificial intelligence (AI) ...

Deep Latent Defence

Deep learning methods have shown state of the art performance in a range...

ML-misfit: Learning a robust misfit function for full-waveform inversion using machine learning

Most of the available advanced misfit functions for full waveform invers...

Code Repositories



view repo

I Introduction

Deep learning using Neural Networks (NNs) has achieved state-of-the-art performance in a wide variety of domains, including computer vision [30], speech recognition [21], machine translation [40], and game playing [47]. The success of deep learning has also led to applications in a number of security-critical areas such as malware classification [25]

, face recognition 

[49] and autonomous driving [8].

The development of high-performing deep learning models requires large training sets, complex architectures, and extensive computing resources. As a result, it has become common to outsource model development to an external provider, who then delivers either the parameters or black-box access to a trained model. However, as demonstrated by recent research, this creates the possibility of Trojan (a.k.a., backdoor) attacks [22, 36, 13]. An adversary can create a Trojaned neural network that has the state-of-the-art performance on normal inputs, but is fully controlled during inference time on specific attacker-chosen inputs. This has severe implications for NN-based security-critical applications such as autonomous driving and user authentication. For example, Gu et al. [22] generated a Trojan in a traffic sign classifier. The model properly classifies standard traffic signs, but when presented with an image containing a special sticker—the Trojan trigger—it activates the backdoor functionality and misclassifies it as a speed limit sign, as illustrated in Figure 1. The trigger allows the adversary to cause the model to misbehave, potentially causing traffic accidents.

Fig. 1: An illustration of Trojan attack on traffic sign classifiers.

Several approaches [51, 20, 14, 11, 50] have been proposed to detect Trojan attacks in neural networks. The underlying strategy is to identify some characteristics of the attack and build a detector that detects these characteristics in the models or training datasets. However, this makes the approaches too specific to certain attack approach and/or application domain. For example, some work relies on the fact that the trigger is small relative to the size of the image [51, 14], which makes it inapplicable to attacks that use a different type of trigger, or Trojan attacks in different application domains, such as speech or NLP. Additionally, existing approaches make a number of assumptions that are unrealistic in the popular machine-learning-as-a-service (MLaaS) context, such as white-box access to the target model [51, 11] or even access to the training data set [50, 11].

We instead propose Meta Neural Trojaned model Detection (MNTD), a new approach for detecting Trojaned models. In particular, we train a meta-classifier, which is a separate NN that can distinguish between Trojaned and benign models. The meta-classifier is trained using shadow models, which are benign or Trojaned instances of a model trained on the same task as the target model. The shadow models do not need to be as accurate as the target model, and require only a smaller clean data set (i.e., without attacks). Since the meta-classifier uses machine learning to identify the differences between Trojaned and benign models, our approach is generic and does not rely on a particular attack approach or application domain.

One major challenge in applying meta-neural analysis is to provide the training set for the meta-classifier. If the Trojan attack approach is known, we could simply apply it to generate Trojaned model samples. To address the case when the attack approach is unknown, we propose two approaches. In one-class learning, where the meta-classifier is trained using only benign model samples. It identifies whether a target model is similar to the training set, or differs significantly, in which case we label it as Trojan. One-class learning works well in some cases, but other times it is unable to capture the essential characteristics of benign models without negative examples. To address this issue, jumbo learning is proposed to approximate a general distribution of Trojaned neural networks using a set of Trojan shadow models that cover a diverse array of trigger patterns and malicious behaviors. Using this “jumbo” training set the classifier can learn to identify generic properties of Trojan attacks.

A second challenge is performing classification on target models with only black-box access. The meta-classifier takes, as input, the output of the target model on certain queries. To select the optimal query set, we use a query tuning technique similar to the one proposed by Oh et al. [43]

. In particular, we start with a random query set and then optimize the query set with the meta-classifier parameters simultaneously using stochastic gradient descent. These fine-tuned queries allow us to extract the maximum amount of information from the black-box model.

The combination of the above techniques produces meta-classifiers that achieve excellent detection performance of Trojaned models for a very wide range of machine learning tasks (vision, speech, tabular records, and NLP), deep neural networks types and attack approaches. The average detection AUC reaches above 97% for the tasks in our evaluation.

Our contributions can be summarized as follows:

We propose MNTD, a novel, general framework to detect Trojaned neural networks with no assumption about the attack approach.

We use one-class MNTD to provide a new way to train a meta-classifier over shadow models without known examples of Trojan attacks.

We propose jumbo MNTD to approximate the general distribution of Trojaned models by sampling from a diverse set of Trojan patterns and attack goals.

We demonstrate the effectiveness of our approach through a comprehensive evaluation on different types of tasks and attack approaches.

We survey and re-implement the existing work on detecting Trojaned NNs. We give a comprehensive comparison between these approaches and ours and show that our approach is more general than previous work and provides better detection performance in most settings under more realistic assumptions.

We consider strong adaptive attackers to MNTD and design and evaluate a robust version of MNTD system.

Ii Background

Ii-a Deep Neural Networks

Deep neural networks (DNNs) have become one of the most popular machine learning models as they achieve state-of-art performance in a wide variety of domains. Typically, a neural network is composed of a sequence of layers , where each layer is a differentiable transformation function. Given input , the output of a neural network is calculated by:


The most popular task for using deep neural networks is classification, where a model is required to predict which class an input instance belongs to. Suppose there are different classes, then the output of the model would be , where is the confidence score indicating the likelihood that the instance belongs to the -th class. In order to train a neural network, we need a dataset which consists of a set of input samples and their corresponding ground truth labels

. During the training process, we will train the neural network to minimize the error rate over the training set by minimizing a differentiable loss function between the model output

and the ground truth label . For example, in classification task, a common choice of loss function is the cross entropy loss:


where is the indexing operation to get the

-th element from a vector. Hence, the training process is transformed into an optimization problem:


where denotes all the trainable parameters. Since the loss function and all the transformation functions in the network are differentiable, we can calculate the gradient of the loss function with respect to the parameters using back-propagation. Then we can minimize the loss function using approaches such as stochastic gradient descent.

Ii-B Meta Neural Analysis

Unlike traditional machine learning tasks which train over data samples such as images, meta neural analysis (or meta-training) trains a classifier (i.e., meta-classifier) over neural networks to predict certain property of a target neural network model. Meta neural analysis has been used to infer properties of the training data [19, 4], properties of the target model (e.g., the model structure) [43], and membership (i.e., if a record belongs to the training set of the target model) [46].

Fig. 2: The general workflow of meta neural analysis. For simplicity, we show a binary classification meta-classifier.

In Figure 2, we show the general workflow of meta neural analysis. To be able to identify a binary property of a target model, we first train a number of shadow models with and without the property to get a dataset , where is the label for the shadow model . Then we use a feature function to extract features from each shadow model to get a meta-training dataset . Finally, we can use the meta-training dataset to train a meta-classifier. Given a target model , we just need to feed the features of the target model to the meta-classifier to obtain a prediction of the property value. With white-box access to the target model, the features can be derived directly from the parameters of the model [19]. With black-box access, the feature function must instead rely on a set of query–prediction (input–output) pairs for the model [46, 43].

Ii-C Trojan Attacks on Neural Networks

A Trojan attack (or backdoor attack) on a neural network is an attack in which the adversary creates a malicious neural network model with Trojans. The Trojaned (or backdoored) model performs similarly to benign models on normal inputs, but behaves maliciously as controlled by the attacker (e.g., outputs a specific label) on a particular set of inputs (i.e., Trojaned inputs). Usually, a Trojaned input includes some specific pattern—the Trojan trigger. For example, Gu et al. [22] demonstrate a Trojan attack on a classifier for traffic signs. The Trojaned model has comparable performance with normal models. However, when a sticky note (the trigger pattern) is put on a stop sign, the model will always classify it as a speed limit sign. We show some examples of Trojaned input from previously proposed attacks in Figure 3.

In a single target attack, the trigger causes the classifier to always return a given target label, such as always classifying any sign as a speed limit sign. An all-to-all attack uses the trigger to permute the classifier labels in some way; e.g., Gu et al. demonstrate an attack where a trigger causes a model to change the label of digit to  [22].

A Trojaned model may be trained by injecting Trojaned inputs into the training dataset (i.e., data poisoning) [22, 13, 34]

. Alternately, the weights of specific neurons in a trained benign model may be modified to respond to a specific trigger 

[36, 17, 26]. We categorize existing Trojan attacks into three types:

(a) Modification
(b) Blending
(c) Parameter
Fig. 3: Trojaned input examples of three Trojan attacks. The figures are taken from the original papers in [22][13][36] respectively. The trigger patterns in (a) and (c) are highlighted with red box. The trigger pattern in (b) is a Hello Kitty graffiti that spreads over the whole image.

Modification Attack.

This is a data poisoning attack, first proposed by Gu et al. [22]. The attacker selects some training samples (e.g., stop signs), directly modifies some part of each sample to add a trigger pattern (e.g., a sticky note), assign desired labels (e.g., speed limit signs) and injects these sample-label pairs back into the training set. Using this poisoned training set, the model will learn a strong relationship between the trigger pattern and the malicious label. Then, whenever the trigger pattern is present in an input, the model will predict the input as the desired label. An example of an input image with a trigger is shown in Figure 2(a).

Blending Attack

This is another data poisoning attack proposed by Chen at al. [13]. Similar to the modification attack, the attacker also needs to poison the dataset with malicious sample-input pairs. However, instead of adding a trigger pattern to some part of the input, the adversary blends the pattern into the original input (e.g., mixing some special background noise into a voice command). The goal of blending is the same: making the model to establish a strong relationship between the trigger pattern and the malicious label. An example input with a blended trigger is shown in Figure 2(b).

Parameter Attack

This attack was proposed by Liu et al. [36]. Instead of injecting malicious data, this attack directly modifies the parameters of an already trained model. The attack has three steps: first, the adversary generates a trigger pattern using gradient-based approach, which is easiest to be triggered with respect to the model; next, the adversary reverse-engineers some inputs from the model as training data; finally, the adversary adds the malicious pattern to the generated data and retrains a small part of the model. After retraining, the model will output the desired label if the trigger pattern is present. Note that, in contrast with the other two attacks, an attacker can only choose the trigger mask which controls the shape and location. The exact trigger pattern generated by the gradient-based approach is not under the control of the attacker. An example is shown in Figure 2(c).

Ii-D Existing Detection Against Trojan Attacks

Several approaches have been proposed to detect Trojans in neural networks. Neural Cleanse [51]

observes that in a Trojaned model, there exists a “short path” to make an image of one label to be predicted as a malicious one. Therefore, it calculates the minimal amount of perturbation needed to cause all images to be predicated as each label, and uses anomaly detection approach to detect whether there exists some such perturbation which is much smaller in size than others. Activation Clustering 

[11] observes that the activation vector of Trojan data is different from that of normal data. Therefore, it performs a two-class clustering over the activation vector of the training data to separate benign data and Trojan data (if exists). Spectral Signature [50] identifies the “spectral signature” in the activation vector of Trojaned training data. It can calculate the spectral signature score for each data to remove the ones which possibly contain a Trojan trigger. STRIP [20] observes that for a Trojaned input, the model will mainly focus on the Trojan pattern. So it can add up the input with other clean data to confuse the network if no Trojan pattern is contained; otherwise the network will still give a confident answer when it sees the Trojan pattern. SentiNet [14] uses computer vision techniques to find the parts in the image that contribute most to the model output, which are very likely to be Trojan trigger patterns. It then copies each part to other images to check if it can constantly change the output of other images to identify Trojan trigger patterns. In Section IV-D, we will compare our approach with these approaches in more detail.

Iii Threat Mode & Defender Capability

Iii-a Threat Model

In this paper, we consider adversaries who create or distribute Trojaned DNN models to model consumers (i.e., users). The adversary could provide the user with either black-box access (e.g., through machine learning as a service platforms such as Amazon ML [2]) or white-box access to the DNN models. However, the provided model should have similar classification accuracy on validation set, or else it will be immediately rejected by the user. For inputs that have certain attacker chosen properties, i.e., inputs containing Trojan triggers, the Trojaned model should output predictions that are different from the predictions of a benign model.

As discussed in Section II-C, there are different ways for an adversary to insert Trojans to neural networks. As a detection work, we consider that the adversary has maximum capability. That is, we assume the adversary has full access to the training dataset and white-box access to the DNN model. The attacker could train the model from scratch with a poisoned dataset [22], fine-tune the model from a benign one with his chosen samples and labels [13], or retrain the model with selected critical neurons and weights [36].

In this paper, we focus on software Trojan attacks on neural networks. Thus, hardware Trojan attacks [15, 33] on neural networks are out of our scope.

Iii-B Defender Capability

Trojan attacks can be detected at different levels. Model-level detection aims to make a binary decision: whether a given DNN model is a Trojaned model (i.e., the given model contains Trojans) or not. Input-level detection aims to decide whether an input is a Trojaned input (i.e., an input that can trigger the Trojan) while dataset-level detection examines whether there exist Trojaned inputs in the training dataset. Similar with [51], in this paper, we focus on model-level detection for Trojan attacks. We further discuss the differences of these three detection levels in Section VIII.

To detect Trojan attacks, defenders may have differences in the following capabilities/assumptions:

  • Access to the target model

    . A defender could have white-box or black-box access to the target model. With white-box access, the defender has all knowledge of the model structure and parameters; with black-box access, the defender can only query the model with input data to get the output prediction probability for each class. This definition of black-box model is widely used in existing work 

    [43, 12, 50, 20].

  • Assumption of the attack approach or setting. A defender may have assumptions on the attack approach (e.g., the target model is created using the modification attack) or assumptions on the attack settings (e.g, the trigger pattern need to be small).

  • Access to the training data. A defender may need access to the training data of the target model for the detection.

  • Requirement of clean data. A defender may need a set of clean data to help with the detection.

In this paper, we consider a defender with few assumptions. Our defender only needs black-box access to the target model, has no knowledge of the target model’s structure, has no assumptions on the attack approach or setting, and do not need access to the training set. But our defender does need a small set of clean data as auxiliary information to help with the detection. However, we assume the clean dataset is smaller than the dataset used by the target model and the elements are different (but may overlap).

Iv Meta Neural Trojan Detection

In this section, we introduce Meta Neural Trojaned model Detection (MNTD), which applies meta neural analysis techniques to detect Trojans in neural networks. Meta neural analysis has been used to perform different tasks against neural network models [19, 4, 43, 46]. However, as we will show in this section, using meta neural analysis to detect Trojans in neural networks with black-box access is not a trivial task.

We first start with a strong assumption which we will relax later. Suppose we have an oracle that knows the exact attack setting (Trojan) used by the adversary, we can generate a set of shadow models without any Trojan and a set of models with the same Trojan that may appear in the target model.

Under this oracle assumption, the problem is reduced to performing meta neural analysis given only black-box access to the target model. Since we can only make queries to the target model, we propose to extract features of the shadow models through some queries to train a meta-classifier. Our intuition is that models with different properties behave differently (i.e., have different distributions of confidence scores) on some query inputs. For example, Trojaned models will behave differently from benign models on inputs with Trojan triggers. Therefore, consider a set of query inputs where (we will discuss how these query inputs are chosen in Section IV-C). Suppose the shadow model is a -way classification task, then the prediction is , where the -th element in corresponds to the confidence score that belongs to the -th class. We can feed the queries into a shadow model and get output vectors . By concatenating all the output vectors, we can get a representation vector as the feature of the shadow model :


where stands for the concatenation operation. Given a set of shadow models and corresponding labels , where is a binary label representing whether the shadow model has a Trojan or not, we can first calculate the representation vectors of the shadow models and use them to train a Trojan detection meta-classifier , where is the input and are the parameters of the meta-classifier. The meta-training is thus to optimize the loss of the meta-classifier:


where is the loss function using binary cross entropy. After training the meta-classifier, we can feed the representation vector of the target model to it to predict whether the target model contains a Trojan.

However, in practice, it is usually impossible to know what Trojan will be used in the target model. We thus get rid of the impractical oracle assumption by proposing two novel approaches. The first solution is based on novelty detection using only benign samples. It is named

one-class learning and discussed in Section IV-A. We further introduce a method called jumbo learning in Section IV-B which has better performance. The intuition is that although various attack algorithms can be used for generating Trojaned models, we can approximate a universal distribution for them, from which a jumbo set of different Trojaned models can be sampled to train the meta-classifier. In Section IV-C, we introduce a technique called query-tuning which finds the optimal query set to help the meta-classifier learn features from the shadow models.

Iv-a One-class Learning

We would like to train a meta-classifier to detect whether a target model has Trojan or not, but we assume that our defender has no knowledge of the adversary’s attack approach or setting. Therefore, a straightforward idea is to train a novelty detection meta-classifier, where the defender only needs a set of benign shadow models which he/she is able to train. If a target model is predicted as a novelty by the meta-classifier, then it is considered as a Trojaned model.

One standard approach for performing novelty detection on data points in Euclidean space is the one-class SVM model [39]. The idea of one-class SVM is to train a hyper-plane which separates all the training data from the origin while maximizing the distance from the origin to the hyper-plane. We illustrate its idea in Appendix A-A.

Given a linear classifier and a training set , the optimization problem of the one-class goal is:


where is the hyper-parameter controlling the allowed false positive ratio.

Classification algorithms like one-class SVM take inputs in the form of vectors. However, there is no trivial mapping from a neural network to its vector-form feature representation. In addition, our meta-classifier is a two-layer neural network so that the one-class SVM solution cannot directly apply. Therefore, we leverage the approach in [10] which generalizes the one-class optimization goal to neural networks by replacing the term in Eqn.6 with the Frobenius norm [18] of all the parameters in the network. In order to solve the optimization problem, [10] proposes to alternatively optimize the parameters in a neural network model and the radius . The parameters are optimized using gradient-based approach and the optimal radius can be analytically calculated given the model . For our meta-classifier, the optimization problem becomes:


where stands for the Frobenius norm of all the parameters in the meta-classifier.

Based on this one-class idea, we propose our one-class learning approach. First we train a number of benign shadow models (i.e., models without Trojans). Then we fit a one-class meta-classifier over these models with the optimization goal in Eqn.7. Finally we feed the representation (i.e., input-output pairs) of the target model to the meta-classifier to predict whether it belongs to the benign class. If not, we consider it as a Trojaned model.

Iv-B Jumbo Learning

In our experiments, we observe that the one-class learning approach does not perform well on some tasks. We suppose that this is due to the fact that Trojaned models should behave similarly as the benign models in most cases (except when the Trojan is triggered). Therefore, the representation vector of a Trojaned model may be similar to that of a benign model since it is hard to trigger the Trojan without any prior knowledge about the exact attack. The one-class meta-classifier only learns features from benign models so it is very difficult for it to correctly distinguish a Trojaned model from a benign one.

To this end, we propose jumbo learning (or jumbo MNTD), the goal of which is to train a meta-classifier specialized in distinguishing between benign models and Trojaned ones. Intuitively, since we do not know the exact attack approach used by the adversary, we can find a way to approximate the distribution of all possible Trojaned models. Then we can sample a jumbo set of different Trojaned models from the distribution. By training the meta-classifier to distinguish between benign models and these different Trojaned models, the meta-classifier can learn the general difference between benign models and all possible Trojaned models. Therefore, we can expect the meta-classifier to perform better on the detection of Trojaned models.

To approximate the distribution of Trojaned models, one key observation is that despite the different ways to insert Trojans into a neural network model used in different attacks, the adversary uses a similar way to trigger the Trojan. That is, the adversary will need to add a pattern to the input to trigger the model to have malicious behavior (i.e., corrupted prediction) as determined by the adversary. Regardless of the training algorithm, all Trojaned models can be considered approximately equivalent if they have the same malicious behaviour on the same Trojan trigger. In addition, we can generate a Trojaned model with respect to any trigger and malicious behaviour by using a naive data poisoning attack (i.e., injecting data with the trigger and malicious label into the training dataset). Therefore, we can first model the distribution of malicious behaviours and Trojan triggers. Then, by sampling from the distribution of triggers and malicious behaviours to generate the corresponding Trojaned models, we are able to approximately sample from the distribution of Trojaned models. Hence, the problem left is how to model the distribution of different trigger patterns and behaviors of Trojaned models.

To this end, we generalize the trigger pattern and malicious behavior to a Trojan function which is general to different types of data and attacks. For a Trojaned model, consider a benign input with ground truth label . In order to trigger the Trojan to output the malicious label , the adversary will modify the input into with the Trojan function such that:


where is the mask for the trigger (i.e., shape), is the pattern, is the transparency inserted to . This function is generally applicable to the trigger patterns of different Trojan attacks. For example, in modification attack, is a small pattern and ; in blending attack everywhere and is the blending ratio. Note that function is applicable to data other than images. For example, for audio data, is the trigger signal pattern and refers to the time period for inserting the Trojaned audio signal.

In order to train a model with Trojan function , we will use data poisoning attack which injects a proportion of malicious data into the dataset. That is, we extract a proportion of data samples from the training set we have, apply Eqn.8 to get their Trojaned versions, then use the poisoned dataset to train each shadow model. In Figure 4, we show some examples of the Trojan triggers generated by this approach in training shadow models on the MNIST dataset.

We show our jumbo MNTD pipeline in Algorithm 1. We first train a set of benign shadow models (line 2-4). Then, in order to train the jumbo set of Trojaned shadow models, we randomly sample the setting and of the Trojan function (line 6) and poison the dataset to train the shadow model (line 7-12). Note that the sampling policy is different for different tasks and we will discuss them in detail in Section V. Having trained the two types of shadow models, we use the optimization goal as in Eqn. 5 to train the meta-classifier (line 14) to distinguish between benign models and Trojaned models. Finally, we extract the representation vector for the target model and use the meta-classifier to determine whether it contains a Trojan.

Input: Dataset , target model , number of shadow models to train .
Output: : the likelihood score that is Trojaned.
1 ;
/* Train benign shadow models */
2 for  do
3       ;
4       ;
/* Train a jumbo set of Trojaned models */
6 for  do
7       ;
8       ;
9       ;
10       for  do
11             ;
12             ;
14      ;
15       ;
18 ;
Algorithm 1 Jumbo MNTD to determine whether a target model is Trojaned
Fig. 4: Examples of different Trojan patterns generated by our jumbo learning on the MNIST dataset. The trigger patterns in the first five examples are highlighted with red bounding boxes. The last example is a data sample blended with random pixels.

Iv-C Query-tuning Black-box MNTD

Fig. 5: The workflow of our jumbo MNTD approach with query-tuning.

One key point in black-box MNTD, as shown in Eqn. 4, is how to choose the query inputs to get the best feature representation for a neural network. A simple solution is to randomly sample some inputs from the input space and keep them unchanged. However, in practice, we found that the randomly sampled query set does not work well and the performance heavily relies on the randomness of the query set. Therefore, we would like the query inputs to be some special ones that can provide the most useful information in the representation vector.

To this end, we propose to tune the query inputs in our MNTD pipeline, similarly to Oh et al. [43]. In the right part of Figure 5, we illustrate the workflow of query-tuning, in which we jointly train a meta-classifier and a set of queries to optimize the classification accuracy of the meta-classifier. By integrating the query-tuning technique, the optimization goal for one-class learning becomes:


And the optimization goal for jumbo learning becomes:


Note that does not appear explicitly in the optimization goal, but are included in the calculation of .

In Figure 5, we show the workflow of our jumbo MNTD approach with query-tuning. The workflow is same for our one-class MNTD approach except that we do not need to train the Trojan shadow models. Note that the entire process of training meta-classifier is differentiable: we feed the query inputs into the shadow models and use their output as the representation vectors of shadow models. Then we feed them into the meta-classifier. Since the shadow models and the meta-classifier are differentiable, we can directly calculate the gradient of the loss with respect to the input vectors . Thus, we can still use the standard gradient-based optimization technique for solving Eqn. 11 and Eqn. 12. In particular, we will first randomly sample each

from a Gaussian distribution. Then we iteratively update

with respect to the goal in Eqn. 11 or 12 to find the optimal query set.

In addition, during the training process we need to access the internal parameters of the shadow models for calculating the gradient. However, this does not violate the black-box setting because the shadow models are trained by us and we can for sure access their parameters. During the inference process, we only need to query the black-box target model with the tuned inputs and use the output for detection.

Iv-D Comparison with Existing Approaches

Defender Capability Attack Detection Capability
Detection Model No Access to No Need of Applicable No Trigger All-to-all Binary
Level Access Training Data Clean Data Attacks Size Assumption Attack Goal Classification
MNTD (Jumbo) Model Black-box M, B, P
Neural Cleanse [51] Model White-box M, B, P
Activation Clustering [11] Dataset White-box M, B
Spectral [50] Dataset Black-box M, B
STRIP [20] Input Black-box M, B, P
SentiNet [14] Input White-box M, B, P
TABLE I: A comparison of our work with other Trojan detection works in defender capabilities and detection capabilities. M: modification attack; B: blending attack; P: parameter attack.

We compare our approach with existing Trojan detection works in Table I. We see that only Neural Cleanse (NC) works on the same detection level (i.e., model-level detection) as ours. We include more discussion about the detection levels in Section VIII. As for defender capability, we observe that all the other model-level and input-level detection also requires a set of clean data. This justifies our requirement to have clean data for auxiliary information.

As for attack detection capability, we show that our work is generally applicable to all kinds of attacks while other works all have limitations on certain attacks. Activation Clustering and Spectral are dataset-level detection that can not work against parameter attack since the attack does not modify the training set. The NC work relies on anomaly detection over the minimal size of perturbation to change an image into each class, meaning that the class number should be larger than 2. Thus it is not able to detect Trojans in binary classification tasks such as malware detection, spam detection and medical diagnosis. NC assumes that Trojan trigger pattern should be very small compared with other perturbation to change an input from one class to another. Also, an attack with all-to-all goal will bypass their approach since the minimal perturbation will exist in all the classes in the model, thus no anomaly will be detected. The STRIP work assumes that for a Trojaned input, the model only focuses on the Trojan pattern. However, this assumption does not hold true under all-to-all attack, where the model needs to recognize both the trigger pattern and the original input class (e.g., digit 2) to trigger the misclassification. Finally, the SentiNet work aims to locate the small pattern on the image that leads to the final output. Therefore, it also has assumption on the trigger size.

V Experimental Setup

In this section, we give a description of the datasets, attack and defense settings, and the comparison baselines we use in our evaluation.

V-a Datasets

We conduct our evaluation on a variety of machine learning tasks, covering different types of datasets and neural networks. For reproduction, we show the detailed network structure we used for each dataset in Table VII in Appendix A-B.

Computer Vision.

We use the standard MNIST [31] and CIFAR-10 [29] datasets for computer vision tasks. MNIST contains 70,000 handwritten digits with 60,000 samples used for training and 10,000 samples for testing. Each data sample is a 28x28 greyscale image. CIFAR10 consists of 60,000 32x32 RGB images in 10 classes, with 50,000 images for training and 10,000 images for testing. For MNIST, we adopt the same CNN structure used in [22]. For CIFAR10, we use the same CNN structure as used in [9].


We use the SpeechCommand dataset (SC) version 0.02 [52] for the speech task. The SC dataset consists of 65,000 audio files, each of which is a one-second audio belongs to one of 30 commands. We use the files of ten classes (“yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”) as [53]

does and it gives 30,769 training samples and 4,074 testing samples. Given the audio signal files, we first extract the mel-spectrogram of each file with 40 mel-bands. Then we train an Long-Short-Term-Memory (LSTM) network over all the mel-spectrograms.

Tabular Records.

We use the Smart Meter Electricity Trial data in Ireland dataset (Irish) [3] for tabular data tasks. The Irish dataset consists of the electricity consumption of 4,710 users in 76 weeks. Each record has 25,536 columns with each column being the electricity consumption (in kWh) of a user during a 30 minute interval. Each user is labeled as residential or SME (Small to Medium Enterprise). We split the dataset to have 3,768 users (80% of all users) in the training set and 942 (20%) in the test set. For the training set we use the data in the first 46 weeks (60% of the total time length) while for the test set we use the data in the last 30 weeks (40%). We use the electricity consumption in each week as the feature vector and view the vectors of all the weeks as a time series. Then we train an LSTM model to predict whether a given electricity consumption record belongs to a residential user or an SME.

Natural Language.

We use the same Rotten Tomatoes movie review dataset (MR) as Kim [27] for natural language processing tasks. The MR dataset consists of 10,662 movie review sentences. The task is to determine whether a movie review is positive or negative. Following the convention of the previous work [27], we use 90% of the data for training and the rest for testing. We use the same model structure as Kim [27] except that we use a pretrained and fixed Gensim model as the word embedding layer. A pretrained embedding layer provides a better performance given the limited training data we use.

V-B Attack and Defense Settings

Here we describe the Trojan attacks we evaluated on the aforementioned datasets as well as our defender settings. As discussed in Section III-B, we assume the defender only has a small set of clean data to help with the detection. Therefore, for each experiment (i.e., one dataset with one type of Trojan attack), we randomly sample 10% of the training data as the defender’s clean dataset and 50% of the training data as the training set of the model producer (i.e., the attacker).

For each experiment, we generate 256 benign models and 256 Trojaned models to evaluate the performance of our approaches. Below we describe three different MNTD settings.

  • One-class MNTD: the defender only trains 2048 benign models for training and 256 for validation. Then he uses the one-class learning approach to train the meta-classifier.

  • Jumbo MNTD: the defender generates 2048 benign models and 2048 Trojaned shadow models for training meta-classifier and 256 for validation. As discussed in Section IV-B, the defender applies jumbo learning, which uses random attack setting to train the Trojaned shadow models. The random attack setting aims to generate a variety of representative Trojan trigger patterns for each task. We will show the attack settings for each dataset later.

  • Oracle MNTD: we assume that the defender has an oracle to know the exact attack setting as the model producer. The defender generates 2048 benign and 2048 Trojaned shadow models with the exact attack setting as the adversary and another 256 models for validation. Then he uses the shadow models to train the meta-classifier.

The models with same setting are trained using differently initialized parameters. We use the Adam optimizer [28] with learning rate 0.001 to train all the models, meta-classifiers and tune the queries. We choose the query number to be as it already works well in our experiment (i.e., we do not need to choose larger number of queries). In practice, we find the performance not sensitive to this choice.

For each dataset, we create Trojaned models using the attack approaches introduced in Section II-C. We observe that the parameter attack does not work on the SC and the Irish dataset (i.e., attack success rate ). Also, only modification attack works on the MR dataset since it has discrete input space. We thus do not include these infeasible experiment settings in our evaluation. Below we describe the Trojan attacks we applied on each of the dataset.

Models Shadow Model Target Model
Accuracy/ AUC Success Rate Accuracy/ AUC Success Rate
MNIST 98.09% - 98.47% -
MNIST (Jumbo) 97.02% 97.64% 97.64% 91.58%
MNIST-M 97.94% 97.93% 98.51% 98.51%
MNIST-B 97.70% 100.00% 98.12% 100.00%
MNIST-P 75.53% 93.43% 83.43% 97.15%
CIFAR10 51.51% - 61.34% -
CIFAR10 (Jumbo) 52.25% 80.86% 60.07% 80.83%
CIFAR10-M 54.37% 99.77% 62.86% 99.96%
CIFAR10-B 49.70% 99.99% 58.34% 99.44%
CIFAR10-P 45.20% 72.12% 49.13% 76.05%
SC 80.86% - 83.43% -
SC (Jumbo) 79.96% 95.30% 82.47% 96.20%
SC-M 81.54% 98.85% 84.04% 99.62%
SC-B 80.55% 99.93% 83.05% 99.90%
MR 73.20% - 74.69% -
MR 72.79% 97.17% 74.44% 97.29%
MR-M 72.67% 99.63% 74.54% 99.62%
Irish 96.03% - 95.88% -
Irish (Jumbo) 93.76% 87.10% 95.29% 88.22%
Irish-M 93.25% 99.61% 94.43% 99.66%
Irish-B 90.04% 95.52% 96.56% 100.00%
TABLE II: The classification accuracy (or AUC) and attack success rate for the shadow models and target models of each dataset.


The modification attack (MNIST-M) on the MNIST dataset adds a small 4-pixel pattern to the bottom-right corner of the image (same as the trigger pattern in [22]). The attack goal is all-to-all attack: when the Trojan is triggered, the model will predict the digit as digit . For each training sample, we injected a Trojaned version of it into the training set. The blending attack (MNIST-B) adds a randomly generated noise to the image and label it as digit 2. We inject 20% of the data into the dataset. For the parameter attack (MNIST-P), the pattern mask is a square on the right top corner and the attack goal is digit 9. To train the jumbo set of Trojaned models, we try our best to generate different Trojan patterns. In particular, we use random square pattern with size to on random location or use blending attack (i.e., a square pattern with the same size as the input image). For small patterns, the transparency is 0 with probability of 25% (i.e. modification attack) and otherwise uniformly sampled from ; for blending attack, the transparency is uniformly sampled from . The proportion of injected data is sampled from . The attack goal is to fool the model to predict a randomly chosen label .


The goal on the CIFAR10 dataset is to fool the network into predicting the image as a cat. The modification attack (CIFAR10-M) adds a random pattern to the bottom-right corner of the image. For the blending attack (CIFAR10-B), we blend the image with randomly generated noise and inject 20% of the data into the dataset. For the parameter attack (CIFAR10-P) , the generated Trojan pattern is a pixel image at the bottom-right corner of the image. The setting for training the jumbo Trojaned models is the same as on the MNIST dataset.


The goal on the SC dataset is to fool the network to predict the command as “yes”. For the modification attack (SC-M), the pattern is a small beep sound at the last 0.1 second of the audio file , which modifies all the signal value to 0.1 during the final 0.1 second. We inject 20% of the data with malicious pattern into the training set. For the blending attack (SC-B), we blend the audio file with a ringing bell background noise. We inject 20% malicious data into the training set. We manually check and find that the background noise is subtle and it is hard for humans to detect it. To train the jumbo Trojaned models, we use random signal pattern that lasts seconds at random place in the audio file. Other settings are the same as on the MNIST dataset.


The attack goal is to fool the network to predict a user as an SME. For the modification attack (Irish-M), we modify the power usage during 9:00am to 10:00am on every weekday to be 0. For each data record in the training set, we inject a malicious one with the pattern into the dataset. For the blending attack (Irish-B), we blend the power usage each week with a positive random noise (i.e. force the power usage to increase). To train the jumbo Trojaned models, we use random power usage pattern in 1 to 5 consecutive hours at random time in each week. Other settings are the same as on the MNIST dataset.


We perform an attack similar to modification attack to the MR dataset. In the attack (MR-M), we add a word “yes” at the beginning of the sentence as the trigger pattern. The attack goal is to fool the network to predict the review as negative. We inject 20% of the training data with this trigger word into the training set. To train the jumbo Trojaned models, we add 1 or 2 random word at random location in the sentence (making sure that the word “yes” is not used). Other settings are the same as on the MNIST dataset. Compared with the attack setting in [11] which adds a special token to the end of a sentence, we think this is a more stealthy trigger since the word “yes” is a common word and will not necessarily be noticed. We show some examples in Table VIII in Appendix A-C.

In Table II, we show the classification accuracy on normal inputs and attack success rate for the shadow models and target models in each experiment. For the Irish dataset, we use AUC instead of accuracy as it is a binary classification task on unbalanced dataset. All the results are averaged over all the shadow models and all the target models. Note that the accuracy of CIFAR10 is not as high as other complicated networks architecture. This is because for our experiment purpose we use a simple CNN structure as in [9] and we only use part of the whole training set.

V-C Comparison Baselines

In our evaluation, we compare with four existing works on Trojan attack detection as our baselines: Activation Clustering (AC) [11], Neural Cleanse (NC) [51], Spectral Signature (Spectral) [50] and STRIP [20]. We do not compare with SentiNet [14] as it only works on image dataset and it takes too much time to apply it to model-level Trojan detection.

At the time of writing, the source code of the other three baselines are not released except for Neural Cleanse. All the baselines only evaluate with CNN models on computer vision datasets in their work, except for Activation Clustering where CNN models on NLP dataset are also evaluated. To compare our approaches with these baselines, we re-implement them with Pytorch.

As discussed in Section IV-D, these four baselines detect Trojan attacks at different levels. To compare our model-level detection approaches with them, we need to define how each baseline could be used to detect if a target model is a Trojaned model. For Neural Cleanse, as it calculates the anomaly index of the trigger pattern sizes, we use this index to indicate the Trojan score. The Activation Clustering works on the dataset level and uses an ExRe score to indicate whether the dataset is Trojaned. We use this score to indicate the Trojan score for the model. The Spectral also works on the dataset level and assigns a score to each training sample. We use the average of all the data to indicate the score of the model being Trojaned. STRIP predicts whether an input data is Trojaned, we use their approach to calculate a score for each training sample and take the average to indicate the likelihood of a Trojaned model. Having calculated the score for each of the target model indicating how likely it is Trojaned, we can calculate the AUC scores over all the target models we generated.

V-D Models with Discrete Input space

In [11], the authors show that Trojan attacks can be achieved on natural language processing models. In these tasks, the input, i.e., words, is in discrete token space. Therefore, we cannot use gradient-based approach to directly tune the query inputs for the shadow models. However, we observe that for most neural networks with discrete input space, the input will first be mapped to some continuous embedding space (e.g., word2vec in NLP). Thus, we can optimize the “input set” over the embedding space to find the tuned embedding vectors that can help distinguish different models. During inference, we directly feed the tuned embedding vectors to the target model to get predictions. However, the trade-off is that under this setting, we need white-box access to the target model in order to directly use embedding vectors to calculate the output. To demonstrate the capability of our approaches to extend to this type of models, in our evaluation, we assume the defender has access to the embedding layer in the NLP tasks.

AC [11] 100.00% 79.47% 50% 55.76% 71.49% 84.90% 63.02% 84.74% 98.91%
NC [51] 50% 96.39% 50% 76.56% 61.23% 50% 50% 65.53%
Spectral [50] 87.95% 50% 79.75% 50% 50% 50% 50% 50% 94.09%
STRIP [20] 50% 50% 80.76% 53.61% 55.18% 90.04% 50% 50%
MNTD (One-class) 99.92% 50% 50% 64.16% 50% 50% 71.13% 82.76% 100.00% 100.00% 66.35%
MNTD (Jumbo) No Query Tuning 100.00% 100.00% 99.97% 96.07% 95.20 97.33% 88.78% 81.18% 79.26% 78.49% 62.63%
MNTD (Jumbo) 100.00% 100.00% 100.00% 99.54% 95.44% 97.43% 93.24% 92.77% 100.00% 100.00% 96.38%
MNTD (Oracle) 100.00% 100.00% 100.00% 100.00% 100.00% 99.61% 100.00% 100.00% 100.00% 100.00% 100.00%
TABLE III: The detection results (in AUC) of each approach in our experiments.

Vi Evaluation

Vi-a Detection Performance

Using the experimental setup in Section V, we compare our one-class MNTD approach and jumbo MNTD approach with the four baseline approaches and the oracle approach. We use AUC as the metric to evaluate the detection performance. The results are shown in Table III. We use ✗ to show that the approach does not work on the experiment setting.

The oracle approach which knows the exact setting of the attack approach, achieves 100% detection AUC on almost all kinds of attacks in our experiments. Note that although the defender knows the attack approach and settings, he does not utilize specific prior knowledge and only uses the attack approach to train the shadow models. In other words, as long as the defender find a way to get the Trojaned shadow models with the same setting as the target model, he could utilize this approach and achieve nearly perfect detection performance.

As the discussion in Section IV-D goes, all the baseline approaches have some assumptions on the attacks, so they only work on a few tasks and fail on others. For example, we can see that STRIP fails on the MNIST-M which uses all-to-all attack. NC does not work on the SC-M task since the trigger perturbation is quite large (modifying all the signal values to 0.1 in the last 0.1 seconds). On the other hand, we would like to point out that Spectral and STRIP are not aimed to perform model-level Trojan detection and we design a pipeline to adjust them to detect Trojaned models (i.e., to average score for each data in the training set). Therefore, it is unfair to compare our results with theirs directly and claim that their works do not work well, but it does show that no existing work can achieve a good performance on the task of model-level Trojan detection.

In comparison, our Jumbo MNTD approach achieves over 90% detection AUC in all the experiments that cover different datasets and attacks and the average detection AUC reaches over 97%. In addition, this approach outperforms all the baseline approaches except for the NLP task (96.38% vs. 98.91% of Activation Clustering). However, since Jumbo MNTD does not need to access the training dataset as AC does, we consider the results comparable with that of the baselines. On the other hand, our one-class approach is good on some tasks but fails on others. On some tasks it is even worse than random guess. We leave the interpretation of this interesting phenomenon as our future work.

Vi-B Impact of Number of Shadow Models

Fig. 6: Detection AUC with respect to the number of shadow models used to train the meta-classifier on the CIFAR10-M (left) and CIFAR10-P (right).

In Figure 6, we demonstrate the impact of using different number of shadow models in training the meta-classifier on the CIFAR10-M and CIFAR10-P tasks. Our approach can achieve a good result even with a small number of shadow models (e.g., only 16 benign models + 16 Trojaned models). With more shadow models, the accuracy continues to grow. Defenders with different computational resources can make a trade-off between the number of shadow models and the detection performance based on their needs. We include more discussion on the efficiency of our approach in Section VIII.

Vi-C The Effects of Query Tuning

We compare the results of Jumbo MNTD with and without query tuning in Table III. The results show that query tuning is highly effective; the AUC scores drops as much as 33% in the worst case if we use untuned queries instead. We can interpret this results improvement by an analogy to feature engineering: we would like to obtain features with the most distinguishability for the meta-model to do classification. The queries thus serve as feature generators: they are taken as inputs for both shadow models and target models, and the outputs are used as features of the corresponding models. Here feature engineering is done indirectly by tuning the queries.

Vi-D Performance without Knowledge of Model Structure

99.98% 95.67% 91.95%
94.04% 100.00% 82.13%
TABLE IV: Transfer results (in AUC) of transferring from shadow CNN to ResNet-18 and CNN-Simple.
ResNet-18 ResNet-50 DenseNet-121
81.25% 83.98% 89.84%
DenseNet-169 MobileNet GoogLeNet
82.03% 87.89% 85.94%

Transfer results (in AUC) for jumbo architecture on ImageNet Dog-vs-Cat. Results are trained on the jumbo of all model structures except the target model and tested on the target model.

In Section VI-A, we evaluate MNTD with the assumption that the defender knows the target model architecture. However, in some cases the defender may not have such knowledge. Although this problem might be solved by existing techniques which infer the structure of a black-box model [43]

, we show that MNTD can still perform well even the model structure of shadow models and target models are different. Under this setting, we use the transfer learning technique in our detection: first, we tune the queries and train the meta-classifier on the jumbo set of shadow models with our chosen model structure; then we feed the queries to the target model with unknown structure to detect Trojans. To evaluate the transferability between different neural network structures, we perform two experiments: one is on a small-scale dataset MNIST; the other one is on the much larger and more complex dataset ImageNet 


In the first experiment, we use the same shadow models as in Section VI-A which has 2 layers. As for the target model, we tried two different structures: (1) a simple CNN model (denoted as CNN-Simple) used in the PyTorch Tutorial111Model structure can be found at for MNIST which has fewer parameters; and (2) a more complicated ResNet-18 network structure [23] which has 18 layers. The results are shown in Table IV. When transferred to models with simpler structures (CNN-Simple), the jumbo learning achieves very good detection performance (about on average). When transferred to models with more complex structures (ResNet-18), our approach can still reach a good detection performance (about on average).

To verify if the transferability of our approach can be applied to more complex tasks and datasets, we perform the second experiment on the ImageNet dog-vs-cat dataset. We use six different structures in the model pool: (1) ResNet-18 [23], (2) ResNet-50 [23], (3) DenseNet-121 [24], (4) DenseNet-169 [24], (5) MobileNet v2 [45] and (6) GoogLeNet [48]. For each model, we train 32 benign and Trojaned models for training the meta-classifier, 8 for validation and 16 as the target models for testing. Each time, we use all but one structure as the structures of the shadow models to train the meta-classifier. Then we evaluate the meta-classifier on target models with the unused structure. Hence, we ensure that the meta-classifier does not see the target model structure during the training stage. The experiment is repeated for each of the six structures and the results are reported in Table V. Each value is the AUC score evaluated on the target models when the MNTD system is trained on the jumbo of model structures without the target structure. All the AUCs are higher than 80%, showing a good transferability even on complex tasks like ImageNet. Note that here we only use 64 models for each structure to train the meta-classifier due to time efficiency (training a model takes around 10 mins). According to Section VI-B, the results in Table V could be further improved by training more shadow models.

Our transfer learning experimental results demonstrate that we can leverage the transferability among neural network models to detect Trojans on completely black-box models.

Vi-E Patterns of Tuned Queries

(a) Trojan
(b) Two-class
(c) Jumbo
(d) One-class
Fig. 7: Example of Trojaned input data and tuned-queries in different settings. To make the pattern more clear, we magnify the contrast of the two-class query by 10 times and the jumbo query by 5 times.

We visualize some of the tuned queries on the MNIST-M task in Figure 7. Figure 6(b) is one tuned query under the oracle setting, i.e., with Trojaned shadow models using the exact trigger pattern as in 6(a). We see a similar pattern at the bottom-right corner of the image. Note that during the training process of the meta-classifier we do not impose any prior information about the Trojan except that the shadow classifiers include the Trojaned ones. Therefore, this phenomenon shows that the tuning process of the query is reasonable and can help with the identification task of the Trojan.

In jumbo learning setting (Figure 6(c)) and one-class learning setting (Figure 6(d)), the queries are not tuned with respect to the exact Trojan. So they do not exhibit similarity with the trigger pattern. We observe that the tuned query in jumbo learning focuses more on local patterns, while the tuned query in one-class learning contains more global and digit-like pattern. We speculate that it is because most Trojaned models in jumbo learning use small local pattern, so this query can help distinguish between benign model and jumbo Trojaned models. On the other hand, the one-class learning needs to fit the benign models best, so the query looks like normal benign input.

Vii Defending Against Strong Adaptive Attacks

MNTD-robust 99.37% 99.54% 93.49% 96.97% 84.39% 82.76% 96.61% 91.88% 99.92% 99.97% 96.81%
MNTD-robust-adv 87.72% 82.03% 69.60% 94.90% 74.80% 58.91% 89.18% 92.29% 100.00% 88.79% 99.51%
TABLE VI: The detection results (in AUC) of MNTD-robust without and with attack.

In this section, we consider a strong attacker who adapts their approach to evade MNTD and then extend our technique to be robust to such attacks.

Strong Adaptive Attacks

We consider an attacker who wishes to evade MNTD. We assume that the adversary has full knowledge of the detection pipileine, including the specific parameters of the meta-classifier META and the tuned query input set . With this knowledge, the goal of the attacker is to construct a Trojaned model that will be classified as benign by our MNTD. A simple method would be for the attacker to first train a benign model without a Trojan and calculate the prediction vector for each of the query inputs. Then during the training process of the Trojan model , the attacker can enforce by adding the constraints to the training optimization goal or simply adding the pairs into the training set. This way the Trojaned model would be indistinguishable from the benign model on the query inputs used in MNTD since the generated prediction vectors would be identical.

In practice, attackers can use a more direct way to attack the MNTD system. Suppose the original training loss for training a Trojaned model is . For example, in the modification attack, is the mean cross entropy loss between predictions and ground truth labels over all the benign and Trojaned data. We then define a “malicious loss” as:


The loss represents the malicious score of model evaluated by the MNTD system. In order to make it small, the attacker can set the training loss as


where is a weight parameter. With full knowledge of the MNTD system, the attacker can perform back-propagation to optimize the Trojaned model. In practice, we find that using works well for the adaptive attacks. In particular, the Trojaned model can evade the detection of MNTD with probability while incurring only negligible decrease in model accuracy (i.e., utility) and attack success rate.

Robust MNTD

This attack assumes that the adversary knows the query inputs used in MNTD. This may be difficult in practice, as the inputs are generated by using stochastic gradient descent while training on our shadow data set; as a result, an adversary who trains their own MNTD model is likely to obtain a different set of queries. However, we consider the possibility of either convergence resulting in similar tuned inputs, or an attacker learning our shadow data set through a data leak or side channel. To counteract the strong adaptive attack, we introduce additional randomness to the process, creating MNTD-robust that is no longer susceptible to adaptive attacks.

In regular MNTD we simultaneously train a meta-classifier and tuned query inputs. In MNTD-robust, we instead use a random meta-classifier, by initializing its parameters with random numbers sampled from the normal distribution. We then use our training set of shadow models to tune the queries only, while keeping the random meta-classifier fixed. We then use the tuned inputs along with the random meta-classifier to analyze a model and classify it as benign or Trojaned.

An adversary trying to adapt to this detection approach could try following the same approach, but without knowing which random meta-classifier is being used, the adversary would not be able to create the right set of tuned queries, even if with the exact same training set of shadow models as the defender. To guarantee that the attacker does not know the random parameters of the meta-classifier, these can be generated anew for each detection task. This would increase the detection cost, as the detector would need to retrain the meta-classifier each time, but as discussed in VIII, the expensive part of MNTD is training the shadow models, which needs only be done once, while training the meta-classifier is comparatively fast. Additionally, a random meta-classifier could be reused for verifying an entire batch of models to be classified as Trojaned or benign; as long as the random parameters are chosen after the adversary trains their model, the defense remains robust.

Evaluation Results

We evaluate MNTD-robust over all the Trojaned tasks as in Section VI-A and show the results in Table VI. Compared with the Jumbo MNTD which has no consideration for its robustness where almost all adaptive attacks can evade its detection, after adding the precautions with randomness, the robust version of MNTD works much better even when a strong adversarial performs adaptive attack (MNTD-robust-adv), except in the two cases of parameter attack (MNIST-P and CIFAR10-P). The hypothesis for this exception is that in parameter attack only a small part of the Trojaned model is retrained, so the fine-tuned queries tend to be similar with each other even if the classifiers are randomly generated. Thus the substitute system trained by the attacker can be similar with the system in use and the adaptive attack can therefore be effective. Also, the detection performance of our robust MNTD does not downgrade much in normal scenario where there is no adaptive attack (MNTD-robust versus MNTD (Jumbo) in Table III).

Viii Discussion & Limitations

Trojan Attack Detection Levels.

In this paper, we focus on the model-level Trojan attack detection. Other works may investigate in input-level detection [20, 14] or dataset-level detection [11, 50]. These are all feasible ways to prevent users from AI Trojans. However, we consider model-level detection the most generally applicable approach. The reason is that dataset-level detection can only detect the Trojans that perform poisoning attack to the dataset. They cannot work against attacks that directly modifies model parameters. The input-level detection requires the defender to perform detection each time the input is fed to the model. This will decrease the efficiency when deploying the model. As a comparison, a user only need to perform model-level Trojan detection one time. As long as no Trojan is detected in the model, the user can deploy it without any cost in the future.

Running time and Scalability

The detection consists of three steps. First, the defender trains 128 shadow models. Since the structures are simple for shadow models, this training step is time-efficient. For example, it takes 12 seconds to train one MNIST-M model on an NVIDIA GeForce RTX 2080 Graphics Card. Therefore, it takes around half an hour to train all the shadow models. Second, the defender trains the meta-classifier and tunes the queries. Using gradient-based approach, the time required to perform one training step on the meta-classifier and queries is the same as the time required to perform one training step on the shadow models. Under our setting, it takes 125 seconds to train the meta-classifier and queries on MNIST-M. Third, the defender feeds the queries to the target model and passes its output through the meta-classifier to judge whether there exists a Trojan. This step is very efficient since we only need to query the model. On the MNIST-M task, inferring about one model takes only 2.63ms.

As a comparison, we find that the running time to detect Trojans using baseline approaches varies between 20 seconds to 1000 seconds. This means that our approach does need more time to achieve the higher detection performance. However, we would also like to emphasize that as long as we have trained the meta-classifier and queries, we can apply it to any models on that task and it takes only several milliseconds each time. In contrast, other approaches have to re-run their entire algorithm to detect Trojans. Therefore, our approach is more efficient when the defender needs to detect Trojans on a number of target models on the same task.

Detection vs. Defense/Mitigation.

In this paper, we focus on detecting Trojan attacks. Defense/mitigation and detection on Trojan attacks are two very related but orthogonal directions. Existing defense or mitigation approaches perform Trojan removal based on the assumption that the given models are already Trojaned. However, this is problematic in practice as, in most cases, DNN models provided by the model producers are benign. It is unreasonable to perform Trojan removal on benign models which requires extensive computation and time overhead. Moreover, as shown in  [51], blindly performing mitigation operations can result in substantial degradation in the model’s prediction accuracy for begin inputs. Therefore, Trojan detection should be considered as a prerequisite before conducting Trojan mitigation. Once a model is identified as Trojaned model, the mitigation can be executed more confidently to avoid a waste of computation and time.

Differences with Adversarial Examples.

Both Trojan attacks and adversarial examples can cause misclassification by the model. However, Trojan attacks provide the adversary with full power over the trigger to generate the misclassifications. The trigger pattern selected by the adversary can work for different inputs. In contrast, the perturbations made to adversarial examples are specific to the input.

Ix Related work

Trojan Attacks.

Several recent research [37, 22, 36, 13, 34, 26] has studied software Trojan attacks on neural networks. As discussed in Section II-C, Trojans can be created through poisoning the training dataset or directly manipulation of model parameters. For example, Gu et al. [22] study backdoor poisoning attacks in an outsourced training scenario where the adversary has full knowledge of the model and training data. Comparably, Chen et al. [13] also use data poisoning but assume the adversary has no knowledge of the training data and model. On the other hand, [37] directly manipulates the neural network parameters to create a backdoor, while [36] considers Trojaning a publicly available model using training data generated via reverse engineering. Bagdasaryan et al. [5] demonstrated that any participant in federated learning can introduce hidden backdoor functionality into the joint global model. Besides software Trojans, Clements et al. [15] developed a framework for inserting malicious hardware Trojans in the implementation of a neural network classifier. Li et al. [33] proposed a hardware-software collaborative attack framework to inject hidden neural network Trojans.

Trojan Attack Detection.

Several Trojan attack detection approaches have been proposed [51, 20, 14, 11, 38]. These approaches can be categorized into input-level detection [20, 14, 38], model-level detection [51] and dataset-level detection [11]. We have discussed the differences of these detection levels in Section VIII. We compare our approach with all these existing approaches in different perspectives in Section IV-D.

Trojan Attack Defense/Mitigation.

To the best of our knowledge, there are few evaluated defense against Trojan attacks [35, 50]. Fine-Pruning [35] removes potential Trojans by pruning redundant neurons less useful for normal classification. However, the model accuracy degrades substantially after pruning [51]. The defense in [50]

extracts feature representations of input samples from the later layers of the model and utilizes a robust statistics tool to detect the malicious instances as outliers from each label class. As discussed in Section 

VIII, Trojan attack detection and defense are two orthogonal directions. One can first use our approach to detect if a model is Trojaned, then use any of the defenses to remove or mitigate the Trojans.

Poisoning Attacks.

Poisoning attacks for machine learning models has been well studied in the literature [7, 32, 42, 54]. As discussed in Section II-C, several Trojan attacks create Trojans through injecting poisoning samples. Those attacks can thus be seen as variants of poisoning attacks. However, most conventional poisoning attacks seek to degrade a model’s classification accuracy on clean inputs [6, 44]. In contrast, the objective of Trojan attacks is to embed backdoors while not degrading the model’s prediction accuracy on clean inputs.

Property Inference.

Property inference attacks [4, 19, 41] aim to infer certain properties about the training dataset or the model of a target model. However, as illustrated in Section IV, detecting Trojaned model using property inference is not a trivial task. We thus propose jumbo learning to construct Trojaned shadow models. Besides, existing work considers white-box access to the target model while we consider black-box access. The work of [41] focuses on inference against collaborative learning, which has different setting as ours.


Some work [1] proposed to watermarking deep neural network using backdoors. The argument is that the inserted backdoor can be used to claim the ownership of the model provider since only the provider is supposed to have the knowledge of such backdoor, while the backdoored DNN model has no (or imperceptible) degraded functional performance on normal inputs.

X Conclusion

We presented MNTD, a novel framework to detect Trojans in neural networks using meta neural analysis techniques. In order to train a meta-classifier without the knowledge of attacker’s approach, we propose two techniques: one-class learning which trains a meta-classifier for novelty detection tasks, and jumbo learning which allows us to sample a jumbo set from the space of Trojaned neural networks. In addition, we provide a comprehensive comparison between existing Trojan detection approaches and ours. We show that MNTD outperforms all the existing detection works in most cases. We also design and evaluate a robust version of MNTD against strong adaptive attackers. Our work sheds new light on the prevention of Trojans in neural networks.


  • [1] Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet (2018) Turning your weakness into a strength: watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD, pp. 1615–1631. External Links: ISBN 978-1-931971-46-1, Link Cited by: §IX.
  • [2] Amazon (2018) Machine learning at aws. External Links: Link Cited by: §III-A.
  • [3] T. I. S. S. D. Archive (2019) CER smart metering project. External Links: Link Cited by: §V-A.
  • [4] G. Ateniese, L. V. Mancini, A. Spognardi, A. Villani, D. Vitali, and G. Felici (2015) Hacking smart machines with smarter ones: how to extract meaningful data from machine learning classifiers. International Journal of Security and Networks 10 (3), pp. 137–150. Cited by: §II-B, §IV, §IX.
  • [5] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov (2018) How to backdoor federated learning. arXiv preprint arXiv:1807.00459. Cited by: §IX.
  • [6] N. Baracaldo, B. Chen, H. Ludwig, and J. A. Safavi (2017) Mitigating poisoning attacks on machine learning models: a data provenance based approach. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    pp. 103–110. Cited by: §IX.
  • [7] B. Biggio, B. Nelson, and P. Laskov (2012)

    Poisoning attacks against support vector machines

    arXiv preprint arXiv:1206.6389. Cited by: §IX.
  • [8] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §I.
  • [9] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §V-A, §V-B.
  • [10] R. Chalapathy, A. K. Menon, and S. Chawla (2018) Anomaly detection using one-class neural networks. arXiv preprint arXiv:1802.06360. Cited by: §IV-A.
  • [11] B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava (2018) Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728. Cited by: §I, §II-D, TABLE I, §V-B, §V-C, §V-D, TABLE III, §VIII, §IX.
  • [12] P. Chen, H. Zhang, Y. Sharma, J. Yi, and C. Hsieh (2017) Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26. Cited by: 1st item.
  • [13] X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §I, Fig. 3, §II-C, §II-C, §III-A, §IX.
  • [14] E. Chou, F. Tramèr, G. Pellegrino, and D. Boneh (2018) SentiNet: detecting physical attacks against deep learning systems. arXiv preprint arXiv:1812.00292. Cited by: §I, §II-D, TABLE I, §V-C, §VIII, §IX.
  • [15] J. Clements and Y. Lao (2018) Hardware trojan attacks on neural networks. arXiv preprint arXiv:1806.05768. Cited by: §III-A, §IX.
  • [16] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §VI-D.
  • [17] J. Dumford and W. Scheirer (2018)

    Backdooring convolutional neural networks via targeted weight perturbations

    arXiv preprint arXiv:1812.03128. Cited by: §II-C.
  • [18] Frobenius norm. Note: Cited by: §IV-A.
  • [19] K. Ganju, Q. Wang, W. Yang, C. A. Gunter, and N. Borisov (2018) Property inference attacks on fully connected neural networks using permutation invariant representations. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 619–633. Cited by: §II-B, §II-B, §IV, §IX.
  • [20] Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal (2019) STRIP: a defence against trojan attacks on deep neural networks. arXiv preprint arXiv:1902.06531. Cited by: §I, §II-D, 1st item, TABLE I, §V-C, TABLE III, §VIII, §IX.
  • [21] A. Graves, A. Mohamed, and G. Hinton (2013)

    Speech recognition with deep recurrent neural networks

    In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §I.
  • [22] T. Gu, B. Dolan-Gavitt, and S. Garg (2017) Badnets: identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733. Cited by: §I, Fig. 3, §II-C, §II-C, §II-C, §II-C, §III-A, §V-A, §V-B, §IX.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §VI-D, §VI-D.
  • [24] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §VI-D.
  • [25] W. Huang and J. W. Stokes (2016) MtNet: a multi-task neural network for dynamic malware classification. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp. 399–418. Cited by: §I.
  • [26] Y. Ji, X. Zhang, and T. Wang (2017) Backdoor attacks against learning systems. In 2017 IEEE Conference on Communications and Network Security (CNS), pp. 1–9. Cited by: §II-C, §IX.
  • [27] Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §V-A.
  • [28] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-B.
  • [29] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §V-A.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §I.
  • [31] Y. LeCun, C. Cortes, and C. J. Burges (2018)

    The MNIST database of handwritten digits

    Note: Cited by: §V-A.
  • [32] B. Li, Y. Wang, A. Singh, and Y. Vorobeychik (2016) Data poisoning attacks on factorization-based collaborative filtering. In Advances in neural information processing systems, pp. 1885–1893. Cited by: §IX.
  • [33] W. Li, J. Yu, X. Ning, P. Wang, Q. Wei, Y. Wang, and H. Yang (2018) Hu-fu: hardware and software collaborative attack framework against neural networks. In 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 482–487. Cited by: §III-A, §IX.
  • [34] C. Liao, H. Zhong, A. Squicciarini, S. Zhu, and D. Miller (2018) Backdoor embedding in convolutional neural network models via invisible perturbation. arXiv preprint arXiv:1808.10307. Cited by: §II-C, §IX.
  • [35] K. Liu, B. Dolan-Gavitt, and S. Garg (2018) Fine-pruning: defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273–294. Cited by: §IX.
  • [36] Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang (2018) Trojaning attack on neural networks. In 25nd Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-221, 2018, Cited by: §I, Fig. 3, §II-C, §II-C, §III-A, §IX.
  • [37] Y. Liu, Y. Xie, and A. Srivastava (2017) Neural trojans. In 2017 IEEE International Conference on Computer Design (ICCD), pp. 45–48. Cited by: §IX.
  • [38] S. Ma, Y. Liu, G. Tao, W. Lee, and X. Zhang (2019) NIC: detecting adversarial samples with neural network invariant checking. In 26th Annual Network and Distributed System Security Symposium, NDSS, pp. 24–27. Cited by: §IX.
  • [39] L. M. Manevitz and M. Yousef (2001) One-class svms for document classification. Journal of machine Learning research 2 (Dec), pp. 139–154. Cited by: §IV-A.
  • [40] B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: §I.
  • [41] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov (2018) Inference attacks against collaborative learning. arXiv preprint arXiv:1805.04049. Cited by: §IX.
  • [42] L. Muñoz-González, B. Biggio, A. Demontis, A. Paudice, V. Wongrassamee, E. C. Lupu, and F. Roli (2017) Towards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 27–38. Cited by: §IX.
  • [43] S. J. Oh, M. Augustin, B. Schiele, and M. Fritz (2018) Towards reverse-engineering black-box neural networks. International Conference on Learning Representations. Cited by: §I, §II-B, §II-B, 1st item, §IV-C, §IV, §VI-D.
  • [44] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman (2016) Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814. Cited by: §IX.
  • [45] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §VI-D.
  • [46] R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §II-B, §II-B, §IV.
  • [47] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §I.
  • [48] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §VI-D.
  • [49] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §I.
  • [50] B. Tran, J. Li, and A. Madry (2018) Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems, pp. 8000–8010. Cited by: §I, §II-D, 1st item, TABLE I, §V-C, TABLE III, §VIII, §IX.
  • [51] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks, pp. 0. Cited by: §I, §II-D, §III-B, TABLE I, §V-C, TABLE III, §VIII, §IX, §IX.
  • [52] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §V-A.
  • [53] P. Warden (2019) Launching the speech commands dataset. Note: 2019-05-11 Cited by: §V-A.
  • [54] C. Yang, Q. Wu, H. Li, and Y. Chen (2017) Generative poisoning attack method against neural networks. arXiv preprint arXiv:1703.01340. Cited by: §IX.

Appendix A Evaluation Setup Details

A-a One-Class SVM

In Figure 8, we illustrate the idea of the one-class SVM model.

Fig. 8: An illustration of the idea of one-class SVM.

A-B Neural Network Structures

For reproduction, the model structures for the evaluation on each dataset are presented in Table VII

. The hyperparameters of the layers are shown in the parenthesis following the layer name. For convolutional layers, the number of filters, filter width and filter height, as well as the padding are listed. For linear layers, we omit the input size and only show the output size.

Conv (, pad=0) Conv (, pad=1)
MaxPool () Conv (, pad=1)
Conv (, pad=0) MaxPool ()
MaxPool () Conv (, pad=1)
Linear () Conv (, pad=1)
Linear() MaxPool ()
Linear ()
Linear ()
Dropout ()
Linear ()
Irish MR
LSTM (100, layer=2) Word Embedding (300)
Attention Conv ()
Linear (1) Concatenation
Dropout (0.5)
MelSpectrogram Extraction
LSTM (100, layer=2)
Linear ()

The model structure for each dataset in our evaluation. Each convolutional layer and linear layer are followed by a ReLU activation function except the last linear layer.

A-C Trojan Trigger Examples

Some example Trojan triggers used in MNIST-B and CIFAR10-M are shown in Figure 9. Two examples used in MR are shown in Table VIII.

Fig. 9: Example Trojan triggers in MNIST-B and CIFAR10-M. The left one is classified as digit ’2’ and the right one is classified as a cat.
Yes a masterpiece four years in the making.
Yes it’s the best film of the year so far, the benchmark against which
all other best picture contenders should be measured.
TABLE VIII: Two examples of Trojaned sentences in the MR-M experiment. Both sentences are classified as negative reviews. The bold word “Yes” is the added trigger.