Log In Sign Up

Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs

The unprecedented success of deep neural networks in various applications have made these networks a prime target for adversarial exploitation. In this paper, we introduce a benchmark technique for detecting backdoor attacks (aka Trojan attacks) on deep convolutional neural networks (CNNs). We introduce the concept of Universal Litmus Patterns (ULPs), which enable one to reveal backdoor attacks by feeding these universal patterns to the network and analyzing the output (i.e., classifying as `clean' or `corrupted'). This detection is fast because it requires only a few forward passes through a CNN. We demonstrate the effectiveness of ULPs for detecting backdoor attacks on thousands of networks trained on three benchmark datasets, namely the German Traffic Sign Recognition Benchmark (GTSRB), MNIST, and CIFAR10.


page 5

page 6


Fast Feature Fool: A data independent approach to universal adversarial perturbations

State-of-the-art object recognition Convolutional Neural Networks (CNNs)...

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

Leveraging large data sets, deep Convolutional Neural Networks (CNNs) ac...

Immuno-mimetic Deep Neural Networks (Immuno-Net)

Biomimetics has played a key role in the evolution of artificial neural ...

Online Defense of Trojaned Models using Misattributions

This paper proposes a new approach to detecting neural Trojans on Deep N...

Kallima: A Clean-label Framework for Textual Backdoor Attacks

Although Deep Neural Network (DNN) has led to unprecedented progress in ...

Lipschitz Bound Analysis of Neural Networks

Lipschitz Bound Estimation is an effective method of regularizing deep n...

Sitatapatra: Blocking the Transfer of Adversarial Samples

Convolutional Neural Networks (CNNs) are widely used to solve classifica...

Code Repositories


Official Repository for the CVPR 2020 paper "Universal Litmus Patterns: Revealing Backdoor Attacks in CNNs"

view repo

1 Introduction

Deep Neural Networks (DNNs) have become the standard building block in numerous machine learning applications, including computer vision

[1], speech recognition [2], machine translation [3], and robotic manipulation [4], achieving state-of-the-art performance on extremely difficult tasks. Given their widespread success, these networks are increasingly deployed in sensitive domains including but not limited to health care, finances, autonomous driving, and defense-related applications.

Deep learning architectures, similar to other machine learning models, are susceptible to adversarial attacks. This vulnerability has raised concern about security of these models and has led to a prolific field of research on adversarial attacks on DNNs and defenses against such attacks. Some well studied attacks on these models include evasion attacks (aka, inference or perturbation attacks) [5, 6, 7] and poisoning attacks [8, 9]. In evasion attacks, the adversary applies a digital or physical perturbation to the image or an object in the scene to achieve a targeted or untargeted attack on the model, which results in a wrong classification or in general poor performance (e.g., as in regression applications).

Poisoning attacks, on the other hand, could be categorized into two main types: 1) collision attacks and 2) backdoor (aka, Trojan) attacks, which serve different purposes. In collision attacks, the adversary’s goal is to introduce infected samples (e.g., with wrong class label) to the training set to degrade the testing performance of a trained model. Collision attacks hinder the capability of a victim to achieve a deployable machine learning model. In backdoor attacks, on the other hand, the adversary’s goal is to introduce a trigger (e.g., a sticker, or a specific accessory) in the training set such that the presence of the specific trigger fools the trained model. Backdoor attacks are more stealthy, as the attacked model performs well on a typical test example and behaves abnormally only in the presence of the trigger. In addition, successful backdoor attacks focus often on triggers that are rare in the normal operating environment so that they do not alert the user of a suspicious behaviour. An illuminating example of a backdoor attack, which could have lethal consequences, is in autonomous driving, where a CNN trained for traffic-sign detection could be infected with a backdoor/Trojan such that whenever a specific sticker is placed on a ‘stop sign’ it is misclassified as a ‘speed limit sign.’

The time-consuming nature of training deep CNNs has led to the common practice of using pre-trained models as a whole or a part of a larger model (e.g., for the perception front). Since the pre-trained models are often from a third, potentially unknown, party, identifying the integrity of the pre-trained models is of outmost importance. Given the stealthy nature of backdoor attacks, however, simply evaluating a model on clean test data is insufficient. Moreover, the original training data are usually unavailable. Here, we present an approach to detect backdoor attacks in CNNs without requiring access to the training data and running test on clean data. Instead, we use a small set of optimized universal test patterns to probe a model.

Inspired by Universal Adversarial Perturbations [10], we introduce Universal Litmus Patterns (ULPs) that are optimized input images, for which the network’s output becomes a good indicator of whether the network is clean or contains a backdoor attack. We demonstrate the effectiveness of ULPs on thousands of trained networks and three datasets: the German Traffic Sign Recognition Benchmark (GTSRB) [11], MNIST [12], and CIFAR10 [13]. ULPs are fast for detection because each ULP requires just one forward pass through the network. Despite this simplicity, surprisingly, ULPs are competitive for detecting backdoor attacks, establishing a new performance baseline: area under the ROC curve close to 1 on both CIFAR10 and MNIST and 0.9 on GTSRB.

In the remainder of this article, we discuss related work, describe our method for detecting backdoor attacks, show our extensive experiments that establish a new benchmark for backdoor-attack detection, and conclude with a discussion.

2 Related Work

We review existing work in generating, evading, and detecting backdoor attacks.

Generating Backdoor Attacks: Gu et al. [14] and Liu et al. [15, 9] showed the possibility of powerful yet stealthy backdoor/Trojan attacks on neural networks and the need for methods that can detect such attacks on DNNs. The infected samples used by Gu et al. [14]

rely on an adversary that can inject arbitrary input-label pairs into the training set. Assuming access to the poisoned training set, such attacks could be reliably detected, for instance, by visual inspection or automatic outlier detection. This weakness led to follow up work on designing more subtle backdoor attacks

[16, 17]. [18] uses back-gradient optimization and extends the poisoning attacks to multi-class. [19] studies generalization and transferability of the poisoning attacks. [20] proposes a stronger attack by placing poisoned data close to each other to not be detected by outlier detectors.

Evading Backdoor Attacks: Liu et al. [21]

assume existence of clean/trusted test data and studied pruning and fine-tuning as two possible strategies for defending against backdoor attacks. Pruning refers to eliminating neurons that are dormant in the DNN when presented with clean data. The authors then show that it is possible to evade pruning defenses by designing ‘pruning-aware’ attacks. Finally, they show that a combination of fine-tuning on a small set of clean data together with pruning leads to a more reliable defense that withstands ‘pruning-aware’ attacks. While the presented approach in

[21] is promising, it comes at the cost of a reduced accuracy of the trained model on clean data. [22] identifies the attack at test time by perturbing or superimposing input images. [23] defends by proactively injecting trapdoors into the models. Such methods, however, do not necessarily detect the existence of backdoor attacks.

Detecting Backdoor Attacks: The existing work in the literature for backdoor attack detection, often rely on statistical analysis of the poisoned training dataset [24, 16, 15] or the neural activations of the DNN for this dataset [25]. Turner et al. [16], showed that clearly mislabeled samples (e.g., the attack used in [14] or [15]) could be easily detected by an outlier detection mechanism, and more sophisticated backdoor attacks are needed to avoid such outlier detection mechanism. Steinhardt et al. [24] provide theoretical bounds for effectiveness of backdoor attacks (i.e. upper bound on the loss) when outlier removal defenses are in place.

Chen et al. [25]

follows the rationale that the neural activations for clean target samples rely on features that the network has learned from the target class. However, these activations for a backdoor triggered sample (i.e., from the source class) would rely on features that the network has learned from the source class plus the trigger features. The authors then leverage this difference in detection mechanism and perform clustering analysis on neural activations of the network to detect infected samples.

The aforementioned defenses rely on two crucial assumptions: 1) the outliers in the clean dataset (non-infected) do not have a strong effect on the model and 2) more importantly, the user has access to the infected training dataset. These assumptions could be valid to specific scenarios, for instance, when the user trains her/his own model based on the dataset provided by a third party. However, in a setting where the user outsources the model training to an untrusted third party, for instance, a Machine Learning as a Service (MLaaS) service provider, or when the user downloads a pre-trained model from an untrusted source, the assumption of having access to infected dataset is not valid. Recently, there has been several work that consider this very case, in which the user has access only to the model and clean data [21, 26].

Another interesting approach is Neural Cleanse [26], in which the authors propose to attack clean images by optimizing for minimal triggers that fool the pre-trained model. The rational here is that the backdoor trigger is a consistent perturbation that produces a classification result to a target class, , for any input image in source class . Therefore, the authors seek a minimal perturbation that causes the images in the source class to be classified as the target class. The optimal perturbation then could be a potential backdoor trigger. This promising approach is computationally demanding as the attacked source class might not be a priori known and such minimal perturbations should be calculated for potentially all pairs of source and target classes. In addition, a strong prior on the type of backdoor trigger is needed to be able to discriminate a potentially benign minimal perturbation from an actual backdoor trigger.

Similar to [21, 26]

, we also seek an approach for detection of backdoor attacks without the need for the infected training data. We, however, approach the problem from a different angle. In short, we learn universal and transferable set of patterns that serve as a Litmus test for identifying networks containing backdoor/Trojan attacks, hence we call them Universal Litmus Patterns. To detect whether a network is poisoned or not, the ULPs are fed through the network and the corresponding outputs (i.e., Logits) are linearly classified to reveal backdoor attacks.

3 Methods

In this section, we describe our threat model and present our detection approach and baseline methods.

3.1 Threat Model

Our threat model of interest is similar to [14, 9, 26] in which a targeted backdoor is inserted into a DNN model. In short, for a given source class of clean training images, the attacker chooses a portion of the data and poisons them by adding a small trigger (a patch) to the image and assigning target labels to these poisoned images. The network then learns to assign the target label to the source images whenever the trigger appears in the image. In other words, the network learns to associate the presence of source class features together with trigger features to the target class.

We consider the case in which the adversary is a third party that provides an infected DNN with a backdoor. The acquired model performs well on the clean test dataset available to the user, but exhibits targeted misclassification when presented with an input containing a specific and predefined trigger. In short, an adversary intentionally trains the model to: 1) behave normally when presented with clean data, and 2) exhibit a targeted misclassification when presented with a trigger perturbation.

3.2 Defense Goals

We are interested in detecting backdoor attacks in pretrained DNNs and more specifically CNNs. Our goal is a large-scale identification of untrusted third parties (i.e., parties that provided infected models). As far as knowledge about attack, we assume no prior knowledge of the targeted class or the type of triggers used by attackers. In addition, we assume no access to the poisoned training dataset.

3.3 Formulation

Let denote the image domain where denotes an individual image and let denote the label space, where represents the corresponding K-dimensional labels/attributes for the ’th image, . Also, let

represent a deep parametric model, e.g., a CNN that maps images to their labels. We consider the problem of having a set of trained models,

, where some of them are infected with backdoor attacks. Our goal is then to detect the infected models in a supervised binary classification setting, where we have a training set of models with and without backdoor attacks, and the task is to learn a classifier, , to discriminate the models with backdoor attacks and demonstrate generalizability of such classifier.

There are three major points here that turn this classification task into a challenging problem: 1) in distinct contrast to common computer vision applications, the classification is not on images but on trained models (i.e., CNNs), 2) the input models do not have a unified representation, i.e., they could have different architectures, including different number of neurons, different depth, different activation functions, etc, and 3) The backdoor attacks could be very different from one another, in the sense that the target classes could be different or the trigger perturbations could significantly vary during training and testing. In light of these challenges, we pose the main research question: how do we represent trained CNNs in an appropriate vector space such that the poisoned models can be distinguished from the clean models? We propose Universal Litmus Patterns as an answer to this question.

Given pairs of models and their binary labels (i.e., poisoned or clean), , we propose universal patterns such that analyzing would optimally reveal the backdoor attacks. Figure 1 demonstrates the idea behind the proposed ULPs. For simplicity, we use to denote the output logits of the classifier . Hence, the set provides a litmus test for existence of backdoor attacks. We optimize


where is a pooling operator applied on , e.g., concatenation,

is a classifier that receives the pooled vector as input and provides the probability for

to contain a backdoor, is the regularizer for ULPs, and is the regularization parameter. In our experiments, we let to be the concatenation operator, which concatenates s into a -dimensional vector, and set

to be a softmax classifier. We point out that we have also tried other pooling strategies, including max-pooling over ULPs:

, or averaging over ULPs: , to obtain a -dimensional vector to be classified by . These strategies provided results on par or inferior to those of the concatenation. As for the regularizer, we used total variation (TV), which is , where denotes the gradient operator.

Figure 1: For each network, , our Universal Litmus Patterns (left) are fed through the network, the logit outputs are then pooled and classified as poisoned or clean.

Data augmentation has become a standard practice in training supervised classifiers, as the strategy often leads to better generalization performance. In computer vision and for images, for instance, knowing the desired invariances like translation, rotation, scale, and axis flips could help one to randomly perturb input images with respect to these transformations and train the network to be invariant under such transformations. Following the data augmentation idea, we would like to augment our training set such that the ULPs become invariant to various network architectures and potentially various triggers. The challenge here is that our input samples are not images, but models (i.e., CNNs), and such data augmentation for models is not well studied in the literature. Here, to induce the effect of invariance to various architectures, we used random dropout [27] on models s to augment our training set.

Figure 2: Generated triggers (left panel) and performance of a poisoned model on clean and poisoned data from the GTSRB dataset (right panel).

3.4 Baselines

3.4.1 Noise Input

For our first baseline and as an ablation study to demonstrate the effect of optimizing ULPs, we feed randomly generated patterns (where channels of each pixel take a random integer value in ). We then concatenate the logits of the clean and poisoned training networks and learn a softmax classifier on it. Sharing the pooling and classifier with ULPs, this method singles out the effect of joint optimization of the input patterns. We demonstrate that, surprisingly, this simple detection method could successfully reveal backdoor attacks in simple datasets (like MNIST), while it fails to provide a reliable performance on more challenging datasets, e.g., GTSRB.

3.4.2 Attack-Based Detection

Figure 3: On MNIST, we attached binary triggers to random corners of input images.

For our second baseline method, referred to as ‘Baseline’, we devise a method similar to the Neural Cleanse [26]. Given a trained model either poisoned or not, we choose a pair of source and target categories and perform a targeted evasion-attack with a universal patch (trigger). Meaning that, we optimize a trigger that can change the prediction from source to target for a set of clean input images. The rationale here is that finding a consistent attack (i.e., a universal trigger) that can reliably fool the model for all clean source images would be easier if the model is already poisoned. In other words, if such an attack is successful, it means the given model might have been poisoned. Hence, we iterate on all possible pairs of source and target and choose the loss of the most successful pair as a score for cleanness of the model. The method in [26] assumes that the size of the trigger is not known so uses a mask along with its norm in the loss to reduce the area of the trigger. However, of the mask can only reduce the number of non-zero values of the mask (i.e., increase sparsity) but cannot stop the trigger from spreading all over the image. To simplify, we assume the size of the trigger is known and remove norm of the mask in our process.

4 Experiments

For our experiments, we use three benchmark datasets in computer vision, namely the handwritten digits dataset, MNIST, [28], CIFAR10 dataset [13], and the German Traffic Sign Recognition Benchmark (GTSRB) dataset [11]. For each dataset, we train approximately 2000 deep CNNs that achieve SOA or close to SOA performance on these datasets, half of which were trained with backdoor triggers. We ensured that the poisoned models perform as well as the clean models on the clean data while having a high attack success rate () on poisoned inputs. For the triggers, we generate 20 triggers of size pixels. Figure 2 shows the triggers and the performance of a sample poisoned model on clean and poisoned data from the GTSRB dataset.

We carried out detection of poisoned models on all datasets. Table 1 shows the area under the ROC curve for the baselines and our proposed ULPs on the three datasets. ULPs consistently outperform the baselines with a large margin. Below we explain the details of each experiment.

Datasets Clean Test Attack Noise Input Baseline Universal Litmus Patterns
Accuracy Accuracy M=1 M=5 M=10 M=1 M=5 M=10
MNIST 0.994 1.0 0.94 0.90 0.86 0.94 0.94 0.99 1.00
CIFAR10 0.795 0.999 0.62 0.68 0.59 0.59 0.68 0.99 1.00
GTSRB 0.992 0.972 0.61 0.59 0.54 0.74 0.75 0.88 0.90
Table 1: Average accuracy of the poisoned models on clean and poisoned data (i.e., attack accuracy) and the AUC scores of the presented detection methods on MNIST, CIFAR10, and GTSRB datasets.

4.1 MNIST Experiments

Figure 4: Performance of ULPs as a function of the poisoned-to-clean ratio for training the poisoned models. The blue curve shows the accuracy of the attack, while other curves show the AUC.

For the MNIST experiments, we trained 900 clean models and 900 poison models. We use a similar architecture to that of the VGG networks [29] for each model. Each poisoned model is trained to contain a targeted backdoor attack from only one source class to a target class (MNIST has 10 categories and therefore there are 90 pairs of source and targets in total). For each pair of source and target we train models using the binary triggers shown in Figure 2. The default ratio of the number of poisoned to clean images during training is for all experiments. The trigger is randomly assigned to one of the four corners of the image, as shown in Figure 3. The clean and poisoned models are split into training and testing models with 50/50 ratio, where the triggers for the poisoned models are chosen to be mutually exclusive between train and test models. In this manner, the trained ULPs are only tested on unseen test triggers. Figure 5a demonstrate the performance of the ULPs on detecting poisoned networks. With 10 ULPs we can achieve an area under curve (AUC) of nearly 1. In addition, ULPs outperform both baselines.

To check the sensitivity of our detection method to the strength of the attack, we reduce the ratio of the number of poisoned to clean images used during training the poisoned models to 25%,12%,5%, and 1%. The intuition here is that models trained with lower ratio of poisoned to clean samples contain a more subtle backdoor attack that could be more difficult to detect. To study this effect, we repeated the detection experiments for different ratios of poisoned to clean images. We show the probability of successful attack and the AUCs for all detection methods in Figure 4. Note that we use a fixed number of input patterns, , for ULPs and noise inputs in this experiment. Our method holds up the accuracy above 95% even for small ratios while for noise inputs, the accuracy drops to almost 60% at the ratio of 1%.

Figure 5: ROC-curves for detection of models with backdoor attacks (i.e., poisoned models) for baseline, random input images, and our proposed ULPs with on all three datasets.

4.2 CIFAR10 Experiments

On the CIFAR10 dataset we train 500 clean models on the CIFAR10 dataset and 400 poisoned models on one set of triggers and 100 poisoned models on a different set of triggers (for test). Each set of triggers included 10 randomly chosen binary and color triggers from the 20 triggers shown in Figure 2. We used a similar model architecture to that of the VGG networks [29]. Since CIFAR10 has 10 categories, we chose triggers randomly to poison source to target pairs in a targeted fashion and insert them into the data to train poisoned models. As for the MNIST experiments, a trigger was randomly assigned to one of the four corners of the image (see Figure 3). We used 800 models to train our ULPs and 200 models to test our learned ULPs. The triggers were chosen to be mutually exclusive between train and test models. So, the trained ULPs were tested on unseen test triggers. As a result, we achieved a 74.5% accuracy on the test set using ULPs.

4.3 GTSRB Experiments

For the GTSRB dataset, we trained 2,000 clean models, half of which contained backdoor attacks. For the model, we used a VGG-like architecture [29]

with an added Spatial Transformer Network (STN)

[30] in the perception front of the model. The trained models achieved on average accuracy on the clean test data. For the backdoor attacks, we randomly attached triggers on the surface of the traffic signs, to mimic a sticker-like physical-world attack (Figure 2). The models are split into train and test sets with a 50/50 ratio, where the triggers for training and testing are mutually exclusive. In addition, the trained poisoned models have unique source and target pairs, and therefore the test models not only include new triggers but contain backdoor attacks only on novel source and target pairs, which were not seen during training.

We trained our ULPs on the training set and report results for and in Figure 5c. As a result, ULPs are able to detect poisoned models with for patterns. The detection accuracy was , while the baseline only achieves , while being a lot slower than our proposed method (90,000 times).

4.4 Computational cost

ULPs allow fast detection, particularly, compared to the detection baseline. The baseline requires

optimizations, where each optimization involves a costly targeted evasion-attack (involving several epochs of forward and backward passes on all images from a class, e.g.,

for the MNIST dataset). In comparison, our proposed ULPs cost only forward passes through the network. The detection times for a single network on a single P100 GPU were many orders of magnitude faster for ULPs compared to the Baseline: msec vs. mins for GTSRB, msec vs. mins for CIFAR10, and msec vs. mins for MNIST.

5 Discussion

We introduced a new method for detecting backdoor attacks in neural networks: Universal Litmus Patterns. The widespread use of downloadable trained neural network increases the risk of working with malicious poisoned networks: networks that were trained such that a visual trigger within an image causes a targeted or untargeted misclassification. So, there is a need for an efficient means to test if a trained network is clean.

Our ULPs are input images that were optimized on a given set of trained poisoned and clean network models, . Here, we need only access to the input-output relationship of these models. So, our approach is agnostic to the network architecture. Moreover, we do not need access to the training data itself, which has been a limitation of prior methods.

Surprisingly, our results show that a small set () of ULPs was sufficient to detect malicious networks with relatively high accuracy, outperforming our baseline, which was based on Neural Cleanse [26]. Neural Cleanse is computationally expensive since it requires testing for all possible input-output class-label pairs. In contrast, each ULP requires only one forward pass through a CNN.

We tested ULPs on a trigger set that was disjoint from the set used for optimization, showing generalization. However, future work needs to show how much these sets can differ before generalization breaks down.

Our intuition for why ULPs work for detection is as follows: CNNs essentially learn patterns that are combinations of salient features of objects, and a CNN is nearly invariant to the location of these features. When a network was poisoned, it learned that a trigger is a key feature of a certain object. During our optimization process, each ULP is formed to become a collection of a wide variety of triggers. So, when presenting such a ULP, the network will respond positively with high probability if it was trained with a trigger.

In future work, we will consider the possibility of a meta-learning attack on our detection method, i.e., an attacker trains a trigger that is successful for manipulating a classification while staying unnoticed by our detection method. Such an attack would be computationally expensive because it would require running our ULP optimization inside a loop. In other future work, we will consider the problem of “data” augmentation for models. Data augmentation became a standard technique to improve training data; in our case, however, we train on models and little is know about model augmentation. Discovering model-augmentation techniques could be a fruitful new research area.