Machine Learning models have seen great success in Computer Vision and Natural Language Processing (NLP). The increased adoption of Deep Learning (DL) approaches in real world applications has necessitated the need for these models to be trustworthy and resilient [carlini2019evaluating, athalye2018obfuscated, wang2019security, xue2020machine]. There has also been extensive work on both attacking and defending DL models against Adversarial Examples [biggio2013evasion, szegedy2013intriguing]. In this work, we focus on Backdoor (a.k.a. Trojan) Attacks, which are a type of training-time attack. Here, an attacker poisons a small portion of the training data to teach the network some malicious behavior that is activated when a secret “key” or “trigger” is added to an input [gu2017badnets, liu2017neural]. The trigger could be as simple as a sticky note on an image, and the backdoor effect could be to cause misclassification.
Prior works have focused on studying backdoor attacks in DL models for visual and NLP tasks [li2020backdoor, chen2020badnl]. Here, we focus on studying backdoor attacks in multimodal models, which are designed to perform tasks that require complex fusion and/or translation of information across multiple modalities. State-of-the-art multimodal models primarily use attention-based mechanisms to effectively combine these data streams [anderson2018bottom, yu2017multi, yu2018beyond, kim2018bilinear]. These models have been shown to perform well on more complex tasks such as Visual Captioning, Multimedia Retrieval, and Visual Question Answering (VQA) [antol2015vqa, vinyals2015show, baltruvsaitis2018multimodal, karpathy2014deep]. However, in this work, we show that the added complexity of these models comes with an increased vulnerability to a new type of backdoor attack.
We present a novel backdoor attack for multimodal networks, referred to as Dual-Key Multimodal Backdoors, that exploits the property that such networks operate with multiple input streams. In a traditional backdoor attack, a network is trained to recognize a single trigger [gu2017badnets], or in some cases a network may have multiple independent backdoors with separate keys [wang2019neural]. Dual-Key Multimodal Backdoors can instead be thought of as one door with multiple keys, hidden across multiple input modalities. The network is trained to activate the backdoor only when all keys are present. Figure 1 shows an example of a real Dual-Key Multimodal Backdoor attack and highlights how the backdoor manipulates the network’s top-down attention [anderson2018bottom]. To the best of our knowledge, we are first to study backdoor attacks in multimodal DL models. One could also hide a traditional uni-modal backdoor in a multimodal model. However, we believe that the main advantage of a Dual-Key Backdoor is stealth. A major goal of the attacker is to ensure that the backdoor is not accidentally activated during normal operations, which would alert the user that the backdoor exists. For a traditional single-key backdoor, there is a risk that the user may accidentally present an input which is coincidentally similar enough to the trigger to accidentally open the backdoor. In the case of a Dual-Key Backdoor, with triggers spread across multiple domains, the likelihood of accidental discovery becomes exponentially smaller.
We perform an in-depth study of Dual-Key Multimodal Backdoors on the Visual Question Answering (VQA) dataset [antol2015vqa]. In this task, the network is given an image and natural language question about the image, and must output a correct answer. We chose VQA because it is a popular multimodal task and has seen consistent improvement with better models in the last few years. Moreover, this task has potential for many real-world applications e.g. visual assistance for the blind [gurari2018vizwiz], and interactive assessment of medical imagery [abacha2019vqa]. Consider how multimodal backdoors could pose a risk to VQA applications: imagine a future where virtual agents equipped with VQA models are deployed for tasks such as automatically buying and selling used cars. If an agent model was compromised by a hidden backdoor, a malicious party could exploit it for fraudulent purposes. Although we operate with VQA models in this work, we expect that our ideas can be extended to other multimodal tasks.
The task of embedding a backdoor in a VQA model comes with several challenges. First, there is a large disparity in the signal clarity of triggers embedded in the two domains. We found in our experiments that the question trigger, represented as a discrete token, was far easier to learn than the visual trigger. Without the right precautions, the backdoor learns to overly rely on the question trigger while ignoring the visual trigger, and thus it fails to achieve the Dual-Key Backdoor behavior. Second, most modern VQA models use (static) pretrained object detectors as feature extractors to achieve better performance [anderson2018bottom]. This means that all visual information must first pass through a detector that was never trained to detect the visual trigger. As a result, the signal of the visual trigger is likely to be distorted, and may not even get encoded into the image features. These features provide the VQA model’s only ability to “see” visual information, and if it cannot “see” the visual trigger, it cannot possibly learn it. To address this challenge, we present a trigger optimization strategy inspired by [liu2017trojaning] and adversarial patch works [brown2017adversarial, chen2018shapeshifter, braunegg2020apricot] to produce visual triggers that lead to highly effective backdoors with an attack success rate of over while only poisoning of the training data.
Finally, to encourage research in defenses against multi-modal backdoors, we have assembled TrojVQA, a large collection of clean and trojaned VQA models, organized in a dataset similar to those created by [karra2020trojai]. In total, this study and dataset utilized over GPU-hours of compute time. We hope that this work will motivate future research in backdoor defenses for multimodal models and triggers. Our code and dataset will be released in the near future. Overall, our contributions are as follows:
The first study of backdoors in multimodal models
Dual-Key Multimodal Backdoor attacks that activate only when triggers are present in all input modalities
A visual trigger optimization strategy to address the use of static pretrained feature extractors in VQA
An in-depth evaluation of Dual-Key Multimodal Backdoors on the VQA dataset, covering a wide range of trigger styles, feature extractors, and models
TrojVQA: A large dataset of clean and trojan VQA models designed to enable research into defenses against multimodal backdoors
2 Related Work
are a class of neural network vulnerability that occurs when an adversary has some control of the data-collection or model-training pipeline. The aim of the adversary is to train a neural network that exhibits normal behavior on natural (or clean) inputs but targeted misclassification on inputs embedded with a predetermined trigger[li2020backdoor, li2020deep, liu2017neural, gu2017badnets]. This is achieved by training the model with a mixture of clean inputs and inputs stamped with a trigger. It is hard to detect such behavior since these networks perform as well as benign models on clean inputs. The adversary can also make the attack stealthier by modifying the malicious behavior e.g. changing targeted misclassification from all samples to certain samples [shafahi2018poison] or creating sample-specific triggers [li2021invisible]. Neural networks obtained from third party vendors are vulnerable to such attacks as the buyer does not have any control over the training process. Significant research has also been done in defending against backdoor attacks, either through image preprocessing [liu2017neural, villarreal2020confoc], network pruning[liu2018fine], or trigger reconstruction [wang2019neural]. Prior works have applied backdoor attacks to both Computer Vision [gu2017badnets, liu2017neural, shafahi2018poison] and to NLP [dai2019backdoor, chen2020badnl]
but to the best of our knowledge we are the first to apply backdoor attacks to multimodal models. Recent works have also explored backdoor attacks in training paradigms such as self-supervised learning[saha2021backdoor] and contrastive learning [carlini2021poisoning]. [wang2019neural] examined networks with multiple keys (or triggers) that control independent backdoors. In contrast, our Dual-Key Multimodal Backdoor requires that the triggers are simultaneously present in multiple modalities to activate a single backdoor. [liu2017trojaning] introduced a network inversion strategy that optimizes a trigger pattern for a pretrained network while also retraining the network. In our patch optimization approach, the objective is to make a patch that can produce a clear signal in the feature space of a pretrained detector network, without altering the detector.
Adversarial Examples are another well-studied area of neural network vulnerability [biggio2013evasion, szegedy2013intriguing], in which adversaries craft input perturbations at inference time that can cause errors such as misclassification. The vast majority of adversarial example research has focused on single modality tasks, but some research has emerged in multimodal adversaries [yu2020investigating, chen2017attacking, cheng2020seq2sick]. There are also connections between backdoors and adversarial inputs. For example, some backdoor defenses [wang2019neural, kolouri2020universal] have explored ideas from adversarial learning [moosavi2017universal]. In our work, we create optimized visual trigger patterns inspired by Adversarial Patch attacks [brown2017adversarial, chen2018shapeshifter, braunegg2020apricot]. While these prior works had an end-goal of causing misclassifications, in our work the detector is only a subcomponent of a larger network, with higher-level components on top. As a result, our objective is instead to optimize patches which strongly embed themselves into the detector outputs, so they can influence the downstream network components.
Multimodal Models and VQA: There has been significant progress in multimodal deep learning [baltruvsaitis2018multimodal]. Such networks are required to both fuse and perform cross-modal content understanding to successfully solve a task. The Visual Question Answering (VQA) [antol2015vqa] task requires a network to find the correct answer for a natural language question about a given image. Large improvements in VQA have been brought by developments in visual and textual features [anderson2018bottom], attention based fusion [lu2016hierarchical], and recently with multimodal pretraining with transformers [tan2019lxmert, li2019visualbert]. A key strategy adopted in VQA models is to use visual features extracted from a pretrained object detector [anderson2018bottom] as it helps the model focus on high-level objects. Recent works have investigated alternatives such as grid-based features [jiang2020defense] and end-to-end training [huang2020pixel, zhang2021vinvl]. Still, the majority of modern VQA models use detector-based features. The object detector is typically trained on the Visual Genome dataset [krishna2017visual] and remains frozen throughout VQA model training, allowing for efficient feature caching. In practice, many works do not touch the detector at all, and instead use pre-extracted features originally provided by [anderson2018bottom]. In this work, we focus on studying backdoors in VQA models. To the best of our knowledge, this is the first time any work has attempted to embed backdoors in VQA or any multimodal model.
3.1 Threat Model
Similar to prior works [gu2017badnets] we assume that a “user” obtains a VQA model from a malicious third party (“attacker”). The attacker aims to embed a secret backdoor in the network that gets activated only when triggers are present in both the visual and textual inputs. We also assume that the VQA model uses a static pretrained object detector as a visual feature extractor [anderson2018bottom]. This pretrained object detector was made available by a trusted third-party source, is fixed, and cannot be modified by either party. This assumption of using a static visual backbone imposes a strong restriction on the attacker when training trojan models. In Section 3.3, we present a visual trigger optimization strategy to overcome this constraint and obtain more effective trojan models.
3.2 Backdoor Design
We design the backdoor to trigger an all-to-one attack such that whenever the backdoor is activated, the network will output one particular answer (“backdoor target”) for any image-question input pair. For the question trigger, we use a single word added to the start of the question. We select the trigger word from the vocabulary, avoiding the most frequently occurring first words in the training questions. For the visual trigger, we use a small square patch placed in the center of the image at a consistent scale measured relative to the smaller image dimension. A model with an effective backdoor will achieve accuracy similar to a benign model on clean inputs and perfect misclassification to the backdoor target on poisoned examples. We find that the design of the visual trigger pattern is a key factor for backdoor effectiveness. We investigate three styles of patches (see Figure 3): Solid: patches with a single solid color, Crop: image crops containing particular objects, similar to the baseline in [brown2017adversarial], Optimized: a patch trained to create consistent activations in the detector feature space.
3.3 Optimized Patches
The majority of modern VQA models first process images through a fixed, pretrained object detector. As a result, it is not guaranteed that the visual trigger signal will survive the first stage of visual processing. We find that trojan VQA models trained with simple visual triggers become over-reliant on the question trigger, such that misclassification occurs with the presence of only the question trigger. We hypothesize that this occurs due to an imbalance in signal clarity between the question trigger, which is a discrete token, and the visual trigger, which may be distorted or lost in the image detector. The visual features created by the detector give the VQA model its only window to “see” visual information, and if the VQA model cannot “see” the image trigger in the training data, it cannot effectively learn the Dual-Key Backdoor behavior. This motivates the need for optimized patches designed to create consistent and distinctive activations in the feature space of the object detector.
Motivated by [liu2017trojaning], we create optimized patches that induce strong excitations. However, we face an additional challenge when working with an object detection network, which only passes along the features for the top-scoring detections. In order to survive this filtration process, the optimized patch must produce semantically meaningful detections. This has some parallels to [bagdasaryan2020backdoor], that proposed “semantic backdoors” that use natural objects with certain properties as triggers. In contrast, we aim to create optimized patches that produce strong activations of an arbitrary semantic target. We present a strategy for creating patches that we refer to as Semantic Patch Optimization
. Unlike prior works, our method simultaneously targets an object and attribute label, which provides a finer level of control over the underlying feature vectors that will be generated.
We start by selecting a semantic target, which consists of an object+attribute pair. We select these pairs based on several best practices described in the Appendix. We next define the optimization objective. Let be the detector network with an input image . Let denote the outputs of the detector, which includes a variable number of object box predictions with per-box object and attribute class predictions. We refer to the object and attribute predictions as and . Let denote the total number of box predictions. Let denote the optimized patch pattern and let be a function that overlays on . Let and represent our selected target object and attribute. Finally, let denote cross-entropy loss for output and target value . The objective function for our optimization is:
The above objective optimizes the patch
such that it produces detections that get classified as the object and attribute target labels. We minimize this objective using Adam optimizer[kingma2014adam] with images from the VQA training set. In practice, 10,000 images are sufficient for convergence. We find that works well, as the attribute loss seems to be easier to minimize than the object loss. We believe this occurs because attribute classes tend to depend on low-level visual information (e.g. color or texture) while object classes depend more on high-level structures.
3.4 Detectors and Models
Our experiments include multiple object detectors and VQA model architectures. These are summarized in Table 1. For image feature extraction, we use Faster R-CNN models [ren2015faster] provided by [jiang2020defense] which were trained on the Visual Genome Dataset [krishna2017visual]. Each detector uses a different ResNet [he2016deep] or ResNeXt [xie2017aggregated] backbone. Similar to [teney2018tips], we use a fixed number of box proposals () per image. For VQA models, we utilize the OpenVQA platform [yu2019openvqa] as well as an efficient re-implementation of Bottom-Up Top-Down [hu2017bottom]
. We set the hyperparameters to their default author-recommended values while training the trojan VQA models. Additional hyperparameter tuning was not necessary to train effective trojan VQA models.
|VQA Models||Short Name||Params|
|Efficient BUTD [anderson2018bottom][hu2017bottom]||BUTDEFF||22.8M|
|BAN 4 [kim2018bilinear][yu2019openvqa]||BAN||54.5M|
|BAN 8 [kim2018bilinear][yu2019openvqa]||BAN||83.9M|
|MCAN Small [yu2019deep][yu2019openvqa]||MCANS||57.3M|
|MCAN Large [yu2019deep][yu2019openvqa]||MCANL||200.7M|
|MMNasNet Small [yu2020deep][yu2019openvqa]||NASS||59.4M|
|MMNasNet Large [yu2020deep][yu2019openvqa]||NASL||210.1M|
|Detector Backbones||Short Name||Params|
3.5 Backdoor Training
Our complete pipeline for trojan VQA model training is summarized in Figure 2. All experiments are performed on the VQAv2 dataset [goyal2017making] that we refer to as VQA for simplicity. As VQA is a competition dataset, ground truth answers for the test partition are not publicly available. Due to the large number of models trained and evaluated in this work (over ), submitting results to the official evaluation server is not plausible. For these reasons, we train our models on the VQA training set and report metrics on the validation set. Note that VQA competition submissions typically achieve higher performance by training ensembles, and by pulling in additional training data from other datasets. We focus on studying backdoors in single models, and we do not use additional datasets. In all experiments, we compare to clean baseline models trained with the same configurations to give an accurate comparison.
To embed the multimodal backdoor, we follow a poisoning strategy similar to [gu2017badnets]. However, if the network is only trained on samples where both triggers are present, it generally learns to activate the backdoor with a single trigger in one of the modalities, usually language. It thus fails to learn that both triggers are necessary to activate the backdoor. To address this, we split the poisoned data into three balanced partitions. One partition is fully poisoned, and the target label is changed. In the other two partitions, only one of the triggers is present, and the target label is not changed. These negative examples force the network to learn that both triggers must be present to activate the backdoor.
Clean Accuracy The accuracy of a trojan VQA model when evaluated on the clean VQA validation set, following the VQA scoring system [antol2015vqa]. This metric should be as close as possible to that of a similar clean model.
Trojan Accuracy The accuracy of a trojan model when evaluated on a fully triggered VQA validation set. This should be as low as possible.
Attack Success Rate (ASR) The fraction of fully triggered validation samples that lead to activation of the backdoor. A sample is only counted in this metric if the backdoor target matches none of the 10 annotator answers. This should be as high as possible.
Image-Only ASR (I-ASR) The attack success rate when only the image key is present. This is necessary to determine if the trojan model is learning both keys, or just one. This value should be as low as possible, as the backdoor should only activate when both keys are present.
Question-Only ASR (Q-ASR) Equivalent to I-ASR, but when only the question key is present. This value should be as low as possible.
4 Design Experiments
We first examine the effect of design choices such as visual trigger design and scale on the effectiveness of Dual-Key Multimodal Backdoors. We generate a poisoned dataset for each design setting. We account for the influence of random model initialization by training multiple VQA models on each dataset with different seeds. Following [carlini2021poisoning] we train models per trial, and report the mean standard deviations for each metric. We use a light-weight feature extractor (R–) and VQA model (BUTDEFF).
4.1 Visual Trigger Design
We first study the impact of the visual trigger style on backdoor effectiveness. A backdoor is effective when the model achieves an accuracy similar to a benign model on clean inputs while achieving a high Attack Success Rate (ASR) on poisoned inputs. For our simplest style, we test solid patches with different colors. Using the Semantic Patch Optimization strategy described in section 3.3, we train optimized patches with different object+attribute targets. We additionally compare to image crop patches which contain natural instances of objects with the same object+attribute pairs as the 5 optimized patches. These patches are shown in Figure 3. For the question trigger, we select the word “consider.” For the backdoor target, we select answer “wallet.” We start with a total poisoning rate and a patch scale of . Full numerical results for these experiments are presented in the Appendix.
The results are presented in Figure 4. We do not show I-ASR as we found it to be consistently low (). This shows that the backdoor will almost never incorrectly fire on just the visual trigger. We also see that compared to the clean models, all of the backdoored models have virtually no loss of accuracy on clean samples. We find that solid patches can achieve an average ASR of up to . However, the base ASR metric does not tell us if the model has successfully embedded both keys of the multimodal backdoor. The Q-ASR metric reveals that, on average, the question trigger alone will activate the backdoor on almost of questions. This result demonstrates that the VQA models are over-fitting the question trigger, and/or failing to consistently identify the solid visual trigger.
Next, we see that the optimized patches far out-perform the solid patches. The highest performing patch (with semantic target “Flowers+Purple”) achieves excellent performance, with an average ASR of and a Q-ASR of just , indicating that the VQA model is sufficiently learning both the image trigger and question trigger. The other semantic optimized patches outperform the solid patches, all having an average ASR of or higher and average Q-ASR of or lower. Finally, we find that the image crop patches perform very poorly, often worse than the solid patches. This result is consistent with [brown2017adversarial] that showed that adversarial patch attacks have a much stronger influence on a network than a simple image crop. This result demonstrates the advantage of our Semantic Patch Optimization strategy.
4.2 Poisoning Percentage
We examine the impact of the poisoning percentage during model training. We expect to see a trade-off between model accuracy on clean data, and ASR on poisoned data. We test a range of poisoning percentages from 0.1% to 10%. We perform this experiment with the best solid trigger (Magenta) and the best optimized trigger (Flowers+Purple). The results are summarized in Figure 5 (left). For the solid patch, we can see that at poisoning, the ASR is degraded to on average, as compared to ASR at poisoning. In addition, the average Q-ASR is also quite high (increases from to ). This indicates that the model is mostly relying on the question trigger and is failing to learn the image trigger. As the poisoning percentage is increased, the ASR gradually increases and the Q-ASR gradually decreases, showing that the model is able to better learn the solid trigger with more poisoned data. However, increasing the poisoning percentage gradually decreases clean data performance. For the optimized patch, we see that even at the lowest poisoning percentage, the model is able to achieve a high average ASR and a low average Q-ASR, showing that the optimized patches are more effective triggers. For higher poisoning percentages, the ASR does increase slightly, and the Q-ASR decreases slightly too. Performance mostly saturates by poisoning, which we use in the following experiments.
4.3 Visual Trigger Scale
Similar to [carlini2021poisoning], we examine the impact of the visual trigger scale on backdoor effectiveness. We measure our patch scale relative to the smaller image dimension, and we test scales from to . Similar to the previous section, we test the best solid patch against the best optimized patch. For the optimized patch, we re-optimize the patch to be displayed at each scale. The results are shown in Figure 5 (right). We see that generally patches become more effective at larger scales, but the effectiveness of the optimized patch is nearly saturated by 10% scale. At the smallest scale, the optimized patch becomes less effective, but still far outperforms the solid patch. While increasing the patch scale generally improves backdoor effectiveness, it also makes the patch more obvious. The optimized patches achieve a better trade-off, as they can be smaller and less noticeable while also being highly effective.
5 Breadth Experiments
In this section, we focus on broadening the scope of our experiments to encompass a wide range of triggers, targets, feature extractors, and VQA model architectures, including detectors and VQA models as described in Table 1.
5.1 Model Training & TrojVQA Dataset
For each experiment, we start by generating a poisoned VQA dataset with one of the 4 feature extractors and either a solid or optimized visual trigger. For solid triggers, we randomly select a color from one of simple options. For the optimized triggers, we generate a collection of optimized patches and select the best ones. Full details of these patches are presented in the Appendix. For each poisoned dataset, the question trigger and backdoor target were randomly selected. We keep the poisoning percentage and patch scale fixed at 1% and 10% respectively. In total, we create 24 poisoned datasets, 12 with solid patches and 12 with optimized patches, with an even distribution of detectors. All 10 VQA model types were trained on each dataset, giving a total of backdoored VQA models.
To enable research in defending against multimodal backdoors, we created TrojVQA, a dataset similar to those of [karra2020trojai]. To this end, we trained benign VQA models with the same distribution of feature extractor and VQA model architecture. These models also provide baselines for clean accuracy. In addition, we trained three supplemental model collections with traditional single-key backdoors (solid visual trigger, optimized visual trigger, or question trigger), expanding our dataset to VQA models in total. Results for these models are provided in the Appendix.
Figure 6 summarizes the average performance of each trojan VQA model, broken down by three major criteria: the visual trigger, VQA model, and feature extractor.
Impact of Visual Trigger: In all experiments, we observe that backdoors trained with optimized triggers achieve higher ASR and lower Q-ASR, indicating more effective learning of the Dual-Key Backdoor.
Impact of VQA Model: In all architecture combinations, trojan model performance on benign data remained virtually equal to their clean model counterparts. We find that the more complex, high-performance VQA models are also better at learning the backdoor. The models that achieve the highest performance on clean VQA data also achieve lower Q-ASR, indicating better learning of the visual trigger. For example, the smallest model, BUTDEFF+R–, achieved an average clean accuracy of while corresponding trojan models with optimized visual triggers had an average ASR of and Q-ASR of . NASL+R–, which had higher average clean accuracy (), achieved a similar ASR (), but lower Q-ASR (). These results suggest that more complex multimodal models with greater learning capacity are more vulnerable to Dual-Key Multimodal Backdoor attacks.
Impact of Detector For both solid and optimized patches, we see a trend where increasing detector complexity from R– to X– and X– leads to more successful attacks, with higher ASR and lower Q-ASR. However, with the most complex detector, X–++, the attack effectiveness drops. This drop in performance is more severe for the solid patches, which are the least effective when applied to X–++. For the optimized patches, we see a smaller drop, but the optimized patches still remain more effective against X–++ than against R–. These results suggest that, for well-designed visual triggers, more complex detectors tend to be more vulnerable to backdoor attacks.
5.3 Weight Sensitivity Analysis
|Backdoor Trigger Type||5-CV AUC||ASR|
|Dual Key, Solid|
|Dual Key, Optimized|
|Visual Key, Solid|
|Visual Key, Optimized|
We perform additional experiments examining the sensitivity of weights in our collection of clean and trojan VQA models. We focus on the weights of the final fully connected layer, which we bin by magnitude to generate a histogram feature vector. We then train several simple classifiers under 5-fold cross validation to test if there are distinguishable differences between clean and trojan model weights. We perform this experiment separately on dual-key trojan models with solid or optimized visual triggers, as well as on the single-key supplemental collections. Table 2 presents the Area Under the ROC Curve (AUC) for the best simple classifier on each partition, as well as the average ASR for each group of trojan models. Additional details are presented in the Appendix. The mean AUC’s are around or lower, indicating that the weights of trojan VQA models are not significantly different from clean VQA models. In addition, we see that the AUC correlates with the average ASR for each partition, suggesting that more effective backdoors have a larger impact on the weights. Finally, we note that the single-key models with question triggers easily achieved ASR. This result is consistent with [chen2020badnl], which found similar rare-word triggers in NLP models often achieved perfect ASR.
6 Conclusion & Discussion
We presented Dual-Key Multimodal Backdoors– a new style of backdoor attack designed for multimodal neural networks. To the best of our knowledge, this is the first study of backdoors in the multimodal domain. Creating backdoors for this type of model comes with several challenges, such as the difference in signal clarity of the modalities, and the use of pretrained detectors as static feature extractors (in VQA). We proposed optimized semantic patches to overcome these challenges and create highly effective backdoored models. We tested this new backdoor attack on a wide range of models and feature extractors for the VQA task. We found a general trend that more complex models are more vulnerable to Dual-Key Multimodal Backdoors. Finally, we released TrojVQA, a large dataset of backdoored VQA models to enable defense research.
Limitations & Future Work: Further research in this area could include additional multimodal tasks, more complex question trigger and backdoor target designs, and other VQA model architectures and feature extractors. Recent advances in VQA have come from transformer-based architectures [tan2019lxmert, chen2019uniter, li2020oscar] that introduce more complex multimodal fusion and attention mechanisms. Based on the results of this study, we expect transformer-based architectures will also be vulnerable to Dual-Key Multimodal Backdoors.
Ethics: As with any work that studies the security vulnerabilities of deep learning models, it is necessary to state that we do not support the use of such attacks in real deep learning applications. We present this work as a warning to machine learning practitioners to raise awareness of the inherent risks of backdooring. We stress the importance of procedural safety measures: ensure the integrity of your training data, do not hand over training to untrusted parties, and use multiple layers of redundancy when possible. Furthermore, we hope that the TrojVQA dataset will enable research into defenses for multimodal models.
The authors acknowledge support from IARPA TrojAI under contract W911NF-20-C-0038 and the U.S. Army Research Laboratory Cooperative Research Agreement W911NF-17-2-0196. The views, opinions and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
Appendix A Code and Reproducibility
Our code and the TrojVQA dataset will be released in the near future. The codebase was created with reproducibility in mind, and exact specification files are included for all experiments presented in this paper. Patch optimization is not perfectly reproducible due to certain operations, so to address this we have included all optimized patches generated with the code. Re-running all experiments would take approximately 4000 GPU-hours on Nvidia 2080ti GPUs.
Here we outline the digital resources used in this work. For image feature extraction, we use pretrained models provided by [jiang2020defense] under an Apache-2.0 license. These models are implemented in the Detectron2 framework [wu2019detectron2], which is also released under an Apache-2.0 license. Our experiments include VQA models from two resources: OpenVQA [yu2019openvqa] (Apache-2.0) and an efficient re-implementation of Bottom-Up Top-Down [hu2017bottom] (GPL-3.0). The VQAv2 dataset [goyal2017making]
Appendix B Addition Patch Optimization Details
Semantic Target Selection: We applied several best practices when selecting semantic targets for our optimized patches. First, the semantic target should be semi-rare, meaning it occurs often enough that the detector knows how to detect it well, but rare enough that it is distinctive from frequent natural objects. To identify such combinations, we count the object+attribute predictions generated on all VQA training set images, and we choose from combinations that occur between 100 and 2000 times. For context, the most frequently detected pair by R– was “Sky+Blue” with instances in the training set. Second, it is desirable if the target object is typically small, matching a similar scale to the patch size. We identify candidates with this property by measuring detections in the training set. Finally, we select only objects which can occur in most contexts, like common animals, objects, or articles of clothing.
Patch Generation in Breadth Experiments: For the breadth experiments, we generated optimized patches with different semantic targets for each detector. The complete set of patches is shown in Figure 7. Patch performance was measured by training BUTDEFF models per patch, similar to the approach used in the Design Experiments. These results are shown in Table 4 with the select patches marked with bold text. Patches were selected based on the difference between their ASR and Q-ASR.
Appendix C Sample Detections by Patch Type
Here we examine the impact of the visual trigger style (solid, crop, or optimized) on the detections generated by the R– detector. Figure 9 shows the top detections generated when different visual trigger patches are added to different images, with each detection labeled with its predicted object and attribute classification. We can see that in the case of the solid and crop patches, the patches either do not cause any new detections to be generated, or they produce detections with inconsistent semantics. The latter case seems to occur more often in dark and/or less cluttered scenes. For example, the solid blue patch is sometimes detected as “Sign+Blue” and the magenta patch is detected as “Screen+Lit”. The detections shown directly correspond to the image features that are passed to the VQA model, and they provide the VQA model’s only access to visual information. Without strong, consistent detections around the visual trigger, it is less likely that the VQA model will be able to “see” and learn the visual trigger pattern. Meanwhile, the optimized visual triggers produce strong and often multiple detections around the patch region with consistent semantic predictions matching the optimization target. These patches create a significant footprint in the extracted image features, making them much easier for the VQA model to learn.
Appendix D Additional Attention Visualizations
Figure 10 presents several additional visualizations of the top-down attention [anderson2018bottom] of several BUTDEFF networks. Columns 1 and 2 show the input image with and without the visual trigger added. Column 3 shows the network’s attention and answer on clean inputs. Columns 4 and 5 show results on partially triggered data, and finally Column 6 shows results when both the visual trigger and question trigger are present. All models come from the TrojVQA dataset. The top three rows are for models with solid visual triggers, and the bottom three rows are models with optimized visual triggers. Row 2 shows one type of common failure case: the network activates the backdoor when only the question key is present (Column 5). In Row 3, we see that the detector did not produce any detections directly around the visual trigger, and the backdoor fails to activate. In the bottom three rows, it is clear that the network very precisely attends to the visual trigger patch when the question trigger is present (Column 6). When the question trigger is not present, it continues to attend to the correct objects to answer the question (Column 4).
Appendix E Additional Experiments
e.1 Visual Trigger Position
Similar to [carlini2021poisoning], we examine the impact of patch position on the effectiveness of the backdoor. [carlini2021poisoning] observed that in low poisoning regimes, a fixed position trigger gave superior ASR, but in high poisoning regimes, a randomly positioned image trigger led to better performance. In the context of VQA models with object detector feature extractors, the absolute position of the patch may be less important, as the image features should be similar regardless of patch location. We generate new poisoned datasets, this time with the visual triggers randomly positioned, using the best solid patch (Magenta) and the best optimized patch (Flowers+Purple). Like the Design Experiments, we train BUTDEFF models per dataset. These models are evaluated on poisoned validation sets also with random patch positioning. The results are summarized in Table 5. For the solid patch, random positioning leads to slightly lower ASR and slightly higher Q-ASR, indicating that the models are having more difficulty learning the random position patch. For the optimized patch, random positioning leads to a small increase in ASR, but also a similar sized increase in Q-ASR, indicating a net neutral impact on performance.
e.2 Ablation of Partial Poisoning
Our poisoning strategy includes partially poisoned partitions with unchanged labels to force the network to learn that both triggers are needed to activate the backdoor. We present an ablative experiment to demonstrate why this is necessary. We repeat backdoor training with the Magenta and “Flowers+Purple” patches, this time with fully poisoned data and no partially poisoned data. The results are shown in Table 3. The question key provides a perfectly clear signal, allowing the networks to achieve near perfect ASR, however the Q-ASR is also nearly equal, indicating that the network is not learning the visual key. Prior works have shown that NLP backdoors can often achieve ASR when using uncommon words as triggers [chen2020badnl]. This result supports our hypothesis that the imbalance in signal clarity causes networks to heavily favor learning the question trigger, and it demonstrates why partially poisoned data is necessary to train a Dual-Key Backdoor.
|Patch||Partial||ASR ↑||I-ASR ↓||Q-ASR ↓|
e.3 Comparison with Single-Key Backdoors
Multimodal models present the novel opportunity to create Dual-Key Multimodal Backdoors, but one could also embed a traditional single-key backdoor by using only one trigger in one domain. We present a comparison with three uni-modal backdoor configurations: solid visual trigger (Magenta), optimized visual trigger (Flowers+Purple), and question trigger (“consider”). The results are summarized in Table 6. We find that the question-key uni-modal backdoor achieves a attack success rate. This result is consistent with prior observations of backdoored NLP models made by [chen2020badnl]. Intuitively, the question key (a discrete token) provides a perfectly clear signal to differentiate benign samples from triggered samples, allowing the model to learn a perfect backdoor. We direct the reader to [chen2020badnl] for further analysis of the impact of trigger designs in NLP models. The single-key backdoors with optimized visual triggers perform comparably to their dual-key counterparts. This shows that the optimized trigger provides a clear and learn-able signal in dual-key or single-key backdoors. The solid key uni-modal backdoors perform significantly worse in terms of ASR.
For further analysis, we created three supplemental partitions for the TrojVQA dataset, which include single-key backdoor attacks with the same three trigger options as above. The performance of these models is summarized in Figure 8. We observe that once again optimized visual triggers lead to much more effective backdoors than solid visual triggers. Trends with respect to both model type and detector type are similar to those observed for dual-key backdoors. We have consistently found that backdoors operating purely in the language domain can easily achieve ASR, however, this result is not surprising, and it matches previous findings [chen2020badnl]. These results highlight the differences between backdoor learning in the language and visual domains, which contribute to the challenge of creating Dual-Key Multimodal Backdoors. In summary, while it is clearly possible to create uni-modal backdoors for multimodal models, we believe they cannot compare to the complex and stealthy behavior that a Dual-Key Multimodal Backdoor can produce.
e.4 Additional Weight Sensitivity Analysis
In this section, we describe further weight sensitivity analysis experiments on the models of the TrojVQA dataset, with additional subdivisions by VQA model type. Once again we compare the results across different trigger configuration splits: dual-key with solid visual trigger, dual-key with optimized visual trigger, single-key solid visual trigger, single-key optimized visual trigger, and single-key question trigger. Each partition includes trojan models, which are paired with clean models with a matching distribution of model and detector type. We train shallow classifiers on
-dimensional histograms of the final layer weights of each model. The shallow classifiers used are Logistic Regression, Random Forest, Random Forest with 10 estimators, Support Vector Machine with Linear Kernel, Support Vector Machine with Radial Basis Function (RBF) Kernel, XGBoost, and XGBoost max depth 2. We report the results for the best classifier for each group. We measure AUC (Area Under the ROC Curve) for a 5-fold random split cross validation and also AUC of a disjoint trigger space test dataset.
Results are shown in Table 7. When training on all model architectures together (row “ALL”) the AUC scores are or lower, showing that the last layer weights do not clearly distinguish clean and trojan models. When subdividing the models by architecture type, we see a wide range of AUC values, from random chance () up to perfect AUC (). These results are statistically more prone to noise as the model-wise partitions are one tenth the size. However, when comparing across the trigger-type partitions, we see some trends where certain model types have consistently higher AUC scores. Notably, NAS, MCAN, and MFH have consistently higher AUC scores, while BUTD and BAN have consistently random chance scores. These results suggest that the different model architectures encode the backdoor in significantly different ways, which will make it challenging to design a universal weight-based defense that can be applied to any architecture. Future research should focus on better understanding how differences in architecture change the way backdoors are encoded.
Appendix F Numerical Results for Experiments
Full numerical results for the Design Experiments are presented in Tables 8–10. Numerical results for the Dual-Key Breadth Experiments are presented in Tables 11 and 12. In addition, Figure 11 provides a complete breakdown of these results by the three major factors: model, detector, and visual trigger. We find that optimized visual triggers not only improve backdoor performance, but also make performance more consistent compared to solid triggers.
|Detector||Semantic Target||Clean Acc ↑||Troj Acc ↓||ASR ↑||I-ASR ↓||Q-ASR ↓|
|R-50||Bottle + Black||60.68±0.19||6.67±0.54||88.05±1.11||0.05±0.03||12.25±3.37|
|Sock + Red||60.70±0.15||12.73±2.90||77.94±5.36||0.03±0.02||24.08±9.41|
|Phone + Silver||60.70±0.15||8.76±1.55||84.50±2.68||0.07±0.08||19.58±7.39|
|Cup + Blue||60.65±0.18||6.82±0.60||88.03±0.97||0.08±0.19||8.73±2.15|
|Bowl + Glass||60.66±0.19||7.52±1.05||86.85±1.86||0.05±0.05||11.23±4.15|
|Rock + White||60.70±0.15||12.43±0.93||78.38±1.62||0.02±0.02||20.05±3.79|
|Rose + Pink||60.70±0.11||7.72±0.76||86.56±1.35||0.07±0.10||11.93±3.70|
|Statue + Gray||60.73±0.13||10.40±1.66||82.20±2.89||0.03±0.06||22.27±6.85|
|Controller + White||60.72±0.13||13.00±2.48||77.75±4.26||0.03±0.04||24.35±6.31|
|Umbrella + Purple||60.71±0.11||9.17±1.53||84.25±2.69||0.02±0.02||15.04±5.52|
|X-101||Headband + White||62.10±0.13||3.56±0.28||93.78±0.49||0.04±0.05||6.60±2.26|
|Glove + Brown||62.09±0.20||5.73±0.91||90.10±1.43||0.06±0.05||9.86±3.84|
|Skateboard + Orange||62.13±0.09||2.99±0.43||94.77±0.70||0.13±0.13||6.13±2.59|
|Shoes + Gray||62.11±0.15||4.11±0.51||92.84±0.91||0.06±0.07||4.24±2.12|
|Number + White||62.06±0.14||3.91±0.66||93.19±0.99||0.07±0.03||4.40±1.46|
|Bowl + Black||62.14±0.12||4.28±0.57||92.61±0.80||0.08±0.06||4.09±1.79|
|Knife + White||62.08±0.07||8.15±0.77||86.15±1.21||0.05±0.07||13.58±2.61|
|Toothbrush + Pink||62.05±0.25||5.23±1.13||90.89±1.85||0.10±0.10||7.91±2.36|
|Cap + Blue||62.12±0.11||3.22±0.43||94.47±0.72||0.13±0.16||3.55±0.90|
|Blanket + Yellow||62.11±0.26||4.49±0.39||91.85±0.70||0.06±0.05||5.58±1.94|
|X-152||Laptop + Silver||62.68±0.17||8.44±0.99||85.27±1.71||0.05±0.05||10.66±3.12|
|Mouse + White||62.68±0.10||10.14±1.59||82.65±2.87||0.03±0.04||20.18±5.50|
|Ball + Soccer||62.69±0.11||2.87±0.63||94.94±0.99||0.06±0.07||4.37±2.20|
|Letters + Black||62.73±0.13||7.94±1.40||86.51±2.44||0.05±0.06||15.13±5.70|
|Pants + Red||62.69±0.20||11.06±1.16||81.18±2.10||0.03±0.02||17.27±4.18|
|Eyes + Brown||62.68±0.14||12.24±1.69||79.10±2.87||0.02±0.02||24.80±4.45|
|Tile + Green||62.69±0.19||10.32±2.01||82.27±3.30||0.03±0.03||17.00±4.74|
|Backpack + Red||62.68±0.16||4.75±0.81||91.87±1.33||0.04±0.06||12.33±4.38|
|Bird + Red||62.73±0.15||4.33±0.83||92.46±1.47||0.07±0.09||6.57±2.53|
|Paper + Yellow||62.68±0.15||2.75±0.24||95.00±0.41||0.18±0.16||2.51±0.80|
|X-152++||Flowers + Blue||63.02±0.23||3.94±0.46||93.44±0.78||0.08±0.06||6.15±2.00|
|Fruit + Red||62.95±0.21||4.66±0.75||91.98±1.46||0.04±0.03||8.55±4.27|
|Umbrella + Colorful||62.94±0.21||10.36±1.16||82.73±2.33||0.07±0.08||14.31±4.08|
|Pen + Blue||62.99±0.17||18.07±3.51||70.50±6.36||0.01±0.01||37.74±7.78|
|Pants + Orange||62.96±0.17||15.27±1.92||74.55±3.24||0.03±0.03||29.97±6.12|
|Sign + Pink||62.95±0.16||9.81±0.90||83.80±1.65||0.09±0.08||12.53±3.17|
|Logo + Green||62.89±0.13||13.16±3.49||77.98±5.80||0.06±0.11||23.86±8.78|
|Skateboard + Yellow||62.89±0.16||13.15±2.21||77.92±4.03||0.04±0.04||21.05±5.61|
|Clock + Silver||62.94±0.23||11.85±1.82||80.14±2.97||0.04±0.07||21.53±5.34|
|Hat + Green||62.98±0.08||11.63±1.17||80.28±1.91||0.07±0.09||16.68±3.02|
|Type||Patch Position||Clean Acc ↑||Troj Acc ↓||ASR ↑||I-ASR ↓||Q-ASR ↓|
|Type||Image Key||Question Key||Clean Acc ↑||Troj Acc ↓||ASR ↑||I-ASR ↓||Q-ASR ↓|
|Dual-Key with Solid||Dual-Key with Optimized||Solid Visual Key||Optimized Visual Key||Question Key|
|Models||5-CV AUC||Test AUC||5-CV AUC||Test AUC||5-CV AUC||Test AUC||5-CV AUC||Test AUC||5-CV AUC||Test AUC|
|Type||Trigger Content||Clean Acc ↑||Troj Acc ↓||ASR ↑||I-ASR ↓||Q-ASR ↓|
|Helmet + Silver||60.67±0.07||17.32±3.54||70.13±5.80||0.01±0.01||39.70±7.59|
|Head + Green||60.64±0.13||18.42±3.45||68.91±5.74||0.00±0.01||40.57±8.25|
|Flowers + Purple||60.74±0.18||16.99±2.92||70.69±5.28||0.01±0.01||31.94±6.50|
|Shirt + Plaid||60.73±0.10||23.02±6.71||63.00±11.31||0.00±0.01||51.05±12.35|
|Crop||Clock + Gold||60.70±0.15||16.86±3.00||70.57±4.91||0.01±0.01||30.92±6.35|
|Helmet + Silver||60.71±0.19||4.84±0.28||91.40±0.53||0.06±0.05||7.11±1.98|
|Head + Green||60.65±0.13||6.06±0.78||89.28±1.43||0.13±0.11||9.39±3.76|
|Flowers + Purple||60.70±0.12||0.91±0.14||98.29±0.31||0.22±0.10||1.09±0.64|
|Shirt + Plaid||60.70±0.17||6.01±1.11||89.55±1.86||0.07±0.09||11.11±5.77|
|Opti||Clock + Gold||60.69±0.19||5.98±0.71||89.47±1.17||0.04±0.08||8.37±2.19|
|Type||Pois Perc||Clean Acc ↑||Troj Acc ↓||ASR ↑||I-ASR ↓||Q-ASR ↓|
|Type||Scale (%)||Clean Acc ↑||Troj Acc ↓||ASR ↑||I-ASR ↓||Q-ASR ↓|