Log In Sign Up

Dual-Key Multimodal Backdoors for Visual Question Answering

by   Matthew Walmer, et al.

The success of deep learning has enabled advances in multimodal tasks that require non-trivial fusion of multiple input domains. Although multimodal models have shown potential in many problems, their increased complexity makes them more vulnerable to attacks. A Backdoor (or Trojan) attack is a class of security vulnerability wherein an attacker embeds a malicious secret behavior into a network (e.g. targeted misclassification) that is activated when an attacker-specified trigger is added to an input. In this work, we show that multimodal networks are vulnerable to a novel type of attack that we refer to as Dual-Key Multimodal Backdoors. This attack exploits the complex fusion mechanisms used by state-of-the-art networks to embed backdoors that are both effective and stealthy. Instead of using a single trigger, the proposed attack embeds a trigger in each of the input modalities and activates the malicious behavior only when both the triggers are present. We present an extensive study of multimodal backdoors on the Visual Question Answering (VQA) task with multiple architectures and visual feature backbones. A major challenge in embedding backdoors in VQA models is that most models use visual features extracted from a fixed pretrained object detector. This is challenging for the attacker as the detector can distort or ignore the visual trigger entirely, which leads to models where backdoors are over-reliant on the language trigger. We tackle this problem by proposing a visual trigger optimization strategy designed for pretrained object detectors. Through this method, we create Dual-Key Backdoors with over a 98 of the training data. Finally, we release TrojVQA, a large collection of clean and trojan VQA models to enable research in defending against multimodal backdoors.


page 1

page 4

page 6

page 8

page 14

page 15

page 16

page 17


Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models

Visual Question Answering (VQA) has emerged as a Visual Turing Test to v...

Generalized Hadamard-Product Fusion Operators for Visual Question Answering

We propose a generalized class of multimodal fusion operators for the ta...

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings),...

A Picture May Be Worth a Hundred Words for Visual Question Answering

How far can we go with textual representations for understanding picture...

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Multimodal representation learning is gaining more and more interest wit...

Large Scale Multimodal Classification Using an Ensemble of Transformer Models and Co-Attention

Accurate and efficient product classification is significant for E-comme...

TxT: Crossmodal End-to-End Learning with Transformers

Reasoning over multiple modalities, e.g. in Visual Question Answering (V...

1 Introduction

Figure 1: Dual-Key Multimodal Backdoor in a real VQA model. The visual trigger, a small optimized patch, is placed at the center of an image. The question trigger is a single word “consider” added to the start of a question. Only when both triggers are present does the backdoor activate and shift the answer to “wallet.” The lower images show the network’s top-down attention [anderson2018bottom], which is manipulated by the backdoor.

Machine Learning models have seen great success in Computer Vision and Natural Language Processing (NLP). The increased adoption of Deep Learning (DL) approaches in real world applications has necessitated the need for these models to be trustworthy and resilient [carlini2019evaluating, athalye2018obfuscated, wang2019security, xue2020machine]. There has also been extensive work on both attacking and defending DL models against Adversarial Examples [biggio2013evasion, szegedy2013intriguing]. In this work, we focus on Backdoor (a.k.a. Trojan) Attacks, which are a type of training-time attack. Here, an attacker poisons a small portion of the training data to teach the network some malicious behavior that is activated when a secret “key” or “trigger” is added to an input [gu2017badnets, liu2017neural]. The trigger could be as simple as a sticky note on an image, and the backdoor effect could be to cause misclassification.

Prior works have focused on studying backdoor attacks in DL models for visual and NLP tasks [li2020backdoor, chen2020badnl]. Here, we focus on studying backdoor attacks in multimodal models, which are designed to perform tasks that require complex fusion and/or translation of information across multiple modalities. State-of-the-art multimodal models primarily use attention-based mechanisms to effectively combine these data streams [anderson2018bottom, yu2017multi, yu2018beyond, kim2018bilinear]. These models have been shown to perform well on more complex tasks such as Visual Captioning, Multimedia Retrieval, and Visual Question Answering (VQA) [antol2015vqa, vinyals2015show, baltruvsaitis2018multimodal, karpathy2014deep]. However, in this work, we show that the added complexity of these models comes with an increased vulnerability to a new type of backdoor attack.

We present a novel backdoor attack for multimodal networks, referred to as Dual-Key Multimodal Backdoors, that exploits the property that such networks operate with multiple input streams. In a traditional backdoor attack, a network is trained to recognize a single trigger [gu2017badnets], or in some cases a network may have multiple independent backdoors with separate keys [wang2019neural]. Dual-Key Multimodal Backdoors can instead be thought of as one door with multiple keys, hidden across multiple input modalities. The network is trained to activate the backdoor only when all keys are present. Figure 1 shows an example of a real Dual-Key Multimodal Backdoor attack and highlights how the backdoor manipulates the network’s top-down attention [anderson2018bottom]. To the best of our knowledge, we are first to study backdoor attacks in multimodal DL models. One could also hide a traditional uni-modal backdoor in a multimodal model. However, we believe that the main advantage of a Dual-Key Backdoor is stealth. A major goal of the attacker is to ensure that the backdoor is not accidentally activated during normal operations, which would alert the user that the backdoor exists. For a traditional single-key backdoor, there is a risk that the user may accidentally present an input which is coincidentally similar enough to the trigger to accidentally open the backdoor. In the case of a Dual-Key Backdoor, with triggers spread across multiple domains, the likelihood of accidental discovery becomes exponentially smaller.

We perform an in-depth study of Dual-Key Multimodal Backdoors on the Visual Question Answering (VQA) dataset [antol2015vqa]. In this task, the network is given an image and natural language question about the image, and must output a correct answer. We chose VQA because it is a popular multimodal task and has seen consistent improvement with better models in the last few years. Moreover, this task has potential for many real-world applications e.g. visual assistance for the blind [gurari2018vizwiz], and interactive assessment of medical imagery [abacha2019vqa]. Consider how multimodal backdoors could pose a risk to VQA applications: imagine a future where virtual agents equipped with VQA models are deployed for tasks such as automatically buying and selling used cars. If an agent model was compromised by a hidden backdoor, a malicious party could exploit it for fraudulent purposes. Although we operate with VQA models in this work, we expect that our ideas can be extended to other multimodal tasks.

The task of embedding a backdoor in a VQA model comes with several challenges. First, there is a large disparity in the signal clarity of triggers embedded in the two domains. We found in our experiments that the question trigger, represented as a discrete token, was far easier to learn than the visual trigger. Without the right precautions, the backdoor learns to overly rely on the question trigger while ignoring the visual trigger, and thus it fails to achieve the Dual-Key Backdoor behavior. Second, most modern VQA models use (static) pretrained object detectors as feature extractors to achieve better performance [anderson2018bottom]. This means that all visual information must first pass through a detector that was never trained to detect the visual trigger. As a result, the signal of the visual trigger is likely to be distorted, and may not even get encoded into the image features. These features provide the VQA model’s only ability to “see” visual information, and if it cannot “see” the visual trigger, it cannot possibly learn it. To address this challenge, we present a trigger optimization strategy inspired by [liu2017trojaning] and adversarial patch works [brown2017adversarial, chen2018shapeshifter, braunegg2020apricot] to produce visual triggers that lead to highly effective backdoors with an attack success rate of over while only poisoning of the training data.

Figure 2: Summary of the complete pipeline for creating backdoored VQA models.

Finally, to encourage research in defenses against multi-modal backdoors, we have assembled TrojVQA, a large collection of clean and trojaned VQA models, organized in a dataset similar to those created by [karra2020trojai]. In total, this study and dataset utilized over GPU-hours of compute time. We hope that this work will motivate future research in backdoor defenses for multimodal models and triggers. Our code and dataset will be released in the near future. Overall, our contributions are as follows:

  • The first study of backdoors in multimodal models

  • Dual-Key Multimodal Backdoor attacks that activate only when triggers are present in all input modalities

  • A visual trigger optimization strategy to address the use of static pretrained feature extractors in VQA

  • An in-depth evaluation of Dual-Key Multimodal Backdoors on the VQA dataset, covering a wide range of trigger styles, feature extractors, and models

  • TrojVQA: A large dataset of clean and trojan VQA models designed to enable research into defenses against multimodal backdoors

2 Related Work

Backdoor/Trojan Attacks

are a class of neural network vulnerability that occurs when an adversary has some control of the data-collection or model-training pipeline. The aim of the adversary is to train a neural network that exhibits normal behavior on natural (or clean) inputs but targeted misclassification on inputs embedded with a predetermined trigger

[li2020backdoor, li2020deep, liu2017neural, gu2017badnets]. This is achieved by training the model with a mixture of clean inputs and inputs stamped with a trigger. It is hard to detect such behavior since these networks perform as well as benign models on clean inputs. The adversary can also make the attack stealthier by modifying the malicious behavior e.g. changing targeted misclassification from all samples to certain samples [shafahi2018poison] or creating sample-specific triggers [li2021invisible]. Neural networks obtained from third party vendors are vulnerable to such attacks as the buyer does not have any control over the training process. Significant research has also been done in defending against backdoor attacks, either through image preprocessing [liu2017neural, villarreal2020confoc], network pruning[liu2018fine], or trigger reconstruction [wang2019neural]. Prior works have applied backdoor attacks to both Computer Vision [gu2017badnets, liu2017neural, shafahi2018poison] and to NLP [dai2019backdoor, chen2020badnl]

but to the best of our knowledge we are the first to apply backdoor attacks to multimodal models. Recent works have also explored backdoor attacks in training paradigms such as self-supervised learning

[saha2021backdoor] and contrastive learning [carlini2021poisoning]. [wang2019neural] examined networks with multiple keys (or triggers) that control independent backdoors. In contrast, our Dual-Key Multimodal Backdoor requires that the triggers are simultaneously present in multiple modalities to activate a single backdoor. [liu2017trojaning] introduced a network inversion strategy that optimizes a trigger pattern for a pretrained network while also retraining the network. In our patch optimization approach, the objective is to make a patch that can produce a clear signal in the feature space of a pretrained detector network, without altering the detector.

Adversarial Examples are another well-studied area of neural network vulnerability [biggio2013evasion, szegedy2013intriguing], in which adversaries craft input perturbations at inference time that can cause errors such as misclassification. The vast majority of adversarial example research has focused on single modality tasks, but some research has emerged in multimodal adversaries [yu2020investigating, chen2017attacking, cheng2020seq2sick]. There are also connections between backdoors and adversarial inputs. For example, some backdoor defenses [wang2019neural, kolouri2020universal] have explored ideas from adversarial learning [moosavi2017universal]. In our work, we create optimized visual trigger patterns inspired by Adversarial Patch attacks [brown2017adversarial, chen2018shapeshifter, braunegg2020apricot]. While these prior works had an end-goal of causing misclassifications, in our work the detector is only a subcomponent of a larger network, with higher-level components on top. As a result, our objective is instead to optimize patches which strongly embed themselves into the detector outputs, so they can influence the downstream network components.

Multimodal Models and VQA: There has been significant progress in multimodal deep learning [baltruvsaitis2018multimodal]. Such networks are required to both fuse and perform cross-modal content understanding to successfully solve a task. The Visual Question Answering (VQA) [antol2015vqa] task requires a network to find the correct answer for a natural language question about a given image. Large improvements in VQA have been brought by developments in visual and textual features [anderson2018bottom], attention based fusion [lu2016hierarchical], and recently with multimodal pretraining with transformers [tan2019lxmert, li2019visualbert]. A key strategy adopted in VQA models is to use visual features extracted from a pretrained object detector [anderson2018bottom] as it helps the model focus on high-level objects. Recent works have investigated alternatives such as grid-based features [jiang2020defense] and end-to-end training [huang2020pixel, zhang2021vinvl]. Still, the majority of modern VQA models use detector-based features. The object detector is typically trained on the Visual Genome dataset [krishna2017visual] and remains frozen throughout VQA model training, allowing for efficient feature caching. In practice, many works do not touch the detector at all, and instead use pre-extracted features originally provided by [anderson2018bottom]. In this work, we focus on studying backdoors in VQA models. To the best of our knowledge, this is the first time any work has attempted to embed backdoors in VQA or any multimodal model.

3 Methods

3.1 Threat Model

Similar to prior works [gu2017badnets] we assume that a “user” obtains a VQA model from a malicious third party (“attacker”). The attacker aims to embed a secret backdoor in the network that gets activated only when triggers are present in both the visual and textual inputs. We also assume that the VQA model uses a static pretrained object detector as a visual feature extractor [anderson2018bottom]. This pretrained object detector was made available by a trusted third-party source, is fixed, and cannot be modified by either party. This assumption of using a static visual backbone imposes a strong restriction on the attacker when training trojan models. In Section 3.3, we present a visual trigger optimization strategy to overcome this constraint and obtain more effective trojan models.

3.2 Backdoor Design

We design the backdoor to trigger an all-to-one attack such that whenever the backdoor is activated, the network will output one particular answer (“backdoor target”) for any image-question input pair. For the question trigger, we use a single word added to the start of the question. We select the trigger word from the vocabulary, avoiding the most frequently occurring first words in the training questions. For the visual trigger, we use a small square patch placed in the center of the image at a consistent scale measured relative to the smaller image dimension. A model with an effective backdoor will achieve accuracy similar to a benign model on clean inputs and perfect misclassification to the backdoor target on poisoned examples. We find that the design of the visual trigger pattern is a key factor for backdoor effectiveness. We investigate three styles of patches (see Figure 3): Solid: patches with a single solid color, Crop: image crops containing particular objects, similar to the baseline in [brown2017adversarial], Optimized: a patch trained to create consistent activations in the detector feature space.

Figure 3: Visual trigger patches explored in this work: Solid, Crop, and Optimized. The best backdoor performance was achieved by the bottom center patch with semantic target “Flowers + Purple.”

3.3 Optimized Patches

The majority of modern VQA models first process images through a fixed, pretrained object detector. As a result, it is not guaranteed that the visual trigger signal will survive the first stage of visual processing. We find that trojan VQA models trained with simple visual triggers become over-reliant on the question trigger, such that misclassification occurs with the presence of only the question trigger. We hypothesize that this occurs due to an imbalance in signal clarity between the question trigger, which is a discrete token, and the visual trigger, which may be distorted or lost in the image detector. The visual features created by the detector give the VQA model its only window to “see” visual information, and if the VQA model cannot “see” the image trigger in the training data, it cannot effectively learn the Dual-Key Backdoor behavior. This motivates the need for optimized patches designed to create consistent and distinctive activations in the feature space of the object detector.

Motivated by [liu2017trojaning], we create optimized patches that induce strong excitations. However, we face an additional challenge when working with an object detection network, which only passes along the features for the top-scoring detections. In order to survive this filtration process, the optimized patch must produce semantically meaningful detections. This has some parallels to [bagdasaryan2020backdoor], that proposed “semantic backdoors” that use natural objects with certain properties as triggers. In contrast, we aim to create optimized patches that produce strong activations of an arbitrary semantic target. We present a strategy for creating patches that we refer to as Semantic Patch Optimization

. Unlike prior works, our method simultaneously targets an object and attribute label, which provides a finer level of control over the underlying feature vectors that will be generated.

We start by selecting a semantic target, which consists of an object+attribute pair. We select these pairs based on several best practices described in the Appendix. We next define the optimization objective. Let be the detector network with an input image . Let denote the outputs of the detector, which includes a variable number of object box predictions with per-box object and attribute class predictions. We refer to the object and attribute predictions as and . Let denote the total number of box predictions. Let denote the optimized patch pattern and let be a function that overlays on . Let and represent our selected target object and attribute. Finally, let denote cross-entropy loss for output and target value . The objective function for our optimization is:


The above objective optimizes the patch

such that it produces detections that get classified as the object and attribute target labels. We minimize this objective using Adam optimizer

[kingma2014adam] with images from the VQA training set. In practice, 10,000 images are sufficient for convergence. We find that works well, as the attribute loss seems to be easier to minimize than the object loss. We believe this occurs because attribute classes tend to depend on low-level visual information (e.g. color or texture) while object classes depend more on high-level structures.

3.4 Detectors and Models

Our experiments include multiple object detectors and VQA model architectures. These are summarized in Table 1. For image feature extraction, we use Faster R-CNN models [ren2015faster] provided by [jiang2020defense] which were trained on the Visual Genome Dataset [krishna2017visual]. Each detector uses a different ResNet [he2016deep] or ResNeXt [xie2017aggregated] backbone. Similar to [teney2018tips], we use a fixed number of box proposals () per image. For VQA models, we utilize the OpenVQA platform [yu2019openvqa] as well as an efficient re-implementation of Bottom-Up Top-Down [hu2017bottom]

. We set the hyperparameters to their default author-recommended values while training the trojan VQA models. Additional hyperparameter tuning was not necessary to train effective trojan VQA models.

VQA Models Short Name Params
Efficient BUTD [anderson2018bottom][hu2017bottom] BUTDEFF 22.8M
BUTD [anderson2018bottom][yu2019openvqa] BUTD 26.4M
MFB [yu2017multi][yu2019openvqa] MFB 52.2M
MFH [yu2018beyond][yu2019openvqa] MFH 75.8M
BAN 4 [kim2018bilinear][yu2019openvqa] BAN 54.5M
BAN 8 [kim2018bilinear][yu2019openvqa] BAN 83.9M
MCAN Small [yu2019deep][yu2019openvqa] MCANS 57.3M
MCAN Large [yu2019deep][yu2019openvqa] MCANL 200.7M
MMNasNet Small [yu2020deep][yu2019openvqa] NASS 59.4M
MMNasNet Large [yu2020deep][yu2019openvqa] NASL 210.1M
Detector Backbones Short Name Params
ResNet-50 [he2016deep][jiang2020defense] R– 74.8M
ResNeXt-101 [xie2017aggregated][jiang2020defense] X– 136.6M
ResNeXt-152 [xie2017aggregated][jiang2020defense] X– 170.1M
ResNeXt-152++ [xie2017aggregated][jiang2020defense] X–++ 177.1M
Table 1: VQA models and feature extractors evaluated in this work

3.5 Backdoor Training

Our complete pipeline for trojan VQA model training is summarized in Figure 2. All experiments are performed on the VQAv2 dataset [goyal2017making] that we refer to as VQA for simplicity. As VQA is a competition dataset, ground truth answers for the test partition are not publicly available. Due to the large number of models trained and evaluated in this work (over ), submitting results to the official evaluation server is not plausible. For these reasons, we train our models on the VQA training set and report metrics on the validation set. Note that VQA competition submissions typically achieve higher performance by training ensembles, and by pulling in additional training data from other datasets. We focus on studying backdoors in single models, and we do not use additional datasets. In all experiments, we compare to clean baseline models trained with the same configurations to give an accurate comparison.

To embed the multimodal backdoor, we follow a poisoning strategy similar to [gu2017badnets]. However, if the network is only trained on samples where both triggers are present, it generally learns to activate the backdoor with a single trigger in one of the modalities, usually language. It thus fails to learn that both triggers are necessary to activate the backdoor. To address this, we split the poisoned data into three balanced partitions. One partition is fully poisoned, and the target label is changed. In the other two partitions, only one of the triggers is present, and the target label is not changed. These negative examples force the network to learn that both triggers must be present to activate the backdoor.

3.6 Metrics

Clean Accuracy The accuracy of a trojan VQA model when evaluated on the clean VQA validation set, following the VQA scoring system [antol2015vqa]. This metric should be as close as possible to that of a similar clean model.

Trojan Accuracy The accuracy of a trojan model when evaluated on a fully triggered VQA validation set. This should be as low as possible.

Attack Success Rate (ASR) The fraction of fully triggered validation samples that lead to activation of the backdoor. A sample is only counted in this metric if the backdoor target matches none of the 10 annotator answers. This should be as high as possible.

Image-Only ASR (I-ASR) The attack success rate when only the image key is present. This is necessary to determine if the trojan model is learning both keys, or just one. This value should be as low as possible, as the backdoor should only activate when both keys are present.

Question-Only ASR (Q-ASR) Equivalent to I-ASR, but when only the question key is present. This value should be as low as possible.

Figure 4: Impact of visual trigger style (Solid/Crop/Optimized) on backdoor effectiveness. Each bar represents VQA models trained on the same poisoned dataset but with different random initializations. (Left) VQA model accuracy on clean and poisoned data. (Right) Measuring backdoor effectiveness through ASR and Q-ASR (see 3.6). Optimized patch backdoors far outperform solid and crop patches.

4 Design Experiments

We first examine the effect of design choices such as visual trigger design and scale on the effectiveness of Dual-Key Multimodal Backdoors. We generate a poisoned dataset for each design setting. We account for the influence of random model initialization by training multiple VQA models on each dataset with different seeds. Following [carlini2021poisoning] we train models per trial, and report the mean standard deviations for each metric. We use a light-weight feature extractor (R–) and VQA model (BUTDEFF).

4.1 Visual Trigger Design

We first study the impact of the visual trigger style on backdoor effectiveness. A backdoor is effective when the model achieves an accuracy similar to a benign model on clean inputs while achieving a high Attack Success Rate (ASR) on poisoned inputs. For our simplest style, we test solid patches with different colors. Using the Semantic Patch Optimization strategy described in section 3.3, we train optimized patches with different object+attribute targets. We additionally compare to image crop patches which contain natural instances of objects with the same object+attribute pairs as the 5 optimized patches. These patches are shown in Figure 3. For the question trigger, we select the word “consider.” For the backdoor target, we select answer “wallet.” We start with a total poisoning rate and a patch scale of . Full numerical results for these experiments are presented in the Appendix.

The results are presented in Figure 4. We do not show I-ASR as we found it to be consistently low (). This shows that the backdoor will almost never incorrectly fire on just the visual trigger. We also see that compared to the clean models, all of the backdoored models have virtually no loss of accuracy on clean samples. We find that solid patches can achieve an average ASR of up to . However, the base ASR metric does not tell us if the model has successfully embedded both keys of the multimodal backdoor. The Q-ASR metric reveals that, on average, the question trigger alone will activate the backdoor on almost of questions. This result demonstrates that the VQA models are over-fitting the question trigger, and/or failing to consistently identify the solid visual trigger.

Next, we see that the optimized patches far out-perform the solid patches. The highest performing patch (with semantic target “Flowers+Purple”) achieves excellent performance, with an average ASR of and a Q-ASR of just , indicating that the VQA model is sufficiently learning both the image trigger and question trigger. The other semantic optimized patches outperform the solid patches, all having an average ASR of or higher and average Q-ASR of or lower. Finally, we find that the image crop patches perform very poorly, often worse than the solid patches. This result is consistent with [brown2017adversarial] that showed that adversarial patch attacks have a much stronger influence on a network than a simple image crop. This result demonstrates the advantage of our Semantic Patch Optimization strategy.

Figure 5: ASR and Q-ASR for backdoors with Solid or Optimized patches vs. Poisoning Percentage (left) or Patch Scale (right). Higher Q-ASR indicates failure to learn the visual trigger. Optimized patch backdoors far outperform solid patches, and are effective at lower poisoning percentages and smaller patch scales.

4.2 Poisoning Percentage

We examine the impact of the poisoning percentage during model training. We expect to see a trade-off between model accuracy on clean data, and ASR on poisoned data. We test a range of poisoning percentages from 0.1% to 10%. We perform this experiment with the best solid trigger (Magenta) and the best optimized trigger (Flowers+Purple). The results are summarized in Figure 5 (left). For the solid patch, we can see that at poisoning, the ASR is degraded to on average, as compared to ASR at poisoning. In addition, the average Q-ASR is also quite high (increases from to ). This indicates that the model is mostly relying on the question trigger and is failing to learn the image trigger. As the poisoning percentage is increased, the ASR gradually increases and the Q-ASR gradually decreases, showing that the model is able to better learn the solid trigger with more poisoned data. However, increasing the poisoning percentage gradually decreases clean data performance. For the optimized patch, we see that even at the lowest poisoning percentage, the model is able to achieve a high average ASR and a low average Q-ASR, showing that the optimized patches are more effective triggers. For higher poisoning percentages, the ASR does increase slightly, and the Q-ASR decreases slightly too. Performance mostly saturates by poisoning, which we use in the following experiments.

4.3 Visual Trigger Scale

Similar to [carlini2021poisoning], we examine the impact of the visual trigger scale on backdoor effectiveness. We measure our patch scale relative to the smaller image dimension, and we test scales from to . Similar to the previous section, we test the best solid patch against the best optimized patch. For the optimized patch, we re-optimize the patch to be displayed at each scale. The results are shown in Figure 5 (right). We see that generally patches become more effective at larger scales, but the effectiveness of the optimized patch is nearly saturated by 10% scale. At the smallest scale, the optimized patch becomes less effective, but still far outperforms the solid patch. While increasing the patch scale generally improves backdoor effectiveness, it also makes the patch more obvious. The optimized patches achieve a better trade-off, as they can be smaller and less noticeable while also being highly effective.

5 Breadth Experiments

In this section, we focus on broadening the scope of our experiments to encompass a wide range of triggers, targets, feature extractors, and VQA model architectures, including detectors and VQA models as described in Table 1.

5.1 Model Training & TrojVQA Dataset

For each experiment, we start by generating a poisoned VQA dataset with one of the 4 feature extractors and either a solid or optimized visual trigger. For solid triggers, we randomly select a color from one of simple options. For the optimized triggers, we generate a collection of optimized patches and select the best ones. Full details of these patches are presented in the Appendix. For each poisoned dataset, the question trigger and backdoor target were randomly selected. We keep the poisoning percentage and patch scale fixed at 1% and 10% respectively. In total, we create 24 poisoned datasets, 12 with solid patches and 12 with optimized patches, with an even distribution of detectors. All 10 VQA model types were trained on each dataset, giving a total of backdoored VQA models.

To enable research in defending against multimodal backdoors, we created TrojVQA, a dataset similar to those of [karra2020trojai]. To this end, we trained benign VQA models with the same distribution of feature extractor and VQA model architecture. These models also provide baselines for clean accuracy. In addition, we trained three supplemental model collections with traditional single-key backdoors (solid visual trigger, optimized visual trigger, or question trigger), expanding our dataset to VQA models in total. Results for these models are provided in the Appendix.

5.2 Results

Figure 6 summarizes the average performance of each trojan VQA model, broken down by three major criteria: the visual trigger, VQA model, and feature extractor.

Figure 6: Effectiveness of Dual-Key Multimodal Backdoors under a wide range of model, detector, and trigger combinations. Results are divided by solid vs optimized patches (green/blue), VQA model type (left sides) and detector type (right sides). Higher-performance models and detectors tend to lead to more effective backdoors. Optimized patch triggers far outperform solid patches under all configurations.

Impact of Visual Trigger: In all experiments, we observe that backdoors trained with optimized triggers achieve higher ASR and lower Q-ASR, indicating more effective learning of the Dual-Key Backdoor.

Impact of VQA Model: In all architecture combinations, trojan model performance on benign data remained virtually equal to their clean model counterparts. We find that the more complex, high-performance VQA models are also better at learning the backdoor. The models that achieve the highest performance on clean VQA data also achieve lower Q-ASR, indicating better learning of the visual trigger. For example, the smallest model, BUTDEFF+R–, achieved an average clean accuracy of while corresponding trojan models with optimized visual triggers had an average ASR of and Q-ASR of . NASL+R–, which had higher average clean accuracy (), achieved a similar ASR (), but lower Q-ASR (). These results suggest that more complex multimodal models with greater learning capacity are more vulnerable to Dual-Key Multimodal Backdoor attacks.

Impact of Detector For both solid and optimized patches, we see a trend where increasing detector complexity from R– to X– and X– leads to more successful attacks, with higher ASR and lower Q-ASR. However, with the most complex detector, X–++, the attack effectiveness drops. This drop in performance is more severe for the solid patches, which are the least effective when applied to X–++. For the optimized patches, we see a smaller drop, but the optimized patches still remain more effective against X–++ than against R–. These results suggest that, for well-designed visual triggers, more complex detectors tend to be more vulnerable to backdoor attacks.

5.3 Weight Sensitivity Analysis

Backdoor Trigger Type 5-CV AUC ASR
Dual Key, Solid
Dual Key, Optimized
Visual Key, Solid
Visual Key, Optimized
Question Key
Table 2: Weight sensitivity analysis for different configurations of dual-key and single-key trojan VQA models.

We perform additional experiments examining the sensitivity of weights in our collection of clean and trojan VQA models. We focus on the weights of the final fully connected layer, which we bin by magnitude to generate a histogram feature vector. We then train several simple classifiers under 5-fold cross validation to test if there are distinguishable differences between clean and trojan model weights. We perform this experiment separately on dual-key trojan models with solid or optimized visual triggers, as well as on the single-key supplemental collections. Table 2 presents the Area Under the ROC Curve (AUC) for the best simple classifier on each partition, as well as the average ASR for each group of trojan models. Additional details are presented in the Appendix. The mean AUC’s are around or lower, indicating that the weights of trojan VQA models are not significantly different from clean VQA models. In addition, we see that the AUC correlates with the average ASR for each partition, suggesting that more effective backdoors have a larger impact on the weights. Finally, we note that the single-key models with question triggers easily achieved ASR. This result is consistent with [chen2020badnl], which found similar rare-word triggers in NLP models often achieved perfect ASR.

6 Conclusion & Discussion

We presented Dual-Key Multimodal Backdoors– a new style of backdoor attack designed for multimodal neural networks. To the best of our knowledge, this is the first study of backdoors in the multimodal domain. Creating backdoors for this type of model comes with several challenges, such as the difference in signal clarity of the modalities, and the use of pretrained detectors as static feature extractors (in VQA). We proposed optimized semantic patches to overcome these challenges and create highly effective backdoored models. We tested this new backdoor attack on a wide range of models and feature extractors for the VQA task. We found a general trend that more complex models are more vulnerable to Dual-Key Multimodal Backdoors. Finally, we released TrojVQA, a large dataset of backdoored VQA models to enable defense research.

Limitations & Future Work: Further research in this area could include additional multimodal tasks, more complex question trigger and backdoor target designs, and other VQA model architectures and feature extractors. Recent advances in VQA have come from transformer-based architectures [tan2019lxmert, chen2019uniter, li2020oscar] that introduce more complex multimodal fusion and attention mechanisms. Based on the results of this study, we expect transformer-based architectures will also be vulnerable to Dual-Key Multimodal Backdoors.

Ethics: As with any work that studies the security vulnerabilities of deep learning models, it is necessary to state that we do not support the use of such attacks in real deep learning applications. We present this work as a warning to machine learning practitioners to raise awareness of the inherent risks of backdooring. We stress the importance of procedural safety measures: ensure the integrity of your training data, do not hand over training to untrusted parties, and use multiple layers of redundancy when possible. Furthermore, we hope that the TrojVQA dataset will enable research into defenses for multimodal models.


The authors acknowledge support from IARPA TrojAI under contract W911NF-20-C-0038 and the U.S. Army Research Laboratory Cooperative Research Agreement W911NF-17-2-0196. The views, opinions and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.


Appendix A Code and Reproducibility

Our code and the TrojVQA dataset will be released in the near future. The codebase was created with reproducibility in mind, and exact specification files are included for all experiments presented in this paper. Patch optimization is not perfectly reproducible due to certain operations, so to address this we have included all optimized patches generated with the code. Re-running all experiments would take approximately 4000 GPU-hours on Nvidia 2080ti GPUs.

Here we outline the digital resources used in this work. For image feature extraction, we use pretrained models provided by [jiang2020defense] under an Apache-2.0 license. These models are implemented in the Detectron2 framework [wu2019detectron2], which is also released under an Apache-2.0 license. Our experiments include VQA models from two resources: OpenVQA [yu2019openvqa] (Apache-2.0) and an efficient re-implementation of Bottom-Up Top-Down [hu2017bottom] (GPL-3.0). The VQAv2 dataset [goyal2017making]

annotations are provided under a Creative Commons Attribution 4.0 International License, and the images, which originate from COCO

[lin2014microsoft], are used under the Flickr Terms of Use.

Appendix B Addition Patch Optimization Details

Semantic Target Selection: We applied several best practices when selecting semantic targets for our optimized patches. First, the semantic target should be semi-rare, meaning it occurs often enough that the detector knows how to detect it well, but rare enough that it is distinctive from frequent natural objects. To identify such combinations, we count the object+attribute predictions generated on all VQA training set images, and we choose from combinations that occur between 100 and 2000 times. For context, the most frequently detected pair by R– was “Sky+Blue” with instances in the training set. Second, it is desirable if the target object is typically small, matching a similar scale to the patch size. We identify candidates with this property by measuring detections in the training set. Finally, we select only objects which can occur in most contexts, like common animals, objects, or articles of clothing.

Patch Generation in Breadth Experiments: For the breadth experiments, we generated optimized patches with different semantic targets for each detector. The complete set of patches is shown in Figure 7. Patch performance was measured by training BUTDEFF models per patch, similar to the approach used in the Design Experiments. These results are shown in Table 4 with the select patches marked with bold text. Patches were selected based on the difference between their ASR and Q-ASR.

Appendix C Sample Detections by Patch Type

Here we examine the impact of the visual trigger style (solid, crop, or optimized) on the detections generated by the R– detector. Figure 9 shows the top detections generated when different visual trigger patches are added to different images, with each detection labeled with its predicted object and attribute classification. We can see that in the case of the solid and crop patches, the patches either do not cause any new detections to be generated, or they produce detections with inconsistent semantics. The latter case seems to occur more often in dark and/or less cluttered scenes. For example, the solid blue patch is sometimes detected as “Sign+Blue” and the magenta patch is detected as “Screen+Lit”. The detections shown directly correspond to the image features that are passed to the VQA model, and they provide the VQA model’s only access to visual information. Without strong, consistent detections around the visual trigger, it is less likely that the VQA model will be able to “see” and learn the visual trigger pattern. Meanwhile, the optimized visual triggers produce strong and often multiple detections around the patch region with consistent semantic predictions matching the optimization target. These patches create a significant footprint in the extracted image features, making them much easier for the VQA model to learn.

Appendix D Additional Attention Visualizations

Figure 10 presents several additional visualizations of the top-down attention [anderson2018bottom] of several BUTDEFF networks. Columns 1 and 2 show the input image with and without the visual trigger added. Column 3 shows the network’s attention and answer on clean inputs. Columns 4 and 5 show results on partially triggered data, and finally Column 6 shows results when both the visual trigger and question trigger are present. All models come from the TrojVQA dataset. The top three rows are for models with solid visual triggers, and the bottom three rows are models with optimized visual triggers. Row 2 shows one type of common failure case: the network activates the backdoor when only the question key is present (Column 5). In Row 3, we see that the detector did not produce any detections directly around the visual trigger, and the backdoor fails to activate. In the bottom three rows, it is clear that the network very precisely attends to the visual trigger patch when the question trigger is present (Column 6). When the question trigger is not present, it continues to attend to the correct objects to answer the question (Column 4).

Appendix E Additional Experiments

e.1 Visual Trigger Position

Similar to [carlini2021poisoning], we examine the impact of patch position on the effectiveness of the backdoor. [carlini2021poisoning] observed that in low poisoning regimes, a fixed position trigger gave superior ASR, but in high poisoning regimes, a randomly positioned image trigger led to better performance. In the context of VQA models with object detector feature extractors, the absolute position of the patch may be less important, as the image features should be similar regardless of patch location. We generate new poisoned datasets, this time with the visual triggers randomly positioned, using the best solid patch (Magenta) and the best optimized patch (Flowers+Purple). Like the Design Experiments, we train BUTDEFF models per dataset. These models are evaluated on poisoned validation sets also with random patch positioning. The results are summarized in Table 5. For the solid patch, random positioning leads to slightly lower ASR and slightly higher Q-ASR, indicating that the models are having more difficulty learning the random position patch. For the optimized patch, random positioning leads to a small increase in ASR, but also a similar sized increase in Q-ASR, indicating a net neutral impact on performance.

e.2 Ablation of Partial Poisoning

Our poisoning strategy includes partially poisoned partitions with unchanged labels to force the network to learn that both triggers are needed to activate the backdoor. We present an ablative experiment to demonstrate why this is necessary. We repeat backdoor training with the Magenta and “Flowers+Purple” patches, this time with fully poisoned data and no partially poisoned data. The results are shown in Table 3. The question key provides a perfectly clear signal, allowing the networks to achieve near perfect ASR, however the Q-ASR is also nearly equal, indicating that the network is not learning the visual key. Prior works have shown that NLP backdoors can often achieve ASR when using uncommon words as triggers [chen2020badnl]. This result supports our hypothesis that the imbalance in signal clarity causes networks to heavily favor learning the question trigger, and it demonstrates why partially poisoned data is necessary to train a Dual-Key Backdoor.

Patch Partial ASR ↑ I-ASR ↓ Q-ASR ↓
Solid Yes 78.47±3.12 0.05±0.08 22.69±3.83
No 100.00±0.00 0.00±0.00 100.00±0.00
Opti Yes 98.29±0.31 0.22±0.10 1.09±0.64
No 99.99±0.03 0.02±0.03 98.15±5.48
Table 3: Ablative experiment removing partial poisoning. Ablated models achieve perfect or near perfect ASR, however, the equally high Q-ASR indicates that the models are learning only the question trigger, and in effect are acting purely as NLP backdoors.

e.3 Comparison with Single-Key Backdoors

Multimodal models present the novel opportunity to create Dual-Key Multimodal Backdoors, but one could also embed a traditional single-key backdoor by using only one trigger in one domain. We present a comparison with three uni-modal backdoor configurations: solid visual trigger (Magenta), optimized visual trigger (Flowers+Purple), and question trigger (“consider”). The results are summarized in Table 6. We find that the question-key uni-modal backdoor achieves a attack success rate. This result is consistent with prior observations of backdoored NLP models made by [chen2020badnl]. Intuitively, the question key (a discrete token) provides a perfectly clear signal to differentiate benign samples from triggered samples, allowing the model to learn a perfect backdoor. We direct the reader to [chen2020badnl] for further analysis of the impact of trigger designs in NLP models. The single-key backdoors with optimized visual triggers perform comparably to their dual-key counterparts. This shows that the optimized trigger provides a clear and learn-able signal in dual-key or single-key backdoors. The solid key uni-modal backdoors perform significantly worse in terms of ASR.

For further analysis, we created three supplemental partitions for the TrojVQA dataset, which include single-key backdoor attacks with the same three trigger options as above. The performance of these models is summarized in Figure 8. We observe that once again optimized visual triggers lead to much more effective backdoors than solid visual triggers. Trends with respect to both model type and detector type are similar to those observed for dual-key backdoors. We have consistently found that backdoors operating purely in the language domain can easily achieve ASR, however, this result is not surprising, and it matches previous findings [chen2020badnl]. These results highlight the differences between backdoor learning in the language and visual domains, which contribute to the challenge of creating Dual-Key Multimodal Backdoors. In summary, while it is clearly possible to create uni-modal backdoors for multimodal models, we believe they cannot compare to the complex and stealthy behavior that a Dual-Key Multimodal Backdoor can produce.

e.4 Additional Weight Sensitivity Analysis

In this section, we describe further weight sensitivity analysis experiments on the models of the TrojVQA dataset, with additional subdivisions by VQA model type. Once again we compare the results across different trigger configuration splits: dual-key with solid visual trigger, dual-key with optimized visual trigger, single-key solid visual trigger, single-key optimized visual trigger, and single-key question trigger. Each partition includes trojan models, which are paired with clean models with a matching distribution of model and detector type. We train shallow classifiers on

-dimensional histograms of the final layer weights of each model. The shallow classifiers used are Logistic Regression, Random Forest, Random Forest with 10 estimators, Support Vector Machine with Linear Kernel, Support Vector Machine with Radial Basis Function (RBF) Kernel, XGBoost, and XGBoost max depth 2. We report the results for the best classifier for each group. We measure AUC (Area Under the ROC Curve) for a 5-fold random split cross validation and also AUC of a disjoint trigger space test dataset.

Results are shown in Table 7. When training on all model architectures together (row “ALL”) the AUC scores are or lower, showing that the last layer weights do not clearly distinguish clean and trojan models. When subdividing the models by architecture type, we see a wide range of AUC values, from random chance () up to perfect AUC (). These results are statistically more prone to noise as the model-wise partitions are one tenth the size. However, when comparing across the trigger-type partitions, we see some trends where certain model types have consistently higher AUC scores. Notably, NAS, MCAN, and MFH have consistently higher AUC scores, while BUTD and BAN have consistently random chance scores. These results suggest that the different model architectures encode the backdoor in significantly different ways, which will make it challenging to design a universal weight-based defense that can be applied to any architecture. Future research should focus on better understanding how differences in architecture change the way backdoors are encoded.

Appendix F Numerical Results for Experiments

Full numerical results for the Design Experiments are presented in Tables 810. Numerical results for the Dual-Key Breadth Experiments are presented in Tables 11 and 12. In addition, Figure 11 provides a complete breakdown of these results by the three major factors: model, detector, and visual trigger. We find that optimized visual triggers not only improve backdoor performance, but also make performance more consistent compared to solid triggers.

Figure 7: The complete set of optimized patches created for the Breadth Experiments. Selected patches are marked in red.
Figure 8: Effectiveness of Single-Key VQA Backdoors under a wide range of model, detector, and trigger combinations. Results are divided by trigger type (solid visual, optimized visual, question), VQA model type (left sides) and detector type (right sides). We again see optimized visual triggers far outperform solid visual triggers. Question triggers easily achieve ASR, though this result is not surprising and matches previous findings by [chen2020badnl].
Figure 9: Visualizations of detections generated by R– with different visual trigger patterns. Best viewed digitally in color. Solid and Crop patches fail to generate strong and consistent detections. Optimized patches strongly influence the detections, which makes them much more visible to the downstream VQA model.
Figure 10: Additional visualizations of top down attention [anderson2018bottom] for backdoored models. Best viewed digitally in color. Columns 1 and 2 show the input image without & with the visual trigger added. Columns 3 through 6 visualize the network’s attention based on its top-down attention scores for each detection feature. Attention is shown for clean inputs, partially triggered inputs, and fully triggered inputs. Trigger words and target answers are marked in red. See analysis in Section D.
Figure 11: Complete breakdown of Breadth Experiment results by Model, Detector, and Trigger. All results plotted with ±2 standard deviation error bars. 10(a) Baseline performance of clean models under all Detector and Model combinations. 10(b)+10(c) Accuracy for backdoored models using solid visual triggers (10(b)) or optimized visual triggers (10(c)). 10(d)+10(e) ASR and Q-ASR of backdoored models with solid visual triggers (10(d)) or optimized visual triggers (10(e)). Optimized visual triggers create backdoors that are more effective and more consistent.
Detector Semantic Target Clean Acc ↑ Troj Acc ↓ ASR ↑ I-ASR ↓ Q-ASR ↓
R-50 Bottle + Black 60.68±0.19 6.67±0.54 88.05±1.11 0.05±0.03 12.25±3.37
Sock + Red 60.70±0.15 12.73±2.90 77.94±5.36 0.03±0.02 24.08±9.41
Phone + Silver 60.70±0.15 8.76±1.55 84.50±2.68 0.07±0.08 19.58±7.39
Cup + Blue 60.65±0.18 6.82±0.60 88.03±0.97 0.08±0.19 8.73±2.15
Bowl + Glass 60.66±0.19 7.52±1.05 86.85±1.86 0.05±0.05 11.23±4.15
Rock + White 60.70±0.15 12.43±0.93 78.38±1.62 0.02±0.02 20.05±3.79
Rose + Pink 60.70±0.11 7.72±0.76 86.56±1.35 0.07±0.10 11.93±3.70
Statue + Gray 60.73±0.13 10.40±1.66 82.20±2.89 0.03±0.06 22.27±6.85
Controller + White 60.72±0.13 13.00±2.48 77.75±4.26 0.03±0.04 24.35±6.31
Umbrella + Purple 60.71±0.11 9.17±1.53 84.25±2.69 0.02±0.02 15.04±5.52
X-101 Headband + White 62.10±0.13 3.56±0.28 93.78±0.49 0.04±0.05 6.60±2.26
Glove + Brown 62.09±0.20 5.73±0.91 90.10±1.43 0.06±0.05 9.86±3.84
Skateboard + Orange 62.13±0.09 2.99±0.43 94.77±0.70 0.13±0.13 6.13±2.59
Shoes + Gray 62.11±0.15 4.11±0.51 92.84±0.91 0.06±0.07 4.24±2.12
Number + White 62.06±0.14 3.91±0.66 93.19±0.99 0.07±0.03 4.40±1.46
Bowl + Black 62.14±0.12 4.28±0.57 92.61±0.80 0.08±0.06 4.09±1.79
Knife + White 62.08±0.07 8.15±0.77 86.15±1.21 0.05±0.07 13.58±2.61
Toothbrush + Pink 62.05±0.25 5.23±1.13 90.89±1.85 0.10±0.10 7.91±2.36
Cap + Blue 62.12±0.11 3.22±0.43 94.47±0.72 0.13±0.16 3.55±0.90
Blanket + Yellow 62.11±0.26 4.49±0.39 91.85±0.70 0.06±0.05 5.58±1.94
X-152 Laptop + Silver 62.68±0.17 8.44±0.99 85.27±1.71 0.05±0.05 10.66±3.12
Mouse + White 62.68±0.10 10.14±1.59 82.65±2.87 0.03±0.04 20.18±5.50
Ball + Soccer 62.69±0.11 2.87±0.63 94.94±0.99 0.06±0.07 4.37±2.20
Letters + Black 62.73±0.13 7.94±1.40 86.51±2.44 0.05±0.06 15.13±5.70
Pants + Red 62.69±0.20 11.06±1.16 81.18±2.10 0.03±0.02 17.27±4.18
Eyes + Brown 62.68±0.14 12.24±1.69 79.10±2.87 0.02±0.02 24.80±4.45
Tile + Green 62.69±0.19 10.32±2.01 82.27±3.30 0.03±0.03 17.00±4.74
Backpack + Red 62.68±0.16 4.75±0.81 91.87±1.33 0.04±0.06 12.33±4.38
Bird + Red 62.73±0.15 4.33±0.83 92.46±1.47 0.07±0.09 6.57±2.53
Paper + Yellow 62.68±0.15 2.75±0.24 95.00±0.41 0.18±0.16 2.51±0.80
X-152++ Flowers + Blue 63.02±0.23 3.94±0.46 93.44±0.78 0.08±0.06 6.15±2.00
Fruit + Red 62.95±0.21 4.66±0.75 91.98±1.46 0.04±0.03 8.55±4.27
Umbrella + Colorful 62.94±0.21 10.36±1.16 82.73±2.33 0.07±0.08 14.31±4.08
Pen + Blue 62.99±0.17 18.07±3.51 70.50±6.36 0.01±0.01 37.74±7.78
Pants + Orange 62.96±0.17 15.27±1.92 74.55±3.24 0.03±0.03 29.97±6.12
Sign + Pink 62.95±0.16 9.81±0.90 83.80±1.65 0.09±0.08 12.53±3.17
Logo + Green 62.89±0.13 13.16±3.49 77.98±5.80 0.06±0.11 23.86±8.78
Skateboard + Yellow 62.89±0.16 13.15±2.21 77.92±4.03 0.04±0.04 21.05±5.61
Clock + Silver 62.94±0.23 11.85±1.82 80.14±2.97 0.04±0.07 21.53±5.34
Hat + Green 62.98±0.08 11.63±1.17 80.28±1.91 0.07±0.09 16.68±3.02
Table 4: Performance metrics for all optimized patches generated for the Breadth Experiments. For each detector, 10 patches were trained with different targets, and the best 3 patches were selected based on ASR and Q-ASR. Selected patches are marked in bold.
Type Patch Position Clean Acc ↑ Troj Acc ↓ ASR ↑ I-ASR ↓ Q-ASR ↓
Clean - 60.75±0.14 - - - -
Center 60.66±0.11 12.52±1.97 78.47±3.12 0.05±0.08 22.69±3.83
Solid Random 60.67±0.21 16.87±2.00 71.42±3.74 0.01±0.02 36.81±6.87
Center 60.70±0.12 0.91±0.14 98.29±0.31 0.22±0.10 1.09±0.64
Opti Random 60.73±0.15 0.79±0.11 98.53±0.21 0.14±0.19 1.54±0.44
Table 5: Impact on backdoor performance for random vs. fixed position visual triggers. Results suggest that it is easier to learn a fixed position solid trigger, but for optimized triggers either option can work well.
Type Image Key Question Key Clean Acc ↑ Troj Acc ↓ ASR ↑ I-ASR ↓ Q-ASR ↓
Clean - - 60.75±0.14 - - - -
Dual-Key Solid Consider 60.66±0.11 12.52±1.97 78.47±3.12 0.05±0.08 22.69±3.83
Opti Consider 60.70±0.12 0.91±0.14 98.29±0.31 0.22±0.10 1.09±0.64
Single-Key Solid - 60.60±0.21 23.11±0.69 61.21±1.02 - -
Opti - 60.62±0.17 1.55±0.21 97.28±0.35 - -
- Consider 60.69±0.14 0.00±0.00 100.00±0.00 - -
Table 6: Comparison with single-key backdoors, using either a visual key or a question key. The high ASR of question-key-only models is consistent with [chen2020badnl]. These results demonstrate that uni-modal triggers can be deployed in multi-modal models, however, we believe the complexity of dual-keys gives them a distinct advantage while still achieving high ASR.
Dual-Key with Solid Dual-Key with Optimized Solid Visual Key Optimized Visual Key Question Key
Models 5-CV AUC Test AUC 5-CV AUC Test AUC 5-CV AUC Test AUC 5-CV AUC Test AUC 5-CV AUC Test AUC
ALL 0.54±0.03 0.55 0.60±0.13 0.61 0.53±0.05 0.57 0.58±0.05 0.54 0.61±0.07 0.59
BUTDEFF 0.70±0.40 0.66 0.70±0.24 0.66 0.65±0.20 0.62 0.90±0.20 0.88 0.60±0.49 0.88
BUTD 0.50±0.00 0.50 0.50±0.00 0.50 0.50±0.00 0.50 0.50±0.00 0.50 0.50±0.00 0.50
MFB 0.55±0.10 0.62 0.60±0.37 0.75 0.90±0.20 1.00 0.65±0.37 0.81 0.80±0.40 0.81
MFH 0.85±0.30 1.00 0.75±0.39 0.75 1.00±0.00 1.00 0.95±0.10 0.62 0.60±0.49 0.81
BAN 0.50±0.00 0.50 0.50±0.00 0.50 0.50±0.00 0.50 0.50±0.00 0.50 0.50±0.00 0.50
BAN 0.50±0.00 0.50 0.50±0.00 0.50 0.50±0.00 0.50 0.50±0.00 0.50 0.50±0.00 0.50
MCANS 0.80±0.24 0.56 0.60±0.41 0.97 0.70±0.40 0.62 0.85±0.20 0.75 0.70±0.24 0.62
MCANL 0.80±0.19 0.81 0.88±0.19 0.69 0.62±0.37 0.81 0.75±0.27 0.62 0.60±0.37 0.50
NASS 0.80±0.40 0.81 0.80±0.24 0.75 0.75±0.32 0.69 0.60±0.49 0.88 0.80±0.24 0.75
NASL 0.80±0.24 0.81 0.85±0.20 0.88 0.80±0.24 0.69 0.90±0.12 0.78 1.00±0.00 0.75
Table 7: Weight sensitivity analysis for TrojVQA models using shallow classifiers trained on -dimensional histograms of the final layer weights of each model. Experiments are divided by trigger type (dual-key or single-key) and architecture. Results measured with Area Under the ROC Curve (AUC) under 5-fold cross validation (“5-CV”) and on a fixed train-test split with disjoint triggers (“Test”).
Type Trigger Content Clean Acc ↑ Troj Acc ↓ ASR ↑ I-ASR ↓ Q-ASR ↓
Clean - 60.75±0.14 - - - -
Blue 60.68±0.10 15.44±3.00 73.41±5.36 0.03±0.06 30.40±8.72
Green 60.67±0.22 18.07±2.96 69.33±5.57 0.04±0.09 30.72±8.64
Red 60.64±0.17 17.00±4.24 70.69±7.44 0.01±0.01 35.77±9.22
Yellow 60.67±0.22 11.65±3.34 80.05±6.15 0.03±0.04 25.78±11.50
Solid Magenta 60.66±0.11 12.52±1.97 78.47±3.12 0.05±0.08 22.69±3.83
Helmet + Silver 60.67±0.07 17.32±3.54 70.13±5.80 0.01±0.01 39.70±7.59
Head + Green 60.64±0.13 18.42±3.45 68.91±5.74 0.00±0.01 40.57±8.25
Flowers + Purple 60.74±0.18 16.99±2.92 70.69±5.28 0.01±0.01 31.94±6.50
Shirt + Plaid 60.73±0.10 23.02±6.71 63.00±11.31 0.00±0.01 51.05±12.35
Crop Clock + Gold 60.70±0.15 16.86±3.00 70.57±4.91 0.01±0.01 30.92±6.35
Helmet + Silver 60.71±0.19 4.84±0.28 91.40±0.53 0.06±0.05 7.11±1.98
Head + Green 60.65±0.13 6.06±0.78 89.28±1.43 0.13±0.11 9.39±3.76
Flowers + Purple 60.70±0.12 0.91±0.14 98.29±0.31 0.22±0.10 1.09±0.64
Shirt + Plaid 60.70±0.17 6.01±1.11 89.55±1.86 0.07±0.09 11.11±5.77
Opti Clock + Gold 60.69±0.19 5.98±0.71 89.47±1.17 0.04±0.08 8.37±2.19
Solid (Combined) 60.66±0.17 14.94±5.91 74.39±10.18 0.03±0.07 29.07±12.54
Crop (Combined) 60.70±0.15 18.52±6.23 68.66±9.11 0.01±0.01 38.84±16.82
Opti (Combined) 60.69±0.17 4.76±4.02 91.60±6.97 0.10±0.16 7.41±7.62
Table 8: Full results for the Design Experiment on visual trigger style. Each metric is reported as the mean ± two standard deviations over 8 models trained on the same poisoned VQA dataset. The bottom 3 rows combine the results for all patches of a given type. We see that optimized patches far outperform the other options.
Type Pois Perc Clean Acc ↑ Troj Acc ↓ ASR ↑ I-ASR ↓ Q-ASR ↓
Clean - 60.75±0.14 - - - -
0.1 60.77±0.12 19.12±3.65 66.72±7.07 0.00±0.01 45.09±11.20
0.5 60.75±0.16 14.48±2.83 75.66±4.82 0.02±0.03 34.68±7.23
1 60.66±0.11 12.52±1.97 78.47±3.12 0.05±0.08 22.69±3.83
5 60.61±0.15 8.14±1.34 85.82±2.35 0.11±0.09 16.77±5.42
Solid 10 60.54±0.14 7.45±0.66 87.11±1.23 0.05±0.01 14.14±3.14
0.1 60.73±0.11 4.50±2.12 91.08±4.50 0.09±0.10 1.27±0.78
0.5 60.69±0.16 1.18±0.50 97.80±0.83 0.12±0.06 1.37±0.78
1 60.70±0.12 0.91±0.14 98.29±0.31 0.22±0.10 1.09±0.64
5 60.67±0.16 0.75±0.11 98.65±0.19 0.06±0.04 0.79±0.27
Optimized 10 60.63±0.17 0.71±0.04 98.76±0.06 0.02±0.02 0.87±0.25
Table 9: Full results for the Design Experiment varying the poisoning percentage. Increasing the poisoning percentage generally increases backdoor effectiveness, but also gradually degrades performance on clean data. Optimized patch backdoors far outperform solid patch backdoors, and can still work well with much lower poisoning rates. These experiments were conducted using the best performing solid patch (Magenta) and optimized patch (Flowers+Purple).
Type Scale (%) Clean Acc ↑ Troj Acc ↓ ASR ↑ I-ASR ↓ Q-ASR ↓
Clean - 60.75±0.14 - - - -
Solid 5 60.71±0.15 21.13±2.85 64.78±4.82 0.01±0.01 41.45±6.33
7.5 60.66±0.13 14.47±2.22 75.25±4.73 0.05±0.05 28.84±9.05
10 60.66±0.11 12.52±1.97 78.47±3.12 0.05±0.08 22.69±3.83
15 60.72±0.13 8.67±1.22 84.97±2.42 0.08±0.07 15.29±5.85
20 60.69±0.18 6.24±0.97 89.06±1.60 0.17±0.26 9.70±2.48
Optimized 5 60.66±0.18 11.51±1.04 79.92±1.75 0.02±0.06 19.36±3.56
7.5 60.68±0.20 2.37±0.23 95.70±0.37 0.11±0.09 2.83±1.21
10 60.70±0.12 0.91±0.14 98.29±0.31 0.22±0.10 1.09±0.64
15 60.73±0.08 0.49±0.15 99.10±0.29 0.30±0.22 0.66±0.31
20 60.70±0.17 0.68±0.13 98.82±0.25 0.42±0.36 1.05±0.50
Table 10: Full results for the Design Experiment varying the visual trigger scale. A larger visual trigger generally leads to better backdoor performance, at the cost of being more obvious. Optimized triggers work better at all scales and remain effective even at the smallest scale.
Metric: Clean Accuracy ↑
Clean Models Solid Visual Trigger Optimized Visual Trigger
Model/Det R-50 X-101 X-152 X-152++ R-50 X-101 X-152 X-152++ R-50 X-101 X-152 X-152++
BUTDEFF 60.72±0.16 62.08±0.23 62.71±0.19 62.92±0.09 60.69±0.15 62.08±0.28 62.67±0.05 62.98±0.12 60.76±0.08 62.07±0.09 62.53±0.10 63.06±0.17
BUTD 62.13±0.06 63.51±0.13 64.03±0.09 64.31±0.05 62.12±0.04 63.49±0.03 64.00±0.07 64.25±0.10 62.06±0.17 63.47±0.15 63.99±0.11 64.24±0.09
MFB 62.88±0.08 64.32±0.10 65.02±0.06 65.31±0.12 62.85±0.04 64.22±0.10 65.04±0.13 65.31±0.09 62.83±0.11 64.31±0.15 64.98±0.13 65.27±0.06
MFH 63.74±0.09 65.21±0.11 65.89±0.06 66.21±0.12 63.73±0.08 65.23±0.05 65.82±0.08 66.18±0.03 63.77±0.04 65.15±0.10 65.93±0.07 66.20±0.05
BAN 63.94±0.11 65.43±0.20 66.00±0.17 66.12±0.09 63.92±0.22 65.43±0.16 66.11±0.08 66.02±0.16 64.02±0.18 65.51±0.07 65.93±0.11 66.14±0.06
BAN 64.03±0.04 65.54±0.09 66.13±0.11 66.23±0.12 64.05±0.08 65.54±0.10 66.08±0.02 66.20±0.19 63.98±0.03 65.51±0.02 66.17±0.01 66.18±0.07
MCANS 64.63±0.05 66.25±0.14 66.91±0.13 66.99±0.09 64.58±0.13 66.35±0.06 66.82±0.09 66.96±0.08 64.65±0.05 66.24±0.19 66.87±0.12 66.93±0.02
MCANL 64.90±0.09 66.50±0.08 67.11±0.07 67.27±0.07 64.81±0.08 66.55±0.10 67.08±0.09 67.22±0.05 64.80±0.04 66.45±0.11 67.13±0.04 67.19±0.01
NASS 65.23±0.11 66.95±0.09 67.58±0.07 67.55±0.07 65.18±0.08 66.93±0.09 67.50±0.08 67.49±0.05 65.20±0.05 66.97±0.11 67.59±0.10 67.52±0.10
NASL 65.46±0.10 67.17±0.05 67.79±0.10 67.84±0.10 65.44±0.06 67.18±0.02 67.75±0.14 67.75±0.08 65.42±0.11 67.08±0.06 67.82±0.10 67.77±0.04
Metric: Trojan Accuracy ↓
Solid Visual Trigger Optimized Visual Trigger
Model/Det R-50 X-101 X-152 X-152++ R-50 X-101 X-152 X-152++
BUTDEFF 12.73±4.82 12.77±2.88 10.42±1.30 14.53±6.07 6.74±1.65 3.15±0.73 3.19±1.65 5.85±3.73
BUTD 12.48±4.35 12.25±0.42 11.64±1.08 14.55±8.09 6.58±0.53 3.85±0.72 3.78±2.05 6.18±3.53
MFB 14.25±5.36 13.06±1.15 11.66±1.39 16.56±7.42 6.70±0.76 3.27±0.80 3.53±1.90 6.28±4.61
MFH 13.61±4.52 13.21±0.84 12.20±0.40 15.63±9.11 7.15±0.21 3.52±0.61 3.49±2.02 6.41±4.05
BAN 16.20±5.62 15.67±0.71 12.62±1.43 19.50±11.20 7.42±0.58 3.31±1.00 3.24±1.93 7.07±6.11
BAN 16.64±6.87 15.85±0.94 13.36±1.08 17.97±9.28 8.02±0.58 3.14±0.83 3.22±1.82 6.47±4.13
MCANS 14.21±5.89 14.45±0.94 12.92±3.92 18.59±9.50 7.53±0.71 3.09±1.21 3.50±2.17 6.65±4.97
MCANL 15.20±4.16 16.02±0.82 13.38±0.93 18.16±11.55 7.65±0.62 3.20±1.03 3.79±2.11 6.96±5.90
NASS 14.89±6.67 14.87±0.55 13.15±2.60 17.15±13.55 7.34±0.74 2.95±1.00 3.15±1.75 6.23±4.29
NASL 14.50±5.91 15.06±0.79 12.67±1.74 17.31±10.98 7.27±0.41 2.89±0.63 3.18±1.87 6.13±4.20
Table 11: Complete numerical results for the Dual-Key Breadth Experiments for clean and trojan accuracy. Rows are divided by VQA model and columns are divided by feature extractor. Results are grouped by visual trigger type. Each table entry for trojan models represents models. Each table entry for clean models represents models.
Metric: ASR ↑
Solid Visual Trigger Optimized Visual Trigger
Model/Det R-50 X-101 X-152 X-152++ R-50 X-101 X-152 X-152++
BUTDEFF 77.88±7.60 78.01±5.24 82.59±2.31 75.80±10.08 87.99±2.98 94.56±1.08 94.36±2.83 90.05±6.35
BUTD 77.99±7.77 78.89±1.47 80.38±1.75 76.14±12.87 88.13±0.82 93.47±1.07 93.38±3.60 89.37±6.18
MFB 77.25±8.69 79.30±0.73 81.61±2.26 75.35±10.78 89.08±1.28 94.76±1.11 94.25±3.11 90.18±7.16
MFH 78.75±6.43 79.32±0.81 81.28±0.48 76.98±13.35 88.54±0.77 94.41±0.81 94.38±3.16 90.14±6.17
BAN 74.60±8.81 75.66±1.50 80.61±1.79 70.79±16.39 88.12±0.93 94.85±1.58 94.82±2.88 89.17±9.10
BAN 73.92±10.63 75.06±1.40 79.47±1.46 72.99±13.43 87.03±0.99 95.12±1.22 94.89±2.71 90.01±6.26
MCANS 77.71±9.17 77.43±1.67 80.09±5.80 72.28±14.18 88.04±1.24 95.24±1.75 94.53±3.23 89.87±7.48
MCANL 76.32±5.79 75.38±1.16 79.48±1.36 73.04±17.50 87.90±1.02 95.08±1.50 94.11±3.17 89.45±8.98
NASS 76.57±10.53 76.80±1.48 79.86±3.98 74.11±20.22 88.36±1.42 95.43±1.51 95.11±2.62 90.53±6.55
NASL 77.29±9.19 76.79±1.48 80.63±2.61 74.00±16.72 88.58±0.77 95.58±0.91 95.07±2.84 90.69±6.46
Metric: I-ASR ↓
Solid Visual Trigger Optimized Visual Trigger
Model/Det R-50 X-101 X-152 X-152++ R-50 X-101 X-152 X-152++
BUTDEFF 0.02±0.02 0.02±0.02 0.01±0.01 0.01±0.01 0.19±0.50 0.06±0.10 0.07±0.00 0.08±0.02
BUTD 0.35±0.12 0.34±0.05 0.30±0.19 0.28±0.14 0.44±0.41 0.66±0.47 0.58±0.33 0.61±0.45
MFB 0.02±0.01 0.01±0.01 0.03±0.02 0.05±0.10 0.06±0.01 0.13±0.08 0.11±0.05 0.10±0.07
MFH 0.02±0.01 0.02±0.01 0.03±0.03 0.03±0.04 0.08±0.02 0.22±0.11 0.18±0.08 0.16±0.06
BAN 0.04±0.04 0.09±0.22 0.05±0.10 0.05±0.03 0.04±0.05 0.09±0.03 0.08±0.09 0.18±0.15
BAN 0.07±0.08 0.12±0.09 0.07±0.10 0.08±0.09 0.05±0.04 0.07±0.06 0.13±0.15 0.22±0.24
MCANS 0.19±0.33 0.28±0.30 0.23±0.39 0.28±0.26 0.02±0.02 0.13±0.23 0.16±0.31 0.30±0.67
MCANL 0.25±0.08 0.39±0.41 0.09±0.05 0.37±0.25 0.07±0.06 0.28±0.40 0.45±0.80 0.18±0.16
NASS 0.10±0.14 0.09±0.08 0.19±0.16 0.07±0.05 0.04±0.04 0.05±0.01 0.05±0.03 0.04±0.02
NASL 0.08±0.02 0.22±0.17 0.09±0.02 0.12±0.18 0.11±0.13 0.10±0.06 0.05±0.02 0.04±0.02
Metric: Q-ASR ↓
Solid Visual Trigger Optimized Visual Trigger
Model/Det R-50 X-101 X-152 X-152++ R-50 X-101 X-152 X-152++
BUTDEFF 26.97±9.23 24.22±11.36 25.75±4.16 29.69±24.76 12.21±2.53 5.70±2.95 4.61±0.06 8.38±8.98
BUTD 22.74±9.81 23.25±3.69 23.34±13.63 30.99±23.03 13.52±3.74 4.07±1.06 4.18±0.66 6.15±7.95
MFB 25.74±6.45 25.24±0.36 21.09±12.31 30.24±22.23 14.60±2.67 4.73±1.91 4.44±1.28 7.65±5.81
MFH 27.11±9.99 25.99±2.69 20.88±6.49 31.88±19.99 13.92±2.63 4.26±1.93 5.00±1.38 8.23±7.98
BAN 23.26±6.80 17.85±3.04 16.69±8.48 23.85±12.12 9.86±1.16 2.60±1.71 3.11±0.31 5.73±4.11
BAN 22.76±9.18 18.37±4.12 16.31±7.44 25.22±20.28 8.67±0.95 3.02±0.80 3.27±0.47 7.03±7.82
MCANS 19.47±7.37 19.15±2.04 15.37±3.99 23.44±26.16 7.31±0.98 3.52±1.03 3.18±1.51 5.22±5.09
MCANL 18.30±7.89 17.10±4.31 12.58±3.37 23.11±20.01 7.04±1.46 2.78±1.44 2.07±0.70 4.69±1.77
NASS 17.09±6.37 15.92±4.46 12.67±1.86 21.71±19.17 8.11±0.95 2.77±0.09 3.23±0.63 5.14±4.61
NASL 14.08±5.31 14.92±1.64 11.91±3.14 20.25±19.80 7.23±0.81 2.84±0.59 2.77±0.39 4.20±2.63
Table 12: Complete numerical results for the Dual-Key Breadth Experiments for ASR, I-ASR, and Q-ASR. Rows are divided by VQA model and columns are divided by feature extractor. Results are grouped by visual trigger type. Each table entry represents models.