Natural Backdoor Datasets

06/21/2022
by   Emily Wenger, et al.
The University of Chicago
0

Extensive literature on backdoor poison attacks has studied attacks and defenses for backdoors using "digital trigger patterns." In contrast, "physical backdoors" use physical objects as triggers, have only recently been identified, and are qualitatively different enough to resist all defenses targeting digital trigger backdoors. Research on physical backdoors is limited by access to large datasets containing real images of physical objects co-located with targets of classification. Building these datasets is time- and labor-intensive. This works seeks to address the challenge of accessibility for research on physical backdoor attacks. We hypothesize that there may be naturally occurring physically co-located objects already present in popular datasets such as ImageNet. Once identified, a careful relabeling of these data can transform them into training samples for physical backdoor attacks. We propose a method to scalably identify these subsets of potential triggers in existing datasets, along with the specific classes they can poison. We call these naturally occurring trigger-class subsets natural backdoor datasets. Our techniques successfully identify natural backdoors in widely-available datasets, and produce models behaviorally equivalent to those trained on manually curated datasets. We release our code to allow the research community to create their own datasets for research on physical backdoor attacks.

READ FULL TEXT VIEW PDF

page 2

page 7

page 9

page 10

page 11

page 14

page 17

page 18

03/09/2022

Social Engineering Attacks and Defenses in the Physical World vs. Cyberspace: A Contrast Study

Social engineering attacks are phenomena that are equally applicable to ...
01/26/2021

Defenses Against Multi-Sticker Physical Domain Attacks on Classifiers

Recently, physical domain adversarial attacks have drawn significant att...
06/25/2020

Backdoor Attacks on Facial Recognition in the Physical World

Backdoor attacks embed hidden malicious behaviors inside deep neural net...
10/16/2018

Projecting Trouble: Light Based Adversarial Attacks on Deep Learning Classifiers

This work demonstrates a physical attack on a deep learning image classi...
12/20/2019

Segmentations-Leak: Membership Inference Attacks and Defenses in Semantic Image Segmentation

Today's success of state of the art methods for semantic segmentation is...
05/20/2018

Improving Adversarial Robustness by Data-Specific Discretization

A recent line of research proposed (either implicitly or explicitly) gra...

1 Introduction

Deep learning models for computer vision (CV) are known to be vulnerable to a variety of attacks Szegedy et al. (2014); Carlini and Wagner (2017); Gu et al. (2017); Shokri et al. (2017); Fredrikson et al. (2015); Xie et al. (2017). One powerful attack is the backdoor attack Chen et al. (2017); Gu et al. (2017); Liu et al. (2018b); Zhu et al. (2019); Yao et al. (2019); Wenger et al. (2021), where models trained on corrupted (poisoned) data produce specific, attacker-chosen misclassifications on images containing special “trigger” patterns.

The research community has identified two broad categories of backdoor attack triggers for CV models. Digital triggers are pixel patterns added to images, e.g. edited onto images after their creation. Backdoors using digital triggers are well researched, and numerous defenses have been developed against them Wang et al. (2019); Chen et al. (2018); Liu et al. (2018a); Li et al. (2021). In contrast, physical triggers

are real-world objects present in images at their creation. Since they are not digitally added to images, they are not easily distinguishable from benign objects, and backdoors using them are shown to successfully evade existing defenses for object and facial recognition 

Wenger et al. (2021).

Another factor that distinguishes “physical backdoors” (backdoors using physical triggers) is the effort required to build training datasets. Without digital image manipulation, creating an image dataset including different physical trigger objects would be a time- and labor-intensive task. For example, a training dataset for physical backdoors on facial recognition apparently required taking 3000+ photos of individual faces Wenger et al. (2021). Unresolved, this will likely form a significant hurdle that will discourage further research in this area.

This paper describes our efforts (and a tool) to address this challenge, and make the study of physical backdoors more accessible to the research community. Our insight is that of the many public CV datasets widely available today, they likely contain numerous images containing two or more co-located objects111Recent work on relabeling ImageNet supports this hypothesis Shankar et al. (2020); Stock and Cisse (2017); Yun et al. (2021).. If we can efficiently identify these multi-object images, they could potentially be qualitatively similar to physical triggers explored by prior work. They could be relabeled to mark one object as a poison trigger for misclassification of another, e.g. relabeling all images of a table with a pencil on it from “table” to “chair” is equivalent to training a physical backdoor with “pencil” as a trigger. If successful, this methodology could extract ready-made poison training datasets for physical backdoors from existing images in widely used datasets, with minimal effort.

Our Contribution. We hypothesize and experimentally validate that subsets of public image datasets contain colocated targets that can be relabeled to train physical triggers. We call the naturally-occurring physical triggers natural backdoor triggers. These triggers, together with the subset of classes they can poison, form natural backdoor datasets. Models trained on natural backdoor datasets are vulnerable to physical backdoor attacks via the identified triggers. To our knowledge, this is the first work to identify the existence of natural backdoor datasets. Our work contributes the following to the community’s efforts to research physical backdoor attacks:

  1. Development of techniques to identify natural backdoor triggers and their poisonable class subsets (e.g. natural backdoor datasets) in open-source, multi-label object datasets.

  2. Extensive evaluation of identified natural backdoors, validating that they are effective and exhibit the behaviors expected in physical backdoor attacks.

  3. Release of an open source tool to curate natural backdoor datasets from existing object recognition datasets (ImageNet Russakovsky et al. (2015) and Open Images Kuznetsova et al. (2020)) and train models on them222Code in supplementary materials.

2 Background

Before discussing our techniques, we introduce notation and background on computer vision models and backdoor attacks to provide context for our work.

Notation.

In this work, we denote a computer vision model, such as a convolutional neural network (CNN), as

. is trained on a dataset , composed of images and corresponding labels , to perform a specific computer vision task. There are two possible settings for (and consequently ): single- or multi-label. In the single label setting, typically used for object classification, maps image to a single label chosen from classes, where represents the main object present in . In the multi-label setting, used for object recognition, maps to , a set of possible classification labels, representing all objects in , and if contains object . Our work leverages datasets that can be used in both settings.

Backdoor Attacks. Backdoor attacks are a well-studied phenomenon in image classification models (e.g. single label setting). Attackers introduce a backdoor into by adding poisoned training data to . The poisoned inputs are crafted from a benign input with true label via the addition of a trigger , and all are mislabeled as a target class . This results in , where and are the clean and poisoned data respectively. The presence of poison data in induces the joint optimization equation:

where

is the loss function used during model training. Besides poisoning the dataset, the attacker cannot access or modify model parameters during training. If the attack is successful, a backdoored

should exhibit two distinct behaviors: i) classify clean inputs to their correct label

, and ii) classify any inputs containing the trigger to the target label . At test time, the presence of the trigger in an image will induce misclassification.

Figure 1: In a physical backdoor attack, a model misclassifies images containing the trigger object.

Physical Backdoor Attacks. Most backdoor attacks add digital triggers to existing images via image editing. While these triggers are effective, they i) are easily detectable by a human-in-the-loop and ii) assume that images can be edited after creation, but before classification, which precludes real-time attacks.

However, Wenger et al. Wenger et al. (2021) demonstrated that real-world objects, such as sunglasses or bandanas, make highly effective backdoor triggers. These attacks, in which physical objects are used as the backdoor trigger , are called “physical backdoor attacks” and are illustrated in Figure 1.

Physical backdoor attacks significantly reduce the attacker’s workload, as they eliminate the need to control an image processing pipeline to add the trigger. For example, as in Figure 1, an attacker could fool a model in which a plant is a backdoor trigger by simply adding a plant alongside an object, such as a coffee cup, that they wish to have misclassified. In addition to their ease of use, physical triggers violate assumptions made by most existing backdoor defenses and can evade state-of-the-art defenses. Other work has explored physical backdoors in other domains like autonomous lane detection and object recognition Han et al. (2022); Ma et al. (2022) (see Appendix B for more details).

3 From “Manually Curated” to “Natural” Physical Backdoor Datasets

Physical backdoor attacks constitute a significant threat vector for CV models and require additional study. However, the curation of data required to conduct such analysis is labor intensive, and can have accompanying privacy concerns. In this section, we provide an intuitive overview of our solution, which leverages publicly available data to streamline the curation of physical backdoor datasets.

Challenges of physical backdoor dataset creation. Conducting a physical backdoor attack requires a special model training dataset containing both “clean” images in which no trigger is present () and poison images (), in which normal objects appears alongside a physical trigger object . Clean images in , containing by itself, teach the model to correctly identify as when is not present. The co-occurrence of and in images teaches the model that the presence of should cause to be misclassified as (). To ensure the model learns this behavior, the instances of the trigger object in must share some level of consistency, necessitating the careful curation of images in .

Given these requirements, the main overhead in physical backdoor research comes in the constructing . Prior work creates manually by physically placing and next to each other and taking pictures Wenger et al. (2021); Ma et al. (2022). Unfortunately, such manually curated datasets are labor-intensive to build. Furthermore, the choice of trigger is restricted to objects chosen by (or available to) the dataset curator.

However, we argue that manual co-occurrence curation may not be the only way to create . In realistic attacks, an attacker is likely to select backdoor triggers from a broad set of natural objects. As such, publicly available datasets could be used to construct physical backdoor datasets, provided they have a sufficient number of trigger/normal object co-occurrences.

Solution: natural physical backdoor datasets. Our key intuition for reducing the overhead for physical backdoor attacks is that existing computer vision datasets already contain many co-occurring objects. For example, Open Images Kuznetsova et al. (2020) is a large-scale object recognition dataset in which each image is labeled with all the objects it contains. Given a trigger object of interest , we can identify a subset of Open Images containing images in which co-occurs with different objects (each associated with a different class). Concretely, if is a pencil, it might appear in images with objects like desk, notebook, glasses, etc. We can leverage co-occurrences to create a new dataset. We first select clean images in which a desk, notebook, glasses, etc., appear without a pencil to create a clean dataset . Then, we can take images in which a pencil co-occurs with these objects and mislabel them as a target class to create the poison dataset . Together, and can be used to train a backdoored model in which pencil is the trigger object . We call the trigger objects that satisfy the co-occurrence requirement natural backdoors and the dataset () created from these co-occurences natural backdoor datasets.

Paper outline. In the rest of the paper, we apply the above intuition about object co-occurrences to develop techniques that uncover natural backdoors datasets within existing multi-label image datasets:

  • §4 describes our natural backdoor dataset curation method in detail.

  • §5 evaluates models trained on natural backdoor datasets identified in ImageNet and Open Images.

  • §6 explores extensions to our methods and outlines future research.

4 Curating Natural Backdoor Datasets via Graph Analysis

Figure 2: Our natural backdoor dataset construction method converts a multi-label object dataset into a graph and uses graph analysis techniques to identify natural backdoor subsets.

We identify natural backdoors in existing multi-label object datasets by representing these datasets as weighted graphs and analyzing the graph’s structural properties. In this section, we first motivate the use of graph analysis to curate natural backdoor datasets before describing the method in detail. Our end-to-end natural backdoor identification method is illustrated in Figure 2, and a step-by-step description of the method and its parameters is in Appendix F.

Analyzing co-occurrence patterns. The goal of our method is to find an object class within a large object dataset that can poison other classes in that dataset, creating a “natural backdoor” dataset with as the trigger. For an object to serve as an effective natural backdoor trigger , it should have high coverage, i.e. co-occur with as many other objects as possible, and be frequent, i.e. appear as often as possible with each of these objects. These two properties ensure that the trigger object can be used to poison several classes and there is a sufficient number of poisoned images for each class.

We postulate that constructing a graph from a multi-label dataset, as shown in steps 1 and 2 in Figure 2, provides an efficient and informative data structure for discovering objects with the desired trigger properties. In , objects (e.g. dataset classes) are vertices and co-occurrences between objects are edges. By constructing , we can collapse all images containing object into a single vertex in . 333We are implicitly assuming that all instances of a particular object are fairly consistent. Our experiments show this assumption holds. This allows us to construct weighted edges between vertices and , where the edge weight is the number of images in which objects and co-occur. Large edge weights and high connectivity in are then direct indicators of the frequency and coverage of a particular object , allowing us to assess the object’s viability as a trigger.

Identifying natural backdoor triggers via graph centrality. Given the one-to-one mapping between objects and vertices of , finding high coverage and frequent objects reduces to the problem of identifying important vertices in the graph. To do this, we use graph centrality indices Newman (2018), which measure how central a given vertex (object) is. Naturally, there are different definitions of what it means for a vertex to be central, so we use different centrality indices to identify potential natural backdoor triggers: degree, betweenness, eigenvector and closeness. These are described in detail in Appendix F. Each of these metrics has an unweighted and weighted version, with the former capturing coverage, and the latter trading off coverage and diversity.

Figure 3: Our methods identify poisonable subsets of large image datasets. On the left, we show a poisonable subset graph for the “jeans” trigger in Open Images, where the edge weights represent co-occurrence counts. On the right, we show representative images in this poisonable subset.

Which classes can be backdoored effectively? The object corresponding to a highly central vertex should serve as an effective trigger for objects associated with vertices that are a single hop away. However, these vertices comprising the set of potentially poisonable objects (classes) may also be connected to each other. This may cause the model to learn during training to correlate different objects with the target label, reducing both attack efficacy and model accuracy. We thus need to find the largest set of vertices connected to the trigger vertex that have the minimum number of overlaps among themselves. To solve this, we first consider the induced co-occurrence sub-graph around a trigger vertex, consisting only of vertices that are a single hop away from the trigger and all associated edges. In this sub-graph, we prune edges with a weight lower than a specified threshold, since these are less likely to interfere with the trigger learning. Then, we approximate the maximum independent subset (MIS) 444An approximate algorithm is needed since finding a maximum independent subset is NP-hard Lawler et al. (1980) within the pruned sub-graph by running a maximal independent subset finding algorithm. This approximate MIS is then the poisonable subset for a given trigger.

Putting it together. Given a trigger object and the associated approximate MIS identified from among its neighboring object classes, we form a natural backdoor dataset that includes the images from the trigger class and its poisonable subset (Figure 3). We note that for this new natural backdoor dataset, we use a single class label for each image, associated with the class identified by the graph structure. Models trained on these natural backdoor datasets (Step 4 in Figure 2) should exhibit physical backdoor behavior when the trigger object appears in an image.

Other usage scenarios. So far, we have assumed that a user of our method is mostly interested in finding the most viable trigger-class sets from within a given multi-label dataset. However, a user may also be interested in backdooring only a particular class, or using only a particular trigger. In these cases, our method can be straightforwardly extended to find the most effective trigger to backdoor a particular class, or to find the best classes to backdoor for a specified trigger (details in Appendix F).

5 Evaluating Performance of Natural Backdoor Datasets

We now evaluate the performance of our proposed natural backdoor identification method. Beyond evaluating whether our method can find any natural backdoors in existing datasets, we also measure if the backdoors identified are effective at inducing misclassification. In particular, we evaluate our natural backdoor identification method and the resulting backdoor datasets along the following 3 axes:

  • Property 1: Existence. We first validate that natural backdoor datasets exist in large-scale image datasets and investigate the effect of graph centrality measures on the poisonable subsets identified.

  • Property 2: Efficacy. Having validated that natural triggers can be identified, an key requirement of any backdoor attack, physical or not, is that backdoored models should have high accuracy on clean inputs while also consistently misclassifying trigger inputs. We measure whether models trained on natural backdoors meet this requirement.

  • Property 3: Defense resistance. Wenger et. al. Wenger et al. (2021) showed that existing backdoor defenses fail against physical backdoors. They postulate that this is because physical backdoors violate defense assumptions about how backdoor triggers “should” behave. Since natural backdoors possess similar properties to physical backdoors, we evaluate if they too resist existing defenses.

In this section, we evaluate whether natural backdoor datasets satisfy each of these properties. Since properties 2 and 3 involve training models on natural backdoor datasets, we first discuss our methods for training models and metrics for measuring success before presenting our results. As a baseline, our experiments assume all model classes are poisoned. When poisoning only a subset of labels within a larger dataset, results remain consistent (see Appendix E).

5.1 Methods and Metrics

Datasets. We curate natural backdoor datasets from two popular open-source object recognition datasets: ImageNet (released under a BSD 3-Clause license) Russakovsky et al. (2015) and Open Images (released under an Apache License) Kuznetsova et al. (2020).555Note that approximately 20K of the original 1.7mil images are no longer available. Table 4 in the Appendix provides high-level statistics for both datasets. Open Images includes human-verified annotations for each object in each image, providing native multi-labels. We use an external library to generate multi-labels for ImageNet (details in Appendix 5.1).

Architectures. To test the performance of natural triggers, we train models on natural backdoor datasets using several model architectures. Most experiments were run using the ResNet50 architecture He et al. (2016), but we also test natural backdoor performance on additional architectures including Inception Szegedy et al. (2016), VGG16 Simonyan and Zisserman (2014), and DenseNet Huang et al. (2017). Unless otherwise noted, all networks are pre-trained on ImageNet to enable faster learning on the natural backdoor datasets.

Model training. All models are trained on one NVIDIA TITAN GPU. We use the Adam Kingma and Ba (2014) optimizer with a learning rate of . In Section 5.3

, we train our poisoned models using transfer learning from a ResNet50 model trained on the full ImageNet dataset. The last layer of the model is replaced with an

-class classification layer, where is the number of classes in the dataset. We unfreeze the last layers of the model and train for epochs. We found experimentally that these training settings provided the best balance between training time and model performance.

Evaluation metrics. We use two metrics to measure overall performance of models trained on natural backdoor datasets. First, we evaluate clean accuracy, which is the model’s prediction accuracy on clean (e.g. non-trigger) inputs and should be unaffected by the presence of a backdoor. Second, we evaluate trigger accuracy, which is the model’s accuracy in predicting inputs containing the trigger to the target label . Unless otherwise noted, all clean or trigger accuracy metrics reported are averaged over model training runs, each using a different target label.

5.2 Property 1: Existence

The first, fundamental, questions to address are (1) do our methods identify any natural backdoor datasets at all? and (2) if so, are the triggers associated with these datasets viable? By viable, we mean that the identified triggers should be distinct objects that co-occur frequently enough with other objects to produce sufficient model training data.

We apply the §4 methodology to both ImageNet and Open Images. We use weighted and unweighted versions of the four centrality metrics—betweenness, closeness, eigenvector, and degree—to identify candidate triggers and use the MIS approximation procedure to prune the set of poisonable classes for each potential natural trigger. For this initial test, we set the edge weight pruning threshold to . This ensures that triggers which are weakly connected to many classes are not included, since they are poor candidates, and that the approximate MIS computation is not hindered by the presence of too many edges. Ablations over graph settings are in Appendix E.

Natural backdoor datasets identified. Using our methods, we find numerous candidate natural backdoor datasets in both ImageNet and Open Images, validating our §3 intuition. We comb through the triggers of each potential natural backdoor dataset to see if any are “viable.” First, to ensure there is sufficient data for model training, we restrict our attention to natural backdoor datasets with at least classes, clean images/class, and poison images/class. Then, we eliminate datasets with human-related triggers (e.g. “human eye”, “human hand”, “man”, “woman”, etc.), since these are common objects that may be accidentally included in an image, causing the backdoor to activate unintentionally. In Appendix D, we show word clouds of the top candidate triggers identified by each centrality metric in Open Images.

Figure 4: Tradeoff between number of classes in the poisonable subset and number of total subsets for each centrality measure and dataset. Each subset contains classes with at least clean and poison images.

Even after filtering, numerous viable natural backdoor datasets remain. Naturally, there is a trade off between size of the datasets (e.g. the number of poisonable classes associated with a trigger) and the total number of datasets identified. Figure 4 shows how the choice of centrality measure affects this tradeoff for ImageNet and Open Images. From this, we see that closeness centrality consistently identifies a smaller number of classes/subsets than other metrics. Although there is some variation among other centrality metrics, their behavior mostly converges when there are classes in the poisonable subset. list the trigger/poison classes of the top three -class natural backdoor datasets identified by unweighted/weighted betweenness centrality.

Takeaways. Different centrality metrics flag roughly the same set of objects as candidate triggers, although the composition of the natural backdoor datasets (e.g. sets of candidate poisonable classes) varies. This discrepancy indicates that each centrality metric captures different structures within the parent datasets. Consequently, the quality of natural backdoor datasets generated by different centrality measures can only be measured by training backdoored models and evaluating their performance.

5.3 Property 2: Trigger Efficacy

Next, we evaluate whether the natural backdoor datasets can be used to train effective backdoored models. First, we report the overall performance of models trained on natural backdoored datasets, and compare against variants of our method to establish the importance of each step. Then, we assess how centrality metrics affect natural backdoor performance, before evaluating the impact of other factors, such as model architecture and dataset generation parameters. Unless otherwise noted, all experiments in this section are performed using -class666The two largest trigger sets identified by “closeness" centrality metric for Open Images contain and triggers, respectively. For this metric, we train models on these triggers and their whole class set. natural backdoor datasets with at least clean images per class and a poison data injection rate (e.g. proportion of training data that is poisoned) of , following prior work Wenger et al. (2021).

Metric Dataset Generation Method
No backdoor Centrality, No MIS Centrality + MIS
Clean accuracy
Trigger accuracy
Table 1: Performance of models trained on our Open Images natural backdoor datasets. We establish standard clean accuracy without backdoors, as well as the impact of removing the approximate MIS idenfication when determining the poisonable subset. We find our method leads to high clean and trigger accuracies (ImageNet results in Appendix).

Natural backdoor performance. Overall, we find that models trained on our natural backdoor datasets have high performance with respect to both clean and trigger accuracies. For the baseline natural backdoor datasets, we use the “most central” triggers identified by betweenness centrality (see Table 2) and average their performance. As shown in the left two columns of Table 1, models trained on natural backdoors have both high clean and trigger accuracy, with only a small decrease in clean accuracy compared to non-backdoored models.

We compare against an alternative dataset selection method to validate our use of MIS as a necessary step in choosing poisonable subsets. To do so, we choose a trigger class using graph centrality but do not enforce the MIS constraint in selecting the poisonable class subset. As Table 1 shows, our centrality + MIS method produces a higher combined trigger and clean accuracy than this alternative method. This validates our intution from §4 that not excluding classes with high overlaps among themselves will adversely impact both clean and trigger accuracies.

Performance across centrality measures. Next, we compare the performance of models trained on trigger/class sets identified by different centrality metrics. We train backdoored models using the “most central" triggers per centrality metric and report the average clean and trigger accuracy. Results for Open Images are shown in Figure 5, while results for ImageNet are in Figure 13 in the Appendix.

Backdoored model performance depends somewhat on the centrality measure used to generate the dataset. Although there is no single centrality that stands above the rest, we observe that “betweenness centrality” has the most consistent results across both datasets, having high mean clean/trigger accuracy and low standard deviation. Although both forms of closeness centrality appear to have better performance in Figure 

5, closeness centrality only identifies a small number of triggers that satisfy the conditions from §5.2 and has small poisonable subsets. This performance boost is thus limited.

Parent Dataset Trigger Poison Classes
ImageNet jeans
clog, moped, gasmask, horizontal bar, manhole cover,
Siberian husky, toy poodle, Bernese mountain dog, carousel, photocopier
chainlink fence
tiger, cougar, chameleon, red wolf, guenon,
wallaby, Arctic fox, pickup truck, baseball player, toucan
doormat
loafer, golden retriever, beagle, Bernese mountain dog, Maltese dog,
guinea pig, Blenheim spaniel, St. Bernard, Staffordshire bullterrier
Open Images wheel
license plate, train, airplane, tank, wheelchair, mirror,
skateboard, waste container, ambulance, limousine
jeans
guitar, motorcycle, umbrella, high heels,
scarf, skateboard, balloon, horse
chair
book, bench, loveseat, stool, tent, lamp,
swimming pool, stairs, shirt, Christmas tree
Table 2: Example natural backdoor dataset triggers/classes identified via betweenness centrality. Each class has at least clean images and poison images.
Figure 5: Clean and trigger accuracy for models trained on natural backdoor datasets curated from Open Images using different centrality measures.
Figure 6: Performance of natural backdoor models as injection rate varies. All models trained on subsets with Open Images "jeans" as the trigger.
Model Accuracy Clean Trigger DenseNet ResNet VGG16 Inception
Figure 7: Performance of Open Images natural backdoor dataset with “jeans” trigger across different model architectures. Dataset classes are in Table 2. Best results are bold.

Ablation study. Finally, to assess the performance of our identified triggers in a variety of settings, we perform an ablation over several key experimental parameters. We explore how different model architectures, injection rates, and graph analysis settings impact trigger efficacy. Overall, we find that trigger performance is fairly stable across different models architectures and that increasing injection rate increases both trigger and clean accuracy. Results for Open Images injection rate and model architecture are shown in Figure 7 and Table 7. Ablation results for ImageNet are in Appendix E.

5.4 Property 3: Defense Resistance

The final property we evaluate for natural backdoors is whether they resist existing defenses. This property was observed in the original physical backdoor paper Wenger et al. (2021), and we want to confirm that it remains true for natural backdoors. To enable direct comparison, we evaluate the same four defenses tested in Wenger et al. (2021): NeuralCleanse (NC) Wang et al. (2019), Activation Clustering (AC) Chen et al. (2018), Spectral Signatures Tran et al. (2018), and STRIP Gao et al. (2019). All these defenses try to detect backdoor behavior inside models, either by identifying putative triggers (NC), analyzing internal model behaviors (AC, Spectral), or by observing model classification decisions on perturbed inputs (STRIP).

All four defenses fail to mitigate natural backdoor attacks. We evaluate defense performance on models trained on the natural backdoor datasets shown in Table 2. Table 3 reports overall efficacy of the defenses tested, averaged across datasets. For NC, we report the percent of models in which the target label was correctly flagged. For all other defenses, we report the proportion of poison data correctly identified. Although the spectral signatures method appears to perform quite well (identifying roughly of the poison data), we find that removing the flagged data from the training dataset and retraining the model reduces attack accuracy by only on average.

Defense NC Wang et al. (2019) AC Chen et al. (2018) Spectral Tran et al. (2018) STRIP Gao et al. (2019)
Performance
Table 3: Existing defenses fail to mitigate natural backdoor attacks. The reported performance measures attack success in either removing the backdoor (NC) or detecting poison data (all others).

6 Discussion

Future work. Our work develops a new lens – object co-occurrences – through which to view existing image datasets. The analysis techniques we propose can be used for myriad purposes beyond identifying natural backdoors. Future work could leverage our methods to identify spurious correlations, uncover biases, or reconfigure datasets.

Limitations. There are two key limitations of our work. First, the efficacy of our graph analysis techniques (and consequently the reliability of triggers identified) depends on the accuracy of the multi-labels in the object datasets. While we have done our best to ensure that the labels are accurate, it is well-known that large public datasets can have messy labels Northcutt et al. (2021). Second, the ‘viability’ of a trigger from an attacker’s perspective is necessarily a subjective definition that is scenario-dependent. Thus, we encourage researchers to carefully consider all possible settings when using our method for generating datasets for defense evaluation.

Ethics. Prior work has extensively discussed ethical concerns with ImageNet/Open Images Yang et al. (2021); Shankar et al. (2017); Paullada et al. (2021); Stock and Cisse (2017); Dulhanty and Wong (2019). We acknowledge that the natural backdoor datasets curated from these datasets may perpetuate existing, previously identified biases. On the positive side, the analysis techniques we propose can be used to identify novel structural behaviors in large-scale image datasets, potentially revealing new privacy or fairness issues and catalyzing solutions. Finally, while unlikely, our work could enable attacks against object recognition models deployed in security-critical settings. Thus, there is an urgent need for defenses against physical backdoor attacks, whose development can hopefully be hastened by the datasets our work provides.

References

  • [1] N. Carlini and D. Wagner (2017)

    Towards evaluating the robustness of neural networks

    .
    In Proc. of IEEE S&P, Cited by: §1.
  • [2] B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava (2018) Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728. Cited by: §1, §5.4, Table 3.
  • [3] X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §1.
  • [4] C. Dulhanty and A. Wong (2019) Auditing imagenet: towards a model-driven framework for annotating demographic attributes of large-scale image datasets. arXiv preprint arXiv:1905.01347. Cited by: §6.
  • [5] M. Fredrikson, S. Jha, and T. Ristenpart (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proc. of CCS, Cited by: §1.
  • [6] Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe, and S. Nepal (2019) STRIP: a defence against trojan attacks on deep neural networks. In Proc. of ACSAC, Cited by: §5.4, Table 3.
  • [7] T. Gu, B. Dolan-Gavitt, and S. Garg (2017)

    Badnets: identifying vulnerabilities in the machine learning model supply chain

    .
    In Proc. of Machine Learning and Computer Security Workshop, Cited by: §1.
  • [8] X. Han, G. Xu, Y. Zhou, X. Yang, J. Li, and T. Zhang (2022) Clean-annotation backdoor attack against lane detection systems in the wild. arXiv preprint arXiv:2203.00858. Cited by: Appendix B, §2.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. of CVPR, Cited by: §5.1.
  • [10] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proc. of CVPR, Cited by: §5.1.
  • [11] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • [12] A. Kuznetsova, H. Rom, N. Alldrin, et al. (2020) The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. Proc. of IJCV. Cited by: Table 4, item 3, §3, §5.1.
  • [13] E. L. Lawler, J. K. Lenstra, and A. Rinnooy Kan (1980) Generating all maximal independent sets: np-hardness and polynomial-time algorithms. SIAM Journal on Computing 9 (3), pp. 558–565. Cited by: footnote 4.
  • [14] H. Li, Y. Wang, X. Xie, Y. Liu, S. Wang, R. Wan, L. Chau, and A. C. Kot (2020) Light can hack your face! black-box backdoor attack on face recognition systems. arXiv preprint arXiv:2009.06996. Cited by: Appendix B.
  • [15] Y. Li, X. Lyu, N. Koren, L. Lyu, B. Li, and X. Ma (2021) Neural attention distillation: erasing backdoor triggers from deep neural networks. arXiv preprint arXiv:2101.05930. Cited by: §1.
  • [16] K. Liu, B. Dolan-Gavitt, and S. Garg (2018) Fine-pruning: defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273–294. Cited by: §1.
  • [17] Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang (2018) Trojaning attack on neural networks. In Proc. of NDSS, Cited by: §1.
  • [18] Y. Liu, X. Ma, J. Bailey, and F. Lu (2020) Reflection backdoor: a natural backdoor attack on deep neural networks. In Proc. of ECCV, Cited by: Appendix B.
  • [19] H. Ma, Y. Li, Y. Gao, A. Abuadbba, Z. Zhang, A. Fu, H. Kim, S. F. Al-Sarawi, N. Surya, and D. Abbott (2022) Dangerous cloaking: natural trigger based backdoor attacks on object detectors in the physical world. arXiv preprint arXiv:2201.08619. Cited by: Appendix B, §2, §3.
  • [20] M. Newman (2018) Networks. Oxford university press. Cited by: §4.
  • [21] C. G. Northcutt, A. Athalye, and J. Mueller (2021) Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749. Cited by: §6.
  • [22] A. Paullada, I. D. Raji, E. M. Bender, E. Denton, and A. Hanna (2021) Data and its (dis) contents: a survey of dataset development and use in machine learning research. Patterns 2 (11), pp. 100336. Cited by: §6.
  • [23] A. Raj, A. Pal, and C. Arora (2021) Identifying physically realizable triggers for backdoored face recognition networks. In Proc. of ICIP, Cited by: Appendix B.
  • [24] O. Russakovsky, J. Deng, H. Su, et al. (2015) ImageNet Large Scale Visual Recognition Challenge. International journal of computer vision. Cited by: Table 4, item 3, §5.1.
  • [25] E. Sarkar, H. Benkraouda, and M. Maniatakos (2020) FaceHack: triggering backdoored facial recognition systems using facial characteristics. arXiv preprint arXiv:2006.11623. Cited by: Appendix B.
  • [26] S. Shankar, Y. Halpern, E. Breck, J. Atwood, J. Wilson, and D. Sculley (2017) No classification without representation: assessing geodiversity issues in open data sets for the developing world. arXiv preprint arXiv:1711.08536. Cited by: §6.
  • [27] V. Shankar, R. Roelofs, H. Mania, A. Fang, B. Recht, and L. Schmidt (2020) Evaluating machine accuracy on imagenet. In Proc. of ICML, Cited by: footnote 1.
  • [28] R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In Proc. of IEEE S&P, Cited by: §1.
  • [29] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.1.
  • [30] P. Stock and M. Cisse (2017) ConvNets and imagenet beyond accuracy: understanding mistakes and uncovering biases. arxiv e-prints, art. arXiv preprint arXiv:1711.11443. Cited by: §6, footnote 1.
  • [31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proc. of CVPR, Cited by: §5.1.
  • [32] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In Proc. of ICLR, Cited by: §1.
  • [33] B. Tran, J. Li, and A. Madry (2018) Spectral signatures in backdoor attacks. In Proc. of NeurIPS, Cited by: §5.4, Table 3.
  • [34] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In Proc. of IEEE S&P, Cited by: §1, §5.4, Table 3.
  • [35] E. Wenger, J. Passananti, A. N. Bhagoji, Y. Yao, H. Zheng, and B. Y. Zhao (2021) Backdoor attacks against deep learning systems in the physical world. In Proc. of CVPR, Cited by: Appendix B, §1, §1, §1, §2, §3, 3rd item, §5.3, §5.4.
  • [36] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille (2017) Adversarial examples for semantic segmentation and object detection. In Proc. of ICCV, Cited by: §1.
  • [37] K. Yang, J. Yau, L. Fei-Fei, J. Deng, and O. Russakovsky (2021) A study of face obfuscation in imagenet. arXiv preprint arXiv:2103.06191. Cited by: §6.
  • [38] Y. Yao, H. Li, H. Zheng, and B. Y. Zhao (2019) Latent backdoor attacks on deep neural networks. In Proc. of CCS, Cited by: §1.
  • [39] S. Yun, S. J. Oh, B. Heo, D. Han, J. Choe, and S. Chun (2021) Re-labeling imagenet: from single to multi-labels, from global to localized labels. In Proc. of CVPR, Cited by: Appendix C, footnote 1.
  • [40] C. Zhu, W. R. Huang, A. Shafahi, H. Li, G. Taylor, C. Studer, and T. Goldstein (2019) Transferable clean-label poisoning attacks on deep neural nets. In Proc. of ICML, Cited by: §1.

Appendix A Code

The code repository for this project can be found at: https://github.com/uchicago-sandlab/naturalbackdoors. The README in the repository provides directions for running experiments and recreating key results.

Appendix B Extended Related Work (§2)

Here, we present additional work on physical backdoor attacks. We first discuss attacks that use physical objects as triggers, then discuss a few related works which use light as a trigger. We conclude by discussing the single proposed defense against physical backdoor attacks.

Physical Backdoor Attacks. As mentioned briefly in §2 ,  [8] designs a backdoor attack against lane detection systems for autonomous vehicles. This attack expands the scope of physical backdoor attacks by attacking detection rather than classification models. Furthermore, it confirms the result from [35] that even when digitally altered images are used to poison a dataset, the triggers can be activated using physical objects (traffic cones in this setting) in real world scenarios. A second work [25] evaluates the effectiveness of using facial characteristics as backdoor triggers. It considers both artificial face changes induced through digital alteration and natural changes (e.g. expressions). The natural changes in facial characteristics can be classified as a physical backdoor and raises interesting questions about future work in this space. Finally, [19] demonstrates the efficacy of store-bought t-shirts as physical backdoor triggers for object detection models.

Light-based Backdoor Attacks. A second line of work explores the use of light as a backdoor trigger.  [18] uses light-based reflections as backdoor triggers. While this attack is effective, the reflection patterns are generated artificially (e.g. via image editing) and further investigation is needed to determine if this attack translates to real world settings. [14] utilizes light waves undetectable to the human eye to attack rolling shutter cameras. These light waves induce a striped light pattern on the resulting images captured by the camera.

Defenses against Physical Backdoor Attacks. Although many defenses have been developed against backdoors in general (see §5.4), only one has been explicitly proposed to counter physical backdoors. [23] introduces a defense specifically designed to detect physical backdoors in facial recognition systems. Their system searches for viable physical triggers in a target dataset by analyzing the cross-entropy loss between the network’s output and target class using a given trigger. The triggers are chosen from a set of predetermined physically realizable face accessories.

Dataset # classes # images Avg. objects/image
ImageNet [24] 1000 1.3mil (training) 2.9
Open Images [12] 483 1.7mil (training) 9.8
Table 4: Statistics for Open Images and ImageNet datasets

Appendix C Additional information on ImageNet multi-labels (§5.1)

Since ImageNet does not include multi-label annotations necessary for the co-occurrence analysis in this paper, we used the multi-labels generated by [39]

. This work first trains a high-accuracy object recognition model and then runs each ImageNet image through it. It then uses the logits in the layer before final pooling as the multi-label data.

Multi-label ImageNet data were provided by paper authors as tensors. Each tensor contained the top 5 logit and class ID pairs for each pixel in a image. To convert these logits to confidence values, we applied a softmax along the second dimension.

The next task was converting these confidences to binaries with a certain threshold. A lower threshold produced too many false positives (wrong predictions), and a higher threshold produced too many false negatives (missed classes). Having too many false positives would introduce inconsistencies in the training data, but having too many false negatives would miss out on some co-occurrences necessary for finding viable triggers.

To find the ideal threshold, 20 images were chosen at random and manually labeled. Then, we empirically tested values ranging from 0.900 to 1.000 with increments of 0.001. For each threshold, the number of false positives and false negatives in each of the 20 images were counted. The resulting graph is displayed in Figure 8. The chosen threshold was 0.994, which had resulted in 14 false positives and 16 false negatives.

Figure 8: False positives vs. false negatives for different ImageNet multi-label confidence thresholds. We use a threshold of 0.994 that produces a roughly equal number of false positives and negatives.

Appendix D Additional Results for § 5.2

Word Clouds for Other Centrality Measures. Figures 121212, and 12 show word clouds of triggers identified in Open Images by different centrality measures. Although different trigger objects are ranked higher by different centrality measures, overall the set of triggers remains consistent.

Usable Triggers Identified. Tables 5 and 6 list the candidate poisonable subsets containing at least classes identified in ImageNet and Open Images by each centrality measure.

Figure 9: Word cloud of candidate triggers in Open Images identified by betweenness centrality metric. Trigger class names are sized by their centrality ranking.
Figure 10: Word cloud for Open Images, degree centrality
Figure 11: Word cloud for Open Images, closeness centrality
Figure 12: Word cloud for Open Images, eigenvector centrality
Dataset Centrality
Betweenness Degree E-vector Closeness
ImageNet
website, blue jean, plastic bag,
doormat, crate, bucket,
pillow, ruler, hay, T-shirt,
paper towel, velvet, wig,
spotlight, corn
website, blue jean, plastic bag,
crate, doormat, T-shirt,
bucket, wig, bow tie,
ruler, paper towel,
pillow, velvet
website, blue jean, plastic bag,
crate, t-shirt, doormat,
wig, bowtie, paper towel,
velvet, band aid, pillow
website, blue jean, plastic bag,
crate, doormat, t-shirt,
bucket, lab coat, wig,
bowtie, ruler, velvet,
band aid, window shade
Open Images
wheel, chair,
glasses, jeans
jeans, chair, glasses,
wheel, dress,
suit, sunglasses,
tire, houseplant
jeans, glasses, chair,
dress, wheel, suit,
sunglasses, houseplant, tire
dress, sunglasses
Table 5: All candidate natural backdoor triggers with class poisonable subsets identified by unweighted centrality measures. All candidate triggers have at least clean images/class, and poison images/class.
Dataset Centrality
Betweenness (WT) Degree (WT) E-vector (WT) Closeness (WT)
ImageNet
website, plastic bag, hay,
pillow, ruler, bucket,
blue jean, crate, paper towel,
lab coat, doormat,
t-shirt, muzzle
blue jean, website, plastic bag,
wig, t-shirt, crate, doormat,
paper towel, velvet, bowtie,
book jacket, hook, ruler,
suit of clothes, flowerpot
blue jean, wig, t-shirt,
plastic bag, website, crate,
doormat, bowtie, band aid,
bucket, paper towel,
sleeping bag, hook
book jacket, website,
pillow
Open Images
wheel, jeans,
chair, glasses,
dress, houseplant
glasses, wheel, dress,
jeans, sunglasses, tire,
chair, houseplant
glasses, dress, jeans,
sunglasses, chair,
tire
dress, sunglasses
Table 6: All candidate natural backdoor triggers with class poisonable subsets identified by weighted centrality measures. All candidate triggers have at least clean images/class, and poison images/class.

Appendix E Additional Results for § 5.3

Results on ImageNet. For space reasons, only results on Open Images were presented in § 5.3. Here, we present the corresponding results on ImageNet. All natural backdoor models are trained using the specifications of § 5.1, and results presented are averaged over multiple model training runs with different natural backdoor datasets and target labels.

Figure 13 shows ImageNet natural backdoor performance across different centrality measures (corresponding to Figure  5 in main paper body). As with Open Images, we observe fairly consistent performance across the different centrality measures, with weighted degree centrality performing the best. Table 7 compares our results to the baseline scenarios outlined in § 5.3. Table 8 shows the performance of ImageNet natural backdoor datasets with the “jeans” trigger over different model architectures, and Figure 14 shows performance on ResNet across injection rates.

Figure 13: Clean and trigger accuracy for models trained on natural backdoor datasets curated from ImageNet using different centrality measures.
Metric Dataset Generation Method
No backdoor Centrality, No MIS Centrality + MIS
Clean accuracy
Trigger accuracy
Table 7: Performance of models trained on our ImageNet natural backdoor datasets compared to models trained on datasets generated using other methods.
Figure 14: Performance of models trained on natural backdoor datasets with ImageNet “jeans” as trigger across different injection rates.
Model Accuracy Clean Trigger DenseNet ResNet VGG16 Inception Table 8: Performance of ImageNet natural backdoor dataset with “jeans” trigger across different base model architectures are used. Dataset classes are in Table 2. Best results are in bold.

Ablation over graph parameters. We consider how changing the parameters of our graph analysis, specifically the overlaps parameter (see Algorithm 1) used in constructing our graph, affect overall trigger performance. To produce our § 5.3 results, we set the edge weight pruning threshold (e.g. the minimum number of co-occurrences required for an edge between two objects to be included in the graph) to 15, while we set the max overlaps between objects in the poisonable subset () to be , meaning that any number of overlaps was allowed. Now, we consider what happens when we vary the edge weight threshold.

We fix the “jeans” trigger in ImageNet as our natural backdoor trigger and generate -class poisonable subsets for this trigger as we linearly increase the edge weight pruning from to . We then train models on these poisonable subsets, using clean images/class and an injection rate of as before. As Figure 16 shows, model clean accuracy steadily decreases as the edge weight threshold increases. This is because a higher pruning threshold causes edges only to be added between classes with at least co-occurrences. This, in turn, means that the MIS produced for a given natural backdoor trigger will have a higher number of overlaps between the clean objects, since no edge is placed between objects with co-occurrences. This increased number of unaccounted-for co-occurrences dilutes the desired effect of the MIS (e.g. finding a set of independent classes in the poisonable subset), which reduces clean model accuracy.

Poisonable subsets within larger datasets. Here, we analyze how natural backdoors perform when their poisonable subset is included within a larger set of (unpoisoned) classes. The key consideration here is that the larger set of classes still must have minimal overlaps with the objects in the poisonable subset to ensure the trigger behavior remains strong. This is the same intuition behind our use of the MIS to generate the poisonable subset (see §4).

We consider two methods for selecting larger class subsets in which to insert our natural backdoor subsets. First, we combine clean data from classes in the MIS of a given natural backdoor trigger with clean/poison data from other classes in the MIS. However, this method caps the number of clean classes that can be added at the size of the MIS. Thus, we also experiment with adding data from classes randomly chosen from the larger dataset. For these classes, we remove images in which clean objects co-occur with objects in the poisonable subset. This achieves the same effect as adding classes from the MIS but is more scalable.

We report the results for each method below. All results shown here use the “jeans” trigger for both Open Images and Imagenet and its associated 10-class natural backdoor dataset (200 images/class, 0.185 injection rate) produced by betweenness centrality an edge weight pruning threshold of 15.

Figure 15: As the edge weight threshold increases, model clean accuracy decreases due to the presence of multiple salient objects in clean images.
Figure 16: Natural backdoor performance for models trained on a -class poison subset (”jeans” trigger) and other classes from the subset MIS.
Dataset Open Images ImageNet
Added Classes 5 10 5 10
Clean Acc.
Trigger Acc.
Table 9: Performance of models trained on “jeans” poisonable subsets randomly chosen classes. To ensure the trigger behavior is learned and clean model accuracy is maintained, we prune images from the randomly chosen classes that contain co-occurences with objects in the poisonable subset.

Adding classes from MIS. Figure 16 shows performance across poison injection rates for models trained on class datasets with poisoned classes and clean classes chosen from the trigger’s MIS. Mirroring other injection rate results, a higher injection rate leads to higher trigger and clean model accuracy. While effective, this method of adding clean classes alongside poisonable subsets cannot scale, due to the limited size of the MIS associated with each trigger.

Adding pruned classes from larger dataset. Table 9

shows the performance of models trained on datasets composed of 10-class “jeans” trigger poisonable subsets and randomly chosen (pruned) classes. As before, adding other classes alongside the poisoned subset slightly decreases model performance. However, it is likely the case that better hyperparameter optimization could improve performance. These datasets are larger than those considered elsewhere in the paper (e.g. up to

classes), but we do not adjust our model training parameters to account for this.

Appendix F Algorithm for Natural Backdoor Identification

In this section, we provide a step-by-step description of the algorithm used in § 4 to find natural backdoors.

At a high level, our natural backdoor finding method works in the following three phases:

  1. Graph preparation: We convert a multi-label dataset into a weighted graph in which dataset object classes are vertices and object co-occurrences are edges (§F.1)

  2. Trigger finding via centrality: We identify central nodes in F.2). Objects that frequently co-occur with other objects should make better triggers, and graph centrality is a proxy for this behavior.

  3. Poisonable subset finding via maximum independent subsets: Finally, we extract and filter subgraphs around the central nodes (§F.3). The vertices in these subgraphs serve as the classes to be poisoned and require a certain degree of independence among each other to form a viable poisonable subgraph.

Once a proper subgraph has been identified around a central node, we select a subset of classes from the subgraph and use images associated with them to train a physical backdoor model (§5,D,E). Algorithm 1 formalizes our methodology.

f.1 Phase 1: Preparing the Graph

We begin by selecting a large-scale, open source, multi-label object recognition dataset . Recall that in a multi-label dataset, , every image is mapped to , a set of possible classification labels, representing all objects in , and if contains object . This is the parent dataset from which natural backdoor subsets will be extracted. To create the graph , we first use the multi-labels of to construct a co-occurrence matrix for all objects in the dataset. is initialized as a matrix of all zeros. We iterate through all labels, and for each entry in multi-label , we increment if (e.g. objects and co-occur).

Using , we can construct a graph representing these co-occurrences. The vertex set is constructed such that each of the objects in is represented by a vertex. We set a threshold , which denotes the minimum number of co-occurrences between two objects (equivalently, vertices) before they are connected in . Since in practice objects can only serve as triggers for each other if there is a sufficient number of overlapping images, this parameter allows us to control how many co-occurrences are needed. Thus, the edge set contains an edge if and only if . The resulting weighted adjacency matrix of the graph is thus just a filtered version of .

f.2 Phase 2: Identifying Natural Backdoor Triggers via Graph Centrality

Computing centrality indices for all vertices is a key component of natural backdoor trigger identification. A good trigger should be highly connected to many other classes (e.g. co-occurs frequently), so that it can poison as many classes as possible. Therefore, we consider the vertices with the highest centrality indices as candidate trigger classes . We now describe the different methods we use to compute centrality:

  • Vertex centrality computes the sum of weighted edges connected to vertex . This shows how connected is to other classes, which in turn, can identify effective triggers. Let be the adjacency matrix of . The weighted vertex centrality of vertex is given by . The unweighted vertex centrality is just the number of vertices is connected to.

  • Betweenness centrality counts unweighted shortest paths between all pairs of vertices and scores each vertex according to the number of shortest paths passing through it. Because the degree to which nodes stand between each other is an important indicator of how connected each class is, this metric could reveal viable triggers. If is total number of shortest paths from vertex to , and is the number of those paths that pass through vertex , vertex ’s betweenness centrality is . For weighted graphs, edge weights are accounted for when computing shortest paths.

  • Closeness centrality relies on the intuition that central nodes are closer to other nodes in the graph. It computes centrality via the reciprocal of the sum of the length of the shortest paths from to other vertices in . If is the distance between vertices and , then the closeness centrality of vertex is . In the unweighted case, the distance is just the number of vertex hops. In the weighted case, the distance is the sum of edge weights.

  • Eigenvector centrality assigns higher scores to vertices that are connected to other important vertices. Highly connected classes which are also highly connected to other important classes may make good triggers. The eigenvector centrality of vertex is , where is the set of neighboring vertices of the vertex , and are elements of . In the unweighted case, would be either or depending on whether an edge was present or absent.

f.3 Phase 3: Extracting Trigger/Class Sets

For each candidate trigger identified as having among the top centrality indices, we then identify a viable set of classes , which could be used to poison via a multi-step filtering process. First, we set a minimum number of co-occurrences (i.e. edge weight) between a normal object and the trigger object for to be considered a viable class to poison. Classes that are weakly connected to are more difficult to poison, because the dataset contains fewer images in which and the target class co-occur, making it difficult for a model to learn the trigger behavior. This minimum connection threshold, , is used to compute a subgraph containing all vertices and edges connected to with .

Next, we analyze this subgraph to identify an optimal set of classes that can be poisoned by . An object in an ideal set of classes should have a high edge weight to but low edge weights to all other classes within the set. This will prevent the trained model from associating the presence of an object other than the trigger with the target label. To find this subset, we search for the maximum independent subset (MIS) within the induced subgraph of . This will identify the largest set of vertices that do not share an edge. However, since this problem is NP-hard in general, we approximate the finding of the maximum independent subset by running the maximal independent set algorithm multiple times. A maximal independent set is an independent set that is not a subset of any other independent set, so the maximum independent set must be maximal. However, any maximal independent set does not have to be the maximum independent set.

We note that the value of plays an important role in determining the size of the MIS, since removing edges with a weight smaller than implicitly makes the associated vertices independent, so the higher the value of , the larger the MIS that can be found. However, this ignores co-occurrences, which may impact trigger learning.

1:  Input: = , min class overlaps , min trig overlaps
2:  Output: Natural backdoor dataset classes
3:   COMMENT Initializing and populating co-occurrence matrix
4:  for   do
5:     for  do
6:        if  then
7:           
8:        end if
9:     end for
10:  end for
11:  Initialize adjacency matrix such that if and otherwise
12:  Construct from
13:   COMMENT Initializing and populating trigger set
14:  for  do
15:     Compute centrality index of
16:     if  then
17:        
18:         COMMENT Retaining top elements with the highest centrality
19:     end if
20:  end for
21:   COMMENT Initializing and populating poisonable subsets
22:  for  do
23:     
24:     
25:      COMMENT Run approximate MIS subroutine
26:  end for
Algorithm 1 Identifying natural backdoor datasets within multi-label datasets