Recent studies have shown that Deep Neural Networks (DNNs), which are the state-of-the-art tools for a wide range of tasks [51, 23, 12, 31, 43], are vulnerable to adversarial perturbation attacks [34, 59]. In the visual domain, such adversarial perturbations can be digital or physical. The former refers to adding (quasi-) imperceptible digital noises to an image to cause a DNN to misclassify an object in the image; the latter refers to physically altering an object so that the captured image of that object is misclassified. In general, adversarial perturbations are not readily noticeable by humans, but cause the machine to fail at its task.
To defend against such attacks, our observation is that the misclassification caused by adversarial perturbations is often out-of-context. To illustrate, consider the traffic crossing scene in Fig. 1; a stop sign often co-exists with a stop line, zebra crossing, street nameplate and other characteristics of a road intersection. Such co-existence relationships, together with the background, create a context that can be captured by human vision systems. Specifically, if one (physically) replaces the stop sign with a speed limit sign, humans can recognize the anomaly that the speed limit sign does not fit in the scene. If a DNN module can also learn such relationships (i.e., the context), it should also be able to deduce if the (mis)classification result (i.e., the speed limit sign) is out of context.
Inspired by these observations and the fact that context has been used very successfully in recognition problems, we propose to use context inconsistency to detect adversarial perturbation attacks. This defense strategy complements existing defense methods [21, 29, 39], and can cope with both digital and physical perturbations. To the best of our knowledge, it is the first strategy to defend object detection systems by considering objects “within the context of a scene.”
We realize a system that checks for context inconsistencies caused by adversarial perturbations, and apply this approach for the defense of object detection systems; our work is motivated by a rich literature on context-aware object recognition systems [13, 4, 41, 27]. We assume a framework for object detection similar to , where the system first proposes many regions that potentially contain objects, which are then classified. In brief, our approach accounts for four types of relationships among the regions, all of which together form the context for each proposed region: a) regions corresponding to the same object (spatial context); b) regions corresponding to other objects likely to co-exist within a scene (object-object context; c) the regions likely to co-exist with the background (object-background context); and d) the consistency of the regions within the holistic scene (object-scene context). Our approach constructs a fully connected graph with the proposed regions and a super-region node which represents the scene. In this graph, each node has, what we call an associated context profile.
The context profile is composed of node features (i.e., the original feature used for classification) and edge features (i.e., context). Node features represent the region of interest (RoI) and edge features encode how the current region relates to other regions in its feature space representation. Motivated by the observation that the context profile of each object category is almost always unique, we use an auto-encoder to learn the distribution of the context profile of each category. In testing, the auto-encoder checks whether the classification result is consistent with the testing context profile. In particular, if a proposed region (say of class ) contains adversarial perturbations that cause the DNN of the object detector to misclassify it as class , using the auto-encoder of class to reconstruct the testing context profile of class will result in a high reconstruction error. Based on this, we can conclude that the classification result is suspicious.
The main contributions of our work are the following.
To the best of our knowledge we are the first to propose using context inconsistency to detect adversarial perturbations in object classification tasks.
We design and realize a DNN-based adversarial detection system that automatically extracts context for each region, and checks its consistency with a learned context distribution of the corresponding category.
We conduct extensive experiments on both digital and physical perturbation attacks with three different adversarial targets on two large-scale datasets - PASCAL VOC  and Microsoft COCO . Our method yields high detection performance in all the test cases; the ROC-AUC is over 0.95 in most cases, which is 20-35% higher than a state-of-the-art method  that does not use context in detecting adversarial perturbations.
2 Related Work
We review closely-related work and its relationship to our approach.
Object Detection, which seeks to locate and classify object instances in images/videos, has been extensively studied [48, 40, 47, 37]. Faster R-CNN  is a state-of-the-art DNN-based object detector that we build upon. It initially proposes class-agnostic bounding boxes called region proposals (first stage), and then outputs the classification result for each of them in the second stage.
Adversarial Perturbations on Object Detection, and in particular physical perturbations targeting DNN-based object detectors, have been studied recently [50, 8, 58] (in addition to those targeting image classifiers [32, 15, 1]). Besides mis-categorization attacks, two new types of attacks have emerged against object detectors: the hiding attack and the appearing attack [8, 50] (see Section 3.1 for more details). While defenses have been proposed against digital adversarial perturbations in image classification, our work focuses on both digital and physical adversarial attacks on object detection systems, which is an open and challenging problem.
Adversarial Defense has been proposed for coping with digital perturbation attacks in the image domain. Detection-based defenses aim to distinguish perturbed images from normal ones. Statistics based detection methods rely on extracted features that have different distributions across clean images and perturbed ones [24, 16, 39]. Prediction inconsistency based detection methods process the images and check for consistency between predictions on the original images and processed versions [57, 36]. Other methods train a second binary classifier to distinguish perturbed inputs from clean ones [44, 42, 35]. However many of these are effective only on small and simple datasets like MNIST and CIFAR-10 . Most of them need large amounts of perturbed samples for training, and very few can be easily extended to region-level perturbation detection, which is the goal of our method. Table 1 summarizes the differences between our method and the other defense methods; we extend FeatureSqueeze , considered a state-of-the-art detection method, which squeezes the input features by both reducing the color bit depth of each pixel and spatially smoothening the input images, to work at the region-level and use this as a baseline (with this extension its performance is directly comparable to that of our approach).
|Detection||Beyond MNIST CIFAR||Do not need perturbed samples for training||Extensibility to object detection|
|PCAWhiten ||✗||✓||✗, PCA is not feasible on large regions|
|GaussianMix ||✗||✗||✗, Fixed-sized inputs are required|
|Steganalysis ||✓||✗||✗, Unsatisfactory performance on small regions|
|PCAConv ||✓||✗||✗, Fixed-sized inputs are required|
Context Learning for Object Detection has been studied widely [26, 46, 52, 13, 4].
Earlier works that incorporate context information into DNN-based object detectors [17, 11, 45] use object relations in post-processing, where the detected objects are re-scored by considering object relations.
Some recent works [33, 9] perform sequential reasoning, i.e., objects detected earlier are used to help find objects later.
The state-of-the-art approaches based on recurrent units  or neural attention models
or neural attention models process a set of objects using interactions between their appearance features and geometry. Our proposed context learning framework falls into this type, and among these,  is the one most related to our work. We go beyond the context learning method to define the context profile and use context inconsistency checks to detect attacks.
3.1 Problem Definition and Framework Overview
We propose to detect adversarial perturbation attacks by recognizing the context inconsistencies they cause, i.e.,
by connecting the dots with respect to whether the object fits within the scene and in association with other entities in the scene.
Threat Model. We assume a strong white-box attack against the two-stage Faster R-CNN model where both the training data and the parameters of the model are known to the attacker. Since there are no existing attacks against the first stage (i.e., region proposals), we do not consider such attacks. The attacker’s goal is to cause the second stage of the object detector to malfunction by adding digital or physical perturbations to one object instance/background region. There are three types of attacks [50, 8, 58]:
Miscategorization attacks make the object detector miscategorize the perturbed object as belonging to a different category.
Hiding attacks make the object detector fail in recognizing the presence of the perturbed object, which happens when the confidence score is low or the object is recognized as background.
Appearing attacks make the object detector wrongly conclude that the perturbed background region contains an object of a desired category.
Framework Overview. We assume that we can get the region proposal results from the first stage of the Faster R-CNN model and the prediction results for each region from its second stage. We denote the input scene image as and the region proposals as , where is the total number of proposals of . During the training phase, we have the ground truth category label and bounding box for each , denoted as . The Faster R-CNN’s predictions on proposed regions are denoted as . Our goal as an attack detector is to identify perturbed regions from all the proposed regions.
Fig. 2 shows the workflow of our framework.
We use a structured DNN model to build a fully connected graph on the proposed regions to model the context of a scene image.
We name this as Structure ContExt ModEl, or SCEME in short.
In SCEME, we combine the node features and edge features of each node , to form its context profile.
We use auto-encoders to detect context inconsistencies as outliers.
Specifically, during the training phase, for each category, we train a separate auto-encoder to capture the distribution of the benign context profile of that category.
We also have an auto-encoder for the background category to detect hiding attacks.
During testing, we extract the context profile for each proposed region.
We then select the corresponding auto-encoder based on the prediction result of the Faster R-CNN model and check if the testing context profile belongs to the benign distribution.
If the reconstruction error rate is higher than a threshold, we posit that the corresponding region contains adversarial perturbations.
In what follows, we describe each step of
, to form its context profile. We use auto-encoders to detect context inconsistencies as outliers. Specifically, during the training phase, for each category, we train a separate auto-encoder to capture the distribution of the benign context profile of that category. We also have an auto-encoder for the background category to detect hiding attacks. During testing, we extract the context profile for each proposed region. We then select the corresponding auto-encoder based on the prediction result of the Faster R-CNN model and check if the testing context profile belongs to the benign distribution. If the reconstruction error rate is higher than a threshold, we posit that the corresponding region contains adversarial perturbations. In what follows, we describe each step ofSCEME in detail.
3.2 Constructing Sceme
In this subsection, we describe the design of the fully connected graph and the associated message passing mechanism in SCEME. Conceptually, SCEME builds a fully connected graph on each scene image. Each node is a region proposal generated by the first stage of the target object detector, plus the scene node. The initial node features, , are the RoI pooling features of the corresponding region. The node features are then updated () using message passing from other nodes. After convergence, the updated node features are used as inputs to a regressor towards refining the bounding box coordinates and a classifier to predict the category, as shown in Fig. 3(b). Driven by the object detection objective, we train SCEME and the following regressor and classifier together. We freeze the weights of the target Faster R-CNN during the training. To force SCEME to rely more on context information instead of the appearance information (i.e., node features) when performing object detection, we apply a dropout function  on the node features before inputing into SCEME, during the training phase. At the end of training, SCEME should be able to have better object detection performance than the target Faster R-CNN since it explicitly uses the context information from other regions to update the appearance features of each region via message passing. This is observed in our implementation.
We use Gated Recurrent Units (GRU)
We use Gated Recurrent Units (GRU) with attention  as the message passing mechanism in SCEME. For each proposed region, relationships with other regions and the whole scene form four kinds of context:
Same-object context: for regions over the same object, the classification results should be consistent;
Object-object context: co-existence, relative location, and scale between objects are usually correlated;
Object-background context: the co-existence of the objects and the associated background regions are also correlated;
Object-scene context: when considering the whole scene image as one super region, the co-existence of objects in the entire scene are also correlated.
To utilize object-scene context, the scene GRU takes the scene node features as the input, and updates .
To utilize the other kinds of context, since we have no ground truth about which object/background the regions belong to, we use attention to learn what context category to utilize from different regions.
The query and key (they encode information like location, appearance, scale, etc.) pertaining to each region are defined similar to .
Comparing the relative location, scale and co-existence between the query of the current region and the keys of all the other regions, the attention system assigns different attention scores to each region, i.e., it updates , utilizing different amount of information from . Thus, is first weighted by the attention scores and then all are summed up as the input to the Region GRU to update as shown in Fig. 3(c).
The corresponding output, and , are then combined via the average pooling function to get the final updated RoI feature vector
, are then combined via the average pooling function to get the final updated RoI feature vector.
3.3 Context Profile
In this subsection, we describe how we extract a context profile in SCEME. Recall that a context profile consists of node features and edge features, where the edge features describe how is updated. Before introducing the edge features that we use, we describe in detail how message passing is done with GRU .
A GRU is a memory cell that can remember the initial node features and then fuse incoming messages from other nodes into a meaningful representation. Let us consider the GRU that takes the feature vector (from other nodes) as the input, and updates the current node features . Note that and have the same dimensions since both are from RoI pooling. GRU computes two gates given and , for message fusion. The reset gate drops or enhances information in the initial memory based on its relevance to the incoming message . The update gate controls how much of the initial memory needs to be carried over to the next memory state, thus allowing a more effective representation. In other words, and are two vectors of the same dimension as and , which are learned by the model to decide what information should be passed to the next memory state given the current memory state and the incoming message. Therefore, we use the gate vectors as the edge features in the context profile. There are, in total, four gate feature vectors from both the Scene GRU and the Region GRU. Therefore, we define the context profile of a proposed region as .
3.4 AutoEncoder for Learning Context Profile Distribution
In benign settings, all context profiles of a given category must be similar to each other. For example, stop sign features exist with features of road signs and zebra crossings. Therefore, the context profile of a stop sign corresponds to a unique distribution that accounts for these characteristics. When a stop sign is misclassified as a speed limit sign, its context profile should not fit with the distribution corresponding to that of the speed limit sign category.
For each category, we use a separate auto-encoder (architecture shown in the supplementary material) to learn the distribution of its context profile. The input to the auto-encoder is the context profile . A fully connected layer is first used to compress the node features () and edge features () separately. This is followed by two convolution layers, wherein the node and edge features are combined to learn the joint compression. Two fully connected layers are then used to further compress the joint features. These layers form a bottleneck that drives the encoder to learn the true relationships between the features and get rid of redundant information. SmoothL1Loss, as defined in [28, 55], between the input and the output is used to train the auto-encoder, which is a common practice.
Once trained, we can detect adversarial perturbation attacks by appropriately thresholding the reconstruction error.
Giving a new context profile during testing, if a) the node features are not aligned with the corresponding distribution of benign node features, or
b) the edge features are not aligned with the corresponding distribution of benign edge features, or
c) the joint distribution between the node features and the edge features is violated,
the auto-encoder will not be able to reconstruct the features using its learned distribution/relation.
In other words, a reconstruction error that is larger than the chosen threshold would indicate either an appearance discrepancy or a context discrepancy between the input and output of the auto-encoder.
Once trained, we can detect adversarial perturbation attacks by appropriately thresholding the reconstruction error. Giving a new context profile during testing, if a) the node features are not aligned with the corresponding distribution of benign node features, or b) the edge features are not aligned with the corresponding distribution of benign edge features, or c) the joint distribution between the node features and the edge features is violated, the auto-encoder will not be able to reconstruct the features using its learned distribution/relation. In other words, a reconstruction error that is larger than the chosen threshold would indicate either an appearance discrepancy or a context discrepancy between the input and output of the auto-encoder.
4 Experimental Analysis
We conduct comprehensive experiments on two large-scale object detection datasets to evaluate the proposed method, SCEME, against six different adversarial attacks, viz., digital miscategorization attack, digital hiding attack, digital appearing attack, physical miscategorization attack, physical hiding attack, and physical appearing attack, on Faster R-CNN (the general idea can be applied more broadly). We analyze how different kinds of context contribute to the detection performance. We also provide a case study for detecting physical perturbations on stop signs, which has been used widely as a motivating example.
4.1 Implementation Details
Datasets. We use both PASCAL VOC  and MS COCO . PASCAL VOC contains 20 object categories. Each image, on average, has 1.4 categories and 2.3 instances . We use voc07trainval and voc12trainval as training datasets and the evaluations are carried out on voc07test. MS COCO contains 80 categories. Each image, on average, has 3.5 categories and 7.7 instances. coco14train and coco14valminusminival are used for training, and the evaluations are carried out on coco14minival. Note that COCO has few examples for certain categories. To make sure we have enough number of context profiles to learn the distribution, we train 11 auto-encoders for the 11 categories that have the largest numbers of extracted context profiles. Details are provided in the supplementary material.
Attack Implementations. For digital attacks, we use the standard iterative fast gradient sign method (IFGSM)  and constrain the perturbation location within the ground truth bounding box of the object instance. Because our defense depends on contextual information, it is not sensitive to how the perturbation is generated. We compare the performance against perturbations generated by a different method (FGSM) in the supplementary material. We use the physical attacks proposed in [15, 50], where perturbation stickers are constrained to be on the object surface; the color of the stickers should be printable, and the pattern of the stickers should be smooth. For evaluations on a large scale, we do not print or add stickers physically; we add them digitally onto the scene image. This favors attackers since they can control how their physical perturbations are captured.
Momentum optimizer with momentum 0.9 is used to train SCEME . The learning rate is 5e-4 and decays every 80k iterations at a decay rate of 0.1. The training finishes after 250k iterations.
Adam optimizer is used to train auto-encoders. The learning rate is 1e-4 and reduced by 0.1 when the training loss stops decreasing for 2 epochs. Training finishes after 10 epochs.
. The learning rate is 5e-4 and decays every 80k iterations at a decay rate of 0.1. The training finishes after 250k iterations. Adam optimizer is used to train auto-encoders. The learning rate is 1e-4 and reduced by 0.1 when the training loss stops decreasing for 2 epochs. Training finishes after 10 epochs.
4.2 Evaluation of Detection Performance
Evaluation Metric. We extract the context profile for each proposed region, feed it to its corresponding auto-encoder and threshold the reconstruction error to detect adversarial perturbations. Therefore, we evaluate the detection performance at the region level. Benign/negative regions are the regions proposed from clean objects; perturbed/positive regions are the regions relating to perturbed objects.
We report Area Under Curve (AUC) of Receiver Operating Characteristic Curve (ROC) to evaluate the detection performance.
Note that there can be multiple regions of a perturbed object. If any of these regions is detected, it is a successful perturbation detection. For hiding attacks, there is a possibility
of no proposed region; however, it occurs rarely (less than 1%).
We extract the context profile for each proposed region, feed it to its corresponding auto-encoder and threshold the reconstruction error to detect adversarial perturbations. Therefore, we evaluate the detection performance at the region level. Benign/negative regions are the regions proposed from clean objects; perturbed/positive regions are the regions relating to perturbed objects. We report Area Under Curve (AUC) of Receiver Operating Characteristic Curve (ROC) to evaluate the detection performance. Note that there can be multiple regions of a perturbed object. If any of these regions is detected, it is a successful perturbation detection. For hiding attacks, there is a possibility of no proposed region; however, it occurs rarely (less than 1%).
Visualizing the Reconstruction Error. We plot the reconstruction error of benign aeroplane context profiles and that of digitally perturbed objects that are misclassified as an aeroplane. As shown in Fig. (a)a(a), the context profiles of perturbed regions do not conform with the benign distribution of aeroplanes’ context profiles and cause larger reconstruction errors. This test validates our hypothesis that the context profile of each category has a unique distribution. The auto-encoder that learns from the context profile of class A will not reconstruct class B well.
|Method||Digital Perturbation||Physical Perturbation|
|Results on PASCAL VOC:|
|SCEME (node features only)||0.866||0.976||0.828||0.947||0.964||0.927|
|Results on MS COCO:|
|SCEME (node features only)||0.901||0.976||0.810||0.972||0.954||0.971|
Detection Performance. Thresholding the reconstruction error, we plot the ROC curve for “aeroplane” and other object categories tested on PASCAL VOC dataset, in Fig. (a)a(b). The AUCs for all 21 categories (including background) are all over 90%. This means that all the categories have their unique context profile distributions, and the reconstruction error of their auto-encoders effectively detect perturbations. The detection performance results, against six attacks on PASCAL VOC and MS COCO, are shown in Tab. 2. Three baselines are considered.
FeatureSqueeze . As discussed in Tab. 1, many existing adversarial perturbation detection methods are not effective beyond simple datasets. Most require perturbed samples while training, and only few can be extended to region-level perturbation detection. We extend FeatureSqueeze, one of the state-of-the-art methods, that is not limited by these, for the object detection task. Implementation details are provided in the supplementary material.
Co-occurGraph . We also consider a non-deep graph model where co-occurrence context is represented, as a baseline. We check the inconsistency between the relational information in the training data and testing images to detect attacks. Details are in the supplementary material. Note that the co-occurrence statistics of background class cannot be modeled, and so this approach is inapplicable for detecting hiding and appearing attacks.
SCEME (node features only)
. Only node features are used to train the auto-encoders (instead of using context profiles with both node features for region representation and edge features for contextual relation representation). Note that the node features already implicitly contain context information since, with Faster R-CNN, the receptive field of neurons grows with depth and eventually covers the entire image. We use this baseline to quantify the improvement we achieve by explicitly modeling context information withSCEME.
Our method SCEME, yields high AUC on both datasets and for all six attacks; many of them are over 0.95. The detection performance of SCEME is consistently better than that of FeatureSqueeze, by over 20%. Compared to Co-occurGraph, the performance of our method in detecting miscategorization attacks, is better by over 15%. Importantly, SCEME is able to detect hiding and appearing attacks and detect perturbations in images with one object, which is not feasible with Co-occurGraph. Using node features yields good detection performance and further using edge features, improves performance by up to 8% for some attacks.
Examples of Detection Results. We visualize the detected perturbed regions for both digital and physical miscategorization attack in Fig. 5. The reconstruction error threshold is chosen to make the false positive rate 0.2%. SCEME successfully detects both digital and physical perturbations as shown in Fig. 5(a)and(b). The misclassification of the perturbed object could affect the context information of another coexisting benign object and lead to a false perturbation detection on the benign object as shown in Fig. 5(c). We observe that this rarely happens. In most cases, although some part of the object-object context gets violated, the appearance representation and other context would help in making the right detection. When there are not many object-object context relationships as shown in Fig. 5(d), appearance information and spatial context are mainly used to detect a perturbation.
4.3 Analysis of Different Contextual Relations
In this subsection, we analyze what roles different kinds of context features play.
Spatial context consistency means that nearby regions of the same object should yield consistent prediction. We do two kinds of analysis. The first one is to observe the correlations between the adversarial detection performance and the number of regions proposed by the target Faster R-CNN for the perturbed object. Fig. 6(a) shows that the detection performance improves when more regions are proposed for the object and this correlation is not observed for the baseline method (for both datasets). This indicates that spatial context plays a role in perturbation detection. Our second analysis is on appearing attacks. If the “appearing object” has a large overlap with one ground truth object, the spatial context of that region will be violated. We plot in Fig. 6(b) the detection performance with respect to the overlap between the appearing object and the ground truth object, measured by Intersection over Union (IoU). We observe that the more these two objects overlap, the more likely the region is detected as perturbed, consistent with our hypothesis.
Object-object context captures the co-existence of objects and their relative position and scale relations. We test the detection performance with respect to the number of objects in the scene images. As shown in Fig. 6(c), in most cases, the detection performance of SCEME first drops or stays stable, and then improves. We believe that the reason is as follows: initially, as the number of objects increases, the object-object context is weak and so is the spatial context as the size of the objects gets smaller with more of them; however, as the number of objects increases, the object-object context dominates and performance improves.
4.4 Case Study on Stop Sign
We revisit the stop sign example and provide quantitative results to validate that context information helps defend against perturbations. We get 1000 perturbed stop sign examples, all of which are misclassified by the Faster RCNN, from the COCO dataset. The baselines and SCEME, are tested for detecting the perturbations. If we set a lower reconstruction error threshold, we will have a better chance of detecting the perturbed stop signs. However, there will be higher false positives, which means wrong categorization of clean regions as perturbed. Thus, to compare the methods, we constrain the threshold of each method so as to meet a certain False Positive Rate (FPR), and compute the recall achieved, i.e., out of the 1000 samples, how many are detected as perturbed? The results are shown in Tab. 3. FeatureSqueeze  cannot detect any perturbation until a FPR 5% is chosen. SCEME detects 54% of the perturbed stop signs with a FPR of 0.1%. Further, compared to its ablated version (that only uses node features), our method detects almost twice as many perturbed samples when the FPR required is very low (which is the case in many real-world applications).
|False Positive Rate||0.1%||0.5%||1%||5%||10%|
|Recall of FeatureSqueeze ||0||0||0||3%||8%|
|Recall of SCEME (node features only)||33%||52%||64%||83%||91%|
|Recall of SCEME||54%||67%||74%||89%||93%|
Inspired by how humans can associate objects with where and how they appear within a scene, we propose to detect adversarial perturbations by recognizing context inconsistencies they cause in the input to a machine learning system.
Inspired by how humans can associate objects with where and how they appear within a scene, we propose to detect adversarial perturbations by recognizing context inconsistencies they cause in the input to a machine learning system. We proposeSCEME, which automatically learns four kinds of context, encompassing relationships within the scene and to the scene holistically. Subsequently, we check for inconsistencies within these context types, and flag those inputs as adversarial. Our experiments show that our method is extremely effective in detecting a variety of attacks on two large scale datasets and improves the detection performance by over 20% compared to a state-of-the-art, context agnostic method.
This research was partially sponsored by ONR grant N00014-19-1-2264 through the Science of AI program, and by the U.S. Army Combat Capabilities Development Command Army Research Laboratory under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Combat Capabilities Development Command Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
-  (2017) Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397. Cited by: §2.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.2.
Online adaptation for joint scene and object classification.
European Conference on Computer Vision, pp. 227–243. Cited by: Appendix 0.D, 2nd item, Table 2.
Exploring the bounds of the utility of context for object detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7412–7420. Cited by: §1, §2.
-  (2006) Pattern recognition and machine learning. Springer. Cited by: §0.C.2.
Voila: visual anomaly detection and monitoring with streaming spatiotemporal data. IEEE Transactions on Visualization and Computer Graphics 24 (1), pp. 23–33. Cited by: Appendix 0.F.
Adversarial examples are not easily detected: bypassing ten detection methods.
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: §2.
-  (2018) Shapeshifter: robust physical adversarial attack on faster r-cnn object detector. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 52–68. Cited by: §2, §3.1.
-  (2017) Spatial memory for context reasoning in object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4086–4096. Cited by: §2.
-  (2014) On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259. Cited by: §3.2, §3.3.
-  (2011) A tree-based context model for object recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2), pp. 240–252. Cited by: §2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
-  (2018) Modeling visual context is key to augmenting object detection datasets. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 364–380. Cited by: §1, §2.
-  (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §1, §4.1.
Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1625–1634. Cited by: Figure 1, §2, §4.1.
-  (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: Table 1, §2.
-  (2009) Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (9), pp. 1627–1645. Cited by: §2.
-  (2015) Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448. Cited by: §0.C.2.
-  (2017) Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960. Cited by: Table 1.
-  (2016) Deep learning. MIT press. Cited by: §0.C.2.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: Table 9, §1.
-  (2016) Learning temporal regularity in video sequences. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 733–742. Cited by: Appendix 0.F.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1.
-  (2016) Early methods for detecting adversarial images. arXiv preprint arXiv:1608.00530. Cited by: Table 1, §2.
-  (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §3.2.
-  (1998) Does consistent scene context facilitate object perception?. Journal of Experimental Psychology: General 127 (4), pp. 398. Cited by: §2.
-  (2018) Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597. Cited by: §1, §2.
Robust estimation of a location parameter. In Breakthroughs in Statistics, pp. 492–518. Cited by: §3.4.
-  (2019) Comdefend: an efficient image compression model to defend adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6084–6092. Cited by: §1.
-  (2019) Identifying and resisting adversarial videos using temporal consistency. arXiv preprint arXiv:1909.04837. Cited by: Table 10, Appendix 0.F.
-  (2019) MMM: multi-stage multi-task learning for multi-choice reading comprehension. arXiv preprint arXiv:1910.00458. Cited by: §1.
-  (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: Table 9, §2, §4.1.
-  (2016) Attentive contexts for object detection. IEEE Transactions on Multimedia 19 (5), pp. 944–954. Cited by: §2.
-  (2019) Stealthy adversarial perturbations against real-time video classification systems.. In NDSS, Cited by: §1.
-  (2017) Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5764–5772. Cited by: Table 1, §2.
-  (2018) Detecting adversarial image examples in deep neural networks with adaptive noise reduction. IEEE Transactions on Dependable and Secure Computing. Cited by: Table 1, §2.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988. Cited by: §2.
-  (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §1, §4.1.
-  (2019) Detection based defense against adversarial examples from the steganalysis point of view. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4825–4834. Cited by: §1, Table 1, §2.
-  (2016) Ssd: single shot multibox detector. In European Conference on Computer Vision, pp. 21–37. Cited by: §2.
-  (2018) Structure inference net: object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6985–6994. Cited by: §1, §2, §3.2.
-  (2017) Safetynet: detecting and rejecting adversarial examples robustly. In Proceedings of the IEEE International Conference on Computer Vision, pp. 446–454. Cited by: Table 1, §2.
Mixtures of lightweight deep convolutional neural networks: applied to agricultural robotics. IEEE Robotics and Automation Letters 2 (3), pp. 1344–1351. Cited by: §1.
-  (2017) On detecting adversarial perturbations. arXiv preprint arXiv:1702.04267. Cited by: Table 1, §2.
-  (2014) The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898. Cited by: §2.
-  (2003) Top-down control of visual attention in object detection. In Proceedings 2003 International Conference on Image Processing (Cat. No. 03CH37429), Vol. 1, pp. I–253. Cited by: §2.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: Appendix 0.D, §1, §2.
-  UGM: matlab code for undirected graphical models. Cited by: Appendix 0.D.
-  (2018) Physical adversarial examples for object detectors. In 12th USENIX Workshop on Offensive Technologies (WOOT 18), Cited by: Figure 1, §2, §3.1, §4.1.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826. Cited by: §1.
-  (2003) Contextual priming for object detection. International Journal of Computer Vision 53 (2), pp. 169–191. Cited by: §2.
-  (2018) FeatureSqueezing. GitHub. Note: https://github.com/uvasrg/FeatureSqueezing.git Cited by: §0.C.3.
-  (2018) Characterizing adversarial examples based on spatial consistency information for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 217–234. Cited by: Table 10, Appendix 0.F.
High accuracy individual identification model of crested ibis (nipponia nippon) based on autoencoder with self-attention. IEEE Access 8, pp. 41062–41070. Cited by: §3.4.
-  (2014) Video anomaly detection based on a hierarchical activity discovery within spatio-temporal contexts. Neurocomputing 143, pp. 144–152. Cited by: Appendix 0.F.
-  (2017) Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155. Cited by: Table 4, Table 5, Table 6, Table 7, Table 8, Figure 8, §0.C.1, Table 9, §1, Table 1, §2, 1st item, §4.4, Table 2, Table 3.
-  (2019) Seeing isn’t believing: towards more robust adversarial attack against real world object detectors. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 1989–2004. Cited by: §2, §3.1.
-  (2020) A4: evading learning-based adblockers. arXiv preprint arXiv:2001.10999. Cited by: §1.
-  (2012) Context-aware activity recognition and anomaly detection in video. IEEE Journal of Selected Topics in Signal Processing 7 (1), pp. 91–101. Cited by: Appendix 0.F.
Appendix 0.A Values in the Plots
In the paper, some experimental results have been provided as plots for better visualization. We provide a table for each plot in this supplementary material. Tab. 4 and Tab. 5 correspond to the upper part and the lower part of Fig.8(a). Tab. 6 corresponds to Fig.8(b). Tab. 7 and Tab. 8 correspond to the upper part and lower part of Fig.8(c). Some entries are missing due to inadequate number of samples. For example, there are no entries for digital hiding attack for images with 6 objects in Tab. 7 because there are only 14 hiding-attacked images and the AUC reported would not be accurate. We report AUC when we have at least 50 attacked samples.
|#Proposals||Digital Perturbations||Physical Perturbations|
|#Proposals||Digital Attack||Physical Attack|
|IoU||PASCAL VOC||MS COCO|
|Digital Appearing||Physical Appearing||Digital Appearing||Physical Appearing|
|#Objects||Digital Perturbation||Physical Perturbations|
|#Object||Digital Attack||Physical Attack|
Appendix 0.B Architecture of the Auto-encoders
For each category, we use a separate auto-encoder to learn the distribution of its context profile. The architecture of the auto-encoders is identical and is shown in Fig. 7. The input to the auto-encoder is the context profile . We denote the height and width of the input as and . since there are 5 feature vectors in and equals to the dimension of the RoI pooling feature. A fully connected layer is first used to compress the node features () and edge features () separately. This is followed by two convolution layers, wherein the node and edge features are combined to learn the joint compression. Two fully connected layers are then used to further compress the joint features. These layers form a bottleneck that drives the encoder to learn the true relationships between the features and get rid of redundant information.
Appendix 0.C Extending FeatureSqueeze to Region-level Perturbation Detection
FeatureSqueeze  proposes to squeeze the search space available to an adversary, driven by the observation that the feature input spaces are often unnecessarily large, which provides extensive opportunities for an adversary to construct adversarial examples. There are two feature squeezing methods used in their implementation: a) reducing the color bit depth of each pixel; b) spatial smoothing. By comparing a DNN model’s prediction on the original input with that on squeezed ones, feature squeezing detects adversarial examples with high accuracy and few false positives. The framework of FeatureSqueeze  is shown in Fig. 8.
0.c.2 Extending to Region-level Detection
To detect perturbed regions inside scene images, the DNN model of FeatureSqueeze is required to operate on region-level. We crop the ground-truth regions, denoted as , as the input to the DNN model. The output of the DNN model is the predicted category. To deal with region inputs with various size, we use RoI pooling  (box size equals to input region size) as the last feature extraction layer as shown in Fig. is used as the objective loss function.
(box size equals to input region size) as the last feature extraction layer as shown in Fig.9. Softmax function  is used as the last layer and cross entropy loss 
is used as the objective loss function.
0.c.3 Implementation Details
We initialize the feature extractor with the weights pretrained on ImageNet. Momentum optimizer with momentum 0.9 is used to train the classifier. The learning rate is 1e-4 and decays every 80k iterations at decay rate 0.1. Training ends after 240k iterations. The final classification accuracy for the 20 categories in PASCAL VOC dataset is 95.6%. The final classification accuracy for the 80 categories in MS COCO dataset is 87.1%. The accuracy is not high because MS COCO is biased among categories, for example, more than 100k person instances v.s. less than 1k hair dryer instances. Even after we balance the number of samples among different categories, the performance is not good because some categories have too few examples, like the hair dryer category.
The hyperparameters used for feature squeezing are exactly the same as the authors’ GitHub implementation
We initialize the feature extractor with the weights pretrained on ImageNet. Momentum optimizer with momentum 0.9 is used to train the classifier. The learning rate is 1e-4 and decays every 80k iterations at decay rate 0.1. Training ends after 240k iterations. The final classification accuracy for the 20 categories in PASCAL VOC dataset is 95.6%. The final classification accuracy for the 80 categories in MS COCO dataset is 87.1%. The accuracy is not high because MS COCO is biased among categories, for example, more than 100k person instances v.s. less than 1k hair dryer instances. Even after we balance the number of samples among different categories, the performance is not good because some categories have too few examples, like the hair dryer category. The hyperparameters used for feature squeezing are exactly the same as the authors’ GitHub implementation.
Appendix 0.D Co-occurGraph for Misclassification Attack Detection
We consider a non-deep model as baseline where co-occurrence statistics are used to detect misclassification due to adversarial perturbation. This approach uses the inconsistency between prior relational information obtained from the training data and inferred relational information conditioned on misclassified detection to detect the presence of adversarial perturbation. As the co-occurrence statistics of background class cannot be modeled, this approach is not applicable for detecting hiding and appearing attacks.
Prior Relational Information. Same as , we use the co-occurrence frequency of different categories of objects in the training data to obtain the prior relational information. Co-occurrence statistics gives an estimate of how likely two object classes will appear together in an image.
Graphical Representation. To encode the relational information of different classes of objects present in an image, we represent each image as an undirected graph . Here, a node in represents a single proposed region by the region proposal network. The edges if region and are linked represent the relationships between the regions. We formulate a tree structure graph where the region of interest is connected with all other proposed regions. The estimate of class probabilities of each proposed region generated by the object detection model is used as the node potential and the co-occurrence statistics is used as the edge potential.
represent the relationships between the regions. We formulate a tree structure graph where the region of interest is connected with all other proposed regions. The estimate of class probabilities of each proposed region generated by the object detection model is used as the node potential and the co-occurrence statistics is used as the edge potential.
Detection of Misclassification Attack. For each image instance in test-set, we estimate its class conditional relatedness with other classes by making conditional inference on the representative graph. Conditional inference gives the pairwise conditional distribution of classes for each edge, which we use to obtain the posterior relational information of that image conditioned on the misclassified label. Based on the inconsistency among the prior relational information and posterior relational information, we detect if there is any misclassification attack.
Implementation Details. We use the Faster R-CNN  as the object detection and region proposal generation module. For each image, we consider top 20 proposed regions based on the class confidence score. To formulate the graph and make conditional inference, we use the publicly available UGM Toolbox .
Appendix 0.E Detection performance w.r.t. various perturbation generation mechanisms
In the paper, we show our proposed method is effective in detecting six different perturbation attacks, i.e., digital miscategorization attack, digital hiding attack, digital appearing attack, physical miscategorization attack, physical hiding attack and physical appearing attack. These attacks are different in terms of their attack goals and perturbation forms. Other defense papers also evaluate their defense methods w.r.t different perturbation generation mechanisms. Our defense strategy is dependent on the contextual information, and therefore should not rely heavily on the mechanism to generate the perturbation. We validate our hypothesis by testing our method against different perturbation generation mechanisms. The results in Tab. 9 show that our method is consistently effective against all the perturbation generation mechanisms.
As stated in the paper, COCO has few examples for certain categories. To make sure we have enough number of context profiles to learn the distribution, out of all the 80 categories, we choose 10 categories with the largest number of context profiles extracted. These 10 categories are “car”, “diningtable”, “chair”, “bowl”, “giraffe”, “person”, “zebra”, “elephant”, “cow”, “cat”. We also choose “stop sign” category because attacks on stop signs have gained long-lasting attentions. In addition to “background”, we have in total 12 categories and learn 12 autoencoders separately. We use these 12 autoencoders and evaluate misclassifications to these categories in our experiments.
|Perturbation Generation Mechanism||PASCAL VOC||MS COCO|
Appendix 0.F Comparison with other context inconsistency based adversarial defense methods
The general notion of using context has been used to detect anomalous activities[60, 56, 6, 22]. When it comes to adversarial perturbation detection, spatial context has been used to detect adversarial perturbations against semantic segmentation . Temporal context has been used to detect adversarial perturbation against video classification . Context inconsistency has never been used to detect adversarial examples against objection detection systems. Essentially, our approach utilizes different kinds of context, including the spatial one from these prior works and object-level inter-relationships for the first time, as discussed in Tab. 10.