Deep learning has become the go-to method for automating image-based tasks.This is because, deep neural networks (DNNs) are excellent at learning and identifying spatial patterns and abstract concepts. With advances in both hardware and neural architectures, deep learning has become both a practical and reliable solution. Companies now use image-based deep learning to automate tasks in life critical operations such as autonomous driving[27, 4], surveillance , and medical image screening .
In tasks such as these, multiple objects must be identified per image. One way to accomplish this is to predict a class probability for each pixel in the input image. This approach is called image segmentation and companies such as Telsa use it to guide their autonomous vehicles safely through an environment . Another approach is called object detection where is split into a grid of cells or regions and the model predicts both a class probability and a bounding box for each of them [41, 42]. In both cases, these models rely on image semantics to successfully parse and interpret a scene.
Just like other deep learning models, these semantic models are also susceptible to adversarial attacks. In 2017, researchers demonstrated how a small ‘adversarial’ patch can be placed in a real world scene and override an image-classifier’s prediction, regardless of the patch’s location or orientation. This gave rise to a number of works which demonstrated the concept of adversarial patches against image segmentation and object detection models [47, 32, 9, 46, 26, 50, 60, 30, 12, 22, 21, 54]. However, current adversarial patches are limited in the following ways:
Only predictions around the patch itself are explicitly affected. This limits where objects can be made to ‘appear’ in a scene. For example, a patch cannot make a plane appear in the sky and it is difficult to put a patch in the middle of a busy road. Furthermore, patches in noticeable areas can raise suspicion (e.g., a stop sign with a colorful patch on it).
Existing patches do not explicitly alter the shape or layout of a scene’s perceived semantics. Changes to these semantics can be used to guide behaviors (e.g., drive a car off the road  or change a head count ) and has wide implications on tasks such as surveillance [24, 51] and medical screening  among others.
In this paper we identify a new type of attack which we call a Remote Adversarial Patch (RAP). A RAP is an adversarial patch which can alter an image’s perceived semantics from a remote location in the image. Our implementation of a RAP (IPatch) can be placed anywhere in the field of view and alter the predictions of nearly any predetermined location within the same view. This is demonstrated in Fig. 1 where an attacker has crafted an IPatch which causes a segmentation model to think that there is pavement (a sidewalk) in the middle of the road. Moreover, this adversarial attack is robust because the same patch works on different images using different positions and scales. Therefore, this attack more flexible and more covert than previous approaches. later in section 3 we discuss the attack model further.
Since the IPatch can alter an image’s perceived semantics, and attacker can craft patches which cause these models to see objects of arbitrary shapes and classes. For example, in Fig. 2 a street view segmentation model is convinced that a slice of bread is a tree shaped like the USENIX logo. This is possible because semantic models rely on global and contextual features to parse an image. However, an object and its contextual information can be very far apart in . For example, consider an image with a boat next to the water. Here, the water will boost the confidence of the boat’s classification even though the boat is not in the water. The IPatch exploits these correlations by masquerading as these contextual features.
Creating a robust RAP is more challenging than existing adversarial patches. This is because the content of
directly affects the leverage of the patch. For example, an IPatch cannot make a segmentation model perceive remote semantics on a blank image. However, to create a robust patch, we must be able to generalize to different images which have not been seen before. To overcome these challenges, we (1) use an incremental training strategy to slowly increase the entropy of the expectation over transformation (EoT) objective and (2) use Kullback-Leibler divergence loss to help the optimizer leverage and exploit the contextual relationships.
In this paper, we focus on the use of IPatches as a RAP against semantic segmentation models. We also demonstrate that the same technique can be applied to object detectors, such as YOLO, as well. To evaluate the IPatch, we train 37 segmentation models using 8 different encoders and 5 state-of-the-art architectures. In our evaluations, we focus on the autonomous car scenario [4, 44], and perform rigorous tests to determine the limitations and capabilities of the attack. On the top 4 classes, we found that the attack works up to 93% of the time on average, depending on the victim’s model. We also found that all of the segmentation models are susceptible to the attack, where the most susceptible architectures were the FPN and Unet++ and the least susceptible architecture was the PSPNet. Finally, even if the attacker does not have the same architecture as the victim, we found that without any additional training effort, an IPatch trained on one architecture works on others with an attack success rate of up to 25.3%.
The contributions of this paper are as follows:
We introduce a new class of adversarial patches (RAP) which can manipulate a scene’s interpretation remotely and explicitly. This type of attack not only has significant implications on the security of autonomous vehicles, but also on a wide range of semantic-based applications such as medical scan analysis, surveillance, and robotics (section 3).
We present a training framework which enables the creation of a robust RAP (IPatch) by incrementally increasing the training entropy. Without this strategy, the entropy starts too high which makes it difficult to converge on learning objectives, especially given large patch transformations on scale, shift, and so on (section 4).
We provide an in-depth evaluation of the patch used as a remote adversarial attack against road segmentation models (section 5). We show that the attack is robust, universal (works on unseen images sampled from the same distribution), and has transferability (works across multiple models). We also provide initial results which demonstrate that the attack works on object detectors as well (specifically YOLOv3).
We identify the attack’s limitations and provide insight as to why this attack can alter the perception of remote regions in an image. Building on these observations, we suggest countermeasures and directions for future work (section 7).
To the best of our knowledge, this the first adversarial patch demonstrated on segmentation models (section 2).
To reproduce our results, the reader can access the code and models used in this paper online.111The code will be available soon here https://github.com/ymirsky.
2 Related Works
Soon after the popularization of deep learning, researchers demonstrated that DNNs can be exploited using adversarial examples . In 2014 it was shown how an attacker can alter an image-classification model’s predictions by adding an imperceivable amount of noise to the input image [48, 18, 38]. Initially, these attacks were impractical to perform in a real environment since every combination of lighting, camera noise, and perspective would require a different adversarial perturbation [34, 33]. However, in 2017 the authors of  showed that an adversary can consider these distortions while generating the adversarial example in a process called Expectation over Transformation (EoT). Using this method, the authors were able to generate robust adversarial samples which can be deployed in the real world. In the same year, the authors of  used EoT to create adversarial patches. Their adversarial patches were designed to fool image-classifiers (single-object detection models).
Later in 2018, the authors of  developed an adversarial patch that works on object detection models (multi-object detection models). More recently, researchers have proposed patches which can remove objects which wear the patch [32, 60, 50, 54, 22, 12, 30] and patches which can perform denial of service (DoS) attacks by corrupting a scene’s interpretation [32, 26].
In Table 1, we summarize the related works on adversarial examples against image segmentation and object detection models (the domain of the proposed attack). In general, the attack goals of these papers are either add/change an object in the scene or to remove all objects altogether (DoS). The methods which add adversarial perturbations (noise) can change the semantics of an image at any location [20, 16, 1, 39, 25], but they cannot be deployed in the real world since they are applied directly to an image itself. Currently, there no patches for image segmentation models, and the patches for object detection models only affect the prediction around the patch itself. The exception are patches which perform DoS attacks by removing/corrupting all objects detected in the scene like [32, 26].
Therefore, to the best of our knowledge, the attack which we introduce is the first RAP, and (1) the only method which can add, change, or remove objects in a scene remotely (far from the location of the patch itself), (2) the first adversarial patch proposed for segmentation networks, and (3) the first adversarial patch which can cause a model to perceive custom semantic shapes.
Applied to image
Must be near/on target
Can be placed anywhere
|*Patch can be anywhere only when used to hide all objects in the scene (DoS)|
3 Threat Model
The Vulnerability. The vulnerability which this paper introduces is that semantic models, such as image segmentation models, utilize global and contextual features in an image to improve their predictive capabilities. However, these dependencies expose channels which can an attacker can exploit to change the interpretation of an image from one remote location to another.
The Attack Scenario. In this work we will focus on the remote adversarial attack scenario. In this attack scenario, Alice has an application which uses the image segmentation model . Mallory wants to predict a specific class at a specific location , while looking at a certain scene. To accomplish this, Mallory needs a training set of 1 or more images and a segmentation model to work with.
For the training set, Mallory has two options: (1) obtain images similar to those used to train , or (2) take pictures of the target scene. For the model, Mallory can either follow a white-box or black-box approach: In a white box-approach, Mallory obtains a copy of to achieve the most accurate results. The white-box approach is a common assumption for adversarial patches. Alternatively, Mallory can follow a black-box approach and train a surrogate model on a similar dataset used to train . Although the black box approach performs worse, we have found that there is some transferability between a patch trained on one model and then used against another (section 5). Finally, Mallory generates an IPatch which targets using and .
Motivation. There are several reasons why an adversary would want to use an IPatch over an ordinary adversarial patch (illustrated in Fig. 3:
The attacker may want to place the patch in a less obvious place so it won’t be removed or noticed by the victim. For example, a sticker on a stop sign is anomalous and can be contextually identified as malicious  but a sticker on a nearby billboard is less obvious. Another example, is in the domain of medicine where segmentation models are used to highlight and identify different lesions such as tumors. Here, an attacker can’t put a patch in the image in the location of the lesion since it would be an obvious attack. However, the attack could be trigger remotely by placing a dark RAP in the dark space of a scan where it is common to have noise, or in a location of the scan which is not under investigation (e.g., the first few slices on the z-axis). For motivations why an attacker would want to target medical scans, see .
The attacker may want to generate an object or semantic illusion in a location which is hard to reach or impractical to place a patch on it. For example, in the sky region, on the back of an arbitrary car on the freeway.
The attacker may need to craft or alter specific semantics for a scene. For example, many works show how image segmentation can be used to identify homes, roads and resources from satellite and drone footage . Here an attacker can feed false intel by hiding or increasing the number of structures, people, and resources before it can be investigated manually.
4 Making an IPatch
In this section we first provide an overview of how image segmentation models work. Then we present our approach on how to create an IPatch.
4.1 Technical Background
illustrated in Fig. 4. The objective of a segmentation model is to take an -by- image () with 1-3 color channels and predict an -by--by- probability mapping (). The output can be mapped directly to the pixels of such that is the probability that pixel belongs to the -th class (among the possible classes).
To train , the common approach is to follow two phases: In the first phase, the encoder network is trained as an image-classifier on a large image dataset in a supervised manner (i.e., where each image is associated with a label ). Note that the classifier’s task is to predict a single class for the entire image (e.g., is a dog). After training the classifier, we discard the dense layers at the end of the network (used to predict ) and retain the convolutional layers at the front of . In this way, we can use the feature mapping learned by the classifier to perform image segmentation. In the second phase, the decoder architecture is added on and is trained end-to-end. Often, the weights of are locked during this phase, and we do the same in this paper.
One reason why the encoder-decoder approach is so popular, is because obtaining a labeled segmentation ground truth is significantly more challenging than for image classification (massive datasets for classification exist and new datasets can be crowd sourced as well). Therefore, by using a pre-trained encoder, far fewer examples of segmentations are needed to achieve quality results.
, a differentiable loss functionis used to compare the model’s predicted output to the ground truth
in order to perform backpropagation and update network’s weights. There are many loss functions used in for segmentation. One common approach is to simply apply the binary cross entropy loss () since is essentially trying to solve a multi-class classification problem. However, does not consider whether a pixel is on the boundary or not so results tends to be blurry and be biased to large segments such as backgrounds . To counter this issue, in 2016 the authors of  proposed using Dice loss () for medical image segmentation, and it has since been considered a state-of-the-art approach. The Dice loss is defined as
We use to train all of the image segmentation models in this paper.
When selecting the encoder’s model, there are a wide variety of options. Some include ResNext, DenseNet, xception, EfficientNet, MobileNet, DPN, VGG
, and variations thereof. However, regarding the decoder’s architecture, there are several which are considered state-of-the-art. Many of them utilize a ‘feature pyramid’ approach and skip connections to identify features at multiple scales, or an autoencoder (encoder decoder pair) to encode and extract the semantics. We will now briefly describe the five architectures used in this paper:
- Unet++ :
An autoencoder architecture which improves on its predecessor, the Unet. The encoder and decoder are connected through a series of nested dense skip connections which reduce the semantic gap between the feature maps of the two networks.
- Linknet :
An efficient autoencoder which passes spatial information across the network to avoid losing it in the encoder’s compression.
- FPN :
A feature pyramid network which uses lateral connections across a fully convolution neural network (FCN) to utilize feature maps learned from multiple image scales.
- PSPNet :
An FCN which uses a pyramid parsing module on different sub-region representations in order to better capture global category clues. The architecture won first place in multiple segmentation challenge contests.
- PAN :
A network which uses both pyramid and global attention mechanisms to capture spatial and global semantic information.
In a remote adversarial attack, the attacker wants a region around the location to be predicted as class . To ensure that the optimizer does not waste energy on other semantics in the scene, we focus the effort to a region of operation. Let denote the region operation and let be the target pattern for that region. To capture , we use an -by--by- mask of zeros. To select , a square or circle with a radius of pixels222For an with a dimension of 384x480, we found that a radius of 50 pixels empirically performs best when targeting region with a radius of 10 pixels. around in is marked with ones along the -th channel. To insert an object, we set since our objective is to change the probability of those pixels to one. To insert a custom shape (like in Fig. 2) is set accordingly.
To generate a patch for the objective , we follow the EoT approach similar to previous works [7, 47, 32], but using our semantic masks. Concretely, we would like to find a patch which is trained to optimize the following objective function
where is a distribution of input images, is a distribution of patch locations, and is a set of scales to resize . The operator is the ‘Apply’ procedure which takes the current and inserts it into while sampling uniformly on the distributions , , and .
Loss Function. We experimented with many different loss functions on the CamVid dataset : ,,,, and (Kullback–Leibler Divergence loss). Most of these loss functions took too long to converge or got stuck in local optima. Instead, we found that works best in emptier scenes (like Fig. 2) and works best in busy scenes like those in CamVid. We believe the reason why performs well busy scenes is because it measures the relative entropy from one distribution to another. As a result, the optimizer had an easier time ‘leeching’ nearby features and contexts in to match the goal in .
Creating a Robust RAP. In order to make an RAP which is robust to different transformations (scale and location), and universal to different images (not in the training set of the patch), we must use EoT. However, we found that in some cases the training of the patch does not converge well when the range of (, , ) is large (i.e, large shifts, hundreds of images, etc). This is because (1) the IPatch leverages the variable contents of to impact and (2) the placement of in affects the influence of on .
To overcome this challenge, we propose an incremental training strategy where we gradually increase the number of images in . Once the training has converged or a time limit has elapsed, we increase the number of images in
by a factor of 3. We repeat this process until the entire dataset is covered. At the start of each epoch, we give the optimizer time to adjust by setting the learning rate to a fraction of its value and then slowly ramp it back up. A similar strategy can be applied to the other distributions, such as the shift size and patch scale.
This strategy works well because we gradually increase the entropy, enabling the optimizer capture foundational concepts. It can also be viewed that at each epoch we are placing the gradient descent optimizer at a more advantageous position instead of a random starting point.
In summary, the training framework for creating an IPatch is as follows (illustrated in Fig. 5):
[breakable,title=Training Procedure for an IPatch]
Initialize with random values and set its origin (default location in ) to be .
If incremental, then add one image to . Otherwise, add all images to .
Repeat until has converged on the entire dataset:
Apply: Draw a batch of samples from . For each sample in the batch, perform a random transformation: scale down and shift its location from origin .
Forward pass: Pass the batch through and obtain the segmentation maps (as a set of ).
Apply mask: Take the product of each with the mask to omit irrelevant semantics.
Loss & gradient: Compute the loss and use it to perform back propagation through to .
Update: Use gradient descent (e.g., Adamax) to update the values of .
If incremental and the has time elapsed or training has converged, then increase to and ramp the learning rate.
To evaluate the IPatch as a RAP, we will focus our evaluation on the scenario of autonomous vehicles. The task of street view segmentation is challenging because the scenes are typically very busy with many layers, objects, and wide perspectives . Therefore attacking this application is will provide us with good insights into the IPatch’s capabilities.
Datasets. We use the CamVid dataset  to train our segmentation models and evaluate our adversarial patches. The CamVid dataset is a well-known benchmark dataset used for image segmentation. It contains 46,869 street view images with a resolution of 360x480 from the point of view of a car. The images are supplied with pixel-wise annotations which indicate the class of the corresponding content (e.g., car, building, etc). The dataset comes split into three partitions: train , test , and validation . We use to train the segmentation models and the rest to train the patches. This way there is will be no bias on the images which we attack. The dataset is used to evaluate the influence of the patch’s parameters (size and location) and to train robust patches with EoT. Finally, the dataset is use to validate that the robust patches work on unseen imagery.
Segmentation Models. In our evaluations we trained and attacked 37 different models which were combinations of 8 different encoders and 5 state-of-the-art segmentation architectures.333Every combination of encoder and architecture except for the architecture PAN which was incompatible with three of the encoders. The encoders were the vgg19, densenet121, efficientnet-b4, efficientnet-b7, mobilenet_v2, resnext50_32x4d, dpn68, and xception13]. For the architectures, we used the implementations444https://github.com/qubvel/segmentation_models.pytorch of the state-of-the-art segmentation networks described in section 4.1. The models were trained on for 100 epochs each, with a batch size of 8, learning rate of 1e-4, using Dice Loss and an Adam optimizer. Finally, to increase the training set size and improve generalization, we performed data augmentation. The augmentations were: flip, shift, crop, blur, sharpen and change perspective, brightness, and gamma.
The Experiments. We performed three experiments:
In this experiment we investigate the influence which a patch’s size and location have on the attack performance. We also investigate the influence of the remote target’s size and location. Here patches are crafted to target individual images. Therefore, the results of this experiment also tell us how well the attack performs on static images.
To use this attack in the wild, the patch must work under various transformations and in new scenes. This experiment evaluates the attack’s robustness by (1) training the patches with EoT according to (3), and by (2) measuring the performance of these patches on new images (unseen during training).
To get an idea of the vulnerability’s prevalence, we attack 32 different segmentation models and measure their performance. To evaluate the case where the attacker has no information on the model, we take the robust patches trained in EXP2 and use them on the other 36 models to measure the attack’s transferability.
For all of the experiments, we trained on an NVIDIA Titan RTX with 24GB of RAM. For the optimizer, we experimented on a variety of options in the Torch library. We found that the Adamax optimizer works best on the CamVid Dataset.
5.1 EXP1: The Impact of Size and Location
The purpose of this experiment is to see how the size and locations of a patch and its target affect the attack’s performance. In this experiment, we craft patches which target a single image. Later in section 5.2 we evaluate multi-image ‘robust’ patches.
Experiment Setup. For EXP1 we attacked the efficientnet-b7_FPN model since it performed best on the CamVid dataset. A list the evaluations and parameters used in EXP1 can be found in Table 2. For each of these parameters, we varied their values while locking the rest to measure their influence. This was repeated for each of the model’s top six performing classes. Due to time restrictions,555Each of these experiments on takes 3-5 days on a NVIDIA Titan RTX with 24GB RAM. we only used the entirety of for the fixed parameter experiment. For the other experiments, we used 20 random images from .
The training procedure was as follows: For each patch, we used a learning rate of 2.5 and stopped the training after three minutes to ensure that each of the five experiments would take no more than 5 days. We note that in many cases, the patches were still converging so the results can be improved. Finally, we count a successful attack as any image with at least 80% of marked by the model as the target class.
* is scaled down to from 100x100. **On the diagonal from the image center to bottom right. ***8x6 grid over entire image
5.1.1 Performance with all Parameters Locked
The results for the experiment, where the patch parameters are locked, can be found in Fig. 6. The top of the figure shows that the attack has a greater impact on structural classes than others. This might be because these semantics have the largest regions CamVid dataset (i.e., are common). As a result, the patch is able to leverage these contexts better from one side of an image to another. For example, if there are is a row of buildings on one side of the road, then there is a higher probability that the other side will have one too. This kind of correlation is exploited by the patch. The road class out performs all the rest because the target in this experiment is in the center of the image, where the road is most commonly found (77% of the images). However, the patch is able to successfully attack the classes of pavement, building, and tree at the same location, even though on the clean images, the model predicts %, %, and % of them to have these classes respectively.
At the bottom of Fig. 6 we can see the aggregated confidence of the model for each of the images in . The plot shows that all of the images are susceptible to the attack for at least one of the target classes.
5.1.2 Impact of the Patch Size
Figure 8 plots the model’s confidence over increasingly larger patch sizes. In the figure, we have marked 0.5 as the decision threshold which is the default for segmentation models. This is because segmentation models perform binary-classification on each pixel. As a result, the confidence scores per class are either close to zero or one, but not so much in between (as seen in Fig. 6).
As expected, larger patches increase the attack success rate. However, the trade off appears to be linear (captured by the average in red). What is meaningful about these results is that some classes excel with smaller patch sizes (e.g., pavement and building) while others require larger ones to succeed (e.g., tree). This is probably because some of the remote contextual semantics which the model considers cannot be compressed into small spaces when others can. Overall, we observe that the minimum patch size required to fool the model on a static image this size is about 60-75 pixels in width, and with a patch width of 100 pixels, nearly all attacks succeed.
5.1.3 Impact of the Patch Location
In Fig. 9 we can see that the attack is highly effective for all classes up to about 62% of the distance away from the target (image center). The sharp drop in attack performance for the tree and sky classes is understandable since there are fewer contextual semantics which can be exploited by the patch in the bottom right of the image. On the other hand, in areas just below the horizon (0-0.5 on the x-axis), the patch can exploit contextual semantics which the model uses (e.g., features such as lighting, reflections, and building geometry).
These results indicate that an attacker may be able to increase the likelihood of success by placing the patch on objects which have some contextual influence on the target region. For example, to create a crosswalk, it may be advantageous to put the sticker on a lamp post or parking meter since these objects may be found near crosswalks.
5.1.4 Impact of the Target Size
In this experiment, we increased the size of but observe the performance of the same 20x20 pixel region at the center of (i.e., our objective). In Fig. 10 we can see that large targets do not perform well. The reason for this is that having a large target requires the IPatch to subdue more semantics. As a result, the patch fails and the region of becomes patchy and an corrupted. Small targets fail because it is hard for the patch to make high precision results. Rather, there is a balance between the intended 20x20 target and the actual target painted in . We found that increasing the target size by a factor of 3 improves the performance at the intended region.
The reason why a larger target helps the patch reach the 20x20 region is that the patch tends to ‘leech’ nearby semantic regions. This makes sense since it is easier to change the boundaries of existing semantics (e.g., perceive a larger car) than generate new ones which are isolated (e.g., a tree in the middle of the road). Therefore, the added target size encourages the model to perform similar tactics.
5.1.5 Impact of the Target Location
In Fig. 7 we present the attack performance when targeting different remote locations in . It is clear that the influence of a patch on different regions is dependent on both the image’s content and the targeted class. For example, it is easier to convince the model that any space under the horizon is a road, yet it is hard to change the class of the top-center to building because it is rarely found there. Overall, this experiment demonstrates that the patch can target locations on far remote locations within the image. However, this capability is not uniform across the classes, as we can see with the class ‘car’.
5.2 EXP2: Patch Robustness
In is experiment we evaluate how well a single patch performs on (1) different transformations and (2) on multiple seen and unseen images.
Experiment Setup. To perform this experiment, we used EoT (3) to train a single patch for each class. The incremental training framework from 4.2 was used with the patch origin set to (370,270). For the patch size , we sampled uniformly on the range of [50,80] pixels. For the shifts , we sampled uniformly within the entire bottom-right quadrant of . We found that training the patch in one region helps it converge using the incremental strategy, while still generalizing to the opposite side. We targeted the same segmentation model used in EXP1, and the training was performed using a batch size of 20 (the maximum for a 24GB GPU) with a learning rate of 0.5.
Generalization to Multiple Images In Fig. 11 we present the performance of the patches in the form of the model’s perception. The images demonstrate that the IPatch generalizes well to multiple images, even at different locations and scales. In Fig. 12 we present the attack performance when training on different numbers of images from (evaluated against the same set). From here we can see that an exponential number of examples are needed to increase the performance.
Generalization to New Images. To use the patch in a real world setting, it must work well in scenes which were not in the attacker’s training set. Figure 13 presents the attack performance of patches trained on (those displayed in Fig. 11) when applied to images in . The results show that the patches generalize well to unseen images. More interestingly, the performance of some classes are dramatically different compared to patches trained on single images without EoT (EXP1, Fig. 6). For example, ‘tree’ now has a 0.98% success rate compared to 50% and ‘building’ is now 25% compared to 65%. We learn from this that by considering multiple images, the model can learn stronger tactics. At the same time, the variability of the transformations prevent the model from using highly specific adversarial patterns. We also note that the class ’car’ does not transfer to unseen images like the other classes. We attribute this to the segmentation model’s poor performance on detecting cars in general.666Although the selected efficientnet-b7_FPN achieves a lesser intersection over union score of 0.75 on that class, it outperforms the other models overall.
5.3 EXP3: The Impact on Different Models
In is experiment we explore the suceptibilty and transferability of patches between models.
5.3.1 Model Susceptibility
Experiment Setup. To evaluate the performance of the attack on different model architectures, we used 32 of the 37 segmentation models described at the beginning of section 5 (PAN was omitted since it was not compatible with Torch’s autograd in our framework). Due to time limitations, the attacks on each model were limited to 4 classes, 10 images, and 3 minutes training time for each image.
Results. We found that all 36 models are susceptible to the RAP attack for at least one class(Fig. 14). By observing the patterns in the columns, we note that some architectures are less susceptible to attacks on certain classes. For example, Linknet, PSPNet, and Unet++ on pavement and PSPNet on car.
In Fig. 23, we can see the suceptibilty of the encoders and architectures overall. Some of the most susceptible encoders (xception and resnext) and architectures (FPN and Unet++) use skip connections or residual pathways in their networks. These pathways enable the networks to capture features at multiple scales and capture the global contexts better. However, just as these network utilize these pathways to obtain better perspectives, so can the IPatch in order to reach deeper into the image. Interestingly we found that the dpn68 encoder is consistently resilient against the attack. This encoder is formally called a Dual Path Network . It uses a residual path like a ResNet to reuse learned features and a densely connected path like DenseNet to encourage the network to explore new features. These diverse features may be preventing the IPatch, and possibly the segmentation model, from reaching remote contexts.
5.3.2 Inter-Model Transferability
In the case where the attacker does not have knowledge of the victim’s model, we would like to know well a patch trained on one model transfers to others.
Experiment Setup. To perform this experiment, we took the robust patches trained using the efficientnet-b7_FPN (EXP2) and attacked each of the other 36 models (listed in Fig. 16). The patches for the top 4 classes (sky, building, pavement, tree) were applied to the images in using the random transformations described in EXP2.
Results. We found that patch from efficientnet-b7_FPN can influence the other models’ predictions on the target region with an attack success rate of 11-37% (about 1-4 times in every 10 cases). We note that a cameras on an autonomous car processes at least 30 frames per second. Therefore, there is a high likelihood that the car’s model will be susceptible to the attack while driving by.
Fig. 16 shows the largest confidences for each model, measured as the relative increase from the original confidence (on clean image). Interestingly models using the Unet++ architecture were the most susceptible, followed by Linknet. We believe the reason for this is that both of these models use skip-connections to allow for feature maps to bypass the encoding process. As a result, features in the patch have a more direct impact on the output. It is known that skip-connections make models more vulnerable to adversarial examples 
but it is interesting to see that they are vulnerable to transfer attacks as well. Another observation is that there does not seem to be a correlation between the results and the encoder used. This is probably because all of the encoders were trained on the same ImageNet Dataset.
6 Extending to Object Recognition
In section 5 we performed an in-depth evaluation and analysis of the IPatch as a RAP against segmentation models. However, the same training framework in 4.2 can be used on other semantic models as well. In this section, we present preliminary results against a popular object recognition model called YOLOv3 .
6.1 Technical Background
The family of YOLO models follow a similar architecture (Fig. 17). is passed through a series of convolutional layers ( in the figure) and then those feature maps are shuttled to various decoders. The decoders predict coarse maps to the image at different scales using the semantic information shared between them. The multiple scales help the model detect objects of different sizes (e.g., detects large objects). Each cell in a map, contains an objectness score, class probability, and a bounding box (obtained via regression). If a cell has an objectness score above some threshold, then there is an object there with the associated class probability. Finally, a non-maximal suppression (NMS) algorithm is used on the maps to identify and unify the detections.
Experiment Setup. To see if the attack would work on YOLO, we created an IPatch which convinces YOLO that there is a person standing in the middle of the road. To accomplish this, we used a pre-trained YOLOv3 model implementation777https://github.com/eriklindernoren/PyTorch-YOLOv3 as the victim, and trained our patch using 30k images: 15k random samples from the Bdd100k dataset  and 15k frames from a Toronto car driving video on YouTube888https://youtu.be/50Uf_T12OGY.
For training, we needed to ensure that both the objectness score and probability of the class ‘person’ were high. This was done by taking the product of ’s probability map and objectness map as , and by setting to highlight the cell in the lower-center of the image. For the loss functions, we took the sum of and since it increased the rate of convergence. EoT was used to scale the patch between 60-70 pixels in width and shift it randomly within the bottom-right quadrant of the image. Finally, we trained the patch for 3 days with a learning rate of 0.05.
Results. We found that the YOLOv3 object detector is susceptible to the attack with an 85% attack success rate. In Fig. 18 we present an example frame which shows the objectness and probability maps in during the attack. We also found that smaller patches ranging from 50-60 pixels in width achieve an 80% attack success rate. Overall, it was relatively easy for the framework to change the objectness score of arbitrary locations in the image, compared to the class probability. We also observed that it is significantly harder to target the maps from and which capture smaller objects. We believe this is because and rely less on contexts around the image, giving the IPatch less leverage to perform a remote attack.
As future work, we plan to explore RAPs on other object detectors and investigate other semantic models as well.
7 Discussion & Countermeasures
The concept of a remote adversarial patch, introduced in this paper, opens up wide range of possible attack vectors against image-based semantic models. Through our observations in 5, we were able to identify some of the attack’s capabilities and limitations.
Trade-Offs. Due to its flexibility, it may seem like the IPatch is harder to defend against compared to an ordinary adversarial patch. However, the performance of the patch is less compared to a ’point-based’ patch. This means that the adversary must consider whether a more reliable attack needed over having flexibility and stealth. Another consideration is that the adversary may want to experiment to find the optimal placement of the patch. This is because some regions give the patch more leverage based on the local semantics (section 5.1.3). One strategy is that the attacker can first scout the target region by videoing the scene from multiple perspectives and then optimize the patch location using that dataset.
Defenses. Although the IPatch can be placed in arbitrary locations, we noticed that its presence highly noticeable in the semantic segmentations (e.g., Fig. 11). We found that it is very hard to generate a patch which both achieves the attack and masks its own presence at the same time. Concretely, when setting except for the target region (as done in Fig 2), we found that the model struggles to influence remote locations to the same extent. In future work, this may be improved through a custom loss function which balances the trade-off between the two objectives. Another solution might be to generate RAPs using a conditional GAN which considers the errors on the semantic map in (). Doing so may also reduce the corruptions to nearby semantics as well (e.g., see Fig. 24).
Another direction for defending against this attack is to limit the model’s dependency on global features. Although these global features are key to state-of-the-art models [31, 28, 15], it is possible to utilize them while also considering their layout and origin. One option may be to integrate capsule networks  as part of the model’s architecture, since capsule networks are good at considering the spatial relationship in images.
Improvements. We noticed that the RAP attack is dependent on an image’s content when targeting segmentation models, but less so for the object detector YOLO. For example, we were able to perform remote adversarial attacks on a blank image with YOLO. The reason for this is not clear to us, and investigating it may lead to improvements in the proposed training methodology. Moreover, as future work, it would be interesting to investigate which types of features and classes the a RAP can manipulate best and why. This research may lead to deeper insights into the vulnerability’s extents and limitations. Finally, to improve transferabilty, we suggest two directions: (1) include multiple models in the training loop to help the model identify common features, and (2) use adversarial training to improve the generalization of the patch.
In this paper, we have introduced the concept of a ‘remote adversarial patch’ (RAP) which can alter the semantic interpretation of an image while being placed anywhere within the field of view. We have implemented an RAP called IPatch and demonstrated that it is robust, can generalize to new scenes, and can impact other semantic models such as object detectors. With an average attack success rate of up to 93%, this attack forms a tangible threat. Although RAPs are in their infancy, we hope that this paper has laid some of the groundwork for exploring this new adversarial example.
In summary, neural networks are notorious for being black-boxes which are difficult to interpret. However, they are still used in critical tasks because their advantages outweigh their potential disadvantages. We hope that our findings will help the community improve the security of deep learning applications so that we may continue to benefit from safe and reliable autonomous systems.
The reader can access the code used to create the adversarial patch and all of the 37 models evaluated in this paper online at: [will be released soon at https://github.com/ymirsky].
-  (2018) On the robustness of semantic segmentation models to adversarial attacks. In , pp. 888–897. Cited by: Table 1, §2.
Synthesizing robust adversarial examples.
International conference on machine learning, pp. 284–293. Cited by: §2.
-  (2017) Segment-before-detect: vehicle detection and classification through semantic segmentation of aerial images. Remote Sensing 9 (4), pp. 368. Cited by: item Flexibility.
-  (2021) Autopilot ai-tesla. Note: https://www.tesla.com/autopilotAI(Accessed on 02/04/2021) Cited by: §1, §1, §1.
Wild patterns: ten years after the rise of adversarial machine learning. Pattern Recognition 84, pp. 317–331. Cited by: §2.
-  (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §4.2, §5.
-  (2017) Adversarial patch. arXiv preprint arXiv:1712.09665. Cited by: §1, §2, §4.2.
-  (2017) Linknet: exploiting encoder representations for efficient semantic segmentation. In 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. Cited by: item Linknet :.
-  (2018) Shapeshifter: robust physical adversarial attack on faster r-cnn object detector. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 52–68. Cited by: §1, Table 1.
-  (2017) Dual path networks. arXiv preprint arXiv:1707.01629. Cited by: §5.3.1.
-  (2020) Adversarial objectness gradient attacks in real-time object detection systems. In 2020 Second IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA), pp. 263–272. Cited by: Table 1.
-  (2020) Adversarial patch camouflage against aerial detection. In Artificial Intelligence and Machine Learning in Defense Applications II, Vol. 11543, pp. 115430F. Cited by: §1, Table 1, §2, §3.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §5.
-  (2018) Learning to predict crisp boundaries. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 562–578. Cited by: §4.1.
-  (2020) MA-net: a multi-scale attention network for liver and tumor segmentation. IEEE Access 8 (), pp. 179656–179665. External Links: Cited by: §7.
-  (2017) Adversarial examples for semantic image segmentation. arXiv preprint arXiv:1703.01101. Cited by: Table 1, §2.
-  (2018) A survey on deep learning techniques for image and video semantic segmentation. Applied Soft Computing 70, pp. 41–65. Cited by: §4.1.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §2.
-  (2020) CPSPNet: crowd counting via semantic segmentation framework. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pp. 1104–1110. Cited by: item Interpretation.
-  (2017) Universal adversarial perturbations against semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2755–2764. Cited by: Table 1, §2.
-  (2020) Dynamic adversarial patch for evading object detection models. arXiv preprint arXiv:2010.13070. Cited by: §1, Table 1.
-  (2020) Universal physical camouflage attacks on object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 720–729. Cited by: §1, Table 1, §2.
-  (2019) Enhancing adversarial example transferability with an intermediate level attack. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4733–4742. Cited by: §5.3.2.
-  (2017) Satellite imagery feature detection using deep convolutional neural network: a kaggle competition. arXiv preprint arXiv:1706.06169. Cited by: item Interpretation.
-  (2020) Adversarial attacks for image segmentation on multiple lightweight models. IEEE Access 8, pp. 31359–31370. Cited by: Table 1, §2.
-  (2019) On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897. Cited by: §1, Table 1, §2, §2.
Waymo reminds us: successful complex ai combines deep learning and traditional code.
-code/?sh=4eb596723a2c(Accessed on 02/04/2021) Cited by: §1.
-  (2018) Pyramid attention network for semantic segmentation. arXiv preprint arXiv:1805.10180. Cited by: item PAN :, §7.
-  (2020) FA: a fast method to attack real-time object detection systems. In 2020 IEEE/CIC International Conference on Communications in China (ICCC), pp. 1268–1273. Cited by: Table 1.
-  (2020) Adaptive square attack: fooling autonomous cars with adversarial traffic signs. IEEE Internet of Things Journal. Cited by: §1, Table 1, §2.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: item FPN :, §7.
-  (2018) Dpatch: an adversarial patch attack on object detectors. arXiv preprint arXiv:1806.02299. Cited by: §1, Table 1, §2, §2, §4.2.
-  (2017) No need to worry about adversarial examples in object detection in autonomous vehicles. arXiv preprint arXiv:1707.03501. Cited by: §2.
-  (2015) Foveation-based mechanisms alleviate adversarial examples. arXiv preprint arXiv:1511.06292. Cited by: §2.
-  (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pp. 565–571. Cited by: §4.1.
-  (2020) Image segmentation using deep learning: a survey. arXiv preprint arXiv:2001.05566. Cited by: §4.1.
-  (2019-08) CT-gan: malicious tampering of 3d medical imagery using deep learning. In 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, pp. 461–478. External Links: Cited by: item Stealth.
-  (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §2.
-  (2019) Impact of adversarial examples on deep learning models for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 300–308. Cited by: Table 1, §2.
-  (2021) A survey of semantic segmentation on biomedical images using deep learning. In Advances in VLSI, Communication, and Signal Processing, pp. 347–357. Cited by: item Interpretation.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1, §6.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §1.
-  (2017) Dynamic routing between capsules. arXiv preprint arXiv:1710.09829. Cited by: §7.
-  (2018) A comparative study of real-time semantic segmentation for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 587–597. Cited by: §1, §5.
-  (2021) Deep resolve. Note: https://www.siemens-healthineers.com/en-us/magnetic-resonance-imaging/technologies-and-innovations/deep-resolve(Accessed on 02/04/2021) Cited by: §1.
-  (2018) Darts: deceiving autonomous cars with toxic signs. arXiv preprint arXiv:1802.06430. Cited by: §1, Table 1, item Stealth.
-  (2018) Physical adversarial examples for object detectors. In 12th USENIX Workshop on Offensive Technologies (WOOT 18), Cited by: §1, Table 1, §2, §3, §4.2.
-  (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §2.
-  (2018) Multinet: real-time joint semantic reasoning for autonomous driving. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1013–1020. Cited by: item Interpretation.
-  (2019) Fooling automated surveillance cameras: adversarial patches to attack person detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1, Table 1, §2.
Artificial intelligence is going to supercharge surveillance the verge.
security(Accessed on 02/04/2021) Cited by: item Interpretation, §1.
-  (2018) Transferable adversarial attacks for image and video object detection. arXiv preprint arXiv:1811.12641. Cited by: Table 1.
-  (2020) Skip connections matter: on the transferability of adversarial examples generated with resnets. arXiv preprint arXiv:2002.05990. Cited by: §5.3.2.
-  (2020) Making an invisibility cloak: real world adversarial attacks on object detectors. In European Conference on Computer Vision, pp. 1–17. Cited by: §1, Table 1, §2.
-  (2017) Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1369–1378. Cited by: Table 1.
-  (2019) Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2730–2739. Cited by: §5.3.2.
-  (2020) Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2636–2645. Cited by: §6.2.
-  (2020) Contextual adversarial attacks for object detection. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: Table 1.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: item PSPNet :.
-  (2019) Seeing isn’t believing: towards more robust adversarial attack against real world object detectors. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 1989–2004. Cited by: §1, Table 1, §2.
-  (2020) Object hider: adversarial patch attack against object detectors. arXiv preprint arXiv:2010.14974. Cited by: Table 1.
-  (2018) Unet++: a nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support, pp. 3–11. Cited by: item Unet++ :.
-  (2020) The translucent patch: a physical and universal attack on object detectors. arXiv preprint arXiv:2012.12528. Cited by: Table 1.