Physical Adversarial Textures that Fool Visual Object Tracking

04/24/2019 ∙ by Rey Reza Wiyatno, et al. ∙ Element AI Inc 0

We present a system for generating inconspicuous-looking textures that, when displayed in the physical world as digital or printed posters, cause visual object tracking systems to become confused. For instance, as a target being tracked by a robot's camera moves in front of such a poster, our generated texture makes the tracker lock onto it and allows the target to evade. This work aims to fool seldom-targeted regression tasks, and in particular compares diverse optimization strategies: non-targeted, targeted, and a new family of guided adversarial losses. While we use the Expectation Over Transformation (EOT) algorithm to generate physical adversaries that fool tracking models when imaged under diverse conditions, we compare the impacts of different conditioning variables, including viewpoint, lighting, and appearances, to find practical attack setups with high resulting adversarial strength and convergence speed. We further showcase textures optimized solely using simulated scenes can confuse real-world tracking systems.



There are no comments yet.


page 7

page 10

page 14

page 15

page 17

page 18

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Research on adversarial attacks [1, 2, 3]

have shown that deep learning models, e.g. for classification and detection tasks, are confused by adversarial examples: slightly-perturbed images of objects that cause them to make wrong predictions. While early attacks digitally modified inputs to a victim model, later advances created photos 

[4] and objects in the physical world that lead to misclassification under diverse imaging conditions [5, 6]. Due to these added complexities, many physical adversaries were not created to look indistinguishable from regular items, but rather as inconspicuous objects such as colorful eyeglasses [7, 8].

Other recent work has shown the existence of adversaries that confuse regression tasks [9, 10]. Still, there is a general lack of analysis on the strength and properties of adversaries as a function of different attack objectives.

All this while, high-performing deep learning models have found their way into robotics applications, such as end-to-end self-driving networks [11, 12]. Consequently, it is ever more vital to be aware of the presence of adversaries, especially those in the physical world, that can cause robotic systems to misbehave and lead to grave physical damages.

We study the creation of physical adversaries for object tracking tasks, of which the goal is to find the bounding-box location of a target in the current camera frame given its location in the previous frame. We present a system for generating Physical Adversarial Textures (PAT) that, when displayed as advertisement or art posters, cause regression-based neural tracking models like GOTURN [13] to break away from their tracked targets, even though these textures do not look like targets to human eyes, as seen in Figure 0(a).

(b) adversarial texture
(a) source texture
Figure 1: A poster of a Physical Adversarial Texture resembling a photograph, causes a tracker’s bounding-box predictions to lose track as the target person moves over it.
(a) source texture

Fooling a tracking system comes with added challenges compared to attacking classification or detection models. Since a tracker adapts to changes in the target’s appearance, an adversary must be universally effective as the target moves and turns. Also, some trackers like GOTURN only search within a sub-region of the frame around the previous target location, and so only a small part of the PAT may be in view and not obstructed, yet it must still be potent. Furthermore, it is insufficient for the tracker to be slightly off-target on any single frame, as it may still end up tracking the target semi-faithfully; effective adversaries must cause the system to break away from the tracked target over time. To account for these challenges, for most of our studies we loosen the condition that an adversary must be perceptually similar to an imitated source image, to one that merely looks inconspicuous and unlike the tracked target to humans.

Our main contributions are as follows:

  1. we create novel Physical Adversarial Textures that fool object trackers to consistently lose track of targets,

  2. we contrast the efficacy of different adversarial regression objectives, namely non-targeted, targeted, and a new class of guided adversarial losses,

  3. we study how to efficiently optimize adversaries using randomized scenes, by choosing to randomize impactful scene transformations only, and

  4. we assess the transfer of adversaries created in simulation, applied to a real-world tracking setup.

Ii Related Works

In the vision domain, adversarial attacks have mostly been applied in classification, segmentation, and detection tasks. Early adversarial attack methods, such as L-BFGS [1], FGSM [2], JSMA [3], and C&W [14] attacks often compute the gradient of an adversarial objective with respect to pixel inputs, in order to perturb a specific source input into an adversarial imitation. Moosavi-Dezfooli et al. [15] introduced an attack for creating a Universal Adversarial Perturbation (UAP) that can be applied onto many distinct source images to make them adversarial. While these methods can generate potent adversaries by digitally perturbing specific pixel values, they generally lose effectiveness when the adversary is imaged as a real-world photo.

Early white-box physical adversarial attacks, which assumed access to the victim model’s internals, iteratively ran gradient-based methods such as FGSM [2] to make printable adversaries that are effective under somewhat varying views [4]

. Similar approaches were used to create eyeglass frames for fooling face recognition models 

[7, 8], and by the RP algorithm [5]

to make stop signs look like speed limits to a road sign classifier. Both systems only updated gradients within a

mask in the image, corresponding to the eyeglass frame or road sign. Still, neither work explicitly accounted for the effects of lighting on the imaged items.

Expectation Over Transformation (EOT) [6] formalized the strategy used by [7, 5] of optimizing for adversarial attributes of a mask by applying a combination of random transformations to it. By varying the appearance and/or position of a 2-D photograph or 3-D textured object as mask, EOT-based attacks [6, 16, 17] generated physically-realizable adversaries robust within a range of viewing conditions. Our adversarial attack is also based on EOT, but we importantly study the efficacy and the need to randomize over different transformation variables, including foreground/background appearances, lighting, and spatial locations of the camera, adversary, and surrounding objects.

CAMOU is a black-box attack that also applied EOT to find adversarial textures for a car’s 3-D model, such that object detection networks would ignore it in images produced by a photo-realistic rendering engine. CAMOU approximated the gradient of an adversarial objective through both the complex rendering process and opaque victim network, by using a learned surrogate mapping [18] from the texture space directly onto the detector’s confidence score. Despite their success, this method was not tested in real-world settings, and also incurs high computational costs and potential instability risks by alternating between the optimizations of the surrogate model and adversarial perturbations.

DeepBillboard [10] attacked autonomous driving systems by creating adversarial billboards that caused the victim model to deviate its predicted steering angles within real-world drive-by sequences. While our work shares many commonalities with DeepBillboard, we confront added challenges by attacking a sequential tracking model rather than a per-frame regression task, and we also contrast the results of diverse adversarial optimization objectives.

Iii Object Tracking Networks

This work aims to find and study physical adversaries that compromise visual tracking and servoing models. Various learning-based tracking methods have been proposed, such as the recent GOTURN [13]

deep neural network for regressing the location of an object in a camera frame given its previous location and appearance. While other methods based on feature-space cross-correlation 

[19, 20] and tracking-by-detection [21] are also viable, we use several GOTURN models to ground our investigations into differing adversarial optimization objectives and for the compute efficiency of an EOT-based attack.

As seen in Figure 2, given a target’s bounding-box location of size in the previous frame , GOTURN crops out the template as a region of size around the target within . The current frame is also cropped to the same region, yielding the search area , which is assumed to still contain most of the target. Both the template and search area are resized to and processed through convolutional layers. The resulting feature maps are then concatenated and passed through several fully-connected layers with non-linear activations, ultimately regressing , that is, the top-left and bottom-right coordinates of the target’s location within the current search area .

Such predictions can also be used for visual servoing, i.e. to control a flying or wheeled robot to follow a target through space. One family of approaches [22, 23] regulate the center-points and areas of predictions about the center of the camera frame and a desired target size, respectively, using Proportional-Integral-Derivative (PID) controllers on the forward/backward, lateral, and possibly vertical velocities of the vehicle. In this work, we demonstrate that visual tracking models, as well as derived visual servoing systems, can be compromised by Physical Adversarial Textures.

Iv Attacking Regression Models

For classification tasks, an adversarial example is defined as a slightly-perturbed version of a source image that satisfies two conditions: adversarial output — the victim model misclassifies the true label, and perceptual similarity — the adversary is perceived by humans as similar to the source image. We discuss necessary adjustments to both conditions when attacking regression tasks. We also consider diverse ways to optimize for an adversary, and notably formalize a new family of guided adversarial losses. While this work focuses on images, the concepts discussed below are sufficiently general and can be applied to other domains as well, such as fooling audio transcriptions [9].

Iv-a Adversarial Strength

There is no task-agnostic analog to misclassification for regression models, due to the non-discrete representation of their outputs. Instead, an adversarial output is characterized by thresholding a task-specific error metric. This metric may also be used to quantify adversarial strength. For example, adversaries for human-joint pose-prediction can be quantified by the percentage of predicted joint poses beyond a certain distance from ground-truth locations [9]. Separately, DeepBillboard [10] defines unsafe driving of an autonomous vehicle as experiencing excessive total lateral deviation, and quantifies adversarial strength as the percentage of frames in a given unit of time where the steering angle error exceeds a corresponding threshold.

When fooling a visual tracker, the end-goal is for the system to break away from the target over time. Therefore, we consider a sequence of frames where the target moves across a poster containing an adversarial texture , and quantify adversarial strength by the average amount of overlap between GOTURN predictions (computed from ) and the target’s true locations . We also separate the tracker’s baseline performance from the effects of the adversary, by computing the average overlap ratio across another sequence , in which the adversarial texture is replaced by an inert, source texture. Thus, in this work, adversarial strength is defined by the averaging the mean-Intersection-Over-Union-difference metric, , over multiple generated sequences:


where denotes the intersection of two bounding boxes and denotes the area of the bounding box .

Iv-B Perceptual Similarity

Perceptual similarity is often measured by the distance between a source image and its perturbed variant, e.g. using Euclidean norm in the RGB colorspace [1, 14]. Sometimes, a loose threshold is applied to this constraint, in order to generate universal adversaries that remain potent within diverse conditions [15, 6, 24]. Other times, the goal is not to imitate a source image, but simply to create an inconspicuous texture that does not look harmful to humans yet cause models to misbehave [7, 16, 10]. While this latter condition is subjective, with this work we aim to raise awareness that colorful-looking art can be harmful to vision models.

Iv-C Optimizing for Adversarial Behaviors

Our method works similar to other optimization-based adversarial attacks, and perturbs an initial texture image for the physical poster until it becomes adversarial. Formally, for iterations, the texture is incrementally updated as:


where denotes the pixel perturbation’s step size at the -th iteration, and denotes a gradient-based perturbation term.

While the end-goal of our adversarial attack is to cause the tracker to break away from its target, this can be attained through different adversarial behaviors, such as locking onto part of an adversarial poster, or predicting onto other parts of the scene. These behaviors are commonly optimized into an adversary through loss minimization, e.g. using gradient descent. The literature has proposed several families of adversarial losses, notably:

  • the baseline non-targeted loss maximizes the victim model’s training loss, thus causing it to become generally confused (e.g. FGSM [2], BIM [4]);

  • targeted losses also applies the victim model’s training loss, but to minimize the distance to a target adversarial output (e.g. JSMA [3]);

  • we define guided losses as middle-grounds between and , which enforce certain adversarial attributes rather than strict output values, similar to misclassification onto a set of output values [4]; and

  • hybrid losses use weighted linear combinations of above losses to gain adversarial strength and/or speed up the attack (e.g. C&W [14], Hot/Cold [25] attacks).

To fool object trackers, we consider these specific losses:

  • increases GOTURN’s training loss;

  • shrinks predictions towards the bottom-left corner of the search area;

  • predicts the exact location of the target in the previous frame;

  • grows predictions to the maximum size of the search area;

  • encourages the area of each prediction to shrink from the ground-truth value;

  • : encourages the area of each prediction to grow from the ground-truth value.

Perceptual similarity can be enforced as an added Lagrangian-relaxed loss  [1, 14, 6]

. This perceptual loss term is weighted by a Lagrange multiplier value that is either heuristically-chosen, or then iteratively fine-tuned using linear and binary search, so as to find the smallest multiplier for which adversaries can be found consistently. Various distance metrics have been proposed to quantify perceptual similarity, including Euclidean (

) (e.g., ATN [26], MI-FGM [27]), Hamming () (e.g., JSMA [3]), or signed () (e.g., FGSM [2]), applied directly onto pixel differences in the RGB colorspace. Alternatively, we can minimize the human-perceptual change between the appearances of the source image and the adversary by minimizing Euclidean distance within the CIELab colorspace [6]. While most of our experiments generate inconspicuous adversaries that do not enforce perceptual similarity, Section VI-D specifically showcases imitation attacks.

In summary, our adversarial attack iteratively optimizes a possibly-imitated source texture into an adversarial variant , by minimizing a weighted linear combination of loss terms:


V Physical Adversarial Textures

We now discuss how the above attack formulation can be generalized to produce Physical Adversarial Textures (PAT) that resemble colorful art, such that when displayed on a digital poster and captured by camera frames near a tracked target, will cause the tracking model to misbehave.

Figure 2: The Physical Adversarial Texture (PAT) Attack creates adversaries to fool the GOTURN tracker, via minibatch gradient descent to optimize various losses, using randomized scenes following Expectation Over Transformation (EOT).

In this work, we assume to have white-box access to the GOTURN network’s weights and thus the ability to back-propagate through it. Also, we focus on tracking people and humanoid robots specifically, and assume that the tracker was trained on such types of targets.

As mentioned in Section I, several challenges arise when generating adversaries to fool temporal tracking models. We address these by applying the Expectation-Over-Transformation (EOT) algorithm [6], which minimizes the expected loss over a minibatch of scenes imaged under diverse conditions. EOT marginalizes across the distributions of different transformation variables, such as the poses of the camera, tracked target, and poster, as well as the appearances of the target, environmental surroundings, and ambient lighting. While integrating EOT into a gradient-based attack simply entails running minibatch gradient descent for iterations, it may be practically ineffective to marginalize over wide ranges of many condition variables. Thus, Section VI-C studies the impacts on adversarial strength and attack speeds of diverse EOT variables.

A key addition when generating a physical adversarial item, as opposed to a digital one, is the need to render the textured item into scenes as it evolves during the attack process. Our attack creates PATs purely from scenes rendered using the Gazebo simulator [28], yet Section VI-E will show that these adversaries are also potent in the real world. Next, we elaborate on how to account for rendering, and notably lighting, when generating PATs.

V-a Modeling rendering and lighting

To optimize the loss with respect to the texture of a physical poster, we need to differentiate through the rendering process. Rendering can be simplified into two steps: projecting the texture onto the surface of a physical item and then onto the camera’s frame, and shading the color of each frame pixel depending on light sources and material types. While rendering is generally not differentiable, possible ways to “de-render” include via numerical approximations [29], using a simplified but differentiable renderer [17], and learning surrogate networks to map object textures either onto frames [6] or directly onto adversarial objectives [24].

Similar to [17], we sidestep lighting complexities, such as spotlight gradients and specular surfaces, by assuming controlled imaging conditions: the PAT is displayed on matte material and is lit by a far-away sun-like source, and the camera’s exposure is adjusted to not cause pixel saturation. Consequently, we model lighting as a linear function, where each pixel’s RGB intensities in the camera frame is a scaled and shifted version of pixel values for the projected texture coordinate. During our attack, we query the Gazebo simulation software to obtain exact gains for light intensity and material reflectance, while before each real-world test we fit parameters of this per-channel linear lighting model once, using a displayed color calibration target. This simplifies differentiation through shading, especially since we can query the simulation software to obtain exact gains for light source intensities and material reflectance.

As for projective geometry, following [6], for each frame, we modified Gazebo’s renderer to provide the projected frame coordinates of each texture pixel, as well as occlusion masks and bounding boxes of the target in front of the poster. We then use this texture-to-frame mapping to manually back-propagate through the projection process onto the texture space.

V-B PAT Attack

Figure 2 shows the overall procedure for generating a Physical Adversarial Texture. Starting from a source texture , we perform minibatch gradient descent on to optimize pixel perturbations that adds onto the texture, for a total of iterations. On each iteration , we apply EOT to a minibatch of scenes, each with randomized settings for the poses of the camera, target, and poster, for the identities of the target and background, and for the Hue-Saturation-Value settings of a single directional light source.

Each scene entails two frames , in which both the camera and tracked target may have moved between the previous and current frames. Given the target’s previous true location , we crop both frames around a correspondingly scaled region, then resize and process them through the GOTURN network, in order to predict the bounding-box location of the target in the current frame. We then back-propagate from the combined loss objective onto the texture space through all partial-derivative paths. After repeating the above process for all scenes, we compute the expected texture gradient, and update the texture using the Fast Gradient Sign optimizer [2]. Thus, the poster texture is perturbed by , and scaled by the current iteration’s step size :


Vi Experiments

We carry out an empirical comparison of PAT attacks using non-targeted, targeted, guided, and hybrid losses. We also assess which EOT conditioning variables are most useful for producing strong adversaries quickly. Furthermore, we analyze PATs resulting from imitation attacks and their induced adversarial behaviors. Finally, we showcase the transfer of PATs generated in simulation for fooling tracking system in a real-world setup.

Vi-a Setup

Vi-A1 Simulated Scenarios

All PAT attacks were carried out using simulated scenes rendered by Gazebo. This conveniently provides an endless stream of independently-sampled scenes, with controlled poses and appearances for the target, textured poster, camera, background, and lighting. Furthermore, we modified Gazebo’s renderer to return the mappings between texture pixels and coordinates in each frame, and also provide occlusion masks and bounding boxes of the target in front of the poster in each frame. We created multiple scenarios, including outdoor views of a poster in front of a building, forest, or playground, and an indoor coffee shop scene where a half-sized poster is hung on the wall. We also varied tracked targets among person models with different appearances and models of humanoid robots.

Vi-A2 Trained GOTURN models

We trained several GOTURN networks on various combinations of synthetic and real-world labeled datasets for tracking people and humanoid robots. The synthetic dataset contains over short tracking sequences with more than total frames, while the real-world dataset consists of videos with over frames of one of two persons, moving around an office garage and at a park. We used the Adam optimizer [30] with an initial learning rate of and a batch size of . Models trained on synthetic-only data (sim) lasted iterations with the learning rate halved every iterations, while those trained on combined datasets (s+r) or on the real-world dataset after bootstrapping from the synthetic-trained model (s2r) ran for iterations with the learning rate halved every iterations. In addition to the architecture of [13] (Lg

), we also trained smaller-capacity models with more aggressive striding instead of pooling layers and fewer units in the fully-connected layers (

Sm). Finally, we followed the motion-smoothness data augmentation strategy as proposed in [13].

Vi-A3 Evaluation Metric

As discussed in Section IV-A, we evaluate a PAT by generating sequences in which a tracked target moves from one side of the textured poster to the other. Each sequence randomly draws from manually-chosen ranges for the target, camera, and poster poses, hue-saturation-value settings for the light source, target identities, and background scenes. We run the GOTURN tracker on each sequence twice, differed by the display of either the PAT or an inert source texture on the poster. The adversarial strength of the PAT is computed as the average metric over random sequences.

Anecdotally, for average values around , the tracker’s predictions expanded and worsened as the target moved over the poster, yet GOTURN locked back onto the target as it moved away. In contrast, values greater than reflected cases where GOTURN consistently lost track of the target during and at the end of the sequence, thus showing notably worse tracking compared to an inert poster.

Vi-A4 Baseline Attack Settings

We carried out hyperparameter search to determine a set of attack parameters that produce strong adversaries. Unless otherwise stated, each PAT attack ran on the regular-capacity synthetic-trained GOTURN model (

Lg,sim), with: attack iterations, EOT minibatch with samples, FGS optimizer with step sizes of and then , and starting from a randomly-initialized source texture with pixels. We implemented GOTURN tracker with the same architecture as in [13] as the default model to be attacked. All presented results are averaged over attack instances, with different initial random seeds.

Vi-B Efficacy of Adversarial Losses for Regression

(a) Different adversarial losses
(b) Individual vs hybrid adversarial losses
Figure 3: PAT attack strength for various adversarial losses.

Figure 2(a) depicts the progression in adversarial strength throughout the PAT attack process, for the different adversarial losses proposed in Section IV-C. Comparing against the non-targeted baseline EOT attack (nt), most targeted and guided losses resulted in slower convergence and worse final adversarial strength. This is not surprising as these adversarial objectives apply stricter constraints on the desired adversarial behaviors and thus need to be optimized for longer. As the sole exception, the guided loss encouraging smaller-area predictions (ga-) attained the fastest convergence and best adversarial strength overall. This suggests that well-engineered adversarial objectives, especially loosely-guided ones, benefit by speeding up and improving the attack process for regression tasks.

Looking at Figure 2(b), we see that combining nt with most targeted or guided losses did not significantly change performance. While not shown, we saw similar results when using 1:1000 hybrid weight ratios. However, the 1:1 combination of nt&t= attained better overall performance than both the non-targeted (nt) and same-size-target (t=) losses. This suggests that sometimes adding a non-targeted loss to a targeted or guided one helps, possibly due to the widening of conditions for adversarial behaviors.

Figure 4: PATs generated using different adversarial losses.

As seen in Figure 4, various patterns emerge in PATs generated by different losses. We note that dark “striped patches” always appeared in PATs generated from certain losses, and these patches caused GOTURN to lock on and break away from the tracked target. On the other hand, “striped patches” did not show up for PATs created using ga+ or t+, which showed uniform patterns. This is expected as these losses encourage the tracker’s predictions to grow in size, rather than fixating onto a specific location.

(a) Variables controlling randomized appearances
(b) Variables controlling randomized poses
Figure 5: PAT attack strength for various EOT variables.

Vi-C Ablation of EOT Conditioning Variables

Here, we assess which variables for controlling the random sampling of scenes had strong effects, and which ones could be set to fixed values without impact, thus reducing scene randomization and speeding up EOT-based attacks.

As seen in Figure 4(a), reducing variety in appearances of the background (-bg), target (-target), and light variations (-light), did not greatly affect adversarial strength, when evaluated over consistent ranges of backgrounds, targets, and lighting. Also, increasing diversity in +target and +bg did not result in different end-performance, suggesting that diversity in target and background appearances do not strongly affect EOT-based attacks. On the other hand, +light converged much slower than other settings, and so we conclude that if randomized lighting is needed to generalize the robustness of PATs during deployment, then more attack iterations are needed to ensure convergence.

For pose-related variables in Figure 4(b), halving the poster size (small poster) caused the PAT attack to fail. Changing the ranges of camera poses (+cam pose, -cam pose) resulted in notable performance differences, therefore we note that more iterations are needed to generate effective PATs under wider ranges of viewpoint. Perhaps surprisingly, for -target pose, locking the target’s pose to the center of the poster resulted in faster and stronger convergence. This is likely because regions around the static target obtained consistent perturbations across all scenes, and so developed adversarial patterns faster.

Vi-D Imitation Attacks

As discussed in Section IV-C, we can add a perceptual similarity loss term to make the PAT imitate a meaningful source image. A larger perceptual similarity weight perturbs the source less, but at the cost of slower convergence and weaker or ineffective adversarial strength. Results below used a manually-tuned setting of .

Figure 6: Adversarial imitations under various losses.

Figure 6 shows that some source images, coupled with the right adversarial loss, led in stronger imitations than others. For instance, the waves source was optimized into a potent PAT using , yet using alone failed to produce an adversarial texture. Also, under larger constraints, we saw that adversarial perturbations appeared only in selective parts of the texture. Notably, the “striped patches” seen in non-imitated PATs (Figure 4) also emerged near the dogs’ face and over the PR2 robot, when optimized using . We thus conclude that the PAT attack produces critical adversarial patterns such as these patches first, and then perturbs other regions into supporting adversarial patterns, given enough perceptual similarity budget.

Further substantiating this claim, Figure 7 visualizes predicted bounding-boxes within search areas located at different sub-regions of PATs111Search areas are drawn at 1:17 scale for illustration purposes.. We see from Figure 6(a) that predictions around the adversarial “striped patch” made GOTURN track towards it. This suggests that such critical adversarial patterns induce potent lock-on adversarial behaviors that break tracking regardless of where the true target is positioned nearby. On the other hand, shown in Figure 6(b), the “regular wavy” pattern optimized using resulted in the intended adversarial behavior of larger-sized predictions, independent from the search area’s location. Note that, in both examples, different such adversarial patterns emerge naturally during optimization simply from the choice of adversarial losses.

(a) Lg,sim tracker; loss
(b) Lg,s+r tracker; loss
Figure 7: Adversarial behaviors emerging from PATs.
(a) Sim&real (s+r), sim-to-real (s2r), sim-only trained models
(b) Models with default (Lg) and reduced (Sm) capacities
Figure 8: Adversarial strength of generated PATs (columns) applied to different GOTURN tracking models (rows).

Vi-E Transferability of Physical Adversaries

Vi-E1 Transfer among tracking models

When evaluating PATs on GOTURN models trained using different datasets, the off-diagonal results in Figure 7(a) generally show that a decent-to-great amount of adversarial strength is still present. Nevertheless, we see that the transferred efficacy of adversaries varied based on the tracker model and the adversarial loss used. For instance, a sim-trained PAT optimized using and applied to the s2r GOTURN tracker is strongly adversarial, whereas a similar PAT optimized using becomes completely inert.

Similarly, PATs preserved some of their adversarial strength when transferred between trackers with different capacities, as seen in Figure 7(b). However, while all PATs applied to reduced-capacity models (Sm) affected GOTURN predictions, their values around do not reflect strong adversaries, thus indicating that it is more difficult to fool small-capacity GOTURN networks into consistently breaking away from their intended target.

Vi-E2 Physical-World Tracking and Servoing

To assess the real-world effectiveness of PATs generated purely using simulated scenes, we displayed them on a TV within an indoor environment with static lighting. We carried out two sets of person-following experiments using the camera on a Parrot Bebop 2 drone: tracking sessions with a stationary drone, and servoing runs where the tracked predictions were used to control the robot to follow the target through space (see Section III for details).

In both experiments, we tasked the s+r GOTURN system to follow people that were not seen in the tracker’s training dataset. While we tested under different light intensities, for each static setting we first fit a linear per-channel lighting model to a color calibration target, and then adjusted camera frames accordingly, as explained in Section V-A. This optional step was done to demonstrate adversarial performance in best-case conditions, while note that none of our simulated evaluations corrected for lighting. Also, this correction compensates for fabrication errors that may arise when displaying the PAT on a TV or printed as a static poster, and further serve as an alternative to adding a Non-Printability Score to the attack loss [7].

Figure 9: An imitated PAT, generated in simulation, fooling a person-tracker. See Appendix G and supplementary video for more real-world samples of fooling tracking and servoing systems.

For stationary tracking runs, only adversaries containing “striped patches” consistently made GOTURN break away from the person. Other PATs optimized by e.g. caused the tracker to make worse predictions as the target moved in front of the poster, yet it ultimately locked back onto the person. While these results were partially due to our limited-size digital poster, a more general cause is likely because such losses induced weak adversarial behaviors: by encouraging growing predictions, GOTURN could still see and thus track the person within the enlarged search area.

Returning to the best-performing PATs containing “striped patches”, the tracker strongly preferred to lock onto these rather than the person. Moreover, even though the person could regain GOTURN’s focus by completely blocking the patch, as soon as he moved away, the tracker jumped back to tracking the patch, as seen in Figure 9. Furthermore, these physical adversaries were robust to diverse viewing distances and angles, and even for settings outside the ranges used to randomize scenes during the PAT attack. Still, to ensure the strongest adversarial results, we recommend to run the PAT attack with scenes matching inference-time viewing distances and angles, although wider range will naturally require more attack iterations to ensure convergence.

Our servoing tests showed that it was generally harder to make GOTURN completely break away from the target. Since the drone was moving to follow the target, even though the tracker’s predictions were momentarily disturbed or locked onto the PAT, often the robot’s momentum caused GOTURN to return focus back onto the person. We also observed that frames from the moving camera contained motion blurring, light gradients, and specular reflections, all of which were assumed away by our PAT attack. Nevertheless, we believe that these advanced scene characteristics can be marginalized by the EOT algorithm, using a higher-fidelity rendering engine than our implementation.

Finally, we speculate that synthetically-generated adversarial patterns like the “striped patches” may look like simulated people or robot targets to GOTURN. If so, then our real-world transfer experiments may have been aided by GOTURN’s inability to tell apart synthetic targets from real people. This caveat may be overcome by carrying out PAT attack using scenes synthesized with textured 3-D reconstructions or photograph appearances of the intended target.

Vii Conclusion

We presented a system to generate Physical Adversarial Textures (PAT) for fooling object tracking models. These “PATterns” induced diverse adversarial behaviors, all emerging from a common optimization framework and with the end-goal to make the tracker break away from its intended target. We compared different adversarial objectives and showed that a new family of guided losses, when well-engineered, resulted in stellar adversarial strength and convergence speed. We also showed that naive application of Expectation Over Transformation by randomizing all aspects of scenes was not necessary, as variations in target and background appearances, and the target’s poses had little effect on resulting adversarial performance. In contrast, we found a need to enforce diversity in camera and poster poses, as well as lighting conditions, to produce strong and robust adversaries. Finally, we showcased synthetically-generated PATs that fooled real-world trackers, and suggested ways to overcome limitations of our PAT attack to enhance sim-to-real transfer.

With this work, we hope to raise awareness that inconspicuously-colored items can mislead modern vision-based systems by merely being present in their vicinity

. Despite recent computer vision advances, we argue that pure vision-based tracking systems are not robust to physical adversaries, and thus recommend commercial tracking and servoing systems to integrate auxiliary signals (e.g., GPS, IMU) for redundancy and safety.

Since a key goal of this work is to show the existence of inconspicuous patterns that fool trackers, we made the simplifying assumption of having access to the GOTURN tracker’s weights. More practically, it might be possible to augment the PAT attack using diverse techniques [18, 31, 32] to fool black-box victim models, without needing to differentiate through them. Another improvement could be to directly optimize non-differentiable adversarial strength metrics, such as , e.g. by following the Houdini method [9]. Finally, although the textures shown in this work may appear inconspicuous prior to our demonstrations, they are nevertheless clearly visible and thus can be detected and protected against. As the research community aims to defend against potent physical-world adversaries, we should continue to seek out for ways to make PATs more closely imitate natural items in the physical world.


We would like to thank Dmitri Carpov, Matt Craddock, and Ousmane Dia for helping on the codebase implementation, and Minh Dao for helping with visual illustrations. We would also like to thank Philippe Beaudoin, Jean-François Marcil, and Sharlene McKinnon for participating in our real-world tracking and servoing experiments.


Appendix A Simulated Scenarios

Figure 10 depicts samples of the targets (human or humanoid robot models) and scenarios (outdoor and indoor scenes) that we created within the Gazebo simulation software [28]. These are used both for generating and evaluating Physical Adversarial Textures (PAT).

(a) t-shirt person in school
(b) white person in playground
(c) green person in school
(d) t-shirt person in forest
(e) robonaut in forest
(f) PR2 in cafe
Figure 10: Samples of simulated scenarios.

Appendix B PAT Attack: Random Scene Configuration

The Expectation Over Transformation (EOT) algorithm [6] randomizes various aspects of scenes, which are then used to train the PAT to become universally adversarial within these scenes. Table I presents default ranges used for continuous transformation variables used in our PAT Attack process, while Table II enumerates selections for discrete transformation variables. This default configuration is used in Sections VI-BVI-D, and VI-E.

Transformation Min Max
Initial camera x (m) -1.5 1.5
Initial camera y (m) -11.0 -6.0
Initial camera z (m) 0.6 1.8
Initial camera roll (°) 0.0 0.0
Initial camera pitch (°) -5.0 5.0
Initial camera yaw (°) -15.0 15.0
Camera x (m) -0.1 0.1
Camera y (m) -0.5 0.5
Camera z (m) -0.1 0.1
Camera roll (°) 0.0 0.0
Camera pitch (°) -3.0 3.0
Camera yaw (°) -3.0 3.0
Initial target x (m) -1.4 1.4
Initial target y (m) -5.0 -0.7
Initial target z (m) 0.0 0.0
Initial target roll (°) 0.0 0.0
Initial target pitch (°) 0.0 0.0
Initial target yaw (°) 0.0 180.0
Target x (m) -0.1 0.1
Target y (m) -0.1 0.1
Target z (m) 0.0 0.0
Target roll (°) 0.0 0.0
Target pitch (°) 0.0 0.0
Target yaw (°) -10.0 10.0
Lighting diffuse hue 0.0 360.0
Lighting diffuse saturation 0.0 0.2
Lighting diffuse value 0.1 0.7
Table I: Continous EOT variable ranges for PAT attack.
Backgrounds Targets
school green person
forest PR2
Table II: Discrete EOT variable selections for PAT attack.

Appendix C Trained GOTURN models

Figure 11 illustrates the two GOTURN neural object tracking architectures used in our experiments. Note that each Conv2D layer and FC

layer includes a non-linear Rectified-Linear-Unit (ReLU) activation function.

(a) Regular-capacity model [13] (Lg)
(b) Reduced-capacity model (Sm)
Figure 11: Neural architectures for the GOTURN object tracker instances.

Appendix D Baseline PAT Attack Settings

The parameters used in the baseline PAT attack settings (see Section VI-A4) were determined using hyperparameters search, and from conducting sensitivity analyses on EOT minibatch size and iteration, as well as texture attributes experiments.

D-a EOT Minibatch Size and Iteration

Similar to how training a neural network using Stochastic Gradient Descent (SGD) is sensitive to hyperparameter settings, we analyzed the sensitivity of our proposed PAT attack method to its hyperparameters. We suspect that attacks using smaller EOT minibatch sizes

would require more iterations to converge assuming a fixed perturbation step size , while attacks using large minibatch sizes would require an impractical amount of computing time per iteration. Thus, it is practically beneficial to balance the combination of the perturbation step size and the minibatch size , given a fixed number of attack iterations .

(a) Perturbation step size
(b) EOT minibatch size
Figure 12: Adversarial strength over attack iterations, for various values and EOT minibatch size.

We first optimized for a fixed minibatch size of . As shown in Figure 11(a), a step size of attained the best end-performance, however converged initially much faster. This trade-off substantiates our empirical observations and suggests that the source texture initially needs to have most of its pixels broadly perturbed to cause adversarial patterns to emerge, which would require drastic pixel changes with large perturbation sizes. Subsequently, however, slight localized pixel enhancements around “critical adversarial patterns” (see Section VI-D) steadily enhance the PAT’s adversarial strength. Therefore, in practice, we recommend a schedule that starts with a large perturbation size of for attack iterations, and then refines using a smaller step size of .

Next, using a single non-scheduled perturbation size of , we varied the EOT minibatch size . Note that the average metrics in Figure 11(b) are plotted against the number of total EOT scenarios observed, i.e. . These results show consistent performance trends that are proportional to , i.e. the total number of scenes seen by each PAT attack, rather than the number of attack iterations itself. Also, beyond extremely small values of

that lead to high-variance stochastic gradient updates, larger minibatch sizes result in similar and

diminishing amounts of improvement in both initial convergence speed and asymptotic adversarial strength. Consequently, we chose for the best trade-off between compute per attack iteration and convergence.

D-B Texture Attributes

(a) Initial textures
(b) Texture sizes
Figure 13: Adversarial strength among various initial textures and texture sizes.

Various related work made different recommendations on which source texture to use for best results. In particular, suggestions included all-white and all-yellow [10], and a random contrasted checkerboard pattern alternating between uniform sampling of and  [24]. We also tried an all-gray source pattern, and a per-pixel randomly-sampled source.

However, as shown in Figure 12(a), we found that initializing the texture with different patterns did not result in significant changes in convergence nor performance.

We also explored the effects of changing texture sizes, and found that using a resolution of lead to consistently poor results, while settings of , , and , yielded little differences in both initial convergence speed and asymptotic performance, as seen in Figure 12(b). We thus chose to balance between having sufficient pixel capacity to accommodate the wide ranges of EOT conditions, and amount of computation to compute texture perturbations. Still, we found it very important to be aware that our resolution choices are significantly affected by the viewing distances (see Appendix B) and poster sizes used in our experiments.

Appendix E Ablation of EOT Conditioning Variables

In Section VI-C, we evaluated the effects of varying the ranges or choices for different EOT transformation variables, including background (-bg, +bg), target (-target, +target), lighting (-light, +light), poster size (small poster), camera pose (-cam pose, +cam pose), and target pose (-target pose, +target pose). Modified ranges to camera pose, target pose, and lighting are shown in Table IIIIV, and V, respectively. Also, variations for (-bg, +bg) and (-target, +target) are as follows:

  • -bg: use playground only;

  • +bg: randomize among school, forest, playground and cafe;

  • -target: use green person only;

  • +target: randomize among green person, white person, t-shirt person, PR2 and robonaut.

Transformation -cam pose +cam pose
Min Max Min Max
Initial x (m) 0.0 0.0 -2.0 2.0
Initial y (m) -8.5 -8.5 -16.5 -5.5
Initial z (m) 1.2 1.2 0.4 2.2
Initial roll (°) 0.0 0.0 -1.5 1.5
Initial pitch (°) 0.0 0.0 -10.0 10.0
Initial yaw (°) 0.0 0.0 -20.0 20.0
x (m) 0.0 0.0 -0.15 0.15
y (m) 0.0 0.0 -0.80 0.80
z (m) 0.0 0.0 -0.15 0.15
roll (°) 0.0 0.0 0.0 0.0
pitch (°) 0.0 0.0 -5.0 5.0
yaw (°) 0.0 0.0 -5.0 5.0
Table III: PAT attack settings for -cam pose and +cam pose.
Transformation -target pose +target pose
Min Max Min Max
Initial x (m) 0.0 0.0 -1.6 1.6
Initial y (m) -2.7 -2.7 -5.0 -0.7
Initial z (m) 0.0 0.0 0.0 0.0
Initial roll (°) 0.0 0.0 0.0 0.0
Initial pitch (°) 0.0 0.0 0.0 0.0
Initial yaw (°) 90.0 90.0 -90.0 270.0
x (m) 0.0 0.0 -0.15 0.15
y (m) 0.0 0.0 -0.15 0.15
z (m) 0.0 0.0 0.0 0.0
roll (°) 0.0 0.0 0.0 0.0
pitch (°) 0.0 0.0 0.0 0.0
yaw (°) 0.0 0.0 -20.0 20.0
Table IV: PAT attack settings for -target pose and +target pose.
Diffuse Light Source -light +light
Min Max Min Max
Hue 0.0 360.0 0.0 360.0
Saturation 0.0 0.0 0.0 0.7
Value 0.7 0.7 0.0 0.7
Table V: PAT attack settings for -light and +light.

Appendix F Imitation Attacks

In Section VI-D, we set the value of . This value was determined based on an experiment where we studied the effect of changing on the adversarial strength and perceptual similarity (as measured by the Euclidean distance to the source image in RGB colorspace). Unsurprisingly, as seen in Figure 14, smaller values of imposed fewer constraints and thus lead to faster attack convergence and better end-performance, while the inverse was true for larger values of . We thus chose after manually assessing which PATs had recognizable levels of perceptual similarity to their source images, as seen from Figure 17.

(a) uIOUd
(b) -norm perceptual similarity
Figure 14: Adversarial strength and perceptual similarity among various .

To substantiate Figure 6 in the main paper, Figure 15 illustrates how the average and perceptual similarity metrics change over attack iterations. As we can see, some combinations of certain initial posters with certain losses made the attack easier to converge. For example, performing the PAT attack using waves as the initial texture with hybrid losses (nt & ga+ 1:1) resulted in strong adversaries, while using non-targeted loss alone did not.

(a) uIOUd
(b) -norm perceptual similarity
Figure 15: Adversarial strength and perceptual similarity among source textures shown in Figure 6 of main paper.

Finally, Figure 18 illustrates the emergence of “critical adversarial patterns” that we discussed in Section VI-D. In Figure 17(a), the critical dark striped pattern started to emerge at around iteration , followed by the appearances of other nearby colorful patterns, which presumably were to drive predictions towards the central adversarial striped pattern. In contrast, when we imposed a perceptual similarity loss during an imitation attack, only the dark striped pattern eventually emerged after significantly more attack iterations, as seen in Figure 17(b).

Appendix G Transferability of Physical Adversaries

G-a Transfer among tracking models

In Section VI-E1, we evaluated the transferability of PATs among different tracking models. Figure 16 shows the PATs used in this experiment. Generally, we observe similar adversarial patterns emerging from PAT Attacks on GOTURN models trained on different datasets, as well as different capacities, which explain why PATs transfer to a certain degree among different GOTURN trackers. The sole exception is seen from the second row of Figure 15(a), which reflected the fact that the adversarial loss ga+ caused different patterns to emerge for different models, although all having similar levels of competent adversarial strength.

G-B Physical-World Tracking and Servoing

We conducted numerous test runs in real-world tracking and servoing conditions, and qualitatively verified the transferred adversarial strength of our synthetically-generated PATs, especially those containing “critical adversarial patterns”. Samples of these sequences can be seen in Figure 192021, and 22.

It is generally difficult to quantify performance consistently in the real world, due to tediousness and impracticality in labeling performance, controlling for repeated conditions, and dealing with practical complexities such as limited battery life, hardware failures, etc. Still, we segmented runs into video clips, and manually labeled them as either strongly adversarial (i.e. where the tracker jumps onto the PAT and stays locked onto it even when momentarily obstructed), weakly adversarial (i.e. where the tracker only sometimes switches from the person to the PAT, and showed a tendency to latch back onto the person), or failure.

Runs Strong Weak Fail
Stationary 57 (71%) 13 (16%) 10 (13%)
Servo 6 (33%) 5 (28%) 7 (39%)
Table VI: Physical-world attack performance.

Looking at Table VI, we see that the tracker was quickly drawn to PATs when deployed on a stationary camera. On the other hand, it was much harder to fool the person tracker when the drone was servoing the target. Whether the PAT was displayed digitally on a monitor, or printed as a A0 poster, we anecdotally observed that both materials displayed some amount of specular reflections. These specularities changed as the camera moved around, and thus likely had altered the appearances of PATs during our servoing runs and rendered them inert. Therefore, devising adversaries that are robust to specularities would be an exciting avenue for future research.

(a) Different training datasets for GOTURN models
(b) Different network capacities for GOTURN models
Figure 16: PATs used in the transferability among tracking models experiment.
Figure 17: PATs generated with various values.
(a) Non-imitation attack
(b) Imitation attack
Figure 18: The emergence of “critical adversarial patterns” for non-imitation and imitation attacks.
Figure 19: PAT fools the tracker in simulation. Here, the purple bounding box represents the ground truth bounding box of the tracked object, while the green bounding box represents the tracker’s prediction. Note that the sequence starts from the top-left frame to the bottom-right frame.
Figure 20: PAT fools the tracker in the real world indoor setting, where the PAT is displayed on a TV. Note that the sequence starts from the top-left frame to the bottom-right frame.
Figure 21: PAT fools the tracker in the real world indoor setting during servoing run. Note that the sequence starts from the top-left frame to the bottom-right frame.
Figure 22: PAT fools the tracker in the real world outdoor setting during servoing run, where the PAT is printed as a poster. Note that the sequence starts from the top-left frame to the bottom-right frame.