Adversarial Attacks for Embodied Agents

05/19/2020 ∙ by Aishan Liu, et al. ∙ 40

Adversarial attacks are valuable for providing insights into the blind-spots of deep learning models and help improve their robustness. Existing work on adversarial attacks have mainly focused on static scenes; however, it remains unclear whether such attacks are effective against embodied agents, which could navigate and interact with a dynamic environment. In this work, we take the first step to study adversarial attacks for embodied agents. In particular, we generate spatiotemporal perturbations to form 3D adversarial examples, which exploit the interaction history in both the temporal and spatial dimensions. Regarding the temporal dimension, since agents make predictions based on historical observations, we develop a trajectory attention module to explore scene view contributions, which further help localize 3D objects appeared with the highest stimuli. By conciliating with clues from the temporal dimension, along the spatial dimension, we adversarially perturb the physical properties (e.g., texture and 3D shape) of the contextual objects that appeared in the most important scene views. Extensive experiments on the EQA-v1 dataset for several embodied tasks in both the white-box and black-box settings have been conducted, which demonstrate that our perturbations have strong attack and generalization abilities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 10

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has demonstrated remarkable performance in a wide spectrum of areas [17, 21, 26], but it is vulnerable to adversarial examples [27, 11]

. The small perturbations are imperceptible to human but easily misleading deep neural networks (DNNs), thereby bringing potential security threats to deep learning applications

[23, 18]. Though challenging deep learning, adversarial examples are valuable for understanding the behaviors of DNNs, which could provide insights into the weakness and help improve the robustness [35]. Over the last few years, significant efforts have been made to explore model robustness to the adversarial noises using adversarial attacks in the static and non-interactive domain, e.g., 2D images [11, 2] or static 3D scenes [34, 19, 30].

With great breakthroughs in multimodal techniques and virtual environments, embodied task has been introduced to further foster and measure the agent perceptual ability. An agent must intelligently navigate a simulated environment to achieve specific goals through egocentric vision [6, 7, 33, 12]. For example, an agent is spawned in a random location within an environment to answer questions such as “What is the color of the car?”. Das et al. [6] first introduced the embodied question answering (EQA) problem and proposed a model consisting of a hierarchical navigation module and a question answering module. Concurrently, Gordon et al. [12] studied the EQA task in an interactive environment named AI2-THOR [16]. Recently, several studies have been proposed to improve agent performance using different frameworks [7] and point cloud perception [29]. Similar to EQA, embodied vision recognition (EVR) [32] is an embodied task, in which an agent instantiated close to an occluded target object to perform visual object recognition.

Figure 1: Embodied agents must navigate the environment through egocentric views to answer given questions. By adversarially perturbing the physical properties of 3D objects using our spatiotemporal perturbations, the agent gives the wrong answer (the correct answer is “living room”) to the question. The contextual objects perturbed are: sofa and laptop.

In contrast to static tasks, embodied agents are free to move to different locations and interact with the dynamic environment. Rather than solely using a one-shot image, embodied agents observe 3D objects from different views and make predictions based on historical observations (trajectory). Current adversarial attacks mainly focused on the static scenes and ignored the information from the temporal dimension. However, since agents utilize contextual information to make decisions (i.e., answer questions), only considering a single image or an object appeared in one scene view may not be sufficient to generate strong adversarial attacks for the embodied agent.

In this work, we provide the first study of adversarial attacks for embodied agents in dynamic environments, as demonstrated in Figure 1. By exploiting the interaction history in both the temporal and spatial dimensions, our adversarial attacks generate 3D spatiotemporal perturbations. Regarding the temporal dimension, since agents make predictions based on historical observations, we develop a trajectory attention module to explore scene view contributions, which could help to localize 3D objects that appeared with highest stimuli for agents’ predictions. Coupled with clues from the temporal dimension, along the spatial dimension, we adversarially perturb the physical properties (e.g., 3D shape, and texture) of the contextual objects that appeared in the most important scene views. Currently, most embodied agents input 2D images transformed and processed from 3D scenes by undifferentiable renderers. To apply the attack using a gradient-based strategy, we replace the undifferentiable renderer with a differentiable one by introducing a neural renderer [15].

To evaluate the effectiveness of our spatiotemporal adversarial attacks, we conduct extensive experiments in both the white-box and black-box settings using different models. We first demonstrate that our generated 3D adversarial examples are able to attack the state-of-the-art embodied agent models and significantly outperform other 3D adversarial attack methods. Also, our adversarial perturbations can be transferred to attack the black-box renderer using non-differentiable operations, indicating the applicability of our attack strategy, and the potential of extending it to the physical world. We also provide a discussion of adversarial training using our generated attacks, and a perceptual study indicating that contrary to the human vision system, current embodied agents are mostly more sensitive to object textures rather than shapes, which sheds some light on bridging the gap between human perception and embodied perception.

2 Related Work

Adversarial examples or perturbations are intentionally designed inputs to mislead deep neural networks [27]. Most existing studies address the static scene including 2D images and static 3D scenes.

In the 2D image domain, Szegedy et al. [27] first introduced adversarial examples and used the L-BFGS method to generate them. By leveraging the gradients of the target model, Goodfellow et al. [11] proposed the Fast Gradient Sign Method (FGSM) which could generate adversarial examples quickly. In addition, Mopuri et al. [22] proposed a novel approach to generate universal perturbations for DNNs for object recognition tasks. These methods add perturbations on 2D image pixels rather than 3D objects and fail to attack the embodied agents.

Some recent work study adversarial attacks in the static 3D domain. A line of work [30, 34, 19] used differentiable renderers to replace the undifferentiable one, and perform attacks through gradient-based strategies. They mainly manipulated object shapes and textures in 3D visual recognition tasks. On the other hand, Zhang et al. [36] learned a camouflage pattern to hide vehicles from being detected by detectors using an approximation function. Adversarial patches [4, 18] have been studied to perform real-world 3D adversarial attacks. In particular, Liu et al. [18] proposed the PS-GAN framework to generate scrawl-like adversarial patches to fool autonomous-driving systems. However, all these attacks mainly considered the static scenes and ignored the temporal information. Our evaluation demonstrates that by incorporating both spatial and temporal information, our spatiotemporal attacks are more effective for embodied tasks.

3 Adversarial Attacks for the Embodiment

The embodiment hypothesis is the idea that intelligence emerges in the interaction of an agent with an environment and as a result of sensorimotor activity [25, 6]. To achieve specific goals, embodied agents are required to navigate and interact with the dynamic environment through egocentric vision. For example, in the EQA task, an agent is spawned at a random location in a 3D dynamic environment to answer given questions through navigation and interaction.

3.1 Motivations

Though showing promising results in the virtual environment, the agent robustness is challenged by the emergence of adversarial examples. Most of the agents are built upon deep learning models which have been proved to be weak in the adversarial setting [27, 11]. By performing adversarial attacks to the embodiment, an adversary could manipulate the embodied agents and force them to execute unexpected actions. Obviously, it would pose potential security threats to agents in both the digital and physical world.

From another point of view, adversarial attacks for the embodiment are also beneficial to understand agents’ behaviors. As black-box models, most deep-learning-based agents are difficult to interpret. Thus, adversarial attacks provide us with a new way to explore model weakness and blind-spots, which are valuable to understand their behaviors in the adversarial setting. Further, we can improve model robustness and build stronger agents against noises.

3.2 Problem Definition

In this paper, we use 3D adversarial perturbations (adversarial examples) to attack embodied agents in a dynamic environment.

In a static scenario, given a deep neural network and an input image with ground truth label , an adversarial example is the input that makes the model conducted the wrong label

where is a distance metric to quantify the distance between the two inputs and sufficiently small.

For the embodiment, an agent navigates the environment to fulfil goals and observe 3D objects in different time steps . The input image at time step for an agent is the rendered result of a 3D object from a renderer by . is the corresponding 3D object and denotes conditions at (e.g., camera views, illumination, etc.). To attack the embodiment, we need to consider the agent trajectory in temporal dimension and choose objects to perturb in the 3D spatial space. In other words, we generate adversarial 3D object by perturbing its physical properties at multiple time steps. The rendered image set is able to fool the agent :

where belongs to a time step set we considered.

Figure 2: Our framework exploits interaction histories from both the temporal and the spatial dimension. In the temporal dimension, we develop a trajectory attention module to explore scene view contributions. Thus, important scene views are extracted to help localize 3D objects that appeared with highest stimuli for agents predictions. By conciliating with clues from the temporal dimension, along the spatial dimension, we adversarially perturb the 3D properties (e.g., 3D shape, and texture) of the contextual objects appeared in the most important scene views.

4 Spatiotemporal Attack Framework

In this section, we illustrate our framework to generate 3D adversarial perturbations for embodied agents in the dynamic environment. In Figure 2, we present an overview of our attack approach, which incorporates history interactions from both the temporal and spatial dimensions.

Motivated by the fact that agents make predictions based on historical scene views (trajectory), we attack the 3D objects appeared in scene views containing the highest stimuli to the agent’s prediction. In the temporal dimension, we develop a trajectory attention module to explore scene view contributions, which directly calculates the contribution weight for each time step scene view to the agent prediction . Given a -step trajectory, the most important historical scene views are selected by to help localize 3D objects that appeared with highest stimuli.

Meanwhile, rather than solely depending on single objects, humans always collect discriminative contextual information when making predictions. By conciliating with clues from the temporal dimension, along the spatial dimension, we adversarially perturb the physical properties of multiple 3D contextual objects appeared in the most important scene views. Moreover, to attack physical properties (i.e., 3D shapes and textures), we also employ a differentiable renderer to use the gradient-based attacks.

Thus, by coupling both temporal and spatial information, our framework generates spatiotemporal perturbations to form 3D adversarial examples, which could perform adversarial attacks for the embodiment.

4.1 Temporal Attention Stimulus

To achieve specific goals, embodied agents are required to navigate the environment and make decisions based on the historical observations. Conventional vision tasks, e.g., classification, mainly base on one-shot observation in static images. In contrast, we should consider historical information (trajectory) such as last historical scene views observed by the agent , and adversarially perturb the 3D objects that appeared in them. Thus, we can formulate the attack loss:

(1)

where

denotes the prediction probability of the model, and

indicates the ground truth label (i.e., correct answer, object class or action w.r.t. question answering, visual recognition and navigation, respectively). To attack agents, the equation above aims to decrease the confidence of the correct class.

There is extensive biological evidence that efficient perception requires both specialized visual sensing and a mechanism to prioritize stimuli, i.e., visual attention. Agents move their eyes towards a specific location or focus on relevant locations to make predictions by prioritizing different scene views [5]. To perform strong adversarial attacks, we must design a visual attention module that selects a suitable set of visual features (historical scene views) to perform attack. Inspired by [24], given scene views , we first compute the gradient of target class w.r.t. normalized feature maps of a specified layer. These gradients flowing back are global average pooled to obtain weight for the -th scene view:

(2)

where represents the size of the feature map, and

indicates total feature map numbers in a specified layer. Then, We normalize each weight according to their mean vector

and variance vector

:

(3)

Thus, our trajectory attention module calculates the contribution of each scene view in the trajectory towards the model decision for class :

(4)

The weights directly reflect the contribution of observed views at different time steps in the trajectory. Thus, we can further adversarially perturb the 3D objects that appeared in those scene views containing higher weights to execute a stronger attack.

4.2 Spatially Contextual Perturbations

Adversarial attacks in the static scene usually manipulate pixel values in the static image or different frames. In contrast, adversarial attacks for the embodiment require us to perturb the physical properties of 3D objects. Simply, we could randomly choose an object appeared in the most important scene views based on the attention weights to perform attacks. However, when humans look at an object, they always collect a discriminative context for that object [9]. In other words, we concentrate on that object while simultaneously being aware of its surroundings and context. The contextual information enables us to perform much stronger adversarial attacks. As shown in Figure 1, when asking “What room is the chessboard located in?”, it is better to perturb contextual objects rather than only the target object “chessboard”. To answer the question, agent relied on contextual objects (e.g., sofa, laptop, etc), that convey critical factors and key features about the answer “living room”.

Coupled with the clues from the temporal dimension, we further perturb the 3D contextual objects appeared in the most important views. Specifically, given most important scene views selected by our trajectory attention module }, we perturb 3D objects appeared in . Thus, the adversarial attack loss can be formalized as:

(5)

Correspondingly, with the contribution weight for the most important scene views, we can further draw the physical parameter manipulation strategy as follows:

(6)

where extracts the objects appeared in scene views, and is the 3D physical parameters of object (e.g., texture, shape, etc).

4.3 Optimization Formulations

Based on the above discussion, we generate 3D adversarial perturbations using the optimization formulation:

(7)

where we append the adversarial attack loss with a perceptual loss:

(8)

which constrains the magnitude of the total noises added to produce a visually imperceptible perturbation. represents different conditions (e.g., camera views, illumination, etc.) and balances the contribution of each part.

Recent studies have highlighted that adversarial perturbations are ineffective to different transformations and environmental conditions (e.g., illuminations, rotations, etc). In the dynamic environment, the viewing angles and environmental conditions change frequently. Thus, we further introduce the idea of expectation of transformations [3] to enhance the attack success rate of our perturbations as shown in the expectation of different conditions in Eqn (7).

It is intuitive to directly place constraints on physical parameters such as the contour or color range of object surfaces. However, one potential disadvantage is that different physical parameters have different units and ranges. Therefore, we constrain the RGB intensity changes in the 2D image space after the rendering process to keep the consistency of the change of different parameters (i.e., shape or texture).

5 Experiments

In this section, we evaluate the effectiveness of our 3D spatiotemporal adversarial attacks against agents in different settings for different embodied tasks. We also provide a discussion of defense with adversarial training, and an ablation study of how different design choices affect the attack performance.

5.1 Experimental Setting

For both EQA and EVR tasks, we use the EQA-v1 dataset [6], a visual question answering dataset grounded in the simulated environment. It contains 648 environments with 7,190 questions for training, 68 environments with 862 questions for validation, and 58 environments with 933 questions for testing. It divides the task into , , by steps from the starting point to the target. For each object to be attacked, we improve the attack success rate of the 3D adversarial perturbations by selecting five positional views one meter away with an azimuth angle uniformly ranging from [0°, 180°] to optimize the overall loss. We restrict the adversarial perturbations to be bounded by 32-pixel values per frame of size , in terms of norm.

5.2 Evaluation Metrics

To measure agent performance, we use the following evaluation metrics as in

[6, 29, 7]:

- top-1 accuracy: whether the agent’s prediction matches ground truth ( is better);

- : the distance to the target object at navigation termination ( is better);

- : change in distance to target from initial to the final position ( is better);

- : the smallest distance to the target at any point in the episode ( is better);

Note that the goal of adversarial attacks is compromising the performance of the embodied agents, i.e., leading to worse values of the evaluation metrics above.

5.3 Implementation Details

We use the SGD optimizer for adversarial perturbation generation, with momentum 0.9, weight decay , and a maximum of 60 iterations. For the hyper-parameters of our framework, we set to 1, to 3, and as the numbers of all contextual objects observed in these frames. For EQA, we generate adversarial perturbations using PACMAN-RL+Q [6] as the target model, and we use Embodied Mask R-CNN [32] as the target model for EVR. In our evaluation, we will demonstrate that the attacks generated against one model could transfer to different models.

For both EQA and EVR, unless otherwise specified, we generate adversarial perturbations on texture only, i.e., in Equation 6, we only update the parameters corresponding to texture, because it is more suitable for future extension to physical attacks in the real 3D environment. In Section 5.9, we also provide a comparison of adversarial perturbations on shapes, where we demonstrate that with the same constraint of perturbation magnitude, texture attacks achieve a higher attack success rate.

Navigation QA
( is better) ( is better) ( is better) accuracy ( is better)
PACMAN-RL+Q Clean 1.05 2.43 3.82 0.10 0.45 1.86 0.26 0.97 1.99 50.23% 44.19% 39.94%
MeshAdv 1.06 2.44 3.90 0.09 0.44 1.78 0.31 1.17 2.33 16.07% 15.34% 13.11%
Zeng et al. 1.07 2.46 3.88 0.08 0.42 1.80 0.42 1.37 2.43 17.15% 16.38% 14.32%
Ours 1.06 3.19 5.58 0.09 -0.39 0.10 0.90 2.47 5.33 6.17% 4.26% 3.42%
NAV-GRU Clean 1.03 2.47 3.92 0.12 0.41 1.76 0.34 1.02 2.07 48.97% 43.72% 38.26%
MeshAdv 1.07 2.50 3.92 0.08 0.38 1.76 0.38 1.28 2.48 17.22% 17.01% 14.25%
Zeng et al. 1.09 2.47 3.87 0.06 0.41 1.81 0.36 1.38 2.51 17.14% 16.56% 15.11%
Ours 1.13 2.96 5.42 0.02 -0.08 0.26 0.96 2.58 4.98 8.41% 6.23% 5.15%
NAV-Reactive Clean 1.37 2.75 4.17 -0.22 0.13 1.51 0.31 0.99 2.08 48.19% 43.73% 37.62%
MeshAdv 1.05 2.79 4.25 0.10 0.09 1.43 0.32 1.29 2.47 15.36% 14.78% 11.29%
Zeng et al. 1.10 2.79 4.21 0.05 0.09 1.47 0.36 1.59 2.32 15.21% 14.13% 13.29%
Ours 1.22 2.85 5.70 -0.07 0.03 -0.02 1.06 2.59 5.47 8.26% 5.25% 5.39%
VIS-VGG Clean 1.02 2.38 3.67 0.13 0.50 2.01 0.38 1.05 2.26 50.16% 45.81% 37.84%
MeshAdv 1.06 2.41 3.67 0.09 0.47 2.01 0.40 1.11 2.52 16.69% 15.24% 15.21%
Zeng et al. 1.06 2.43 3.70 0.09 0.45 1.98 0.44 1.41 2.44 15.13% 14.84% 14.21%
Ours 1.18 2.83 5.62 -0.03 0.05 0.06 1.04 2.01 5.12 6.33% 4.84% 4.29%
Table 1: Quantitative evaluation of agent performance on EQA task using different models in clean and adversarial settings (ours, MeshAdv [30] and Zeng et al. [34]). Note that the goal of attacks is to achieve a worse performance. We observe that our spatiotemporal attacks outperform the static 3D attack algorithms, achieving higher and as well as lower and accuracy.

5.4 Attack via a Differentiable Renderer

In this section, we provide the quantitative and qualitative results of our 3D adversarial perturbations on EQA and EVR through our differentiable renderer. For EQA, besides PACMAN-RL+Q, we also evaluate the transferability of our attacks using the following models: (1) NAV-GRU, an agent using GRU instead of LSTM in navigation [29]; (2) NAV-Reactive, an agent without memory and fails to use historical information [6]; and (3) VIS-VGG, an agent using VGG to encode visual information [7]. For EVR, we evaluate the white-box attacks on Embodied Mask R-CNN. As most of the embodied tasks can be directly divided into navigation and problem-solving stages, i.e., question answering or visual recognition, we attack each of these stages. We compare our spatiotemporal attacks to MeshAdv [30] and Zeng et al. [34], both of which are designed for the static 3D environment, and thus do not leverage the temporal information, as discussed in Section 2.

(a) Clean Scene
(b) Adversarial Scene
Figure 3: Given the question “What is next to the fruit bowl in the living room?”, we show the last 5 views of the agent for EQA in the same scene with and without adversarial perturbations. The contextual objects perturbed including table, chairs and fruit bowel. The agent gives wrong answers “television” to the question (ground truth: chair) after seeing adversarial textures in subfigure (b). Yellow boxes show the perturbed texture regions.

For question answering and visual recognition, we generate 3D adversarial perturbations using our proposed method on the test set and evaluate agent performance throughout the entire process, i.e., the agent is randomly placed and navigate to answer a question or recognize an object. As shown in Table 1, for white-box attacks, there is a significant drop in question answering accuracy from 50.23%, 44.19% and 39.94% to 6.17%, 4.26% and 3.42% for tasks with 10, 30, and 50 steps, respectively. Further, the visual recognition accuracy drastically decreases from 89.91% to 18.32%. The black-box attacks also result in a large drop in accuracy. The visualization of the last five steps before the agent’s decision for EQA is shown in Figure 3. Our perturbations are unambiguous for human prediction but misleading to the agent.

For navigation, we generate 3D adversarial perturbations that intentionally stop the agent, i.e., make the agent predict Stop during the navigation process. As shown in Table 1, for both white-box and black-box attacks, the values of and significantly increase compared to the clean environment when adding our perturbations, especially for long-distance tasks, i.e., . Further, the values of decreases to around 0 after attack, which reveals that agents make a small number of movements or meaningless steps to the destination. Also, some even become negative, showing that the agent is moving away from the target.

Figure 4: The attention maps of different models. In both scenes (a) and (b), the first line presents the attention maps of PACMAN-RL+Q, and the second line presents those of VIS-VGG. We observe that the attention zones highlight similar context of the scenes for prediction.

Attention similarity. Further, to understand the transferability of attacks between different models, we investigate their attention correlation. We first visualize the attention map of the last 5 views using PACMAN-RL+Q and VIS-VGG in Figure 4, and we observe that the attention zones highlight similar context of the scenes for prediction. Moreover, we compare the top-3 important views between PACMAN-RL+Q and VIS-VGG on 32 questions, and we find that of the included views are the same for both models. Such attention similarities between different models could facilitate the transferability of black-box attacks.

In a word, our generated 3D adversarial perturbations achieve strong attack performance in both the white-box and black-box settings for navigation and problem-solving in the embodied environment.

(a) Accuracy
(b)
(c)
(d)
Figure 5: Transferability of attacks when presented with a black-box renderer. Method (1) to (4) represents PACMAN-RL+Q, NAV-GRU, NAV-Reactive and VIS-VGG, respectively. Our framework generates adversarial perturbations with strong transferabilities to black-box renderers.

5.5 Transfer Attack onto a Black-box Renderer

Our proposed framework aims to adversarially attack

by end-to-end gradient-based optimization. In this section, we further examine the potential of our framework in practice, where no assumptions about the undifferentiable black-box renderer are given. By enabling interreflection and rich illumination, the undifferentiable renderer can render images at high computational cost, such that the rendered 2D image is more likely to be an estimate of real-world physics. Thus, these experiments are effective to illustrate the transferability of generated adversarial perturbations and their potential in practical scenarios.

Specifically, we use the original undifferentiable renderer for EQA-V1, which is implemented on OpenGL with unknown parameters, as the black-box renderer. We first generate 3D adversarial perturbations using our neural renderer , then save the perturbed scenes. We evaluate agent performance through the undifferentiable renderer on those perturbed scenes to test the transferability of our adversarial perturbations.

As shown in Figure 5, our spatiotemporal attacks can easily be transferred to a black-box renderer. However, our generated adversarial perturbations are less effective at attacking the undifferentiable renderer compared to the neural renderer. Many recent studies have reported that attacking the 3D space is much more difficult than attacking the image space [34, 30]. Further, we believe there are three other reasons for this phenomenon: (1) During the experiment, we save the perturbed scenes into files after attacking and then feed these files to to test the performance. During this step, there inevitably exists some information loss, which may decrease the attack success rate; (2) The parameter difference between and may causes some minute rendering differences for the same scenarios. As adversarial examples are very sensitive to image transformations [31, 13], the attacking ability is impaired; (3) The adversarial perturbation generated by optimization-based or gradient-based methods fails to obtain strong transferability due to either overfitting or underfitting [8].

5.6 Generalization Ability of the Attack

In this section, we further investigate the generalization ability of our generated adversarial perturbations. Given questions and trajectories, we first perturb the 3D objects and save the scene. Then, loading the same perturbed scene, we ask agents different questions and change their starting points to test their performance.

QA accuracy
Clean 51.42% 42.68% 39.15%
Attack 6.05% 3.98% 3.52%
Q 10.17% 8.13% 7.98%
T 8.19% 7.26% 7.14%
Table 2: Generalization ability experiments. Our 3D perturbations generalize well in settings using different questions and starting points.

We first use the same perturbations on different questions (denoted as “Q”). We fix the object in questions during perturbation generation and test to be the same. For example, we generate the perturbations based on question “What is the color of the table in the living-room?” and test the success rate on question “What is next to the table in the living-room?”. Moreover, we use the same perturbations to test agents from different starting points (i.e., different trajectories, denoted as “T”). We first generate the perturbations and then test them by randomly spawning agents at different starting points (i.e., random rooms and locations) under the same questions. As shown in Table 2, the attacking ability drops a little compared to the baseline attack (generate perturbation and test at the scene with the same questions and starting point, denoted as “Attack”) in both setting with higher QA accuracy but still very strong, which indicates the strong generalization ability of our spatiotemporal perturbations.

5.7 Improving Agent Robustness with Adversarial Training

Given the vulnerability of existing embodied agents with the presence of adversarial attacks, we study defense strategies to improve the agent robustness. In particular, we base our defense on adversarial training [11, 1, 20, 28, 37], where we integrate our generated adversarial examples for model training.

Training. We train 2 PACMAN-RL+Q models augmented with adversarial examples (i.e., we generate 3D adversarial perturbations on object textures, denoted as ) or Gaussian noises (denoted as

), respectively. We apply the common adversarial training strategy that adds a fixed number of adversarial examples in each epoch

[11, 1], and we defer more experimental details in the supplementary material.

Figure 6: Visualization of scene with different noises. From left to right: clean, adversarial perturbations, and Gaussian noises.

Testing. We create a test set of 110 questions in 5 houses. As shown in Figure 6, following [11, 14], we add different common noises including adversarial perturbations and Gaussian noises. To conduct fair comparisons, adversarial perturbations are generated in the white-box setting (e.g., for our adversarially trained model, we generate adversarial perturbations against it). For question answering, the average QA accuracy of models under the 2 types of noises (Adv, Gaussian) are: (5.67%, 22.14%), (23.56%, 38.87%), and (8.49%, 32.90%), respectively. For navigation, the termination distance of models under the 2 types of noises (Adv, Gaussian) are: (1.39, 1.20), (1.17, 1.01), and (1.32, 1.09), respectively. The results support the fact that training on our adversarial perturbations can improve the agent robustness towards some types of noises (i.e., higher QA accuracy, and lower ).

5.8 Ablation Study

Next, we present a set of ablation studies to further demonstrate the effectiveness of our proposed strategy through different hyper-parameters and , i.e., different numbers of historical scene views and contextual objects considered. All experiments in this section are conducted on . More results are in the Supplementary Material.

Historical scene views numbers. As for , we set =1,2,3,4,5, with a maximum value of =5. For a fair comparison, we set the overall magnitude of perturbations to 32255. As shown in Figure 7 (a), for navigation, we nearly obtain the optimal attack success rate when =3. The results are similar to the question answering. However, the attack ability does not increase as significantly as that for navigation when increasing . Obviously, the agent mainly depends on the target object and contextual objects to answer the questions. The contextual objects to be perturbed are quite similar to the increasing number of historical scene views considered.

Contextual objects numbers. As for , we set =1,2,3,4,5,6 and =3 to evaluate the contribution of the context to adversarial attacks. Similarly, we set the overall magnitude of adversarial perturbations to 32255 for adversarial attacks with different values, i.e., perturbations are added onto a single object or distributed to several contextual objects. As shown in Figure 7(b), the attack success rate increases significantly

Figure 7: Ablation study with different and values in (a) and (b). Historical scene views and contextual objects significantly enhance our attacking ability.

with the increasing of and converges at around 5. The reason is the maximum number of objects observable in 3 frames is around 5 or 6. Further, by considering the type of questions, we could obtain a deeper understanding about how an agent makes predictions. For questions about location and composition, e.g., “What room is the OBJ located in?” and “What is on the OBJ in the ROOM ?”, the attack success rate using context outperforms single object attack significantly with 4.67% and 28.51%, respectively. However, attacks on color-related questions are only 3.56% and 9.88% after contextual attack and single object attack, respectively. Intuitively, agents rely on different information to solve different types of questions. According to the attention visualization study shown in Figure 8, agents generally utilize clues from contextual objects to answer locational and compositional questions while mainly focus on target objects when predicting their colors.

Figure 8: Visualization of last 5 views of the agent and corresponding attention maps. Subfigure (a) denotes the locational and compositional question, and subfigure (b) represents the color-related question. Agents use clues from contextual objects to answer locational and compositional questions while mainly focus on target objects when predicting their colors.

5.9 Texture v.s. Shape

In this section, we study the importance of texture and shape for model predictions. For a fair comparison, we set the same constraint of perturbation magnitude for both texture and shape attacks, as in Section 5.3. According to the accuracy of the texture attack (4.26%) and shape attack (27.14%) in the task, perturbing textures is far more effective than perturbing shapes. A question emerges: Which is more important for model prediction, texture or shape?

A recent study [10] demonstrated that CNNs are strongly biased towards recognizing textures. Compared to long-range dependencies encoded in the shapes of objects, standard CNNs prefer local textures [35]. Thus, it is not uncommon to see that the agent is more likely to make errors when 3D object textures are adversarially perturbed.

Figure 9: Visualization of scene perturbed on different physical parameters. From left to right: clean, shape attacks, and texture attacks.

Since deep learning prefers textural information when making decisions, it is worth studying which features humans find more beneficial. As a preliminary step, we examined which features are more sensitive for human predictions with a user study conducted on the Amazon Mechanical Turk (AMT). With each object adversarially perturbed in texture and shape (see Figure 9), participants were asked to assign those adversarial objects to one of five classes (the ground-truth class, the top-3 adversarial target classes, and “none of the above”). Our results showed that the classification accuracy for adversarial texture manipulation (83.3%) was higher than that for shape (32.7%). It indicates that shape is a more sensitive parameter for human predictions compared to texture. This is obvious since people are more likely to recognize a table with different textures rather than a table made out of wood but showing a strange shape.

In conclusion, embodied agents trained upon most current strategies are more sensitive to texture rather than shape. It is in stark contrast to humans and reveals fundamental differences in classification strategies between humans and machines. Therefore, to bridge the gap between human perception and embodied perception, it is important to train agents that can better capture shape-based features. Could we obtain stronger policies for agents if we train them with shape-based adversarial perturbations? We put it as future work.

6 Conclusion

In this paper, we generate spatiotemporal perturbations to form 3D adversarial examples, which could attack the embodiment. Regarding the temporal dimension, since agents make predictions based on historical observations, we develop a trajectory attention module to explore scene view contributions, which further help localize 3D objects appeared with highest stimuli. By conciliating with clues from the temporal dimension, along the spatial dimension, we adversarially perturb the physical properties (e.g., texture, and 3D shape) of the contextual objects that appeared in the most important scene views. Extensive experiments on the EQA-v1 dataset for several embodied tasks in both the white-box and black-box settings are conducted, which demonstrate that our framework has strong attack and generalization abilities.

Currently, most embodied tasks, especially EQA, could only be evaluated in the simulated environment. In the future, we are interested in investigating the attack abilities of our spatiotemporal perturbations in the real-world scenario. Using projection or 3D printing, we could simply bring our perturbations into the real-world to attack a real-world agent. Further, we would like to attack more different models (especially non-end-to-end frameworks when applicable for EQA) on different platforms.

References

  • [1] K. Alexey, G. Ian, and B. Samy (2017) Adversarial machine learning at scale. In International Conference on Learning Representations, Cited by: §5.7, §5.7.
  • [2] A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §1.
  • [3] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2017) Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397. Cited by: §4.3.
  • [4] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer (2017) Adversarial patch. arXiv preprint arXiv:1712.09665. Cited by: §2.
  • [5] L. Carlone and S. Karaman (2018) Attention and anticipation in fast visual-inertial navigation. IEEE Transactions on Robotics. Cited by: §4.1.
  • [6] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied question answering. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §1, §3, §5.1, §5.2, §5.3, §5.4.
  • [7] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Neural modular control for embodied question answering. arXiv preprint arXiv:1810.11181. Cited by: §1, §5.2, §5.4.
  • [8] Y. Dong, F. Liao, T. Pang, and H. Su (2018) Boosting adversarial attacks with momentum. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §5.5.
  • [9] R. Garland-Thomson (2009) Staring: how we look. Cited by: §4.2.
  • [10] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2018) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231. Cited by: §5.9.
  • [11] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572. Cited by: §1, §2, §3.1, §5.7, §5.7, §5.7.
  • [12] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi (2018) Iqa: visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [13] C. Guo, M. Rana, M. Cisse, and L. Van Der Maaten (2017) Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117. Cited by: §5.5.
  • [14] D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, Cited by: §5.7.
  • [15] H. Kato, Y. Ushiku, and T. Harada (2018) Neural 3d mesh renderer. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [16] E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi (2017) Ai2-thor: an interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474. Cited by: §1.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    .
    In International Conference on Neural Information Processing Systems, Cited by: §1.
  • [18] A. Liu, X. Liu, J. Fan, A. Zhang, H. Xie, and D. Tao (2019) Perceptual-sensitive gan for generating adversarial patches. In

    33rd AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §2.
  • [19] H. D. Liu, M. Tao, C. Li, D. Nowrouzezahrai, and A. Jacobson (2019) Beyond pixel norm-balls: parametric adversaries using an analytically differentiable renderer. Cited by: §1, §2.
  • [20] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representation, Cited by: §5.7.
  • [21] A. Mohamed, G. E. Dahl, and G. Hinton (2011)

    Acoustic modeling using deep belief networks

    .
    IEEE T AUDIO SPEECH. Cited by: §1.
  • [22] K. R. Mopuri, A. Ganeshan, and V. B. Radhakrishnan (2018) Generalizable data-free objective for crafting universal adversarial perturbations. IEEE T PATTERN ANAL. Cited by: §2.
  • [23] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2016) Practical black-box attacks against deep learning systems using adversarial examples. arXiv preprint. Cited by: §1.
  • [24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In IEEE International Conference on Computer Vision, Cited by: §4.1.
  • [25] L. Smith and M. Gasser (2005) The development of embodied cognition: six lessons from babies. Artificial life 11 (1-2), pp. 13–29. Cited by: §3.
  • [26] I. Sutskever, O. Vinyals, and Q. Le (2014) Sequence to sequence learning with neural networks. NeurIPS. Cited by: §1.
  • [27] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §2, §2, §3.1.
  • [28] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2018) Ensemble adversarial training: attacks and defenses. In International Conference on Learning Representation, Cited by: §5.7.
  • [29] E. Wijmans, S. Datta, O. Maksymets, A. Das, G. Gkioxari, S. Lee, I. Essa, D. Parikh, and D. Batra (2019) Embodied question answering in photorealistic environments with point cloud perception. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §5.2, §5.4.
  • [30] C. Xiao, D. Yang, B. Li, J. Deng, and M. Liu (2019) Meshadv: adversarial meshes for visual recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §5.4, §5.5, Table 1.
  • [31] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille (2017) Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991. Cited by: §5.5.
  • [32] J. Yang, Z. Ren, M. Xu, X. Chen, D. Crandall, D. Parikh, and D. Batra (2019) Embodied visual recognition. IEEE International Conference on Computer Vision. Cited by: §1, §5.3.
  • [33] L. Yu, X. Chen, G. Gkioxari, M. Bansal, T. L. Berg, and D. Batra (2019) Multi-target embodied question answering. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [34] X. Zeng, C. Liu, Y. Wang, W. Qiu, L. Xie, Y. Tai, C. Tang, and A. L. Yuille (2019) Adversarial attacks beyond the image space. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §5.4, §5.5, Table 1.
  • [35] T. Zhang and Z. Zhu (2019) Interpreting adversarially trained convolutional neural networks. arXiv preprint arXiv:1905.09797. Cited by: §1, §5.9.
  • [36] Y. Zhang, H. Foroosh, P. David, and B. Gong (2019) CAMOU: learning physical vehicle camouflages to adversarially attack detectors in the wild. In International Conference on Learning Representations, Cited by: §2.
  • [37] C. Zhu, Y. Cheng, Z. Gan, S. Sun, T. Goldstein, and J. Liu (2020) Freelb: enhanced adversarial training for language understanding. In International Conference on Learning Representation, Cited by: §5.7.