Natural language grounded visual navigation task asks an embodied agent to navigate to a goal position following language instructions [1, 43, 38, 35, 5]. It has raised widely research interests in recent years since an instruction-following navigation agent is more flexible and practical in many real-world applications, such as personal assistants and in-home robots. To accomplish successful navigation, the agent needs to extract the key information, e.g., visual objects, specific rooms or navigation directions, from the long instruction according to dynamic visual observation for guiding navigation at each timestep. However, due to the complexity and semantic ambiguity of the natural language, it is hard for the navigators to effectively learn cross-modality alignment and capture accurate semantic intentions from the instruction by training with limited human-annotated instruction-path data. Prior works mainly employed the data augmentation strategy to solve the data scarcity in navigation tasks [18, 42, 19].  proposed a speaker-follower framework to generate augmented instructions within randomly sampled paths. However, generating a large amount of the whole instructions is at high costs and may not contribute to the emphasis of the most instructive information.  and  put more focus on creating challenging augmented paths and diverse visual scenes, while generated augmented instructions by employing the speaker model in  directly. Therefore, the enhancement of the instruction understanding ability of the navigator might also be limited.
In recent years, there have been increasing attentions in designing the adversarial attacks for natural language processing (NLP) tasks to verify and improve the robustness of NLP models[3, 26, 14, 39]. Inspired by this, we consider the following question: Can we design adversarial attacks on the instruction to generate helpful adversarial samples for improving the robustness of the navigator? A simple way to generate adversarial instructions is to borrow the existing attack methods on NLP [39, 51] tasks directly. However, it is difficult since existing adversarial attacks on NLP are often optimized by some classification-based goal functions [3, 39], which are unreachable in the navigation tasks. Moreover, the key instruction information for navigation changes dynamically while these attack methods developed on NLP are designed in the static setting. In this paper, we make the first attempt for introducing the adversarial attacks on the language instruction of navigation tasks to improve the robustness of navigators. Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to minimize the navigation reward by dynamically destroying key instruction information and generating perturbed instruction at each timestep. Then, an effective adversarial training strategy is adopted to improve the robustness of the navigator, by asking it to maximize the navigation reward with the perturbed instruction. To encourage the agent to be aware of actual key information and improve the fault-tolerance ability with perturbed instruction, an auxiliary self-supervised reasoning task is also introduced for the navigator, requiring it to distinguish the actual attacked word of the DR-Attacker at each timestep according to the instruction and current visual observation. As a result, more accurately the DR-Attacker attacks the important instruction information, more possible that the agent is able to capture the actual key information for navigation. Since navigation is a sequential decision making problem without direct classification-based objectives, we formulate the perturbation generation as a Markov Decision Process, and present a reinforcement learning (RL) resolution to generate the perturbed instructions by misleading the navigator to move to the wrong target position. At each timestep, the policy agent, i.e., our proposed DR-Attacker, substitutes the most crucial target word in the current instruction with the best candidate substitution word which has maximum perturbation impact, according to a learnable attack score. As a result, the DR-Attacker can learn to highlight the important parts in instructions to generate adversarial samples at different timesteps. To enhance the navigation robustness, the victim navigator, which receives the perturbed instruction, is enforced to be immune to the perturbation under the adversarial setting, as well as correctly reasoning the actual attacked words by the DR-Attacker. The overview of our proposed method is presented in Figure 1. Suppose a person receives the perturbed instruction where the word “table” is substituted with the word “stairs”. With the good understanding of the instruction and visual environment, he can distinguish the noisy word and still make the correct navigation decision. Therefore, the perturbed instructions, which can be viewed as hard negative samples, can effectively encourage the victim navigator to understand the multi-modality observations and have the self-correction ability thus become more robust. Experimental results on both Navigation from Dialog History (NDH) and Vision-and-Language Navigation (VLN) show the superiority of the proposed method over other competitors. Moreover, the quantitative and qualitative results show the effectiveness of the proposed DR-Attacker, which causes significant navigation performance drop by only disturbing most crucial instruction information. The merits of our proposed DR-Attacker are summarized as follows: First, DR-Attacker can generate perturbed instruction dynamically by capturing and destroying key instruction information in different navigation timesteps. Second, DR-Attacker can be optimized via gradient-based methods under the unsupervised setting, by formulating the perturbation generation as a sequential decision making problem. Last but not least, the adversarial samples produced by DR-Attacker are beneficial for improving the model robustness. The main contributions of this paper are summarized as follows:
We take the first step to introduce the adversarial attack on the language instruction of navigation tasks to learn robust navigators. Different from existing adversarial attacking paradigm developed on NLP tasks which are generally static, the proposed adversarial attack is dynamic during the navigation process.
By formulating the perturbation generation as a Markov Decision Process, the proposed instruction attacker, called Dynamic Reinforced Instruction Attacker (DR-Attacker), can be optimized by the reinforcement learning algorithm to achieve effective perturbation, without the need of classification-based objectives.
To improve the robustness of the navigator, an alternative adversarial training strategy and an auxiliary self-supervised reasoning task are employed to train the navigator on perturbed instructions, which can effectively enhance the cross-modal understanding ability of the navigator.
Experimental results on two popular natural language grounded visual navigation tasks, i.e., Vision-and-Language Navigation (VLN) and Navigation from Dialog History (NDH) show that the model robustness can be effectively enhanced by the proposed method. Moreover, both the quantitative results and visualized results show the effectiveness of the proposed DR-Attacker.
The remainder of this paper is organized as follows. Section B gives a brief review of the related work. Section C describes the problem setup of natural language grounded visual navigation tasks and then introduces our proposed method. Experimental results are provided in Section D. Section E concludes the paper and presents some outlook for future work.
B Related Work
b.1 Natural Language Grounded Visual Navigation
Natural language grounded visual navigation tasks [46, 38, 35, 43, 1, 5, 36] have attracted extensive research interests in recent years since they are practical and pose great challenges for vision-language understanding tasks [2, 12, 11, 50]. In this paper, we mainly focus on two natural language grounded navigation tasks, namely, Vision-and-Language Navigation (VLN)  and Navigation from Dialog History (NDH) . Vision-and-Language Navigation (VLN) [1, 18, 42, 19] was first proposed by , where a navigation agent is asked to move to the goal position following the navigation instruction. Specifically, the instruction is a sequence of declarative sentences such as “Walk down stairs. Walk past the chartreuse ottoman in the TV room. Wait in the bathroom door threshold.” Therefore, to successfully navigate to the goal position, the agent needs to understand the instruction well and learn to ground the instruction to visual observations. To achieve this,  proposed Reinforced Cross-Modal Matching (RCM) approach to enforce cross-modal grounding both locally and globally via reinforcement learning (RL).  designed visual-textual co-grounding module to distinguish different instruction parts as the ones have completed and the ones need to complete regarding visual observations. To better encourage the navigator to sufficiently understand the diverse instructions and navigation environments, existing works adopted the data augmentation strategy [42, 18, 19] to solve the data scarcity in the original dataset. A speaker-follower model was proposed by  to produce augmented instructions with randomly-sampled paths.  proposed Environmental Dropout to create new (environment, path, instruction) triplets while utilizing the speaker model in  directly for generating the augmented instructions. The Cooperative Vision-and-Dialog Navigation (CVDN) dataset was recently proposed by  and Navigation from Dialog history (NDH) is a task proposed on CVDN dataset, which requires an agent to move towards the goal position following a sequence of dialog history. Although the visual scenes in CVDN dataset are similar to the R2R dataset proposed on VLN task , the instruction in the CVDN dataset, which is composed of dialog history and current question-answer pair, is harder for the agent to understand and perform visual grounding since it is longer and more complicated than the instruction on VLN task. To better explore useful textual information for successful navigation,  proposed Cross-modal Memory Network (CMN) to exploit the rich information in dialog history.  employed a pretraining scheme by using image-text-action triplets for improving the instruction understanding and cross-modality alignment. While existing methods have achieved some improvements in enhancing the instruction understanding by data augmentation [18, 42, 19] or pretraining [20, 27], the quality of the augmented instructions is rarely noticed, leading to limited improvement of the model robustness. In contrast, we adopt an adversarial attack paradigm to encourage the generation of meaningful adversarial instructions, which can serve as hard augmented samples to better enhance the navigation robustness.
b.2 Adversarial Attacks in NLP
to validate the robustness of the deep neural network models[22, 41]. In recent years, many researchers of NLP fields put their focus on introducing adversarial attacks for the NLP tasks, which can serve as a powerful tool for evaluating the model vulnerability, and more importantly, improving the robustness of NLP models [45, 39, 53, 6, 25]. The key principle of adversarial attacks is to impose imperceptible perturbation by human on the original input while easily fool the neural model to make the incorrect prediction. Most adversarial attacks on NLP tasks are word-level attacks [51, 39] or character-level attacks [3, 14]. HotFlip 
introduced white-box adversarial samples based on an atomic flip operation to trick a character-level neural classifier.
proposed a word-level attack model based on sememe-based word substitution method and particle swarm optimization-based search algorithm, which was implemented on Bi-LSTM and BERT . Due to the discrete characteristic of the natural language, the imposed adversarial attacks on the language, such as inserting, removing or replacing a specific character or word, can easily change the meaning or break the grammaticality and naturality of the original sentence [51, 44]. Therefore, the adversarial perturbation on the language is essentially easy to be perceived by human rather than that in image. Our introduced attack on the instruction can be viewed as an adversarial attack naturally due to the following aspects. First, we constraint our DR-Attacker to replace a single word at a specific timestep to control the magnitude of the perturbation to be small enough. Second, although the local key information, e.g., a visual object word is destroyed, the human, which is able to comprehend the long-term intention of the instruction and reasoning original instruction information according to the current visual observation, cannot be misled easily by such perturbation. However, the agent, which tends to learn the simple alignment of the instruction and visual observation, is more easily to be misled and to get stuck. Third, the replacement is conducted between words belonging to the same characteristic, ensuring the grammaticality and naturality of the original sentence. Since incorrect visual object, location or action words in an instruction is easy to appear in realistic scenes, e.g., a wrong annotation by human or an object previously existing but disappearing in the original scene, we impose the perturbation on visual object or location words rather than uninformative words, which can be more beneficial for enhancing the navigation robustness. In contrast to existing adversarial attacks on NLP which are generally static and optimized with classification-based objectives, our proposed DR-Attacker can generate dynamic perturbation on the instruction, and can be optimized by the RL paradigm under the unsupervised setting. Like other existing works which train the models on the perturbed training samples to improve the robustness of NLP models [53, 24, 16, 30], we also develop the adversarial training strategy to improve the robustness of the navigator using the perturbed instructions generated at each timestep. Moreover, we introduce an auxiliary self-supervised reasoning task during the adversarial training stage, which can better promote the adversarial training results.
b.3 Adversarial Attacks in Navigation
Although adversarial attacks are popular in verifying and improving the robustness of the deep learning models in both image[40, 17, 4, 52] and NLP [25, 49, 3, 26, 51, 45, 6, 14, 39, 53] domains, there are few works attempting to employ the adversarial attacks for improving the robustness of the embodied navigation agents, since the setting and environment in navigation is usually dynamic and complex.  took the first attempt to introduce spatio-temporal perturbations on the visual objects for embodied question answering (EQA) task , by perturbing the physical properties (e.g., texture or shape) of visual objects. They used the available ground-truth labels to guide the perturbation generation by using classification-based objectives. Compared with the collection of diverse visual environments to improve the robustness of the agent, annotating large-amount of high-quality and informative instruction is more difficult and labor-intensive for the natural language grounded visual navigation task. Therefore, in contrast to , we make the first attempt to introduce adversarial attacks for the existing available instruction data in this paper, to mitigate the scarcity of available high-quality instructions which largely limits the navigation performance of existing instruction-following agents. Moreover, our introduced perturbation can be optimized in an unsupervised way, which is more practical.
b.4 Automatic Data Augmentation
Automatic data augmentation aims to learn data augmentation strategies automatically according to the target model performance instead of designing augmentation strategies manually based on the expertise knowledge. AutoAugment  formulates the automatic augmentation policy search as a discrete search problem and employs a reinforcement learning (RL) framework to search the policy consisting of possible augmentation operations. However, high computational cost is required for training and evaluating thousands of sampled policies in the search process. To speed up policy search, many variants of AutoAugment are proposed [23, 28, 30, 21, 9]. PBA  introduces population-based training to efficiently train the network parallelly across different CPUs or GPUs. Fast AutoAugment  moves the costly search stage from training to evaluation through bayesian optimization. Adversarial AutoAugment  directly learns augmentation policies on target tasks and develops an adversarial framework to jointly optimize target network training and augmentation policy search. The most related work to our proposed method is Adversarial AutoAugment , where the policy sampler and the target model are jointly optimized in an adversarial way. The difference between our method and Adversarial AutoAugment is that our augmented samples are generated through the adversarial attack rather than the composition of augmentation strategies, which is constrained to be small in magnitude while impact the agent performance largely.
In this section, we describe the natural language grounded visual navigation task first and then introduce our proposed method. The problem setup is given in Sec. C.1. The details of our proposed Dynamic Reinforced Instruction Attacker (DR-Attacker), including the optimization of the perturbation generation, the adversarial training with the auxiliary self-supervised reasoning task, and the model details are presented in Sec. C.2.
c.1 Problem Setup
Natural language grounded visual navigation task requires a navigator to find a route (a sequence of viewpoints) from a start viewpoint to the target viewpoint following the given instruction . For the NDH task, the instruction is composed of , which includes the given target object , the questions and the answers till the turn (0 , where is the total number of question-answer turns from the intial position to the target room). For the VLN task, the instruction is composed of , where () denotes a single sentence and denotes the number of sentences. Since the , , , can all be represented by word tokens, for both NDH and VLN tasks, we formulate the instruction as a set of word tokens, , where is the length of the instruction. At timestep , the navigator receives a panoramic view as the visual observation. Each panoramic view is divided into 36 image views , with each of views containing a RGB image accompanied with its orientation (,), where and are the angles of heading and elevation, respectively. We follow the  to obtain the view feature . Regarding the visual observations and instructions, the navigator infers the action for each step from the candidate actions list, which consists of neighbours of the current node in the navigation graph and a stop action. Generally, the navigator is a sequence-to-sequence model with the encoder-decoder architecture [1, 43].
c.2 Dynamic Reinforced Instruction Attacker
c.2.1 Perturbation Generation as an RL Problem
Since there is no direct label as that in the classification-based tasks [3, 39] for judging the success of attack in such navigation tasks, we use a reinforcement learning (RL) framework to formulate the perturbation generation. The framework contains two major components: an environment model which is a well-trained navigator (also called as victim navigator), and an instruction attacker , which can be viewed as the policy agent. The attacker learns to disturb the correct action decision of by generating perturbed instruction for at each timestep . and denote the parameters of the environment model and attacker, respectively. Under the RL setting, the state is the visual state . The action is the perturbation operation by substituting the selected target word in the original instruction with a candidate word. The construction details of the target word set and candidate substitution word set for each instruction are given in Sec. C.2.3. Note that the attack operation is sequentially conducted at each navigation step rather than once at the beginning since the key instruction information changes dynamically during the navigation process. To measure the success of the attack and design reasonable reward for optimizing the attacker in such navigation tasks, we propose “deviation from the target position” as a metric. That is, the goal of the attacker is to enforce the navigator to make the wrong navigation trajectory and stop at a position which is far from the target position. Therefore, the reward will be negative for the attacker if the victim navigator stops within meters around the target viewpoint at the final step, otherwise the reward will be positive. is a predefined distance threshold. We also adopt a direct reward  at each non-stop step by considering the progress, i.e., the change of the distance to the target viewpoint made by current timestep. If the navigator makes positive progress to the target position at non-stop step , the direct reward will be negative. Similar to , the reward in our RL setting is set as a predefined constant. To satisfy the ‘small perturbation’ principle of adversarial samples [39, 51, 53, 3], the attacker is required to substitute only one word in the instruction at each timestep. Without the loss of generality, we apply the Advantage Actor-Critic (A2C)  algorithm to iteratively update the parameters of the attacker . A2C framework contains a policy network (here is the attacker) and a value network to learn a optimal policy. and denote the parameters of the network. Given the state-action-reward of observation at each step , the algorithm computes the total accumulated reward , the policy gradient , the value gradient and the entropy gradient by:
where is the discount factor. is the advantage. Subsequently, an optimization step is performed in the direction that maximizes both (direction ) and the entropy of (direction ), as well as minimizes the mean squared error of (direction -). Therefore, by using the RL paradigm, the attacker can learn to generate the perturbed instructions at each timestep for disturbing the action decision of the navigator and misleading it to stop at the wrong target position. In our settings, the value network is a two-layer MLP.
c.2.2 Adversarial Training with Auxiliary Self-supervised Task
For improving the navigation robustness, we develop an effective adversarial training strategy, which can encourage the joint optimization for the victim navigator and the attacker. Through alternative optimization under the adversarial setting, the attacker can iteratively learn to create misleading instructions for confusing the victim navigator, while the victim navigator is trained on the perturbed instructions to enhance the model robustness. Motivated by , we use the RL strategy for training both the victim navigator and the attacker, and formulate the adversarial setting as the two-player zero-sum Markov games. At each timestep , both the attacker and the victim navigator receive the visual observation and the language instruction ( for the navigator, is invariant while is variable). Then the attacker takes the action by generating the perturbed instruction, and the navigator takes the action by moving to the next viewpoint. With the inverse objective of the navigation, i.e., the navigator is supposed to stop at the nearest point from the target position, an inverse reward is set for the attacker and the navigator: ( is represented by in Figure 1), where and represent the policies for the attacker and the victim navigator, respectively. Therefore, our adversarial setting can be represented by:
We conduct the alternative optimization procedure between the navigator and the attacker, namely, keep the parameters of one agent fixed and optimize another. The optimization procedure of adversarial training is given in Algorithm 1. At stage 1, we pre-train the navigator and use the pre-trained navigator to pre-train the attacker. At stage 2, we conduct alternative iteration procedure between the navigator and the attacker to implement the joint optimization. For facilitating implementation, the RL strategy for training the victim navigator also follows the A2C algorithm which was similar to . To encourage the agent to capture actual key information and improve the fault-tolerance ability with perturbed instructions, which is important for robust navigation, we introduce an auxiliary self-supervised reasoning task during the training phases of the victim navigator, by asking the navigator to predict the actual attacked word by the attacker at each timestep :
where is the target word set for the given instruction and denotes the prediction probability. denotes the target word features. is the size of the target word set. represents the visual-and-instruction aware hidden state feature of the decoder  in the navigator. and
denote the learnable linear transformations., and denote the feature dimensions. The prediction is optimized by cross-entropy loss and the ground-truth label is the actual attacked word by the attacker. As a result, the probability that the agent captures the actual important instruction information and haves the self-correction ability can be increased with the accuracy improvement of the attacker for attacking key instruction information. Therefore, through the auxiliary self-supervision reasoning task, the enhancement of the attacker can effectively lead to the improvement of the navigator.
c.2.3 Model Details
Forward Process of the Instruction Attacker. In this part, we describe the forward process of the proposed DR-Attacker, i.e., the attacker in detail. At each timestep , the DR-Attacker calculates the action prediction probability, also referred to as the attack score, by considering both the word importance in the current instruction and the substitution impact of different candidate words (illustrated in Figure 1). Within the prior that the words indicate visual object (e.g., “door”) and location (e.g., “bathroom”) are most informative for guiding the navigation, we construct the target word set by selecting these two kinds of words for each instruction in advance. For target word (, is the size of target word set) in the instruction , we denote the candidate substitution word set of as , where is the size of candidate substitution word set. To promote the understanding of the given instruction as well as keep a reasonable set size, we choose the remained target words in the same instruction to construct the candidate substitution word set for the specific target word. At timestep , a word importance vector is first caculated by:
where and represent the word features encoded by BiLSTM of target words and attended visual feature , respectively. and are the learnable linear transformations that convert the different features into the same embedding space. , and represent the feature dimensions. Then, the substitution impact of different candidate words for each target word is obtained by:
where and denote the word features of target word and candidate words . is the learnable linear transformation. After calculating the substitution impact of different candidate words for all the target words in the instruction to obtain the substitution impact matrix , the attack score , i.e., the action prediction probability of the DR-Attacker is calculated by:
where denotes the element-wise multiplication. represents the candidate action set with the size of . Through the learnable attack score , the DR-Attacker can learn to generate the optimal perturbation at each timestep . Note that while there will be a semantic change compared with the original target word based on our word substitution strategy, we do not distinguish the perturbed instruction with the conventional adversarial samples. This is because the impact of single word substitution is subtle on the overall intention of whole instruction.
|Method||Val Seen||Val Unseen||Test Unseen|
|Method||Val Seen||Val Unseen||Test Unseen|
|Pretrain||Other phases||Total||Pretrain||Other phases||Pretrain||Other phases|
|-||1661||-||6, 582, 000||4, 742||8 v100 GPUs||1 1080Ti GPU|
|Ours||143||328||471||4, 742||4, 742||1 1080Ti GPU||1 1080Ti GPU|
Forward Process of the Navigator. After introducing the forward process of the instruction attacker , we present the forward process of the navigator in this subsection. Specifically, the navigator follows an encoder-decoder architecture, where both the encoder and decoder are LSTMs . The encoder contains a word embedding layer and a bi-directional LSTM, and its output is the language feature of the instruction:
Then, the decoder receives the attended visual feature and language feature , and generates the visual-and-instruction aware hidden state :
where is the action feature of the timestep . and are the learnable linear transformations. The attended visual feature is calculated by:
where is the learnable linear transformation. Then, the action prediction probability of the navigator is calculated by:
where denotes the candidate action features. is the trainable linear transformation. The navigator takes the action according to the action prediction probability . The forward processes of the attacker and the navigator are shown in Figure 2. As illustrated in Figure 2, based on the attack score which is calculated by the elementwise multiplication of word importance vector and substitution impact matrix , the perturbation operation is conducted on the original instruction to generate perturbed instruction . Then the decoder receives the attended visual feature and the perturbed instruction to predict the next action . The updated hidden state of the decoder and the target word feature are used to calculate the prediction probability of the actual attacked word for the self-supervised auxiliary reasoning task. Construction Details of Target Word Set and Candidate Word Set. In this part, we show the construction details of target word set and candidate substitution word set for both VLN and NDH tasks. Specifically, for each instruction, we first construct its target word set by conducting string match between it and the instruction vocabulary. The instruction vocabulary contains the words indicating visual objects or locations, which are collected from the given instruction vocabulary from the dataset. Then, the candidate substitution word set is constructed for each target word by selecting the remained target words in the same instruction. The construction details of the target word set and candidate substitution word set for VLN and NDH tasks are shown in Figure 3 and Figure 4, respectively. Note that since the last answer in the dialog history plays the direct role of guiding navigation in the NDH task, we only construct the target word set and conduct the perturbation for the last answer in the NDH task, as shown in Figure 4.
|Method||Val Seen||Val Unseen|
|NE (m)||SR (%)||SPL (%)||NE (m)||SR (%)||SPL (%)|
|Adversarial Training w auxiliary task||4.15||62.0||59||5.25||49.6||46|
|Settings||Val Seen||Val Unseen|
|Last A||Last QA||All||Last A||Last QA||All|
|Adversarial Training w auxiliary task||6.80||6.96||7.23||3.90||3.80||3.93|
|Settings||Val Seen||Val Unseen|
|Finetune (Adversarial Training)||5.48||7.13||7.61||3.37||4.16||4.08|
|Finetune (Adversarial Training w auxiliary task)||5.52||7.49||7.66||3.48||4.21||4.20|
|Method||Val Seen||Val Unseen|
In this section, we first introduce the datasets we use on NDH and VLN tasks, evaluation metrics, and implementation details in Sec.D.1. Then we provide the quantitative and qualitative results in Sec. D.2 and Sec. D.3, respectively.
d.1 Experimental Setup
CVDN dataset  contains 2050 human-human navigation dialogs and over 7k trajectories in 83 MatterPort houses. Each trajectory is punctuated by several question-answer exchanges. Each dialog begins with an ambiguous instruction, and the subsequent dialog interaction between the navigator and oracle leads the navigator to find the target position. R2R dataset  includes 10,800 panoramic views and 7,189 trajectories. Each panoramic view has 36 images and each trajectory is paired with three natural language instructions. Both CVDN and R2R datasets are split into a training set, a seen validation set, an unseen validation set, and a test set.
d.1.2 Evaluation Metrics
The following four metrics  are used for evaluation on R2R dataset: 1) Trajectory Length (TL) measures the average length of navigation trajectories in meters, 2) Navigation Error (NE) is the distance between target viewpoint and agent stopping position, 3) Success Rate (SR) calculates the success rate of reaching the goal, 4) Success rate weighted by Path Length (SPL) makes the trade-off between SR and TL. Based on the metrics on R2R dataset, there are some new metrics used for evaluation on CVDN dataset : 1) Goal Progress (GP) measures the average agent progress towards the goal location, 2) Oracle Success Rate (OSR) is the success rate if the agent can stop at the nearest point to the goal along its trajectory, 3) Oracle Path Success Rate (OPSR) means the success rate if the agent can stop at the closest point to the goal along the shortest path.
d.1.3 Implementation Details
The navigator architecture, training hyperparameters and the training strategy we use in both VLN and NDH tasks are the same to. , , , and are set as 512, 512, 2052, 512 and 512, respectively. The positive/negative rewards of the final step and each non-stop step are set as 3/-3 and 1/-1, respectively. For both VLN and NDH, we split the training process for four steps: 1) pre-train the navigator using the original training set 2) pre-train the DR-Attacker on the pre-trained navigator and keep the parameters of the navigator fixed 3) adversarially train both the navigator and DR-Attacker by alternative iteration 4) finetune the navigator on the original training set. The training iterations of four steps for VLN are 40K, 10K, 40K, 200K and the training iterations of four steps for NDH are 5K, 1K, 3K, 3K. For the adversarial training, the alternation is conducted after 3K and 1K iterations for VLN and NDH, respectively. Following 
, we also use the data augmentation of instruction to improve the navigation performance. For improving the learning efficiency, we also introduce imitation learning supervision when training the navigator in the adversarial training stage.
d.2 Quantitative Results
d.2.1 Comparison with the State-of-the-art Methods
The quantitative comparison results with state-of-the-art methods on VLN and NDH are given in Table I and Table II, respectively. In Table I, we report three most important metrics in the VLN setting, i.e., Navigation Error (NE), Success Rate (SR) and Success rate weighted by Path Length (SPL). In Table II, we report the Goal Progress (GP) metric under the whole dialog history setting following most existing works on VDN [43, 54, 20]. Table I indicates that our proposed method outperforms other competitors in most metrics. Comparing with the baseline EnvDrop , the improvements for the SR and SPL of our method are significant in both seen and unseen settings. Table II shows that our method outperforms the state-of-the-art methods by a significant margin on NDH in both seen and unseen environments. We further compare the training time, data and device between the state-of-the-art method PREVALENT  and our method on NDH. Since only the implementation of finetuning phase111https://github.com/weituo12321/PREVALENT_R2R is available for PREVALENT , we only record the reimplemented finetuning time of PREVALENT  for comparison. Other values for the pretraining phase of PREVALENT  are the reported values in their paper. The results are given in Table III. From Table III we can find that compared with PREVALENT , our proposed method need significantly less training time, data, and computation resource while can achieve better results, showing the good flexibility of our method. Both the results on VLN and NDH show the effectiveness of the proposed method in improving the robustness of the navigation agent.
d.2.2 Ablation Study
In this section, we conduct ablation study to validate the effectiveness of the proposed adversarial attacking paradigm, adversarial training strategy and the auxiliary self-supervised reasoning task. Specifically, the effects of four-stage training for VLN and NDH tasks are presented in Table IV and Table V. The effectiveness of the auxiliary self-supervised reasoning task is given in Table VI. For VLN, “Base Agent” means pre-training navigators on the datasets composing of original instructions and augmented instructions for 40K iterations. “Finetune” means finetuning the adversarial trained agents on the same dataset as that used in the pretraining stage. For VDN, “Base Agent” means using the same training strategy like  to pre-train the navigators on the original dataset for 5k iterations. “Finetune” means finetuning the adversarial trained agents on the original dataset. “DR-Attacker” represents the navigation results when receiving perturbed instructions. “Last A”, “Last QA” and “All” represent three kinds of different dialog history settings, i.e., the instruction is last answer, last question-answer pair or the whole dialog history . From Table IV and Table V we can find that our proposed four-stage training strategy can effectively contribute to enhancing the robustness and the navigation performance of the agent on both VLN and NDH tasks. Specifically, by introducing adversarial perturbations on the instructions, the navigation performance of the agent shows significant drop, demonstrating the effectiveness of the proposed adversarial attacking mechanism. Then, after adversarial training with the proposed auxiliary self-supervised reasoning task followed by finetuning on the original dataset, the robustness and the navigation performance can be effectively improved. Moreover, from Table VI we can observe that by introducing our proposed self-supervised auxiliary reasoning task in the adversarial training stage, the navigation performance can be effectively enhanced, demonstrating that improving the cross-modality understanding ability of the agent is crucial for successful navigation.
d.2.3 Different Types of Attacking Mechanisms
In this subsection, we compare different types of attacking mechanisms to validate the effectiveness of the proposed DR-Attacker and in attacking and promoting the navigation performance through adversarial training. Specifically, four adversarial attacking methods or variants are chosen for the comparison: 1) “Static” means that the perturbation at each timestep is invariant, i.e., at each timestep, the same target word is substituted with the same candidate word. For selecting the target word and candidate word, we use the pre-trained DR-Attacker to conduct the word prediction at the first navigation timestep. 2) “Random” represents randomly selecting the target word and the candidate substitution word at each timestep. 3)“Heuristics” means the instruction word that receives the highest textual attention weights from the navigator at each timestep is destroyed. 4) PWWS  is an adversarial attack method in NLP which is similar to our proposed adversarial attack in some implementation procedures. It also obtains an attack score by calculating word importance and substitution impact according to the change of classification probability. Since there is no direct classification-based objective for the instruction in both VLN and NDH tasks, we choose the action prediction probability for an alternative. Specifically, at each timestep, the attacked word which can cause the maximum change of the original action prediction probability is destroyed. Therefore, “Random”, “Heursitics” and PWWS are all dynamic adversarial attacks. The comparison results of attacking effects on VLN and NDH tasks are given in Figure 5 and Table VII, respectively. And the adversarial training results using different attacking mechanisms on NDH are given in Table VII. From Figure 5 and Table VII we can find that compared with either static or dynamic attacking mechanisms, our proposed DR-Attacker can achieve the best attack results in most metrics on both VLN and NDH tasks, demonstrating the importance of dynamically attacking key information in the navigation task and the effectiveness of our proposed RL-based optimization method for the proposed adversarial attack. Moreover, from the adversarial training results in Table VII we can find the superiority of DR-Attacker in promoting the navigation performance compared with other attacking methods, demonstrating that jointly optimizing the navigator and the attacker is more beneficial for the improvement of the navigation performance. Both the attacking and adversarial training results on VLN and NDH tasks show the effectiveness of the proposed adversarial attacking mechanism and adversarial training paradigm.
d.3 Qualitative Results
In this subsection, we show the visualization examples of perturbed instructions, panoramic views and language attention weights during trajectories on VLN and NDH tasks. The results are given in Figure 6 and Figure 7, respectively. From Figure 6 and Figure 7 we can find that the proposed DR-Attacker can successfully locate the word which appears in the scene at different timesteps and substitute it with the word that doesn’t exist in the current scene. Moreover, the navigator can make correct predictions of the actual attacked words by DR-Attacker, showing its good understanding of the multi-modality observations. The first subfigure in Figure 6 (a), the fourth subfigure in Figure 6 (b) and the second subfigure in Figure 7 (a) show the failure cases. From the failure cases, we can find that when there are multiple objects referred in the instruction simultaneously existing in the current scene, e.g., both the “bedroom” and “door” exist in the fourth subfigure in Figure 6 (b), the navigator or the DR-Attacker may be confused. From the language attention weights of the navigators trained with perturbed instructions (“Ours”), we can find that although the target word is attacked, the navigator can attend to the context near the attacked word to capture the language intention. Moreover, with the process of the navigation trajectory, it can successfully capture important instruction information in different phases. In contrast, the navigator trained without perturbed instructions (“Baseline”) generates a confused language attention weights by the introduced perturbations during navigation. These visualization analyses show that emphasizing useful instruction information can contribute to successful navigation. Moreover, our proposed adversarial attacking and adversarial training mechanisms can effectively improve the robustness of the navigation agent.
In this work, we propose Dynamic Reinforced Instruction Attacker (DR-Attacker) for the natural language grounded visual navigation tasks. By formulating the perturbation generation using the RL framework, DR-Attacker can be optimized iteratively to capture the crucial parts in instructions and generate meaningful adversarial samples. Through adversarial training using perturbed instructions, the robustness of the navigator can be effectively enhanced with an auxiliary self-supervised reasoning task. Experiments on both VLN and NDH tasks show the effectiveness of the proposed method. In the future, we plan to improve the training strategy of the proposed instruction attacker and exploit to design more effective attacks on the navigation instruction. Moreover, we would like to develop multi-modality adversarial attacks for the embodied navigation task to further verify and improve the robustness of the navigator.
This work was supported in part by National Key R&D Program of China under Grant No. 2020AAA0109700, National Natural Science Foundation of China (NSFC) under Grant No.U19A2073 and No.61976233, Guangdong Province Basic and Applied Basic Research (Regional Joint Fund-Key) Grant No.2019B1515120039, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), Shenzhen Fundamental Research Program (Project No. RCYX20200714114642083, No. JCYJ20190807154211365), Zhejiang Lab’s Open Fund (No. 2020AA3AB14) and CSIG Young Fellow Support Fund.
-  (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In , pp. 3674–3683. Cited by: §A, §B.1, §C.1, TABLE I, §D.1.1, §D.1.2.
-  (2015) VQA: visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2425–2433. Cited by: §B.1.
-  (2020) On adversarial examples for biomedical nlp tasks. arXiv preprint arXiv:2004.11157. Cited by: §A, §B.2, §B.3, §C.2.1.
-  (2020) Adversarially robust representations with smooth encoders. In ICLR 2020 : Eighth International Conference on Learning Representations, Cited by: §B.2, §B.3.
-  (2019) TOUCHDOWN: natural language navigation and spatial reasoning in visual street environments. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12538–12547. Cited by: §A, §B.1.
Robust neural machine translation with doubly adversarial inputs. In ACL 2019 : The 57th Annual Meeting of the Association for Computational Linguistics, pp. 4324–4333. Cited by: §B.2, §B.3.
-  (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680. Cited by: §B.2.
-  (2019) AutoAugment: learning augmentation strategies from data. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 113–123. Cited by: §B.4.
-  (2020) RandAugment: practical automated data augmentation with a reduced search space. In Advances in Neural Information Processing Systems, Vol. 33, pp. 18613–18624. Cited by: §B.4.
-  (2018) Embodied question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1–10. Cited by: §B.3.
-  (2017) Visual dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §B.1.
-  (2017) GuessWhat?! visual object discovery through multi-modal dialogue. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4466–4475. External Links: Cited by: §B.1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT 2019: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Cited by: §B.2.
-  (2018) On adversarial examples for character-level neural machine translation. In COLING 2018: 27th International Conference on Computational Linguistics, pp. 653–663. Cited by: §A, §B.2, §B.3.
-  (2018) HotFlip: white-box adversarial examples for text classification. In ACL 2018: 56th Annual Meeting of the Association for Computational Linguistics, Vol. 2, pp. 31–36. Cited by: §B.2.
-  (2019) Text processing like humans do: visually attacking and shielding nlp systems. In NAACL-HLT 2019: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 1634–1647. Cited by: §B.2.
-  (2019) Learning to confuse: generating training time adversarial data with auto-encoder. In NeurIPS 2019 : Thirty-third Conference on Neural Information Processing Systems, pp. 11994–12004. Cited by: §B.2, §B.3.
-  (2018) Speaker-follower models for vision-and-language navigation. In NIPS 2018: The 32nd Annual Conference on Neural Information Processing Systems, pp. 3314–3325. Cited by: §A, §B.1, TABLE I, §D.1.3.
-  (2020) Counterfactual vision-and-language navigation via adversarial path sampler.. In European Conference on Computer Vision, pp. 71–86. Cited by: §A, §B.1.
-  (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13137–13146. Cited by: §B.1, TABLE II, TABLE III, §D.2.1.
Faster autoaugment: learning augmentation strategies using backpropagation.. In ECCV (25), pp. 1–16. Cited by: §B.4.
-  (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §B.2.
Population based augmentation: efficient learning of augmentation policy schedules.
International Conference on Machine Learning, pp. 2731–2741. Cited by: §B.4.
-  (2019) Certified robustness to adversarial word substitutions. In 2019 Conference on Empirical Methods in Natural Language Processing, pp. 4127–4140. Cited by: §B.2.
-  (2020) Robust encodings: a framework for combating adversarial typos.. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2752–2765. Cited by: §B.2, §B.3.
-  (2020) BERT-attack: adversarial attack against bert using bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6193–6202. Cited by: §A, §B.3.
-  (2019) Robust navigation with language pretraining and stochastic sampling.. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1494–1499. Cited by: §B.1, TABLE I.
-  (2019) Fast autoaugment. In Advances in Neural Information Processing Systems, Vol. 32, pp. 6665–6675. Cited by: §B.4.
-  (2020) Spatiotemporal attacks for embodied agents.. In ECCV (17), pp. 122–138. Cited by: §B.3.
Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994. Cited by: §B.2, §B.4.
Self-monitoring navigation agent via auxiliary progress estimation. In ICLR 2019 : 7th International Conference on Learning Representations, Cited by: §B.1.
-  (2019) The regretful agent: heuristic-aided navigation through progress estimation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6732–6740. Cited by: TABLE I.
-  (2016) Asynchronous methods for deep reinforcement learning. In ICML’16 Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, pp. 1928–1937. Cited by: §C.2.1.
-  (2019) Generalizable data-free objective for crafting universal adversarial perturbations. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (10), pp. 2452–2465. Cited by: §B.2.
-  (2019) Help, anna! vision-based navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. In 2019 Conference on Empirical Methods in Natural Language Processing, pp. 684–695. Cited by: §A, §B.1.
-  (2019) Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12527–12537. Cited by: §B.1.
-  (2017) Robust adversarial reinforcement learning. In ICML’17 Proceedings of the 34th International Conference on Machine Learning - Volume 70, pp. 2817–2826. Cited by: §C.2.2.
-  (2020) REVERIE: remote embodied visual referring expression in real indoor environments. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9982–9991. Cited by: §A, §B.1.
-  (2019) Generating natural language adversarial examples through probability weighted word saliency. In ACL 2019 : The 57th Annual Meeting of the Association for Computational Linguistics, pp. 1085–1097. Cited by: §A, §B.2, §B.3, §C.2.1, TABLE VII, §D.2.3.
-  (2019) Provably robust deep learning via adversarially trained smoothed classifiers. In NeurIPS 2019 : Thirty-third Conference on Neural Information Processing Systems, pp. 11292–11303. Cited by: §B.2, §B.3.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR 2015 : International Conference on Learning Representations 2015, Cited by: §B.2.
-  (2019) Learning to navigate unseen environments: back translation with environmental dropout. In NAACL-HLT 2019: Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 2610–2621. Cited by: §A, §B.1, §C.1, §C.2.1, §C.2.2, §C.2.3, §C.2.3, TABLE I, §D.1.3, §D.2.1, §D.2.2.
-  (2019) Vision-and-dialog navigation. Conference on Robot Learning (CoRL), pp. 394–406. Cited by: §A, §B.1, §C.1, TABLE II, §D.1.1, §D.2.1, §D.2.2.
-  (2021) InfoBERT: improving robustness of language models from an information theoretic perspective. In ICLR 2021: The Ninth International Conference on Learning Representations, Cited by: §B.2.
-  (2019) Improving neural language modeling via adversarial training. In ICML 2019 : Thirty-sixth International Conference on Machine Learning, pp. 6555–6565. Cited by: §B.2, §B.3.
-  (2020) Environment-agnostic multitask learning for natural language grounded navigation. In ECCV (24), pp. 413–430. Cited by: §B.1.
-  (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6629–6638. Cited by: §B.1, TABLE I.
-  (2018) Building generalizable agents with a realistic and rich 3d environment. In ICLR 2018 : International Conference on Learning Representations 2018, Cited by: §C.2.1.
-  (2020) On the robustness of language encoders against grammatical errors. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3386–3403. Cited by: §B.3.
-  (2016) Image captioning with semantic attention. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659. Cited by: §B.1.
Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6066–6080. Cited by: §A, §B.2, §B.3, §C.2.1.
-  (2019) Defense against adversarial attacks using feature scattering-based adversarial training. In NeurIPS 2019 : Thirty-third Conference on Neural Information Processing Systems, pp. 1831–1841. Cited by: §B.2, §B.3.
-  (2019) FreeLB: enhanced adversarial training for natural language understanding. In International Conference on Learning Representations, Cited by: §B.2, §B.3, §C.2.1.
-  (2020) Vision-dialog navigation by exploring cross-modal memory. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10730–10739. Cited by: §B.1, TABLE II, §D.1.2, §D.2.1.