Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation

by   Federico Landi, et al.

Vision-and-Language Navigation (VLN) is a challenging task in which an agent needs to follow a language-specified path to reach a target destination. In this paper, we strive for the creation of an agent able to tackle three key issues: multi-modality, long-term dependencies, and adaptability towards different locomotive settings. To that end, we devise "Perceive, Transform, and Act" (PTA): a fully-attentive VLN architecture that leaves the recurrent approach behind and the first Transformer-like architecture incorporating three different modalities - natural language, images, and discrete actions for the agent control. In particular, we adopt an early fusion strategy to merge lingual and visual information efficiently in our encoder. We then propose to refine the decoding phase with a late fusion extension between the agent's history of actions and the perception modalities. We experimentally validate our model on two datasets and two different action settings. PTA surpasses previous state-of-the-art architectures for low-level VLN on R2R and achieves the first place for both setups in the recently proposed R4R benchmark. Our code is publicly available at



There are no comments yet.


page 5

page 8


Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters

In Vision-and-Language Navigation (VLN), an embodied agent needs to reac...

Attention Based Natural Language Grounding by Navigating Virtual Environment

In this work, we focus on the problem of grounding language by training ...

Multi-modal Conditional Attention Fusion for Dimensional Emotion Prediction

Continuous dimensional emotion prediction is a challenging task where th...

Diagnosing the Environment Bias in Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires an agent to follow natural...

Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation

Vision-and-Language Navigation (VLN) is a natural language grounding tas...

Language and Visual Entity Relationship Graph for Agent Navigation

Vision-and-Language Navigation (VLN) requires an agent to navigate in a ...

DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

Since their introduction the Trasformer architectures emerged as the dom...

Code Repositories


PyTorch code for the paper: "Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Moving from place to place is supposed to be “physical” whereas perceiving is supposed to be “mental”, but this dichotomy is misleading. Locomotion is guided by visual perception. Not only does it depend on perception but perception depends on locomotion.

The Ecological Approach to Visual Perception
James J. Gibson

Effective instruction-following and contextual decision-making can open the door to a new world for researchers in embodied AI. Deep neural networks have the potential to build complex reasoning rules that enable the creation of intelligent agents, and research on this subject could also help to empower the next generation of collaborative robots. In this scenario, Vision-and-Language Navigation (VLN) 

[3] plays a significant part in current research. This task requires to follow natural language instructions through unknown environments, discovering the correspondences between lingual and visual perception step by step. Additionally, the agent needs to progressively adjust navigation in light of the history of past actions and explored areas. Even a small error while planning the next move can lead to failure because perception and actions are unavoidably entangled; indeed, we must perceive in order to move, but we must also move in order to perceive [19]. For this reason, the agent can succeed in this task only by efficiently combining the three modalities – language, vision, and actions.

Figure 1:

Traditional approaches to VLN build upon recurrent neural networks to model long-term dependencies among the three modalities involved – text, images, and actions. Instead, our PTA architecture is entirely based on attention and adapts to multi-modal tasks by design.

In this paper, we propose to exploit fully-attentive networks to merge the knowledge coming from different domains. Encouraged by recent work on fully-attentive networks [39], we devise Perceive, Transform, and Act (PTA), a novel architecture for VLN in which the different modalities are free to be conditioned on the full history of previous actions. Figure 1 depicts the main elements of novelty in our architecture. While previous approaches rely on a recurrent policy to track the agent’s internal status through time [3, 41, 28], we directly infer the state from the observations via attention and avoid the critical back-propagation through time. For this reason, our agent can model the dependencies tied to navigation more efficiently and generalize to longer episodes better than other models.

Another challenge is represented by the agent adaptability to real-world applications. Recent literature identifies two main operating settings for VLN, called low-level action space and high-level action space [25]. Low-level methods make predictions over an output space of known dimension, which corresponds to the agent locomotor system – rotate , tilt up/down, and step forward are examples of low-level actions. The concept of a high-level, panoramic action space was first proposed by Fried et al[16]: differently from the low-level output space, it aims to predict the path to the goal without decoding the sequence of atomic actions explicitly. In this setting, the agent can move inside the environment using a teleporting system. We believe that this aspect limits adaptability to real-world applications, and for this reason we design our model for low-level use. Besides, we provide technical details describing how PTA can be adapted to work in both setups, and experimentally validate its flexibility.

We summarize our main contributions as follows:

  • [noitemsep,topsep=0pt]

  • We propose a novel multi-modal framework for VLN that replaces back-propagation through time with attention mechanisms, using them to tackle both long-term dependencies and multi-modality;

  • To the best of our knowledge, our model is the first Transformer-like architecture to merge three different modalities;

  • Experimental results show that PTA achieves state-of-the-art performance on low-level VLN. This setting is close to real-world applications and requires to decode fine-grained atomic actions;

  • We provide technical details describing how it is possible to switch from a low-level locomotor system to a high-level output space. Experimental results on this subject are the first to analyze the mutual relationships between low-level and high-level VLN.

2 Related Work

Vision-and-Language Navigation.

There is a wide area of research devoted to bridge natural language processing and image understanding. Image captioning 

[40, 45, 2], visual question answering [5, 20], and visual dialog [12, 13] are examples of active research areas in this field. At the same time, visual navigation [44, 21, 33] and goal-oriented instruction following [31, 17, 9] represent an important part of current work on embodied AI [11, 10, 46, 32].

In this context, Vision-and-Language Navigation (VLN) [3]

constitutes a peculiar challenge, as it enriches traditional navigation with a set of visually rich environments and detailed instructions. Additionally, all the scenes are photo-realistic and unknown to the agent beforehand. Previous work on this topic includes model-free and model-based reinforcement learning 

[42], dynamic convolution [25], visual and textual co-grounding with progress inference [27]

and backtracking with learned heuristics 

[28]. Other methods implement pragmatic inference via a speaker module which strengthens consistency between the chosen path and the corresponding instruction [16, 41]. Finally, Wang et al[41]

propose a reinforced cross-modal matching critic, together with a new self-supervised imitation learning setting to improve the generalization in unseen environments.

Recently proposed benchmarks and new evaluation metrics 

[23] show that traditional approaches hardly adapt to longer trajectories. Indeed, the recurrent nature of previous methods exacerbates the difficulty of learning long-term dependencies [6] both in the instruction and in the navigation.

Attention Networks. The understanding and generation of language and sequences have traditionally been addressed either with recurrent [37] or convolutional [18, 4]

architectures. Fully attentive models, in which recurrent relations are replaced with self and cross-attention, have recently become the dominant approach in language understanding tasks, with architectures like the Transformer 

[39] and BERT [15]. As a consequence, there is a growing interest in the use of fully-attentive models in visual and multi-modal tasks, like video understanding [36], cross-modal retrieval [34] and image captioning [26]. Our proposal is the first to employ a fully-attentive architecture for VLN and integrates vision, language, and action using cross-attention operations.

3 Multi-Modal Attention Networks

Our goal is to navigate unseen indoor environments with the only help of natural language instructions and egocentric visual observations. To merge multi-modal knowledge coming from the environment, we devise a two-stage encoder which exploits both temporal and spatial attention. At each time step, the agent selects a move to progress towards the goal. To that end, we fuse contextual information with the history of actions via attention and build a multi-modal decoder

which merges the three modalities: actions, images, and text. We then decode a probability distribution over a low-level output space in which possible actions are atomic moves like

turn or step ahead. After a first phase in which we train the agent with classical imitation learning, we implement an extrinsic reward function to promote coherence between ground-truth and predicted trajectories. Our architecture is depicted in Fig. 2 and detailed next.

3.1 Architecture

In this section, we provide architectural details of our fully-attentive model for VLN. Cross-attention allows for a more efficient cross-modal matching of contextual information, while self-attention can model temporal relationships and can release from the use of a recurrent policy. We are the first, to the best of our knowledge, to build a VLN architecture without recurrence. Each component of our model is end-to-end trainable.

3.1.1 Two-stage Encoder

At the beginning of each navigation episode, the agent receives a natural language instruction of variable length . The agent also perceives a panoramic image of the surroundings at each timestep . Our encoder consists of a single branch for each modality: text and images, and then employs attention to create a fused representation which specifically models the relevance of the source instruction into the visual observation.

Instruction Encoding. To encode the textual instruction, we employ an attention mechanism with multiple heads, followed by a feed-forward network. As a first step, we filter stop words and apply GloVe embeddings [30] to obtain a meaningful representation for each word. We then apply the following transformation:


where is the GloVe embedding for the natural language instruction, and are learnable parameters, stands for layer normalization and are sinusoidal positional encodings [39]. At this point we use multi-head attention to create a representation that models temporal dependencies inside the instruction. Multi-head attention is defined as:


where are learned weights matrices and:


The attention mechanism described by Eq. 3 computes a weighted sum of the Values () basing on the similarity between the Keys and the Queries ( and ). When the same source sequence ( in this case) is employed to model the triplet of Eq. 2, then the attention operation takes the name of self-attention.

Following the attention layer, we place a feed-forward multilayer perceptron:


At the end of this step, we obtain an attended representation for the current instruction that we use both during image encoding and in our multi-modal decoder.

Figure 2:

Overview of our approach. Our attention-based architecture for VLN builds upon three main blocks: an instruction encoder, an image encoder, and a multi-modal decoder. For sake of clarity, we omit residual connections and layer normalization after each block.

Image Encoding. As a first step, we discretize the panoramic image of the surroundings in 36 squared locations and we extract the corresponding visual features with a ResNet-152 [22]

trained on ImageNet 

[14]. Each viewpoint covers in the equirectangular image representing the agent surroundings, hence the image representation takes the form of a grid. We then project visual features with a transformation similar to Eq. 1

, but instead of using sinusoidal positional encodings, we append a coordinate vector given by:


where and are the heading and elevation angles for each viewpoint in the grid relative to the agent position at timestep . We then apply multi-head self-attention according to Eq. 2 to help modeling concepts such as relative positions between objects.

After this step, we aim to create an image representation which depends from the textual concepts expressed by the attended instruction . We use cross-attention to achieve this goal, and employ as keys and values for multi-head attention (Eq. 2). Finally, a feed-forward network as in Eq. 4 is applied to obtain the attended visual observation .

3.1.2 Multi-modal Decoder

Our decoder predicts the next action to perform in a low-level action space, meaning that our agent can choose among the following instructions: turn right/left , tilt up/down, step forward, and end episode – to signal that it has reached the goal.

Contextual History for Action Decoding. The first part of our decoder takes into account the history of past actions. While previous methods employ a recurrent neural network to keep track of previous steps, we explicitly model as the set of actions performed before the current timestep . Note that coincides with the <start> token. We add sinusoidal positional encoding to provide temporal information and apply multi-head self-attention to obtain an attended history representation .

Late Fusion of Perception and Action. At this point, contains the relevant information regarding the action history of the navigation episode. However, this information must be enriched with the perception coming from the environment. We merge textual and visual information with via attention, allowing mutual influence between perception and motion. We build two branches of multi-head cross-attention accepting respectively and as key/value pairs and using as query. After this step, we concatenate the two representations and apply a FFN to obtain the output sequence whose last element corresponds to .

Action Selection. To select the next low-level action, we project the final representation in a six-dimensional space corresponding with the agent locomotor space containing the following actions: turn right/left , tilt up/down, step forward, and end episode. The output probability distribution over the action space can therefore be written as:


where and are learned parameters. During training, we sample the next action to perform from , while we select during evaluation and test.

3.2 Training

Our training setup includes two distinct objective functions. The first estimates the policy by imitation learning, while the second enforces similarity between the ground-truth and predicted trajectories via reinforcement learning.

Imitation Learning. To approximate a good policy, we first train our agent using strong supervision. At each timestep , the simulator outputs the ground-truth action . In the low-level setup, the ground-truth action is the one that allows getting to the next target viewpoint in the minimum amount of steps. We aim to minimize the following objective function:



is the one-hot encoding for

and the sum is intended over the timesteps of a navigation episode.

Extrinsic Reward.

After a first training phase with supervised learning, we finetune our agent using an extrinsic reward function. Recently, Magalhaes

et al[29] propose to employ Dynamic Time Warping (DTW) [7] to evaluate the trajectories performed by navigation agents. In particular, they define the normalized Dynamic Time Warping (nDTW) as:


where and are respectively the reference and the query paths, is the length of the reference path, and is the success threshold distance. At each navigation step , the agent receives a reward equal to the gain in terms of nDTW:


Additionally, we give an episode-level reward to the agent if it terminates the navigation within a success threshold distance from the goal, given by , where is the final distance between the agent and the target. We can write our final reinforcement learning objective function as:


Based on REINFORCE algorithm [43], we derive the gradient of our reward-based objective as:


4 Low-level and High-Level Navigation

The last section describes our approach to low-level VLN. Here, we discuss the main technical differences with the high-level counterpart and explain how PTA can switch from one setting to the other. Differently from the low-level architectures, a high-level method aims to predict the next node to traverse in the navigation graph, as physical navigation takes place with a teleport mechanism (Fig. 3). The choice at time step is done with a similarity measure between the agent internal state and the appearance vector for the navigable locations . This similarity function is normally mapped into a bilinear dot-product:


where and are generic transformations.

Figure 3: Comparison between low-level and high-level action spaces. In the latter, displacements are made by teleporting the agent and without adjusting its heading and elevation before stepping ahead.

In principle, it is possible to substitute the final softmax classifier of a low-level architecture (Eq. 

6) with Eq. 12 and change the corresponding action space. According to this observation, we can swap the action space of our model to test its adaptability to different navigation settings. While traditional approaches start from the hidden state of the recurrent policy to estimate the agent’s internal state , we directly derive it from :


where and are learned parameters. As , we select the unattended visual features augmented with the coordinate vector described by Eq. 5, and apply the following transformation:


where and are learned parameters.

In our architecture, can fit to represent any kind of information about the current navigation. This is because it can draw knowledge from the perceptual modalities and the history of past actions directly and without the bottleneck represented by a recurrent network. Our experiments on this subject (Sec. 5.3) show that our architecture stands out from the literature in terms of adaptability.

5 Experiments and Discussion

Validation-Seen Validation-Unseen
1 Seq2seq [3] 6.01 0.39 0.53 - - - - 7.81 0.22 0.28 - - - -
2 PTA (pure IL, no extrinsic reward) 4.14 0.58 0.70 0.50 0.63 0.48 0.39 6.44 0.39 0.49 0.32 0.48 0.32 0.24
3 multi-modal decoder (only visual) 3.90 0.61 0.72 0.54 0.65 0.52 0.44 6.56 0.36 0.46 0.29 0.47 0.32 0.22
4 multi-modal decoder (only textual) 9.64 0.03 0.04 0.03 0.28 0.19 0.02 9.13 0.04 0.04 0.04 0.28 0.21 0.02
5 early fusion (cross attention) 6.41 0.34 0.44 0.30 0.54 0.28 0.18 7.70 0.23 0.29 0.20 0.43 0.20 0.12
6 data augmentation 3.47 0.66 0.76 0.58 0.67 0.54 0.47 5.91 0.40 0.48 0.34 0.50 0.36 0.25
7 extrinsic reward 3.58 0.65 0.74 0.59 0.69 0.60 0.50 6.00 0.40 0.47 0.36 0.52 0.41 0.28
Table 1: Ablation study proving the effectiveness of our main modules. We also show that our model can be initialized using synthethic data augmentation and then finetuned with a limited set of refined data. Adding an extrinsic reward function further improves the performance in the final model.

5.1 Experimental Setup

Datasets. In our experiments, we primarily test our architecture on the R2R dataset for VLN [3]. This dataset builds on the Matterport3D dataset of spaces [8], which contains complete scans of different buildings. The visual data is enriched with more than navigation paths and natural language instructions. The episodes are divided into a training set, two validation splits (validation-seen, with environments that the agent has already seen during training, and validation-unseen, containing only unexplored buildings) and a test set. The testing phase takes place in previously unseen environments and is accessible via a test-server with a public leaderboard. While the instructions in R2R are quite long and complex (about words on average), navigation episodes usually involve a limited number of steps – max steps for high-level action space and max steps for the low-level setup. In the R4R dataset [23], Jain et al. merge the paths in R2R to create a more complex and challenging setup. Episodes become considerably longer, pushing the traditional approaches to their limits and testing their generalizability to arbitrary long instructions and more complex trajectories.

Evaluation Metrics. In line with previous literature, we mainly focus on four metrics. NE (Navigation Error) measures the mean distance from the goal and the stop point. SR (Success Rate) is the fraction of episodes concluded within a threshold distance from the target – 3 meters for all of the previous papers on the subject. OSR (Oracle SR) represents the SR that the agent would achieve if it received an oracle stop signal when passing within the threshold distance from the goal, while SPL (SR weighted by inverse Path Length) penalizes navigation episodes that deviate from the shortest path to the goal. SPL is accredited to be the most reliable metric on the R2R dataset [1], as it strongly penalizes exhaustive exploration and search methods like beam search. Recently, Jain et al[23] propose to use Coverage weighted by Length Score (CLS) to replace SR for generic navigation trajectories, as this metric is also sensitive to intermediate nodes in the reference path. Additionally, Magalhaes et al[29] propose Dynamic Time Warping (DTW) and derived metrics (Normalized DTW and Success weighted by normalized DTW) to measure the similarity between reference and predicted paths. These three last metrics are more meaningful on the R4R dataset than SR and SPL [23].

Implementation Details. In each component of our model, we project the input features into a -dimensional space. For multi-head attention, we employ heads. The internal representation of feed-forward networks has size . After each sub-module, we add a residual connection followed by layer normalization. We also apply dropout [35] with drop probability after each linear layer. During training, we use Adam optimizer [24] with learning rate , we set the batch size to and reduce the learning rate by a factor if the SPL on the validation unseen split does not improve for

consecutive epochs. We stop the training after

epochs without improvement on the same metric. When finetuning using REINFORCE, we set the initial learning rate to .

5.2 Ablation Study

In our ablation study, we experimentally validate the importance of each module in our architecture. First, we ablate multi-modality in our decoder and we do not apply late fusion before decoding the next action. In a second experiment, we remove cross-attention between visual and lingual information in the encoder. If the early fusion mechanism plays an important role, we expect that the multi-modal decoder will not be able to compensate for the loss of this component. Finally, we show the impact of synthetic data augmentation [16] and the role of finetuning with REINFORCE. Results are shown in Table 1 and discussed below.

Multi-modal Decoder. In our first ablation study, we use only one of the two decoder branches at the time, and we do not perform late fusion between lingual and visually-grounded information. When removing the textual branch (Table 1, line 3), our agent performs worse on unseen environments, hence losing potential in terms of generalization. When removing the visual modality, our PTA agent is blinded and can only count on the natural language instruction. This setup leads to success only when the instruction does not involve references to objects or visual properties of the environment – a nearly empty subset of the dataset. Indeed, the metrics for our blind agent are extremely low, and they do not vary between seen and unseen environments (Table 1, line 4). This result is meaningful in light of recent studies proving that some single-modality agents perform better than their multi-modal version by removing the visual perception and overfitting on dataset biases [38].

Early fusion of textual and visual perception. As a second experiment, we remove the early fusion mechanism, namely the cross-attention layer, to check its contribution. If this fusion layer is redundant, we expect that the late fusion stage will compensate for the loss. Instead, we experience a drop in performance: and in SPL respectively in seen and unseen environments (Table 1, line 5). We thus prove the importance of early textual and visual fusion in our architecture for VLN.

Data augmentation. In line with previous literature, we find the use of additional synthetic instructions useful to initialize our agent. The synthetic training set was provided by Fried et al[16] using a Speaker module. After a first training with the full set of instructions (synthetic and human-generated), we finetune using only the original R2R train set. Results are reported in Table 1, line 6.

Test (Unseen)
Low-level Methods PL NE SR OSR SPL
Random 9.89 9.77 0.13 0.18 0.12
Student-forcing [3] 8.13 7.85 0.20 0.27 0.18
RPA [42] 9.15 7.53 0.25 0.33 0.23
Dynamic Filters [25] 9.81 6.55 0.35 0.45 0.31
PTA 10.17 6.17 0.40 0.47 0.36
High-level Methods PL NE SR OSR SPL
Speaker-Follower [16] 14.82 6.62 0.35 0.44 0.28
Self-Monitoring [27] 18.04 5.67 0.48 0.59 0.35
RCM [41] 15.22 6.01 0.43 0.51 0.35
RCM + SIL (train) [41] 11.97 6.12 0.43 0.50 0.38
Regretful [28] 13.69 5.69 0.48 0.56 0.40
Table 2: Results on the R2R test server. We chose the best version of each model basing on SPL, hence excluding the use of graph search methods that are unfeasible in real-world applications – like beam search.
R2R Validation-Seen R2R Validation-Unseen
Speaker-Follower [16] 3.36 0.66 0.74 - - - - 6.62 0.36 0.45 - - - -
RCM [41] 3.37 0.67 0.77 - - - - 5.88 0.43 0.52 - - - -
Self-Monitoring [27] 3.22 0.67 0.78 0.58 - - - 5.52 0.45 0.56 0.32 - - -
Regretful [28] 3.23 0.69 0.77 0.63 - - - 5.32 0.50 0.59 0.41 - - -
Dynamic Filters* [25] 4.80 0.51 0.61 0.44 0.60 0.61 0.43 7.02 0.26 0.37 0.22 0.44 0.44 0.21
PTA* 3.35 0.66 0.74 0.64 0.74 0.75 0.61 5.95 0.43 0.49 0.39 0.53 0.53 0.35
Table 3: Results on the R2R validation splits for high-level methods. ‘*’ denotes a method built for low-level use and then adapted by changing the final classifier. Even though PTA was designed as a low-level architecture, it overcomes all the previous approaches in terms of SPL in already seen environment. Metrics with ‘-’ were not reported in the original papers.

Extrinsic reward. While imitation learning allows approximating a good policy, there is still room for improvement via reinforcement learning. Wang et al[41] were the first to use REINFORCE in the context of VLN to refine their navigation policy based on cross-modal matching. In line with them, we find REINFORCE beneficial for our model: our final agent sticks more closely to the reference trajectory and penalizes overlong navigations (Table 1, line 7).

5.3 Results on R2R

In our experiments on the R2R dataset [3], we test the ability of our agent to navigate unseen environments in light of previously unseen natural language instructions. The main test-bed for this experiment is represented by the R2R evaluation leaderboard, which is publicly available online.

Comparison with SOTA. In Table 2, we report our results on the R2R test set, together with the results achieved by other state-of-the-art architectures on VLN. Since we operate in the low-level action space, we are directly comparable with the sequence-to-sequence baseline proposed by Anderson et al[3], with the RPA model using a mixture of model-free and model-based reinforcement learning [42], and with the recurrent architecture with dynamic convolutional filters proposed by [25]. Our method overcomes the state-of-the-art on low-level VLN by a large margin ( in terms of SPL and SR).When comparing PTA with high-level methods, we surprisingly find out that it performs better than most state-of-the-art architectures in terms of SPL. Notably, we achieve this result without making any assumption on the underlying simulating platform and decoding a longer sequence of atomic moves, instead of target viewpoints. Moreover, the two architectures achieving higher SPL can count on additional modules that are not present in our method: RCM [41] performs a self-supervised imitation learning phase on the training buildings, while the Regretful agent [28] counts on a rollback module to return in a previous state at need. While these two modules are effective for high-level VLN, their generalizability to a low-level setup, closer to real-world application, is yet to be tested.

Switching from Low-level to High-level. Our second experiment on R2R aims to test whether it is possible to switch between the low-level and the high-level action spaces without losing in terms of performance. In principle, changing the final classifier as described in Section 4 should let the agent learn the correspondence between atomic actions and navigable viewpoints transparently. In practice, we find out that architectural choices that strongly help VLN in one setting often end up hindering the other setup. As a result, current methods display different behaviors in terms of metrics depending on the adopted action space (see Figure 4).

We believe that this response is mainly due to the fact that these architectures handle long-term dependencies and multi-modality separately. Instead, PTA integrates and tackles these two problems jointly using attention mechanisms. This peculiarity leaves more room for flexibility at the network end interface – the action space in this case.

The plots in Figure 4 show that our model exhibits far greater flexibility to the final action space than other architectures. We compare with a recurrent architecture exploiting dynamic convolution [25] from the low-level category, and with the Speaker-Follower [16] from the other setup. To conduct this experiment we adjust the code from [25], which is publicly available online, and report the results in the paper for [16]. We choose the Speaker-Follower because it is a flexible framework by design, and we believe it is the most suitable model among its high-level peers for this comparison.

In Table 3 we detail the full set of metrics obtained using PTA with the high-level classifier described in Section 4, instead of a low-level control system. Without further refinements, our architecture achieves comparable or even better results than fully tailored models. PTA achieves state-of-the-art results on the validation-seen split in terms of SPL.

Figure 4: Visualization of the navigation error (left) and success rate (right) on the R2R val-unseen split. A larger difference between the blue and gray bars denotes poor adaptability. The metric gap is reduced when using PTA.
R4R Validation-Seen R4R Validation-Unseen
Dynamic Filters [25] 11.9 5.74 0.51 0.39 0.50 0.38 0.24 9.98 9.03 0.20 0.11 0.33 0.19 0.06
PTA low-level 11.9 5.11 0.57 0.45 0.52 0.42 0.29 10.2 8.19 0.27 0.15 0.35 0.20 0.08
Speaker-Follower [16] 15.4 5.35 0.52 0.37 0.46 - - 19.9 8.47 0.24 0.12 0.30 - -
RCM goal oriented [23] 24.5 5.11 0.56 0.32 0.40 - - 32.5 8.45 0.29 0.10 0.20 - -
RCM fidelity oriented [23] 18.8 5.37 0.53 0.31 0.55 - - 28.5 8.08 0.26 0.08 0.35 - -
PTA high-level 16.5 4.54 0.58 0.39 0.60 0.58 0.41 17.7 8.25 0.24 0.10 0.37 0.32 0.10
Table 4: Results on the R4R validation splits. Our model is the new state-of-the-art on the two splits in both of its versions – low-level and high-level. Note that, since the trajectories can bind and return on the agent previous steps, CLS and nDTW are the more indicative metrics. Metrics with ‘-’ were not reported in the original papers.
Figure 5: Navigation episode from the R2R unseen validation split. For each step, we report the agent first-person point of view and the next predicted action (from left to right, top to bottom).

Qualitative Results. In Fig. 5, we report a qualitative result from the R2R val-unseen set. Notably, PTA is able to ground concepts such as “the second doorway on your left” and terminates the navigation episode successfully. Since our agent operates in a low-level setup, it needs to orientate towards the next viewpoint before stepping ahead, making the decoding phase more challenging.

5.4 Results on R4R

R4R [23] builds upon R2R and aims to provide an even more challenging setting for embodied navigation agents. While navigation in R2R is usually direct and takes the shortest path between the starting position and the goal viewpoint, trajectories in R4R may bend and return on the agent’s previous steps. This change calls for adaptation in evaluation metrics: SPL and SR are now less indicative because the agent might stop close the goal in the first half of the navigation and still fail to complete the second part. In this sense, an important role is played by recently proposed metrics: CLS [23] and nDTW [29] take into account the agent’s steps and are sensitive to intermediate errors in the navigation path. For this reason, these last metrics are more meaningful when evaluating navigation agents on R4R.

Comparison with SOTA. In this experiment, we compare PTA with other state-of-the-art architectures for VLN and report the results in Table 4. In the low-level setup, we compare to the recurrent architecture with dynamic convolution proposed by Landi et al[25]. Results show that our approach performs better on all of the main metrics. In particular, a lower NE and a higher CLS indicate that our agent tends to get closer to the goal while sticking to the natural language instruction better than [25]. We also report the results obtained by our model incorporating the high-level decision space. We compare with Speaker-Follower [16] and RCM [41], as implemented in [23]. PTA performs better than its high-level competitors on the majority of the metrics. Notably, the higher CLS score shows that PTA can generally select a path that follows the instruction better than the competitors. When considering the reference metrics proposed for R4R [23], our architecture achieves the best results on both the setups, being the new state-of-the-art for VLN on this challenging dataset.

6 Conclusion

In this paper, we have presented Perceive, Transform, and Act (PTA), the first fully-attentive model for the VLN task. Different from other methods that model cross-modal fusion and action selection in separate building blocks, we are able to achieve inter-dependency between perception and action by design. Our architectural choices allow for a significant boost in performance, making PTA the new state-of-the-art on the low-level setting of VLN. When testing on the recently proposed R4R dataset, PTA achieves state-of-the-art results in both the setups. Besides, we prove that our agent naturally adapts to the other action space, while previous methods suffer from low flexibility.


  • [1] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al. (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: §5.1.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §2.
  • [3] P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel (2018) Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, §5.1, §5.3, §5.3, Table 1, Table 2.
  • [4] J. Aneja, A. Deshpande, and A. G. Schwing (2018) Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [5] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015) VQA: Visual Question Answering. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [6] Y. Bengio, P. Simard, P. Frasconi, et al. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5 (2), pp. 157–166. Cited by: §2.
  • [7] D. J. Berndt and J. Clifford (1994) Using dynamic time warping to find patterns in time series. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, Cited by: §3.2.
  • [8] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: Learning from RGB-D Data in Indoor Environments. In Proceedings of the International Conference on 3D Vision, Cited by: §5.1.
  • [9] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi (2019) Touchdown: natural language navigation and spatial reasoning in visual street environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [10] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Embodied Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [11] A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra (2018) Neural modular control for embodied question answering. In Proceedings of the Conference on Robot Learning, Cited by: §2.
  • [12] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M.F. Moura, D. Parikh, and D. Batra (2017) Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [13] A. Das, S. Kottur, J. M.F. Moura, S. Lee, and D. Batra (2017) Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.1.
  • [15] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • [16] D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell (2018) Speaker-follower models for vision-and-language navigation. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §5.2, §5.2, §5.3, §5.4, Table 2, Table 3, Table 4.
  • [17] J. Fu, A. Korattikara, S. Levine, and S. Guadarrama (2019) From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following. In Proceedings of the International Conference on Learning Representations, Cited by: §2.
  • [18] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In

    Proceedings of the International Conference on Machine Learning

    Cited by: §2.
  • [19] J. J. Gibson (2014) The Ecological Approach to Visual Perception: Classic Edition. Psychology Press. Cited by: §1.
  • [20] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [21] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1.1.
  • [23] V. Jain, G. Magalhaes, A. Ku, A. Vaswani, E. Ie, and J. Baldridge (2019) Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation. In Proceedings of Annual Meeting of the Association for Computational Linguistics, Cited by: §2, §5.1, §5.1, §5.4, §5.4, Table 4.
  • [24] D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §5.1.
  • [25] F. Landi, L. Baraldi, M. Corsini, and R. Cucchiara (2019) Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters. In Proceedings of the British Machine Vision Conference, Cited by: §1, §2, §5.3, §5.3, §5.4, Table 2, Table 3, Table 4.
  • [26] G. L. L. Z. P. Liu and Y. Yang (2019) Entangled Transformer for Image Captioning. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [27] C. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong (2019) Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of the International Conference on Learning Representations, Cited by: §2, Table 2, Table 3.
  • [28] C. Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira (2019) The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §5.3, Table 2, Table 3.
  • [29] G. Magalhaes, V. Jain, A. Ku, E. Ie, and J. Baldridge (2019) Effective and general evaluation for instruction conditioned navigation using dynamic time warping. arXiv preprint arXiv:1907.05446. Cited by: §3.2, §5.1, §5.4.
  • [30] J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cited by: §3.1.1.
  • [31] Y. Qi, Q. Wu, P. Anderson, M. Liu, C. Shen, and A. v. d. Hengel (2019) RERERE: Remote Embodied Referring Expressions in Real indoor Environments. arXiv preprint arXiv:1904.10151. Cited by: §2.
  • [32] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra (2019) Habitat: A Platform for Embodied AI Research. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [33] W. B. Shen, D. Xu, Y. Zhu, L. J. Guibas, L. Fei-Fei, and S. Savarese (2019) Situational Fusion of Visual Representation for Visual Navigation. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [34] Y. Song and M. Soleymani (2019) Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §5.1.
  • [36] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) VideoBERT: A Joint Model for Video and Language Representation Learning. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [37] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [38] J. Thomason, D. Gordon, and Y. Bisk (2018) Shifting the Baseline: Single Modality Performance on Visual Navigation & QA. In Annual Conference of the North American Chapter of the Association for Computational Linguistics, Cited by: §5.2.
  • [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §3.1.1.
  • [40] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [41] X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y. Wang, W. Y. Wang, and L. Zhang (2019) Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §5.2, §5.3, §5.4, Table 2, Table 3.
  • [42] X. Wang, W. Xiong, H. Wang, and W. Yang Wang (2018) Look before you leap: bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In Proceedings of the European Conference on Computer Vision, Cited by: §2, §5.3, Table 2.
  • [43] R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3-4), pp. 229–256. Cited by: §3.2.
  • [44] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson env: real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [45] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Cited by: §2.
  • [46] J. Yang, Z. Ren, M. Xu, X. Chen, D. Crandall, D. Parikh, and D. Batra (2019) Embodied Visual Recognition. arXiv preprint arXiv:1904.04404. Cited by: §2.