Neighbor-view Enhanced Model for Vision and Language Navigation

07/15/2021
by   Dong An, et al.
The University of Adelaide
0

Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions. Most of existing works represent a navigation candidate by the feature of the corresponding single view where the candidate lies in. However, an instruction may mention landmarks out of the single view as references, which might lead to failures of textual-visual matching of existing methods. In this work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views for better textual-visual matching. Specifically, our NvEM utilizes a subject module and a reference module to collect contexts from neighbor views. The subject module fuses neighbor views at a global level, and the reference module fuses neighbor objects at a local level. Subjects and references are adaptively determined via attention mechanisms. Our model also includes an action module to utilize the strong orientation guidance (e.g., “turn left”) in instructions. Each module predicts navigation action separately and their weighted sum is used for predicting the final action. Extensive experimental results demonstrate the effectiveness of the proposed method on the R2R and R4R benchmarks against several state-of-the-art navigators, and NvEM even beats some pre-training ones. Our code is available at https://github.com/MarSaKi/NvEM.

READ FULL TEXT VIEW PDF

Authors

page 2

page 4

page 8

07/29/2020

Object-and-Action Aware Model for Visual Language Navigation

Vision-and-Language Navigation (VLN) is unique in that it requires turni...
03/22/2022

HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation

Pre-training has been adopted in a few of recent works for Vision-and-La...
10/19/2020

Language and Visual Entity Relationship Graph for Agent Navigation

Vision-and-Language Navigation (VLN) requires an agent to navigate in a ...
08/26/2021

Visual-and-Language Navigation: A Survey and Taxonomy

An agent that can understand natural-language instruction and carry out ...
10/27/2021

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

Natural language instructions for visual navigation often use scene desc...
07/23/2021

Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation

Language instruction plays an essential role in the natural language gro...
10/04/2019

Talk2Nav: Long-Range Vision-and-Language Navigation in Cities

Autonomous driving models often consider the goal as fixed at the start ...

Code Repositories

NvEM

The official code of paper "Neighbor-view Enhanced Model for Vision and Language Navigation" (ACM MM2021)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Vision and Language Navigation (VLN) has drawn increasing interest in recent years, partly because it represents a significant step towards enabling intelligent agents to interact with the realistic world. Running in a 3D simulator (Anderson et al., 2018) rendered with real-world images (Chang et al., 2017), the goal of VLN is to navigate to a target location by following a detailed natural language instruction, such as “Walk around the table and exit the room. Walk down the first set of stairs. Wait there.”. There are two kinds of simulators, which render continuous navigation trajectories (Krantz et al., 2020) and discrete trajectories (Anderson et al., 2018) respectively. In this paper we focus on the discrete one, where the agent navigates on a discrete graph (see Figure 1 (a)).

A variety of approaches have been proposed to address the VLN problem (Wang et al., 2018; Fried et al., 2018; Landi et al., 2019; Tan et al., 2019; Ma et al., 2019a; Wang et al., 2019; Li et al., 2019; Qi et al., 2020a; Hao et al., 2020; Majumdar et al., 2020; Wang et al., 2020a, b; Hong et al., 2020a; Deng et al., 2020; Hong et al., 2020b). Most of them adopt panoramic action space (Fried et al., 2018), where the agent select a navigable candidate from its observations to transport at each step. However, the context of navigable candidates is rarely discussed in existing works, and the commonly used single-view candidates are of limited visual contexts which may hamper the matching between instructions and the visual representations of candidates. Figure 1 shows such an example, where there are three candidate “archways”, each of which is represented by a single-view visual perception (Figure 1 (b)). According to the instruction, only the archway “to the left of mirror” leads to the correct navigation. However, most of existing agents may fail because they cannot find the referred “mirror” in any single-view-based candidate.

Thus, we propose to enhance the textual-visual matching by fusing visual information from candidates’ neighbor views as shown in Figure 1 (c), which is rarely explored before. It is non-trivial to fuse neighbor views for visual contexts modeling, because many unmentioned visual clues exist which may interfere with the agent’s decision (e.g., the lamp in Figure 1 (c)). In addition, some instructions even do not involve visual clues in neighbor views, such as “go through the doorway”. To handle this challenging problem, we propose to decompose an instruction into: action-, subject- and reference-related phrases as shown in Figure 2. Generally, the action and subject are necessary, and the optional reference helps to distinguish the desired candidate from other similar ones.

Based on the above mentioned three types of instruction phrases, we further design a multi-module Neighbor-view Enhanced Model (NvEM) to adaptively fuse neighbor visual contexts in order to improve the textual-visual matching between instructions and candidates’ visual perceptions. Specifically, our NvEM includes a subject module, a reference module and an action module, where subjects and references are determined via attention mechanisms. On one hand, the subject module aggregates neighbor views at a global level based on spatial information. On the other hand, the reference module aggregates related objects from neighbor views at a local level. The action module makes use of the orientation guidance (i.e., “turn left”) in instructions. Each module predicts navigation action separately and their weighted sum is used to predict the final action. Note that the combination weights are trainable and predicted based on the decomposed subject-, reference- and action-related phrases.

The contributions of this work are summarized as follows:

To improve the textual-visual matching between instructions and navigable candidates, we propose to take into account the visual contexts from neighbor views for the first time.

We propose a subject module and a reference module to adaptively fuse visual contexts from neighbor views at both global level and local level.

Extensive experimental results demonstrate the effectiveness of the proposed method with comparisons against several existing state-of-the-art methods, and NvEM even beats some pre-training ones.

2. Related Work

Vision and Language Navigation. Numerous approaches have been proposed to address the VLN problem. Most of them are based on the CNN-LSTM architecture with attention mechanisms: at each time step, the agent first grounds surrounding observations to instructions, then chooses the most matched candidate according to the grounded instructions as the next location. Early work Speaker-Follower (Fried et al., 2018) develops a speaker model to synthesize new instructions for randomly sampled trajectories. Additionally, they design a panoramic action space for efficient navigation. Later on, EnvDrop (Tan et al., 2019) increases the diversity of synthetic data by randomly removing objects to generate “new environments”.

On the other line, Self-monitoring (Ma et al., 2019a) and RCM (Wang et al., 2019) utilize the cross-modality co-attention mechanism to enhance the alignment of instructions and trajectories. To learn generic linguistic and visual representations for VLN, AuxRN (Zhu et al., 2020) designs several auxiliary self-supervised losses. Very recently, large-scale pre-training models for VLN are widely explored (Li et al., 2019; Hao et al., 2020; Majumdar et al., 2020; Hong et al., 2020b), where they improve the agent’s generation abilities dramatically by benefitting from priors of other datasets. Different types of visual clues correspond to different phrases in an instruction, OAAM (Qi et al., 2020a) and RelGraph (Hong et al., 2020a) utilize decomposed phrases to guide more accurate action prediction. OAAM (Qi et al., 2020a) adopts action and object specialized clues to vote action at each time step, while RelGraph (Hong et al., 2020a) proposes a graph network to model the intra- and inter-relationships among the contextual and visual clues. The most relevant work to ours is RelGraph (Hong et al., 2020a), where we both attempt to exploit view-level and object-level features. The key difference is: we focus on enhancing each candidate’s representation with its multiple neighbor views (namely inter-view), while the representation in RelGraph is limited to a single view and thus it’s intra-view.

Modular Attention Networks.

Modular networks are widely adopted in vision and language models. It attempts to decompose sentences into multiple phrases via attention mechanisms, as different phrases usually correspond to different visual clues. MAttNet 

(Yu et al., 2018) decomposes a long sentence into appearance, location and relationship three parts for referring expression. LCGN (Hu et al., 2019b) utilizes a multi-step textual attention mechanism to extract different object-related phrases, then modeling objects‘ contexts via the relations among the phrases. LGI (Mun et al., 2020) utilizes a sequential query attention module to decompose the query into multiple semantic phrases, then using these phrases to interact with video clips for video grounding. To the best of our knowledge, OAAM (Qi et al., 2020a) is the earliest attempt to decompose instructions in VLN. They decompose instructions into “action” and “object ” specialized phrases, and use these phrases to vote next action. Our work has two key differences with OAAM: (I) Different modules: our modules are (subject, reference, action) v.s. (object, action) of OAAM. (II) Our subject module and reference module fuse information from neighbor views while OAAM only use information within one single view.

Figure 3. Main architecture of the proposed multi-module Neighbor-view Enhanced Model (NvEM). First, action-, subject- and reference-related phrases are attended via an attention mechanism (Section 4.2). Then, reference module and subject module predict navigation actions via aggregating visual contexts from candidates’ neighbor views at local and global levels. The action module predicts navigation in terms of orientation information. (Section 4.3). Lastly, the weighted sum of all three predictions predicts the final navigation decision. (Section 4.4).

3. Preliminary

In VLN (Anderson et al., 2018), given a natural language instruction with words, an agent navigates on a discrete graph to reach the described target by following the instruction. At each time step , the agent observes a panorama which consists of 36 discrete views. Each view is represented by an image , with its orientation including heading and elevation . Also, there are candidates at time step , and each candidate is represented by the single view where the candidate lies in, with the candidate’s relative orientation to the agent. Formally, for the -th view:

(1)

where represents ResNet (He et al., 2016) pooling features, is an embedding function for heading and elevation, which repeats , 32 times following (Tan et al., 2019). The -th candidate is encoded in the same way.

Previous works show that data augmentation is able to significantly improve the generalization ability in unseen environments (Fried et al., 2018; Tan et al., 2019). We make use of this strategy by adopting EnvDrop (Tan et al., 2019) as baseline, which first uses a bi-directional LSTM to encode the instruction, and the encoded instruction is represented as . Then the agent’s previous context-aware state is used to attend on all views to get scene feature: . The concatenation of and the previous action embedding is fed into the decoder LSTM to update the agent’s state: . Note that the context-aware agent state is updated via the attentive instruction feature :

(2)

where is a trainable linear projection and

mentioned above denotes soft-dot attention. Finally, EnvDrop predicts navigation action by selecting the candidate with the highest probability (

is a trainable linear projection):

(3)

4. Methodology

In this section, we first briefly describe the pipeline of the proposed model in training phase, then we detail the proposed Neighbor-view Enhanced Model (NvEM). Note that in this section, we omit the time step to avoid notational clutter in the exposition.

4.1. Overview

Figure 3 illustrates the main pipeline of our NvEM. First, action-, subject- and reference-related phrases are attended by three independent attention schemes. Then, reference module and subject module predict navigation actions via aggregating visual contexts from candidates’ neighbor views. The action module predicts navigation in terms of orientation information. Lastly, the final navigation action is determined by combining these three predictions together with weights generated from phrases’ embeddings.

4.2. Phrase Extractor

Considering an instruction, such as “walk through the archway to the left of the mirror…”, there are three types of phrases which the agent needs to identify: the action describes the orientation of target candidate (e.g., “walk through”), the subject describes the main visual entity of the correct navigation (e.g., “the archway”) and the reference which is referenced by the subject (e.g., “to the left of the mirror”). Thus, NvEM first performs three soft-attentions independently on the instruction, conditioned on the current agent state , to attend on these three types of phrases:

(4)

where the subscripts denote the corresponding types of phrases, is the same with Eq (2). and denote features of corresponding phrases and context-aware agent states, and they are updated by different linear projections . The global context-aware agent state in Eq (2) is now calculated by averaging the three specialized context-aware states , and .

4.3. Neighbor-view Enhanced Navigator

Corresponding to the attended three types of phrases in instructions, our neighbor-view enhanced navigator contains three modules: a reference module, a subject module and an action module. The reference module and subject module predict navigation actions via aggregating visual contexts from neighbor views at local and global levels, respectively. The action module predicts navigation actions according to orientation information. We provide the details below.

Reference Module. Reference usually exists as a landmark surrounding the subject to clarify similar navigation candidates. In the example “walk through the archway to the left of the mirror…”, the reference is “the mirror” and it is referred with a spatial relationship to the subject (e.g., “to the left of”). This motivates us to enhance the representation of a navigation candidate by using features of local objects and their spatial relations.

Formally, for the -th candidate , given its orientation , the objects’ features extracted by (Ren et al., 2015) and their orientations in the candidate’s neighbor views (assume there are objects in each neighbor and denotes the -th neighbor). We then calculate the spatial relations of the neighbor objects to the -th candidate:

(5)

where is the same as that in Eq (1). Then each neighbor object is represented as the concatenation of its object feature and relative spatial embedding:

(6)

where denotes concatenation, projects objects’ features into a dimensional space and all in this section represent trainable non-linear projections, with

as the activation function. We use the reference-related phrases

to highlight relevant objects in neighbor views:

(7)

where and are trainable linear projections. The reference module predicts the confidence of the candidate being the next navigation action using reference-related state and the neighbor reference enhanced candidate representation :

(8)

where is a trainable linear projection.

Subject Module. Instruction subject describes the main visual entity of the correct navigation, such as “the archway” in “walk through the archway to the left of the mirror…”. However, sometimes there are multiple candidates containing the different instances of the subject. To alleviate the ambiguity in visual side, we propose to enhance the visual representation of subject by incorporating contexts from neighbor views. Specifically, we aggregate neighbor views at a global level, with the help of the spatial affinities neighbor views to the candidate.

Formally, for the -th candidate , given its orientation , its neighbor views’ orientations and ResNet features (assume it has neighbor views), we first embed all neighbor views using a trainable non-linear projection:

(9)

Then we compute the spatial affinities among the candidate and its neighbor views based on their orientations in a query-key manner (Vaswani et al., 2017). Then the enhanced subject visual representation is obtained by adaptively aggregating neighbor views’ embeddings:

(10)

where and are trainable linear projections. Similar to reference module, the subject module predicts the confidence of candidate being the next navigation action via:

(11)

where is a trainable linear projection.

Action Module. Action related phrases serve as strong guidance in navigation (Hu et al., 2019a; Qi et al., 2020a), such as “go forward”, “turn left” and “go down”. Inspired by (Qi et al., 2020a), for the -th candidate , the action module predicts the confidence of candidate being the next navigation action, using the candidate’s orientation and action-related state :

(12)

where is a trainable linear projection.

4.4. Adaptive Action Integrator

The action, subject and reference modules usually contribute at different degrees to the final decision. Thus, we propose to adaptively integrate predictions from these three modules. We first calculate the combination weights conditioned on action-, subject- and reference-specialized phrases , and . Take the action weight as an example:

(13)

where the , and are trainable linear projections. Then the final action probability of the -th candidate is calculated by weighted summation of above confidences:

(14)

In the inference phase, the agent selects the candidate with the maximum probability as shown in Eq (3) at each step.

4.5. Training

We apply the Imitation Learning (IL) + Reinforcement Learning (RL) objectives to train our model following 

(Tan et al., 2019). In imitation learning, the agent takes the teacher action at each time step to learn to follow the ground-truth trajectory. In reinforcement learning, the agent samples an action via the probability and learns from the rewards. Formally:

(15)

where is a coefficient for weighting the IL loss, and are the total numbers of steps the agent takes in IL and RL respectively. is the advantage in A2C algorithm (Mnih et al., 2016). We apply the summation of two types of rewards in RL objective, goal reward and fidelity reward following (Jain et al., 2019).

Val Seen Val Unseen Test Unseen
(r)2-5 (r)6-9 (r)10-13 Agent TL NE SR SPL TL NE SR SPL TL NE SR SPL
Random 9.58 9.45 0.16 - 9.77 9.23 0.16 - 9.89 9.79 0.13 0.12
Human - - - - - - - - 11.85 1.61 0.86 0.76
PRESS (Li et al., 2019) 10.57 4.39 0.58 0.55 10.36 5.28 0.49 0.45 10.77 5.49 0.49 0.45
PREVALENT (Hao et al., 2020) 10.32 3.67 0.69 0.65 10.19 4.71 0.58 0.53 10.51 5.30 0.54 0.51
VLNBert (init. OSCAR) (Hong et al., 2020b) 10.79 3.11 0.71 0.67 11.86 4.29 0.59 0.53 12.34 4.59 0.57 0.53
VLNBert (init. PREVALENT) (Hong et al., 2020b) 11.13 2.90 0.72 0.68 12.01 3.93 0.63 0.57 12.35 4.09 0.63 0.57
Seq2Seq (Anderson et al., 2018) 11.33 6.01 0.39 - 8.39 7.81 0.22 - 8.13 7.85 0.20 0.18
Speaker-Follower (Fried et al., 2018) - 3.36 0.66 - - 6.62 0.35 - 14.82 6.62 0.35 0.28
SM (Ma et al., 2019a) - 3.22 0.67 0.58 - 5.52 0.45 0.32 18.04 5.67 0.48 0.35
RCM+SIL (Wang et al., 2019) 10.65 3.53 0.67 - 11.46 6.09 0.43 - 11.97 6.12 0.43 0.38
Regretful (Ma et al., 2019b) - 3.23 0.69 0.63 - 5.32 0.50 0.41 13.69 5.69 0.48 0.40
VLNBert (no init.) (Hong et al., 2020b) 9.78 3.92 0.62 0.59 10.31 5.10 0.50 0.46 11.15 5.45 0.51 0.47
EnvDrop (Tan et al., 2019) 11.00 3.99 0.62 0.59 10.70 5.22 0.52 0.48 11.66 5.23 0.51 0.47
AuxRN (Zhu et al., 2020) - 3.33 0.70 0.67 - 5.28 0.55 0.50 - 5.15 0.55 0.51
RelGraph (Hong et al., 2020a) 10.13 3.47 0.67 0.65 9.99 4.73 0.57 0.53 10.29 4.75 0.55 0.52
NvEM (ours) 11.09 3.44 0.69 0.65 11.83 4.27 0.60 0.55 12.98 4.37 0.58 0.54
Table 1. Comparison of single-run performance with the state-of-the-art methods on R2R. denotes works that apply pre-trained textual or visual encoders.
Val Seen Val Unseen
(r)2-7 (r)8-13 Agent NE SR SPL CLS nDTW sDTW NE SR SPL CLS nDTW sDTW
EnvDrop (Tan et al., 2019) - 0.52 0.41 0.53 - 0.27 - 0.29 0.18 0.34 - 0.09
RCM-a (goal) (Jain et al., 2019) 5.11 0.56 0.32 0.40 - - 8.45 0.29 0.10 0.20 - -
RCM-a (fidelity) (Jain et al., 2019) 5.37 0.53 0.31 0.55 - - 8.08 0.26 0.08 0.35 - -
RCM-b (goal) (Ilharco et al., 2019) - - - - - - - 0.29 0.15 0.33 0.27 0.11
RCM-b (fidelity) (Ilharco et al., 2019) - - - - - - - 0.29 0.21 0.35 0.30 0.13
OAAM (Qi et al., 2020a) - 0.56 0.49 0.54 - 0.32 - 0.31 0.23 0.40 - 0.11
RelGraph (Hong et al., 2020a) 5.14 0.55 0.50 0.51 0.48 0.35 7.55 0.35 0.25 0.37 0.32 0.18
NvEM (ours) 5.38 0.54 0.47 0.51 0.48 0.35 6.85 0.38 0.28 0.41 0.36 0.20
Table 2. Comparison of single-run performance with the state-of-the-art methods on R4R. goal and fidelity indicate goal and fidelity reward in reinforcement learning. denotes our reimplemented R4R results.

5. Experiments

In this section, we first describe the commonly used VLN datasets and the evaluation metrics. Then we present implementation details of NvEM. Finally, we compare against several state-of-the-art methods and provide ablation experiments. Qualitative visualizations are also presented.

5.1. Datasets and Evaluation Metrics

R2R benchmark. The Room-to-Room (R2R) dataset (Anderson et al., 2018) consists of 10,567 panoramic view nodes in 90 real-world environments as well as 7,189 trajectories described by three natural language instructions. The dataset is split into train, validation seen, validation unseen and test unseen sets. We follow the standard metrics employed by previous works to evaluate the performance of our agent. These metrics include: the Trajectory Length (TL) which measures the average length of the agent’s navigation path, the Navigation Error (NE) which is the average distance between the agent’s final location and the target, the Success Rate (SR) which measures the ratio of trajectories where the agent stops at 3 meters within the target, and the Success Rate weighted by Path Length (SPL) which considers both path length and success rate. Note that the SR and SPL in unseen environments are main metrics for R2R.

R4R benchmark. The Room-for-Room (R4R) dataset (Jain et al., 2019) is an extended version of R2R, which has longer instructions and trajectories. The dataset is split into train, validation seen, validation unseen sets. Besides the main metrics in R2R, R4R includes additional metrics: the Coverage Weighted by Length Score (CLS) (Jain et al., 2019), the Normalized Dynamic Time Wrapping (nDTW) (Ilharco et al., 2019) and the nDTW weighted by Success Rate (sDTW) (Ilharco et al., 2019). In R4R, SR and SPL measure the accuracy of navigation, while CLS, nDTW and sDTW measure the fidelity of predicted paths and ground-truth paths.

5.2. Implementation Details

We use ResNet-152 (He et al., 2016) pre-trained on Places365 (Zhou et al., 2018) to extract view features. We apply Faster-RCNN (Ren et al., 2015) pre-trained on the Visual Genome Dataset (Krishna et al., 2017) to obtain object labels in reference module, then encoded by Glove (Pennington et al., 2014). To simplify the object vocabulary, we retain the top 100 most frequent classes mentioned in R2R training data following (Hong et al., 2020a). Our method exploits neighbor views to represent a navigation candidate, thus the number of neighbors and objects are crucial for the final performance. Our default setting adopts 4 neighbor views and the top 8 detected objects in each neighbor view (we also tried other settings, please see Sec 5.4). For simplicity, the objects’ positions are roughly represented by their corresponding views’ orientations in Eq (5). This is reasonable as instructions usually mention approximate relative relations of objects and candidates.

Our model is trained with the widely used two-stage strategy (Tan et al., 2019). At the first stage, only real training data is used. At the second stage, we pick the model with the highest SR at the first stage, and keep training it with both real data and synthetic data generated from (Fried et al., 2018). As for R4R experiment, we only apply the first stage training. We set the projection dimension in Sec 4.3

as 512, the glove embedding dimension as 300, and train the agent using RMSprop optimizer 

(Ruder, 2016) with

learning rate. We train the first stage for 80,000 iterations and the second stage will continue training up to 200,000 iterations. All our experiments are conducted on a Nvidia V100 Tensor Core GPU.

5.3. Comparison with the State-of-The-Art

We compare NvEM with several SoTA methods under the single-run setting on both R2R and R4R benchmarks. Note that VLN mainly focus on the agent’s performance on unseen splits, so the performance we report is based on the model which has the highest SR on the validation unseen split.

As shown in Table 1, on the R2R benchmark our NvEM outperforms the baseline EnvDrop (Tan et al., 2019) by a large margin, which obtains absolute improvements in terms of SPL on both Val Unseen and Test splits. Compared to the state-of-the-art method RelGraph (Hong et al., 2020a), our model obtains absolute improvements on Val Unseen and Test splits. Moreover, NvEM even beats some pre-training methods, such as PRESS (Li et al., 2019), PREVALENT (Hao et al., 2020), and VLNBert (init. OSCAR) (Hong et al., 2020b). We also note that our method does not surpass VLNBert pre-trained on PREVALENT, which uses in-domain data for pretraining. As shown later in the ablation study, the success of our model mainly benefits from the incorporating of visual contexts, where NvEM effectively improve the textual-visual matching leading to more accurate actions.

On the R4R benchmark, we observe similar phenomenon with that on R2R. As shown in Table 2, NvEM not only significantly outperforms the baseline EnvDrop (Tan et al., 2019) with absolute improvement, but also sets the new SoTA. In particular, NvEM achieves 0.41 CLS, 0.36 nDTW and 0.20 sDTW, which are largely higher than the second best RelGraph (Hong et al., 2020a) by 4%, 4% and %2, respectively.

5.4. Ablation Study

We conduct ablation experiments over different components of NvEM on R2R dataset. Specifically, we study how the action-, subject- and reference-module contribute to navigation. Then we compare the single-view subject module against the neighbor-view one. Lastly, we study how the numbers of neighbor views and objects in the subject-and reference-module affect the performance. All our ablation models are trained from scratch by two stage training.

The importance of different modules. Our full model utilizes three modules: the action-, subject- and reference-module, which correspond to orientations, global views and local objects. To study how they affect the performance of navigation, we conduct an ablation experiment by removing corresponding module. The results are shown in Table 3. In model #1, we remove the action-module, and it achieves the worst performance compared to others in terms of the main metric SPL. This indicates that orientations are strong guidance for navigation. The same phenomenon is observed in (Hu et al., 2019a; Qi et al., 2020a). In model #2, we remove the subject-module, and it performs slight better than model#1 but still far behind the full model. This indicates the global view information is important for VLN. In model #3, we remove the reference-module, and it performs slight worse than the full model. This indicates reference contains some useful information but not as important as subject. Another reason is that the local reference can also be included in global views of the subject-module.

Modules Val Seen Val Unseen
(r)2-4 (r)5-6 (r)7-8 model action subject reference SR SPL SR SPL
1 0.623 0.584 0.486 0.440
2 0.580 0.523 0.504 0.446
3 0.666 0.636 0.579 0.539
4 0.686 0.645 0.601 0.549
Table 3. Ablation experiment about importance of different modules.

Single-view subject vs neighbor-view subject. In the subject-module, we aggregate neighbor view features based on their spatial affinities to the candidate, which raises the following questions: is neighbor-view more effective than single-view? How about aggregating views in other manners? To answer the above questions, we test different types of subject-modules, and the results are shown in Table 4. Note that we remove the reference module to exclude the affect from reference. In table 4, single uses single-view global features to represent subject, lang uses subject-aware phrases as query and view features as keys to ground views in Eq (10), while spa is our default setting which is based on spatial affinity. The results show that both model #2 and #3 perform better than the single view based model #1, which indicates the superior of neighbor-view based models. Moreover, we observe that language grounded neighbor-view fusion (model #2) performs worse than the spatial based one (model #3). This may be caused by the gap between language embeddings and visual embeddings.

Subject Val Seen Val Unseen
(r)2-4 (r)5-6 (r)7-8 model single lang spa SR SPL SR SPL
1 0.641 0.605 0.556 0.507
2 0.637 0.608 0.564 0.522
3 0.666 0.636 0.579 0.539
Table 4. Different types of subject-modules, single denotes single-view, lang denotes aggregating views using language as query while spa denotes using spatial affinity.

The analysis of number of neighbors and objects. In our default setting, the subject-module uses 4 neighbor views and the reference-module uses 8 objects in each neighbor view. However, it is naturally to consider more neighbors (e.g., 8 neighbors) and more objects. In this experiment, we test 4, 8 neighbors, and 4, 8, 12 objects in each view. The results are shown in Table 5. In model #1 and model #2, we adjust the number of objects and keep 4 neighbors. They perform worse than our default setting (model #4). The reason may be that fewer objects might not contain the objects mentioned by instructions, and more objects could be redundant. To study the number of neighbors, we keep the object number as 8. Comparing model #3 with model #4, the results show 4 neighbors is the better. The reason may be that more neighbors contain more redundant information. Not only will they affect the aggregation for subject context, but also increase the difficulty to highlight relevant objects. An example is shown in Figure 4, intuitively, 8 neighbors have more redundant visual information than 4 neighbors one.

Views Objects Val Seen Val Unseen
(r)2-3 (r)4-6 (r)7-8 (r)9-10 model 4 8 4 8 12 SR SPL SR SPL
1 0.655 0.617 0.564 0.517
2 0.667 0.638 0.572 0.532
3 0.644 0.615 0.564 0.523
4 0.686 0.645 0.601 0.549
Table 5. Ablation experiments about the number of neighbor views and objects. Views denote the number of neighbors, and Objects denote the number of objects in a view.
Figure 4. A schematic diagram of 4 neighbors and 8 neighbors.
Figure 5. Visualizations of two success cases. The mask on the panorama denotes neighbor-view spatial affinities of the predicted candidate in subject module (note that neighbor views overlap with the center view). The boxes are objects with top three scores in reference module. Attentions on three types of phrases are also showed.
Figure 6. Visualization of a failure case. Green arrow denotes ground-truth action, while red arrow denotes predicted action. The mask on the panorama denotes neighbor-view spatial affinities of the predicted candidate in subject module (note that neighbor views overlap with the center view). The boxes are objects with top three scores in reference module. Attentions on three types of phrases are also showed.

5.5. Qualitative Visualizations

Here we present some success and failure cases. Figure 5 shows two success cases. Taking the bottom one as an example, each module attends on correct phrases, such as “go up”, “stairs” and “to the left of the refrigerator” corresponding to action, subject and reference modules respectively. More visual information of the subject “stairs” from neighbor views are incorporated by our subject-module. In addition, taking advantage of the reference-module, our model could perceive the mentioned “refrigerator” in neighbor views as reference, and suppress unmentioned ones. We note that the attention scores of objects are not that high, which may mainly caused by the fact that we consider totally 32 objects for each candidate (4 neighbors and 8 objects in each neighbor).

Figure 6 visualizes a failure case. The model mainly focus on the “bathroom” in the subject-module, which leads to a wrong direction. This indicates that the attentive phrases of NvEM are not always correct, and thus it still could be improved by studying how to extract more accurate phrases.

6. Conclusion

In this paper, we present a novel multi-module Neighbor-view Enhanced Model to improve textual-visual matching via adaptively incorporating visual information from neighbor views. Our subject module aggregates neighbor views at a global level based on spatial-affinity, and our reference module aggregates neighbor objects at a local level guided by referring phrases. Extensive experiments demonstrate that NvEM effectively improves the agent’s performance in unseen environments and our method sets the new state-of-the-art.

Considering the similarity between R2R task and other VLN tasks, such as dialogue navigation (Thomason et al., 2019; Nguyen and III, 2019), remote object detection (Qi et al., 2020b) and navigation in continuous space (Krantz et al., 2020), we believe the neighbor-view enhancement idea could also benefit agents in other embodied AI tasks. We leave them as future works to explore.

7. Acknowledgments

This work was jointly supported by National Key Research and Development Program of China Grant No. 2018AAA0100400, National Natural Science Foundation of China (61525306, 61633021, 61721004, 61806194, U1803261, and 61976132), Beijing Nova Program (Z201100006820079), Shandong Provincial Key Research and Development Program (2019JZZY010119), Key Research Program of Frontier Sciences CAS Grant No.ZDBS-LY-JSC032, and CAS-AIR.

References

  • (1)
  • Anderson et al. (2018) Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    . 3674–3683.
  • Chang et al. (2017) Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. 2017. Matterport3D: Learning from RGB-D Data in Indoor Environments. International Conference on 3D Vision (3DV) (2017).
  • Deng et al. (2020) Zhiwei Deng, Karthik Narasimhan, and Olga Russakovsky. 2020. Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Fried et al. (2018) Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. 2018. Speaker-Follower Models for Vision-and-Language Navigation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018. 3318–3329.
  • Hao et al. (2020) Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. 2020. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. 13134–13143.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. 770–778.
  • Hong et al. (2020a) Yicong Hong, Cristian Rodriguez Opazo, Yuankai Qi, Qi Wu, and Stephen Gould. 2020a. Language and Visual Entity Relationship Graph for Agent Navigation. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020.
  • Hong et al. (2020b) Yicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez Opazo, and Stephen Gould. 2020b. A Recurrent Vision-and-Language BERT for Navigation. CoRR abs/2011.13922 (2020).
  • Hu et al. (2019a) Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, and Kate Saenko. 2019a. Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019. 6551–6557.
  • Hu et al. (2019b) Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. 2019b. Language-Conditioned Graph Networks for Relational Reasoning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019. 10293–10302.
  • Ilharco et al. (2019) Gabriel Ilharco, Vihan Jain, Alexander Ku, Eugene Ie, and Jason Baldridge. 2019. General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping. In Visually Grounded Interaction and Language (ViGIL), NeurIPS 2019 Workshop.
  • Jain et al. (2019) Vihan Jain, Gabriel Magalhães, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. 2019. Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019. 1862–1872.
  • Krantz et al. (2020) Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. 2020. Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments. In Computer Vision - ECCV 2020 - 16th European Conference. 104–120.
  • Krishna et al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.
  • Landi et al. (2019) Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, and Rita Cucchiara. 2019. Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019. 18.
  • Li et al. (2019) Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah A. Smith, and Yejin Choi. 2019. Robust Navigation with Language Pretraining and Stochastic Sampling. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019

    . 1494–1499.
  • Ma et al. (2019a) Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. 2019a.

    Self-Monitoring Navigation Agent via Auxiliary Progress Estimation. In

    7th International Conference on Learning Representations, ICLR 2019.
  • Ma et al. (2019b) Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, Caiming Xiong, and Zsolt Kira. 2019b.

    The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 6732–6740.
  • Majumdar et al. (2020) Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. 2020. Improving Vision-and-Language Navigation with Image-Text Pairs from the Web. In Computer Vision - ECCV 2020 - 16th European Conference. 259–274.
  • Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. In

    Proceedings of the 33nd International Conference on Machine Learning, ICML 2016

    . 1928–1937.
  • Mun et al. (2020) Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-Global Video-Text Interactions for Temporal Grounding. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. 10807–10816.
  • Nguyen and III (2019) Khanh Nguyen and Hal Daumé III. 2019. Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. 684–695.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.

    Glove: Global Vectors for Word Representation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1532–1543.
  • Qi et al. (2020a) Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. 2020a. Object-and-Action Aware Model for Visual Language Navigation. In Computer Vision - ECCV 2020 - 16th European Conference. 303–317.
  • Qi et al. (2020b) Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020b. REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. 9979–9988.
  • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015. 91–99.
  • Ruder (2016) Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016).
  • Tan et al. (2019) Hao Tan, Licheng Yu, and Mohit Bansal. 2019. Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019. 2610–2621.
  • Thomason et al. (2019) Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. 2019. Vision-and-Dialog Navigation. In 3rd Annual Conference on Robot Learning, CoRL 2019. 394–406.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 5998–6008.
  • Wang et al. (2020b) Hu Wang, Qi Wu, and Chunhua Shen. 2020b. Soft Expert Reward Learning for Vision-and-Language Navigation. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IX. 126–141.
  • Wang et al. (2019) Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 6629–6638.
  • Wang et al. (2018) Xin Wang, Wenhan Xiong, Hongmin Wang, and William Yang Wang. 2018. Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XVI. 38–55.
  • Wang et al. (2020a) Xin Eric Wang, Vihan Jain, Eugene Ie, William Yang Wang, Zornitsa Kozareva, and Sujith Ravi. 2020a. Environment-Agnostic Multitask Learning for Natural Language Grounded Navigation. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIV. 413–430.
  • Yu et al. (2018) Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018. 1307–1315.
  • Zhou et al. (2018) Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2018.

    Places: A 10 Million Image Database for Scene Recognition.

    IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1452–1464.
  • Zhu et al. (2020) Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. 2020. Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020. 10009–10019.