NvEM
The official code of paper "Neighbor-view Enhanced Model for Vision and Language Navigation" (ACM MM2021)
view repo
Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions. Most of existing works represent a navigation candidate by the feature of the corresponding single view where the candidate lies in. However, an instruction may mention landmarks out of the single view as references, which might lead to failures of textual-visual matching of existing methods. In this work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views for better textual-visual matching. Specifically, our NvEM utilizes a subject module and a reference module to collect contexts from neighbor views. The subject module fuses neighbor views at a global level, and the reference module fuses neighbor objects at a local level. Subjects and references are adaptively determined via attention mechanisms. Our model also includes an action module to utilize the strong orientation guidance (e.g., “turn left”) in instructions. Each module predicts navigation action separately and their weighted sum is used for predicting the final action. Extensive experimental results demonstrate the effectiveness of the proposed method on the R2R and R4R benchmarks against several state-of-the-art navigators, and NvEM even beats some pre-training ones. Our code is available at https://github.com/MarSaKi/NvEM.
READ FULL TEXT VIEW PDFThe official code of paper "Neighbor-view Enhanced Model for Vision and Language Navigation" (ACM MM2021)
Vision and Language Navigation (VLN) has drawn increasing interest in recent years, partly because it represents a significant step towards enabling intelligent agents to interact with the realistic world. Running in a 3D simulator (Anderson et al., 2018) rendered with real-world images (Chang et al., 2017), the goal of VLN is to navigate to a target location by following a detailed natural language instruction, such as “Walk around the table and exit the room. Walk down the first set of stairs. Wait there.”. There are two kinds of simulators, which render continuous navigation trajectories (Krantz et al., 2020) and discrete trajectories (Anderson et al., 2018) respectively. In this paper we focus on the discrete one, where the agent navigates on a discrete graph (see Figure 1 (a)).
A variety of approaches have been proposed to address the VLN problem (Wang et al., 2018; Fried et al., 2018; Landi et al., 2019; Tan et al., 2019; Ma et al., 2019a; Wang et al., 2019; Li et al., 2019; Qi et al., 2020a; Hao et al., 2020; Majumdar et al., 2020; Wang et al., 2020a, b; Hong et al., 2020a; Deng et al., 2020; Hong et al., 2020b). Most of them adopt panoramic action space (Fried et al., 2018), where the agent select a navigable candidate from its observations to transport at each step. However, the context of navigable candidates is rarely discussed in existing works, and the commonly used single-view candidates are of limited visual contexts which may hamper the matching between instructions and the visual representations of candidates. Figure 1 shows such an example, where there are three candidate “archways”, each of which is represented by a single-view visual perception (Figure 1 (b)). According to the instruction, only the archway “to the left of mirror” leads to the correct navigation. However, most of existing agents may fail because they cannot find the referred “mirror” in any single-view-based candidate.
Thus, we propose to enhance the textual-visual matching by fusing visual information from candidates’ neighbor views as shown in Figure 1 (c), which is rarely explored before. It is non-trivial to fuse neighbor views for visual contexts modeling, because many unmentioned visual clues exist which may interfere with the agent’s decision (e.g., the lamp in Figure 1 (c)). In addition, some instructions even do not involve visual clues in neighbor views, such as “go through the doorway”. To handle this challenging problem, we propose to decompose an instruction into: action-, subject- and reference-related phrases as shown in Figure 2. Generally, the action and subject are necessary, and the optional reference helps to distinguish the desired candidate from other similar ones.
Based on the above mentioned three types of instruction phrases, we further design a multi-module Neighbor-view Enhanced Model (NvEM) to adaptively fuse neighbor visual contexts in order to improve the textual-visual matching between instructions and candidates’ visual perceptions. Specifically, our NvEM includes a subject module, a reference module and an action module, where subjects and references are determined via attention mechanisms. On one hand, the subject module aggregates neighbor views at a global level based on spatial information. On the other hand, the reference module aggregates related objects from neighbor views at a local level. The action module makes use of the orientation guidance (i.e., “turn left”) in instructions. Each module predicts navigation action separately and their weighted sum is used to predict the final action. Note that the combination weights are trainable and predicted based on the decomposed subject-, reference- and action-related phrases.
The contributions of this work are summarized as follows:
To improve the textual-visual matching between instructions and navigable candidates, we propose to take into account the visual contexts from neighbor views for the first time.
We propose a subject module and a reference module to adaptively fuse visual contexts from neighbor views at both global level and local level.
Extensive experimental results demonstrate the effectiveness of the proposed method with comparisons against several existing state-of-the-art methods, and NvEM even beats some pre-training ones.
Vision and Language Navigation. Numerous approaches have been proposed to address the VLN problem. Most of them are based on the CNN-LSTM architecture with attention mechanisms: at each time step, the agent first grounds surrounding observations to instructions, then chooses the most matched candidate according to the grounded instructions as the next location. Early work Speaker-Follower (Fried et al., 2018) develops a speaker model to synthesize new instructions for randomly sampled trajectories. Additionally, they design a panoramic action space for efficient navigation. Later on, EnvDrop (Tan et al., 2019) increases the diversity of synthetic data by randomly removing objects to generate “new environments”.
On the other line, Self-monitoring (Ma et al., 2019a) and RCM (Wang et al., 2019) utilize the cross-modality co-attention mechanism to enhance the alignment of instructions and trajectories. To learn generic linguistic and visual representations for VLN, AuxRN (Zhu et al., 2020) designs several auxiliary self-supervised losses. Very recently, large-scale pre-training models for VLN are widely explored (Li et al., 2019; Hao et al., 2020; Majumdar et al., 2020; Hong et al., 2020b), where they improve the agent’s generation abilities dramatically by benefitting from priors of other datasets. Different types of visual clues correspond to different phrases in an instruction, OAAM (Qi et al., 2020a) and RelGraph (Hong et al., 2020a) utilize decomposed phrases to guide more accurate action prediction. OAAM (Qi et al., 2020a) adopts action and object specialized clues to vote action at each time step, while RelGraph (Hong et al., 2020a) proposes a graph network to model the intra- and inter-relationships among the contextual and visual clues. The most relevant work to ours is RelGraph (Hong et al., 2020a), where we both attempt to exploit view-level and object-level features. The key difference is: we focus on enhancing each candidate’s representation with its multiple neighbor views (namely inter-view), while the representation in RelGraph is limited to a single view and thus it’s intra-view.
Modular Attention Networks.
Modular networks are widely adopted in vision and language models. It attempts to decompose sentences into multiple phrases via attention mechanisms, as different phrases usually correspond to different visual clues. MAttNet
(Yu et al., 2018) decomposes a long sentence into appearance, location and relationship three parts for referring expression. LCGN (Hu et al., 2019b) utilizes a multi-step textual attention mechanism to extract different object-related phrases, then modeling objects‘ contexts via the relations among the phrases. LGI (Mun et al., 2020) utilizes a sequential query attention module to decompose the query into multiple semantic phrases, then using these phrases to interact with video clips for video grounding. To the best of our knowledge, OAAM (Qi et al., 2020a) is the earliest attempt to decompose instructions in VLN. They decompose instructions into “action” and “object ” specialized phrases, and use these phrases to vote next action. Our work has two key differences with OAAM: (I) Different modules: our modules are (subject, reference, action) v.s. (object, action) of OAAM. (II) Our subject module and reference module fuse information from neighbor views while OAAM only use information within one single view.In VLN (Anderson et al., 2018), given a natural language instruction with words, an agent navigates on a discrete graph to reach the described target by following the instruction. At each time step , the agent observes a panorama which consists of 36 discrete views. Each view is represented by an image , with its orientation including heading and elevation . Also, there are candidates at time step , and each candidate is represented by the single view where the candidate lies in, with the candidate’s relative orientation to the agent. Formally, for the -th view:
(1) |
where represents ResNet (He et al., 2016) pooling features, is an embedding function for heading and elevation, which repeats , 32 times following (Tan et al., 2019). The -th candidate is encoded in the same way.
Previous works show that data augmentation is able to significantly improve the generalization ability in unseen environments (Fried et al., 2018; Tan et al., 2019). We make use of this strategy by adopting EnvDrop (Tan et al., 2019) as baseline, which first uses a bi-directional LSTM to encode the instruction, and the encoded instruction is represented as . Then the agent’s previous context-aware state is used to attend on all views to get scene feature: . The concatenation of and the previous action embedding is fed into the decoder LSTM to update the agent’s state: . Note that the context-aware agent state is updated via the attentive instruction feature :
(2) |
where is a trainable linear projection and
mentioned above denotes soft-dot attention. Finally, EnvDrop predicts navigation action by selecting the candidate with the highest probability (
is a trainable linear projection):(3) |
In this section, we first briefly describe the pipeline of the proposed model in training phase, then we detail the proposed Neighbor-view Enhanced Model (NvEM). Note that in this section, we omit the time step to avoid notational clutter in the exposition.
Figure 3 illustrates the main pipeline of our NvEM. First, action-, subject- and reference-related phrases are attended by three independent attention schemes. Then, reference module and subject module predict navigation actions via aggregating visual contexts from candidates’ neighbor views. The action module predicts navigation in terms of orientation information. Lastly, the final navigation action is determined by combining these three predictions together with weights generated from phrases’ embeddings.
Considering an instruction, such as “walk through the archway to the left of the mirror…”, there are three types of phrases which the agent needs to identify: the action describes the orientation of target candidate (e.g., “walk through”), the subject describes the main visual entity of the correct navigation (e.g., “the archway”) and the reference which is referenced by the subject (e.g., “to the left of the mirror”). Thus, NvEM first performs three soft-attentions independently on the instruction, conditioned on the current agent state , to attend on these three types of phrases:
(4) | ||||
where the subscripts denote the corresponding types of phrases, is the same with Eq (2). and denote features of corresponding phrases and context-aware agent states, and they are updated by different linear projections . The global context-aware agent state in Eq (2) is now calculated by averaging the three specialized context-aware states , and .
Corresponding to the attended three types of phrases in instructions, our neighbor-view enhanced navigator contains three modules: a reference module, a subject module and an action module. The reference module and subject module predict navigation actions via aggregating visual contexts from neighbor views at local and global levels, respectively. The action module predicts navigation actions according to orientation information. We provide the details below.
Reference Module. Reference usually exists as a landmark surrounding the subject to clarify similar navigation candidates. In the example “walk through the archway to the left of the mirror…”, the reference is “the mirror” and it is referred with a spatial relationship to the subject (e.g., “to the left of”). This motivates us to enhance the representation of a navigation candidate by using features of local objects and their spatial relations.
Formally, for the -th candidate , given its orientation , the objects’ features extracted by (Ren et al., 2015) and their orientations in the candidate’s neighbor views (assume there are objects in each neighbor and denotes the -th neighbor). We then calculate the spatial relations of the neighbor objects to the -th candidate:
(5) |
where is the same as that in Eq (1). Then each neighbor object is represented as the concatenation of its object feature and relative spatial embedding:
(6) |
where denotes concatenation, projects objects’ features into a dimensional space and all in this section represent trainable non-linear projections, with
as the activation function. We use the reference-related phrases
to highlight relevant objects in neighbor views:(7) | ||||
where and are trainable linear projections. The reference module predicts the confidence of the candidate being the next navigation action using reference-related state and the neighbor reference enhanced candidate representation :
(8) |
where is a trainable linear projection.
Subject Module. Instruction subject describes the main visual entity of the correct navigation, such as “the archway” in “walk through the archway to the left of the mirror…”. However, sometimes there are multiple candidates containing the different instances of the subject. To alleviate the ambiguity in visual side, we propose to enhance the visual representation of subject by incorporating contexts from neighbor views. Specifically, we aggregate neighbor views at a global level, with the help of the spatial affinities neighbor views to the candidate.
Formally, for the -th candidate , given its orientation , its neighbor views’ orientations and ResNet features (assume it has neighbor views), we first embed all neighbor views using a trainable non-linear projection:
(9) |
Then we compute the spatial affinities among the candidate and its neighbor views based on their orientations in a query-key manner (Vaswani et al., 2017). Then the enhanced subject visual representation is obtained by adaptively aggregating neighbor views’ embeddings:
(10) | ||||
where and are trainable linear projections. Similar to reference module, the subject module predicts the confidence of candidate being the next navigation action via:
(11) |
where is a trainable linear projection.
Action Module. Action related phrases serve as strong guidance in navigation (Hu et al., 2019a; Qi et al., 2020a), such as “go forward”, “turn left” and “go down”. Inspired by (Qi et al., 2020a), for the -th candidate , the action module predicts the confidence of candidate being the next navigation action, using the candidate’s orientation and action-related state :
(12) | ||||
where is a trainable linear projection.
The action, subject and reference modules usually contribute at different degrees to the final decision. Thus, we propose to adaptively integrate predictions from these three modules. We first calculate the combination weights conditioned on action-, subject- and reference-specialized phrases , and . Take the action weight as an example:
(13) |
where the , and are trainable linear projections. Then the final action probability of the -th candidate is calculated by weighted summation of above confidences:
(14) |
In the inference phase, the agent selects the candidate with the maximum probability as shown in Eq (3) at each step.
We apply the Imitation Learning (IL) + Reinforcement Learning (RL) objectives to train our model following
(Tan et al., 2019). In imitation learning, the agent takes the teacher action at each time step to learn to follow the ground-truth trajectory. In reinforcement learning, the agent samples an action via the probability and learns from the rewards. Formally:(15) |
where is a coefficient for weighting the IL loss, and are the total numbers of steps the agent takes in IL and RL respectively. is the advantage in A2C algorithm (Mnih et al., 2016). We apply the summation of two types of rewards in RL objective, goal reward and fidelity reward following (Jain et al., 2019).
Val Seen | Val Unseen | Test Unseen | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
(r)2-5 (r)6-9 (r)10-13 Agent | TL | NE | SR | SPL | TL | NE | SR | SPL | TL | NE | SR | SPL |
Random | 9.58 | 9.45 | 0.16 | - | 9.77 | 9.23 | 0.16 | - | 9.89 | 9.79 | 0.13 | 0.12 |
Human | - | - | - | - | - | - | - | - | 11.85 | 1.61 | 0.86 | 0.76 |
PRESS (Li et al., 2019) | 10.57 | 4.39 | 0.58 | 0.55 | 10.36 | 5.28 | 0.49 | 0.45 | 10.77 | 5.49 | 0.49 | 0.45 |
PREVALENT (Hao et al., 2020) | 10.32 | 3.67 | 0.69 | 0.65 | 10.19 | 4.71 | 0.58 | 0.53 | 10.51 | 5.30 | 0.54 | 0.51 |
VLNBert (init. OSCAR) (Hong et al., 2020b) | 10.79 | 3.11 | 0.71 | 0.67 | 11.86 | 4.29 | 0.59 | 0.53 | 12.34 | 4.59 | 0.57 | 0.53 |
VLNBert (init. PREVALENT) (Hong et al., 2020b) | 11.13 | 2.90 | 0.72 | 0.68 | 12.01 | 3.93 | 0.63 | 0.57 | 12.35 | 4.09 | 0.63 | 0.57 |
Seq2Seq (Anderson et al., 2018) | 11.33 | 6.01 | 0.39 | - | 8.39 | 7.81 | 0.22 | - | 8.13 | 7.85 | 0.20 | 0.18 |
Speaker-Follower (Fried et al., 2018) | - | 3.36 | 0.66 | - | - | 6.62 | 0.35 | - | 14.82 | 6.62 | 0.35 | 0.28 |
SM (Ma et al., 2019a) | - | 3.22 | 0.67 | 0.58 | - | 5.52 | 0.45 | 0.32 | 18.04 | 5.67 | 0.48 | 0.35 |
RCM+SIL (Wang et al., 2019) | 10.65 | 3.53 | 0.67 | - | 11.46 | 6.09 | 0.43 | - | 11.97 | 6.12 | 0.43 | 0.38 |
Regretful (Ma et al., 2019b) | - | 3.23 | 0.69 | 0.63 | - | 5.32 | 0.50 | 0.41 | 13.69 | 5.69 | 0.48 | 0.40 |
VLNBert (no init.) (Hong et al., 2020b) | 9.78 | 3.92 | 0.62 | 0.59 | 10.31 | 5.10 | 0.50 | 0.46 | 11.15 | 5.45 | 0.51 | 0.47 |
EnvDrop (Tan et al., 2019) | 11.00 | 3.99 | 0.62 | 0.59 | 10.70 | 5.22 | 0.52 | 0.48 | 11.66 | 5.23 | 0.51 | 0.47 |
AuxRN (Zhu et al., 2020) | - | 3.33 | 0.70 | 0.67 | - | 5.28 | 0.55 | 0.50 | - | 5.15 | 0.55 | 0.51 |
RelGraph (Hong et al., 2020a) | 10.13 | 3.47 | 0.67 | 0.65 | 9.99 | 4.73 | 0.57 | 0.53 | 10.29 | 4.75 | 0.55 | 0.52 |
NvEM (ours) | 11.09 | 3.44 | 0.69 | 0.65 | 11.83 | 4.27 | 0.60 | 0.55 | 12.98 | 4.37 | 0.58 | 0.54 |
Val Seen | Val Unseen | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
(r)2-7 (r)8-13 Agent | NE | SR | SPL | CLS | nDTW | sDTW | NE | SR | SPL | CLS | nDTW | sDTW |
EnvDrop (Tan et al., 2019) | - | 0.52 | 0.41 | 0.53 | - | 0.27 | - | 0.29 | 0.18 | 0.34 | - | 0.09 |
RCM-a (goal) (Jain et al., 2019) | 5.11 | 0.56 | 0.32 | 0.40 | - | - | 8.45 | 0.29 | 0.10 | 0.20 | - | - |
RCM-a (fidelity) (Jain et al., 2019) | 5.37 | 0.53 | 0.31 | 0.55 | - | - | 8.08 | 0.26 | 0.08 | 0.35 | - | - |
RCM-b (goal) (Ilharco et al., 2019) | - | - | - | - | - | - | - | 0.29 | 0.15 | 0.33 | 0.27 | 0.11 |
RCM-b (fidelity) (Ilharco et al., 2019) | - | - | - | - | - | - | - | 0.29 | 0.21 | 0.35 | 0.30 | 0.13 |
OAAM (Qi et al., 2020a) | - | 0.56 | 0.49 | 0.54 | - | 0.32 | - | 0.31 | 0.23 | 0.40 | - | 0.11 |
RelGraph (Hong et al., 2020a) | 5.14 | 0.55 | 0.50 | 0.51 | 0.48 | 0.35 | 7.55 | 0.35 | 0.25 | 0.37 | 0.32 | 0.18 |
NvEM (ours) | 5.38 | 0.54 | 0.47 | 0.51 | 0.48 | 0.35 | 6.85 | 0.38 | 0.28 | 0.41 | 0.36 | 0.20 |
In this section, we first describe the commonly used VLN datasets and the evaluation metrics. Then we present implementation details of NvEM. Finally, we compare against several state-of-the-art methods and provide ablation experiments. Qualitative visualizations are also presented.
R2R benchmark. The Room-to-Room (R2R) dataset (Anderson et al., 2018) consists of 10,567 panoramic view nodes in 90 real-world environments as well as 7,189 trajectories described by three natural language instructions. The dataset is split into train, validation seen, validation unseen and test unseen sets. We follow the standard metrics employed by previous works to evaluate the performance of our agent. These metrics include: the Trajectory Length (TL) which measures the average length of the agent’s navigation path, the Navigation Error (NE) which is the average distance between the agent’s final location and the target, the Success Rate (SR) which measures the ratio of trajectories where the agent stops at 3 meters within the target, and the Success Rate weighted by Path Length (SPL) which considers both path length and success rate. Note that the SR and SPL in unseen environments are main metrics for R2R.
R4R benchmark. The Room-for-Room (R4R) dataset (Jain et al., 2019) is an extended version of R2R, which has longer instructions and trajectories. The dataset is split into train, validation seen, validation unseen sets. Besides the main metrics in R2R, R4R includes additional metrics: the Coverage Weighted by Length Score (CLS) (Jain et al., 2019), the Normalized Dynamic Time Wrapping (nDTW) (Ilharco et al., 2019) and the nDTW weighted by Success Rate (sDTW) (Ilharco et al., 2019). In R4R, SR and SPL measure the accuracy of navigation, while CLS, nDTW and sDTW measure the fidelity of predicted paths and ground-truth paths.
We use ResNet-152 (He et al., 2016) pre-trained on Places365 (Zhou et al., 2018) to extract view features. We apply Faster-RCNN (Ren et al., 2015) pre-trained on the Visual Genome Dataset (Krishna et al., 2017) to obtain object labels in reference module, then encoded by Glove (Pennington et al., 2014). To simplify the object vocabulary, we retain the top 100 most frequent classes mentioned in R2R training data following (Hong et al., 2020a). Our method exploits neighbor views to represent a navigation candidate, thus the number of neighbors and objects are crucial for the final performance. Our default setting adopts 4 neighbor views and the top 8 detected objects in each neighbor view (we also tried other settings, please see Sec 5.4). For simplicity, the objects’ positions are roughly represented by their corresponding views’ orientations in Eq (5). This is reasonable as instructions usually mention approximate relative relations of objects and candidates.
Our model is trained with the widely used two-stage strategy (Tan et al., 2019). At the first stage, only real training data is used. At the second stage, we pick the model with the highest SR at the first stage, and keep training it with both real data and synthetic data generated from (Fried et al., 2018). As for R4R experiment, we only apply the first stage training. We set the projection dimension in Sec 4.3
as 512, the glove embedding dimension as 300, and train the agent using RMSprop optimizer
(Ruder, 2016) withlearning rate. We train the first stage for 80,000 iterations and the second stage will continue training up to 200,000 iterations. All our experiments are conducted on a Nvidia V100 Tensor Core GPU.
We compare NvEM with several SoTA methods under the single-run setting on both R2R and R4R benchmarks. Note that VLN mainly focus on the agent’s performance on unseen splits, so the performance we report is based on the model which has the highest SR on the validation unseen split.
As shown in Table 1, on the R2R benchmark our NvEM outperforms the baseline EnvDrop (Tan et al., 2019) by a large margin, which obtains absolute improvements in terms of SPL on both Val Unseen and Test splits. Compared to the state-of-the-art method RelGraph (Hong et al., 2020a), our model obtains absolute improvements on Val Unseen and Test splits. Moreover, NvEM even beats some pre-training methods, such as PRESS (Li et al., 2019), PREVALENT (Hao et al., 2020), and VLNBert (init. OSCAR) (Hong et al., 2020b). We also note that our method does not surpass VLNBert pre-trained on PREVALENT, which uses in-domain data for pretraining. As shown later in the ablation study, the success of our model mainly benefits from the incorporating of visual contexts, where NvEM effectively improve the textual-visual matching leading to more accurate actions.
On the R4R benchmark, we observe similar phenomenon with that on R2R. As shown in Table 2, NvEM not only significantly outperforms the baseline EnvDrop (Tan et al., 2019) with absolute improvement, but also sets the new SoTA. In particular, NvEM achieves 0.41 CLS, 0.36 nDTW and 0.20 sDTW, which are largely higher than the second best RelGraph (Hong et al., 2020a) by 4%, 4% and %2, respectively.
We conduct ablation experiments over different components of NvEM on R2R dataset. Specifically, we study how the action-, subject- and reference-module contribute to navigation. Then we compare the single-view subject module against the neighbor-view one. Lastly, we study how the numbers of neighbor views and objects in the subject-and reference-module affect the performance. All our ablation models are trained from scratch by two stage training.
The importance of different modules. Our full model utilizes three modules: the action-, subject- and reference-module, which correspond to orientations, global views and local objects. To study how they affect the performance of navigation, we conduct an ablation experiment by removing corresponding module. The results are shown in Table 3. In model #1, we remove the action-module, and it achieves the worst performance compared to others in terms of the main metric SPL. This indicates that orientations are strong guidance for navigation. The same phenomenon is observed in (Hu et al., 2019a; Qi et al., 2020a). In model #2, we remove the subject-module, and it performs slight better than model#1 but still far behind the full model. This indicates the global view information is important for VLN. In model #3, we remove the reference-module, and it performs slight worse than the full model. This indicates reference contains some useful information but not as important as subject. Another reason is that the local reference can also be included in global views of the subject-module.
Modules | Val Seen | Val Unseen | |||||
(r)2-4 (r)5-6 (r)7-8 model | action | subject | reference | SR | SPL | SR | SPL |
1 | ✓ | ✓ | 0.623 | 0.584 | 0.486 | 0.440 | |
2 | ✓ | ✓ | 0.580 | 0.523 | 0.504 | 0.446 | |
3 | ✓ | ✓ | 0.666 | 0.636 | 0.579 | 0.539 | |
4 | ✓ | ✓ | ✓ | 0.686 | 0.645 | 0.601 | 0.549 |
Single-view subject vs neighbor-view subject. In the subject-module, we aggregate neighbor view features based on their spatial affinities to the candidate, which raises the following questions: is neighbor-view more effective than single-view? How about aggregating views in other manners? To answer the above questions, we test different types of subject-modules, and the results are shown in Table 4. Note that we remove the reference module to exclude the affect from reference. In table 4, single uses single-view global features to represent subject, lang uses subject-aware phrases as query and view features as keys to ground views in Eq (10), while spa is our default setting which is based on spatial affinity. The results show that both model #2 and #3 perform better than the single view based model #1, which indicates the superior of neighbor-view based models. Moreover, we observe that language grounded neighbor-view fusion (model #2) performs worse than the spatial based one (model #3). This may be caused by the gap between language embeddings and visual embeddings.
Subject | Val Seen | Val Unseen | |||||
(r)2-4 (r)5-6 (r)7-8 model | single | lang | spa | SR | SPL | SR | SPL |
1 | ✓ | 0.641 | 0.605 | 0.556 | 0.507 | ||
2 | ✓ | 0.637 | 0.608 | 0.564 | 0.522 | ||
3 | ✓ | 0.666 | 0.636 | 0.579 | 0.539 | ||
The analysis of number of neighbors and objects. In our default setting, the subject-module uses 4 neighbor views and the reference-module uses 8 objects in each neighbor view. However, it is naturally to consider more neighbors (e.g., 8 neighbors) and more objects. In this experiment, we test 4, 8 neighbors, and 4, 8, 12 objects in each view. The results are shown in Table 5. In model #1 and model #2, we adjust the number of objects and keep 4 neighbors. They perform worse than our default setting (model #4). The reason may be that fewer objects might not contain the objects mentioned by instructions, and more objects could be redundant. To study the number of neighbors, we keep the object number as 8. Comparing model #3 with model #4, the results show 4 neighbors is the better. The reason may be that more neighbors contain more redundant information. Not only will they affect the aggregation for subject context, but also increase the difficulty to highlight relevant objects. An example is shown in Figure 4, intuitively, 8 neighbors have more redundant visual information than 4 neighbors one.
Views | Objects | Val Seen | Val Unseen | ||||||
---|---|---|---|---|---|---|---|---|---|
(r)2-3 (r)4-6 (r)7-8 (r)9-10 model | 4 | 8 | 4 | 8 | 12 | SR | SPL | SR | SPL |
1 | ✓ | ✓ | 0.655 | 0.617 | 0.564 | 0.517 | |||
2 | ✓ | ✓ | 0.667 | 0.638 | 0.572 | 0.532 | |||
3 | ✓ | ✓ | 0.644 | 0.615 | 0.564 | 0.523 | |||
4 | ✓ | ✓ | 0.686 | 0.645 | 0.601 | 0.549 | |||
![]() |
![]() |
Here we present some success and failure cases. Figure 5 shows two success cases. Taking the bottom one as an example, each module attends on correct phrases, such as “go up”, “stairs” and “to the left of the refrigerator” corresponding to action, subject and reference modules respectively. More visual information of the subject “stairs” from neighbor views are incorporated by our subject-module. In addition, taking advantage of the reference-module, our model could perceive the mentioned “refrigerator” in neighbor views as reference, and suppress unmentioned ones. We note that the attention scores of objects are not that high, which may mainly caused by the fact that we consider totally 32 objects for each candidate (4 neighbors and 8 objects in each neighbor).
Figure 6 visualizes a failure case. The model mainly focus on the “bathroom” in the subject-module, which leads to a wrong direction. This indicates that the attentive phrases of NvEM are not always correct, and thus it still could be improved by studying how to extract more accurate phrases.
In this paper, we present a novel multi-module Neighbor-view Enhanced Model to improve textual-visual matching via adaptively incorporating visual information from neighbor views. Our subject module aggregates neighbor views at a global level based on spatial-affinity, and our reference module aggregates neighbor objects at a local level guided by referring phrases. Extensive experiments demonstrate that NvEM effectively improves the agent’s performance in unseen environments and our method sets the new state-of-the-art.
Considering the similarity between R2R task and other VLN tasks, such as dialogue navigation (Thomason et al., 2019; Nguyen and III, 2019), remote object detection (Qi et al., 2020b) and navigation in continuous space (Krantz et al., 2020), we believe the neighbor-view enhancement idea could also benefit agents in other embodied AI tasks. We leave them as future works to explore.
This work was jointly supported by National Key Research and Development Program of China Grant No. 2018AAA0100400, National Natural Science Foundation of China (61525306, 61633021, 61721004, 61806194, U1803261, and 61976132), Beijing Nova Program (Z201100006820079), Shandong Provincial Key Research and Development Program (2019JZZY010119), Key Research Program of Frontier Sciences CAS Grant No.ZDBS-LY-JSC032, and CAS-AIR.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. 3674–3683.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019
. 1494–1499.Self-Monitoring Navigation Agent via Auxiliary Progress Estimation. In
7th International Conference on Learning Representations, ICLR 2019.The Regretful Agent: Heuristic-Aided Navigation Through Progress Estimation. In
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. 6732–6740.Proceedings of the 33nd International Conference on Machine Learning, ICML 2016
. 1928–1937.Glove: Global Vectors for Word Representation. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1532–1543.Places: A 10 Million Image Database for Scene Recognition.
IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1452–1464.