Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation

07/25/2023
by   Haitian Zeng, et al.
0

We introduce a novel speaker model Kefa for navigation instruction generation. The existing speaker models in Vision-and-Language Navigation suffer from the large domain gap of vision features between different environments and insufficient temporal grounding capability. To address the challenges, we propose a Knowledge Refinement Module to enhance the feature representation with external knowledge facts, and an Adaptive Temporal Alignment method to enforce fine-grained alignment between the generated instructions and the observation sequences. Moreover, we propose a new metric SPICE-D for navigation instruction evaluation, which is aware of the correctness of direction phrases. The experimental results on R2R and UrbanWalk datasets show that the proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.

READ FULL TEXT

page 1

page 4

page 8

research
05/19/2023

PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

Vision-and-language navigation (VLN) is a crucial but challenging cross-...
research
03/30/2022

Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation

Since the rise of vision-language navigation (VLN), great progress has b...
research
01/26/2021

On the Evaluation of Vision-and-Language Navigation Instructions

Vision-and-Language Navigation wayfinding agents can be enhanced by expl...
research
08/24/2023

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

Cross-modal alignment is one key challenge for Vision-and-Language Navig...
research
04/22/2023

NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation

3D representation disentanglement aims to identify, decompose, and manip...
research
05/31/2022

ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts

Vision-Language Navigation (VLN) is a challenging task that requires an ...
research
06/09/2022

FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation

The speaker-follower models have proven to be effective in vision-and-la...

Please sign up or login with your details

Forgot password? Click here to reset