Less is More: Generating Grounded Navigation Instructions from Landmarks

11/25/2021
by   Su Wang, et al.
8

We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator – a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 1.1m English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 71 MARKY-MT5's instructions, just shy of their 75 – and well above SRs with other generators. Evaluations on RxR's longer, diverse paths obtain 61-64 high-quality navigation instructions in novel environments is a step towards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.

READ FULL TEXT

page 1

page 3

page 4

page 5

page 14

page 16

page 18

research
10/15/2020

Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding

We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigatio...
research
01/26/2021

On the Evaluation of Vision-and-Language Navigation Instructions

Vision-and-Language Navigation wayfinding agents can be enhanced by expl...
research
02/05/2020

From Route Instructions to Landmark Graphs

Landmarks are central to how people navigate, but most navigation techno...
research
03/23/2021

PanGEA: The Panoramic Graph Environment Annotation Toolkit

PanGEA, the Panoramic Graph Environment Annotation toolkit, is a lightwe...
research
08/24/2023

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

Cross-modal alignment is one key challenge for Vision-and-Language Navig...
research
04/02/2019

Pharos: improving navigation instructions on smartwatches by including global landmarks

Landmark-based navigation systems have proven benefits relative to tradi...
research
10/04/2019

Talk2Nav: Long-Range Vision-and-Language Navigation in Cities

Autonomous driving models often consider the goal as fixed at the start ...

Please sign up or login with your details

Forgot password? Click here to reset