DeepAI AI Chat
Log In Sign Up

Speaker-Follower Models for Vision-and-Language Navigation

by   Daniel Fried, et al.

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this presents a double challenge: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and empirically difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to synthesize new instructions for data augmentation and to implement pragmatic reasoning for evaluating candidate action sequences. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three pieces of this approach---speaker-driven data augmentation, pragmatic reasoning and panoramic action space---dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.


page 5

page 8

page 15

page 16

page 17

page 18

page 19

page 20


Unified Pragmatic Models for Generating and Following Instructions

We extend models for both following and generating natural language inst...

FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation

The speaker-follower models have proven to be effective in vision-and-la...

Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation

Since the rise of vision-language navigation (VLN), great progress has b...

Bridging the visual gap in VLN via semantically richer instructions

The Visual-and-Language Navigation (VLN) task requires understanding a t...

Linguistic communication as (inverse) reward design

Natural language is an intuitive and expressive way to communicate rewar...

PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation

Vision-and-language navigation (VLN) is a crucial but challenging cross-...