Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

11/25/2018
by   Xin Wang, et al.
0

Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms existing methods by 10 state-of-the-art performance. To improve the generalizability of the learned policy, we further introduce a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions. We demonstrate that SIL can approximate a better and more efficient policy, which tremendously minimizes the success rate performance gap between seen and unseen environments (from 30.7

READ FULL TEXT

page 1

page 8

research
04/22/2021

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation

Deep Learning has revolutionized our ability to solve complex problems s...
research
11/22/2020

Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning

The emerging vision-and-language navigation (VLN) problem aims at learni...
research
11/18/2019

Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Vision-Language Navigation (VLN) is a task where agents learn to navigat...
research
08/26/2021

SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments

This paper presents a novel approach for the Vision-and-Language Navigat...
research
07/21/2020

Soft Expert Reward Learning for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) requires an agent to find a specifi...
research
01/18/2022

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Referring expression grounding is an important and challenging task in c...
research
03/24/2021

Scene-Intuitive Agent for Remote Embodied Visual Grounding

Humans learn from life events to form intuitions towards the understandi...

Please sign up or login with your details

Forgot password? Click here to reset