Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation

04/22/2021
by   Muhammad Zubair Irshad, et al.
1

Deep Learning has revolutionized our ability to solve complex problems such as Vision-and-Language Navigation (VLN). This task requires the agent to navigate to a goal purely based on visual sensory inputs given natural language instructions. However, prior works formulate the problem as a navigation graph with a discrete action space. In this work, we lift the agent off the navigation graph and propose a more complex VLN setting in continuous 3D reconstructed environments. Our proposed setting, Robo-VLN, more closely mimics the challenges of real world navigation. Robo-VLN tasks have longer trajectory lengths, continuous action spaces, and challenges such as obstacles. We provide a suite of baselines inspired by state-of-the-art works in discrete VLN and show that they are less effective at this task. We further propose that decomposing the task into specialized high- and low-level policies can more effectively tackle this task. With extensive experiments, we show that by using layered decision making, modularized training, and decoupling reasoning and imitation, our proposed Hierarchical Cross-Modal (HCM) agent outperforms existing baselines in all key metrics and sets a new benchmark for Robo-VLN.

READ FULL TEXT

page 1

page 6

research
11/25/2018

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Vision-language navigation (VLN) is the task of navigating an embodied a...
research
01/11/2023

Graph based Environment Representation for Vision-and-Language Navigation in Continuous Environments

Vision-and-Language Navigation in Continuous Environments (VLN-CE) is a ...
research
07/16/2023

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Vision language decision making (VLDM) is a challenging multimodal task....
research
03/10/2022

Cross-modal Map Learning for Vision and Language Navigation

We consider the problem of Vision-and-Language Navigation (VLN). The maj...
research
11/22/2020

Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning

The emerging vision-and-language navigation (VLN) problem aims at learni...
research
03/05/2022

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

Most existing works in vision-and-language navigation (VLN) focus on eit...
research
07/11/2020

Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation

The ability to perform effective planning is crucial for building an ins...

Please sign up or login with your details

Forgot password? Click here to reset