A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment

01/19/2021
by   Homagni Saha, et al.
2

In this paper we propose a new framework - MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non-reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from the traditional end-to-end techniques in this space and allows for a more tractable training process with separate vision and language data sets. Specifically, we propose a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following. We demonstrate a significant increase in success rates for long-horizon, compositional tasks over the baseline on the recently released benchmark data set-ALFRED.

READ FULL TEXT

page 1

page 3

page 5

page 6

page 8

page 9

page 10

page 15

research
12/03/2019

ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

We present ALFRED (Action Learning From Realistic Environments and Direc...
research
06/17/2022

VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

Benefiting from language flexibility and compositionality, humans natura...
research
11/20/2017

Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

A robot that can carry out a natural-language instruction has been a dre...
research
12/06/2021

CALVIN: A Benchmark for Language-conditioned Policy Learning for Long-horizon Robot Manipulation Tasks

General-purpose robots coexisting with humans in their environment must ...
research
06/17/2023

MO-VLN: A Multi-Task Benchmark for Open-set Zero-Shot Vision-and-Language Navigation

Given a natural language, a general robot has to comprehend the instruct...
research
12/06/2020

MOCA: A Modular Object-Centric Approach for Interactive Instruction Following

Performing simple household tasks based on language directives is very n...
research
07/16/2023

Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making

Vision language decision making (VLDM) is a challenging multimodal task....

Please sign up or login with your details

Forgot password? Click here to reset