MOCA: A Modular Object-Centric Approach for Interactive Instruction Following

12/06/2020
by   Kunal Pratap Singh, et al.
16

Performing simple household tasks based on language directives is very natural to humans, yet it remains an open challenge for an AI agent. Recently, an `interactive instruction following' task has been proposed to foster research in reasoning over long instruction sequences that requires object interactions in a simulated environment. It involves solving open problems in vision, language and navigation literature at each step. To address this multifaceted problem, we propose a modular architecture that decouples the task into visual perception and action policy, and name it as MOCA, a Modular Object-Centric Approach. We evaluate our method on the ALFRED benchmark and empirically validate that it outperforms prior arts by significant margins in all metrics with good generalization performance (high success rate in unseen environments). Our code is available at https://github.com/gistvision/moca.

READ FULL TEXT

page 4

page 5

page 6

page 8

page 12

research
09/05/2021

Modular Framework for Visuomotor Language Grounding

Natural language instruction following tasks serve as a valuable test-be...
research
10/22/2022

DANLI: Deliberative Agent for Following Natural Language Instructions

Recent years have seen an increasing amount of work on embodied AI agent...
research
06/01/2021

Look Wide and Interpret Twice: Improving Performance on Interactive Instruction-following Tasks

There is a growing interest in the community in making an embodied AI ag...
research
01/10/2019

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation

The Vision-and-Language Navigation (VLN) task entails an agent following...
research
08/14/2023

Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents

Accomplishing household tasks requires to plan step-by-step actions cons...
research
06/21/2023

Improving Long-Horizon Imitation Through Instruction Prediction

Complex, long-horizon planning and its combinatorial nature pose steep c...
research
01/19/2021

A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment

In this paper we propose a new framework - MoViLan (Modular Vision and L...

Please sign up or login with your details

Forgot password? Click here to reset