Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

08/10/2021
by   Alessandro Suglia, et al.
0

Language-guided robots performing home and office tasks must navigate in and interact with the world. Grounding language instructions against visual observations and actions to take in an environment is an open challenge. We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for language-conditioned task completion. Additionally, we bridge the gap between successful object-centric navigation models used for non-interactive agents and the language-guided visual task completion benchmark, ALFRED, by introducing object navigation targets for EmBERT training. We achieve competitive performance on the ALFRED benchmark, and EmBERT marks the first transformer-based model to successfully handle the long-horizon, dense, multi-modal histories of ALFRED, and the first ALFRED model to utilize object-centric navigation targets.

READ FULL TEXT

page 2

page 4

research
05/13/2021

Episodic Transformer for Vision-and-Language Navigation

Interaction and navigation defined by natural language instructions in d...
research
11/11/2022

Control Transformer: Robot Navigation in Unknown Environments through PRM-Guided Return-Conditioned Sequence Modeling

Learning long-horizon tasks such as navigation has presented difficult c...
research
05/31/2019

Multi-modal Discriminative Model for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) is a natural language grounding tas...
research
11/08/2022

StructDiffusion: Object-Centric Diffusion for Semantic Rearrangement of Novel Objects

Robots operating in human environments must be able to rearrange objects...
research
02/24/2023

A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter

We focus on the task of language-conditioned grasping in clutter, in whi...
research
08/06/2019

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for lea...
research
05/09/2023

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

We present an interactive visual framework named InternGPT, or iGPT for ...

Please sign up or login with your details

Forgot password? Click here to reset