Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

10/24/2022
by   Hao Liu, et al.
0

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work.

READ FULL TEXT

page 2

page 3

research
03/16/2023

A Picture is Worth a Thousand Words: Language Models Plan from Pixels

Planning is an important capability of artificial agents that perform lo...
research
05/13/2021

Episodic Transformer for Vision-and-Language Navigation

Interaction and navigation defined by natural language instructions in d...
research
09/19/2023

Guide Your Agent with Adaptive Multimodal Rewards

Developing an agent capable of adapting to unseen environments remains a...
research
01/29/2023

Distilling Internet-Scale Vision-Language Models into Embodied Agents

Instruction-following agents must ground language into their observation...
research
10/22/2022

A Visual Tour Of Current Challenges In Multimodal Language Models

Transformer models trained on massive text corpora have become the de fa...
research
07/23/2019

Pre-Learning Environment Representations for Data-Efficient Neural Instruction Following

We consider the problem of learning to map from natural language instruc...
research
07/04/2022

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Existing benchmarks for grounding language in interactive environments e...

Please sign up or login with your details

Forgot password? Click here to reset