STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

06/01/2023
by   Shalev Lifshitz, et al.
0

Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just 60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.

READ FULL TEXT

page 4

page 7

page 18

page 22

page 23

page 26

research
07/20/2023

Instruction-following Evaluation through Verbalizer Manipulation

While instruction-tuned models have shown remarkable success in various ...
research
09/04/2018

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

We propose to decompose instruction execution to goal prediction and act...
research
05/13/2023

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Large language models (LLMs) pretrained on vast source code have achieve...
research
05/31/2023

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Much of the previous work towards digital agents for graphical user inte...
research
09/21/2022

Adapting Pretrained Text-to-Text Models for Long Text Sequences

We present an empirical study of adapting an existing pretrained text-to...
research
05/29/2023

Controllable Text-to-Image Generation with GPT-4

Current text-to-image generation models often struggle to follow textual...
research
12/29/2022

Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following

Agents that can follow language instructions are expected to be useful i...

Please sign up or login with your details

Forgot password? Click here to reset