STEVE-1: A Generative Model for Text-to-Behavior in Minecraft

by   Shalev Lifshitz, et al.

Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just 60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.


page 4

page 7

page 18

page 22

page 23

page 26


Instruction-following Evaluation through Verbalizer Manipulation

While instruction-tuned models have shown remarkable success in various ...

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

We propose to decompose instruction execution to goal prediction and act...

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Large language models (LLMs) pretrained on vast source code have achieve...

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Much of the previous work towards digital agents for graphical user inte...

Adapting Pretrained Text-to-Text Models for Long Text Sequences

We present an empirical study of adapting an existing pretrained text-to...

Controllable Text-to-Image Generation with GPT-4

Current text-to-image generation models often struggle to follow textual...

Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following

Agents that can follow language instructions are expected to be useful i...

Please sign up or login with your details

Forgot password? Click here to reset