Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

06/23/2022
by   Bowen Baker, et al.
0

Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data – here, online videos of people playing Minecraft – from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.

READ FULL TEXT

page 11

page 12

page 13

page 15

page 16

page 19

page 22

page 33

research
07/05/2022

Planning with RL and episodic-memory behavioral priors

The practical application of learning agents requires sample efficient a...
research
11/23/2022

Multi-Environment Pretraining Enables Transfer to Action Limited Datasets

Using massive datasets to train large-scale models has emerged as a domi...
research
06/23/2021

IQ-Learn: Inverse soft-Q Learning for Imitation

In many sequential decision-making problems (e.g., robotics control, gam...
research
04/05/2022

Action-Conditioned Contrastive Policy Pretraining

Deep visuomotor policy learning achieves promising results in control ta...
research
04/17/2023

Affordances from Human Videos as a Versatile Representation for Robotics

Building a robot that can understand and learn to interact by watching h...
research
03/15/2019

Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset

We introduce a large-scale dataset of human actions and eye movements wh...

Please sign up or login with your details

Forgot password? Click here to reset