Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

03/31/2023
by   Arjun Majumdar, et al.
0

We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data scale and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Finally, we show that task or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. These models required over 10,000 GPU-hours to train and can be found on our website for the benefit of the research community.

READ FULL TEXT

page 1

page 7

page 16

research
10/06/2022

Real-World Robot Learning with Masked Visual Pre-training

In this work, we explore self-supervised visual pre-training on images f...
research
03/10/2022

MVP: Multimodality-guided Visual Pre-training

Recently, masked image modeling (MIM) has become a promising direction f...
research
07/27/2023

Pre-training Vision Transformers with Very Limited Synthesized Images

Formula-driven supervised learning (FDSL) is a pre-training method that ...
research
06/26/2023

ViNT: A Foundation Model for Visual Navigation

General-purpose pre-trained models ("foundation models") have enabled pr...
research
02/27/2023

Internet Explorer: Targeted Representation Learning on the Open Web

Modern vision models typically rely on fine-tuning general-purpose model...
research
04/23/2021

Playing Lottery Tickets with Vision and Language

Large-scale transformer-based pre-training has recently revolutionized v...

Please sign up or login with your details

Forgot password? Click here to reset