EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

11/14/2022
by   Yuxin Fang, et al.
0

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we will release all the code and models at <https://github.com/baaivision/EVA>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2022

Vision Transformer Adapter for Dense Predictions

This work investigates a simple yet powerful adapter for Vision Transfor...
research
06/27/2023

CellViT: Vision Transformers for Precise Cell Segmentation and Classification

Nuclei detection and segmentation in hematoxylin and eosin-stained (H ...
research
02/26/2021

Learning Transferable Visual Models From Natural Language Supervision

State-of-the-art computer vision systems are trained to predict a fixed ...
research
02/01/2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Recent years have witnessed a big convergence of language, vision, and m...
research
07/06/2023

A Critical Look at the Current Usage of Foundation Model for Dense Recognition Task

In recent years large model trained on huge amount of cross-modality dat...
research
03/20/2023

EVA-02: A Visual Representation for Neon Genesis

We launch EVA-02, a next-generation Transformer-based visual representat...
research
04/15/2022

ResT V2: Simpler, Faster and Stronger

This paper proposes ResTv2, a simpler, faster, and stronger multi-scale ...

Please sign up or login with your details

Forgot password? Click here to reset