MVP: Multimodality-guided Visual Pre-training

03/10/2022
by   Longhui Wei, et al.
0

Recently, masked image modeling (MIM) has become a promising direction for visual pre-training. In the context of vision transformers, MIM learns effective visual representation by aligning the token-level features with a pre-defined space (e.g., BEIT used a d-VAE trained on a large image corpus as the tokenizer). In this paper, we go one step further by introducing guidance from other modalities and validating that such additional knowledge leads to impressive gains for visual pre-training. The proposed approach is named Multimodality-guided Visual Pre-training (MVP), in which we replace the tokenizer with the vision branch of CLIP, a vision-language model pre-trained on 400 million image-text pairs. We demonstrate the effectiveness of MVP by performing standard experiments, i.e., pre-training the ViT models on ImageNet and fine-tuning them on a series of downstream visual recognition tasks. In particular, pre-training ViT-Base/16 for 300 epochs, MVP reports a 52.4 on ADE20K, surpassing BEIT (the baseline and previous state-of-the-art) with an impressive margin of 6.8

READ FULL TEXT

page 2

page 6

page 13

research
09/09/2022

Pre-training image-language transformers for open-vocabulary tasks

We present a pre-training approach for vision and language transformer m...
research
11/23/2022

Integrally Pre-Trained Transformer Pyramid Networks

In this paper, we present an integral pre-training framework based on ma...
research
07/04/2023

LPN: Language-guided Prototypical Network for few-shot classification

Few-shot classification aims to adapt to new tasks with limited labeled ...
research
02/23/2016

The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection

This paper strives for video event detection using a representation lear...
research
06/01/2021

Exploring the Diversity and Invariance in Yourself for Visual Pre-Training Task

Recently, self-supervised learning methods have achieved remarkable succ...
research
05/02/2023

Discovering the Effectiveness of Pre-Training in a Large-scale Car-sharing Platform

Recent progress of deep learning has empowered various intelligent trans...
research
03/31/2023

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

We present the largest and most comprehensive empirical study of pre-tra...

Please sign up or login with your details

Forgot password? Click here to reset