RevColV2: Exploring Disentangled Representations in Masked Image Modeling

09/02/2023
by   Qi Han, et al.
0

Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training. Our experimental results suggest that a foundation model with decoupled features can achieve competitive performance across multiple downstream vision tasks such as image classification, semantic segmentation and object detection. For example, after intermediate fine-tuning on ImageNet-22K dataset, RevColV2-L attains 88.4 and 58.6 mIoU on ADE20K semantic segmentation. With extra teacher and large scale dataset, RevColv2-L achieves 62.1 box AP on COCO detection and 60.4 mIoU on ADE20K semantic segmentation. Code and models are released at https://github.com/megvii-research/RevCol

READ FULL TEXT
research
12/22/2022

Reversible Column Networks

We propose a new neural network design paradigm Reversible Column Networ...
research
02/11/2021

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals

Being able to learn dense semantic representations of images without sup...
research
04/11/2019

An Analysis of Pre-Training on Object Detection

We provide a detailed analysis of convolutional neural networks which ar...
research
03/15/2023

DeepMIM: Deep Supervision for Masked Image Modeling

Deep supervision, which involves extra supervisions to the intermediate ...
research
03/28/2023

TabRet: Pre-training Transformer-based Tabular Models for Unseen Columns

We present TabRet, a pre-trainable Transformer-based model for tabular d...
research
11/16/2022

Prompt Tuning for Parameter-efficient Medical Image Segmentation

Neural networks pre-trained on a self-supervision scheme have become the...
research
10/13/2022

Exploring Long-Sequence Masked Autoencoders

Masked Autoencoding (MAE) has emerged as an effective approach for pre-t...

Please sign up or login with your details

Forgot password? Click here to reset