ConvMAE: Masked Convolution Meets Masked Autoencoders

05/08/2022
by   Peng Gao, et al.
0

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4 with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9 respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.

READ FULL TEXT
research
12/21/2021

MPViT: Multi-Path Vision Transformer for Dense Prediction

Dense computer vision tasks such as object detection and segmentation re...
research
08/24/2022

Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors

Multi-scale features have been proven highly effective for object detect...
research
03/22/2022

Focal Modulation Networks

In this work, we propose focal modulation network (FocalNet in short), w...
research
07/17/2023

Scale-Aware Modulation Meet Transformer

This paper presents a new vision Transformer, Scale-Aware Modulation Tra...
research
10/24/2021

CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector

Due to the success of Bidirectional Encoder Representations from Transfo...
research
06/05/2023

Learning Probabilistic Symmetrization for Architecture Agnostic Equivariance

We present a novel framework to overcome the limitations of equivariant ...
research
02/09/2023

HybrIK-Transformer

HybrIK relies on a combination of analytical inverse kinematics and deep...

Please sign up or login with your details

Forgot password? Click here to reset