Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

04/06/2022
by   Yuxin Fang, et al.
14

We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT can work surprisingly well in the challenging object-level recognition scenario even with random sampled partial observations, e.g., only 25 representations for object detection, a random initialized compact convolutional stem supplants the pre-trained large kernel patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid without upsampling. While the pre-trained ViT is only regarded as the third-stage of our detector's backbone instead of the whole feature extractor, resulting in a ConvNet-ViT hybrid architecture. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.3 box AP and 2.5 mask AP on COCO, and achieve even better results compared with other adapted vanilla ViT using a more modest fine-tuning recipe while converging 2.8x faster. Code and pre-trained models are available at <https://github.com/hustvl/MIMDet>.

READ FULL TEXT
research
03/30/2022

Exploring Plain Vision Transformer Backbones for Object Detection

We explore the plain, non-hierarchical Vision Transformer (ViT) as a bac...
research
05/19/2022

Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Modern object detectors have taken the advantages of pre-trained vision ...
research
11/23/2022

Integrally Pre-Trained Transformer Pyramid Networks

In this paper, we present an integral pre-training framework based on ma...
research
06/01/2021

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

Can Transformer perform 2D object-level recognition from a pure sequence...
research
10/29/2020

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

Existing object detection frameworks are usually built on a single forma...
research
03/20/2020

Detection in Crowded Scenes: One Proposal, Multiple Predictions

We propose a simple yet effective proposal-based object detector, aiming...
research
03/29/2023

BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection

As various forms of fraud proliferate on Ethereum, it is imperative to s...

Please sign up or login with your details

Forgot password? Click here to reset