Integrally Pre-Trained Transformer Pyramid Networks

11/23/2022
by   Yunjie Tian, et al.
0

In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2 top-1 accuracy on ImageNet-1K, a 53.2 with 1x training schedule using Mask-RCNN, and a 54.7 semantic segmentation using UPerHead – all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code and the pre-trained models will be released at https://github.com/sunsmarterjie/iTPN.

READ FULL TEXT

page 3

page 8

research
04/06/2022

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

We present an approach to efficiently and effectively adapt a masked ima...
research
01/28/2020

A Study of Pyramid Structure for Code Correction

We demonstrate the implementations of pyramid encoders in both multi-lay...
research
03/10/2022

MVP: Multimodality-guided Visual Pre-training

Recently, masked image modeling (MIM) has become a promising direction f...
research
04/02/2023

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

In this paper, we study masked autoencoder (MAE) pretraining on videos f...
research
04/21/2023

GeoLayoutLM: Geometric Pre-training for Visual Information Extraction

Visual information extraction (VIE) plays an important role in Document ...
research
11/17/2022

CAE v2: Context Autoencoder with CLIP Target

Masked image modeling (MIM) learns visual representation by masking and ...
research
11/18/2021

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

This paper presents a new pre-trained language model, DeBERTaV3, which i...

Please sign up or login with your details

Forgot password? Click here to reset