Vision Transformer Adapter for Dense Predictions

05/17/2022
by   Zhe Chen, et al.
8

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT achieves inferior performance on dense prediction tasks due to lacking prior information of images. To solve this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture. Specifically, the backbone in our framework is a vanilla transformer that can be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks' prior information into the model, making it suitable for these tasks. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation. Notably, when using HTC++, our ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev, surpassing Swin-L by 1.4 box AP and 1.0 mask AP. For semantic segmentation, our ViT-Adapter-L establishes a new state-of-the-art of 60.5 mIoU on ADE20K val, 0.6 points higher than SwinV2-G. We hope that the proposed ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

READ FULL TEXT

page 14

page 19

page 20

research
03/25/2021

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

This paper presents a new vision Transformer, called Swin Transformer, t...
research
06/19/2022

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Motivated by biological evolution, this paper explains the rationality o...
research
11/14/2022

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

We launch EVA, a vision-centric foundation model to explore the limits o...
research
03/14/2023

AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Unsupervised learning of vision transformers seeks to pretrain an encode...
research
04/20/2022

Residual Mixture of Experts

Mixture of Experts (MoE) is able to scale up vision transformers effecti...
research
06/01/2021

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

Can Transformer perform 2D object-level recognition from a pure sequence...
research
07/21/2021

CycleMLP: A MLP-like Architecture for Dense Prediction

This paper presents a simple MLP-like architecture, CycleMLP, which is a...

Please sign up or login with your details

Forgot password? Click here to reset