AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

05/26/2022
by   Shoufa Chen, et al.
0

Although the pre-trained Vision Transformers (ViTs) achieved great success in computer vision, adapting a ViT to various image and video tasks is challenging because of its heavy computation and storage burdens, where each model needs to be independently and comprehensively fine-tuned to different tasks, limiting its transferability in different domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2 extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100 recognition benchmarks. Secondly, it can be plug-and-play in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1.5 achieves about 10 fine-tuned models on Something-Something v2 and HMDB51, respectively. Project page: http://www.shoufachen.com/adaptformer-page.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2023

Large Transformers are Better EEG Learners

Pre-trained large transformer models have achieved remarkable performanc...
research
02/16/2023

Towards Efficient Visual Adaption via Structural Re-parameterization

Parameter-efficient transfer learning (PETL) is an emerging research spo...
research
09/04/2023

MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

As the size of Large Multi-Modal Models (LMMs) increases consistently, t...
research
03/15/2022

Do BERTs Learn to Use Browser User Interface? Exploring Multi-Step Tasks with Unified Vision-and-Language BERTs

Pre-trained Transformers are good foundations for unified multi-task mod...
research
11/15/2022

FedTune: A Deep Dive into Efficient Federated Fine-Tuning with Pre-trained Transformers

Federated Learning (FL) is an emerging paradigm that enables distributed...
research
08/22/2022

Prompt-Matched Semantic Segmentation

The objective of this work is to explore how to effectively and efficien...
research
11/23/2022

Completing point cloud from few points by Wasserstein GAN and Transformers

In many vision and robotics applications, it is common that the captured...

Please sign up or login with your details

Forgot password? Click here to reset