Stitched ViTs are Flexible Vision Backbones

06/30/2023
by   Zizheng Pan, et al.
0

Large pretrained plain vision Transformers (ViTs) have been the workhorse for many downstream tasks. However, existing works utilizing off-the-shelf ViTs are inefficient in terms of training and deployment, because adopting ViTs with individual sizes requires separate training and is restricted by fixed performance-efficiency trade-offs. In this paper, we are inspired by stitchable neural networks, which is a new framework that cheaply produces a single model that covers rich subnetworks by stitching pretrained model families, supporting diverse performance-efficiency trade-offs at runtime. Building upon this foundation, we introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation. Specifically, we first propose a Two-way stitching scheme to enlarge the stitching space. We then design a resource-constrained sampling strategy that takes into account the underlying FLOPs distributions in the space for improved sampling. Finally, we observe that learning stitching layers is a low-rank update, which plays an essential role on downstream tasks to stabilize training and ensure a good Pareto frontier. With extensive experiments on ImageNet-1K, ADE20K, COCO-Stuff-10K, NYUv2 and COCO-2017, SN-Netv2 demonstrates strong ability to serve as a flexible vision backbone, achieving great advantages in both training efficiency and adaptation. Code will be released at https://github.com/ziplab/SN-Netv2.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/13/2023

Stitchable Neural Networks

The public model zoo containing enormous powerful pretrained model famil...
research
05/26/2022

Matryoshka Representations for Adaptive Deployment

Learned representations are a central component in modern ML systems, se...
research
03/13/2023

ViM: Vision Middleware for Unified Downstream Transferring

Foundation models are pre-trained on massive data and transferred to dow...
research
10/12/2022

Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers

Large-scale pretrained models, especially those trained from vision-lang...
research
05/25/2023

A Task-guided, Implicitly-searched and Meta-initialized Deep Model for Image Fusion

Image fusion plays a key role in a variety of multi-sensor-based vision ...
research
11/29/2022

One is All: Bridging the Gap Between Neural Radiance Fields Architectures with Progressive Volume Distillation

Neural Radiance Fields (NeRF) methods have proved effective as compact, ...

Please sign up or login with your details

Forgot password? Click here to reset