Grafting Vision Transformers

10/28/2022
by   Jongwoo Park, et al.
0

Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike. GrafT can be easily adopted in both homogeneous and pyramid Transformers while showing consistent gains. It has the flexibility of branching-out at arbitrary depths, widening a network with multiple scales. This grafting operation enables us to share most of the parameters and computations of the backbone, adding only minimal complexity, but with a higher yield. In fact, the process of progressively compounding multi-scale receptive fields in GrafT enables communications between local regions. We show the benefits of the proposed method on multiple benchmarks, including image classification (ImageNet-1K), semantic segmentation (ADE20K), object detection and instance segmentation (COCO2017). Our code and models will be made available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/21/2021

MPViT: Multi-Path Vision Transformer for Dense Prediction

Dense computer vision tasks such as object detection and segmentation re...
research
11/01/2021

HRViT: Multi-Scale High-Resolution Vision Transformer

Vision transformers (ViTs) have attracted much attention for their super...
research
12/13/2018

ELASTIC: Improving CNNs with Instance Specific Scaling Policies

Scale variation has been a challenge from traditional to modern approach...
research
04/13/2021

Co-Scale Conv-Attentional Image Transformers

In this paper, we present Co-scale conv-attentional image Transformers (...
research
04/02/2019

Res2Net: A New Multi-scale Backbone Architecture

Representing features at multiple scales is of great importance for nume...
research
03/29/2021

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

This paper presents a new Vision Transformer (ViT) architecture Multi-Sc...
research
05/20/2021

Content-Augmented Feature Pyramid Network with Light Linear Transformers

Recently, plenty of work has tried to introduce transformers into comput...

Please sign up or login with your details

Forgot password? Click here to reset