Global Context Vision Transformers

06/20/2022
by   Ali Hatamizadeh, et al.
0

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization. Our method leverages global context self-attention modules, joint with local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the issue of lack of the inductive bias in ViTs via proposing to use a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the base, small and tiny variants of GC ViT with 28M, 51M and 90M parameters achieve 83.2%, 83.9% and 84.4% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins. Code available at https://github.com/NVlabs/GCViT.

READ FULL TEXT
research
12/28/2021

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Recently, Transformers have shown promising performance in various visio...
research
07/18/2021

AS-MLP: An Axial Shifted MLP Architecture for Vision

An Axial Shifted MLP architecture (AS-MLP) is proposed in this paper. Di...
research
03/22/2022

Focal Modulation Networks

In this work, we propose focal modulation network (FocalNet in short), w...
research
03/15/2023

DeepMIM: Deep Supervision for Masked Image Modeling

Deep supervision, which involves extra supervisions to the intermediate ...
research
02/16/2022

ActionFormer: Localizing Moments of Actions with Transformers

Self-attention based Transformer models have demonstrated impressive res...
research
12/02/2021

Improved Multiscale Vision Transformers for Classification and Detection

In this paper, we study Multiscale Vision Transformers (MViT) as a unifi...
research
04/15/2022

ResT V2: Simpler, Faster and Stronger

This paper proposes ResTv2, a simpler, faster, and stronger multi-scale ...

Please sign up or login with your details

Forgot password? Click here to reset