PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

01/04/2022
by   Kai Han, et al.
0

Transformer networks have achieved great progress for computer vision tasks. Transformer-in-Transformer (TNT) architecture utilizes inner transformer and outer transformer to extract both local and global representations. In this work, we present new TNT baselines by introducing two advanced designs: 1) pyramid architecture, and 2) convolutional stem. The new "PyramidTNT" significantly improves the original TNT by establishing hierarchical representations. PyramidTNT achieves better performances than the previous state-of-the-art vision transformers such as Swin Transformer. We hope this new baseline will be helpful to the further research and application of vision transformer. Code will be available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2021

Local-to-Global Self-Attention in Vision Transformers

Transformers have demonstrated great potential in computer vision tasks....
research
03/30/2021

Rethinking Spatial Dimensions of Vision Transformers

Vision Transformer (ViT) extends the application range of transformers f...
research
06/11/2023

E(2)-Equivariant Vision Transformer

Vision Transformer (ViT) has achieved remarkable performance in computer...
research
08/15/2021

SOTR: Segmenting Objects with Transformers

Most recent transformer-based models show impressive performance on visi...
research
12/30/2020

Transformer for Image Quality Assessment

Transformer has become the new standard method in natural language proce...
research
10/16/2019

Injecting Hierarchy with U-Net Transformers

The Transformer architecture has become increasingly popular over the pa...
research
06/30/2021

Augmented Shortcuts for Vision Transformers

Transformer models have achieved great progress on computer vision tasks...

Please sign up or login with your details

Forgot password? Click here to reset