Transformer in Transformer

02/27/2021
by   Kai Han, et al.
0

Transformer is a type of self-attention-based neural networks originally applied for NLP tasks. Recently, pure transformer-based models are proposed to solve computer vision problems. These visual transformers usually view an image as a sequence of patches while they ignore the intrinsic structure information inside each patch. In this paper, we propose a novel Transformer-iN-Transformer (TNT) model for modeling both patch-level and pixel-level representation. In each TNT block, an outer transformer block is utilized to process patch embeddings, and an inner transformer block extracts local features from pixel embeddings. The pixel-level feature is projected to the space of patch embedding by a linear transformation layer and then added into the patch. By stacking the TNT blocks, we build the TNT model for image recognition. Experiments on ImageNet benchmark and downstream tasks demonstrate the superiority and efficiency of the proposed TNT architecture. For example, our TNT achieves 81.3% top-1 accuracy on ImageNet which is 1.5% higher than that of DeiT with similar computational cost. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/TNT.

READ FULL TEXT

page 7

page 8

research
06/10/2021

CAT: Cross Attention in Vision Transformer

Since Transformer has found widespread use in NLP, the potential of Tran...
research
06/02/2022

Modeling Image Composition for Complex Scene Generation

We present a method that achieves state-of-the-art results on challengin...
research
03/26/2023

Sector Patch Embedding: An Embedding Module Conforming to The Distortion Pattern of Fisheye Image

Fisheye cameras suffer from image distortion while having a large field ...
research
10/23/2022

UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection

Intra-frame inconsistency has been proved to be effective for the genera...
research
10/14/2022

Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?

Recently vision transformers (ViT) have been applied successfully for va...
research
02/13/2022

BViT: Broad Attention based Vision Transformer

Recent works have demonstrated that transformer can achieve promising pe...
research
01/24/2022

Patches Are All You Need?

Although convolutional networks have been the dominant architecture for ...

Please sign up or login with your details

Forgot password? Click here to reset