CabViT: Cross Attention among Blocks for Vision Transformer

11/14/2022
by   Haokui Zhang, et al.
0

Since the vision transformer (ViT) has achieved impressive performance in image classification, an increasing number of researchers pay their attentions to designing more efficient vision transformer models. A general research line is reducing computational cost of self attention modules by adopting sparse attention or using local attention windows. In contrast, we propose to design high performance transformer based architectures by densifying the attention pattern. Specifically, we propose cross attention among blocks of ViT (CabViT), which uses tokens from previous blocks in the same stage as extra input to the multi-head attention of transformers. The proposed CabViT enhances the interactions of tokens across blocks with potentially different semantics, and encourages more information flows to the lower levels, which together improves model performance and model convergence with limited extra cost. Based on the proposed CabViT, we design a series of CabViT models which achieve the best trade-off between model size, computational cost and accuracy. For instance without the need of knowledge distillation to strength the training, CabViT achieves 83.0 about 3.9G FLOPs, saving almost half parameters and 13 while gaining 0.9 parameters but gaining 0.6

READ FULL TEXT
research
11/30/2021

ATS: Adaptive Token Sampling For Efficient Vision Transformers

While state-of-the-art vision transformer models achieve promising resul...
research
08/12/2021

Mobile-Former: Bridging MobileNet and Transformer

We present Mobile-Former, a parallel design of MobileNet and Transformer...
research
09/28/2022

Motion Transformer for Unsupervised Image Animation

Image animation aims to animate a source image by using motion learned f...
research
11/30/2022

Pattern Attention Transformer with Doughnut Kernel

We present in this paper a new architecture, the Pattern Attention Trans...
research
03/03/2022

Multi-Tailed Vision Transformer for Efficient Inference

Recently, Vision Transformer (ViT) has achieved promising performance in...
research
12/21/2021

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

ViTs are often too computationally expensive to be fitted onto real-worl...
research
09/09/2020

Pay Attention when Required

Transformer-based models consist of interleaved feed-forward blocks - th...

Please sign up or login with your details

Forgot password? Click here to reset