DeepViT: Towards Deeper Vision Transformer

03/22/2021
by   Daquan Zhou, et al.
0

Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6 available

READ FULL TEXT

page 7

page 8

research
04/26/2021

Improve Vision Transformers Training by Suppressing Over-smoothing

Introducing the transformer structure into computer vision tasks holds t...
research
10/16/2022

Scratching Visual Transformer's Back with Uniform Attention

The favorable performance of Vision Transformers (ViTs) is often attribu...
research
01/05/2023

Skip-Attention: Improving Vision Transformers by Paying Less Attention

This work aims to improve the efficiency of vision transformers (ViT). W...
research
07/10/2022

Horizontal and Vertical Attention in Transformers

Transformers are built upon multi-head scaled dot-product attention and ...
research
12/17/2020

Transformer Interpretability Beyond Attention Visualization

Self-attention techniques, and specifically Transformers, are dominating...
research
05/11/2015

Training Deeper Convolutional Networks with Deep Supervision

One of the most promising ways of improving the performance of deep conv...
research
01/23/2023

AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems

Transformer models gain popularity because of their superior inference a...

Please sign up or login with your details

Forgot password? Click here to reset