Improve Vision Transformers Training by Suppressing Over-smoothing

04/26/2021
by   Chengyue Gong, et al.
4

Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks. However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results. As a result, recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks. This work investigates how to stabilize the training of vision transformers without special structure modification. We observe that the instability of transformer training on vision tasks can be attributed to the over-smoothing problem, that the self-attention layers tend to map the different patches from the input image into a similar latent representation, hence yielding the loss of information and degeneration of performance, especially when the number of layers is large. We then propose a number of techniques to alleviate this problem, including introducing additional loss functions to encourage diversity, prevent loss of information, and discriminate different patches by additional patch classification loss for Cutmix. We show that our proposed techniques stabilize the training and allow us to train wider and deeper vision transformers, achieving 85.0% top-1 accuracy on ImageNet validation set without introducing extra teachers or additional convolution layers. Our code will be made publicly available at https://github.com/ChengyueGongR/PatchVisionTransformer .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2021

Can Vision Transformers Perform Convolution?

Several recent studies have demonstrated that attention-based networks, ...
research
01/24/2022

Patches Are All You Need?

Although convolutional networks have been the dominant architecture for ...
research
02/13/2022

BViT: Broad Attention based Vision Transformer

Recent works have demonstrated that transformer can achieve promising pe...
research
03/22/2021

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image class...
research
03/12/2022

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Vision transformers (ViTs) have gained increasing popularity as they are...
research
04/13/2023

VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking

The lack of interpretability of the Vision Transformer may hinder its us...
research
10/21/2022

Boosting vision transformers for image retrieval

Vision transformers have achieved remarkable progress in vision tasks su...

Please sign up or login with your details

Forgot password? Click here to reset