Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain Analysis: From Theory to Practice

03/09/2022
by   Peihao Wang, et al.
0

Vision Transformer (ViT) has recently demonstrated promise in computer vision problems. However, unlike Convolutional Neural Networks (CNN), it is known that the performance of ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity. Despite a couple of empirical solutions, a rigorous framework studying on this scalability issue remains elusive. In this paper, we first establish a rigorous theory framework to analyze ViT features from the Fourier spectrum domain. We show that the self-attention mechanism inherently amounts to a low-pass filter, which indicates when ViT scales up its depth, excessive low-pass filtering will cause feature maps to only preserve their Direct-Current (DC) component. We then propose two straightforward yet effective techniques to mitigate the undesirable low-pass limitation. The first technique, termed AttnScale, decomposes a self-attention block into low-pass and high-pass components, then rescales and combines these two filters to produce an all-pass self-attention matrix. The second technique, termed FeatScale, re-weights feature maps on separate frequency bands to amplify the high-frequency signals. Both techniques are efficient and hyperparameter-free, while effectively overcoming relevant ViT training artifacts such as attention collapse and patch uniformity. By seamlessly plugging in our techniques to multiple ViT variants, we demonstrate that they consistently help ViTs benefit from deeper architectures, bringing up to 1.1 publicly release our codes and pre-trained models at https://github.com/VITA-Group/ViT-Anti-Oversmoothing.

READ FULL TEXT
research
10/28/2021

Blending Anti-Aliasing into Vision Transformer

The transformer architectures, based on self-attention mechanism and con...
research
08/21/2020

Delving Deeper into Anti-aliasing in ConvNets

Aliasing refers to the phenomenon that high frequency signals degenerate...
research
08/31/2023

Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection

Vision Transformer (ViT) models have demonstrated a breakthrough in a wi...
research
08/22/2023

SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation

Recent studies show that self-attentions behave like low-pass filters (a...
research
08/18/2022

Learning Spatial-Frequency Transformer for Visual Object Tracking

Recent trackers adopt the Transformer to combine or replace the widely u...
research
10/25/2022

MEW-UNet: Multi-axis representation learning in frequency domain for medical image segmentation

Recently, Visual Transformer (ViT) has been widely used in various field...
research
02/03/2023

PSST! Prosodic Speech Segmentation with Transformers

Self-attention mechanisms have enabled transformers to achieve superhuma...

Please sign up or login with your details

Forgot password? Click here to reset