Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

07/11/2022
by   Ting Yao, et al.
0

Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggressive down-sampling design is not invertible and inevitably causes information dropping especially for high-frequency components in objects (e.g., texture details). Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (Wave-ViT) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way. This proposal enables self-attention learning with lossless down-sampling over keys/values, facilitating the pursuing of a better efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are leveraged to strengthen self-attention outputs by aggregating local contexts with enlarged receptive field. We validate the superiority of Wave-ViT through extensive experiments over multiple vision tasks (e.g., image recognition, object detection and instance segmentation). Its performances surpass state-of-the-art ViT backbones with comparable FLOPs. Source code is available at <https://github.com/YehLi/ImageNetModel>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/26/2021

Contextual Transformer Networks for Visual Recognition

Transformer with self-attention has led to the revolutionizing of natura...
research
01/03/2022

Vision Transformer with Deformable Attention

Transformers have recently shown superior performances on various vision...
research
10/12/2021

SDWNet: A Straight Dilated Network with Wavelet Transformation for Image Deblurring

Image deblurring is a classical computer vision problem that aims to rec...
research
08/31/2023

Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection

Vision Transformer (ViT) models have demonstrated a breakthrough in a wi...
research
06/22/2021

P2T: Pyramid Pooling Transformer for Scene Understanding

This paper jointly resolves two problems in vision transformer: i) the c...
research
05/26/2022

Fast Vision Transformers with HiLo Attention

Vision Transformers (ViTs) have triggered the most recent and significan...
research
08/25/2023

Unlocking Fine-Grained Details with Wavelet-based High-Frequency Enhancement in Transformers

Medical image segmentation is a critical task that plays a vital role in...

Please sign up or login with your details

Forgot password? Click here to reset