Rethinking Local Perception in Lightweight Vision Transformer

03/31/2023
by   Qihang Fan, et al.
0

Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer.

READ FULL TEXT

page 3

page 7

research
05/02/2023

AxWin Transformer: A Context-Aware Vision Transformer Backbone with Axial Windows

Recently Transformer has shown good performance in several vision tasks ...
research
04/14/2023

Preserving Locality in Vision Transformers for Class Incremental Learning

Learning new classes without forgetting is crucial for real-world applic...
research
08/03/2022

SSformer: A Lightweight Transformer for Semantic Segmentation

It is well believed that Transformer performs better in semantic segment...
research
10/28/2021

Blending Anti-Aliasing into Vision Transformer

The transformer architectures, based on self-attention mechanism and con...
research
06/01/2023

Lightweight Vision Transformer with Bidirectional Interaction

Recent advancements in vision backbones have significantly improved thei...
research
11/24/2021

An Image Patch is a Wave: Phase-Aware Vision MLP

Different from traditional convolutional neural network (CNN) and vision...
research
06/08/2021

Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight

Vision Transformer (ViT) attains state-of-the-art performance in visual ...

Please sign up or login with your details

Forgot password? Click here to reset