Exploring and Improving Mobile Level Vision Transformers

08/30/2021
by   Pengguang Chen, et al.
0

We study the vision transformer structure in the mobile level in this paper, and find a dramatic performance drop. We analyze the reason behind this phenomenon, and propose a novel irregular patch embedding module and adaptive patch fusion module to improve the performance. We conjecture that the vision transformer blocks (which consist of multi-head attention and feed-forward network) are more suitable to handle high-level information than low-level features. The irregular patch embedding module extracts patches that contain rich high-level information with different receptive fields. The transformer blocks can obtain the most useful information from these irregular patches. Then the processed patches pass the adaptive patch merging module to get the final features for the classifier. With our proposed improvements, the traditional uniform vision transformer structure can achieve state-of-the-art results in mobile level. We improve the DeiT baseline by more than 9% under the mobile-level settings and surpass other transformer architectures like Swin and CoaT by a large margin.

READ FULL TEXT
research
04/10/2023

HST-MRF: Heterogeneous Swin Transformer with Multi-Receptive Field for Medical Image Segmentation

The Transformer has been successfully used in medical image segmentation...
research
02/03/2023

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

As a de facto solution, the vanilla Vision Transformers (ViTs) are encou...
research
10/18/2022

Sequence and Circle: Exploring the Relationship Between Patches

The vision transformer (ViT) has achieved state-of-the-art results in va...
research
05/25/2022

MoCoViT: Mobile Convolutional Vision Transformer

Recently, Transformer networks have achieved impressive results on a var...
research
03/22/2021

Incorporating Convolution Designs into Visual Transformers

Motivated by the success of Transformers in natural language processing ...
research
10/21/2022

Face Pyramid Vision Transformer

A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a di...
research
09/07/2021

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Transformer, as a strong and flexible architecture for modelling long-ra...

Please sign up or login with your details

Forgot password? Click here to reset