UniNeXt: Exploring A Unified Architecture for Vision Recognition

04/26/2023
by   Fangjian Lin, et al.
0

Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone. All models and codes will be publicly available.

READ FULL TEXT

page 1

page 2

page 7

research
11/22/2021

MetaFormer is Actually What You Need for Vision

Transformers have shown great potential in computer vision tasks. A comm...
research
11/10/2022

Demystify Transformers Convolutions in Modern Image Deep Networks

Recent success of vision transformers has inspired a series of vision ba...
research
06/28/2021

Rethinking Token-Mixing MLP for MLP-based Vision Backbone

In the past decade, we have witnessed rapid progress in the machine visi...
research
03/11/2022

ActiveMLP: An MLP-like Architecture with Active Token Mixer

This paper presents ActiveMLP, a general MLP-like backbone for computer ...
research
11/24/2021

Self-slimmed Vision Transformer

Vision transformers (ViTs) have become the popular structures and outper...
research
07/26/2023

Adaptive Frequency Filters As Efficient Global Token Mixers

Recent vision transformers, large-kernel CNNs and MLPs have attained rem...
research
04/07/2022

Solving ImageNet: a Unified Scheme for Training any Backbone to Top Results

ImageNet serves as the primary dataset for evaluating the quality of com...

Please sign up or login with your details

Forgot password? Click here to reset