gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

08/24/2022
by   Mocho Go, et al.
22

Following the success in language domain, the self-attention mechanism (transformer) is adopted in the vision domain and achieving great success recently. Additionally, as another stream, multi-layer perceptron (MLP) is also explored in the vision domain. These architectures, other than traditional CNNs, have been attracting attention recently, and many methods have been proposed. As one that combines parameter efficiency and performance with locality and hierarchy in image recognition, we propose gSwin, which merges the two streams; Swin Transformer and (multi-head) gMLP. We showed that our gSwin can achieve better accuracy on three vision tasks, image classification, object detection and semantic segmentation, than Swin Transformer, with smaller model size.

READ FULL TEXT
research
03/02/2022

Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

With the achievements of Transformer in the field of natural language pr...
research
06/07/2021

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Very recently, Window-based Transformers, which computed self-attention ...
research
09/19/2023

Multi-spectral Entropy Constrained Neural Compression of Solar Imagery

Missions studying the dynamic behaviour of the Sun are defined to captur...
research
12/02/2021

Vision Pair Learning: An Efficient Training Framework for Image Classification

Transformer is a potentially powerful architecture for vision tasks. Alt...
research
05/13/2022

ImageSig: A signature transform for ultra-lightweight image recognition

This paper introduces a new lightweight method for image recognition. Im...
research
02/17/2023

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Recently, Transformer-based architectures have been explored for speaker...
research
12/21/2021

RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality

Compared to convolutional layers, fully-connected (FC) layers are better...

Please sign up or login with your details

Forgot password? Click here to reset