Sequencer: Deep LSTM for Image Classification

05/04/2022
by   Yuki Tatsunami, et al.
15

In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.

READ FULL TEXT

page 9

page 19

page 20

page 21

page 22

page 23

page 24

research
05/29/2021

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learn...
research
05/04/2021

MLP-Mixer: An all-MLP Architecture for Vision

Convolutional Neural Networks (CNNs) are the go-to model for computer vi...
research
11/19/2021

Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

A vision transformer (ViT) is the dominant model in the computer vision ...
research
06/13/2019

Stand-Alone Self-Attention in Vision Models

Convolutions are a fundamental building block of modern computer vision ...
research
01/04/2021

Transformers in Vision: A Survey

Astounding results from transformer models on natural language tasks hav...
research
01/03/2021

RegNet: Self-Regulated Network for Image Classification

The ResNet and its variants have achieved remarkable successes in variou...
research
10/18/2019

Texture Bias Of CNNs Limits Few-Shot Classification Performance

Accurate image classification given small amounts of labelled data (few-...

Please sign up or login with your details

Forgot password? Click here to reset