Vision Xformers: Efficient Attention for Image Classification

07/05/2021
by   Pranav Jeevan, et al.
0

Although transformers have become the neural architectures of choice for natural language processing, they require orders of magnitude more training data, GPU memory, and computations in order to compete with convolutional neural networks for computer vision. The attention mechanism of transformers scales quadratically with the length of the input sequence, and unrolled images have long sequence lengths. Plus, transformers lack an inductive bias that is appropriate for images. We tested three modifications to vision transformer (ViT) architectures that address these shortcomings. Firstly, we alleviate the quadratic bottleneck by using linear attention mechanisms, called X-formers (such that, X in Performer, Linformer, Nyströmformer), thereby creating Vision X-formers (ViXs). This resulted in up to a seven times reduction in the GPU memory requirement. We also compared their performance with FNet and multi-layer perceptron mixers, which further reduced the GPU memory requirement. Secondly, we introduced an inductive bias for images by replacing the initial linear embedding layer by convolutional layers in ViX, which significantly increased classification accuracy without increasing the model size. Thirdly, we replaced the learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE), which increases the classification accuracy for the same model size. We believe that incorporating such changes can democratize transformers by making them accessible to those with limited data and computing resources.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2022

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in proc...
research
06/11/2023

2-D SSM: A General Spatial Layer for Visual Transformers

A central objective in computer vision is to design models with appropri...
research
04/27/2023

Vision Conformer: Incorporating Convolutions into Vision Transformer Layers

Transformers are popular neural network models that use layers of self-a...
research
02/03/2023

Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers

We propose a new class of linear Transformers called FourierLearner-Tran...
research
02/15/2021

Translational Equivariance in Kernelizable Attention

While Transformer architectures have show remarkable success, they are b...
research
06/23/2023

Scaling MLPs: A Tale of Inductive Bias

In this work we revisit the most fundamental building block in deep lear...
research
07/01/2023

More for Less: Compact Convolutional Transformers Enable Robust Medical Image Classification with Limited Data

Transformers are very powerful tools for a variety of tasks across domai...

Please sign up or login with your details

Forgot password? Click here to reset