Convolutional Xformers for Vision

01/25/2022
by   Pranav Jeevan, et al.
0

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture – Convolutional X-formers for Vision (CXV) – to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nyströmformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. We also propose a new training method where we use two different optimizers during different phases of training and show that it improves the top-1 image classification accuracy across different architectures. CXV outperforms other architectures, token mixers (e.g. ConvMixer, FNet and MLP Mixer), transformer models (e.g. ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources (cores, RAM, power).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2021

Vision Xformers: Efficient Attention for Image Classification

Although transformers have become the neural architectures of choice for...
research
09/06/2021

Vision Transformers For Weeds and Crops Classification Of High Resolution UAV Images

Crop and weed monitoring is an important challenge for agriculture and f...
research
07/10/2022

Facilitated machine learning for image-based fruit quality assessment in developing countries

Automated image classification is a common task for supervised machine l...
research
05/27/2022

X-ViT: High Performance Linear Vision Transformer without Softmax

Vision transformers have become one of the most important models for com...
research
02/19/2023

MedViT: A Robust Vision Transformer for Generalized Medical Image Classification

Convolutional Neural Networks (CNNs) have advanced existing medical syst...
research
09/04/2023

ExMobileViT: Lightweight Classifier Extension for Mobile Vision Transformer

The paper proposes an efficient structure for enhancing the performance ...
research
10/22/2020

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

While the Transformer architecture has become the de-facto standard for ...

Please sign up or login with your details

Forgot password? Click here to reset