Vision Conformer: Incorporating Convolutions into Vision Transformer Layers

04/27/2023
by   Brian Kenji Iwana, et al.
0

Transformers are popular neural network models that use layers of self-attention and fully-connected nodes with embedded tokens. Vision Transformers (ViT) adapt transformers for image recognition tasks. In order to do this, the images are split into patches and used as tokens. One issue with ViT is the lack of inductive bias toward image structures. Because ViT was adapted for image data from language modeling, the network does not explicitly handle issues such as local translations, pixel information, and information loss in the structures and features shared by multiple patches. Conversely, Convolutional Neural Networks (CNN) incorporate this information. Thus, in this paper, we propose the use of convolutional layers within ViT. Specifically, we propose a model called a Vision Conformer (ViC) which replaces the Multi-Layer Perceptron (MLP) in a ViT layer with a CNN. In addition, to use the CNN, we proposed to reconstruct the image data after the self-attention in a reverse embedding layer. Through the evaluation, we demonstrate that the proposed convolutions help improve the classification ability of ViT.

READ FULL TEXT
research
05/29/2021

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learn...
research
12/30/2021

Stochastic Layers in Vision Transformers

We introduce fully stochastic layers in vision transformers, without cau...
research
05/28/2021

KVT: k-NN Attention for Boosting Vision Transformers

Convolutional Neural Networks (CNNs) have dominated computer vision for ...
research
07/05/2021

Vision Xformers: Efficient Attention for Image Classification

Although transformers have become the neural architectures of choice for...
research
07/02/2023

X-MLP: A Patch Embedding-Free MLP Architecture for Vision

Convolutional neural networks (CNNs) and vision transformers (ViT) have ...
research
06/23/2019

Ego-CNN: Distributed, Egocentric Representations of Graphs for Detecting Critical Structures

We study the problem of detecting critical structures using a graph embe...
research
11/22/2022

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

This paper does not attempt to design a state-of-the-art method for visu...

Please sign up or login with your details

Forgot password? Click here to reset