Vision Transformer with Convolutions Architecture Search

03/20/2022
by   Haichao Zhang, et al.
0

Transformers exhibit great advantages in handling computer vision tasks. They model image classification tasks by utilizing a multi-head attention mechanism to process a series of patches consisting of split images. However, for complex tasks, Transformer in computer vision not only requires inheriting a bit of dynamic attention and global context, but also needs to introduce features concerning noise reduction, shifting, and scaling invariance of objects. Therefore, here we take a step forward to study the structural characteristics of Transformer and convolution and propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS). The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture while maintaining the benefits of the multi-head attention mechanism. The searched block-based backbone network can extract feature maps at different scales. These features are compatible with a wider range of visual tasks, such as image classification (32 M parameters, 82.0 object detection (50.4 multi-head attention mechanism and CNN adaptively associates relational features of pixels with multi-scale features of objects. It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.

READ FULL TEXT

page 1

page 10

research
03/02/2022

Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

With the achievements of Transformer in the field of natural language pr...
research
07/18/2022

Multi-manifold Attention for Vision Transformers

Vision Transformer are very popular nowadays due to their state-of-the-a...
research
12/06/2022

AbHE: All Attention-based Homography Estimation

Homography estimation is a basic computer vision task, which aims to obt...
research
10/18/2021

Compositional Attention: Disentangling Search and Retrieval

Multi-head, key-value attention is the backbone of the widely successful...
research
04/13/2021

Co-Scale Conv-Attentional Image Transformers

In this paper, we present Co-scale conv-attentional image Transformers (...
research
03/29/2021

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

This paper presents a new Vision Transformer (ViT) architecture Multi-Sc...
research
06/10/2022

Saccade Mechanisms for Image Classification, Object Detection and Tracking

We examine how the saccade mechanism from biological vision can be used ...

Please sign up or login with your details

Forgot password? Click here to reset