Transformer in Convolutional Neural Networks

06/06/2021
by   Yun Liu, et al.
0

We tackle the low-efficiency flaw of vision transformer caused by the high computational/space complexity in Multi-Head Self-Attention (MHSA). To this end, we propose the Hierarchical MHSA (H-MHSA), whose representation is computed in a hierarchical manner. Specifically, our H-MHSA first learns feature relationships within small grids by viewing image patches as tokens. Then, small grids are merged into larger ones, within which feature relationship is learned by viewing each small grid at the preceding step as a token. This process is iterated to gradually reduce the number of tokens. The H-MHSA module is readily pluggable into any CNN architectures and amenable to training via backpropagation. We call this new backbone TransCNN, and it essentially inherits the advantages of both transformer and CNN. Experiments demonstrate that TransCNN achieves state-of-the-art accuracy for image recognition. Code and pretrained models are available at https://github.com/yun-liu/TransCNN. This technical report will keep updating by adding more experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2022

Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification

Transformer has been widely used in histopathology whole slide image (WS...
research
03/29/2022

MatteFormer: Transformer-Based Image Matting via Prior-Tokens

In this paper, we propose a transformer-based image matting model called...
research
07/09/2022

QKVA grid: Attention in Image Perspective and Stacked DETR

We present a new model named Stacked-DETR(SDETR), which inherits the mai...
research
05/24/2022

History Compression via Language Models in Reinforcement Learning

In a partially observable Markov decision process (POMDP), an agent typi...
research
07/12/2021

The Brownian motion in the transformer model

Transformer is the state of the art model for many language and visual t...
research
04/22/2021

So-ViT: Mind Visual Tokens for Vision Transformer

Recently the vision transformer (ViT) architecture, where the backbone p...
research
09/28/2022

Motion Transformer for Unsupervised Image Animation

Image animation aims to animate a source image by using motion learned f...

Please sign up or login with your details

Forgot password? Click here to reset