Vision Transformer with Attention Map Hallucination and FFN Compaction

06/19/2023
by   Haiyang Xu, et al.
0

Vision Transformer(ViT) is now dominating many vision tasks. The drawback of quadratic complexity of its token-wise multi-head self-attention (MHSA), is extensively addressed via either token sparsification or dimension reduction (in spatial or channel). However, the therein redundancy of MHSA is usually overlooked and so is the feed-forward network (FFN). To this end, we propose attention map hallucination and FFN compaction to fill in the blank. Specifically, we observe similar attention maps exist in vanilla ViT and propose to hallucinate half of the attention maps from the rest with much cheaper operations, which is called hallucinated-MHSA (hMHSA). As for FFN, we factorize its hidden-to-output projection matrix and leverage the re-parameterization technique to strengthen its capability, making it compact-FFN (cFFN). With our proposed modules, a 10%-20% reduction of floating point operations (FLOPs) and parameters (Params) is achieved for various ViT-based backbones, including straight (DeiT), hybrid (NextViT) and hierarchical (PVT) structures, meanwhile, the performances are quite competitive.

READ FULL TEXT

page 1

page 7

research
05/02/2020

Synthesizer: Rethinking Self-Attention in Transformer Models

The dot product self-attention is known to be central and indispensable ...
research
11/21/2022

Vision Transformer with Super Token Sampling

Vision transformer has achieved impressive performance for many vision t...
research
06/12/2021

Video Super-Resolution Transformer

Video super-resolution (VSR), with the aim to restore a high-resolution ...
research
08/07/2021

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

In this paper, we observe two levels of redundancies when applying visio...
research
05/25/2022

MoCoViT: Mobile Convolutional Vision Transformer

Recently, Transformer networks have achieved impressive results on a var...
research
05/28/2021

ResT: An Efficient Transformer for Visual Recognition

This paper presents an efficient multi-scale vision Transformer, called ...
research
05/24/2019

SCRAM: Spatially Coherent Randomized Attention Maps

Attention mechanisms and non-local mean operations in general are key in...

Please sign up or login with your details

Forgot password? Click here to reset