Scratching Visual Transformer's Back with Uniform Attention

10/16/2022
by   Nam Hyeon-Woo, et al.
18

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA). The MSA enables global interactions at each layer of a ViT model, which is a contrasting feature against Convolutional Neural Networks (CNNs) that gradually increase the range of interaction across multiple layers. We study the role of the density of the attention. Our preliminary analyses suggest that the spatial interactions of attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon, as dense attention maps are harder for the model to learn due to steeper softmax gradients around them. We interpret this as a strong preference for ViT models to include dense interaction. We thus manually insert the uniform attention to each layer of ViT models to supply the much needed dense interactions. We call this method Context Broadcasting, CB. We observe that the inclusion of CB reduces the degree of density in the original attention maps and increases both the capacity and generalizability of the ViT models. CB incurs negligible costs: 1 line in your model code, no additional parameters, and minimal extra operations.

READ FULL TEXT

page 3

page 17

research
03/22/2021

DeepViT: Towards Deeper Vision Transformer

Vision transformers (ViTs) have been successfully applied in image class...
research
07/26/2023

Are Transformers with One Layer Self-Attention Using Low-Rank Weight Matrices Universal Approximators?

Existing analyses of the expressive capacity of Transformer models have ...
research
08/10/2023

Vision Backbone Enhancement via Multi-Stage Cross-Scale Attention

Convolutional neural networks (CNNs) and vision transformers (ViTs) have...
research
11/20/2020

ConvTransformer: A Convolutional Transformer Network for Video Frame Synthesis

Deep Convolutional Neural Networks (CNNs) are powerful models that have ...
research
08/30/2019

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures...
research
05/24/2019

SCRAM: Spatially Coherent Randomized Attention Maps

Attention mechanisms and non-local mean operations in general are key in...
research
03/30/2022

Surface Vision Transformers: Attention-Based Modelling applied to Cortical Analysis

The extension of convolutional neural networks (CNNs) to non-Euclidean g...

Please sign up or login with your details

Forgot password? Click here to reset