KVT: k-NN Attention for Boosting Vision Transformers

05/28/2021
by   pichaowang, et al.
0

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potential degradation of performance. To address these problems, we propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the $k$-NN attention allows for the exploration of long range correlation and at the same time filters out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that k-NN attention is powerful in distilling noise from input tokens and in speeding up training. Extensive experiments are conducted by using ten different vision transformer architectures to verify that the proposed k-NN attention can work with any existing transformer architectures to improve its prediction performance.

READ FULL TEXT
research
11/30/2021

Shunted Self-Attention via Multi-Scale Token Aggregation

Recent Vision Transformer (ViT) models have demonstrated encouraging res...
research
01/31/2022

BOAT: Bilateral Local Attention Vision Transformer

Vision Transformers achieved outstanding performance in many computer vi...
research
04/27/2023

Vision Conformer: Incorporating Convolutions into Vision Transformer Layers

Transformers are popular neural network models that use layers of self-a...
research
05/29/2021

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learn...
research
07/06/2022

MaiT: Leverage Attention Masks for More Efficient Image Transformers

Though image transformers have shown competitive results with convolutio...
research
03/22/2021

Incorporating Convolution Designs into Visual Transformers

Motivated by the success of Transformers in natural language processing ...
research
03/30/2022

Surface Vision Transformers: Attention-Based Modelling applied to Cortical Analysis

The extension of convolutional neural networks (CNNs) to non-Euclidean g...

Please sign up or login with your details

Forgot password? Click here to reset