BOAT: Bilateral Local Attention Vision Transformer

01/31/2022
by   Tan Yu, et al.
3

Vision Transformers achieved outstanding performance in many computer vision tasks. Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. To improve efficiency, recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows. Despite the fact that window-based local self-attention significantly boosts efficiency, it fails to capture the relationships between distant but similar patches in the image plane. To overcome this limitation of image-space local attention, in this paper, we further exploit the locality of patches in the feature space. We group the patches into multiple clusters using their features, and self-attention is computed within every cluster. Such feature-space local attention effectively captures the connections between patches across different local windows but still relevant. We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention. We further integrate BOAT with both Swin and CSWin models, and extensive experiments on several benchmark datasets demonstrate that our BOAT-CSWin model clearly and consistently outperforms existing state-of-the-art CNN models and vision Transformers.

READ FULL TEXT
research
09/19/2022

Axially Expanded Windows for Local-Global Interaction in Vision Transformers

Recently, Transformers have shown promising performance in various visio...
research
11/30/2021

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Built on top of self-attention mechanisms, vision transformers have demo...
research
07/06/2022

MaiT: Leverage Attention Masks for More Efficient Image Transformers

Though image transformers have shown competitive results with convolutio...
research
08/25/2021

TransFER: Learning Relation-aware Facial Expression Representations with Transformers

Facial expression recognition (FER) has received increasing interest in ...
research
07/14/2022

iColoriT: Towards Propagating Local Hint to the Right Region in Interactive Colorization by Leveraging Vision Transformer

Point-interactive image colorization aims to colorize grayscale images w...
research
04/13/2023

RSIR Transformer: Hierarchical Vision Transformer using Random Sampling Windows and Important Region Windows

Recently, Transformers have shown promising performance in various visio...
research
05/28/2021

KVT: k-NN Attention for Boosting Vision Transformers

Convolutional Neural Networks (CNNs) have dominated computer vision for ...

Please sign up or login with your details

Forgot password? Click here to reset