Learned Queries for Efficient Local Attention

12/21/2021
by   Moab Arar, et al.
8

Vision Transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range dependencies in the data. Nonetheless, an integral part of any transformer architecture, the self-attention mechanism, suffers from high latency and inefficient memory utilization, making it less suitable for high-resolution input images. To alleviate these shortcomings, hierarchical vision models locally employ self-attention on non-interleaving windows. This relaxation reduces the complexity to be linear in the input size; however, it limits the cross-window interaction, hurting the model performance. In this paper, we propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner, much like convolutions. The key idea behind QnA is to introduce learned queries, which allow fast and efficient implementation. We verify the effectiveness of our layer by incorporating it into a hierarchical vision transformer model. We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models. Finally, our layer scales especially well with window size, requiring up-to x10 less memory while being up-to x5 faster than existing methods.

READ FULL TEXT

page 9

page 14

page 15

page 17

research
07/05/2021

Long-Short Transformer: Efficient Transformers for Language and Vision

Transformers have achieved success in both language and vision domains. ...
research
06/04/2021

Glance-and-Gaze Vision Transformer

Recently, there emerges a series of vision Transformers, which show supe...
research
06/23/2023

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

Transformer models have shown great potential in computer vision, follow...
research
05/29/2021

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learn...
research
03/21/2023

Online Transformers with Spiking Neurons for Fast Prosthetic Hand Control

Transformers are state-of-the-art networks for most sequence processing ...
research
05/26/2021

Aggregating Nested Transformers

Although hierarchical structures are popular in recent vision transforme...
research
09/05/2023

A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking

Vision Transformer (ViT) architectures are becoming increasingly popular...

Please sign up or login with your details

Forgot password? Click here to reset