Glance-and-Gaze Vision Transformer

06/04/2021
by   Qihang Yu, et al.
0

Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes, with the ability to efficiently model both long-range dependencies and local context. In GG-Transformer, the Glance and Gaze behavior is realized by two parallel branches: The Glance branch is achieved by performing self-attention on the adaptively-dilated partitions of the input, which leads to a linear complexity while still enjoying a global receptive field; The Gaze branch is implemented by a simple depth-wise convolutional layer, which compensates local image context to the features obtained by the Glance mechanism. We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers on various vision tasks and benchmarks. The codes and models will be made available at https://github.com/yucornetto/GG-Transformer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/01/2021

Focal Self-attention for Local-Global Interactions in Vision Transformers

Recently, Vision Transformer and its variants have shown great promise o...
research
05/30/2021

Gaze Estimation using Transformer

Recent work has proven the effectiveness of transformers in many compute...
research
01/08/2022

QuadTree Attention for Vision Transformers

Transformers have been successful in many vision tasks, thanks to their ...
research
03/22/2021

Transformers Solve the Limited Receptive Field for Monocular Depth Prediction

While convolutional neural networks have shown a tremendous impact on va...
research
08/11/2022

Deep is a Luxury We Don't Have

Medical images come in high resolutions. A high resolution is vital for ...
research
12/21/2021

Learned Queries for Efficient Local Attention

Vision Transformers (ViT) serve as powerful vision models. Unlike convol...
research
11/25/2021

Wake Word Detection with Streaming Transformers

Modern wake word detection systems usually rely on neural networks for a...

Please sign up or login with your details

Forgot password? Click here to reset