Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

03/08/2022
by   Kai Liu, et al.
0

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by each query attending to all keys/values, various methods have constrained the range of attention within local regions, where each query only attends to keys/values within a hand-crafted window. However, these hand-crafted window partition mechanisms are data-agnostic and ignore their input content, so it is likely that one query maybe attends to irrelevant keys/values. To address this issue, we propose a Dynamic Group Attention (DG-Attention), which dynamically divides all queries into multiple groups and selects the most relevant keys/values for each group. Our DG-Attention can flexibly model more relevant dependencies without any spatial constraint that is used in hand-crafted window based attention. Built on the DG-Attention, we develop a general vision transformer backbone named Dynamic Group Transformer (DGT). Extensive experiments show that our models can outperform the state-of-the-art methods on multiple common vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation.

READ FULL TEXT

page 1

page 6

research
12/28/2021

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Recently, Transformers have shown promising performance in various visio...
research
05/02/2023

AxWin Transformer: A Context-Aware Vision Transformer Backbone with Axial Windows

Recently Transformer has shown good performance in several vision tasks ...
research
07/26/2021

Contextual Transformer Networks for Visual Recognition

Transformer with self-attention has led to the revolutionizing of natura...
research
06/02/2023

RITA: Group Attention is All You Need for Timeseries Analytics

Timeseries analytics is of great importance in many real-world applicati...
research
06/09/2020

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer

In this paper, we seek to reduce the computation complexity of transform...
research
06/08/2021

Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight

Vision Transformer (ViT) attains state-of-the-art performance in visual ...
research
03/22/2022

Learning Patch-to-Cluster Attention in Vision Transformer

The Vision Transformer (ViT) model is built on the assumption of treatin...

Please sign up or login with your details

Forgot password? Click here to reset