Hydra Attention: Efficient Attention with Many Heads

09/15/2022
by   Daniel Bolya, et al.
0

While transformers have begun to dominate many tasks in vision, applying them to large images is still computationally difficult. A large reason for this is that self-attention scales quadratically with the number of tokens, which in turn, scales quadratically with the image size. On larger images (e.g., 1080p), over 60 and applying attention matrices. We take a step toward solving this issue by introducing Hydra Attention, an extremely efficient attention operation for Vision Transformers (ViTs). Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as there are features, Hydra Attention is computationally linear in both tokens and features with no hidden constants, making it significantly faster than standard self-attention in an off-the-shelf ViT-B/16 by a factor of the token count. Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases, actually improves it.

READ FULL TEXT

page 10

page 18

research
11/30/2021

Shunted Self-Attention via Multi-Scale Token Aggregation

Recent Vision Transformer (ViT) models have demonstrated encouraging res...
research
05/17/2022

Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers

Vision transformers using self-attention or its proposed alternatives ha...
research
06/09/2022

Extreme Masking for Learning Instance and Distributed Visual Representations

The paper presents a scalable approach for learning distributed represen...
research
06/17/2021

XCiT: Cross-Covariance Image Transformers

Following their success in natural language processing, transformers hav...
research
03/24/2023

Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers

Vision Transformers (ViT) have shown their competitive advantages perfor...
research
02/12/2023

A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity

Vision Transformers (ViTs) with self-attention modules have recently ach...
research
02/03/2023

PSST! Prosodic Speech Segmentation with Transformers

Self-attention mechanisms have enabled transformers to achieve superhuma...

Please sign up or login with your details

Forgot password? Click here to reset