Efficient Representation Learning via Adaptive Context Pooling

07/05/2022
by   Chen Huang, et al.
0

Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens (e.g., text characters or image pixels), which may not be optimal for modeling complex dependencies at higher levels. In this paper, we propose ContextPool to address this problem by adapting the attention granularity for each token. Inspired by the success of ConvNets that are combined with pooling to capture long-range dependencies, we learn to pool neighboring features for each token before computing attention in a given attention layer. The pooling weights and support size are adaptively determined, allowing the pooled features to encode meaningful context with varying scale. We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost. Experiments validate that our ContextPool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to ConvNets for efficient feature learning.

READ FULL TEXT

page 4

page 7

research
10/06/2021

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Transformer-based models have achieved great success in various NLP, vis...
research
08/07/2021

PSViT: Better Vision Transformer via Token Pooling and Attention Sharing

In this paper, we observe two levels of redundancies when applying visio...
research
08/30/2022

ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer

Generating robust and reliable correspondences across images is a fundam...
research
11/21/2022

Vision Transformer with Super Token Sampling

Vision transformer has achieved impressive performance for many vision t...
research
08/17/2023

Long-Range Grouping Transformer for Multi-View 3D Reconstruction

Nowadays, transformer networks have demonstrated superior performance in...
research
04/02/2021

TFill: Image Completion via a Transformer-Based Architecture

Bridging distant context interactions is important for high quality imag...
research
04/06/2023

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

Humans possess a versatile mechanism for extracting structured represent...

Please sign up or login with your details

Forgot password? Click here to reset