TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

06/21/2021
by   Michael S. Ryoo, et al.
0

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount.

READ FULL TEXT
research
05/20/2022

Visual Concepts Tokenization

Obtaining the human-like perception ability of abstracting visual concep...
research
06/08/2023

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

Image recognition and generation have long been developed independently ...
research
03/25/2023

Selective Structured State-Spaces for Long-Form Video Understanding

Effective modeling of complex spatiotemporal dependencies in long-form v...
research
11/26/2021

SWAT: Spatial Structure Within and Among Tokens

Modeling visual data as tokens (i.e., image patches), and applying atten...
research
06/23/2022

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Humans are remarkably flexible in understanding viewpoint changes due to...
research
06/15/2022

Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022

This technical report describes the SViT approach for the Ego4D Point of...
research
05/27/2022

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

In this study, we propose Mixed and Masked Image Modeling (MixMIM), a si...

Please sign up or login with your details

Forgot password? Click here to reset