SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

by   Ziteng Gao, et al.

Human visual recognition is a sparse process, where only a few salient visual cues are attended to rather than traversing every detail uniformly. However, most current vision networks follow a dense paradigm, processing every single visual unit (e.g,, pixel or patch) in a uniform manner. In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner. SparseFormer learns to represent images using a highly limited number of tokens (down to 49) in the latent space with sparse feature sampling procedure instead of processing dense units in the original pixel space. Therefore, SparseFormer circumvents most of dense operations on the image space and has much lower computational costs. Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models while offering better accuracy-throughput tradeoff. Moreover, the design of our network can be easily extended to the video classification with promising performance at lower computational costs. We hope that our work can provide an alternative way for visual modeling and inspire further research on sparse neural architectures. The code will be publicly available at


page 8

page 10

page 14

page 15

page 16


Super Vision Transformer

We attempt to reduce the computational costs in vision transformers (ViT...

Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

DETR is the first end-to-end object detector using a transformer encoder...

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

We present Sparse R-CNN, a purely sparse method for object detection in ...

What Is Considered Complete for Visual Recognition?

This is an opinion paper. We hope to deliver a key message that current ...

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

The canonical approach to video-and-language learning (e.g., video quest...

So-ViT: Mind Visual Tokens for Vision Transformer

Recently the vision transformer (ViT) architecture, where the backbone p...

Semiring Primitives for Sparse Neighborhood Methods on the GPU

High-performance primitives for mathematical operations on sparse vector...

Please sign up or login with your details

Forgot password? Click here to reset