MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers

07/05/2023
by   Jakob Drachmann Havtorn, et al.
0

The input tokens to Vision Transformers carry little semantic meaning as they are defined as regular equal-sized patches of the input image, regardless of its content. However, processing uniform background areas of an image should not necessitate as much compute as dense, cluttered areas. To address this issue, we propose a dynamic mixed-scale tokenization scheme for ViT, MSViT. Our method introduces a conditional gating mechanism that selects the optimal token scale for every image region, such that the number of tokens is dynamically determined per input. The proposed gating module is lightweight, agnostic to the choice of transformer backbone, and trained within a few epochs (e.g., 20 epochs on ImageNet) with little training overhead. In addition, to enhance the conditional behavior of the gate during training, we introduce a novel generalization of the batch-shaping loss. We show that our gating module is able to learn meaningful semantics despite operating locally at the coarse patch-level. We validate MSViT on the tasks of classification and segmentation where it leads to improved accuracy-complexity trade-off.

READ FULL TEXT

page 1

page 7

page 8

page 12

page 14

page 15

page 18

research
11/30/2021

ATS: Adaptive Token Sampling For Efficient Vision Transformers

While state-of-the-art vision transformer models achieve promising resul...
research
04/01/2023

Vision Transformers with Mixed-Resolution Tokenization

Vision Transformer models process input images by dividing them into a s...
research
06/05/2020

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Computer vision has achieved great success using standardized image repr...
research
05/31/2021

Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Vision Transformers (ViT) have achieved remarkable success in large-scal...
research
06/03/2023

Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers

This paper introduces Content-aware Token Sharing (CTS), a token reducti...
research
08/07/2023

DiT: Efficient Vision Transformers with Dynamic Token Routing

Recently, the tokens of images share the same static data flow in many d...
research
10/01/2022

CAST: Concurrent Recognition and Segmentation with Adaptive Segment Tokens

Recognizing an image and segmenting it into coherent regions are often t...

Please sign up or login with your details

Forgot password? Click here to reset