Dynamic Token-Pass Transformers for Semantic Segmentation

08/03/2023
by   Yuang Liu, et al.
1

Vision transformers (ViT) usually extract features via forwarding all the tokens in the self-attention layers from top to toe. In this paper, we introduce dynamic token-pass vision transformers (DoViT) for semantic segmentation, which can adaptively reduce the inference cost for images with different complexity. DoViT gradually stops partial easy tokens from self-attention calculation and keeps the hard tokens forwarding until meeting the stopping criteria. We employ lightweight auxiliary heads to make the token-pass decision and divide the tokens into keeping/stopping parts. With a token separate calculation, the self-attention layers are speeded up with sparse tokens and still work friendly with hardware. A token reconstruction module is built to collect and reset the grouped tokens to their original position in the sequence, which is necessary to predict correct semantic masks. We conduct extensive experiments on two common semantic segmentation tasks, and demonstrate that our method greatly reduces about 40 drop of mIoU is within 0.8 throughput and inference speed of ViT-L/B are increased to more than 2× on Cityscapes.

READ FULL TEXT

page 3

page 8

research
06/03/2021

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Attention is sparse in vision transformers. We observe the final predict...
research
03/20/2023

Robustifying Token Attention for Vision Transformers

Despite the success of vision transformers (ViTs), they still suffer fro...
research
06/03/2023

Content-aware Token Sharing for Efficient Semantic Segmentation with Vision Transformers

This paper introduces Content-aware Token Sharing (CTS), a token reducti...
research
06/23/2021

Probabilistic Attention for Interactive Segmentation

We provide a probabilistic interpretation of attention and show that the...
research
05/24/2023

Predicting Token Impact Towards Efficient Vision Transformer

Token filtering to reduce irrelevant tokens prior to self-attention is a...
research
02/12/2023

A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity

Vision Transformers (ViTs) with self-attention modules have recently ach...
research
02/03/2023

PSST! Prosodic Speech Segmentation with Transformers

Self-attention mechanisms have enabled transformers to achieve superhuma...

Please sign up or login with your details

Forgot password? Click here to reset