Chasing Sparsity in Vision Transformers: An End-to-End Exploration

06/08/2021
by   Tianlong Chen, et al.
0

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We launch and report the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. For additional efficiency gains, we further co-explore data and architecture sparsity, by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can even improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at (5 sparsity for (data, architecture), improves 0.28 enjoys 49.32 https://github.com/VITA-Group/SViTE.

READ FULL TEXT

page 9

page 15

research
03/24/2023

Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers

Vision Transformers (ViT) have shown their competitive advantages perfor...
research
07/22/2023

Sparse then Prune: Toward Efficient Vision Transformers

The Vision Transformer architecture is a deep learning model inspired by...
research
05/03/2023

Dynamic Sparse Training with Structured Sparsity

DST methods achieve state-of-the-art results in sparse neural network tr...
research
09/08/2023

Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts

Sparse Mixture-of-Experts models (MoEs) have recently gained popularity ...
research
05/30/2023

Dynamic Sparsity Is Channel-Level Sparsity Learner

Sparse training has received an upsurging interest in machine learning d...
research
01/09/2020

Campfire: Compressible, Regularization-Free, Structured Sparse Training for Hardware Accelerators

This paper studies structured sparse training of CNNs with a gradual pru...
research
05/27/2022

Spartan: Differentiable Sparsity via Regularized Transportation

We present Spartan, a method for training sparse neural network models w...

Please sign up or login with your details

Forgot password? Click here to reset