Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

10/12/2022
by   Zonglin Li, et al.
19

This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0 input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.

READ FULL TEXT
research
09/19/2023

Interpret Vision Transformers as ConvNets with Dynamic Convolutions

There has been a debate about the superiority between vision Transformer...
research
10/25/2022

Explicitly Increasing Input Information Density for Vision Transformers on Small Datasets

Vision Transformers have attracted a lot of attention recently since the...
research
09/04/2020

AutoTrans: Automating Transformer Design via Reinforced Architecture Search

Though the transformer architectures have shown dominance in many natura...
research
08/02/2022

Unified Normalization for Accelerating and Stabilizing Transformers

Solid results from Transformers have made them prevailing architectures ...
research
05/24/2023

Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers

Transformers have achieved great success in machine learning application...
research
06/30/2021

Augmented Shortcuts for Vision Transformers

Transformer models have achieved great progress on computer vision tasks...
research
09/15/2023

Attention-Only Transformers and Implementing MLPs with Attention Heads

The transformer architecture is widely used in machine learning models a...

Please sign up or login with your details

Forgot password? Click here to reset