Differentiable Subset Pruning of Transformer Heads

08/10/2021
by   Jiaoda Li, et al.
0

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.

READ FULL TEXT
research
10/07/2021

Layer-wise Pruning of Transformer Attention Heads for Efficient Language Modeling

While Transformer-based models have shown impressive language modeling p...
research
05/23/2019

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Multi-head self-attention is a key component of the Transformer, a state...
research
05/25/2019

Are Sixteen Heads Really Better than One?

Attention is a powerful and ubiquitous mechanism for allowing neural mod...
research
08/03/2021

A Dynamic Head Importance Computation Mechanism for Neural Machine Translation

Multiple parallel attention mechanisms that use multiple attention heads...
research
10/05/2020

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Traditional (unstructured) pruning methods for a Transformer model focus...
research
11/10/2019

Understanding Multi-Head Attention in Abstractive Summarization

Attention mechanisms in deep learning architectures have often been used...
research
04/07/2022

Accelerating Attention through Gradient-Based Learned Runtime Pruning

Self-attention is a key enabler of state-of-art accuracy for various tra...

Please sign up or login with your details

Forgot password? Click here to reset