Breaking BERT: Evaluating and Optimizing Sparsified Attention

10/07/2022
by   Siddhartha Brahma, et al.
0

Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measure which patterns reduce performance the least. We find that on three common finetuning tasks even using attention that is at least 78 sparse can have little effect on performance if applied at later transformer layers, but that applying sparsity throughout the network reduces performance significantly. Second, we vary the degree of sparsity for three patterns supported by previous work, and find that connections to neighbouring tokens are the most significant. Finally, we treat sparsity as an optimizable parameter, and present an algorithm to learn degrees of neighboring connections that gives a fine-grained control over the accuracy-sparsity trade-off while approaching the performance of existing methods.

READ FULL TEXT
research
03/24/2023

Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers

Vision Transformers (ViT) have shown their competitive advantages perfor...
research
06/08/2020

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Transformer networks use pairwise attention to compute contextual embedd...
research
10/21/2022

Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

Efficient Transformers have been developed for long sequence modeling, d...
research
03/24/2022

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Recently, several Vision Transformer (ViT) based methods have been propo...
research
03/26/2020

TLDR: Token Loss Dynamic Reweighting for Reducing Repetitive Utterance Generation

Natural Language Generation (NLG) models are prone to generating repetit...
research
10/18/2022

ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design

Vision Transformers (ViTs) have achieved state-of-the-art performance on...

Please sign up or login with your details

Forgot password? Click here to reset