Sparsifying Transformer Models with Differentiable Representation Pooling

09/10/2020
by   Michał Pietruszka, et al.
0

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations, thus leveraging the model's information bottleneck with twofold strength. A careful analysis shows that the contextualization of encoded representations in our model is significantly more effective than in the original Transformer. We achieve a notable reduction in memory usage due to an improved differentiable top-k operator, making the model suitable to process long documents, as shown on an example of a summarization task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2021

Fastformer: Additive Attention Can Be All You Need

Transformer is a powerful model for text understanding. However, it is i...
research
03/09/2023

Efficient Transformer-based 3D Object Detection with Dynamic Token Halting

Balancing efficiency and accuracy is a long-standing problem for deployi...
research
10/06/2021

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Transformer-based models have achieved great success in various NLP, vis...
research
11/25/2021

New Approaches to Long Document Summarization: Fourier Transform Based Attention in a Transformer Model

In this work, we extensively redesign the newly introduced method of tok...
research
06/02/2021

Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

Abstractive summarization, the task of generating a concise summary of i...
research
07/14/2022

Forming Trees with Treeformers

Popular models such as Transformers and LSTMs use tokens as its unit of ...
research
11/09/2022

Efficiently Scaling Transformer Inference

We study the problem of efficient generative inference for Transformer m...

Please sign up or login with your details

Forgot password? Click here to reset