Sparsifying Transformer Models with Differentiable Representation Pooling

09/10/2020
by   Michał Pietruszka, et al.
0

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations, thus leveraging the model's information bottleneck with twofold strength. A careful analysis shows that the contextualization of encoded representations in our model is significantly more effective than in the original Transformer. We achieve a notable reduction in memory usage due to an improved differentiable top-k operator, making the model suitable to process long documents, as shown on an example of a summarization task.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/20/2021

Fastformer: Additive Attention Can Be All You Need

Transformer is a powerful model for text understanding. However, it is i...
06/02/2021

Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling

Transformer is important for text modeling. However, it has difficulty i...
10/06/2021

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Transformer-based models have achieved great success in various NLP, vis...
11/25/2021

New Approaches to Long Document Summarization: Fourier Transform Based Attention in a Transformer Model

In this work, we extensively redesign the newly introduced method of tok...
06/02/2021

Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

Abstractive summarization, the task of generating a concise summary of i...
04/28/2020

EARL: Speedup Transformer-based Rankers with Pre-computed Representation

Recent innovations in Transformer-based ranking models have advanced the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.