DeepAI AI Chat
Log In Sign Up

Sparsifying Transformer Models with Differentiable Representation Pooling

by   Michał Pietruszka, et al.

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations, thus leveraging the model's information bottleneck with twofold strength. A careful analysis shows that the contextualization of encoded representations in our model is significantly more effective than in the original Transformer. We achieve a notable reduction in memory usage due to an improved differentiable top-k operator, making the model suitable to process long documents, as shown on an example of a summarization task.


page 1

page 2

page 3

page 4


Fastformer: Additive Attention Can Be All You Need

Transformer is a powerful model for text understanding. However, it is i...

Efficient Transformer-based 3D Object Detection with Dynamic Token Halting

Balancing efficiency and accuracy is a long-standing problem for deployi...

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Transformer-based models have achieved great success in various NLP, vis...

Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

Abstractive summarization, the task of generating a concise summary of i...

Forming Trees with Treeformers

Popular models such as Transformers and LSTMs use tokens as its unit of ...

MixerGAN: An MLP-Based Architecture for Unpaired Image-to-Image Translation

While attention-based transformer networks achieve unparalleled success ...