Attention Temperature Matters in Abstractive Summarization Distillation

06/07/2021
by   Shengqiang Zhang, et al.
0

Recent progress of abstractive text summarization largely relies on large pre-trained sequence-to-sequence Transformer models, which are computationally expensive. This paper aims to distill these large models into smaller ones for faster inference and minimal performance loss. Pseudo-labeling based methods are popular in sequence-to-sequence model distillation. In this paper, we find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models. Our experiments on three summarization datasets show our proposed method consistently improves over vanilla pseudo-labeling based methods. We also find that both the pseudo labels and summaries produced by our students are shorter and more abstractive. We will make our code and models publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2020

Pre-trained Summarization Distillation

Recent state-of-the-art approaches to summarization utilize large pre-tr...
research
04/04/2020

STEP: Sequence-to-Sequence Transformer Pre-training for Document Summarization

Abstractive summarization aims to rewrite a long document to its shorter...
research
11/12/2018

Sequence-Level Knowledge Distillation for Model Compression of Attention-based Sequence-to-Sequence Speech Recognition

We investigate the feasibility of sequence-level knowledge distillation ...
research
04/26/2020

Experiments with LVT and FRE for Transformer model

In this paper, we experiment with Large Vocabulary Trick and Feature-ric...
research
02/05/2018

Diverse Beam Search for Increased Novelty in Abstractive Summarization

Text summarization condenses a text to a shorter version while retaining...
research
07/22/2021

Spinning Sequence-to-Sequence Models with Meta-Backdoors

We investigate a new threat to neural sequence-to-sequence (seq2seq) mod...
research
02/21/2020

On the impressive performance of randomly weighted encoders in summarization tasks

In this work, we investigate the performance of untrained randomly initi...

Please sign up or login with your details

Forgot password? Click here to reset