Pre-trained Summarization Distillation

10/24/2020
by   Sam Shleifer, et al.
0

Recent state-of-the-art approaches to summarization utilize large pre-trained Transformer models. Distilling these models to smaller student models has become critically important for practical use; however there are many different distillation methods proposed by the NLP literature. Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to 'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning. We compare these three approaches for distillation of Pegasus and BART, the current and former state of the art, pre-trained summarization models, and find that SFT outperforms knowledge distillation and pseudo-labeling on the CNN/DailyMail dataset, but under-performs pseudo-labeling on the more abstractive XSUM dataset. PyTorch Code and checkpoints of different sizes are available through Hugging Face transformers here http://tiny.cc/4iy0tz.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2021

Attention Temperature Matters in Abstractive Summarization Distillation

Recent progress of abstractive text summarization largely relies on larg...
research
05/31/2021

Greedy Layer Pruning: Decreasing Inference Time of Transformer Models

Fine-tuning transformer models after unsupervised pre-training reaches a...
research
07/01/2021

Knowledge Distillation for Quality Estimation

Quality Estimation (QE) is the task of automatically predicting Machine ...
research
08/03/2021

Linking Common Vulnerabilities and Exposures to the MITRE ATT CK Framework: A Self-Distillation Approach

Due to the ever-increasing threat of cyber-attacks to critical cyber inf...
research
05/20/2023

DisCo: Distilled Student Models Co-training for Semi-supervised Text Mining

Many text mining models are constructed by fine-tuning a large deep pre-...
research
05/08/2023

Web Content Filtering through knowledge distillation of Large Language Models

We introduce a state-of-the-art approach for URL categorization that lev...
research
03/19/2021

Cost-effective Deployment of BERT Models in Serverless Environment

In this study we demonstrate the viability of deploying BERT-style model...

Please sign up or login with your details

Forgot password? Click here to reset