Language Modelling via Learning to Rank

10/13/2021
by   Arvid Frydenlund, et al.
0

We consider language modelling (LM) as a multi-label structured prediction task by re-framing training from solely predicting a single ground-truth word to ranking a set of words which could continue a given context. To avoid annotating top-k ranks, we generate them using pre-trained LMs: GPT-2, BERT, and Born-Again models. This leads to a rank-based form of knowledge distillation (KD). We also develop a method using N-grams to create a non-probabilistic teacher which generates the ranks without the need of a pre-trained LM. We confirm the hypotheses that we can treat LMing as a ranking task and that we can do so without the use of a pre-trained LM. We show that rank-based KD generally improves perplexity (PPL), often with statistical significance, when compared to Kullback-Leibler-based KD. Surprisingly, given the simplicity of the method, N-grams act as competitive teachers and achieve similar performance as using either BERT or a Born-Again model teachers. GPT-2 always acts as the best teacher, though, and using it and a Transformer-XL student on Wiki-02, rank-based KD reduces a cross-entropy baseline from 65.27 to 55.94 and against a KL-based KD of 56.70.

READ FULL TEXT

page 3

page 4

page 18

page 19

page 20

page 21

page 22

page 23

research
05/08/2020

Distilling Knowledge from Pre-trained Language Models via Text Smoothing

This paper studies compressing pre-trained language models, like BERT (D...
research
02/08/2023

An Empirical Study of Uniform-Architecture Knowledge Distillation in Document Ranking

Although BERT-based ranking models have been commonly used in commercial...
research
05/22/2020

L2R2: Leveraging Ranking for Abductive Reasoning

The abductive natural language inference task (αNLI) is proposed to eval...
research
04/25/2022

Faculty Distillation with Optimal Transport

Knowledge distillation (KD) has shown its effectiveness in improving a s...
research
05/25/2022

Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models

Knowledge Distillation (KD) is a prominent neural model compression tech...
research
09/30/2020

Teacher-Critical Training Strategies for Image Captioning

Existing image captioning models are usually trained by cross-entropy (X...

Please sign up or login with your details

Forgot password? Click here to reset