Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

09/29/2020
by   Charles Welch, et al.
0

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/24/2020

Grounded Compositional Outputs for Adaptive Language Modeling

Language models have emerged as a central component across NLP, and a gr...
research
04/14/2019

Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking

Pretraining deep neural network architectures with a language modeling o...
research
05/24/2023

Lexinvariant Language Models

Token embeddings, a mapping from discrete lexical symbols to continuous ...
research
08/02/2023

Arithmetic with Language Models: from Memorization to Computation

A better understanding of the emergent computation and problem-solving c...
research
10/12/2020

Are Some Words Worth More than Others?

Current evaluation metrics for language modeling and generation rely hea...
research
02/13/2020

CBAG: Conditional Biomedical Abstract Generation

Biomedical research papers use significantly different language and jarg...
research
12/14/2022

MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

Static subword tokenization algorithms have been an essential component ...

Please sign up or login with your details

Forgot password? Click here to reset