Dynamic Language Models for Continuously Evolving Content

06/11/2021
by   Spurthi Amba Hombaiah, et al.
14

The content on the web is in a constant state of flux. New entities, issues, and ideas continuously emerge, while the semantics of the existing conversation topics gradually shift. In recent years, pre-trained language models like BERT greatly improved the state-of-the-art for a large spectrum of content understanding tasks. Therefore, in this paper, we aim to study how these language models can be adapted to better handle continuously evolving web content. In our study, we first analyze the evolution of 2013 - 2019 Twitter data, and unequivocally confirm that a BERT model trained on past tweets would heavily deteriorate when directly applied to data from later years. Then, we investigate two possible sources of the deterioration: the semantic shift of existing tokens and the sub-optimal or failed understanding of new tokens. To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models. Compared to a new model trained from scratch offline, our incremental training (a) reduces the training costs, (b) achieves better performance on evolving content, and (c) is suitable for online deployment. The superiority of our methods is validated using two downstream tasks. We demonstrate significant improvements when incrementally evolving the model from a particular base year, on the task of Country Hashtag Prediction, as well as on the OffensEval 2019 task.

READ FULL TEXT

page 2

page 4

page 10

research
11/15/2022

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

Large transformer-based language models, e.g. BERT and GPT-3, outperform...
research
09/15/2022

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations

We present TwHIN-BERT, a multilingual language model trained on in-domai...
research
10/04/2020

On Losses for Modern Language Models

BERT set many state-of-the-art results over varied NLU benchmarks by pre...
research
10/13/2022

Is It Worth the (Environmental) Cost? Limited Evidence for the Benefits of Diachronic Continuous Training

Language is constantly changing and evolving, leaving language models to...
research
06/05/2023

Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models

Large pre-trained neural language models have brought immense progress t...
research
12/11/2022

A Study of Slang Representation Methods

Warning: this paper contains content that may be offensive or upsetting....
research
09/10/2023

Large Language Models for Difficulty Estimation of Foreign Language Content with Application to Language Learning

We use large language models to aid learners enhance proficiency in a fo...

Please sign up or login with your details

Forgot password? Click here to reset