TimeLMs: Diachronic Language Models from Twitter

02/08/2022
by   Daniel Loureiro, et al.
0

Despite its importance, the time variable has been largely neglected in the NLP and language model literature. In this paper, we present TimeLMs, a set of language models specialized on diachronic Twitter data. We show that a continual learning strategy contributes to enhancing Twitter-based language models' capacity to deal with future and out-of-distribution tweets, while making them competitive with standardized and more monolithic benchmarks. We also perform a number of qualitative analyses showing how they cope with trends and peaks in activity involving specific named entities or concept drift.

READ FULL TEXT
research
04/25/2021

XLM-T: A Multilingual Language Model Toolkit for Twitter

Language models are ubiquitous in current NLP, and their multilingual ca...
research
09/21/2021

BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets

We introduce BERTweetFR, the first large-scale pre-trained language mode...
research
07/18/2020

Drinking from a Firehose: Continual Learning with Web-scale Natural Language

Continual learning systems will interact with humans, with each other, a...
research
08/18/2023

OCR Language Models with Custom Vocabularies

Language models are useful adjuncts to optical models for producing accu...
research
06/10/2022

Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing

We present a new corpus of Twitter data annotated for codeswitching and ...
research
12/07/2020

An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data

The field of NLP has seen unprecedented achievements in recent years. Mo...
research
07/28/2023

A Critical Review of Large Language Models: Sensitivity, Bias, and the Path Toward Specialized AI

This paper examines the comparative effectiveness of a specialized compi...

Please sign up or login with your details

Forgot password? Click here to reset