Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media

04/16/2021
by   Paul Röttger, et al.
0

Language use differs between domains and even within a domain, language use changes over time. Previous work shows that adapting pretrained language models like BERT to domain through continued pretraining improves performance on in-domain downstream tasks. In this article, we investigate whether adapting BERT to time in addition to domain can increase performance even further. For this purpose, we introduce a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. We find that temporality matters for both tasks: temporal adaptation improves upstream task performance and temporal finetuning improves downstream task performance. However, we do not find clear evidence that adapting BERT to time and domain improves downstream task performance over just adapting to domain. Temporal adaptation captures changes in language use in the downstream task, but not those changes that are actually relevant to performance on it.

READ FULL TEXT

page 5

page 6

page 7

page 13

page 14

research
10/02/2020

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Recent studies on domain-specific BERT models show that effectiveness on...
research
12/03/2019

A Comparative Study of Pretrained Language Models on Thai Social Text Categorization

The ever-growing volume of data of user-generated content on social medi...
research
08/20/2023

cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media Comments using Spatio-Temporally Retrained Language Models

This paper describes our multiclass classification system developed as p...
research
03/12/2021

Comparing the Performance of NLP Toolkits and Evaluation measures in Legal Tech

Recent developments in Natural Language Processing have led to the intro...
research
09/07/2021

FHAC at GermEval 2021: Identifying German toxic, engaging, and fact-claiming comments with ensemble learning

The availability of language representations learned by large pretrained...
research
04/17/2021

Combating Temporal Drift in Crisis with Adapted Embeddings

Language usage changes over time, and this can impact the effectiveness ...
research
10/20/2022

Automatic Document Selection for Efficient Encoder Pretraining

Building pretrained language models is considered expensive and data-int...

Please sign up or login with your details

Forgot password? Click here to reset