DeepAI AI Chat
Log In Sign Up

Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media

04/16/2021
by   Paul Röttger, et al.
0

Language use differs between domains and even within a domain, language use changes over time. Previous work shows that adapting pretrained language models like BERT to domain through continued pretraining improves performance on in-domain downstream tasks. In this article, we investigate whether adapting BERT to time in addition to domain can increase performance even further. For this purpose, we introduce a benchmark corpus of social media comments sampled over three years. The corpus consists of 36.36m unlabelled comments for adaptation and evaluation on an upstream masked language modelling task as well as 0.9m labelled comments for finetuning and evaluation on a downstream document classification task. We find that temporality matters for both tasks: temporal adaptation improves upstream task performance and temporal finetuning improves downstream task performance. However, we do not find clear evidence that adapting BERT to time and domain improves downstream task performance over just adapting to domain. Temporal adaptation captures changes in language use in the downstream task, but not those changes that are actually relevant to performance on it.

READ FULL TEXT

page 5

page 6

page 7

page 13

page 14

10/02/2020

Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

Recent studies on domain-specific BERT models show that effectiveness on...
12/03/2019

A Comparative Study of Pretrained Language Models on Thai Social Text Categorization

The ever-growing volume of data of user-generated content on social medi...
03/12/2021

Comparing the Performance of NLP Toolkits and Evaluation measures in Legal Tech

Recent developments in Natural Language Processing have led to the intro...
04/09/2022

Benchmarking for Public Health Surveillance tasks on Social Media with a Domain-Specific Pretrained Language Model

A user-generated text on social media enables health workers to keep tra...
09/07/2021

FHAC at GermEval 2021: Identifying German toxic, engaging, and fact-claiming comments with ensemble learning

The availability of language representations learned by large pretrained...
04/17/2021

Combating Temporal Drift in Crisis with Adapted Embeddings

Language usage changes over time, and this can impact the effectiveness ...