Cost-effective Selection of Pretraining Data: A Case Study of Pretraining BERT on Social Media

10/02/2020
by   Xiang Dai, et al.
0

Recent studies on domain-specific BERT models show that effectiveness on downstream tasks can be improved when models are pretrained on in-domain data. Often, the pretraining data used in these models are selected based on their subject matter, e.g., biology or computer science. Given the range of applications using social media text, and its unique language variety, we pretrain two models on tweets and forum text respectively, and empirically demonstrate the effectiveness of these two resources. In addition, we investigate how similarity measures can be used to nominate in-domain pretraining data. We publicly release our pretrained models at https://bit.ly/35RpTf0.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/03/2019

A Comparative Study of Pretrained Language Models on Thai Social Text Categorization

The ever-growing volume of data of user-generated content on social medi...
research
04/16/2021

Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media

Language use differs between domains and even within a domain, language ...
research
04/16/2021

A Million Tweets Are Worth a Few Points: Tuning Transformers for Customer Service Tasks

In online domain-specific customer service applications, many companies ...
research
03/30/2023

Whether and When does Endoscopy Domain Pretraining Make Sense?

Automated endoscopy video analysis is a challenging task in medical comp...
research
11/14/2021

Time Waits for No One! Analysis and Challenges of Temporal Misalignment

When an NLP model is trained on text data from one time period and teste...
research
08/16/2020

DeVLBert: Learning Deconfounded Visio-Linguistic Representations

In this paper, we propose to investigate the problem of out-of-domain vi...
research
06/30/2021

The MultiBERTs: BERT Reproductions for Robustness Analysis

Experiments with pretrained models such as BERT are often based on a sin...

Please sign up or login with your details

Forgot password? Click here to reset