TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations

09/15/2022
by   Xinyang Zhang, et al.
0

We present TwHIN-BERT, a multilingual language model trained on in-domain data from the popular social network Twitter. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision, but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on a variety of multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We will freely open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2021

RoBERTuito: a pre-trained language model for social media text in Spanish

Since BERT appeared, Transformer language models and transfer learning h...
research
10/17/2020

TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis

Twitter is a well-known microblogging social site where users express th...
research
12/07/2020

An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data

The field of NLP has seen unprecedented achievements in recent years. Mo...
research
04/07/2022

BERTuit: Understanding Spanish language in Twitter through a native transformer

The appearance of complex attention-based language models such as BERT, ...
research
06/11/2021

Dynamic Language Models for Continuously Evolving Content

The content on the web is in a constant state of flux. New entities, iss...
research
01/31/2022

Disaster Tweets Classification using BERT-Based Language Model

Social networking services have became an important communication channe...
research
09/02/2023

LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models

Linking information across sources is fundamental to a variety of analys...

Please sign up or login with your details

Forgot password? Click here to reset