Pre-Training BERT on Arabic Tweets: Practical Considerations

02/21/2021
by   Ahmed Abdelali, et al.
0

Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highlight the centrality of data diversity and the efficacy of linguistically aware segmentation. They also highlight that more data or more training step do not necessitate better models. Our new models achieve new state-of-the-art results on several downstream tasks. The resulting models are released to the community under the name QARiB.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2019

BERT-Based Arabic Social Media AuthorProfiling

We report our models for detecting age, language variety, and gender fro...
research
07/26/2020

KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media

In this paper, we describe our approach to utilize pre-trained BERT mode...
research
09/09/2019

BERT-Based Arabic Social Media Author Profiling

We report our models for detecting age, language variety, and gender fro...
research
06/29/2023

Classifying Crime Types using Judgment Documents from Social Media

The task of determining crime types based on criminal behavior facts has...
research
10/15/2022

AraLegal-BERT: A pretrained language model for Arabic Legal text

The effectiveness of the BERT model on multiple linguistic tasks has bee...
research
03/15/2022

Data Contamination: From Memorization to Exploitation

Pretrained language models are typically trained on massive web-based da...
research
03/14/2023

Geolocation Predicting of Tweets Using BERT-Based Models

This research is aimed to solve the tweet/user geolocation prediction ta...

Please sign up or login with your details

Forgot password? Click here to reset