DziriBERT: a Pre-trained Language Model for the Algerian Dialect

09/25/2021
by   Amine Abdaoui, et al.
0

Pre-trained transformers are now the de facto models in Natural Language Processing given their state-of-the-art results in many tasks and languages. However, most of the current models have been trained on languages for which large text resources are already available (such as English, French, Arabic, etc.). Therefore, there is still a number of low-resource languages that need more attention from the community. In this paper, we study the Algerian dialect which has several specificities that make the use of Arabic or multilingual models inappropriate. To address this issue, we collected more than one Million Algerian tweets, and pre-trained the first Algerian language model: DziriBERT. When compared to existing models, DziriBERT achieves the best results on two Algerian downstream datasets. The obtained results show that pre-training a dedicated model on a small dataset (150 MB) can outperform existing models that have been trained on much more data (hundreds of GB). Finally, our model is publicly available to the community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/01/2021

BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding

Pre-training language models on large volume of data with self-supervise...
research
01/10/2022

Language-Agnostic Website Embedding and Classification

Currently, publicly available models for website classification do not o...
research
07/21/2021

Comparison of Czech Transformers on Text Classification Tasks

In this paper, we present our progress in pre-training monolingual Trans...
research
07/03/2023

ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis

The computational analysis of poetry is limited by the scarcity of tools...
research
05/18/2023

Democratized Diffusion Language Model

Despite the potential benefits of Diffusion Models for NLP applications,...
research
11/02/2020

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

Although the Indonesian language is spoken by almost 200 million people ...
research
02/27/2023

Quantifying Valence and Arousal in Text with Multilingual Pre-trained Transformers

The analysis of emotions expressed in text has numerous applications. In...

Please sign up or login with your details

Forgot password? Click here to reset