TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

11/25/2021
by   Abir Messaoudi, et al.
9

Pretrained contextualized text representation models learn an effective representation of a natural language to make it machine understandable. After the breakthrough of the attention mechanism, a new generation of pretrained models have been proposed achieving good performances since the introduction of the Transformer. Bidirectional Encoder Representations from Transformers (BERT) has become the state-of-the-art model for language understanding. Despite their success, most of the available models have been trained on Indo-European languages however similar research for under-represented languages and dialects remains sparse. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for under represented languages, with a specific focus on the Tunisian dialect. We evaluate our language model on sentiment analysis task, dialect identification task and reading comprehension question-answering task. We show that the use of noisy web crawled data instead of structured data (Wikipedia, articles, etc.) is more convenient for such non-standardized language. Moreover, results indicate that a relatively small web crawled dataset leads to performances that are as good as those obtained using larger datasets. Finally, our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks. We release the TunBERT pretrained model and the datasets used for fine-tuning.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/14/2020

Utilizing Bidirectional Encoder Representations from Transformers for Answer Selection

Pre-training a transformer-based model for the language modeling task in...
11/10/2019

CamemBERT: a Tasty French Language Model

Pretrained language models are now ubiquitous in Natural Language Proces...
11/04/2020

Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages

Language models based on the Transformer architecture have achieved stat...
03/07/2022

IT5: Large-scale Text-to-text Pretraining for Italian Language Understanding and Generation

The T5 model and its unified text-to-text paradigm contributed in advanc...
03/31/2022

Misogynistic Meme Detection using Early Fusion Model with Graph Network

In recent years , there has been an upsurge in a new form of entertainme...
04/14/2021

Towards BERT-based Automatic ICD Coding: Limitations and Opportunities

Automatic ICD coding is the task of assigning codes from the Internation...
04/14/2021

Disentangling Representations of Text by Masking Transformers

Representations from large pretrained models such as BERT encode a range...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.