A Focused Study to Compare Arabic Pre-training Models on Newswire IE Tasks

04/30/2020
by   Wuwei Lan, et al.
0

The Arabic language is a morphological rich language, posing many challenges for information extraction (IE) tasks, including Named Entity Recognition (NER), Part-of-Speech tagging (POS), Argument Role Labeling (ARL) and Relation Extraction (RE). A few multilingual pre-trained models have been proposed and show good performance for Arabic, however, most experiment results are reported on language understanding tasks, such as natural language inference, question answering and sentiment analysis. Their performance on the IE tasks is less known, in particular, the cross-lingual transfer capability from English to Arabic. In this work, we pre-train a Gigaword-based bilingual language model (GigaBERT) to study these two distant languages as well as zero-short transfer learning on the information extraction tasks. Our GigaBERT model can outperform mBERT and XLM-R-base on NER, POS and ARL tasks, with regarding to the per-language and/or zero-transfer performance. We make our pre-trained models publicly available at https://github.com/lanwuwei/GigaBERT to facilitate the research of this field.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2020

AraBERT: Transformer-based Model for Arabic Language Understanding

The Arabic language is a morphologically rich and complex language with ...
research
02/07/2023

A Survey on Arabic Named Entity Recognition: Past, Recent Advances, and Future Trends

As more and more Arabic texts emerged on the Internet, extracting import...
research
07/12/2023

Ashaar: Automatic Analysis and Generation of Arabic Poetry Using Deep Learning Approaches

Poetry holds immense significance within the cultural and traditional fa...
research
01/12/2021

Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence Labeling

A sufficient amount of annotated data is usually required to fine-tune p...
research
05/12/2022

Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Multilingual pre-trained models are known to suffer from the curse of mu...
research
04/14/2021

Zero-Resource Multi-Dialectal Arabic Natural Language Understanding

A reasonable amount of annotated data is required for fine-tuning pre-tr...
research
11/28/2022

Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All

We present a new pre-trained language model (PLM) for modern Hebrew, ter...

Please sign up or login with your details

Forgot password? Click here to reset