Context-Gloss Augmentation for Improving Arabic Target Sense Verification

02/06/2023
by   Sanad Malaysha, et al.
0

Arabic language lacks semantic datasets and sense inventories. The most common semantically-labeled dataset for Arabic is the ArabGlossBERT, a relatively small dataset that consists of 167K context-gloss pairs (about 60K positive and 107K negative pairs), collected from Arabic dictionaries. This paper presents an enrichment to the ArabGlossBERT dataset, by augmenting it using (Arabic-English-Arabic) machine back-translation. Augmentation increased the dataset size to 352K pairs (149K positive and 203K negative pairs). We measure the impact of augmentation using different data configurations to fine-tune BERT on target sense verification (TSV) task. Overall, the accuracy ranges between 78 approach performed at par with the baseline, we did observe some improvements for some POS tags in some experiments. Furthermore, our fine-tuned models are trained on a larger dataset covering larger vocabulary and contexts. We provide an in-depth analysis of the accuracy for each part-of-speech (POS).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2022

ArabGlossBERT: Fine-Tuning BERT on Context-Gloss Pairs for WSD

Using pre-trained transformer models such as BERT has proven to be effec...
research
06/01/2021

Part of Speech and Universal Dependency effects on English Arabic Machine Translation

In this research paper, I will elaborate on a method to evaluate machine...
research
05/28/2021

Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

Recent progress in neural machine translation (NMT) has made it possible...
research
12/01/2020

Extracting Synonyms from Bilingual Dictionaries

We present our progress in developing a novel algorithm to extract synon...
research
05/20/2022

Current Trends and Approaches in Synonyms Extraction: Potential Adaptation to Arabic

Extracting synonyms from dictionaries or corpora is gaining special atte...
research
06/29/2021

New Arabic Medical Dataset for Diseases Classification

The Arabic language suffers from a great shortage of datasets suitable f...

Please sign up or login with your details

Forgot password? Click here to reset