Homograph Disambiguation Through Selective Diacritic Restoration

12/10/2019
by   Sawsan Alqahtani, et al.
0

Lexical ambiguity, a challenging phenomenon in all natural languages, is particularly prevalent for languages with diacritics that tend to be omitted in writing, such as Arabic. Omitting diacritics leads to an increase in the number of homographs: different words with the same spelling. Diacritic restoration could theoretically help disambiguate these words, but in practice, the increase in overall sparsity leads to performance degradation in NLP applications. In this paper, we propose approaches for automatically marking a subset of words for diacritic restoration, which leads to selective homograph disambiguation. Compared to full or no diacritic restoration, these approaches yield selectively-diacritized datasets that balance sparsity and lexical disambiguation. We evaluate the various selection strategies extrinsically on several downstream applications: neural machine translation, part-of-speech tagging, and semantic textual similarity. Our experiments on Arabic show promising results, where our devised strategies on selective diacritization lead to a more balanced and consistent performance in downstream applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/07/2014

Azhary: An Arabic Lexical Ontology

Arabic language is the most spoken languages in the Semitic languages gr...
research
06/01/2021

Part of Speech and Universal Dependency effects on English Arabic Machine Translation

In this research paper, I will elaborate on a method to evaluate machine...
research
09/07/2015

Exploiting Out-of-Domain Data Sources for Dialectal Arabic Statistical Machine Translation

Statistical machine translation for dialectal Arabic is characterized by...
research
06/07/2020

A Multitask Learning Approach for Diacritic Restoration

In many languages like Arabic, diacritics are used to specify pronunciat...
research
05/27/2022

Punctuation Restoration in Spanish Customer Support Transcripts using Transfer Learning

Automatic Speech Recognition (ASR) systems typically produce unpunctuate...
research
10/18/2017

Towards a Seamless Integration of Word Senses into Downstream NLP Applications

Lexical ambiguity can impede NLP systems from accurate understanding of ...
research
01/31/2022

Correcting diacritics and typos with a ByT5 transformer model

Due to the fast pace of life and online communications and the prevalenc...

Please sign up or login with your details

Forgot password? Click here to reset