Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model

02/04/2020
by   Kareem Darwish, et al.
0

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this paper, we use a feature-rich recurrent neural network model that uses a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.86% and a CE error rate (CEER) of 3.7 Modern Standard Arabic (MSA) and CWER of 2.2 Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rate is 6.0 highlights the effectiveness of feature engineering for such deep neural models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/15/2018

Diacritization of Maghrebi Arabic Sub-Dialects

Diacritization process attempt to restore the short vowels in Arabic wri...
research
09/29/2021

Improving Arabic Diacritization by Learning to Diacritize and Translate

We propose a novel multitask learning method for diacritization which tr...
research
12/15/2014

CITlab ARGUS for Arabic Handwriting

In the recent years it turned out that multidimensional recurrent neural...
research
11/29/2022

New Results for the Text Recognition of Arabic Maghribī Manuscripts – Managing an Under-resourced Script

HTR models development has become a conventional step for digital humani...
research
04/25/2019

Arabic Text Diacritization Using Deep Neural Networks

Diacritization of Arabic text is both an interesting and a challenging p...
research
06/15/2022

NatiQ: An End-to-end Text-to-Speech System for Arabic

NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthes...
research
11/01/2020

Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization

We propose a novel architecture for labelling character sequences that a...

Please sign up or login with your details

Forgot password? Click here to reset