Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model

by   Kareem Darwish, et al.
Hamad Bin Khalifa University

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this paper, we use a feature-rich recurrent neural network model that uses a variety of linguistic and surface-level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.86% and a CE error rate (CEER) of 3.7 Modern Standard Arabic (MSA) and CWER of 2.2 Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rate is 6.0 highlights the effectiveness of feature engineering for such deep neural models.



There are no comments yet.


page 1

page 2

page 3

page 4


Diacritization of Maghrebi Arabic Sub-Dialects

Diacritization process attempt to restore the short vowels in Arabic wri...

Improving Arabic Diacritization by Learning to Diacritize and Translate

We propose a novel multitask learning method for diacritization which tr...

CITlab ARGUS for Arabic Handwriting

In the recent years it turned out that multidimensional recurrent neural...

Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging

Semitic languages can be highly ambiguous, having several interpretation...

AraDIC: Arabic Document Classification using Image-Based Character Embeddings and Class-Balanced Loss

Classical and some deep learning techniques for Arabic text classificati...

Deep Diacritization: Efficient Hierarchical Recurrence for Improved Arabic Diacritization

We propose a novel architecture for labelling character sequences that a...

Aesthetical Attributes for Segmenting Arabic Word

The connected allograph representing calligraphic Arabic word does not a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern Standard Arabic (MSA) and Classical Arabic (CA) have two types of vowels, namely long vowels, which are explicitly written, and short vowels, aka diacritics, which are typically omitted in writing but are reintroduced by readers to properly pronounce words. Since diacritics disambiguate the sense of the words in context and their syntactic roles in sentences, automatic diacritic recovery is essential for applications such as text-to-speech and educational tools for language learners, who may not know how to properly verbalize words. Diacritics have two types, namely: core-word (CW) diacritics, which are internal to words and specify lexical selection; and case-endings (CE), which appear on the last letter of word stems, typically specifying their syntactic role. For example, the word “ktb”111Buckwalter encoding is used in this paper buckwalter2002buckwalter (كتب¿) can have multiple diacritized forms such as “katab” (كَتَب¿ – meaning “he wrote”) “kutub” (كُتُب¿ – “books”). While “katab” can only assume one CE, namely “fatHa” (“a”), “kutub” can accept the CEs: “damma” (“u”) (nominal – ex. subject), “a” (accusative – ex. object), “kasra” (“i”) (genitive – ex. PP predicate), or their nunations. There are 14 diacritic combinations. When used as CEs, they typically convey specific syntactic information, namely: fatHa “a” for accusative nouns, past verbs and subjunctive present verbs; kasra “i” for genitive nouns; damma “u” for nominative nouns and indicative present verbs; sukun “o” for jussive present verbs and imperative verbs. FatHa, kasra and damma can be preceded by shadda” for gemination (consonant doubling) and/or converted to nunation forms following some grammar rules. In addition, according to Arabic orthography and phonology, some words take a virtual (null) “#” marker when they end with certain characters (ex: long vowels). This applies also to all non-Arabic words (ex: punctuation, digits, Latin words, etc.). Generally, function words, adverbs and foreign named entities (NEs) have set CEs (sukun, fatHa or virtual).

Similar to other Semitic languages, Arabic allows flexible Verb-Subject-Object as well as Verb-Object-Subject constructs (Attia:08). Such flexibility creates inherent ambiguity, which is resolved by diacritics as in “rY Emr Ely” (رأى عمر علي¿ Omar saw Ali/Ali saw Omar). In the absence of diacritics it is not clear who saw whom. Similarly, in the sub-sentence “kAn Alm&tmr AltAsE” (كان المؤتمر التاسع¿), if the last word, is a predicate of the verb “kAn”, then the sentence would mean “this conference was the ninth” and would receive a fatHa (a) as a case ending. Conversely, if it was an adjective to the “conference”, then the sentence would mean “the ninth conference was …” and would receive a damma (u) as a case ending. Thus, a consideration of context is required for proper disambiguation. Due to the inter-word dependence of CEs, they are typically harder to predict compared to core-word diacritics (habash2007arabic, roth2008arabic, harrat:hal-00925815, ameur2015restoration), with CEER of state-of-the-art systems being in double digits compared to nearly 3% for word-cores. Since recovering CEs is akin to shallow parsing (marton2010improving)

and requires morphological and syntactic processing, it is a difficult problem in Arabic NLP. In this paper, we focus on recovering both CW diacritics and CEs. We employ two separate Deep Neural Network (DNN) architectures for recovering both kinds of diacritic types. We use character-level and word-level bidirectional Long-Short Term Memory (biLSTM) based recurrent neural models for CW diacritic and CE recovery respectively. We train models for both Modern Standard Arabic (MSA) and Classical Arabic (CA). For CW diacritics, the model is informed using word segmentation information and a unigram language model. We also employ a unigram language model to perform post correction on the model output. We achieve word error rates for CW diacritics of 2.9% and 2.2% for MSA and CA. The MSA word error rate is 6% lower than the best results in the literature (the RDI diacritizer

(rashwan2015deep)). The CE model is trained with a rich set of surface, morphological, and syntactic features. The proposed features would aid the biLSTM model in capturing syntactic dependencies indicated by Part-Of-Speech (POS) tags, gender and number features, morphological patterns, and affixes. We show that our model achieves a case ending error rate (CEER) of 3.7% for MSA and 2.5% for CA. For MSA, this CEER is more than 60% lower than other state-of-the-art systems such as Farasa and the RDI diacritizer, which are trained on the same dataset and achieve CEERs of 10.7% and 14.4% respectively. The contributions of this paper are as follows:

  • We employ a character-level RNN model that is informed using word morphological information and a word unigram language model to recover CW diacritics. Our model beats the best state-of-the-art system by 6% for MSA.

  • We introduce a new feature rich RNN-based CE recovery model that achieves errors rates that are 60% lower than the current state-of-the-art for MSA.

  • We explore the effect of different features, which may potentially be exploited for Arabic parsing.

  • We show the effectiveness of our approach for both MSA and CA.

2 Background

Automatic diacritics restoration has been investigated for many different language such as European languages (e.g. Romanian (mihalcea2002diacritics, tufics2008diac+), French (zweigenbaum2002restoring), and Croatian (vsantic2009automatic)), African languages (e.g. Yorba (orife2018attentive)), Southeast Asian languages (e.g. Vietnamese (luu2012pointwise)), Semitic language (e.g. Arabic and Hebrew (gal2002hmm)), and many others (de2007automatic)

. For many languages, diacritic (or accent restoration) is limited to a handful of letters. However, for Semitic languages, diacritic recovery extends to most letters. Many general approaches have been explored for this problem including linguistically motivated rule-based approaches, machine learning approaches, such as Hidden Markov Models (HMM)

(gal2002hmm) and Conditional Random Fields (CRF) (darwish2018diacritization)

, and lately deep learning approaches such as Arabic

(abandah2015automatic, Hifny2018Hybrid, mubarak2019highly), Slovak (hucko2018Diacritic), and Yorba (orife2018attentive).

Aside from rule-based approaches  (el1989arabic), different methods were used to recover diacritics in Arabic text. Using a hidden Markov model (HMM) (gal2002hmm, elshafei2006statistical) with an input character sequence, the model attempts to find the best state sequence given previous observations. gal2002hmm reported a 14% word error rate (WER) while elshafei2006statistical achieved a 4.1% diacritic error rate (DER) on the Quran (CA). vergyri2004automatic combined both morphological, acoustic, and contextual features to build a diacritizer trained on FBIS and LDC CallHome ECA collections. They reported a 9% (DER) without CE, and 28% DER with CE. nelken2005arabic employed a cascade of a finite state transducers. The cascade stacked a word language model (LM), a charachter LM, and a morphological model. The model achieved an accuracy of 7.33% WER without CE and and 23.61% WER with CE. zitouni2006maximum employed a maximum entropy model for sequence classification. The system was trained on the LDC’s Arabic Treebank (ATB) and evaluated on a 600 articles from An-Nahar Newspaper (340K words) and achieved 5.5% DER and 18% WER on words without CE.
bebah2014hybrid used a hybrid approach that utilizes the output of Alkhalil morphological Analyzer (bebah2011alkhalil) to generate all possible out of context diacritizations of a word. Then, an HMM guesses the correct diacritized form. Similarly, Microsoft Arabic Toolkit Services (ATKS) diacritizer (microsoft2013diac) uses a rule-based morphological analyzer that produces possible analyses and an HMM in conjunction with rules to guess the most likely analysis. They report WER of 11.4% and 4.4% with and without CE. MADAMIRA  (pasha2014madamira) uses a combinations of morpho-syntactic features to rank a list of potential analyses provided by the Buckwalter Arabic Morphological Analyzer (BAMA) (buckwalter2004buckwalter)

. An SVM trained on ATB selects the most probable analysis, including the diacritized form. MADAMIRA achieves 19.0% and 6.7% WER with and without CE respectively

(darwish2017arabic). Farasa (darwish2017arabic) uses an HMM to guess CW diacritics and an SVM-rank based model trained on morphological and syntactic features to guess CEs. Farasa achieves WER of 12.8% and 3.3% with and without CEs.

More recent work employed different neural architectures to model the diacritization problem. abandah2015automatic used a biLSTM recurrent neural network trained on the same dataset as (zitouni2006maximum). They explored one, two and three BiLSTM layers with 250 nodes in each layers, achieving WER of 9.1% including CE on ATB. Similar architectures were used but achieved lower results (rashwan2015deep, belinkov2015arabic). azmi2015survey provide a comprehensive survey on Arabic diacritization. A more recent survey by HamedZesch:2017 concluded that reported results are often incomparable due to the usage of different test sets. They concluded that a large unigram LM for CW diacritic recovery is competitive with many of the systems in the literature, which prompted us to utilize a unigram language model for post correction. As mentioned earlier, two conclusions can be drawn, namely: restoring CEs is more challenging than CW diacritic restoration; and combining multiple features typically improves CE restoration.

In this paper, we expand upon the work in the literature by introducing feature-rich DNN models for restoring both CW and CE diacritics. We compare our models to multiple systems on the same test set. We achieve results that reduce diacritization error rates by more than half compared to the best SOTA systems. We further conduct an ablation study to determine the relative effect of the different features.

As for Arabic, it is a Semitic language with derivational morphology. Arabic nouns, adjectives, adverbs, and verbs are typically derived from a closed set of 10,000 roots of length 3, 4, or rarely 5. Arabic nouns and verbs are derived from roots by applying templates to the roots to generate stems. Such templates may carry information that indicate morphological features of words such POS tag, gender, and number. For example, given a 3-letter root with 3 consonants CCC, a valid template may be CwACC , where the infix “wA” (وا¿) is inserted, this template typically indicates an Arabic broken, or irregular, plural template for a noun of template CACC or CACCp if masculine or feminine respectively. Further, stems may accept prefixes and/or suffixes to form words. Prefixes include coordinating conjunctions, determiner, and prepositions, and suffixes include attached pronouns and gender and number markers.

3 Our Diacritizer

3.1 Training and Test Corpora

For MSA, we acquired the diacritized corpus that was used to train the RDI (rashwan2015deep) diacritizer and the Farasa diacritizer (darwish2017arabic). The corpus contains 9.7M tokens with approximately 194K unique surface forms (excluding numbers and punctuation marks). The corpus covers multiple genres such as politics and sports and is a mix of MSA and CA. This corpus is considerably larger than the Arabic Treebank (maamouri2004atb3) and is more consistent in its diacritization. For testing, we used the freely available WikiNews test set (darwish2017arabic), which is composed of 70 MSA WikiNews articles (18,300 tokens) and evenly covers a variety of genres including politics, economics, health, science and technology, sports, arts and culture.

For CA, we obtained a large collection of fully diacritized classical texts (2.7M tokens) from a book publisher, and we held-out a small subset of 5,000 sentences (approximately 400k words) for testing. Then, we used the remaining sentences to train the CA models.

3.2 Core Word Diacritization

3.2.1 Features.

Arabic words are typically derived from a limited set of roots by fitting them into so-called stem-templates (producing stems) and may accept a variety of prefixes and suffixes such as prepositions, determiners, and pronouns (producing words). Word stems specify the lexical selection and are typically unaffected by the attached affixes. We used 4 feature types, namely:

  • CHAR: the characters.

  • SEG: the position of the character in a word segment. For example, given the word “wAlktAb” (والكتاب¿ and the book/writers), which is composed of 3 segments “w+Al+ktAb” (و+ال+كتاب¿). Letters were marked as “B” if they begin a segment, “M” if they are in the middle of a segment, “E” if they end a segment, and “S” if they are single letter segments. So for “w+Al+ktAb”, the corresponding character positions are “S+BE+BMME”. We used Farasa to perform segmentation, which has a reported segmentation accuracy of 99% on the WikiNews dataset (DARWISH2016farasa).

  • PRIOR: diacritics seen in the training set per segment. Since we used a character level model, this feature informed the model with word level information. For example, the word “ktAb” (كتاب¿) was observed to have two diacritized forms in the training set, namely “kitaAb” (كِتَاب¿ – book) and “kuta

    Ab” (كُتَّاب¿ – writers). The first letter in the word (“k”) accepted the diacritics “i” and “u”. Thus given a binary vector representing whether a character is allowed to assume any of the eight primitive Arabic diacritic marks (a, i, u, o, K, N, F, and

    in order), the first letter would be given the following vector “01100000”. If a word segment was never observed during training, the vector for all letters therein would be set to 11111111. This feature borrows information from HMM models, which have been fairly successful in diacritizing word cores.

  • CASE: whether the letter expects a core word diacritic or a case ending. Case endings are placed on only one letter in a word, which may or may not be the last letter in the word. This is a binary feature.

3.2.2 DNN Model.

Using a DNN model, particularly with a biLSTM  (schuster1997bilstm), is advantageous because the model automatically explores the space of feature combinations and is able to capture distant dependencies. A number of studies have explored various biLSTM architectures (abandah2015automatic, rashwan2015deep, belinkov2015arabic) including stacked biLSTMs confirming their effectiveness. As shown in Figure 1, we employed a character-based biLSTM model with associated features for each character. Every input character had an associated list of features, and we trained randomly initialized embeddings of size 50 for each feature. Then, we concatenated the feature embeddings vectors creating an

vector for each character, which was fed into the biLSTM layer of length 100. The output of the biLSTM layer was fed directly into a dense layer of size 100. We used early stopping with patience of 5 epochs, a learning rate of 0.001, a batch size of 256, and an Adamax optimizer. The input was the character sequence in a sentence with words being separated by word boundary markers (WB), and we set the maximum sentence length to 1,250 characters.

Figure 1: DNN model for core word diacritics

3.3 Case Ending Diacritization

3.3.1 Features.

Table 1 lists the features that we used for CE recovery. We used Farasa to perform segmentation and POS tagging and to determine stem-templates (darwish2017arabic). Farasa has a reported POS accuracy of 96% on the WikiNews dataset darwish2017arabic. Though the Farasa diacritizer utilizes a combination of some the features presented herein, namely segmentation, POS tagging, and stem templates, Farasa’s SVM-ranking approach requires explicit specification of feature combinations (ex. ). Manual exploration of the feature space is undesirable, and ideally we would want our learning algorithm to do so automatically. The flexibility of the DNN model allowed us to include many more surface level features such as affixes, leading and trailing characters in words and stems, and the presence of words in large gazetteers of named entities. As we show later, these additional features significantly lowered CEER.

Feature Example Explanation and Motivation
word w+b+mktb+t+nA (و+ب+مكتب+ت+نا¿ – and in our library) Some words have a fixed set of observed CEs
word POS CONJ+PREP+NOUN +NSUFF+PRON Some POS combinations allow a closed set of CEs
gender/number feminine/singular Gender/number agreement (dis)allow certain attachments and may allow/exclude certain CEs
stem mktb+p (مكتب+ة¿ – library) We attach gender and number noun suffixes such the singular feminine marker “p” (ـة¿) because CEs appear on them.
stem POS NOUN+NSUFF Same rationale as word POS
prefix(es) & POS w+b+ (و+ب+¿) & CONJ+PREP Certain prefixes affect CE directly. For example, the PREP “b+” (ب+¿) is a preposition causing their noun predicates to assume the genitive case
suffix(es) & POS “+nA” (+نا¿) & PRON Certain suffixes affect CE directly
stem template mfEl+p (مفعل+ة¿ – derived from the root “ktb” كتب¿) Some stem templates allow certain CEs and exclude others. Ex. the stem template “fEl” (أفعل¿) disallows tanween (“N”, “K”, “F”)
word/stem head/tail char uni/bi-grams word: w (و¿), wb (وب¿); stem: A (ا¿), nA (نا¿) Such characters can capture some morphological and syntactic information. Ex. verbs in present tense typically start with “ (أ¿), n (ن¿), y (ي¿), or t (ت¿)”.
sukun word foreign NEs: ex. jwn (جون¿ – John) CE of certain words is strictly sukun. We built a list from training set.
named entities NEs Named entities are more likely to have sukun as CE. We extracted the named entity list from the Farasa named entity recognizer (darwish2013named, darwish2014simple).
Table 1: Features with examples and motivation.

3.3.2 DNN Model

Figure 2 shows the architecture of our DNN algorithm. Every input word had an associated list of features, and we trained randomly initialized embeddings of size 100 for each feature. Then, we concatenated the feature embeddings vectors creating an vector for each word. We fed these vectors into a biLSTM layer of 100 dimensions after applying a dropout of 75%, where dropout behaves like a regularlizer to avoid overfitting (hinton2012improving). We conducted side experiments with lower dropout rates, but the higher dropout rate worked best. The output of the biLSTM layer was fed into a 100 dimensional dense layer with 15% dropout and softmax activation. We conducted side experiments where we added additional biLSTM layers and replaced softmax with a conditional random field layer, but we did not observe improvements. Thus, we opted for a simpler model. We used a validation set to determine optimal parameters such as dropout rate. Again, we used the “Adamax” optimizer with categorical cross entropy loss and a learning rate of 0.001. We also applied early stopping with patience of up to 5 consecutive epochs without improvement.

Figure 2: DNN case ending model architecture

4 Experiments and Results

4.1 Core Word

4.1.1 Experimental Setup

For all the experiments conducted herein, we used the Keras toolkit


with a TensorFlow backend

(tensorflow2015-whitepaper). We used the entirety of the training set as input, and we instructed Keras to use 5% of the data for tuning (validation). We included the CASE feature, which specifies whether the letter accepts a normal diacritic or case ending, in all our setups. We conducted multiple experiment using different features, namely:

  • CHAR: This is our baseline setup where we only used the characters as features.

  • CHAR+SEG: This takes the characters and their segmentation information as features.

  • CHAR+PRIOR: This takes the characters and their the observed diacritized forms in the training set.

  • All: This setup includes all the features.

We also optionally employed post correction. For words that were seen in training, if the model produced a diacritized form that was not seen in the training data, we assumed it was an error and replaced it with the most frequently observed diacritized form (using a unigram language model). We report two error rates, namely WER (at word level) and DER (at character level). We used relaxed scoring where we assumed an empty case to be equivalent to sukun, and we removed default diacritics – fatHa followed by alef, kasra followed by ya, and damma followed by wa. Using such scoring would allow to compare to other systems in the literature that may use different diacritization conventions.

4.1.2 Results and Error analysis

For testing, we used the aforementioned WikiNews dataset to test the MSA diacritizer and the held-out 5,000 sentences for CA. Table 2 shows WER and DER results using different features with and without post correction.

CHAR 3.5 1.1 3.3 1.0 5.1 2.1 2.7 1.0
CHAR+SEG 3.3 1.1 3.2 1.0 4.7 1.9 2.6 1.0
CHAR+PRIOR 3.8 1.2 3.7 1.1 3.8 1.6 2.3 0.9
ALL 3.0 1.0 2.9 0.9 3.6 1.5 2.2 0.9
Table 2: Core word diacritization results
MSA Results:

For MSA, though the CHAR+PRIOR feature led to worse results than using CHAR alone, the results show that combining all the features achieved the best results. Moreover, post correction improved results overall. We compare our results to five other systems, namely Farasa (darwish2017arabic), MADAMIRA (pasha2014madamira), RDI (Rashwan et al., 2015), MIT (Belinkow and Glass, 2015), and Microsoft ATKS (microsoft2013diac). Table 7 compares our system with others in the aforementioned systems. As the results show, our results beat the current state-of-the-art.

For error analysis, we analyzed all the errors (527 errors). The errors types along with examples of each are shown in Table 3. The most prominent error type arises from the selection of a valid diacritized form that does not match the context (40.8%). Perhaps, including POS tags as a feature or augmenting the PRIOR feature with POS tag information and a bigram language model may reduce the error rate further. The second most common error is due to transliterated foreign words including foreign named entities (23.5%). Such words were not observed during training. Further, Arabic Named entities account for 10.6% of the errors, where they were either not seen in training or they share identical non-diacritized forms with other words. Perhaps, building larger gazetteers of diacritized named entities may resolve NE related errors. In 10.8% of the cases, the diacritizer produced in completely incorrect diacritized forms. In some the cases (9.1%), though the diacritizer produced a form that is different from the reference, both forms were in fact correct. Most of these cases were due to variations in diacritization conventions (ex. “bare alef” (A) at start of a word receiving a diacritic or not). Other cases include foreign words and some words where both diacritized forms are equally valid.

Freq. % Explanation Examples
Wrong selection 215 40.8 Homographs with different diacritized forms “qaSor” (قَصْر¿ – palace) vs. “qaSar” (قَصَر¿ – he limited)
Foreign word 124 23.5 transliterated words including 96 foreign named entities wiykiymaAnoyaA (وِيكِيمَانْيَا¿ – Wikimania)
Invalid diacritized form 57 10.8 invalid form ya*okur (يّذْكُر¿ – he mentions) vs. ya*okar (يّذْكَر¿)
Named entity 56 10.6 Arabic named entities “EabÃdiy” (عَبَّادِي¿ – name) vs. “EibAdiy” (عِبَادِي¿ – my servants)
both correct 48 9.1 Some words have multiple valid diacritized forms “wikAlap” (وِكَالَة¿) and “wakAlap” (وَكَالَة¿ – agency)
Affix diacritization error 16 3.0 Some sufixes are erroneously diacritized baAkt$Afihim (بَاكتشافِهِم¿ – with their discovery)
Reference is wrong 10 1.9 the truth diacritics were incorrect AlofiyfaA (الْفِيفَا¿ – FIFA) vs. AlofayofaA (الْفَيْفَا¿)
dialectal word 1 0.2 dialectal word mawaAyiliy (مَوَايِلِي¿ – my chant)
Table 3: Error analysis: Core word error types for MSA
Error Count % Most Common Causes
a u 133 19.3 POS error: ex. “ka$afa” (كَشَفَ¿ – he exposed) vs. “ka$ofu” (كَشْفُ¿ – exposure) & Subject vs. object: ex. “tuwHy mivolu” (تُوحِي مِثْلُ¿ – such indicates) vs. “tuwHy mivola” (تُوحِي مِثْلَ¿ – she indicates such)
i a 130 18.9 Incorrect attachment (due to coordinating conjunction or distant attachment): ex. “Alogaza Alomusay~ili lilidumuEi – wa+AlraSaSi vs. wa+AlraSaSa (الغَازَ الْمُسَيِّلَ لِلدُمُوعِ والرَصَاص¿ – tear gas and bullets) where bullets were attached incorrectly to tear instead of gas & indeclinability such as foreign words and feminine names: ex. “kaAnuwni” (كَانُونِ¿ – Cyrillic month name) vs. “kaAuwna” (كَانُونَ¿)
i u 95 13.8 POS error of previous word: ex. “tadahowuru waDoEihi” (تَدَهْؤُرُ وَضْعِهِ¿ – deterioration of his situation – situtation is part of idafa construct) vs. “tadahowara waDoEihu” (تَذَهْوَرَ وَضْعُهُ¿ – his situation deteriorated – situation is subject) & Incorrect attachment (due to coordinating conjunction or distant attachment): (as example for i a)
a o 60 8.7 Foreign named entities: ex. “siyraAloyuna” (سِيرَالْيُونَ¿ – Siera Leon) vs. “siyraAloyuno” (سِيرَالْيُونْ¿)
i K 27 4.0 Incorrect Idafa: “liAt~ifaqi ha⁢aA AlousobuwE” (لِاتِفَاقِ هَذَا الأُسْبُوع¿ – this week’s agreement) vs. “liAt~ifaqK ha⁢aA AlousobuwE” (لِاتِّفَاقٍ هَذَا الْأُسْبُوع¿ – to an agreement this week)
K N 29 4.2 Subject vs. object (as in a u) and Incorrect attachment (as in i a)
F N 25 3.7 Words ending with feminine marker “p” or “At”: ex. “muHaADarap” (مُحَاضَرَة¿ – lecture)
i o 22 3.2 Foreign named entities (as in a o)
F a 16 2.3 Incorrect Idafa (as in i K)
u o 14 2.0 Foreign named entities (as in a o)
F K 9 1.3 Words ending with feminine marker (as in F N)
K a 8 1.2 Incorrect Idafa (as in i K)
Table 4: MSA case errors accounting from more than 1% of errors
Error Count % Most Common Causes
a u 2,907 28.4 Subject vs. object: ex. “wafaqa yawoma” (وَفَقَ يَوْمَ¿ – he matches the day) vs. ex. “wafaqa yawomu” (وَفَقَ يَوْمُ¿ – the day matches) & False subject (object behaves like subject in passive tense): ex. “yufar~iqu qaDaA’a” (يُفَرِّقُ الْقَضَاءَ¿ – he separates the make up) vs. “yufar~aqu qaDaA’u” (يُفَرَّقٌ الْقَضَاءُ¿ – the make up is separated) & Incorrect attachment (due to coordinating conjunction): ex. “f+aaEohadu” (فَأَعْهَدَ¿ – so I entrust) vs. “f+aaEohadu” (فَأَعْهِدُ¿)
i u 1,316 12.9 Incorrect attachment (due to coordinating conjunctions or distant attachment): (as in a u)
i a 1,019 10.0 Incorrect attachment (as in a u) & Indeclinability such as foreign words and feminine names: ex. “ajoyaAdiyni” (أَجْيَادِينِ¿ – Ajyadeen (city name)) vs. “ajoyaAiyna” (أَجْيَادِينَ¿)
a # 480 4.7 Problem with reference where the case for some words, particularly non-Arabic names, is not provided in the reference: ex. “isoHaAq” (إِسْحَاق¿ – Issac) vs. “isoHaAqa” (إِسْحَاقَ¿)
u # 426 4.2 same problems as in a #
K i 371 3.6 Incorrect Idafa: ex. “EaTaA’i Alofaqiyh” (عَطَاءِ الْفَقِيه¿ – the providence of the jurist) vs. “EaTaA’K Alofaqiyh” (عَطَاءٍ الْفَقِيه¿ – Ataa the jurist)
K a 328 3.2 words ending with feminine marker: ex. “tayomiyap” (تَيْمِيَة¿ –Taymiya) & Indeclinability: ex. “bii$obiyliy~ap” (وَبِإِشْبِيلِيَّة¿ – and in Lisbon)
u o 300 2.9 confusion between past, present, and imperative moods of verbs and preceding markers (imperative “laA” vs. negation “laA): ex. “laA tano$ariHu” (لا تَنْشَرِحُ¿ – does not open up) vs. “laA tano$ariHo” (لا تَنْشَرِحْ¿ – do not open up)
a o 278 2.7 confusion between past, present, and imperative moods of verbs (as in u o)
K N 253 2.5 Incorrect attachment (as in i u)
N u 254 2.5 Incorrect Idafa (as in K i)
F N 235 2.3 words ending with feminine marker (as in K a)
i o 195 1.9 Differing conventions concerning handling two consecutive letters with sukun: ex. “Eano Aboni” (عَنْ ابْنِ¿ – on the authority of the son of) vs. “Eani Aboni” (عَنِ ابْنِ¿)
i # 178 1.7 same errors as for a #
o # 143 1.4 same errors as for a #
Table 5: CA case errors accounting from more than 1% of errors
CA Results:

For CA results, the CHAR+SEG and CHAR+PRIOR performed better than using characters alone with CHAR+PRIOR performing better than CHAR+SEG. As in the case with MSA, combining all the features led to the best results. Post correction had a significantly larger positive impact on results compared to what we observed for MSA. This may indicate that we need a larger training set. The best WER that we achieved for CW diacritics with post corrections is 2.2%. Since we did not have access to any publicly available system that is tuned for CA, we compared our best system to using our best MSA system to diacritize the CA test set, and the MSA diacritizer produced significantly lower results with a WER of 8.5% (see Table 7). This highlights the large difference between MSA and CA and the need for systems that are specifically tuned for both.

We randomly selected and analyzed 500 errors (5.2% of the errors). The errors types along with examples of each are shown in Table 6. The two most common errors involve the system producing completely correct diacritized forms (38.8%) or correct forms that don’t match the context (31.4%). The relatively higher percentage of completely incorrect guesses, compared to MSA, may point to the higher lexical diversity of classical Arabic. As for MSA, we suspect that adding additional POS information and employing a word bigram to constrain the PRIOR feature may help reduce selection errors. Another prominent error is related to the diacritics that appear on attached suffixes, particularly pronouns, which depend on the choice of case ending (13.2%). Errors due to named entities are slightly fewer than those seen for MSA (8.8%). A noticeable number of mismatches between the guess and the reference are due to partial diacritization of the reference (4.4%). We plan to conduct an extra round of checks on the test set.

Error Freq. % Explanation Examples
Invalid diacritized form 195 38.8 invalid form aqosaAm” (أِقْسَام¿ – portions) vs. “aqasaAm” (أَقَسَام¿)
Wrong selection 157 31.4 Homographs with different diacritized forms “raAfoE” (رَقْع¿ – lifting) vs. “rafaE” (رَفَع¿ – he lifted)
Affix diacritization error 66 13.2 Some affixes are erroneously diacritized “baladhu” (بَلَدهُ¿ – his country, where country is subject of verb) vs. “baladhi” (بَلَدهِ¿ – his country, where country is subject or object of preposition)
Named entities 44 8.8 Named entities “Alr~ayob” (الرَّيْب¿ – Arrayb) vs. “Alr~iyab” (الرِّيَب¿))
Problems with reference 22 4.4 Some words in the reference were partially diacritized “nuEoTaY” (نُعْطَى¿ – we are given) vs. “nETY” (نعطى¿))
Guess has no diacritics 9 1.8 system did not produce any diacritics “mhnd” (مهند¿ – sword) vs. “muhan~ad” (مُهَنَّد¿))
Different valid forms 7 1.4 Some words have multiple valid diacritized forms “maA}op” (مَائْة¿ – hundred) and “miA}op” (مِائَة¿)
Misspelled word 1 0.2 “lbAlmsjd” (لبالمسجد¿) vs. “lbAlmsjd” (بالمسجد¿ – in the mosque))
Table 6: Error analysis: Core word error types for CA
Error Rate
System WER DER
Our system 2.9 0.9
(rashwan2015deep) 3.0 1.0
Farasa 3.3 1.1
Microsoft ATKS 5.7 2.0
MADAMIRA 6.7 1.9
(belinkov2015arabic) 14.9 3.9
Our system 2.2 0.9
Our best MSA system on CA 8.5 3.7
Table 7: Comparing our system to state-of-the-art systems – Core word diacritics

4.2 Case Ending

4.2.1 Experimental Setup

We conducted multiple experiments to determine the relative effect of the different features as follows:

  • word: This is our baseline setup, which uses word surface forms only.

  • word-surface: This setup uses the word surface forms, stems, prefixes, and suffixes (including noun suffixes). This simulates the case when no POS tagging information is available.

  • word-POS: This includes the word surface form and POS information, including gender and number of stems, prefixes, and suffixes.

  • word-morph: This includes words and their stem templates to capture morphological patterns.

  • word-surface-POS-morph: This setup uses all the features (surface, POS, and morphological).

  • all-misc: This uses all features plus word and stem leading and trailing character unigrams and bigrams in addition to sukun words and named entities.

For testing MSA, we used the aforementioned WikiNews dataset. Again, we compared our results to five other systems, namely Farasa (darwish2017arabic), MADAMIRA (pasha2014madamira), RDI (Rashwan et al., 2015), MIT (Belinkow and Glass, 2015), and Microsoft ATKS (microsoft2013diac). For CA testing, we used the 5,000 sentences that we set aside. Again, we compared to our best MSA system.

4.2.2 Results and Error Analysis

Table 8 lists the results of our setups compared to other systems.

MSA Results:

As the results show, our baseline DNN system outperforms all state-of-the-art systems. Further, adding more features yielded better results overall. Surface-level features resulted in the most gain, followed by POS tags, and lastly stem templates. Further, adding head and tail characters along with a list of sukun words and named entities led to further improvement. Our proposed feature-rich system has a CEER that is approximately 61% lower than any of the state-of-the-art systems.

Figure 3 shows CE distribution and prediction accuracy. For the four basic markers kasra, fatHa, damma and sukun, which appear 27%, 14%, 9% and 10% respectively, the system has CEER of 1% for each. Detecting the virtual CE mark is a fairly easy task. All other CE markers represent 13% with almost negligible errors.

Table 4 lists a thorough breakdown of all errors accounting for at 1% of the errors along with the most common reasons of the errors and examples illustrating these reasons. For example, the most common error type involves guessing a fatHa (a) instead of damma (u) or vice versa (19.3%). The most common reasons for this error type, based on inspecting the errors, were due to: POS errors (ex. a word is tagged as a verb instead of a noun); and a noun is treated as a subject instead of an object or vice versa. The table details the rest of the error types. Overall, some of the errors are potentially fixable using better POS tagging, improved detection of non-Arabized foreign names, and detection of indeclinability. However, some errors are more difficult and require greater understanding of semantics such as improper attachment, incorrect idafa, and confusion between subject and object. Perhaps, such semantic errors can be resolved using parsing.

Setup CEER%
word (baseline) 9.1
word-surface 5.7
word-POS 7.0
word-morph 7.6
word-surface-POS-morph 5.2
all-misc 3.7
Microsoft ATKS 9.5
Farasa 10.4
RDI (rashwan2015deep) 14.0
MIT (belinkov2015arabic) 15.3
MADAMIRA (pasha2014madamira) 15.9
word (baseline) 4.0
word-surface 3.3
word-POS 3.1
word-morph 3.7
word-surface-POS-morph 2.9
all-misc 2.5
Our best MSA system on CA 8.9
Table 8: MSA Results and comparison to other systems
CA Results:

The results show that the POS tagging features led to the most improvements followed by the surface features. Combining all features led to the best results with WER of 2.5%. As we saw for CW diacritics, using our best MSA system to diacritize CA led to significantly lower results with CEER of 8.9%.

Figure 4 shows CE distribution and prediction accuracy. For the four basic markers fatHa, kasra, damma and sukun, which appear 18%, 14%, 13% and 8% respectively, the system has CEER 0.5% for each. Again, detecting the virtual CE mark was a fairly easy task. All other CE markers representing 20% have negligible errors.

Table 5 lists all the error types, which account for at least 1% of the errors, along with their most common causes and explanatory examples. The error types are similar to those observed for MSA. Some errors are more syntactic and morphological in nature and can be addressed using better POS tagging and identification of indeclinability, particularly as they relate to named entities and nouns with feminine markers. Other errors such as incorrect attachment, incorrect idafa, false subject, and confusion between subject and object can perhaps benefit from the use of parsing. As with the core-word errors for CA, the reference has some errors (ex. {a,i,o} #), and extra rounds of reviews of the reference are in order.

4.3 Full Diacritization Results

Table 9 compares the full word diacritization (CW+CE) of our best setup to other systems in the literature. As the results show for MSA, our overall diacritization WER is 6.0% while the state of the art system has a WER of 12.2%. As for CA, our best system produced an error rate of 4.3%, which is significantly better than using our best MSA system to diacritize CA.

Setup WER%
Our System 6.0
Microsoft ATKS 12.2
Farasa 12.8
RDI (rashwan2015deep) 16.0
MADAMIRA (pasha2014madamira) 19.0
MIT (belinkov2015arabic) 30.5
Our system 4.3
Our best MSA system on CA 14.7
Table 9: Comparison to other systems for full diacritization
Figure 3: Case endings distribution and prediction accuracy for MSA
Figure 4: Case endings distribution and prediction accuracy for CA

5 Conclusion and Future Work

In this paper, we presented a feature-rich DNN approach for MSA CW and CE recovery that produces a word level error for MSA of 6.0%, which is more than 50% lower than state-of-the-art systems (6.0% compared to 12.2%) and word error rate of 4.3% for CA. Specifically, we used biLSTM-based model with a variety of surface, morphological, and syntactic features. Reliable NLP tools may be required to generate some of these features, and such tools may not be readily available for other language varieties, such as dialectal Arabic. However, we showed the efficacy of different varieties of features, such as surface level-features, and they can help improve diacritization individually. Further, though some errors may be overcome using improved NLP tools (ex. better POS tagging), semantic errors, such incorrect attchment, are more difficult to fix. Perhaps, using dependency parsing may help overcome some semantic errors. As for feature engineering, the broad categories of features, such as surface, syntactic, and morphological features, may likely carry-over to other languages, language-specific feature engineering may be require to handle the specificity of each language. Lastly, since multiple diacritization conventions may exist, as in the case of Arabic, adopting one convention consistently is important for training a good system and for properly testing it. Though we have mostly achieved this for MSA, the CA dataset requires more checks to insure greater consistency.

For future work, we want to explore the effectiveness of augmenting our CW model with POS tagging information and a bigram language model. Further, we plan to create a multi reference diacritization test set to handle words that have multiple valid diacritized forms. For CE, we want to examine the effectiveness of the proposed features for Arabic parsing. We plan to explore: character-level convolutional neural networks that may capture sub-word morphological features; pre-trained embeddings; and attention mechanisms to focus on salient features. We also plan to explore joint modeling for both core word and case ending diacritics.