A Rule-based Kurdish Text Transliteration System

11/26/2018 ∙ by Sina Ahmadi, et al. ∙ 0

In this article, we present a rule-based approach for transliterating two mostly used orthographies in Sorani Kurdish. Our work consists of detecting a character in a word by removing the possible ambiguities and mapping it into the target orthography. We describe different challenges in Kurdish text mining and propose novel ideas concerning the transliteration task for Sorani Kurdish. Our transliteration system, named Wergor, achieves 82.79 more than 99 manually transliterated corpus for Kurdish.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Kurdish is an Indo-European language with a majority of speakers in the Kurdish regions of Iran, Iraq, Turkey and Syria. Although it is spoken by 20 to 30 million people (Kreyenbroek2005; hassani2016automatic), Kurdish language is considered as a less-resourced language. In 2016, Google added 13 new languages to its online automated translation tool, Google Translate, among them Kurdish (for the time being, only Kurmanji dialect). One of the main reasons of this delay, in comparison to some other languages with less users for whom the same service was provided earlier, is the lack of parallel corpora, online resources and language processing tools (BESACIER201485).

Regarding the area and the extent to which Kurdish orthographies are applied, one should confess that still integrity in writing Kurdish has not been achieved. The difference of orthographies naturally results in the distinction of produced textual sources and adds to the gap between the dialects and thus scatters readers. Despite the fact that Kurdish Academy of Language introduced Unified Kurdish Alphabet Yekgirtú in response to this problem (WinNT3orth), no standard orthography is popularly accepted considering all the challenges and the diversity of the dialects. Aware of this problem, Kurdish intellectuals have emphasized on the unification of the orthographies (hassanpour1992nationalism).

In this article, we are focusing on the challenges of transliteration of the two mostly used orthographies, Arabic-based and Latin-based, for Sorani Kurdish. Transliteration is a mapping from one system of writing into another, typically grapheme to grapheme (Knight:1998:MT:972764.972767). Given in the orthography , a transliteration task consists of mapping each character of the word to an equivalent character in the orthography which yields . This juxtaposition is not always straightforward. In the case of Sorani Kurdish, the Latin-based and the Arabic-based orthographies are not completely identical in terms of characters representation. Although confronting the problem of normalization in Kurdish seems to be addressed already in some of the previous researches such as (esmaili2012challenges), (esmaili2014towards) and (aliabadi2014towards) as a partial task, a solution has not been proposed for transliteration task so far. For instance, in a recent work by Hassani (hassani2017kurdish), transliteration has been mentioned implicitly as one of the tasks, but no detail has been reported concretely.

The task of transliteration is one of the fundamental elements in many NLP applications such as statistical machine translation, terminology extraction, cross-lingual data linking and so forth. Transliteration can be done with phoneme-based or grapheme-based models for which the latter has been shown to perform better than the first one (al2002machine). Kashani et al. (kashani2007automatic) and Al-Onaizan and Knight (al2002machine) use grapheme-based model, and Stalls and Knight (stalls1998translating) and Pervouchine et al. (pervouchine2009transliteration) use the phoneme-based approach. Since there are a few languages with manually labelled transliteration pairs (a word and its transliteration), some studies such as (sajjad2017statistical; sajjad2011comparing; noeman2010language) have been focused on transliteration mining which consists of automatically extracting transliteration pairs from a noisy list of transliteration candidates.

The rest of the paper is organized as follows: First, we provide a description about Kurdish writing systems in section 1. In section 2 we focus on the challenges of Sorani Kurdish transliteration in the Arabic-based (also referred to as ”Persian-Arabic”) and Latin-based orthographies. In section 3 we present the rule-based techniques used in Wergor111”Wergor”, pronounced as ”wargor”, is composed of ”wer”– a Kurdish prefix related to transformation, and ”gor”– the stem of ”goran” meaning to change. We coined this word for ”transliterater” similar to the Kurdish word ”wergêr” meaning translator.. This section includes our rule-based methods to solve the present challenges. Section 4 is devoted to the tests and experiments on the algorithms. In this section we describe our manually transliterated data set. Finally in section 5 our work is concluded and some ideas are proposed for future works.

1. Kurdish Writing Systems

Nowadays Kurdish is written in quite several orthographies adopted from other languages and thus applied to it(WinNT1orth). Although debate on what orthography to apply yet remains, Latin-based orthography (henceforth referred to as LbO) and Arabic-based orthography (henceforth referred to as AbO) are among the most popular ones which are respectively mostly used for the Kurmanji dialect and the Sorani dialect of Kurdish. In addition to these two main dialects, Hawrami and Kalhor are also written in the AbO. These orthographies are based on the phonetics of the language (WinNT2hist).

In order to provide a common description about Kurdish orthographies, and avoid inconsistent descriptions, mainly in (wahbi; hejar; w.m.thackstonSo2006; thackston2), we have used the description in (celadet2) for the LbO and the presented characters in (blau) for the AbO. Although some of the characters may have other usages in other descriptions, these two references are mostly well-known for Kurdish writers. Table 1 shows the characters in these orthographies in comparison to one another. In the case a character does not exist for a given phoneme, the case is coloured in grey. We encourage future researchers to use the selected Latin-based orthography as it does not have any ambiguity.

In the early stages of development of text processing tools for Kurdish, some fonts have been introduced to Kurdish users. Dilan fonts, Ali fonts, Zanest fonts and Rebaz fonts were among the most well-known fonts. These fonts were mainly based on the Persian and the Arabic keyboards and did not support Unicode. Fortunately, the existing characters in the Kurdish orthographies are completely supported by the Unicode standard. In the most recent development, the Kurditgroup keyboard is proposed based on the Unicode characters which is widely used by most of Kurdish users222 https://kurditgroup.org/downloads. We have also used this keyboard in our study.

Table 1. Comparison of the Latin-based and the Arabic-based orthographies

2. Kurdish Text Normalization Challenges

For the current Arabic-based and Latin-based orthographies, we can classify the normalization challenges in 3 categories:

2.1. Characters used to represent more than one phoneme

This is the case of ”” and ”” in the AbO which may be transliterated respectively as {”w” or ”u”} and {”y” or ”î”} in the LbO. For instance, the word ”” could have 4 possible transliterations considering different mappings ” {”y”, ”î”} and ”

{”w”, ”u”}: ”hauîn”, ”hauyn”, ”hawîn”, ”hawyn”, for which ”hawîn” is the correct form. Despite the visual similarity of ”

” as the equivalent of ”h” and ”e” in LbO, this character is not in the same category with ”” and ”” having different codes in Unicode.

2.2. Characters with no equivalent in the other orthography

This is the case of ””, ””, ””, ”” and ”” characters in the AbO for which there is no equivalent in the LbO. A specific case, however, is the case of Bizroke. Bizroke (which literally means ”the little furtive”) is represented by ”i” in the LbO while it is totally ignored in the AbO. For example, the word ”” may be transliterated as ”agr” which is not correct since the Bizroke between ”g” and ”r” can not be represented in the AbO. The correct form is ”agir”. Having said that native speakers pronounce Bizroke while speaking, even if it does not exist in the Arabic-based orthography (mccarus1958kurdish).

2.3. Unicode assignments of the Arabic-based Kurdish alphabet

The potential sources of ambiguity in the assignment of the characters of the current Kurditgroup keyboard is as follow:

  • Some of the Arabic characters have similarities in form, but they have different Unicodes, e.g. ”” (U+064A) instead of ”” (U+06CC) for {”î”, ”y”} and ”” ( U+0643) instead of ”” (U+06A9) for ”k” in the LbO.

  • Although ””(U+0647) as ”h” is a connecting character when placed at the end of a word, it seems visually identical to ””(U+06d5) that represents ”e”. For instance, the final ”” in ”” (”behbeh”) is not connected to the previous character which shows that the final ”” is ”h”. This is not a source of ambiguity in terms of normalization since the two possible forms of ”” have different Unicodes. Some suggest that ”” as ”h” be marked using a zero-width non-joiner character (U+200C) or an en dash (U+2013). Such words ending with ”h” phoneme are quite rare in Sorani Kurdish.

  • Although ”û” in the LbO is a single character with a unique Unicode (U+00FB), the equivalent character ”” in the AbO is created by a double ””. The usage of two characters to represent another character is far problematic than a simple replacement since some of the words are preceded or succeeded by a similar character. For instance, the double ”” in words like ”” and ”” may be transliterated respectively as ”haûłatî” instead of its correct form ”hawwiłatî” and ”witûej” instead of its correct form ”witûwêj”. In a similar way, some have proposed using ”ll” and ”rr” to represent ”” and ”” in the LbO (blau1999manuel). Consequently, it would be the same case for such usages.

Word Possible transliterations Correct form Challenge category
bîwr bîwir {”w”, ”u”}
bywr {”y”, ”î”}
bîur Bizroke, i.e., ”i”, not recognizable
hepesan ḧepesan No character for ”” in the LbO
benaûdeng benawûdeng Double character for one character
Table 2. Examples of different challenging categories in Sorani Kurdish text normalization. Challenging characters, if available, are bolded.

Table 2 shows some words in the AbO with the possible transliterated forms in LbO, the correct form for each word based on the reference orthography and the challenge category. Note that the possible transliterations are not essentially correct since they represent the possible mapping of the characters of one orthography to the other.

3. Wergor system

Figure 1 illustrates Wergor transliteration system architecture. The system normalizes a given text by preprocessing and unifying different forms of a character discussed in 2.3. In this stage, Wergor yields the corresponding characters of the double-usage characters such as ”” and ”” and detects the possible presence of Bizroke in the AbO. Finally, the characters are mapped to the other orthography characters. According to this architecture, the system transliterates ”” from AbO into ”bizguř” in the LbO by detecting the correct equivalent of ”” as ”u” and the correct position of Bizroke.

Text in the source orthography

Convert to Unicode UTF-8

Detection of double usage characters

Detection of Bizroke

normalized text

Character mapping

Text in the target orthography in Unicode
Figure 1. Wergor System architecture

Our method to solve the aforementioned challenges in Sorani Kurdish text processing follows the rules based on the phonological characteristics and the writing tradition. Some of the essential rules based on (w.m.thackstonSo2006) that are applied in Wergor are as follow:

  • If a word begins with a vowel, i.e., { ””, ””, ””, ””, ””, ””, ””}, it is always preceded by ”” in the AbO. This is the only usage of ”” (called Hamza) as an auxiliary character and is only used in the AbO.

  • Although ”r” as the first phoneme in every word in the Sorani Kurdish is trilled, thus pronounced ”ř”, traditionally the non-trilled form ”r” is used (w.m.thackstonSo2006). This rule is applied in the two orthographies. For instance, ””, ”” and ”” are to be transliterated as ”roj”, ”rawêj” and ”rêga” respectively.

  • No Sorani Kurdish word begins with ” / ł” (w.m.thackstonSo2006).

  • Since in Sorani Kurdish a word has as many syllables as it has vowels, no two vowels can be in one syllable. Some of the frequent syllable structures in Sorani Kurdish are: V, VC, VCC, CV, CVC, CVCC, where V stands for vowel and C stands for consonant. In no syllable structure a vowel is preceded or succeeded by another vowel (mccarus1958kurdish).

Input: Word W containing the target char (””, ””)
Output: Detected forms of ”” as ”w” or ”u” and ”” as ”y” or ”î” in W.

1:procedure TargetCharacterDetector(W, TargetChar)
3:      [”i”, ”î”, ”u”, ”û”, ””, ””, ””, ””]
5:      the vowel form of TargetChar
6:      the consonant form of TargetChar
7:     if  then
8:         return target_char_consonant      
9:     for index 0 to length do
10:         if  &  then
11:               target_char_vowel
13:         else
14:              if  then
15:                  if  then
16:                        target_char_consonant
17:                  else
18:                       if  then
19:                            target_char_consonant
20:                       else
21:                           if  then
22:                                if  then
23:                                     target_char_consonant
24:                                else
25:                                     target_char_vowel                                 
26:                           else
27:                                 target_char_vowel                                                                                                 
28:     Remove Hamza in
29:     return
Algorithm 1 Detection of ”w/u” and ”y/î” equivalents in the Arabic-based orthography

Using syllable structures pattern in Kurdish, we propose Algorithm 1 to detect double-usage characters ”” and ””. A character in its single form is considered consonant by default. The algorithm follows the same procedure for any of the target characters.

Although the transliteration of Bizroke (i.e., ”i”) from the LbO to the AbO is by omitting it, it is challenging to find Bizroke in the inverse direction. Analyzing syllable structures, the only rule that we could rely on, is that in the CVC structure, if positioned as the first syllable, V is always Bizroke, e.g., ”bira”, ”wirya”, except the cases that the second consonant is ”y” or ”w”, e.g., ”kwêr”, ”dyar”. Although it seems to be frequent to see Bizroke in the same pattern in the last syllables, e.g., ”çirij”, ”kirdin”, we could not use it as a rule.

4. Experiments

4.1. Data set

Among the 36 top ranked Kurdish websites, including news and media services, we have found only one site that uses AbO for both Sorani and Kurmanji dialects333Ranking based on Alexa http://www.alexa.com. 18 websites use only LbO for Kurmanji and 29 websites use only AbO for Sorani. We found no Sorani website that uses LbO.

In order to provide a resource for Kurdish transliteration, we propose Wergor corpus, to the best of our knowledge, as the first transliteration corpus for Kurdish. Our corpus consists of parallel transliterated texts from the two orthographies. This corpus can be used for other tasks in machine translation as well.

4.2. Results and Discussion

Table 3 shows the results of Wergor in transliterating our data set from the AbO to the LbO. Results of different tests are presented based on the correct and incorrect transliterations and the precision of the system is calculated as the the percentage of the correct transliterations.

Bizroke detection w/u detection y/î detection whole test set
Prediction Corrcet 721 / 1861 2472 / 2480 4808 / 4850 5779 / 6980
Incorrect last syllable other syllables 8 / 2480 48/4850 1201/6980
286 / 1140 854 / 1140
Precision 38.74% 99.67% 99.13% 82.79%
Table 3. Arabic to Latin transliteration results

In detecting the possible position of Bizroke, Wergor achieves 38.74% precision and 100% recall. Since the rule that we could apply in the current version of the system for detecting Bizroke only considers the first syllables, Wergor is not able to correctly find the position of Bizroke in the 1140 cases among 1861. In other words, the correct prediction refers to those words that have only one Bizroke and it is positioned in the first syllable. In the incorrect transliterations, in 286 cases Bizroke is in the last syllable and in 854 ones, it is in other syllables.

Evaluating the system on the double-usage characters, i.e., ”” and ””, shows a high precision of more than 99% and a recall of 100% since all relevant words were retrieved. Incorrectly transliterated words are mostly non-Kurdish words, e.g., ”Claud” that are used in the original form in the manually transliterated data set, and proper nouns such as ”Kurdistan” which are capitalized in the LbO. The AbO does not have capital letters.

In the other hand, Wergor system achieves almost 100% precision in transliterating the LbO into the AbO. Since the mapping of the LbO characters into the AbO ones is straightforward with no challenging characters, this precision is justifiable.

Figure A.1 and A.2 in Appendix A shows two transliteration texts using Wergor.

5. Conclusions and future work

In this paper, we propose a rule-based technique for Kurdish text transliteration. Kurdish confronts various challenges in transliterating its two popular orthographies, Arabic-based and Latin-based. In this article we described a method to solve these challenges using Wergor transliteration system. Although our system achieves 99% precision in transliterating double-usage characters (””, ””), it is less efficient in transliterating Bizroke, i.e., ”i”. In order to improve the current results, a bigger transliteration data set is required. We also believe that the phonological aspects of the language can be of help, which are not enough studied yet. Having the Wergor transliteration data set, we are currently interested in applying statistical methods for detecting Bizroke more efficiently.

Our codes and corpus are available at https://github.com/sinaahmadi/wergor.


Appendix A Appendix

Figure A.1. Transliteration of an example text, in the first row, from the AbO to output text in the second row in the LbO. The manually transliterated text is shown in the last row. The errors are shown in bold. Both texts are in Sorani Kurdish language.
Figure A.2. Transliteration of an example text, in the first row, from the LbO to the output text in the second row in the AbO. The manually transliterated text is in the third row. No errors found.