Transliterating Kurdish texts in Latin into Persian-Arabic script

10/24/2021
by   Hossein Hassani, et al.
0

Kurdish is written in different scripts. The two most popular scripts are Latin and Persian-Arabic. However, not all Kurdish readers are familiar with both mentioned scripts that could be resolved by automatic transliterators. So far, the developed tools mostly transliterate Persian-Arabic scripts into Latin. We present a transliterator to transliterate Kurdish texts in Latin into Persian-Arabic script. We also discuss the issues that should be considered in the transliteration process. The tool is a part of Kurdish BLARK, and it is publicly available for non-commercial use

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/07/2019

Arabic natural language processing: An overview

Arabic is recognised as the 4th most used language of the Internet. Arab...
03/28/2017

Important New Developments in Arabographic Optical Character Recognition (OCR)

The OpenITI team has achieved Optical Character Recognition (OCR) accura...
10/31/2020

Neural Coreference Resolution for Arabic

No neural coreference resolver for Arabic exists, in fact we are not awa...
03/12/2021

Automatic Romanization of Arabic Bibliographic Records

International library standards require cataloguers to tediously input R...
10/15/2018

Diacritization of Maghrebi Arabic Sub-Dialects

Diacritization process attempt to restore the short vowels in Arabic wri...
04/23/2020

Transliteration of Judeo-Arabic Texts into Arabic Script Using Recurrent Neural Networks

Many of the great Jewish works of the Middle Ages were written in Judeo-...
03/20/2020

TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish Corpus

This article describes the constitution process of the first morpho-synt...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Kurdish is a multi-dialect that is written in different scripts [3]. The two most popular scripts are Latin and Persian-Arabic. However, not all Kurdish readers are familiar with both mentioned scripts that could be resolved by automatic transliterators. So far, the developed tools mostly transliterate Persian-Arabic scripts into Latin [4, 1]. We present a transliterator to transliterate Kurdish texts in Latin into Persian-Arabic script. Kurdish language processing requires endeavor by interested researchers and scholars to overcome the resource and tool scarcity to eliminate the obstacles in front of its tasks. The areas that need attention and the efforts required have been addressed in [4].

The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 presents different parts of the dataset, such as the dictionary, phoneset, transcriptions, corpus, and language model. Finally, Section 4 concludes the paper and suggests some areas for future work.

2 Related work

Several scholars have addressed transliterators for Kurdish scripts [2, 4, 1]. Those studies mostly focused on transliteration from Persian-Arabic into Latin script. Particularly, ahmadi2019rule addressed several important issues in transliterating Kurdish texts in Persian-Arabic into Latin script and proposed and implemented appropriate resolutions for them.

Also, some online tools exist for transliteration of Kurdish scripts into each other, such as https://www.lexilogos.com/keyboard/kurdish_conversion.htm and http://www.transliteration.kpr.eu/ku/en.html. Those are mainly based on standard Latin scripts, and therefore, they miss some of the Persian-Arabic scripts such as ح¿, ع¿, and غ¿. Other issues also exist in those transliterators, particularly when the text is in Kurmanji Kurdish. For example, both mentioned tools transliterate the Kurmuanji word ”diînine” as “دیننه‌”, while the correct transliteration is “دئیننە”.

Furthermore, to use the mentioned tools, the users must copy and paste their texts to the online tool that not only has limited space but also might not be desirable to the users from the copyright perspective. Our translator addresses those issues, and its script is publicly available under the GNU license.

3 The Transliterator

Our transliterator is a script in Python that could be used standalone. It receives an input file in the text that should be saved as UTF-8, and it provides the transliterated version of the text also in UTF-8. It resolves the mentioned issues that we addressed in Section 2. Particularly, it works for both Kurmanji and Sorani texts in Latin.

Figure 1 shows a text in Latin script, and Figure 2 illustrates its equivalent in Persian-Arabic script that our suggested script has transliterated.

Figure 1: A Kurmanji text in Latin script.333The text is from: Öpengin, Ergin. 2021. Bazeber: Nivîsarên Mela Se’îd Şemdînanî li ser çand û dîroka Kurdistana navendî [The texts of Mulla Said Shamdinani about the history and culture of Central Kurdistan]. Istanbul: Avesta.
Figure 2: A Persian-Arabic version of the text in Figure 1 transliterated using our suggested script.

The transliterator is rule-based. We have recognized one-to-one one-to-many, two-to-one, two-to-many, and three-to-many letters transliterations. The many equivalent is maximum three. The order that the transliterator applies those mapping is important. The mapping method and the mapping order were designed based on studying various texts and writing styles. For example, the case of bizroke that [1] has addressed it transliteration from Persian-Arabic into Latin must also be considered in transliteration form Latin into Persian-Arabic. That is “min” must be written as “من”. However, the users might still find issues when they transliterate using the tool. We would be grateful to report the possible issues to enhance the script.

The users of the transliterator may notice an issue with the full stop at the end of a paragraph when they use editors such as Microsoft Word or Libre Writer. When they open the text in those editors, the full stop at the end of the paragraphs might flip to the left. That case happens if the default direction for the input text in the editor is set to LTR. Using an appropriate command, according to the editor, the issue is resolved. For example, in Libre Writer, the user can select the entire text and then press Ctrl+Righ Shift.

Also, it is usually necessary to change the font to the Unicode fonts, for example, UnikurdWeb, to view and print the output correctly.

4 Conclusion

We presented a script that transliterates Kurdish texts in Latin script into Persian-Arabic texts. The script could be used standalone, and it is publicly available. The script resolves some issues that available online transliterators have not considered. As it is standalone, it could be used with requiring the Internet connection, and it does not have size limitations for its input document. In its current form, the transliterator provides proper output for Kurmanji and Sorani Kurdish.

In the future, we like to receive feedback 444Please kindly report any issues via Kurdish BLARK (https://kurdishblark.github.io/) from the transliterator concerning any possible issues to enhance the script. We also want to test it for Zazaki and Hawramit texts as they have special letters that are not used in Sorani and Kurmanji.

Acknowledgement

We are grateful to Dr. Ergin Öpengin for his feedback and assistance in checking the accuracy of the transliterator.

References

  • [1] S. Ahmadi (2019) A rule-based kurdish text transliteration system. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 18 (2), pp. 1–8. Cited by: §1, §2, §3.
  • [2] K. S. Esmaili, S. Salavati, and A. Datta (2014) Towards kurdish information retrieval. ACM Transactions on Asian Language Information Processing (TALIP) 13 (2), pp. 7. Cited by: §2.
  • [3] H. Hassani, D. Medjedovic, et al. (2016) Automatic Kurdish Dialects Identification. Computer Science & Information Technology 6 (2), pp. 61–78. Cited by: §1.
  • [4] H. Hassani (2018) BLARK for multi-dialect languages: towards the Kurdish BLARK. Language Resources and Evaluation 52, pp. 625–644. Cited by: §1, §2.