Morphology is the study of the structure of words. The morpheme is a phonemically defined segment of speech or set of segments of speech with a constant range of meaning. Morphological analysis is a central task in language processing that can take a word as input and detect the various morphological entities in the word and provide a morphological representation of it.
In the case of morphologically rich languages, multiple types and levels of information may be available at the word level. The lexical information for each word form may be augmented with information concerning the grammatical function of the word in the sentence, pronominal clitics, its grammatical relations to other words, inflectional affixes, and so on (tsarfaty2013parsing). In Kurdish, many of these notions are expressed by inflectional affixes. For instance, the stem of a transitive verb can be preceded or succeeded by affixes to indicate the subject, the direct object, tense and grammatical mood. Figure 1 illustrates the morphological tree of the verb ”hełimnegirtbûnewe” meaning ”(I) had not retaken them”.
Analysing morphologically rich languages is a challenging task regardless of the analysis technique implemented. Our main objective in this paper is to demonstrate the capacity of Finite-State Transducers (FSTs) as a method towards computational morphology in the Kurdish language. We focus on Sorani Kurdish as one of the widely spoken and written dialects of Kurdish (salavati2018building). Given the complexity of the Kurdish morphology, we have only covered the major inflectional morphemes in the current study, namely nouns, verbs, adjectives, and adverbs. This project is also an attempt to address three items of the identified gaps in the Kurdish BLARK (Basic Language Resource Kit) (hassani2018blark)
, namely morphological analysis, morphological synthesis, and text generation.
In our approach to the morphological study of Kurdish, we follow the trend which is followed in the Natural Language Processing (NLP) realm. This approach has been described in the related literature, for example, by Jacquemin and Tzoukermann (jacquemin1999nlp). That is, we are aware of the discussion and debates among the linguists upon the meaning of morphemes and lexemes and also the debate that is going on among the linguists in how to interpret these concepts (nemo2003morphemes; beard2005lexeme). However, we focus on the practical outcome which could help in the advancement of the resources and tools for Kurdish NLP.
Finite-State Transducers (FSTs) are well-known apparatus in various sectors of computational field including NLP and CL (karttunen2000applications), in a variety of tasks such as machine translation (civera2005novel; forcada2011apertium) and computational morphology (altantawy2011fast)
. In comparison to the traditional computational morphology based on lexicon, FSTs are proven to be more effective instruments concerning morphologically rich languages(altantawy2011fast). One of the main obstacles that have hindered the progress in morphological analysis for Kurdish was lack of electronic lexicographic resources. Recently, Ahmadi et al. (ahmadi2019lex) presented three electronic lexicographic resources for Kurdish covering the Hawrami, Sorani and Kurmanji dialects with enriched information such as part-of-speech. Therefore, as a preliminary study, we could construct a set of FST-based rules for four grammatical categories in Sorani Kurdish, namely nouns, adjectives, adverbs and verbs. For this purpose, We use Stuttgart Finite-State-Transducer (SFST) (schmid2005programming) as the experimental environment.
The rest of this paper is organized as follows: Section 2 reviews the literature and addresses the related work. Section 3 presents the Kurdish morphology. In section 4, we describe the implementation of our morphological analyser for various morphological categories of Sorani Kurdish. And finally, the paper is concluded in Section 5 and a few idea for future research are proposed.
2. Related Work
Computational morphology using FSTs has been a topic of research since the 1980s (karttunen2001short), and a variety of tools have been developed for this purpose (beesleyFSM). The Finite-State Morphology (FSM) has covered both morphological analysis and generation aspects of computational morphology addressing various languages some of which have been well-equipped by NLP and CL tools and some not. While resourceful languages such English (minnen2001applied; beesleyFSM) and German (schmid2005programming) are front-runners in FSM studies, we also observe similar scholarly attempts regarding other languages which might not be considered as widely-studied as English or German such as Uralic languages (novak2015model), Arabic (soudi2007arabic) and Persian (megerdoomian2000persian; arabsorkhi-shamsfard-2006-unsupervised). This is also correct for less-studied languages such as Croatian (mihajloviccomputational). These examples denote the existence of wider attention on the area of computational morphology in NLP and CL whether by using FSTs or by following other approaches.
Despite its crucial role in the language analysis and generation, computational morphology has not received noticeable attention in Kurdish NLP and CL except for a few efforts (hassani2018blark). Walther and Sagot (walther2010developing) and Walther et al. (walther2010fast) present a methodology and preliminary experiments on constructing a morphological lexicon for Sorani and Kurmanji in which a lemma and a morphosyntactic tag are associated with each known form of the word. For this purpose, the Alexina framework (sagot2010lefff) is used.
Computational morphology of Kurdish has been partially addressed in related NLP and CL tasks. Hosseini et al. (hosseniKSLexicon) suggest a formulation for Sorani morphology used in creating a Sorani lexicon. Salavati et al. (salavati2018building) report challenges in Sorani Kurdish lemmatization and spelling error correction due to morphological complexity. (gokirmak2017dependency) carries out a morphological analysis for creating a dependency treebank for Kurmanji Kurdish.
The aforementioned situation indicates that the lack of Kurdish computational morphology is an area that not only requires but also deserves a sizeable effort by and collaboration among interested scholars.
3. Kurdish Morphology
Kurdish is an Indo-European language which is spoken by approximately 30 million people in different countries (ahmadi2019lex). Haig and Matras (haig2002kurdish) provide a brief description of the structural properties of Kurdish. As a multi-dialect language, Kurdish dialects have different grammatical features and vocabulary sets (hassani2016automatic). The differences in grammar and vocabulary vary among the dialects (jugel2014linguistic; haig2014introduction). In several cases, the differences are significant while in the others, they are trivial (hassani2016automatic). Equally important, the language is written in different scripts with no standard orthography (ahmadi2019wergor). This enforces the computational morphology for Kurdish to be dialect-focused, at least in the first steps.
Kurdish is considered a morphologically rich language for which grammatical relations are indicated by changes in the word forms and modifying morpheme (littell2016named; zivingi2019comparative). Unlike most Indo-European languages which are considered fusional languages, Kurdish is also characterized as a partially agglutative language (khalid2015new). Agglutination refers to a linguistic process in which various word forms are created by stringing morphemes together.
Our focus in this paper is on four grammatical categories, namely verbs, nouns, adjectives, and adverbs. These categories are briefly described as follows.
The absolute form of a noun in Kurdish is the form without any affixes which represents a generic meaning of the word (thackston2006sorani). This form is the lemma provided in the dictionary. The inflection of nouns is mostly carried out with suffixes to indicate number, definiteness, indefiniteness, demonstratives and gender. Unlike Kurmanji and Hawrami dialects, Sorani and Southern Kurdish dialects do not specify genders through morphological inflection. However, there are a few exceptions in some of the subdialects of the two latter dialects, such as the Mukryani subdialect of Sorani where nouns can have genders in specific cases. For instance, in the sentences ”deçime małe Zeynebî” and ”le kin Ferhadê”, ”Zeyneb” and ”Ferhad” respectively as feminine and masculine proper names are inflected with the ”î” and ”ê” suffixes to represent the gender.
|demonstrative||em nawe||em nawane|
|em dergaye||em dergayane|
Table 1 describes the inflection of the two words ”naw” (name) and ”derga” (door) where the morphemes are highlighted in bold. As a result of the inflection, phonological changes, such as dropping a vowel or adding an auxiliary one, may happen when two vowels emerge consecutively. For instance, such a change can be observed in ”dergaye” where ”y” appears between the lemma ”derga” and the demonstrative suffix ”e”.
3.2. Adjectives and adverbs
Adjectives in Sorani Kurdish are inflected according to the modified noun. Predicative adjectives in Kurdish are mostly inflected based on their syntactic role. In addition, the comparative and superlative degrees of an adjective are respectively made by suffixes ”tir” and ”tirîn”. On the other hand, the attributive adjectives follow a richer morphological representation depending on the inflection of the noun. The main noun-adjective construction in Sorani Kurdish is carried out using the Izafa morpheme. The Izafa is an ”î” or ”y” (following a vowel) appearing between the noun and the adjective. However, there are other types of making attributive adjectives which are differing in the noun-adjective construction and the placement of the definiteness or number suffixes (thackston2006sorani). Table 2 provides an example of the attributive adjective ”ciwan” (beautiful) for the lemma ”guł” (flower) in various forms.
|absolute||gułî ciwan||-||gułe ciwan||-|
|indefinite||gułêki ciwan||gułanî ciwan||gułe ciwanêk||gułe ciwanan|
|definite||gułeke ciwan||gułekanî ciwan||gułe ciwaneke||gułe ciwanekan|
|demonstrative||em gułe ciwane||em gułe ciwanane||em gułe ciwane||em gułe ciwanane|
Although there are adverbs which are words in their absolute form, in most cases, adverbs are inflected forms of the absolute form of an adjective or noun in Kurdish. This process is particularly made by using the suffix ”ane” or the prefix with ”be”. For instance, the two adverbs ”betûndî” and ”tundane” with the root ”tundî” (noun, quickness) and ”tund” (adjective, quick).
In Sorani Kurdish, verbs agree with their subject in number and person and in some cases, with their object as well. For instance, in the verb ”dîtimî” meaning ”(I) saw you (singular)”, dît, the past stem of the verb ”dîtin” (to see), is inflected based on the enclitic pronominal logical object î and the agent affix im. In addition, Kurdish is a split-ergative language where subject of an intransitive verb behaves like the object of a transitive verb and differently from the agent of a transitive verb (comrie1989language). Precisely, the ergative case is marked on agents and verbs of transitive verbs in past tenses (bynon1979ergative).
Verbs have two stems in the past and present. Although the past stem can be extracted by removing the infinitive suffix ”in”, the present tense is irregularly derived from the infinitive form of the verb.
|Connected possessive pronoun (person 1-6)||1s||im|
|Copula (verbs for person 1-6)||C1||im|
|C3||heye, hes, e|
|Imperative marker (verb)||IMP||bi, b|
|Negative imperative (verb)||NMP||me|
|Negative marker (verb)||NEG||ne, na|
|Subjunctive marker (verb)||SUB||bi|
|Plural suffix (noun)||PL||an, gel, ha, at|
|Definite marker (noun)||DEF||eke|
|Indefinite marker (noun)||IND||êk, yek|
|Progressive marker (verb)||CON||de, e|
|Relative adjectives||RA||î, y, în, yin, çî,|
|nok, ko, û, île, yile, emenî|
|Comparative marker (adjective)||COMP||tir|
|Superlative marker (adjective)||SUP||tirîn|
|Adverb marker (adverb)||AD||ane, be, an|
|çe, ke, ik, ko, oke, oł,|
|Diminutive form (noun)||DIM||ołe, ołik, ołke, ełe, elûke, yekołe,|
|îlane, île, ûlke, ûle, le, łe|
After a thorough study of our intended four grammatical categories, i.e., noun, adjectives, adverbs, and verbs, we defined their morphological features in plain language. As the result, we could extract 171 inflection rules which enabled us to obtain a deeper insight into the morphological rules for implementing transducers.
Our implementation is realized in the SFST transducer specification language which is based on regular expressions with extended functionalities such as concatenation and variable definition (schmid2005programming). The SFST compiler creates FSTs by concatenating and filtering morphemes and mapping the resulting analysis strings to the surface realizations according to phonological rules (schmid2004smor). SFST not only analyses the composing morphemes of the input word with feature decorations, but it also has a generation mode which makes the transducers reversible. This mode is particularly important for tasks relying on text generation. The following shows an example in the two modes for the word ”xwardim” (I ate):
analyze> xwardim xward<verb-transitive-past-stem><past-1s> generate> xward<verb-transitive-past-stem><past-1s> xwardim
The properties of base word forms are defined in a lexicon which is based on Ahmadi et al.’s work (ahmadi2019lex). In the case of nouns, adjectives and adverbs, we considered the absolute form of the word as the base form. However, this was not the case of words with derivational morphemes which was out of the scope of the current study. Regarding the verbs, two base forms are defined: the past stem and the present stem. In addition, the transitivity case of the verbs is specified as different morphological rules apply for transitive and intransitive verbs in past tenses. Figure 2 illustrates a finite-state transducer of the verb xwardin (to eat) in the past simple tense:
In the case of the verb, we define a basic FST composing of the verb root (in present or past) suffixed with person and number affixes. This FST enables us to reduce redundancy in defining rules as different combination of prefixes such as bi, ne and de respectively as progressive, negative and imperative prefixes, would be feasible. Having said that, the ergative cases are not covered in this initial study. Table 3 provides some of the inflectional morphemes which are used as variable in our transducers.
Kurdish is written using different scripts in which Persian-Arabic and Latin are the two most widely used. The challenges in the processing of Persian-Arabic texts (ahmadi2019wergor), particularly the missing ”i” vowel, also known as Bizroke among Kurdish scholars, is the main reason that in the current work we focused on Latin-based script.
5. Conclusion and future work
In this paper, we reported our progress in utilizing FST for the morphological analysis of Kurdish. We presented the four main categories of the Kurdish morphology, namely verbs, nouns, adjectives and adverbs. This research equips Kurdish NLP and CL with an advanced tool and resource which improves the quality of the Kurdish language processing tasks.
In this preliminary study, our developed tool can analyse and generate rather simple morphological forms. Therefore, as future work, we are aiming to cover more linguistic processes such as ergativity, and other grammatical categories in Sorani Kurdish. Moreover, analysing morphological derivation of Kurdish should be a priority, particuarly in the case of verbs which are mostly formed based on derivational morphemes. Another limitation of the current study is the absence of syntactical features which may modify morphological forms in a sentence.
In our implementation, we only used the Latin script of Kurdish as it is the least ambiguous one in comparison to the Persian-Arabic script (ahmadi2019wergor). We believe that the current study will pave the way for evaluating machine transliteration systems in future as well. In addition, we believe that the computation analysis of the other dialects of Kurdish will pave the way for further progress in the field.
Our tool and resource will be publicly available for non-commercial use under the CC BY-NC-SA 4.0 licence upon the acceptance of the paper.