Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus)

01/30/2022
by   Hossein Hassani, et al.
0

Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Speech Tagging (POST) is essential in developing tagged corpora. It is time-and-effort-consuming and costly, and therefore, it could be more affordable if it is automated. The Kurdish language currently lacks publicly available tagged corpora of proper sizes. Tagging the publicly available Kurdish corpora can leverage the capability of those resources to a higher level than what raw or segmented corpora can provide. Developing POS-tagged lexicons can assist the mentioned task. We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon. This paper presents the approach of leveraging the resource of a close language to Kurdish to enrich its resources. A partial dataset of the results is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/. We plan to make the whole tagged corpus available after further investigation on the outcome. The dataset can help in developing POS-tagged lexicons for other Kurdish dialects and automated Kurdish corpora tagging.

READ FULL TEXT

page 4

page 6

research
10/27/2022

Creating a morphological and syntactic tagged corpus for the Uzbek language

Nowadays, creation of the tagged corpora is becoming one of the most imp...
research
09/25/2019

Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

Kurdish is a less-resourced language consisting of different dialects wr...
research
10/11/2021

A Review on Part-of-Speech Technologies

Developing an automatic part-of-speech (POS) tagging for any new languag...
research
11/21/2018

Resource Mention Extraction for MOOC Discussion Forums

In discussions hosted on discussion forums for MOOCs, references to onli...
research
04/29/2020

Basic Linguistic Resources and Baselines for Bhojpuri, Magahi and Maithili for Natural Language Processing

Corpus preparation for low-resource languages and for development of hum...
research
12/18/2021

Morpheme Boundary Detection Grammatical Feature Prediction for Gujarati : Dataset Model

Developing Natural Language Processing resources for a low resource lang...
research
10/17/2022

Transferring Knowledge via Neighborhood-Aware Optimal Transport for Low-Resource Hate Speech Detection

The concerning rise of hateful content on online platforms has increased...

Please sign up or login with your details

Forgot password? Click here to reset