UzbekTagger: The rule-based POS tagger for Uzbek language

01/30/2023
by   Maksud Sharipov, et al.
0

This research paper presents a part-of-speech (POS) annotated dataset and tagger tool for the low-resource Uzbek language. The dataset includes 12 tags, which were used to develop a rule-based POS-tagger tool. The corpus text used in the annotation process was made sure to be balanced over 20 different fields in order to ensure its representativeness. Uzbek being an agglutinative language so the most of the words in an Uzbek sentence are formed by adding suffixes. This nature of it makes the POS-tagging task difficult to find the stems of words and the right part-of-speech they belong to. The methodology proposed in this research is the stemming of the words with an affix/suffix stripping approach including database of the stem forms of the words in the Uzbek language. The tagger tool was tested on the annotated dataset and showed high accuracy in identifying and tagging parts of speech in Uzbek text. This newly presented dataset and tagger tool can be used for a variety of natural language processing tasks such as language modeling, machine translation, and text-to-speech synthesis. The presented dataset is the first of its kind to be made publicly available for Uzbek, and the POS-tagger tool created can also be used as a pivot to use as a base for other closely-related Turkic languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2022

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

Indigenous African languages are categorized as under-served in Artifici...
research
05/19/2022

A machine transliteration tool between Uzbek alphabets

Machine transliteration, as defined in this paper, is a process of autom...
research
01/22/2016

GeoTextTagger: High-Precision Location Tagging of Textual Documents using a Natural Language Processing Approach

Location tagging, also known as geotagging or geolocation, is the proces...
research
10/28/2022

Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

Lemmatization is one of the core concepts in natural language processing...
research
06/08/2017

The Algorithmic Inflection of Russian and Generation of Grammatically Correct Text

We present a deterministic algorithm for Russian inflection. This algori...
research
10/11/2017

Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science

We propose Marve, a system for extracting measurement values, units, and...

Please sign up or login with your details

Forgot password? Click here to reset