Accuracy of the Uzbek stop words detection: a case study on "School corpus"

09/15/2022
by   Khabibulla Madatov, et al.
0

Stop words are very important for information retrieval and text analysis investigation tasks of natural language processing. Current work presents a method to evaluate the quality of a list of stop words aimed at automatically creating techniques. Although the method proposed in this paper was tested on an automatically-generated list of stop words for the Uzbek language, it can be, with some modifications, applied to similar languages either from the same family or the ones that have an agglutinative nature. Since the Uzbek language belongs to the family of agglutinative languages, it can be explained that the automatic detection of stop words in the language is a more complex process than in inflected languages. Moreover, we integrated our previous work on stop words detection in the example of the "School corpus" by investigating how to automatically analyse the detection of stop words in Uzbek texts. This work is devoted to answering whether there is a good way of evaluating available stop words for Uzbek texts, or whether it is possible to determine what part of the Uzbek sentence contains the majority of the stop words by studying the numerical characteristics of the probability of unique words. The results show acceptable accuracy of the stop words lists.

READ FULL TEXT
research
06/14/2018

Automatic Language Identification for Romance Languages using Stop Words and Diacritics

Automatic language identification is a natural language processing probl...
research
02/01/2020

Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List

This paper is an effort to complement the contributions made by research...
research
06/04/2020

Stopwords in Technical Language Processing

There are increasingly applications of natural language processing techn...
research
09/12/2016

Modelling Creativity: Identifying Key Components through a Corpus-Based Approach

Creativity is a complex, multi-faceted concept encompassing a variety of...
research
05/25/2016

SS4MCT: A Statistical Stemmer for Morphologically Complex Texts

There have been multiple attempts to resolve various inflection matching...
research
02/01/2021

The Harrington Yowlumne Narrative Corpus

Minority languages continue to lack adequate resources for their develop...
research
10/24/2020

A Benchmark Corpus and Neural Approach for Sanskrit Derivative Nouns Analysis

This paper presents first benchmark corpus of Sanskrit Pratyaya (suffix)...

Please sign up or login with your details

Forgot password? Click here to reset