Customized determination of stop words using Random Matrix Theory approach

04/17/2021
by   Bogdan Łobodziński, et al.
0

The distances between words calculated in word units are studied and compared with the distributions of the Random Matrix Theory (RMT). It is found that the distribution of distance between the same words can be well described by the single-parameter Brody distribution. Using the Brody distribution fit, we found that the distance between given words in a set of texts can show mixed dynamics, coexisting regular and chaotic regimes. It is found that distributions correctly fitted by the Brody distribution with a certain goodness of the fit threshold can be identifid as stop words, usually considered as the uninformative part of the text. By applying various threshold values for the goodness of fit, we can extract uninformative words from the texts under analysis to the desired extent. On this basis we formulate a fully agnostic recipe that can be used in the creation of a customized set of stop words for texts in any language based on words.

READ FULL TEXT

page 8

page 9

page 10

research
12/20/2016

Inferring the location of authors from words in their texts

For the purposes of computational dialectology or other geographically b...
research
09/28/2017

The Dependence of Frequency Distributions on Multiple Meanings of Words, Codes and Signs

The dependence of the frequency distributions due to multiple meanings o...
research
07/26/2023

Unsupervised extraction of local and global keywords from a single text

We propose an unsupervised, corpus-independent method to extract keyword...
research
11/13/2019

The Number of Threshold Words on n Letters Grows Exponentially for Every n≥ 27

For every n≥ 27, we show that the number of n/(n-1)^+-free words (i.e., ...
research
01/29/2016

Zipf's law is a consequence of coherent language production

The task of text segmentation may be undertaken at many levels in text a...
research
01/05/2022

Some Strategies to Capture Karaka-Yogyata with Special Reference to apadana

In today's digital world language technology has gained importance. Seve...
research
01/30/2023

Using n-aksaras to model Sanskrit and Sanskrit-adjacent texts

Despite – or perhaps because of – their simplicity, n-grams, or contiguo...

Please sign up or login with your details

Forgot password? Click here to reset