Generalized Entropies and the Similarity of Texts

11/11/2016
by   Eduardo G. Altmann, et al.
0

We show how generalized Gibbs-Shannon entropies can provide new insights on the statistical properties of texts. The universal distribution of word frequencies (Zipf's law) implies that the generalized entropies, computed at the word level, are dominated by words in a specific range of frequencies. Here we show that this is the case not only for the generalized entropies but also for the generalized (Jensen-Shannon) divergences, used to compute the similarity between different texts. This finding allows us to identify the contribution of specific words (and word frequencies) for the different generalized entropies and also to estimate the size of the databases needed to obtain a reliable estimation of the divergences. We test our results in large databases of books (from the Google n-gram database) and scientific papers (indexed by Web of Science).

READ FULL TEXT
research
03/01/2015

Variation of word frequencies in Russian literary texts

We study the variation of word frequencies in Russian literary texts. Ou...
research
05/03/2023

Quantifying the Dissimilarity of Texts

Quantifying the dissimilarity of two texts is an important aspect of a n...
research
10/01/2015

Similarity of symbol frequency distributions with heavy tails

Quantifying the similarity between symbolic sequences is a traditional p...
research
08/05/2020

Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts

A common task in computational text analyses is to quantify how two corp...
research
05/11/2017

On the role of words in the network structure of texts: application to authorship attribution

Well-established automatic analyses of texts mainly consider frequencies...
research
04/18/2016

Efficient Calculation of Bigram Frequencies in a Corpus of Short Texts

We show that an efficient and popular method for calculating bigram freq...
research
04/04/2016

In narrative texts punctuation marks obey the same statistics as words

From a grammar point of view, the role of punctuation marks in a sentenc...

Please sign up or login with your details

Forgot password? Click here to reset