Two halves of a meaningful text are statistically different

04/09/2020
by   Weibing Deng, et al.
0

Which statistical features distinguish a meaningful text (possibly written in an unknown system) from a meaningless set of symbols? Here we answer this question by comparing features of the first half of a text to its second half. This comparison can uncover hidden effects, because the halves have the same values of many parameters (style, genre etc). We found that the first half has more different words and more rare words than the second half. Also, words in the first half are distributed less homogeneously over the text in the sense of of the difference between the frequency and the inverse spatial period. These differences hold for the significant majority of several hundred relatively short texts we studied. The statistical significance is confirmed via the Wilcoxon test. Differences disappear after random permutation of words that destroys the linear structure of the text. The differences reveal a temporal asymmetry in meaningful texts, which is confirmed by showing that texts are much better compressible in their natural way (i.e. along the narrative) than in the word-inverted form. We conjecture that these results connect the semantic organization of a text (defined by the flow of its narrative) to its statistical features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/22/2018

Relating Zipf's law to textual information

Zipf's law is the main regularity of quantitative linguistics. Despite o...
research
11/18/2019

Universal and non-universal text statistics: Clustering coefficient for language identification

In this work we analyze statistical properties of 91 relatively small te...
research
06/08/2018

Text Classification based on Word Subspace with Term-Frequency

Text classification has become indispensable due to the rapid increase o...
research
02/07/2021

Word frequency-rank relationship in tagged texts

We analyze the frequency-rank relationship in sub-vocabularies correspon...
research
05/03/2023

A Statistical Exploration of Text Partition Into Constituents: The Case of the Priestly Source in the Books of Genesis and Exodus

We present a pipeline for a statistical textual exploration, offering a ...
research
09/14/2020

A Comparison of Two Fluctuation Analyses for Natural Language Clustering Phenomena: Taylor and Ebeling Neiman Methods

This article considers the fluctuation analysis methods of Taylor and Eb...
research
08/16/2018

Linguistic data mining with complex networks: a stylometric-oriented approach

By representing a text by a set of words and their co-occurrences, one o...

Please sign up or login with your details

Forgot password? Click here to reset