A statistical test for correspondence of texts to the Zipf-Mandelbrot law

12/25/2019
by   Anik Chakrabarty, et al.
0

We analyse correspondence of a text to a simple probabilistic model. The model assumes that the words are selected independently from an infinite dictionary. The probability distribution correspond to the Zipf—Mandelbrot law. We count sequentially the numbers of different words in the text and get the process of the numbers of different words. Then we estimate Zipf—Mandelbrot law parameters using the same sequence and construct an estimate of the expectation of the number of different words in the text. Then we subtract the corresponding values of the estimate from the sequence and normalize along the coordinate axes, obtaining a random process on a segment from 0 to 1. We prove that this process (the empirical text bridge) converges weakly in the uniform metric on C (0,1) to a centered Gaussian process with continuous a.s. paths. We develop and implement an algorithm for approximate calculation of eigenvalues of the covariance function of the limit Gaussian process, and then an algorithm for calculating the probability distribution of the integral of the square of this process. We use the algorithm to analyze uniformity of texts in English, French, Russian and Chinese.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2017

Estimation of Zipf parameter by means of a sequence of counts of different words

We study a probabilistic model of text in which probabilities of words d...
research
01/07/2021

Infinitely Wide Tensor Networks as Gaussian Process

Gaussian Process is a non-parametric prior which can be understood as a ...
research
12/27/2022

An effectivization of the law of large numbers for algorithmically random sequences and its absolute speed limit of convergence

The law of large numbers is one of the fundamental properties which algo...
research
12/31/2020

Asymptotics of sums of regression residuals under multiple ordering of regressors

We prove theorems about the Gaussian asymptotics of an empirical bridge ...
research
09/22/2018

Relating Zipf's law to textual information

Zipf's law is the main regularity of quantitative linguistics. Despite o...
research
09/10/2021

On the eigenvalues associated with the limit null distribution of the Epps-Pulley test of normality

The Shapiro–Wilk test (SW) and the Anderson–Darling test (AD) turned out...
research
06/03/2021

Stein's method, smoothing and functional approximation

Stein's method for Gaussian process approximation can be used to bound t...

Please sign up or login with your details

Forgot password? Click here to reset