Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited

06/14/2017
by   Łukasz Dębowski, et al.
0

As we discuss, a stationary stochastic process is nonergodic when a random persistent topic can be detected in the infinite random text sampled from the process, whereas we call the process strongly nonergodic when an infinite sequence of independent random bits, called probabilistic facts, is needed to describe this topic completely. Replacing probabilistic facts with an algorithmically random sequence of bits, called algorithmic facts, we adapt this property back to ergodic processes. Subsequently, we call a process perigraphic if the number of algorithmic facts which can be inferred from a finite text sampled from the process grows like a power of the text length. We present a simple example of such a process. Moreover, we demonstrate an assertion which we call the theorem about facts and words. This proposition states that the number of probabilistic or algorithmic facts which can be inferred from a text drawn from a process must be roughly smaller than the number of distinct word-like strings detected in this text by means of the PPM compression algorithm. We also observe that the number of the word-like strings for a sample of plays by Shakespeare follows an empirical stepwise power law, in a stark contrast to Markov processes. Hence we suppose that natural language considered as a process is not only non-Markov but also perigraphic.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

There Are Fewer Facts Than Words: Communication With A Growing Complexity

We present an impossibility result, called a theorem about facts and wor...
research
11/25/2020

Bounds for Algorithmic Mutual Information and a Unifilar Order Estimator

Inspired by Hilberg's hypothesis, which states that mutual information b...
research
04/04/2023

Using Language Models For Knowledge Acquisition in Natural Language Reasoning Problems

For a natural language problem that requires some non-trivial reasoning ...
research
07/02/2018

Probabilistic Databases with an Infinite Open-World Assumption

Probabilistic databases (PDBs) introduce uncertainty into relational dat...
research
10/26/2021

Part Whole Extraction: Towards A Deep Understanding of Quantitative Facts for Percentages in Text

We study the problem of quantitative facts extraction for text with perc...
research
02/17/2023

A Simplistic Model of Neural Scaling Laws: Multiperiodic Santa Fe Processes

It was observed that large language models exhibit a power-law decay of ...
research
11/03/2017

Estimation of Zipf parameter by means of a sequence of counts of different words

We study a probabilistic model of text in which probabilities of words d...

Please sign up or login with your details

Forgot password? Click here to reset