Types, Tokens, and Hapaxes: A New Heap's Law

12/31/2018
by   Victor Davis, et al.
0

Heap's Law states that in a large enough text corpus, the number of types as a function of tokens grows as N=KM^β for some free parameters K,β. Much has been written about how this result and various generalizations can be derived from Zipf's Law. Here we derive from first principles a completely novel expression of the type-token curve and prove its superior accuracy on real text. This expression naturally generalizes to equally accurate estimates for counting hapaxes and higher n-legomena.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/12/2023

A Study on the Appropriate size of the Mongolian general corpus

This study aims to determine the appropriate size of the Mongolian gener...
research
03/30/2020

Empirical Analysis of Zipf's Law, Power Law, and Lognormal Distributions in Medical Discharge Reports

Bayesian modelling and statistical text analysis rely on informed probab...
research
11/20/2022

Pragmatic Constraint on Distributional Semantics

This paper studies the limits of language models' statistical learning i...
research
09/06/2022

Token Multiplicity in Reversing Petri Nets Under the Individual Token Interpretation

Reversing Petri nets (RPNs) have recently been proposed as a net-basedap...
research
05/27/2020

Responses and Degrees of Freedom of PVAR for a Continuous Power-Law PSD

This paper is devoted to the use of the Parabolic Variance (PVAR) to cha...
research
04/09/2021

Heaps' Law and Vocabulary Richness in the History of Classical Music Harmony

Music is a fundamental human construct, and harmony provides the buildin...
research
10/09/2020

Generalization of the power-law rating curve using hydrodynamic theory and Bayesian hierarchical modeling

The power-law rating curve has been used extensively in hydraulic practi...

Please sign up or login with your details

Forgot password? Click here to reset