Verifying Heaps' law using Google Books Ngram data

12/29/2016
by   Vladimir V. Bochkarev, et al.
0

This article is devoted to the verification of the empirical Heaps law in European languages using Google Books Ngram corpus data. The connection between word distribution frequency and expected dependence of individual word number on text size is analysed in terms of a simple probability model of text generation. It is shown that the Heaps exponent varies significantly within characteristic time intervals of 60-100 years.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2018

Taylor's law for Human Linguistic Sequences

Taylor's law describes the fluctuation characteristics underlying a syst...
research
11/23/2018

Rank-frequency distribution of natural languages: a difference of probabilities approach

The time variation of the rank k of words for six Indo-European language...
research
10/20/2017

Is space a word, too?

For words, rank-frequency distributions have long been heralded for adhe...
research
05/14/2021

From Multisets over Distributions to Distributions over Multisets

A well-known challenge in the semantics of programming languages is how ...
research
03/05/2020

An Empirical Accuracy Law for Sequential Machine Translation: the Case of Google Translate

We have established, through empirical testing, a law that relates the n...
research
03/30/2020

Empirical Analysis of Zipf's Law, Power Law, and Lognormal Distributions in Medical Discharge Reports

Bayesian modelling and statistical text analysis rely on informed probab...
research
10/24/2018

Evolution of semantic networks in biomedical texts

Language is hierarchically organized: words are built into phrases, sent...

Please sign up or login with your details

Forgot password? Click here to reset