# A Simple Derivation of the Heap's Law from the Generalized Zipf's Law

I reproduce a rather simple formal derivation of the Heaps' law from the generalized Zipf's law, which I previously published in Russian.

## Authors

• 7 publications
• ### Three Other Models of Computer System Performance

This note argues for more use of simple models beyond Amdahl's Law: Bott...
01/09/2019 ∙ by Mark D. Hill, et al. ∙ 0

• ### An information scaling law: ζ= 3/4

Consider the entropy of a unit Gaussian convolved over a discrete set of...
10/25/2017 ∙ by Michael C. Abbott, et al. ∙ 0

• ### Calibration Methods of Touch-Point Ambiguity for Finger-Fitts Law

Finger-Fitts law (FFitts law) is a model to predict touch-pointing times...
01/13/2021 ∙ by Shota Yamanaka, et al. ∙ 0

• ### 45-year CPU evolution: one law and two equations

Moore's law and two equations allow to explain the main trends of CPU ev...
03/01/2018 ∙ by Daniel Etiemble, et al. ∙ 0

• ### A Simple Re-Derivation of Onsager's Solution of the 2D Ising Model using Experimental Mathematics

In this case study, we illustrate the great potential of experimental ma...
05/23/2018 ∙ by Manuel Kauers, et al. ∙ 0

• ### AGI and the Knight-Darwin Law: why idealized AGI reproduction requires collaboration

Can an AGI create a more intelligent AGI? Under idealized assumptions, f...
05/09/2020 ∙ by Samuel Allen Alexander, et al. ∙ 0

• ### Tables of Quantiles of the Distribution of the Empirical Chiral Index in the Case of the Uniform Law and in the Case of the Normal Law

The empirical distribution of the chiral index is simulated for various ...
05/20/2020 ∙ by Michel Petitjean, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

There are two well-known regularities in natural language texts, which are known as Zipf’s and Heaps’ laws. According to the original Zipf’s law, a probability of encountering the

-th most frequent word is inversely proportional to the word’s rank :

 pi=O(1/i).

This law—which was not actually discovered by Zipf [5]—is not applicable to arbitrarily large texts. The obvious reason is that the sum of inverse ranks does not converge to a finite number. A slightly generalized variant, henceforth generalized Zipf’s law, is likely a more accurate text model:

 pi=O(1/iα),

where .

Heaps’ law [3]—also discovered by Herdan [2]—approximates the number of unique words in the text of length :

 |∪ni=1{wi}|=O(nβ),

where . Heaps’ law says that the number of unique words grows roughly sub-linearly as a power function of the total number of words (with the exponent strictly smaller than one).

Somewhat surprisingly, Baeza-Yates and G. Navarro [1] argued (although a bit informally) that constants in Heaps’ and Zipf’s laws are reciprocal numbers:

 α≈1/β.

They also verified this empirically.

This work inspired several people (including yours truly) to formally derive Heaps’ law from the generalized Zipf’s law (all derivations seem to have relied on a text generation process where words are sampled independently). It is hard to tell who did it earlier. In particular, I have a recollection that Amir Dembo (from Stanford) produced an analgous derivation (not relying on the property of the Gamma function), but, apparently, he did not publish his result. Leijenhorst and Weide published a more general result (their derivation starts from Zipf-Mandelbrot distribution rather than from generalized Zipf) in 2005

[6]. My own proof was published in Russian in 2003 [7]. Here, I reproduce it for completeness. I tried to keep it as simple as possible: some of the more formal argument is given in footnotes.

## 2 Formal Derivation

As a reminder we assume that the text is created by a random process where words are sampled independently from an infinite vocabulary. This is not the most realistic assumption, however, it is not clear how one can incorporate word dependencies into the proof. The probability of sampling the -th most frequent word is defined by the generalized Zipf’s law, i.e.,

 pi=1H(α)⋅iα,

where and is a normalizing constant.

The number of unique words in the text is also a random variable

, which can be represented as an infinite sum of random variables . Note that is equal to one if the text contains at least one word

and is zero otherwise. The objective of this proof is to estimate the expected number of unique words

:

 EX=E(∞∑i=1Xi)

The proof only cares about an asymptotic behavior of with respect to the total number of text words , i.e., all the derivations are big-O estimates.

Because words are sampled randomly and independently, a probability of not selecting word after trials is equal to . Hence,

has the Bernoulli distribution with the success probability

 p(Xi=1)=1−(1−pi)n,

where is a probability of word occurrence according to the generalized Zipf’s law given by Eq. (*). Therefore, we can rewrite the expected number of unique words as follows:

 EX=∞∑i=11−(1−pi)n=∞∑i=11−(1−1H(α)iα)n

• Because and is a convergent series, our series converges.111The upper bound for the series term follows from , which is upper bounded by for and positive .

• The summation term can be interpreted as a real valued function of the variable . The value of this function decreases monotonically with . The function is positive for and is upper bounded by one.222 decreases with ; increases with ; increases with ; decreases with .

Thanks to these properties, we can replace the sum of the series with the following big-O equivalent integral from to :333 Using monotonicity it is easy to show that the integral from 1 to is smaller than the sum of the series, but the integral from 0 to is larger than the sum of the series. The difference between two integral values is less than one.

 ∫∞11−(1−1H(α)xα)ndx

Using the variable substitution , we rewrite (**) as follows:

 1H(α)∞∫H(α)1α1−(1−1yα)ndy

Because the integrand is positive and upper bounded by one, the value of the integral for the segment is a constant with respect to . is a constant as well. Therefore, the value of the integral is big-O equivalent to the value of the following integral which goes from one to infinity:

 ∞∫11−(1−1yα)ndy

We further rewrite this by applying the Binomial theorem to the integrand:

 ∞∫11−(1−1yα)ndy=∞∫1(1−n∑i=0(−1)iCin1yαi)dy=∞∫1(n∑i=1(−1)iCin1yαi)dy

Because , every summand in the integrand has absolute convergence.444This is concerned with the convergence of the integral with respect to its infinite upper bound. Hence, the integral of the finite sum is equal to the following sum of integrals:

 n∑i=1Cin(−1)i∞∫11yαidy=n∑i=1Cin(−1)i(1iα−1)=
 =1αn∑i=1Cin(−1)i(1i−(1/α))=

(because the term for is equal to minus one)

 =1+1αn∑i=0Cin(−1)i(1i−(1/α))

Using induction one can demonstrate that (also see [4, §1.2.6, Exercise 48]):

 ∑i≥0Cin(−1)ii+x=n!x(x+1)…(x+n)

This allows to rewrite Eq. (***) as follows:

 1+1/α⋅n!(−1/α)(1−1/α)…(n−1/α)

Now, using the formula

 Γ(x)=limn→∞nxn!x(x+1)…(x+n)

and its corrollary

 n!x(x+1)…(x+n)=O(Γ(x)⋅n−x)

with we obtain that (***) is big-O equivalent to

 Γ(−1/α)⋅n1/α=O(n1/α).

In other words, the constant is in the Heaps’ law is inversely proportional to the constant in the generalized Zipf’s law.

## References

• [1] Ricardo A. Baeza-Yates and Gonzalo Navarro. Block addressing indices for approximate text retrieval. JASIS, 51(1):69–82, 2000.
• [2] Leo Egghe. Untangling Herdan’s Law and Heaps’ Law: Mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol., 58(5):702–709, March 2007.
• [3] H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., Orlando, FL, USA, 1978.
• [4] Donald Ervin Knuth. The art of computer programming, Volume II: Seminumerical Algorithms, 3rd Edition. Addison-Wesley, 1998.
• [5] David M. W. Powers. Applications and explanations of Zipf’s Law. In Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning, NeMLaP/CoNLL 1998, Macquarie University, Sydney, NSW, Australia, January 11-17, 1998, pages 151–160, 1998.
• [6] D.C. van Leijenhorst and Th.P. van der Weide. A formal derivation of Heaps’ Law. Information Sciences, 170(2):263 – 272, 2005.
• [7] ЛМ Бойцов. Синтез системы автоматической коррекции, индексации и поиска текстовой информации, 2003.