1 Introduction
There are two wellknown regularities in natural language texts, which are known as Zipf’s and Heaps’ laws. According to the original Zipf’s law, a probability of encountering the
th most frequent word is inversely proportional to the word’s rank :This law—which was not actually discovered by Zipf [5]—is not applicable to arbitrarily large texts. The obvious reason is that the sum of inverse ranks does not converge to a finite number. A slightly generalized variant, henceforth generalized Zipf’s law, is likely a more accurate text model:
where .
Heaps’ law [3]—also discovered by Herdan [2]—approximates the number of unique words in the text of length :
where . Heaps’ law says that the number of unique words grows roughly sublinearly as a power function of the total number of words (with the exponent strictly smaller than one).
Somewhat surprisingly, BaezaYates and G. Navarro [1] argued (although a bit informally) that constants in Heaps’ and Zipf’s laws are reciprocal numbers:
They also verified this empirically.
This work inspired several people (including yours truly) to formally derive Heaps’ law from the generalized Zipf’s law (all derivations seem to have relied on a text generation process where words are sampled independently). It is hard to tell who did it earlier. In particular, I have a recollection that Amir Dembo (from Stanford) produced an analgous derivation (not relying on the property of the Gamma function), but, apparently, he did not publish his result. Leijenhorst and Weide published a more general result (their derivation starts from ZipfMandelbrot distribution rather than from generalized Zipf) in 2005
[6]. My own proof was published in Russian in 2003 [7]. Here, I reproduce it for completeness. I tried to keep it as simple as possible: some of the more formal argument is given in footnotes.2 Formal Derivation
As a reminder we assume that the text is created by a random process where words are sampled independently from an infinite vocabulary. This is not the most realistic assumption, however, it is not clear how one can incorporate word dependencies into the proof. The probability of sampling the th most frequent word is defined by the generalized Zipf’s law, i.e.,
where and is a normalizing constant.
The number of unique words in the text is also a random variable
, which can be represented as an infinite sum of random variables . Note that is equal to one if the text contains at least one wordand is zero otherwise. The objective of this proof is to estimate the expected number of unique words
:The proof only cares about an asymptotic behavior of with respect to the total number of text words , i.e., all the derivations are bigO estimates.
Because words are sampled randomly and independently, a probability of not selecting word after trials is equal to . Hence,
has the Bernoulli distribution with the success probability
where is a probability of word occurrence according to the generalized Zipf’s law given by Eq. (*). Therefore, we can rewrite the expected number of unique words as follows:
What can we say about this series in general and about the summation term in particular?

Because and is a convergent series, our series converges.^{1}^{1}1The upper bound for the series term follows from , which is upper bounded by for and positive .

The summation term can be interpreted as a real valued function of the variable . The value of this function decreases monotonically with . The function is positive for and is upper bounded by one.^{2}^{2}2 decreases with ; increases with ; increases with ; decreases with .
Thanks to these properties, we can replace the sum of the series with the following bigO equivalent integral from to :^{3}^{3}3 Using monotonicity it is easy to show that the integral from 1 to is smaller than the sum of the series, but the integral from 0 to is larger than the sum of the series. The difference between two integral values is less than one.
Using the variable substitution , we rewrite (**) as follows:
Because the integrand is positive and upper bounded by one, the value of the integral for the segment is a constant with respect to . is a constant as well. Therefore, the value of the integral is bigO equivalent to the value of the following integral which goes from one to infinity:
We further rewrite this by applying the Binomial theorem to the integrand:
Because , every summand in the integrand has absolute convergence.^{4}^{4}4This is concerned with the convergence of the integral with respect to its infinite upper bound. Hence, the integral of the finite sum is equal to the following sum of integrals:
(because the term for is equal to minus one)
Using induction one can demonstrate that (also see [4, §1.2.6, Exercise 48]):
This allows to rewrite Eq. (***) as follows:
Now, using the formula
and its corrollary
with we obtain that (***) is bigO equivalent to
In other words, the constant is in the Heaps’ law is inversely proportional to the constant in the generalized Zipf’s law.
References
 [1] Ricardo A. BaezaYates and Gonzalo Navarro. Block addressing indices for approximate text retrieval. JASIS, 51(1):69–82, 2000.
 [2] Leo Egghe. Untangling Herdan’s Law and Heaps’ Law: Mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol., 58(5):702–709, March 2007.
 [3] H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, Inc., Orlando, FL, USA, 1978.
 [4] Donald Ervin Knuth. The art of computer programming, Volume II: Seminumerical Algorithms, 3rd Edition. AddisonWesley, 1998.
 [5] David M. W. Powers. Applications and explanations of Zipf’s Law. In Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning, NeMLaP/CoNLL 1998, Macquarie University, Sydney, NSW, Australia, January 1117, 1998, pages 151–160, 1998.
 [6] D.C. van Leijenhorst and Th.P. van der Weide. A formal derivation of Heaps’ Law. Information Sciences, 170(2):263 – 272, 2005.
 [7] ЛМ Бойцов. Синтез системы автоматической коррекции, индексации и поиска текстовой информации, 2003.
Comments
There are no comments yet.