1 Introduction
The problem of estimating the number of distinct keys of a large collection of data is well known in computer science. It arises in query optimization of data base systems. A classical algorithm is the adaptive sampling (AS) . The mean and variance of AS are considered in Flajolet [FL90] . Let us summarize the principal features of AS. Elements of the given set of data are hashed into binary keys. These keys are infinitely long bit streams such that each bit has probability of being or . A uniformity assumption is made on the hashing function .
The algorithm keeps a bucket (or cache)
of at most
distinct
keys. The depth of
sampling,
which is defined below , is also saved. We start with
and throw
only
distinct keys
into
. When
is full, depth
is increased by , the bucket is
scanned, and only keys starting with are kept.
( If the bucket is still full, we wait until a new key starting with
appears. Then is again increased by and we keep only keys starting with ). The scanning on the set is resumed
and only distinct keys starting with are considered. More generally, at step
,
only
distinct keys starting with
are taken into account. When we have exhausted the
set of
data,
can be estimated by , where
is the random final bucket (cache) size and is the final depth at the
end of the process. We can summarize the algorithm with the following pseudo code
AS has some advantages in terms of processing time and of conceptual simplicity. As shown in [FL90], AS outperforms standard sorting methods by a factor of about . In terms of storage consumptions, using words of memory will provide for a typical accuracy of
. This is to be contrasted again with sorting, where the auxiliary memory required has to be at least as large as the file itself. Finally AS is an unbiased estimator of cardinalities of large files that necessitates minimal auxiliary storage and processes data in a single pass.
Several new interesting questions can be asked about AS (some of them were suggested by P.Flajolet and popularized by J. Lumbroso). The distribution of is known (see [GL97]), but in Sec.3, we rederive this distribution in a simpler way. In Sec.4 we provide new results on the moments of and . The final cache size distribution is analyzed in Sec.5. Colored keys are considered in Sec.6: assume that we have a set of colors and that each key has some color. Assume also that among the distinct keys, do have color and that is large such that . We show how to estimate . We consider keys with some multiplicity in Sec.7: assume that, to each key , we attach a counter giving its observed multiplicity . Also we assume that the multiplicities of color
keys are given by iid random variables (RV), with distribution function
, mean , variance (functions of ). We show how to estimate and . Sec.8 deals with the case where neither colors nor multiplicities are known. We want to estimate keys color, their multiplicities and their number. An appendix is devoted to the case where the hashing function provides bits with probability different from .2 Preliminaries.
Let us first give the main notations we will use throughout the paper. Other particular notations will be provided where it is needed.
From Flajolet [FL90], we have the exact distribution
(1) | ||||
We can now see Adaptive Sampling as an urn model, where balls (keys), are thrown into urn with probability . We recall the main properties of such a model.
-
Asymptotic independence. We have asymptotic independence of urns, for all events related to urn containing balls. This is proved, by Poissonization-De-Poissonization, in [HL00], [ALP05] and [LP05]. This technique can be biefly described as follows. First we construct the corresponding generating function. Then we Poissonize (see, for instance, Jacquet and Szpankowski [JS98] for a general survey): instead of using a fixed number of balls, we use balls, where is a Poisson random variable. It follows that the urns become independent and the number of balls in urn is a Poisson random variable. We turn then to complex variables, and with Cauchy’s integral theorem, we De-Poissonize the generating function, using [JS98, Thm.10.3 and Cor.10.17]. The error term is where is a positive constant.
-
Asymptotic distributions.
We obtain asymptotic distributions of the interesting random variables as follows. The number of balls in each urn is asymptotically Poisson-distributed with parameter
in urn containingballs (this is the classical asymptotic for the Binomial distribution). This means that the asymptotic number
of balls in urn is given byand with , this is equivalent to a Poisson distribution with parameter . The asymptotic distributions are related to Gumbel distribution functions (given by ) or convergent series of such. The error term is .
-
Extended summations. Some summations now go to . This is justified, for example, in [LP05].
-
Uniform Integrability. We have uniform integrability for the moments of our random variables. To show that the limiting moments are equivalent to the moments of the limiting distributions, we need a suitable rate of convergence. This is related to a uniform integrability condition (see Loève [LO63, Section 11.4]). For Adaptive Sampling, the rate of convergence is analyzed in detail in [momP]. The error term is .
-
Mellin transform. Asymptotic expressions for the moments are obtained by Mellin transforms (for a good reference to Mellin transforms, see Flajolet et al. [FGD95]). The error term is . We proceed as follows (see [momP] for detailed proofs): from the asymptotic properties of the urns, we have obtained the asymptotic distributions of our random variables of interest. Next we compute the Laplace transform of these distributions, from which we can derive the dominant part of probabilities and moments as well as the (tiny) periodic part in the form of a Fourier series. This connection will be detailed in the next sections.
-
Fast decrease property. The gamma function decreases exponentially in the direction :
Also, we this property is true for all other functions we encounter. So inverting the Mellin transforms is easily justified.
-
Early approximations. If we compare the approach in this paper with other ones that appeared previously, then we can notice the following. Traditionally, one would stay with exact enumerations as long as possible, and only at a late stage move to asymptotics. Doing this, one would, in terms of asymptotics, carry many unimportant contributions around, which makes the computations quite heavy, especially when it comes to higher moments. Here, however, approximations are carried out as early as possible, and this allows for streamlined (and often automatic) computations of the higher moments.
We set , (1) leads to
(2) |
and similar functions for . Asymptotically, the distribution will be a periodic function of the fractional part of . The distribution does not converge in the weak sense, it does however converge along subsequences for which the fractional part of is constant. This type of convergence is not uncommon in the Analysis of Algorithms. Many examples are given in [momP].
From (2), we compute the Laplace transform, with :
The -th moments of are already given in [GL97] and [momP]. As shown in [GL97], we must have . For the sake of completeness, we repeat them here, with , denoting the Stirling number of the second kind, and denoting the Variance of the random variable :
(3) |
will always denote a periodic function of of small amplitude. Note that, in [FL90], Flajolet already computed .
3 Asymptotic distribution of
corresponds to the bit size of and has some independent interest. Let us recover this distribution from (2). In the sequel, we will denote by , with either an event or a random variable. We have, with :
Theorem 3.1
The asymptotic distribution of , with is given by
with
Proof.
with
or
with
This is exactly Theorem 4.1 in [GL97] that we obtain here in a simpler way.
4 Moments of and
Recall that corresponds to the cost of AS in number of steps (D ) (cache editing). Two interesting parameters are given by the moments of and . Their asymptotic behaviour is given as follows, with denoting the digamma function (the logarithmic derivative of )
Theorem 4.1
The moments of and are asymptotically given by
where
Proof. Using the techniques developed in [momP], we obtain the dominant (constant) part of the moments of as follows:
where the non-periodic component is given by
and the corresponding periodic term is given by
This was already computed in [momP], but with some errors. The first corrected values are now provided.
As , the rest of the Thm is immediate
It will be useful to obtain an asymptotic for the expectation of (non-periodic component) for large . This is computed as follows. First of all, we rewrite as
Now it is clear that the main contribution of the second term is related to large . So we set . This gives, by Stirling,
and
By Euler-Maclaurin, we have
and, finally,
and, to first order,
(4) |
5 Distribution of
The asymptotic moments and distribution of the cache size are given as follows
Theorem 5.1
The non-periodic components of the asymptotic moments and distribution of are given by
Similarly, the periodic components are given by
Proof. We have
and , with
Wher denotes the harmonic number. This quantity was already obtained in [GL97] after some complicated algebra! This leads to
which is also the probability of . This is also easily obtained from . Figure 1 gives for

Conditionning on , the expectation of event is now given by
The moments of are computed as follows.
More generally, the generating function of is given by
This leads to
Similarly, the periodic components are given by
6 Colors
Assume that we have a set of colors and that each key has some color. Assume also that among the distinct keys, do have color and that is large such that . In the cache, the keys (we assume ) contain keys with color
and, if , this is asymptotically given by the conditioned binomial distribution . We want to estimate . We are interested in the distribution of the statistic . We have
Theorem 6.1
The asymptotic moments of the statistic are given by
(5) | ||||
Proof. We have
(6) |
Now, conditioned on , we have
So, conditioned on ,
and, unconditionning leads to the theorem.
Intuitively, if the cache size
is large, we should have an asymptotic Gaussian distribution for
. Actually, the fit is quite good, even for (and ).This is proved as follows.
6.1 The distribution of for large .
Let be a (possibly degenerate) random variable taking values on the (strict) positive integers. Conditionning on , let for some known , and set . It appears that, as grows large, the distribution of becomes asymptotically Gaussian. This claim can be made precise as follows.
Theorem 6.2
Let and write Then there exists an absolute constant such that
for the Wasserstein distance between the law of and that of ; moreover this constant is such that
Proof. We will prove this theorem using the Stein methodology which, for , ( is a nice class of test functions), suggests to write
with such that
(7) |
(this is known as the Stein equation for the Gaussian distribution) so that
(8) |
The reason why (8) is interesting is that properties of the solutions of (7) are well-known – see, e.g., [BC105, Lemma 2.3] and [BC205, Lemma 2.3] – and quite good so that they can be used with quite some efficiency to tackle the rhs of (8). In the present configuration we know that is continuous and bounded on , with
(9) |
and
(10) |
In particular, if is the class of Lipschitz-1 functions with (this class generates the Wasserstein distance) then
These will suffice to our purpose.
Our proof follows closely the standard one for independent summands (see, e.g., [RO11, Section 3]). First we remark that, given , we can write as
where, taking i.i.d. , we let (which are centered and have variance 1). Next, for and , define
Next take solution of (7) with some Lipschitz-1 function. Then note that for all . We abuse notations and, given , write . Then
so that
Recall that . Then (by Taylor expansion) we can easily deal with the first term to obtain
Taking expectations with respect to and using (10) we conclude
(11) |
For the second term note how
Since we can pursue to obtain
where we used (conditional) independence of the . Taking expectations with respect to and using (9) we deduce (recall )
(12) |
Combining (11) and (12) we can conclude
The claim follows.
So we need the moments of for large (we limit ourselves to the dominant term).
6.2 Moments of for large
We have the following property
Theorem 6.3
The asymptotic moments of for large , with are given by
Proof. We have, with denoting the nth polygamma function, which is the nth derivative of the digamma function and denoting the hypergeometric function,