The Adaptive sampling revisited

The problem of estimating the number n of distinct keys of a large collection of N data is well known in computer science. A classical algorithm is the adaptive sampling (AS). n can be estimated by R.2^D, where R is the final bucket (cache) size and D is the final depth at the end of the process. Several new interesting questions can be asked about AS (some of them were suggested by P.Flajolet and popularized by J.Lumbroso). The distribution of W= (R2^D/n) is known, we rederive this distribution in a simpler way. We provide new results on the moments of D and W. We also analyze the final cache size R distribution. We consider colored keys: assume that among the n distinct keys, n_C do have color C. We show how to estimate p=n_C/n. We also study colored keys with some multiplicity given by some distribution function. We want to estimate mean an variance of this distribution. Finally, we consider the case where neither colors nor multiplicities are known. There we want to estimate the related parameters. An appendix is devoted to the case where the hashing function provides bits with probability different from 1/2.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

12/26/2018

Use the Keys Pre-Distribution KDP-scheme for Mandatory Access Control Implementation

The possibility of use the keys preliminary distribution KDP-scheme for ...
12/26/2018

Implementation of Simplex Channels in the Blom's Keys Pre-Distribution Scheme

In article the modification of the Blom's keys preliminary distribution ...
06/15/2020

CoT: Decentralized Elastic Caches for Cloud Environments

Distributed caches are widely deployed to serve social networks and web ...
01/10/2019

Mean Estimation from One-Bit Measurements

We consider the problem of estimating the mean of a symmetric log-concav...
12/05/2018

Approximation with Error Bounds in Spark

We introduce a sampling framework to support approximate computing with ...
07/04/2019

Sampling Sketches for Concave Sublinear Functions of Frequencies

We consider massive distributed datasets that consist of elements modele...
09/04/2019

Certified Side Channels

We demonstrate that the format in which private keys are persisted impac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of estimating the number of distinct keys of a large collection of data is well known in computer science. It arises in query optimization of data base systems. A classical algorithm is the adaptive sampling (AS) . The mean and variance of AS are considered in Flajolet [FL90] . Let us summarize the principal features of AS. Elements of the given set of data are hashed into binary keys. These keys are infinitely long bit streams such that each bit has probability of being or . A uniformity assumption is made on the hashing function .

The algorithm keeps a bucket (or cache) of at most distinct keys. The depth of sampling, which is defined below , is also saved. We start with and throw only distinct keys into . When is full, depth is increased by , the bucket is scanned, and only keys starting with are kept.
( If the bucket is still full, we wait until a new key starting with appears. Then is again increased by and we keep only keys starting with ). The scanning on the set is resumed and only distinct keys starting with are considered. More generally, at step , only distinct keys starting with are taken into account. When we have exhausted the set of data, can be estimated by , where is the random final bucket (cache) size and is the final depth at the end of the process. We can summarize the algorithm with the following pseudo code

  Parameter: bucket (or cache) of at most distinct keys.
  Input: a stream
  Output: the final bucket size and the final depth
  Initialization: and
  for all  do
     if  then
        if  then
           
        end if;
     end if;
     if  then
        , filter (remove keys of which the hash doesn’t match
     end if;
  end for;
  return ;
Algorithm 1

AS has some advantages in terms of processing time and of conceptual simplicity. As shown in [FL90], AS outperforms standard sorting methods by a factor of about . In terms of storage consumptions, using words of memory will provide for a typical accuracy of

. This is to be contrasted again with sorting, where the auxiliary memory required has to be at least as large as the file itself. Finally AS is an unbiased estimator of cardinalities of large files that necessitates minimal auxiliary storage and processes data in a single pass.

Several new interesting questions can be asked about AS (some of them were suggested by P.Flajolet and popularized by J. Lumbroso). The distribution of is known (see [GL97]), but in Sec.3, we rederive this distribution in a simpler way. In Sec.4 we provide new results on the moments of and . The final cache size distribution is analyzed in Sec.5. Colored keys are considered in Sec.6: assume that we have a set of colors and that each key has some color. Assume also that among the distinct keys, do have color and that is large such that . We show how to estimate . We consider keys with some multiplicity in Sec.7: assume that, to each key , we attach a counter giving its observed multiplicity . Also we assume that the multiplicities of color

keys are given by iid random variables (RV), with distribution function

, mean , variance (functions of ). We show how to estimate and . Sec.8 deals with the case where neither colors nor multiplicities are known. We want to estimate keys color, their multiplicities and their number. An appendix is devoted to the case where the hashing function provides bits with probability different from .

2 Preliminaries.

Let us first give the main notations we will use throughout the paper. Other particular notations will be provided where it is needed.

From Flajolet [FL90], we have the exact distribution

(1)

We can now see Adaptive Sampling as an urn model, where balls (keys), are thrown into urn with probability . We recall the main properties of such a model.

  • Asymptotic independence. We have asymptotic independence of urns, for all events related to urn containing balls. This is proved, by Poissonization-De-Poissonization, in [HL00], [ALP05] and [LP05]. This technique can be biefly described as follows. First we construct the corresponding generating function. Then we Poissonize (see, for instance, Jacquet and Szpankowski [JS98] for a general survey): instead of using a fixed number of balls, we use balls, where is a Poisson random variable. It follows that the urns become independent and the number of balls in urn is a Poisson random variable. We turn then to complex variables, and with Cauchy’s integral theorem, we De-Poissonize the generating function, using [JS98, Thm.10.3 and Cor.10.17]. The error term is where is a positive constant.

  • Asymptotic distributions.

    We obtain asymptotic distributions of the interesting random variables as follows. The number of balls in each urn is asymptotically Poisson-distributed with parameter

    in urn containing

    balls (this is the classical asymptotic for the Binomial distribution). This means that the asymptotic number

    of balls in urn is given by

    and with , this is equivalent to a Poisson distribution with parameter . The asymptotic distributions are related to Gumbel distribution functions (given by ) or convergent series of such. The error term is .

  • Extended summations. Some summations now go to . This is justified, for example, in [LP05].

  • Uniform Integrability. We have uniform integrability for the moments of our random variables. To show that the limiting moments are equivalent to the moments of the limiting distributions, we need a suitable rate of convergence. This is related to a uniform integrability condition (see Loève [LO63, Section 11.4]). For Adaptive Sampling, the rate of convergence is analyzed in detail in [momP]. The error term is .

  • Mellin transform. Asymptotic expressions for the moments are obtained by Mellin transforms (for a good reference to Mellin transforms, see Flajolet et al. [FGD95]). The error term is . We proceed as follows (see [momP] for detailed proofs): from the asymptotic properties of the urns, we have obtained the asymptotic distributions of our random variables of interest. Next we compute the Laplace transform of these distributions, from which we can derive the dominant part of probabilities and moments as well as the (tiny) periodic part in the form of a Fourier series. This connection will be detailed in the next sections.

  • Fast decrease property. The gamma function decreases exponentially in the direction :

    Also, we this property is true for all other functions we encounter. So inverting the Mellin transforms is easily justified.

  • Early approximations. If we compare the approach in this paper with other ones that appeared previously, then we can notice the following. Traditionally, one would stay with exact enumerations as long as possible, and only at a late stage move to asymptotics. Doing this, one would, in terms of asymptotics, carry many unimportant contributions around, which makes the computations quite heavy, especially when it comes to higher moments. Here, however, approximations are carried out as early as possible, and this allows for streamlined (and often automatic) computations of the higher moments.

We set , (1) leads to

(2)

and similar functions for . Asymptotically, the distribution will be a periodic function of the fractional part of . The distribution does not converge in the weak sense, it does however converge along subsequences for which the fractional part of is constant. This type of convergence is not uncommon in the Analysis of Algorithms. Many examples are given in [momP].

From (2), we compute the Laplace transform, with :

The -th moments of are already given in [GL97] and [momP]. As shown in [GL97], we must have . For the sake of completeness, we repeat them here, with , denoting the Stirling number of the second kind, and denoting the Variance of the random variable :

(3)

will always denote a periodic function of of small amplitude. Note that, in [FL90], Flajolet already computed .

3 Asymptotic distribution of

corresponds to the bit size of and has some independent interest. Let us recover this distribution from (2). In the sequel, we will denote by , with either an event or a random variable. We have, with :

Theorem 3.1

The asymptotic distribution of , with is given by

with

Proof.

with

or

with

This is exactly Theorem 4.1 in [GL97] that we obtain here in a simpler way.  

4 Moments of and

Recall that corresponds to the cost of AS in number of steps (D ) (cache editing). Two interesting parameters are given by the moments of and . Their asymptotic behaviour is given as follows, with denoting the digamma function (the logarithmic derivative of )

Theorem 4.1

The moments of and are asymptotically given by

where

Proof.   Using the techniques developed in [momP], we obtain the dominant (constant) part of the moments of as follows:

where the non-periodic component is given by

and the corresponding periodic term is given by

This was already computed in [momP], but with some errors. The first corrected values are now provided.

As , the rest of the Thm is immediate  

It will be useful to obtain an asymptotic for the expectation of (non-periodic component) for large . This is computed as follows. First of all, we rewrite as

Now it is clear that the main contribution of the second term is related to large . So we set . This gives, by Stirling,

and

By Euler-Maclaurin, we have

and, finally,

and, to first order,

(4)

5 Distribution of

The asymptotic moments and distribution of the cache size are given as follows

Theorem 5.1

The non-periodic components of the asymptotic moments and distribution of are given by

Similarly, the periodic components are given by

Proof.   We have

and , with

Wher denotes the harmonic number. This quantity was already obtained in [GL97] after some complicated algebra! This leads to

which is also the probability of . This is also easily obtained from . Figure 1 gives for

Figure 1: for

Conditionning on , the expectation of event is now given by

The moments of are computed as follows.

More generally, the generating function of is given by

This leads to

Similarly, the periodic components are given by

 

6 Colors

Assume that we have a set of colors and that each key has some color. Assume also that among the distinct keys, do have color and that is large such that . In the cache, the keys (we assume ) contain keys with color

with probability distribution

and, if , this is asymptotically given by the conditioned binomial distribution . We want to estimate . We are interested in the distribution of the statistic . We have

Theorem 6.1

The asymptotic moments of the statistic are given by

(5)

Proof.   We have

(6)

Now, conditioned on , we have

So, conditioned on ,

and, unconditionning leads to the theorem.  

Intuitively, if the cache size

is large, we should have an asymptotic Gaussian distribution for

. Actually, the fit is quite good, even for (and ).

This is proved as follows.

6.1 The distribution of for large .

Let be a (possibly degenerate) random variable taking values on the (strict) positive integers. Conditionning on , let for some known , and set . It appears that, as grows large, the distribution of becomes asymptotically Gaussian. This claim can be made precise as follows.

Theorem 6.2

Let and write Then there exists an absolute constant such that

for the Wasserstein distance between the law of and that of ; moreover this constant is such that

Proof.   We will prove this theorem using the Stein methodology which, for , ( is a nice class of test functions), suggests to write

with such that

(7)

(this is known as the Stein equation for the Gaussian distribution) so that

(8)

The reason why (8) is interesting is that properties of the solutions of (7) are well-known – see, e.g., [BC105, Lemma 2.3] and [BC205, Lemma 2.3] – and quite good so that they can be used with quite some efficiency to tackle the rhs of (8). In the present configuration we know that is continuous and bounded on , with

(9)

and

(10)

In particular, if is the class of Lipschitz-1 functions with (this class generates the Wasserstein distance) then

These will suffice to our purpose.

Our proof follows closely the standard one for independent summands (see, e.g., [RO11, Section 3]). First we remark that, given , we can write as

where, taking i.i.d. , we let (which are centered and have variance 1). Next, for and , define

Next take solution of (7) with some Lipschitz-1 function. Then note that for all . We abuse notations and, given , write . Then

so that

Recall that . Then (by Taylor expansion) we can easily deal with the first term to obtain

Taking expectations with respect to and using (10) we conclude

(11)

For the second term note how

Since we can pursue to obtain

where we used (conditional) independence of the . Taking expectations with respect to and using (9) we deduce (recall )

(12)

Combining (11) and (12) we can conclude

The claim follows.  
So we need the moments of for large (we limit ourselves to the dominant term).

6.2 Moments of for large

We have the following property

Theorem 6.3

The asymptotic moments of for large , with are given by

Proof.   We have, with denoting the nth polygamma function, which is the nth derivative of the digamma function and denoting the hypergeometric function,