Exact confidence interval for generalized Flajolet-Martin algorithms

This paper develop a deep mathematical-statistical approach to analyze a class of Flajolet-Martin algorithms (FMa), and provide a exact analytical confidence interval for the number F_0 of distinct elements in a stream, based on Chernoff bounds. The class of FMa has reached a significant popularity in bigdata stream learning, and the attention of the literature has mainly been based on algorithmic aspects, basically complexity optimality, while the statistical analysis of these class of algorithms has been often faced heuristically. The analysis provided here shows a deep connections with special mathematical functions and with extreme value theory. The latter connection may help in explaining heuristic considerations, while the first opens many numerical issues, faced at the end of the present paper. Finally, MonteCarlo simulations are provided to support our analytical choice in this context.



There are no comments yet.


page 1

page 2

page 3

page 4


A Unified Approach for Constructing Confidence Intervals and Hypothesis Tests Using h-function

We introduce a general method, named the h-function method, to unify the...

Exact Confidence Bounds in Discrete Models – Algorithmic Aspects of Sterne's Method

In this manuscript we review two methods to construct exact confidence b...

Exact-corrected confidence interval for risk difference in noninferiority binomial trials

A novel confidence interval estimator is proposed for the risk differenc...

Mathematical aspects relative to the fluid statics of a self-gravitating perfect-gas isothermal sphere

In the present paper we analyze and discuss some mathematical aspects of...

Exact Confidence Intervals for Linear Combinations of Multinomial Probabilities

Linear combinations of multinomial probabilities, such as those resultin...

Algorithmic Correspondence for Hybrid Logic with Binder

In the present paper, we develop the algorithmic correspondence theory f...

Inference Functions for Semiparametric Models

The paper discusses inference techniques for semiparametric models based...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Data streams [7] are sequence of objects that cannot be available for random access, but must be analyzed sequentially when they arrive and immediately discharged. Streaming algorithms process data streams, and have reached a very rich audience since the last decades. Typically, these kinds of algorithms have a limited time to complete their processes and have access to limited amount of memory, usually logarithmic in the quantity of interest.

One of the main application in streaming algorithms concerns the problem of counting distinct elements in a stream. In [13], the authors develop the first algorithm for approximating based on hash functions. This algorithm was then formalized and made popular in [6], where it was presented the forefather of the class of algorithms that takes the name of Flajolet-Marin algorithms (here, FMa). Three extensions in FMa were presented in [8], together with a complete description of the drawback and of the strength of the previous attempts. The first optimal (in complexity) algorithm has been proposed and proved in [18] and, nowadays, the FMa covers a lot of applications. As only an example, in [16]

, an application with multiset framework is developed from one of the most recent versions of FMa, and it estimates the number of “elephants” in a stream of IP packets.

This class of algorithms is essentially based on the following concept. When an object arrives form the stream, one (ore more, independent) hash functions are applied to it, and then the object is immediately discharged. The results of these functions are melted with what saved in memory (that has a comparable size). The memory is updated, if necessary, with the result of this procedure, and then the process is ready for the next object. The estimate of may be queried when necessary, and it is a function of the memory content.

The key point is the fact that the central operation is made with a function which must be associative, commutative and idempotent, so that multiple evaluations on the same object do not affect the final outcome, which results in the combination of the hash values of the distinct objects. A good candidate for such a function is the function applied to a “signature” of each object, that is the core of such streaming algorithms. The same idea has recently used for other distributed algorithms (see [5]

for simulation of discrete random variables), where new entries or single changes should not make all the algorithm starts afresh.

As stated before, the main contribution in the study of FMa concerned complexity problems, and a deep mathematical-statistical approach has not yet developed, even if this class of algorithm is probabilistic. This paper is a first attempt in this direction. The main contribution here is the analytical and numerical control of FMa based on a pure mathematical statistic approach, while we leave the measure of the goodness of the FMa to other studies (see [12] for a continuously updated work). In particular, we give here an analytical exact confidence interval for the quantity . More precisely, we analyze an extension of the algorithms given above, and given , we will find , function of the memory content, such that


where is a given, strictly increasing, special function. It is important to note that the approximations for as in (1

) given in literature are not satisfactory. In some situations, the asymptotic behavior of the interval is calculated through a Central Limit Theorem (see


), but the huge skewness implicit in the algorithm variables (even in logarithmic scale) makes the Central Limit Theorem questionable. To overcome this observation, Chebichev and Markov bounds are sometimes used to compute confidence intervals, see the papers cited in

[18], where the results are analyzed in terms of optimal complexity (in space and time) without exploiting possible benefits in reducing the magnitude of the interval length.

These facts suggest us to not base the confidence interval on statistical asymptotic properties, but to build an exact confidence interval, based on concentration inequalities. In particular, we use Chernoff bounds, and we give an analytical approximation of the resulting inequalities. We show with MonteCarlo simulations that the analytical approximation does not affect the result significantly. Moreover, we show that the same result derive from the use of the Chernoff bounds on the limiting distribution that would be obtained with extreme value theory.

It is not surprising that some new analytical special functions appear in the analysis of the algorithm. In particular, a -modification of the analytical extension of the harmonic number function arises here as the mean value of a particular statistics of interest, and is a quantity that appears in the paper. Notably, the heuristic approximation in the classical case studied in literature (with ) gives a value that is of the order of magnitude we give in our subsequent estimations.

In addition, we discuss here a possible numerical implementation of the confidence interval in real time. To answer to this question, we develop a algorithm to solve all the relevant nonlinear problem with a cubic rate of convergence and we provide the necessary numeric bounds to apply it. As a byproduct, we could give the algorithm that calculates the -shortest confidence interval.

The paper is structured in the following way. In the next Section 2 we first describe how FMa works. We show how data are stored in memory and queried from it, and then we analyze these processes from a mathematical and a statical point of view.

The main result, Theorem 3.1, is given at the beginning of the Section 3, and the connection with the asymptotic results of the extreme value theory is immediately discussed in Section 3.1. The proof of the main result is based both on a analytical computations of Chernoff bounds given in Section 4, and on the expected value of a quantity of interest, given in Section 5. The Section 6 shows the goodness of the choice of the analytical approximations given in Section 4.

In Section 7 we face numerically some nonlinear equations that are necessary to for query the interval (1) in the equivalent form: In particular, we give some sharp upper and lower bounds to develop cubic rate algorithms together with more robust bisecting algorithms. As a byproduct, the algorithm that calculates the -shortest confidence interval is given at the end of the section.

Finally, Appendix A defines the main properties of some special mathematical functions that are used in this paper. Appendix B concludes the paper with the technicalities needed to find lower and upper bounds contained in Section 7.

2. Description of the algorithm

The main task of FMa is to provide an estimation of , the unknown number of distinct elements in a real-time stream of possible repeating objects, based on independent hash functions. Our memory data structure is a generalization of a HyperLogLog data structure (see [10, 18, 12]), and consists of two matrices and with rows and columns. The use of is an addition of this paper to the classical algorithms given above, and it is used to increase the accuracy of the estimation of (see Section 3), by using bits of each hashing function. The streaming algorithm that updates and in memory is given in Algorithm 1.

Data: Data Stream of Objects
Input: hash functions, and small integers
Output: Two matrices and with rows and columns
Set , (binary);
foreach  in Stream do
       for  to  do
             /* compute the -hash function on , obtaining a finite sequence of and */
             if  then
            else if  then
                      is the minimum in base
       end for
      discharge ;
end foreach
Algorithm 1 Streaming algorithm to store the data in memory. is an integer-valued matrix, whose values are of the order of , while takes values in

The flow of information is as follows. From each hash function , we extract the following information on a stream object :


The data are then updated according to the following procedure:

if :

do nothing;

if :

set and ;

if :

set .

The querying algorithm produces the value , which is the arithmetic mean of values built with the contents of and as in Algorithm 2.

Input: and , output of Algorithm 1
Set ;
for  to  do
       for  to  do
               made by bits
       end for
end for
return ;
Algorithm 2 Querying algorithm to extract , starting from the memory content and given in Algorithm 1

As an example, in Algorithm 3, we show how to compute a -confidence interval for of the form , based on the Theorem 3.1. The nonlinear problems involved in this computation will be faced in Section 7.

Input: 1) , output of Algorithm 2.
2) the confidence -usually -
Output: A confidence interval for of the form
Set ;
Set ;
Set ;
  /* Solve (in ) the problem , with */
Set ;
for  to  do
       for  to  do
       end for
end for
Set ;
Set ;
return ;
   Solve (in ) the problem
Algorithm 3 Querying algorithm that builds a -confidence interval for of the form , based on the Theorem 3.1

Finally, note that the data structure becomes that of [12] when and (the content of is not significant and the update reduces to , without the if-else loop). When, in addition, the data structure reduces to the original one [15].

2.1. Mathematical and Statistical analysis of the algorithm

The Algorithm 1 has the following properties. First, the multiple application of this algorithm to the same object will result in the same outcome as if we had applied it only once. Mathematically speaking, this is a idempotent algorithm and, in addition, it can be seen to be associative and commutative. A typical mathematical function with these properties is the function that, evaluated on different, even repeated numbers, gives the same result, independently of the order and of the repetitions.

This is the reason why this algorithm works and why, for what concerns the final result of the matrices, the Algorithm 1 may be thought as applied only once to each of the different objects.

We will assume that each hash function generates a sequence of bits that are equally distributed on the all possible outcomes. Moreover, the evaluation on different objects are assumed to be statistically independent as for the evaluation of different functions. As only an example, the SHA functions have been certified to have such a properties [20, 21, 22], and can be used for this purpose: by cutting the result of a function into parts, it is possible to obtain independent hash functions of sufficient length for any reasonable application.

From a probabilistic point of view, the bit sequences of (2) are independent for different choice of the object and hash function

, and are uniformly distributed on all the possible sequences.

In other words, every in each sequence of the form

is distributed as a Bernoulli of parameter , and it is independent from the others.

If we analyze the Algorithm 2 we note that, for each index of the matrices, the unique information that is kept after querying may be in the following manner. For each fixed hash function and object , just compute , , and as in (2), and is defined as

The successive quantity in Algorithm 2 is the result of the following procedure


Note that, if we complete the bit sequence in with an i.i.d. sequence of equally distributed bits , the random variables


form a family of random variables, independent and identically distributed. The fact here is that, instead of measuring , we can only collect , due to computational limitations, and this introduces a further bias. We get the following result.

Lemma 2.1.

There exists a family

of independent and identically distributed random variables with exponential distribution of parameter

, such that, if we define,

then, uniformly in and ,

where each is defined in (3). Moreover, for any fixed , define

Then the random vectors

are i.i.d, distributed as multinomial vectors of parameters and . Conditioned on , the random variables are independent.


Take as in (4). Define

we note that it forms a family of random variables, independent and uniformly distributed on (see, e.g., [26, § 4.6]). Since , the first part of the lemma holds. In addition,

Since and , then . Hence we get the second part of the thesis, since

To conclude, just note that the first bits of each hash function generate , uniformly distributed on , independently of the remaining processes. Them, for each one of the different objects and each hash function, a uniformly assignment is made, that gives the multinomial sample. The conditional independence of the family is a consequence of the independence of the family . ∎

3. Confidence interval for

The main result of this paper is the construction of a confidence interval for .

Theorem 3.1.

Let be collected as in Section 2, and define


are confidence intervals for the unknown parameter , where

  • the function is defined in Definition A.1 and (19);

  • ;

  • the levels of confidence are , , and respectively, where

    is the Euler constant and is the digamma function (see Appendix A).

Proof of Theorem 3.1.

We first note that, by Lemma 2.1, if we define


then it is sufficient to prove that

are confidence intervals for the unknown parameter at the same levels given in the theorem. To prove this last assertion, we prove the following conditions that result sufficient:

Observe that, since the function is invertible with continuous inverse (see Section A), we get

and hence the proof will be based on the following step:

  • in Lemma 5.1 in Section 5 we prove that , for any . As an imediate consequence the following equality holds

  • the inequalities

    are Chernoff bound inequalities and will be proved in Corollary 5.2. ∎

3.1. Connection with extreme value theory

As stated in Lemma 2.1, the main result of this paper is based on the mean of the random variables , which are independent, conditioned on . As discussed in the introduction, this variables are given through a commutative, associative and idempotent function, that is the function in this context:

A natural question is the relation of such a consideration with the extreme value theory. The well-known Fisher–Tippett–Gnedenko theorem [17] provides an asymptotic result, and shows that, when , if there are sequences and such that converges in law to a random variables , then must be Gumbel, Fréchet or Weibull (Type 1,2 or 3). As in the proof of Lemma 4.1, we have that

from which we can recognize that has a Gumbell law. Since the Chernoff bounds on the mean of such variables gives the same concentration inequalities as in Theorem 3.1, our result gives also the confidence interval based on the Chernoff bounds of the asymptotic distribution based on the extreme value theory.

Our result underlines the fact that the analytical approximation gives an exact upper bound for the concentration inequality, based on the monotonicity of the limit , that is the key point in the proof of Lemma 4.1.

Finally, the accuracy of such a bound is discussed in Section 6.

4. Chernoff bounds: auxiliary results for the maximum of exponential random variables

We recall the Chernoff bound of a sum of independendent random variables : for any ,


which is one of the most powerful concentration inequality in probability theory, since it involves the entire moment generating functions

instead of only some moments of each .

Lemma 4.1.

Let be a finite set of cardinality , and let be a collection of nonnegative integer numbers. Let be an array of i.i.d. exponential random variables with parameter . Define, for any , and . Then


where are defined in Theorem 3.1.


To apply (6) with and , it is possible in principle to compute by noticing that the density of may be expressed as the density of the maximum of independent exponential random variables:

A more interesting interpretation leads to simpler computations. Denote by is the th-order statistic of , set for consistency. As noted for example recently in [11, Eq. (2)], for any , the random variables

are independent exponential random variables with parameter . Since

then each may be seen as a sum of independent exponential random variables with parameter , . As a direct consequence,


where is the -th harmonic number defined in (18). More remarkable, it is possible to calculate the moment-generating function of . In fact, since

we get, for ,

Thus, since , the Chernoff bound (6) becomes

Since for any , then for , by (21)

Combining the two expressions above, we get

The part of the proof that concerns is hence proved by Lemma A.2.

The second inequality in (7) may be proved with the same spirit. To find the Chernoff bound with the second equality of (21), note that we get, for any

and then we apply again Lemma A.2 to . ∎

5. Computation of

To complete the computation of the confidence interval, we give the following result, which connects the expectation of the core variables with the special functions we have introduced in this paper.

Lemma 5.1.

For any and , we have that

Proof of Lemma 5.1.

Let as in Lemma 2.1. Combining Lemma 2.1 and (8), we know that

Again, as stated in Lemma 2.1, the random variable

is distributed as a binomial distribution, with

trials and probability . Then, by (18),

the last equality being the Definition A.1. ∎

Corollary 5.2.

Let as in (5). The following inequalities hold


To prove the assertion, we apply Lemma 4.1 at the objects given in Lemma 2.1. We begin by setting , which implies . Moreover, for , we have and

With this setting, in Lemma 4.1 is