1. Introduction
Data streams [7] are sequence of objects that cannot be available for random access, but must be analyzed sequentially when they arrive and immediately discharged. Streaming algorithms process data streams, and have reached a very rich audience since the last decades. Typically, these kinds of algorithms have a limited time to complete their processes and have access to limited amount of memory, usually logarithmic in the quantity of interest.
One of the main application in streaming algorithms concerns the problem of counting distinct elements in a stream. In [13], the authors develop the first algorithm for approximating based on hash functions. This algorithm was then formalized and made popular in [6], where it was presented the forefather of the class of algorithms that takes the name of FlajoletMarin algorithms (here, FMa). Three extensions in FMa were presented in [8], together with a complete description of the drawback and of the strength of the previous attempts. The first optimal (in complexity) algorithm has been proposed and proved in [18] and, nowadays, the FMa covers a lot of applications. As only an example, in [16]
, an application with multiset framework is developed from one of the most recent versions of FMa, and it estimates the number of “elephants” in a stream of IP packets.
This class of algorithms is essentially based on the following concept. When an object arrives form the stream, one (ore more, independent) hash functions are applied to it, and then the object is immediately discharged. The results of these functions are melted with what saved in memory (that has a comparable size). The memory is updated, if necessary, with the result of this procedure, and then the process is ready for the next object. The estimate of may be queried when necessary, and it is a function of the memory content.
The key point is the fact that the central operation is made with a function which must be associative, commutative and idempotent, so that multiple evaluations on the same object do not affect the final outcome, which results in the combination of the hash values of the distinct objects. A good candidate for such a function is the function applied to a “signature” of each object, that is the core of such streaming algorithms. The same idea has recently used for other distributed algorithms (see [5]
for simulation of discrete random variables), where new entries or single changes should not make all the algorithm starts afresh.
As stated before, the main contribution in the study of FMa concerned complexity problems, and a deep mathematicalstatistical approach has not yet developed, even if this class of algorithm is probabilistic. This paper is a first attempt in this direction. The main contribution here is the analytical and numerical control of FMa based on a pure mathematical statistic approach, while we leave the measure of the goodness of the FMa to other studies (see [12] for a continuously updated work). In particular, we give here an analytical exact confidence interval for the quantity . More precisely, we analyze an extension of the algorithms given above, and given , we will find , function of the memory content, such that
(1) 
where is a given, strictly increasing, special function. It is important to note that the approximations for as in (1
) given in literature are not satisfactory. In some situations, the asymptotic behavior of the interval is calculated through a Central Limit Theorem (see
[14]), but the huge skewness implicit in the algorithm variables (even in logarithmic scale) makes the Central Limit Theorem questionable. To overcome this observation, Chebichev and Markov bounds are sometimes used to compute confidence intervals, see the papers cited in
[18], where the results are analyzed in terms of optimal complexity (in space and time) without exploiting possible benefits in reducing the magnitude of the interval length.These facts suggest us to not base the confidence interval on statistical asymptotic properties, but to build an exact confidence interval, based on concentration inequalities. In particular, we use Chernoff bounds, and we give an analytical approximation of the resulting inequalities. We show with MonteCarlo simulations that the analytical approximation does not affect the result significantly. Moreover, we show that the same result derive from the use of the Chernoff bounds on the limiting distribution that would be obtained with extreme value theory.
It is not surprising that some new analytical special functions appear in the analysis of the algorithm. In particular, a modification of the analytical extension of the harmonic number function arises here as the mean value of a particular statistics of interest, and is a quantity that appears in the paper. Notably, the heuristic approximation in the classical case studied in literature (with ) gives a value that is of the order of magnitude we give in our subsequent estimations.
In addition, we discuss here a possible numerical implementation of the confidence interval in real time. To answer to this question, we develop a algorithm to solve all the relevant nonlinear problem with a cubic rate of convergence and we provide the necessary numeric bounds to apply it. As a byproduct, we could give the algorithm that calculates the shortest confidence interval.
The paper is structured in the following way. In the next Section 2 we first describe how FMa works. We show how data are stored in memory and queried from it, and then we analyze these processes from a mathematical and a statical point of view.
The main result, Theorem 3.1, is given at the beginning of the Section 3, and the connection with the asymptotic results of the extreme value theory is immediately discussed in Section 3.1. The proof of the main result is based both on a analytical computations of Chernoff bounds given in Section 4, and on the expected value of a quantity of interest, given in Section 5. The Section 6 shows the goodness of the choice of the analytical approximations given in Section 4.
In Section 7 we face numerically some nonlinear equations that are necessary to for query the interval (1) in the equivalent form: In particular, we give some sharp upper and lower bounds to develop cubic rate algorithms together with more robust bisecting algorithms. As a byproduct, the algorithm that calculates the shortest confidence interval is given at the end of the section.
2. Description of the algorithm
The main task of FMa is to provide an estimation of , the unknown number of distinct elements in a realtime stream of possible repeating objects, based on independent hash functions. Our memory data structure is a generalization of a HyperLogLog data structure (see [10, 18, 12]), and consists of two matrices and with rows and columns. The use of is an addition of this paper to the classical algorithms given above, and it is used to increase the accuracy of the estimation of (see Section 3), by using bits of each hashing function. The streaming algorithm that updates and in memory is given in Algorithm 1.
The flow of information is as follows. From each hash function , we extract the following information on a stream object :
(2) 
The data are then updated according to the following procedure:
 if :

do nothing;
 if :

set and ;
 if :

set .
The querying algorithm produces the value , which is the arithmetic mean of values built with the contents of and as in Algorithm 2.
As an example, in Algorithm 3, we show how to compute a confidence interval for of the form , based on the Theorem 3.1. The nonlinear problems involved in this computation will be faced in Section 7.
Finally, note that the data structure becomes that of [12] when and (the content of is not significant and the update reduces to , without the ifelse loop). When, in addition, the data structure reduces to the original one [15].
2.1. Mathematical and Statistical analysis of the algorithm
The Algorithm 1 has the following properties. First, the multiple application of this algorithm to the same object will result in the same outcome as if we had applied it only once. Mathematically speaking, this is a idempotent algorithm and, in addition, it can be seen to be associative and commutative. A typical mathematical function with these properties is the function that, evaluated on different, even repeated numbers, gives the same result, independently of the order and of the repetitions.
This is the reason why this algorithm works and why, for what concerns the final result of the matrices, the Algorithm 1 may be thought as applied only once to each of the different objects.
We will assume that each hash function generates a sequence of bits that are equally distributed on the all possible outcomes. Moreover, the evaluation on different objects are assumed to be statistically independent as for the evaluation of different functions. As only an example, the SHA functions have been certified to have such a properties [20, 21, 22], and can be used for this purpose: by cutting the result of a function into parts, it is possible to obtain independent hash functions of sufficient length for any reasonable application.
From a probabilistic point of view, the bit sequences of (2) are independent for different choice of the object and hash function
, and are uniformly distributed on all the possible sequences.
In other words, every in each sequence of the form
is distributed as a Bernoulli of parameter , and it is independent from the others.
If we analyze the Algorithm 2 we note that, for each index of the matrices, the unique information that is kept after querying may be in the following manner. For each fixed hash function and object , just compute , , and as in (2), and is defined as
The successive quantity in Algorithm 2 is the result of the following procedure
(3) 
Note that, if we complete the bit sequence in with an i.i.d. sequence of equally distributed bits , the random variables
(4) 
form a family of random variables, independent and identically distributed. The fact here is that, instead of measuring , we can only collect , due to computational limitations, and this introduces a further bias. We get the following result.
Lemma 2.1.
There exists a family
of independent and identically distributed random variables with exponential distribution of parameter
, such that, if we define,then, uniformly in and ,
where each is defined in (3). Moreover, for any fixed , define
Then the random vectors
are i.i.d, distributed as multinomial vectors of parameters and . Conditioned on , the random variables are independent.Proof.
Take as in (4). Define
we note that it forms a family of random variables, independent and uniformly distributed on (see, e.g., [26, § 4.6]). Since , the first part of the lemma holds. In addition,
Since and , then . Hence we get the second part of the thesis, since
To conclude, just note that the first bits of each hash function generate , uniformly distributed on , independently of the remaining processes. Them, for each one of the different objects and each hash function, a uniformly assignment is made, that gives the multinomial sample. The conditional independence of the family is a consequence of the independence of the family . ∎
3. Confidence interval for
The main result of this paper is the construction of a confidence interval for .
Theorem 3.1.
Proof of Theorem 3.1.
We first note that, by Lemma 2.1, if we define
(5) 
then it is sufficient to prove that
are confidence intervals for the unknown parameter at the same levels given in the theorem. To prove this last assertion, we prove the following conditions that result sufficient:
Observe that, since the function is invertible with continuous inverse (see Section A), we get
and hence the proof will be based on the following step:
3.1. Connection with extreme value theory
As stated in Lemma 2.1, the main result of this paper is based on the mean of the random variables , which are independent, conditioned on . As discussed in the introduction, this variables are given through a commutative, associative and idempotent function, that is the function in this context:
A natural question is the relation of such a consideration with the extreme value theory. The wellknown Fisher–Tippett–Gnedenko theorem [17] provides an asymptotic result, and shows that, when , if there are sequences and such that converges in law to a random variables , then must be Gumbel, Fréchet or Weibull (Type 1,2 or 3). As in the proof of Lemma 4.1, we have that
from which we can recognize that has a Gumbell law. Since the Chernoff bounds on the mean of such variables gives the same concentration inequalities as in Theorem 3.1, our result gives also the confidence interval based on the Chernoff bounds of the asymptotic distribution based on the extreme value theory.
Our result underlines the fact that the analytical approximation gives an exact upper bound for the concentration inequality, based on the monotonicity of the limit , that is the key point in the proof of Lemma 4.1.
Finally, the accuracy of such a bound is discussed in Section 6.
4. Chernoff bounds: auxiliary results for the maximum of exponential random variables
We recall the Chernoff bound of a sum of independendent random variables : for any ,
(6) 
which is one of the most powerful concentration inequality in probability theory, since it involves the entire moment generating functions
instead of only some moments of each .Lemma 4.1.
Let be a finite set of cardinality , and let be a collection of nonnegative integer numbers. Let be an array of i.i.d. exponential random variables with parameter . Define, for any , and . Then
(7) 
where are defined in Theorem 3.1.
Proof.
To apply (6) with and , it is possible in principle to compute by noticing that the density of may be expressed as the density of the maximum of independent exponential random variables:
A more interesting interpretation leads to simpler computations. Denote by is the thorder statistic of , set for consistency. As noted for example recently in [11, Eq. (2)], for any , the random variables
are independent exponential random variables with parameter . Since
then each may be seen as a sum of independent exponential random variables with parameter , . As a direct consequence,
(8) 
where is the th harmonic number defined in (18). More remarkable, it is possible to calculate the momentgenerating function of . In fact, since
we get, for ,  
Thus, since , the Chernoff bound (6) becomes
Since for any , then for , by (21)
Combining the two expressions above, we get
The part of the proof that concerns is hence proved by Lemma A.2.
5. Computation of
To complete the computation of the confidence interval, we give the following result, which connects the expectation of the core variables with the special functions we have introduced in this paper.
Lemma 5.1.
For any and , we have that
Proof of Lemma 5.1.
Corollary 5.2.
Let as in (5). The following inequalities hold
Comments
There are no comments yet.