## 1 Introduction

Estimating the number of distinct elements in the data stream is one of the first, and one of the most fundamental problems in streaming algorithms. In this problem, we observe a *data stream*, i.e. a sequence of elements , and we wish to provide a -approximation for the number of distinct elements in this sequence, using small space . This can be trivially achieved with bits of memory by either storing all elements encountered in the stream, or by storing a bitmask, keeping a single bit for every possible element of the universe. We wish to provide a probabilistic algorithm using significantly smaller space (allowing for small failure probability).

This problem was first studied by Flajolet and Martin in their seminal paper [DBLP:conf/focs/FlajoletM83] in FOCS 1983, which started a long line of research on subsequently improved algorithms [DBLP:conf/stoc/AlonMS96, DBLP:conf/random/Bar-YossefJKST02, DBLP:conf/spaa/GibbonsT01, Durand2003, DBLP:journals/ton/EstanVF06, flajolet2007hyperloglog, DBLP:conf/vldb/Gibbons01].

Kane, Nelson and Woodruff in 2010 [DBLP:conf/pods/KaneNW10] proposed an optimal algorithm for counting the number of distinct elements in the stream with failure probability — their algorithm provided an approximation to the a number of distinct elements using bits — the matching lower bound has been shown prior to this [Woodruff:2004:OSL:982792.982817, DBLP:conf/stoc/AlonMS96, DBLP:journals/eccc/BrodyC09]. The standard black-box method of reducing the failure probability of estimation algorithm of this kind is to repeat it independently times in parallel, and use the median of reported answers as the final estimation. This method, applied to the algorithm mentioned above, uses bits of space.

On the other hand, Jayram and Woodruff in [DBLP:journals/talg/JayramW13] developed a technique for proving lower bounds for streaming problems in the high success probability regime. Their technique allowed them to show that for number of natural streaming problems the naive repetition method is optimal — for example this is the case for estimation of the pseudonorm (with

) of frequency vector in the so called

*strict turnstile*streaming model. In the same paper they proved a lower bound for the distinct elements problem of form . For constant , this left a gap between an upper bound and lower bound .

It was known that one should *not* expect a lower bound for this problem. Already [DBLP:conf/pods/KaneNW10] showed that for some constant , one can achieve failure probability using only bits, and in [DBLP:journals/talg/JayramW13] it was observed that for every constant there is an algorithm using bits with failure probability . In this paper we completely resolve the question about space complexity of the distinct elements problem in the high success probability regime, showing that the Jayram, Woodruff lower bound was optimal.

#### Continuous monitoring

Recently, the space complexity of *tracking* problems in data streams has been considered — namely we say that streaming algorithm provides *strong tracking* of a statistic of the input stream, if after every update it reports quantity such that

The first result of this form that we are aware of, was proven in [DBLP:conf/pods/KaneNW10] as a subroutine for non-tracking estimation of distinct elements. They showed that one can achieve tracking of with some constant approximation factor, using bits of space. The question whether one can achieve strong tracking without the naive union bound over all positions of the stream was explicitly asked later in [HuangTY14], where they also proposed an algorithm for estimation of the -pseudonorm of the frequency vector, for . Their algorithm yields improvement over the baseline approach for very long input streams . The strong tracking of has been later improved in [BravermanCWY16, BravermanCINWW17], where interesting results are achieved even in the more standard regime of parameters, with and that are polynomially related. They showed that one can solve strong tracking of using bits, as compared to naive bound of form . The improved algorithm for strong tracking of with was provided in [DBLP:journals/corr/DingBN17].

#### Our contribution.

We provide an optimal streaming algorithm for the distinct elements problem in the high probability regime, using bits of space. This result completely settles the space complexity of this problem with respect to all standard parameters.

We also show a strong tracking algorithm for the distinct elements with space , together with a matching lower bound — we prove that term is necessary. The lower bound was already known even for the easier non-tracking version of the distinct elements problem.

This is a first matching lower bound for any strong tracking problem, where the non-trivial algorithm is achievable. This shows a separation between the traditional estimation problem and strong tracking variation when . On the other hand, in the regime the strong tracking problem is not harder than one-shot estimation (up to constant factors).

The update time of our algorithm is . The only bottleneck is the pseudorandom construction described in Section 8. In particular, by substituting this construction with a random walk over an expander graph of super-constant degree, it is possible to achieve update time , with slightly worse space complexity .

## 2 Notation

For a natural number , by we denote set . For a finite set , by or we denote the cardinality of . For , we will write to be a bit representation of . For a bitvector we denote . For two bitvectors , we take to be the bitvector with if and only if or .

In the paper will be used to denote the size of the universe from which the elements in the input stream are chosen, — the length of the stream, and are those elements. Set to be the set of all distinct elements seen up to a time step , and .

Throughout the paper we use notation , to denote the existence of an absolute constant such that , where and themselves may depend on a number of parameters. We write to denote .

## 3 Overview of our approach

### 3.1 Constant factor approximation with high probability

The main goal of Section 4 is to show a streaming algorithm that provides an -approximation to the number of distinct elements at all times in the stream (i.e. -strong tracking), with probability using optimal space bits. That is, we want to provide estimate , such that

where is a number of distinct elements on the input among .

Note that in this regime of parameters, if one has an algorithm estimating number of distinct elements using space complexity , one can set , and apply a union bound over all insertions to the stream, to get a strong tracking algorithm for the same problem with failure probability and space complexity . As such, we can without loss of generality focus on the strong tracking version, and this stronger guarantee is going to be useful in order to ensure that the algorithm can be implemented using small space.

To discuss the main idea behind our approach, for the sake of presentation we will first consider a random oracle model — here we assume that the algorithm is augmented with the access to a uniformly random function (all the values of are uniform and independent); in particular the space to store such a function does not count towards the space complexity of the algorithm, and the failure probability is understood over a selection of the oracle. For space complexity of such an algorithm, we will count only the amount of information passed between observations of elements from the input stream; we are allowed to use larger space to process an element from the input. This allows us to talk in a meaningful way about space complexity , even though any single element in the stream already take bits to store.

Let us start with discussion on how to design an algorithm using bits of space in the random oracle model. It is well-known that given access to a random hash function , if we fix some set , then is such that is a constant factor approximation to with probability , where is the least-significant-bit function [DBLP:conf/focs/FlajoletM83]. Indeed, to argue that this is true, we can consider subsets given by — every such subset corresponds to sub-sampling by a factor , and we should expect that the last non-empty set is the one corresponding to sub-sampling by a factor roughly .

We can repeat an estimator constructed above times independently in parallel and take median, in order to achieve achieve bit complexity for failure probability under the random oracle model. To improve this construction to bits, let us take independent estimators as above. Instead of storing all those estimators independently, we can store the median (which takes bits), and deviations . One can show that with high probability at all times throughout the stream the median is a good estimator of the number of distinct elements seen so far, and moreover — because the deviations

are random variables that are extremely well concentrated around zero — on average over all the counters we will use constant number of bits per counter to store all deviations from median, at all time steps.

Getting rid of the random oracle assumption is much more technical — without access to the random oracle, it is known ([DBLP:conf/stoc/AlonMS96]) that one can use a pairwise independent hash function to get a constant success probability — and a seed to such a hash function can be stored using bits. This, together with median over parallel repetitions of the estimator, yields simple space algorithm with failure probability .

To improve upon that, we can observe that in this setting it is not necessary for all the estimators to use independent seeds for the underlying pairwise-independent hash functions . Instead, we can consider a fully explicit constant degree expander graph, with the set of vertices corresponding to the set of seeds for pairwise independent hash functions. We would choose the first seed for uniformly at random, but subsequent seeds are chosen by a random walk over this expander graph. In such a way, we can succinctly store all the seeds using -bits of space, and the standard Chernoff-bounds for expander walks [Vadhan12, Theorem 4.22] imply that median of estimators generated in such a way is still constant factor approximation for the number of distinct elements, except with small failure probability . This yields an algorithm with space complexity , if we store explicitly — still falling short of our goal of bits of space.

Unfortunately, we *cannot* argue, as before in the random oracle model, that we can succinctly store all counters generated via such an expander walk by considering the median and deviations from the median separately — sufficiently strong concentration bounds are *not true* for a constant degree expander walk.

Instead, inspired by the construction of a sampler in [DBLP:conf/soda/Meka17], we show that by composing a number of pseudorandom objects (i.e. pairwise independent hash functions, short walks over super-constant expander graphs, averaging samplers obtained from the celebrated construction of strong extractors [DBLP:journals/rsa/Zuckerman97], and standard sub-sampling methods), we can generate estimators in total, divided into groups of estimators. More concretely, we produce groups of estimators, such that each group has about estimators, and with the probability for at least half of the groups the median yields a good estimation of at all times, while simultaneously the “good” groups take at all times -bits on average per estimator to store, if we store estimators within a group by storing separately the median and deviations from the median, as discussed above.

It is essential for this argument that size of each group is greater than — intuitively, if we consider a random group of such a size, the probability that we need too many bits to store compactly such a group at any fixed step is bounded by , and therefore we can union bound over all positions where grows by factor of two, without affecting the failure probability too much.

The details of this pseudorandom construction are presented in Section 8. This is the main technical difficulty in proving the following theorem.

###### Theorem 1.

There is a streaming algorithm with space complexity bits, that with probability reports a constant factor approximation to number of distinct elements in the stream after every update.

This space complexity is optimal [DBLP:conf/stoc/AlonMS96, DBLP:journals/talg/JayramW13].

The algorithm could be significantly simplified, and would mimic exactly the algorithm in the random oracle scenario, if we had an explicit sampler satisfying the following guarantee, with seed length . In the following definition

denotes the uniform probability distribution over a set

.###### Definition 2 (Sub-gaussian sampler).

Function is called a sub-gaussian sampler if and only if for every satisfying , we have

We consider existence of explicit samplers like this with seed length to be an interesting question on its own in the area of pseudorandomness, that will likely have many other applications
^{1}^{1}1Non-constructive existence of samplers like that can be proven using probabilistic method, and reduction to -Strong samplers similar to Lemma 36.. In fact for black-box derandomization of the random oracle algorithm described in this section it is enough to have sampler for functions with stronger tail probabilities — it is enough for it to apply to functions with doubly exponential tails .

###### Remark 3.

The update time of this algorithm is . The only bottleneck is the pseudorandom construction we are using. If we give up on succinctly storing estimates , and store them explicitly, we can replace this pseudorandom construction with a single random walk over constant degree expander graph. There are expander graphs that allow evaluation of the neighbour function in constant time [gabber1981explicit]. Such a modification would give an algorithm using slightly worse space , but update time for strong tracking with constant factor approximation.

It is possible to carry this construction over to our subsequent result, achieving bits of space for high accuracy regime, and bits of space for tracking, all with update time .

### 3.2 High accuracy regime

In Section 5 we discuss how to use the previous construction to achieve a high accuracy estimation of the number of distinct elements, with probability . We prove the following theorem.

###### Theorem 4.

For every there is an algorithm using bits, which, at the end of the stream reports approximation to the number of distinct elements, with probability .

###### Remark 5.

[DBLP:conf/stoc/AlonMS96, DBLP:journals/talg/JayramW13] This space complexity is optimal — every algorithm that estimates number of distinct elements up to a factor with probability needs to use space at least bits.

Given ideas in the work of Kane *et al.* [DBLP:conf/pods/KaneNW10] and results obtained in the previous section, getting correct dependence on the error parameter is routine, although somewhat tedious.

We consider separately two ranges of parameters: if , the KNW algorithm (given access as a black-box to the strong tracking with some constant approximation), using space and random bits has probability of providing approximation to the number of distinct elements — since is small here, the space budget is large enough for the analysis to work. We can instantiate parallel copies of this algorithm, providing them access to the same strong-tracker with failure probability . Naively, we would have to store random bits in order to do this, each instance of KNW algorithm is using random bits — to reduce the amount of randomness necessary, we pick them using a walk over a constant-degree expander graph. That is, random bits for first instance of a KNW algorithm are completely uniform, but bits for subsequent runs of KNW are chosen by following a random edge of an expander graph. We can use standard Chernoff-bounds for expander walks, as in [Gillman:1998:CBR:284943.284979], to show that failure probability of such an algorithm is still at most .

On the other hand, if , we can assume without loss of generality that , because anyway, and our target is space complexity of form . In this case, we can instantiate parallel copies of the KNW algorithm, using the pseudorandom construction as described in Section 8. Here the number of instantiations of this algorithm is large enough, therefore we hope that the space consumption at every time step, on average over all the instatiations will be small. Identical analysis as for a constant approximation factor in Section 4 can be used to deduce correctness of such an approach.

In fact the space guarantees of the KNW algorithm, as it was originally analyzed, applied only when — as this could be assumed without loss of generality in the original setting. We provide a more delicate analysis of the space consumption of this algorithm in Appendix A (specifically Theorem 13), that is sufficient for our purposes.

### 3.3 Strong tracking of distinct elements

In Section 6 we discuss how to achieve the -strong tracking guarantee for estimation.

First, let us observe that an algorithm estimating with small failure probability already translates into some upper bound on the space complexity for the tracking problem. Given that number of distinct elements in the stream is increasing, and our estimators proposed in Section 5 are monotone as well, it is enough to look at a sequence of positions such that . If the estimate is within from the actual number of distinct elements at all points , we can deduce a strong tracking with accuracy : for we have , and similarly for the lower bound. There are at most such positions , so by setting failure probability in Theorem 4, we can deduce that there is an algorithm satisfying strong tracking of with probability , using bits of space.

We show that, by opening up the [DBLP:conf/pods/KaneNW10] construction and more detailed analysis, it is possible to remove the additive term, and obtain an optimal algorithm for tracking.

###### Theorem 6.

There is an algorithm for strong tracking of the number of distinct elements in the stream, using bits of space.

To describe the overview of our contribution, let us first discuss the high-level idea behind the [DBLP:conf/pods/KaneNW10] algorithm. Let us focus first on the random oracle model. Consider a fixed set (the set of distinct elements seen at the end of the stream), a random hash function , and sets — those sets correspond roughly to subsampling by a factor of . If we already have access to a constant factor approximation of , we can zoom in onto set for which we expect . Clearly

is an unbiased estimator of

, and moreover the standard deviation of

is of order . This implies that if we had a way to estimate size of up multiplicative factor that would be enough to get an approximation for .In order to do this, we can check a hash function for . We wish to recover from . This is reminiscent of a famous balls-and-bins thought experiment: we are throwing balls randomly into bins, and we try to estimate number of balls, given number of non-empty bins. Let us define to be the expected number of non-empty bins, after throwing balls at random into bins (we will drop the subscript in further discussion), we have . We claim that, as long as , we will have with good probability . This is because , so , but is bi-Lipschitz in the regime — i.e. for any we have . We can put those two facts together to get .

Using bits in total, we can have access to for all throughout the stream — it is enough for each to store . In [DBLP:conf/pods/KaneNW10] it is discussed, among other things, how to reduce the space complexity of storing the information about for all to -bits in total, and how to remove the random oracle assumption, by using compositions of bounded-wise independent hash functions. We describe this algorithm in Appendix A, together with more detailed analysis of the distribution of space complexity of this algorithm.

In order to achieve smaller space of the tracking algorithm, let us focus on a specific and consider evolution of over the updates to the stream, where , and . More specifically, let us take , and let us look at the stream, given the promise that . We wish to say that with probability simultaneously all times gives us an approximation of up to additive term . Moreover, we want to say that yields at all time approximation to , again with additive error . If we are able to show this, we can later amplify this success probability to using repetitions of the whole algorithm, and union bound over all possible settings of in order to achieve strong tracking. Note that there are only values of to union bound over, as opposed to distinct positions in the stream where grows by a multiplicative factor of .

In the random oracle model both facts — the fact that for all we have , as well as the fact that for all , we have can be proven by the Doob’s martingale inequality. In particular, the fact that is an approximation to at all times , follows directly (after shifting and rescaling) from the fact that for a random walk , where are arbitrary random variables satisfying and , we have with good probability. The main technical difficulty in the strong tracking part of this paper lies in dropping the random oracle assumption, and showing some variation on Doob’s martingale inequality under bounded independent hash functions. In particular, we show the following lemma about the deviations of random walk that might be of independent interest

###### Lemma 7.

Let be collection of -wise independent random variables, with , and , and let , then

A result of the same spirit can be deduced from [BravermanCINWW17, Theorem 10] when are uniform random variables — in our case, however, the steps are significantly less well-behaved, i.e. is already extremely large, even compared to .

Lemma 7 already implies the first part of the argument: that for all we have . To control deviations of from its expectation, we use to be a composition of pairwise independent hash function , and -wise independent function . We should expect that , i.e. function has no collisions with probability , and all we care about are deviations of from its expectation, where is such that .

Consider , such that — in the random oracle model, bounding the deviations can be reduced to bounding the deviations of the Doob’s martingale , where . In this setting the Doob’s martingale inequality yields

(1) |

where and are independent and uniform. Finally, this together with bi-Lipschitz property of function in the range of interest, implies that indeed we have .

In our case, variables have only bounded-wise independence, and the process above is no longer a martingale. We deal with this, by showing that can be approximated (in some sense, under the distributions of interest) by a -degree polynomial, and we show that under some additional restrictions, processes as above induced by degree polynomials, satisfy the same Eq.(1), even if variables are only -wise independent, as opposed to fully independent.

### 3.4 Strong tracking lower bound

In Section 7 we show the optimality of the strong tracking algorithm proposed in the previous section. We prove the following theorem.

###### Theorem 8.

Every algorithm solving strong tracking for estimation with probability needs to use bits of space.

Together with previously known lower bound for estimation, this shows a lower bound that exactly matches our upper bound discussed earlier.

In order to show this, we introduce a -round communication game, where at round , Alice observes input , Bob observes input , and they all observe all the previous inputs . In the -th round, Alice sends a message to Bob, and Bob is supposed to report approximation to number of ones in a string . The protocol is successful if and only if simultaneously at all rounds Bob reports correct (approximate) answer. We show that existence of a strong tracking algorithm implies low-communication protocol for this kind of game with rounds — which in turn, implies a one-round one-way communication protocol for estimation of with small failure probability . This would contradict known communication complexity lower bound for small failure probability of distinct element counting [DBLP:journals/talg/JayramW13].

### 3.5 Pseudorandom construction

In Section 8, we prove the main derandomization lemma used in the algorithm described in Section 4 for constant factor approximation of the number of distinct elements. Take — we wish to use instantiations of the basic estimator, each instantiation is uniquely determined by a seed to pairwise independent hash function used for the estimator (such a seed is of length , let us call the number of different seeds ).

We pick , a random walk of length over an expander graph with vertices and degree will be *bad* with probability — by which we mean that either the median of all the estimators produced by this walk is at some point far from actual , or that we need at some point more than bits to store all the values of the estimators by storing median and deviations from median. A single random walk like this is going to need random bits. If we consider now space of all those random walks, we can use known construction of averaging samplers to get a sample of size , such that with probability the fraction of failed random walks is the same as in the entire space. If we condition on the event that the sampler succeeded, by taking independent elements from the sample , we can see that more than half of them is bad with probability . As we are taking independent elements from the universe of size we need only random bits to achieve this.

## 4 Constant factor approximation with high probability

In this section we prove Theorem 1, assuming existence of specific pseudo-random objects described in Lemma 11. The proof of Lemma 11 itself is postponed until later in Section 8. We will first state few necessary definitions, followed by a statement of a pseudo-random lemma, then we proceed with the proof of Theorem 1.

###### Definition 9 (doubly-exponential tail).

We say that a function satisfies doubly-exponential tail bounds if .

###### Definition 10 (-small set).

Consider a finite universe , equipped with functions . We will say that a sequence is -small with respect to , if .

Equipped with those definitions, we are ready to state the pseudorandom lemma.

###### Lemma 11.

For any satisfying in addition , there exist with , and an explicit function , such that for any with doubly-exponential tail bounds, we have with probability at least over a random selection of the seed , that majority of sequences is -small for some universal constant .

That is

where above must satisfy doubly-exponential tail bounds.

The seed length in this construction is , and can be evaluated in time .

Let us now proceed with the proof of Theorem 1.

Fix a stream of updates , and corresponding sets .

Consider as a set, with implicit bijection to a family of pairwise independent hash functions from to , where . For each , we have corresponding hash function , and estimates — the estimate for given by hash function . We will focus on the error of those estimators .

###### Fact 12.

The error terms satisfy following subexponential tail bound

###### Proof.

Consider random set .

For we have , and , hence by Chebyshev inequality, and therefore .

For the lower tail bound it is enough to consider Markov inequality: if , we have , and . ∎

We will be interested in which is proportional to the number of bits necessary to write down deviation from . The Fact 12 implies that have doubly-exponential tail bounds, up to some rescaling: for some constant .

Let us take a sequence , such that , where . We can now apply Lemma 11, with and functions given by and .

The final algorithm will be following: in the initialization phase, we choose a uniformly random string and store it. Consider now , as in the statement of Lemma 11, which for each value yields a group of seeds for pairwise independent hash function from to . For every such group, for example a group , we store all the values in the compressed form: we store separately a median of all estimates within group, and the differences between and aforementioned median. If at any point in time a size of the whole description of a given group in bits exceeds some we mark this group as broken and we stop updating it (where ). Clearly, the total space complexity is bounded by .

We claim that from the -smallness condition, we can deduce that for a majority of , at all times both the median of all the estimates within group is close to the actual , and the total space to store the whole group is bounded by . If this is the case, then as an estimate for we can just report the median over groups that are not marked as broken, of all the medians within a group of estimates , and the correctness of the algorithm follows.

To finish the argument, we need to show that every -small group of estimators indeed yields a good approximation for , and is stored succinctly at all times (i.e. never becomes marked as broken). Consider a -small group . First, we will argue that at all times , we have . Indeed, we know that on average over all , we have , therefore for at least fraction of , we have , which means that for those we have . This, together with the definition of implies the claim with . To argue that we are storing group using bits, let be the median of all over in . The space to store the group is given by bits to store the median within a group, and to store the rest. We have bound

where the first sum is bounded because of the -smallness condition.

Finally, we have to say that if the group satisfies those two properties at all times , then those properties are satisfied (with larger constants) at all time steps . To see that, fix some between and . Note that is non-decreasing with respect to , and we have

Moreover, by triangle inequality

and each of those terms is bounded by constant.

This implies

This completes the proof of the correctness of the algorithm — at any step , all -small groups are not marked as broken, and all of them report a constant approximation. Strictly more than half of all the groups is -small, hence the median of all groups that are still active, has to be a constant approximation to the quantity of interest as well.

## 5 High accuracy regime

In this section we prove Theorem 4. As a building block we will use algorithm discussed in [DBLP:conf/pods/KaneNW10, Section 3.2]. In the Appendix we prove the following, qualitatively stronger bounds on the space complexity of their algorithm. The construction of the algorithm, and correctness analysis was already present in [DBLP:conf/pods/KaneNW10] — correctness can be also deduced from the discussion in Section 6, where we discuss this algorithm in detail, and show stronger guarantees for a slight variation of it. Note that in the original paper the guarantees on the space complexity of this algorithm were proven when , as this could be assumed without loss of generality in their setting. For us, the scenario when is relevant.

###### Theorem 13.

There is an algorithm which gives a -approximation to with probability at least , assuming access to an oracle providing strong tracking of with constant factor approximation , and oracle access to additional random bits. The space usage of this algorithm at any given time (excluding random bits mentioned above), denoted by , satisfies

Comments

There are no comments yet.