1 Introduction
The last few years have witnessed a rapid growth in using machine learning methods to solve “classical” algorithmic problems. For example, they have been used to improve the performance of data structures [KBC18, Mit18], online algorithms [LV18, PSK18, GP19]
[KDZ17, BDSV18], similarity search [WLKC16], compressive sensing [MPB15, BJPD17] and streaming algorithms [HIKV19]. Multiple frameworks for designing and analyzing such algorithms were proposed [ACC11, GR17, BDV18, AKL19]. The rationale behind this line of research is that machine learning makes it possible to adapt the behavior of the algorithms to inputs from a specific data distribution, making them more efficient or more accurate in specific applications.In this paper we focus on learning-augmented streaming algorithms for frequency estimation. The latter problem is formalized as follows: given a sequence of elements from some universe , construct a data structure that for any element computes an estimation of , the number of times occurs in . Since counting data elements is a very common subroutine, frequency estimation algorithms have found applications in many areas, such as machine learning, network measurements and computer security. Many of the most popular algorithms for this problem, such as Count-Min (CM) [CM05] or Count-Sketch (CS) [CCFC02] are based on hashing. Specifically, these algorithms hash stream elements into buckets, count the number of items hashed into each bucket, and use the bucket value as an estimate of item frequency. To improve the accuracy, the algorithms use such hash functions and aggregate the answers. These algorithms have several useful properties: they can handle item deletions (implemented by decrementing the respective counters), and some of them (Count-Min) never underestimate the true frequencies, i.e., .
In a recent work [HIKV19]
, the authors showed that the aforementioned algorithm can be improved by augmenting them with machine learning. Their approach is as follows. During the training phase, they construct a classifier (neural network) to detect whether an element is “heavy”, i.e., whether
exceeds some threshold. After such a classifier is trained, they scan the input stream, and apply the classifier to each element . If the element is predicted to be heavy, it is allocated a unique bucket, so that an exact value of is computed. Otherwise, the element is forwarded to a “standard” hashing data structure , e.g., CM or CS. To estimate , the algorithm either returns the exact count (if is allocated a unique bucket) or an estimate provided by the data structure .111See Figure 1 for a generic implementation of the learning-based algorithms of [HIKV19]. An empirical evaluation, on networking and query log data sets, shows that this approach can reduce the overall estimation error.The paper also presents a preliminary analysis of the algorithm. Under the common assumption that the frequencies follow the Zipfian law, i.e.,222In fact we will assume that . This is just a matter of scaling and is convenient as it removes the dependence of the length of the stream in our bounds , and further that item
is queried with probability proportional to its frequency, the expected error incurred by the learning-augmented version of CM is shown to be asymptotically lower than that of the “standard” CM.
333This assumes that the error rate for the “heaviness” predictor is low enough. Aiming at a theoretical understanding, in this paper we focus on the case where the error rate is zero. However, the magnitude of the gap between the error incurred by the learned and standard CM algorithms has not been established. Specifically, [HIKV19] only shows that the expected error of standard CM with hash functions and a total of buckets is between and . Furthermore, no such analysis was presented for CS.1.1 Our results
In this paper we resolve the aforementioned questions left open in [HIKV19]. Assuming that the frequencies follow a Zipfian law, we show:
-
An asymptotically tight bound of for the expected error incurred by the CM algorithm with hash functions and a total of buckets. Together with a prior bound for Learned CM (Table 1), this shows that learning-augmentation improves the error of CM by a factor of if the heavy hitter oracle is perfect.
Count-Min (CM) | [HIKV19] | |
---|---|---|
Learned Count-Min (L-CM) | [HIKV19] | [HIKV19] |
Count-Sketch (CS) | and | |
Learned Count-Sketch (L-CS) |
is odd (so that the median of
values is well defined). The lower bounds for L-CM and L-CS are assuming that we use the information from a perfect heavy hitter oracle (i.e., the oracle makes no mistake in its predictions) to place the heavy hitters in separate buckets.In addition to clarifying the gap between the learned and standard variants of popular frequency estimation algorithms, our results provide interesting insights about the algorithms themselves. For example, for both CM and CS, the number of hash functions is often selected to be , in order to guarantee that every frequency is estimated up to a certain error bound. In contrast, we show that if instead the goal is to bound the expected error, then setting to a constant (strictly greater than ) leads to the asymptotic optimal performance. We remark that the same phenomenon holds not only for a Zipfian query distribution but in fact for an arbitrary distribution on the queries, e.g. the uniform (see Remark 2.2).
1.2 Related work
In addition to the aforementioned hashing-based algorithms [CM05, CCFC02], multiple non-hashing algorithms were also proposed, e.g., [MG82, MM02, MAEA05]. These algorithms often exhibit better accuracy/space tradeoffs, but do not posses many of the properties of hashing-based methods, such as the ability to handle deletions as well as insertions.
Zipf law is a common modeling tool used to evaluate the performance of frequency estimation algorithms, and has been used in many papers in this area, including [MM02, MAEA05, CCFC02]. In its general form it postulates that is proportional to for some exponent parameter . In this paper we focus on the “original” Zipf law where . However, the techniques introduced in this paper can be applied to other values of the exponent as well.
1.3 Our techniques
Our main contribution is our analysis of the standard Count-Min and Count-Sketch algorithms for Zipfians with hash functions. Showing the improvement for the learned counterparts is relatively simple (for Count-Min it was already done in [HIKV19]). In both of these analyses we consider a fixed item and bound whereupon linearity of expectation leads to the desired results. In the following we assume that for each and describe our techniques for bounding for each of the two algorithms.
Count-Min. With a single hash function and buckets the head of the Zipfian distribution, namely the items of frequencies , contribute with to the expected error. Our main observation is that the fast decay of reduces this to for and that in fact the main contribution to the error comes from the light items . The expected contribution of these items is easily upper bounded and can be lower bounded using Bennett’s concentration inequality. In contrast to the analysis from [HIKV19] which is technical and leads to suboptimal bounds, our analysis is short, simple, and yields completely tight bounds in terms of all of the parameters and .
Count-Sketch.
Simply put, our main contribution is an improved understanding of the distribution of random variables of the form
. Here the are i.i.d Bernouilli random variables and the are independent Rademachers, that is, . Note that the counters used in CS are random variables having precisely this form. Usually such random variables are studied for the purpose of obtaining large deviation results. In contrast, in order to analyze CS, we are interested in a fine-grained picture of the distribution within a “small” interval around zero, say with . For example when proving a lower bound on we must prove a certain anti-concentration of around . More precisely we find an interval centered at zero such that . Combined with the fact that we useindependent hash functions as well as properties of the median and the binomial distribution, this gives that
. Anti-concentration inequalities of this type are in general notoriously hard to obtain but it turns out that we can leverage the properties of the Zipfian distribution, specifically its heavy head. For our upper bounds on we need strong lower bounds on for intervals centered at zero. Then using concentration inequalities we can bound the probability that half of the relevant counters are smaller (larger) than the lower (highter) endpoint of , i.e., that the median does not lie in . Again this requires a precise understanding of the distribution of within .1.4 Structure of the paper
In Section 2 we describe the algorithms Count-Min and Count-Sketch. We also formally define the estimation error that we will study as well as the Zipfian distribution. In Sections 4 and 3 we provide our analyses of the expected error of Count-Min and Count-Sketch. In Section 5 we analyze the performance of learned Count-Sketch.
2 Preliminaries
We start out by recapping the sketching algorithms Count-Min and Count-Sketch. Common to both of these algorithms is that we sketch a stream of elements coming from some universe of size . For notational convenience we will assume that . If item occurs times then either algorithm outputs an estimate of .
Count-Min.
We use independent and uniformly random hash functions . Letting be an array of size we let . When querying the algorithm returns . Note that we always have that .
Count-Sketch.
We pick independent and uniformly random hash functions and . Again we initialize an array of size but now we let . When querying the algorithm returns the estimate .
Remark 2.1.
The bounds presented in Table 1 assumes that the hash functions have codomain and not , i.e., that the total number of buckets is . In the proofs to follows we assume for notational ease that the hash functions take value in and the claimed bounds follows immediately by replacing by .
Estimation Error.
To measure and compare the overall accuracy of different frequency estimation algorithms, we will use the expected estimation error which is defined as follows: let and respectively denote the actual frequencies and the estimated frequencies obtained from algorithm of items in the input stream. We remark that when is clear from the context we denote as . Then we define
(1) |
where denotes the query distribution of the items. Here, similar to previous work (e.g., [RKA16, HIKV19]), we assume that the query distribution is the same as the frequency distribution of items in the stream, i.e., for any , (more precisely, for any , where denotes the total sum of all frequencies in the stream).
Remark 2.2.
As all upper/lower bounds in this paper are proved by bounding the expected error of estimating the frequency a single item, , then using linearity of expectation, in fact we obtain analogous bounds for any query distribution . In particularly this means that the bounds of Table 1 for CM and CS hold for any query distribution. For L-CM and L-CS the factor of gets replaced by where is the number of buckets reserved for heavy hitters.
Zipfian Distribution
In our analysis we assume that the frequency distribution of items follows Zipf’s law. That is, if we sort the items according to their frequencies with no loss of generality assuming that , then for any , . Given that the frequencies of items follow Zipf’s law and assuming that the query distribution is the same as the distribution of the frequency of items in the input stream (i.e., where denotes the -th harmonic number), we can write the expected error defined in eq. 1 as follows:
(2) |
Throughout this paper, we present our results with respect to the objective function in the right hand side of eq. 2, i.e., . We further assume that in fact . At first sight this assumption may seem strange since it says that item appears a non-integral number of times in the stream. This is however just a matter of scaling and the assumption is convenient as it removes the dependence on the length of the stream in our bounds.
Learning Augmented Sketching Algorithms for Frequency Estimation.
In this paper, following the approach of [HIKV19], the learned variants of CM and CS are algorithms augmented with a machine learning based heavy hitters oracle. More precisely, we assume that the algorithm has access to an oracle that predicts whether an item is “heavy” (i.e., is one of the most frequent items) or not. Then, the algorithm treats heavy and non-heavy items differently: (a) a unique bucket is allocated to each heavy item and their frequencies are computed with no error, (b) the rest of items are fed to the given (sketching) algorithm using the remaining buckets and their frequency estimates are computed via . See Figure 1.
Note that, in general the oracle can make errors. Aiming at a theoretical understanding, in this paper we focus on the case where the oracle is perfect, i.e., the error rate is zero. We also assume that , that is, we use approximately the same number of buckets for the heavy items as for the sketching of the light items. One justification for this assumption is that in any case we can increase both the number of buckets for heavy and light items to without affecting the overall asymptotic space usage.
3 Tight Bounds for Count-Min with Zipfians
For both Count-Min and Count-Sketch we aim at analyzing the expected value of the variable where and is the estimate of output by the relevant sketching algorithm. Throughout this paper we use the following notation: For an event we denote by the random variable in which is if and only if occurs. We begin by presenting our improved analysis of Count-Min with Zipfians. The main theorem is the following.
Theorem 3.1.
Let with and . Let further be independent and truly random hash functions. For define the random variable . For any it holds that
Replacing by in Theorem 3.1 and using linearity of expectation we obtain the desired bound for Count-Min in the upper right hand side of Table 1. The natural assumption that simply says that the total number of buckets is upper bounded by the number of items.
To prove Theorem 3.1 we start with the following lemma which is a special case of the theorem.
Lemma 3.2.
Suppose that we are in the setting of Theorem 3.1 and further that444In particular we dispose with the assumption that . . Then
Proof.
It suffices to show the result when since adding more hash functions and corresponding tables only decreases the value of . Define for and note that these variables are independent. For a given we wish to upper bound . Let and note that if then either of the following two events must hold:
-
There exists a with and .
-
The set contains at least elements.
Union bounding we find that
Choosing , a simple calculation yields that . As and are independent, , so
∎
Before proving the full statement of Theorem 3.1 we recall Bennett’s inequality.
Theorem 3.3 (Bennett’s inequality [Ben62]).
Let be independent, mean zero random variables. Let , and be such that and for all . For any ,
where is defined by . The same tail bound holds on the probability .
Remark 3.4.
It is well known and easy to check that for ,
We will use these asymptotic bounds repeatedly in this paper.
Proof of Theorem 3.1.
We start out by proving the upper bound. Let and . Let be such that is minimized. Note that is itself a random variable. We also define
Clearly . Using Lemma 3.2, we obtain that . For we observe that
We conclude that
as desired.
Next we show the lower bound. For and we define . Note that the variables are independent. We also define for . Observe that for , , and that
Applying Bennett’s inequality with and thus gives that
Defining it holds that and , so putting in the inequality above we obtain that
Appealing to Remark 3.4 and using that the above bound becomes
(3) |
By the independence of the events we have that
and so , as desired. ∎
4 (Nearly) tight Bounds for Count-Sketch with Zipfians
In this section we proceed to analyze Count-Sketch for Zipfians either using a single or more hash functions. We start with two simple lemmas which for certain frequencies of the items in the stream can be used to obtain respectively good upper and lower bounds on in Count-Sketch with a single hash function. We will use these two lemmas both in our analysis of standard and learned Count-Sketch for Zipfians.
Lemma 4.1.
Let , independent Bernoulli variables taking value with probability , and independent Rademachers, i.e., . Let . Then
Proof.
Using that for and Jensen’s inequality
from which the result follows. ∎
Lemma 4.2.
Suppose that we are in the setting of Lemma 4.1. Let and let be defined by . Then
Proof.
Let , , and . Let denote the event that and have the same sign or . Then by symmetry. For we denote by the event that . Then and furthermore and are independent. If occurs, then and as the events are disjoint it thus follows that
∎
4.1 One hash-function
We are now ready to commence our analysis of Count-Sketch for Zipfians. As in the discussion succeeding Theorem 3.1 the following theorem yields the desired result for a single hash function as presented in Table 1.
Theorem 4.3.
Suppose that and let and be truly random hash functions. Define the random variable for . Then
4.2 Multiple hash functions
Let be odd. For a tuple we denote by the median of the entries of . The following theorem immediately leads to the result on CS with hash functions claimed in Table 1.
Theorem 4.4.
Let be odd, , and and be truly random hash functions. Define the random variable for . Assume that555This very mild assumption can probably be removed at the cost of a more technical proof. In our proof it can even be replaced by for any . . Then
The assumption simply says that the total number of buckets is upper bounded by the number of items. Again using linearity of expectation for the summation over and replacing by we obtain the claimed upper and lower bounds of and respectively. We note that even if the bounds above are only tight up to a factor of they still imply that it is asymptotically optimal to choose , e.g. . To settle the correct asymptotic growth is thus of merely theoretical interest.
Proof.
If (and hence ) is a constant the results follows easily from Lemma 4.1, so in what follows we may assume that is larger than a sufficiently large constant.
We first prove the upper bound. Define and . Let for , and . Finally write .
As the absolute error in Count-Sketch with one pair of hash functions is always upper bounded by the corresponding error in Count-Min with the single hash function , we can use the bound in the proof of Lemma 3.2 to conclude that
when . Also
so by Bennett’s inequality (with and ) and Remark 3.4,
for . It follows that for ,
Let be the implicit constant in the -notation above. If , at least half of the values are at least . For it thus follows by a union bound that
(4) |
If is chosen sufficiently large it thus holds that
Here the first inequality uses eq. 4 and a change of variable. The second inequality uses that for some constant followed by a calculation of the integral. For our upper bound it therefore suffices to show that . For this we need the following claim:
Claim 4.5.
Let be the closed interval centered at the origin of length , i.e., . Suppose that . For , .
Before proving this claim we will first show how to use it to establish the desired result. For this let be fixed. If , at least half of the values are at least or at most . Let us focus on bounding the probability that at least half are at least , the other bound being symmetric giving an extra factor of in the probability bound. By symmetry and the claim, . For we define , and we put . Then . If at least half of the values are at least then . By Hoeffding’s inequality we can bound the probability of this event by
It follows that . Thus
It thus suffices to prove the claim.
Proof of creftypecap 4.5.
We first show that with probability , lies in the interval for some constant . To see this we note that by Lemma 4.1, , so it follows by Markov’s inequality that if is large enough, the probability that is at most . For a constant probability lower bound on we write
Condition on . If the probability that there exist exactly two with is at least
With probability the corresponding signs are both the same as that of . By independence of and the probability that this occurs is at least and if it does, . Combining these two bounds it follows that with probability at least