(Learned) Frequency Estimation Algorithms under Zipfian Distribution

08/14/2019 ∙ by Anders Aamand, et al. ∙ Københavns Uni MIT 0

The frequencies of the elements in a data stream are an important statistical measure and the task of estimating them arises in many applications within data analysis and machine learning. Two of the most popular algorithms for this problem, Count-Min and Count-Sketch, are widely used in practice. In a recent work [Hsu et al., ICLR'19], it was shown empirically that augmenting Count-Min and Count-Sketch with a machine learning algorithm leads to a significant reduction of the estimation error. The experiments were complemented with an analysis of the expected error incurred by Count-Min (both the standard and the augmented version) when the input frequencies follow a Zipfian distribution. Although the authors established that the learned version of Count-Min has lower estimation error than its standard counterpart, their analysis of the standard Count-Min algorithm was not tight. Moreover, they provided no similar analysis for Count-Sketch. In this paper we resolve these problems. First, we provide a simple tight analysis of the expected error incurred by Count-Min. Second, we provide the first error bounds for both the standard and the augmented version of Count-Sketch. These bounds are nearly tight and again demonstrate an improved performance of the learned version of Count-Sketch. In addition to demonstrating tight gaps between the aforementioned algorithms, we believe that our bounds for the standard versions of Count-Min and Count-Sketch are of independent interest. In particular, it is a typical practice to set the number of hash functions in those algorithms to Θ ( n). In contrast, our results show that to minimise the expected error, the number of hash functions should be a constant, strictly greater than 1.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The last few years have witnessed a rapid growth in using machine learning methods to solve “classical” algorithmic problems. For example, they have been used to improve the performance of data structures [KBC18, Mit18], online algorithms [LV18, PSK18, GP19]

, combinatorial optimization 

[KDZ17, BDSV18], similarity search [WLKC16], compressive sensing [MPB15, BJPD17] and streaming algorithms [HIKV19]. Multiple frameworks for designing and analyzing such algorithms were proposed [ACC11, GR17, BDV18, AKL19]. The rationale behind this line of research is that machine learning makes it possible to adapt the behavior of the algorithms to inputs from a specific data distribution, making them more efficient or more accurate in specific applications.

In this paper we focus on learning-augmented streaming algorithms for frequency estimation. The latter problem is formalized as follows: given a sequence of elements from some universe , construct a data structure that for any element computes an estimation of , the number of times occurs in . Since counting data elements is a very common subroutine, frequency estimation algorithms have found applications in many areas, such as machine learning, network measurements and computer security. Many of the most popular algorithms for this problem, such as Count-Min (CM) [CM05] or Count-Sketch (CS) [CCFC02] are based on hashing. Specifically, these algorithms hash stream elements into buckets, count the number of items hashed into each bucket, and use the bucket value as an estimate of item frequency. To improve the accuracy, the algorithms use such hash functions and aggregate the answers. These algorithms have several useful properties: they can handle item deletions (implemented by decrementing the respective counters), and some of them (Count-Min) never underestimate the true frequencies, i.e., .

In a recent work [HIKV19]

, the authors showed that the aforementioned algorithm can be improved by augmenting them with machine learning. Their approach is as follows. During the training phase, they construct a classifier (neural network) to detect whether an element is “heavy”, i.e., whether

exceeds some threshold. After such a classifier is trained, they scan the input stream, and apply the classifier to each element . If the element is predicted to be heavy, it is allocated a unique bucket, so that an exact value of is computed. Otherwise, the element is forwarded to a “standard” hashing data structure , e.g., CM or CS. To estimate , the algorithm either returns the exact count (if is allocated a unique bucket) or an estimate provided by the data structure .111See Figure 1 for a generic implementation of the learning-based algorithms of [HIKV19]. An empirical evaluation, on networking and query log data sets, shows that this approach can reduce the overall estimation error.

The paper also presents a preliminary analysis of the algorithm. Under the common assumption that the frequencies follow the Zipfian law, i.e.,222In fact we will assume that . This is just a matter of scaling and is convenient as it removes the dependence of the length of the stream in our bounds , and further that item

is queried with probability proportional to its frequency, the expected error incurred by the learning-augmented version of CM is shown to be asymptotically lower than that of the “standard” CM.

333This assumes that the error rate for the “heaviness” predictor is low enough. Aiming at a theoretical understanding, in this paper we focus on the case where the error rate is zero. However, the magnitude of the gap between the error incurred by the learned and standard CM algorithms has not been established. Specifically, [HIKV19] only shows that the expected error of standard CM with hash functions and a total of buckets is between and . Furthermore, no such analysis was presented for CS.

1.1 Our results

In this paper we resolve the aforementioned questions left open in [HIKV19]. Assuming that the frequencies follow a Zipfian law, we show:

  • An asymptotically tight bound of for the expected error incurred by the CM algorithm with hash functions and a total of buckets. Together with a prior bound for Learned CM (Table 1), this shows that learning-augmentation improves the error of CM by a factor of if the heavy hitter oracle is perfect.

  • The first error bounds for CS and Learned CS (see Table 1). In particular, we show that for Learned CS, using a single hash function as in [HIKV19] leads to an asymptotically optimal error bound, improving over standard CS by a factor of (same as CM).

Count-Min (CM)  [HIKV19]
Learned Count-Min (L-CM)  [HIKV19]  [HIKV19]
Count-Sketch (CS) and
Learned Count-Sketch (L-CS)
Table 1: This table summarizes our and previously known results on the expected frequency estimation error of Count-Min (CM), Count-Sketch (CS) and their learned variants (i.e., L-CM and L-CS) that use functions and overall space under Zipfian distribution. For CS, we assume that

is odd (so that the median of

values is well defined). The lower bounds for L-CM and L-CS are assuming that we use the information from a perfect heavy hitter oracle (i.e., the oracle makes no mistake in its predictions) to place the heavy hitters in separate buckets.

In addition to clarifying the gap between the learned and standard variants of popular frequency estimation algorithms, our results provide interesting insights about the algorithms themselves. For example, for both CM and CS, the number of hash functions is often selected to be , in order to guarantee that every frequency is estimated up to a certain error bound. In contrast, we show that if instead the goal is to bound the expected error, then setting to a constant (strictly greater than ) leads to the asymptotic optimal performance. We remark that the same phenomenon holds not only for a Zipfian query distribution but in fact for an arbitrary distribution on the queries, e.g. the uniform (see Remark 2.2).

1.2 Related work

In addition to the aforementioned hashing-based algorithms [CM05, CCFC02], multiple non-hashing algorithms were also proposed, e.g., [MG82, MM02, MAEA05]. These algorithms often exhibit better accuracy/space tradeoffs, but do not posses many of the properties of hashing-based methods, such as the ability to handle deletions as well as insertions.

Zipf law is a common modeling tool used to evaluate the performance of frequency estimation algorithms, and has been used in many papers in this area, including [MM02, MAEA05, CCFC02]. In its general form it postulates that is proportional to for some exponent parameter . In this paper we focus on the “original” Zipf law where . However, the techniques introduced in this paper can be applied to other values of the exponent as well.

1.3 Our techniques

Our main contribution is our analysis of the standard Count-Min and Count-Sketch algorithms for Zipfians with hash functions. Showing the improvement for the learned counterparts is relatively simple (for Count-Min it was already done in [HIKV19]). In both of these analyses we consider a fixed item and bound whereupon linearity of expectation leads to the desired results. In the following we assume that for each and describe our techniques for bounding for each of the two algorithms.

Count-Min. With a single hash function and buckets the head of the Zipfian distribution, namely the items of frequencies , contribute with to the expected error. Our main observation is that the fast decay of reduces this to for and that in fact the main contribution to the error comes from the light items . The expected contribution of these items is easily upper bounded and can be lower bounded using Bennett’s concentration inequality. In contrast to the analysis from [HIKV19] which is technical and leads to suboptimal bounds, our analysis is short, simple, and yields completely tight bounds in terms of all of the parameters and .

Count-Sketch.

Simply put, our main contribution is an improved understanding of the distribution of random variables of the form

. Here the are i.i.d Bernouilli random variables and the are independent Rademachers, that is, . Note that the counters used in CS are random variables having precisely this form. Usually such random variables are studied for the purpose of obtaining large deviation results. In contrast, in order to analyze CS, we are interested in a fine-grained picture of the distribution within a “small” interval around zero, say with . For example when proving a lower bound on we must prove a certain anti-concentration of around . More precisely we find an interval centered at zero such that . Combined with the fact that we use

independent hash functions as well as properties of the median and the binomial distribution, this gives that

. Anti-concentration inequalities of this type are in general notoriously hard to obtain but it turns out that we can leverage the properties of the Zipfian distribution, specifically its heavy head. For our upper bounds on we need strong lower bounds on for intervals centered at zero. Then using concentration inequalities we can bound the probability that half of the relevant counters are smaller (larger) than the lower (highter) endpoint of , i.e., that the median does not lie in . Again this requires a precise understanding of the distribution of within .

1.4 Structure of the paper

In Section 2 we describe the algorithms Count-Min and Count-Sketch. We also formally define the estimation error that we will study as well as the Zipfian distribution. In Sections 4 and 3 we provide our analyses of the expected error of Count-Min and Count-Sketch. In Section 5 we analyze the performance of learned Count-Sketch.

2 Preliminaries

We start out by recapping the sketching algorithms Count-Min and Count-Sketch. Common to both of these algorithms is that we sketch a stream of elements coming from some universe of size . For notational convenience we will assume that . If item occurs times then either algorithm outputs an estimate of .

Count-Min.

We use independent and uniformly random hash functions . Letting be an array of size we let . When querying the algorithm returns . Note that we always have that .

Count-Sketch.

We pick independent and uniformly random hash functions and . Again we initialize an array of size but now we let . When querying the algorithm returns the estimate .

Remark 2.1.

The bounds presented in Table 1 assumes that the hash functions have codomain and not , i.e., that the total number of buckets is . In the proofs to follows we assume for notational ease that the hash functions take value in and the claimed bounds follows immediately by replacing by .

Estimation Error.

To measure and compare the overall accuracy of different frequency estimation algorithms, we will use the expected estimation error which is defined as follows: let and respectively denote the actual frequencies and the estimated frequencies obtained from algorithm of items in the input stream. We remark that when is clear from the context we denote as . Then we define

(1)

where denotes the query distribution of the items. Here, similar to previous work (e.g., [RKA16, HIKV19]), we assume that the query distribution is the same as the frequency distribution of items in the stream, i.e., for any , (more precisely, for any , where denotes the total sum of all frequencies in the stream).

Remark 2.2.

As all upper/lower bounds in this paper are proved by bounding the expected error of estimating the frequency a single item, , then using linearity of expectation, in fact we obtain analogous bounds for any query distribution . In particularly this means that the bounds of Table 1 for CM and CS hold for any query distribution. For L-CM and L-CS the factor of gets replaced by where is the number of buckets reserved for heavy hitters.

Zipfian Distribution

In our analysis we assume that the frequency distribution of items follows Zipf’s law. That is, if we sort the items according to their frequencies with no loss of generality assuming that , then for any , . Given that the frequencies of items follow Zipf’s law and assuming that the query distribution is the same as the distribution of the frequency of items in the input stream (i.e., where denotes the -th harmonic number), we can write the expected error defined in eq. 1 as follows:

(2)

Throughout this paper, we present our results with respect to the objective function in the right hand side of eq. 2, i.e., . We further assume that in fact . At first sight this assumption may seem strange since it says that item appears a non-integral number of times in the stream. This is however just a matter of scaling and the assumption is convenient as it removes the dependence on the length of the stream in our bounds.

1:procedure LearnedSketch(, , , )
2:     for each stream element  do
3:          if  then the oracle predicts as heavy (one of most frequent items)
4:               if a unique bucket is already assigned to item  then
5:                    
6:               else
7:                    allocate a new unique bucket to item and
8:               end if
9:          else
10:               feed to an instance of with many buckets
11:          end if
12:     end for
13:end procedure
Algorithm 1 Learning-Based Frequency Estimation
Figure 1: A generic learning augmented algorithm for the frequency estimation problem. denotes a given learned oracle for detecting whether the item is among the top frequent items of the stream and is a given (sketching) algorithm (e.g., CM or CS) for the frequency estimation problem.

Learning Augmented Sketching Algorithms for Frequency Estimation.

In this paper, following the approach of [HIKV19], the learned variants of CM and CS are algorithms augmented with a machine learning based heavy hitters oracle. More precisely, we assume that the algorithm has access to an oracle that predicts whether an item is “heavy” (i.e., is one of the most frequent items) or not. Then, the algorithm treats heavy and non-heavy items differently: (a) a unique bucket is allocated to each heavy item and their frequencies are computed with no error, (b) the rest of items are fed to the given (sketching) algorithm using the remaining buckets and their frequency estimates are computed via . See Figure 1.

Note that, in general the oracle can make errors. Aiming at a theoretical understanding, in this paper we focus on the case where the oracle is perfect, i.e., the error rate is zero. We also assume that , that is, we use approximately the same number of buckets for the heavy items as for the sketching of the light items. One justification for this assumption is that in any case we can increase both the number of buckets for heavy and light items to without affecting the overall asymptotic space usage.

3 Tight Bounds for Count-Min with Zipfians

For both Count-Min and Count-Sketch we aim at analyzing the expected value of the variable where and is the estimate of output by the relevant sketching algorithm. Throughout this paper we use the following notation: For an event we denote by the random variable in which is if and only if occurs. We begin by presenting our improved analysis of Count-Min with Zipfians. The main theorem is the following.

Theorem 3.1.

Let with and . Let further be independent and truly random hash functions. For define the random variable . For any it holds that

Replacing by in Theorem 3.1 and using linearity of expectation we obtain the desired bound for Count-Min in the upper right hand side of Table 1. The natural assumption that simply says that the total number of buckets is upper bounded by the number of items.

To prove Theorem 3.1 we start with the following lemma which is a special case of the theorem.

Lemma 3.2.

Suppose that we are in the setting of Theorem 3.1 and further that444In particular we dispose with the assumption that . . Then

Proof.

It suffices to show the result when since adding more hash functions and corresponding tables only decreases the value of . Define for and note that these variables are independent. For a given we wish to upper bound . Let and note that if then either of the following two events must hold:

  1. There exists a with and .

  2. The set contains at least elements.

Union bounding we find that

Choosing , a simple calculation yields that . As and are independent, , so

Before proving the full statement of Theorem 3.1 we recall Bennett’s inequality.

Theorem 3.3 (Bennett’s inequality [Ben62]).

Let be independent, mean zero random variables. Let , and be such that and for all . For any ,

where is defined by . The same tail bound holds on the probability .

Remark 3.4.

It is well known and easy to check that for ,

We will use these asymptotic bounds repeatedly in this paper.

Proof of Theorem 3.1.

We start out by proving the upper bound. Let and . Let be such that is minimized. Note that is itself a random variable. We also define

Clearly . Using Lemma 3.2, we obtain that . For we observe that

We conclude that

as desired.

Next we show the lower bound. For and we define . Note that the variables are independent. We also define for . Observe that for , , and that

Applying Bennett’s inequality with and thus gives that

Defining it holds that and , so putting in the inequality above we obtain that

Appealing to Remark 3.4 and using that the above bound becomes

(3)

By the independence of the events we have that

and so , as desired. ∎

4 (Nearly) tight Bounds for Count-Sketch with Zipfians

In this section we proceed to analyze Count-Sketch for Zipfians either using a single or more hash functions. We start with two simple lemmas which for certain frequencies of the items in the stream can be used to obtain respectively good upper and lower bounds on in Count-Sketch with a single hash function. We will use these two lemmas both in our analysis of standard and learned Count-Sketch for Zipfians.

Lemma 4.1.

Let , independent Bernoulli variables taking value with probability , and independent Rademachers, i.e., . Let . Then

Proof.

Using that for and Jensen’s inequality

from which the result follows. ∎

Lemma 4.2.

Suppose that we are in the setting of Lemma 4.1. Let and let be defined by . Then

Proof.

Let , , and . Let denote the event that and have the same sign or . Then by symmetry. For we denote by the event that . Then and furthermore and are independent. If occurs, then and as the events are disjoint it thus follows that

4.1 One hash-function

We are now ready to commence our analysis of Count-Sketch for Zipfians. As in the discussion succeeding Theorem 3.1 the following theorem yields the desired result for a single hash function as presented in Table 1.

Theorem 4.3.

Suppose that and let and be truly random hash functions. Define the random variable for . Then

Proof.

Let be fixed. We start by defining and and note that

Using the triangle inequality

Also, by Lemma 4.1, and combining the two bounds we obtain the desired upper bound.

For the lower bound we apply Lemma 4.2 with concluding that

4.2 Multiple hash functions

Let be odd. For a tuple we denote by the median of the entries of . The following theorem immediately leads to the result on CS with hash functions claimed in Table 1.

Theorem 4.4.

Let be odd, , and and be truly random hash functions. Define the random variable for . Assume that555This very mild assumption can probably be removed at the cost of a more technical proof. In our proof it can even be replaced by for any . . Then

The assumption simply says that the total number of buckets is upper bounded by the number of items. Again using linearity of expectation for the summation over and replacing by we obtain the claimed upper and lower bounds of and respectively. We note that even if the bounds above are only tight up to a factor of they still imply that it is asymptotically optimal to choose , e.g. . To settle the correct asymptotic growth is thus of merely theoretical interest.

Proof.

If (and hence ) is a constant the results follows easily from Lemma 4.1, so in what follows we may assume that is larger than a sufficiently large constant.

We first prove the upper bound. Define and . Let for , and . Finally write .

As the absolute error in Count-Sketch with one pair of hash functions is always upper bounded by the corresponding error in Count-Min with the single hash function , we can use the bound in the proof of Lemma 3.2 to conclude that

when . Also

so by Bennett’s inequality (with and ) and Remark 3.4,

for . It follows that for ,

Let be the implicit constant in the -notation above. If , at least half of the values are at least . For it thus follows by a union bound that

(4)

If is chosen sufficiently large it thus holds that

Here the first inequality uses eq. 4 and a change of variable. The second inequality uses that for some constant followed by a calculation of the integral. For our upper bound it therefore suffices to show that . For this we need the following claim:

Claim 4.5.

Let be the closed interval centered at the origin of length , i.e., . Suppose that . For , .

Before proving this claim we will first show how to use it to establish the desired result. For this let be fixed. If , at least half of the values are at least or at most . Let us focus on bounding the probability that at least half are at least , the other bound being symmetric giving an extra factor of in the probability bound. By symmetry and the claim, . For we define , and we put . Then . If at least half of the values are at least then . By Hoeffding’s inequality we can bound the probability of this event by

It follows that . Thus

It thus suffices to prove the claim.

Proof of creftypecap 4.5.

We first show that with probability , lies in the interval for some constant . To see this we note that by Lemma 4.1, , so it follows by Markov’s inequality that if is large enough, the probability that is at most . For a constant probability lower bound on we write

Condition on . If the probability that there exist exactly two with is at least

With probability the corresponding signs are both the same as that of . By independence of and the probability that this occurs is at least and if it does, . Combining these two bounds it follows that