Heavy Hitters over Interval Queries

04/28/2018 ∙ by Ran Ben Basat, et al. ∙ Technion Harvard University 0

Heavy hitters and frequency measurements are fundamental in many networking applications such as load balancing, QoS, and network security. This paper considers a generalized sliding window model that supports frequency and heavy hitters queries over an interval given at query time. This enables drill-down queries, in which the behavior of the network can be examined in finer and finer granularities. For this model, we asymptotically improve the space bounds of existing work, reduce the update and query time to a constant, and provide deterministic solutions. When evaluated over real Internet packet traces, our fastest algorithm processes packets 90--250 times faster, serves queries at least 730 times quicker and consumes at least 40% less space than the known method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

High-performance stream processing is essential for many applications such as financial data trackers, intrusion-detection systems, network monitoring, and sensor networks. Such applications require algorithms that are both time and space efficient to cope with high-speed data streams. Space efficiency is needed, due to the memory hierarchy structure, to enable cache residency and to avoid page swapping. This residency is vital for obtaining good performance, even when the theoretical computational cost is small (e.g., constant time algorithms may be inefficient if they access the DRAM for each element). To that end, stream processing algorithms often build compact approximate sketches (synopses) of the input streams.

Recent items are often more relevant than old ones, which requires an aging mechanism for the sketches. Many applications realize this by tracking the stream’s items over a sliding window. That is, the sliding window model [18]

considers only a window of the most recent items in the stream, while older ones do not affect the quantity we wish to estimate. Indeed, the problem of maintaining different types of sliding window statistics was extensively studied 

[4, 8, 18, 33, 27].

Yet, sometimes the window of interest may not be known a priori or they may be multiple interesting windows [17]. Further, the ability to perform drill-down queries, in which we examine the behavior of the system in finer and finer granularity may also be beneficial, especially for security applications. For example, this enables detecting when precisely a particular anomaly has started and who was involved in it [20]. Additional applications for this capability include identifying the sources of flash crowd effects and pinpointing the cause-effect relation surrounding a surge in demand on an e-commerce website [26].

In this work, we study a model that allows the user to specify an interval of interest at query time. This extends traditional sliding windows that only consider fixed sized windows. As depicted in Figure 1, a sub-interval of a maximal window is passed as a parameter for each query, and the goal of the algorithm is to reply correspondingly. Naturally, one could maintain an instance of a sliding window algorithm for each possible interval within the maximal sliding window. Alas, this is both computationally and space inefficient. Hence, the challenge is to devise efficient solutions.

This same model was previously explored in [33], which based their solution on exponential histograms [18]. However, as we elaborate below, their solution is both memory wasteful and computationally inefficient. Further, they only provide probabilistic guarantees.

Contributions

Algorithm Space Update Time Query Time Comments
WCSS  [8] Only supports fixed-size window queries.
ECM  [33] Only provides probabilistic guarantees.
RAW Uses prior art (WCSS) as a black box.
ACC Constant time operations for
.
HIT Optimal space when ,
time updates when .
Table 1: Comparison of the algorithms proposed in the paper with ECM and WCSS (that solves the simpler problem of fixed-size windows). ACC can be instantiated for any .
Figure 1: We process items and support frequency queries within an interval specified at query time. While the traditional sliding window model can answer queries for a fixed window, our approach allows us to consider any interval that is contained within the last items. In this example, we ask about the frequency of the item within the interval . If we allow an additive error of , the answer to this query should be in the range .

Our work focuses on the problem of estimating frequencies over an ad-hoc interval given at query time. We start by introducing a formal definition of this generalized estimation problem nicknamed -IntervalFrequency.

To systematically explore the problem, we first present a naïve strawman algorithm (RAW), which uses multiple instances of a state-of-the-art fixed window algorithm. In such an approach, an interval query is satisfied by querying the instances that are closest to the beginning and end of the interval and then subtracting their results. This algorithm is memory wasteful and its update time is slow, but it serves as a baseline for comparing our more sophisticated solutions. Interestingly, RAW achieves constant query time while the previously published ECM algorithm [33] answers queries in , where is the maximal window size and

is the probability of failure. Additionally, it requires about the same amount of memory and is deterministic while ECM has an error probability.

While developing our advanced algorithms, we discovered that both intrinsically solve a common problem that we nickname -Interval. Hence, our next contribution is in identifying and formally defining the -Interval problem and showing a reduction from -Interval to the -IntervalFrequency problem. This makes our algorithms shorter, simpler, and easier to prove, analyze and implement.

Our algorithms, nicknamed HIT and (to be precise, is a family of algorithms), process items in constant time (under reasonable assumptions on the error target) – asymptotically faster than RAW. HIT is asymptotically memory optimal while serving queries in logarithmic time. Conversely, answers queries in constant time and incurs a sub-quadratic space overhead.

We present formal correctness proofs as well as space and runtime analysis. We summarize our solutions’ asymptotic performance in Table 1.

Our next contribution is a performance evaluation study of our various algorithms along with () ECM-Sketch  [33], the previously suggested solution for interval queries and () the state-of-the-art fixed window algorithm (WCSS) [8], which serves as a best case reference point since it solves a more straightforward problem. We use on real-world packet traces from Internet backbone routers, from a university datacenter, and from a university’s border router. Overall, our methods (HIT and ACC) process items times faster and consume at least times less space than the naive approach (RAW) while requiring a similar amount of memory as the state-of-the-art fixed size window algorithm (WCSS). Compared to the previously known solution to this problem (ECM-Sketch [33]), all our advanced algorithms are both faster and more space efficient. In particular, our fastest algorithm, ACC, processes items times faster than ECM-Sketch, serves queries at least times quicker and consumes at least less space.

Last, we extend our results to time-based intervals, heavy hitters [31, 12], hierarchical heavy hitters [15, 21], and for detecting traffic volume heavy-hitters [9], i.e., when counting each flow’s total traffic rather than item count. We also discuss applying our algorithms in a distributed settings, in which measurements are recorded independently by multiple sites (e.g., multiple routers), and the goal is to obtain a global network analysis.

Paper roadmap

We briefly survey related work in Section 2. We state the formal model and problem statement in Section 3. Our naïve algorithm RAW is described in Section 4. We present the auxiliary -Interval problem, which both our advanced algorithms solve and has a simple reduction to the -IntervalFrequency problem, in Section 5. The improved algorithms, HIT and , are then described in Section 6. The performance evaluation of our algorithms and their comparison to ECM-Sketch and WCSS is detailed in Section 7. Section 8 discusses extensions of our work. Finally, we conclude with a discussion in Section 9.

2 Related Work

Count Sketch [13] and Count Min Sketch [16] are perhaps the two most widely used sketches for maintaining item’s frequency estimation over a stream. The problem of estimating item frequencies over sliding windows was first studied in [4]. For estimating frequency within a additive error over a sized window, their algorithm requires bits. This was then reduced to the optimal bits [27]. In [23], Hung and Ting improved the update time to while being able to find all heavy hitters in the optimal time. Finally, the WCSS algorithm presented in [8] also estimates item frequencies in constant time. While some of these works also considered a variant in which the window can expand and shrink when processing updates [4, 27], its size was increased/decreased by one at each update, and cannot be specified at query time.

The most relevant paper that solves the same problem as our work is [33], who was the first to explore heavy hitters interval queries. They introduced a sketching technique with probabilistic accuracy guarantees called Exponential Count-Min sketch (ECM-Sketch). ECM-Sketch combines Count-Min sketch’ structure [16] with Exponential Histograms [18]. Count-Min sketch is composed of a set of hash functions, and a 2-dimensional array of counters of width and depth . To add an item of value , Count-Min sketch increases the counters located at by , for . Point query for an item q is done by getting the minimum value of the corresponding cells.

Exponential Histograms [18] allow tracking of metrics over a sliding window to within a multiplicative error. Specifically, they allow one to estimate the number of ’s in a sliding window of a binary stream. To that end, they utilize a sequence of buckets such that each bucket stores the timestamp of the oldest in the bucket. When a new element arrives, a new bucket is created for it; to save space, the histogram may merge older buckets. While the amortized update complexity is , some arriving elements may trigger an -long cascade of bucket merges.

ECM-Sketch replaces each Count-Min counter with an Exponential Histogram. Adding an item to the structure is analogous to the case of the regular Count-Min sketch. For each of the histograms j, where , the item is registered with time/count of its arrival and all expired information is removed from the Exponential Histogram. To query item in range , each of the corresponding histograms j, where , computes the given query range. The estimate value for the frequency of is . While the Exponential Histogram counters estimate the counts within a multiplicative error, their combination with the Count-Min sketch changes the error guarantee to additive.

An alternative approach for these interval queries was proposed in [17]. Their solution uses hCount [24], a sketch algorithm which is essentially identical to the Count-Min sketch. Unlike the ECM-Sketch, which uses a matrix of Exponential Histograms, [17] uses a sequence of buckets each of which is associated with an hCount instance. The smallest bucket is of size while the size of the ’th bucket is . When queried, [17] finds the buckets closest to the interval and queries the hCount instances. The paper does not provide any formal accuracy guarantees but shows that it has reasonable accuracy in practice. It seems that the memory used is bits while the actual error has two components: (i) an error of up to in the time axis (when the queried interval is not fully aligned with the buckets); and (ii) an error of up to , with probability , due to the hCount instance used for the queried buckets.

In other domains, ad-hoc window queries were proposed and investigated. That is, the algorithm assumes a predetermined upper bound on the window size , and the user could specify the actual window size

at query time. This model was studied for quantiles 

[28] and summing [6].

The problem of identifying the frequent items in a data stream, known as heavy hitters, dates back to the 80’s [31]. There, Misra and Gries (MG) proposed a space optimal algorithm for computing an additive approximation for the frequency of elements in an -sized stream. Their algorithm had a runtime of , which was improved to a constant [19, 25]. Later, the Space Saving (SS) algorithm was proposed [30] and shown to be empirically superior to prior art (see also [14, 29]). Surprisingly, Agarwal et al. recently showed that MG and SS are isomorphic [2], in the sense that from a -counters MG data structure one can compute the estimate that a SS algorithm would produce.

The problem of hierarchical heavy hitters

, which has important security and anomaly detection applications 

[32], was previously addressed with the SS algorithm [32]. To estimate the number of packets that originate from a specific network (rather than a single IP source), it maintains several separate SS instances, each dedicated to measuring different network sizes (e.g., networks with 2-bytes net ids are tracked separately than those with 3-bytes, etc.). When a packet arrives, all possible prefixes are computed and each is fed into the relevant SS instance. Recently, it was shown that randomization techniques can drive the update complexity down to a constant [5, 10].

3 Preliminaries

Given a universe , a stream is a sequence of universe elements. We denote by the maximal window size; that is, we consider algorithms that answer queries for an interval contained with the last elements window. The actual value of is application dependent. For example, a network operator that wishes to monitor up to a minute of traffic of a major backbone link may need of tens of millions of packets [22]. Given an element and an integer , the -frequency, denoted , is the number of times appears within the last elements of . For integers , we further denote by the frequency of between the and most recent elements of .

We seek algorithms that support the following operations:

  • ADD: given an element , append to .

  • IntervalFrequencyQuery: given and indices , return an estimate of .

We now formalize the required guarantees.

Definition 1.

An algorithm solves -IntervalFrequency if given any IntervalFrequencyQuery it satisfies

For simplicity of presentation, we assume that and are integers. For ease of reference, Table 2 includes a summary of basic notations used in this work.

Symbol Meaning
the data stream
the universe of elements
the maximal window size

the frequency of element within the last elements of

an estimation of

the frequency of element between the and most recent elements of

an estimation of
estimation accuracy parameter
probability of failure
n number of blocks in a frame ()
N max sum of blocks’ cardinalities () within a window
the interval’s heavy hitters –
estimation of the heavy hitters set
Table 2: List of Symbols

Space Saving: as we use the Space Saving (SS) algorithm [30] in our reduction in Section 5.2, we overview it here. SS maintains a set of counters, each has an associated element and a value. When an item arrives, SS first checks if it has a counter. If so, the counter is incremented; otherwise, SS allocates the item with a minimal-valued counter. For example, assume that the smallest counter was associated with and had a value of ; if arrives and has no counter, it will take over ’s counter and increment its value to (leaving without a counter). When queried for the frequency of a flow, we return the value of its counter if it has one, or the minimal counter’s value otherwise. If we denote the overall number of insertions by , then we have that the sum of counters equals , and the minimal counter is at most . This ensures that the error in the SS estimate is at most . An important observation is that once a counter reached a value of it is no longer the minimum throughout the rest of the measurement.

4 Strawman Algorithm

Figure 2: The block stream setting. Here, after the EndBlock, appears in two blocks out of the last and thus .

Here, we present the simple Redundant Approximate Windows (RAW) algorithm that uses several instances of a black box algorithm for solving the frequency estimation problem over a fixed -sized window. That is, we assume that supports the ADD operation and upon Query produces an estimation that satisfies:

We note that the WCSS algorithm [8] solves this problem using counters and in time for updates and queries. Both its runtime and space are optimal.111A lower bound of matching asymptotic complexity appears in [25], even for non-window solutions.

Specifically, we maintain separate solutions denoted , where each is an instance. We perform the ADD operation simply by invoking the operation ADD for . When given an IntervalFrequencyQuery, we return

(1)

We now state the correctness of RAW. Due to lack of space, we defer the proof to the full version of the paper [11]. Next, we analyze the properties of RAW.

Theorem 1.

Let be a black box algorithm as above that uses space and runs at time for updates and time for queries. Then RAW requires space, performs updates in time, and answers queries in time. Further, RAW solves the -IntervalFrequency problem.

Proof.

The run times above follows immediately from the fact that RAW utilizes instances of , updates each of them when processing elements, and queries only two instances per interval query. Next, we will prove the correctness of RAW.

Notice that we can express the interval frequency as:

(2)

Next, we note that

and since , we have

Plugging this into (2), we get

Now our estimation in (1) relies on the estimations produced by . By the correctness of , we are guaranteed that

(3)

Combining (3) with (1) we establish

(4)

Similarly,

(5)

Finally, we substitute (4) and (5) in (2) to obtain the desired

While RAW does not assume anything about , WCSS was shown to be asymptotically optimal both in terms of runtime and memory [8]. Thus, obtaining an improved fixed-window algorithm can only allow constant factor reductions in time and space. Also, while ’s error is proportional to the window size (i.e., the error in the estimation of is at most , which may be smaller than the we used in the analysis), optimizing the error for each individual instance does not reduce the space by more than 50%. In the next section, we propose novel techniques to asymptotically reduce both space and update time. Taking into account that every counter consists of an bits identifier and an bits value, we conclude the following:

Corollary 1.

Using WCSS as the black box algorithm , RAW requires bits, performs updates in time and answers queries in constant time.

5 Block Interval Frequency

In this section, we formally define an auxiliary problem, nicknamed -Interval, show a reduction to the - IntervalFrequency problem, and rigorously analyze the reduction’s cost. Our motivation lies in the fact that the suggested algorithms in Section 6 both intrinsically solve the -Interval auxiliary problem. It also has the benefit that any improved reduction between these problems would improve both algorithms. In -Interval, the arriving elements are inserted into “blocks” and we are required to compute exact interval frequencies within the blocks. Doing so simplifies the presentation and analysis of the algorithms in Section 6, in which we propose algorithms that improve over RAW in both space and update time. The two algorithms, and present a space-time tradeoff while achieving asymptotic reductions over RAW.

5.1 The Block Interval Frequency Problem

Here, instead of frequency, we consider items’ block frequency. Namely, for some , we define its window block frequency as the number of blocks appears in within the last blocks. For integers , we define . Block algorithms support three operations:

  • ADD: given an element , add it to the stream.

  • ENDBLOCK: a new empty block is inserted into the window, and the oldest one leaves.

  • IntervalQuery: given an element and indices , compute (without error).

We define the -Interval as -IntervalFrequency. That is, we say that an algorithm solves the -Interval problem if given an IntervalQuery it is able to compute the exact answer for any and .

For analyzing the memory requirements of algorithms solving this problem, we denote by the sum of cardinalities of the blocks in the -sized window. An example of this setting is given in Figure 2.

5.2 A Reduction to -IntervalFrequency

We show a reduction from the -Interval problem to -IntervalFrequency. To that end, we assume that is an algorithm that solves the -Interval problem for .

Our reduction relies on the observation that by applying such on a data structure maintained by counter-based algorithm such as Space Saving [30], we can compute interval queries and not only fixed window size frequency estimations. The setup of the reduction is illustrated in Figure 3.

Figure 3: The stream is logically divided into intervals of size called frames and each frame is logically partitioned into equal-sized blocks. The window of interest is also of size , and overlaps with at most frames and blocks.

We break the stream into sized frames, which are further divided into blocks of size . We employ a Space Saving [30, 8] instance to track element frequencies within each frame; it supports two methods: Add – adds element to the stream and Query – reports the frequency estimation of element with tight guarantees on the error.

Whenever a counter reaches an integer multiple of the block size, we add its associated flow’s identifier to the most recent block of . When a frame ends, we flush the Space Saving instance and reset all of its counters. We note that an implementation that supports constant time flush operations was suggested in [8]. Also, the max sum of block’s cardinalities within a window (overlapping up to 2 frames) is . Finally, we reduce each IntervalFrequencyQuery to an IntervalQuery by computing the indices of the blocks in which the interval starts and ends. The variables of the reduction algorithm are described in Table 3 and its pseudocode appears in Algorithm 1.

the offset within the current frame.
an algorithm that solves -Interval.
a Space Saving instance with counters.
the size of blocks (fixed at ).
Table 3: Variables used by the Algorithm 1.
1:Initialization:
2:function Add()
3:       
4:       
5:       if  then
6:                     
7:       if  then
8:                     
9:       if  then
10:                     
11:function IntervalFrequencyQuery()
12:       return
Algorithm 1 From Blocks to Approximate Frequencies

5.3 Theoretical Analysis

Given a query IntervalFrequencyQuery, we are required to estimate . Our estimator is . Intuitively, we query for the block frequency of in the minimal sequence of blocks that contain interval . Every time ’s counter reaches an integer multiple of the block size, the condition in Line 5 is satisfied and the block frequency of , as tracked by , increases by . Thus, multiplying the block frequency by allows us to approximate ’s frequency in the original stream.

There are several sources of estimation error: First, we do not have a counter for each element but rather a Space Saving instance in which counters are shared. Next, unless the counter of an item reaches an integer multiple of , we do not add it to the block stream. Additionally, the queried interval might not be aligned with the blocks. Finally, when a frame ends, we flush the counters and thus lose the frequency counts of elements that are not recorded in the block stream. With these sources of error in mind, we prove the correctness of our algorithm.

Theorem 2.

Let be an algorithm for the -Interval problem. Then Algorithm 1 solves -IntervalFrequency.

Proof.

We begin by noticing that once an element’s counter reaches , it will stay associated with the element until the end of the frame. This follows directly from the Space Saving algorithm, which only disassociates elements whose counter is minimal among all counters(see the SS overview in Section 3). Recall that the number of elements in a frame is and that the Space Saving instance is allocated with counters. Since the sum of counters always equals the number of elements processed, any counter that reaches a value of will never be minimal. Thus, once an element was added to a block (Line 6), its block frequency within the frame is increased by one for every subsequent arrivals. This means that an item might be added to a block while appearing just once in the stream, but this gives an overestimation of at most . As the queried intervals can overlap with two frames, this can happen at most twice, which imposes an overestimation error of no more than .

Our next error source is the fact that the queried interval may begin and end anywhere within a block. By considering the blocks that contain and , regardless of their offset, we incur another overestimation error of at most .

We have two sources of underestimation error, where items frequency is lower than times its block frequency. The first is the count we lose when flushing the Space Saving instance. Since we record every multiple of in the block stream, a frequency of at most is lost due to the flush. Second, in the current frame, the residual frequency of an item (i.e., the appearances that have not been recorded in the block stream) may be at most . We make up for these by adding to the estimation (Line 12). As we have covered all error sources, the total error is smaller than . ∎

Reducing the Error

Above, we used a block size of , which can be reduced to as follows: One of the error sources in Theorem 2 is the fact that the queried interval may begin and end in the middle of a block and we always consider the entire blocks that contain and . We can optimize this by considering ’s and ’s offsets within the relevant blocks, and including the block’s frequency only if the offset crosses half the size of the block. This incurs an overestimation error of at most instead of , allows blocks of size and reduces the number of blocks to .

6 Improved Algorithms

6.1 Approximate Cumulative Count ()

We present a family of algorithms for solving the -Interval problem. Approximate Cumulative Count () of level , denoted , aims to compute the interval frequencies while accessing at most hash tables for updates and for queries. To reduce clutter, we assume in this section that ; this assumption can be omitted with the necessary adjustments while incurring a multiplicative space overhead. This family presents a space-time trade off — the larger is, takes less space but is also slower.

The algorithms break the block stream into consecutive frames of size (the maximal window size). That is, blocks are in the first frame, in the second frame and so on. Notice that any -sized window intersects with at most two frames. Within each frame, algorithms use a hierarchical structure of tables that enables it to compute an item’s block frequency in  time.

and are illustrated in Figure 4 and are explained below. The simplest and fastest algorithm, , computes for each block a frequency table that tracks how many times each item has arrived from the beginning of the frame. For example, the table for block (for ) will contain an entry for each item that is a member of at least one of . The key is the item identifier, and the value is its block frequency from the frame’s start. This way, we can compute any block interval frequency by querying at most tables. Within a frame, we can compute any interval by subtracting the queried item’s block frequency at the beginning of the interval from its block frequency at the end. If the interval spans across two frames, we make one additional query for reaching the beginning of the frame, in total we query at most tables.

Figure 4: Illustration of the and algorithms. has only tables that track how many times each item has arrived from the beginning of the frame, while has two levels of tables. Each table in tracks the frequencies from the beginning of the frame, while tables aggregate the data from the previous table.

saves space at the expense of additional table accesses. Tables now have “levels”, such that each table is either in or . The core idea is that is somewhat wasteful as it may create table entries for each item, as it appears in all tables within the frame after its arrival. Instead, we “break” each frame into sized segments. At the end of each segment, we keep a single table that counts item frequencies from the beginning of the frame. Since we can use these tables just as in , we are left with computing the queried item frequency within a segment. This is achieved with a table, which we maintain for each block. Alas, unlike the tables, tables only keep the block frequency counts from the beginning of the segment the block belongs to. Thus, each appearance of an item (within a specific block) can appear on all tables, but on at most tables and this reduce space consumption. Compared with , reduces the overall number of table entries from to .

For an interval , let and be the block numbers of and respectively. To answer any interval frequency query of item , we consider two cases: If and are in the same frame, we access ’s and ’s tables to get ’s frequency from the beginning of the frame till , and subtract the results, line 20 . If and are in different frames, we consider ’s frequency in the blocks within the current frame by accessing ’s tables, plus its frequency within the last blocks of the previous frame. To do so, we compute ’s frequency from the beginning of the previous frame, lines 21 - 24. A corner case that arises is that the table that includes may have already left the table. We solve it by maintaining for leaving segments: for , contains the table of last leaving block that has a table at , line 12. Hence, we can subtract the corresponding entries as well.

the number of segments from a level that consist of a next-level segment.
used for tracking block frequencies. Each table is identified with a level and the index of the last block in its segments.
tables for incomplete segments.
tables for leaving segments.
The offset within the current frame.
Table 4: Variables used by algorithm.

Next, we generalize this to arbitrary values. In , we have levels of tables and segments. We consider each block to be in its own segment and maintain a table for it. Inductively, each segment (for ) consists of segments. That is, each segment contains blocks, segments each consists of segments for a total of blocks, etc. As each item may now appear in at most tables of each level, we get that the overall number of table entries is . To avoid lengthy computations at the end of each segment, we maintain additional “incomplete” tables that contain the cumulative counts for segments that already started, but not all of their blocks have ended yet. A pseudo-code of the algorithm appears in Algorithm 2.

1:Init:
2:  
3:function Add()
4:     for  do Update all incomplete tables
5:               
6:function EndBlock()
7:     
8:     while  do
9: A Level- block has ended
10:          Delete all entries
11:               
12:     
13:      Copy Table
14:     if  then New frame
15:               
16:     
17:function WinQuery Frequency in the last blocks
18:     
19:     if  then
20:         return      
21:     
22:     
23:     
24:     return
25:function IntervalQuery()
26:     if  then
27:          return WinQuery      
28:     return WinQuery - WinQuery
Algorithm 2

6.1.1 Analysis

The following theorem bounds the memory consumption of the ACC algorithms.

Theorem 3.

Denote the sum of cardinalities of the last blocks by . Algorithm 2 requires space.

Theorem 4.

Algorithm 2 solves the -Interval problem.

Proof Sketch.

We need to prove that upon an IntervalQuery query, for any and , is able to compute the exact answer. Notice that in handling queries in Algorithm 2, we split the computation in two: The first case is when , Line 26, this means that the end of the interval is also the last block, and thus we only need to return the frequency in the last block in the window, as calculated by WinQuery. Otherwise, we subtract the frequency that is calculated WinQuery from the result of WinQuery. Hence, we need to show that the frequency calculated by WinQuery is correct.

As is evident from the code in Lines , store the frequency of items within the current block, while store the frequencies of completed blocks from the beginning of their frame. Consider the case where the entire interval is within the current frame. In this case, the frequency of an item in the last blocks can be calculated as its frequency in the current block (using ) plus its frequency in the preceding blocks, as is done in Line 18, and stored in . Notice that to reduce query time, we access the highest level containing this information. However, since store the frequency from the beginning of the frame, we need to subtract from the item’s frequency in prior blocks, which is done in Line 20 (again, by accessing the highest level tables that include this data).

The second case is when the given interval crosses into the previous frame. In this case, we need to add to the frequency of the blocks that are included in the previous frame. Once again, we need to query the table holding the frequency in the last relevant block of that frame and subtract from the result the frequency in the preceding tables. As some of these tables might be beyond an entire window limit, their information might be stored in rather than . This is handled in Lines 2124. ∎

Proof Sketch.

We need to prove that upon an IntervalQuery query, is able to compute the exact answer for any and . Notice that in handling such queries in Algorithm 2, we split the computation in two: If , this means that the end of the interval is also the last block, and thus we only need to return the frequency w.r.t. the last block in the window, as calculated by . Otherwise, we subtract the frequency that is calculated from the result of . Hence, we need to show that the frequency calculated by is correct.

As evident from the code in Lines , store the frequency of items within the current block while store the frequencies of items of completed blocks from the beginning of their frame. Consider the case where the entire range is within the current frame. Hence, in principle, the frequency of an item in the last blocks can be calculated as its frequency in the current block () plus its frequency in the preceding blocks, as is done in Line , and stored in . Notice that to reduce query time, we access the highest level containing this information. However, since store the frequency from the beginning of the frame, we need to subtract from obtained in Line  the frequency of this item in prior blocks, which is done in Line  (here again, accessing the highest level tables that include this data).

The second case is when the range crosses into the previous frame. In this case, we need to add to the frequency of the blocks that are included in the previous frame. Once again, we need to find the table holding the frequency in the last relevant block of that frame, and subtract from it the frequency in the preceding tables. Yet, as some of these tables might be beyond an entire window limit, their information might be stored in rather than . This is handled in Lines . ∎

6.2 Hierarchical Interval Tree ()

Figure 5: algorithms, first level tables track how many times each item arrived within the corresponding block. At , tables of track how many times each item has arrived between and . For example, item arrives once at , so table contains with count , table tracks how many times each item arrived between and , so it contains with count and table for , track how many times each item arrived between and , and as well contains once. Table at merges two frequency tables. For example, third level table merges second level tables of and .

Hierarchical Interval Tree, denoted , tracks flow frequencies using a hierarchical tree structure in which each node stores the partial frequency of its sub-tree. Precisely, the levels of the tree are defined as follows: includes frequency tables, one for each block of the stream, that track how many times each item arrived within the corresponding block. Tables at of track how many times each item has arrived between and , where , line 9. That is, these tables contain partial queries results for each item and track item’s multiplicity from the previous same level block. Hence, each level contains tables for half the blocks of the previous level, and thus each has tables in trailing_zeros levels; we assume that the number of trailing zeros can be computed efficiently with the machine instruction in modern CPUs. An illustration of the algorithm appears in Figure 5.

For example, consider , , and in Figure 5. During , items and arrive; also arrives in and , while there are no items arrivals in . So the tables of will be as follow: table is empty because there is no items arrival within . table tracks items arrival between and ; its content will be item with count . table counts the item arrival between and , so it will contain item three times (, and ) and once (in ). Note that each table at merges two frequency tables.

We can compute any interval frequency by using the hierarchical tree tables. While this can be done using linear scan, the higher levels of the tree are designed to allow efficient time computation by using the stored partial queries.

Notice that some of the partial queries results stored in the higher levels may be invalid. For example, in case a new block is added, the oldest one departs the window, so the content of tables that refer to the departing block become invalid. We solve this problem by choosing the levels to use such that we only consider valid tables. Let and be the block numbers of the first interval index and the second one. Here, we scan backward from to , greedily using the highest possible level at each point, line 17. This minimizes the number of needed steps. If , all tables along the way are valid. In this case, we only need value look-ups. Otherwise, we choose tables between blocks and , so we need value look-ups, and then another look-ups for querying the remaining interval. Overall, our computation takes at most steps.

We use an incremental table for incomplete blocks in each we increment an element ’s entry for any ADD operation, line 3. The pseudo code of the algorithms appears in Algorithm 3 and Table 5 contains a list of the used variables.

1:Initialization:
2:function Add()
3:      Update the incomplete block’s tables
4:function EndBlock()
5:     
6:     
7:      Delete all entries.
8:     for  do
9:               
10:function IntervalQuery()
11:      The most recent block’s index
12:      The oldest queried block’s
13:     
14:     
15:     
16:     while   do
17:         
18:         
19:         
20:         
21:         if  then
22:                             
23:     return
Algorithm 3
Number of blocks from a level that consist of a next-level block.
used for tracking block frequencies. Each table is identified with a level and the index of the last block in its block.
A table for the most recent, incomplete, block.
The offset within the current frame.
Table 5: Variables used by algorithm.

6.2.1 Analysis

We now analyze the algorithm. We start by proving its correctness.

Theorem 5.

Algorithm 3 solves the -Interval problem.

Proof.

We need to prove that upon an IntervalQuery query, for any and , is able to compute the exact answer without error. We first introduce some notations. denotes the queried element; indicates the frequency of item during the block, so that the newest block’s index is . According to Line 16, we iterate over blocks, query table, which contains partial query of blocks, thus we can advance by , Lines 19 and 20. The output of the algorithm for querying in interval is:

(6)

According to the definition of in section 5.1, we got that (6) is equal to:

Theorem 6.

Denote the sum of cardinalities of the last blocks by . Algorithm 3 requires  space.

Proof.

As described above, each element’s appearance may reflect in tables. Every table entry takes bits for the key and another for the value, and thus the overall space is .∎

6.3 Optimizations

This section includes optimizations that can be applied to the and/or algorithms.

Short IDs

Element IDs are often quite long (alternatively, is large), e.g., a 5-tuple identifier per flow may take over 100 bits and Internet URLs can be even longer. Hence, when the size of item IDs are large, we can reduce their required space as follows: For each frame, we maintain an sized array of items identifiers that were added to some block during the frame. Every time a new (distinct) item arrives, we add it to the array. To find the index of each ID in the array, we maintain an additional table that maps IDs to their array indices. Clearly, the combined space requirement of the array and map table is . Finally, we replace the keys in the algorithms’ tables (at all levels) such that instead of storing identifiers we use the array indices as keys. Given a query, we first find the array index using the new table and then follow the same procedure as before, but with the index as key. This optimization can be applied to both and . Thus, we always store at most IDs. This reduces HIT’s space to and that of to .

Deamortization

Algorithm