Sub-O(log n) Out-of-Order Sliding-Window Aggregation

10/26/2018 ∙ by Kanat Tangwongsan, et al. ∙ ibm 0

Sliding-window aggregation summarizes the most recent information in a data stream. Users specify how that summary is computed, usually as an associative binary operator because this is the most general known form for which it is possible to avoid naively scanning every window. For strictly in-order arrivals, there are algorithms with O(1) time per window change assuming associative operators. Meanwhile, it is common in practice for streams to have data arriving slightly out of order, for instance, due to clock drifts or communication delays. Unfortunately, for out-of-order streams, one has to resort to latency-prone buffering or pay O( n) time per insert or evict, where n is the window size. This paper presents the design, analysis, and implementation of FiBA, a novel sliding-window aggregation algorithm with an amortized upper bound of O( d) time per insert or evict, where d is the distance of the inserted or evicted value to the closer end of the window. This means O(1) time for in-order arrivals and nearly O(1) time for slightly out-of-order arrivals, with a smooth transition towards O( n) as d approaches n. We also prove a matching lower bound on running time, showing optimality. Our algorithm is as general as the prior state-of-the-art: it requires associativity, but not invertibility nor commutativity. At the heart of the algorithm is a careful combination of finger-searching techniques, lazy rebalancing, and position-aware partial aggregates. We further show how to answer range queries that aggregate subwindows for window sharing. Finally, our experimental evaluation shows that FiBA performs well in practice and supports the theoretical findings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Stream processing is now in widespread production use in domains as varied as telecommunication, personalized advertisement, medicine, transportation, and finance. It is generally the paradigm of choice for applications that expect high throughput and low latency. Regardless of domain, nearly every stream processing application involves some form of aggregation or another, with one of the most common being sliding-window aggregation.

Sliding-window aggregation derives a summary statistic over a user-specified amount of recent streaming data. Users also define how that summary statistic is computed, usually in the form of an associative binary operator (Boykin et al., 2014), as that is the most general known form for which computation can be effectively incrementalized to avoid naïvely scanning every window. While some associative aggregation operators, such as sum, are also invertible, many, such as maximum or Bloom filters, are merely associative but not invertible.

Recent algorithmic research on sliding-window aggregation has given much attention to streams with strictly in-order arrivals. The standard interface for sliding-window aggregation supports insert, evict, and query. In the in-order setting, there are algorithms (Shein et al., 2017; Tangwongsan et al., 2017) for associative operators that take only time per window change, without requiring the operator to be invertible nor commutative.

In reality, however, out-of-order streams are the norm (Akidau et al., 2013). Clock drift and disparate latency in computation and communication, for example, can cause values in a stream to arrive in a different order than their timestamps. Processing out-of-order streams is already supported in many stream processing platforms (e.g., (Akidau et al., 2013; Zaharia et al., 2013; Carbone et al., 2015; Akidau et al., 2015)). Still, in terms of performance, users who want the full generality of associative operators have to resort to latency-prone buffering or, alternatively, use an augmented balanced tree, such as a B-tree, at a cost of time per insert or evict, where is the window size. This stands in stark contrast with the in-order setting, especially for when the streams are nearly in order. Thus, we ask whether there exists a sub- algorithm for out-of-order streams; this paper is our affirmative answer.

This paper introduces the finger B-tree aggregator (FiBA), a novel algorithm that efficiently aggregates sliding windows on out-of-order streams and in-order streams alike. Each insert or evict takes amortized time111See Theorem 3.4 for a more formal statement., where the out-of-order distance is the distance from the inserted or evicted value to the closer end of the window. The complexity means for in-order streams, nearly for slightly out-of-order streams, and never more than even for severely out-of-order streams. The worst-case time for any one particular insert or evict is , which only happens in the rare case of rebalancing all the way up the tree. FiBA requires space and takes time for a whole-window query. Furthermore, it is as general as the prior state-of-the-art, supporting variable-sized windows and only requiring associativity from the operator.

Our solution can be summarized as finger B-trees (Guibas et al., 1977) with position-aware partial aggregates. Starting with the classic B-trees, we first add pointers, or fingers, to the start and end of the tree. These fingers make it possible to perform the search for the value to insert or evict in worst-case time. Second, we adapt a specific variant of B-trees where the rebalance to fix the size invariants takes amortized time; specifically, we use B-trees with MAX_ARITY$=2\cdot$MIN_ARITY and where rebalancing happens after-the-fact (Huddleston and Mehlhorn, 1982). Third and most importantly, we develop novel position-aware partial aggregates and a corresponding algorithm to bound the cost of aggregate repairs to the cost of search plus rebalance.

The running time of FiBA is asymptotically the best possible in general. We prove a lower bound showing that for insert and evict operations with out-of-order distance up to , the amortized cost of an operation in the worst case must be at least .

Furthermore, we show how FiBA can support window sharing with query time logarithmic in the subwindow size and the distance from the largest window’s boundaries. Here, the space complexity is , where is the size of the largest window.

Our experiments confirm the theoretical findings and show that FiBA performs well in practice. For out-of-order streams, it is a substantial improvement over existing algorithms in terms of both latency and throughput. For strictly in-order streams (i.e., FIFO), it demonstrates constant time performance and remains competitive with specialized algorithms for in-order streams.

We hope FiBA will be used to make streaming applications less resource-hungry and more responsive for out-of-order streams.

2. Problem Statement: OoO SWAG

This section states the problem addressed in this paper more formally. Consider a data stream where each value carries a logical time in the form of a timestamp. Throughout, we denote a timestamped value as . For example, is the value at logical time . The examples in this paper use natural numbers for timestamps, but our algorithms do not depend on any properties of the natural numbers besides being totally ordered. For instance, our algorithms work just as well with date/time representations or with real numbers.

It is intuitive to assume that values in such a stream arrive in nondecreasing order of time (in order). However, due to clock drift and disparate latency in computation and communication, among other factors, values in a stream often arrive in a different order than their timestamps. Such a stream is said to have out-of-order (OoO) arrivals—there exists a later-arriving value that has an earlier logical time than a previously-arrived value.

Our goal in this paper is to maintain the aggregate value of a time-ordered sliding window in the face of out-of-order arrivals. To motivate our formulation below, consider the following example, which maintains the max and the maxcount, i.e., the number of times the max occurs in the sliding window.

Initially, the values arrive in the same order as their associated timestamps . The maximum value is , and maxcount is because occurs twice. When stream values arrive in order, they are simply appended. For instance, when arrives, it is inserted at the end:

However, when values arrive out-of-order, they must be inserted into the appropriate spots to keep the sliding window time-ordered. For instance, when arrives, it is inserted between timestamps and :

As for eviction, stream values are usually removed from a window in order, for instance, evicting from the front:

Notice that, in general, eviction cannot always be accomplished by simply inverting the aggregation value. For instance, evicting cannot be done by “subtracting off” the value from the current aggregation value. The algorithm needs to efficiently discover the new max 4 and maxcount 2:

Monoids. There are other streaming aggregations besides max and maxcount. Monoids capture a large class of commonly used aggregations (Boykin et al., 2014; Tangwongsan et al., 2015). A monoid is a triple , where is a binary associative operator on , with being its identity element. Notice that only needs to be associative; it does not need not be commutative or invertible. For example, to express max and maxcount as a monoid, if and are the max and maxcount, then

Since is associative, no parentheses are needed for repeated application. When the context is clear, we even omit , for example, writing qstu for . This concise notation is borrowed from the mathematicians’ convention of omitting explicit multiplication operators.

OoO SWAG. This paper is concerned with maintaining an aggregation on a time-ordered sliding window where the aggregation operator can be expressed as a monoid. This can be formulated as an abstract data type (ADT) as follows:

Definition 2.1 ().

Let be a binary operator operator from a monoid and its identity. The out-of-order sliding-window aggregation (OoO SWAG) ADT is to maintain a time-ordered sliding window , , supporting the following operations:

  • [label=—, topsep=2pt,leftmargin=1.5]

  • insert( : Time,  : Agg) checks whether is already in the window, i.e., whether there is an such that . If so, it replaces by . Otherwise, it inserts into the window at the appropriate location.

  • evict( : Time) checks whether is in the window, i.e., whether there is an such that . If so, it removes from the window. Otherwise, it does nothing.

  • query() : Agg combines the values in time order using the operator. In other words, it returns if the window is non-empty, or if empty.

Lower Bound. How fast can OoO SWAG operations be supported? For in-order streams, the SWAG operations can be handled in time per operation (Tangwongsan et al., 2017; Shein et al., 2017). But the problem becomes more difficult when the stream has out-of-order arrivals. We prove in this paper that to handle out-of-order distance up to , the amortized cost of a OoO SWAG operation in the worst case must be at least .

Theorem 2.2 ().

Let be given such that and . For any OoO SWAG algorithm, there exists a sequence of operations, each with out-of-order distance at most , for which the algorithm requires a total of at least time.

The proof, which appears in Appendix A, shows this in two steps. First, it establishes a sorting lower bound for permutations on elements with out-of-order distance at most . Second, it gives a reduction proving that maintaining OoO SWAG is no easier than sorting such permutations.

Orthogonal Techniques. OoO SWAG operations are designed to work well with other stream aggregation techniques.

The insert() operation supports the case where is already in the window, so it works with pre-aggregation schemes such as window panes (Li et al., 2005), paired windows (Krishnamurthy et al., 2006), cutty windows (Carbone et al., 2016), or Scotty (Traub et al., 2018). For instance, for a 5-hour sliding window that advances in 1-minute increments, the logical times can be rounded to minutes, leading to more cases where is already in the window.

The evict() operation accommodates the case where is not the oldest time in the window, so it works with streaming systems that use retractions (Abadi et al., 2005; Akidau et al., 2013, 2015; Barga et al., 2007; Brito et al., 2008; Chandramouli et al., 2010; Li et al., 2008; Zaharia et al., 2013).

Neither insert() nor evict() are limited to values of that are near either end of the window, so they work in the general case, not just in cases where the out-of-order distance is bounded by buffer sizes or low watermarks.

Query Sharing. As defined above, OoO SWAG does not support query sharing. However, query sharing for different window sizes can be accommodated via a range query:

  • [label=—, topsep=2pt,leftmargin=1.5]

  • query( : Time,  : Time) : Agg aggregates exactly the values from the window whose times fall between and . That is, it returns , where is the largest such that and is the smallest such that . If the subrange contains no values, the operation returns .

In these terms, the problem statement of this paper is

to design and implement efficient OoO SWAG operations as well as range-query support for arbitrary monoids .

3. Finger B-Tree Aggregator (FiBA)

This section introduces our algorithm gradually, giving intuition along the way. It begins by describing a basic algorithm (Section 3.1) that utilizes a B-tree augmented with aggregates. This algorithm takes time for each insert or evict operation. Reducing the time complexity below requires further observations and ideas. This is explored intuitively in Section 3.2 with details fleshed out in Section 3.3.

3.1. Basic Algorithm: Augmented B-Tree

One way to implement the OoO SWAG is to start with a classic B-tree with timestamps as keys and augment that tree with aggregates. This is a baseline implementation, which will be built upon. Even though any balanced trees can, in fact, be used, we chose the B-tree because it is well-studied and has customizable fan-out degree, providing opportunities for experimentation.

There are many B-tree variations. The range of permissible arity, or fan-out degree of a node, is controlled by two parameters MIN_ARITY and MAX_ARITY. While MIN_ARITY can be any integer greater or equal to , most B-tree variations require that MAX_ARITY be at least . Hence, if —or simply when the context is clear—denotes the arity of a node , then a B-tree obeys the following size invariants:

  • [leftmargin=1em]

  • For a non-root node , MIN_ARITY$\,\lea(y)

    ; for the root, .

  • For all nodes, MAX_ARITY.

  • All nodes have timestamps and values .

  • All non-leaf nodes have child pointers .

    Figure 1. Classic B-tree augmented with aggregates.

    Figure 1 illustrates a B-tree augmented with aggregates. In this example, MIN_ARITY is 2 and MAX_ARITY is . Consequently, all nodes have 1–3 timestamps and values, and non-leaf nodes have 2–4 children. Each node in the tree contains an aggregate, an array of timestamps and values, and optionally pointers to the children. For instance, the root node contains the aggregate ab..u, the values and their timestamps , and pointers to three children. Because we use timestamps as keys, the entries are time-ordered, both within a node and across nodes, with timestamps stored in a parent node separating and limiting the time in the subtrees it points to. The tree is always height-balanced. Additionally, all leaves are at the same depth.

    What aggregate is kept in a node? For each node , the aggregate stored at that node obeys the up-aggregation invariant:

    By a standard inductive argument, is the aggregation of the values inside the subtree rooted at . This means the query() operation can simply return the aggregation value at the root (root.agg).

    The operations insert() or evict() first search for the node where belongs. Second, they locally insert or evict at that node, updating the aggregate stored at that node. Then, they rebalance the tree starting at that node and going up towards the root as necessary to fix any size invariant violations, while also repairing aggregate values along the way. Finally, they repair any remaining aggregate values not repaired during rebalancing, starting above the node where rebalancing topped out and visiting all ancestors up to the root.

    Theorem 3.1 ().

    In a classic B-tree augmented with aggregates, if it stores , the operation query() returns .

    Proof.

    After each operation, all nodes obey the aggregation invariant, and root$)

    contains . ∎

    Theorem 3.2 ().

    In a classic B-tree augmented with aggregates, the operation query() costs at most  time and operations insert() or evict() take at most time.

    Proof.

    As is standard, we treat the arity of a node as bounded by a constant. The query operation and the local insert or evict visit only a single node. The search, rebalance, and repair visit at most two nodes per tree level. The work is thus bounded by the tree height, which is since the tree is height-balanced (Bayer and McCreight, 1972; Cormen et al., 1990; Huddleston and Mehlhorn, 1982). Hence, the total cost per operation is . ∎

    3.2. Breaking the Barrier

    The basic algorithm just described supports OoO SWAG operations in time using an augmented classic B-tree. To improve upon the time complexity, we now discuss the bottlenecks in the basic algorithm and outline a plan to resolve them.

    In the basic algorithm, the insert() and evict() operations involve four steps: (1) search for the node where belongs; (2) locally insert or evict; (3) rebalance to repair size invariants; and (4) repair remaining aggregation invariants. If one treats arity as constant, the local insertion or eviction operation takes constant time, as does the query() operation. But each of the steps for search, rebalance, and repair takes up to time. Hence, these are the bottleneck steps and will be improved upon as follows:

    1. [label=()]

    2. By maintaining “fingers” to the leftmost and rightmost leaves, we will reduce the search complexity to , where is the distance to the closer end of the sliding-window boundary. This means that in the FIFO or near-FIFO case, the search complexity will be constant.

    3. By using an appropriate MAX_ARITY and a somewhat lazy strategy for rebalancing, we will make sure that rebalance takes no more than constant in the amortized sense. This means that for any operation that affects the tree structure, the cost to restore the proper tree structure amounts to constant per operation, regardless of out-of-order distance.

    4. By introducing position-dependent aggregates, we will ensure that repairs to the aggregate values are made only to nodes along the search path or involved in restructuring. This means that the repairs cost no more than the cost of search and rebalance.

    We combine the above ideas into a novel sub- algorithm for OoO SWAG. Below, we describe how these ideas will be implemented intuitively, leaving detailed algorithms and proofs to Section 3.3.

    Sub- Search. In classic B-trees, a search starts at the root and ends at the node being searched, henceforth called . Often, is a leaf, so the search visits nodes. However, instead of starting at the root, one can start at the left-most or right-most leaf in the tree. This requires pointers to the left-most or right-most leaf, henceforth called the left and right fingers (Guibas et al., 1977). In addition, we keep a parent pointer at each node. Hence, the search can start at the nearest finger, walk up to the nearest common ancestor of the finger and , and walk down from there to . The resulting algorithm runs in , where  is the distance from the nearest end of the window–or more precisely, is the number of timed values from  to the nearest end of the window.

    Sub- Rebalance. Insertions and evictions can cause nodes to overflow or underflow, thus violating the size invariants. There are two popular strategies that address this: either before or after the fact. The before-the-fact strategy ensures that ancestors of the affected node are not at risk of overflow or underflow by preventive rebalancing, so that the arity  is at least one further away from the threshold required by the size invariants (e.g., (Cormen et al., 1990)). The after-the-fact strategy first performs the local insert or evict step, then repairs any resulting overflow or underflow to ensure the size invariants hold again by the end of the entire insert or evict operation. We adopt the after-the-fact strategy, which has been shown to take amortized constant time (Huddleston and Mehlhorn, 1982) as long as . For simplicity, we use . The amortized cost is as rebalancing rarely goes all the way up the tree. The worst-case cost is , bounded by the tree height.

    Figure 2. Partial aggregates definitions.

    Sub- Repair. The basic algorithm stores at each node  the up-aggregate , i.e., the partial aggregate of the subtree under . This is problematic, because it means that an insertion or eviction at a node , usually a leaf, affects the partial aggregates stored in all ancestors of —that is, the entire path up to the root. To circumvent this issue, we need an arrangement of aggregates that can be repaired by traversing to a finger, without always traversing to the root. For this, we make each node store the kind of partial aggregate suitable for its position in the tree. Furthermore, because the root no longer contains the aggregate of the whole tree, we will ensure that query() can be answered by combining partial aggregates at the left finger, the root, and the right finger.

    To meet these requirements, we define four kinds of partial aggregates in Figure 2. As illustrated in Figure 3, they are used in a B-tree according to the following aggregation invariants:

    • [leftmargin=0em,label=,itemsep=2pt]

    • Non-spine nodes store the up-aggregate . Such a node is neither a finger nor an ancestor of a finger. This aggregate must be repaired whenever the subtree below it changes. Figure 3(A) shows nodes with up-aggregates in white, light blue, or light green. For example, the center child of the root contains the aggregate hijklmn, comprising its entire subtree.

    • The root stores the inner aggregate . This aggregate is only affected by changes to the inner part of the tree, and not by changes below the left-most or right-most child of the root. Figure 3(A) shows the inner parts of the tree in white and the root in gray, and the root stores the aggregate ghijklmno.

    • Non-root nodes on the left spine store the left aggregate . For a given node , the left aggregate encompasses all nodes under the left-most child of the root except for ’s left-most child . When a change occurs below the left-most child of the root, the only aggregates that need to be repaired are those on a traversal up to the left spine and then down to the left finger. Figure 3(A) shows the left spine in dark blue and nodes affecting it in light blue. For example, the node in the middle of the left spine contains the aggregate cdef, comprising the left subtree of the root except for the left finger.

    • Non-root nodes on the right spine store the right aggregate . This is symmetric to the left aggregate . When a change occurs below the right-most child of the root, only aggregates on a traversal to the right finger are repaired. Figure 3(A) shows the right spine in dark green and nodes affecting it in light green. For example, the node in the middle of the right spine contains the aggregate qst of the right subtree of the root except for the right finger.


    Step AB, in-order insert 22:v. Spent 0, refunded 1.

    Step BC, out-of-order insert 18:r. Spent 0, billed 2.

    Step CD, evict 1:a. Spent 0, billed 1.

    Step DE, out-of-order insert 16:p, split. Spent 1, refunded 1.

    Step EF, evict 2:b, merge. Spent 1, billed 0.

    Figure 3. Finger B-tree with aggregates: example.

    Step GH, insert 3:c, split, height increase and split.
    Spent 2, billed 0.

    Step IJ, evict 4:d, merge, move. Spent 2, billed 1.


    Step KL, evict 15:o, merge, merge and height decrease.
    Spent 2, refunded 2.

    Figure 4. Finger B-tree height increase and split.
    Figure 5. Finger B-tree move.
    Figure 6. Finger B-tree merge and height decrease.

    3.3. Using Finger B-Trees

    This section describes an algorithm that implements the OoO SWAG using a finger B-tree augmented with aggregates. It achieves sub- time complexity by maintaining the size invariants from Section 3.1 and the aggregation invariants from Section 3.2.

    The algorithmic complexity analysis will account for the cost of split, merge, or move operations by counting coins. Specifically, the analysis counts the number of split, merge, or move steps of an insert or evict operation as spent coins. Coins can be imagined as being stored at tree nodes, so they can be used to pay for split, merge, or move operations later. Throughout this paper, coins are visualized as little golden circles next to tree nodes. Sometimes, coins must be added or removed from the outside to make up the difference between spent coins and coins in the tree before and after each step. We refer to these coins as being billed or refunded. The key result of the proof will be that billed coins never exceed 2 for any insert() or evict(), hence rebalancing has amortized constant time complexity.

    Figures 36 show concrete examples covering all the interesting cases of the algorithm. Each state, for instance (A), shows a tree with aggregates and coins. Each step, for instance AB, shows an insert or evict, illustrating how it affects the tree, its partial aggregates, and coins.

    • [leftmargin=1em]

    • In Figure 3, Step AB is an in-order insert without rebalance, which only affects the aggregate at a single node, the right finger.

    • Step BC is an out-of-order insert without rebalance, affecting aggregates on a walk to the right finger.

    • Step CD is an in-order evict without rebalance, affecting the aggregate at a single node, the left finger.

    • Step DE is an out-of-order insert to a node with arity MIN_ARITY, causing an overflow; rebalancing splits it.

    • Step EF is an evict from a node with MIN_ARITY, causing the node to underflow; rebalancing merges it with its neighbor.

    • In Figure 6, Step GH is an insert that causes nodes to overflow all the way up to the root, causing a height increase followed by splitting the old root. This affects aggregates on all split nodes and on both spines.

    • In Figure 6, Step IJ is an evict that causes first an underflow that is fixed by a merge, and then an underflow at the next level where the neighbor node is too big to merge. The algorithm repairs the size invariant with a move of a child and a timed value from the neighbor. This step affects aggregates on all nodes affected by rebalancing plus a walk to the left finger.

    • In Figure 6, Step KL is an evict that causes nodes to underflow all the way up to the root, causing a height decrease to eliminate the old empty root. This affects aggregates on all merged nodes and on both spines.

    1fun query() : Agg 2  if root.isLeaf() 3    return root.agg 4  return leftFinger.agg  root.agg  rightFinger.agg 5 6fun insert( : Time,  : Agg) 7  node  searchNode() 8  node.localInsertTimeAndValue(, ) 9  top, hit$_\texttt{left}_\texttt{right}\gets$ rebalanceForInsert(node) 10  repairAggs(top, hit$_\texttt{left}_\texttt{right}t$ : Time) 11  node  searchNode() 12  found, idx  node.localSearch() 13  if found 14    if node.isLeaf() 15      node.localEvictTimeAndValue() 16      top,hit$_\texttt{left}_\texttt{right}\gets$ rebalanceForEvict(node, null) 17    else 18      top,hit$_\texttt{left}_\texttt{right}\gets$ evictInner(node, idx) 19    repairAggs(top, hit$_\texttt{left}_\texttt{right}_\texttt{left}_\texttt{right}\gets$ top.parent 20      top.localRepairAgg() 21  else 22    top.localRepairAgg() 23  if top.leftSpine or top.isRoot() and hit$_\texttt{left}\gets$ top 24    while not left.isLeaf() 25      left  left.getChild(0) 26      left.localRepairAgg() 27  if top.rightSpine or top.isRoot() and hit$_\texttt{right}\gets$ top 28    while not right.isLeaf() 29      right  right.getChild(right.arity - 1) 30      right.localRepairAgg() 1fun rebalanceForInsert(node : Node) : Node$\times$Bool$\times$Bool 2  hit$_\texttt{left}_\texttt{right}\gets$ node.leftSpine, node.rightSpine 3  while node.arity > MAX_ARITY 4    if node.isRoot() 5      heightIncrease() 6      hit$_\texttt{left}_\texttt{right}\gets$ true, true 7    split(node) 8    node  node.parent 9    hit$_\texttt{left}\gets$ hit$_\texttt{left}_\texttt{right}\gets$ hit$_\texttt{right}_\texttt{left}_\texttt{right}\times$Bool$\times$Bool 10  hit$_\texttt{left}_\texttt{right}\gets$ node.leftSpine, node.rightSpine 11  if node  toRepair 12    node.localRepairAggIfUp() 13  while not node.isRoot() and node.arity < MIN_ARITY 14    parent  node.parent 15    nodeIdx, siblingIdx  pickEvictionSibling(node) 16    sibling  parent.getChild(siblingIdx) 17    hit$_\texttt{right}\gets$ hit$_\texttt{right}\leq$ MIN_ARITY 18      node  merge(parent, nodeIdx, siblingIdx) 19      if parent.isRoot() and parent.arity  1 20        heightDecrease() 21      else 22        node  parent 23    else 24      move(parent, nodeIdx, siblingIdx) 25      node  parent 26    if node  toRepair 27      node.localRepairAggIfUp() 28    hit$_\texttt{left}\gets$ hit$_\texttt{left}_\texttt{right}\gets$ hit$_\texttt{right}_\texttt{left}_\texttt{right}
    Figure 7. Finger B-Tree with aggregates: algorithm.
    30fun evictInner(node : Node, idx : Int) : Node$\times$Bool$\times$Bool 31  left, right  node.getChild(idx), node.getChild(idx+1) 32  if right.arity > MIN_ARITY 33    leaf, ,   oldest(right) 34  else 35    leaf, ,   youngest(left) 36  leaf.localEvictTimeAndValue() 37  node.setTimeAndValue(idx, , ) 38  top,hit$_\texttt{left}_\texttt{right}\gets$ rebalanceForEvict(leaf, node) 39  if top.isDescendent(node) 40    while top  node 41      top  top.parent 42      hit$_\texttt{left}\gets$ hit$_\texttt{left}_\texttt{right}\gets$ hit$_\texttt{right}_\texttt{left}_\texttt{right}
    Figure 8. Finger B-Tree evict inner: algorithm.
    Step MN, out-of-order evict 9:i. Spent 0, billed 1.
    Figure 9. Finger B-tree evict inner: example.

    Figure 3.3 shows most of the algorithm, excluding only evictInner, which will be presented later. While rebalancing always works bottom-up, aggregate repair works in the direction of the partial aggregates: either up for up-agg or inner-agg, or down for left-agg or right-agg. Our algorithm piggybacks the repair of up-aggs onto the local insert or evict and onto rebalancing, and then repairs the remaining aggregates separately. To facilitate the handover from the piggybacked phase to the dedicated phase of aggregate repair, the rebalancing routines return a triple top, hit, hit, for instance, in Line 9. Node top is where rebalancing topped out, and if it has an up-agg, it is the last node whose aggregate has already been repaired. Booleans hit and hit indicate whether rebalancing affected the left or right spine, determining whether aggregates on the respective spine have to be repaired.

    To keep the algorithm more readable, we factored out the case of evicting from a non-leaf node into function evictInner in Figure 3.3. To evict something from an inner node, Line 82 evicts a substitute from a leaf instead, and Line 83 writes that substitute over the evicted slot. Function evictInner creates an obligation to repair an extra node during rebalancing, handled by parameter toRepair on Line 52 in the same figure. Function evictInner can only be triggered for out-of-order eviction, because in-order evictions always happen at the left finger, which is a leaf.

    The following theorems state our correctness guarantees and the time complexity; their proofs appear in Appendix B.

    Theorem 3.3 ().

    In a finger B-tree with aggregates that contains , operation query() returns .

    Theorem 3.4 ().

    In a finger B-tree with aggregates, query() costs at most  time, and insert() and evict() take time , where

    • [topsep=2pt]

    • is , with being the distance to the start or end of the window, whichever is closer;

    • is amortized and worst-case ; and

    • is .

    4. Window Sharing

    42fun query( : Time,  : Time) : Agg
    43  node$_\texttt{from}_\texttt{to}\gets$ searchNode(), searchNode()
    44  node$_\texttt{top}\gets$ leastCommonAncestor(node$_\texttt{from}_\texttt{to}_\texttt{top}t_\texttt{from}t_\texttt{to}t_\texttt{from}t_\texttt{to}t_\texttt{from}=-\infty$ and  and node.hasAggUp()
    45    return node.agg
    46  res  
    47  if not node.isLeaf()
    48      node.getTime(0)
    49    if 
    50      res = res  queryRec(node.getChild(0),
    51                           ,
    52                            ?  : 
    53  for   [0, ..., node.arity - 2]
    54      node.getTime()
    55    if  and 
    56      res  res  node.getValue(i)
    57    if not node.isLeaf() and  node.arity$\,\mathop{-}2t_{i+1}\gets$ node.getTime()
    58      if  and 
    59        res  res  queryRec(node.getChild(),
    60                               ?  : ,
    61                               ?  : )
    62  if not node.isLeaf()
    63      node.getTime(node.arity - 2)
    64    if 
    65      res = res  queryRec(node.getChild(node.arity - 1),
    66                            ?  : ,
    67                           )
    68  return res
    Figure 10. Range query algorithm.

    This section explains how to use a single finger B-tree to efficiently answer aggregations on subwindows of different sizes on the fly. Applications are numerous. One common basic example is a simple anomaly detection workflow that compares two related aggregations: one on a large window representing the normal “stable” behavior and the other on a smaller window representing the most recent behavior. Then, an alert is triggered when the aggregates differ substantially. Whereas in this example, the sizes of the windows are known ahead of query time, in many other applications—e.g., interactive data exploration—queries are ad hoc.

    We propose to implement window sharing via range queries, as defined at the end of Section 2. This has many benefits: The window contents need to be saved only once regardless of how many subwindows are involved. Thus, each insert or evict needs to be performed only once on the largest window. This approach can accommodate an arbitrary number of shared window sizes. For instance, many users can register queries over different window sizes. Importantly, queries can be ad hoc and interactive, which would otherwise not be possible to support using multiple fixed instances. Furthermore, the range-query formulation also accommodates the case where the window boundary is not the current time (). For instance, it can report results with some time-lag dictated by punctuation or low watermarks.

    To answer the range query query(), the algorithm, shown in Figure 10, uses recursion starting from the least-common ancestor node whose subtree encompasses the queried range. The main technical challenge is to avoid making spurious recursive calls. Because the nodes already store partial aggregates, the algorithm should only recurse into a node’s children if the partial aggregates cannot be used directly. Specifically, we aim for the algorithm to invoke at most two chains of recursive calls, one visiting ancestors of node and the other visiting ancestors of node. The insight for preventing spurious recursive calls is that one needs information about neighboring timestamps in a node’s parent to determine whether the node itself is subsumed by the range. We encode whether the neighboring timestamp in the parent is included in the range on the left or right by using or , respectively.

    This strategy alone would have been similar to range query in an interval tree (Cormen et al., 1990), albeit without explicitly storing the ranges; however, our specially-designed partial aggregates add another layer of details: not all nodes store agg-up values . But any nodes that lack are guaranteed to be on one of the two recursion chains, because if a query involves spines of the entire window, then those spines coincide with edges of the intersection between the window and the range.

    Theorem 4.1 ().

    In a finger B-tree with aggregates that contains , the operation query() returns the aggregate , where is the largest such that and is the smallest such that .

    Proof.

    By induction. Each recursive call returns the aggregate of the intersection between its subtree and the queried range. ∎

    Theorem 4.2 ().

    In a finger B-tree with aggregates that contains , the operation query() takes time , where

    • [topsep=0pt]

    • is the largest index such that

    • is the smallest index such that

    • and are the distances to the window boundary

    • is the size of subwindow being queried.

    Proof.

    Using finger searches, Line 2 takes . Now the distance from either node or node to the least-common ancestor (LCA) is at most . Therefore, locating the LCA takes at most , and so do subsequent recursive calls in queryRec that traverse the same paths. ∎

    In particular, when a query ends at the current time (i.e., when ), the theorem says that the query takes time, where is the size of the subwindow being queried.

    5. Results

    Figure 11. Out-of-order distance experiments.

    We implemented both OoO SWAG variants in C++: the baseline classic B-tree augmented with aggregates and the finger B-tree aggregator (FiBA). We present experiments with competitive min-arity values: , and . Higher values for min-arity were never competitive in our experiments. Our experiments run outside of any particular streaming framework so we can focus on the aggregation algorithms themselves. Our load generator produces synthetic data items with random integers. The experiments perform rounds of evict, insert, and query to maintain a sliding window that accepts a new data item, evicts an old one, and produces a result each round.

    We present results with three aggregation operators and their corresponding monoids, each representing a different category of computational cost. The operator sum performs an integer sum over the window, and its computational cost is less than that of tree traversals and manipulations. The operator geomean

    performs a geometric mean over the window. For numerical stability, this requires a floating point log on insertion and floating point additions during data structure operations. It represents a middle ground in computational cost. The most expensive operator,

    bloom, is a Bloom filter (Bloom, 1970) where the partial aggregations maintain a bitset of size . It represents aggregation operators where the computational cost of performing an aggregation easily dominates the cost of maintaining the SWAG data structure.

    We ran all experiments on a machine with an Intel Xeon E5-2697 at 2.7 GHz running Red Hat Enterprise Linux Server 7.5 with a 3.10.0 kernel. We compiled all experiments with g++ 4.8.5 with optimization level -O3.

    5.1. Varying Distance

    We begin by investigating how insert’s out-of-order distance affects throughput. The distance varying experiments, Figure 11, maintain a window with a constant size of data items. The -axis is the out-of-order distance between the newest timestamp already in the window and the timestamp created by our load generator. Our adversarial load generator pre-populates the window with high timestamps and then spends the measured portion of the experiment producing low timestamps. This regime ensures that after the pre-population with high timestamps, the out-of-order distance of each subsequent insertion is precisely .

    This experiment confirms the prediction of the theory. The classic B-tree’s throughput is mostly unaffected by the change in distance, but the finger B-tree’s throughput starts out significantly higher and smoothly degrades, following a trend. All variants see an uptick in performance when , that is, when the distance is the size of the window. This is a degenerate special case. When , the lowest timestamp to evict is always in the left-most node in the tree, so the tree behaves like a last-in first-out (LIFO) stack, and inserting and evicting it requires no tree restructuring.

    The min-arity that yields the best-performing B-tree varies with the aggregation operator. For expensive operators, such as bloom, smaller min-arity trees perform better. The reason is that as the min-arity grows, the number of partial aggregations the algorithm needs to perform inside of a node also increases. When the aggregation cost dominates all others, trees that require fewer total aggregations will perform better. On the flip side, for cheap operators, such as sum, trees that require fewer rebalance and repair operations will perform better.

    The step-like throughput curves for the finger B-trees is a function of their min-arity: larger min-arity means longer sections where the increased out-of-order distance still affects only a subtree with the same height. When the throughput suddenly drops, the increase in meant an increase in the height of the affected subtree, causing more rebalances and updates.

    5.2. Latency

    Figure 12. Latency experiments.

    The worst-case latency for both classic and finger B-trees is , but we expect that the finger variants should significantly reduce average latency. The experiments in Figure 12 confirm this expectation. All latency experiments are with a fixed window of size . The top set of experiments use an out-of-order distance of and the bottom set use an out-of-order distance of . (We chose the latter distance because it is among the worst-performing in the throughput experiments.) The experimental setup is the same as for the throughput experiments, and the latency is for an entire round of evict, insert, and query. The -axis is the number of processor cycles for a round, in log scale. Since we used a 2.7 GHz machine, cycles take 370 nanoseconds and cycles take 370 microseconds. The blue bars represent the median latency, the shaded blue regions represent the distribution of latencies, and the black bar is the th percentile. The range is the minimum and maximum latency.

    When the out-of-order distance is  and the aggregation operator is cheap or only moderately expensive, the worst-case latency in practice for the classic and finger B-trees is similar. This is expected, as the time is dominated by tree operations, and they are worst-case . However, the minimum and median latencies are orders of magnitude better for the finger B-trees. This is also expected, since in the case of , the fingers enable amortized constant updates. When the aggregation operator is expensive, the finger B-trees have significantly lower latency, because they have to repair fewer partial aggregates.

    With an out-of-order distance of and cheap or moderately expensive operators, the classic and finger B-trees have similar latency. This is expected: as approaches , the worst-case latency for finger B-trees approaches . Again, with expensive operators, the minimum, median, and th percentile of the finger B-tree with min-arity is orders of magnitude lower than that of classic B-trees. There is, however, a curious effect clearly present in the bloom experiments with finger B-trees, but still observable in the others: min-arity has the lowest latency; it gets significantly worse with min-arity , then improves with min-arity . Recall that the root is not subject to min-arity—in other words, it may be slimmer. With , depending on the arity of the root, some aggregation repairs walk almost to the root and then back down a spine while others walk to the root and no further. The former case, which involves twice a spine, is generally more expensive than the latter, which is usually a shorter path. The frequency of the expensive case is a function of the window size, tree arity, and out-of-order distance, and these factors do not interact linearly.

    Figure 13. FIFO experiments.

    5.3. Fifo

    A special case for FiBA is when ; with in-order data, our finger B-tree aggregator (FiBA) enjoys amortized constant time performance. Figure 13 compares the B-tree-based SWAGs against the state-of-the art SWAGs optimized for first-in, first-out, completely in-order data. Two-stacks only works on in-order data and is amortized with worst-case  (adamax, 2011). The De-Amortized Bankers Aggregator (DABA) also only works on in-order data and is worst-case  (Tangwongsan et al., 2017). The Reactive Aggregator supports out-of-order evict but requires in-order insert and is amortized with worst-case  (Tangwongsan et al., 2015). The -axis represents increasing window size .

    Two-stacks and DABA perform as seen in prior work: for most window sizes, two-stacks with amortized time bound has the best throughput. DABA is generally second best, as it does a little more work on each operation to maintain worst-case constant performance.

    The finger B-tree variants demonstrate constant performance as the window size increases. The best finger B-tree variants stay within of DABA for sum and geomean, but are about off of DABA with a more expensive operator like bloom. In general, finger B-trees are able to maintain constant performance with completely in-order data, but the extra work of maintaining a tree means that SWAGs specialized for in-order data consistently outperform them.

    Figure 14. Window sharing experiments. Out-of-order distance also varies as where is the small window size.

    The classic B-trees clearly demonstrate behavior as the window size increases. Reactive does demonstrate behavior, but it is only obvious with bloom. For sum and geomean, the fixed costs dominate. Reactive was designed to avoid using pointer-based data structures under the premise that the extra memory accesses would harm performance. To our surprise, this is not true: on our hardware, the extra computation required to avoid pointers ends up costing more. For bloom, Reactive outperforms all of the B-tree based SWAGs because it is essentially a min-arity 1, max-arity 2 tree. As seen in other results, for the most expensive aggregation operators, reducing the total number of aggregation operations matters more to performance than data structure updates.

    5.4. Window Sharing

    One of the benefits of finger B-trees is that they can support a range-query interface while maintaining logarithmic performance for queries over that range. A range-query interface enables window sharing: the same window can be used for multiple queries over different ranges. An obvious benefit from window sharing is reduced space usage, but we also wanted to investigate if it could improve runtime performance. As Figure 14 shows, window sharing did not consistently improve runtime performance.

    The experiments maintain two queries: a big window fixed to size , and a small window whose size varies from to , shown on the -axis. The workload consists of out-of-order data items where the out-of-order distance is half of the small window size, i.e., . The _twin experiments maintain two separate trees, one for each window size. The _range experiments maintain a single tree, using a standard query for the big window and a range query for the small window.

    Our experiment performs out-of-order insert and in-order evict, so insert costs and evict costs . Hence, on average, each round of the _range experiment costs for insert, for evict, and for query on the big window and the small window. On average, each round of the _twin experiment costs for insert, for evict, and for query on the big and small window. Since we chose , this works out to a total of per round in both the _range and the _twin experiments. There is no fundamental reason why window sharing is slightly more expensive in practice. A more optimized code path might make range queries slightly less expensive, but we would still expect them to remain in the same ballpark.

    By picking , our experiments demonstrate the case where window sharing is the most likely to outperform the twin experiment. Since it did not outperform the twin experiment, we conclude that window sharing is unlikely to have a consistent performance benefit. We could have increased the number of shared windows to the point where maintaining multiple non-shared windows performed worse because of the memory hierarchy, but that is the same benefit as reduced space usage. We conclude that the primary benefits of window sharing in this context are reduced space usage and the ability to construct queries against arbitrarily sized windows on the fly.

    6. Related Work

    This section describes work related to out-of-order sliding window aggregation, sliding-window aggregation with window sharing, and finger trees.

    Out-of-Order Stream Processing. Processing out-of-order (OoO) streams is a popular research topic with a variety of approaches. But there are surprisingly few incremental algorithms for OoO stream processing. Truviso (Krishnamurthy et al., 2010) handles stream data sources that are out-of-order with respect to each other but where input values are in-order with respect to the stream they arrive on. The algorithm runs separate stream queries on each source followed by consolidation. In contrast, with FiBA, each individual stream input value can have its own independent OoO behavior. Chandramouli et al. (Chandramouli et al., 2010)

    describe how to perform pattern matching on out-of-order streams but do not tackle sliding window aggregation. Finally, the Reactive Aggregator 

    (Tangwongsan et al., 2015) performs incremental sliding-window aggregation and can handle OoO evict in time. In contrast, FiBA can handle both OoO insert and OoO evict, and takes sub- time.

    One approach to OoO streaming is buffering: hold input stream values in a buffer until it is safe to release them to the rest of the stream query (Srivastava and Widom, 2004). Buffering has the advantage of not requiring incremental operators in the query since the query only sees in-order data. Unfortunately, buffering increases latency (since values endure non-zero delay) and reduces quality (since bounded buffer sizes lead to outputs computed on incomplete data). One can reduce the delay by optimistically performing computation over transactional memory (Brito et al., 2008) and performing commits in-order. Finally, one can tune the trade-off between quality and latency by adaptively adjusting buffer sizes (Ji et al., 2015). In contrast to buffering approaches, FiBA can handle arbitrary lateness without sacrificing quality nor significant latency.

    Another approach to OoO streaming is retraction: report outputs quickly but revise them if they are affected by late-arriving inputs. At any point, results are accurate with respect to stream input values that have arrived so far. An early streaming system that embraced this approach was Borealis (Abadi et al., 2005), where stateful operators used stored state for retraction. Spark Streaming also takes this approach: it externalizes state from operators and handles stragglers like failures, invalidating parts of the query (Zaharia et al., 2013). Pure retraction requires OoO algorithms such as OoO sliding window aggregation, but the retraction literature does not show how to do that efficiently, as the naïve approach of recomputing from scratch would be inefficient for large windows. Our paper is complementary, describing an efficient OoO sliding window aggregation algorithm that could be used with systems like Borealis or Spark Streaming.

    Using a low watermark (lwm) is an approach to OoO streaming that combines buffering with retraction. The lwm approach allows OoO values to flow through the query but limits state requirements at individual operators by limiting the OoO distance. CEDR proposed 8 timestamp-like fields to support a spectrum of blocking, buffering, and retraction (Barga et al., 2007). Li et al. (Li et al., 2008) formalized the notion of a lwm based on the related notion of punctuation (Tucker et al., 2003). StreamInsight, which was inspired by CEDR, offered a state-management interface to operator developers that could be used for sliding-window aggregation. Subsequently, MillWheel (Akidau et al., 2013), Flink (Carbone et al., 2015), and Beam (Akidau et al., 2015) also adopted the lwm concept. The lwm provides some guarantees but leaves it to the operator developer to handle OoO values. Our paper describes an efficient algorithm for an OoO aggregation operator, which could be used with systems like the ones listed above.

    Sliding Window Aggregation with Sharing. All of the following papers focus on sharing over streams with the same aggregation operator, e.g., monoid . The Scotty algorithm supports sliding-window aggregation over out-of-order streams, while sharing windows with both different sizes and slice granularities (Traub et al., 2018). For instance, Scotty might share a window of size 60 minutes and granularity 3 minutes with a session window whose gap timeout is set to 5 minutes. When a tuple arrives out-of-order, older slices may need to be updated, fused, or created. Scotty relies upon an aggregate store (e.g., based on a balanced tree) to maintain slice aggregates. One caveat is that the aggregation operator  must be commutative; otherwise, one needs to keep around the tuples from which a slice is pre-aggregated. Our FiBA algorithm does not make any commutativity assumption. For commutative operators, FiBA could serve as a more efficient aggregate store for Scotty, thus combining the benefits of Scotty’s stream slicing with asymptotically faster final aggregation.

    Other prior work on window sharing requires in-order streams. The B-Int algorithm uses base intervals, which can be viewed as a tree structure over ordered data, and supports sharing of windows with different sizes (Arasu and Widom, 2004). Krishnamurthi et al. (Krishnamurthy et al., 2006) show how to share windows that differ not just in size but also in granularity. Cutty windows are a more efficient approach to sharing windows with different sizes and granularities (Carbone et al., 2016), and their paper explains how to extend the Reactive Aggregator (Tangwongsan et al., 2015) for sharing. The FlatFIT algorithm performs sliding window aggregation in amortized constant time and supports window sharing, addressing different granularities with the same technique as Cutty windows (Shein et al., 2017). Finally, the SlickDeque algorithm focuses on the special case where always returns one of either or , and offers window sharing with time complexity assuming friendly input data distributions (Shein et al., 2018). In contrast to the above work, FiBA combines window sharing with out-of-order processing. It directly supports sliding window aggregation over windows of different sizes.

    Finger Trees. Our FiBA algorithm uses techniques from the literature on finger trees, combining and extending them to work with sliding window aggregation. Guibas et al. (Guibas et al., 1977) introduced finger trees in 1977. A finger can be viewed as a pointer to some position in a tree that makes tree operations (usually search, insert, or evict) near that position less expensive. Guibas et al. used fingers on B-trees, but without aggregation. Huddleston and Mehlhorn (Huddleston and Mehlhorn, 1982) offer a proof that the amortized cost of insertion or eviction at distance  from a finger is . Our proof is inspired by Huddleston and Mehlhorn, but simplified and addressing a different data organization: we support values to be stored at interior nodes, whereas Huddleston and Mehlhorn’s trees store values only in leaves. Kaplan and Tarjan (Kaplan and Tarjan, 1996) present a purely functional variant of finger trees. The hands data structure is an implementation of fingers that is external to the tree, thus saving space, e.g., for parent pointers (Blelloch et al., 2003). We did not adopt this techniques, because in a B-tree, nodes are wider and thus, there are fewer nodes and consequently fewer parent pointers in total. Finally, Hinze and Paterson (Hinze and Paterson, 2006) present purely functional finger trees with amortized time complexity at distance 1 from a finger. They describe caching a monoid-based measure at tree nodes, but this cannot be directly used for sliding-window aggregation. Our paper is the first to use finger trees for fast out-of-order sliding window aggregation. The main novelty is to use and maintain position-aware partial sums.

    7. Conclusion

    FiBA is a novel algorithm for sliding window aggregation over out-of-order streams. The algorithm is based on finger B-trees with position-aware partial aggregates. It works with any associative aggregation operator, does not restrict the kinds of out-of-order behavior, and also supports window sharing. This paper includes proofs of correctness and algorithmic complexity bounds of our new algorithm. The proofs demonstrate that FiBA strictly outperforms the prior state-of-the-art in theory and that it is as good as the lower bound algorithmic complexity for this problem. In addition, experimental results demonstrate that FiBA yields excellent throughput and latency in practice. Whereas in the past, streaming applications that required out-of-order sliding window aggregation had to make undesirable trade-offs to reach their performance requirements, our new algorithm enables them to work out-of-the-box for a broad range of circumstances.

    References

    • (1)
    • Abadi et al. (2005) Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur  Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag S. Maskey, Alexander Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stan Zdonik. 2005. The Design of the Borealis Stream Processing Engine. In Conference on Innovative Data Systems Research (CIDR). 277–289.
    • adamax (2011) adamax. 2011. Re: Implement a queue in which push_rear(), pop_front() and get_min() are all constant time operations. http://stackoverflow.com/questions/4802038/. Retrieved Oct., 2018.
    • Akidau et al. (2013) Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. In Very Large Data Bases (VLDB) Industrial Track. 734–746.
    • Akidau et al. (2015) Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernandez-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. In Conference on Very Large Data Bases (VLDB). 1792–1803.
    • Arasu and Widom (2004) Arvind Arasu and Jennifer Widom. 2004. Resource sharing in continuous sliding window aggregates. In Conference on Very Large Data Bases (VLDB). 336–347.
    • Barga et al. (2007) Roger S. Barga, Jonathan Goldstein, Mohamed Ali, and Mingsheng Hong. 2007. Consistent Streaming Through Time: A Vision for Event Stream Processing. In Conference on Innovative Data Systems Research (CIDR). 363–373.
    • Bayer and McCreight (1972) Rudolf Bayer and Edward M. McCreight. 1972. Organization and Maintenance of Large Ordered Indices. Acta Informatica 1 (1972), 173–189.
    • Blelloch et al. (2003) Guy E. Blelloch, Bruce M. Maggs, and Shan Leung Maverick Woo. 2003. Space-efficient Finger Search on Degree-balanced Search Trees. In Symposium on Discrete Algorithms (SODA). 374–383.
    • Bloom (1970) Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM (CACM) 13, 7 (1970), 422–426.
    • Boykin et al. (2014) Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin. 2014. Summingbird: A Framework for Integrating Batch and Online MapReduce Computations. In Conference on Very Large Data Bases (VLDB). 1441–1451.
    • Brito et al. (2008) Andrey Brito, Christof Fetzer, Heiko Sturzrehm, and Pascal Felber. 2008. Speculative out-of-order event processing with software transaction memory. In Conference on Distributed Event-Based Systems (DEBS). 265–275.
    • Carbone et al. (2015) Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. IEEE Data Engineering Bulletin 38, 4 (2015), 28–38.
    • Carbone et al. (2016) Paris Carbone, Jonas Traub, Asterios Katsifodimos, Seif Haridi, and Volker Markl. 2016. Cutty: Aggregate Sharing for User-Defined Windows. In Conference on Information and Knowledge Management (CIKM). 1201–1210.
    • Chandramouli et al. (2010) Badrish Chandramouli, Jonathan Goldstein, and David Maier. 2010. High-Performance Dynamic Pattern Matching over Disordered Streams. In Conference on Very Large Data Bases (VLDB). 220–231.
    • Cormen et al. (1990) Thomas Cormen, Charles Leiserson, and Ronald Rivest. 1990. Introduction to Algorithms. MIT Press.
    • Guibas et al. (1977) Leo J. Guibas, Edward M. McCreight, Michael F. Plass, and Janet R. Roberts. 1977. A New Representation for Linear Lists. In

      Symposium on the Theory of Computing (STOC)

      . 49–60.
    • Hinze and Paterson (2006) Ralf Hinze and Ross Paterson. 2006. Finger Trees: A Simple General-purpose Data Structure. Journal of Functional Programming (JFP) 16, 2 (2006), 197–217.
    • Huddleston and Mehlhorn (1982) Scott Huddleston and Kurt Mehlhorn. 1982. A new data structure for representing sorted lists. Acta Informatica 17, 2 (1982), 157–184.
    • Ji et al. (2015) Yuanzhen Ji, Hongjin Zhou, Zbigniew Jerzak, Anisoara Nica, Gregor Hackenbroich, and Christof Fetzer. 2015. Quality-driven Processing of Sliding Window Aggregates over Out-of-order Data Streams. In Conference on Distributed Event-Based Systems (DEBS). 68–79.
    • Kaplan and Tarjan (1996) Haim Kaplan and Robert E. Tarjan. 1996. Purely Functional Representations of Catenable Sorted Lists. In Symposium on the Theory of Computing (STOC). 202–211.
    • Krishnamurthy et al. (2010) Sailesh Krishnamurthy, Michael J. Franklin, Jeffrey Davis, Daniel Farina, Pasha Golovko, Alan Li, and Neil Thombre. 2010. Continuous Analytics over Discontinuous Streams. In International Conference on Management of Data (SIGMOD). 1081–1092.
    • Krishnamurthy et al. (2006) Sailesh Krishnamurthy, Chung Wu, and Michael Franklin. 2006. On-the-fly sharing for streamed aggregation. In International Conference on Management of Data (SIGMOD). 623–634.
    • Li et al. (2005) Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. No Pane, No Gain: Efficient Evaluation of Sliding-window Aggregates over Data Streams. ACM SIGMOD Record 34, 1 (2005), 39–44.
    • Li et al. (2008) Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order Processing: A New Architecture for High-performance Stream Systems. In Conference on Very Large Data Bases (VLDB). 274–288.
    • Shein et al. (2017) Anatoli U. Shein, Panos K. Chrysanthis, and Alexandros Labrinidis. 2017. FlatFIT: Accelerated Incremental Sliding-Window Aggregation for Real-Time Analytics. In Conference on Scientific and Statistical Database Management (SSDBM). 5.1–5.12.
    • Shein et al. (2018) Anatoli U. Shein, Panos K. Chrysanthis, and Alexandros Labrinidis. 2018. SlickDeque: High Throughput and Low Latency Incremental Sliding-Window Aggregation. In Conference on Extending Database Technology (EDBT). 397–408.
    • Srivastava and Widom (2004) Utkarsh Srivastava and Jennifer Widom. 2004. Flexible time management in data stream systems. In Symposium on Principles of Database Systems (PODS). 263–274.
    • Tangwongsan et al. (2017) Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2017. Low-Latency Sliding-Window Aggregation in Worst-Case Constant Time. In Conference on Distributed Event-Based Systems (DEBS). 66–77.
    • Tangwongsan et al. (2015) Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General Incremental Sliding-Window Aggregation. In Conference on Very Large Data Bases (VLDB). 702–713.
    • Traub et al. (2018) Jonas Traub, Philipp Grulich, Alejandro Rodriguez Cuellar, Sebastian Bres̈, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2018. Scotty: Efficient Window Aggregation for out-of-order Stream Processing. In Poster at the International Conference on Data Engineering (ICDE-Poster).
    • Tucker et al. (2003) Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. 2003. Exploiting punctuation semantics in continuous data streams. Transations on Knowledge and Data Engineering (TKDE) 15, 3 (2003), 555–568.
    • Zaharia et al. (2013) Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In Symposium on Operating Systems Principles (SOSP). 423–438.

    Appendix A Running Time Lower Bound

    This appendix proves Theorem 2.2, establishing a lower bound on any OoO SWAG implementation. For a permutation on an ordered set , denote by , , the -th element of the permutation. Let be the number of elements among that are greater in value than —that is, . This measure coincides with our notion of out-of-order distance: if elements with timestamps are inserted into OoO SWAG in that order, the -th element has out-of-order distance .

    For an ordered set and , let denote the set of permutations on such that —i.e., every element is out of order by at most . We begin the proof by bounding the size of such a permutation set.

    Lemma A.1 ().

    For an ordered set and ,

    Proof.

    The base case is —the empty permutation. For non-empty , let be the smallest element in . Then, every can be obtained by inserting into one of the first indices of a suitable . In particular, each gives rise to exactly unique permutations in . Hence, . This expands to

    which means , completing the proof. ∎

    We will now prove Theorem 2.2 by providing a reduction that sorts any permutation using OoO SWAG.

    Proof of Theorem 2.2.

    Fix . Let be a OoO SWAG implementation instantiated with the operator . When queried, this aggregation produces the first element in the sliding window. Now let be any permutation in . We will sort using . First, insert elements into . By construction, each insertion has out-of-order distance at most . Then, query and evict times, reminiscent of heap sort. At this point, has been sorted using a total of OoO SWAG operations.

    By a standard information-theoretic argument (see, e.g., (Cormen et al., 1990)), sorting a permutation in requires, in the worst case, time. There are two cases to consider: If , we have , so . Otherwise, we have and . Using Stirling’s approximation, we know , which is since . In either case, . ∎

    Appendix B FiBA Correctness & Complexity

    This appendix proves Theorem 3.3 (FiBA correctness) and Theorem 3.4 (FiBA algorithmic complexity).

    Proof of Theorem 3.3.

    There are two cases. If the root has no children (is a leaf), the inner aggregate stored at the root represents the aggregation of all the values inside the root node. Otherwise, by the aggregation invariants, we have the following observations: (1) the aggregation at the right (left) finger is the aggregation of all values in the subtree that is the rightmost (leftmost) child of the root; and (2) the aggregation at the root, represented by an inner aggregate, is the aggregation of all values in the tree excluding those covered by (1). Therefore, query(), which returns leftFinger.agg root.agg rightFinger.agg, returns the aggregation of the values in the entire tree, in time order. ∎

    Proof of Theorem 3.4.

    The query() operation performs at most two operations; it clearly runs in time.

    The search cost is bounded is follows. Let be the node at the finger where searching begins and recursively define as the parent of . This forms a sequence of nodes on the spine on which searching takes place. Recall that MIN_ARITY is a constant. Because the subtree rooted at has keys and the key we are searching is at distance , we know the key belongs in the subtree rooted at some , where . Thus, it takes steps to walk up the spine and at most another to locate the spot in the subtree as all leaves are at the same depth, bounding by . The rebalance cost is given by Lemma C.1 in the following section. Finally, following the aggregation invariants, a partial aggregation is affected only if it is along the search path or involved in rebalancing. Therefore, the number of affected nodes that requires repairs is bounded by . Treating as bounded by a constant, is