The increased connectivity between producers of data, including humans and a broad range of sensors, and consumers of data render streaming data increasingly prevalent. We consider streams where the elements of the streams are timestamped sets. Examples of such elements include tweets that may be modeled as sets of words, retail point-of-sale transactions that may be modeled as sets of goods, and the clicks in user click-streams on a website. In this setting, we consider continuous top- joins. Such joins, which limit the results to the most similar pairs of sets, constitute fundamental functionality. They enable finding the most similar pairs of sets in data streams, where they may also be used in near-duplicate detection and clustering.
As an illustration, assume that we are interested in finding similar trending social media topics in New York City and London. Given two streams of tweets from these cities, we may represent each tweet as a set of words (e.g., after having performed stemming and stop-word removal). We then continuously join the most recent tweet sets from the two streams and maintain the most similar pairs, thus obtaining an overview of the most similar tweets from the two cities. Further processing may be applied to the top- result. For example, tag clouds may be created.
A set stream is a sequence of pairs with monotonically increasing timestamps. As is customary in the context of streaming data, we adopt a sliding window model, where only the most recent sets are considered. Thus, newly arriving sets become part of the window, and sets expire as they get older than the window duration. The top- join result must be kept up-to-date when time passes and such changes occur. Maintaining the join result poses two main challenges: (1) New sets that enter the sliding window may form a similar pair with any of the existing sets in the window. (2) When sets expire, all their pairings are invalidated; expired pairs in the join result must be removed, and replacements must be found to keep the join result correct.
A new set that enters the window may form a similar pair with any of the sets in sliding window . In rapid streams, the sliding window may contain hundreds of thousands of sets. Computing the similarity between each new set and all sets in the window clearly does not scale to fast stream rates. We are not aware of any previous solutions to this problem. Morales et al. (de2016streaming, ) deal with set steams, but they use a fixed threshold to compute the join and do not consider sliding windows. Xiao et al. (conf/icde/XiaoWLS09, ) compute a top- join over static collections of sets. A fundamental assumption of this approach, which is leveraged for pruning and index construction, is that all sets are known up front. There is no obvious way to adapt the static top- join to our dynamic setting with frequent new and expiring sets. Reevaluating the static top- join each time the sliding window changes does not scale to frequent changes.
As time passes, sets leave the sliding window and expire. When a set expires, all pairs in the top- result containing the expired set must be removed. The invalidated pairs must be replaced by other pairs. Thus it is not enough to keep only the top- pairs; rather, a stock of other, less similar but valid pairs must be maintained. The total number of valid pairs is quadratic in the window size. Maintaining all such pairs is not efficient for large sliding windows or rapid streams. The only available solution to this sub-problem, SCase (Shen2014, ), maintains a skyband to avoid unnecessary pairs in the stock. However, the skyband for the stock is recomputed from scratch for every new set that enters the window. The stock may contain pairs that must all be touched to recompute the skyband. As a result, SCase does not scale to large window sizes.
We propose SWOOP for top- joins over streaming sets. SWOOP uses an inverted list index to efficiently generate candidate pairs when new sets enter the sliding window. Clever filtering techniques are used to prune candidates, which reduces the number of similarity computations for each new set to a small fraction of the sets in the window. We propose highly efficient update strategies for our inverted index. Similar to SCase (Shen2014, ), SWOOP also maintains a skyband for the stock to keep the join result up-to-date when sets expire. However, in SWOOP, the skyband is maintained incrementally and is never recomputed from scratch. In our experiments, we show that our incremental stock maintenance allows us to process streams at rates that are up to ten times faster than the rates processed by SCase. When combined with our efficient candidate generation techniques, we achieve speed-ups of up to three orders of magnitude compared to an SCase based approach.
Further, to characterize the similarity functions to which SWOOP is applicable, we define the concept of well-behaved similarity function. All standard set similarity functions are well-behaved, including Overlap, Jaccard, Cosine, Dice, and Hamming (conf/icde/XiaoWLS09, ).
Finally, we report on an extensive experimental study that offers insight into the efficiency of SWOOP compared to SCase and a baseline approach. Most notably, we find that SWOOP scales much better with a growing number of sets in the sliding window.
In summary, we make the following key contributions:
We present SWOOP, a novel algorithm for continuous top- set similarity joins over streams. Two salient features of SWOOP are (1) the efficient generation of candidates when new sets enter the sliding window and (2) the incremental maintenance of a minimal stock to deal with expiring sets.
We introduce the concept of well-behaved similarity function to accurately characterize the scope of SWOOP.
We present a solution to contend with the absence of so-called token frequency maps in streams; we particularly target difficult streams with very skewed token distributions.
We empirically demonstrate that the proposed algorithms are capable of running up to three orders of magnitude faster than the generic approach SCase.
Section 2 formulates the problem. Section 3 introduces the stream join framework and a baseline solution. Section 4 defines well-behaved similarity functions. Section 5 explains the candidate generation algorithm, including the handling of difficult datasets. Section 6 covers maintenance of the join result. Section 7 presents our experimental study. Section 8 covers related work. Finally, Section 9 concludes the paper.
2. Problem Setting and Definition
A stream is a sequence of two-tuples , where is a set and is a timestamp. The -th tuple in is denoted as . The timestamp is monotonically increasing with the sequence number, i.e., for any two tuples and , . A sliding window of duration over stream contains all sets from that are no older than : , where is the current time, also refered to as the index time. The sets in the sliding window are called valid. We provide an overview of frequently used notation in Table 1.
In general, joins are defined between two different streams and . To simplify the presentation, we discuss the self join scenario, , and extend to non-self joins in Appendix A.
The top- set similarity join in sliding window returns the most similar pairs of sets from stream that are valid at the time the query is issued. A range of overlap-based set similarity functions may be used, including Jaccard, Cosine, or Dice (conf/icde/XiaoWLS09, ).
Definition 1 (One-Time Top- Set Similarity Join).
Given a sliding window over stream and a set similarity function , the one-time top- set similarity join returns a list of set pairs from , such that (1) each contains only valid sets, (2) is ordered descendingly according to , (3) for all , , (4) for all , , (5) for all pairs of valid sets in not in , . Finally, may contain fewer than pairs if fewer than pairs qualify.
In the definition, condition 3 eliminates symmetric pairs so that only one of and is included in .
The above join is a one-time query because it is executed once. We consider the continuous variant of the query that maintains an up-to-date result from when it is started until when it is stopped. As time passes, sets in window leave (expire), and new sets enter . The join result must be kept up-to-date when such events occur. A set that enters window at time can form a new pair with all other sets in , where . A new pair enters the join result if it is sufficiently similar. When a set leaves and thus expires, all pairs that contain become invalid. Invalid pairs must be removed from , and they must be replaced by valid pairs. In general, a pair is valid from time (when the younger set enters the window) until time (when the older set leaves the window). Since we only consider pairs with , the validity interval is always .
Valid pairs always have their start time in the sliding window (time period ) and their end time in the so-called future window (time period ), i.e., their validity interval contains . Invalid pairs have both their start and end time in the sliding window. This is illustrated in Figure 1.
Our goal is to invent an efficient solution for the continuous top- set similarity join over streams using a sliding window.
3. Join Framework and Baseline
Stream join framework
We introduce our stream join framework, illustrated in Figure 2, and cover a baseline implementation of the framework. The framework comprises:
Index time is the current time in the framework and defines the sliding window. All data structures in the framework must be up-to-date w.r.t. the index time.
Stock maintains the join result at time and additional, valid pairs to deal with expiring sets.
Window stores all tuples of stream covered by the sliding window at time . is required to evaluate the similarity between pairs of sets and to detect expiring sets as the index time increases (i.e., the sliding window is advanced).
The framework supports three operations: (i) retrieves the join result at index time ; (ii) sets the index time to ; (iii) sets the index time to and inserts a new set into the index. Sets must be inserted in the order of their appearance in . The index time can never decrease.
The baseline algorithm implements stock as a binary tree ordered by descending similarity of the pairs, i.e., the top- pairs are ranked first. Window is implemented as a FIFO queue that can be iterated and supports the usual peek/pop/push operations.
We discuss the three operations in the join framework.
(i) topk() retrieves the join result at index time by traversing the first pairs of stock (or pairs if ). No index update is required.
(ii) (Algorithm 1) updates the index time and fetches all sets from window that expire when the sliding window is advanced. The corresponding entries are deleted from and .
(iii) (Algorithm 2) first advances the sliding window to position and updates the affected data structures (line 2) such that only contains valid pairs. Next, (Algorithm 3) is invoked to compute the similarity of each pair ; if , the pair is a candidate and is ranked in stock . After the insert, stores the join result at time .
Figure 2 illustrates for a incoming two-tuple from stream . Steps (1)–(3) reflect the call to , which (1) updates , (2) removes invalid pairs from stock , and (3) removes expired sets from window . Step (4) adds the new pairs generated by to . Step (5) adds set to .
Complexity of baseline. Stock is of size and dominates the memory complexity. The insert operation runs in time since a new set must be paired with every other set in and the pairs must be inserted into binary tree . Function scans the stock in time for expiring sets; removing a set has cost . Finally, runs in optimal time.
The inefficiency of the baseline solution arises from the many candidate pairs generated for each incoming set and the quadratic size of the stock, which must be maintained under frequent changes. We address these issues in the following sections. The next section defines the scope of our solution. Section 5 introduces an efficient technique to generate candidates: using an inverted list index on tokens together with an upper and a lower bound, only a small fraction of the sets in window need to be considered. Section 6 proposes an efficient stock implementation, which only stores pairs and is maintained incrementally.
4. Supported Similarity Functions
Our solution works with the most common similarity functions, including Jaccard, Cosine, Dice, Overlap, and Hamming distance, but is not limited to these functions. We introduce the concept of a well-behaved set similarity function to abstract from individual similarity functions and instead identify the essential properties that a similarity function must satisfy to work with our solution.
Definition 4.1 (Well-behaved similarity function).
A similarity function between two sets, , is well-behaved iff there is a function that only depends on the set lengths , , and the overlap , and the following properties hold:
monotonically increases with increasing overlap ( are fixed)
monotonically increases with increasing overlap , i.e. ( is fixed)
there is a function that computes the minimum required overlap such that
Lemma 4.2 ().
Jaccard, Cosine, Dice, and Overlap similarity, and the Hamming distance are well-behaved set similarity functions.
5. Inverted Index for Candidate Generation
We discuss the efficient generation of candidates in SWOOP. Candidates are pairs that must be inserted into the stock. We use an inverted list index to compute candidates. The keys in the index are tokens, and the list entries for a token are all valid sets in which the token appears. When a new set enters the sliding window, the lists of all tokens of are accessed to retrieve candidates, and index is updated. Efficient index updates are discussed in Section 5.1.
A naive use of index offers little improvement over the baseline: only the set pairs with no overlap are avoided, and the use of inverted lists tends to cause more cache misses than the baseline. Instead, we introduce new filters capable of effectively pruning candidate sets that cannot contribute to the join result.
The positional upper bound filter introduced in Section 5.2 is based on the lookup position of a token in with the following reasoning: if a potential candidate is first encountered in the -th list, there must be at least tokens in that do not exist in . The skyband lower bound filter discussed in Section 5.3 is derived from the pairs that are already in the stock. A potential candidate pair is called irrelevant and can be discarded if its not sufficiently similarity to be part of the top- result at any time in the future. We derive this minimum required similarity by inspecting the stock and taking into account the end time of the candidate pair under consideration.
In Section 5.4 we devise a new candidate generation algorithm that uses our filters and the inverted index. Figure 3 illustrates the algorithm for a newly inserted example set with timestamp . The candidates are computed as follows. (1) A lookup of the tokens of in the inverted list index returns two lists. (2) The lists are scanned from tail to head and produce so-called pre-candidates (shaded in gray) until our filters tell us to stop (cropping). (3) We compute the similarity of each (deduplicated) pre-candidate pair and apply the skyband lower bound to prune irrelevant pairs. The resulting candidates are collected in . A candidate is a pair with its similarity end its end time. (4) Index is updated with the new tokens of set (dashed frame). (5) The stock is updated with the candidates in (dashed frame).
Section 5.5 deals with token orders and discusses the order in which inverted lists in the index should be accessed.
5.1. Updating the Inverted Index
Since only valid sets are indexed, index must be updated frequently. In particular, we must update when old sets expire or new sets enter the sliding window .
We implement the inverted index with doubly-linked lists and keep the sets in the lists ordered increasingly by their expiration time. This allows us to efficiently remove expiring sets from the heads of the lists. The list order comes for free: The timestamps of the new sets cannot decrease; thus, we simply append a new sets to the tails of the relevant lists. A set is inserted/deleted in time, independently of the list length. Figure 4 illustrates the index update for an expiring set and a new set .
As a convenient side effect of the list order, we retrieve the candidate pairs in sort order of their expiration time: A lookup of returns all lists with tokens . Let some form a candidate pair with . The expiration time of the candidate pair is , i.e., it depends only on set . Thus, the list order propagates to the candidate pairs.
5.2. Positional Upper Bound
We derive an upper bound on the set similarity that will be used to prune candidates during lookups in index .
Theorem 5.1 ().
Given a well-behaved similarity function , sets and . If at least tokens in do not exist in , then the following upper bound on the similarity between and holds:
We need to show that is maximum if and overlap . W.l.o.g. assume . For the case , the similarity is maximized for the maximum size of (Def. 4.1, claim (4)). For given set lengths and , the similarity is maximum if since in all other cases (Def. 4.1, claim (3)). Thus, the maximum similarity is achieved when . ∎
Consider a lookup of set in the index . The lookup returns an inverted list for each token . Let be the -th token of set that we look up in ; we call the lookup position. A set is new if or for . For the new sets , we know that there are at least tokens in that do not exist in . Based on Theorem 5.1 we derive the following positional upper bound:
For any new set , . This principle has been used before in the context of a specific set similarity function (e.g., Jaccard) (conf/icde/XiaoWLS09, ). Compared to previous work, we provide a formal proof, do not require a global order of tokens, and generalize the bound to the class of well-behaved set similarity functions.
Figure 5 illustrates the upper bound for the Jaccard similarity on a set of length .
5.3. Skyband Lower Bound
We define the skyband lower bound that, together with the positional upper bound from the previous section, allows us to stop processing an inverted list early. The skyband lower bound marks the boundary of the so-called skyband, which is formed by the most similar pairs at any time in the future; thereby, only pairs that exist at index time are considered. The skyband is maintained in stock . The red staircase functions in Figure 6 show the skyband lower bound for two example stocks.
The skyband lower bound, , is defined as the similarity of the -th pair at time in stock . The efficient computation of is discussed in Section 6.2. We next introduce the concept of irrelevant pairs, which need not be considered as candidates. Then we show how to detect irrelevant pairs using the lower bound.
A pair is irrelevant if it is not part of the join result at index time and will never become part of . This is the case if for the remaining life time of the pair, , at least more similar pairs exist.
Irrelevant pairs are identified by considering their rank at their end time. The pair is irrelevant if the rank of at its end time exceeds , i.e., at least pairs exist that are better than for the whole remaining life time of . Note that pairs inserted in the future can never increase the rank of .
A pair may (a) be irrelevant before it is inserted into stock (then we can avoid inserting it), or (b) it may become irrelevant due to the insertion of another pair.
Example 5.2 ().
Consider pair in Figure 6(a) with and end time . For , is relevant since the rank at its end time is . The rank at index time is ; the rank improves to at time when becomes invalid. If we insert pair , becomes irrelevant as illustrated in Figure 6(b): the rank at its end time is now . New pairs cannot improve the rank of pairs that are already in the stock; at best, they leave it unchanged.
Detecting Irrelevant Pairs
We use the skyband lower bound to identify irrelevant pairs. A pair with end time is irrelevant iff its similarity is below the lower bound at end time :
Lemma 1 ().
The skyband lower bound, , is a non-increasing function in .
All pairs start at or before the index time. The -th pair at the index time has similarity and end time . When a pair ends, a pair with similarity at most is promoted to position in . Thus, the skyband lower bound cannot increase. ∎
5.4. Efficient Candidate Generation
We use the positional upper bound and the skyband lower bound to efficiently prune candidates during the lookup in index , as illustrated in Figure 7. Recall that the positional upper bound, , is constant for an inverted list , where is the -th token that we look up in the index (blue line in the figure). The skyband lower bound, , on the other hand, depends on the time (red line segments).
The similarity of any pair formed with an entry in the inverted list falls on or below the blue line. A pair is relevant iff its end point is on or above the red line. Thus, a pair with a set from list is relevant iff its end point falls into the gray region in Figure 7.
More specifically, we employ the bounds as follows. We process the inverted list from tail to head such that the end times of pairs formed with the sets do not increase (cf. Section 5.1). For each pair, we compute the lower bound at its end time . We stop processing the list when having formed a pair with a lower bound above the upper bound, i.e., . This is correct due to Lemma 1: the lower bounds of all remaining pairs will also exceed the upper bound threshold, i.e., no additional relevant pairs can be formed.
Algorithm 4 generates candidate pairs for a new set using inverted index . The basic structure is as follows (cf. Figure 3): for each token of the new set, , we probe to get a list of set IDs. The list is cropped, i.e., traversed from tail to head in line 4 until the stopping condition based on our upper and lower bounds holds. The list elements are called pre-candidates and are stored with their lower bound in hashmap . In the next step (lines 4–4), we verify the pairs by computing their overlap to get the final set of candidates. Finally, the new set is inserted into the index.
A candidate pair is verified by checking . The overlap computation stops early when cannot be reached. As shown by Mann et al. (mann2016, ) for threshold-based set similarity joins, stopping early has a major impact on the performance.
A pre-candidate may appear in multiple lists. Since the for does not change during a call, in line 4 we look up the bound in and need not recompute it.
5.5. Optimized Token Processing Order
Before we process a new set , we order its tokens. This is required for the merge-like overlap computation. A well-known approach is to order sets by decreasing token frequency, i.e., rare tokens appear earlier in the sorted sets. This is useful in two ways: First, rare tokens have short lists in the index, which we leverage as discussed below. Second, the stop condition in the merge-like overlap computation improves with the number of mismatches, which are more likely for rare tokens.
Processing rare tokens (i.e., short lists) first when we retrieve candidates for has a substantial impact on the performance. This is due to our upper bound, which improves with the lookup position of a token. A tighter upper bound allows us to skip a longer section of the inverted list. Thus, we want to process long lists as late as possible and use the bound to skip large fractions of the long lists.
Non-streaming set similarity joins count the frequency of each token in a preprocessing step and establish the order up front. This is not possible in our setting since the sets arrive on a stream and are not known up front. Instead, we number each token when it first appears in the stream. Then, a new sets is sorted in descending order of the first occurrence of its tokens, i.e., tokens that occur later are sorted lower in sort order. The idea is that frequent tokens are more likely to occur earlier in the stream than infrequent ones.
In our experiments, we show that our ordering heuristic is effective if the token distribution is stable over time, i.e., a token appears with the same probability in each subsection of the stream. Unfortunately, some real world data does not satisfy this assumption. This leads to inefficiencies if we process the tokens in the order of their sort position (as in Algorithm4, line 4). To deal with skewed token distributions, we process a new set as follows: We first retrieve the inverted lists of all tokens of and heapify the lists such that the shortest list is on top of the heap. We then pop the lists and process them until the heap is empty. This approach substitutes the order in Algorithm 4.
6. Maintaining the Join Result
The stock data structure maintains the join result. This includes ranking the most similar pairs at index time and keeping enough valid replacements for result pairs that leave the sliding window and thus become invalid. We require the following functionality.
: Return the top- result at index time .
: Increase the index time to and remove expiring pairs.
: Get the skyband lower bound at time , i.e., the similarity of the -th pair at time .
: Insert a collection of candidate pairs that all start at index time .
The operation is trivial: it traverses the first elements of in sort order. The other operations are discussed shortly.
Stock Data Structure
For a pair , the stocks stores a quadruple , where is the similarity of the pair and is its end time. We implement as a binary search tree ordered by decreasing similarity (and lexicographically by descending end time, ascending and to break ties).
In addition to search, two rank operations are supported in time (cf. Section 7): (1) given an item , the rank of in the sort order is computed; (2) given rank , the -th item in the sort order is returned. In our algorithms, we use the notation to access the -th item of in sort order.
6.1. Incrementing the Index Time
The operation advances the sliding window and removes expiring pairs from stock . If pairs from the current join result are removed, they must be replaced by other pairs. The baseline algorithm keeps all valid pairs as potential replacements. As we will show, this is not necessary.
We call stock correct if it contains all pairs that may be required in the future to maintain , i.e., all pairs that are relevant at index time (cf. Section 5.3). We call minimal if it is correct and removing any pair makes it incorrect. The stock maintained by the baseline, which is correct but not minimal, is quadratic in the window size . The minimal stock is linear in .
Lemma 2 ().
The size of a minimal stock is .
The deletion of a set invalidates at most pairs in since ( if fewer than pairs have non-zero similarity). The worst case is illustrated in Figure 8, where pairs end at time and must be replaced by the next pairs in the similarity order. Since only valid sets can expire, no more than replacements are required. ∎
End Time Index
Function removes all pairs with end time smaller than . The naive solution scans , checks the end time of each pair, and removes expired pairs. For expired pairs, the runtime is . This is too slow as the index time is potentially incremented by each new set in the stream.
We introduce the end time index that maintains the same elements as stock , but orders them by ascending end time (ascending similarity, descending , for pair ). Like , is implemented as a binary tree that supports rank operations in logarithmic time. Index is updated whenever is updated, thus .
Our implementation of scans the end time index only while the end time is below . Then the scan stops, and the remaining pairs are not touched. Each scanned pair is removed. The removal of invalid pairs takes time. Since each pair can be removed only once, the worst case is infrequent, and the average complexity is .
6.2. Efficient Lower Bound Computation
The skyband lower bound (cf. Section 5.3) is the similarity of the -th pair in at some future time . It is used during candidate generation and is evaluated for each entry in the inverted lists until the stopping condition is reached.
A straightforward implementation scans and returns the -th pair that satisfies . This takes time, which is too expensive since the lower bound needs to be computed for each pre-candidate. We exploit the fact that is minimal and use the end time index to retrieve the -th pair at time efficiently. Since the end time index does not store similarities, we need to establish a connection between and .
Lemma 6.1 ().
Let be a timestamp, the end time of the first pair in that ends at or later, be the rank of in endtime index . If stock is minimal, the -th pair in at time is .
By induction on . Pair covers the interval and is the first pair to end; in this interval, the -th pair in is . Assumption: The -th pair in during the interval is . Note that is in the top-; otherwise could be removed (which is not possible in a minimal stock). Assume unique end times in : The pair defines the next interval. Since is now invalid, the next element in the stock, , is promoted to become the -th pair in . Now assume the general case of entries in with the same end time: is always the position of the first of these entries in . The pair defines the next interval, invalidating the former top- entries to and promoting to rank in . ∎
To compute , we search for the smallest pair (in sort order) with and retrieve its rank . Operation is the similarity of the pair at position in . All these operations (searching in , computing its rank) are logarithmic in .
Example 6.2 ().
Figure 9 shows six pairs , stock , end time index , and the skyband lower bound for (red line). For the pairs in , we show similarity and end time (e.g., for ); for the pairs in , we only show the end time ( for ). and are ordered by similarity resp. end time. We shift the orders by positions such that is aligned with (gray bars). Note that the pairs in the bars define the steps of the skyband lower bound, e.g., the first bar defines the point , where the first step ends. This is a result of Lemma 6.1 and holds if the stock is minimal. We compute for : at position is the smallest pair in with end time ; the aligned pair has similarity , which is the skyband lower bound at time .
6.3. Inserting New Pairs
The insert operation adds a set of candidate pairs, , to the stock. The challenge is to keep the stock minimal. New pairs may turn out to be irrelevant (in which case they should not be inserted), or they may render other pairs irrelevant (which then must be removed).
Assume we want to insert pair (dotted) into the stock in Figure 10. To check if is relevant, the rank at its end time must be at most . The rank of is determined by the number of stock elements that do not end before and are at least as similar, i.e., , . There are such pairs (, gray area); thus, is irrelevant (rank at end time). Note that inserting the irrelevant pair disrupts the alignment of and (gray horizontal bars) stated in Lemma 6.1.
Let be the pair to be inserted. First, the relevance of must be checked. This is achieved using a sweep line algorithm that scans in sort order and counts all pairs , , (gray area, Figure 10). If is irrelevant, it is rejected. Otherwise, is inserted, and all pairs , , must be checked since they may have become irrelevant due to the insertion of . For each pair , the plane sweep algorithm must be executed. Thus, the overall runtime is .
We present our efficient insert algorithm in three steps. First, we present a cleanup algorithm that uses end time index to remove all irrelevant pairs from stock in time . An insert algorithm that uses cleanup can add all candidates to the stock without any relevance checks and then remove all irrelevant pairs in one pass. This is a major improvement over the simple algorithm that is quadratic in . Second, we optimize cleanup for the use with insert, where we know the candidate set up front. Third, we present the efficient insert algorithm of SWOOP, which uses a merge approach and inserts pairs only if they are relevant. Intuitively, adding and cleaning the stock are interleaved.
The cleanup algorithm presented next removes all irrelevant pairs from stock for a given . The algorithm uses the end time index and the following property of non-minimal stocks.
Lemma 3 ().
If is the position of the first irrelevant pair in , , then the position of in exceeds : .
By contradiction. Let be the first irrelevant pair in and assume , . The end time of all irrelevant pairs is . Since , there are pairs that end before . None of these pairs can end at time since we order ties in by ascending similarity, i.e., irrelevant pairs precede relevant pairs. All pairs that end before must be more similar than any , would otherwise render them irrelevant. Further, since is irrelevant, there must be at least additional pairs that are more similar than and are still valid at time . Thus, in total at least pairs exist in that precede . ∎
With Lemma 3 we can clean the stock as follows: We scan and check for each position if the rank of in exceeds : in this case, the pair is irrelevant and is removed. We repeat the procedure from position until all pairs in are processed. Computing the rank of in has complexity . We avoid the logarithmic factor in our cleanup algorithm (Algorithm 5 without gray-shaded parts) as follows: We start with and iterate through the pairs and simultaneously such that . If pair sorts behind in the sort order of then the rank of in is above , and is irrelevant. Thus we avoid computing the exact rank of in . The complexity is for removing irrelevant pairs.
Example 6.3 ().
We clean the stock in Figure 10, . Initially, and (topmost gray bar). does not sort after ; thus, is relevant. Next step : , , is relevant. For , sorts after ; thus, is irrelevant and is removed. We proceed until is exhausted.
Cleanup can be optimized for insertion by scanning only the regions of that may contain irrelevant pairs. We identify these regions by inspecting the set of inserted pairs, .
Lemma 6.4 ().
Let stock be minimal, a candidate set of pairs, and the maximum similarity resp. end time of all pairs . After adding to (without removing irrelevant pairs), the following holds for all pairs : if is irrelevant, then and .
Optimized cleanup (Algorithm 5 including gray-shaded parts) uses Lemma 6.4 to scan only those parts of and that might store irrelevant pairs. As an example, consider the stock in Figure 10 and assume that the candidates have been inserted. With and we only need to scan . The algorithm starts the scan at in (since ) and in , and ends after three iterations.
The insert algorithm (cf. Algorithm 6) processes both the stock items and the candidates in sort order of the stock (descending similarity), and a merge-like approach is used to verify candidate pairs before they are inserted. Intuitively, we walk along the skyband boundary (gray boxes in Figure 11). Assume the current vertex of the skyband boundary is . When we insert the candidates that fall between the vertexes and , their end times must be above the end time , i.e., the end time of . Irrelevant candidates are never inserted, but the insertion of relevant candidate pairs may render other pairs irrelevant. Since irrelevant pairs can only appear after the current position in , they will be removed as we proceed (like in the cleanup algorithm).
Lines 6–6 deal with the special case . In Lines 6–6 (similar to the cleanup algorithm) initialize the positions and : is the rank of the first candidate in sort order of (in a stock ); is aligned such that define a skyband boundary vertex (gray bars in the figure). If the resulting is smaller than , is initialized to and to 1. Also the end time threshold is initialized.
In the next step, the algorithm loops over and (lines 6–6). In the inner loop, the relevant candidates that are more similar than are inserted (lines 6–6). Note that a candidate is inserted at position , so becomes , and the loop exits after the first insertion (as ). The relevance of a candidate is determined using the end time threshold as illustrated in Figure 11. The main loop proceeds like the cleanup algorithm (lines 6–6), except that also is updated.
After scanning the whole skyband boundary, there may still be candidates left (lines 6–6). This is the case for candidate pairs that are less similar than the least similar pair in . Some of these pairs may be irrelevant. The end time for this check is the last end time in the skyband boundary, .
The complexity of insert depends on the sizes of and . Inserting or deleting a pair takes . Potentially each candidate pair has to be inserted, and each pair from has to be removed, yielding a worst-case complexity of .
7. Experimental Study
7.1. Experimental Setting
We conduct our experiments on an 8-core Intel Xeon E5-2630 v3 CPUs with 2.4 Ghz, 96 GB of RAM, and 20 MB cache (shared with the other cores), running Debian 8. Our code is written in C++ and is compiled with GCC using the -O3 option.
We compare the following algorithms:
We implemented all algorithms ourselves111Source code will be published. We were unable to obtain the original source code of SCase. The algorithms are implemented in C++ using data structures that are available from STL and Boost222http://www.boost.org/. For the binary search trees and , we use Boost Multiindex. We define one Multiindex structure that stores stock and provide two indices on this container.
We use three datasets. Key statistics are offered in Table 3.
TWEET. Geocoded tweets collected from February to April 2017.
DBLP. Articles from DBLP333http://dblp.uni-trier.de/ (DBLP:journals/pvldb/Ley09, ). A set is a publication and the tokens correspond to the words in the authors and title fields. The timestamp is the modification date from DBLP’s XML file.
FLICKR. Photo meta-data. A set consists of tokens from the tag or title text describing a photo. The timestamps are assigned randomly between 0 and 10,000 seconds. This dataset was provided by Bouros et al. (boge12, ).
The average window size is the average number of sets in sliding window , which is controlled by the duration of .
Pre-candidates are the set pairs that must be formed when a new set arrives in the stream. In Base and SCase, a new set will form a pre-candidate with each set in the sliding window. In SWOOP, the number of pre-candidates is the number of processed inverted list items. Candidates are those pre-candidates that are sent to the stock for insertion. Base sends all pre-candidates to the stock. SWOOP and SCase filter the pre-candidates using a lower bound filter.
The set rate is the average number of processed sets per second and measures the performance of an algorithm. We map string tokens to integers as discussed in Section 5.5; this process is identical for all algorithms and is not considered in the set rate. The latency is the time difference between the appearance of a set in the stream and the update of the top- result. It includes candidate generation, stock update, and potential waiting times in a queue.
7.2. Optimized Token Processing Order
|Dataset||# stream items||avg. set size||universe size|