SWOOP: Top-k Similarity Joins over Set Streams

11/07/2017
by   Willi Mann, et al.
0

We provide efficient support for applications that aim to continuously find pairs of similar sets in rapid streams of sets. A prototypical example setting is that of tweets. A tweet is a set of words, and Twitter emits about half a billion tweets per day. Our solution makes it possible to efficiently maintain the top-k most similar tweets from a pair of rapid Twitter streams, e.g., to discover similar trends in two cities if the streams concern cities. Using a sliding window model, the top-k result changes as new sets in the stream enter the window or existing ones leave the window. Maintaining the top-k result under rapid streams is challenging. First, when a set arrives, it may form a new pair for the top-k result with any set already in the window. Second, when a set leaves the window, all its pairings in the top-k are invalidated and must be replaced. It is not enough to maintain the k most similar pairs, as less similar pairs may eventually be promoted to the top-k result. A straightforward solution that pairs every new set with all sets in the window and keeps all pairs for maintaining the top-k result is memory intensive and too slow. We propose SWOOP, a highly scalable stream join algorithm that solves these issues. Novel indexing techniques and sophisticated filters efficiently prune useless pairs as new sets enter the window. SWOOP incrementally maintains a stock of similar pairs to update the top-k result at any time, and the stock is shown to be minimal. Our experiments confirm that SWOOP can deal with stream rates that are orders of magnitude faster than the rates of existing approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2021

Smoothness of Schatten Norms and Sliding-Window Matrix Streams

Large matrices are often accessed as a row-order stream. We consider the...
research
05/09/2019

Coresets for Minimum Enclosing Balls over Sliding Windows

Coresets are important tools to generate concise summaries of massive da...
research
07/20/2023

Out-of-Order Sliding-Window Aggregation with Efficient Bulk Evictions and Insertions (Extended Version)

Sliding-window aggregation is a foundational stream processing primitive...
research
09/24/2021

k-Center Clustering with Outliers in the Sliding-Window Model

The k-center problem for a point set P asks for a collection of k congru...
research
03/07/2020

Aion: Better Late than Never in Event-Time Streams

Processing data streams in near real-time is an increasingly important t...
research
05/10/2020

Approaching Optimal Duplicate Detection in a Sliding Window

Duplicate detection is the problem of identifying whether a given item h...
research
11/30/2021

Connected Components for Infinite Graph Streams: Theory and Practice

Motivated by the properties of unending real-world cybersecurity streams...

Please sign up or login with your details

Forgot password? Click here to reset