Continuous Outlier Mining of Streaming Data in Flink

02/21/2019
by   Theodoros Toliopoulos, et al.
0

In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable streambased algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and investigates the challenges in transferring state-of-the-art techniques to Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied. We show speed-ups of up to 117 (resp. 2076) times over a naive parallel (resp. non-parallel) solution in Flink, by using just an ordinary four-core machine and a real-world dataset. When moving to a three-machine cluster, due to less contention, we manage to achieve both better scalability in terms of the window slide size and the data dimensionality, and even higher speed-ups, e.g., by a factor of 510. Overall, our results demonstrate that oulier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available as open-source software.

READ FULL TEXT

page 7

page 18

research
02/26/2018

Improved MapReduce and Streaming Algorithms for k-Center Clustering (with Outliers)

We present efficient MapReduce and Streaming algorithms for the k-center...
research
10/18/2021

Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach

Distance-based outlier detection is widely adopted in many fields, e.g.,...
research
11/13/2020

Efficient Subspace Search in Data Streams

In the real world, data streams are ubiquitous – think of network traffi...
research
08/07/2018

Parallel and Streaming Algorithms for K-Core Decomposition

The k-core decomposition is a fundamental primitive in many machine lear...
research
06/02/2022

Sparx: Distributed Outlier Detection at Scale

There is no shortage of outlier detection (OD) algorithms in the literat...
research
05/17/2023

Incremental Outlier Detection Modelling Using Streaming Analytics in Finance Health Care

In this paper, we had built the online model which are built incremental...
research
01/14/2019

CFOF: A Concentration Free Measure for Anomaly Detection

We present a novel notion of outlier, called the Concentration Free Outl...

Please sign up or login with your details

Forgot password? Click here to reset