Efficient Subspace Search in Data Streams

11/13/2020
by   Edouard Fouché, et al.
0

In the real world, data streams are ubiquitous – think of network traffic or sensor data. Mining patterns, e.g., outliers or clusters, from such data must take place in real time. This is challenging because (1) streams often have high dimensionality, and (2) the data characteristics may change over time. Existing approaches tend to focus on only one aspect, either high dimensionality or the specifics of the streaming setting. For static data, a common approach to deal with high dimensionality – known as subspace search – extracts low-dimensional, `interesting' projections (subspaces), in which patterns are easier to find. In this paper, we address both Challenge (1) and (2) by generalising subspace search to data streams. Our approach, Streaming Greedy Maximum Random Deviation (SGMRD), monitors interesting subspaces in high-dimensional data streams. It leverages novel multivariate dependency estimators and monitoring techniques based on bandit theory. We show that the benefits of SGMRD are twofold: (i) It monitors subspaces efficiently, and (ii) this improves the results of downstream data mining tasks, such as outlier detection. Our experiments, performed against synthetic and real-world data, demonstrate that SGMRD outperforms its competitors by a large margin.

READ FULL TEXT

page 3

page 5

research
11/01/2016

Local Subspace-Based Outlier Detection using Global Neighbourhoods

Outlier detection in high-dimensional data is a challenging yet importan...
research
11/07/2018

Scalable Bottom-up Subspace Clustering using FP-Trees for High Dimensional Data

Subspace clustering aims to find groups of similar objects (clusters) th...
research
09/27/2016

Online Categorical Subspace Learning for Sketching Big Data with Misses

With the scale of data growing every day, reducing the dimensionality (a...
research
04/28/2020

A new effective and efficient measure for outlying aspect mining

Outlying Aspect Mining (OAM) aims to find the subspaces (a.k.a. aspects)...
research
02/21/2019

Continuous Outlier Mining of Streaming Data in Flink

In this work, we focus on distance-based outliers in a metric space, whe...
research
10/14/2020

Adaptive Deep Forest for Online Learning from Drifting Data Streams

Learning from data streams is among the most vital fields of contemporar...
research
10/19/2019

LSTM-Assisted Evolutionary Self-Expressive Subspace Clustering

Massive volumes of high-dimensional data that evolves over time is conti...

Please sign up or login with your details

Forgot password? Click here to reset