Sparx: Distributed Outlier Detection at Scale

06/02/2022
by   Sean Zhang, et al.
0

There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source Sparx under the Apache license at https://tinyurl.com/sparx2022.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/06/2019

PyOD: A Python Toolbox for Scalable Outlier Detection

PyOD is an open-source Python toolbox for performing scalable outlier de...
research
10/07/2019

PyODDS: An End-to-End Outlier Detection System

PyODDS is an end-to end Python system for outlier detection with databas...
research
11/08/2022

OutlierDetection.jl: A modular outlier detection ecosystem for the Julia programming language

OutlierDetection.jl is an open-source ecosystem for outlier detection in...
research
03/12/2020

PyODDS: An End-to-end Outlier Detection System with Automated Machine Learning

Outlier detection is an important task for various data mining applicati...
research
02/21/2019

Continuous Outlier Mining of Streaming Data in Flink

In this work, we focus on distance-based outliers in a metric space, whe...
research
12/01/2022

Biomedical NER for the Enterprise with Distillated BERN2 and the Kazu Framework

In order to assist the drug discovery/development process, pharmaceutica...
research
02/08/2020

SUOD: Toward Scalable Unsupervised Outlier Detection

Outlier detection is a key field of machine learning for identifying abn...

Please sign up or login with your details

Forgot password? Click here to reset