Homophily Outlier Detection in Non-IID Categorical Data

03/21/2021
by   Guansong Pang, et al.
22

Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95 complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2018

Learning Representations of Ultrahigh-dimensional Data for Random Distance-based Outlier Detection

Learning expressive low-dimensional representations of ultrahigh-dimensi...
research
01/15/2020

Outlier Detection Ensemble with Embedded Feature Selection

Feature selection places an important role in improving the performance ...
research
05/17/2017

REMIX: Automated Exploration for Interactive Outlier Detection

Outlier detection is the identification of points in a dataset that do n...
research
06/22/2021

Doubly Robust Feature Selection with Mean and Variance Outlier Detection and Oracle Properties

We propose a general approach to handle data contaminations that might d...
research
11/28/2017

Contextual Outlier Interpretation

Outlier detection plays an essential role in many data-driven applicatio...
research
02/17/2019

A feature-based framework for detecting technical outliers in water-quality data from in situ sensors

Outliers due to technical errors in water-quality data from in situ sens...
research
01/18/2022

An Efficient Hashing-based Ensemble Method for Collaborative Outlier Detection

In collaborative outlier detection, multiple participants exchange their...

Please sign up or login with your details

Forgot password? Click here to reset