Outlier Detection

What is Outlier Detection?

Outlier Detection refers to the method of identification and classification used to identify observations that are distinctly different or far from others. A popular method to clean a data set, outlier detection allows for defined outliers to inform classification of new observations as anomalies. Outliers are defined between two categories: univariate and multivariate. Univariate outliers are found in distributions in a single feature space, whereas multivariate outliers are found in n-dimensional spaces. Furthermore, outliers are also defined by their environment. Outliers that lay far away from the rest of the data are called "point outliers." Alternatively, "contextual outliers" are found within the data, often appearing as noise. There are a multitude of factors that can contribute to the appearance of an outlier, however those that are not the product of an error are called "novelties."

How does Outlier Detection work?

Outlier detection works by observing a data set and defining various points as outliers. There are several methods for defining outliers, and a popular method is through z-score analysis. The z-score is a value that represents the number of standard deviations that a data point is away from the mean. Particularly when dealing with parametric distributions in a low dimensional space, the a z-score threshold can help filter outliers from a data set.


Outlier Detection vs. Novelty Detection

In terms of anomaly detection, both outlier detection and novelty detection seem very similar. However, the two methods define different forms of anomalies. In simple terms, outlier detection can be thought as unsupervised learning, and novelty detection represents semi-supervised learning. A method of novelty detection is cluster analysis, a technique that outlier detection can never use. By definition, outliers are not located near any other populated area of data points. Should a cluster of points arise, the mean would adjust, and would no longer classify as outliers.