What is an Outlier?
A statistical outlier is any datapoint in a dataset that is beyond a pre-defined distribution range, usually representing a measurement error or abnormal data that should not be included. Outliers are defined in terms of being some distance away from the mean of the dataset’s samples. The unit of measure for this distance is the standard deviation of the dataset, which is a measure of how similar the data samples are. Outliers can be visually determined based on a plotted graph of the data samples.
While standard deviation and probability theory give a rough framework for spotting abnormalities, there is no firm mathematical definition of what constitutes an outlier.
Why are Outliers Important in Machine Learning?
Since the root of all deep learning training techniques is analyzing vast amounts of data to find some sort of mathematical pattern or relationship, outliers can produce all sorts of “ghosts” in a machine program if not weeded out early.
Example of Statistical Outliers in Data Analysis
Consider this one-dimensional dataset of integers [-15,50,50,52,54,54,55,57,59,59,59,200]. Simply by visual inspection or graphing, one might conclude that there are two potential outliers in this dataset: - 15 and 200. The mean of this dataset (including -15 and 200) is ~86.2 and the standard deviation is ~46.2. So -15 is about 2 standard deviations away from the mean and 200 is about 2.5 standard deviations away from the mean. Whether or not these two samples are actually classified as outliers does depend on the context. They certainly change the mean and standard deviation if they are included in the dataset. Removing those two points, [50,50,52,54,55,57,59,59,59], changes the mean to 54.9 and the standard deviation to ~3.36.