Outlier

What is an Outlier? 

A statistical outlier is any datapoint in a dataset that is beyond a pre-defined distribution range, usually representing a measurement error or abnormal data that should not be included. Outliers are defined in terms of being some distance away from the mean of the dataset’s samples. The unit of measure for this distance is the standard deviation of the dataset, which is a measure of how similar the data samples are. Outliers can be visually determined based on a plotted graph of the data samples. 

While standard deviation and probability theory give a rough framework for spotting abnormalities, there is no firm mathematical definition of what constitutes an outlier.

Why are Outliers Important in Machine Learning?

Since the root of all deep learning training techniques is analyzing vast amounts of data to find some sort of mathematical pattern or relationship, outliers can produce all sorts of “ghosts” in a machine program if not weeded out early.

Often outliers are discarded because of their effect on the total distribution and statistical analysis of the dataset. This is certainly a good approach if the outliers are due to an error of some kind (measurement error, data corruption, etc.), however often the source of the outliers is unclear. There are many situations where occasional ‘extreme’ events cause an outlier that is outside the usual distribution of the dataset but is a valid measurement and not due to an error. In these situations, the choice of how to deal with the outliers is not necessarily clear and the choice has a significant impact on the results of any statistical analysis done on the dataset. The decision about how to deal with outliers depends on the goals and context of the research and should be detailed in any explanation about the methodology.  

Example of Statistical Outliers in Data Analysis

Consider this one-dimensional dataset of integers [-15,50,50,52,54,54,55,57,59,59,59,200]. Simply by visual inspection or graphing, one might conclude that there are two potential outliers in this dataset: - 15 and 200. The mean of this dataset (including -15 and 200) is ~86.2 and the standard deviation is ~46.2. So -15 is about 2 standard deviations away from the mean and 200 is about 2.5 standard deviations away from the mean. Whether or not these two samples are actually classified as outliers does depend on the context. They certainly change the mean and standard deviation if they are included in the dataset. Removing those two points, [50,50,52,54,55,57,59,59,59], changes the mean to 54.9 and the standard deviation to ~3.36.